A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Sergio Burdisso

Last update: Jan 2, 2023

Related tags

Text Data & NLP nlp machine-learning natural-language-processing text-mining data-mining text-classification machine-learning-algorithms artificial-intelligence document-classification sentence-classification interpretability multilabel-classification explainable-artificial-intelligence interpretable-ml xai interpretable-machine-learning document-categorization early-classification text-labeling ss3-classifier

Overview

A Python package implementing a new model for text classification with visualization tools for Explainable AI

🍣 Online live demos: http://tworld.io/ss3/ 🍦 🍨 🍰

The SS3 text classifier is a novel supervised machine learning model for text classification which has the ability to naturally explain its rationale. It was originally introduced in Section 3 of the paper "A text classification framework for simple and effective early depression detection over social media streams" (arXiv preprint). Given its white-box nature, it allows researchers and practitioners to deploy explainable, and therefore more reliable, models for text classification (which could be especially useful for those working with classification problems by which people's lives could be somehow affected).

Note: this package also incorporates different variations of the original model, such as the one introduced in "t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams" (arXiv preprint) which allows SS3 to recognize important variable-length word n-grams "on the fly".

What is PySS3?

PySS3 is a Python package that allows you to work with SS3 in a very straightforward, interactive and visual way. In addition to the implementation of the SS3 classifier, PySS3 comes with a set of tools to help you developing your machine learning models in a clearer and faster way. These tools let you analyze, monitor and understand your models by allowing you to see what they have actually learned and why. To achieve this, PySS3 provides you with 3 main components: the SS3 class, the Live_Test class, and the Evaluation class, as pointed out below.

👉 The `SS3` class

which implements the classifier using a clear API (very similar to that of sklearn's models):

    from pyss3 import SS3
    clf = SS3()
    ...
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)

Also, this class provides a handful of other useful methods, such as, for instance, extract_insight() to extract the text fragments involved in the classification decision (allowing you to better understand the rationale behind the model’s predictions) or classify_multilabel() to provide multi-label classification support:

    doc = "Liverpool CEO Peter Moore on Building a Global Fanbase"
    
    # standard "single-label" classification
    label = clf.classify_label(doc) # 'business'

    # multi-label classification
    labels = clf.classify_multilabel(doc)  # ['business', 'sports']

👉 The `Live_Test` class

which allows you to interactively test your model and visually see the reasons behind classification decisions, with just one line of code:

    from pyss3.server import Live_Test
    from pyss3 import SS3

    clf = SS3()
    ...
    clf.fit(x_train, y_train)
    Live_Test.run(clf, x_test, y_test) # <- this one! cool uh? :)

As shown in the image below, this will open up, locally, an interactive tool in your browser which you can use to (live) test your models with the documents given in x_test (or typing in your own!). This will allow you to visualize and understand what your model is actually learning.

For example, we have uploaded two of these live tests online for you to try out: "Movie Review (Sentiment Analysis)" and "Topic Categorization", both were obtained following the tutorials.

👉 And last but not least, the `Evaluation` class

This is probably one of the most useful components of PySS3. As the name may suggest, this class provides the user easy-to-use methods for model evaluation and hyperparameter optimization, like, for example, the test, kfold_cross_validation, grid_search, and plot methods for performing tests, stratified k-fold cross validations, grid searches for hyperparameter optimization, and visualizing evaluation results using an interactive 3D plot, respectively. Probably one of its most important features is the ability to automatically (and permanently) record the history of evaluations that you've performed. This will save you a lot of time and will allow you to interactively visualize and analyze your classifier performance in terms of its different hyper-parameters values (and select the best model according to your needs). For instance, let's perform a grid search with a 4-fold cross-validation on the three hyperparameters, smoothness(s), significance(l), and sanction(p):

from pyss3.util import Evaluation
...
best_s, best_l, best_p, _ = Evaluation.grid_search(
    clf, x_train, y_train,
    s=[0.2, 0.32, 0.44, 0.56, 0.68, 0.8],
    l=[0.1, 0.48, 0.86, 1.24, 1.62, 2],
    p=[0.5, 0.8, 1.1, 1.4, 1.7, 2],
    k_fold=4
)

In this illustrative example, s, l, and p will take those 6 different values each, and once the search is over, this function will return (by default) the hyperparameter values that obtained the best accuracy. Now, we could also use the plot function to analyze the results obtained in our grid search using the interactive 3D evaluation plot:

Evaluation.plot()

In this 3D plot, each point represents an experiment/evaluation performed using that particular combination of values (s, l, and p). Also, these points are painted proportional to how good the performance was according to the selected metric; the plot will update "on the fly" when the user select a different evaluation metric (accuracy, precision, recall, f1, etc.). Additionally, when the cursor is moved over a data point, useful information is shown (including a "compact" representation of the confusion matrix obtained in that experiment). Finally, it is worth mentioning that, before showing the 3D plots, PySS3 creates a single and portable HTML file in your project folder containing the interactive plots. This allows users to store, send or upload the plots to another place using this single HTML file. For example, we have uploaded two of these files for you to see: "Sentiment Analysis (Movie Reviews)" and "Topic Categorization", both evaluation plots were also obtained following the tutorials.

Want to give PySS3 a shot? 👓 ☕

Just go to the Getting Started page :D

Installation

Simply use:

pip install pyss3

Want to contribute to this Open Source project? ✨ ✨

Thanks for your interest in the project, you're !! Any kind of help is very welcome (Code, Bug reports, Content, Data, Documentation, Design, Examples, Ideas, Feedback, etc.), Issues and/or Pull Requests are welcome for any level of improvement, from a small typo to new features, help us make PySS3 better 👍

Remember that you can use the "Edit" button ('pencil' icon) up the top to edit any file of this repo directly on GitHub.

Also, if you star this repo ( 🌟 ), you would be helping PySS3 to gain more visibility and reach the hands of people who may find it useful since repository lists and search results are usually ordered by the total number of stars.

Finally, in case you're planning to create a new Pull Request, for committing to this repo, we follow the "seven rules of a great Git commit message" from "How to Write a Git Commit Message", so make sure your commits follow them as well.

(please do not hesitate to send me an email to sergio.burdisso@gmail.com for anything)

Contributors 💪 😎 👍

Thanks goes to these awesome people (emoji key):

_{Saurabh Bora}

This project follows the all-contributors specification. Contributions of any kind welcome!

Further Readings 📜

Full documentation

API documentation

Paper preprint

Comments

Multilabel Classification Evaluation
Hey @sergioburdisso,

Thank you for this awesome project! Currently the evaluation class only supports single label classification, even though SS3 inherently supports multilabel classification. These are the steps (I see) needed to support multilabel classification evaluation:

Take the output of classify_multilabel

Convert result to binarized vector (same length as confidence vector)

Implement multilabel classification metrics usage (e.g. Hamming Loss)

Adopt Gridsearch

enhancement
opened by angrymeir 14
Custom preprocessing in Live Test

@sergioburdisso It would be a great feature to have custom preprocessing in the Live Test. This will enable us to visually understand the words, sentences, and paragraphs that helped the model to classify a particular document after custom preprocessing.
enhancement

opened by enthussb 8
Initialization of sanction function

Hey @sergioburdisso,

as far as I understand the SS3 framework, there is an inconsistency between the initialization of SS3 and its documentation. The initialization describes the parameter sn_m as method used to compute the sanction (sn) function [...]

However in the actual initialization the only function that changes based on the sn_m parameter is the significance function (see here).

I would be great if you could have a look at it and tell me whether I'm wrong? 😄

Best, Florian
documentation

opened by angrymeir 6
Error in Live_test

I'm getting an error list index out of range. I'm not sure what happened here. I'm using the latest built as of posting (just installed it prior to using it here), my python is 3.6 if I remember correctly.

EDIT: I don't know why but restarting the kernel fixed it.
bug

opened by penatbater 5
[joss] feature request: accessible utility to import a dataset
openjournals/joss-reviews#3934

This package has good documentation. Going through the examples I came up with a feature request, which would greatly benefit introducing newcomers and prototyping code.

I like the first example in README to be straightforward and copy-paste ready, which is not the case here (looking at missing code ...).

How about implementing some import_dataset(url) / download(url) functionality in utils or Dataset that would, for example, download the dataset .zip file and unpack it (sample code) so that one can load the data into exemplary code:

from pyss3 import SS3 Dataset.import_dataset("https://github.com/sergioburdisso/pyss3/blob/master/examples/datasets/movie_review.zip") x_train, y_train = Dataset.load_from_files("movie_review/train") x_test, y_test = Dataset.load_from_files("movie_review/test") clf = SS3() clf.fit(x_train, y_train) y_pred = clf.predict(x_test)

Implementation details and naming may vary, but it would be nice to easily run code from README.
enhancement
opened by hbaniecki 4

Data loading issues while train

Hey ,

[Note] : I have pandas dataframe contain 2 columns as ,

Text
Label

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                    y_data ,
                                                    test_size = 0.2, 
                                                    shuffle=False)

train () and fit() methods are not working

here is a reference code

How to fix it?

Thanks

enhancement

opened by Practcdi 4

Divison by 0

I am eager to use the SS3 classifier for text classification task in my master's thesis. Unfortunately when I run it I get a division by zero error message, see image. My text seems fairly clean (although not yet cleaned exactly the right way) to me, so I am not sure what is causing this.

Is there anything you suspect might be going wrong which I could try? Or anywhere where the data criteria are listed (I've looked but maybe I've overlooked)?

I included the data structure (pandas series), some of what my data looks like and the error.

Many thanks!

bug

opened by demcbs 4
Multilabel Classification Dataset Loading
Hey @sergioburdisso,

for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files. My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.

# Writing Data dataset = {"Text 1": ["label1", "label2"], "Text 2": ["label2", "label3"], "Text 3": ["label1"]} for text, labels in dataset.items(): with open('text.txt', 'a+') as text_file: text_file.write(text + '\n') with open('labels.txt', 'a+') as label_file: label_file.write(';'.join(labels) + '\n')

The result is the following:

# cat text.txt Text 1 Text 2 Text 3 # cat labels.txt label1;label2 label2;label3 label1

It would be great if util.Dataset.load_from_files could be adjusted to also support this! But I'm also open for other suggestions on how to tackle that problem :)

Thanks for your hard work!
enhancement
opened by angrymeir 4
Multilabel Live Test

Hey @sergioburdisso,

I've noticed taht you fixed recently the Multilabel fit issue #6 But the Live_Test.run(clf, X_test, y_test) still does not accept y_test as List[List[str]] It would be really great to have it If you don't have time maybe I could submit a PR? Olivier
enhancement

opened by oterrier 3
Substantial print pollution when optimising for hyperpameters
Hello, while I was trying to optimise for hyperparameters I have found that there is little I could do to avoid having a large printout to stdout.

I tried to mute this by doing the following:

import pyss3 pyss3.Print.set_quiet(True)

but besides, it is also tqdm that produces progress bars, hence I propose propagating Print.__quiet__to tqdm's disable parameter, for example:

tqdm(..., disable=Print.__quiet__)

Thanks :) Keep up the good work!
enhancement
opened by LemurPwned 2
[joss] update entry site of the documentation

https://github.com/openjournals/joss-reviews/issues/3934

Hi, I enjoy working with http://tworld.io/ss3, which is highlighted in README. It is also good to see a statement of need with the two references there. Could we add the same to the welcome page of the documentation at https://pyss3.readthedocs.io/en/latest?
documentation

opened by hbaniecki 1
[JOSS] comments on the paper
My comments on the software paper wrt to the JOSS submission:

In the abstract, you declare two useful tools and then say: "For instance, one of these tools provides (...)". Since there are only two tools, I'd also suggest describing the other one as well. Additionally, instead of "for instance", I'd go with one sentence per each tool. This way, I can only guess what the other functionality is.

The last sentence in the abstract is quite long, and I would consider breaking it into shorter pieces.

My understanding is that your input is the implementation of the SS3 algorithm. Therefore, I'd be happy to see a bit more details about the algorithm to make the paper self-contained. Also, the title gives the impression of you introducing a new model/algorithm. I believe the input of this work is an implementation? This distinction is not very clear to me. Also, does it mean the SS3 algorithm was proposed without any implementation? That sounds a bit confusing.

Is the explanation tool model-specific, or it works with any classification method? What is your exact input here (the explanation algorithm or GUI?). If it's a model-agnostic explanation, perhaps it could be implemented in a different package?

github -> GitHub?

since explanations of the models are not the primary contribution of this work (?), you could consider adding a reference to some work in this area.

Line 40: "On the other hand" doesn't contrast with anything, and definitely no "On the one hand". Maybe you could consider rephrasing this

Footnote 2 has an issue with spacing – no space between "ArXiv" and a bracket

I ticked all the boxes for this part anyways. An exciting paper overall. I like the examples particularly. However, it's not clear what the exact contributions are: model implementation and the explanation GUI (or algorithm)?

In reference to https://github.com/openjournals/joss-reviews/issues/3934
documentation
opened by kmichael08 0
[joss] software paper comments
https://github.com/openjournals/joss-reviews/issues/3934 Hi, I hope these comments help in improving the paper.

Comments

The paper's title could see a change. It says "PySS3: A new interpretable and simple machine learning model for text classification", but the model is named "SS3" and seems not new. The title of the repository seems more accurate, "A Python package implementing a new simple and interpretable model for text classification", but even then one could drop "new" and use the PyPI package's title, e.g. "PySS3: A Python package implementing the SS3 interpretable text classifier [with interactive/visualization tools for explainable AI]". Just an example to be considered.

I would recommend the authors to highlight in the article the software's aspect of "interactive" (explanation, analysis) and (model, machine learning) "monitoring" as this seems both novel and emerging in discussions lately.

In the end, it would be useful to release a stable version 1.0 of the package (on GitHub, PyPI) and mark that in the paper, e.g. in the Summary section.

Summary

L10. "implements novel machine learning model" - It might not be seen as novel when the model was already published in 2019 and extended in 2020.

L11. mentioning "two useful tools" without describing what the second does seems off

Statement of need This part discusses mainly the need for open-source implementation of the machine learning models. However, as I see it, the significant contributions of the software/paper, distinguishing it from the previous work, are the Live_Test/Evaluation tools allowing for visual explanation and hyperparameter optimization. This could be further underlined.

State of the field The paper lacks a brief discussion on packages in the field of interpretable and explainable machine learning. In that, I suggest the authors reference/compare to the following software related to interactive explainability:

Wexler et al. "The What-If Tool: Interactive Probing of Machine Learning Models" (IEEE TVCG, 2019) https://doi.org/10.1109/TVCG.2019.2934619

Tenney et al. "The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models" (EMNLP, 2020) http://doi.org/10.18653/v1/2020.emnlp-demos.15

Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann. "exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models" (ACL, 2020) https://www.doi.org/10.18653/v1/2020.acl-demos.22

[Ours] "dalex: Responsible Machine Learning with Interactive Explainability and Fairness in Python" (JMLR, 2021) https://www.jmlr.org/papers/v22/20-1473.html

Other possibly missing/useful references:

Pedregosa et al. "Scikit-learn: Machine Learning in Python" (JMLR, 2011) https://www.jmlr.org/papers/v12/pedregosa11a.html

Christoph Molnar "Interpretable Machine Learning - A Guide for Making Black Box Models Explainable" (book, 2018) https://christophm.github.io/interpretable-ml-book

Cynthia Rudin "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead" (Nature Machine Intelligence, 2019) https://doi.org/10.1038/s42256-019-0048-x

[Ours] "modelStudio: Interactive Studio with Explanations for ML Predictive Models" (JOSS, 2019) https://doi.org/10.21105/joss.01798

Implementation

L48 github -> GitHub

L54 "such as the one introduced later by the same authors" -> "by us" would be easier to read

L57 missing the citation of scikit-learn

Illustrative examples

In the beginning, it lacks a brief description of the predictive task used for the example (dataset name, positive/negative text classification, etc.).

Also, it could now be updated with the Dataset.load_from_url() function.

Conclusions Again, I have doubts that the machine learning model is "novel", as it has been previously published etc.. It might be misunderstood as "introducing a novel machine learning model".
documentation
opened by hbaniecki 1
Custom metrics for evaluation

Hi! A way pass a scorer function (e.g using sklearn's make_scorer) on Evaluation would make this pyss3 even greater.

Any plans on this?

This is a very interesting project. Thank you!
enhancement

opened by ogabrielluiz 5

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Related tags

Overview

A Python package implementing a new model for text classification with visualization tools for Explainable AI

What is PySS3?

👉 The SS3 class

👉 The Live_Test class

👉 And last but not least, the Evaluation class

Want to give PySS3 a shot? 👓 ☕

Installation

Want to contribute to this Open Source project? ✨ ✨

Contributors 💪 😎 👍

Further Readings 📜

Comments

[Note] : I have pandas dataframe contain 2 columns as ,

Owner

Sergio Burdisso

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

An Explainable Leaderboard for NLP

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Binary LSTM model for text classification

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

A framework for implementing federated learning

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

👉 The `SS3` class

👉 The `Live_Test` class

👉 And last but not least, the `Evaluation` class