A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Overview

PySS3 Logo

Documentation Status Build Status codecov Requirements Status PyPI version Downloads Binder


A Python package implementing a new model for text classification with visualization tools for Explainable AI

🍣 Online live demos: http://tworld.io/ss3/ 🍦 🍨 🍰


The SS3 text classifier is a novel supervised machine learning model for text classification which has the ability to naturally explain its rationale. It was originally introduced in Section 3 of the paper "A text classification framework for simple and effective early depression detection over social media streams" (arXiv preprint). Given its white-box nature, it allows researchers and practitioners to deploy explainable, and therefore more reliable, models for text classification (which could be especially useful for those working with classification problems by which people's lives could be somehow affected).

Note: this package also incorporates different variations of the original model, such as the one introduced in "t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams" (arXiv preprint) which allows SS3 to recognize important variable-length word n-grams "on the fly".

What is PySS3?

PySS3 is a Python package that allows you to work with SS3 in a very straightforward, interactive and visual way. In addition to the implementation of the SS3 classifier, PySS3 comes with a set of tools to help you developing your machine learning models in a clearer and faster way. These tools let you analyze, monitor and understand your models by allowing you to see what they have actually learned and why. To achieve this, PySS3 provides you with 3 main components: the SS3 class, the Live_Test class, and the Evaluation class, as pointed out below.

👉 The SS3 class

which implements the classifier using a clear API (very similar to that of sklearn's models):

    from pyss3 import SS3
    clf = SS3()
    ...
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)

Also, this class provides a handful of other useful methods, such as, for instance, extract_insight() to extract the text fragments involved in the classification decision (allowing you to better understand the rationale behind the model’s predictions) or classify_multilabel() to provide multi-label classification support:

    doc = "Liverpool CEO Peter Moore on Building a Global Fanbase"
    
    # standard "single-label" classification
    label = clf.classify_label(doc) # 'business'

    # multi-label classification
    labels = clf.classify_multilabel(doc)  # ['business', 'sports']

👉 The Live_Test class

which allows you to interactively test your model and visually see the reasons behind classification decisions, with just one line of code:

    from pyss3.server import Live_Test
    from pyss3 import SS3

    clf = SS3()
    ...
    clf.fit(x_train, y_train)
    Live_Test.run(clf, x_test, y_test) # <- this one! cool uh? :)

As shown in the image below, this will open up, locally, an interactive tool in your browser which you can use to (live) test your models with the documents given in x_test (or typing in your own!). This will allow you to visualize and understand what your model is actually learning.

img

For example, we have uploaded two of these live tests online for you to try out: "Movie Review (Sentiment Analysis)" and "Topic Categorization", both were obtained following the tutorials.

👉 And last but not least, the Evaluation class

This is probably one of the most useful components of PySS3. As the name may suggest, this class provides the user easy-to-use methods for model evaluation and hyperparameter optimization, like, for example, the test, kfold_cross_validation, grid_search, and plot methods for performing tests, stratified k-fold cross validations, grid searches for hyperparameter optimization, and visualizing evaluation results using an interactive 3D plot, respectively. Probably one of its most important features is the ability to automatically (and permanently) record the history of evaluations that you've performed. This will save you a lot of time and will allow you to interactively visualize and analyze your classifier performance in terms of its different hyper-parameters values (and select the best model according to your needs). For instance, let's perform a grid search with a 4-fold cross-validation on the three hyperparameters, smoothness(s), significance(l), and sanction(p):

from pyss3.util import Evaluation
...
best_s, best_l, best_p, _ = Evaluation.grid_search(
    clf, x_train, y_train,
    s=[0.2, 0.32, 0.44, 0.56, 0.68, 0.8],
    l=[0.1, 0.48, 0.86, 1.24, 1.62, 2],
    p=[0.5, 0.8, 1.1, 1.4, 1.7, 2],
    k_fold=4
)

In this illustrative example, s, l, and p will take those 6 different values each, and once the search is over, this function will return (by default) the hyperparameter values that obtained the best accuracy. Now, we could also use the plot function to analyze the results obtained in our grid search using the interactive 3D evaluation plot:

Evaluation.plot()

img

In this 3D plot, each point represents an experiment/evaluation performed using that particular combination of values (s, l, and p). Also, these points are painted proportional to how good the performance was according to the selected metric; the plot will update "on the fly" when the user select a different evaluation metric (accuracy, precision, recall, f1, etc.). Additionally, when the cursor is moved over a data point, useful information is shown (including a "compact" representation of the confusion matrix obtained in that experiment). Finally, it is worth mentioning that, before showing the 3D plots, PySS3 creates a single and portable HTML file in your project folder containing the interactive plots. This allows users to store, send or upload the plots to another place using this single HTML file. For example, we have uploaded two of these files for you to see: "Sentiment Analysis (Movie Reviews)" and "Topic Categorization", both evaluation plots were also obtained following the tutorials.

Want to give PySS3 a shot? 👓

Just go to the Getting Started page :D

Installation

Simply use:

pip install pyss3

Want to contribute to this Open Source project? :octocat:

Thanks for your interest in the project, you're Awesome!! Any kind of help is very welcome (Code, Bug reports, Content, Data, Documentation, Design, Examples, Ideas, Feedback, etc.), Issues and/or Pull Requests are welcome for any level of improvement, from a small typo to new features, help us make PySS3 better 👍

Remember that you can use the "Edit" button ('pencil' icon) up the top to edit any file of this repo directly on GitHub.

Also, if you star this repo ( 🌟 ), you would be helping PySS3 to gain more visibility and reach the hands of people who may find it useful since repository lists and search results are usually ordered by the total number of stars.

Finally, in case you're planning to create a new Pull Request, for committing to this repo, we follow the "seven rules of a great Git commit message" from "How to Write a Git Commit Message", so make sure your commits follow them as well.

(please do not hesitate to send me an email to [email protected] for anything)

Contributors 💪 😎 👍

Thanks goes to these awesome people (emoji key):


Florian Angermeir

💻 🤔 🔣

Muneeb Vaiyani

🤔 🔣

Saurabh Bora

🤔

This project follows the all-contributors specification. Contributions of any kind welcome!

Further Readings 📜

Full documentation

API documentation

Paper preprint

Comments
  • Multilabel Classification Evaluation

    Multilabel Classification Evaluation

    Hey @sergioburdisso,

    Thank you for this awesome project! Currently the evaluation class only supports single label classification, even though SS3 inherently supports multilabel classification. These are the steps (I see) needed to support multilabel classification evaluation:

    • Take the output of classify_multilabel
    • Convert result to binarized vector (same length as confidence vector)
    • Implement multilabel classification metrics usage (e.g. Hamming Loss)
    • Adopt Gridsearch
    enhancement 
    opened by angrymeir 14
  • Custom preprocessing in Live Test

    Custom preprocessing in Live Test

    @sergioburdisso It would be a great feature to have custom preprocessing in the Live Test. This will enable us to visually understand the words, sentences, and paragraphs that helped the model to classify a particular document after custom preprocessing.

    enhancement 
    opened by enthussb 8
  • Initialization of sanction function

    Initialization of sanction function

    Hey @sergioburdisso,

    as far as I understand the SS3 framework, there is an inconsistency between the initialization of SS3 and its documentation. The initialization describes the parameter sn_m as method used to compute the sanction (sn) function [...]

    However in the actual initialization the only function that changes based on the sn_m parameter is the significance function (see here).

    I would be great if you could have a look at it and tell me whether I'm wrong? 😄

    Best, Florian

    documentation 
    opened by angrymeir 6
  • Error in Live_test

    Error in Live_test

    I'm getting an error list index out of range. I'm not sure what happened here. I'm using the latest built as of posting (just installed it prior to using it here), my python is 3.6 if I remember correctly. image

    EDIT: I don't know why but restarting the kernel fixed it.

    bug 
    opened by penatbater 5
  • [joss] feature request: accessible utility to import a dataset

    [joss] feature request: accessible utility to import a dataset

    openjournals/joss-reviews#3934

    This package has good documentation. Going through the examples I came up with a feature request, which would greatly benefit introducing newcomers and prototyping code.

    I like the first example in README to be straightforward and copy-paste ready, which is not the case here (looking at missing code ...).

    How about implementing some import_dataset(url) / download(url) functionality in utils or Dataset that would, for example, download the dataset .zip file and unpack it (sample code) so that one can load the data into exemplary code:

    from pyss3 import SS3
    
    Dataset.import_dataset("https://github.com/sergioburdisso/pyss3/blob/master/examples/datasets/movie_review.zip")
    x_train, y_train = Dataset.load_from_files("movie_review/train")
    x_test, y_test = Dataset.load_from_files("movie_review/test")
    
    clf = SS3()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    

    Implementation details and naming may vary, but it would be nice to easily run code from README.

    enhancement 
    opened by hbaniecki 4
  • Data loading issues while train

    Data loading issues while train

    Hey ,

    [Note] : I have pandas dataframe contain 2 columns as ,

    1. Text
    2. Label

    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(x_data,
                                                        y_data ,
                                                        test_size = 0.2, 
                                                        shuffle=False)
    

    train () and fit() methods are not working

    here is a reference code

    image

    How to fix it?

    Thanks

    enhancement 
    opened by Practcdi 4
  • Divison by 0

    Divison by 0

    I am eager to use the SS3 classifier for text classification task in my master's thesis. Unfortunately when I run it I get a division by zero error message, see image. My text seems fairly clean (although not yet cleaned exactly the right way) to me, so I am not sure what is causing this.

    Is there anything you suspect might be going wrong which I could try? Or anywhere where the data criteria are listed (I've looked but maybe I've overlooked)?

    I included the data structure (pandas series), some of what my data looks like and the error.

    Many thanks! image

    image

    image

    bug 
    opened by demcbs 4
  • Multilabel Classification Dataset Loading

    Multilabel Classification Dataset Loading

    Hey @sergioburdisso,

    for multilabel classification the file structure described in the topic categorization tutorial is not efficient since the text related to multiple label has to be stored in multiple files. My current approach is to write the text to one file linewise and the respective labels to another file, also linewise.

    # Writing Data
    dataset = {"Text 1": ["label1", "label2"], 
               "Text 2": ["label2", "label3"], 
               "Text 3": ["label1"]}
    
    for text, labels in dataset.items():
    
      with open('text.txt', 'a+') as text_file:
        text_file.write(text + '\n')
    
      with open('labels.txt', 'a+') as label_file:
        label_file.write(';'.join(labels) + '\n')
    

    The result is the following:

    # cat text.txt
    Text 1
    Text 2
    Text 3
    
    # cat labels.txt
    label1;label2
    label2;label3
    label1
    

    It would be great if util.Dataset.load_from_files could be adjusted to also support this! But I'm also open for other suggestions on how to tackle that problem :)

    Thanks for your hard work!

    enhancement 
    opened by angrymeir 4
  • Multilabel Live Test

    Multilabel Live Test

    Hey @sergioburdisso,

    I've noticed taht you fixed recently the Multilabel fit issue #6 But the Live_Test.run(clf, X_test, y_test) still does not accept y_test as List[List[str]] It would be really great to have it If you don't have time maybe I could submit a PR? Olivier

    enhancement 
    opened by oterrier 3
  • Substantial print pollution when optimising for hyperpameters

    Substantial print pollution when optimising for hyperpameters

    Hello, while I was trying to optimise for hyperparameters I have found that there is little I could do to avoid having a large printout to stdout.

    I tried to mute this by doing the following:

    import pyss3
    pyss3.Print.set_quiet(True)
    

    but besides, it is also tqdm that produces progress bars, hence I propose propagating Print.__quiet__to tqdm's disable parameter, for example:

    tqdm(..., disable=Print.__quiet__)
    

    Thanks :) Keep up the good work!

    enhancement 
    opened by LemurPwned 2
  • [joss] update entry site of the documentation

    [joss] update entry site of the documentation

    https://github.com/openjournals/joss-reviews/issues/3934

    Hi, I enjoy working with http://tworld.io/ss3, which is highlighted in README. It is also good to see a statement of need with the two references there. Could we add the same to the welcome page of the documentation at https://pyss3.readthedocs.io/en/latest?

    documentation 
    opened by hbaniecki 1
  • [JOSS] comments on the paper

    [JOSS] comments on the paper

    My comments on the software paper wrt to the JOSS submission:

    • In the abstract, you declare two useful tools and then say: "For instance, one of these tools provides (...)". Since there are only two tools, I'd also suggest describing the other one as well. Additionally, instead of "for instance", I'd go with one sentence per each tool. This way, I can only guess what the other functionality is.
    • The last sentence in the abstract is quite long, and I would consider breaking it into shorter pieces.
    • My understanding is that your input is the implementation of the SS3 algorithm. Therefore, I'd be happy to see a bit more details about the algorithm to make the paper self-contained. Also, the title gives the impression of you introducing a new model/algorithm. I believe the input of this work is an implementation? This distinction is not very clear to me. Also, does it mean the SS3 algorithm was proposed without any implementation? That sounds a bit confusing.
    • Is the explanation tool model-specific, or it works with any classification method? What is your exact input here (the explanation algorithm or GUI?). If it's a model-agnostic explanation, perhaps it could be implemented in a different package?
    • github -> GitHub?
    • since explanations of the models are not the primary contribution of this work (?), you could consider adding a reference to some work in this area.
    • Line 40: "On the other hand" doesn't contrast with anything, and definitely no "On the one hand". Maybe you could consider rephrasing this
    • Footnote 2 has an issue with spacing – no space between "ArXiv" and a bracket

    I ticked all the boxes for this part anyways. An exciting paper overall. I like the examples particularly. However, it's not clear what the exact contributions are: model implementation and the explanation GUI (or algorithm)?

    In reference to https://github.com/openjournals/joss-reviews/issues/3934

    documentation 
    opened by kmichael08 0
  • [joss] software paper comments

    [joss] software paper comments

    https://github.com/openjournals/joss-reviews/issues/3934 Hi, I hope these comments help in improving the paper.

    Comments

    1. The paper's title could see a change. It says "PySS3: A new interpretable and simple machine learning model for text classification", but the model is named "SS3" and seems not new. The title of the repository seems more accurate, "A Python package implementing a new simple and interpretable model for text classification", but even then one could drop "new" and use the PyPI package's title, e.g. "PySS3: A Python package implementing the SS3 interpretable text classifier [with interactive/visualization tools for explainable AI]". Just an example to be considered.
    2. I would recommend the authors to highlight in the article the software's aspect of "interactive" (explanation, analysis) and (model, machine learning) "monitoring" as this seems both novel and emerging in discussions lately.
    3. In the end, it would be useful to release a stable version 1.0 of the package (on GitHub, PyPI) and mark that in the paper, e.g. in the Summary section.

    Summary

    • L10. "implements novel machine learning model" - It might not be seen as novel when the model was already published in 2019 and extended in 2020.
    • L11. mentioning "two useful tools" without describing what the second does seems off

    Statement of need This part discusses mainly the need for open-source implementation of the machine learning models. However, as I see it, the significant contributions of the software/paper, distinguishing it from the previous work, are the Live_Test/Evaluation tools allowing for visual explanation and hyperparameter optimization. This could be further underlined.

    State of the field The paper lacks a brief discussion on packages in the field of interpretable and explainable machine learning. In that, I suggest the authors reference/compare to the following software related to interactive explainability:

    1. Wexler et al. "The What-If Tool: Interactive Probing of Machine Learning Models" (IEEE TVCG, 2019) https://doi.org/10.1109/TVCG.2019.2934619
    2. Tenney et al. "The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models" (EMNLP, 2020) http://doi.org/10.18653/v1/2020.emnlp-demos.15
    3. Benjamin Hoover, Hendrik Strobelt, Sebastian Gehrmann. "exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models" (ACL, 2020) https://www.doi.org/10.18653/v1/2020.acl-demos.22
    4. [Ours] "dalex: Responsible Machine Learning with Interactive Explainability and Fairness in Python" (JMLR, 2021) https://www.jmlr.org/papers/v22/20-1473.html

    Other possibly missing/useful references:

    1. Pedregosa et al. "Scikit-learn: Machine Learning in Python" (JMLR, 2011) https://www.jmlr.org/papers/v12/pedregosa11a.html
    2. Christoph Molnar "Interpretable Machine Learning - A Guide for Making Black Box Models Explainable" (book, 2018) https://christophm.github.io/interpretable-ml-book
    3. Cynthia Rudin "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead" (Nature Machine Intelligence, 2019) https://doi.org/10.1038/s42256-019-0048-x
    4. [Ours] "modelStudio: Interactive Studio with Explanations for ML Predictive Models" (JOSS, 2019) https://doi.org/10.21105/joss.01798

    Implementation

    • L48 github -> GitHub
    • L54 "such as the one introduced later by the same authors" -> "by us" would be easier to read
    • L57 missing the citation of scikit-learn

    Illustrative examples

    1. In the beginning, it lacks a brief description of the predictive task used for the example (dataset name, positive/negative text classification, etc.).
    2. Also, it could now be updated with the Dataset.load_from_url() function.

    Conclusions Again, I have doubts that the machine learning model is "novel", as it has been previously published etc.. It might be misunderstood as "introducing a novel machine learning model".

    documentation 
    opened by hbaniecki 1
  • Custom metrics for evaluation

    Custom metrics for evaluation

    Hi! A way pass a scorer function (e.g using sklearn's make_scorer) on Evaluation would make this pyss3 even greater.

    Any plans on this?

    This is a very interesting project. Thank you!

    enhancement 
    opened by ogabrielluiz 5
Owner
Sergio Burdisso
Computer Science Ph.D. student. (NLP/ML/Data Mining)
Sergio Burdisso
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 7, 2023
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 2.5k Feb 17, 2021
An Explainable Leaderboard for NLP

ExplainaBoard: An Explainable Leaderboard for NLP Introduction | Website | Download | Backend | Paper | Video | Bib Introduction ExplainaBoard is an i

NeuLab 319 Dec 20, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 579 Feb 17, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

LancoPKU 105 Jan 3, 2023
Binary LSTM model for text classification

Text Classification The purpose of this repository is to create a neural network model of NLP with deep learning for binary classification of texts re

Nikita Elenberger 1 Mar 11, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
A framework for implementing federated learning

This is partly the reproduction of the paper of [Privacy-Preserving Federated Learning in Fog Computing](DOI: 10.1109/JIOT.2020.2987958. 2020)

DavidChen 46 Sep 23, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 8, 2023
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.1k Feb 13, 2021