pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks
A Transformer-based library for SocialNLP classification tasks.
Currently supports:
- Sentiment Analysis (Spanish, English)
- Emotion Analysis (Spanish, English)
Just do pip install pysentimiento
and start using it:
from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")
analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})
analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""
emotion_analyzer = EmotionAnalyzer(lang="en")
emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})
Also, you might use pretrained models directly with transformers
library.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")
Preprocessing
pysentimiento
features a tweet preprocessor specially suited for tweet classification with transformer-based models.
from pysentimiento.preprocessing import preprocess_tweet
# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"
# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"
# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"
# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"
# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'
Trained models so far
Check CLASSIFIERS.md for details on the reported performances of each model.
Spanish models
English models
Instructions for developers
- First, download TASS 2020 data to
data/tass2020
(you have to register here to download the dataset)
Labels must be placed under data/tass2020/test1.1/labels
- Run script to train models
Check TRAIN_EVALUATE.md
- Upload models to Huggingface's Model Hub
Check "Model sharing and upload" instructions in huggingface
docs.
License
pysentimiento
is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use
- TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
- SEMEval 2017 Dataset license (Sentiment Analysis in English)
Citation
If you use pysentimiento
in your work, please cite this paper
@misc{perez2021pysentimiento,
title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
year={2021},
eprint={2106.09462},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
TODO:
- Upload some other models
- Train in other languages
Suggestions and bugfixes
Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)