A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

Last update: Jan 7, 2023

Related tags

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Transformer-based library for SocialNLP classification tasks.

Currently supports:

Sentiment Analysis (Spanish, English)
Emotion Analysis (Spanish, English)

Just do pip install pysentimiento and start using it:

from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})

analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""

emotion_analyzer = EmotionAnalyzer(lang="en")

emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})

Also, you might use pretrained models directly with transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

Preprocessing

pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.

from pysentimiento.preprocessing import preprocess_tweet

# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"

# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"

# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"

# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"

# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'

Trained models so far

Check CLASSIFIERS.md for details on the reported performances of each model.

Spanish models

English models

Instructions for developers

First, download TASS 2020 data to data/tass2020 (you have to register here to download the dataset)

Labels must be placed under data/tass2020/test1.1/labels

Run script to train models

Check TRAIN_EVALUATE.md

Upload models to Huggingface's Model Hub

Check "Model sharing and upload" instructions in huggingface docs.

License

pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use

TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
SEMEval 2017 Dataset license (Sentiment Analysis in English)

Citation

If you use pysentimiento in your work, please cite this paper

@misc{perez2021pysentimiento,
      title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
      author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
      year={2021},
      eprint={2106.09462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

TODO:

Upload some other models
Train in other languages

Suggestions and bugfixes

Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)

Comments

Why epochs is 5?

Hi,I noticed that the epochs is 5 during training, why not set it a bit larger? Is it because the epochs are set too large to cause overfitting? Thanks!

opened by gongshaojie12 9
Support for python 3.10

Hi, first of all great work!

I am trying to install the dependency with pip in python3.10

when I run the command:

pip3 install git+https://github.com/pysentimiento/pysentimiento.git

I get the error:

ERROR: Package 'pysentimiento' requires a different Python: 3.10.4 not in '<3.10,>=3.7'

Have you thought in making the library compatible with python3.10?

Thanks in advance.

opened by HugoJBello 7
When installing using pip install pysentimiento the analyzer.py and __init__.py are distinct than the ones in the repository

Describe the bug When installing using pip install pysentimiento the analyzer.py and init.py are distinct than the ones in the repository. It is installing 0.2.5 instead of 0.4.2

To Reproduce Using Python 3.10.6, I tried installing using pip install pysentimiento however the analyzer.py and init.py files are different from the ones in the github repository.

To make the code work, I have to download the files from the github repository and replace them.

Expected behavior How can we install from the github repository directly?, I tried but it was not possible.

Environment pip freeze: absl-py==1.2.0 aiohttp==3.8.1 aiosignal==1.2.0 astunparse==1.6.3 async-timeout==4.0.2 attrs==22.1.0 Automat==20.2.0 cachetools==5.2.0 certifi==2022.6.15 cffi==1.15.1 charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.1.0 colorama==0.4.5 configparser==5.3.0 constantly==15.1.0 coverage==6.4.4 coveralls==3.3.1 cryptography==37.0.4 datasets==2.4.0 defusedxml==0.7.1 dill==0.3.5.1 docopt==0.6.2 emoji==2.0.0 exceptiongroup==1.0.0rc9 filelock==3.8.0 Flask==2.2.2 Flask-Cors==3.0.10 Flask-WTF==1.0.1 flatbuffers==1.12 frozenlist==1.3.1 fsspec==2022.8.2 future==0.18.2 gast==0.4.0 genson==1.2.2 google-auth==2.11.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.48.1 h5py==3.7.0 huggingface-hub==0.9.1 hyperlink==21.0.0 hypothesis==6.54.5 idna==3.3 incremental==21.3.0 iniconfig==1.1.1 itsdangerous==2.1.2 Jinja2==3.1.2 joblib==1.1.0 jsonschema==4.15.0 keras==2.9.0 Keras-Preprocessing==1.1.2 libclang==14.0.6 Markdown==3.4.1 MarkupSafe==2.1.1 mock==4.0.3 multidict==6.0.2 multiprocess==0.70.13 nltk==3.7 numpy==1.23.2 oauthlib==3.2.0 opt-einsum==3.3.0 packaging==21.3 pandas==1.4.4 pluggy==1.0.0 protobuf==3.19.4 py==1.11.0 pyarrow==9.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.21 PyJWT==2.4.0 pyOpenSSL==22.0.0 pyparsing==3.0.9 pyrsistent==0.18.1 pysentimiento==0.2.5 pytest==7.1.3 pytest-cov==3.0.0 python-dateutil==2.8.2 pytz==2022.2.1 PyYAML==6.0 regex==2022.8.17 requests==2.28.1 requests-oauthlib==1.3.1 responses==0.18.0 rsa==4.9 scikit-learn==1.1.2 scipy==1.9.1 sentiment-analysis-spanish==0.0.25 simplejson==3.17.6 six==1.16.0 sklearn==0.0 sortedcontainers==2.4.0 tableauserverclient==0.19.0 tabpy==2.5.0 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.2 tensorflow-estimator==2.9.0 tensorflow-io-gcs-filesystem==0.26.0 termcolor==1.1.0 textblob==0.17.1 threadpoolctl==3.1.0 tokenizers==0.12.1 tomli==2.0.1 torch==1.12.1 tornado==6.2 tqdm==4.64.1 transformers==4.21.3 Twisted==22.4.0 twisted-iocpsupport==1.0.2 typing_extensions==4.3.0 urllib3==1.26.12 Werkzeug==2.2.2 wrapt==1.14.1 WTForms==3.0.1 xxhash==3.0.0 yarl==1.8.1 zope.interface==5.4.0

python --version Python 3.10.6

Additional context Once I have replaced the files with the ones in the repository, it worked.

opened by difemaro 6
ImportError: cannot import name 'SentimentAnalyzer'

Hi,

Estoy intentando ocupar el código en python3 pero me insiste que necesito esa libreria, donde se encuenta?

ImportError: cannot import name 'SentimentAnalyzer'

Saludos,

opened by davesnake01 6
updated for compatibility with python3.10

I added two init.py in the test directories, otherwise I could not run the tests. I could not properly install with poetry, I suspect is not fully compatible with my python version. Nevertheless I installed the project with the same packages and versions using venv and everything works (including all the integration and unit tests).

opened by HugoJBello 5
[BUG] Cannot make predictions for an array of texts
Describe the bug I'm trying to predict the sentiment of an array containing texts in spanish, but i'm having this error: Error: "softmax_lastdim_kernel_impl" not implemented for 'Half'

To Reproduce

# Import and instantiate transformers model from pysentimiento import create_analyzer analyzer = create_analyzer(task="sentiment", lang="es") print('Model instantiated' + '\n') # Obtain sentiment label from a text def get_sentence_sentiment(prediction): sentiment = prediction.output return sentiment # Obtain the sentiment score of a text def get_sentence_score(prediction): score = max(prediction.probas.values()) return score # Obtain the sentiments of the texts def obtain_sentiments(df): texts = df['content'].to_numpy() from torch import autocast with autocast("cuda"): predictions = analyzer.predict(texts) sentiment_labels = [get_sentence_sentiment(prediction) for prediction in predictions] sentiment_scores = [get_sentence_score(prediction) for prediction in predictions] df['sentiment_label'] = sentiment_labels df['sentiment_score'] = sentiment_scores return df

Expected behavior I'm expected to return a dataframe containing two new columns, one with the sentiment labels and the other with their scores.

Environment pip freeze absl-py==0.15.0 adal==1.2.7 adlfs==2022.7.0 aiohttp==3.8.1 aiohttp-cors==0.7.0 aiosignal==1.2.0 alembic==1.8.1 analytics-python==1.4.0 ansiwrap==0.8.4 antlr4-python3-runtime==4.9.3 anyio==3.6.1 app-store-scraper==0.3.5 applicationinsights==0.11.10 arch==4.14 argcomplete==2.0.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arviz @ file:///tmp/build/80754af9/arviz_1614019183254/work astroid==2.11.7 asttokens==2.0.5 astunparse==1.6.3 async-timeout==4.0.2 attrs==21.4.0 auto-tqdm==1.0.2 autokeras==1.0.16 autopep8==1.6.0 azure-appconfiguration==1.1.1 azure-batch==12.0.0 azure-cli==2.38.0 azure-cli-core==2.38.0 azure-cli-telemetry==1.0.6 azure-common==1.1.28 azure-core==1.22.1 azure-cosmos==3.2.0 azure-data-tables==12.4.0 azure-datalake-store==0.0.52 azure-graphrbac==0.61.1 azure-identity==1.7.0 azure-keyvault==1.1.0 azure-keyvault-administration==4.0.0b3 azure-keyvault-keys==4.5.1 azure-loganalytics==0.1.1 azure-mgmt-advisor==9.0.0 azure-mgmt-apimanagement==3.0.0 azure-mgmt-appconfiguration==2.1.0 azure-mgmt-applicationinsights==1.0.0 azure-mgmt-authorization==2.0.0 azure-mgmt-batch==16.1.0 azure-mgmt-batchai==7.0.0b1 azure-mgmt-billing==6.0.0 azure-mgmt-botservice==2.0.0b3 azure-mgmt-cdn==12.0.0 azure-mgmt-cognitiveservices==13.2.0 azure-mgmt-compute==27.1.0 azure-mgmt-consumption==2.0.0 azure-mgmt-containerinstance==9.1.0 azure-mgmt-containerregistry==10.0.0 azure-mgmt-containerservice==19.1.0 azure-mgmt-core==1.3.0 azure-mgmt-cosmosdb==7.0.0b6 azure-mgmt-databoxedge==1.0.0 azure-mgmt-datalake-analytics==0.2.1 azure-mgmt-datalake-nspkg==3.0.1 azure-mgmt-datalake-store==0.5.0 azure-mgmt-datamigration==10.0.0 azure-mgmt-deploymentmanager==0.2.0 azure-mgmt-devtestlabs==4.0.0 azure-mgmt-dns==8.0.0 azure-mgmt-eventgrid==9.0.0 azure-mgmt-eventhub==10.1.0 azure-mgmt-extendedlocation==1.0.0b2 azure-mgmt-hdinsight==9.0.0 azure-mgmt-imagebuilder==1.0.0 azure-mgmt-iotcentral==10.0.0b1 azure-mgmt-iothub==2.2.0 azure-mgmt-iothubprovisioningservices==1.1.0 azure-mgmt-keyvault==10.0.0 azure-mgmt-kusto==0.3.0 azure-mgmt-loganalytics==13.0.0b4 azure-mgmt-managedservices==1.0.0 azure-mgmt-managementgroups==1.0.0 azure-mgmt-maps==2.0.0 azure-mgmt-marketplaceordering==1.1.0 azure-mgmt-media==9.0.0 azure-mgmt-monitor==3.0.0 azure-mgmt-msi==6.0.1 azure-mgmt-netapp==8.0.0 azure-mgmt-network==20.0.0 azure-mgmt-nspkg==3.0.2 azure-mgmt-policyinsights==1.1.0b2 azure-mgmt-privatedns==1.0.0 azure-mgmt-rdbms==10.0.0 azure-mgmt-recoveryservices==2.0.0 azure-mgmt-recoveryservicesbackup==5.0.0 azure-mgmt-redhatopenshift==1.1.0 azure-mgmt-redis==13.1.0 azure-mgmt-relay==0.1.0 azure-mgmt-reservations==2.0.0 azure-mgmt-resource==21.1.0 azure-mgmt-search==8.0.0 azure-mgmt-security==2.0.0b1 azure-mgmt-servicebus==7.1.0 azure-mgmt-servicefabric==1.0.0 azure-mgmt-servicefabricmanagedclusters==1.0.0 azure-mgmt-servicelinker==1.0.0 azure-mgmt-signalr==1.0.0b2 azure-mgmt-sql==4.0.0b2 azure-mgmt-sqlvirtualmachine==1.0.0b3 azure-mgmt-storage==20.0.0 azure-mgmt-synapse==2.1.0b2 azure-mgmt-trafficmanager==1.0.0 azure-mgmt-web==6.1.0 azure-multiapi-storage==0.9.0 azure-nspkg==3.0.2 azure-storage-blob==12.9.0 azure-storage-common==1.4.2 azure-storage-queue==12.3.0 azure-synapse-accesscontrol==0.5.0 azure-synapse-artifacts==0.13.0 azure-synapse-managedprivateendpoints==0.3.0 azure-synapse-spark==0.2.0 azureml-accel-models==1.43.0 azureml-automl-core==1.43.0 azureml-automl-dnn-nlp==1.43.0.post1 azureml-automl-runtime==1.43.0 azureml-cli-common==1.43.0 azureml-contrib-automl-pipeline-steps==1.43.0 azureml-contrib-dataset==1.43.0 azureml-contrib-fairness==1.43.0 azureml-contrib-notebook==1.43.0 azureml-contrib-pipeline-steps==1.43.0 azureml-contrib-reinforcementlearning==1.43.0 azureml-contrib-server==1.43.0 azureml-contrib-services==1.43.0 azureml-core==1.43.0 azureml-datadrift==1.43.0 azureml-dataprep==4.0.4 azureml-dataprep-native==38.0.0 azureml-dataprep-rslex==2.6.3 azureml-dataset-runtime==1.43.0.post2 azureml-defaults==1.43.0 azureml-explain-model==1.43.0 azureml-inference-server-http==0.4.13 azureml-interpret==1.43.0 azureml-mlflow==1.43.0.post1 azureml-opendatasets==1.43.0 azureml-pipeline==1.43.0 azureml-pipeline-core==1.43.0 azureml-pipeline-steps==1.43.0 azureml-responsibleai==1.43.0 azureml-samples @ file:///mnt/jupyter-azsamples azureml-sdk==1.43.0 azureml-telemetry==1.43.0 azureml-tensorboard==1.43.0 azureml-train==1.43.0 azureml-train-automl==1.43.0 azureml-train-automl-client==1.43.0 azureml-train-automl-runtime==1.43.0 azureml-train-core==1.43.0 azureml-train-restclients-hyperdrive==1.43.0 azureml-training-tabular==1.43.0 azureml-widgets==1.43.0 Babel==2.10.3 backcall==0.2.0 backoff==1.10.0 backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work backports.tempfile==1.0 backports.weakref==1.0.post1 backports.zoneinfo==0.2.1 bcrypt==3.2.2 beautifulsoup4==4.11.1 bleach==5.0.1 blessed==1.19.1 blis==0.4.1 bokeh==2.4.3 Boruta==0.3 boto==2.49.0 boto3==1.20.19 botocore==1.23.19 Bottleneck==1.3.5 cachetools==5.2.0 catalogue==1.0.0 certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work cftime @ file:///tmp/build/80754af9/cftime_1638357901230/work chardet==3.0.4 charset-normalizer==2.0.12 click==7.1.2 cloudpickle @ file:///Users/ktietz/demo/mc3/conda-bld/cloudpickle_1629142150447/work colorama==0.4.5 colorful==0.5.4 colorlover==0.3.0 configparser==3.7.4 contextlib2==21.6.0 convertdate @ file:///tmp/build/80754af9/convertdate_1634070773133/work coremltools @ git+https://github.com/apple/coremltools@13c064ed99ab1da7abea0196e4ddf663ede48aad cramjam==2.5.0 cryptography==37.0.3 cufflinks==0.17.3 cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work cymem==2.0.6 Cython==0.29.17 dask==2.30.0 dask-sql==2022.6.0 databricks-cli==0.17.0 dataclasses==0.6 datasets==2.6.1 debugpy==1.6.0 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.13 dice-ml==0.8 dill==0.3.5.1 distlib==0.3.5 distributed==2.30.1 distro==1.7.0 dm-tree==0.1.7 docker==5.0.3 docopt==0.6.2 dotnetcore2==3.1.23 dowhy==0.7.1 econml==0.12.0 emoji==1.7.0 en-core-web-sm @ https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz encrypted-inference==0.9 entrypoints==0.4 environments-utils==1.0.4 ephem @ file:///tmp/build/80754af9/ephem_1638942191467/work erroranalysis==0.3.2 executing==0.8.3 fabric==2.7.1 fairlearn==0.7.0 fastai==1.0.61 fastapi==0.79.0 fastjsonschema==2.15.3 fastparquet==0.8.1 fastprogress==1.0.3 fbprophet @ file:///home/conda/feedstock_root/build_artifacts/fbprophet_1599365532360/work ffmpy==0.3.0 filelock==3.7.1 fire==0.4.0 flake8==4.0.1 Flask==1.0.3 Flask-Cors==3.0.10 flatbuffers==2.0 fonttools==4.25.0 frozenlist==1.3.0 fsspec==2022.5.0 funcy==1.17 fusepy==3.0.1 future==0.18.2 gast==0.3.3 gensim==3.8.3 gevent==1.3.6 gitdb==4.0.9 GitPython==3.1.27 google-api-core==2.8.2 google-auth==2.8.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 google-play-scraper==1.2.2 googleapis-common-protos==1.56.3 gpustat==1.0.0rc1 gradio==3.1.7 greenlet==1.1.2 grpcio==1.47.0 gunicorn==20.1.0 gym==0.21.0 h11==0.12.0 h5py==3.7.0 HeapDict==1.0.1 hijri-converter @ file:///tmp/build/80754af9/hijri-converter_1634064010501/work holidays==0.10.3 horovod==0.19.1 htmlmin==0.1.12 httpcore==0.15.0 httpx==0.23.0 huggingface-hub==0.10.1 humanfriendly==10.0 humanize==4.2.3 idna==2.10 ImageHash==4.2.1 imageio==2.19.5 imbalanced-learn==0.7.0 importlib-metadata==4.11.4 importlib-resources==5.8.0 inference-schema==1.3.0 interpret-community==0.26.0 interpret-core==0.2.7 invoke==1.7.1 ipykernel==6.8.0 ipython==8.4.0 ipython-genutils==0.2.0 ipywidgets==7.7.1 isodate==0.6.1 isort==5.10.1 itsdangerous==1.1.0 javaproperties==0.5.2 jedi==0.18.0 jeepney==0.8.0 Jinja2==2.11.2 jmespath==0.10.0 joblib==0.14.1 JPype1==1.4.0 json-logging-py==0.2 json5==0.9.8 jsondiff==2.0.0 jsonpickle==2.2.0 jsonschema==4.6.0 jupyter==1.0.0 jupyter-client==6.1.12 jupyter-console==6.4.4 jupyter-core==4.10.0 jupyter-resource-usage==0.6.1 jupyter-server==1.18.1 jupyter-server-mathjax==0.2.6 jupyter-server-proxy==3.2.1 jupyterlab==3.2.4 jupyterlab-nvdashboard==0.7.0 jupyterlab-pygments==0.2.2 jupyterlab-server==2.15.0 jupyterlab-system-monitor==0.8.0 jupyterlab-topbar==0.6.1 jupyterlab-widgets==1.1.1 jupytext==1.14.0 Keras==2.3.1 Keras-Applications==1.0.8 keras-nightly==2.5.0.dev2021032900 Keras-Preprocessing==1.1.2 keras-tuner==1.1.3 keras2onnx==1.6.0 kiwisolver==1.4.3 kmodes==0.12.1 knack==0.9.0 korean-lunar-calendar @ file:///tmp/build/80754af9/korean_lunar_calendar_1634063020401/work kt-legacy==1.0.4 lazy-object-proxy==1.7.1 liac-arff==2.5.0 lightgbm==3.2.1 linkify-it-py==1.0.3 llvmlite==0.36.0 locket==1.0.0 LunarCalendar @ file:///tmp/build/80754af9/lunarcalendar_1646383991234/work lz4==4.0.1 Mako==1.2.1 Markdown==3.4.1 markdown-it-py==2.1.0 MarkupSafe==2.0.1 matplotlib==3.2.1 matplotlib-inline==0.1.3 mccabe==0.6.1 mdit-py-plugins==0.3.0 mdurl==0.1.1 missingno==0.5.1 mistune==0.8.4 ml-wrappers==0.2.0 mlflow==1.27.0 mlflow-skinny==1.26.1 mlxtend==0.20.0 monotonic==1.6 mpmath==1.2.1 msal==1.18.0 msal-extensions==0.3.1 msgpack==1.0.4 msrest==0.6.21 msrestazure==0.6.4 multidict==6.0.2 multimethod==1.8 multiprocess==0.70.13 munkres==1.1.4 murmurhash==1.0.7 nbclassic==0.4.3 nbclient==0.6.6 nbconvert==6.5.0 nbdime==3.1.1 nbformat==5.2.0 ndg-httpsclient==0.5.1 nest-asyncio==1.5.5 netCDF4==1.5.7 networkx==2.5 nimbusml==1.8.0 nltk==3.7 notebook==6.4.12 notebook-shim==0.1.0 numba==0.53.1 numexpr==2.8.3 numpy==1.19.0 nvidia-ml-py==11.495.46 nvidia-ml-py3==7.352.0 oauthlib==3.2.0 olefile @ file:///Users/ktietz/demo/mc3/conda-bld/olefile_1629805411829/work onnx==1.7.0 onnxconverter-common==1.6.0 onnxmltools==1.4.1 onnxruntime==1.8.1 opencensus==0.9.0 opencensus-context==0.1.2 opencensus-ext-azure==1.1.4 opencv-python-headless==4.6.0.66 opt-einsum==3.3.0 orjson==3.7.12 packaging @ file:///tmp/build/80754af9/packaging_1637314298585/work pandas==1.1.5 pandas-ml==0.6.1 pandas-profiling==3.2.0 pandocfilters==1.5.0 papermill==1.2.1 paramiko==2.11.0 parso==0.8.3 partd==1.2.0 pathlib2==2.3.7.post1 pathspec==0.9.0 patsy==0.5.2 pexpect==4.8.0 phik==0.12.2 pickleshare==0.7.5 Pillow==6.2.1 pipreqs==0.4.11 pkginfo==1.8.3 plac==1.1.3 platformdirs==2.5.2 plotly==5.9.0 pluggy==1.0.0 pmdarima==1.7.1 portalocker==2.4.0 preshed==3.0.6 prometheus-client==0.14.1 prometheus-flask-exporter==0.20.2 prompt-toolkit==3.0.28 property-cached==1.6.4 protobuf==3.20.1 psutil==5.9.1 psycopg2 @ file:///tmp/build/80754af9/psycopg2_1612298147424/work ptyprocess==0.7.0 pure-eval==0.2.2 py-spy==0.3.12 py4j==0.10.9.5 pyarrow==10.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycaret==2.3.10 pycocotools==2.0.2 pycodestyle==2.6.0 pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pycryptodome==3.15.0 pydantic==1.9.1 pydocstyle==6.1.1 pydot==1.4.2 pydub==0.25.1 pyflakes==2.2.0 PyGithub==1.55 Pygments==2.12.0 PyJWT==2.4.0 pyLDAvis==3.3.1 pylint==2.14.5 PyMeeus @ file:///tmp/build/80754af9/pymeeus_1634069098549/work PyNaCl==1.5.0 pynndescent==0.5.7 pynvml==11.4.1 pyod==1.0.3 pyodbc @ file:///tmp/build/80754af9/pyodbc_1647408110185/work pyOpenSSL==22.0.0 pyparsing==3.0.9 pyreadline3==3.4.1 pyrsistent==0.18.1 pysentimiento==0.5.2 PySocks==1.7.1 pyspark==3.3.0 pystan @ file:///home/conda/feedstock_root/build_artifacts/pystan_1598392747715/work python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work python-jsonrpc-server==0.4.0 python-language-server==0.35.0 python-multipart==0.0.5 python-snappy==0.6.1 pytoolconfig==1.2.1 pytorch-transformers==1.0.0 pytz==2019.3 pytz-deprecation-shim==0.1.0.post0 PyWavelets==1.3.0 PyYAML==6.0 pyzmq==23.2.0 qtconsole==5.3.1 QtPy==2.1.0 QuantLib==1.27 querystring-parser==1.2.4 rai-core-flask==0.3.0 raiutils==0.1.0 raiwidgets==0.19.0 ray==1.13.0 regex==2022.6.2 requests==2.23.0 requests-oauthlib==1.3.1 responses==0.18.0 responsibleai==0.19.0 rfc3986==1.5.0 rope==1.2.0 rsa==4.8 s3transfer==0.5.2 sacremoses==0.0.53 scikit-image==0.19.3 scikit-learn==0.22.1 scikit-plot==0.3.7 scipy==1.5.3 scp==0.13.6 scrapbook==0.5.0 seaborn==0.11.2 SecretStorage==3.3.2 semver==2.13.0 Send2Trash==1.8.0 sentencepiece==0.1.96 seqeval==1.2.2 setuptools-git==1.2 shap==0.39.0 simpervisor==0.4 six==1.16.0 skl2onnx==1.4.9 sklearn-pandas==1.7.0 slicer==0.0.7 smart-open==1.9.0 smmap==5.0.0 sniffio==1.2.0 snowballstemmer==2.2.0 sortedcontainers==2.4.0 soupsieve==2.3.2.post1 spacy==2.2.4 sparse==0.13.0 SQLAlchemy==1.4.39 sqlparse==0.4.2 srsly==1.0.5 sshtunnel==0.1.5 stack-data==0.3.0 starlette==0.19.1 statsmodels==0.11.0 sympy==1.10.1 tabulate==0.8.10 tangled-up-in-unicode==0.2.0 tblib==1.7.0 tenacity==8.0.1 tensorboard==2.2.2 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 tensorflow==2.2.0 tensorflow-estimator==2.2.0 tensorflow-gpu==2.2.0 termcolor==1.1.0 terminado==0.15.0 testpath==0.6.0 textblob==0.17.1 textwrap3==0.9.2 thinc==7.4.0 threadpoolctl @ file:///Users/ktietz/demo/mc3/conda-bld/threadpoolctl_1629802263681/work tifffile==2022.5.4 tinycss2==1.1.1 tokenizers==0.13.2 toml==0.10.2 tomli==2.0.1 tomlkit==0.11.1 toolz==0.11.2 torch==1.11.0+cu113 torch-tb-profiler==0.4.0 torchaudio==0.11.0+cu113 torchvision==0.12.0+cu113 tornado==6.1 tqdm @ file:///opt/conda/conda-bld/tqdm_1650891076910/work traitlets==5.3.0 transformers==4.24.0 typing-extensions==4.2.0 tzdata==2022.1 tzlocal==4.2 uc-micro-py==1.0.1 ujson==5.4.0 umap-learn==0.5.3 urllib3==1.25.11 uuid==1.30 uvicorn==0.18.2 virtualenv==20.15.1 visions==0.7.4 waitress==2.1.1 wasabi==0.9.1 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work webencodings==0.5.1 websocket-client==1.3.3 websockets==10.3 Werkzeug==1.0.1 widgetsnbextension==3.6.1 wordcloud==1.8.2.2 wrapt==1.12.1 xarray @ file:///opt/conda/conda-bld/xarray_1639166117697/work xgboost==1.3.3 xmltodict==0.13.0 xxhash==3.0.0 yapf==0.32.0 yarg==0.1.9 yarl==1.7.2 yellowbrick==1.4 zict==2.2.0 zipp==3.8.0 zope.event==4.5.0 zope.interface==5.4.0

python --version Python 3.8.13
opened by juanchate 4
Download and use model locally
Hi, guys. First of all, great lib, works great and it's helping me a tons in a recent project. I'm building an app for my job, but I have some security limitations, and one of them is that I can't reach external endpoints from the internal network, so I wonder if theres any way I can load the model locally after download it. With hugginface library it'll be something like:

!git clone https://huggingface.co/ORGANIZATION_OR_USER/MODEL_NAME

from transformers import AutoModel model = AutoModel.from_pretrained('./MODEL_NAME')`

Thanks in advance.
opened by arieltoledo 4
ValueError: Non-consecutive added token '' found. Should have index 63996 but has index 64000 in saved vocabulary

I am getting following error while importing tokenizer. Is this allowed one?

Code: tokenizer = BertTokenizer.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') model = BertForSequenceClassification.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') Error: ` The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BertweetTokenizer'. The class this function is called from is 'BertTokenizer'.

ValueError Traceback (most recent call last) in () 2 3 # initialize the tokenizer for BERT models ----> 4 tokenizer = BertTokenizer.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') 5 # initialize the model for sequence classification 6 model = BertForSequenceClassification.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis')

1 frames /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, *init_inputs, **kwargs) 1920 # current length of the tokenizer. 1921 raise ValueError( -> 1922 f"Non-consecutive added token '{token}' found. " 1923 f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." 1924 )

ValueError: Non-consecutive added token '' found. Should have index 63996 but has index 64000 in saved vocabulary. `

opened by amitkayal 4
Tokenizer Error
Hello, I am getting an error when the following code (extracted from the examples) is executed:

from pysentimiento import SentimentAnalyzer analyzer = SentimentAnalyzer(lang="es")

Error:

AssertionError: Non-consecutive added token '@usuario' found. Should have index 31006 but has index 31002 in saved vocabulary.

Thank you
opened by JOTOR 4
below Issue when we use transformer code

AssertionError Traceback (most recent call last) in 1 from transformers import AutoTokenizer, AutoModelForSequenceClassification 2 ----> 3 tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis") 4 5 model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) 421 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] 422 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None): --> 423 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 424 else: 425 if tokenizer_class_py is not None:

~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs) 1708 1709 return cls._from_pretrained( -> 1710 resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs 1711 ) 1712

~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs) 1814 for token, index in added_tok_encoder_sorted: 1815 assert index == len(tokenizer), ( -> 1816 f"Non-consecutive added token '{token}' found. " 1817 f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." 1818 )

AssertionError: Non-consecutive added token '[USER]' found. Should have index 31005 but has index 31002 in saved vocabulary.

opened by avinashpaul 4

outdated example on Readme?

Following the example you will get

In [1]: from pysentimiento import SentimentAnalyzer

In [2]: analyzer = SentimentAnalyzer(lang="es")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-479bef79285e> in <module>
----> 1 analyzer = SentimentAnalyzer(lang="es")

TypeError: __init__() got an unexpected keyword argument 'lang'

Edit: looks like the pip version is different to the one on github

opened by Zincr0 4

[BUG] NER analyzer doesn't work if GPU available
Describe the bug NER pipeline explodes if GPU is available

To Reproduce

from pysentimiento import create_analyzer analyzer = create_analyzer("ner", lang="es") analyzer.predict("Bill Gates is the founder of Microsoft")

Environment

Environment with a GPU available

pysentimiento == 0.5.2
opened by finiteautomata 0
Add hashtag segmentation with hashformers
Closes #23 .

Usage:

from pysentimiento.preprocessing import preprocess_tweet from pysentimiento.segmenter import create_segmenter # Handles hashtags segmenter = create_segmenter(lang="es", batch_size=1000) preprocess_tweet("esto es #UnaGenialidad", segmenter=segmenter) # "esto es una genialidad"

create_segmenter(lang="en") or calling a GPT-2 model directly ( e.g. create_segmenter(model_name="gpt2-large") ) are also implemented. Calling preprocess_tweet without a segmenter will run the default camel case segmenter.

I have also modified preprocess_tweet to handle both strings and lists of strings.

P.S.: If you are going to evaluate this segmenter on downstream tasks, make sure you also test create_segmenter(lang="en") on Spanish text. This returns a distilgpt2 which has achieved good results at segmenting hashtags in other languages. Model size doesn't seem to matter much ( distilgpt2 will usually give similar or even better results than gpt2 or gpt2-large ).
opened by ruanchaves 9
Package dependency torch version 1.9.0+

Not really an issue, but we use the LTS version of torch, which is currently 1.8.2, but pysentimiento requires newer versions of torch. Is this solvable from your end perhaps? We just use pip's --use-deprecated=legacy-resolver to get around this but we were curious to see if staying on torch 1.8.2 will cause some issues for this library.

Pretty neat package btw, thanks a lot for maintaining it ❤️

opened by anthony2261 1
[Feature Proposal] Use hashformers for hashtag segmentation
preprocess_tweet currently uses a very simple camel case regex to handle hashtag preprocessing. This will obviously fail for most hashtags.

I propose to integrate hashformers with pysentimiento. Here are a few reasons to do this:

Hashformers has been proven by two research groups to be the current state-of-the-art for hashtag segmentation.

It can instantly work with Spanish, English or any other language.

It does not add any significant extra dependencies to the library.

It is very easy to integrate.

If this seems like a good idea to the maintainers of this repository ( @finiteautomata ), I can draft an initial PR for this feature.
opened by ruanchaves 6

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

Related tags

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Preprocessing

Trained models so far

Spanish models

English models

Instructions for developers

License

Citation

TODO:

Suggestions and bugfixes

Comments

Owner

This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Hunt down social media accounts by username across social networks

FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

The repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`

Semi-supervised Learning for Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform some analysis,,

Collection of NLP model explanations and accompanying analysis tools

Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".