A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Tests

A Transformer-based library for SocialNLP classification tasks.

Currently supports:

  • Sentiment Analysis (Spanish, English)
  • Emotion Analysis (Spanish, English)

Just do pip install pysentimiento and start using it:

Test it in Colab

from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})

analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""

emotion_analyzer = EmotionAnalyzer(lang="en")

emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})

Also, you might use pretrained models directly with transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

Preprocessing

pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.

from pysentimiento.preprocessing import preprocess_tweet

# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"

# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"

# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"

# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"

# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'

Trained models so far

Check CLASSIFIERS.md for details on the reported performances of each model.

Spanish models

English models

Instructions for developers

  1. First, download TASS 2020 data to data/tass2020 (you have to register here to download the dataset)

Labels must be placed under data/tass2020/test1.1/labels

  1. Run script to train models

Check TRAIN_EVALUATE.md

  1. Upload models to Huggingface's Model Hub

Check "Model sharing and upload" instructions in huggingface docs.

License

pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use

  1. TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
  2. SEMEval 2017 Dataset license (Sentiment Analysis in English)

Citation

If you use pysentimiento in your work, please cite this paper

@misc{perez2021pysentimiento,
      title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
      author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
      year={2021},
      eprint={2106.09462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

TODO:

  • Upload some other models
  • Train in other languages

Suggestions and bugfixes

Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)

Comments
  • Why epochs is 5?

    Why epochs is 5?

    Hi,I noticed that the epochs is 5 during training, why not set it a bit larger? Is it because the epochs are set too large to cause overfitting? Thanks!

    opened by gongshaojie12 9
  • Support for python 3.10

    Support for python 3.10

    Hi, first of all great work!

    I am trying to install the dependency with pip in python3.10

    when I run the command:

    pip3 install git+https://github.com/pysentimiento/pysentimiento.git

    I get the error:

    ERROR: Package 'pysentimiento' requires a different Python: 3.10.4 not in '<3.10,>=3.7'

    Have you thought in making the library compatible with python3.10?

    Thanks in advance.

    opened by HugoJBello 7
  • When installing using pip install pysentimiento the analyzer.py and __init__.py are distinct than the ones in the repository

    When installing using pip install pysentimiento the analyzer.py and __init__.py are distinct than the ones in the repository

    Describe the bug When installing using pip install pysentimiento the analyzer.py and init.py are distinct than the ones in the repository. It is installing 0.2.5 instead of 0.4.2

    To Reproduce Using Python 3.10.6, I tried installing using pip install pysentimiento however the analyzer.py and init.py files are different from the ones in the github repository.

    To make the code work, I have to download the files from the github repository and replace them.

    Expected behavior How can we install from the github repository directly?, I tried but it was not possible.

    Environment pip freeze: absl-py==1.2.0 aiohttp==3.8.1 aiosignal==1.2.0 astunparse==1.6.3 async-timeout==4.0.2 attrs==22.1.0 Automat==20.2.0 cachetools==5.2.0 certifi==2022.6.15 cffi==1.15.1 charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.1.0 colorama==0.4.5 configparser==5.3.0 constantly==15.1.0 coverage==6.4.4 coveralls==3.3.1 cryptography==37.0.4 datasets==2.4.0 defusedxml==0.7.1 dill==0.3.5.1 docopt==0.6.2 emoji==2.0.0 exceptiongroup==1.0.0rc9 filelock==3.8.0 Flask==2.2.2 Flask-Cors==3.0.10 Flask-WTF==1.0.1 flatbuffers==1.12 frozenlist==1.3.1 fsspec==2022.8.2 future==0.18.2 gast==0.4.0 genson==1.2.2 google-auth==2.11.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.48.1 h5py==3.7.0 huggingface-hub==0.9.1 hyperlink==21.0.0 hypothesis==6.54.5 idna==3.3 incremental==21.3.0 iniconfig==1.1.1 itsdangerous==2.1.2 Jinja2==3.1.2 joblib==1.1.0 jsonschema==4.15.0 keras==2.9.0 Keras-Preprocessing==1.1.2 libclang==14.0.6 Markdown==3.4.1 MarkupSafe==2.1.1 mock==4.0.3 multidict==6.0.2 multiprocess==0.70.13 nltk==3.7 numpy==1.23.2 oauthlib==3.2.0 opt-einsum==3.3.0 packaging==21.3 pandas==1.4.4 pluggy==1.0.0 protobuf==3.19.4 py==1.11.0 pyarrow==9.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.21 PyJWT==2.4.0 pyOpenSSL==22.0.0 pyparsing==3.0.9 pyrsistent==0.18.1 pysentimiento==0.2.5 pytest==7.1.3 pytest-cov==3.0.0 python-dateutil==2.8.2 pytz==2022.2.1 PyYAML==6.0 regex==2022.8.17 requests==2.28.1 requests-oauthlib==1.3.1 responses==0.18.0 rsa==4.9 scikit-learn==1.1.2 scipy==1.9.1 sentiment-analysis-spanish==0.0.25 simplejson==3.17.6 six==1.16.0 sklearn==0.0 sortedcontainers==2.4.0 tableauserverclient==0.19.0 tabpy==2.5.0 tensorboard==2.9.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow==2.9.2 tensorflow-estimator==2.9.0 tensorflow-io-gcs-filesystem==0.26.0 termcolor==1.1.0 textblob==0.17.1 threadpoolctl==3.1.0 tokenizers==0.12.1 tomli==2.0.1 torch==1.12.1 tornado==6.2 tqdm==4.64.1 transformers==4.21.3 Twisted==22.4.0 twisted-iocpsupport==1.0.2 typing_extensions==4.3.0 urllib3==1.26.12 Werkzeug==2.2.2 wrapt==1.14.1 WTForms==3.0.1 xxhash==3.0.0 yarl==1.8.1 zope.interface==5.4.0

    python --version Python 3.10.6

    Additional context Once I have replaced the files with the ones in the repository, it worked.

    opened by difemaro 6
  • ImportError: cannot import name 'SentimentAnalyzer'

    ImportError: cannot import name 'SentimentAnalyzer'

    Hi,

    Estoy intentando ocupar el código en python3 pero me insiste que necesito esa libreria, donde se encuenta?

    ImportError: cannot import name 'SentimentAnalyzer'

    Saludos,

    opened by davesnake01 6
  • updated for compatibility with python3.10

    updated for compatibility with python3.10

    I added two init.py in the test directories, otherwise I could not run the tests. I could not properly install with poetry, I suspect is not fully compatible with my python version. Nevertheless I installed the project with the same packages and versions using venv and everything works (including all the integration and unit tests).

    opened by HugoJBello 5
  • [BUG] Cannot make predictions for an array of texts

    [BUG] Cannot make predictions for an array of texts

    Describe the bug I'm trying to predict the sentiment of an array containing texts in spanish, but i'm having this error: Error: "softmax_lastdim_kernel_impl" not implemented for 'Half'

    To Reproduce

    
    # Import and instantiate transformers model
    from pysentimiento import create_analyzer
    analyzer = create_analyzer(task="sentiment", lang="es")
    print('Model instantiated' + '\n')
    
    # Obtain sentiment label from a text
    def get_sentence_sentiment(prediction):
        sentiment = prediction.output
        return sentiment
    
    # Obtain the sentiment score of a text
    def get_sentence_score(prediction):
        score = max(prediction.probas.values())
        return score
    
    # Obtain the sentiments of the texts
    def obtain_sentiments(df):
        texts = df['content'].to_numpy()
        from torch import autocast
        with autocast("cuda"):
            predictions = analyzer.predict(texts)
        
        sentiment_labels = [get_sentence_sentiment(prediction) for prediction in predictions]
        sentiment_scores = [get_sentence_score(prediction) for prediction in predictions]
        
        df['sentiment_label'] = sentiment_labels
        df['sentiment_score'] = sentiment_scores
        return df
    

    Expected behavior I'm expected to return a dataframe containing two new columns, one with the sentiment labels and the other with their scores.

    Environment pip freeze absl-py==0.15.0 adal==1.2.7 adlfs==2022.7.0 aiohttp==3.8.1 aiohttp-cors==0.7.0 aiosignal==1.2.0 alembic==1.8.1 analytics-python==1.4.0 ansiwrap==0.8.4 antlr4-python3-runtime==4.9.3 anyio==3.6.1 app-store-scraper==0.3.5 applicationinsights==0.11.10 arch==4.14 argcomplete==2.0.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arviz @ file:///tmp/build/80754af9/arviz_1614019183254/work astroid==2.11.7 asttokens==2.0.5 astunparse==1.6.3 async-timeout==4.0.2 attrs==21.4.0 auto-tqdm==1.0.2 autokeras==1.0.16 autopep8==1.6.0 azure-appconfiguration==1.1.1 azure-batch==12.0.0 azure-cli==2.38.0 azure-cli-core==2.38.0 azure-cli-telemetry==1.0.6 azure-common==1.1.28 azure-core==1.22.1 azure-cosmos==3.2.0 azure-data-tables==12.4.0 azure-datalake-store==0.0.52 azure-graphrbac==0.61.1 azure-identity==1.7.0 azure-keyvault==1.1.0 azure-keyvault-administration==4.0.0b3 azure-keyvault-keys==4.5.1 azure-loganalytics==0.1.1 azure-mgmt-advisor==9.0.0 azure-mgmt-apimanagement==3.0.0 azure-mgmt-appconfiguration==2.1.0 azure-mgmt-applicationinsights==1.0.0 azure-mgmt-authorization==2.0.0 azure-mgmt-batch==16.1.0 azure-mgmt-batchai==7.0.0b1 azure-mgmt-billing==6.0.0 azure-mgmt-botservice==2.0.0b3 azure-mgmt-cdn==12.0.0 azure-mgmt-cognitiveservices==13.2.0 azure-mgmt-compute==27.1.0 azure-mgmt-consumption==2.0.0 azure-mgmt-containerinstance==9.1.0 azure-mgmt-containerregistry==10.0.0 azure-mgmt-containerservice==19.1.0 azure-mgmt-core==1.3.0 azure-mgmt-cosmosdb==7.0.0b6 azure-mgmt-databoxedge==1.0.0 azure-mgmt-datalake-analytics==0.2.1 azure-mgmt-datalake-nspkg==3.0.1 azure-mgmt-datalake-store==0.5.0 azure-mgmt-datamigration==10.0.0 azure-mgmt-deploymentmanager==0.2.0 azure-mgmt-devtestlabs==4.0.0 azure-mgmt-dns==8.0.0 azure-mgmt-eventgrid==9.0.0 azure-mgmt-eventhub==10.1.0 azure-mgmt-extendedlocation==1.0.0b2 azure-mgmt-hdinsight==9.0.0 azure-mgmt-imagebuilder==1.0.0 azure-mgmt-iotcentral==10.0.0b1 azure-mgmt-iothub==2.2.0 azure-mgmt-iothubprovisioningservices==1.1.0 azure-mgmt-keyvault==10.0.0 azure-mgmt-kusto==0.3.0 azure-mgmt-loganalytics==13.0.0b4 azure-mgmt-managedservices==1.0.0 azure-mgmt-managementgroups==1.0.0 azure-mgmt-maps==2.0.0 azure-mgmt-marketplaceordering==1.1.0 azure-mgmt-media==9.0.0 azure-mgmt-monitor==3.0.0 azure-mgmt-msi==6.0.1 azure-mgmt-netapp==8.0.0 azure-mgmt-network==20.0.0 azure-mgmt-nspkg==3.0.2 azure-mgmt-policyinsights==1.1.0b2 azure-mgmt-privatedns==1.0.0 azure-mgmt-rdbms==10.0.0 azure-mgmt-recoveryservices==2.0.0 azure-mgmt-recoveryservicesbackup==5.0.0 azure-mgmt-redhatopenshift==1.1.0 azure-mgmt-redis==13.1.0 azure-mgmt-relay==0.1.0 azure-mgmt-reservations==2.0.0 azure-mgmt-resource==21.1.0 azure-mgmt-search==8.0.0 azure-mgmt-security==2.0.0b1 azure-mgmt-servicebus==7.1.0 azure-mgmt-servicefabric==1.0.0 azure-mgmt-servicefabricmanagedclusters==1.0.0 azure-mgmt-servicelinker==1.0.0 azure-mgmt-signalr==1.0.0b2 azure-mgmt-sql==4.0.0b2 azure-mgmt-sqlvirtualmachine==1.0.0b3 azure-mgmt-storage==20.0.0 azure-mgmt-synapse==2.1.0b2 azure-mgmt-trafficmanager==1.0.0 azure-mgmt-web==6.1.0 azure-multiapi-storage==0.9.0 azure-nspkg==3.0.2 azure-storage-blob==12.9.0 azure-storage-common==1.4.2 azure-storage-queue==12.3.0 azure-synapse-accesscontrol==0.5.0 azure-synapse-artifacts==0.13.0 azure-synapse-managedprivateendpoints==0.3.0 azure-synapse-spark==0.2.0 azureml-accel-models==1.43.0 azureml-automl-core==1.43.0 azureml-automl-dnn-nlp==1.43.0.post1 azureml-automl-runtime==1.43.0 azureml-cli-common==1.43.0 azureml-contrib-automl-pipeline-steps==1.43.0 azureml-contrib-dataset==1.43.0 azureml-contrib-fairness==1.43.0 azureml-contrib-notebook==1.43.0 azureml-contrib-pipeline-steps==1.43.0 azureml-contrib-reinforcementlearning==1.43.0 azureml-contrib-server==1.43.0 azureml-contrib-services==1.43.0 azureml-core==1.43.0 azureml-datadrift==1.43.0 azureml-dataprep==4.0.4 azureml-dataprep-native==38.0.0 azureml-dataprep-rslex==2.6.3 azureml-dataset-runtime==1.43.0.post2 azureml-defaults==1.43.0 azureml-explain-model==1.43.0 azureml-inference-server-http==0.4.13 azureml-interpret==1.43.0 azureml-mlflow==1.43.0.post1 azureml-opendatasets==1.43.0 azureml-pipeline==1.43.0 azureml-pipeline-core==1.43.0 azureml-pipeline-steps==1.43.0 azureml-responsibleai==1.43.0 azureml-samples @ file:///mnt/jupyter-azsamples azureml-sdk==1.43.0 azureml-telemetry==1.43.0 azureml-tensorboard==1.43.0 azureml-train==1.43.0 azureml-train-automl==1.43.0 azureml-train-automl-client==1.43.0 azureml-train-automl-runtime==1.43.0 azureml-train-core==1.43.0 azureml-train-restclients-hyperdrive==1.43.0 azureml-training-tabular==1.43.0 azureml-widgets==1.43.0 Babel==2.10.3 backcall==0.2.0 backoff==1.10.0 backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work backports.tempfile==1.0 backports.weakref==1.0.post1 backports.zoneinfo==0.2.1 bcrypt==3.2.2 beautifulsoup4==4.11.1 bleach==5.0.1 blessed==1.19.1 blis==0.4.1 bokeh==2.4.3 Boruta==0.3 boto==2.49.0 boto3==1.20.19 botocore==1.23.19 Bottleneck==1.3.5 cachetools==5.2.0 catalogue==1.0.0 certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work cftime @ file:///tmp/build/80754af9/cftime_1638357901230/work chardet==3.0.4 charset-normalizer==2.0.12 click==7.1.2 cloudpickle @ file:///Users/ktietz/demo/mc3/conda-bld/cloudpickle_1629142150447/work colorama==0.4.5 colorful==0.5.4 colorlover==0.3.0 configparser==3.7.4 contextlib2==21.6.0 convertdate @ file:///tmp/build/80754af9/convertdate_1634070773133/work coremltools @ git+https://github.com/apple/coremltools@13c064ed99ab1da7abea0196e4ddf663ede48aad cramjam==2.5.0 cryptography==37.0.3 cufflinks==0.17.3 cycler @ file:///tmp/build/80754af9/cycler_1637851556182/work cymem==2.0.6 Cython==0.29.17 dask==2.30.0 dask-sql==2022.6.0 databricks-cli==0.17.0 dataclasses==0.6 datasets==2.6.1 debugpy==1.6.0 decorator==5.1.1 defusedxml==0.7.1 Deprecated==1.2.13 dice-ml==0.8 dill==0.3.5.1 distlib==0.3.5 distributed==2.30.1 distro==1.7.0 dm-tree==0.1.7 docker==5.0.3 docopt==0.6.2 dotnetcore2==3.1.23 dowhy==0.7.1 econml==0.12.0 emoji==1.7.0 en-core-web-sm @ https://aka.ms/automl-resources/packages/en_core_web_sm-2.1.0.tar.gz encrypted-inference==0.9 entrypoints==0.4 environments-utils==1.0.4 ephem @ file:///tmp/build/80754af9/ephem_1638942191467/work erroranalysis==0.3.2 executing==0.8.3 fabric==2.7.1 fairlearn==0.7.0 fastai==1.0.61 fastapi==0.79.0 fastjsonschema==2.15.3 fastparquet==0.8.1 fastprogress==1.0.3 fbprophet @ file:///home/conda/feedstock_root/build_artifacts/fbprophet_1599365532360/work ffmpy==0.3.0 filelock==3.7.1 fire==0.4.0 flake8==4.0.1 Flask==1.0.3 Flask-Cors==3.0.10 flatbuffers==2.0 fonttools==4.25.0 frozenlist==1.3.0 fsspec==2022.5.0 funcy==1.17 fusepy==3.0.1 future==0.18.2 gast==0.3.3 gensim==3.8.3 gevent==1.3.6 gitdb==4.0.9 GitPython==3.1.27 google-api-core==2.8.2 google-auth==2.8.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 google-play-scraper==1.2.2 googleapis-common-protos==1.56.3 gpustat==1.0.0rc1 gradio==3.1.7 greenlet==1.1.2 grpcio==1.47.0 gunicorn==20.1.0 gym==0.21.0 h11==0.12.0 h5py==3.7.0 HeapDict==1.0.1 hijri-converter @ file:///tmp/build/80754af9/hijri-converter_1634064010501/work holidays==0.10.3 horovod==0.19.1 htmlmin==0.1.12 httpcore==0.15.0 httpx==0.23.0 huggingface-hub==0.10.1 humanfriendly==10.0 humanize==4.2.3 idna==2.10 ImageHash==4.2.1 imageio==2.19.5 imbalanced-learn==0.7.0 importlib-metadata==4.11.4 importlib-resources==5.8.0 inference-schema==1.3.0 interpret-community==0.26.0 interpret-core==0.2.7 invoke==1.7.1 ipykernel==6.8.0 ipython==8.4.0 ipython-genutils==0.2.0 ipywidgets==7.7.1 isodate==0.6.1 isort==5.10.1 itsdangerous==1.1.0 javaproperties==0.5.2 jedi==0.18.0 jeepney==0.8.0 Jinja2==2.11.2 jmespath==0.10.0 joblib==0.14.1 JPype1==1.4.0 json-logging-py==0.2 json5==0.9.8 jsondiff==2.0.0 jsonpickle==2.2.0 jsonschema==4.6.0 jupyter==1.0.0 jupyter-client==6.1.12 jupyter-console==6.4.4 jupyter-core==4.10.0 jupyter-resource-usage==0.6.1 jupyter-server==1.18.1 jupyter-server-mathjax==0.2.6 jupyter-server-proxy==3.2.1 jupyterlab==3.2.4 jupyterlab-nvdashboard==0.7.0 jupyterlab-pygments==0.2.2 jupyterlab-server==2.15.0 jupyterlab-system-monitor==0.8.0 jupyterlab-topbar==0.6.1 jupyterlab-widgets==1.1.1 jupytext==1.14.0 Keras==2.3.1 Keras-Applications==1.0.8 keras-nightly==2.5.0.dev2021032900 Keras-Preprocessing==1.1.2 keras-tuner==1.1.3 keras2onnx==1.6.0 kiwisolver==1.4.3 kmodes==0.12.1 knack==0.9.0 korean-lunar-calendar @ file:///tmp/build/80754af9/korean_lunar_calendar_1634063020401/work kt-legacy==1.0.4 lazy-object-proxy==1.7.1 liac-arff==2.5.0 lightgbm==3.2.1 linkify-it-py==1.0.3 llvmlite==0.36.0 locket==1.0.0 LunarCalendar @ file:///tmp/build/80754af9/lunarcalendar_1646383991234/work lz4==4.0.1 Mako==1.2.1 Markdown==3.4.1 markdown-it-py==2.1.0 MarkupSafe==2.0.1 matplotlib==3.2.1 matplotlib-inline==0.1.3 mccabe==0.6.1 mdit-py-plugins==0.3.0 mdurl==0.1.1 missingno==0.5.1 mistune==0.8.4 ml-wrappers==0.2.0 mlflow==1.27.0 mlflow-skinny==1.26.1 mlxtend==0.20.0 monotonic==1.6 mpmath==1.2.1 msal==1.18.0 msal-extensions==0.3.1 msgpack==1.0.4 msrest==0.6.21 msrestazure==0.6.4 multidict==6.0.2 multimethod==1.8 multiprocess==0.70.13 munkres==1.1.4 murmurhash==1.0.7 nbclassic==0.4.3 nbclient==0.6.6 nbconvert==6.5.0 nbdime==3.1.1 nbformat==5.2.0 ndg-httpsclient==0.5.1 nest-asyncio==1.5.5 netCDF4==1.5.7 networkx==2.5 nimbusml==1.8.0 nltk==3.7 notebook==6.4.12 notebook-shim==0.1.0 numba==0.53.1 numexpr==2.8.3 numpy==1.19.0 nvidia-ml-py==11.495.46 nvidia-ml-py3==7.352.0 oauthlib==3.2.0 olefile @ file:///Users/ktietz/demo/mc3/conda-bld/olefile_1629805411829/work onnx==1.7.0 onnxconverter-common==1.6.0 onnxmltools==1.4.1 onnxruntime==1.8.1 opencensus==0.9.0 opencensus-context==0.1.2 opencensus-ext-azure==1.1.4 opencv-python-headless==4.6.0.66 opt-einsum==3.3.0 orjson==3.7.12 packaging @ file:///tmp/build/80754af9/packaging_1637314298585/work pandas==1.1.5 pandas-ml==0.6.1 pandas-profiling==3.2.0 pandocfilters==1.5.0 papermill==1.2.1 paramiko==2.11.0 parso==0.8.3 partd==1.2.0 pathlib2==2.3.7.post1 pathspec==0.9.0 patsy==0.5.2 pexpect==4.8.0 phik==0.12.2 pickleshare==0.7.5 Pillow==6.2.1 pipreqs==0.4.11 pkginfo==1.8.3 plac==1.1.3 platformdirs==2.5.2 plotly==5.9.0 pluggy==1.0.0 pmdarima==1.7.1 portalocker==2.4.0 preshed==3.0.6 prometheus-client==0.14.1 prometheus-flask-exporter==0.20.2 prompt-toolkit==3.0.28 property-cached==1.6.4 protobuf==3.20.1 psutil==5.9.1 psycopg2 @ file:///tmp/build/80754af9/psycopg2_1612298147424/work ptyprocess==0.7.0 pure-eval==0.2.2 py-spy==0.3.12 py4j==0.10.9.5 pyarrow==10.0.0 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycaret==2.3.10 pycocotools==2.0.2 pycodestyle==2.6.0 pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work pycryptodome==3.15.0 pydantic==1.9.1 pydocstyle==6.1.1 pydot==1.4.2 pydub==0.25.1 pyflakes==2.2.0 PyGithub==1.55 Pygments==2.12.0 PyJWT==2.4.0 pyLDAvis==3.3.1 pylint==2.14.5 PyMeeus @ file:///tmp/build/80754af9/pymeeus_1634069098549/work PyNaCl==1.5.0 pynndescent==0.5.7 pynvml==11.4.1 pyod==1.0.3 pyodbc @ file:///tmp/build/80754af9/pyodbc_1647408110185/work pyOpenSSL==22.0.0 pyparsing==3.0.9 pyreadline3==3.4.1 pyrsistent==0.18.1 pysentimiento==0.5.2 PySocks==1.7.1 pyspark==3.3.0 pystan @ file:///home/conda/feedstock_root/build_artifacts/pystan_1598392747715/work python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work python-jsonrpc-server==0.4.0 python-language-server==0.35.0 python-multipart==0.0.5 python-snappy==0.6.1 pytoolconfig==1.2.1 pytorch-transformers==1.0.0 pytz==2019.3 pytz-deprecation-shim==0.1.0.post0 PyWavelets==1.3.0 PyYAML==6.0 pyzmq==23.2.0 qtconsole==5.3.1 QtPy==2.1.0 QuantLib==1.27 querystring-parser==1.2.4 rai-core-flask==0.3.0 raiutils==0.1.0 raiwidgets==0.19.0 ray==1.13.0 regex==2022.6.2 requests==2.23.0 requests-oauthlib==1.3.1 responses==0.18.0 responsibleai==0.19.0 rfc3986==1.5.0 rope==1.2.0 rsa==4.8 s3transfer==0.5.2 sacremoses==0.0.53 scikit-image==0.19.3 scikit-learn==0.22.1 scikit-plot==0.3.7 scipy==1.5.3 scp==0.13.6 scrapbook==0.5.0 seaborn==0.11.2 SecretStorage==3.3.2 semver==2.13.0 Send2Trash==1.8.0 sentencepiece==0.1.96 seqeval==1.2.2 setuptools-git==1.2 shap==0.39.0 simpervisor==0.4 six==1.16.0 skl2onnx==1.4.9 sklearn-pandas==1.7.0 slicer==0.0.7 smart-open==1.9.0 smmap==5.0.0 sniffio==1.2.0 snowballstemmer==2.2.0 sortedcontainers==2.4.0 soupsieve==2.3.2.post1 spacy==2.2.4 sparse==0.13.0 SQLAlchemy==1.4.39 sqlparse==0.4.2 srsly==1.0.5 sshtunnel==0.1.5 stack-data==0.3.0 starlette==0.19.1 statsmodels==0.11.0 sympy==1.10.1 tabulate==0.8.10 tangled-up-in-unicode==0.2.0 tblib==1.7.0 tenacity==8.0.1 tensorboard==2.2.2 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==2.5.1 tensorflow==2.2.0 tensorflow-estimator==2.2.0 tensorflow-gpu==2.2.0 termcolor==1.1.0 terminado==0.15.0 testpath==0.6.0 textblob==0.17.1 textwrap3==0.9.2 thinc==7.4.0 threadpoolctl @ file:///Users/ktietz/demo/mc3/conda-bld/threadpoolctl_1629802263681/work tifffile==2022.5.4 tinycss2==1.1.1 tokenizers==0.13.2 toml==0.10.2 tomli==2.0.1 tomlkit==0.11.1 toolz==0.11.2 torch==1.11.0+cu113 torch-tb-profiler==0.4.0 torchaudio==0.11.0+cu113 torchvision==0.12.0+cu113 tornado==6.1 tqdm @ file:///opt/conda/conda-bld/tqdm_1650891076910/work traitlets==5.3.0 transformers==4.24.0 typing-extensions==4.2.0 tzdata==2022.1 tzlocal==4.2 uc-micro-py==1.0.1 ujson==5.4.0 umap-learn==0.5.3 urllib3==1.25.11 uuid==1.30 uvicorn==0.18.2 virtualenv==20.15.1 visions==0.7.4 waitress==2.1.1 wasabi==0.9.1 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work webencodings==0.5.1 websocket-client==1.3.3 websockets==10.3 Werkzeug==1.0.1 widgetsnbextension==3.6.1 wordcloud==1.8.2.2 wrapt==1.12.1 xarray @ file:///opt/conda/conda-bld/xarray_1639166117697/work xgboost==1.3.3 xmltodict==0.13.0 xxhash==3.0.0 yapf==0.32.0 yarg==0.1.9 yarl==1.7.2 yellowbrick==1.4 zict==2.2.0 zipp==3.8.0 zope.event==4.5.0 zope.interface==5.4.0

    python --version Python 3.8.13

    opened by juanchate 4
  • Download and use model locally

    Download and use model locally

    Hi, guys. First of all, great lib, works great and it's helping me a tons in a recent project. I'm building an app for my job, but I have some security limitations, and one of them is that I can't reach external endpoints from the internal network, so I wonder if theres any way I can load the model locally after download it. With hugginface library it'll be something like:

    !git clone https://huggingface.co/ORGANIZATION_OR_USER/MODEL_NAME

    from transformers import AutoModel
    
    model = AutoModel.from_pretrained('./MODEL_NAME')`
    

    Thanks in advance.

    opened by arieltoledo 4
  • ValueError: Non-consecutive added token '<mask>' found. Should have index 63996 but has index 64000 in saved vocabulary

    ValueError: Non-consecutive added token '' found. Should have index 63996 but has index 64000 in saved vocabulary

    I am getting following error while importing tokenizer. Is this allowed one?

    Code: tokenizer = BertTokenizer.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') model = BertForSequenceClassification.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') Error: ` The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BertweetTokenizer'. The class this function is called from is 'BertTokenizer'.

    ValueError Traceback (most recent call last) in () 2 3 # initialize the tokenizer for BERT models ----> 4 tokenizer = BertTokenizer.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis') 5 # initialize the model for sequence classification 6 model = BertForSequenceClassification.from_pretrained('finiteautomata/bertweet-base-sentiment-analysis')

    1 frames /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, use_auth_token, *init_inputs, **kwargs) 1920 # current length of the tokenizer. 1921 raise ValueError( -> 1922 f"Non-consecutive added token '{token}' found. " 1923 f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." 1924 )

    ValueError: Non-consecutive added token '' found. Should have index 63996 but has index 64000 in saved vocabulary. `

    opened by amitkayal 4
  • Tokenizer Error

    Tokenizer Error

    Hello, I am getting an error when the following code (extracted from the examples) is executed:

    from pysentimiento import SentimentAnalyzer
    analyzer = SentimentAnalyzer(lang="es")
    
    

    Error:

    AssertionError: Non-consecutive added token '@usuario' found. Should have index 31006 but has index 31002 in saved vocabulary.

    Thank you

    opened by JOTOR 4
  • below Issue when we use transformer code

    below Issue when we use transformer code


    AssertionError Traceback (most recent call last) in 1 from transformers import AutoTokenizer, AutoModelForSequenceClassification 2 ----> 3 tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis") 4 5 model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

    ~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/models/auto/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs) 421 tokenizer_class_py, tokenizer_class_fast = TOKENIZER_MAPPING[type(config)] 422 if tokenizer_class_fast and (use_fast or tokenizer_class_py is None): --> 423 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) 424 else: 425 if tokenizer_class_py is not None:

    ~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs) 1708 1709 return cls._from_pretrained( -> 1710 resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs 1711 ) 1712

    ~/thesis/copycat/copy_env/lib64/python3.6/site-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs) 1814 for token, index in added_tok_encoder_sorted: 1815 assert index == len(tokenizer), ( -> 1816 f"Non-consecutive added token '{token}' found. " 1817 f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." 1818 )

    AssertionError: Non-consecutive added token '[USER]' found. Should have index 31005 but has index 31002 in saved vocabulary.

    opened by avinashpaul 4
  • outdated example on Readme?

    outdated example on Readme?

    Following the example you will get

    In [1]: from pysentimiento import SentimentAnalyzer
    
    In [2]: analyzer = SentimentAnalyzer(lang="es")
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-2-479bef79285e> in <module>
    ----> 1 analyzer = SentimentAnalyzer(lang="es")
    
    TypeError: __init__() got an unexpected keyword argument 'lang'
    

    Edit: looks like the pip version is different to the one on github

    opened by Zincr0 4
  • [BUG] NER analyzer doesn't work if GPU available

    [BUG] NER analyzer doesn't work if GPU available

    Describe the bug NER pipeline explodes if GPU is available

    To Reproduce

    from pysentimiento import create_analyzer
    
    analyzer = create_analyzer("ner", lang="es")
    
    analyzer.predict("Bill Gates is the founder of Microsoft")
    

    Environment

    Environment with a GPU available

    pysentimiento == 0.5.2

    opened by finiteautomata 0
  • Add hashtag segmentation with hashformers

    Add hashtag segmentation with hashformers

    Closes #23 .

    Usage:

    from pysentimiento.preprocessing import preprocess_tweet
    from pysentimiento.segmenter import create_segmenter
    
    # Handles hashtags
    segmenter = create_segmenter(lang="es", batch_size=1000)
    preprocess_tweet("esto es #UnaGenialidad", segmenter=segmenter)
    # "esto es una genialidad"
    

    create_segmenter(lang="en") or calling a GPT-2 model directly ( e.g. create_segmenter(model_name="gpt2-large") ) are also implemented. Calling preprocess_tweet without a segmenter will run the default camel case segmenter.

    I have also modified preprocess_tweet to handle both strings and lists of strings.

    P.S.: If you are going to evaluate this segmenter on downstream tasks, make sure you also test create_segmenter(lang="en") on Spanish text. This returns a distilgpt2 which has achieved good results at segmenting hashtags in other languages. Model size doesn't seem to matter much ( distilgpt2 will usually give similar or even better results than gpt2 or gpt2-large ).

    opened by ruanchaves 9
  • Package dependency torch version 1.9.0+

    Package dependency torch version 1.9.0+

    Not really an issue, but we use the LTS version of torch, which is currently 1.8.2, but pysentimiento requires newer versions of torch. Is this solvable from your end perhaps? We just use pip's --use-deprecated=legacy-resolver to get around this but we were curious to see if staying on torch 1.8.2 will cause some issues for this library.

    Pretty neat package btw, thanks a lot for maintaining it ❤️

    opened by anthony2261 1
  • [Feature Proposal] Use hashformers for hashtag segmentation

    [Feature Proposal] Use hashformers for hashtag segmentation

    preprocess_tweet currently uses a very simple camel case regex to handle hashtag preprocessing. This will obviously fail for most hashtags.

    I propose to integrate hashformers with pysentimiento. Here are a few reasons to do this:

    • Hashformers has been proven by two research groups to be the current state-of-the-art for hashtag segmentation.
    • It can instantly work with Spanish, English or any other language.
    • It does not add any significant extra dependencies to the library.
    • It is very easy to integrate.

    If this seems like a good idea to the maintainers of this repository ( @finiteautomata ), I can draft an initial PR for this feature.

    opened by ruanchaves 6
Owner
null
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 398 Dec 30, 2022
Hunt down social media accounts by username across social networks

Hunt down social media accounts by username across social networks Installation | Usage | Docker Notes | Contributing Installation # clone the repo $

null 1 Dec 14, 2021
FAMIE is a comprehensive and efficient active learning (AL) toolkit for multilingual information extraction (IE)

FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction

null 18 Sep 1, 2022
The repo contains the code of the ACL2020 paper `Dice Loss for Data-imbalanced NLP Tasks`

Dice Loss for NLP Tasks This repository contains code for Dice Loss for Data-imbalanced NLP Tasks at ACL2020. Setup Install Package Dependencies The c

null 223 Dec 17, 2022
Semi-supervised Learning for Sentiment Analysis

Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining Code, models and Datasets for《Neural Semi-supervised Learning fo

null 47 Jan 1, 2023
This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis

This is the code for ACL2021 paper A Unified Generative Framework for Aspect-Based Sentiment Analysis Install the package in the requirements.txt, the

null 108 Dec 23, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

HLT@HIT(SZ) 7 Dec 16, 2021
Propose a principled and practically effective framework for unsupervised accuracy estimation and error detection tasks with theoretical analysis and state-of-the-art performance.

Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles This project is for the paper: Detecting Errors and Estimating

Jiefeng Chen 13 Nov 21, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 29, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17k Feb 11, 2021
I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform some analysis,,

Virtual-Artificial-Intelligence-genesis- I created My own Virtual Artificial Intelligence named genesis, He can assist with my Tasks and also perform

AKASH M 1 Nov 5, 2021
Collection of NLP model explanations and accompanying analysis tools

Thermostat is a large collection of NLP model explanations and accompanying analysis tools. Combines explainability methods from the captum library wi

null 126 Nov 22, 2022
Code and models used in "MUSS Multilingual Unsupervised Sentence Simplification by Mining Paraphrases".

Multilingual Unsupervised Sentence Simplification Code and pretrained models to reproduce experiments in "MUSS: Multilingual Unsupervised Sentence Sim

Facebook Research 81 Dec 29, 2022
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

Microsoft 125 Jan 4, 2023
Load What You Need: Smaller Multilingual Transformers for Pytorch and TensorFlow 2.0.

Smaller Multilingual Transformers This repository shares smaller versions of multilingual transformers that keep the same representations offered by t

Geotrend 79 Dec 28, 2022
null 190 Jan 3, 2023
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

INK Lab @ USC 19 Nov 30, 2022
One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

Introduction One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing". Users

seq-to-mind 18 Dec 11, 2022