TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Text Data & NLP named-entity-recognition dependency-parser ner pos-tagging tokenization text-annotation lemmatization tweet-analysis twitter-nlp nlp-toolkit

Overview

TweebankNLP

This repo contains the new Tweebank-NER dataset and Twitter-Stanza pipeline for state-of-the-art Tweet NLP. Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks. The Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

Installation

# please install from the source
pip install -e .

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

import stanza

# config for the `en_tweet` pipeline (trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './saved_models/ner/en_tweet_nertagger.pt'
}

# Initialize the pipeline using a configuration dict
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Running Twitter-Stanza (Command Line Interface)

NER

We provide two pre-trained Stanza NER models:

en_tweenut17: trained on TB2+WNUT17
en_tweet: trained on TB2

source twitter-stanza/scripts/config.sh

python stanza/utils/training/run_ner.py en_tweenut17 \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json

Syntactic NLP Models

We provide two pre-trained models for the following NLP tasks:

tweet_ewt: trained on TB2+UD-English-EWT
en_tweet: trained on TB2

1. Tokenization

python stanza/utils/training/run_tokenizer.py tweet_ewt \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt

2. Lemmatization

python stanza/utils/training/run_lemma.py tweet_ewt \
--mode predict \
--score_test \
--gold_file data/depparse/en_tweet.test.gold.conllu \
--eval_file data/depparse/en_tweet.test.in.conllu

3. POS Tagging

python stanza/utils/training/run_pos.py tweet_ewt \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

4. Dependency Parsing

python stanza/utils/training/run_depparse.py tweet_ewt \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.txt \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweebank,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    publisher={arXiv},
    year={2022}
}

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC).

A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == 'unk', ice

42 Dec 27, 2022

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

1 Aug 15, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 4, 2023

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

End-to-end neural table-text understanding models.

914 Jan 7, 2023

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Overview

TweebankNLP

Installation

Python Interface for Twitter-Stanza

Running Twitter-Stanza (Command Line Interface)

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

References

Acknowledgement

You might also like...

A unified tokenization tool for Images, Chinese and English.

Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

NLP project that works with news (NER, context generation, news trend analytics)

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Must-read papers on improving efficiency for pre-trained language models.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Owner

Laboratory for Social Machines

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

A NLP program: tokenize method, PoS Tagging with deep learning

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.