Hashtag segmentation is the task of automatically inserting the missing spaces between the words in a hashtag.
Hashformers applies Transformer models to hashtag segmentation. It is built on top of the transformers library and the lm-scorer and mlm-scoring packages.
Try it right now on Google Colab.
Paper: Zero-shot hashtag segmentation for multilingual sentiment analysis
Basic usage
from hashformers import WordSegmenter
ws = WordSegmenter(
segmenter_model_name_or_path="gpt2",
reranker_model_name_or_path="bert-base-uncased",
use_reranker=True
)
segmentations = ws.segment([
"#myoldphonesucks",
"#latinosinthedeepsouth",
"#weneedanationalpark",
"#LandoftheLost",
"#icecold",
"#Heartbreaker",
"#TheRiseGuys"
])
print(segmentations)
# ['my old phone sucks',
# 'latinos in the deep south',
# 'we need a national park',
# 'Land of the Lost',
# 'ice cold',
# 'Heartbreaker',
# 'The Rise Guys']
Installation
Installation steps are described on this notebook. A Docker image is coming soon.
Examples
Applications of hashtag segmentation to tweet sentiment analysis and the automatic translation of tweets can be found on the examples
folder.
Contributing
Pull requests are welcome! We need to improve on the documentation and code quality of this repository. It's also a good idea to implement more sophisticated ensembling techniques. Read our paper for more details on the inner workings of our framework.
Citation
@misc{rodrigues2021zeroshot,
title={Zero-shot hashtag segmentation for multilingual sentiment analysis},
author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
year={2021},
eprint={2112.03213},
archivePrefix={arXiv},
primaryClass={cs.CL}
}