Text preprocessing, representation and visualization from zero to hero.

Overview

Github stars pip package pip downloads Github issues Github license

Text preprocessing, representation and visualization from zero to hero.

From zero to heroInstallationGetting StartedExamplesAPIFAQContributions

From zero to hero

Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic.

You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.

Texthero include tools for:

  • Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
  • Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
  • Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
  • Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
  • Text visualization: vector space visualization, place localization on maps (wip).

Texthero is free, open-source and well documented (and that's what we love most by the way!).

We hope you will find pleasure working with Texthero as we had during his development.

Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?

Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.

Now, the next main milestone is to provide multilingual support and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!

For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github issue, we will be glad to support you and help you.

Beta version

Texthero's community is growing fast. Texthero though is still in a beta version; soon, a faster and better version will be released and it will bring some major changes.

For instance, to give a more granular control over the pipeline, starting from the next version on, all preprocessing functions will require as argument an already tokenized text. This will be a major change.

Once released the stable version (Texthero 2.0), backward compatibility will be respected. Until this point, backward compatibility will be present but it will be weaker.

If you want to be part of this fast-growing movements, do not hesitate to contribute: CONTRIBUTING!

Installation

Install texthero via pip:

pip install texthero

☝️ Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

For faster performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python, the higher, the best.

Getting started

The best way to learn Texthero is through the Getting Started docs.

In case you are an advanced python user, then help(texthero) should do the work.

Examples

1. Text cleaning, TF-IDF representation and Visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.tfidf)
   .pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

2. Text preprocessing, TF-IDF, K-means and Visualization

import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.tfidf)
)

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")

3. Simple pipeline for text cleaning

>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé    (123 /) needs to [OK!] be cleaned!   "
>>> s = pd.Series(text)
>>> s
0    This sèntencé    (123 /) needs to [OK!] be cleane...
dtype: object

Remove all digits:

>>> s = hero.remove_digits(s)
>>> s
0    This sèntencé    (  /) needs to [OK!] be cleaned!
dtype: object

Remove digits replaces only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, you need to set only_blocks to false.

Remove all types of brackets and their content.

>>> s = hero.remove_brackets(s)
>>> s 
0    This sèntencé    needs to  be cleaned!
dtype: object

Remove diacritics.

>>> s = hero.remove_diacritics(s)
>>> s 
0    This sentence    needs to  be cleaned!
dtype: object

Remove punctuation.

>>> s = hero.remove_punctuation(s)
>>> s 
0    This sentence    needs to  be cleaned
dtype: object

Remove extra white-spaces.

>>> s = hero.remove_whitespace(s)
>>> s 
0    This sentence needs to be cleaned
dtype: object

Sometimes we also want to get rid of stop-words.

>>> s = hero.remove_stopwords(s)
>>> s
0    This sentence needs cleaned
dtype: object

API

Texthero is composed of four modules: preprocessing.py, nlp.py, representation.py and visualization.py.

1. Preprocessing

Scope: prepare text data for further analysis.

Full documentation: preprocessing

2. NLP

Scope: provide classic natural language processing tools such as named_entity and noun_phrases.

Full documentation: nlp

2. Representation

Scope: map text data into vectors and do dimensionality reduction.

Supported representation algorithms:

  1. Term frequency (count)
  2. Term frequency-inverse document frequency (tfidf)

Supported clustering algorithms:

  1. K-means (kmeans)
  2. Density-Based Spatial Clustering of Applications with Noise (dbscan)
  3. Meanshift (meanshift)

Supported dimensionality reduction algorithms:

  1. Principal component analysis (pca)
  2. t-distributed stochastic neighbor embedding (tsne)
  3. Non-negative matrix factorization (nmf)

Full documentation: representation

3. Visualization

Scope: summarize the main facts regarding the text data and visualize it. This module is opinionable. It's handy for anyone that needs a quick solution to visualize on screen the text data, for instance during a text exploratory data analysis (EDA).

Supported functions:

  • Text scatterplot (scatterplot)
  • Most common words (top_words)

Full documentation: visualization

FAQ

Why Texthero

Sometimes we just want things done, right? Texthero helps with that. It helps make things easier and give the developer more time to focus on his custom requirements. We believe that cleaning text should just take a minute. Same for finding the most important part of a text and the same for representing it.

In a very pragmatic way, texthero has just one goal: make the developer spare time. Working with text data can be a pain and in most cases, a default pipeline can be quite good to start. There is always time to come back and improve previous work.

Contributions

"Texthero has been developed by a member of the NLP community for the whole NLP-community"

Texthero is for all of us NLP-developers and it can continue to exist with the precious contribution of the community.

Your level of expertise of python and NLP does not matter, anyone can help and anyone is more than welcome to contribute!

Are you an NLP expert?

  • open an issue and tell us what you like and dislike of Texthero and what we can do better!

Are you good at creating websites?

The website will be soon moved from Docusaurus to Sphinx: read the open issue there. Good news: the website will look like now :) Average news: we need to do some web-development to adapt this Sphinx template to our needs. Can you help us?

Are you good at writing?

Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide.

If you are good at writing you can help us! Why don't you start by Adding a FAQ page to the website or explain how to create a custom pipeline? Need help? We are there for you.

Are you good in python?

There are a lot of open issues for techie guys. Which one do you choose?

If you have just other questions or inquiry drop me a line at jonathanbesomi__AT__gmail.com

Contributors (in chronological order)

License

The MIT License (MIT)

Copyright (c) 2020 Texthero

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Issues
  • Change representation_series to DataFrame

    Change representation_series to DataFrame

    • all functions, which previously dealt with representation series now handle only the dataframe instead. 🚀

    • rm all functions like flatten, as they are not needed anymore

    • adopted docstrings and tests

    -> further stuff to do:

    • add those examples into the tutorials, readme, getting started
    enhancement 
    opened by mk2510 20
  • Can we avoid having a cell with a list?

    Can we avoid having a cell with a list?

    As we know, it's really not recommended to store a list in a Pandas cell. TokenSeries and VectorSeries, two of the core ideas of (current) Texthero are actually using it, can this process be avoided?

    Need to discuss:

    • Alternatives using sub-columns (it's still MultiIndex). Understand how complex and flexible this solution is. 99% of the cases, the standard Pandas/Texthero user does not really know how to work with MultiIndex ...
    • Can we just use RepresentationSeries? Probably not as we cannot merge it into a DataFrame with a single index, other alternatives than data alignment with reindex (too complicated)?

    @mk2510 @henrifroese

    discussion 
    opened by jbesomi 18
  • count(s) and term_frequency(s)

    count(s) and term_frequency(s)

    Replace term frequency by Count and creates a new method term_frequency

    opened by ishanarora04 17
  • RepresentationSeries: count, term_frequency and tfidf

    RepresentationSeries: count, term_frequency and tfidf

    • Implement full support for Representation Series in "Vectorization" functions of representation module
    • add appropriate tests
    • add function_check_is_valid_representation

    We already wrote & tested all the code for Representation Series in the whole module, but we want to split this up into separate PRs so it's easier to review etc. As soon as this is merged, we'll open the other PRs.

    Roadmap for Representation Series implementation:

    1. This PR
    2. Implement Representation Series in rest of representation module over 2-3 more PRs
    3. Write tutorial for Representation Series
    4. Incorporate Representation Series into README and getting-started
    5. Release to PyPI
    enhancement 
    opened by henrifroese 15
  • Add Lemmatization

    Add Lemmatization

    Lemmatization can be thought of as a more advanced stemming that we already have in the preprocessing module. You can read about it e.g. here. Implementation should be done with spaCy.

    ToDo

    Implement a function hero.lemmatize(s: TokenSeries) (or mayber rather TextSeries?). Using spaCy this should be fairly straightforward. It should go into the NLP module and probably look very similar to the other spacy-based functions there.

    Just comment below if you want to work on this and/or have any questions. I think this is a good first issue for new contributors.

    enhancement good first issue 
    opened by henrifroese 15
  • replace `tokenize_with_phrases` with `phrases` and added tests

    replace `tokenize_with_phrases` with `phrases` and added tests

    This PR will replace tokenize_with_phrases with phrases. I added unit tests as well for phrases.

    This is the result of running ./tests.sh:

    ..............................................................................................................................................................
    ----------------------------------------------------------------------
    Ran 158 tests in 9.798s
    
    OK
    

    This is the result of running ./format.sh:

    All done! ✨ 🍰 ✨
    6 files left unchanged.
    All done! ✨ 🍰 ✨
    6 files left unchanged.
    
    opened by cedricconol 12
  • Update documentation docstrings etc

    Update documentation docstrings etc

    So this is quite a big PR that will finish the first part of #85 . We went through all docstrings and added examples/tests, added other arguments, and fixed some stuff along the way. We also updated the README.md and the getting-started.md

    Besides the docstrings updates, some small code changes are:

    • more parameters for the representation functions
    • change to scatterplot to support 3d visualization and return figure correctly

    I just went through some other issues and think that additionally this fixes

    • parts of #100 and #98
    • all of #99

    After this, in line with #85 , a new version should be deployed / published.

    opened by henrifroese 11
  • Added Remove Tags and Replace Tags

    Added Remove Tags and Replace Tags

    opened by ishanarora04 11
  • Update docstring for hero.wordcloud

    Update docstring for hero.wordcloud

    After the discussion on #78

    We should add something like:

    "To reduce blur in the images, width and height should have the same size, i.e the image should be squared"

    documentation good first issue 
    opened by vidyap-xgboost 11
  • Fix NaNs (Closes #86)

    Fix NaNs (Closes #86)

    Implement dealing with np.nan, closes #86

    Every function in the library now handles NaNs correctly.

    Implemented through decorator @handle_nans in new file _helper.py.

    Tests added in test_nan.py

    As we went through the whole library anyways, argument "input" was renamed to "s" in some functions to be in line with the others.

    opened by henrifroese 10
  • Question regarding application of this tool to other language documents

    Question regarding application of this tool to other language documents

    Question. Can i use this tool with other language documents like japanese, spanish, ...?

    If yes how? If not, how can we extend this and what is the template to contribute?

    Thanks

    opened by skwolvie 1
  • Discussion - stopwords

    Discussion - stopwords

    I liked the texthero, and I want to contribute in somehow. First, I want to discuss something that boring me - stopwords..

    Problem - I want to deploy a solution without the spacy stopwords requirements, and, possible, add my own stopwords. My solution is based on Docker containers, is a bad practice download files every time that a new containers is instanced, causing a cold start problem, also using unnecessary space (because I don't use them).

    In this sense,

    • Is it possible to remove the spacy stopwords requirements?
    • How can we add general stopwords, according to our own language needs?
    • Do we have some stopwords dictionary for many languages outside spacy?
    • How turn off stopwords download?
    opened by leomaurodesenv 4
  • punctuation not being removed correctly using `preprocessing.clean`

    punctuation not being removed correctly using `preprocessing.clean`

    This is my code and I was trying to clean a large dataset

    full_data['text_pp'] = (
        full_data['text']
        .pipe(hero.preprocessing.clean)
        .pipe(hero.remove_urls)
    )
    

    According to the documentation this is the default pipeline for the clean functionality:

    Default pipeline:
    texthero.preprocessing.fillna()
    
    texthero.preprocessing.lowercase()
    
    texthero.preprocessing.remove_digits()
    
    texthero.preprocessing.remove_punctuation()
    
    texthero.preprocessing.remove_diacritics()
    
    texthero.preprocessing.remove_stopwords()
    
    texthero.preprocessing.remove_whitespace()
    

    But my ouput does not reflect this as some of the punctuation remained in the text.

    Original text column image

    Preprocessed text column image

    opened by aliforgetti 2
  • -1 dbscan category

    -1 dbscan category

    Hi, I was trying to run dbscan on some texts and create a scatterplot.

    I wonder why my dbscan_labels has a -1 category (not sure what it means):

    documents['dbscan_labels'] = (
        documents['tfidf']
        .pipe(hero.dbscan)
        .astype(str)
    )
    
    hero.scatterplot(df=documents, col='pca', color='dbscan_labels', hover_data=['ID', 'Title'], title=" DBScan Clustering (Test) - Texthero library")
    
    

    image

    I tried running using k-means previously and the clusters/scatter plot look good:

    documents['tfidf'] = (
        documents['Text']
        .pipe(hero.clean)
        .pipe(hero.tfidf)
    )
    
    documents['kmeans_labels'] = (
        documents['tfidf']
        .pipe(hero.kmeans, n_clusters=13)
        .astype(str)
    )
    
    documents['pca'] = documents['tfidf'].pipe(hero.pca)
    
    hero.scatterplot(df=documents, col='pca', color='kmeans_labels', hover_data=['ID', 'Title'], title="K-Means Clustering (Test) - Texthero library")
    
    

    image

    Thank you!

    opened by foongminwong 1
  • [WIP] Matching content in our doctests

    [WIP] Matching content in our doctests

    This PR solves issue #189 in order to match the content in our doctests.

    I updated all the sources in the texthero folder - the main issue is in the scatterplot function, in visualization.py, where the 3D representation on the browser does not show anything (WiP).

    I also updated the file CONTRIBUTING.md in order to inform the project contributors to match as much as possible the doctests in their examples / tests.

    To finish, I also updated some doctests in order to add a new line between the doc and the source code sometimes (for clarity), to remove extra whitespaces, etc...

    opened by k0pernicus 7
  • Matching Content in our Doctests

    Matching Content in our Doctests

    It would be great to have more friendly and funny doctest text content (instead of "Aha", "Text", ...). It's also nicer for users if the docstring examples are all similar.

    One idea, for instance, is to use famous sentences said by movie Superheroes. Here are a few examples:

    • I have the power!
    • Flame on!
    • HULK SMASH!
    • Holy ____ Batman!
    • I am the vengeance, I am the night, I am BATMAN!
    • I am GROOT.
    • I’m going ghost!
    • I am the law!
    • SPOOOON!!!

    Just comment below if you would like to work on this. Goals would probably be:

    • for all doctests outside of representation.py, change them to use some of the sentences from above
    • for all doctests in representation.py (and maybe also in visualization.py), we purposely currently have doctests that give users nice results (e.g. for the clustering functions we have doctests that are easily separable into clusters so users understand our examples). We might need to add some more example sentences to the list above to create 2-3 "topics" in our pool of examples that the clustering functions will find
    • as a last step, add this information to the CONTRIBUTING.md so other devs will know about it
    good first issue testing 
    opened by henrifroese 2
  • Redo / Improve our Doctests

    Redo / Improve our Doctests

    Doctests are used in Texthero to make sure that what we think the output will look like is also what it looks like. Example:

    def double(x):
        """
        Double the given input.
    
        Examples
        -------------
        >>> f(5)
        10
        """
        return 2*x
    

    The "Examples" part is then executed by the doctest suite when running our testing script.

    We have noticed that with more "complicated" outputs, we often need to skip the doctests as they do work locally, but do not work on all our travis builds (we're testing on macOS/Xenial/Windows) . The main reasons are:

    • different floating point representation on the OSes -> float results are different after some decimals -> tests fail
    • different pandas printing outputs on the OSes -> e.g. macOS prints "..." in DataFrames at a different position than the others -> tests fail

    We're looking for any solution for those issues.

    Preliminary Ideas

    1. Look at the doctest module and find ways to make it work better (e.g. somehow allow some floating point epsilon); maybe the only thing not working will we DataFrames and we might be able to live with that

    2. (Partly) write our own doctest module (see here)

    Interested in opinions!

    help wanted discussion testing 
    opened by henrifroese 0
  • Draft for getting-started-preprocessing

    Draft for getting-started-preprocessing

    To be completed. Need preliminary feedbacks on:

    • Structure (also keeping in mind rendering on website)
    • Tone (e.g. use of examples)
    • Length / level of details
    opened by Iota87 3
  • Add infer test cases for test indexes

    Add infer test cases for test indexes

    This PR add codes to generate test cases for each function according their input HeroSeries. Only functions with one HeroSeries input will be generated with test case. Exceptions can be manually added, for example,

    test_case_exceptions = {}
    for case in (
        test_cases_nlp
        + test_cases_preprocessing
        + test_cases_representation
        + test_cases_visualization
    ):
        test_case_exceptions[case[0]] = case
    

    Then functions within test_case_exceptions will not be given a generated test case. To omit some functions for testing, put them under func_white_list,

    func_white_list = set(
        [s for s in inspect.getmembers(visualization, inspect.isfunction)]
    )
    

    See #179.

    opened by AlfredWGA 5
  • Infer test cases from input HeroSeries in test_indexes

    Infer test cases from input HeroSeries in test_indexes

    Currently test_indexes save all test cases using hard encoding, by listing all functions and their valid inputs separately, where lots of functions share the same input. With HeroSeries objects as inputs, it seems easy infer a valid input for each of them.

    For example, save a allowed_hero_series_type variable for each function,

    def InputSeries(allowed_hero_series_type):
    ...
        def decorator(func):
            func.allowed_hero_series_type = allowed_hero_series_type
            @functools.wraps(func)
            def wrapper(*args, **kwargs):
    ...
    

    Then iterate all functions through dir(hero) and generate test cases,

    # Define the valid inputs for each HeroSeries type
    valid_inputs = {
        "TokenSeries": s_tokenized_lists,
        "TextSeries": s_text,
        "VectorSeries": s_numeric_lists
    }
    
    test_cases = []
    # Find all functions under texthero
    func_strs = [s for s in dir(hero) if not s.startswith("__") and s not in {'preprocessing', 'visualization', 'representation', 'nlp', 'stopwords'}]
    
    # Exceptions for test cases in case they need specific inputs
    test_case_exceptions = {
        "pca": ["pca", representation.pca, (s_numeric_lists, 0)]
    }
    
    for func_str in func_strs:
        if func_str in test_case_exceptions:
                test_cases.append(test_case_exceptions[func_str])
        else:
            func = getattr(hero, func_str)
            if hasattr(func, 'allowed_hero_series_type'):
                if func.allowed_hero_series_type.__name__ in valid_inputs:
                    test_cases.append([func_str, func, (valid_inputs[func.allowed_hero_series_type.__name__],)])
    

    Will it make the code cleaner and easier to maintain?

    enhancement testing 
    opened by AlfredWGA 4
Releases(1.1.0)
Owner
Jonathan Besomi
NLP and text mining.
Jonathan Besomi
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.7k Oct 23, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 78 Oct 18, 2021
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 84 Oct 9, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

What is this fork of Megatron-LM and Megatron-DeepSpeed This is a detached fork of https://github.com/microsoft/Megatron-DeepSpeed, which in itself is

BigScience Workshop 55 Oct 25, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 2.4k Oct 22, 2021
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 82 Oct 20, 2021
This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

Graph4AI 210 Oct 16, 2021
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.5k Oct 15, 2021
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.6k Oct 23, 2021
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 415 Oct 22, 2021
The guide to tackle with the Text Summarization

The guide to tackle with the Text Summarization

Takahiro Kubo 1.1k Oct 20, 2021
Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Mask-Align: Self-Supervised Neural Word Alignment This is the implementation of our work Mask-Align: Self-Supervised Neural Word Alignment. @inproceed

THUNLP-MT 12 Aug 29, 2021
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

null 834 Oct 22, 2021
Making text a first-class citizen in TensorFlow.

TensorFlow Text - Text processing in Tensorflow IMPORTANT: When installing TF Text with pip install, please note the version of TensorFlow you are run

null 692 Feb 16, 2021
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 23k Oct 25, 2021
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 22.2k Feb 18, 2021
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 230 Jul 10, 2021
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 220 Feb 10, 2021