The tool to make NLP datasets ready to use

Overview

chazutsu

chazutsu_top.PNG
photo from Kaikado, traditional Japanese chazutsu maker

PyPI version Build Status codecov

chazutsu is the dataset downloader for NLP.

>>> import chazutsu
>>> r = chazutsu.datasets.IMDB().download()
>>> r.train_data().head(5)

Then

   polarity  rating                                             review
0         0       3  You'd think the first landing on the Moon woul...
1         1       9  I took a flyer in renting this movie but I got...
2         1      10  Sometimes I just want to laugh. Don't you? No ...
3         0       2  I knew it wasn't gunna work out between me and...
4         0       2  Sometimes I rest my head and think about the r...

You can use chazutsu on Jupyter.

Install

pip install chazutsu

Supported datasetd

chazutsu supports various kinds of datasets!
Please see the details here!

  • Sentiment Analysis
    • Movie Review Data
    • Customer Review Datasets
    • Large Movie Review Dataset(IMDB)
  • Text classification
    • 20 Newsgroups
    • Reuters News Courpus (RCV1-v2)
  • Language Modeling
    • Penn Tree Bank
    • WikiText-2
    • WikiText-103
    • text8
  • Text Summarization
    • DUC2003
    • DUC2004
    • Gigaword
  • Textual entailment
    • The Multi-Genre Natural Language Inference (MultiNLI)
  • Question Answering
    • The Stanford Question Answering Dataset (SQuAD)

How it works

chazutsu not only download the dataset, but execute expand archive file, shuffle, split, picking samples process also (You can disable the process by arguments if you don't need).

chazutsu_process1.png

r = chazutsu.datasets.MovieReview.polarity(shuffle=False, test_size=0.3, sample_count=100).download()
  • shuffle: The flag argument for executing shuffle or not(True/False).
  • test_size: The ratio of the test dataset (If dataset already prepares train and test dataset, this value is ignored).
  • sample_count: You can pick some samples from the dataset to avoid the editor freeze caused by the heavy text file.
  • force: Don't use cache, re-download the dataset.

chazutsu supports fundamental process for tokenization.

chazutsu_process2.png

>>> import chazutsu
>>> r = chazutsu.datasets.MovieReview.subjectivity().download()
>>> r.train_data().head(3)

Then

    subjectivity                                             review
0             0  . . . works on some levels and is certainly wo...
1             1  the hulk is an anger fueled monster with incre...
2             1  when the skittish emma finds blood on her pill...

Now we want to convert this data to train various frameworks.

fixed_len = 10
r.make_vocab(vocab_size=1000)
r.column("review").as_word_seq(fixed_len=fixed_len)
X, y = r.to_batch("train")
assert X.shape == (len(y), fixed_len, len(r.vocab))
assert y.shape == (len(y), 1)
  • make_vocab
    • vocab_resources: resources to make vocabulary ("train", "valid", "test")
    • columns_for_vocab: The columns to make vocabulary
    • tokenizer: Tokenizer
    • vocab_size: Vocacbulary size
    • min_word_freq: Minimum word count to include the vocabulary
    • unknown: The tag used for out of vocabulary word
    • padding: The tag used to pad the sequence
    • end_of_sentence: If you want to clarify the end-of-line by specific tag, then use this.
    • reserved_words: The word that should included in vocabulary (ex. tag for padding)
    • force: Don't use cache, re-create the dataset.

If you don't want to load all the training data? You can use to_batch_iter.

Additional Feature

Use on Jupyter

You can use chazutsu on Jupyter Notebook.

on_jupyter.png

Before you execute chazutsu on Jupyter, you have to enable widget extention by below command.

jupyter nbextension enable --py --sys-prefix widgetsnbextension
Issues
  • Add SQuAD dataset to chazutsu.datasets module

    Add SQuAD dataset to chazutsu.datasets module

    • Add Squad class to chazutsu.datasets module
    • Add TestSquad class as a test case for the class
    • Check all other test case passed
    opened by yasufumy 5
  • Avoid removing white spaces from paragraph in SQuAD

    Avoid removing white spaces from paragraph in SQuAD

    • Avoid removing white spaces from paragraph otherwise the answer will be changed
      • Avoid using strip method
      • Replace \n to (space) in paragraph
    • Check test cases passed
    opened by yasufumy 2
  • CustomerReview method names on README.md are different from code.

    CustomerReview method names on README.md are different from code.

    CustomerReview class has class methods products5, additional9, more3.

    But Datasets README.md shows CustomerReview has 5products, 9additional, 3more. https://github.com/chakki-works/chazutsu/blob/master/chazutsu/datasets/README.md#customer-review-datasets

    I think it is better to match documents with code. How about?

    bug 
    opened by shirakiya 1
  • Support Python2.x

    Support Python2.x

    I could successfully install chazutsu in python2.7.10. But there is a problem.

    when I import chazutsu, I have a following issue:

    >>> import chazutsu
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/__init__.py", line 1, in <module>
        import chazutsu.datasets as datasets
      File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/__init__.py", line 1, in <module>
        from .movie_review import MovieReview
      File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/movie_review.py", line 5, in <module>
        from chazutsu.datasets.framework.dataset import Dataset
      File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/framework/dataset.py", line 10, in <module>
        from urllib.parse import urlparse
    ImportError: No module named parse
    

    Maybe, it's due to the specification change of urllib between python2.x and python3.x.

    bug 
    opened by Hironsan 1
  • Add ATIS

    Add ATIS

    Having the ATIS task might be fun even if it's a bit easy.

    enhancement 
    opened by erip 1
  • Add better support for language modeling data

    Add better support for language modeling data

    The language model data is not a kind of the format of X and y. The data is sequencial and the label data is given by shifting it. So to handle the language model data, some trick is needed to change X, y to X_t, X_t+1.

    opened by icoxfog417 0
  • Add SQuAD dataset

    Add SQuAD dataset

    I add chazutsu.datasets.squad.Squad class for loading SQuAD dataset. Also test code for the class is added.

    opened by yasufumy 0
  • fix bug in squad.py

    fix bug in squad.py

    • Fix the process to compute answer span
    • Check test case passed
    opened by yasufumy 0
  • fix flaky test_movie_review.py

    fix flaky test_movie_review.py

    This PR aims to improve the reliability of the test test_movie_review.py by changing the encoding method in chazutsu/datasets/framework/resource.py, so that the reading function won't run into error when reading . The error can be reproduced when running for multiple times. E.g. touch run.py and with in this python file write: import os for i in range(100): os.system('pytest tests/test_movie_review.py >> repeat_100_result.log')

    opened by shenganzhang 0
  • Add support for SemEval Datasets

    Add support for SemEval Datasets

    It would be beneficial if this dataset can also be included. It's a pretty important dataset for many downstream NLP tasks.

    enhancement 
    opened by shaksham95 0
  • Improve testing robustness

    Improve testing robustness

    opened by wanasit 0
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 2.4k Oct 22, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 1.9k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.8k Oct 21, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.7k Oct 16, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 19.2k Oct 18, 2021
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 64 Oct 14, 2021
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 141 Sep 30, 2021
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 362 Oct 21, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2k Oct 15, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 18, 2021
A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

null 290 Oct 18, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 288 Oct 19, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 287 Feb 14, 2021
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 314 Oct 19, 2021
🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

?? The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face 10.2k Oct 22, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.7k Oct 23, 2021
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 1.5k Feb 17, 2021