The tool to make NLP datasets ready to use

chakki

Last update: Dec 29, 2022

Related tags

Overview

chazutsu

photo from Kaikado, traditional Japanese chazutsu maker

chazutsu is the dataset downloader for NLP.

>>> import chazutsu
>>> r = chazutsu.datasets.IMDB().download()
>>> r.train_data().head(5)

Then

   polarity  rating                                             review
0         0       3  You'd think the first landing on the Moon woul...
1         1       9  I took a flyer in renting this movie but I got...
2         1      10  Sometimes I just want to laugh. Don't you? No ...
3         0       2  I knew it wasn't gunna work out between me and...
4         0       2  Sometimes I rest my head and think about the r...

You can use chazutsu on Jupyter.

Install

pip install chazutsu

Supported datasetd

chazutsu supports various kinds of datasets!
Please see the details here!

Sentiment Analysis
- Movie Review Data
- Customer Review Datasets
- Large Movie Review Dataset(IMDB)
Text classification
- 20 Newsgroups
- Reuters News Courpus (RCV1-v2)
Language Modeling
- Penn Tree Bank
- WikiText-2
- WikiText-103
- text8
Text Summarization
- DUC2003
- DUC2004
- Gigaword
Textual entailment
- The Multi-Genre Natural Language Inference (MultiNLI)
Question Answering
- The Stanford Question Answering Dataset (SQuAD)

How it works

chazutsu not only download the dataset, but execute expand archive file, shuffle, split, picking samples process also (You can disable the process by arguments if you don't need).

r = chazutsu.datasets.MovieReview.polarity(shuffle=False, test_size=0.3, sample_count=100).download()

shuffle: The flag argument for executing shuffle or not(True/False).
test_size: The ratio of the test dataset (If dataset already prepares train and test dataset, this value is ignored).
sample_count: You can pick some samples from the dataset to avoid the editor freeze caused by the heavy text file.
force: Don't use cache, re-download the dataset.

chazutsu supports fundamental process for tokenization.

>>> import chazutsu
>>> r = chazutsu.datasets.MovieReview.subjectivity().download()
>>> r.train_data().head(3)

Then

    subjectivity                                             review
0             0  . . . works on some levels and is certainly wo...
1             1  the hulk is an anger fueled monster with incre...
2             1  when the skittish emma finds blood on her pill...

Now we want to convert this data to train various frameworks.

fixed_len = 10
r.make_vocab(vocab_size=1000)
r.column("review").as_word_seq(fixed_len=fixed_len)
X, y = r.to_batch("train")
assert X.shape == (len(y), fixed_len, len(r.vocab))
assert y.shape == (len(y), 1)

make_vocab
- vocab_resources: resources to make vocabulary ("train", "valid", "test")
- columns_for_vocab: The columns to make vocabulary
- tokenizer: Tokenizer
- vocab_size: Vocacbulary size
- min_word_freq: Minimum word count to include the vocabulary
- unknown: The tag used for out of vocabulary word
- padding: The tag used to pad the sequence
- end_of_sentence: If you want to clarify the end-of-line by specific tag, then use this.
- reserved_words: The word that should included in vocabulary (ex. tag for padding)
- force: Don't use cache, re-create the dataset.

If you don't want to load all the training data? You can use to_batch_iter.

Additional Feature

Use on Jupyter

You can use chazutsu on Jupyter Notebook.

Before you execute chazutsu on Jupyter, you have to enable widget extention by below command.

jupyter nbextension enable --py --sys-prefix widgetsnbextension

Comments

Avoid removing white spaces from paragraph in SQuAD
Avoid removing white spaces from paragraph otherwise the answer will be changed

Avoid using strip method

Replace \n to (space) in paragraph

Check test cases passed
opened by yasufumy 2

Support Python2.x

I could successfully install chazutsu in python2.7.10. But there is a problem.

when I import chazutsu, I have a following issue:

>>> import chazutsu
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/__init__.py", line 1, in <module>
    import chazutsu.datasets as datasets
  File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/__init__.py", line 1, in <module>
    from .movie_review import MovieReview
  File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/movie_review.py", line 5, in <module>
    from chazutsu.datasets.framework.dataset import Dataset
  File "/Users/xxx/VirtualEnv/py2venv/lib/python2.7/site-packages/chazutsu/datasets/framework/dataset.py", line 10, in <module>
    from urllib.parse import urlparse
ImportError: No module named parse

Maybe, it's due to the specification change of urllib between python2.x and python3.x.

bug

opened by Hironsan 1

CustomerReview method names on README.md are different from code.

CustomerReview class has class methods products5, additional9, more3.

But Datasets README.md shows CustomerReview has 5products, 9additional, 3more. https://github.com/chakki-works/chazutsu/blob/master/chazutsu/datasets/README.md#customer-review-datasets

I think it is better to match documents with code. How about?
bug

opened by shirakiya 1
Add better support for language modeling data

The language model data is not a kind of the format of X and y. The data is sequencial and the label data is given by shifting it. So to handle the language model data, some trick is needed to change X, y to X_t, X_t+1.

opened by icoxfog417 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
fix flaky test_movie_review.py

This PR aims to improve the reliability of the test test_movie_review.py by changing the encoding method in chazutsu/datasets/framework/resource.py, so that the reading function won't run into error when reading ’. The error can be reproduced when running for multiple times. E.g. touch run.py and with in this python file write: import os for i in range(100): os.system('pytest tests/test_movie_review.py >> repeat_100_result.log')

opened by shenganzhang 0

The tool to make NLP datasets ready to use

Related tags

Overview

chazutsu

Install

Supported datasetd

How it works

Additional Feature

Use on Jupyter

Comments

Avoid removing white spaces from paragraph in SQuAD

Support Python2.x

CustomerReview method names on README.md are different from code.

Add better support for language modeling data

CVE-2007-4559 Patch

Patching CVE-2007-4559

fix flaky test_movie_review.py

Owner

chakki

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Simple NLP based project without any use of AI

Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

NLP tool to extract emotional phrase from tweets 🤩

💫 Industrial-strength Natural Language Processing (NLP) in Python

NLP, before and after spaCy

Multilingual text (NLP) processing toolkit