🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Hugging Face

Last update: Jan 2, 2023

Related tags

Text Data & NLP nlp natural-language-processing computer-vision metrics tensorflow numpy evaluation pandas pytorch datasets

Overview

🤗Datasets is a lightweight library providing two main features:

one-line dataloaders for many public datasets: one liners to download and pre-process any of the major public datasets (in 467 languages and dialects!) provided on the HuggingFace Datasets Hub. With a simple command like squad_dataset = load_datasets("squad"), get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text. With simple commands like tokenized_dataset = dataset.map(tokenize_exemple), efficiently prepare the dataset for inspection and ML model evaluation and training.

🎓 Documentation 🕹 Colab tutorial

🔎 Find a dataset in the Hub 🌟 Add a new dataset to the Hub

🤗Datasets also provides access to +15 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.

🤗Datasets has many additional interesting features:

Thrive on large datasets: 🤗Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).
Smart caching: never wait for your data to process several times.
Lightweight and fast with a transparent and pythonic API (multi-processing/caching/memory-mapping).
Built-in interoperability with NumPy, pandas, PyTorch, Tensorflow 2 and JAX.

🤗Datasets originated from a fork of the awesome TensorFlow Datasets and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library. More details on the differences between 🤗Datasets and tfds can be found in the section Main differences between 🤗Datasets and tfds.

Installation

With pip

🤗Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)

pip install datasets

With conda

🤗Datasets can be installed using conda as follows:

conda install -c huggingface -c conda-forge datasets

Follow the installation pages of TensorFlow and PyTorch to see how to install them with conda.

For more details on installation, check the installation page in the documentation: https://huggingface.co/docs/datasets/installation.html

Installation to use with PyTorch/TensorFlow/pandas

If you plan to use 🤗Datasets with PyTorch (1.0+), TensorFlow (2.2+) or pandas, you should also install PyTorch, TensorFlow or pandas.

For more details on using the library with NumPy, pandas, PyTorch or TensorFlow, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html

Usage

🤗Datasets is made to be very simple to use. The main methods are:

datasets.list_datasets() to list the available datasets
datasets.load_dataset(dataset_name, **kwargs) to instantiate a dataset
datasets.list_metrics() to list the available metrics
datasets.load_metric(metric_name, **kwargs) to instantiate a metric

Here is a quick example:

from datasets import list_datasets, load_dataset, list_metrics, load_metric

# Print all the available datasets
print(list_datasets())

# Load a dataset and print the first example in the training set
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

# List all the available metrics
print(list_metrics())

# Load a metric
squad_metric = load_metric('squad')

# Process the dataset - add a column with the length of the context texts
dataset_with_length = squad_dataset.map(lambda x: {"length": len(x["context"])})

# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗Transformers library)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)

For more details on using the library, check the quick tour page in the documentation: https://huggingface.co/docs/datasets/quicktour.html and the specific pages on:

Loading a dataset https://huggingface.co/docs/datasets/loading_datasets.html
What's in a Dataset: https://huggingface.co/docs/datasets/exploring.html
Processing data with 🤗Datasets: https://huggingface.co/docs/datasets/processing.html
Writing your own dataset loading script: https://huggingface.co/docs/datasets/add_dataset.html
etc.

Another introduction to 🤗Datasets is the tutorial on Google Colab here:

Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the datasets already provided on the HuggingFace Datasets Hub.

You will find the step-by-step guide here to add a dataset to this repository.

You can also have your own repository for your dataset on the Hub under your or your organization's namespace and share it with the community. More information in the documentation section about dataset sharing.

Main differences between `🤗Datasets` and `tfds`

If you are familiar with the great Tensorflow Datasets, here are the main differences between 🤗Datasets and tfds:

the scripts in 🤗Datasets are not provided within the library but are queried, downloaded/cached and dynamically loaded upon request
🤗Datasets also provides evaluation metrics in a similar fashion to the datasets, i.e. as dynamically installed scripts with a unified API. This gives access to the pair of a benchmark dataset and a benchmark metric for instance for benchmarks like SQuAD or GLUE.
the backend serialization of 🤗Datasets is based on Apache Arrow instead of TF Records and leverage python dataclasses for info and features with some diverging features (we mostly don't do encoding and store the raw data as much as possible in the backend serialization cache).
the user-facing dataset object of 🤗Datasets is not a tf.data.Dataset but a built-in framework-agnostic dataset class with methods inspired by what we like in tf.data (like a map() method). It basically wraps a memory-mapped Arrow table cache.

Disclaimers

Similar to TensorFlow Datasets, 🤗Datasets is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license.

If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the ML community!

BibTeX

If you want to cite this framework you can use this:

@article{2020HuggingFace-datasets,
  title={Datasets},
  author={Thomas Wolf and Quentin Lhoest and Patrick von Platen and Yacine Jernite and Mariama Drame and Julien Plu and Julien Chaumond and Clement Delangue and Clara Ma and Abhishek Thakur and Suraj Patil and Joe Davison and Teven Le Scao and Victor Sanh and Canwen Xu and Nicolas Patry and Angie McMillan-Major and Simon Brandeis and Sylvain Gugger and François Lagunas and Lysandre Debut and Morgan Funtowicz and Anthony Moi and Sasha Rush and Philipp Schmidd and Pierric Cistac and Victor Muštar and Jeff Boudier and Anna Tordjmann},
  journal={GitHub. Note: https://github.com/huggingface/datasets},
  volume={1},
  year={2020}
}

Comments

Load text file for RoBERTa pre-training.

I migrate my question from https://github.com/huggingface/transformers/pull/4009#issuecomment-690039444

I tried to train a Roberta from scratch using transformers. But I got OOM issues with loading a large text file. According to the suggestion from @thomwolf , I tried to implement datasets to load my text file. This test.txt is a simple sample where each line is a sentence.

from datasets import load_dataset
dataset = load_dataset('text', data_files='test.txt',cache_dir="./")
dataset.set_format(type='torch',columns=["text"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
next(iter(dataloader))

But dataload cannot yield sample and error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-388aca337e2f> in <module>
----> 1 next(iter(dataloader))

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    361 
    362     def __next__(self):
--> 363         data = self._next_data()
    364         self._num_yielded += 1
    365         if self._dataset_kind == _DatasetKind.Iterable and \

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    401     def _next_data(self):
    402         index = self._next_index()  # may raise StopIteration
--> 403         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    404         if self._pin_memory:
    405             data = _utils.pin_memory.pin_memory(data)

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

KeyError: 0

dataset.set_format(type='torch',columns=["text"]) returns a log says:

Set __getitem__(key) output type to torch for ['text'] columns (when key is int or slice) and don't output other (un-formatted) columns.

I noticed the dataset is DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 44)}). Each sample can be accessed by dataset["train"]["text"] instead of dataset["text"].

Could you please give me any suggestions on how to modify this code to load the text file?

Versions: Python version 3.7.3 PyTorch version 1.6.0 TensorFlow version 2.3.0 datasets version: 1.0.1

opened by chiyuzhang94 43

load_dataset for text files not working

Trying the following snippet, I get different problems on Linux and Windows.

dataset = load_dataset("text", data_files="data.txt")
# or 
dataset = load_dataset("text", data_files=["data.txt"])

(ps This example shows that you can use a string as input for data_files, but the signature is Union[Dict, List].)

The problem on Linux is that the script crashes with a CSV error (even though it isn't a CSV file). On Windows the script just seems to freeze or get stuck after loading the config file.

Linux stack trace:

PyTorch version 1.6.0+cu101 available.
Checking /home/bram/.cache/huggingface/datasets/b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at /home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.json
Using custom data configuration default
Generating dataset text (/home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7)
Downloading and preparing dataset text/default-0907112cc6cd2a38 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/bram/.cache/huggingface/datasets/text/default-0907112cc6cd2a38/0.0.0/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7...
Dataset not on Hf google storage. Downloading and preparing it from source
Downloading took 0.0 min
Checksum Computation took 0.0 min
Unable to verify checksums.
Generating split train
Traceback (most recent call last):
  File "/home/bram/Python/projects/dutch-simplification/utils.py", line 45, in prepare_data
    dataset = load_dataset("text", data_files=dataset_f)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/load.py", line 608, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 468, in download_and_prepare
    self._download_and_prepare(
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 546, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/datasets/builder.py", line 888, in _prepare_split
    for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
  File "/home/bram/.local/share/virtualenvs/dutch-simplification-NcpPZtDF/lib/python3.8/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/home/bram/.cache/huggingface/modules/datasets_modules/datasets/text/7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7/text.py", line 100, in _generate_tables
    pa_table = pac.read_csv(
  File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 1 columns, got 2

Windows just seems to get stuck. Even with a tiny dataset of 10 lines, it has been stuck for 15 minutes already at this message:

Checking C:\Users\bramv\.cache\huggingface\datasets\b1d50a0e74da9a7b9822cea8ff4e4f217dd892e09eb14f6274a2169e5436e2ea.30c25842cda32b0540d88b7195147decf9671ee442f4bc2fb6ad74016852978e.py for additional imports.
Found main folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text
Found specific version folder for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7
Found script file from https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py to C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.py
Couldn't find dataset infos file at https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text\dataset_infos.json
Found metadata file for dataset https://raw.githubusercontent.com/huggingface/datasets/1.0.1/datasets/text/text.py at C:\Users\bramv\.cache\huggingface\modules\datasets_modules\datasets\text\7e13bc0fa76783d4ef197f079dc8acfe54c3efda980f2c9adfab046ede2f0ff7\text.json
Using custom data configuration default

dataset bug

opened by BramVanroy 41

Create Audio feature
Create Audio feature to handle raw audio files.

Some decisions to be further discussed:

I have chosen soundfile as the audio library; another interesting library is librosa, but this requires soundfile (see here). If we require some more advanced functionalities, we could eventually switch the library.

I have implemented the audio feature as an extra: pip install datasets[audio]. For the moment, the typical datasets user uses only text datasets, and there is no need for them for additional package requirements for audio/image if they do not need them.

For tests, I require audio dependencies (so that all audio functionalities are checked with our CI test suite); I exclude Linux platforms, which require an additional library to be installed with the distribution package manager

I also require pytest-datadir, which allow to have (audio) data files for tests

The audio data contain: array and sample_rate.

The array is reshaped as 1D array (expected input for Wav2Vec2).

Note that to install soundfile on Linux, you need to install libsndfile using your distribution’s package manager, for example sudo apt-get install libsndfile1.

Requirements Specification

Access example with audio loading and resampling:
ds[0]["audio"]

Map with audio loading & resampling:
def preprocess(batch): batch["input_values"] = processor(batch["audio"]).input_values return batch ds = ds.map(preprocess)

Map without audio loading and resampling:
def preprocess(batch): batch["labels"] = processor(batch["target_text"]).input_values return batch ds = ds.map(preprocess)

Additional requirement specification (see https://github.com/huggingface/datasets/pull/2324#pullrequestreview-768864998): Cast audio column to change sampling sate:
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))
opened by albertvillanova 30
Fatal error condition occurred in aws-c-io
Describe the bug

Fatal error when using the library

Steps to reproduce the bug

from datasets import load_dataset dataset = load_dataset('wikiann', 'en')

Expected results

No fatal errors

Actual results

Fatal error condition occurred in D:\bld\aws-c-io_1633633258269\work\source\event_loop.c:74: aws_thread_launch(&cleanup_thread, s_event_loop_destroy_async_thread_fn, el_group, &thread_options) == AWS_OP_SUCCESS Exiting Application

Environment info

datasets version: 1.15.2.dev0

Platform: Windows-10-10.0.22504-SP0

Python version: 3.8.12

PyArrow version: 6.0.0

bug
opened by Crabzmatic 26
Checksums didn't match for dataset source
Dataset viewer issue for 'wiki_lingua*'

Link: link to the dataset viewer page

data = datasets.load_dataset("wiki_lingua", name=language, split="train[:2000]") short description of the issue

[NonMatchingChecksumError: Checksums didn't match for dataset source files: ['https://drive.google.com/uc?export=download&id=11wMGqNVSwwk6zUnDaJEgm3qT71kAHeff']]()

Am I the one who added this dataset ? No
dataset-viewer
opened by rafikg 25
Only user permission of saved cache files, not group

Hello,

It seems when a cached file is saved from calling dataset.map for preprocessing, it gets the user permissions and none of the user's group permissions. As we share data files across members of our team, this is causing a bit of an issue as we have to continually reset the permission of the files. Do you know any ways around this or a way to correctly set the permissions?
enhancement good first issue

opened by lorr1 23
Adding support for generic multi dimensional tensors and auxillary image data for multimodal datasets

nlp/features.py:

The main factory class is MultiArray, every single time this class is called, a corresponding pyarrow extension array and type class is generated (and added to the list of globals for future use) for a given root data type and set of dimensions/shape. I provide examples on working with this in datasets/lxmert_pretraining_beta/test_multi_array.py

src/nlp/arrow_writer.py

I had to add a method for writing batches that include extension array types because despite having a unique class for each multidimensional array shape, pyarrow is unable to write any other "array-like" data class to a batch object unless it is of the type pyarrow.ExtensionType. The problem in this is that when writing multiple batches, the order of the schema and data to be written get mixed up (where the pyarrow datatype in the schema only refers to as ExtensionAray, but each ExtensionArray subclass has a different shape) ... possibly I am missing something here and would be grateful if anyone else could take a look!

datasets/lxmert_pretraining_beta/lxmert_pretraining_beta.py & datasets/lxmert_pretraining_beta/to_arrow_data.py:

I have begun adding the data from the original LXMERT paper (https://arxiv.org/abs/1908.07490) hosted here: (https://github.com/airsplay/lxmert). The reason I am not pulling from the source of truth for each individual dataset is because it seems that there will also need to be functionality to aggregate multimodal datasets to create a pre-training corpus (:sleepy: ). For now, this is just being used to test and run edge-cases for the MultiArray feature, so ive labeled it as "beta_pretraining"!

(still working on the pretraining, just wanted to push out the new functionality sooner than later)

opened by eltoto1219 23
map/filter multiprocessing raises errors and corrupts datasets
After upgrading to the 1.0 started seeing errors in my data loading script after enabling multiprocessing.

... ner_ds_dict = ner_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed) ner_ds_dict["validation"] = ner_ds_dict["test"] rel_ds_dict = rel_ds.train_test_split(test_size=test_pct, shuffle=True, seed=seed) rel_ds_dict["validation"] = rel_ds_dict["test"] return ner_ds_dict, rel_ds_dict

The first train_test_split, ner_ds/ner_ds_dict, returns a train and test split that are iterable. The second, rel_ds/rel_ds_dict in this case, returns a Dataset dict that has rows but if selected from or sliced into into returns an empty dictionary. eg rel_ds_dict['train'][0] == {} and rel_ds_dict['train'][0:100] == {}.

Ok I think I know the problem -- the rel_ds was mapped though a mapper with num_proc=12. If I remove num_proc. The dataset loads.

I also see errors with other map and filter functions when num_proc is set.

Done writing 67 indices in 536 bytes . Done writing 67 indices in 536 bytes . Fatal Python error: PyCOND_WAIT(gil_cond) failed
bug
opened by timothyjlaurent 22
Very slow data loading on large dataset
I made a simple python script to check the NLP library speed, which loads 1.1 TB of textual data. It has been 8 hours and still, it is on the loading steps. It does work when the text dataset size is small about 1 GB, but it doesn't scale. It also uses a single thread during the data loading step.

train_files = glob.glob("xxx/*.txt",recursive=True) random.shuffle(train_files) print(train_files) dataset = nlp.load_dataset('text', data_files=train_files, name="customDataset", version="1.0.0", cache_dir="xxx/nlp")

Is there something that I am missing ?
opened by agemagician 22
Add a Depth Estimation dataset - DIODE / NYUDepth / KITTI
Name

NYUDepth

Paper

http://cs.nyu.edu/~silberman/papers/indoor_seg_support.pdf

Data

https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html

Motivation

Depth estimation is an important problem in computer vision. We have a couple of Depth Estimation models on Hub as well:

GLPN

DPT

Would be nice to have a dataset for depth estimation. These datasets usually have three things: input image, depth map image, and depth mask (validity mask to indicate if a reading for a pixel is valid or not). Since we already have semantic segmentation datasets on the Hub, I don't think we need any extended utilities to support this addition.

Having this dataset would also allow us to author data preprocessing guides for depth estimation, particularly like the ones we have for other tasks (example).

Ccing @osanseviero @nateraw @NielsRogge

Happy to work on adding it.
dataset request
opened by sayakpaul 21

Dataset librispeech_asr fails to load

Describe the bug

The dataset librispeech_asr (standard Librispeech) fails to load.

Steps to reproduce the bug

datasets.load_dataset("librispeech_asr")

Expected results

It should download and prepare the whole dataset (all subsets).

In the doc, it says it has two configurations (clean and other). However, the dataset doc says that not specifying split should just load the whole dataset, which is what I want.

Also, in case of this specific dataset, this is also the standard what the community uses. When you look at any publications with results on Librispeech, they always use the whole train dataset for training.

Actual results

...
  File "/home/az/.cache/huggingface/modules/datasets_modules/datasets/librispeech_asr/1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c/librispeech_asr.py", line 119, in LibrispeechASR._split_generators
    line: archive_path = dl_manager.download(_DL_URLS[self.config.name])
    locals:
      archive_path = <not found>
      dl_manager = <local> <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>
      dl_manager.download = <local> <bound method DownloadManager.download of <datasets.utils.download_manager.DownloadManager object at 0x7fc07b426160>>
      _DL_URLS = <global> {'clean': {'dev': 'http://www.openslr.org/resources/12/dev-clean.tar.gz', 'test': 'http://www.openslr.org/resources/12/test-clean.tar.gz', 'train.100': 'http://www.openslr.org/resources/12/train-clean-100.tar.gz', 'train.360': 'http://www.openslr.org/resources/12/train-clean-360.tar.gz'}, 'other'...
      self = <local> <datasets_modules.datasets.librispeech_asr.1f4602f6b5fed8d3ab3e3382783173f2e12d9877e98775e34d7780881175096c.librispeech_asr.LibrispeechASR object at 0x7fc12a633310>
      self.config = <local> BuilderConfig(name='default', version=0.0.0, data_dir='/home/az/i6/setups/2022-03-20--sis/work/i6_core/datasets/huggingface/DownloadAndPrepareHuggingFaceDatasetJob.TV6Nwm6dFReF/output/data_dir', data_files=None, description=None)
      self.config.name = <local> 'default', len = 7
KeyError: 'default'

Environment info

datasets version: 2.1.0
Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Python version: 3.9.9
PyArrow version: 6.0.1
Pandas version: 1.4.2

bug

opened by albertz 21

Finish Deprecating the `fs=` arg

See #5385 for some discussion on this

The fs= arg was depcrecated from Dataset.save_to_disk and Dataset.load_from_disk in 2.8.0 (to be removed in 3.0.0). There are a few other places where the fs= arg was still used (functions/methods in datasets.info and datasets.load). This PR adds a similar behavior, warnings and the storage_options= arg to these functions and methods.

One question: should the "deprecated" / "added" versions be 2.8.1 for the docs/warnings on these? Right now I'm going with "fs was deprecated in 2.8.0" but "storage_options= was added in 2.8.1" where appropriate.

@mariosasko

opened by dconathan 2
Whisper Event - RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 [2:52:21<00:00, 10.34s/it]

Done in a VM with a GPU (Ubuntu) following the Whisper Event - PYTHON instructions.

Attempted using RuntimeError: he size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 - WEB - another person experiencing the same issue. But could not resolve the issue with the google/fleurs data. Not clear what can be modified in the PY code to resolve the input data size mismatch, as the training data is already very small.

Tried posting on Discord, @sanchit-gandhi and @vaibhavs10. Was hoping that the event is over and some input/help is now available. Hugging Face - whisper-small-amet.

The paper Robust Speech Recognition via Large-Scale Weak Supervision am_et is a low resource language (Table E), with the WER results ranging from 120-229, based on model size. (Whisper small WER=120.2).

---> Initial Training Output

/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|trainer.py:1641] 2022-12-18 05:23:28,799 >> ***** Running training ***** [INFO|trainer.py:1642] 2022-12-18 05:23:28,799 >> Num examples = 446 [INFO|trainer.py:1643] 2022-12-18 05:23:28,799 >> Num Epochs = 72 [INFO|trainer.py:1644] 2022-12-18 05:23:28,799 >> Instantaneous batch size per device = 16 [INFO|trainer.py:1645] 2022-12-18 05:23:28,799 >> Total train batch size (w. parallel, distributed & accumulation) = 32 [INFO|trainer.py:1646] 2022-12-18 05:23:28,799 >> Gradient Accumulation steps = 2 [INFO|trainer.py:1647] 2022-12-18 05:23:28,800 >> Total optimization steps = 1000 [INFO|trainer.py:1648] 2022-12-18 05:23:28,801 >> Number of trainable parameters = 241734912

---> Error

14% 9/65 [07:07<48:34, 52.04s/it][INFO|configuration_utils.py:523] 2022-12-18 05:03:07,941 >> Generate config GenerationConfig { "begin_suppress_tokens": [ 220, 50257 ], "bos_token_id": 50257, "decoder_start_token_id": 50258, "eos_token_id": 50257, "max_length": 448, "pad_token_id": 50257, "transformers_version": "4.26.0.dev0", "use_cache": false }

Traceback (most recent call last): File "run_speech_recognition_seq2seq_streaming.py", line 629, in main() File "run_speech_recognition_seq2seq_streaming.py", line 578, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1534, in train return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1859, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2122, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py", line 78, in evaluate return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2818, in evaluate output = eval_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3000, in evaluation_loop loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer_seq2seq.py", line 213, in prediction_step outputs = model(**inputs) File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1197, in forward outputs = self.model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 1066, in forward decoder_outputs = self.decoder( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/models/whisper/modeling_whisper.py", line 873, in forward hidden_states = inputs_embeds + positions RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 100% 1000/1000 [2:52:21<00:00, 10.34s/it]

opened by catswithbats 0
Missing documentation page : improve-performance

Describe the bug

Trying to access https://huggingface.co/docs/datasets/v2.8.0/en/package_reference/cache#improve-performance, the page is missing.

The link is in here : https://huggingface.co/docs/datasets/v2.8.0/en/package_reference/loading_methods#datasets.load_dataset.keep_in_memory

Steps to reproduce the bug

Access the page and see it's missing.

Expected behavior

Not missing page

Environment info

Doesn't matter

opened by astariul 1
Is `fs=` deprecated in `load_from_disk()` as well?

Describe the bug

The fs= argument was deprecated from Dataset.save_to_disk and Dataset.load_from_disk in favor of automagically figuring it out via fsspec: https://github.com/huggingface/datasets/blob/9a7272cd4222383a5b932b0083a4cc173fda44e8/src/datasets/arrow_dataset.py#L1339-L1340

Is there a reason the same thing shouldn't also apply to datasets.load.load_from_disk() as well ?

https://github.com/huggingface/datasets/blob/9a7272cd4222383a5b932b0083a4cc173fda44e8/src/datasets/load.py#L1779

Steps to reproduce the bug

n/a

Expected behavior

n/a

Environment info

n/a

opened by dconathan 2

Releases(2.8.0)

2.8.0(Dec 19, 2022)
Important

Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types

The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.

Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287

Datasets in streaming mode now update their features after column renaming or removal

Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239

Use multiprocessing to load multiple files in parallel

Add features param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311

Sharded save_to_disk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268

Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()

Pass num_proc to use multiprocessing.

Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252

Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357

You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:

from datasets import load_dataset ds = load_dataset("c4", "en", streaming=True, split="train") dataloader = DataLoader(ds, batch_size=32, num_workers=4)

Docs

Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248

General improvements and bug fixes

typo by @WrRan in https://github.com/huggingface/datasets/pull/5253

typo by @WrRan in https://github.com/huggingface/datasets/pull/5254

remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257

fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256

Fix max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267

Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266

Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250

Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279

Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210

Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285

Use correct dataset type in from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307

Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294

Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297

Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299

Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065

Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211

Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310

Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313

Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319

[Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320

Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321

This was affecting datasets like wikipedia or natural_questions

Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328

Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318

fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333

Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329

Close stream in ArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309

Use same num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300

Set IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336

fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234

Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340

Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334

Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341

Support topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308

Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302

Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350

Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349

Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344

Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355

Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322

Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356

ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366

Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373

Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375

New Contributors

@WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253

@eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318

@vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234

@Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0
Source code(tar.gz)
Source code(zip)
2.7.1(Nov 22, 2022)
Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1
Source code(tar.gz)
Source code(zip)
2.6.2(Nov 22, 2022)
Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2
Source code(tar.gz)
Source code(zip)
2.7.0(Nov 16, 2022)
Dataset Features

Multiprocessed dataset builder by @TevenLeScao in https://github.com/huggingface/datasets/pull/5107

Load big datasets faster than before using multiprocessing:

from datasets import load_dataset ds = load_dataset("imagenet-1k", num_proc=4)

Make torch.Tensor and spacy models cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/5191

Function passed to map or filter that uses tensors or pipelines can now be cached

Drop labels in Image and Audio folders if files are on different levels in directory or if there is only one label by @polinaeterna in https://github.com/huggingface/datasets/pull/5192

TextConfig: added "errors" by @NightMachinery in https://github.com/huggingface/datasets/pull/5155

Audio setup

Add ffmpeg4 installation instructions in warnings by @polinaeterna in https://github.com/huggingface/datasets/pull/5167

Docs

Update create image dataset docs by @stevhliu in https://github.com/huggingface/datasets/pull/5177

add: segmentation guide. by @sayakpaul in https://github.com/huggingface/datasets/pull/5188

Reword E2E training and inference tips in the vision guides by @sayakpaul in https://github.com/huggingface/datasets/pull/5217

Add SQL guide by @stevhliu in https://github.com/huggingface/datasets/pull/5223

General improvements and bug fixes

Add pyproject.toml for black by @mariosasko in https://github.com/huggingface/datasets/pull/5125

Fix tqdm zip bug by @david1542 in https://github.com/huggingface/datasets/pull/5120

Install tensorflow-macos dependency conditionally by @albertvillanova in https://github.com/huggingface/datasets/pull/5124

[TYPO] Update new_dataset_script.py by @cakiki in https://github.com/huggingface/datasets/pull/5119

Avoid extra cast in class_encode_column by @mariosasko in https://github.com/huggingface/datasets/pull/5130

Use yaml for issue templates + revamp by @mariosasko in https://github.com/huggingface/datasets/pull/5116

Update docs once dataset scripts transferred to the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5136

Delete duplicate issue template file by @albertvillanova in https://github.com/huggingface/datasets/pull/5146

Deprecate num_proc parameter in DownloadManager.extract by @ayushthe1 in https://github.com/huggingface/datasets/pull/5142

Raise ImportError instead of OSError by @ayushthe1 in https://github.com/huggingface/datasets/pull/5141

Fix CI require beam by @albertvillanova in https://github.com/huggingface/datasets/pull/5168

Make iter_files deterministic by @albertvillanova in https://github.com/huggingface/datasets/pull/5149

Add PB and TB in convert_file_size_to_int by @lhoestq in https://github.com/huggingface/datasets/pull/5171

Reduce default max writer_batch_size by @mariosasko in https://github.com/huggingface/datasets/pull/5163

Support dill 0.3.6 by @albertvillanova in https://github.com/huggingface/datasets/pull/5166

Make filename matching more robust by @riccardobucco in https://github.com/huggingface/datasets/pull/5128

Preserve None in list type cast in PyArrow 10 by @mariosasko in https://github.com/huggingface/datasets/pull/5174

Raise ffmpeg warnings only once by @polinaeterna in https://github.com/huggingface/datasets/pull/5173

Add "ipykernel" to list of co_filenames to remove by @gpucce in https://github.com/huggingface/datasets/pull/5169

chore: add notebook links to img cls and obj det. by @sayakpaul in https://github.com/huggingface/datasets/pull/5187

Fix docs about dataset_info in YAML by @albertvillanova in https://github.com/huggingface/datasets/pull/5194

fsspec lock reset in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5159

Add note about the name of a dataset script by @polinaeterna in https://github.com/huggingface/datasets/pull/5198

Deprecate dummy data generation command by @mariosasko in https://github.com/huggingface/datasets/pull/5199

Do not sort splits in dataset info by @polinaeterna in https://github.com/huggingface/datasets/pull/5201

Add missing DownloadConfig.use_auth_token value by @alvarobartt in https://github.com/huggingface/datasets/pull/5205

Update canonical links to Hub links by @stevhliu in https://github.com/huggingface/datasets/pull/5203

Refactor CI hub fixtures to use monkeypatch instead of patch by @albertvillanova in https://github.com/huggingface/datasets/pull/5208

Update github pr docs actions by @mishig25 in https://github.com/huggingface/datasets/pull/5214

Use hfh hf_hub_url function by @albertvillanova in https://github.com/huggingface/datasets/pull/5196

Pin typer version in tests to <0.5 to fix Windows CI by @polinaeterna in https://github.com/huggingface/datasets/pull/5235

Fix shards in IterableDataset.from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/5233

Fix class name of symbolic link by @riccardobucco in https://github.com/huggingface/datasets/pull/5126

Make Version hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5238

Handle ArrowNotImplementedError caused by try_type being Image or Audio in cast by @mariosasko in https://github.com/huggingface/datasets/pull/5236

Encode path only for old versions of hfh by @lhoestq in https://github.com/huggingface/datasets/pull/5237

Fix CI require_beam maximum compatible dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/5212

Support hfh rc version by @lhoestq in https://github.com/huggingface/datasets/pull/5241

Cleaner error tracebacks for dataset script errors by @mariosasko in https://github.com/huggingface/datasets/pull/5240

New Contributors

@david1542 made their first contribution in https://github.com/huggingface/datasets/pull/5120

@ayushthe1 made their first contribution in https://github.com/huggingface/datasets/pull/5142

@gpucce made their first contribution in https://github.com/huggingface/datasets/pull/5169

@sayakpaul made their first contribution in https://github.com/huggingface/datasets/pull/5187

@NightMachinery made their first contribution in https://github.com/huggingface/datasets/pull/5155

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.7.0
Source code(tar.gz)
Source code(zip)
2.6.1(Oct 14, 2022)
Bug fixes

Fix filter indices when batched by @albertvillanova in https://github.com/huggingface/datasets/pull/5113

fixed a bug where filter could return examples with the wrong indices

Fix iter_batches by @lhoestq in https://github.com/huggingface/datasets/pull/5115

fixed a bug where map with batch=True could return a dataset with less examples

Fix a typo in arrow_dataset.py by @yangky11 in https://github.com/huggingface/datasets/pull/5108

New Contributors

@yangky11 made their first contribution in https://github.com/huggingface/datasets/pull/5108

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.0...2.6.1
Source code(tar.gz)
Source code(zip)
2.6.0(Oct 13, 2022)
Important

[GH->HF] Remove all dataset scripts from github by @lhoestq in https://github.com/huggingface/datasets/pull/4974

all the dataset scripts and dataset cards are now on https://hf.co/datasets

we invite users and contributors to open discussions or pull requests on the Hugging Face Hub from now on

Datasets features

Add ability to read-write to SQL databases. by @Dref360 in https://github.com/huggingface/datasets/pull/4928

Read from sqlite file:

from datasets import Dataset dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")

Allow connection objects in from_sql + small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091

from datasets import Dataset from sqlite3 import connect con = connect(...) dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)

Image & Audio formatting for numpy/torch/tf/jax by @lhoestq in https://github.com/huggingface/datasets/pull/5072

return numpy/torch/tf/jax tensors with

from datasets import load_dataset ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax ds[0]["image"]

Added IterableDataset.from_generator by @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052

Fast dataset iter by @mariosasko in https://github.com/huggingface/datasets/pull/5030

speed up by a factor of 2 using the Arrow Table reader

Dataset infos in yaml by @lhoestq in https://github.com/huggingface/datasets/pull/4926

you can now specify the feature types and number of samples in the dataset card, see https://huggingface.co/docs/datasets/dataset_card

Add kwargs to Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/5049

Support converters in CsvBuilder by @mariosasko in https://github.com/huggingface/datasets/pull/5057

Restore saved format state in load_from_disk by @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073

Dataset changes

Update: hendrycks_test - support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/5041

Update: swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5019

Update swiss judgment prediction by @JoelNiklaus in https://github.com/huggingface/datasets/pull/5042

Fix: xcsr - fix languages of X-CSQA configs by @albertvillanova in https://github.com/huggingface/datasets/pull/5022

Fix: sbu_captions - fix URLs by @donglixp in https://github.com/huggingface/datasets/pull/5020

Fix: xcsr - fix string features by @albertvillanova in https://github.com/huggingface/datasets/pull/5024

Fix: hendrycks_test - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/5040

Fix: cats_vs_dogs - fix number of samples by @lhoestq in https://github.com/huggingface/datasets/pull/5047

Fix: lex_glue - fix bug with labels of eurlex config of lex_glue dataset by @iliaschalkidis in https://github.com/huggingface/datasets/pull/5048

Fix: msr_sqa - fix dataset generation by @Timothyxxx in https://github.com/huggingface/datasets/pull/3715

Dataset cards

Add description to hellaswag dataset by @julien-c in https://github.com/huggingface/datasets/pull/4810

Add deprecation warning to multilingual_librispeech dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5010

Update languages in aeslc dataset card by @apergo-ai in https://github.com/huggingface/datasets/pull/3357

Update license to bookcorpus dataset card by @meg-huggingface in https://github.com/huggingface/datasets/pull/3526

Update paper link in medmcqa dataset card by @monk1337 in https://github.com/huggingface/datasets/pull/4290

Add oversampling strategy iterable datasets interleave by @ylacombe in https://github.com/huggingface/datasets/pull/5036

Fix license/citation information of squadshifts dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/5054

General improvements and bug fixes

Fix missing use_auth_token in streaming docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/5003

Add some note about running the transformers ci before a release by @lhoestq in https://github.com/huggingface/datasets/pull/5007

Remove license tag file and validation by @albertvillanova in https://github.com/huggingface/datasets/pull/5004

Re-apply input columns change by @mariosasko in https://github.com/huggingface/datasets/pull/5008

patch CI_HUB_TOKEN_PATH with Path instead of str by @Wauplin in https://github.com/huggingface/datasets/pull/5026

Fix typo in error message by @severo in https://github.com/huggingface/datasets/pull/5027

Fix import in ClassLabel docstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029

Remove redundant code from some dataset module factories by @albertvillanova in https://github.com/huggingface/datasets/pull/5033

Fix typos in load docstrings and comments by @albertvillanova in https://github.com/huggingface/datasets/pull/5035

Prefer split patterns from directories over split patterns from filenames by @polinaeterna in https://github.com/huggingface/datasets/pull/4985

Fix tar extraction vuln by @lhoestq in https://github.com/huggingface/datasets/pull/5016

Support hfh 0.10 implicit auth by @lhoestq in https://github.com/huggingface/datasets/pull/5031

Fix flatten_indices with empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043

Improve CI performance speed of PackagedDatasetTest by @albertvillanova in https://github.com/huggingface/datasets/pull/5037

Revert task removal in folder-based builders by @mariosasko in https://github.com/huggingface/datasets/pull/5051

Fix backward compatibility for dataset_infos.json by @lhoestq in https://github.com/huggingface/datasets/pull/5055

Fix typo by @stevhliu in https://github.com/huggingface/datasets/pull/5059

Fix CI hfh token warning by @albertvillanova in https://github.com/huggingface/datasets/pull/5062

Mark CI tests as xfail when 502 error by @albertvillanova in https://github.com/huggingface/datasets/pull/5058

Fix passed download_config in HubDatasetModuleFactoryWithoutScript by @albertvillanova in https://github.com/huggingface/datasets/pull/5077

Fix CONTRIBUTING once dataset scripts transferred to Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/5067

Fix header level in Audio docs by @stevhliu in https://github.com/huggingface/datasets/pull/5078

Support DEFAULT_CONFIG_NAME when no BUILDER_CONFIGS by @albertvillanova in https://github.com/huggingface/datasets/pull/5071

Support streaming gzip.open by @albertvillanova in https://github.com/huggingface/datasets/pull/5066

adding keep in memory by @Mustapha-AJEGHRIR in https://github.com/huggingface/datasets/pull/5082

refactor: replace AssertionError with more meaningful exceptions (#5074) by @galbwe in https://github.com/huggingface/datasets/pull/5079

fix: update exception throw from OSError to EnvironmentError in `push… by @rahulXs in https://github.com/huggingface/datasets/pull/5076

Align signature of list_repo_files with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5063

Align signature of create/delete_repo with latest hfh by @albertvillanova in https://github.com/huggingface/datasets/pull/5064

Fix filter with empty indices by @Mouhanedg56 in https://github.com/huggingface/datasets/pull/5087

Fix tutorial (#5093) by @riccardobucco in https://github.com/huggingface/datasets/pull/5095

Use HTML relative paths for tiles in the docs by @lewtun in https://github.com/huggingface/datasets/pull/5092

Fix loading how to guide (#5102) by @riccardobucco in https://github.com/huggingface/datasets/pull/5104

url encode hub url (#5099) by @riccardobucco in https://github.com/huggingface/datasets/pull/5103

Free the "hf" filesystem protocol for hffs by @lhoestq in https://github.com/huggingface/datasets/pull/5101

Fix task template reload from dict by @lhoestq in https://github.com/huggingface/datasets/pull/5106

New Contributors

@Wauplin made their first contribution in https://github.com/huggingface/datasets/pull/5026

@donglixp made their first contribution in https://github.com/huggingface/datasets/pull/5020

@Timothyxxx made their first contribution in https://github.com/huggingface/datasets/pull/3715

@hamid-vakilzadeh made their first contribution in https://github.com/huggingface/datasets/pull/5052

@Mustapha-AJEGHRIR made their first contribution in https://github.com/huggingface/datasets/pull/5082

@galbwe made their first contribution in https://github.com/huggingface/datasets/pull/5079

@rahulXs made their first contribution in https://github.com/huggingface/datasets/pull/5076

@Mouhanedg56 made their first contribution in https://github.com/huggingface/datasets/pull/5087

@riccardobucco made their first contribution in https://github.com/huggingface/datasets/pull/5095

@asofiaoliveira made their first contribution in https://github.com/huggingface/datasets/pull/5073

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0
Source code(tar.gz)
Source code(zip)
2.5.2(Oct 5, 2022)
Bug fixes

Revert task removal in folder-based builders (#5051)

Support hfh 0.10 implicit auth (#5031)

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.5.2
Source code(tar.gz)
Source code(zip)
2.5.1(Sep 21, 2022)
Bug fixes

Revert input_columns change by @lhoestq in https://github.com/huggingface/datasets/pull/5006

Full Changelog: https://github.com/huggingface/datasets/compare/2.5.0...2.5.1
Source code(tar.gz)
Source code(zip)
2.5.0(Sep 21, 2022)
Important

Drop Python 3.6 support by @mariosasko in https://github.com/huggingface/datasets/pull/4460

Deprecate metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4739

Metrics are now deprecated and have been moved to evaluate:
!pip install evaluate import evaluate metric = evaluate.load("accuracy")

Load GitHub datasets from Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4059

datasets with no namespace like "squad" were loaded from this GitHub repository, now they're loaded from https://huggingface.co/datasets

Decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923

latest version of torchaudio 0.12 now requires ffmpeg (version 4) to read MP3 files, please downgrade to 0.12 for now or use librosa

Use HTTP requests to access data and metadata through the Datasets REST API (docs here)

Datasets features

No-code loaders

Add AudioFolder packaged loader by @polinaeterna in https://github.com/huggingface/datasets/pull/4530

Add support for CSV metadata files to ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/4837

Add support for parsing JSON files in array form by @mariosasko in https://github.com/huggingface/datasets/pull/4997

Dataset methods

add Dataset.from_list by @sanderland in https://github.com/huggingface/datasets/pull/4890

Add Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/4957

Add oversampling strategies to interleave datasets by @ylacombe in https://github.com/huggingface/datasets/pull/4831

Preserve non-input_colums in Dataset.map if input_columns are specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971

Add fn_kwargs param to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4975

More rigorous shape inference in to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4763

Parquet support

Download and prepare as Parquet for cloud storage by @lhoestq in https://github.com/huggingface/datasets/pull/4724

Shard parquet in download_and_prepare by @lhoestq in https://github.com/huggingface/datasets/pull/4747

Embed image/audio data in dl_and_prepare parquet by @lhoestq in https://github.com/huggingface/datasets/pull/4987

Datasets changes

Update: natural questions - Add long answer candidates by @seirasto in https://github.com/huggingface/datasets/pull/4368

Update: opus_paracrawl - update version by @albertvillanova in https://github.com/huggingface/datasets/pull/4816

Update: ReCoRD - Include entity positions as feature by @richarddwang in https://github.com/huggingface/datasets/pull/4479

Update: swda - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4914

Update: Enwik8 - update broken link and information by @mtanghu in https://github.com/huggingface/datasets/pull/4

Update: compguesswhat - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4968

Update: nli_tr - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4970

Update: IndicGLUE - update download links by @sumanthd17 in https://github.com/huggingface/datasets/pull/4978

Update: iwslt2017 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4992

Fix: mbpp - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4788

Fix: mkqa - Update data URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4823

Fix: exams - fix bug and checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/4853

Fix: trec - use fine classes by @albertvillanova in https://github.com/huggingface/datasets/pull/4801

Fix: wmt datasets - fix CWMT zh subsets by @lhoestq in https://github.com/huggingface/datasets/pull/4871

Fix: LibriSpeech - Fix dev split local_extracted_archive for 'all' config by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4904

Fix: compguesswhat - fix data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/4959

Fix: vivos - fix data URL and metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4969

Fix: MBPP - Add splits by @cwarny in https://github.com/huggingface/datasets/pull/4943

Dataset cards

Add language_bcp47 tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753

Added more information in the README about contributors of the Arabic Speech Corpus by @nawarhalabi in https://github.com/huggingface/datasets/pull/4701

Remove "unkown" language tags by @lhoestq in https://github.com/huggingface/datasets/pull/4754

Highlight non-commercial license in amazon_reviews_multi dataset card by @sbroadhurst-hf in https://github.com/huggingface/datasets/pull/4712

Added dataset information in clinic oos dataset card by @Arnav-Ladkat in https://github.com/huggingface/datasets/pull/4751

Fix opus_gnome dataset card by @gojiteji in https://github.com/huggingface/datasets/pull/4806

Complete the mlqa dataset card by @eldhoittangeorge in https://github.com/huggingface/datasets/pull/4809

Fix loading example in opus dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4813

Add missing language tags to resources by @albertvillanova in https://github.com/huggingface/datasets/pull/4819

Fix titles in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4824

Fix language tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4826

Add license metadata to pg19 by @julien-c in https://github.com/huggingface/datasets/pull/4827

Fix task tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4830

Fix tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4832

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4833

Fix documentation card of recipe_nlg dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4834

Fix documentation card of ethos dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4835

Update documentation card of miam dataset by @PierreColombo in https://github.com/huggingface/datasets/pull/4846

Update stackexchange license by @cakiki in https://github.com/huggingface/datasets/pull/4842

Update ted_talks_iwslt license to include ND by @cakiki in https://github.com/huggingface/datasets/pull/4841

Fix documentation card of adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4838

Complete tags of superglue dataset card by @richarddwang in https://github.com/huggingface/datasets/pull/48674869

Fix license tag and Source Data section in billsum dataset card by @kashif in https://github.com/huggingface/datasets/pull/4851

Fix documentation card of covid_qa_castorini dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4877

Fix Citation Information section in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4879

Fix documentation card of math_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4884

Added names of less-studied languages by @BenjaminGalliot in https://github.com/huggingface/datasets/pull/4880

Fix language tags resource file by @albertvillanova in https://github.com/huggingface/datasets/pull/4882

Add citation to ro_sts and ro_sts_parallel datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4892

Add citation information to makhzan dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4894

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4891

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4896

Re-add code and und language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/4899

Add "cc-by-nc-sa-2.0" to list of licenses by @osanseviero in https://github.com/huggingface/datasets/pull/48874903

Update GLUE evaluation metadata by @lewtun in https://github.com/huggingface/datasets/pull/4909

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4908

Add license and citation information to cosmos_qa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4913

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4921

Add cc-by-nc-2.0 to list of licenses by @albertvillanova in https://github.com/huggingface/datasets/pull/4930

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4931

Add Papers with Code ID to scifact dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4941

Fix license information in qasc dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4951

Fix multilinguality tag and missing sections in xquad_r dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/4940

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4979

Fix missing tags in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4991

Documentation

Update map docs by @stevhliu in https://github.com/huggingface/datasets/pull/4743

Add image classification processing guide by @stevhliu in https://github.com/huggingface/datasets/pull/4748

Fix train_test_split docs by @NielsRogge in https://github.com/huggingface/datasets/pull/4821

Update local loading script docs by @stevhliu in https://github.com/huggingface/datasets/pull/4778

Docs for creating a loading script for image datasets by @stevhliu in https://github.com/huggingface/datasets/pull/4783

Docs for creating an audio dataset by @stevhliu in https://github.com/huggingface/datasets/pull/4872

General improvements and bug fixes

Use CI unit/integration tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4738

Fix multiprocessing in map_nested by @albertvillanova in https://github.com/huggingface/datasets/pull/4740

Add 2.4.0 version added to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4767

Update CI badge by @mariosasko in https://github.com/huggingface/datasets/pull/4764

Fix version in map_nested docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4765

fix typo by @xwwwwww in https://github.com/huggingface/datasets/pull/4770

Unpin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4768

Remove apache_beam import from module level in natural_questions dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4780

Require torchaudio<0.12.0 to avoid RuntimeError by @albertvillanova in https://github.com/huggingface/datasets/pull/4777

Remove dummy data generation docs by @stevhliu in https://github.com/huggingface/datasets/pull/4771

Require torchaudio<0.12.0 in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/4785

Fix bug in function validate_type for Python >= 3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/4812

Fix typo in streaming docs by @flozi00 in https://github.com/huggingface/datasets/pull/4843

Fix test of _get_extraction_protocol for TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/4850

Fix typos in documentation by @fl-lo in https://github.com/huggingface/datasets/pull/

Mark CI tests as xfail if Hub HTTP error by @albertvillanova in https://github.com/huggingface/datasets/pull/4845

[Windows] Fix Access Denied when using os.rename() by @DougTrajano in https://github.com/huggingface/datasets/pull/4825

[docs] Some tiny doc tweaks by @julien-c in https://github.com/huggingface/datasets/pull/4874

Document loading from relative path by @stevhliu in https://github.com/huggingface/datasets/pull/4773

Fix CI reporting by @albertvillanova in https://github.com/huggingface/datasets/pull/4903

Add 'val' to VALIDATION_KEYWORDS. by @akt42 in https://github.com/huggingface/datasets/pull/4844

Raise ManualDownloadError from get_dataset_config_info by @albertvillanova in https://github.com/huggingface/datasets/pull/4901

feat: improve error message on Keys mismatch. closes #4917 by @PaulLerner in https://github.com/huggingface/datasets/pull/4919

Fixes a typo in loading documentation by @sighingnow in https://github.com/huggingface/datasets/pull/4929

Remove main branch rename notice by @lhoestq in https://github.com/huggingface/datasets/pull/4938

Fix NonMatchingChecksumError in adv_glue dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4939

Remove deprecated identical_ok by @lhoestq in https://github.com/huggingface/datasets/pull/4937

Pin TensorFlow temporarily by @albertvillanova in https://github.com/huggingface/datasets/pull/4954

Fix minor typo in error message for missing imports by @mariosasko in https://github.com/huggingface/datasets/pull/4948

Fix TF tests for 2.10 by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4956

fix BLEU metric card by @antoniolanza1996 in https://github.com/huggingface/datasets/pull/4927

Update doc upload_dataset.mdx by @mishig25 in https://github.com/huggingface/datasets/pull/4789

Improve features resolution in streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4762

Fix label renaming and add a battery of tests by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4781

Strip "/" in local dataset path to avoid empty dataset name error by @apohllo in https://github.com/huggingface/datasets/pull/4967

Introduce regex check when pushing as well by @LysandreJik in https://github.com/huggingface/datasets/pull/4946

[doc] Fix broken snippet that had too many quotes by @tomaarsen in https://github.com/huggingface/datasets/pull/4986

Fix map batched with torch output by @lhoestq in https://github.com/huggingface/datasets/pull/4972

fix: avoid casting tuples after Dataset.map by @szmoro in https://github.com/huggingface/datasets/pull/4993

decode mp3 with librosa if torchaudio is > 0.12 as a temporary workaround by @polinaeterna in https://github.com/huggingface/datasets/pull/4923

Don't add a tag on the Hub on release by @lhoestq in https://github.com/huggingface/datasets/pull/4998

Add EmptyDatasetError by @lhoestq in https://github.com/huggingface/datasets/pull/4999

New Contributors

@seirasto made their first contribution in https://github.com/huggingface/datasets/pull/4368

@sbroadhurst-hf made their first contribution in https://github.com/huggingface/datasets/pull/4712

@nawarhalabi made their first contribution in https://github.com/huggingface/datasets/pull/4701

@Arnav-Ladkat made their first contribution in https://github.com/huggingface/datasets/pull/4751

@xwwwwww made their first contribution in https://github.com/huggingface/datasets/pull/4770

@gojiteji made their first contribution in https://github.com/huggingface/datasets/pull/4806

@eldhoittangeorge made their first contribution in https://github.com/huggingface/datasets/pull/4809

@flozi00 made their first contribution in https://github.com/huggingface/datasets/pull/4843

@fl-lo made their first contribution in https://github.com/huggingface/datasets/pull/4869

@BenjaminGalliot made their first contribution in https://github.com/huggingface/datasets/pull/4880

@DougTrajano made their first contribution in https://github.com/huggingface/datasets/pull/4825

@ylacombe made their first contribution in https://github.com/huggingface/datasets/pull/4831

@osanseviero made their first contribution in https://github.com/huggingface/datasets/pull/4887

@akt42 made their first contribution in https://github.com/huggingface/datasets/pull/4844

@sanderland made their first contribution in https://github.com/huggingface/datasets/pull/4890

@sighingnow made their first contribution in https://github.com/huggingface/datasets/pull/4929

@mtanghu made their first contribution in https://github.com/huggingface/datasets/pull/4950

@antoniolanza1996 made their first contribution in https://github.com/huggingface/datasets/pull/4927

@apohllo made their first contribution in https://github.com/huggingface/datasets/pull/4967

@cwarny made their first contribution in https://github.com/huggingface/datasets/pull/4943

@tomaarsen made their first contribution in https://github.com/huggingface/datasets/pull/4986

@szmoro made their first contribution in https://github.com/huggingface/datasets/pull/4993

Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0
Source code(tar.gz)
Source code(zip)
2.4.0(Jul 25, 2022)
Dataset Features

Add concatenate_datasets for iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500

Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in https://github.com/huggingface/datasets/pull/4625

Support using PCM audio files (#4323) by @YooSungHyun in https://github.com/huggingface/datasets/pull/4409

[data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in https://github.com/huggingface/datasets/pull/4633

Support extract 7-zip compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4672

Support extract lz4 compressed data files by @albertvillanova in https://github.com/huggingface/datasets/pull/4700

Support metadata.jsonl from parent directories in imagefolder @mariosasko in https://github.com/huggingface/datasets/pull/4576

Dataset changes

Update: allocine - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4563

Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4585

Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in https://github.com/huggingface/datasets/pull/4586

Update: financial_phrasebank - Host data on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4598

Update: cfq - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4579

Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/4588

Update: bookcorpus - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4564

Update: fever - Refactor and add metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/4503

Update: mlsum - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4574

Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/4523

Fix: conll2003 - fix empty example by @lhoestq in https://github.com/huggingface/datasets/pull/4662

Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in https://github.com/huggingface/datasets/pull/4554

Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in https://github.com/huggingface/datasets/pull/4706

Fix: crd3 - fix splits that were containing the same data by @lhoestq in https://github.com/huggingface/datasets/pull/4705

Dataset Cards

Add action names in schema_guided_dstc8 dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/4559

Add evaluation data to acronym_identification by @lewtun in https://github.com/huggingface/datasets/pull/4561

Update WinoBias README by @sashavor in https://github.com/huggingface/datasets/pull/4631

Support "tags" yaml tag by @lhoestq in https://github.com/huggingface/datasets/pull/4716

Fix POS tags by @lhoestq in https://github.com/huggingface/datasets/pull/4715

AESLC dataset: Add summarization tags by @hobson in https://github.com/huggingface/datasets/pull/4517

Documentation

Update docs around audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4440

Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in https://github.com/huggingface/datasets/pull/4513

Remove multiple config section by @stevhliu in https://github.com/huggingface/datasets/pull/4600

Create new sections for audio and vision in guides by @stevhliu in https://github.com/huggingface/datasets/pull/4519

Document installation of sox OS dependency for audio by @albertvillanova in https://github.com/huggingface/datasets/pull/4713

General improvements and bug fixes

Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510

Support all negative values in ClassLabel by @lhoestq in https://github.com/huggingface/datasets/pull/4511

Add uppercased versions of image file extensions for automatic module inference by @mariosasko in https://github.com/huggingface/datasets/pull/4515

Patch tests for hfh v0.8.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/4518

Replace deprecated logging.warn with logging.warning by @hugovk in https://github.com/huggingface/datasets/pull/4539

[CI] Fix upstream hub test url by @lhoestq in https://github.com/huggingface/datasets/pull/4543

Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4541

[CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in https://github.com/huggingface/datasets/pull/4546

Tell users to upload on the hub directly by @lhoestq in https://github.com/huggingface/datasets/pull/4552

Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in https://github.com/huggingface/datasets/pull/4535

Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4545

Properly raise FileNotFound even if the dataset is private by @lhoestq in https://github.com/huggingface/datasets/pull/4536

Fix hashing for python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/4516

[CI] Fix some warnings by @lhoestq in https://github.com/huggingface/datasets/pull/4547

Validate new_fingerprint passed by user by @lhoestq in https://github.com/huggingface/datasets/pull/4587

Update CI Windows orb by @albertvillanova in https://github.com/huggingface/datasets/pull/4604

Perform hidden file check on relative data file path by @mariosasko in https://github.com/huggingface/datasets/pull/4551

Align more metadata with other repo types (models,spaces) by @julien-c in https://github.com/huggingface/datasets/pull/4607

Align/fix license metadata info by @julien-c in https://github.com/huggingface/datasets/pull/4613

Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in https://github.com/huggingface/datasets/pull/4611

Add authentication tip to load_dataset by @mariosasko in https://github.com/huggingface/datasets/pull/4577

Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4553

fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in https://github.com/huggingface/datasets/pull/4630

Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in https://github.com/huggingface/datasets/pull/4608

Rename master to main by @lhoestq in https://github.com/huggingface/datasets/pull/4643

Set HF_SCRIPTS_VERSION to main by @lhoestq in https://github.com/huggingface/datasets/pull/4645

[Minor fix] Typo correction by @cakiki in https://github.com/huggingface/datasets/pull/4644

fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in https://github.com/huggingface/datasets/pull/4627

Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in https://github.com/huggingface/datasets/pull/4590

Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628

Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in https://github.com/huggingface/datasets/pull/4660

Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496

Fix embed_storage on features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615

Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in https://github.com/huggingface/datasets/pull/4512

Transfer CI to GitHub Actions by @albertvillanova in https://github.com/huggingface/datasets/pull/4659

Fix mock fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/4685

Trigger CI also on push to main by @albertvillanova in https://github.com/huggingface/datasets/pull/4687

Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in https://github.com/huggingface/datasets/pull/4622

Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4688

Test extractors for all compression formats by @albertvillanova in https://github.com/huggingface/datasets/pull/4689

Refactor base extractors by @albertvillanova in https://github.com/huggingface/datasets/pull/4690

Update create dataset card docs by @stevhliu in https://github.com/huggingface/datasets/pull/4683

Add text decorators by @stevhliu in https://github.com/huggingface/datasets/pull/4663

Skip tests only for lz4/zstd params if not installed by @albertvillanova in https://github.com/huggingface/datasets/pull/4704

Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in https://github.com/huggingface/datasets/pull/4614

Docs: Fix same-page haslinks by @mishig25 in https://github.com/huggingface/datasets/pull/4722

Fix broken link to the Hub by @stevhliu in https://github.com/huggingface/datasets/pull/4726

Refactor conftest fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/4723

Add object detection processing tutorial by @nateraw in https://github.com/huggingface/datasets/pull/4710

Fix require torchaudio and refactor test requirements by @albertvillanova in https://github.com/huggingface/datasets/pull/4708

docs: ✏️ fix TranslationVariableLanguages example by @severo in https://github.com/huggingface/datasets/pull/4731

Pin rouge_score test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/4735

Fix named split sorting and remove unnecessary casting by @albertvillanova in https://github.com/huggingface/datasets/pull/4714

Make cast in from_pandas more robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703

Make Extractor accept Path as input by @albertvillanova in https://github.com/huggingface/datasets/pull/4718

Refactor Hub tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4729

Fix to dict conversion of DatasetInfo/Features by @mariosasko in https://github.com/huggingface/datasets/pull/4741

New Contributors

@hugovk made their first contribution in https://github.com/huggingface/datasets/pull/4539

@VijayKalmath made their first contribution in https://github.com/huggingface/datasets/pull/4545

@gugarosa made their first contribution in https://github.com/huggingface/datasets/pull/4630

@benlipkin made their first contribution in https://github.com/huggingface/datasets/pull/4627

@YooSungHyun made their first contribution in https://github.com/huggingface/datasets/pull/4409

@hobson made their first contribution in https://github.com/huggingface/datasets/pull/4517

@khushmeeet made their first contribution in https://github.com/huggingface/datasets/pull/4554

@dtuit made their first contribution in https://github.com/huggingface/datasets/pull/4614

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0
Source code(tar.gz)
Source code(zip)
2.3.2(Jun 15, 2022)
Bug fixes

Fix double dots in data files by @lhoestq in https://github.com/huggingface/datasets/pull/4505

fix a bug when /../ is passed to data_files causing FileNotFoundError

fix ETT m1/m2 test/val dataset by @kashif in https://github.com/huggingface/datasets/pull/4499

Corrected broken links in doc by @clefourrier in https://github.com/huggingface/datasets/pull/4501

New Contributors

@clefourrier made their first contribution in https://github.com/huggingface/datasets/pull/4501

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.1...2.3.2
Source code(tar.gz)
Source code(zip)
2.3.1(Jun 15, 2022)
Bug fixes

Fix patching module that doesn't exist by @lhoestq in https://github.com/huggingface/datasets/pull/4495

fix bug when importing the lib when scipy is not installed

Re-add download_manager module in utils by @lhoestq in https://github.com/huggingface/datasets/pull/4497

fix moved imports of DownloadConfig, DownloadMode, DownloadManager

Support streaming UDHR dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4487

Full Changelog: https://github.com/huggingface/datasets/compare/2.3.0...2.3.1
Source code(tar.gz)
Source code(zip)
2.3.0(Jun 14, 2022)
Datasets Changes

New: ImageNet-Sketch by @nateraw in https://github.com/huggingface/datasets/pull/4301

New: Biwi Kinect Head Pose by @dnaveenr in https://github.com/huggingface/datasets/pull/3903

New: enwik8 by @HallerPatrick in https://github.com/huggingface/datasets/pull/4321

New: LCCC dataset by @silverriver in https://github.com/huggingface/datasets/pull/4416

New: TruthfulQA by @jon-tow in https://github.com/huggingface/datasets/pull/4159

New: BIG-bench by @andersjohanandreassen in https://github.com/huggingface/datasets/pull/4125

New: QuickDraw by @mariosasko in https://github.com/huggingface/datasets/pull/3592

New: SST-2 by @albertvillanova in https://github.com/huggingface/datasets/pull/4473

Update: imagenet-1k - remove manual download by @mariosasko in https://github.com/huggingface/datasets/pull/4299

ImageNet can now be loaded in python with load_dataset without requiring a manual download !

It also supports streaming mode with load_dataset("imagenet-1k", streaming=True)

Update: spider - Remove Google Drive URL by @albertvillanova in https://github.com/huggingface/datasets/pull/4410

Update: blended_skill_talk - add missing columns to by @mariosasko in https://github.com/huggingface/datasets/pull/4437

Update: multi-news - Use newer version with fixes by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4451

Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459

Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/

Update: udhr - update metadata by @leondz in https://github.com/huggingface/datasets/pull/4362

Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4469

Update: PASS - update dataset version by @mariosasko in https://github.com/huggingface/datasets/pull/4488

Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in https://github.com/huggingface/datasets/pull/4389

Fix: GEM - fix URL for totto config by @albertvillanova in https://github.com/huggingface/datasets/pull/4396

Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in https://github.com/huggingface/datasets/pull/4424

Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in https://github.com/huggingface/datasets/pull/4425

Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in https://github.com/huggingface/datasets/pull/4436

Fix: iwslt2017 by @lhoestq in https://github.com/huggingface/datasets/pull/4481

Dataset Features

to_tf_dataset rewrite by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4170

see more in the documentation

Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/4375

see more in the documentation

Added stratify option to train_test_split by @nandwalritik in https://github.com/huggingface/datasets/pull/4322

Re-add support for Apache Beam functionality by @albertvillanova in https://github.com/huggingface/datasets/pull/4328

Resume push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in https://github.com/huggingface/datasets/pull/4402

Support nested/complex feature types as features in packaged loaders by @mariosasko in https://github.com/huggingface/datasets/pull/4364

Optimize contiguous shard and select by @lhoestq in https://github.com/huggingface/datasets/pull/4466

Dataset Cards

Minor fixes/improvements in scene_parse_150 card by @mariosasko in https://github.com/huggingface/datasets/pull/4447

Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in https://github.com/huggingface/datasets/pull/4378

Fix example in opus_ubuntu, Add license info by @leondz in https://github.com/huggingface/datasets/pull/4360

Update README.md of fquad by @lhoestq in https://github.com/huggingface/datasets/pull/4450

Documentation

Add API code examples for loading methods by @stevhliu in https://github.com/huggingface/datasets/pull/4300

Add API code examples for remaining main classes by @stevhliu in https://github.com/huggingface/datasets/pull/4292

Generalize tutorials for audio and vision by @stevhliu in https://github.com/huggingface/datasets/pull/4468

[Docs] How to use with PyTorch page by @lhoestq in https://github.com/huggingface/datasets/pull/4474

First draft of the docs for TF + Datasets by @Rocketknight1 in https://github.com/huggingface/datasets/pull/4457

Other improvements and bug fixes

Update CI deprecated legacy image by @albertvillanova in https://github.com/huggingface/datasets/pull/4393

remove int documentation from logging docs by @lvwerra in https://github.com/huggingface/datasets/pull/4392

Fix docstring in DatasetDict::shuffle by @felixdivo in https://github.com/huggingface/datasets/pull/4344

Fix Version equality by @albertvillanova in https://github.com/huggingface/datasets/pull/4359

Set builder name from module instead of class by @albertvillanova in https://github.com/huggingface/datasets/pull/4388

Test dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4385

Refactor download by @albertvillanova in https://github.com/huggingface/datasets/pull/4384

Fix dependency on dill version by @albertvillanova in https://github.com/huggingface/datasets/pull/4397

Support remote cache_dir by @albertvillanova in https://github.com/huggingface/datasets/pull/4347

Update imagenet gate by @lhoestq in https://github.com/huggingface/datasets/pull/4408

Fix dataset builder default version by @albertvillanova in https://github.com/huggingface/datasets/pull/4356

Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in https://github.com/huggingface/datasets/pull/4403

Rename DatasetBuilder config_name by @albertvillanova in https://github.com/huggingface/datasets/pull/4414

Fix metadata validation by @albertvillanova in https://github.com/huggingface/datasets/pull/4390

Add HF.co for PRs/Issues for specific datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4427

Fix type hint and documentation for new_fingerprint by @fxmarty in https://github.com/huggingface/datasets/pull/4326

Skip hidden files/directories in data files resolution and iter_files by @mariosasko in https://github.com/huggingface/datasets/pull/4412

Fix docstring of inspect_dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4438

Fix builder docstring by @albertvillanova in https://github.com/huggingface/datasets/pull/4432

Fix kwargs in docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4444

Fix missing args in docstring of load_dataset_builder by @albertvillanova in https://github.com/huggingface/datasets/pull/4445

Add missing kwargs to docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/4446

Add extractor for bzip2-compressed files by @asivokon in https://github.com/huggingface/datasets/pull/4421

Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in https://github.com/huggingface/datasets/pull/4434

Update dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in https://github.com/huggingface/datasets/pull/4415

Update builder docstring for deprecated/added arguments by @albertvillanova in https://github.com/huggingface/datasets/pull/4429

Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/4464

Fix script fetching and local path handling in inspect_dataset and inspect_metric by @mariosasko in https://github.com/huggingface/datasets/pull/4433

Fix bigbench config names by @lhoestq in https://github.com/huggingface/datasets/pull/4465

Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in https://github.com/huggingface/datasets/pull/4472

Reorder returned validation/test splits in script template by @albertvillanova in https://github.com/huggingface/datasets/pull/4470

Better ImportError message when a dataset script dependency is missing by @lhoestq in https://github.com/huggingface/datasets/pull/4484

Fix cast to null by @lhoestq in https://github.com/huggingface/datasets/pull/4485

Update _format_columns in remove_columns by @alvarobartt in https://github.com/huggingface/datasets/pull/4411

Fix wrong map parameter name in cache docs by @h4iku in https://github.com/huggingface/datasets/pull/4293

Pin the revision in imagenet download links by @lhoestq in https://github.com/huggingface/datasets/pull/4492

Refactor column mappings for question answering datasets by @lewtun in https://github.com/huggingface/datasets/pull/4391

New Contributors

@leondz made their first contribution in https://github.com/huggingface/datasets/pull/4378

@felixdivo made their first contribution in https://github.com/huggingface/datasets/pull/4344

@nandwalritik made their first contribution in https://github.com/huggingface/datasets/pull/4322

@fxmarty made their first contribution in https://github.com/huggingface/datasets/pull/4326

@HallerPatrick made their first contribution in https://github.com/huggingface/datasets/pull/4321

@silverriver made their first contribution in https://github.com/huggingface/datasets/pull/4416

@asivokon made their first contribution in https://github.com/huggingface/datasets/pull/4421

@andersjohanandreassen made their first contribution in https://github.com/huggingface/datasets/pull/4125

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.2...lol
Source code(tar.gz)
Source code(zip)
2.2.2(May 20, 2022)
Datasets fixes

Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4377

Fix: CC-Aligned - fix invalid url by @juntang-zhuang in https://github.com/huggingface/datasets/pull/4231

Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in https://github.com/huggingface/datasets/pull/4353

Bug fixes

Support lists of multi-dimensional numpy arrays by @albertvillanova in https://github.com/huggingface/datasets/pull/4194

Check if dataset features match before push in DatasetDict.push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/4372

Pin dill by @albertvillanova in https://github.com/huggingface/datasets/pull/4380

dill 0.3.5 has some issues in transformers - pinning the version to <0.3.5 for now

Dataset Cards

Adding eval metadata for ade v2 by @sashavor in https://github.com/huggingface/datasets/pull/4319

Adding eval metadata for AG News by @sashavor in https://github.com/huggingface/datasets/pull/4329

Adding eval metadata to Allociné dataset by @sashavor in https://github.com/huggingface/datasets/pull/4330

Adding eval metadata to Amazon Polarity by @sashavor in https://github.com/huggingface/datasets/pull/4331

Adding eval metadata for arabic speech corpus by @sashavor in https://github.com/huggingface/datasets/pull/4332

Adding eval metadata for Banking 77 by @sashavor in https://github.com/huggingface/datasets/pull/4333

Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in https://github.com/huggingface/datasets/pull/4338

Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in https://github.com/huggingface/datasets/pull/4337

Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in https://github.com/huggingface/datasets/pull/4335

Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in https://github.com/huggingface/datasets/pull/4336

Docs

Add API code examples for Builder classes by @stevhliu in https://github.com/huggingface/datasets/pull/4313

Add redirect to dataset script in the repo structure page by @lhoestq in https://github.com/huggingface/datasets/pull/4369

Other improvements and bug fixes

Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4342

Fix never ending GH Action to build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/4345

Fix warning in upload_file by @albertvillanova in https://github.com/huggingface/datasets/pull/4355

Fix warning in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/4357

Remove config names as yaml keys by @lhoestq in https://github.com/huggingface/datasets/pull/4367

Add missing language tags for udhr dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4371

Remove links in docs to old dataset viewer by @mariosasko in https://github.com/huggingface/datasets/pull/4373

New Contributors

@JohnGiorgi made their first contribution in https://github.com/huggingface/datasets/pull/4353

@juntang-zhuang made their first contribution in https://github.com/huggingface/datasets/pull/4231

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.1...2.2.2
Source code(tar.gz)
Source code(zip)
2.2.1(May 11, 2022)
Datasets bug fixes

Fix cnn_dailymail (dm stories were ignored) by @lhoestq in https://github.com/huggingface/datasets/pull/4317

datasets 2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset

General improvements and bug fixes

Fix: Add missing comma by @mrm8488 in https://github.com/huggingface/datasets/pull/4303

Catch pull error when mirroring by @lhoestq in https://github.com/huggingface/datasets/pull/4314

Remove unused multiprocessing args from test CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/4308

Fix CLI run_beam namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/4315

Support passing config_kwargs to CLI run_beam by @albertvillanova in https://github.com/huggingface/datasets/pull/4316

Don't check f.loc in _get_extraction_protocol_with_magic_number by @lhoestq in https://github.com/huggingface/datasets/pull/4318

New Contributors

@mrm8488 made their first contribution in https://github.com/huggingface/datasets/pull/4303

Full Changelog: https://github.com/huggingface/datasets/compare/2.2.0...2.2.1
Source code(tar.gz)
Source code(zip)
2.2.0(May 10, 2022)
Dataset Changes

New: ImageNet by @apsdehal in https://github.com/huggingface/datasets/pull/4178

Manual download only for now

New: Google Conceptual Captions by @abhishekkrthakur in https://github.com/huggingface/datasets/pull/1459

New: Conceptual 12M by @thomasw21 in https://github.com/huggingface/datasets/pull/4162

New: Visual Genome by @thomasw21 in https://github.com/huggingface/datasets/pull/4161

New: RVL-CDIP by @dnaveenr in https://github.com/huggingface/datasets/pull/4050

New: Text-based NP Enrichment (TNE) by @yanaiela in https://github.com/huggingface/datasets/pull/4153

New: TextVQA by @apsdehal in https://github.com/huggingface/datasets/pull/3967

New: ETT time series dataset by @kashif in https://github.com/huggingface/datasets/pull/4213

Update: assin2 - update metadata by @lhoestq in https://github.com/huggingface/datasets/pull/4172

Update: Librispeech - Add 'all' config by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4184

Update: XGLUE - Support streaming dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4249

Update: crd3 - group all the turns in one example by @shanyas10 in https://github.com/huggingface/datasets/pull/4240

Update: pubmed_qa - Remove google drive URL by @lhoestq in https://github.com/huggingface/datasets/pull/4255

Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4254

Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in https://github.com/huggingface/datasets/pull/4267

Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4236

Update: openbookqa - Add missing features for additional config by @albertvillanova in https://github.com/huggingface/datasets/pull/4278

Update: commonsense_qa - Add missing features by @albertvillanova in https://github.com/huggingface/datasets/pull/4280

Fix: Common Voice - Make sure bytes are correctly deleted if path exists by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4212

Fix: openbookqa - fix bug in choices labels by @manandey in https://github.com/huggingface/datasets/pull/4259

Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4270

Dataset Features

Add support for metadata files to imagefolder by @mariosasko in https://github.com/huggingface/datasets/pull/4069

load a folder of images and metadata stored in metadata.jsonl, more info in the documentation on how to load an image dataset

Infer splits from the data_dir parameter when loading datasets without script by @polinaeterna in https://github.com/huggingface/datasets/pull/4144

splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository

Enable label alignment for token classification datasets by @lewtun in https://github.com/huggingface/datasets/pull/4277

Add drop_last_batch to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4215

Load dataset with TSV files by @albertvillanova in https://github.com/huggingface/datasets/pull/4246

Dataset Cards

Autoeval config by @nrajani in https://github.com/huggingface/datasets/pull/4234

Add train-deval-index metadata to automate evaluation on your datasets based on their tasks

Adding license information for Openbookcorpus by @meg-huggingface in https://github.com/huggingface/datasets/pull/3525

Make code for image downloading from image urls cacheable by @mariosasko in https://github.com/huggingface/datasets/pull/4218

Fix description links in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4222

Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in https://github.com/huggingface/datasets/pull/4262

Remove a copy-paste sentence in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/4281

Update LexGLUE README.md by @iliaschalkidis in https://github.com/huggingface/datasets/pull/4285

leadboard info added for TNE by @yanaiela in https://github.com/huggingface/datasets/pull/4273

Add Lahnda language tag by @mariosasko in https://github.com/huggingface/datasets/pull/4286

Add license and point of contact to big_patent dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4269

Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in https://github.com/huggingface/datasets/pull/4266

Metrics Changes

Perplexity Speedup by @emibaylor in https://github.com/huggingface/datasets/pull/4108

Add AUC ROC Metric by @emibaylor in https://github.com/huggingface/datasets/pull/4158

Small fixes in ROC AUC docs by @wschella in https://github.com/huggingface/datasets/pull/4239

Fix/start token mask issue and update documentation by @TristanThrush in https://github.com/huggingface/datasets/pull/4258

Add pearsonr mc, update functionality to match the original docs by @emibaylor in https://github.com/huggingface/datasets/pull/4226

Metric Cards

Metric card for the XTREME-S dataset by @sashavor in https://github.com/huggingface/datasets/pull/4251

Creating metric card for MAE by @sashavor in https://github.com/huggingface/datasets/pull/4252

Create metric cards for mean IOU by @sashavor in https://github.com/huggingface/datasets/pull/4253

Create metric card for Mahalanobis Distance by @sashavor in https://github.com/huggingface/datasets/pull/4257

Create metric card for MSE by @sashavor in https://github.com/huggingface/datasets/pull/4256

Fix exact match by @emibaylor in https://github.com/huggingface/datasets/pull/4166

Fix google bleu typos, examples by @emibaylor in https://github.com/huggingface/datasets/pull/4165

Add f1 metric card, update docstring in py file by @emibaylor in https://github.com/huggingface/datasets/pull/4227

Add Recall Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4204

Matthews Correlation Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4110

Add Precision Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4203

Add Accuracy Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4223

Add Spearmanr Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4109

Metric card template by @emibaylor in https://github.com/huggingface/datasets/pull/3915

Documentation

Document save_to_disk and push_to_hub on images and audio files by @lhoestq in https://github.com/huggingface/datasets/pull/4193

Add to docs how to load from local script by @albertvillanova in https://github.com/huggingface/datasets/pull/4200

Add code examples to API docs by @stevhliu in https://github.com/huggingface/datasets/pull/4168

Add code examples for DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/4245

Add API code examples for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/4274

Add packaged builder configs to the documentation by @lhoestq in https://github.com/huggingface/datasets/pull/4307

[Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in https://github.com/huggingface/datasets/pull/4311

General improvements and bug fixes

Generate tasks.json taxonomy from huggingface_hub by @julien-c in https://github.com/huggingface/datasets/pull/4154

Fix when map function modifies input in-place by @thomasw21 in https://github.com/huggingface/datasets/pull/4174

Support streaming cnn_dailymail dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4188

Don't duplicate data when encoding audio or image by @lhoestq in https://github.com/huggingface/datasets/pull/4187

Fix outdated docstring about default dataset config by @lhoestq in https://github.com/huggingface/datasets/pull/4186

Deprecate shard_size in push_to_hub in favor of max_shard_size by @mariosasko in https://github.com/huggingface/datasets/pull/4190

Fix some type annotation in doc by @thomasw21 in https://github.com/huggingface/datasets/pull/4202

Update GH template for dataset viewer issues by @albertvillanova in https://github.com/huggingface/datasets/pull/4201

Update auth when mirroring datasets on the hub by @lhoestq in https://github.com/huggingface/datasets/pull/4242

Rename imagenet2012 -> imagenet-1k by @lhoestq in https://github.com/huggingface/datasets/pull/4263

Skip checksum computation in Imagefolder by default by @mariosasko in https://github.com/huggingface/datasets/pull/4214

Fix convert_file_size_to_int for kilobits and megabits by @mariosasko in https://github.com/huggingface/datasets/pull/4205

Fix typo in logging docs by @stevhliu in https://github.com/huggingface/datasets/pull/4272

Bump PyArrow Version to 6 by @dnaveenr in https://github.com/huggingface/datasets/pull/4250

task id update by @nrajani in https://github.com/huggingface/datasets/pull/4244

Avoid recursion error in map if example is returned as dict value by @mariosasko in https://github.com/huggingface/datasets/pull/4216

Update minimal PyArrow version warning by @mariosasko in https://github.com/huggingface/datasets/pull/4279

[Minor edit] Fix typo in class name by @cakiki in https://github.com/huggingface/datasets/pull/4207

Stream private zipped images by @lhoestq in https://github.com/huggingface/datasets/pull/4173

Fix filesystem docstring by @stevhliu in https://github.com/huggingface/datasets/pull/4283

Document how to use FAISS index for special operations by @albertvillanova in https://github.com/huggingface/datasets/pull/4189

Contributing MedMCQA dataset by @monk1337 in https://github.com/huggingface/datasets/pull/4064

Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in https://github.com/huggingface/datasets/pull/4282

Fix missing lz4 dependency for tests by @albertvillanova in https://github.com/huggingface/datasets/pull/4295

Altered faiss installation comment by @vishalsrao in https://github.com/huggingface/datasets/pull/4220

Fix CLI run_beam save_infos by @albertvillanova in https://github.com/huggingface/datasets/pull/4294

Add missing faiss import to fix https://github.com/huggingface/datasets/issues/4287 by @alvarobartt in https://github.com/huggingface/datasets/pull/4288

New Contributors

@shanyas10 made their first contribution in https://github.com/huggingface/datasets/pull/4240

@apsdehal made their first contribution in https://github.com/huggingface/datasets/pull/4178

@wschella made their first contribution in https://github.com/huggingface/datasets/pull/4239

@TristanThrush made their first contribution in https://github.com/huggingface/datasets/pull/4258

@yanaiela made their first contribution in https://github.com/huggingface/datasets/pull/4153

@mo6zes made their first contribution in https://github.com/huggingface/datasets/pull/4262

@nrajani made their first contribution in https://github.com/huggingface/datasets/pull/4244

@sanchit-gandhi made their first contribution in https://github.com/huggingface/datasets/pull/4266

@cakiki made their first contribution in https://github.com/huggingface/datasets/pull/4207

@monk1337 made their first contribution in https://github.com/huggingface/datasets/pull/4064

@alvarobartt made their first contribution in https://github.com/huggingface/datasets/pull/4288

Full Changelog: https://github.com/huggingface/datasets/compare/2.1.0...2.2.0
Source code(tar.gz)
Source code(zip)
2.1.0(Apr 14, 2022)
Datasets Changes

New: initial monash time series forecasting by @kashif in https://github.com/huggingface/datasets/pull/3743

New: Roman Urdu Hate Speech dataset by @bp-high in https://github.com/huggingface/datasets/pull/3972

New: Adversarial GLUE by @jxmorris12 in https://github.com/huggingface/datasets/pull/3849

New: MetaShift by @dnaveenr in https://github.com/huggingface/datasets/pull/3900

New: GSM8K by @jon-tow in https://github.com/huggingface/datasets/pull/4103

New: SBU Captions Photo by @thomasw21 in https://github.com/huggingface/datasets/pull/4130

Deprecated: Multilingual Librispeech - deprecate dataset in favor of facebook/multilingual_librispeechby @polinaeterna in https://github.com/huggingface/datasets/pull/4060

Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in https://github.com/huggingface/datasets/pull/4145

Update: Wikipedia by @albertvillanova in https://github.com/huggingface/datasets/pull/3821 and https://github.com/huggingface/datasets/pull/3989

Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4002

Update: daily_dialog - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4008

Update: id_clickbait - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4014

Update: blimp - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4016

Update: scan - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4017

Update: yelp_review_full - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4018

Update: yelp_polarity - Support streaming by @lhoestq in https://github.com/huggingface/datasets/pull/4019

Update: amazon_polarity - Replace data URL by @lhoestq in https://github.com/huggingface/datasets/pull/4020

Update: dbpedia_14 - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4022

Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in https://github.com/huggingface/datasets/pull/4026

Update: yahoo_answers_topics - Replace data url by @lhoestq in https://github.com/huggingface/datasets/pull/4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in https://github.com/huggingface/datasets/pull/4004

Update: xcopa - Support streaming by @albertvillanova in https://github.com/huggingface/datasets/pull/4039

Update: medical_dialog - Add configs with processed data by @albertvillanova in https://github.com/huggingface/datasets/pull/4127

Update: xtreme - Support streaming for udpos config by @albertvillanova in https://github.com/huggingface/datasets/pull/4131

Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4132

Update: xtreme - Support streaming for PAN-X config by @albertvillanova in https://github.com/huggingface/datasets/pull/4135

Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in https://github.com/huggingface/datasets/pull/4030

Update: HANS - Support streaming by @mariosasko in https://github.com/huggingface/datasets/pull/4155

Fix: cats_vs_dogs - fix checksum error dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/4033

Fix: xcopa - fix null checksum by @albertvillanova in https://github.com/huggingface/datasets/pull/4034

Fix: amazon_us_reviews - fix metadata - 4/4/2022 by @trentonstrong in https://github.com/huggingface/datasets/pull/4092

Dataset Cards

Updated annotations for nli_tr dataset by @e-budur in https://github.com/huggingface/datasets/pull/4058

Add missing label for emotion description by @lijiazheng99 in https://github.com/huggingface/datasets/pull/4151

Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in https://github.com/huggingface/datasets/pull/3955

Improve RedCaps dataset card by @mariosasko in https://github.com/huggingface/datasets/pull/4100

Fix duplicate key in multi_news by @lhoestq in https://github.com/huggingface/datasets/pull/4164

Datasets Tags and Search on the Hugging Face Hub

Tasks alignment with models by @lhoestq in https://github.com/huggingface/datasets/pull/4066

Update datasets task tags to align tags with models by @lhoestq in https://github.com/huggingface/datasets/pull/4067

Metrics Changes

Xtreme-S Metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3799

Fix xtreme s metrics by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3957

Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in https://github.com/huggingface/datasets/pull/3938

Add exact match metric by @emibaylor in https://github.com/huggingface/datasets/pull/3899

Fix comet metric by @lhoestq in https://github.com/huggingface/datasets/pull/3945

Add zero_division argument to precision and recall metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4035

Support float data types in pearsonr/spearmanr metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4054

Remove GLEU metric by @emibaylor in https://github.com/huggingface/datasets/pull/3949

Metric Cards

Perplexity Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3905

Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3917

Create README.md for CER metric by @sashavor in https://github.com/huggingface/datasets/pull/3911

Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3944

Update README.md by @sashavor in https://github.com/huggingface/datasets/pull/3933

Create SARI metric card by @sashavor in https://github.com/huggingface/datasets/pull/3932

Create MAUVE metric card by @sashavor in https://github.com/huggingface/datasets/pull/3934

Create CoVAL metric card by @sashavor in https://github.com/huggingface/datasets/pull/3940

Google BLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/3948

Create metric card for BERTScore by @sashavor in https://github.com/huggingface/datasets/pull/3966

Rename wer to cer by @pmgautam in https://github.com/huggingface/datasets/pull/4012

Create metric card for XNLI by @sashavor in https://github.com/huggingface/datasets/pull/4046

Create metric card for the Code Eval metric by @sashavor in https://github.com/huggingface/datasets/pull/4049

Add TER metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3981

BLEU metric card by @emibaylor in https://github.com/huggingface/datasets/pull/3947

Create metric card for CUAD by @sashavor in https://github.com/huggingface/datasets/pull/4043

Create metric card for METEOR by @sashavor in https://github.com/huggingface/datasets/pull/4065

Create a metric card for Competition MATH by @sashavor in https://github.com/huggingface/datasets/pull/4073

Create metric card for seqeval by @sashavor in https://github.com/huggingface/datasets/pull/4070

Create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3930

Create metric card for Frugal Score by @sashavor in https://github.com/huggingface/datasets/pull/4089

Updating FrugalScore metric card by @sashavor in https://github.com/huggingface/datasets/pull/4097

Proposing WikiSplit metric card by @sashavor in https://github.com/huggingface/datasets/pull/4098

Fix formatting in BLEU metric card by @mariosasko in https://github.com/huggingface/datasets/pull/4157

Documentation

Doc maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3926

[Doc] Don't use v for version tags on GitHub by @sgugger in https://github.com/huggingface/datasets/pull/3943

Use templates for doc-builidng jobs by @sgugger in https://github.com/huggingface/datasets/pull/3914

Add align_labels_with_mapping docs by @stevhliu in https://github.com/huggingface/datasets/pull/3931

Add tip on how to speed up loading with ImageFolder by @mariosasko in https://github.com/huggingface/datasets/pull/3980

Fix main_classes docs index by @lhoestq in https://github.com/huggingface/datasets/pull/3925

More consistent references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3988

Docs maintenance by @stevhliu in https://github.com/huggingface/datasets/pull/3999

Add ROUGE Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4076

Add chrF(++) Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4082

Add SacreBLEU Metric Card by @emibaylor in https://github.com/huggingface/datasets/pull/4083

General improvements and bug fixes

Fix flatten of complex feature types by @mariosasko in https://github.com/huggingface/datasets/pull/3723

Fix flatten of Sequence feature type by @lhoestq in https://github.com/huggingface/datasets/pull/3962

Exclude Google Drive tests of the CI by @lhoestq in https://github.com/huggingface/datasets/pull/3982

Close PIL.Image file handler in Image.decode_example by @mariosasko in https://github.com/huggingface/datasets/pull/3995

Fix Faiss custom_index device by @albertvillanova in https://github.com/huggingface/datasets/pull/3987

Fix None issue with Sequence of dict by @lhoestq in https://github.com/huggingface/datasets/pull/4010

Update main readme by @lhoestq in https://github.com/huggingface/datasets/pull/3927

Fix map remove_columns on empty dataset by @lhoestq in https://github.com/huggingface/datasets/pull/4021

Fix Audio.encode_example() when writing an array by @polinaeterna in https://github.com/huggingface/datasets/pull/3998

Use audio feature in ASR task template by @lhoestq in https://github.com/huggingface/datasets/pull/4006

Improve out of bounds error message by @lhoestq in https://github.com/huggingface/datasets/pull/4068

Increase max retries for GitHub metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/4063

Fix CLI dummy data generation by @albertvillanova in https://github.com/huggingface/datasets/pull/4045

Fix docs on audio feature installation by @albertvillanova in https://github.com/huggingface/datasets/pull/4028

Add installation instructions to image_process doc by @mariosasko in https://github.com/huggingface/datasets/pull/4072

Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in https://github.com/huggingface/datasets/pull/4078

Increase max retries for GitHub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/4079

Close parquet writer properly in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/4081

fix typo in rename_column error message by @hunterlang in https://github.com/huggingface/datasets/pull/4095

Fix BeamWriter output Parquet file by @albertvillanova in https://github.com/huggingface/datasets/pull/4087

Remove unused legacy Beam utils by @albertvillanova in https://github.com/huggingface/datasets/pull/4088

Hotfix failing CI tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/4119

Update security policy by @albertvillanova in https://github.com/huggingface/datasets/pull/4111

Avoid writing empty license files by @albertvillanova in https://github.com/huggingface/datasets/pull/4090

Support huggingface_hub 0.5 by @lhoestq in https://github.com/huggingface/datasets/pull/4106

Pretty print dataset info files by @mariosasko in https://github.com/huggingface/datasets/pull/4116

Add single dataset citations for TweetEval by @gchhablani in https://github.com/huggingface/datasets/pull/4137

Adjust path to datasets tutorial in How-To by @NimaBoscarino in https://github.com/huggingface/datasets/pull/4147

Applied index-filters on scores in search.py. by @vishalsrao in https://github.com/huggingface/datasets/pull/3971

More robust cast_to_python_objects in TypedSequence by @mariosasko in https://github.com/huggingface/datasets/pull/4128

Sync Features dictionaries by @mariosasko in https://github.com/huggingface/datasets/pull/3997

Avoid rate limit in update hub repositories by @lhoestq in https://github.com/huggingface/datasets/pull/4167

New Contributors

@bp-high made their first contribution in https://github.com/huggingface/datasets/pull/3972

@ruanchaves made their first contribution in https://github.com/huggingface/datasets/pull/4004

@pmgautam made their first contribution in https://github.com/huggingface/datasets/pull/4012

@hunterlang made their first contribution in https://github.com/huggingface/datasets/pull/4095

@trentonstrong made their first contribution in https://github.com/huggingface/datasets/pull/4092

@NimaBoscarino made their first contribution in https://github.com/huggingface/datasets/pull/4147

@jon-tow made their first contribution in https://github.com/huggingface/datasets/pull/4103

@lijiazheng99 made their first contribution in https://github.com/huggingface/datasets/pull/4151

@Datta0 made their first contribution in https://github.com/huggingface/datasets/pull/3955

@vishalsrao made their first contribution in https://github.com/huggingface/datasets/pull/3971

Full Changelog: https://github.com/huggingface/datasets/compare/2.0.0...2.1.0
Source code(tar.gz)
Source code(zip)
2.0.0(Mar 15, 2022)
🤗 Datasets 2.0.0

We're happy to announce that our new documentation is available at hf.co/docs/datasets !

Dataset Features

Load a folder of images using the imagefolder dataset loader:

Add imagefolder dataset by @nateraw in https://github.com/huggingface/datasets/pull/2830

Faster ImageFolder + add option to drop labels by @mariosasko in https://github.com/huggingface/datasets/pull/3887

Push your image and audio datasets on the Hugging Face Hub with push_to_hub:

Add support for Audio and Image feature in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3685

New processing methods for streaming datasets:

Add IterableDataset.filter by @lhoestq in https://github.com/huggingface/datasets/pull/3826

Manipulate columns on IterableDataset (rename columns, cast, etc.) by @lhoestq in https://github.com/huggingface/datasets/pull/3862

Add the new methods to IterableDatasetDict by @lhoestq in https://github.com/huggingface/datasets/pull/3923

And more:

Add more compression types for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3551

Multi-GPU support for FaissIndex by @rentruewang in https://github.com/huggingface/datasets/pull/3721

Breaking changes

API changes for map and shuffle for datasets loaded in streaming mode:

Align map when streaming: update instead of overwrite + add missing parameters by @lhoestq in https://github.com/huggingface/datasets/pull/3801

Align IterableDataset.shuffle with Dataset.shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/3842

Rename GenerateMode to DownloadMode by @albertvillanova in https://github.com/huggingface/datasets/pull/3759

Remove deprecated methods/params (preparation for v2.0) by @mariosasko in https://github.com/huggingface/datasets/pull/3803

Remove deprecated remove_columns param in filter by @mariosasko in https://github.com/huggingface/datasets/pull/3827

Module namespace cleanup for v2.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3875

Dataset Changes

New: CFPB Consumer Complaints by @kayvane1 in https://github.com/huggingface/datasets/pull/3617

New: told-br (brazilian hate speech) by @JAugusto97 in https://github.com/huggingface/datasets/pull/3683

New: electricity load diagram by @kashif in https://github.com/huggingface/datasets/pull/3722

New: MIT Scene Parsing Benchmark by @mariosasko in https://github.com/huggingface/datasets/pull/3607

New: ElkarHizketak v1.0 by @antxa in https://github.com/huggingface/datasets/pull/3780

New: wikitablequestions by @SivilTaram in https://github.com/huggingface/datasets/pull/3870

New: ontonotes_conll by @richarddwang in https://github.com/huggingface/datasets/pull/3853

Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3616

Update: Common voice - add validated partition by @shalymin-amzn in https://github.com/huggingface/datasets/pull/3669

Update: Common Voice - add local paths to audio files by @lhoestq in https://github.com/huggingface/datasets/pull/3736

Update: Common Voice - simplify code by @lhoestq in https://github.com/huggingface/datasets/pull/3817

Update: Natural Questions - add dev-only configuration by @albertvillanova in https://github.com/huggingface/datasets/pull/3699

Update: pubmed - update data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3692

Update: pubmed - make the dataset streamable by @abhi-mosaic in https://github.com/huggingface/datasets/pull/3740

Update: RedCaps - make the dataset streamable by @mariosasko in https://github.com/huggingface/datasets/pull/3737

Update: cats_vs_dogs - update metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3752

Update: newsroom - update manual download url by @albertvillanova in https://github.com/huggingface/datasets/pull/3779

Update: xcopa - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3810

Update: cats_vs_dogs size by @mariosasko in https://github.com/huggingface/datasets/pull/3878

Fix: sem_eval_2018_task_1 - fix download location by @maxpel in https://github.com/huggingface/datasets/pull/3643

Fix: newsqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3696

Fix: The Pile datasets - fix host urls by @albertvillanova in https://github.com/huggingface/datasets/pull/3627

Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3718

Fix: NewsQA - fix dataset script by @albertvillanova in https://github.com/huggingface/datasets/pull/3734

Fix: head_qa - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3766

Fix: msr_sqa - fix unique keys by @albertvillanova in https://github.com/huggingface/datasets/pull/3771

Fix: reddit_tifu - fix data url by @albertvillanova in https://github.com/huggingface/datasets/pull/3774

Fix: wiki_lingua - fix spanish data file url by @albertvillanova in https://github.com/huggingface/datasets/pull/3806

Fix: beans - fix data urls by @mariosasko in https://github.com/huggingface/datasets/pull/3890

Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3921

Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in https://github.com/huggingface/datasets/pull/3922

Dataset cards

Add code example in wikipedia card by @lhoestq in https://github.com/huggingface/datasets/pull/3678

Fix Multi-News dataset metadata and card by @albertvillanova in https://github.com/huggingface/datasets/pull/3731

Reddit dataset card additions by @anna-kay in https://github.com/huggingface/datasets/pull/3781

Update gigaword card and info by @mariosasko in https://github.com/huggingface/datasets/pull/3775

Reddit dataset card contribution by @anna-kay in https://github.com/huggingface/datasets/pull/3797

Metric Changes

New: FrugalScore by @moussaKam in https://github.com/huggingface/datasets/pull/3674

New: Mahalanobis distance by @JoaoLages in https://github.com/huggingface/datasets/pull/3794

New: mIoU by @NielsRogge in https://github.com/huggingface/datasets/pull/3745

New: MSE and MAE - V2 by @dnaveenr in https://github.com/huggingface/datasets/pull/3874

Fix: METEOR - fix bug due to nltk version by @albertvillanova in https://github.com/huggingface/datasets/pull/3884

Metric cards

Add perplexity to metrics by @emibaylor in https://github.com/huggingface/datasets/pull/3757

Create SQuAD metric README.md by @sashavor in https://github.com/huggingface/datasets/pull/3873

SQuAD v2 metric: create README.md by @sashavor in https://github.com/huggingface/datasets/pull/3879

Update README.md for SQuAD v2 metric by @sashavor in https://github.com/huggingface/datasets/pull/3908

Update README.md for SQuAD metric by @sashavor in https://github.com/huggingface/datasets/pull/3907

Create README.md for WER metric by @sashavor in https://github.com/huggingface/datasets/pull/3898

Create README.md for GLUE by @sashavor in https://github.com/huggingface/datasets/pull/3916

New documentation

Update docs to new frontend/UI by @mishig25 in https://github.com/huggingface/datasets/pull/3690

Image process doc by @stevhliu in https://github.com/huggingface/datasets/pull/3882

General improvements and bug fixes

Better TQDM output by @mariosasko in https://github.com/huggingface/datasets/pull/3654

Prioritize module.builder_kwargs over defaults in TestCommand by @lvwerra in https://github.com/huggingface/datasets/pull/3672

Extend support for streaming datasets that use os.path.relpath by @albertvillanova in https://github.com/huggingface/datasets/pull/3623

Add Fon language tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3620

Remove unnecessary 'r' arg in by @bryant1410 in https://github.com/huggingface/datasets/pull/3661

Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in https://github.com/huggingface/datasets/pull/3680

Upgrade black to version ~=22.0 by @LysandreJik in https://github.com/huggingface/datasets/pull/3691

Fix streaming for servers not supporting HTTP range requests by @albertvillanova in https://github.com/huggingface/datasets/pull/3689

Pin ElasticSearch by @lhoestq in https://github.com/huggingface/datasets/pull/3701

Raise informative error when loading a save_to_disk dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3705

Fix ClassLabel to/from dict when passed names_file by @albertvillanova in https://github.com/huggingface/datasets/pull/3695

Fix CI code quality issue by @albertvillanova in https://github.com/huggingface/datasets/pull/3710

Check if indices values in Dataset.select are within bounds by @mariosasko in https://github.com/huggingface/datasets/pull/3719

Pin pandas to avoid bug in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3725

Use config pandas version in CSV dataset builder by @albertvillanova in https://github.com/huggingface/datasets/pull/3726

Set base path to hub url for canonical datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3709

Fix ValueError message formatting in int2str by @akulchik in https://github.com/huggingface/datasets/pull/3742

Patch all module attributes in its namespace by @albertvillanova in https://github.com/huggingface/datasets/pull/3727

Fix typo in train split name by @albertvillanova in https://github.com/huggingface/datasets/pull/3751

feat: 🎸 generate info if dataset_infos.json does not exist by @severo in https://github.com/huggingface/datasets/pull/3670

Support streaming in size estimation function in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3732

Expose method and fix param by @severo in https://github.com/huggingface/datasets/pull/3767

Fix HfFileSystem docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3768

process .opus files (for Multilingual Spoken Words) by @polinaeterna in https://github.com/huggingface/datasets/pull/3666

Fix: dataset name is stored in keys by @thomasw21 in https://github.com/huggingface/datasets/pull/3772

Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/3746

Start removing canonical datasets logic by @lhoestq in https://github.com/huggingface/datasets/pull/3777

Support passing str to iter_files by @albertvillanova in https://github.com/huggingface/datasets/pull/3783

Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in https://github.com/huggingface/datasets/pull/3787

Skip checksum computation if ignore_verifications is True by @mariosasko in https://github.com/huggingface/datasets/pull/3796

Fix error message in CSV loader for newer Pandas versions by @mariosasko in https://github.com/huggingface/datasets/pull/3798

Add data_dir to data_files resolution and misc improvements to HfFileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/3791

Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in https://github.com/huggingface/datasets/pull/3782

Handle Nones in PyArrow struct by @mariosasko in https://github.com/huggingface/datasets/pull/3814

Fix iter_archive getting reset by @lhoestq in https://github.com/huggingface/datasets/pull/3815

Added computer vision tasks by @merveenoyan in https://github.com/huggingface/datasets/pull/3800

Fix typo in doc build yml by @mishig25 in https://github.com/huggingface/datasets/pull/3819

Allow not specifying feature cols other than predictions/references in Metric.compute by @mariosasko in https://github.com/huggingface/datasets/pull/3824

Logo float left by @mishig25 in https://github.com/huggingface/datasets/pull/3836

Pin responses to fix CI for Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/3840

Fix dead dataset scripts creation link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3834

Remove decode: true for image feature in head_qa by @craffel in https://github.com/huggingface/datasets/pull/3805

Update faiss device docstring by @lhoestq in https://github.com/huggingface/datasets/pull/3846

Udpate index.mdx margins by @gary149 in https://github.com/huggingface/datasets/pull/3858

Fix push_to_hub with null images by @lhoestq in https://github.com/huggingface/datasets/pull/3856

Redundant add dataset information and dead link. by @dnaveenr in https://github.com/huggingface/datasets/pull/3852

Update image dataset tags by @mariosasko in https://github.com/huggingface/datasets/pull/3864

Bring back imgs so that forsk dont get broken by @mishig25 in https://github.com/huggingface/datasets/pull/3866

Small typos in How-to-train tutorial. by @lkhphuc in https://github.com/huggingface/datasets/pull/3833

Small doc fixes by @mishig25 in https://github.com/huggingface/datasets/pull/3860

add pandas to env command by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3871

Ignore duplicate keys if ignore_verifications=True by @mariosasko in https://github.com/huggingface/datasets/pull/3868

Update code blocks by @lhoestq in https://github.com/huggingface/datasets/pull/3863

Fix download_mode in dataset_module_factory by @albertvillanova in https://github.com/huggingface/datasets/pull/3876

Fix some shuffle docs by @lhoestq in https://github.com/huggingface/datasets/pull/3885

Fix race condition in doc build by @lhoestq in https://github.com/huggingface/datasets/pull/3891

Add default branch for doc building by @sgugger in https://github.com/huggingface/datasets/pull/3893

[docs] make dummy data creation optional by @lhoestq in https://github.com/huggingface/datasets/pull/3894

Fix code examples indentation by @lhoestq in https://github.com/huggingface/datasets/pull/3895

Align tqdm control/cache control with Transformers by @mariosasko in https://github.com/huggingface/datasets/pull/3897

Fix CLI test checksums by @albertvillanova in https://github.com/huggingface/datasets/pull/3892

Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in https://github.com/huggingface/datasets/pull/3843

Change the framework switches to the new syntax by @sgugger in https://github.com/huggingface/datasets/pull/3880

New Contributors

@kayvane1 made their first contribution in https://github.com/huggingface/datasets/pull/3617

@JAugusto97 made their first contribution in https://github.com/huggingface/datasets/pull/3683

@shalymin-amzn made their first contribution in https://github.com/huggingface/datasets/pull/3669

@kashif made their first contribution in https://github.com/huggingface/datasets/pull/3722

@akulchik made their first contribution in https://github.com/huggingface/datasets/pull/3742

@abhi-mosaic made their first contribution in https://github.com/huggingface/datasets/pull/3740

@emibaylor made their first contribution in https://github.com/huggingface/datasets/pull/3757

@anna-kay made their first contribution in https://github.com/huggingface/datasets/pull/3781

@JoaoLages made their first contribution in https://github.com/huggingface/datasets/pull/3794

@mishig25 made their first contribution in https://github.com/huggingface/datasets/pull/3690

@antxa made their first contribution in https://github.com/huggingface/datasets/pull/3780

@dnaveenr made their first contribution in https://github.com/huggingface/datasets/pull/3834

@lkhphuc made their first contribution in https://github.com/huggingface/datasets/pull/3833

@rentruewang made their first contribution in https://github.com/huggingface/datasets/pull/3721

@gary149 made their first contribution in https://github.com/huggingface/datasets/pull/3858

@NielsRogge made their first contribution in https://github.com/huggingface/datasets/pull/3745

@sashavor made their first contribution in https://github.com/huggingface/datasets/pull/3873

@SivilTaram made their first contribution in https://github.com/huggingface/datasets/pull/3870

Document cases for github datasets by @lhoestq in https://github.com/huggingface/datasets/pull/3924

Fix text loader to split only on universal newlines by @albertvillanova in https://github.com/huggingface/datasets/pull/3910

Retry HfApi call inside push_to_hub when 504 error by @albertvillanova in https://github.com/huggingface/datasets/pull/3886

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...0.0.0
Source code(tar.gz)
Source code(zip)
1.18.4(Mar 7, 2022)
Bug fixes

Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra)

Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)

Upgrade black to version ~=22.0 #3691 (@LysandreJik)

Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)

Pin ElasticSearch #3701 (@lhoestq)

Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)

Fix CI code quality issue #3710 (@albertvillanova)

Check if indices values in Dataset.select are within bounds #3719 (@mariosasko)

Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)

Use config pandas version in CSV dataset builder #3726 (@albertvillanova)

Fix dataset mirroring (@lhoestq)

Fix ValueError message formatting in int2str #3742 (@akulchik)

Patch all module attributes in its namespace #3727 (@albertvillanova)

Fix HfFileSystem docstring #3768 (@lhoestq)

Fix: dataset name is stored in keys #3772 (@thomasw21)

Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)

Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)

Pin responses to fix CI for Windows #3840 (@albertvillanova)

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...1.18.4
Source code(tar.gz)
Source code(zip)
1.18.3(Feb 2, 2022)
Bug fixes

Fix MP3 resampling when a dataset's audio files have different sampling rates by @lhoestq in https://github.com/huggingface/datasets/pull/3665

Extend dataset builder for streaming in get_dataset_split_names by @mariosasko in https://github.com/huggingface/datasets/pull/3657

Dataset changes

New: Turkic X-WMT evaluation set for machine translation by @mirzakhalov in https://github.com/huggingface/datasets/pull/3605

New: British Library books dataset by @davanstrien in https://github.com/huggingface/datasets/pull/3603

Fix: wiki_bio - Update link by @jxmorris12 in https://github.com/huggingface/datasets/pull/3651

Other improvements

sp. Columbia => Colombia by @serapio in https://github.com/huggingface/datasets/pull/3652

Run pyupgrade for Python 3.6+ by @bryant1410 in https://github.com/huggingface/datasets/pull/3560

New Contributors

@serapio made their first contribution in https://github.com/huggingface/datasets/pull/3652

@mirzakhalov made their first contribution in https://github.com/huggingface/datasets/pull/3605

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.2...1.18.3
Source code(tar.gz)
Source code(zip)
1.18.2(Jan 28, 2022)
Bug fixes

Fix streaming datasets that are not reset correctly by @lhoestq in https://github.com/huggingface/datasets/pull/3646

Fix numpy rngs when shuffling with seed=None by @mariosasko in https://github.com/huggingface/datasets/pull/3641

Fix dataset slicing with negative bounds when indices mapping is not None by @mariosasko in https://github.com/huggingface/datasets/pull/3642

Fix add_column on datasets with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/3647

Other improvements

Update index.rst by @VioletteLepercq in https://github.com/huggingface/datasets/pull/3636

Fix Windows CI: bump python to 3.7 by @lhoestq in https://github.com/huggingface/datasets/pull/3648

New Contributors

@VioletteLepercq made their first contribution in https://github.com/huggingface/datasets/pull/3636

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.1...1.18.2
Source code(tar.gz)
Source code(zip)
1.18.1(Jan 26, 2022)
Improvements

Make decoding of Audio and Image feature optional by @mariosasko in https://github.com/huggingface/datasets/pull/3430

Bug fixes

Fix prepare_for_task() by @mariosasko in https://github.com/huggingface/datasets/pull/3614

Fix: Multilingual Librispeech - fix bad url formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/3619

Full Changelog: https://github.com/huggingface/datasets/compare/1.18.0...1.18.1
Source code(tar.gz)
Source code(zip)
1.18.0(Jan 21, 2022)
Datasets Changes

New: VCTK

Add VCTK dataset by @jaketae in https://github.com/huggingface/datasets/pull/3351

Fix VCTK encoding by @lhoestq in https://github.com/huggingface/datasets/pull/3493

Docs: Add VCTK dataset description by @jaketae in https://github.com/huggingface/datasets/pull/3500

New: CPPE-5 dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3517

New: RedCaps dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3424

New: WIDER FACE dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3413

New: SVHN dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3535

New: BNL newspapers by @davanstrien in https://github.com/huggingface/datasets/pull/3397

New: PASS dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3576

New: Text2log Dataset by @apergo-ai in https://github.com/huggingface/datasets/pull/3579

Update: beans, cats_vs_dogs - Use iter_files instead of str(Path(...) in image dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3477

Update : PIB - update version and make it streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3496

Update: code_x_glue_tt_text_to_text, compguesswhat - Remove print statements in datasets by @mariosasko in https://github.com/huggingface/datasets/pull/3546

Update: MuchoCine - add missing tasks by @mariosasko in https://github.com/huggingface/datasets/pull/3571

Fix: Tashkeela - fix to yield stripped text by @albertvillanova in https://github.com/huggingface/datasets/pull/3471

Fix: asset - change to raw.githubusercontent.com URLs by @VictorSanh in https://github.com/huggingface/datasets/pull/3516

Fix: CC100 - use HTTPS for the data source URL by @aajanki in https://github.com/huggingface/datasets/pull/3519

Fix: vision datsets - Fix bug in ImageClassifcation task template by @mariosasko in https://github.com/huggingface/datasets/pull/3557

Fix: tweet_qa - fix DuplicatedKeysError and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/3559

Fix: mC4 - fix multiple language downloading by @polinaeterna in https://github.com/huggingface/datasets/pull/3594

Fix: CoNLL2003:

Use old url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3600

Update url for conll2003 by @lhoestq in https://github.com/huggingface/datasets/pull/3602

Add conll2003 licensing by @lhoestq in https://github.com/huggingface/datasets/pull/3601

Datasets Features

[Time series] Add support for time, date, duration, and decimal dtypes by @mariosasko in https://github.com/huggingface/datasets/pull/3591

[Image][Audio] Add flexible casting for Image and Audio + Support nested casting by @lhoestq in https://github.com/huggingface/datasets/pull/3575

Allows DatasetDict.filter to have batching option by @thomasw21 in https://github.com/huggingface/datasets/pull/3506

Add desc parameter to filter by @mariosasko in https://github.com/huggingface/datasets/pull/3513

Add gzip for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3492

Allow multiple task templates of the same type by @mariosasko in https://github.com/huggingface/datasets/pull/3562

Add parameter preserve_index to from_pandas by @Sorrow321 in https://github.com/huggingface/datasets/pull/3565

Dataset Streaming:

Fix str(Path(...)) conversion in streaming on Linux by @mariosasko in https://github.com/huggingface/datasets/pull/3472

Extend support for streaming datasets that use ET.parse by @albertvillanova in https://github.com/huggingface/datasets/pull/3476

Extend support for streaming datasets that use os.walk by @albertvillanova in https://github.com/huggingface/datasets/pull/3478

Metrics Changes

Add Mauve metric by @jthickstun in https://github.com/huggingface/datasets/pull/3573

Dataset cards

update pretty_name for first 200 datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3498

update pretty_name for all the other datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3536

pib: Update pib dataset card by @albertvillanova in https://github.com/huggingface/datasets/pull/3501

arabic_speech_corpus: Adding link to license. by @meg-huggingface in https://github.com/huggingface/datasets/pull/3524

Covost2: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3528

librispeech_asr: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3529

vivos: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3530

audio datasets: Audio datacard update - first pass by @meg-huggingface in https://github.com/huggingface/datasets/pull/3520

common_language: Update README.md by @meg-huggingface in https://github.com/huggingface/datasets/pull/3527

wiki_dpr: Update wiki_dpr README.md by @lhoestq in https://github.com/huggingface/datasets/pull/3534

qa4mre: Fix qa4mre tags by @lhoestq in https://github.com/huggingface/datasets/pull/3574

HellaSwag: Update HellaSwag README.md by @borgr in https://github.com/huggingface/datasets/pull/3588

ANLI: Update ANLI README.md by @borgr in https://github.com/huggingface/datasets/pull/3590

tweet_eval: Update README.md by @borgr in https://github.com/huggingface/datasets/pull/3593

Documentation

Fix rendering of docs by @albertvillanova in https://github.com/huggingface/datasets/pull/3470

Fix to_tf_dataset references in docs by @mariosasko in https://github.com/huggingface/datasets/pull/3514

added PII statements and license links to data cards by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3537

Readme usage update by @meg-huggingface in https://github.com/huggingface/datasets/pull/3538

Update the CC-100 dataset card by @aajanki in https://github.com/huggingface/datasets/pull/3542

Research wording for nc licenses by @meg-huggingface in https://github.com/huggingface/datasets/pull/3539

Added links to licensing and PII message in vctk dataset by @mcmillanmajora in https://github.com/huggingface/datasets/pull/3523

Give clearer instructions to add the YAML tags by @albertvillanova in https://github.com/huggingface/datasets/pull/3532

General improvements and bug fixes

Fix overriding of filesystem info by @albertvillanova in https://github.com/huggingface/datasets/pull/3481

Update ADD_NEW_DATASET.md by @apergo-ai in https://github.com/huggingface/datasets/pull/3487

Fix weird spacing in ManualDownloadError message by @bryant1410 in https://github.com/huggingface/datasets/pull/3486

Clone full repo to detect new tags when mirroring datasets on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3494

Remove unused phony rule from Makefile by @bryant1410 in https://github.com/huggingface/datasets/pull/3483

fix: 🐛 pass token when retrieving the split names by @severo in https://github.com/huggingface/datasets/pull/3545

Pin torchmetrics to fix the COMET test by @lhoestq in https://github.com/huggingface/datasets/pull/3589

Preserve encoding/decoding with features in Iterable.map call by @mariosasko in https://github.com/huggingface/datasets/pull/3556

New Contributors

@apergo-ai made their first contribution in https://github.com/huggingface/datasets/pull/3487

@bryant1410 made their first contribution in https://github.com/huggingface/datasets/pull/3486

@meg-huggingface made their first contribution in https://github.com/huggingface/datasets/pull/3527

@aajanki made their first contribution in https://github.com/huggingface/datasets/pull/3519

@Sorrow321 made their first contribution in https://github.com/huggingface/datasets/pull/3565

@jthickstun made their first contribution in https://github.com/huggingface/datasets/pull/3573

@borgr made their first contribution in https://github.com/huggingface/datasets/pull/3588

Full Changelog: https://github.com/huggingface/datasets/compare/1.17.0...1.18.0
Source code(tar.gz)
Source code(zip)
1.17.0(Dec 21, 2021)
Dataset Changes

New: The Pile

Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287

Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359

Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360

Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378

Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427

New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312

New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371

New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335

New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420

New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436

Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352

Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376

Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362

Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383

Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426

Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463

Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266

Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417

Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342

Dataset Features

Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163

to_tf_dataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356

More robust None handling by @mariosasko in https://github.com/huggingface/datasets/pull/3195

Add cast_column to IterableDataset by @mariosasko in https://github.com/huggingface/datasets/pull/3439

Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375

Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355

Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443

Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442

Push dataset_infos.json to Hub to preserve feature types by @lhoestq in https://github.com/huggingface/datasets/pull/3467

Dataset cards

Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330

Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322

Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354

Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386

Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458

Dataset Tasks

Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387

Metric Changes

BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348

Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461

Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469

Docs

Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344

Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370

Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395

Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432

Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437

Additional improvements and bug fixes

Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328

Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332

Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350

Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340

Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343

Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368

Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367

Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296

Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in https://github.com/huggingface/datasets/pull/3388

More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402

Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406

Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410

Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409

Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412

Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414

Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428

#3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382

Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429

Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407

Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454

[Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451

Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438

raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349

New Contributors

@avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330

@NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328

@davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312

@francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296

@LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266

@fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371

@polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335

@AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383

@tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342

@jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420

@scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436

Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0
Source code(tar.gz)
Source code(zip)
1.16.1(Nov 26, 2021)
Bug fixes

Fix import datasets on python 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/3326

Fix wrongly converted assert by @eliasws in https://github.com/huggingface/datasets/pull/3323

Source code(tar.gz)
Source code(zip)
1.16.0(Nov 26, 2021)
Datasets Changes

New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161

New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198

New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074

New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149

New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205

New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307

Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203

Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254

Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225

Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276

Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281

Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290

Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213

Fix: id_newspapers_2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249

Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280

Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260

Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:

upload your dataset to the Hugging face Hub with the push_to_hub() method !

See documentation here

200+ datasets now support streaming:

Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110

Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248

Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133

Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129

Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221

It takes into account the file names to know which file goes into which split

See documentation here

Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244

Adding with_rank arg to pass process rank to map by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230

Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274

Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301

Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916

Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235

Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252

Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278

Documentation

Add docs for to_tf_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/3175

Small updates to to_tf_dataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215

Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194

Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233

Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241

Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222

Add push_to_hub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200

Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208

Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211

Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218

Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160

Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121

Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216

Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231

Fix load_from_disk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245

[tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246

Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234

Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243

asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256

Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270

asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262

Force data files extraction if download_mode='force_redownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275

Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279

Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271

Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286

Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288

f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277

Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289

Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293

Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291

fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302

asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305

fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309

Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318

Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315

Citation

Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223

Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226

Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228

Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229

Deprecations

Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166

Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0
Source code(tar.gz)
Source code(zip)
1.15.1(Nov 2, 2021)
Dependencies

Bump huggingface_hub to 0.1.0 by @lhoestq in https://github.com/huggingface/datasets/pull/3199

Source code(tar.gz)
Source code(zip)
1.15.0(Nov 2, 2021)
Dataset Changes

Update: JNLBA - add tags names by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3092

Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in https://github.com/huggingface/datasets/pull/3125 and https://github.com/huggingface/datasets/pull/3176

Update: RONEC - update to v2 by @dumitrescustefan in https://github.com/huggingface/datasets/pull/3184

Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in https://github.com/huggingface/datasets/pull/3136

Fix: HLGD - fix label mapping by @VictorSanh in https://github.com/huggingface/datasets/pull/3180

Dataset Features

Allow dynamic first dimension for ArrayXD by @rpowalski in https://github.com/huggingface/datasets/pull/2891

add multi-proc in to_csv by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896

QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in https://github.com/huggingface/datasets/pull/3196

Dataset Cards

Fill in dataset card for NCBI disease dataset by @edugp in https://github.com/huggingface/datasets/pull/3115

Metrics Changes

New: metric for the MATH dataset (competition_math). by @hacobe in https://github.com/huggingface/datasets/pull/3020

New: Google BLEU (aka GLEU) metric by @slowwavesleep in https://github.com/huggingface/datasets/pull/3108

New: TER by @BramVanroy in https://github.com/huggingface/datasets/pull/3153

New: ChrF(++) by @BramVanroy in https://github.com/huggingface/datasets/pull/3187

General improvements and bug fixes

Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3120

Fixes to to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085

Add security policy to the project by @albertvillanova in https://github.com/huggingface/datasets/pull/2958

Update doc links to point to new docs by @mariosasko in https://github.com/huggingface/datasets/pull/3116

Fix caching bugs by @mariosasko in https://github.com/huggingface/datasets/pull/3141

Fix numpy deprecation warning for ragged tensors by @lhoestq in https://github.com/huggingface/datasets/pull/3137

Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in https://github.com/huggingface/datasets/pull/3157

Fix some typos in the documentation by @h4iku in https://github.com/huggingface/datasets/pull/3152

Fix string encoding for Value type by @lhoestq in https://github.com/huggingface/datasets/pull/3158

Fix CLI test to ignore verfications when saving infos by @albertvillanova in https://github.com/huggingface/datasets/pull/3147

Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in https://github.com/huggingface/datasets/pull/3159

Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in https://github.com/huggingface/datasets/pull/3173

Asserts replaced by exceptions (huggingface#3171) by @joseporiolayats in https://github.com/huggingface/datasets/pull/3174

Preserve ordering in zip_dict by @mariosasko in https://github.com/huggingface/datasets/pull/3170

Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in https://github.com/huggingface/datasets/pull/3182

Re-add faiss to windows testing suite by @BramVanroy in https://github.com/huggingface/datasets/pull/3151

Add missing docstring to DownloadConfig by @mariosasko in https://github.com/huggingface/datasets/pull/3183

More efficient nested features encoding by @eladsegal in https://github.com/huggingface/datasets/pull/3124

Fix optimized encoding for arrays by @lhoestq in https://github.com/huggingface/datasets/pull/3197

Source code(tar.gz)
Source code(zip)
1.14.0(Oct 19, 2021)
Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)

Update: SUPERB - use Audio features #3101 (@anton-l)

Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)

Fix project description in PyPI #3103 (@albertvillanova)

Align tqdm control with cache control #3031 (@mariosasko)

Add paper BibTeX citation #3107 (@albertvillanova)

Source code(tar.gz)
Source code(zip)
1.13.3(Oct 15, 2021)
Dataset changes

Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

Update BibTeX entry #3090 (@albertvillanova)

Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)

Fix Audio feature mp3 resampling #3096 (@albertvillanova)

Source code(tar.gz)
Source code(zip)

🤗 The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use and efficient data manipulation tools

Related tags

Overview

Installation

With pip

With conda

Installation to use with PyTorch/TensorFlow/pandas

Usage

Add a new dataset to the Hub

Main differences between 🤗Datasets and tfds

Disclaimers

BibTeX

Comments

Requirements Specification

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

Dataset viewer issue for 'wiki_lingua*'

Name

Paper

Data

Motivation

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

---> Initial Training Output

---> Error

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

Releases(2.8.0)

2.8.0(Dec 19, 2022)

Important

Datasets Features

Docs

General improvements and bug fixes

New Contributors

2.7.1(Nov 22, 2022)

Bug fixes

2.6.2(Nov 22, 2022)

Bug fixes

2.7.0(Nov 16, 2022)

Dataset Features

Audio setup

Docs

General improvements and bug fixes

New Contributors

2.6.1(Oct 14, 2022)

Bug fixes

New Contributors

2.6.0(Oct 13, 2022)

Important

Datasets features

Dataset changes

Dataset cards

General improvements and bug fixes

New Contributors

2.5.2(Oct 5, 2022)

Bug fixes

2.5.1(Sep 21, 2022)

Bug fixes

2.5.0(Sep 21, 2022)

Important

Datasets features

No-code loaders

Dataset methods

Parquet support

Datasets changes

Dataset cards

Documentation

General improvements and bug fixes

Main differences between `🤗Datasets` and `tfds`