SparkDataset
Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure).
Drop a star if you like the project.
What?
The idea is simple. There are various datasets available out there, but they are scattered in different places over the web. Is there a quick way (in Pyspark) to access them instantly without going through the hassle of searching, downloading, and reading ... etc? SparkDataset tries to address that question :)
Usage:
Start with importing data()
:
from sparkdataset import data
- To load a dataset:
titanic = data('titanic')
- To display the documentation of a dataset:
data('titanic', show_doc=True)
- To see the available datasets:
data()
- To search for datasets with terms
data('ab')
Did you mean:
crabs, abbey, Vocab
That's it.
Go to this notebook for a demonstration of the functionality
Why?
In R
, there is a very easy and immediate way to access multiple statistical datasets, in almost no effort. All it takes is one line > data(dataset_name)
. This makes the life easier for quick prototyping and testing. Well, I am jealous that Pyspark does not have a similar functionality. Thus, the aim of sparkdataset
is to fill that gap.
Currently, sparkdataset
has about 757 (mostly numerical-based) datasets, that are based on RDatasets
. In the future, I plan to scale it to include a larger set of datasets. For example,
- include textual data for NLP-related tasks, and
- allow adding a new dataset to the in-module repository.
Installation:
$ pip install sparkdataset
Uninstall:
$ pip uninstall sparkdataset
$ rm -rf $HOME/.sparkdataset
Changelog
1.0.0
- Added search dataset by name similarity.
- Example:
>>> data('heat')
Did you mean:
Wheat, heart, Heating, Yeast, eidat, badhealth, deaths, agefat, hla, heptathlon, azt
- Added support to Windows.
Dependency:
- pandas
- pyspark :: 3.1.2
Miscellaneous:
- Tested on OSX and Linux (debian).
- Supports both Python 3 (3.8.8 and above).
TODO:
- add textual datasets (e.g. NLTK stuff).
- add samples generators.
Thanks to:
- RDatasets: R's datasets collection.