Sparkling Pandas

Overview

buildstatus

SparklingPandas

SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. SparklingPandas builds on Spark's DataFrame class to give you a polished, pythonic, and Pandas-like API.

Documentation

See SparklingPandas.com.

Videos

An early version of Sparkling Pandas was discussed in Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland

Requirements

The primary requirement of SparklingPandas is that you have a recent (v1.4 currently) version of Spark installed - http://spark.apache.org and Python 2.7.

Using

Make sure you have the SPARK_HOME environment variable set correctly, as SparklingPandas uses this for including the PySpark libraries

Other than that you can install SparklingPandas with pip and just import it.

State

This is in early development. Feedback is taken seriously and is seriously appreciated. As you can tell, us SparklingPandas are a pretty serious bunch.

Support

Check out our Google group at https://groups.google.com/forum/#!forum/sparklingpandas

Comments
  • Rework testing cmd.

    Rework testing cmd.

    This allows us to run nosetests directly from the command line. It shifts the responsibility of setting up the env to the init for the testing module. init gets called exactly once when the module is first used. There are two problems in this current version:

    1. test_apply_map fails with:
    ...
    File "/Users/juliet/src/sparklingpandas/sparklingpandas/custom_functions.py", line 29, in registerSQLExtensions
    sc._jvm.com.sparklingpandas.functions.registerUdfs(scala_SQLContext)
    File "/Users/juliet/bin/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__
    raise Py4JError('Trying to call a package.')
    Py4JError: Trying to call a package.
    
    1. That failure cascades into other test failures because of a lack of the proper cleanup of the spark contexts. The class defining those tests does extend SparklingPandasTestCase so it should have tearDown called after the test is run. Not sure what is up with that.
    opened by hougs 3
  • Added ability to read json files. Unit test included.

    Added ability to read json files. Unit test included.

    This is an initial fix for https://github.com/sparklingpandas/sparklingpandas/issues/47 .

    Please let me know if you have any feedback.

    Thank you, Michal

    opened by michalmonselise 3
  • Doctests to nosetests

    Doctests to nosetests

    Move almost all doc tests to unit tests that are the run through nosetest instead of a custom script. When run in travis, we wiil use the --logging-level=INFO and --detailed-errors.

    opened by hougs 3
  • Initial setup related issue

    Initial setup related issue

    Hello, I'm not sure if this can be raised as an issue. But couldn't find some other way to ask this question. I have downloaded sparklingpandas zip file.

    I've added the folder to PYTHONPATH variable.

    When I am trying to import it from python interactive shell, I'm getting an import error.

    import sparklingpandas Traceback (most recent call last): File "", line 1, in ImportError: No module named sparklingpandas

    I cannot use 'pip install' because pip is not installed on the machine I'm using. And I don't have much control on the machine since I'm not the root user of the machine.

    opened by gayatri-devarakonda 2
  • Install issue

    Install issue

    Hi, I'm getting this error message when I try to install:

    Running setup.py (path:/tmp/pip_build_root/sparklingpandas/setup.py) egg_info for package sparklingpandas
        Traceback (most recent call last):
          File "<string>", line 17, in <module>
          File "/tmp/pip_build_root/sparklingpandas/setup.py", line 12, in <module>
            long_description=open('README.md').read(),
        IOError: [Errno 2] No such file or directory: 'README.md'
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
    
      File "<string>", line 17, in <module>
    
      File "/tmp/pip_build_root/sparklingpandas/setup.py", line 12, in <module>
    
        long_description=open('README.md').read(),
    
    IOError: [Errno 2] No such file or directory: 'README.md'
    
    
    
    opened by ghost 2
  • New panda image and minor README cleanup

    New panda image and minor README cleanup

    • fixes Travis image link so that it's actually a link
    • uses new sparkling panda image (commissioned work executed by itzelspixels)
      • drops file size from 1.2M to 26K
      • 1,600% increase in sparkles
      • correct displayed image orientation, but no more vertical space used
      • now eating bamboo (metaphor for data???)
      • arguably increased cuteness
    • removed empty header line (likely spurious formatting)

    @julbright collaborated on key thought-leadership

    opened by ajschumacher 2
  • pip install sparklingpandas failed

    pip install sparklingpandas failed

    Error message:

    Traceback (most recent call last):
          File "<string>", line 20, in <module>
          File "/private/tmp/pip-build-ibiITf/SparklingPandas/setup.py", line 12, in <module>
            long_description=open('README.md').read(),
        IOError: [Errno 2] No such file or directory: 'README.md'
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
    
          File "<string>", line 20, in <module>
    
          File "/private/tmp/pip-build-ibiITf/SparklingPandas/setup.py", line 12, in <module>
    
            long_description=open('README.md').read(),
    
        IOError: [Errno 2] No such file or directory: 'README.md'
    
    opened by messense 2
  • Bump numpy from 1.9.2 to 1.21.0

    Bump numpy from 1.9.2 to 1.21.0

    Bumps numpy from 1.9.2 to 1.21.0.

    Release notes

    Sourced from numpy's releases.

    v1.21.0

    NumPy 1.21.0 Release Notes

    The NumPy 1.21.0 release highlights are

    • continued SIMD work covering more functions and platforms,
    • initial work on the new dtype infrastructure and casting,
    • universal2 wheels for Python 3.8 and Python 3.9 on Mac,
    • improved documentation,
    • improved annotations,
    • new PCG64DXSM bitgenerator for random numbers.

    In addition there are the usual large number of bug fixes and other improvements.

    The Python versions supported for this release are 3.7-3.9. Official support for Python 3.10 will be added when it is released.

    :warning: Warning: there are unresolved problems compiling NumPy 1.21.0 with gcc-11.1 .

    • Optimization level -O3 results in many wrong warnings when running the tests.
    • On some hardware NumPy will hang in an infinite loop.

    New functions

    Add PCG64DXSM BitGenerator

    Uses of the PCG64 BitGenerator in a massively-parallel context have been shown to have statistical weaknesses that were not apparent at the first release in numpy 1.17. Most users will never observe this weakness and are safe to continue to use PCG64. We have introduced a new PCG64DXSM BitGenerator that will eventually become the new default BitGenerator implementation used by default_rng in future releases. PCG64DXSM solves the statistical weakness while preserving the performance and the features of PCG64.

    See upgrading-pcg64 for more details.

    (gh-18906)

    Expired deprecations

    • The shape argument numpy.unravel_index cannot be passed as dims keyword argument anymore. (Was deprecated in NumPy 1.16.)

    ... (truncated)

    Commits
    • b235f9e Merge pull request #19283 from charris/prepare-1.21.0-release
    • 34aebc2 MAINT: Update 1.21.0-notes.rst
    • 493b64b MAINT: Update 1.21.0-changelog.rst
    • 07d7e72 MAINT: Remove accidentally created directory.
    • 032fca5 Merge pull request #19280 from charris/backport-19277
    • 7d25b81 BUG: Fix refcount leak in ResultType
    • fa5754e BUG: Add missing DECREF in new path
    • 61127bb Merge pull request #19268 from charris/backport-19264
    • 143d45f Merge pull request #19269 from charris/backport-19228
    • d80e473 BUG: Removed typing for == and != in dtypes
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Docs

    Docs

    Changed a few internal function names to be more descriptive and added numpy style method level docs. I like the numpy style, and sphinx works well with it. I think we should move to this style of doc. Thoughts?

    opened by hougs 1
  • Add pylint and fix errors.

    Add pylint and fix errors.

    There are two error left related to computing kurtosis. I wanted to push this so that a starting point for fixing them exists. In order for travis to build, this only requires that there are no errors, other types of flags slide by.

    opened by hougs 1
  • Bump numpy from 1.9.2 to 1.22.0

    Bump numpy from 1.9.2 to 1.22.0

    Bumps numpy from 1.9.2 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • to_spark_sql() adds a column called

    to_spark_sql() adds a column called "index"

    After converting to a spark sql dataframe there is an index column that gets added to the dataframe.

    Reproducible example:

    import csv import tempfile

    input = [["dwarves", "uid"], ["happy", 3], ["grumpy", 4], ["dopey", 5]]

    temp_file = tempfile.NamedTemporaryFile(delete=False) with open(temp_file.name, 'wb') as f: writer = csv.writer(f) writer.writerows(input)

    df = psc.read_csv(temp_file.name) df_sql = df.to_spark_sql() df_sql.columns

    The output here is : ['index', 'dwarves', 'uid']

    opened by michalmonselise 1
  • First implementation of merge.

    First implementation of merge.

    This iteration only covers the two cases covered by spark 1.5:

    1. Merging using "on" using only one column
    2. Merging using "left_on" and "right_on" - there is no restriction on number of columns used but they must differ in name

    Currently merging without specifying "on", "left_on" or "right_on" is not supported. This feature will be supported starting Spark 1.6. Refer to Spark issue https://issues.apache.org/jira/browse/SPARK-10246 for more information.

    By default, the "copy" argument in the merge function does not apply to the Spark framework but has been left in the function for completeness.

    opened by michalmonselise 0
  • get_dummies() with sparklingpandas

    get_dummies() with sparklingpandas

    I am looking for a solution to do on Spark something like pandas get_dummies() - is it planned to add something like this anytime soon?

    if not: could you point me into the right direction how to implement something like get_dummies() with Spark Dataframes?

    opened by pixelsebi 2
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

Jorge Santos 2.7k Dec 30, 2022
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 697 Jan 6, 2023
A GUI for Pandas DataFrames

PandasGUI A GUI for analyzing Pandas DataFrames. Demo Installation Install latest release from PyPi: pip install pandasgui Install directly from Githu

Adam 2.8k Jan 3, 2023
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 822 Jan 7, 2023
Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

JoyPy JoyPy is a one-function Python package based on matplotlib + pandas with a single purpose: drawing joyplots (a.k.a. ridgeline plots). The code f

Leonardo Taccari 462 Jan 2, 2023
Interactive plotting for Pandas using Vega-Lite

pdvega: Vega-Lite plotting for Pandas Dataframes pdvega is a library that allows you to quickly create interactive Vega-Lite plots from Pandas datafra

Altair 342 Oct 26, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 6.8k Feb 18, 2021
Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

Jorge Santos 2.1k Feb 18, 2021
A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

HoloViz 349 Feb 15, 2021
A GUI for Pandas DataFrames

PandasGUI A GUI for analyzing Pandas DataFrames. Demo Installation Install latest release from PyPi: pip install pandasgui Install directly from Githu

Adam 2k Feb 17, 2021
Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

Patrik Hlobil 614 Feb 17, 2021
Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

JoyPy JoyPy is a one-function Python package based on matplotlib + pandas with a single purpose: drawing joyplots (a.k.a. ridgeline plots). The code f

Leonardo Taccari 317 Feb 17, 2021
Interactive plotting for Pandas using Vega-Lite

pdvega: Vega-Lite plotting for Pandas Dataframes pdvega is a library that allows you to quickly create interactive Vega-Lite plots from Pandas datafra

Altair 340 Feb 1, 2021
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

???? Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

wq framework 1.2k Jan 1, 2023
Visualize your pandas data with one-line code

PandasEcharts 简介 基于pandas和pyecharts的可视化工具 安装 pip 安装 $ pip install pandasecharts 源码安装 $ git clone https://github.com/gamersover/pandasecharts $ cd pand

陈华杰 2 Apr 13, 2022
In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas

CogniPy for Pandas - In-memory Graph Database and Knowledge Graph with Natural Language Interface Whats in the box Reasoning, exploration of RDF/OWL,

Cognitum Octopus 34 Dec 13, 2022
Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

SQLite python Starcraft 2 English This project shows the usage of SQLite with python. To create, modify and communicate with the SQLite database from

null 1 Dec 30, 2021
Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

Jason Kraynak 1 Jan 7, 2022