Sparkling Pandas

Last update: Oct 27, 2022

Related tags

Data Visualization sparklingpandas

Overview

SparklingPandas

SparklingPandas aims to make it easy to use the distributed computing power of PySpark to scale your data analysis with Pandas. SparklingPandas builds on Spark's DataFrame class to give you a polished, pythonic, and Pandas-like API.

Documentation

See SparklingPandas.com.

Videos

An early version of Sparkling Pandas was discussed in Sparkling Pandas - using Apache Spark to scale Pandas - Holden Karau and Juliet Hougland

Requirements

The primary requirement of SparklingPandas is that you have a recent (v1.4 currently) version of Spark installed - http://spark.apache.org and Python 2.7.

Using

Make sure you have the SPARK_HOME environment variable set correctly, as SparklingPandas uses this for including the PySpark libraries

Other than that you can install SparklingPandas with pip and just import it.

State

This is in early development. Feedback is taken seriously and is seriously appreciated. As you can tell, us SparklingPandas are a pretty serious bunch.

Support

Check out our Google group at https://groups.google.com/forum/#!forum/sparklingpandas

Comments

Rework testing cmd.
This allows us to run nosetests directly from the command line. It shifts the responsibility of setting up the env to the init for the testing module. init gets called exactly once when the module is first used. There are two problems in this current version:

test_apply_map fails with:

... File "/Users/juliet/src/sparklingpandas/sparklingpandas/custom_functions.py", line 29, in registerSQLExtensions sc._jvm.com.sparklingpandas.functions.registerUdfs(scala_SQLContext) File "/Users/juliet/bin/spark-1.4.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 726, in __getattr__ raise Py4JError('Trying to call a package.') Py4JError: Trying to call a package.

That failure cascades into other test failures because of a lack of the proper cleanup of the spark contexts. The class defining those tests does extend SparklingPandasTestCase so it should have tearDown called after the test is run. Not sure what is up with that.
opened by hougs 3
Added ability to read json files. Unit test included.

This is an initial fix for https://github.com/sparklingpandas/sparklingpandas/issues/47 .

Please let me know if you have any feedback.

Thank you, Michal

opened by michalmonselise 3
Doctests to nosetests

Move almost all doc tests to unit tests that are the run through nosetest instead of a custom script. When run in travis, we wiil use the --logging-level=INFO and --detailed-errors.

opened by hougs 3
Initial setup related issue

Hello, I'm not sure if this can be raised as an issue. But couldn't find some other way to ask this question. I have downloaded sparklingpandas zip file.

I've added the folder to PYTHONPATH variable.

When I am trying to import it from python interactive shell, I'm getting an import error.

import sparklingpandas Traceback (most recent call last): File "", line 1, in ImportError: No module named sparklingpandas

I cannot use 'pip install' because pip is not installed on the machine I'm using. And I don't have much control on the machine since I'm not the root user of the machine.

opened by gayatri-devarakonda 2

Install issue

Hi, I'm getting this error message when I try to install:

Running setup.py (path:/tmp/pip_build_root/sparklingpandas/setup.py) egg_info for package sparklingpandas
    Traceback (most recent call last):
      File "<string>", line 17, in <module>
      File "/tmp/pip_build_root/sparklingpandas/setup.py", line 12, in <module>
        long_description=open('README.md').read(),
    IOError: [Errno 2] No such file or directory: 'README.md'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 17, in <module>

  File "/tmp/pip_build_root/sparklingpandas/setup.py", line 12, in <module>

    long_description=open('README.md').read(),

IOError: [Errno 2] No such file or directory: 'README.md'

opened by ghost 2

New panda image and minor README cleanup
fixes Travis image link so that it's actually a link

uses new sparkling panda image (commissioned work executed by itzelspixels)

drops file size from 1.2M to 26K

1,600% increase in sparkles

correct displayed image orientation, but no more vertical space used

now eating bamboo (metaphor for data???)

arguably increased cuteness

removed empty header line (likely spurious formatting)

@julbright collaborated on key thought-leadership
opened by ajschumacher 2

pip install sparklingpandas failed

Error message:

Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/private/tmp/pip-build-ibiITf/SparklingPandas/setup.py", line 12, in <module>
        long_description=open('README.md').read(),
    IOError: [Errno 2] No such file or directory: 'README.md'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

      File "<string>", line 20, in <module>

      File "/private/tmp/pip-build-ibiITf/SparklingPandas/setup.py", line 12, in <module>

        long_description=open('README.md').read(),

    IOError: [Errno 2] No such file or directory: 'README.md'

opened by messense 2

Bump numpy from 1.9.2 to 1.21.0
Bumps numpy from 1.9.2 to 1.21.0.

Release notes

Sourced from numpy's releases.

v1.21.0

NumPy 1.21.0 Release Notes

The NumPy 1.21.0 release highlights are

continued SIMD work covering more functions and platforms,

initial work on the new dtype infrastructure and casting,

universal2 wheels for Python 3.8 and Python 3.9 on Mac,

improved documentation,

improved annotations,

new PCG64DXSM bitgenerator for random numbers.

In addition there are the usual large number of bug fixes and other improvements.

The Python versions supported for this release are 3.7-3.9. Official support for Python 3.10 will be added when it is released.

:warning: Warning: there are unresolved problems compiling NumPy 1.21.0 with gcc-11.1 .

Optimization level -O3 results in many wrong warnings when running the tests.

On some hardware NumPy will hang in an infinite loop.

New functions

Add PCG64DXSM BitGenerator

Uses of the PCG64 BitGenerator in a massively-parallel context have been shown to have statistical weaknesses that were not apparent at the first release in numpy 1.17. Most users will never observe this weakness and are safe to continue to use PCG64. We have introduced a new PCG64DXSM BitGenerator that will eventually become the new default BitGenerator implementation used by default_rng in future releases. PCG64DXSM solves the statistical weakness while preserving the performance and the features of PCG64.

See upgrading-pcg64 for more details.

(gh-18906)

Expired deprecations

The shape argument numpy.unravel_index cannot be passed as dims keyword argument anymore. (Was deprecated in NumPy 1.16.)

... (truncated)

Commits

b235f9e Merge pull request #19283 from charris/prepare-1.21.0-release

34aebc2 MAINT: Update 1.21.0-notes.rst

493b64b MAINT: Update 1.21.0-changelog.rst

07d7e72 MAINT: Remove accidentally created directory.

032fca5 Merge pull request #19280 from charris/backport-19277

7d25b81 BUG: Fix refcount leak in ResultType

fa5754e BUG: Add missing DECREF in new path

61127bb Merge pull request #19268 from charris/backport-19264

143d45f Merge pull request #19269 from charris/backport-19228

d80e473 BUG: Removed typing for == and != in dtypes

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Docs

Changed a few internal function names to be more descriptive and added numpy style method level docs. I like the numpy style, and sphinx works well with it. I think we should move to this style of doc. Thoughts?

opened by hougs 1
Add pylint and fix errors.

There are two error left related to computing kurtosis. I wanted to push this so that a starting point for fixing them exists. In order for travis to build, this only requires that there are no errors, other types of flags slide by.

opened by hougs 1
Bump numpy from 1.9.2 to 1.22.0
Bumps numpy from 1.9.2 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
to_spark_sql() adds a column called "index"

After converting to a spark sql dataframe there is an index column that gets added to the dataframe.

Reproducible example:

import csv import tempfile

input = [["dwarves", "uid"], ["happy", 3], ["grumpy", 4], ["dopey", 5]]

temp_file = tempfile.NamedTemporaryFile(delete=False) with open(temp_file.name, 'wb') as f: writer = csv.writer(f) writer.writerows(input)

df = psc.read_csv(temp_file.name) df_sql = df.to_spark_sql() df_sql.columns

The output here is : ['index', 'dwarves', 'uid']

opened by michalmonselise 1
First implementation of merge.
This iteration only covers the two cases covered by spark 1.5:

Merging using "on" using only one column

Merging using "left_on" and "right_on" - there is no restriction on number of columns used but they must differ in name

Currently merging without specifying "on", "left_on" or "right_on" is not supported. This feature will be supported starting Spark 1.6. Refer to Spark issue https://issues.apache.org/jira/browse/SPARK-10246 for more information.

By default, the "copy" argument in the merge function does not apply to the Spark framework but has been left in the function for completeness.
opened by michalmonselise 0
get_dummies() with sparklingpandas

I am looking for a solution to do on Spark something like pandas get_dummies() - is it planned to add something like this anytime soon?

if not: could you point me into the right direction how to implement something like get_dummies() with Spark Dataframes?

opened by pixelsebi 2

Owner

GitHub http://sparklingpandas.com/

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 1, 2023

Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

2.7k Dec 30, 2022

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

697 Jan 6, 2023

A GUI for Pandas DataFrames

PandasGUI A GUI for analyzing Pandas DataFrames. Demo Installation Install latest release from PyPi: pip install pandasgui Install directly from Githu

2.8k Jan 3, 2023

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

822 Jan 7, 2023

Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

JoyPy JoyPy is a one-function Python package based on matplotlib + pandas with a single purpose: drawing joyplots (a.k.a. ridgeline plots). The code f

462 Jan 2, 2023

Interactive plotting for Pandas using Vega-Lite

pdvega: Vega-Lite plotting for Pandas Dataframes pdvega is a library that allows you to quickly create interactive Vega-Lite plots from Pandas datafra

342 Oct 26, 2022

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

6.8k Feb 18, 2021

Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

2.1k Feb 18, 2021

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

hvPlot A high-level plotting API for the PyData ecosystem built on HoloViews. Build Status Coverage Latest dev release Latest release Docs What is it?

349 Feb 15, 2021

A GUI for Pandas DataFrames

PandasGUI A GUI for analyzing Pandas DataFrames. Demo Installation Install latest release from PyPi: pip install pandasgui Install directly from Githu

2k Feb 17, 2021

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas-Bokeh provides a Bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames, similar to the already existing Visualization feature of

614 Feb 17, 2021

Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

JoyPy JoyPy is a one-function Python package based on matplotlib + pandas with a single purpose: drawing joyplots (a.k.a. ridgeline plots). The code f

317 Feb 17, 2021

Interactive plotting for Pandas using Vega-Lite

pdvega: Vega-Lite plotting for Pandas Dataframes pdvega is a library that allows you to quickly create interactive Vega-Lite plots from Pandas datafra

340 Feb 1, 2021

📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

???? Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

1.2k Jan 1, 2023

Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

1 Jan 7, 2022

Sparkling Pandas

Related tags

Overview

SparklingPandas

Documentation

Videos

Requirements

Using

State

Support

Comments

v1.21.0

NumPy 1.21.0 Release Notes

New functions

Add PCG64DXSM BitGenerator

Expired deprecations

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Owner

Create HTML profiling reports from pandas DataFrame objects

Productivity Tools for Plotly + Pandas

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

A GUI for Pandas DataFrames

Bokeh Plotting Backend for Pandas and GeoPandas

Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

Interactive plotting for Pandas using Vega-Lite

Create HTML profiling reports from pandas DataFrame objects

Productivity Tools for Plotly + Pandas

A high-level plotting API for pandas, dask, xarray, and networkx built on HoloViews

A GUI for Pandas DataFrames

Bokeh Plotting Backend for Pandas and GeoPandas

Joyplots in Python with matplotlib & pandas :chart_with_upwards_trend:

Interactive plotting for Pandas using Vega-Lite

📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

Visualize your pandas data with one-line code

In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas

Using SQLite within Python to create database and analyze Starcraft 2 units data (Pandas also used)

Project coded in Python using Pandas to look at changes in chase% for batters facing a pitcher first time through the order vs. thrid time

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio