Koalas: pandas API on Apache Spark

Overview

pandas API on Apache Spark
Explore Koalas docs »

Live notebook · Issues · Mailing list
Help Thirsty Koalas Devastated by Recent Fires

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.

pandas is the de facto standard (single-node) DataFrame implementation in Python, while Spark is the de facto standard for big data processing. With this package, you can:

  • Be immediately productive with Spark, with no learning curve, if you are already familiar with pandas.
  • Have a single codebase that works both with pandas (tests, smaller datasets) and with Spark (distributed datasets).

We would love to have you try it and give us feedback, through our mailing lists or GitHub issues.

Try the Koalas 10 minutes tutorial on a live Jupyter notebook here. The initial launch can take up to several minutes.

Github Actions codecov Documentation Status Latest Release Conda Version Binder Downloads

Getting Started

Koalas can be installed in many ways such as Conda and pip.

# Conda
conda install koalas -c conda-forge
# pip
pip install koalas

See Installation for more details.

For Databricks Runtime, Koalas is pre-installed in Databricks Runtime 7.1 and above. Try Databricks Community Edition for free. You can also follow these steps to manually install a library on Databricks.

Lastly, if your PyArrow version is 0.15+ and your PySpark version is lower than 3.0, it is best for you to set ARROW_PRE_0_15_IPC_FORMAT environment variable to 1 manually. Koalas will try its best to set it for you but it is impossible to set it if there is a Spark context already launched.

Now you can turn a pandas DataFrame into a Koalas DataFrame that is API-compliant with the former:

import databricks.koalas as ks
import pandas as pd

pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})

# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)

# Rename the columns
df.columns = ['x', 'y', 'z1']

# Do some operations in place:
df['x2'] = df.x * df.x

For more details, see Getting Started and Dependencies in the official documentation.

Contributing Guide

See Contributing Guide and Design Principles in the official documentation.

FAQ

See FAQ in the official documentation.

Best Practices

See Best Practices in the official documentation.

Koalas Talks and Blogs

See Koalas Talks and Blogs in the official documentation.

Comments
  • Understanding Groupbyapply

    Understanding Groupbyapply

    Hello there, firstly thank you for such an amazing package that bridges the gap between Pandas and PySpark. I started using koalas approximately 1 week back and everything was intuitive till the time i stumbled upon koalas.Groupby.Apply.

    Code:

    if __name__ == '__main__':
    
            ks_df = ks.DataFrame(features_data)
            ks_df_info_abt_train = ks_df.groupby(['div_nbr', 'store_nbr']).apply(_koalas_train)
            
            def _koalas_train(frame):
                      out_frame = frame.copy()
                      out_frame = frame['trans_type_value'].sum()
                      return out_frame
    

    Here features_data is a pd.Dataframe.

    Output from Koalas.Groupby.Apply:

    Screen Shot 2019-09-27 at 2 27 26 PM

    Output from Pandas.Groupby.Apply: Screen Shot 2019-09-27 at 2 30 13 PM

    As you can see, the output from pandas Groupby apply is as expected, but the output from Koalas Groupby apply is not right. Could you guid me towards the right direction by pointing out any logical mistake that i might have made or anything else. Thank you once again.

    Koalas version - 0.18.0 Pandas version - 0.23.4 PySpark - 2.4.3

    opened by devarshml 29
  • Allow querying DataFrame directly in sql method

    Allow querying DataFrame directly in sql method

    While I really like the idea of @rxin's recent #256 PR, he uses an (in my opinion) over-simplistic example of ks.sql("select * from range(10) where id > 7"). I believe that the ability to query actual Koalas DataFrames through SQL can prove really valuable to many users. However, when trying to use ks.sql with a Koalas DataFrame, the following exception occurs:

    kdf = ks.DataFrame({'A': [1,2,3]})
    >>> ks.sql("select * from kdf")
    ...
    org.apache.spark.sql.AnalysisException: Table or view not found: kdf; line 1 pos 14
    ...
    

    This is not surprising to someone with PySpark knowledge who knows that kdf has to be registered as a temporary table before being able to use it with SparkSQL. Unfortunately, (as I understand it) the target group of the Koalas library should not be expected to be Spark experts. To get the above example working, the following workaround is needed, which requires the usage of the lower-level (Py)Spark API, thus somewhat defeating the purpose of Koalas.

    >>> from pyspark import SparkContext
    >>> from pyspark.sql import SQLContext
    >>> sc = SparkContext.getOrCreate()
    >>> sql = SQLContext(sc)
    >>> sql.registerDataFrameAsTable(kdf._sdf, "kdf")
    >>> ks.sql("select * from kdf")
       __index_level_0__  A                                                         
    0                  0  1
    1                  1  2
    2                  2  3
    # Optionally clean-up by dropping the temporary table
    >>> sql.dropTempTable("kdf")
    

    Wouldn't it be much more convenient if this "temporary table magic" would instead be handled by Koalas behind the scenes or are there any design objections against such an approach?

    enhancement discussions 
    opened by floscha 27
  • Introduce plotting.backend configuration with Plotly support

    Introduce plotting.backend configuration with Plotly support

    Aims to fix #1626 Each backend returns the figure in their own format, allowing for further editing or customization if required.

    How to use?:

    
    import databricks.koalas as ks
    
    kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
    kdf.plot(title="Example Figure") # defaults to backend="matplotlib"
    

    image

    kdf.plot(backend="pandas_bokeh", title="Example Figure")
    ## same as:
    # ks.options.plotting.backend = "pandas_bokeh"
    # kdf.plot(title="Example Figure")
    

    image

    fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
    fig.show()
    

    image

    # further edits can be made to the figure
    fig.update_layout(template="plotly_dark")
    fig.show()
    

    image

    opened by DumbMachine 23
  • Basic plot functionality for Series

    Basic plot functionality for Series

    As mentioned in #293 , this PR creates Series.plot functions for plotting data in Koalas.Series.

    The idea is to use pandas.plotting._core as base for inheritance as well to copy some functions/methods from and then adjust them to compute the necessary summarized data using Spark.

    opened by dvgodoy 23
  • Fix comparison operators to treat NULL as False

    Fix comparison operators to treat NULL as False

    This PR proposes to:

    • Fix column comparison operators to treat NULL as False
    • Resolve #999.

    pandas

    pandas treats NULL as False

    >>> pser
    0    0.0
    1    1.0
    2    2.0
    3    NaN
    dtype: float64
    
    >>> pser == 0
    0     True
    1    False
    2    False
    3    False  <- bool
    dtype: bool
    

    Koalas

    Koalas currently treats NULL as NULL

    >>> kser
    0    0.0
    1    1.0
    2    2.0
    3    NaN
    Name: 0, dtype: float64
    
    # CURRENT
    >>> kser == 0
    0     True
    1    False
    2    False
    3     None  <- not bool
    Name: 0, dtype: object
    
    # PROPOSED
    >>> kser == 0
    0     True
    1    False
    2    False
    3    False  <- bool
    Name: 0, dtype: bool
    
    opened by harupy 21
  • Plotly on koalas dataframes

    Plotly on koalas dataframes

    I can easily use plotly's interactive charts by below 2 lines of code:

    import pandas as pd pd.options.plotting.backend = "plotly"

    However, unable to use Plotly with Koalas Dataframes. Is there a workaround or can there be an enhancement to include this feature?

    enhancement discussions 
    opened by jainayush007 20
  • Implements delete() for Index

    Implements delete() for Index

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.delete.html#pandas.Index.delete

    >>> kidx = ks.Index([10, 10, 9, 8, 4, 2, 4, 4, 2, 2, 10, 10])
    >>> kidx
    Int64Index([10, 10, 9, 8, 4, 2, 4, 4, 2, 2, 10, 10], dtype='int64')
    
    >>> kidx.delete(0)
    Int64Index([10, 9, 8, 4, 2, 4, 4, 2, 2, 10, 10], dtype='int64')
    
    >>> kidx.delete([0, 1, 2, 3, 10, 11])
    Int64Index([4, 2, 4, 4, 2, 2], dtype='int64')
    
    opened by itholic 20
  • Implement the first batch of Serialization / IO / Conversion functions

    Implement the first batch of Serialization / IO / Conversion functions

    See https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion

    Looks like we can easily implement almost all of them by calling toPandas().func_name().

    One thing is that some of the functions support max_rows. When that argument is specified, we should add a limit call in Spark to avoid moving all the data to the driver.

    The list to add in the first batch are:

    • [x] to_dict (see #169)
    • [x] to_excel (#288)
    • [x] to_html (we already have this, but let's add a limit when max_rows is set), done in #206
    • [x] to_latex (#297)
    • [x] to_records (#298)
    • [x] to_string (done in #211 and #213)
    • [x] to_clipboard (#257)

    Skipping the following because I don't know how popular they are:

    • to_pickle
    • to_hdf
    • to_stata
    • to_msgpack
    • to_records
    • to_sparse
    • to_dense

    The following might require parallelization with Pandas UDFs, rather than collecting all the data to the driver, so leaving them for the future:

    • to_sql
    • to_gbq

    I'm also not adding json and csv here. We need to design those properly because both Spark and Pandas have those.

    help wanted good first issue 
    opened by rxin 20
  • ValueError when reading dict with None

    ValueError when reading dict with None

    I find that reading a dict

    row =  {'a': [1], 'b':[None]}
    ks.DataFrame(row)
    
    ValueError: can not infer schema from empty or null dataset
    

    but for pandas there is no error

    row =  {'a': [1], 'b':[None]}
    print(pd.DataFrame(row))
    
       a     b
    0  1  None
    

    I have tried setting dtype=np.int64 but this has not helped.

    opened by nickhalmagyi 19
  • Add Property iat for DataFrame & Series

    Add Property iat for DataFrame & Series

    iat for DataFrame & Series https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html#pandas.DataFrame.iat

    For DataFrame

    >>> df = ks.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
    ...                   columns=['A', 'B', 'C'])
    >>> df
        A   B   C
    0   0   2   3
    1   0   4   1
    2  10  20  30
    
    >>> df.iat[1, 2]
    1
    

    For Series

    >>> kser = ks.Series([1, 2, 3], index=[10, 20, 30])
    >>> kser
    10    1
    20    2
    30    3
    Name: 0, dtype: int64
    >>> kser.iat[1]
    2
    

    Another examples are in tests/test_indexing.py including MultiIndex & MultiIndex columns tests.

    opened by itholic 17
  • Triage Top Missing APIs to explicitly don't support or implement.

    Triage Top Missing APIs to explicitly don't support or implement.

    @ueshin wrote and ran some notebooks to analyze statistics from missing API calls as below. We might need to explicitly don't support with a proper workaround in its exception message and/or implement it first.

    • [x] Series.__iter__(self)
    • [x] DataFrame.apply(self, func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, args=(), **kwds) #1259
    • [x] Series.values
    • [x] Index.values
    • [x] DataFrame.info(self, verbose=None, buf=None, max_cols=None, memory_usage=None, null_counts=None) - https://github.com/databricks/koalas/issues/872
    • [x] Index.to_numpy(self, dtype=None, copy=False)
    • [x] DataFrame.iterrows(self)
    • [x] DataFrame.unstack(self, level=-1, fill_value=None) #1295
    • [x] DataFrame.rename(self, mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)
    • [ ] DataFrame.tail(self, n=5)
    • [x] Index.__iter__(self)
    • [x] DataFrame.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False, raise_on_error=None)
    • [x] DataFrame.values
    • [x] DataFrameGroupBy.head(self, n=5)
    • [x] DataFrame.query(self, expr, inplace=False, **kwargs)
    • [x] DataFrame.take(self, indices, axis=0, is_copy=True, **kwargs) #1292
    • [x] DataFrame.take(self, indices, axis=0, convert=None, is_copy=True, **kwargs) #1292
    • [x] Index.tolist(self)
    • [ ] SeriesGroupBy.unique
    • [x] Series.keys
    • [x] Series.replace
    enhancement 
    opened by HyukjinKwon 17
  • Attribute Error: module 'numpy' has no attribute 'bool'

    Attribute Error: module 'numpy' has no attribute 'bool'

    Reading a CSV file using Koalas is giving the below error. I believe this is because NumPy has deprecated np.bool in release 1.24.0

    Attribute Error: module 'numpy' has no attribute 'bool'

         # BooleanType
    

    --> elif tpe in (bool, np.bool, "bool", "?"): return types.BooleanType()

    opened by akanshkatyayan 1
  • Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

    Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives

    Hi, I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value. I did the same operations on a Pandas dataframe, but it gives the minimum value .

    Please find attached the notebook for step by step operations I performed.

    cmd1 import databricks.koalas as ks import pandas as pd import random

    cmd2 #Create Sample dataframe in Koalas df = ks.DataFrame.from_dict({ 'Age': [random.randint(0, 100000) for i in range(100000)], 'Income': [random.randint(0, 100000) for i in range(100000)] })

    print(df.head(5))

    cmd3 import scipy.stats as stats import numpy as np ks.set_option('compute.ops_on_diff_frames', True) df['Income_zscore'] = ks.Series(stats.zscore(df['Income'].to_numpy())) df['Age_zscore'] = ks.Series(stats.zscore(df['Age'].to_numpy())) df['sq_dist'] = [np.linalg.norm(i) for i in df[['Income_zscore','Age_zscore']].to_numpy()] ks.set_option('compute.ops_on_diff_frames', False)

    cmd4 #display(df)

    cmd5 #calculate min of sq_dist minindex=df['sq_dist'].idxmin() minindex

    cmd6 #display min value of sq_dist df['sq_dist'].iloc[minindex]

    cmd7 df.to_spark().createOrReplaceTempView("koalastable")

    cmd8 %sql select min(sq_dist) from koalastable -- THis doesnt match with the value we got in cmd6

    cmd9 #do same operations with Pandas df_spark = df.to_spark() stats_array = np.array(df_spark.select('Age', 'Income').collect()) normalized_data = stats.zscore(stats_array, axis=0) df_pd = pd.DataFrame(data=normalized_data, columns=['Age', 'Income']) df_pd['sq_dist'] = [np.linalg.norm(i) for i in normalized_data] df_pd.head(5)

    cmd10 minindex_pd=df_pd['sq_dist'].idxmin() minindex_pd

    cmd11 #minimum of sq_dist using Koalas df_pd['sq_dist'].iloc[minindex_pd]

    cmd12 spark.createDataFrame(df_pd).createOrReplaceTempView("pandastable")

    cmd13 %sql select min(sq_dist) from pandastable -- This match with the value we got in cmd11

    opened by nikeshv 1
  • Spammed with FutureWarnings that are unfilterable

    Spammed with FutureWarnings that are unfilterable

    When preforming any kind of itterable: enc[example'] = enc.example.apply(lambda x: 1880 if pd.notnull(x) & ((x >=1880) & (x < 1890)) else x)

    I am constantly spammed with:

    /path/miniconda/lib/python3.8/site-packages/pyspark/python/lib/pyspark.zip/pyspark/pandas/internal.py:1573: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.

    I believe this is due to the new pandas update.

    opened by CowboyViking 0
  • pyspark dataframe coverting to koalas dataframe have different elements

    pyspark dataframe coverting to koalas dataframe have different elements

    Code:

    import databricks.koalas as ks
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    
    if __name__ == '__main__':
        conf = SparkConf().setAppName("test")
    
        spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
        sdf = spark.sql("select uid,vr_id, gender, follow_count, byfollow_count, is_click "
                        "from database.table where data_date=20220726 "
                        "and uid=249462081764458496 limit 5")
        sdf.show(n=20)
    
        print("=======================to_koalas===============================")
        df = sdf.to_koalas()
    
        category_features_df = df[["uid", "vr_id", "gender"]].fillna(0)
        dense_features_df = df[["follow_count", "byfollow_count"]].fillna(0)
    
        y = df["is_click"].values
    
        print("category_features_df: {}".format(category_features_df))
        print("dense_features_df: {}".format(dense_features_df))
    
        total_uids = category_features_df["uid"].unique().tolist()
        total_vids = category_features_df["vr_id"].unique().tolist()
        uid_id2index = {uid: i for i, uid in enumerate(total_uids)}
        uid_index2id = {i: uid for uid, i in uid_id2index.items()}
        vid_id2index = {vid: i for i, vid in enumerate(total_vids)}
        vid_index2id = {i: vid for vid, i in vid_id2index.items()}
        print(f"uid_id2index: {uid_id2index}")
        print(f"vid_id2index: {vid_id2index}")
    

    The result:

    +------------------+------------------+------+------------+--------------+--------+
    |               uid|             vr_id|gender|follow_count|byfollow_count|is_click|
    +------------------+------------------+------+------------+--------------+--------+
    |249462081764458496|234389742446182400|     0|           4|             2|       0|
    |249462081764458496|247965851351777280|     0|           4|             2|       0|
    |249462081764458496|303938736226304000|     0|           4|             2|       0|
    |249462081764458496|305220054218178560|     0|           4|             2|       0|
    |249462081764458496|150357127037190144|     0|           4|             2|       0|
    +------------------+------------------+------+------------+--------------+--------+
    
    =======================to_koalas===============================
    /mnt/softwares/my_env/lib/python3.6/site-packages/databricks/koalas/generic.py:603: UserWarning: We recommend using `Series.to_numpy()` instead.
      warnings.warn("We recommend using `{}.to_numpy()` instead.".format(type(self).__name__))
    category_features_df:                   uid               vr_id  gender
    0  249462081764458496  239951849459810304       0
    1  249462081764458496  218479966654824448       0
    2  249462081764458496  269598027864342528       0
    3  249462081764458496  306587488548290560       0
    4  249462081764458496  270454206781980672       0
    dense_features_df:    follow_count  byfollow_count
    0             4               2
    1             4               2
    2             4               2
    3             4               2
    4             4               2
    uid_id2index: {249462081764458496: 0}
    vid_id2index: {298760687402876928: 0, 306851269564170240: 1, 306601561927188480: 2, 269902057735979008: 3, 286263993075499008: 4}
    

    why sdf is different with df?

    opened by Alxe1 0
  • data type conversion error

    data type conversion error

    I can run my data process success on pandas, but when I switch to koalas, there are lots of data type errors, like that:

    1. with type DataFrame: did not recognize Python value type when inferring an Arrow data type
    2. <class 'str'>: (<class 'py4j.protocol.Py4JError'>, Py4JError('An error occurred while calling None.None'))

    I think this error happens when transferring python data to java data. how to solve it, thanks!

    opened by hrxx 1
  • pyspark is not required when install koalas

    pyspark is not required when install koalas

    koalas is a great package.

    when I install the package, all requirements are as below: pip install koalas==1.8.2 Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting koalas==1.8.2 Using cached https://pypi.tuna.tsinghua.edu.cn/packages/28/9a/d69cf12ea62116873b427e5843be8ae8431b18f2a0714d6f4eec3ee4cda6/koalas-1.8.2-py3-none-any.whl (390 kB) Requirement already satisfied: numpy>=1.14 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from koalas==1.8.2) (1.21.5) Requirement already satisfied: pandas>=0.23.2 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from koalas==1.8.2) (1.3.5) Requirement already satisfied: pyarrow>=0.10 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from koalas==1.8.2) (7.0.0) Requirement already satisfied: pytz>=2017.3 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from pandas>=0.23.2->koalas==1.8.2) (2021.1) Requirement already satisfied: python-dateutil>=2.7.3 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from pandas>=0.23.2->koalas==1.8.2) (2.8.2) Requirement already satisfied: six>=1.5 in /Users/celential-bing/.pyenv/versions/3.8.12/envs/time_machine/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.23.2->koalas==1.8.2) (1.16.0) Installing collected packages: koalas Successfully installed koalas-1.8.2

    but it also needs pyspark, for example when I start a service: `ImportError: Unable to import pyspark - consider doing a pip install with [spark] extra to install pyspark with pip Traceback (most recent call last): File "/Users/celential-bing/.pyenv/versions/time_machine/lib/python3.8/site-packages/databricks/koalas/init.py", line 49, in assert_pyspark_version import pyspark ModuleNotFoundError: No module named 'pyspark'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/Users/celential-bing/time-machine/timemachine/app.py", line 1, in from timemachine import app, dapp File "/Users/celential-bing/time-machine/timemachine/init.py", line 301, in app, dapp, schema = create_app() File "/Users/celential-bing/time-machine/timemachine/init.py", line 57, in create_app raise e File "/Users/celential-bing/time-machine/timemachine/init.py", line 54, in create_app return TimeMachineInitializer(app).init_app() File "/Users/celential-bing/time-machine/timemachine/init.py", line 212, in init_app self.init_app_in_ctx() File "/Users/celential-bing/time-machine/timemachine/init.py", line 176, in init_app_in_ctx self.init_views() File "/Users/celential-bing/time-machine/timemachine/init.py", line 69, in init_views from timemachine.views.base import ( File "/Users/celential-bing/time-machine/timemachine/views/base.py", line 13, in from timemachine.models.base import Module, Lambda File "/Users/celential-bing/time-machine/timemachine/models/base.py", line 14, in from timemachine.engines import current_engine, DF File "/Users/celential-bing/time-machine/timemachine/engines/init.py", line 9, in from databricks.koalas import DataFrame as SparkDataFrame File "/Users/celential-bing/.pyenv/versions/time_machine/lib/python3.8/site-packages/databricks/koalas/init.py", line 72, in assert_pyspark_version() File "/Users/celential-bing/.pyenv/versions/time_machine/lib/python3.8/site-packages/databricks/koalas/init.py", line 51, in assert_pyspark_version raise ImportError( ImportError: Unable to import pyspark - consider doing a pip install with [spark] extra to install pyspark with pip`

    so I suggest adding the pyspark in the requirement.txt. I didn't find the file, so pull an issue.

    opened by bingwork 0
Releases(v1.8.2)
  • v1.8.2(Oct 19, 2021)

  • v1.8.1(Jun 18, 2021)

    Koalas 1.8.1 is a maintenance release. Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

    Improvements and bug fixes

    • Remove the upperbound for numpy. (#2166)
    • Allow Python 3.9 when the underlying PySpark is 3.1 and above. (#2167)

    Along with the following fixes:

    • Support x and y properly in plots (both matplotlib and plotly). (#2172)
    • Fix Index.different to work properly. (#2173)
    • Fix backward compatibility for Python version 3.5.*. (#2174)
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(May 3, 2021)

    Koalas 1.8.0 is the last minor release because Koalas will be officially included in PySpark in the upcoming Apache Spark 3.2. In Apache Spark 3.2+, please use Apache Spark directly.

    Categorical type and ExtensionDtype

    We added the support of pandas' categorical type (#2064, #2106).

    >>> s = ks.Series(list("abbccc"), dtype="category")
    >>> s
    0    a
    1    b
    2    b
    3    c
    4    c
    5    c
    dtype: category
    Categories (3, object): ['a', 'b', 'c']
    >>> s.cat.categories
    Index(['a', 'b', 'c'], dtype='object')
    >>> s.cat.codes
    0    0
    1    1
    2    1
    3    2
    4    2
    5    2
    dtype: int8
    >>> idx = ks.CategoricalIndex(list("abbccc"))
    >>> idx
    CategoricalIndex(['a', 'b', 'b', 'c', 'c', 'c'],
                     categories=['a', 'b', 'c'], ordered=False, dtype='category')
    
    >>> idx.codes
    Int64Index([0, 1, 1, 2, 2, 2], dtype='int64')
    >>> idx.categories
    Index(['a', 'b', 'c'], dtype='object')
    

    and ExtensionDtype as type arguments to annotate return types (#2120, #2123, #2132, #2127, #2126, #2125, #2124):

    def func() -> ks.Series[pd.Int32Dtype()]:
        ...
    

    Other new features, improvements and bug fixes

    We added the following new features:

    DataFrame:

    • first (#2128)
    • at_time (#2116)

    Series:

    • at_time (#2130)
    • first (#2128)
    • between_time (#2129)

    DatetimeIndex:

    • indexer_between_time (#2104)
    • indexer_at_time (#2109)
    • between_time (#2111)

    Along with the following fixes:

    • Support tuple to (DataFrame|Series).replace() (#2095)
    • Check index_dtype and data_dtypes more strictly. (#2100)
    • Return actual values via toPandas. (#2077)
    • Add lines and orient to read_json and to_json to improve error message (#2110)
    • Fix isin to accept numpy array (#2103)
    • Allow multi-index column names for inferring return type schema with names. (#2117)
    • Add a short JDBC user guide (#2148)
    • Remove upper bound pandas 1.2 (#2141)
    • Standardize exceptions of arithmetic operations on Datetime-like data (#2101)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Mar 8, 2021)

    Switch the default plotting backend to Plotly

    We switched the default plotting backend from Matplotlib to Plotly (#2029, #2033). In addition, we added more Plotly methods such as DataFrame.plot.kde and Series.plot.kde (#2028).

    import databricks.koalas as ks
    kdf = ks.DataFrame({
        'a': [1, 2, 2.5, 3, 3.5, 4, 5],
        'b': [1, 2, 3, 4, 5, 6, 7],
        'c': [0.5, 1, 1.5, 2, 2.5, 3, 3.5]})
    kdf.plot.hist()
    

    Koalas_plotly_hist_plot

    Plotting backend can be switched to matplotlib by setting ks.options.plotting.backend to matplotlib.

    ks.options.plotting.backend = "matplotlib"
    

    Add Int64Index, Float64Index, DatatimeIndex

    We added more types of Index such as Index64Index, Float64Index and DatetimeIndex (#2025, #2066).

    When creating an index, Index instance is always returned regardless of the data type.

    But now Int64Index, Float64Index or DatetimeIndex is returned depending on the data type of the index.

    >>> type(ks.Index([1, 2, 3]))
    <class 'databricks.koalas.indexes.numeric.Int64Index'>
    >>> type(ks.Index([1.1, 2.5, 3.0]))
    <class 'databricks.koalas.indexes.numeric.Float64Index'>
    >>> type(ks.Index([datetime.datetime(2021, 3, 9)]))
    <class 'databricks.koalas.indexes.datetimes.DatetimeIndex'>
    

    In addition, we added many properties for DatetimeIndex such as year, month, day, hour, minute, second, etc. (#2074) and added APIs for DatetimeIndex such as round(), floor(), ceil(), normalize(), strftime(), month_name() and day_name() (#2082, #2086, #2089).

    Create Index from Series or Index objects

    Index can be created by taking Series or Index objects (#2071).

    >>> kser = ks.Series([1, 2, 3], name="a", index=[10, 20, 30])
    >>> ks.Index(kser)
    Int64Index([1, 2, 3], dtype='int64', name='a')
    >>> ks.Int64Index(kser)
    Int64Index([1, 2, 3], dtype='int64', name='a')
    >>> ks.Float64Index(kser)
    Float64Index([1.0, 2.0, 3.0], dtype='float64', name='a')
    
    >>> kser = ks.Series([datetime(2021, 3, 1), datetime(2021, 3, 2)], index=[10, 20])
    >>> ks.Index(kser)
    DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
    >>> ks.DatetimeIndex(kser)
    DatetimeIndex(['2021-03-01', '2021-03-02'], dtype='datetime64[ns]', freq=None)
    

    Extension dtypes support

    We added basic extension dtypes support (#2039).

    >>> kdf = ks.DataFrame(
    ...     {
    ...         "a": [1, 2, None, 3],
    ...         "b": [4.5, 5.2, 6.1, None],
    ...         "c": ["A", "B", "C", None],
    ...         "d": [False, None, True, False],
    ...     }
    ... ).astype({"a": "Int32", "b": "Float64", "c": "string", "d": "boolean"})
    >>> kdf
          a    b     c      d
    0     1  4.5     A  False
    1     2  5.2     B   <NA>
    2  <NA>  6.1     C   True
    3     3  NaN  <NA>  False
    >>> kdf.dtypes
    a      Int32
    b    float64
    c     string
    d    boolean
    dtype: object
    

    The following types are supported per the installed pandas:

    • pandas >= 0.24
      • Int8Dtype
      • Int16Dtype
      • Int32Dtype
      • Int64Dtype
    • pandas >= 1.0
      • BooleanDtype
      • StringDtype
    • pandas >= 1.2
      • Float32Dtype
      • Float64Dtype

    Binary operations and type casting are supported:

    >>> kdf.a + kdf.b
    0       5
    1       7
    2    <NA>
    3    <NA>
    dtype: Int64
    >>> kdf + kdf
          a     b
    0     2     8
    1     4    10
    2  <NA>    12
    3     6  <NA>
    >>> kdf.a.astype('Float64')
    0     1.0
    1     2.0
    2    <NA>
    3     3.0
    Name: a, dtype: Float64
    

    Other new features, improvements and bug fixes

    We added the following new features:

    koalas:

    • date_range (#2081)
    • read_orc (#2017)

    Series:

    • align (#2019)

    DataFrame:

    • align (#2019)
    • to_orc (#2024)

    Along with the following fixes:

    • PySpark 3.1.1 Support
    • Preserve index for statistical functions with axis==1 (#2036)
    • Use iloc to make sure it retrieves the first element (#2037)
    • Fix numeric_only to follow pandas (#2035)
    • Fix DataFrame.merge to work properly (#2060)
    • Fix astype(str) for some data types (#2040)
    • Fix binary operations Index by Series (#2046)
    • Fix bug on pow and rpow (#2047)
    • Support bool list-like column selection for loc indexer (#2057)
    • Fix window functions to resolve (#2090)
    • Refresh GitHub workflow matrix (#2083)
    • Restructure the hierarchy of Index unit tests (#2080)
    • Fix to delegate dtypes (#2061)
    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Jan 22, 2021)

    Improved Plotly backend support

    We improved plotting support by implementing pie, histogram and box plots with Plotly plot backend. Koalas now can plot data with Plotly via:

    • DataFrame.plot.pie and Series.plot.pie (#1971) Screen Shot 2021-01-22 at 6 32 48 PM

    • DataFrame.plot.hist and Series.plot.hist (#1999) Screen Shot 2021-01-22 at 6 32 38 PM

    • Series.plot.box (#2007) Screen Shot 2021-01-22 at 6 32 31 PM

    In addition, we optimized histogram calculation as a single pass in DataFrame (#1997) instead of launching each job to calculate each Series in DataFrame.

    Operations between Series and Index

    The operations between Series and Index are now supported as below (#1996):

    >>> kser = ks.Series([1, 2, 3, 4, 5, 6, 7])
    >>> kidx = ks.Index([0, 1, 2, 3, 4, 5, 6])
    
    >>> (kser + 1 + 10 * kidx).sort_index()
    0     2
    1    13
    2    24
    3    35
    4    46
    5    57
    6    68
    dtype: int64
    >>> (kidx + 1 + 10 * kser).sort_index()
    0    11
    1    22
    2    33
    3    44
    4    55
    5    66
    6    77
    dtype: int64
    

    Support setting to a Series via attribute access

    We have added the support of setting a column via attribute assignment in DataFrame, (#1989).

    >>> kdf = ks.DataFrame({'A': [1, 2, 3, None]})
    >>> kdf.A = kdf.A.fillna(kdf.A.median())
    >>> kdf
         A
    0  1.0
    1  2.0
    2  3.0
    3  2.0
    

    Other new features, improvements and bug fixes

    We added the following new features:

    Series:

    • factorize (#1972)
    • sem (#1993)

    DataFrame

    • insert (#1983)
    • sem (#1993)

    In addition, we also implement new parameters:

    • Add min_count parameter for Frame.sum. (#1978)
    • Added ddof parameter for GroupBy.std() and GroupBy.var() (#1994)
    • Support ddof parameter for std and var. (#1986)

    Along with the following fixes:

    • Fix stat functions with no numeric columns. (#1967)
    • Fix DataFrame.replace with NaN/None values (#1962)
    • Fix cumsum and cumprod. (#1982)
    • Use Python type name instead of Spark's in error messages. (#1985)
    • Use object.__setattr__ in Series. (#1991)
    • Adjust Series.mode to match pandas Series.mode (#1995)
    • Adjust data when all the values in a column are nulls. (#2004)
    • Fix as_spark_type to not support "bigint". (#2011)
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Dec 11, 2020)

    Index operations support

    We improved Index operations support (#1944, #1955).

    Here are some examples:

    • Before

      >>> kidx = ks.Index([1, 2, 3, 4, 5])
      >>> kidx + kidx
      Int64Index([2, 4, 6, 8, 10], dtype='int64')
      >>> kidx + kidx + kidx
      Traceback (most recent call last):
      ...
      AssertionError: args should be single DataFrame or single/multiple Series
      
      >>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
      Traceback (most recent call last):
      ...
      AssertionError: args should be single DataFrame or single/multiple Series
      
    • After

      >>> kidx = ks.Index([1, 2, 3, 4, 5])
      >>> kidx + kidx + kidx
      Int64Index([3, 6, 9, 12, 15], dtype='int64')
      
      >>> ks.options.compute.ops_on_diff_frames = True
      >>> ks.Index([1, 2, 3, 4, 5]) + ks.Index([6, 7, 8, 9, 10])
      Int64Index([7, 9, 13, 11, 15], dtype='int64')
      

    Other new features and improvements

    We added the following new features:

    DataFrame:

    • swaplevel (#1928)
    • swapaxes (#1946)
    • dot (#1945)
    • itertuples (#1960)

    Series:

    • swaplevel (#1919)
    • swapaxes (#1954)

    Index:

    • to_list (#1948)

    MultiIndex:

    • to_list (#1948)

    GroupBy:

    • tail (#1949)
    • median (#1957)

    Other improvements and bug fixes

    • Support DataFrame parameter in Series.dot (#1931)
    • Add a best practice for checkpointing. (#1930)
    • Remove implicit switch-ons of "compute.ops_on_diff_frames" (#1953)
    • Fix Series._to_internal_pandas and introduce Index._to_internal_pandas. (#1952)
    • Fix first/last_valid_index to support empty column DataFrame. (#1923)
    • Use pandas' transpose when the data is expected to be small. (#1932)
    • Fix tail to use the resolved copy (#1942)
    • Avoid unneeded reset_index in DataFrameGroupBy.describe. (#1951)
    • TypeError when Index.name / Series.name is not a hashable type (#1883)
    • Adjust data column names before attaching default index. (#1947)
    • Add plotly into the optional dependency in Koalas (#1939)
    • Add plotly backend test cases (#1938)
    • Don't pass stacked in plotly area chart (#1934)
    • Set upperbound of matplotlib to avoid failure on Ubuntu (#1959)
    • Fix GroupBy.descirbe for multi-index columns. (#1922)
    • Upgrade pandas version in CI (#1961)
    • Compare Series from the same anchor (#1956)
    • Add videos from Data+AI Summit 2020 EUROPE. (#1963)
    • Set PYARROW_IGNORE_TIMEZONE for binder. (#1965)
    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Nov 14, 2020)

    Better type support

    We improved the type mapping between pandas and Koalas (#1870, #1903). We added more types or string expressions to specify the data type or fixed mismatches between pandas and Koalas.

    Here are some examples:

    • Added np.float32 and "float32" (matched to FloatType)

      >>> ks.Series([10]).astype(np.float32)
      0    10.0
      dtype: float32
      
      >>> ks.Series([10]).astype("float32")
      0    10.0
      dtype: float32
      
    • Added np.datetime64 and "datetime64[ns]" (matched to TimestampType)

      >>> ks.Series(["2020-10-26"]).astype(np.datetime64)
      0   2020-10-26
      dtype: datetime64[ns]
      
      >>> ks.Series(["2020-10-26"]).astype("datetime64[ns]")
      0   2020-10-26
      dtype: datetime64[ns]
      
    • Fixed np.int to match LongType, not IntegerType.

      >>> pd.Series([100]).astype(np.int)
      0    100.0
      dtype: int64
      
      >>> ks.Series([100]).astype(np.int)
      0    100.0
      dtype: int32  # This fixed to `int64` now.
      
    • Fixed np.float to match DoubleType, not FloatType.

      >>> pd.Series([100]).astype(np.float)
      0    100.0
      dtype: float64
      
      >>> ks.Series([100]).astype(np.float)
      0    100.0
      dtype: float32  # This fixed to `float64` now.
      

    We also added a document which describes supported/unsupported pandas data types or data type mapping between pandas data types and PySpark data types. See: Type Support In Koalas.

    Return type annotations for major Koalas objects

    To improve Koala’s auto-completion in various editors and avoid misuse of APIs, we added return type annotations to major Koalas objects. These objects include DataFrame, Series, Index, GroupBy, Window objects, etc. (#1852, #1857, #1859, #1863, #1871, #1882, #1884, #1889, #1892, #1894, #1898, #1899, #1900, #1902).

    The return type annotations help auto-completion libraries, such as Jedi, to infer the actual data type and provide proper suggestions:

    • Before

    Before

    • After

    After

    It also helps mypy enable static analysis over the method body.

    pandas 1.1.4 support

    We verified the behaviors of pandas 1.1.4 in Koalas.

    As pandas 1.1.4 introduced a behavior change related to MultiIndex.is_monotonic (MultiIndex.is_monotonic_increasing) and MultiIndex.is_monotonic_decreasing (pandas-dev/pandas#37220), Koalas also changes the behavior (#1881).

    Other new features and improvements

    We added the following new features:

    DataFrame:

    • __neg__ (#1847)
    • rename_axis (#1843)
    • spark.repartition (#1864)
    • spark.coalesce (#1873)
    • spark.checkpoint (#1877)
    • spark.local_checkpoint (#1878)
    • reindex_like (#1880)

    Series:

    • rename_axis (#1843)
    • compare (#1802)
    • reindex_like (#1880)

    Index:

    • intersection (#1747)

    MultiIndex:

    • intersection (#1747)

    Other improvements and bug fixes

    • Use SF.repeat in series.str.repeat (#1844)
    • Remove warning when use cache in the context manager (#1848)
    • Support a non-string name in Series' boxplot (#1849)
    • Calculate fliers correctly in Series.plot.box (#1846)
    • Show type name rather than type class in error messages (#1851)
    • Fix DataFrame.spark.hint to reflect internal changes. (#1865)
    • DataFrame.reindex supports named columns index (#1876)
    • Separate InternalFrame.index_map into index_spark_column_names and index_names. (#1879)
    • Fix DataFrame.xs to handle internal changes properly. (#1896)
    • Explicitly disallow empty list as index_spark_colum_names and index_names. (#1895)
    • Use nullable inferred schema in function apply (#1897)
    • Introduce InternalFrame.index_level. (#1890)
    • Remove InternalFrame.index_map. (#1901)
    • Force to use the Spark's system default precision and scale when inferred data type contains DecimalType. (#1904)
    • Upgrade PyArrow from 1.0.1 to 2.0.0 in CI (#1860)
    • Fix read_excel to support squeeze argument. (#1905)
    • Fix to_csv to avoid duplicated option 'path' for DataFrameWriter. (#1912)
    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Oct 9, 2020)

    pandas 1.1 support

    We verified the behaviors of pandas 1.1 in Koalas. Koalas now supports pandas 1.1 officially (#1688, #1822, #1829).

    Support for non-string names

    Now we support for non-string names (#1784). Previously names in Koalas, e.g., df.columns, df.colums.names, df.index.names, needed to be a string or a tuple of string, but it should allow other data types which are supported by Spark.

    Before:

    >>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
    >>> kdf.columns
    Index(['0', '1'], dtype='object')
    

    After:

    >>> kdf = ks.DataFrame([[1, 'x'], [2, 'y'], [3, 'z']])
    >>> kdf.columns
    Int64Index([0, 1], dtype='int64')
    

    Improve distributed-sequence default index

    The performance is improved when creating a distributed-sequence as a default index type by avoiding the interaction between Python and JVM (#1699).

    Standardize binary operations between int and str columns

    Make behaviors of binary operations (+, -, *, /, //, %) between int and str columns consistent with respective pandas behaviors (#1828).

    It standardizes binary operations as follows:

    • +: raise TypeError between int column and str column (or string literal)
    • *: act as spark SQL repeat between int column(or int literal) and str columns; raise TypeError if a string literal is involved
    • -, /, //, %(modulo): raise TypeError if a str column (or string literal) is involved

    Other new features and improvements

    We added the following new features:

    DataFrame:

    • product (#1739)
    • from_dict (#1778)
    • pad (#1786)
    • backfill (#1798)

    Series:

    • reindex (#1737)
    • explode (#1777)
    • pad (#1786)
    • argmin (#1790)
    • argmax (#1790)
    • argsort (#1793)
    • backfill (#1798)

    Index:

    • inferred_type (#1745)
    • item (#1744)
    • is_unique (#1766)
    • asi8 (#1764)
    • is_type_compatible (#1765)
    • view (#1788)
    • insert (#1804)

    MultiIndex:

    • inferred_type (#1745)
    • item (#1744)
    • is_unique (#1766)
    • asi8 (#1764)
    • is_type_compatible (#1765)
    • from_frame (#1762)
    • view (#1788)
    • insert (#1804)

    GroupBy:

    • get_group (#1783)

    Other improvements

    • Fix DataFrame.mad to work properly (#1749)
    • Fix Series name after binary operations. (#1753)
    • Fix GroupBy.cum~ for matching with pandas' behavior (#1708)
    • Fix cumprod to work properly with Integer columns. (#1750)
    • Fix DataFrame.join for MultiIndex (#1771)
    • Exception handling for from_frame properly (#1791)
    • Fix iloc for slice(None, 0) (#1767)
    • Fix Series.__repr__ when Series.name is None. (#1796)
    • DataFrame.reindex supports koalas Index parameter (#1741)
    • Fix Series.fillna with inplace=True on non-nullable column. (#1809)
    • Input check in various APIs (#1808, #1810, #1811, #1812, #1813, #1814, #1816, #1824)
    • Fix to_list work properly in pandas==0.23 (#1823)
    • Fix Series.astype to work properly (#1818)
    • Frame.groupby supports dropna (#1815)
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Aug 28, 2020)

    Non-named Series support

    Now we added support for non-named Series (#1712). Previously Koalas automatically named a Series "0" if no name is specified or None is set to the name, whereas pandas allows a Series without the name.

    For example:

    >>> ks.__version__
    '1.1.0'
    >>> kser = ks.Series([1, 2, 3])
    >>> kser
    0    1
    1    2
    2    3
    Name: 0, dtype: int64
    >>> kser.name = None
    >>> kser
    0    1
    1    2
    2    3
    Name: 0, dtype: int64
    

    Now the Series will be non-named.

    >>> ks.__version__
    '1.2.0'
    >>> ks.Series([1, 2, 3])
    0    1
    1    2
    2    3
    dtype: int64
    >>> kser = ks.Series([1, 2, 3], name="a")
    >>> kser.name = None
    >>> kser
    0    1
    1    2
    2    3
    dtype: int64
    

    More stable "distributed-sequence" default index

    Previously "distributed-sequence" default index had sometimes produced wrong values or even raised an exception. For example, the codes below:

    >>> from databricks import koalas as ks
    >>> ks.options.compute.default_index_type = 'distributed-sequence'
    >>> ks.range(10).reset_index()
    

    did not work as below:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      ...
    pyspark.sql.utils.PythonException:
      An exception was thrown from the Python worker. Please see the stack trace below.
    Traceback (most recent call last):
      ...
      File "/.../koalas/databricks/koalas/internal.py", line 620, in offset
        current_partition_offset = sums[id.iloc[0]]
    KeyError: 103
    

    We investigated and made the default index type more stable (#1701). Now it unlikely causes such situations and it is stable enough.

    Improve testing infrastructure

    We changed the testing infrastructure to use pandas' testing utils for exact check (#1722). Now it compares even index/column types and names so that we will be able to follow pandas more strictly.

    Other new features and improvements

    We added the following new features:

    DataFrame:

    • last_valid_index (#1705)

    Series:

    • product (#1677)
    • last_valid_index (#1705)

    GroupBy:

    • cumcount (#1702)

    Other improvements

    • Refine Spark I/O. (#1667)
      • Set partitionBy explicitly in to_parquet.
      • Add mode and partition_cols to to_csv and to_json.
      • Fix type hints to use Optional.
    • Make read_excel read from DFS if the underlying Spark is 3.0.0 or above. (#1678, #1693, #1694, #1692)
    • Support callable instances to apply as a function, and fix groupby.apply to keep the index when possible (#1686)
    • Bug fixing for hasnans when non-DoubleType. (#1681)
    • Support axis=1 for DataFrame.dropna(). (#1689)
    • Allow assining index as a column (#1696)
    • Try to read pandas metadata in read_parquet if index_col is None. (#1695)
    • Include pandas Index object in dataframe indexing options (#1698)
    • Unified PlotAccessor for DataFrame and Series (#1662)
    • Fix SeriesGroupBy.nsmallest/nlargest. (#1713)
    • Fix DataFrame.size to consider its number of columns. (#1715)
    • Fix first_valid_index() for Empty object (#1704)
    • Fix index name when groupby.apply returns a single row. (#1719)
    • Support subtraction of date/timestamp with literals. (#1721)
    • DataFrame.reindex(fill_value) does not fill existing NaN values (#1723)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Jul 17, 2020)

    API extensions

    We added support for API extensions (#1617).

    You can register your custom accessors to DataFrame, Seires, and Index.

    For example, in your library code:

    from databricks.koalas.extensions import register_dataframe_accessor
    
    @register_dataframe_accessor("geo")
    class GeoAccessor:
    
        def __init__(self, koalas_obj):
            self._obj = koalas_obj
            # other constructor logic
    
        @property
        def center(self):
            # return the geographic center point of this DataFrame
            lat = self._obj.latitude
            lon = self._obj.longitude
            return (float(lon.mean()), float(lat.mean()))
    
        def plot(self):
            # plot this array's data on a map
            pass
        ...
    

    Then, in a session:

    >>> from my_ext_lib import GeoAccessor 
    >>> kdf = ks.DataFrame({"longitude": np.linspace(0,10),
    ...                     "latitude": np.linspace(0, 20)})
    >>> kdf.geo.center 
        (5.0, 10.0)
    
    >>> kdf.geo.plot() 
    ...
    

    See also: https://koalas.readthedocs.io/en/latest/reference/extensions.html

    Plotting backend

    We introduced plotting.backend configuration (#1639).

    Plotly (>=4.8) or other libraries that pandas supports can be used as a plotting backend if they are installed in the environment.

    >>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
    >>> kdf.plot(title="Example Figure")  # defaults to backend="matplotlib"
    

    image

    >>> fig = kdf.plot(backend="plotly", title="Example Figure", height=500, width=500)
    >>> ## same as:
    >>> # ks.options.plotting.backend = "plotly"
    >>> # fig = kdf.plot(title="Example Figure", height=500, width=500)
    >>> fig.show()
    

    image

    Each backend returns the figure in their own format, allowing for further editing or customization if required.

    >>> fig.update_layout(template="plotly_dark")
    >>> fig.show()
    

    image

    Koalas accessor

    We introduced koalas accessor and some methods specific to Koalas (#1613, #1628).

    DataFrame.apply_batch, DataFrame.transform_batch, and Series.transform_batch are deprecated and moved to koalas accessor.

    >>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pdf):
    ...     return pdf + 1  # should always return the same length as input.
    ...
    >>> kdf.koalas.transform_batch(pandas_plus)
       a  b
    0  2  5
    1  3  6
    2  4  7
    
    >>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_filter(pdf):
    ...     return pdf[pdf.a > 1]  # allow arbitrary length
    ...
    >>> kdf.koalas.apply_batch(pandas_filter)
       a  b
    1  2  5
    2  3  6
    

    or

    >>> kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pser):
    ...     return pser + 1  # should always return the same length as input.
    ...
    >>> kdf.a.koalas.transform_batch(pandas_plus)
    0    2
    1    3
    2    4
    Name: a, dtype: int64
    

    See also: https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html

    Other new features and improvements

    We added the following new features:

    DataFrame:

    • tail (#1632)
    • droplevel (#1622)

    Series:

    • iteritems (#1603)
    • items (#1603)
    • tail (#1632)
    • droplevel (#1630)

    Other improvements

    • Simplify Series.to_frame. (#1624)
    • Make Window functions create a new DataFrame. (#1623)
    • Fix Series._with_new_scol to use alias. (#1634)
    • Refine concat to handle the same anchor DataFrames properly. (#1627)
    • Add sort parameter to concat. (#1636)
    • Enable to assign list. (#1644)
    • Use SPARK_INDEX_NAME_FORMAT in combine_frames to avoid ambiguity. (#1650)
    • Rename spark columns only when index=False. (#1649)
    • read_csv: Implement reading of number of rows (#1656)
    • Fixed ks.Index.to_series() to work properly with name paramter (#1643)
    • Fix fillna to handle "ffill" and "bfill" properly. (#1654)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Jun 24, 2020)

    Critical bug fix

    We fixed a critical bug introduced in Koalas 1.0.0 (#1609).

    If we call DataFrame.rename with columns parameter after some operations on the DataFrame, the operations will be lost:

    >>> kdf = ks.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns=["A", "B", "C", "D"])
    >>> kdf1 = kdf + 1
    >>> kdf1
       A  B  C  D
    0  2  3  4  5
    1  6  7  8  9
    >>> kdf1.rename(columns={"A": "aa", "B": "bb"})
       aa  bb  C  D
    0   1   2  3  4
    1   5   6  7  8
    

    This should be:

    >>> pdf1.rename(columns={"A": "aa", "B": "bb"})
       aa  bb  C  D
    0   2   3  4  5
    1   6   7  8  9
    

    Other improvements

    • Clean up InternalFrame and around anchor. (#1601)
    • Fixing DataFrame.iteritems to return generator (#1602)
    • Clean up groupby to use the anchor. (#1610)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jun 19, 2020)

    Better pandas API coverage

    We implemented many APIs and features equivalent with pandas such as plotting, grouping, windowing, I/O, and transformation, and now Koalas reaches the pandas API coverage close to 80% in Koalas 1.0.0.

    Apache Spark 3.0

    Apache Spark 3.0 is now supported in Koalas 1.0 (#1586, #1558). Koalas does not require any change to use Spark 3.0. Apache Spark has more than 3400 fixes landed in Spark 3.0 and Koalas shares the most of fixes in many other components.

    It also brings the performance improvement in Koalas APIs that execute Python native functions internally via pandas UDFs, for example, DataFrame.apply and DataFrame.apply_batch (#1508).

    Python 3.8

    With Apache Spark 3.0, Koalas supports the latest Python 3.8 which has many significant improvements (#1587), see also Python 3.8.0 release notes.

    Spark accessor

    spark accessor was introduced from Koalas 1.0.0 in order for the Koalas users to leverage the existing PySpark APIs more easily (#1530). For example, you can apply the PySpark functions as below:

    import databricks.koalas as ks
    import pyspark.sql.functions as F
    
    kss = ks.Series([1, 2, 3, 4])
    kss.spark.apply(lambda s: F.collect_list(s))
    

    Better type hint support

    In the early versions, it was required to use Koalas instances as the return type hints for the functions that return a pandas instances, which looks slightly awkward.

    def pandas_div(pdf) -> koalas.DataFrame[float, float]:
        # pdf is a pandas DataFrame,
        return pdf[['B', 'C']] / pdf[['B', 'C']]
    
    df = ks.DataFrame({'A': ['a', 'a', 'b'], 'B': [1, 2, 3], 'C': [4, 6, 5]})
    df.groupby('A').apply(pandas_div)
    

    In Koalas 1.0.0 with Python 3.7+, you can also use pandas instances in the return type as below:

    def pandas_div(pdf) -> pandas.DataFrame[float, float]:
        return pdf[['B', 'C']] / pdf[['B', 'C']]
    

    In addition, the new type hinting is experimentally introduced in order to allow users to specify column names in the type hints as below (#1577):

    def pandas_div(pdf) -> pandas.DataFrame['B': float, 'C': float]:
        return pdf[['B', 'C']] / pdf[['B', 'C']]
    

    See also the guide in Koalas documentation (#1584) for more details.

    Wider support of in-place update

    Previously in-place updates happen only within each DataFrame or Series, but now the behavior follows pandas in-place updates and the update of one side also updates the other side (#1592).

    For example, the following updates kdf as well.

    kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
    kser = kdf.x
    kser.fillna(0, inplace=True)
    
    kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
    kser = kdf.x
    kser.loc[2] = 30
    
    kdf = ks.DataFrame({"x": [np.nan, 2, 3, 4, np.nan, 6]})
    kser = kdf.x
    kdf.loc[2, 'x'] = 30
    

    If the DataFrame and Series are connected, the in-place updates update each other.

    Less restriction on compute.ops_on_diff_frames

    In Koalas 1.0.0, the restriction of compute.ops_on_diff_frames became much more loosened (#1522, #1554). For example, the operations such as below can be performed without enabling compute.ops_on_diff_frames, which can be expensive due to the shuffle under the hood.

    df + df + df
    df['foo'] = df['bar']['baz']
    df[['x', 'y']] = df[['x', 'y']].fillna(0)
    

    Other new features and improvements

    DataFrame:

    • __bool__ (#1526)
    • explode (#1507)
    • spark.apply (#1536)
    • spark.schema (#1530)
    • spark.print_schema (#1530)
    • spark.frame (#1530)
    • spark.cache (#1530)
    • spark.persist (#1530)
    • spark.hint (#1530)
    • spark.to_table (#1530)
    • spark.to_spark_io (#1530)
    • spark.explain (#1530)
    • spark.apply (#1530)
    • mad (#1538)
    • __abs__ (#1561)

    Series:

    • item (#1502, #1518)
    • divmod (#1397)
    • rdivmod (#1397)
    • unstack (#1501)
    • mad (#1503)
    • __bool__ (#1526)
    • to_markdown (#1510)
    • spark.apply (#1536)
    • spark.data_type (#1530)
    • spark.nullable (#1530)
    • spark.column (#1530)
    • spark.transform (#1530)
    • filter (#1511)
    • __abs__ (#1561)
    • bfill (#1580)
    • ffill (#1580)

    Index:

    • __bool__ (#1526)
    • spark.data_type (#1530)
    • spark.column (#1530)
    • spark.transform (#1530)
    • get_level_values (#1517)
    • delete (#1165)
    • __abs__ (#1561)
    • holds_integer (#1547)

    MultiIndex:

    • __bool__ (#1526)
    • spark.data_type (#1530)
    • spark.column (#1530)
    • spark.transform (#1530)
    • get_level_values (#1517)
    • delete (#1165
    • __abs__ (#1561)
    • holds_integer (#1547)

    Along with the following improvements:

    • Fix Series.clip not to create a new DataFrame. (#1525)
    • Fix combine_first to support tupled names. (#1534)
    • Add Spark accessors to usage logging. (#1540)
    • Implements multi-index support in Dataframe.filter (#1512)
    • Fix Series.fillna to avoid Spark jobs. (#1550)
    • Support DataFrame.spark.explain(extended: str) case. (#1563)
    • Support Series as repeats in Series.repeat. (#1573)
    • Fix fillna to handle NaN properly. (#1572)
    • Fix DataFrame.replace to avoid creating a new Spark DataFrame. (#1575)
    • Cache an internal pandas object to avoid run twice in Jupyter. (#1564)
    • Fix Series.div when div/floordiv np.inf by zero (#1463)
    • Fix Series.unstack to support non-numeric type and keep the names (#1527)
    • Fix hasnans to follow the modified column. (#1532)
    • Fix explode to use internal methods. (#1538)
    • Fix RollingGroupby and ExpandingGroupby to handle agg_columns. (#1546)
    • Fix reindex not to update internal. (#1582)

    Backward Compatibility

    • Remove the deprecated pandas_wraps (#1529)
    • Remove compute function. (#1531)
    Source code(tar.gz)
    Source code(zip)
  • v0.33.0(May 14, 2020)

    apply and transform Improvements

    We added supports to have positional/keyword arguments for apply, apply_batch, transform, and transform_batch in DataFrame, Series, and GroupBy. (#1484, #1485, #1486)

    >>> ks.range(10).apply(lambda a, b, c: a + b + c, args=(1,), c=3)
       id
    0   4
    1   5
    2   6
    3   7
    4   8
    5   9
    6  10
    7  11
    8  12
    9  13
    
    >>> ks.range(10).transform_batch(lambda pdf, a, b, c: pdf.id + a + b + c, 1, 2, c=3)
    0     6
    1     7
    2     8
    3     9
    4    10
    5    11
    6    12
    7    13
    8    14
    9    15
    Name: id, dtype: int64
    
    >>> kdf = ks.DataFrame(
    ...    {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
    ...    columns=["a", "b", "c"])
    >>> kdf.groupby(["a", "b"]).apply(lambda x, y, z: x + x.min() + y + z, 1, z=2)
        a   b   c
    0   5   5   5
    1   7   5  11
    2   9   7  21
    3  11   9  35
    4  13  13  53
    5  15  19  75
    

    Spark Schema

    We add spark_schema and print_schema to know the underlying Spark Schema. (#1446)

    >>> kdf = ks.DataFrame({'a': list('abc'),
    ...                     'b': list(range(1, 4)),
    ...                     'c': np.arange(3, 6).astype('i1'),
    ...                     'd': np.arange(4.0, 7.0, dtype='float64'),
    ...                     'e': [True, False, True],
    ...                     'f': pd.date_range('20130101', periods=3)},
    ...                    columns=['a', 'b', 'c', 'd', 'e', 'f'])
    
    >>> # Print the schema out in Spark’s DDL formatted string
    >>> kdf.spark_schema().simpleString()
    'struct<a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
    >>> kdf.spark_schema(index_col='index').simpleString()
    'struct<index:bigint,a:string,b:bigint,c:tinyint,d:double,e:boolean,f:timestamp>'
    
    >>> # Print out the schema as same as DataFrame.printSchema()
    >>> kdf.print_schema()
    root
     |-- a: string (nullable = false)
     |-- b: long (nullable = false)
     |-- c: byte (nullable = false)
     |-- d: double (nullable = false)
     |-- e: boolean (nullable = false)
     |-- f: timestamp (nullable = false)
    
    >>> kdf.print_schema(index_col='index')
    root
     |-- index: long (nullable = false)
     |-- a: string (nullable = false)
     |-- b: long (nullable = false)
     |-- c: byte (nullable = false)
     |-- d: double (nullable = false)
     |-- e: boolean (nullable = false)
     |-- f: timestamp (nullable = false)
    

    GroupBy Improvements

    We fixed many bugs of GroupBy as listed below.

    • Fix groupby when as_index=False. (#1457)
    • Make groupby.apply in pandas<0.25 run the function only once per group. (#1462)
    • Fix Series.groupby on the Series from different DataFrames. (#1460)
    • Fix GroupBy.head to recognize agg_columns. (#1474)
    • Fix GroupBy.filter to follow complex group keys. (#1471)
    • Fix GroupBy.transform to follow complex group keys. (#1472)
    • Fix GroupBy.apply to follow complex group keys. (#1473)
    • Fix GroupBy.fillna to use GroupBy._apply_series_op. (#1481)
    • Fix GroupBy.filter and apply to handle agg_columns. (#1480)
    • Fix GroupBy apply, filter, and head to ignore temp columns when ops from different DataFrames. (#1488)
    • Fix GroupBy functions which need natural orderings to follow the order when opts from different DataFrames. (#1490)

    Other new features and improvements

    We added the following new feature:

    SeriesGroupBy:

    • filter (#1483)

    Other improvements

    • dtype for DateType should be np.dtype("object"). (#1447)
    • Make reset_index disallow the same name but allow it when drop=True. (#1455)
    • Fix named aggregation for MultiIndex (#1435)
    • Raise ValueError that is not raised now (#1461)
    • Fix get dummies when uses the prefix parameter whose type is dict (#1478)
    • Simplify DataFrame.columns setter. (#1489)
    Source code(tar.gz)
    Source code(zip)
  • v0.32.0(Apr 23, 2020)

    Koalas documentation redesign

    Koalas documentation was redesigned with a better theme, pydata-sphinx-theme. Please check the new Koalas documentation site out.

    transform_batch and apply_batch

    We added the APIs that enable you to directly transform and apply a function against Koalas Series or DataFrame. map_in_pandas is deprecated and now renamed to apply_batch.

    import databricks.koalas as ks
    kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    def pandas_plus(pdf):
        return pdf + 1  # should always return the same length as input.
    
    kdf.transform_batch(pandas_plus)
    
    import databricks.koalas as ks
    kdf = ks.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    def pandas_plus(pdf):
        return pdf[pdf.a > 1]  # allow arbitrary length
    
    kdf.apply_batch(pandas_plus)
    

    Please also check Transform and apply a function in Koalas documentation.

    Other new features and improvements

    We added the following new feature:

    DataFrame:​

    • truncate (#1408)
    • hint (#1415)

    SeriesGroupBy:

    • unique (#1426)

    Index:

    • spark_column (#1438)

    Series:

    • spark_column (#1438)

    MultiIndex:

    • spark_column (#1438)

    Other improvements

    • Fix from_pandas to handle the same index name as a column name. (#1419)
    • Add documentation about non-Koalas APIs (#1420)
    • Hot-fixing the lack of keyword argument 'deep' for DataFrame.copy() (#1423)
    • Fix Series.div when divide by zero (#1412)
    • Support expand parameter if n is a positive integer in Series.str.split/rsplit. (#1432)
    • Make Series.astype(bool) follow the concept of "truthy" and "falsey". (#1431)
    • Fix incompatible behaviour with pandas for floordiv with np.nan (#1429)
    • Use mapInPandas for apply_batch API in Spark 3.0 (#1440)
    • Use F.datediff() for subtraction of dates as a workaround. (#1439)
    Source code(tar.gz)
    Source code(zip)
  • v0.31.0(Apr 9, 2020)

    PyArrow>=0.15 support is back

    We added PyArrow>=0.15 support back (#1110).

    Note that, when working with pyarrow>=0.15 and pyspark<3.0, Koalas will set an environment variable ARROW_PRE_0_15_IPC_FORMAT=1 if it does not exist, as per the instruction in SPARK-29367, but it will NOT work if there is a Spark context already launched. In that case, you have to manage the environment variable by yourselves.

    Spark specific improvements

    Broadcast hint

    We added broadcast function in namespace.py (#1360).

    We can use it with merge, join, and update which invoke join operation in Spark when you know one of the DataFrame is small enough to fit in memory, and we can expect much more performant than shuffle-based joins.

    For example,

    >>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
    >>> merged.explain()
    == Physical Plan ==
    ...
    ...BroadcastHashJoin...
    ...
    

    persist function and storage level

    We added persist function to specify the storage level when caching (#1381), and also, we added storage_level property to check the current storage level (#1385).

    >>> with df.cache() as cached_df:
    ...     print(cached_df.storage_level)
    ...
    Disk Memory Deserialized 1x Replicated
    
    >>> with df.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
    ...     print(cached_df.storage_level)
    ...
    Memory Serialized 1x Replicated
    

    Other new features and improvements

    We added the following new feature:

    DataFrame:

    • to_markdown (#1377)
    • squeeze (#1389)

    Series:

    • squeeze (#1389)
    • asof (#1366)

    Other improvements

    • Add a way to specify index column in I/O APIs (#1379)
    • Fix iloc.__setitem__ with the other Series from the same DataFrame. (#1388)
    • Add support Series from different DataFrames for loc/iloc.__setitem__. (#1391)
    • Refine __setitem__ for loc/iloc with DataFrame. (#1394)
    • Help misuse of options argument. (#1402)
    • Add blog posts in Koalas documentation (#1406)
    • Fix mod & rmod for matching with pandas. (#1399)
    Source code(tar.gz)
    Source code(zip)
  • v0.30.0(Mar 26, 2020)

    Slice column selection support in loc

    We continue to improve loc indexer and added the slice column selection support (#1351).

    >>> from databricks import koalas as ks
    >>> df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij'), 'c': range(10)})
    >>> df.loc[:, "b":"c"]
       b  c
    0  a  0
    1  b  1
    2  c  2
    3  d  3
    4  e  4
    5  f  5
    6  g  6
    7  h  7
    8  i  8
    9  j  9
    

    Slice row selection support in loc for multi-index

    We also added the support of slice as row selection in loc indexer for multi-index (#1344).

    >>> from databricks import koalas as ks
    >>> import pandas as pd
    >>> df = ks.DataFrame({'a': range(3)}, index=pd.MultiIndex.from_tuples([("a", "b"), ("a", "c"), ("b", "d")]))
    >>> df.loc[("a", "c"): "b"]
         a
    a c  1
    b d  2
    

    Slice row selection support in iloc

    We continued to improve iloc indexer to support iterable indexes as row selection (#1338).

    >>> from databricks import koalas as ks
    >>> df = ks.DataFrame({'a':list('abcdefghij'), 'b':list('abcdefghij')})
    >>> df.iloc[[-1, 1, 2, 3]]
       a  b
    1  b  b
    2  c  c
    3  d  d
    9  j  j
    

    Support of setting values via loc and iloc at Series

    Now, we added the basic support of setting values via loc and iloc at Series (#1367).

    >>> from databricks import koalas as ks
    >>> kser = ks.Series([1, 2, 3], index=["cobra", "viper", "sidewinder"])
    >>> kser.loc[kser % 2 == 1] = -kser
    >>> kser
    cobra        -1
    viper         2
    sidewinder   -3
    

    Other new features and improvements

    We added the following new feature:

    DataFrame:

    • take (#1292)
    • eval (#1359)

    Series:

    • dot (#1136)
    • take (#1357)
    • combine_first (#1290)

    Index:

    • droplevel (#1340)
    • union (#1348)
    • take (#1357)
    • asof (#1350)

    MultiIndex:

    • droplevel (#1340)
    • unique (#1342)
    • union (#1348)
    • take (#1357)

    Other improvements

    • Compute Index.is_monotonic/Index.is_monotonic_decreasing in a distributed manner (#1354)
    • Fix SeriesGroupBy.apply() to respect various output (#1339)
    • Add the support for operations between different DataFrames in groupby() (#1321)
    • Explicitly don't support to disable numeric_only in stats APIs at DataFrame (#1343)
    • Fix index operator against Series and Frame to use iloc conditionally (#1336)
    • Make nunique in DataFrame to return a Koalas DataFrame instead of pandas' (#1347)
    • Fix MultiIndex.drop() to follow renaming et al. (#1356)
    • Add column axis in ks.concat (#1349)
    • Fix iloc for Series when the series is modified. (#1368)
    • Support MultiIndex for duplicated, drop_duplicates. (#1363)
    Source code(tar.gz)
    Source code(zip)
  • v0.29.0(Mar 12, 2020)

    Slice support in iloc

    We improved iloc indexer to support slice as row selection. (#1335)

    For example,

    >>> kdf = ks.DataFrame({'a':list('abcdefghij')})
    >>> kdf
       a
    0  a
    1  b
    2  c
    3  d
    4  e
    5  f
    6  g
    7  h
    8  i
    9  j
    >>> kdf.iloc[2:5]
       a
    2  c
    3  d
    4  e
    >>> kdf.iloc[2:-3:2]
       a
    2  c
    4  e
    6  g
    >>> kdf.iloc[5:]
       a
    5  f
    6  g
    7  h
    8  i
    9  j
    >>> kdf.iloc[5:2]
    Empty DataFrame
    Columns: [a]
    Index: []
    

    Documentation

    We added links to the previous talks in our document. (#1319)

    You can see a lot of useful talks from the previous events and we will keep updated.

    https://koalas.readthedocs.io/en/latest/getting_started/videos.html

    Other new features and improvements

    We added the following new feature:

    DataFrame:

    • stack (#1329)

    Series:

    • repeat (#1328)

    Index:

    • difference (#1325)
    • repeat (#1328)

    MultiIndex:

    • difference (#1325)
    • repeat (#1328)

    Other improvements

    • DataFrame.pivot should preserve the original index names. (#1316)
    • Fix _LocIndexerLike to handle a Series from index. (#1315)
    • Support MultiIndex in DataFrame.unstack. (#1322)
    • Support Spark UDT when converting from/to pandas DataFrame/Series. (#1324)
    • Allow negative numbers for head. (#1330)
    • Return a Koalas series instead of pandas' in stats APIs at Koalas DataFrame (#1333)
    Source code(tar.gz)
    Source code(zip)
  • v0.28.0(Feb 27, 2020)

    pandas 1.0 support

    We added pandas 1.0 support (#1197, #1299), and Koalas now can work with pandas 1.0.

    map_in_pandas

    We implemented DataFrame.map_in_pandas API (#1276) so Koalas can allow any arbitrary function with pandas DataFrame against Koalas DataFrame. See the example below:

    >>> import databricks.koalas as ks
    >>> df = ks.DataFrame({'A': range(2000), 'B': range(2000)})
    >>> def query_func(pdf):
    ...     num = 1995
    ...     return pdf.query('A > @num')
    ...
    >>> df.map_in_pandas(query_func)
             A     B
    1996  1996  1996
    1997  1997  1997
    1998  1998  1998
    1999  1999  1999
    

    Standardize code style using Black

    As a development only change, we added Black integration (#1301). Now, all code style is standardized automatically via running ./dev/reformat, and the style is checked as a part of ./dev/lint-python.

    Other new features and improvements

    We added the following new feature:

    DataFrame:

    • query (#1273)
    • unstack (#1295)

    Other improvements

    • Fix DataFrame.describe() to support multi-index columns. (#1279)
    • Add util function validate_bool_kwarg (#1281)
    • Rename data columns prior to filter to make sure the column names are as expected. (#1283)
    • Add an faq about Structured Streaming. (#1298)
    • Let extra options have higher priority to allow workarounds (#1296)
    • Implement 'keep' parameter for drop_duplicates (#1303)
    • Add a note when type hint is provided to DataFrame.apply (#1310)
    • Add a util method to verify temporary column names. (#1262)
    Source code(tar.gz)
    Source code(zip)
  • v0.27.0(Feb 13, 2020)

    head ordering

    Since Koalas doesn't guarantee the row ordering, head could return some rows from distributed partition and the result is not deterministic, which might confuse users.

    We added a configuration compute.ordered_head (#1231), and if it is set to True, Koalas performs natural ordering beforehand and the result will be the same as pandas'. The default value is False because the ordering will cause a performance overhead.

    >>> kdf = ks.DataFrame({'a': range(10)})
    >>> pdf = kdf.to_pandas()
    >>> pdf.head(3)
       a
    0  0
    1  1
    2  2
    
    >>> kdf.head(3)
       a
    5  5
    6  6
    7  7
    >>> kdf.head(3)
       a
    0  0
    1  1
    2  2
    
    >>> ks.options.compute.ordered_head = True
    >>> kdf.head(3)
       a
    0  0
    1  1
    2  2
    >>> kdf.head(3)
       a
    0  0
    1  1
    2  2
    

    GitHub Actions

    We started trying to use GitHub Actions for CI. (#1254, #1265, #1264, #1267, #1269)

    Other new features and improvements

    We added the following new feature:

    DataFrame:

    • apply (#1259)

    Other improvements

    • Fix identical and equals for the comparison between the same object. (#1220)
    • Select the series correctly in SeriesGroupBy APIs (#1224)
    • Fixes DataFrame/Series.clip function to preserve its index. (#1232)
    • Throw a better exception in DataFrame.sort_values when multi-index column is used (#1238)
    • Fix fillna not to change index values. (#1241)
    • Fix DataFrame.__setitem__ with tuple-named Series. (#1245)
    • Fix corr to support multi-index columns. (#1246)
    • Fix output of print() matches with pandas of Series (#1250)
    • Fix fillna to support partial column index for multi-index columns. (#1244)
    • Add as_index check logic to groupby parameter (#1253)
    • Raising NotImplementedError for elements that actually are not implemented. (#1256)
    • Fix where to support multi-index columns. (#1249)
    Source code(tar.gz)
    Source code(zip)
  • v0.26.0(Jan 23, 2020)

    iat indexer

    We continued to improve indexers. Now, iat indexer is supported too (#1062).

    >>> df = ks.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
    ...                   columns=['A', 'B', 'C'])
    >>> df
        A   B   C
    0   0   2   3
    1   0   4   1
    2  10  20  30
    
    >>> df.iat[1, 2]
    1
    

    Other new features and improvements

    We added the following new features:

    koalas.Index

    • equals (#1216)
    • identical (#1215)
    • is_all_dates (#1205)
    • append (#1163)
    • to_frame (#1187)

    koalas.MultiIndex:

    • equals (#1216)
    • identical (#1215)
    • swaplevel (#1105)
    • is_all_dates (#1205)
    • is_monotonic_increasing (#1183)
    • is_monotonic_decreasing (#1183)
    • append (#1163)
    • to_frame (#1187)

    koalas.DataFrameGroupBy

    • describe (#1168)

    Other improvements

    • Change default write mode to overwrite to be consistent with pandas (#1209)
    • Prepare Spark 3 (#1211, #1181)
    • Fix DataFrame.idxmin/idxmax. (#1198)
    • Fix reset_index with the default index is "distributed-sequence". (#1193)
    • Fix column name as a tuple in multi column index (#1191)
    • Add favicon to doc (#1189)
    Source code(tar.gz)
    Source code(zip)
  • v0.25.0(Jan 9, 2020)

    loc and iloc indexers improvement

    We improved loc and iloc indexers. Now, loc can support scalar values as indexers (#1172).

    >>> import databricks.koalas as ks
    >>>
    >>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
    ...                   index=['cobra', 'viper', 'sidewinder'],
    ...                   columns=['max_speed', 'shield'])
    >>> df.loc['sidewinder']
    max_speed    7
    shield       8
    Name: sidewinder, dtype: int64
    >>> df.loc['sidewinder', 'max_speed']
    7
    

    In addition, Series derived from a different Frame can be used as indexers (#1155).

    >>> import databricks.koalas as ks
    >>>
    >>> ks.options.compute.ops_on_diff_frames = True
    >>> 
    >>> df1 = ks.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [100, 200, 300, 400, 500]},
    ...                    index=[20, 10, 30, 0, 50])
    >>> df2 = ks.DataFrame({'A': [0, -1, -2, -3, -4], 'B': [-100, -200, -300, -400, -500]},
    ...                    index=[20, 10, 30, 0, 50])
    >>> df1.A.loc[df2.A > -3].sort_index()
    10    1
    20    0
    30    2
    

    Lastly, now loc uses its natural order according to index identically with pandas' when using the slice (#1159, #1174, #1179). See the example below.

    >>> df = ks.DataFrame([[1, 2], [4, 5], [7, 8]],
    ...                   index=['cobra', 'viper', 'sidewinder'],
    ...                   columns=['max_speed', 'shield'])
    >>> df.loc['cobra':'viper', 'max_speed']
    cobra    1
    viper    4
    Name: max_speed, dtype: int64
    

    Other new features and improvements

    We added the following new features:

    koalas.Series:

    • get (#1153)

    koalas.Index

    • drop (#1117)
    • len (#1161)
    • set_names (#1134)
    • argmin (#1162)
    • argmax (#1162)

    koalas.MultiIndex:

    • from_product (#1144)
    • drop (#1117)
    • len (#1161)
    • set_names (#1134)

    Other improvements

    • Add support from_pandas for Index/MultiIndex. (#1170)
    • Add a hidden column __natural_order__. (#1146)
    • Introduce _LocIndexerLike and consolidate some logic. (#1149)
    • Refactor LocIndexerLike.__getitem__. (#1152)
    • Remove sort in GroupBy._reduce_for_stat_function. (#1147)
    • Randomize index in tests and fix some window-like functions. (#1151)
    • Explicitly don't support Index.duplicated (#1131)
    • Fix DataFrame._repr_html_(). (#1177)
    Source code(tar.gz)
    Source code(zip)
  • v0.24.0(Dec 19, 2019)

    NumPy's universal function (ufunc) compatibility

    We added the compatibility of NumPy ufunc (#1127). Virtually all ufunc compatibilities in Koalas DataFrame were implemented. See the example below:

    >>> import databricks.koalas as ks
    >>> import numpy as np
    >>> kdf = ks.range(10)
    >>> np.log(kdf)
             id
    0       NaN
    1  0.000000
    2  0.693147
    3  1.098612
    4  1.386294
    5  1.609438
    6  1.791759
    7  1.945910
    8  2.079442
    9  2.197225
    

    Other new features and improvements

    We added the following new features:

    koalas:

    • to_numeric (#1060)

    koalas.DataFrame:

    • idxmax (#1054)
    • idxmin (#1054)
    • pct_change (#1051)
    • info (#1124)

    koalas.Index

    • fillna (#1102)
    • min (#1114)
    • max (#1114)
    • drop_duplicates (#1121)
    • nunique (#1132)
    • sort_values (#1120)

    koalas.MultiIndex:

    • levshape (#1086)
    • min (#1114)
    • max (#1114)
    • sort_values (#1120)

    koalas.SeriesGroupBy

    • head (#1050)

    koalas.DataFrameGroupBy

    • head (#1050)

    Other improvements

    • Setting index name / names for Series (#1079)
    • disable 'str' for 'SeriesGroupBy', disable 'DataFrame' for 'GroupBy' (#1097)
    • Support 'compute.ops_on_diff_frames' for NumPy ufunc compay in Series (#1128)
    • Support arithmetic and comparison APIs on same DataFrames (#1129)
    • Fix rename() for Index to support MultiIndex also (#1125)
    • Set the upper-bound for pandas. (#1137)
    • Fix _cum() for Series to work properly (#1113)
    • Fix value_counts() to work properly when dropna is True (#1116, #1142)
    Source code(tar.gz)
    Source code(zip)
  • v0.23.0(Dec 5, 2019)

    NumPy's universal function (ufunc) compatibility

    We added the compatibility of NumPy ufunc (#1096, #1106). Virtually all ufunc compatibilities in Koalas Series were implemented. See the example below:

    >>> import databricks.koalas as ks
    >>> import numpy as np
    >>> kdf = ks.range(10)
    >>> kser = np.sqrt(kdf.id)
    >>> type(kser)
    <class 'databricks.koalas.series.Series'>
    >>> kser
    0    0.000000
    1    1.000000
    2    1.414214
    3    1.732051
    4    2.000000
    5    2.236068
    6    2.449490
    7    2.645751
    8    2.828427
    9    3.000000
    

    Other new features and improvements

    We added the following new features:

    koalas:

    • option_context (#1077)

    koalas.DataFrame:

    • where (#1018)
    • mask (#1018)
    • iterrows (#1070)

    koalas.Series:

    • pop (#866)
    • first_valid_index (#1092)
    • pct_change (#1071)

    koalas.Index

    • symmetric_difference (#953, #1059)
    • to_numpy (#1058)
    • transpose (#1056)
    • T (#1056)
    • dropna (#938)
    • shape (#1085)
    • value_counts (#949)

    koalas.MultiIndex:

    • symmetric_difference (#953, #1059)
    • to_numpy (#1058)
    • transpose (#1056)
    • T (#1056)
    • dropna (#938)
    • shape (#1085)
    • value_counts (#949)

    Other improvements

    • Fix comparison operators to treat NULL as False (#1029)
    • Make corr return koalas.DataFrame (#1069)
    • Include link to Help Thirsty Koalas Fund (#1082)
    • Add Null handling for different frames (#1083)
    • Allow Series.__getitem__ to take boolean Series (#1075)
    • Produce correct output against multiIndex when 'compute.ops_on_diff_frames' is enabled (#1089)
    • Fix idxmax() / idxmin() for Series work properly (#1078)
    Source code(tar.gz)
    Source code(zip)
  • v0.22.0(Nov 14, 2019)

    Enable Arrow 0.15.1+

    Apache Arrow 0.15.0 did not work well with PySpark 2.4 so it was disabled in the previous version. With Arrow 0.15.1, now it works in Koalas (#902).

    Expanding and Rolling

    We also added expanding() and rolling() APIs in all groupby(), Series and Frame (#985, #991, #990, #1015, #996, #1034, #1037)

    • min
    • max
    • sum
    • mean
    • std
    • var

    Multi-index columns support

    We continue improving multi-index columns support. We made the following APIs support multi-index columns:

    • median (#995)
    • at (#1049)

    Documentation

    We added "Best Practices" section in the documentation (#1041) so that Koalas users can read and follow. Please see https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html

    Other new features and improvements

    We added the following new features:

    koalas.DataFrame:

    • quantile (#984)
    • explain (#1042)

    koalas.Series:

    • between (#997)
    • update (#923)
    • mask (#1017)

    koalas.MultiIndex:

    • from_tuples (#970)
    • from_arrays (#1001)

    Along with the following improvements:

    • Introduce column_scols in InternalFrame substitude for data_columns. (#956)
    • Fix different index level assignment when 'compute.ops_on_diff_frames' is enabled (#1045)
    • Fix Dataframe.melt function & Add doctest case for melt function (#987)
    • Enable creating Index from list like 'Index([1, 2, 3])' (#986)
    • Fix combine_frames to handle where the right hand side arguments are modified Series (#1020)
    • setup.py should support Python 2 to show a proper error message. (#1027)
    • Remove Series.schema. (#993)
    Source code(tar.gz)
    Source code(zip)
  • v0.21.0(Oct 31, 2019)

    Multi-index columns support

    We continue improving multi-index columns support. We made the following APIs support multi-index columns:

    • nunique (#980)
    • to_csv (#983)

    Documentation

    Now, we have installation guide, design principles and FAQ in our public documentation (#914, #944, #963, #964)

    Other new features and improvements

    We added the following new features:

    koalas

    • merge (#969)

    koalas.DataFrame:

    • keys (#937)
    • ndim (#947)

    koalas.Series:

    • keys (#935)
    • mode (#899)
    • truncate (#928)
    • xs (#921)
    • where (#922)
    • first_valid_index (#936)

    koalas.Index:

    • copy (#939)
    • unique (#912)
    • ndim (#947)
    • has_duplicates (#946)
    • nlevels (#945)

    koalas.MultiIndex:

    • copy (#939)
    • ndim (#947)
    • has_duplicates (#946)
    • nlevels (#945)

    koalas.Expanding

    • count (#978)

    Along with the following improvements:

    • Fix passing options as keyword arguments (#968)
    • Make is_monotonic~ work properly for index (#930)
    • Fix Series.__getitem__ to work properly (#934)
    • Fix reindex when all the given columns are included the existing columns (#975)
    • Add datetime as the equivalent python type to TimestampType (#957)
    • Fix is_unique to respect the current Spark column (#981)
    • Fix bug when assign None to name as Index (#974)
    • Use name_like_string instead of str directly. (#942, #950)
    Source code(tar.gz)
    Source code(zip)
  • v0.20.0(Oct 15, 2019)

    Disable Arrow 0.15

    Apache Arrow 0.15.0 was released on the 5th of October, 2019, which Koalas depends on to execute Pandas UDF, but the Spark community reports an issue with PyArrow 0.15.

    We decided to set an upper bound for pyarrow version to avoid such issues until we are sure that Koalas works fine with it.

    • Set an upper bound for pyarrow version. (#918)

    Multi-index columns support

    We continue improving multi-index columns support. We made the following APIs support multi-index columns:

    • pivot_table (#908)
    • melt (#920)

    Other new features and improvements

    We added the following new features:

    koalas.DataFrame:

    • xs (#892)

    koalas.Series:

    • drop_duplicates (#896)
    • replace (#903)

    koalas.GroupBy:

    • shift (#910)

    Along with the following improvements:

    • Implement nested renaming for groupby agg (#904)
    • Add 'index_col' parameter to DataFrame.to_spark (#906)
    • Add more options to read_csv (#916)
    • Add NamedAgg (#911)
    • Enable DataFrame setting value as list of labels (#905)
    Source code(tar.gz)
    Source code(zip)
  • v0.19.0(Oct 4, 2019)

    Koalas Logo

    Now that we have an official logo!

    We can see the cute logo in our documents as well.

    Documentation

    Also we improved the documentation: https://koalas.readthedocs.io/en/latest/

    • Added the logo (#831)
    • Added a Jupyter notebook for 10 min tutorial (#843)
    • Added the tutorial to the documentation (#853)
    • Add some examples for plot implementations in their docstrings (#847)
    • Move contribution guide to the official documentation site (#841)

    Binder integration for the 10 min tutorial

    You can run a live Jupyter notebook for 10 min tutorial from Binder.

    Multi-index columns support

    We continue improving multi-index columns support. We made the following APIs support multi-index columns:

    • transform (#800)
    • round (#802)
    • unique (#809)
    • duplicated (#803)
    • assign (#811)
    • merge (#825)
    • plot (#830)
    • groupby and its functions (#833)
    • update (#848)
    • join (#848)
    • drop_duplicate (#856)
    • dtype (#858)
    • filter (#859)
    • dropna (#857)
    • replace (#860)

    Plots

    We also continue adding plot APIs as follows:

    For DataFrame:

    • plot.kde() (#784)

    Other new features and improvements

    We added the following new features:

    koalas.DataFrame:

    • pop (#791)
    • __iter__ (#836)
    • rename (#806)
    • expanding (#840)
    • rolling (#840)

    koalas.Series:

    • aggregate (#816)
    • agg (#816)
    • expanding (#840)
    • rolling (#840)
    • drop (#829)
    • copy (#869)

    koalas.DataFrameGroupBy:

    • expanding (#840)
    • rolling (#840)

    koalas.SeriesGroupBy:

    • expanding (#840)
    • rolling (#840)

    Along with the following improvements:

    • Add squeeze argument to read_csv (#812)
    • Raise a more helpful error for duplicated columns in Join (#820)
    • Issue with ks.merge to Series (#818)
    • Fix MultiIndex.to_pandas() and __repr__(). (#832)
    • Add unit and origin options for to_datetime (#839)
    • Fix on wrong error raise in DataFrame.fillna (#844)
    • Allow str and list in aggfunc in DataFrameGroupby.agg (#828)
    • Add index_col argument to to_koalas(). (#863)
    Source code(tar.gz)
    Source code(zip)
  • v0.18.0(Sep 19, 2019)

    Multi-index columns support

    We continue improving multi-index columns support (#793, #776). We made the following APIs support multi-index columns:

    • applymap (#793)
    • shift (#793)
    • diff (#793)
    • fillna (#793)
    • rank (#793)

    Also, we can set tuple or None name for Series and Index. (#776)

    >>> import databricks.koalas as ks
    >>> kser = ks.Series([1, 2, 3])
    >>> kser.name = ('a', 'b')
    >>> kser
    0    1
    1    2
    2    3
    Name: (a, b), dtype: int64
    

    Plots

    We also continue adding plot APIs as follows:

    For Series:

    • plot.kde() (#767)

    For DataFrame:

    • plot.hist() (#780)

    Options

    In addition, we added the support for namespace-access in options (#785).

    >>> import databricks.koalas as ks
    >>> ks.options.display.max_rows
    1000
    >>> ks.options.display.max_rows = 10
    >>> ks.options.display.max_rows
    10
    

    See also User Guide of our project docs.

    Other new features and improvements

    We added the following new features:

    koalas.DataFrame:

    • aggregate (#796)
    • agg (#796)
    • items (#787)

    koalas.indexes.Index/MultiIndex

    • is_boolean (#795)
    • is_categorical (#795)
    • is_floating (#795)
    • is_integer (#795)
    • is_interval (#795)
    • is_numeric (#795)
    • is_object (#795)

    Along with the following improvements:

    • Add index_col for read_json (#797)
    • Add index_col for spark IO reads (#769, #775)
    • Add "sep" parameter for read_csv (#777)
    • Add axis parameter to dataframe.diff (#774)
    • Add read_json and let to_json use spark.write.json (#753)
    • Use spark.write.csv in to_csv of Series and DataFrame (#749)
    • Handle TimestampType separately when convert to pandas' dtype. (#798)
    • Fix spark_df when set_index(.., drop=False). (#792)

    Backward compatibility

    • We removed some parameters in DataFrame.to_csv and DataFrame.to_json to allow distributed writing (#749, #753)
    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Sep 5, 2019)

    Options

    We started using options to configure the Koalas' behavior. Now we have the following options:

    • display.max_rows (#714, #742)
    • compute.max_rows (#721, #736)
    • compute.shortcut_limit (#717)
    • compute.ops_on_diff_frames (#725)
    • compute.default_index_type (#723)
    • plotting.max_rows (#728)
    • plotting.sample_ratio (#737)

    We can also see the list and their descriptions in the User Guide of our project docs.

    Plots

    We continue adding plot APIs as follows:

    For Series:

    • plot.area() (#704)

    For DataFrame:

    • plot.line() (#686)
    • plot.bar() (#695)
    • plot.barh() (#698)
    • plot.pie() (#703)
    • plot.area() (#696)
    • plot.scatter() (#719)

    Multi-index columns support

    We also continue improving multi-index columns support. We made the following APIs support multi-index columns:

    • koalas.concat() (#680)
    • koalas.get_dummies() (#695)
    • DataFrame.pivot_table() (#635)

    Other new features and improvements

    We added the following new features:

    koalas:

    • read_sql_table() (#741)
    • read_sql_query() (#741)
    • read_sql() (#741)

    koalas.DataFrame:

    • style (#712)

    Along with the following improvements:

    • GroupBy.apply should return Koalas DataFrame instead of pandas DataFrame (#731)
    • Fix rpow and rfloordiv to use proper operators in Series (#735)
    • Fix rpow and rfloordiv to use proper operators in DataFrame (#740)
    • Add schema inference support at DataFrame.transform (#732)
    • Add Option class to support type check and value check in options (#739)
    • Added missing tests (#687, #692, #694, #709, #711, #730, #729, #733, #734)

    Backward compatibility

    • We renamed two of the default index names from one-by-one and distributed-one-by-one to sequence and distributed-sequence respectively. (#679)
    • We moved the configuration for enabling operations on different DataFrames from the environment variable to the option. (#725)
    • We moved the configuration for the default index from the environment variable to the option. (#723)
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Aug 22, 2019)

    Firstly, we introduced new mode to enable operations on different DataFrames (#633). This mode can be enabled by setting OPS_ON_DIFF_FRAMES environment variable is set to true as below:

    >>> import databricks.koalas as ks
    >>>
    >>> kdf1 = ks.range(5)
    >>> kdf2 = ks.DataFrame({'id': [5, 4, 3]})
    >>> (kdf1 - kdf2).sort_index()
        id
    0 -5.0
    1 -3.0
    2 -1.0
    3  NaN
    4  NaN
    
    >>> import databricks.koalas as ks
    >>>
    >>> kdf = ks.range(5)
    >>> kdf['new_col'] = ks.Series([1, 2, 3, 4])
    >>> kdf
       id  new_col
    0   0      1.0
    1   1      2.0
    3   3      4.0
    2   2      3.0
    4   4      NaN
    

    Secondly, we also introduced default index and disallowed Koalas DataFrame with no index internally (#639)(#655). For example, if you create Koalas DataFrame from Spark DataFrame, the default index is used. The default index implementation can be configured by setting DEFAULT_INDEX as one of three types:

    • (default) one-by-one: It implements a one-by-one sequence by Window function without specifying partition. This index type should be avoided when the data is large.

      >>> ks.range(3)
         id
      0   0
      1   1
      2   2
      
    • distributed-one-by-one: It implements a one-by-one sequence by group-by and group-map approach. It still generates a one-by-one sequential index globally. If the default index must be a one-by-one sequence in a large dataset, this index can be used.

      >>> ks.range(3)
         id
      0   0
      1   1
      2   2
      
    • distributed: It implements a monotonically increasing sequence simply by using Spark's monotonically_increasing_id function. If the index does not have to be a one-by-one sequence, this index can be used. Performance-wise, this index almost does not have any penalty comparing to other index types.

      >>> ks.range(3)
                   id
      25769803776   0
      60129542144   1
      94489280512   2
      

    Thirdly, we implemented many plot APIs in Series as follows:

    • plot.pie() (#669)
    • plot.area() (#670)
    • plot.line() (#671)
    • plot.barh() (#673)

    See the example below:

    import databricks.koalas as ks
    
    ks.range(10).to_pandas().id.plot.pie()
    

    image

    Fourthly, we rapidly improved multi-index columns support continuously. Now multi-index columns are supported in multiple APIs:

    • DataFrame.sort_index()(#637)
    • GroupBy.diff()(#653)
    • GroupBy.rank()(#653)
    • Series.any()(#652)
    • Series.all()(#652)
    • DataFrame.any()(#652)
    • DataFrame.all()(#652)
    • DataFrame.assign()(#657)
    • DataFrame.drop()(#658)
    • DataFrame.reindex()(#659)
    • Series.quantile()(#663)
    • Series,transform()(#663)
    • DataFrame.select_dtypes()(#662)
    • DataFrame.transpose()(#664).

    Lastly we added new functionalities, especially for groupby-related functionalities, in the past weeks. We added the following features:

    koalas.DataFrame

    • duplicated() (#569)
    • fillna() (#640)
    • bfill() (#640)
    • pad() (#640)
    • ffill() (#640)

    koalas.groupby.GroupBy:

    • diff() (#622)
    • nunique() (#617)
    • nlargest() (#654)
    • nsmallest() (#654)
    • idxmax() (#649)
    • idxmin() (#649)

    Along with the following improvements:

    • Add a basic infrastructure for configurations. (#645)
    • Always use column_index. (#648)
    • Allow to omit type hint in GroupBy.transform, filter, apply (#646)
    Source code(tar.gz)
    Source code(zip)
Owner
Databricks
Helping data teams solve the world’s toughest problems using data and AI
Databricks
The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

Eyal Trabelsi 206 Dec 13, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 1, 2023
sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

yhat 1.2k Jan 9, 2023
Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

Python for Data 348 Jan 3, 2023
Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

null 8.2k Jan 1, 2023
A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

Jason Carpenter 2.2k Jan 4, 2023
The easy way to write your own flavor of Pandas

Pandas Flavor The easy way to write your own flavor of Pandas Pandas 0.23 added a (simple) API for registering accessors with Pandas objects. Pandas-f

Zachary Sailer 260 Jan 1, 2023
X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Nguyễn Quang Huy 5 Sep 28, 2022
python-bigquery Apache-2python-bigquery (🥈34 · ⭐ 3.5K · 📈) - Google BigQuery API client library. Apache-2

Python Client for Google BigQuery Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google

Google APIs 550 Jan 1, 2023
google-cloud-bigtable Apache-2google-cloud-bigtable (🥈31 · ⭐ 3.5K) - Google Cloud Bigtable API client library. Apache-2

Python Client for Google Cloud Bigtable Google Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many cor

Google APIs 39 Dec 3, 2022
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 4, 2023
A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

Sven Kreiss 254 Dec 6, 2022
BigDL: Distributed Deep Learning Framework for Apache Spark

BigDL: Distributed Deep Learning on Apache Spark What is BigDL? BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can w

null 4.1k Jan 9, 2023
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 4, 2023
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

null 2.5k Dec 28, 2022
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 4, 2023
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

null 1 Nov 22, 2021