A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

Jason Carpenter

Last update: Jan 4, 2023

Related tags

Data Containers pandas-dataframe parallel-computing parallelization pandas dask modin

Overview

swifter

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.

Blog posts

Documentation

To know about latest improvements, please check the changelog.

Further documentations on swifter is available here.

Check out the examples notebook, along with the speed benchmark notebook. The benchmarks are created using the library perfplot.

Installation:

$ pip install -U pandas # upgrade pandas
$ pip install swifter # first time installation

$ pip install -U swifter # upgrade to latest version if already installed

alternatively, to install on Anaconda:

conda install -c conda-forge swifter

...after installing, import swifter into your code along with pandas using:

import pandas as pd
import swifter

...alternatively, swifter can be used with modin dataframes in the same manner:

import modin.pandas as pd
import swifter

NOTE: if you import swifter before modin, you will have to additionally register modin: swifter.register_modin()

Easy to use

df = pd.DataFrame({'x': [1, 2, 3, 4], 'y': [5, 6, 7, 8]})

# runs on single core
df['x2'] = df['x'].apply(lambda x: x**2)
# runs on multiple cores
df['x2'] = df['x'].swifter.apply(lambda x: x**2)

# use swifter apply on whole dataframe
df['agg'] = df.swifter.apply(lambda x: x.sum() - x.min())

# use swifter apply on specific columns
df['outCol'] = df[['inCol1', 'inCol2']].swifter.apply(my_func)
df['outCol'] = df[['inCol1', 'inCol2', 'inCol3']].swifter.apply(my_func,
             positional_arg, keyword_arg=keyword_argval)

Vectorizes your function, when possible

When vectorization is not possible, automatically decides which is faster: to use dask parallel processing or a simple pandas apply

Notes

The function is documented in the .py file. In Jupyter Notebooks, you can see the docs by pressing Shift+Tab(x3). Also, check out the complete documentation here along with the changelog.
Please upgrade your version of pandas, as the pandas extension api used in this module is a recent addition to pandas.
Import modin before importing swifter, if you wish to use modin with swifter. Otherwise, use swifter.register_modin() to access it.
Do not use swifter to apply a function that modifies external variables. Under the hood, swifter does sample applies to optimize performance. These sample applies will modify the external variable in addition to the final apply. Thus, you will end up with an erroneously modified external variable.
It is advised to disable the progress bar if calling swifter from a forked process as the progress bar may get confused between various multiprocessing modules.
If swifter return is different than pandas try explicitly casting type e.g.: df.swifter.apply(lambda x: float(np.angle(x)))

Comments

Slow Performance of Swifter for Text Preprocessing
Hi @jmcarpenter2,

Dear Swifter Folks,

Recently, i found the speed when using swifter is 5-10x slower than using vanilla pandas apply for case that the process is not vectorized (my case is doing text preprocessing).

The experiment is like this:

import pandas as pd import swifter def clean_text(text): text = text.strip() text = text.replace(' ', '_') return text N_rows = 7000000 df_data = pd.DataFrame([["i want to break free"]] * N_rows, columns=["text"]) %time df_data['text'] = df_data['text'].swifter.apply(clean_text) %time df_data['text'] = df_data['text'].apply(clean_text)

Is it expected? let's have a discussion to make sure i'm not missing something. Thank you!
opened by hadyan-tvlk 26
Swifter using only single core

I am applying swifter to a function which takes several values apart from datetime variable. After running the code I saw it using only a single core (available 6 cores). The data is of size 476k rows. With a single core, it takes about 7.5 minutes.

I added a set_npartitions(16) it improved the processing time to 3.5 minutes but still using a single core.

Any reason why it can't use all the cores?

opened by raghu1121 13
Swifter Restarting Script

Hi,

I have attempted to speed up some data processing involving data frames with over 4 million rows with swifter on Python 3.6

I have prepended some of my pandas applys with swifter, however it seems to complete restart the script multiple times (printing out debug information over and over) and create multiple threads within the call stack crashing the program

I have been unable to trace which apply causes the error at this point in time

I understand that python 3 is experimental, if you'd like me to share my anaconda environment let me know

opened by Jack-McKew 12
swifter install is stuck

Hey guys, using Python 3.9 here on my local (MacOS). Tried a simple pip install swifter in my venv, and have not been able to pass through this:

INFO: pip is looking at multiple versions of ipykernel to determine which version is compatible with other requirements. This could take a while. Collecting ipykernel>=4.5.1 Using cached ipykernel-5.4.1-py3-none-any.whl (119 kB) Using cached ipykernel-5.4.0-py3-none-any.whl (119 kB) Using cached ipykernel-5.3.4-py3-none-any.whl (120 kB) Using cached ipykernel-5.3.3-py3-none-any.whl (120 kB) Using cached ipykernel-5.3.2-py3-none-any.whl (120 kB) Using cached ipykernel-5.3.1-py3-none-any.whl (120 kB) Using cached ipykernel-5.3.0-py3-none-any.whl (119 kB)

I believe someone else also had this issue and has documented it in this stack overflow post. https://stackoverflow.com/questions/65238819/failed-to-install-swifter-via-pip-info-pip-is-looking-at-multiple-versions
installation issue

opened by amankagarwal 11
swifter apply for resample groups

I've used swifter to speed up apply calls on DataFrames, but this isn't the only context apply is used in pandas. Would it be simple to implement for resample objects also?

See: pandas.DataFrame.resample

Can we go from: series.resample('3T').apply(custom_resampler) to: series.resample('3T').swifter.apply(custom_resampler)?
enhancement

opened by harahu 11
Question: Swifter with lambda functions
Hi-

Sorry to leave a question here, but I didn't see any other way to reach you. I am loving swifter and would like to figure out how to apply it in a double-apply that I'm doing between two dataframes. I'm using hamming distance to calculate the distance between two strings from columns of two different data frames as follows:

df1 id | Target 12 | AATTGG 57 | GGAACC df2 id | ngram 22 | AATTGC 42 | AATTGA import distance df1.Target.apply(lambda bc: df2.ngram.apply(lambda x: distance.hamming(bc, x)))

Is there a way to do something list this in swifter?

Thanks!
opened by summerela 10
Python 3.10 support?

I am trying to install swifter via conda in a new virtual environment based on Python 3.10 and it fails with some dependency issues. Is Python 3.10 not supported, or perhaps something else is going on with my environment?

Thank you.
installation issue

opened by borice 9

TypeError: TypeError('encode() argument 1 must be string, not bool',) while `apply`ing to dataframe

Python 2.7.15 Swifter 0.281 Pandas 0.24.1 Numpy 1.16.1

Trying to switch a relatively simple currently working dataframe .apply to use this package, I ran into this exception. Here is the code I am running

def score_stuff(df_to_score, months, predictor):
    def compute_value(row):
        return predictor.compute_n_month_values(row.plan, row.length_in_months, row.months)

    df_to_score['output'] = df_to_score.apply(compute_value, axis=1)

(row.plan is a string/object, row.length_in_months is a float, and row.months is an int. There are other cols in the df_to_score of many types but they are not referenced in the compute_value() method)

Here's the stack trace.

File "/opt/airflow/repo/dags/scripts/models/score_stuff.py", line 128, in score_stuff
   df_to_score['output'] = df_to_score.apply(compute_value, axis=1)
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/swifter/swifter.py", line 285, in apply
  **kwds
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_tqdm.py", line 657, in inner
  t = tclass(*targs, total=total, **tkwargs)
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_tqdm.py", line 945, in __init__
  self.display()
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_tqdm.py", line 1315, in display
  self.sp(self.__repr__() if msg is None else msg)
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_tqdm.py", line 250, in print_status
  fp_write('\r' + s + (' ' * max(last_len[0] - len_s, 0)))
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_tqdm.py", line 243, in fp_write
  fp.write(_unicode(s))
File "/var/lib/venv/airflow/local/lib/python2.7/site-packages/tqdm/_utils.py", line 160, in write
  self, 'encoding')))
TypeError: encode() argument 1 must be string, not bool
Exception TypeError: TypeError('encode() argument 1 must be string, not bool',) in <bound method tqdm.__del__ of Pandas Apply:   0%|          | 0/14168 [00:00<?, ?it/s]> ignored

Interestingly, using .swifter was fine on my local OS X machine on a small dataset, but failing on a 16 core EC2 instance with a larger dataset.

I tried passing raw=True just in case that might help, but it did not... am i just doing something dumb?

opened by apurvis 9

swifter using single core only

Hi, I have tried to use the swift.apply() on a pandas dataframe and can't get it to use more than one core.

I'm running swifter version 1.0.9 on a centos 8 server with 20 cores, and 202 GB RAM using jupyter notebooks. Everything was installed using conda.

Information on the DataFrame:

DatetimeIndex: 1950000 entries, 2016-11-10 06:32:00.000030+00:00 to 2016-11-10 06:44:59.999630+00:00 Columns: 741 entries, 2700.321045 to 3199.975098 dtypes: uint16(741) memory usage: 2.8 GB

The code to run the swifter.apply() is:

def rail_break(amplitude_ser):
    amplitude_max_ser = amplitude_ser.rolling(window=1000, min_periods=1).max()
    alarm_amp_threshold = 1.05
    alarm_time_threshold = dt.timedelta(minutes=5)
    background_amp_time = dt.timedelta(seconds=5)
    mean_background_amp = amplitude_max_ser[amplitude_max_ser.index <= (amplitude_max_ser.index[0] + 
                                                                        background_amp_time)].mean()
    alarm_amp_threshold = mean_background_amp * alarm_amp_threshold
    try:
        alarm_start_time = amplitude_max_ser.index[ amplitude_max_ser >= alarm_amp_threshold ][0]
    except(IndexError):
        alarm = False
        return alarm
    alarm_end_time = alarm_start_time + alarm_time_threshold
    alarm = amplitude_max_ser[ (amplitude_max_ser.index >= alarm_start_time) & 
                      (amplitude_max_ser.index <= alarm_end_time)] <= alarm_amp_threshold
    alarm = not(alarm.any())
    return alarm

alarm_distances = amplitude_df.columns[amplitude_df.swifter.apply(rail_break, axis=0)]
alarm_df = amplitude_df.loc[:,alarm_distances]

I have tried the following but it still only uses one core:

used amplitude_df.swifter.set_dask_scheduler('processes').apply(rail_break, axis=0)
transposed the DataFrame to use axis=1

opened by malapradej 8

Kernel dies after importing swifter

Hello!

I am experiencing an issue when trying to import swifter in Jupyter Notebook - Kernel basically dies after importing.

python version: 3.7.4 pandas version: 1.0.1 swifter version: 0.301

I'm on MacOS Mojave (v 10.14) and 8 GB RAM.

I am using Anaconda 3 and I've tried both installing via pip and via conda.

Also tried using virtualenv just in case there was any incompatibility issue but still ran into the same problem.

Thank you in advance for your attention, if you need any other details please just ask for them!

opened by Mmoncadaisla 8

Configurable progress bar instances

I didn't really like how I was forced to have "Pandas Apply" or "Dask Apply" as my output. So I did a thing.

The Dask progress bar enforces it's own total argument, so I had it override anything sent into it.

import pandas as pd
import swifter
df = pd.DataFrame([{'a': 1, 'b': 2}, {'c': 3, 'd': 4}])
df.swifter.apply(print)
Pandas Apply: 100%|█████████████████████████| 4/4 [00:00<00:00, 1561.11it/s]

df.swifter.progress_bar(desc='testy!', total=2).apply(print)
testy!: 100%|████████████████████████████████| 2/2 [00:00<00:00, 768.75it/s]

opened by rlynch-ironnet 8

Why does "force_parallel(enable=True)" not work?

In this code, dask works：

def has_inter(x_cat_set, now_set):
    inter = x_cat_set.intersection(now_set)
    return len(inter) == 0 

def get_negs2(now_set,si_doc, num, df3):
    negs_set = set(df3[df3.loc[:,'s_cat'].swifter.progress_bar(False).apply(has_inter, args=(now_set, ))].s_id)
    negs = list(negs_set)
    return negs

neg_dict = df2.loc[:, 's_cat'].swifter.force_parallel(enable=True).apply(get_negs2, args=(si_doc, n_neg, df3,))

This is the result：

In this code, dask doesn't works：


def get_negs(line, si_doc, num, df3):
    now_set = line['s_cat']
    negs_set = set(df3[df3.loc[:,'s_cat'].swifter.progress_bar(False).apply(has_inter, args=(now_set, ))].s_id)
    negs = list(negs_set)
    return negs

neg_dict = df2.swifter.force_parallel(enable=True).allow_dask_on_strings(enable=True).apply(get_negs,args=(si_doc,n_neg, df3,),axis=1)

This is the result：

Why are there different results? I want to use the second method, because I need to use two columns of data in other cases.

opened by kongbo96 0

Swifter With GroupBy - Crashing Python

Using Swifter with group by I am running into an error that is crashing the Python instance. Error below, please let me know if there is any more that will help getting to the bottom of this.

2022-10-06 15:07:26,013 INFO worker.py:1518 -- Started a local Ray instance. [failure_signal_handler.cc : 171] RAW: sigaltstack() failed with errno=1 /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

opened by SamPetherbridge 0
Inform the user whether multiprocessing was used
Hi, thanks for this cool library.

One thing that would be nice would be to tell the user what swifter decided to do, e.g.:

was it able to vectorize?

did it choose to apply multiprocessing with Dask?

Right now it seems everything is totally transparent to the user; I cannot easily tell if swifter is even using more than one core.
opened by tadamcz 0
swifter.groupby() does not support with dropna=False

I found that the swifter groupby apply chain will encounter the error when trying to sort index, if I set dropna to False for the groupby step.

Here is the error log: Traceback (most recent call last): File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 76, in wrapper result = func(*args, **kwargs) File "/paedyl01/disk1/yangyxt/ngs_scripts/acmg_automated_anno.py", line 484, in BP2_PM3_compound_with_patho return df.swifter.groupby([gene_col], as_index=False, dropna=False).apply(check_compound_per_gene, File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 661, in apply return self._ray_apply(func, *args, **kwds) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/swifter/swifter.py", line 650, in _ray_apply return pd.concat(ray.get(apply_chunks), axis=self._axis).sort_index() File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/frame.py", line 6447, in sort_index return super().sort_index( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/generic.py", line 4685, in sort_index indexer = get_indexer_indexer( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 94, in get_indexer_indexer indexer = nargsort( File "/home/yangyxt/anaconda3/envs/dask/lib/python3.9/site-packages/pandas/core/sorting.py", line 417, in nargsort indexer = non_nan_idx[non_nans.argsort(kind=kind)] TypeError: '<' not supported between instances of 'int' and 'tuple' ERROR:2022-09-28 13:40:29,310:wrapper:83:Exception raised in main_anno_process. exception: '<' not supported between instances of 'int' and 'tuple'

The dataframe put to use swifter.groupby() has a common numerical index. From 0 to len(df). The groupby column might have some rows with NA values and I do wish to keep them. I guess that's why this issue happened. I 'm not sure whether this can be fixed or optimized. Pls take a look.

opened by yangyxt 1
IndexError: tuple index out of range (when using dask_apply)

Hi,

swifter version: 1.1.3 dask version: 2022.05.0 pandas version: 1.4.2 python version: 3.9.12

When I use dataframe[col].apply(func) it does work.

When I use dataframe[col].swifter.allow_dask_on_strings(enable=True).apply(func) on SMALL sample (10), it use pandas apply under the hood and it works.

When I use dataframe[col].swifter.allow_dask_on_strings(enable=True).apply(func) on BIGGER sample (1000), it use dask apply under the hood and it does NOT works. Seems to have a problem when switching to dask apply. Here is the complete error:

Traceback (most recent call last): File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/swifter/swifter.py", line 241, in apply self._validate_apply( File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/swifter/base.py", line 50, in _validate_apply raise ValueError(error_message) ValueError: Vectorized function sample doesn't match pandas apply sample.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "", line 1, in File "", line 47, in transform File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/swifter/swifter.py", line 255, in apply return self._dask_apply(func, convert_dtype, *args, **kwds) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/swifter/swifter.py", line 173, in _dask_apply dd.from_pandas(sample, npartitions=self._npartitions) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/base.py", line 292, in compute (result,) = compute(self, traverse=False, **kwargs) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/base.py", line 576, in compute return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/base.py", line 576, in return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/dataframe/core.py", line 129, in finalize return _concat(results) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/dataframe/core.py", line 110, in _concat return da.core.concatenate3(args) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/array/core.py", line 5124, in concatenate3 chunks = chunks_from_arrays(arrays) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/array/core.py", line 4911, in chunks_from_arrays result.append(tuple(shape(deepfirst(a))[dim] for a in arrays)) File "/opt/continuum/.conda/envs/nlpbeneva/lib/python3.9/site-packages/dask/array/core.py", line 4911, in result.append(tuple(shape(deepfirst(a))[dim] for a in arrays)) IndexError: tuple index out of range

opened by CoteDave 1
Swifter "progress_bar" Not Working
I just started experimenting with Swifter a few minutes ago and have been struggling to get the progress bar to show.

I have the code snippet below, that was appropriated using the example code provided.

Why is the prgress_bar(enable=True) option not working? Is there something wrong with my code?

var_unza_dspace_dataframe["subjectMistakes"] = var_unza_dspace_dataframe["subject"].str.split("=").swifter.allow_dask_on_strings(enable=True).progress_bar( enable=True, desc='Subjects Mistakes' ).apply(fxn_subject_spellchecker)
opened by lightonphiri 11

Owner

Jason Carpenter

Accelerating AI development for leading companies

GitHub

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct