A simple and efficient tool to parallelize Pandas operations on all available CPUs

Overview

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

  • Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

  • OS: Linux Ubuntu 16.04
  • Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

  • shm_size_mb: Deprecated.
  • nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
  • progress_bar: Display progress bars if set to True. (bool, False by default)
  • verbose: The verbosity level (int, 2 by default)
    • 0 - don't display any logs
    • 1 - display only warning logs
    • 2 - display all logs
  • use_memory_fs: (bool, None by default)
    • If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
    • If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
    • If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization With parallelization
df.apply(func) df.parallel_apply(func)
df.applymap(func) df.parallel_applymap(func)
df.groupby(args).apply(func) df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func) df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func) df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func) series.parallel_map(func)
series.apply(func) series.parallel_apply(func)
series.rolling(args).apply(func) series.rolling(args).parallel_apply(func)

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.


I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Comments
  • Maximum size exceeded

    Maximum size exceeded

    hi I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

    why i get this message when i set more than 2gb?

    bug 
    opened by vvssttkk 23
  • Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    Setting progress_bar=True freezes execution for parallel_apply before reaching 1% completion on all CPU's

    When progress_bar=True, I noticed that the execution of my parallel_apply task stopped right before all parallel processes reached 1% progress mark. Here are some further details of what I was encountering -

    • I turned on logging with DEBUG messages, but no messages were displayed when the execution stopped. There were no error messages either. The dataframe rows simply stopped processing further and the process seemed to be frozen.
    • I have two CPU's. It seems that the progress bar only updates in 1% increments. One of the progress bars reaches 1% mark, but when the number of processed rows reaches the 2% mark (which I assume is associated with the second progress bar updating to 1% as well), that's when the process froze.
    • The process runs fine with progress_bar=False.
    opened by abhineetgupta 22
  • pandarallel_apply crashes with OverflowError: int too big to convert

    pandarallel_apply crashes with OverflowError: int too big to convert

    Hi everyone,

    I am getting this error here using parallel_apply in pandas:

      File "extract_specifications.py", line 156, in <module>
        extracted_data = df.parallel_apply(extract_raw_infos, axis=1)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 367, in closure
        kwargs,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 239, in get_workers_args
        zip(input_files, output_files, chunk_lengths)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 238, in <listcomp>
        for index, (input_file, output_file, chunk_length) in enumerate(
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 169, in wrapper
        time=time,
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 464, in inline
        func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in shift_instructions
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 314, in <genexpr>
        for instruction in instructions
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 293, in shift_instruction
        return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 34, in wrapper
        return function(*args, **kwargs)
      File "/home/tom/.local/lib/python3.6/site-packages/pandarallel/utils/inliner.py", line 71, in int2python_bytes
        return int.to_bytes(item, nb_bytes, "little")
    OverflowError: int too big to convert
    

    I am using

    pandarallel == 1.4.2
    pandas == 0.24.2
    python == 3.6.9
    

    Any idea how to proceed from here? I have basically no idea what could cause this bug. I suspect it might be related to the size of the data I have in one column (I save html from web pages in there). But otherwise no idea. I would help removing this bug(?) if I had some guidance here. Thx for helping.

    opened by yeus 21
  • Fails with

    Fails with "_wrap_applied_output() missing 1 required positional argument" where a simple pandas apply succeeds

    Hello,

    I'm using python 3.8.10 (anaconda distribution, GCC 7.5.10) in Ubuntu LTS 20 64bits x86

    From my pip freeze:

    pandarallel 1.5.2 pandas 1.3.0 numpy 1.20.3

    I'm working with a dataFrame that looks like this one:

    HoleID scaffold tpl strand base score tMean tErr modelPrediction ipdRatio coverage isboundary identificationQv context experiment isbegin_bondary isend_boundary isin_IES uniqueID No_known_IES_retention_this_CCS detailed_classif
    1025444 70189477 scaffold_024_with_IES 688203 0 T 2 0.517 0.190 0.555 0.931 11 True NaN TTAAATAGAAATTAAAATCAGCTGC NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025446 70189477 scaffold_024_with_IES 688204 0 A 4 1.347 0.367 1.251 1.077 13 True NaN TAAATAGAAATTAAAATCAGCTGCT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025448 70189477 scaffold_024_with_IES 688205 0 A 5 1.913 0.779 1.464 1.307 16 True NaN AAATAGAAATTAAAATCAGCTGCTT NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025450 70189477 scaffold_024_with_IES 688206 0 A 4 1.535 0.712 1.328 1.156 18 True NaN AATAGAAATTAAAATCAGCTGCTTA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES
    1025452 70189477 scaffold_024_with_IES 688207 0 A 5 1.655 0.565 1.391 1.190 18 True NaN ATAGAAATTAAAATCAGCTGCTTAA NM9_10 False False False NM9_10_70189477 False POTENTIALLY_RETAINED_MACIES_OUTIES

    I defined the following function

    def get_distance_from_nearest_criteria(df,criteria):
        begins = df[df[criteria]].copy()
        
        if len(begins) == 0:
            return pd.Series([np.nan for x in range(len(df))])
        else:
            list_return = []
    
            for idx, nt in df.iterrows():
                distances = [abs(nt["tpl"] - x) for x in begins["tpl"]]
                mindistance = min(distances,default=np.nan)
                list_return.append(mindistance)
    
            return pd.Series(list_return)
    

    Then using :

    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=False, nb_workers=12)
    out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    

    leads to :

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-49-02fc7c0589e3> in <module>
    ----> 1 out = df.groupby(["uniqueID"]).parallel_apply(lambda x: get_distance_from_nearest_criteria(x,'isbegin_bondary'))
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        463             )
        464 
    --> 465             return reduce(results, reduce_meta_args)
        466 
        467         finally:
    
    ~/conda3/envs/ies/lib/python3.8/site-packages/pandarallel/data_types/dataframe_groupby.py in reduce(results, df_grouped)
         14         keys, values, mutated = zip(*results)
         15         mutated = any(mutated)
    ---> 16         return df_grouped._wrap_applied_output(
         17             keys, values, not_indexed_same=df_grouped.mutated or mutated
         18         )
    
    TypeError: _wrap_applied_output() missing 1 required positional argument: 'values'
    
    

    For me, the error is not clear enough (I can't tell what's happening)

    However, when I run it with a simple pandas apply :

    uniqueID           
    HT2_10354935    0      297.0
                    1      297.0
                    2      296.0
                    3      296.0
                    4      295.0
                           ...  
    NM9_10_9568952  502      NaN
                    503      NaN
                    504      NaN
                    505      NaN
                    506      NaN
    Length: 1028437, dtype: float64
    

    I'm running all of this in a jupyter notebook

    ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 notebook 6.4.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyter-dash 0.4.0 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0

    I was wondering if someone could explain me what's hapenning, and how to fix it if the error is mine. Because it works out of the box with a simple pandas apply, I suppose that there is a small problem in pandarallel

    NB: Note also that this code leaves unkilled processes even after I interrupted or restarted the ipython kernel EDIT: Would it be linked to the fact that I'm using a lambda function ?

    opened by GDelevoye 18
  • Add `parallel_apply` for `Resampler` class

    Add `parallel_apply` for `Resampler` class

    I implemented parallel_apply for the Resampler class to have some important time series functionality. For now it is still using the default _chunk method, but it can lead to some processes terminating much quicker than others i.e. if the time series gets denser over time. A potential upgrade would be to random sample the contents of the chunks, so each chunk gets a similar distribution of workloads.

    P.S.: I noticed that 30/188 of the tests fail due to a ZeroDivisionError. This is unrelated to my pull request, but an important issue.

    opened by alvail 17
  • Connection to IPC socket failed for pathname

    Connection to IPC socket failed for pathname

    Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times please help me,thank

    opened by lsircc 10
  • ZeroDivisionError: float division by zero

    ZeroDivisionError: float division by zero

    General

    • Operating System: windows 11
    • Python version: 3.8.3
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.1

    Acknowledgement

    • [ ] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    WARNING: You are on Windows. If you detect any issue with pandarallel, be sure you checked out the Troubleshooting page: 0.99% | 46 / 4628 | 0.00% | 0 / 4627 | multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\multiprocessing\pool.py", line 51, in starmapstar return list(itertools.starmap(args[0], args[1])) File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\core.py", line 158, in call results = self.work_function( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\data_types\series.py", line 26, in work return data.apply( File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\series.py", line 4433, in apply return SeriesApply(self, func, convert_dtype, args, kwargs).apply() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1082, in apply return self.apply_standard() File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandas\core\apply.py", line 1137, in apply_standard mapped = lib.map_infer( File "pandas_libs\lib.pyx", line 2870, in pandas._libs.lib.map_infer File "D:\Users\yh110\Anaconda3\envs\22_esci\lib\site-packages\pandarallel\progress_bars.py", line 206, in closure state.next_put_iteration += max(int((delta_i / delta_t) * 0.25), 1) ZeroDivisionError: float division by zero

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    Write here the minimal code sample to ease bug fix for pandarallel team

    bug 
    opened by heya5 8
  • OverflowError

    OverflowError

    Using python 3.8, I am getting an OverflowError running apply_parallel:

    OverflowError                             Traceback (most recent call last)
    <ipython-input-5-a78fd5119887> in <module>
         37     grouped_df = df.groupby("id")
         38 
    ---> 39     grouped_df.parallel_apply(lookahead) \
         40         .to_parquet(output_location_look_ahead, compression='snappy', engine='pyarrow')
         41 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in closure(data, func, *args, **kwargs)
        429         queue = manager.Queue()
        430 
    --> 431         workers_args, chunk_lengths, input_files, output_files = get_workers_args(
        432             use_memory_fs,
        433             nb_requested_workers,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in get_workers_args(use_memory_fs, nb_workers, progress_bar, chunks, worker_meta_args, queue, func, args, kwargs)
        284             raise OSError(msg)
        285 
    --> 286         workers_args = [
        287             (
        288                 input_file.name,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in <listcomp>(.0)
        293                 progress_bar == PROGRESS_IN_WORKER,
        294                 dill.dumps(
    --> 295                     progress_wrapper(
        296                         progress_bar >= PROGRESS_IN_FUNC, queue, index, chunk_length
        297                     )(func)
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/pandarallel.py in wrapper(func)
        203     def wrapper(func):
        204         if progress_bar:
    --> 205             wrapped_func = inline(
        206                 progress_pre_func,
        207                 func,
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in inline(pre_func, func, pre_func_arguments)
        485 
        486     func_instructions = tuple(get_instructions(func))
    --> 487     shifted_func_instructions = shift_instructions(
        488         func_instructions, len(b"".join(pinned_pre_func_instructions_without_return))
        489     )
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instructions(instructions, qty)
        301     If Python version not in 3.{5, 6, 7}, a SystemError is raised.
        302     """
    --> 303     return tuple(
        304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in <genexpr>(.0)
        302     """
        303     return tuple(
    --> 304         shift_instruction(instruction, qty)
        305         if bytes((instruction[0],))
        306         in (
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in shift_instruction(instruction, qty)
        291     """
        292     operation, *values = instruction
    --> 293     return bytes((operation,)) + int2python_bytes(python_ints2int(values) + qty)
        294 
        295 
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in wrapper(*args, **kwargs)
         32             raise SystemError("Python version should be 3.{5, 6, 7, 8}")
         33 
    ---> 34         return function(*args, **kwargs)
         35 
         36     return wrapper
    
    ../miniconda3/lib/python3.8/site-packages/pandarallel/utils/inliner.py in int2python_bytes(item)
         69 
         70     nb_bytes = 2 if python_version.minor == 5 else 1
    ---> 71     return int.to_bytes(item, nb_bytes, "little")
         72 
         73 
    
    OverflowError: int too big to convert
    
    opened by lfdversluis 7
  • TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    TypeError: 'generator' object is not subscriptable error in colab - works in VScode

    General

    • Operating System: OSX
    • Python version: 3.7.13
    • Pandas version: 1.3.5
    • Pandarallel version: 1.6.2

    Acknowledgement

    • Issue happens on Colab only. When I use VScode, the problem does not happen

    Bug description

    I get the error:

    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    100.00%
    1 / 1
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-16-b9ca1a6007f2>](https://localhost:8080/#) in <module>()
          1 data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
          2 df = pd.DataFrame(data)
    ----> 3 df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    2 frames
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(data, user_defined_function, *user_defined_function_args, **user_defined_function_kwargs)
        324             return wrapped_reduce_function(
        325                 (Path(output_file.name) for output_file in output_files),
    --> 326                 reduce_extra,
        327             )
        328 
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/core.py](https://localhost:8080/#) in closure(output_file_paths, extra)
        197         )
        198 
    --> 199         return reduce_function(dfs, extra)
        200 
        201     return closure
    
    [/usr/local/lib/python3.7/dist-packages/pandarallel/data_types/dataframe.py](https://localhost:8080/#) in reduce(datas, extra)
         45             datas: Iterable[pd.DataFrame], extra: Dict[str, Any]
         46         ) -> pd.DataFrame:
    ---> 47             axis = 0 if isinstance(datas[0], pd.Series) else 1 - extra["axis"]
         48             return pd.concat(datas, copy=False, axis=axis)
         49 
    
    TypeError: 'generator' object is not subscriptable
    

    Minimal but working code sample to ease bug fix for pandarallel team

    !pip install pandarallel
    from pandarallel import pandarallel
    pandarallel.initialize(progress_bar=True)
    
    import pandas as pd
    
    data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Age': [20, 21, 19, 18]}
    df = pd.DataFrame(data)
    df['HalfAge'] = df.parallel_apply(lambda r: r.Age/2,axis=1)
    
    opened by agiveon 6
  • Library not working....

    Library not working....

    Hello everyone!

    While trying to run the example cases and I'm receiving the following error:

    Code

    import pandas as pd
    import pandarallel
    pandarallel.initialize()
    
    def func(x):
        return math.sin(x.a**2) + math.sin(x.b**2)
    
    if __name__ == '__main__':
        df_size = int(5e6)
        df = pd.DataFrame(dict(a=np.random.randint(1, 8, df_size),
                               b=np.random.rand(df_size)))
        res_parallel = df.parallel_apply(func, axis=1, progress_bar=True)
    

    Error

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <timed exec> in <module>
    
    ~\.conda\envs\ingestion-env\lib\site-packages\pandarallel\pandarallel.py in closure(data, func, *args, **kwargs)
        434         try:
        435             pool = Pool(
    --> 436                 nb_workers, worker_init, (prepare_worker(use_memory_fs)(worker),)
        437             )
        438 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in Pool(self, processes, initializer, initargs, maxtasksperchild)
        117         from .pool import Pool
        118         return Pool(processes, initializer, initargs, maxtasksperchild,
    --> 119                     context=self.get_context())
        120 
        121     def RawValue(self, typecode_or_type, *args):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in __init__(self, processes, initializer, initargs, maxtasksperchild, context)
        174         self._processes = processes
        175         self._pool = []
    --> 176         self._repopulate_pool()
        177 
        178         self._worker_handler = threading.Thread(
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\pool.py in _repopulate_pool(self)
        239             w.name = w.name.replace('Process', 'PoolWorker')
        240             w.daemon = True
    --> 241             w.start()
        242             util.debug('added worker')
        243 
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\process.py in start(self)
        110                'daemonic processes are not allowed to have children'
        111         _cleanup()
    --> 112         self._popen = self._Popen(self)
        113         self._sentinel = self._popen.sentinel
        114         # Avoid a refcycle if the target function holds an indirect
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\context.py in _Popen(process_obj)
        320         def _Popen(process_obj):
        321             from .popen_spawn_win32 import Popen
    --> 322             return Popen(process_obj)
        323 
        324     class SpawnContext(BaseContext):
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
         87             try:
         88                 reduction.dump(prep_data, to_child)
    ---> 89                 reduction.dump(process_obj, to_child)
         90             finally:
         91                 set_spawning_popen(None)
    
    ~\.conda\envs\ingestion-env\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
         58 def dump(obj, file, protocol=None):
         59     '''Replacement for pickle.dump() using ForkingPickler.'''
    ---> 60     ForkingPickler(file, protocol).dump(obj)
         61 
         62 #
    
    AttributeError: Can't pickle local object 'prepare_worker.<locals>.closure.<locals>.wrapper'
    

    I'm working on a Jupyter Lab notebook with a conda environment: What can I do?

    opened by murilobellatini 6
  • IndexError when there are fewer DataFrame rows than workers

    IndexError when there are fewer DataFrame rows than workers

    When the number of rows is below the number of workers an IndexError is raised. Minimal example:

    Code

    import time
    import pandas as pd
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    df = pd.DataFrame({'x':[1,2]})
    df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
    

    Output

    INFO: Pandarallel will run on 6 workers.
    INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
    B
       0.00%                                          |        0 /        1 |                                                                                                                    
       0.00%                                          |        0 /        1 |                                                                                                                    Traceback (most recent call last):
      File "foo.py", line 8, in <module>
        df.parallel_apply(lambda row: print('A'), time.sleep(2), print('B'), axis=1)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 446, in closure
        map_result,
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/pandarallel.py", line 382, in get_workers_result
        progress_bars.update(progresses)
      File "$VIRTUAL_ENV/lib/python3.7/site-packages/pandarallel/utils/progress_bars.py", line 82, in update
        self.__bars[index][0] = value
    IndexError: list index out of range
    

    I'm using python version 3.7.4 with pandas 0.25.3 and pandarallel 1.4.4.

    bug 
    opened by elemakil 6
  • __main__ imposition in code

    __main__ imposition in code

    Make sure if a novice tries to access the module, he should be aware of encapsulating code within main execution context, pretty much similar to that of requirements of multiprocessing in general. This will reduce wasting of time and delving into multiporcessing more.

    opened by karunakar2 4
  • how to use it in fastapi

    how to use it in fastapi

    I have a server like:

    from fastapi import FastAPI
    from pandarallel import pandarallel
    pandarallel.initialize()
    app = FastAPI()
    
    @app.post("/area_quant")
    def create_item(event_data: Areadata):
        data = pd.DataFrame(event_data.data)
        data['type_score'] = data['EVENT_TYPE'].applymap(Config.type_map)
    
    if __name__ == '__main__':
        uvicorn.run(app)
    

    as a server,it can only run one time,then it will shutdown,how can i use it in a server?

    opened by lim-0 0
  • 'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    'functools.partial' object has no attribute '__code__' in Jupyter Notebooks

    General

    • Operating System: Ubuntu 20.04
    • Python version: 3.9.13
    • Pandas version: 1.4.4
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [x] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    functool.partial cannot be used in junction with jupyter notebooks and pandarallel

    Observed behavior

    Write here the observed behavior

    Expected behavior

    Write here the expected behavior

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    from functools import partial
    from typing import List
    import numpy as np
    from pandarallel import pandarallel
    
    pandarallel.initialize(progress_bar=True)
    
    def processing_fn(row: pd.Series, invalid_numbers: List[int], default: int) -> pd.Series:
        cond = row.isin(invalid_numbers)
        row[cond] = default
        return row
    
    data = pd.DataFrame(np.random.randint(low=-10, high=10, size=(10000, 5)))
    print("Before", (data.values == 100).sum())
    
    fn = partial(processing_fn, invalid_numbers=[-5, 2, 5], default=100)
    new_data = data.apply(fn, axis=1)
    
    print("After serial", (new_data.values == 100).sum())
    
    data = data.parallel_apply(fn, axis=1)
    print("After parallel", (data.values == 100).sum())
    
    

    Works fine in a standalone script, but fails if ran in Jupyter notebook

    opened by Meehai 0
  • Choose which type of progress bar you want in a notebook

    Choose which type of progress bar you want in a notebook

    Sometimes one doesn't always have control of the environment of the Jupyter instance where one's working (think Jupyterhub) and can't install the necessary extensions for progress bars. In this case it might be nice to have the option to manually request the simple progress bar so that it doesn't just display a widget error.

    Is there already a way to do so that I've missed? Otherwise, if it's something you'd consider adding, I'd also be happy to try and draft a PR.

    opened by astrojarred 3
  • TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    TypeError: cannot pickle 'sqlite3.Connection' object in pyCharm

    General

    • Operating System: Ubuntu 22.04
    • Python version: 3.9.7
    • Pandas version: 1.4.2
    • Pandarallel version: 1.6.3

    Acknowledgement

    • [x] My issue is NOT present when using pandas without alone (without pandarallel)
    • [ ] If I am on Windows, I read the Troubleshooting page before writing a new bug report

    Bug description

    I observe this when running parallel_apply in pyCharm. #76 sounds similar to me, but none of the tricks suggested there work for me. At this point I am also not sure if it's more of an issue with pyCharm.

    Observed behavior

    Traceback (most recent call last):
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-39-d230d86ff5ef>", line 1, in <module>
        iris.groupby("species").parallel_apply(lambda x: np.mean(x))
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/pandarallel/core.py", line 265, in closure
        dilled_user_defined_function = dill.dumps(user_defined_function)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 364, in dumps
        dump(obj, file, protocol, byref, fmode, recurse, **kwds)#, strictio)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 336, in dump
        Pickler(file, protocol, **_kwds).dump(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 620, in dump
        StockPickler.dump(self, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 487, in dump
        self.save(obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1963, in save_function
        _save_with_postproc(pickler, (_create_function, (
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1154, in _save_with_postproc
        pickler._batch_setitems(iter(source.items()))
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 692, in save_reduce
        save(args)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 886, in save_tuple
        save(element)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 603, in save
        self.save_reduce(obj=obj, *rv)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 717, in save_reduce
        save(state)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 560, in save
        f(self, obj)  # Call unbound method with explicit self
      File "~/anaconda3/envs/tor/lib/python3.9/site-packages/dill/_dill.py", line 1251, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 971, in save_dict
        self._batch_setitems(obj.items())
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 997, in _batch_setitems
        save(v)
      File "~/anaconda3/envs/tor/lib/python3.9/pickle.py", line 578, in save
        rv = reduce(self.proto)
    TypeError: cannot pickle 'sqlite3.Connection' object
    

    Minimal but working code sample to ease bug fix for pandarallel team

    import pandas as pd
    import numpy as np
    from pandarallel import pandarallel
    pandarallel.initialize(use_memory_fs=False)
    iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
    iris.groupby("species").parallel_apply(lambda x: np.mean(x))
    
    opened by wiessall 1
Releases(v1.6.3)
Owner
Manu NALEPA
Data Scientist / Data Engineer @ Clustree — Sousaphone & saxophone player
Manu NALEPA
Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

null 168 Dec 28, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 5, 2023
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

null 10k Jan 1, 2023
Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Cedric Zhuang 1.1k Dec 28, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

null 14 Jun 22, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

null 48 Dec 21, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 5, 2022