Shared, streaming Python dict

Ronny Rentner

Last update: Dec 23, 2022

Related tags

Text Data & NLP UltraDict

Overview

UltraDict

Sychronized, streaming Python dictionary that uses shared memory as a backend

Warning: This is an early hack. There are only few unit tests and so on. Maybe not stable!

Features:

Fast (compared to other sharing solutions)
No running manager processes
Works in spawn and fork context
Safe locking between independent processes
Tested with Python >= v3.8 on Linux and Windows
Optional recursion for nested dicts

General Concept

UltraDict uses multiprocessing.shared_memory to synchronize a dict between multiple processes.

It does so by using a stream of updates in a shared memory buffer. This is efficient because only changes have to be serialized and transferred.

If the buffer is full, UltraDict will automatically do a full dump to a new shared memory space, reset the streaming buffer and continue to stream further updates. All users of the UltraDict will automatically load full dumps and continue using streaming updates afterwards.

Issues

On Windows, if no process has any handles on the shared memory, the OS will gc all of the shared memory making it inaccessible for future processes. To work around this issue you can currently set full_dump_size which will cause the creator of the dict to set a static full dump memory of the requested size. This full dump memory will live as long as the creator lives. This approach has the downside that you need to plan ahead for your data size and if it does not fit into the full dump memory, it will break.

Alternatives

There are many alternatives:

How to use?

Simple

In one Python REPL:

Python 3.9.2 on linux
>>> 
>>> from UltraDict import UltraDict
>>> ultra = UltraDict({ 1:1 }, some_key='some_value')
>>> ultra
{1: 1, 'some_key': 'some_value'}
>>>
>>> # We need the shared memory name in the other process.
>>> ultra.name
'psm_ad73da69'

In another Python REPL:

Python 3.9.2 on linux
>>> 
>>> from UltraDict import UltraDict
>>> # Connect to the shared memory with the name above
>>> other = UltraDict(name='psm_ad73da69')
>>> other
{1: 1, 'some_key': 'some_value'}
>>> other[2] = 2

Back in the first Python REPL:

>>> ultra[2]
2

Nested

In one Python REPL:

Python 3.9.2 on linux
>>> 
>>> from UltraDict import UltraDict
>>> ultra = UltraDict(recurse=True)
>>> ultra['nested'] = { 'counter': 0 }
>>> type(ultra['nested'])
<class 'UltraDict.UltraDict'>
>>> ultra.name
'psm_0a2713e4'

In another Python REPL:

Python 3.9.2 on linux
>>> 
>>> from UltraDict import UltraDict
>>> other = UltraDict(name='psm_0a2713e4')
>>> other['nested']['counter'] += 1

Back in the first Python REPL:

>>> ultra['nested']['counter']
1

Performance comparison

Python 3.9.2 on linux
>>> 
>>> from UltraDict import UltraDict
>>> ultra = UltraDict()
>>> for i in range(10_000): ultra[i] = i
... 
>>> len(ultra)
10000
>>> ultra[500]
500
>>> # Now let's do some performance testing
>>> import multiprocessing, timeit
>>> orig = dict(ultra)
>>> len(orig)
10000
>>> orig[500]
500
>>> managed = multiprocessing.Manager().dict(orig)
>>> len(managed)
10000

Read performance

>>> timeit.timeit('orig[1]', globals=globals())
0.03503723500762135
>>>
>>> timeit.timeit('ultra[1]', globals=globals())
0.380401570990216
>>>
>>> timeit.timeit('managed[1]', globals=globals())
15.848494678968564
>>>
>>> # We are factor 10 slower than a real, local dict,
>>> # but way faster than using a Manager
>>>
>>> # If you need full read performance, you can access the underlying
>>> # cache directly and get almost original dict performance,
>>> # of course at the cost of not having real-time updates anymore.
>>>
>>> timeit.timeit('ultra.data[1]', globals=globals())
0.047667117964010686

Write performance

>>> timeit.timeit('orig[1] = 1', globals=globals())
0.02869905502302572
>>>
>>> timeit.timeit('ultra[1] = 1', globals=globals())
2.259694856009446
>>>
>>> timeit.timeit('managed[1] = 1', globals=globals())
16.352361536002718
>>>
>>> # We are factor 100 slower than a real, local dict,
>>> # but still way faster than using a Manager

Parameters

Ultradict(*arg, name=None, buffer_size=10000, serializer=pickle, shared_lock=False, full_dump_size=None, auto_unlink=True, recurse=False, **kwargs)

name: Name of the shared memory. A random name will be chosen if not set. If a name is given a new shared memory space is created if it does not exist yet. Otherwise the existing shared memory space is attached.

buffer_size: Size of the shared memory buffer used for streaming changes of the dict.

The buffer size limits the biggest change that can be streamed, so when you use large values or deeply nested dicts you might need a bigger buffer. Otherwise, if the buffer is too small, it will fall back to a full dump. Creating full dumps can be slow, depending on the size of your dict.

Whenever the buffer is full, a full dump will be created. A new shared memory is allocated just big enough for the full dump. Afterwards the streaming buffer is reset. All other users of the dict will automatically load the full dump and continue streaming updates.

serializer: Use a different serialized from the default pickle, e. g. marshal, dill, json. The module or object provided must support the methods loads() and dumps()

shared_lock: When writing to the same dict at the same time from multiple, independent processes, they need a shared lock to synchronize and not overwrite each other's changes. Shared locks are slow. They rely on the atomics package for atomic locks. By default, UltraDict will use a multiprocessing.RLock() instead which works well in fork context and is much faster.

full_dump_size: If set, uses a static full dump memory instead of dynamically creating it. This might be necessary on Windows depending on your write behaviour. On Windows, the full dump memory goes away if the process goes away that had created the full dump. Thus you must plan ahead which processes might be writing to the dict and therefore creating full dumps.

auto_unlink: If True, the creator of the shared memory will automatically unlink the handle at exit so it is not visible or accessible to new processes. All existing, still connected processes can continue to use the dict.

recurse: If True, any nested dict objects will be automaticall wrapped in an UltraDict allowing transparent nested updates.

Advanced usage

See examples folder

>>> ultra = UltraDict({ 'init': 'some initial data' }, name='my-name', buffer_size=100_000)
>>> # Let's use a value with 100k bytes length.
>>> # This will not fit into our 100k bytes buffer due to the serialization overhead.
>>> ultra[0] = ' ' * 100_000
>>> ultra.print_status()
{'buffer': SharedMemory('my-name_memory', size=100000),
 'buffer_size': 100000,
 'control': SharedMemory('my-name', size=1000),
 'full_dump_counter': 1,
 'full_dump_counter_remote': 1,
 'full_dump_memory': SharedMemory('psm_765691cd', size=100057),
 'full_dump_memory_name_remote': 'psm_765691cd',
 'full_dump_size': None,
 'full_dump_static_size_remote': <memory at 0x7fcbf5ca6580>,
 'lock': <RLock(None, 0)>,
 'lock_pid_remote': 0,
 'lock_remote': 0,
 'name': 'my-name',
 'recurse': False,
 'recurse_remote': <memory at 0x7fcbf5ca6700>,
 'serializer': <module 'pickle' from '/usr/lib/python3.9/pickle.py'>,
 'shared_lock_remote': <memory at 0x7fcbf5ca6640>,
 'update_stream_position': 0,
 'update_stream_position_remote': 0}

Note: All status keys ending with _remote are stored in the control shared memory space and shared across processes.

Other things you can do:

>>> # Load latest full dump if one is available
>>> ultra.load()

>>> # Show statistics
>>> ultra.print_status()

>>> # Force load of latest full dump, even if we had already processed it.
>>> # There might also be streaming updates available after loading the full dump.
>>> ultra.load(force=True)

>>> # Apply full dump and stream updates to
>>> # underlying local dict, this is automatically
>>> # called by accessing the UltraDict in any usual way,
>>> # but can be useful to call after a forced load.
>>> ultra.apply_update()

>>> # Access underlying local dict directly
>>> ultra.data

>>> # Use any serializer you like, given it supports the loads() and dumps() methods
>>> import pickle 
>>> ultra = UltraDict(serializer=pickle)

>>> # Unlink all shared memory, it will not be visible to new processes afterwards
>>> ultra.unlink()

Contributing

Contributions are always welcome!

Comments

Crashes under high load

master process is writing to 1 nested dict1 (recurse=1) shared between 20-40 processes, total dict1 size ~1500 keys with nested dict (as value, small)

processes created via multiprocessing.Process, and writing to other shared dict - dict2[process_id] once per second, dict2 size - same, but *num_processes

main process analyzing statistics from dict2: for process_id in dict2: dict2[process_id]: ... and write changes to shared dict1 once per second: for change in changes: dict1['nested'][change] = {'time': 123, 'blah': '123'}

crashing appears if changes size is 300-2000 in 1 second, and read lookups is HUGE (>100k/sec) but i tried to cache it once per second to local dict using deepcopy and this doesnt help... total memory usage not exceed 2-4GB i think (free ram is about 60GB), CPU usage up to 100%

dict1 size in bytes determined on local dict with same structure is less than 150kb

i tried:

copy.deepcopy(dict1) once per second to create a local copy in processes for cached lookups - doesn't help
shared_lock
with dict1.lock/etc
increasing buffer to huge values, increasing full dump size/etc

and nothing helps... on low speeds (or no/small changes from master to dict1) all is working, or using multiprocessing.manager().dict all is working too, but slow

Examples of exceptions:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
	self.run()
  File "/usr/lib/python3.9/threading.py", line 892, in run
	self._target(*self._args, **self._kwargs)
  File "zvshield.py", line 793, in zvshield.accept_connections
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 585, in __contains__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 511, in apply_update
	assert bytes(self.buffer.buf[pos:pos+1]) == b'\x00'
AssertionError


File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 248, in __init__
	self.buffer = self.get_memory(create=True, name=self.name + '_memory', size=buffer_size)
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 347, in get_memory
	full_dump = self.serializer.loads(bytes(buf[pos:pos+length]))
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 304, in __init__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 520, in apply_update
	memory = multiprocessing.shared_memory.SharedMemory(name=name)
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 114, in __init__
	mode, key, value = self.serializer.loads(bytes(self.buffer.buf[pos:pos+length]))
	self._mmap = mmap.mmap(self._fd, size)
OSError: [Errno 12] Cannot allocate memory

EOFError: Ran out of input
Exception ignored in: <function SharedMemory.__del__ at 0x7fb80639e820>

Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
	self.close()
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
Exception ignored in: <function SharedMemory.__del__ at 0x7fb80639e820>
	self._mmap.close()
Traceback (most recent call last):
BufferError: cannot close exported pointers exist
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
	self.close()
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
	self._mmap.close()
BufferError: cannot close exported pointers exist
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
	self.close()
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
	self._mmap.close()
BufferError: cannot close exported pointers exist
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
	self.run()
  File "/usr/lib/python3.9/threading.py", line 892, in run
	self._target(*self._args, **self._kwargs)
  File "zvshield.py", line 793, in zvshield.accept_connections
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 585, in __contains__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 500, in apply_update
	self.load(force=True)
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 450, in load
	full_dump = self.serializer.loads(bytes(buf[pos:pos+length]))
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 304, in __init__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 520, in apply_update
	mode, key, value = self.serializer.loads(bytes(self.buffer.buf[pos:pos+length]))
EOFError: Ran out of input
Exception ignored in: <function SharedMemory.__del__ at 0x7fc48f4d4820>
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
	self.close()
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
	self._mmap.close()
BufferError: cannot close exported pointers exist
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
	self.run()
  File "/usr/lib/python3.9/threading.py", line 892, in run
	self._target(*self._args, **self._kwargs)
  File "zvshield.py", line 793, in zvshield.accept_connections
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 585, in __contains__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 500, in apply_update
	self.load(force=True)
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 450, in load
	full_dump = self.serializer.loads(bytes(buf[pos:pos+length]))
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 304, in __init__
	self.apply_update()
  File "/usr/local/lib/python3.9/dist-packages/UltraDict/UltraDict.py", line 520, in apply_update
	mode, key, value = self.serializer.loads(bytes(self.buffer.buf[pos:pos+length]))
EOFError: Ran out of input

Exception ignored in: <function SharedMemory.__del__ at 0x7fc48f4d4820>
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 184, in __del__
	self.close()
  File "/usr/lib/python3.9/multiprocessing/shared_memory.py", line 227, in close
	self._mmap.close()
BufferError: cannot close exported pointers exist

opened by rojamit 33

Question - pickle.UnpicklingError: pickle data was truncated

I got an error pickle.UnpicklingError: pickle data was truncated

while try to utilize this library... how does this error message get generated? and how can I avoid this in the future?

Another weird one UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 201: invalid continuation byte

opened by mccoydj1 15

Extremely slow initialization from existing dict

Initializing from a (large) existing dict is slow -- it seems to be serializing every key-value pair as an update:

Traceback (most recent call last):
  File "/global/homes/p/pfasano/group_stats_dict.py", line 327, in <module>
    group_dict = read_groups(partitions, sp_bin)
  File "/global/homes/p/pfasano/group_stats_dict.py", line 206, in read_groups
    return UltraDict(group_dict, auto_unlink=True)
  File "/global/homes/p/pfasano/.local/perlmutter/3.9-anaconda-2021.11/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 301, in __init__
    super().__init__(*args, **kwargs)
  File "/global/common/software/nersc/pm-2022q2/sw/python/3.9-anaconda-2021.11/lib/python3.9/collections/__init__.py", line 1046, in __init__
    self.update(dict)
  File "/global/homes/p/pfasano/.local/perlmutter/3.9-anaconda-2021.11/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 541, in update
    self[k] = v
  File "/global/homes/p/pfasano/.local/perlmutter/3.9-anaconda-2021.11/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 568, in __setitem__
    self.append_update(key, item)
  File "/global/homes/p/pfasano/.local/perlmutter/3.9-anaconda-2021.11/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 482, in append_update
    self.dump()
  File "/global/homes/p/pfasano/.local/perlmutter/3.9-anaconda-2021.11/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 374, in dump
    marshalled = self.serializer.dumps(self.data)

It seems like somehow super().__init__ is calling collections.UserDict.__init__, which in turn calls UltraDict.__setitem__.

I guess I don't quite understand yet how UltraDict works, but why does every key need to be serialized as an update to an empty dict?

opened by kc9jud 6

Crash

Cannot re-start my app, even after restart the computer.

C:\Users\marce\PycharmProjects\srsapp\venv310\Scripts\python.exe C:/Users/marce/PycharmProjects/srsapp/launcher.py --enable_file_cache True Traceback (most recent call last): File "C:\Users\marce\PycharmProjects\srsapp\launcher.py", line 7, in import globalVariables File "C:\Users\marce\PycharmProjects\srsapp\globalVariables.py", line 574, in config = UltraDict(name='config1', size=500000) File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 288, in init super().init(*args, **kwargs) File "C:\Users\marce\AppData\Local\Programs\Python\Python310\lib\collections_init_.py", line 1092, in init self.update(kwargs) File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 498, in update self[k] = v File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 514, in setitem self.apply_update() File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 464, in apply_update self.load(force=True) File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 398, in load full_dump_memory = self.get_memory(create=False, name=name) File "C:\Users\marce\PycharmProjects\srsapp\venv310\lib\site-packages\UltraDict\UltraDict.py", line 329, in get_memory raise Exception("Could not get memory: ", name) Exception: ('Could not get memory: ', 'wnsm_0ce9a65a') Exception ignored in: <function SharedMemory.del at 0x000001F84F477880> Traceback (most recent call last): File "C:\Users\marce\AppData\Local\Programs\Python\Python310\lib\multiprocessing\shared_memory.py", line 184, in del self.close() File "C:\Users\marce\AppData\Local\Programs\Python\Python310\lib\multiprocessing\shared_memory.py", line 227, in close self._mmap.close() BufferError: cannot close exported pointers exist

opened by marcelomanzo 6
Duplicate logs

Hello maybe this is a noob question, but I'm having this problem that when using the library some logs gets duplicated.

This is a very basic setup of FastAPI with UltraDict

opened by marianomat 5
Problem updating iterating on values

Hi! i started using your dictionary in my project however I found a bug while trying to iterate on the dictionary values. Those few lines of code trigger the bug.

.

It can be solve by applying apply_update before trying to iterate on the values, however the function is already called by the same process before trying to iterate (I added a print) so I do not really understand why is it solving it. However I'm probably going to iterate over keys instead, trying to bypass it by iterating over items but it is not working too :-)

opened by hugo3m 5
Unable to access Ultradict after a certain loop Limit, Issue occurs Only on Linux.............

from UltraDict import UltraDict

ultra = UltraDict({ 'init': 'some initial data' }, name='myname1')

for i in range(1,5000): print(UltraDict(name='myname1'))

############### ERROR ################# File "/home/merit/miniconda3/lib/python3.9/site-packages/UltraDict/UltraDict.py", line 659, in unlink self.control.unlink() File "/home/merit/miniconda3/lib/python3.9/multiprocessing/shared_memory.py", line 241, in unlink _posixshmem.shm_unlink(self._name) FileNotFoundError: [Errno 2] No such file or directory: '/myname1'

opened by hemakumar01 4
Shared memory not always cleared

Hi,

I'm using UltraDict to share data between a master process and several subprocesses.

I have auto_unlink=True on all declarations, but sometimes if the script fails (meaning something wrong in the code, or an unexpected error) it won't clear the memory, thus on the next run, when the master process creates the "new" UltraDict object, it reuses the same information from the previous execution (as the UltraDict names are predefined).

Is there a way to clear the memory of previous executions without having to reboot the server?

Thanks.

opened by joelsdc 2
Memory usage analysis

before

testing...

after

seems ok, ultra-dict didnt eats memory after test done. -- i am afraid it allocates memory and did't release thus the server will oom finally.

if you have any thoughts to test it plz let me know, i want to use ultra-dict in our prod env but afraid something went wrong.

opened by csrgxtu 2
The dict does not delete items but put an empty string
Hello, I am currently using your dictionary for my project. Then I found a problem when I try to delete an item from the dict. Instead of deleting, the dict replaces the value that needs to be deleted by an empty string and it leads to a bug in my project. I am writing a small piece of code to reproduce this behavior as you can find hereafter. Hope it can help you to figure out the problem.

from UltraDict import UltraDict import random import string letters = string.ascii_lowercase rand_str = ''.join(random.choice(letters) for i in range(1000)) my_dict = UltraDict() for i in range(10000): my_dict[i] = rand_str for i in list(my_dict.keys()): del my_dict[i] print (my_dict)

and here are the results I got {379: b'', 750: b'', 1121: b'', 1492: b'', 1863: b'', 2234: b'', 2605: b'', 2976: b'', 3347: b'', 3718: b'', 4089: b'', 4460: b'', 4831: b'', 5202: b'', 5573: b'', 5944: b'', 6315: b'', 6686: b'', 7057: b'', 7428: b'', 7799: b'', 8170: b'', 8541: b'', 8912: b'', 9283: b'', 9654: b''}

Thank you
opened by haidang1201 2
UltraDict dependency 'atomics' is not compatible with MacBook silicon (m1)

Version : branch master OS: macOS big sur version 11.6 The scenario: I'm using this module in an algotrading bot app. One mechanism I'm driving with this is helping the bot get quick updates from other process which is responsible of transmitting price updates. The dictionary is a smart move as it is the right tool for the job. Process A fill a dictionary with prices . Process B consume those prices and makes math calculations based on them.

My Issue . As the bot run inside a while loop , it never really exits gracefully but through an interrupt (SIGINT , then SIGTERM) if the Producer of dict(Process A) exit by SIGINT its fine. but if Process B (consumer) exit by SIGINT the dictionary seems to enter a state which you can't clear it even with unlink() and close(). only restart helps with this scenario (checked /dev/shm but /shm does not exist on my hd)

That lad me to try the shared lock mechanism (because I thought it might help with accessing this map with a lock) When I run the code again I was given an error stating "atomics" is not found. after a short pip install atomics I found out they don't have a wheel for Mac arm wheel but only universal one. when running again I get the error of "mech-o:wrong architecture" even if I exclude t "shared lock=true" it keeps throw errors on the same thing. a restart to the computer is the only thing which clears that thing.

I suggest sort this quick as MacBook m1 computers are not that rare and it's actually a quite great library which I'm currently cannot really use :\

opened by JOEAV 4
Improve write performance by using faster locking

The library that is used for atomic test_and_set operations on the shared memory has a performance issue.

It will be fixed by the author and should give us more write speed.

Related ticket: https://github.com/doodspav/atomics/issues/3
enhancement

opened by ronny-rentner 0
Add configurable timeout when waiting to acquire a lock

Currently hardcoded to 100_000 loops.

In Python 3.11, there's a new nanosleep(). Before, it's hard to sleep a nanosecond in Python without using busy wait.

We need to find a better solution for waiting for Python < 3.11
enhancement

opened by ronny-rentner 0

Releases(v0.0.6)

v0.0.6(Sep 6, 2022)

Lots of bug fixes and improvements, better docs, more tests.
Source code(tar.gz)
Source code(zip)
v0.0.4(Apr 17, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Ronny Rentner

GitHub

A Fast Command Analyser based on Dict and Pydantic

Alconna Alconna 隶属于ArcletProject，在Cesloi内有内置 Alconna 是 Cesloi-CommandAnalysis 的高级版，支持解析消息链一般情况下请当作简易的消息链解析器/命令解析器文档暂时的文档 Example from arclet.alcon

19 Jan 3, 2023

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

23 Dec 30, 2022

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

43 Dec 28, 2022

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

FCS-applications Source code for CsiNet and CRNet using the Fully Connected Layer-Shared feedback architecture. Introduction This repository contains

4 Oct 7, 2022

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

KB-NER: a Knowledge-based System for Multilingual Complex Named Entity Recognition The code is for the winner system (DAMO-NLP) of SemEval 2022 MultiC

116 Dec 27, 2022

Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

3 Jan 5, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

763 Dec 27, 2022

Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

579 Feb 17, 2021

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

1 Jun 14, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

A python framework to transform natural language questions to queries in a database query language.

__ _ _ _ ___ _ __ _ _ / _` | | | |/ _ \ '_ \| | | | | (_| | |_| | __/ |_) | |_| | \__, |\__,_|\___| .__/ \__, | |_| |_| |___/

1.2k Dec 18, 2022

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 2, 2023

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

285 Jan 2, 2023

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

46 Dec 14, 2022

A python wrapper around the ZPar parser for English.

NOTE This project is no longer under active development since there are now really nice pure Python parsers such as Stanza and Spacy. The repository w

49 Sep 12, 2022

Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way. At the core is the tool ``colibri-patternmodeller`` whi ch allows you to build, view, manipulate and query pattern models.

Colibri Core by Maarten van Gompel, [email protected], Radboud University Nijmegen Licensed under GPLv3 (See http://www.gnu.org/licenses/gpl-3.0.html

122 Nov 17, 2022

Shared, streaming Python dict

Related tags

Overview

UltraDict

General Concept

Issues

Alternatives

How to use?

Simple

Nested

Performance comparison

Read performance

Write performance

Parameters

Advanced usage

Contributing

Comments

Releases(v0.0.6)

v0.0.6(Sep 6, 2022)

v0.0.4(Apr 17, 2022)

Owner

Ronny Rentner

A Fast Command Analyser based on Dict and Pydantic

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Shared code for training sentence embeddings with Flax / JAX

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

Paddlespeech Streaming ASR GUI

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Python module (C extension and plain python) implementing Aho-Corasick algorithm

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A python framework to transform natural language questions to queries in a database query language.

Python library for processing Chinese text

A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

A python wrapper around the ZPar parser for English.