Simple spill-to-disk dictionary

Blaze

Last update: Dec 19, 2022

Related tags

Data Structures chest

Overview

Chest

A dictionary that spills to disk.

Chest acts likes a dictionary but it can write its contents to disk. This is useful in the following two occasions:

Chest can hold datasets that are larger than memory
Chest persists and so can be saved and loaded for later use

Related Projects

The standard library shelve is an alternative out-of-core dictionary. Chest offers the following benefits over shelve:

Chest supports any hashable key (not just strings)
Chest supports pluggable serialization and file saving schemes

Alternatively one might consider a traditional key-value store database like Redis.

Shove is another excellent alternative with support for a variety of stores.

How it works

Chest stores data in two locations

An in-memory dictionary
On the filesystem in a directory owned by the chest

As a user adds contents to the chest the in-memory dictionary fills up. When a chest stores more data in memory than desired (see available_memory= keyword argument) it writes the larger contents of the chest to disk as pickle files (the choice of pickle is configurable). When a user asks for a value chest checks the in-memory store, then checks on-disk and loads the value into memory if necessary, pushing other values to disk.

Chest is a simple project. It was intended to provide a simple interface to assist in the storage and retrieval of numpy arrays. However it's design and implementation are agnostic to this case and so could be used in a variety of other situations.

With minimal work chest could be extended to serve as a communication point between multiple processes.

Known Failings

Chest was designed to hold a moderate amount of largish numpy arrays. It doesn't handle the very many small key-value pairs usecase (though could with small effort). In particular chest has the following deficiencies

Chest is not multi-process safe. We should institute a file lock at least around the .keys file.
Chest does not support mutation of variables on disk.

LICENSE

New BSD. See License

Install

chest is available through conda:

conda install chest

chest is on the Python Package Index (PyPI):

pip install chest

Example

>>> from chest import Chest
>>> c = Chest()

>>> # Acts like a normal dictionary
>>> c['x'] = [1, 2, 3]
>>> c['x']
[1, 2, 3]

>>> # Data persists to local files
>>> c.flush()
>>> import os
>>> os.listdir(c.path)
['.keys', 'x']

>>> # These files hold pickled results
>>> import pickle
>>> pickle.load(open(c.key_to_filename('x')))
[1, 2, 3]

>>> # Though one normally accesses these files with chest itself
>>> c2 = Chest(path=c.path)
>>> c2.keys()
['x']
>>> c2['x']
[1, 2, 3]

>>> # Chest is configurable, so one can use json, etc. instead of pickle
>>> import json
>>> c = Chest(path='my-chest', dump=json.dump, load=json.load)
>>> c['x'] = [1, 2, 3]
>>> c.flush()

>>> json.load(open(c.key_to_filename('x')))
[1, 2, 3]

Dependencies

Chest supports Python 2.6+ and Python 3.2+ with a common codebase.

It currently depends on the heapdict library.

It is a light weight dependency.

Comments

ENH: Allows user to pass in a custom lock object

Reading the docks I saw that this is not multiprocess safe; would it be sufficient if a user passed in their own multiprocessing safe lock object to the chest? Either way, this could be a convenient feature.

opened by llllllllll 10
Store filenames

Filenames are stored along with keys, making old chests more portable. _keys, which was a set, is replaced with a {key -> filename} dictionary. The dictionary is dumped/loaded as a list of tuples for json compatibility, but used as a dict internally.

opened by maxhutch 7
Added cache_many method and open_many constructor argument.

cache_many loads a list of keys to the in-cache memory.

open_many take a list of filenames and returns a list of open files. This is mostly to support overloading of open.

opened by maxhutch 4
Adding "eat" procedure to move the contents of one chest into another.

c1.eat(c2) flushes c2, moves c2's files into c1's directory, and adds c2's keys to c1. By default, keys in c2 overwrite the same keys in c1, but that behavior can be controlled by overwrite=False.

opened by maxhutch 4
ENH: on_full policy
Allows a user to set a maximum disk usage and a policy defining what to do when the max disk is used and a new entry must be added.

The two options are currently:

raise_: raise an OSError indicating you ran out of space.

pop_lru: rotate the least recently used element out of the chest.

Users may pass any callable here; however, these two are defined for them.
opened by llllllllll 3

Tuple keys move files into directories

Example

    In [1]: from chest import Chest

    In [2]: c = Chest(path='foo')

    In [3]: c['one'] = 1

    In [4]: c['one', 'two'] = 12

    In [5]: c['one', 'three'] = 13

    In [6]: c.flush()

    In [7]: !tree foo
    foo
    ├── one
    └── _one
        ├── three
        └── two

        1 directory, 3 files

cc @maxhutch

opened by mrocklin 2

Chunking open_many calls to prevent open file bloat.

When prefetching, the list of keys to load are chunked up to prevent open_many from opening too many files simultaneously.

On my system, the user file limit was 1024, so I set it to 512.

This should resolve #14.

opened by maxhutch 0
Prefetch opens too many files

For a reasonably large chest c, c.prefetch(list(c.keys())) will cause OSError: [Errno 24] Too many open files because open_many opens everything, consumes everything, and then closes everything.

Simple solution: chunk up prefetches into open_many's of some maximum size.
bug

opened by maxhutch 0
Fast memory usage

The memory_usage property is now a statefull field that we track during operation. This is error prone but faster for cases where we churn disk frequently.

opened by mrocklin 0
Bugfix: if dump() was called, del can't write keys during flush.

This just checks if dump has been called (by seeing if self.path exists). It dump was called, then delete the contents of self.inmem but don't move them to disk or write keys.

This gets rid of the exceptions that were briefly mentioned in #4

opened by maxhutch 0
Import MutableMapping from collections.abc

Python 3.3 and above moved the abstract base classes to their own module under collections, collections.abc. Import from that location, falling back if required for Python 3.2.

Fixes #29

opened by s-t-e-v-e-n-k 1

Python 3.9 support

 /opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/chest/core.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
151
    from collections import MutableMapping

Would you accept a PR fix, and make a new release?

opened by cmacdonald 0

run out of inodes by using chest

Hi,

I tried chest on a large file and on an ubuntu 16.04 system. Chest created millions of files on my hard disk and I was running out of inodes. I have plenty of hard disk avaiable but I can not use it because inodes usage is at the peak. What can I do about it? Can I reduce the millions of generated files to hundereds? Is there any option for this?

Thanks in advance

opened by z-rahimi 0
Include pickling with the highest protocol

I am trying to modify the protocol used for dumping, however, it always complains that it does not understand the argument protocol in dump=partial(pickle.dump, protocol=pickle.HIGHEST_PROTOCOL). Could you put this as default?

opened by jshleap 0
Switched broken pypip.in badges to shields.io

Hello, this is an auto-generated Pull Request. (Feedback?)

Some time ago, pypip.in shut down. This broke the badges for a bunch of repositories, including chest. Thankfully, an equivalent service is run by shields.io. This pull request changes the badges to use shields.io instead.

Unfortunately, PyPI has removed download statistics from their API, which means that even the shields.io "download count" badges are broken (they display "no longer available". See this). So those badges should really be removed entirely. Since this is an automated process (and trying to automatically remove the badges from READMEs can be tricky), this pull request just replaces the URL with the shields.io syntax.

opened by movermeyer 0
dump with another wrapper than the one used for caching
I am trying to save the resulting chest into a different file format than the one used for caching.

A practical example: I used pickle when I initialize my chest because I am using nested dicts and sets, which would be stripped off I guess if it was to be stored in json.

But in the end, I would like to dump my chest into json for export.

Is there a way to do that?

For example, this does not work past the first level:

import ujson as json with open('db.json', 'w') as f: f.write(json.dumps(out, ensure_ascii=False, indent=4, sort_keys=True))
opened by lrq3000 0

Owner

Blaze

GitHub

A simple tutorial to use tree-sitter to parse code into ASTs

A simple tutorial to use py-tree-sitter to parse code into ASTs. To understand what is tree-sitter, see https://github.com/tree-sitter/tree-sitter. Tr

7 Sep 17, 2022

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

10.6k Jan 1, 2023

Visual disk-usage analyser for docker images

whaler What? A command-line tool for visually investigating the disk usage of docker images Why? Large images are slow to move and expensive to store.

194 Sep 1, 2022

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.

Diamond Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory,

1.7k Jan 5, 2023

Decryption utility for PGP Whole Disk Encryption

wdepy: Decryption and Inspection for PGP WDE Disks This is a small python tool to inspect and decrypt disk images encrypted with PGP Whole Disk Encryp

17 Oct 7, 2022

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

1.7k Jan 5, 2023

Decryption utility for PGP Whole Disk Encryption

11 Mar 21, 2021

It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

Permission Bot 디스코드 내에 있는 message-components 를 연구하기 위하여 제작된 봇입니다. Setup /config/config_example.ini 파일을 /config/config.ini으로 변환합니다. config 파일의 기본 양식은 아

4 Mar 7, 2022

🔄 🌐 Handle thousands of HTTP requests, disk writes, and other I/O-bound tasks simultaneously with Python's quintessential async libraries.

?? ?? Handle thousands of HTTP requests, disk writes, and other I/O-bound tasks simultaneously with Python's quintessential async libraries.

15 Dec 12, 2022

Give you a better view of your Docker registry disk usage.

registry-du Give you a better view of your Docker registry disk usage. This small tool will analysis your Docker registry(vanilla or Harbor both work)

16 Jan 7, 2023

check disk storage's amount and if necessary, send alert message by email

DiskStorageAmountChecker What is this script? (このスクリプトは何ですか?) This script check disk storage's available amount of specified servers and send alerting

1 Oct 22, 2021

Find vulnerable Log4j2 versions on disk and also inside Java Archive Files (Log4Shell CVE-2021-44228)

log4j-finder A Python3 script to scan the filesystem to find Log4j2 that is vulnerable to Log4Shell (CVE-2021-44228) It scans recursively both on disk

431 Dec 22, 2022

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics (so breakpoint(), etc work as normal)

7 Nov 30, 2022

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

VADER-Sentiment-Analysis VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifica

3.8k Dec 30, 2022

Simple spill-to-disk dictionary

Related tags

Overview

Chest

Related Projects

How it works

Known Failings

LICENSE

Install

Example

Dependencies

Comments

Example

Owner

Blaze

A simple tutorial to use tree-sitter to parse code into ASTs

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Visual disk-usage analyser for docker images

Diamond is a python daemon that collects system metrics and publishes them to Graphite (and others). It is capable of collecting cpu, memory, network, i/o, load and disk metrics. Additionally, it features an API for implementing custom collectors for gathering metrics from almost any source.

Decryption utility for PGP Whole Disk Encryption

Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

Decryption utility for PGP Whole Disk Encryption

It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

🔄 🌐 Handle thousands of HTTP requests, disk writes, and other I/O-bound tasks simultaneously with Python's quintessential async libraries.

Give you a better view of your Docker registry disk usage.

check disk storage's amount and if necessary, send alert message by email

Find vulnerable Log4j2 versions on disk and also inside Java Archive Files (Log4Shell CVE-2021-44228)

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics

Python function to construct a ZIP archive with on the fly - without having to store the entire ZIP in memory or disk

Monitoring plugin to check disk io with Icinga, Nagios and other compatible monitoring solutions

Python script that can be used to generate latitude/longitude coordinates for GOES-16 full-disk extent.

Backend app for visualizing CANedge log files in Grafana (directly from local disk or S3)

CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

A simple XLSX/CSV reader - to dictionary converter

VADER Sentiment Analysis. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.