Simple spill-to-disk dictionary

Overview

Chest

Build Status Coverage Status Version Status Downloads

A dictionary that spills to disk.

Chest acts likes a dictionary but it can write its contents to disk. This is useful in the following two occasions:

  1. Chest can hold datasets that are larger than memory
  2. Chest persists and so can be saved and loaded for later use

Related Projects

The standard library shelve is an alternative out-of-core dictionary. Chest offers the following benefits over shelve:

  1. Chest supports any hashable key (not just strings)
  2. Chest supports pluggable serialization and file saving schemes

Alternatively one might consider a traditional key-value store database like Redis.

Shove is another excellent alternative with support for a variety of stores.

How it works

Chest stores data in two locations

  1. An in-memory dictionary
  2. On the filesystem in a directory owned by the chest

As a user adds contents to the chest the in-memory dictionary fills up. When a chest stores more data in memory than desired (see available_memory= keyword argument) it writes the larger contents of the chest to disk as pickle files (the choice of pickle is configurable). When a user asks for a value chest checks the in-memory store, then checks on-disk and loads the value into memory if necessary, pushing other values to disk.

Chest is a simple project. It was intended to provide a simple interface to assist in the storage and retrieval of numpy arrays. However it's design and implementation are agnostic to this case and so could be used in a variety of other situations.

With minimal work chest could be extended to serve as a communication point between multiple processes.

Known Failings

Chest was designed to hold a moderate amount of largish numpy arrays. It doesn't handle the very many small key-value pairs usecase (though could with small effort). In particular chest has the following deficiencies

  1. Chest is not multi-process safe. We should institute a file lock at least around the .keys file.
  2. Chest does not support mutation of variables on disk.

LICENSE

New BSD. See License

Install

chest is available through conda:

conda install chest

chest is on the Python Package Index (PyPI):

pip install chest

Example

>>> from chest import Chest
>>> c = Chest()

>>> # Acts like a normal dictionary
>>> c['x'] = [1, 2, 3]
>>> c['x']
[1, 2, 3]

>>> # Data persists to local files
>>> c.flush()
>>> import os
>>> os.listdir(c.path)
['.keys', 'x']

>>> # These files hold pickled results
>>> import pickle
>>> pickle.load(open(c.key_to_filename('x')))
[1, 2, 3]

>>> # Though one normally accesses these files with chest itself
>>> c2 = Chest(path=c.path)
>>> c2.keys()
['x']
>>> c2['x']
[1, 2, 3]

>>> # Chest is configurable, so one can use json, etc. instead of pickle
>>> import json
>>> c = Chest(path='my-chest', dump=json.dump, load=json.load)
>>> c['x'] = [1, 2, 3]
>>> c.flush()

>>> json.load(open(c.key_to_filename('x')))
[1, 2, 3]

Dependencies

Chest supports Python 2.6+ and Python 3.2+ with a common codebase.

It currently depends on the heapdict library.

It is a light weight dependency.

Comments
  • ENH: Allows user to pass in a custom lock object

    ENH: Allows user to pass in a custom lock object

    Reading the docks I saw that this is not multiprocess safe; would it be sufficient if a user passed in their own multiprocessing safe lock object to the chest? Either way, this could be a convenient feature.

    opened by llllllllll 10
  • Store filenames

    Store filenames

    Filenames are stored along with keys, making old chests more portable. _keys, which was a set, is replaced with a {key -> filename} dictionary. The dictionary is dumped/loaded as a list of tuples for json compatibility, but used as a dict internally.

    opened by maxhutch 7
  • Added cache_many method and open_many constructor argument.

    Added cache_many method and open_many constructor argument.

    cache_many loads a list of keys to the in-cache memory.

    open_many take a list of filenames and returns a list of open files. This is mostly to support overloading of open.

    opened by maxhutch 4
  • Adding

    Adding "eat" procedure to move the contents of one chest into another.

    c1.eat(c2) flushes c2, moves c2's files into c1's directory, and adds c2's keys to c1. By default, keys in c2 overwrite the same keys in c1, but that behavior can be controlled by overwrite=False.

    opened by maxhutch 4
  • ENH: on_full policy

    ENH: on_full policy

    Allows a user to set a maximum disk usage and a policy defining what to do when the max disk is used and a new entry must be added.

    The two options are currently:

    1. raise_: raise an OSError indicating you ran out of space.
    2. pop_lru: rotate the least recently used element out of the chest.

    Users may pass any callable here; however, these two are defined for them.

    opened by llllllllll 3
  • Tuple keys move files into directories

    Tuple keys move files into directories

    Example

        In [1]: from chest import Chest
    
        In [2]: c = Chest(path='foo')
    
        In [3]: c['one'] = 1
    
        In [4]: c['one', 'two'] = 12
    
        In [5]: c['one', 'three'] = 13
    
        In [6]: c.flush()
    
        In [7]: !tree foo
        foo
        ├── one
        └── _one
            ├── three
            └── two
    
            1 directory, 3 files
    

    cc @maxhutch

    opened by mrocklin 2
  • Chunking open_many calls to prevent open file bloat.

    Chunking open_many calls to prevent open file bloat.

    When prefetching, the list of keys to load are chunked up to prevent open_many from opening too many files simultaneously.

    On my system, the user file limit was 1024, so I set it to 512.

    This should resolve #14.

    opened by maxhutch 0
  • Prefetch opens too many files

    Prefetch opens too many files

    For a reasonably large chest c, c.prefetch(list(c.keys())) will cause OSError: [Errno 24] Too many open files because open_many opens everything, consumes everything, and then closes everything.

    Simple solution: chunk up prefetches into open_many's of some maximum size.

    bug 
    opened by maxhutch 0
  • Fast memory usage

    Fast memory usage

    The memory_usage property is now a statefull field that we track during operation. This is error prone but faster for cases where we churn disk frequently.

    opened by mrocklin 0
  • Bugfix: if dump() was called, del can't write keys during flush.

    Bugfix: if dump() was called, del can't write keys during flush.

    This just checks if dump has been called (by seeing if self.path exists). It dump was called, then delete the contents of self.inmem but don't move them to disk or write keys.

    This gets rid of the exceptions that were briefly mentioned in #4

    opened by maxhutch 0
  • Import MutableMapping from collections.abc

    Import MutableMapping from collections.abc

    Python 3.3 and above moved the abstract base classes to their own module under collections, collections.abc. Import from that location, falling back if required for Python 3.2.

    Fixes #29

    opened by s-t-e-v-e-n-k 1
  • Python 3.9 support

    Python 3.9 support

     /opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/chest/core.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
    151
        from collections import MutableMapping
    

    Would you accept a PR fix, and make a new release?

    opened by cmacdonald 0
  • run out of inodes by using chest

    run out of inodes by using chest

    Hi,

    I tried chest on a large file and on an ubuntu 16.04 system. Chest created millions of files on my hard disk and I was running out of inodes. I have plenty of hard disk avaiable but I can not use it because inodes usage is at the peak. What can I do about it? Can I reduce the millions of generated files to hundereds? Is there any option for this?

    Thanks in advance

    opened by z-rahimi 0
  • Include pickling with the highest protocol

    Include pickling with the highest protocol

    I am trying to modify the protocol used for dumping, however, it always complains that it does not understand the argument protocol in dump=partial(pickle.dump, protocol=pickle.HIGHEST_PROTOCOL). Could you put this as default?

    opened by jshleap 0
  • Switched broken pypip.in badges to shields.io

    Switched broken pypip.in badges to shields.io

    Hello, this is an auto-generated Pull Request. (Feedback?)

    Some time ago, pypip.in shut down. This broke the badges for a bunch of repositories, including chest. Thankfully, an equivalent service is run by shields.io. This pull request changes the badges to use shields.io instead.

    Unfortunately, PyPI has removed download statistics from their API, which means that even the shields.io "download count" badges are broken (they display "no longer available". See this). So those badges should really be removed entirely. Since this is an automated process (and trying to automatically remove the badges from READMEs can be tricky), this pull request just replaces the URL with the shields.io syntax.

    opened by movermeyer 0
  • dump with another wrapper than the one used for caching

    dump with another wrapper than the one used for caching

    I am trying to save the resulting chest into a different file format than the one used for caching.

    A practical example: I used pickle when I initialize my chest because I am using nested dicts and sets, which would be stripped off I guess if it was to be stored in json.

    But in the end, I would like to dump my chest into json for export.

    Is there a way to do that?

    For example, this does not work past the first level:

    import ujson as json
    with open('db.json', 'w') as f:
        f.write(json.dumps(out, ensure_ascii=False, indent=4, sort_keys=True))
    
    opened by lrq3000 0
Owner
Blaze
Blaze
A simple tutorial to use tree-sitter to parse code into ASTs

A simple tutorial to use py-tree-sitter to parse code into ASTs. To understand what is tree-sitter, see https://github.com/tree-sitter/tree-sitter. Tr

Nghi D. Q. Bui 7 Sep 17, 2022
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 1, 2023
Visual disk-usage analyser for docker images

whaler What? A command-line tool for visually investigating the disk usage of docker images Why? Large images are slow to move and expensive to store.

Treebeard Technologies 194 Sep 1, 2022
Decryption utility for PGP Whole Disk Encryption

wdepy: Decryption and Inspection for PGP WDE Disks This is a small python tool to inspect and decrypt disk images encrypted with PGP Whole Disk Encryp

Brendan Dolan-Gavitt 17 Oct 7, 2022
Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

Grant Jenks 1.7k Jan 5, 2023
Decryption utility for PGP Whole Disk Encryption

Decryption utility for PGP Whole Disk Encryption

Brendan Dolan-Gavitt 11 Mar 21, 2021
It is a temporary project to study discord interactions. You can set permissions conveniently when you invite a particular disk code bot.

Permission Bot 디스코드 내에 있는 message-components 를 연구하기 위하여 제작된 봇입니다. Setup /config/config_example.ini 파일을 /config/config.ini으로 변환합니다. config 파일의 기본 양식은 아

gunyu1019 4 Mar 7, 2022
🔄 🌐 Handle thousands of HTTP requests, disk writes, and other I/O-bound tasks simultaneously with Python's quintessential async libraries.

?? ?? Handle thousands of HTTP requests, disk writes, and other I/O-bound tasks simultaneously with Python's quintessential async libraries.

Hackers and Slackers 15 Dec 12, 2022
Give you a better view of your Docker registry disk usage.

registry-du Give you a better view of your Docker registry disk usage. This small tool will analysis your Docker registry(vanilla or Harbor both work)

Nova Kwok 16 Jan 7, 2023
check disk storage's amount and if necessary, send alert message by email

DiskStorageAmountChecker What is this script? (このスクリプトは何ですか?) This script check disk storage's available amount of specified servers and send alerting

Hajime Kurita 1 Oct 22, 2021
Find vulnerable Log4j2 versions on disk and also inside Java Archive Files (Log4Shell CVE-2021-44228)

log4j-finder A Python3 script to scan the filesystem to find Log4j2 that is vulnerable to Log4Shell (CVE-2021-44228) It scans recursively both on disk

Fox-IT 431 Dec 22, 2022
A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics (so breakpoint(), etc work as normal)

Greg Brockman 7 Nov 30, 2022
Python function to construct a ZIP archive with on the fly - without having to store the entire ZIP in memory or disk

Python function to construct a ZIP archive with on the fly - without having to store the entire ZIP in memory or disk

Department for International Trade 34 Jan 5, 2023
Monitoring plugin to check disk io with Icinga, Nagios and other compatible monitoring solutions

check_disk_io - Monitor disk io This is a monitoring plugin for Icinga, Nagios and other compatible monitoring solutions to check the disk io. It uses

DinoTools 3 Nov 15, 2022
Python script that can be used to generate latitude/longitude coordinates for GOES-16 full-disk extent.

goes-latlon Python script that can be used to generate latitude/longitude coordinates for GOES-16 full-disk extent. ?? ??️ The grid files can be acces

Douglas Uba 3 Apr 6, 2022
Backend app for visualizing CANedge log files in Grafana (directly from local disk or S3)

CANedge Grafana Backend - Visualize CAN/LIN Data in Dashboards This project enables easy dashboard visualization of log files from the CANedge CAN/LIN

null 13 Dec 15, 2022
CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, written in Python.

CaskDB - Disk based Log Structured Hash Table Store CaskDB is a disk-based, embedded, persistent, key-value store based on the Riak's bitcask paper, w

null 886 Dec 27, 2022
A simple XLSX/CSV reader - to dictionary converter

sheet2dict A simple XLSX/CSV reader - to dictionary converter Installing To install the package from pip, first run: python3 -m pip install --no-cache

Tomas Pytel 216 Nov 25, 2022
C.J. Hutto 3.8k Dec 30, 2022