Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

Overview

DiskCache: Disk Backed Cache

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

The cloud-based computing of 2021 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn't it be nice to leverage empty disk space for caching?

Django is Python's most popular web framework and ships with several caching backends. Unfortunately the file-based cache in Django is essentially broken. The culling method is random and large caches repeatedly scan a cache directory which slows linearly with growth. Can you really allow it to take sixty milliseconds to store a key in a cache with a thousand items?

In Python, we can do better. And we can do it in pure-Python!

In [1]: import pylibmc
In [2]: client = pylibmc.Client(['127.0.0.1'], binary=True)
In [3]: client[b'key'] = b'value'
In [4]: %timeit client[b'key']

10000 loops, best of 3: 25.4 µs per loop

In [5]: import diskcache as dc
In [6]: cache = dc.Cache('tmp')
In [7]: cache[b'key'] = b'value'
In [8]: %timeit cache[b'key']

100000 loops, best of 3: 11.8 µs per loop

Note: Micro-benchmarks have their place but are not a substitute for real measurements. DiskCache offers cache benchmarks to defend its performance claims. Micro-optimizations are avoided but your mileage may vary.

DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There's no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.

Testimonials

Daren Hasenkamp, Founder --

"It's a useful, simple API, just like I love about Redis. It has reduced the amount of queries hitting my Elasticsearch cluster by over 25% for a website that gets over a million users/day (100+ hits/second)."

Mathias Petermann, Senior Linux System Engineer --

"I implemented it into a wrapper for our Ansible lookup modules and we were able to speed up some Ansible runs by almost 3 times. DiskCache is saving us a ton of time."

Does your company or website use DiskCache? Send us a message and let us know.

Features

  • Pure-Python
  • Fully Documented
  • Benchmark comparisons (alternatives, Django cache backends)
  • 100% test coverage
  • Hours of stress testing
  • Performance matters
  • Django compatible API
  • Thread-safe and process-safe
  • Supports multiple eviction policies (LRU and LFU included)
  • Keys support "tag" metadata and eviction
  • Developed on Python 3.9
  • Tested on CPython 3.6, 3.7, 3.8, 3.9
  • Tested on Linux, Mac OS X, and Windows
  • Tested using GitHub Actions

Quickstart

Installing DiskCache is simple with pip:

$ pip install diskcache

You can access documentation in the interpreter with Python's built-in help function:

>>> import diskcache
>>> help(diskcache)                             # doctest: +SKIP

The core of DiskCache is three data types intended for caching. Cache objects manage a SQLite database and filesystem directory to store key and value pairs. FanoutCache provides a sharding layer to utilize multiple caches and DjangoCache integrates that with Django:

>>> from diskcache import Cache, FanoutCache, DjangoCache
>>> help(Cache)                                 # doctest: +SKIP
>>> help(FanoutCache)                           # doctest: +SKIP
>>> help(DjangoCache)                           # doctest: +SKIP

Built atop the caching data types, are Deque and Index which work as a cross-process, persistent replacements for Python's collections.deque and dict. These implement the sequence and mapping container base classes:

>>> from diskcache import Deque, Index
>>> help(Deque)                                 # doctest: +SKIP
>>> help(Index)                                 # doctest: +SKIP

Finally, a number of recipes for cross-process synchronization are provided using an underlying cache. Features like memoization with cache stampede prevention, cross-process locking, and cross-process throttling are available:

>>> from diskcache import memoize_stampede, Lock, throttle
>>> help(memoize_stampede)                      # doctest: +SKIP
>>> help(Lock)                                  # doctest: +SKIP
>>> help(throttle)                              # doctest: +SKIP

Python's docstrings are a quick way to get started but not intended as a replacement for the DiskCache Tutorial and DiskCache API Reference.

User Guide

For those wanting more details, this part of the documentation describes tutorial, benchmarks, API, and development.

Comparisons

Comparisons to popular projects related to DiskCache.

Key-Value Stores

DiskCache is mostly a simple key-value store. Feature comparisons with four other projects are shown in the tables below.

  • dbm is part of Python's standard library and implements a generic interface to variants of the DBM database — dbm.gnu or dbm.ndbm. If none of these modules is installed, the slow-but-simple dbm.dumb is used.
  • shelve is part of Python's standard library and implements a “shelf” as a persistent, dictionary-like object. The difference with “dbm” databases is that the values can be anything that the pickle module can handle.
  • sqlitedict is a lightweight wrapper around Python's sqlite3 database with a simple, Pythonic dict-like interface and support for multi-thread access. Keys are arbitrary strings, values arbitrary pickle-able objects.
  • pickleDB is a lightweight and simple key-value store. It is built upon Python's simplejson module and was inspired by Redis. It is licensed with the BSD three-clause license.

Features

Feature diskcache dbm shelve sqlitedict pickleDB
Atomic? Always Maybe Maybe Maybe No
Persistent? Yes Yes Yes Yes Yes
Thread-safe? Yes No No Yes No
Process-safe? Yes No No Maybe No
Backend? SQLite DBM DBM SQLite File
Serialization? Customizable None Pickle Customizable JSON
Data Types? Mapping/Deque Mapping Mapping Mapping Mapping
Ordering? Insert/Sorted None None None None
Eviction? LRU/LFU/more None None None None
Vacuum? Automatic Maybe Maybe Manual Automatic
Transactions? Yes No No Maybe No
Multiprocessing? Yes No No No No
Forkable? Yes No No No No
Metadata? Yes No No No No

Quality

Project diskcache dbm shelve sqlitedict pickleDB
Tests? Yes Yes Yes Yes Yes
Coverage? Yes Yes Yes Yes No
Stress? Yes No No No No
CI Tests? Linux/Windows Yes Yes Linux No
Python? 2/3/PyPy All All 2/3 2/3
License? Apache2 Python Python Apache2 3-Clause BSD
Docs? Extensive Summary Summary Readme Summary
Benchmarks? Yes No No No No
Sources? GitHub GitHub GitHub GitHub GitHub
Pure-Python? Yes Yes Yes Yes Yes
Server? No No No No No
Integrations? Django None None None None

Timings

These are rough measurements. See DiskCache Cache Benchmarks for more rigorous data.

Project diskcache dbm shelve sqlitedict pickleDB
get 25 µs 36 µs 41 µs 513 µs 92 µs
set 198 µs 900 µs 928 µs 697 µs 1,020 µs
delete 248 µs 740 µs 702 µs 1,717 µs 1,020 µs

Caching Libraries

  • joblib.Memory provides caching functions and works by explicitly saving the inputs and outputs to files. It is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.
  • klepto extends Python’s lru_cache to utilize different keymaps and alternate caching algorithms, such as lfu_cache and mru_cache. Klepto uses a simple dictionary-sytle interface for all caches and archives.

Data Structures

  • dict is a mapping object that maps hashable keys to arbitrary values. Mappings are mutable objects. There is currently only one standard Python mapping type, the dictionary.
  • pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
  • Sorted Containers is an Apache2 licensed sorted collections library, written in pure-Python, and fast as C-extensions. Sorted Containers implements sorted list, sorted dictionary, and sorted set data types.

Pure-Python Databases

  • ZODB supports an isomorphic interface for database operations which means there's little impact on your code to make objects persistent and there's no database mapper that partially hides the datbase.
  • CodernityDB is an open source, pure-Python, multi-platform, schema-less, NoSQL database and includes an HTTP server version, and a Python client library that aims to be 100% compatible with the embedded version.
  • TinyDB is a tiny, document oriented database optimized for your happiness. If you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you.

Object Relational Mappings (ORM)

  • Django ORM provides models that are the single, definitive source of information about data and contains the essential fields and behaviors of the stored data. Generally, each model maps to a single SQL database table.
  • SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise-level persistence patterns.
  • Peewee is a simple and small ORM. It has few (but expressive) concepts, making it easy to learn and intuitive to use. Peewee supports Sqlite, MySQL, and PostgreSQL with tons of extensions.
  • SQLObject is a popular Object Relational Manager for providing an object interface to your database, with tables as classes, rows as instances, and columns as attributes.
  • Pony ORM is a Python ORM with beautiful query syntax. Use Python syntax for interacting with the database. Pony translates such queries into SQL and executes them in the database in the most efficient way.

SQL Databases

  • SQLite is part of Python's standard library and provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.
  • MySQL is one of the world’s most popular open source databases and has become a leading database choice for web-based applications. MySQL includes a standardized database driver for Python platforms and development.
  • PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development. Psycopg is the most popular PostgreSQL adapter for the Python programming language.
  • Oracle DB is a relational database management system (RDBMS) from the Oracle Corporation. Originally developed in 1977, Oracle DB is one of the most trusted and widely used enterprise relational database engines.
  • Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it stores and retrieves data as requested by other software applications.

Other Databases

  • Memcached is free and open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
  • Redis is an open source, in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, and more.
  • MongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schema. PyMongo is the recommended way to work with MongoDB from Python.
  • LMDB is a lightning-fast, memory-mapped database. With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
  • BerkeleyDB is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is a programmatic toolkit that provides built-in database support for desktop and server applications.
  • LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Data is stored sorted by key and users can provide a custom comparison function.

Reference

License

Copyright 2016-2021 Grant Jenks

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • sqlite3.OperationalError and Timeout During Initialization from Multiple Threads

    sqlite3.OperationalError and Timeout During Initialization from Multiple Threads

    Current django cache creates a cache instance per thread so each initial cache access for a thread results in the cache initialization running.

    This has code to store various settings into to the sql table (from diskcache/core.py/Cache.init():

           # Set cached attributes: updates settings and sets pragmas.
    
            for key, value in sets.items():
                query = 'INSERT OR REPLACE INTO Settings VALUES (?, ?)'
                sql(query, (key, value))
                self.reset(key, value)
    
            for key, value in METADATA.items():
                query = 'INSERT OR IGNORE INTO Settings VALUES (?, ?)'
                sql(query, (key, value))
                self.reset(key)
    

    If multiple threads are started at the same time, this first cache access can hit a 'database locked' error during these writes. This is easy to demonstrate:

    (arcviz_3.6.2) doug@Dougs-MacBook-Pro:$ python manage.py shell
    Python 3.6.2 (default, Sep  6 2017, 18:33:29)
    [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    (InteractiveConsole)
    >>> from django.core.cache import cache
    >>> def f():
    ...   cache.get('key')
    ...
    >>> for i in range(50):
    ...   threading.Thread(target=f).start()
    ...
    

    This results in a bunch of errors from the threads:

    >>> Exception in thread Thread-11:
    Traceback (most recent call last):
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 574, in _transact
        sql('BEGIN IMMEDIATE')
    sqlite3.OperationalError: database is locked
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
        self.run()
      File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 864, in run
        self._target(*self._args, **self._kwargs)
      File "<console>", line 2, in f
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 99, in __getattr__
        return getattr(caches[DEFAULT_CACHE_ALIAS], name)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 80, in __getitem__
        cache = _create_cache(alias)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 55, in _create_cache
        return backend_cls(location, params)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/djangocache.py", line 28, in __init__
        self._cache = FanoutCache(directory, shards, timeout, **options)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in __init__
        for num in range(shards)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in <genexpr>
        for num in range(shards)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 435, in __init__
        self.reset(key, value)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 1863, in reset
        with self._transact() as (sql, _):
      File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 578, in _transact
        raise Timeout
    diskcache.core.Timeout
    

    This is with a very generic cache configuration:

    CACHES = {
      'default': {
        'BACKEND': 'diskcache.DjangoCache',
        'LOCATION': os.path.expanduser("~/.arc/cache"),
        'SHARDS': 4,
        'DATABASE_TIMEOUT': 10.0,
        'OPTIONS': {
            'size_limit': 1 * (2 ** 30)  # 1 gigabytes
        },
      },
    }
    

    We could put the initialization inside a lock, but I'm thinking maybe for diskcache, we don't want each django thread using a private cache object.

    opened by dougjc 19
  • DjangoCache Out of Disk Space Scenario

    DjangoCache Out of Disk Space Scenario

    I had to do an emergency delete of the disk cache /var/tmp/django_disk_cache as my server had run out of disk space.

    Ever since I receive Django errors

    Exception Type: ValueError at x Exception Value: Key ':1xxx' not found

    Disabling DjangoCache is the ony current fix. I've checked the Django db and there's no corresponding table for DjangoCache. There's obviously some reference to these keys somewhere but I can't find them. The docs make mention to a sqlite db, but I've searched my installation and can't find it.

    The help for DjangoCache also mention a clear command, which I assume I'm meant to run in a python shell, but I can't figure out how to run it and there's no examples.

    opened by silentjay 17
  • Many Disk Cache Instances Cause Inode Overflow After Consistent Use

    Many Disk Cache Instances Cause Inode Overflow After Consistent Use

    Hi Grant,

    I've been using disk cache on my AWS systems for probably a year, and i love it quite alot! I have many types of data i'm caching, thus many diskcaches (~100), and i'm using a high performance low capacity NVME disk to keep things fast.

    Because of the low size of the disk the number of inodes is 3.2M. Today with all the folders created by diskcache that aren't cleaned I got an Error 28 - no space left on device.

    Is there a recommended way to clean up a disk cache filesystem? It creates a ton of references that are never cleaned or managed. Could we ensure that all previous references are used before assigning new ones?

    Deleting its contents takes quite a while and i can't spare that downtime. Is there a cleanup callback, behavior we could add when a file is removed from a folder, and that folder no longer has references? In general I would have better control and knowledge of what is being purged / cleaned up.

    opened by SoundsSerious 16
  • Help with debugging an issue in diskcache sqlite3 connection

    Help with debugging an issue in diskcache sqlite3 connection

    Hello, I've spent I don't know how many hours already debugging an issue in one of our stacks where I'm trying to use this package. In one of the classes I'm doing the following:

    self.cache = Cache(directory="/tmp/diskcache")
    self.cache.set(key, value)
    self.cache.get(key)
    

    self.cache.get always returns None.

    After some debugging I can say that the information is stored in the files and I can retrieve it using the _disk private attribute passing the correct parameters. What apparently is not working is the part of storing the data in sqlite3, more specifically the _insert_row method. After it is being executed and the outer transact finished, the Cache table is still empty. However, if I grab the return of the sql command from the _insert_rows and execute a commit the data appears!

    Also what I've tried is the following while in the middle of a breakpoint:

    ipdb> from diskcache import Cache
    ipdb> cache = Cache("/tmp/diskcache")
    ipdb> cache.set("key", "value")
    True
    ipdb> cache.get("key")
    ipdb> cache = Cache("/tmp/diskcache-1")
    ipdb> cache.set("key", "value")
    True
    ipdb> cache.get("key")
    'value'
    

    /tmp/diskcache is the directory used originally in the process that doesn't work and as you can see, it still doesn't work if I initialize it manually in the debugger. However, if I use another path it works fine inside the debugger!

    I'm really confused on why the sql command is not working correctly and even more why its not raising an error? Any ideas of what it could be or what else I could check?

    EDIT: Some further info. If I import sqlite3 module directly and work with it, I can insert items in the cache table under /tmp/diskcache

        119             self.cache.set("key", "value")
        120             conn = sqlite3.connect('/tmp/diskcache/cache.db')
        121             conn.cursor()
        122             conn.execute("insert into Cache values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)")
        123             conn.commit()
        124             import ipdb; ipdb.set_trace()
    --> 125             
    
    [1]+  Stopped                 make run-dev
    root@c5f8b84e358a:/user# sqlite3 /tmp/diskcache/cache.db
    SQLite version 3.27.2 2019-02-25 16:06:06
    Enter ".help" for usage hints.
    sqlite> select * from Cache;
    1|1|1|1.0|1.0|1.0|1|1|1|1|1|1
    

    Thanks!

    opened by argaen 13
  • Possible example of a readonly cache.

    Possible example of a readonly cache.

    This is a minimal example of a Read Only cache.

    It could be made more explicit by checking the flag at the API boundary but I wanted to make it as small a change as possible.

    The only issue is with the potential infinite loop which needs to be broken. I have not tried the entire API to ensure it does not hang.

    I am not asking to commit as is, just to take it as a suggestion.

    opened by audetto 12
  • Disk I/O Error, unable to write db

    Disk I/O Error, unable to write db

    File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 211, in onTimer self.update() S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 581] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 197, in update invoke_components(self, '_on_update') S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 582] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 37, in invoke_components getattr(basecls, method)(self, *args) S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 582] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/umodule/debug_module.py", line 52, in _on_update cache.write_cache(self.user.uid, server_util.pickle(self.user.pack_to_dict())) S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/common/cache.py", line 7, in write_cache with Cache(os.getenv('LACG_CACHE_PATH')) as cache: S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/libs/diskcache/core.py", line 418, in init sql('CREATE TABLE IF NOT EXISTS Settings (' S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - sqlite3.OperationalError: disk I/O error

    opened by sekkit 12
  • Cache.expire() should cull items to respect size_limit

    Cache.expire() should cull items to respect size_limit

    Grant, I couldn't find a way to contact you regarding a question about DiskCache, except to file an issue. I do not know how to cull items from the cache if the cache size is larger than the size_limit.

    If I fill the cache to something larger than the size_limit, the cache does not report any items as expired, so calling the expire method does nothing. How is the size_limit parameter supposed to work?

    opened by mrclary 11
  • Unable to open database file with a process pool

    Unable to open database file with a process pool

    Hi,

    Thanks for diskcache and especially for all its awesome utilities and recipes !

    Context

    I am using it to cache data from an API, which should also be throttled. So I'm using memoize() and throttle() around my request function. All this runs in a multiprocessing environment, which brought me to diskcache in the first place.

    Am I right to expect diskcache to work with multiprocessing out of the box (relying on DB transactions), or should I use Locks, especially for that multi-process throttling ?

    Problem

    Running some tests locally (macOS), all seems fine, I accomplish what I need -- unseen values are queried to the API, with the correct rate, while seen values are returned from cache.

    However, running it with more data in "pre-prod" (linux, Ubuntu), I encounter a weird sqlite3.OperationalError: unable to open database file. The code works well for a while, but then stops with this error, with the following stack trace:

    Stack trace
    concurrent.futures.process._RemoteTraceback:
    '''
    Traceback (most recent call last):
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/process.py", line 367, in _queue_management_worker
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/multiprocessing/connection.py", line 251, in recv
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 2370, in __setstate__
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 457, in __init__
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 649, in _sql_retry
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 644, in _sql
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 621, in _con
    sqlite3.OperationalError: unable to open database file
    '''
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "batch_evaluation.py", line 208, in <module>
        run_multiple_ds(paths_df, ds_names, args.output)
      File "batch_evaluation.py", line 185, in run_multiple_ds
        write_outputs_to_disk(results, paths_df, ds_names, output_dir)
      File "batch_evaluation.py", line 121, in write_outputs_to_disk
        for out_path, result in results:
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
        for element in iterable:
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
        yield fs.pop().result()
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 428, in result
        return self.__get_result()
      File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
        raise self._exception
    concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
    

    I am not able to catch this error in my code to further inspect it, because it is happening when the cache object is unpickled after being passed from one process to another.

    Some users on StackOverflow suggest this may arise when the sqlite3 db becomes corrupted.

    I believe this may happen due to the use of throttle():

    • it uses cache.transact() https://github.com/grantjenks/python-diskcache/blob/c4ba1f78bb8494bcf6aba9d7d1c3aa49a1093508/diskcache/recipes.py#L254
    • in turn, this uses threading.get_ident() to manage the transaction: https://github.com/grantjenks/python-diskcache/blob/c4ba1f78bb8494bcf6aba9d7d1c3aa49a1093508/diskcache/core.py#L712
    • a small test shows that the get_ident() may return the same identity while the process is different:
    from concurrent.futures import ProcessPoolExecutor
    import threading
    import os
    
    def f(i):
       print(threading.get_ident(), os.getpid(), flush=True)
    
    
    with ProcessPoolExecutor(max_workers=3) as pool:
       pool.map(f, range(10))
    

    Outputs

    139819087464256 6165
    139819087464256 6165
    139819087464256 6166
    139819087464256 6167
    139819087464256 6165
    139819087464256 6165
    139819087464256 6166
    139819087464256 6167
    139819087464256 6165
    139819087464256 6166
    

    So I believe this may result in a database corruption, as process P2 may COMMIT under the identity of process P1.

    What do you think ? Could this be the cause, or should I look deeper in my code ?

    Many thanks for the awesome library!

    opened by cipri-tom 10
  • Query only support

    Query only support

    1. fix the assertion for setting comparison (dbvalue is a list of tuples)
    2. skip 2 fields that cannot be compared for a ro cache (not sure about "count")
    3. tag_index: make it a noop if the setting is unchanged

    With this the ro cache runs.

    Tests are still missing.

    opened by audetto 10
  • Add docs about FanoutCache shard size limit and

    Add docs about FanoutCache shard size limit and "none" eviction policy

    Hello, I have a function need cache a large pandas DataFrame loaded from s3 . The file size would be 518MB by flask-caching (and 826MB by joblib ).

    I don't want to cache the flask view, also want to reuse this cache file in some daily jobs. So I switch to diskcache.

    But I found diskcache didn't work .

    cache = FanoutCache(CACHE_DIR, shards=4, timeout=20)
    
    @cache.memoize()
    def read_saleinfo():
        reader = get_reader(DATA_SOURCE)
        days = pd.date_range(start=FORECAST_START_DATE, end=date.today(), freq='D').to_series().apply(lambda x: x.strftime('%Y/%m/%d')).ravel()
        p1 = [SOURCE_DIR_PATH + "arch/test/%s/*.parquet" % i for i in days]
        df = reader.read_paths(p1, columns=['product_id', 'store_id', 'count_of_sales', 'price_of_sales', 'price_max', 'year', 'month', 'day'])
    
        return df
    

    test code:

    df1 = read_saleinfo()
    df2 = read_saleinfo()
    

    read_saleinfo actually execute twice .

    Do I miss something ?

    opened by eromoe 10
  • Problem with memoize and keyword arguments

    Problem with memoize and keyword arguments

    I'm using memoize with keyword arguments and am having problems with the ENOVAL part of the key when using for key in cache to iterate over the keys.

    This is the original key (ENOVAL is a Constant):

    Screenshot 2021-02-25 at 12 20 14

    But when I subsequently iterate over the keys I get this (ENOVAL is now a tuple with a string member):

    Screenshot 2021-02-25 at 12 20 57

    This is an invalid key (cache.get(key) returns nothing).

    I experimented with changing args_to_key() as shown, and the problem goes away.

        if kwargs:
            # XXX this seems to become a nested tuple? use string for now
            # key += (ENOVAL,)
            key += ('ENOVAL',)
    
    opened by wlupton 9
  • `diskcache.BoundedSemaphore` malfunctions on key eviction

    `diskcache.BoundedSemaphore` malfunctions on key eviction

    The entry used by diskcache.BoundedSemaphore can be accidentally evicted from the cache on all non-none eviction policies.

    The end result is that the guarded resource can be accessed by more users than expected, and trigger an exception on a release() possibly minutes after the first illegal acquire().

    Is there a way to mark an entry as non-evictable regardless of the eviction policy?

    Test program:

    import diskcache
    import multiprocessing
    import os
    import shutil
    import tempfile
    import time
    
    nprocesses = 5
    directory = os.path.join(tempfile.gettempdir(), 'sem-evict')
    size_limit = 2**16
    
    def process(i):
        print(f'{i}: process() START')
        with diskcache.Cache(directory=directory, size_limit=size_limit) as cache:
            try:
                with diskcache.BoundedSemaphore(cache=cache, key='mysem', value=nprocesses//2 + 1):
                    print(f'{i}: BoundedSemaphore acquired')
                    time.sleep(10)
                print(f'{i}: BoundedSemaphore released')
            except Exception as e:
                print(f'{i}: EXCEPTION {e}')
        print(f'{i}: process() END')
    
    if __name__ == '__main__':
        try:
            shutil.rmtree(directory) # nuke the cache
        except:
            pass
        with diskcache.Cache(directory=directory, size_limit=size_limit) as cache:
            all_args = [ [i] for i in range(1, nprocesses+1)]
            with multiprocessing.get_context('spawn').Pool(processes=nprocesses) as pool:
                pool.starmap_async(process, all_args)
                pool.close()
                for x in range(10): # fill the cache to trigger eviction
                    time.sleep(1)
                    cache.set(x, 'x'*(size_limit//5), expire=999)
                pool.join()
    
    1: process() START
    1: BoundedSemaphore acquired
    2: process() START
    2: BoundedSemaphore acquired
    3: process() START
    3: BoundedSemaphore acquired
    4: process() START
    5: process() START
    5: BoundedSemaphore acquired
    4: BoundedSemaphore acquired
    1: EXCEPTION cannot release un-acquired semaphore
    1: process() END
    2: EXCEPTION cannot release un-acquired semaphore
    2: process() END
    3: EXCEPTION cannot release un-acquired semaphore
    3: process() END
    4: EXCEPTION cannot release un-acquired semaphore
    4: process() END
    5: EXCEPTION cannot release un-acquired semaphore
    5: process() END
    
    opened by FallenKhadgar 1
  • Implement PEP 562 for Python >= 3.7

    Implement PEP 562 for Python >= 3.7

    Currently when importing whatever object from diskcache, if django is installed it is imported. You can see it in the next image generated by pyinstrument:

    image

    To avoid that, this PR implements PEP 562 on the __init__.py file for Python 3.7 onwards. After it, when importing whatever object, like with from diskcache import Cache for example, django is not imported:

    image

    According to my benchmarks, this avoids ~100ms of initialization time on all imports except DjangoCache on Python3.8. Especially useful in CLI programs that don't need django and should start as fast as possible.

    opened by mondeja 1
  • Cache __init__ is not thread/process safe

    Cache __init__ is not thread/process safe

    When Cache object is concurrently instantiated in several threads or processes with the same cache path, sqlite.DatabaseError may occur. As the library claims to be thread and process safe, I think this claim should extend to the init stage as well. One possible solution is to use a filesystem-based lock for the init function.

    opened by f3flight 4
  • Memoize with defaulted parameters

    Memoize with defaulted parameters

    More a design question, at least first: When using memoize(), default arguments are not taken into account:

    @CACHE.memoize()
    def f(a, b = 'foo'):
        pass
    

    When calling f(1), only 1 as args will be taken into the cache key, not b='foo', see code. As long as the default does not change, that is OK, but when it does, the cache key for the function call that does not pass b does not change.

    It sounds like a weird edge condition, but I ran into this because I used b = generate_cache_key_from_things_that_are_static_at_runtime() and found out much later that cache invalidation via different results of that function did not work.

    I believe that it would totally be possible to take default values for parameters into account, i.e. via introspect. Do you think this would be an antipattern, or even more confusing, or worth implementing?

    By the way, my solution to this looks like following and is arguably a cleaner implementation anyways:

    from functools import cache
    
    @cache
    def generate_cache_key_from_things_that_are_static_at_runtime():
        ...
    
    def f(a):
        return f_cached(a, generate_cache_key_from_things_that_are_static_at_runtime())
    
    @CACHE.memoize()
    def f_cached(a, b):
        ...
    
    opened by sbrandtb 1
  • django 4.1 incompatibility

    django 4.1 incompatibility

    Hi,

    when using diskcache with the newer django framework 4.1, one of the tests fail:

    ============================= test session starts ==============================
    platform linux -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
    rootdir: /build/source, configfile: tox.ini
    plugins: django-4.5.2, xdist-2.5.0, forked-1.4.0
    ^Mgw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I^Mgw0 C / gw1 I / gw2 I / gw3>
    ...........F............................................................ [ 30%]
    ........................................................................ [ 60%]
    ........................................................................ [ 91%]
    .....................                                                    [100%]
    =================================== FAILURES ===================================
    ______________ DiskCacheTests.test_cache_write_unpicklable_object ______________
    [gw5] linux -- Python 3.10.5 /nix/store/rc9cz7z4qlgmsbwvpw2acig5g2rdws46-python3-3.10.5/bin/>
    self = <tests.test_djangocache.DiskCacheTests testMethod=test_cache_write_unpicklable_object>
    
        def test_cache_write_unpicklable_object(self):
            fetch_middleware = FetchFromCacheMiddleware(empty_response)
    >       fetch_middleware.cache = cache
    E       AttributeError: can't set attribute 'cache'
    
    tests/test_djangocache.py:873: AttributeError
    =============================== warnings summary ===============================
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
    ../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
      /nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-pack>
        warnings.warn(USE_L10N_DEPRECATED_MSG, RemovedInDjango50Warning)
    
    -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    =========================== short test summary info ============================
    FAILED tests/test_djangocache.py::DiskCacheTests::test_cache_write_unpicklable_object
    ================== 1 failed, 236 passed, 8 warnings in 18.88s ==================
    

    I didn't find any corresponding breaking changes in djangos changelog

    Maybe you have an idea?

    opened by gador 1
  • JSONDisk example not working

    JSONDisk example not working

    There is a Disk tutorial in the documentation at this address: https://grantjenks.com/docs/diskcache/tutorial.html#disk

    It seems obsolete, as running the code errors with:

    ./tests/test_storage.py::test_dump_diskcache_zstd Failed: [undefined]TypeError: JSONDisk.store() got an unexpected keyword argument 'key'
    def test_dump_diskcache_zstd():
    >       time = storage.dump_diskcache_zstd()
    
    tests/test_storage.py:65: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    storage/diskcache_zstd.py:88: in dump_diskcache_zstd
        cache.set(file, data)
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <diskcache.core.Cache object at 0x7f56111b8a60>
    key = 'EUW1_5534893159_matchtimeline.json'
    value = {'info': {'frameInterval': 60000, 'frames': [{'events': [{'realTimestamp': 1635859409575, 'timestamp': 0, 'type': 'PAU...4WUdQDTxpG4uNYadBFtEA2VXf5mw', 'Pud-6KdqMMcPp3fWG2RDUMZe840NHsSjb0iLixC_-8uN5OVrObmUI28ObrAHoWDiM_L2OoV7af14iw', ...]}}
    expire = None, read = False, tag = None, retry = False
    
        def set(self, key, value, expire=None, read=False, tag=None, retry=False):
            """Set `key` and `value` item in cache.
        
            When `read` is `True`, `value` should be a file-like object opened
            for reading in binary mode.
        
            Raises :exc:`Timeout` error when database timeout occurs and `retry` is
            `False` (default).
        
            :param key: key for item
            :param value: value for item
            :param float expire: seconds until item expires
                (default None, no expiry)
            :param bool read: read value as bytes from file (default False)
            :param str tag: text to associate with key (default None)
            :param bool retry: retry if database timeout occurs (default False)
            :return: True if item was set
            :raises Timeout: if database timeout occurs
        
            """
            now = time.time()
            db_key, raw = self._disk.put(key)
            expire_time = None if expire is None else now + expire
    >       size, mode, filename, db_value = self._disk.store(value, read, key=key)
    E       TypeError: JSONDisk.store() got an unexpected keyword argument 'key'
    
    ../../../.cache/pypoetry/virtualenvs/json-cold-storage-comparison-2wgndtiW-py3.10/lib/python3.10/site-packages/diskcache/core.py:772: TypeError
    

    I'm not exactly sure what needs updating as I'm not familiar with the project!

    Edit: looking at the source code of the project it seems the doc is simply missing the key argument: https://github.com/grantjenks/python-diskcache/blob/d55a50ee083784afa9c85e14e41c4a2d132f3111/diskcache/core.py#L335

    opened by mrtolkien 2
Owner
Grant Jenks
listen | learn | think | solve
Grant Jenks
Robust, highly tunable and easy-to-integrate in-memory cache solution written in pure Python, with no dependencies.

Omoide Cache Caching doesn't need to be hard anymore. With just a few lines of code Omoide Cache will instantly bring your Python services to the next

Leo Ertuna 2 Aug 14, 2022
An ORM cache for Django.

Django ORMCache A cache manager mixin that provides some caching of objects for the ORM. Installation / Setup / Usage TODO Testing Run the tests with:

Educreations, Inc 15 Nov 27, 2022
johnny cache django caching framework

Johnny Cache is a caching framework for django applications. It works with the django caching abstraction, but was developed specifically with the use

Jason Moiron 304 Nov 7, 2022
A slick ORM cache with automatic granular event-driven invalidation.

Cacheops A slick app that supports automatic or manual queryset caching and automatic granular event-driven invalidation. It uses redis as backend for

Alexander Schepanovski 1.7k Dec 30, 2022
Automatic Flask cache configuration on Heroku.

flask-heroku-cacheify Automatic Flask cache configuration on Heroku. Purpose Configuring your cache on Heroku can be a time sink. There are lots of di

Randall Degges 39 Jun 5, 2022
RecRoom Library Cache Tool

RecRoom Library Cache Tool A handy tool to deal with the Library cache file. Features Parse Library cache Remove Library cache Parsing The script pars

Jesse 5 Jul 9, 2022
Peerix is a peer-to-peer binary cache for nix derivations

Peerix Peerix is a peer-to-peer binary cache for nix derivations. Every participating node can pull derivations from each other instances' respective

null 92 Dec 13, 2022
Automatic caching and invalidation for Django models through the ORM.

Cache Machine Cache Machine provides automatic caching and invalidation for Django models through the ORM. For full docs, see https://cache-machine.re

null 846 Nov 26, 2022
An implementation of memoization technique for Django

django-memoize django-memoize is an implementation of memoization technique for Django. You can think of it as a cache for function or method results.

Unhaggle 118 Dec 9, 2022
WSGI middleware for sessions and caching

Cache and Session Library About Beaker is a web session and general caching library that includes WSGI middleware for use in web applications. As a ge

Ben Bangert 500 Dec 29, 2022
Extensible memoizing collections and decorators

cachetools This module provides various memoizing collections and decorators, including variants of the Python Standard Library's @lru_cache function

Thomas Kemmer 1.5k Jan 5, 2023
Aircache is an open-source caching and security solution that can be integrated with most decoupled apps that use REST APIs for communicating.

AirCache Aircache is an open-source caching and security solution that can be integrated with most decoupled apps that use REST APIs for communicating

AirCache 2 Dec 22, 2021
A Python wrapper around the libmemcached interface from TangentOrg.

pylibmc is a Python client for memcached written in C. See the documentation at sendapatch.se/projects/pylibmc/ for more information. New in version 1

Ludvig Ericson 458 Dec 30, 2022
PyCache - simple key:value server written with Python

PyCache simple key:value server written with Python and client is here run server python -m pycache.server or from pycache.server import start_server

chick_0 0 Nov 1, 2022
Cache-house - Caching tool for python, working with Redis single instance and Redis cluster mode

Caching tool for python, working with Redis single instance and Redis cluster mo

Tural 14 Jan 6, 2022
Much faster than SORT(Simple Online and Realtime Tracking), a little worse than SORT

QSORT QSORT(Quick + Simple Online and Realtime Tracking) is a simple online and realtime tracking algorithm for 2D multiple object tracking in video s

Yonghye Kwon 8 Jul 27, 2022
Django package to log request values such as device, IP address, user CPU time, system CPU time, No of queries, SQL time, no of cache calls, missing, setting data cache calls for a particular URL with a basic UI.

django-web-profiler's documentation: Introduction: django-web-profiler is a django profiling tool which logs, stores debug toolbar statistics and also

MicroPyramid 77 Oct 29, 2022
An implementation of multimap with per-item expiration backed up by Redis.

MultiMapWithTTL An implementation of multimap with per-item expiration backed up by Redis. Documentation: https://loggi.github.io/python-multimapwitht

Loggi 2 Jan 17, 2022
Redis-backed message queue implementation that can hook into a discord bot written with hikari-lightbulb.

Redis-backed FIFO message queue implementation that can hook into a discord bot written with hikari-lightbulb. This is eventually intended to be the backend communication between a bot and a web dashboard.

thomm.o 7 Dec 5, 2022
Full featured redis cache backend for Django.

Redis cache backend for Django This is a Jazzband project. By contributing you agree to abide by the Contributor Code of Conduct and follow the guidel

Jazzband 2.5k Jan 3, 2023