Python disk-backed cache (Django-compatible). Faster than Redis and Memcached. Pure-Python.

Grant Jenks

Last update: Jan 5, 2023

Related tags

Caching python filesystem cache persistence key-value-store

Overview

DiskCache: Disk Backed Cache

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

The cloud-based computing of 2021 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn't it be nice to leverage empty disk space for caching?

Django is Python's most popular web framework and ships with several caching backends. Unfortunately the file-based cache in Django is essentially broken. The culling method is random and large caches repeatedly scan a cache directory which slows linearly with growth. Can you really allow it to take sixty milliseconds to store a key in a cache with a thousand items?

In Python, we can do better. And we can do it in pure-Python!

In [1]: import pylibmc
In [2]: client = pylibmc.Client(['127.0.0.1'], binary=True)
In [3]: client[b'key'] = b'value'
In [4]: %timeit client[b'key']

10000 loops, best of 3: 25.4 µs per loop

In [5]: import diskcache as dc
In [6]: cache = dc.Cache('tmp')
In [7]: cache[b'key'] = b'value'
In [8]: %timeit cache[b'key']

100000 loops, best of 3: 11.8 µs per loop

Note: Micro-benchmarks have their place but are not a substitute for real measurements. DiskCache offers cache benchmarks to defend its performance claims. Micro-optimizations are avoided but your mileage may vary.

DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There's no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.

Testimonials

Daren Hasenkamp, Founder --

"It's a useful, simple API, just like I love about Redis. It has reduced the amount of queries hitting my Elasticsearch cluster by over 25% for a website that gets over a million users/day (100+ hits/second)."

Mathias Petermann, Senior Linux System Engineer --

"I implemented it into a wrapper for our Ansible lookup modules and we were able to speed up some Ansible runs by almost 3 times. DiskCache is saving us a ton of time."

Does your company or website use DiskCache? Send us a message and let us know.

Features

Pure-Python
Fully Documented
Benchmark comparisons (alternatives, Django cache backends)
100% test coverage
Hours of stress testing
Performance matters
Django compatible API
Thread-safe and process-safe
Supports multiple eviction policies (LRU and LFU included)
Keys support "tag" metadata and eviction
Developed on Python 3.9
Tested on CPython 3.6, 3.7, 3.8, 3.9
Tested on Linux, Mac OS X, and Windows
Tested using GitHub Actions

Quickstart

Installing DiskCache is simple with pip:

$ pip install diskcache

You can access documentation in the interpreter with Python's built-in help function:

>>> import diskcache
>>> help(diskcache)                             # doctest: +SKIP

The core of DiskCache is three data types intended for caching. Cache objects manage a SQLite database and filesystem directory to store key and value pairs. FanoutCache provides a sharding layer to utilize multiple caches and DjangoCache integrates that with Django:

>>> from diskcache import Cache, FanoutCache, DjangoCache
>>> help(Cache)                                 # doctest: +SKIP
>>> help(FanoutCache)                           # doctest: +SKIP
>>> help(DjangoCache)                           # doctest: +SKIP

Built atop the caching data types, are Deque and Index which work as a cross-process, persistent replacements for Python's collections.deque and dict. These implement the sequence and mapping container base classes:

>>> from diskcache import Deque, Index
>>> help(Deque)                                 # doctest: +SKIP
>>> help(Index)                                 # doctest: +SKIP

Finally, a number of recipes for cross-process synchronization are provided using an underlying cache. Features like memoization with cache stampede prevention, cross-process locking, and cross-process throttling are available:

>>> from diskcache import memoize_stampede, Lock, throttle
>>> help(memoize_stampede)                      # doctest: +SKIP
>>> help(Lock)                                  # doctest: +SKIP
>>> help(throttle)                              # doctest: +SKIP

Python's docstrings are a quick way to get started but not intended as a replacement for the DiskCache Tutorial and DiskCache API Reference.

User Guide

For those wanting more details, this part of the documentation describes tutorial, benchmarks, API, and development.

Comparisons

Comparisons to popular projects related to DiskCache.

Key-Value Stores

DiskCache is mostly a simple key-value store. Feature comparisons with four other projects are shown in the tables below.

dbm is part of Python's standard library and implements a generic interface to variants of the DBM database — dbm.gnu or dbm.ndbm. If none of these modules is installed, the slow-but-simple dbm.dumb is used.
shelve is part of Python's standard library and implements a “shelf” as a persistent, dictionary-like object. The difference with “dbm” databases is that the values can be anything that the pickle module can handle.
sqlitedict is a lightweight wrapper around Python's sqlite3 database with a simple, Pythonic dict-like interface and support for multi-thread access. Keys are arbitrary strings, values arbitrary pickle-able objects.
pickleDB is a lightweight and simple key-value store. It is built upon Python's simplejson module and was inspired by Redis. It is licensed with the BSD three-clause license.

Features

Feature	diskcache	dbm	shelve	sqlitedict	pickleDB
Atomic?	Always	Maybe	Maybe	Maybe	No
Persistent?	Yes	Yes	Yes	Yes	Yes
Thread-safe?	Yes	No	No	Yes	No
Process-safe?	Yes	No	No	Maybe	No
Backend?	SQLite	DBM	DBM	SQLite	File
Serialization?	Customizable	None	Pickle	Customizable	JSON
Data Types?	Mapping/Deque	Mapping	Mapping	Mapping	Mapping
Ordering?	Insert/Sorted	None	None	None	None
Eviction?	LRU/LFU/more	None	None	None	None
Vacuum?	Automatic	Maybe	Maybe	Manual	Automatic
Transactions?	Yes	No	No	Maybe	No
Multiprocessing?	Yes	No	No	No	No
Forkable?	Yes	No	No	No	No
Metadata?	Yes	No	No	No	No

Quality

Project	diskcache	dbm	shelve	sqlitedict	pickleDB
Tests?	Yes	Yes	Yes	Yes	Yes
Coverage?	Yes	Yes	Yes	Yes	No
Stress?	Yes	No	No	No	No
CI Tests?	Linux/Windows	Yes	Yes	Linux	No
Python?	2/3/PyPy	All	All	2/3	2/3
License?	Apache2	Python	Python	Apache2	3-Clause BSD
Docs?	Extensive	Summary	Summary	Readme	Summary
Benchmarks?	Yes	No	No	No	No
Sources?	GitHub	GitHub	GitHub	GitHub	GitHub
Pure-Python?	Yes	Yes	Yes	Yes	Yes
Server?	No	No	No	No	No
Integrations?	Django	None	None	None	None

Timings

These are rough measurements. See DiskCache Cache Benchmarks for more rigorous data.

Project	diskcache	dbm	shelve	sqlitedict	pickleDB
get	25 µs	36 µs	41 µs	513 µs	92 µs
set	198 µs	900 µs	928 µs	697 µs	1,020 µs
delete	248 µs	740 µs	702 µs	1,717 µs	1,020 µs

Caching Libraries

joblib.Memory provides caching functions and works by explicitly saving the inputs and outputs to files. It is designed to work with non-hashable and potentially large input and output data types such as numpy arrays.
klepto extends Python’s lru_cache to utilize different keymaps and alternate caching algorithms, such as lfu_cache and mru_cache. Klepto uses a simple dictionary-sytle interface for all caches and archives.

Data Structures

dict is a mapping object that maps hashable keys to arbitrary values. Mappings are mutable objects. There is currently only one standard Python mapping type, the dictionary.
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
Sorted Containers is an Apache2 licensed sorted collections library, written in pure-Python, and fast as C-extensions. Sorted Containers implements sorted list, sorted dictionary, and sorted set data types.

Pure-Python Databases

ZODB supports an isomorphic interface for database operations which means there's little impact on your code to make objects persistent and there's no database mapper that partially hides the datbase.
CodernityDB is an open source, pure-Python, multi-platform, schema-less, NoSQL database and includes an HTTP server version, and a Python client library that aims to be 100% compatible with the embedded version.
TinyDB is a tiny, document oriented database optimized for your happiness. If you need a simple database with a clean API that just works without lots of configuration, TinyDB might be the right choice for you.

Object Relational Mappings (ORM)

Django ORM provides models that are the single, definitive source of information about data and contains the essential fields and behaviors of the stored data. Generally, each model maps to a single SQL database table.
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. It provides a full suite of well known enterprise-level persistence patterns.
Peewee is a simple and small ORM. It has few (but expressive) concepts, making it easy to learn and intuitive to use. Peewee supports Sqlite, MySQL, and PostgreSQL with tons of extensions.
SQLObject is a popular Object Relational Manager for providing an object interface to your database, with tables as classes, rows as instances, and columns as attributes.
Pony ORM is a Python ORM with beautiful query syntax. Use Python syntax for interacting with the database. Pony translates such queries into SQL and executes them in the database in the most efficient way.

SQL Databases

SQLite is part of Python's standard library and provides a lightweight disk-based database that doesn’t require a separate server process and allows accessing the database using a nonstandard variant of the SQL query language.
MySQL is one of the world’s most popular open source databases and has become a leading database choice for web-based applications. MySQL includes a standardized database driver for Python platforms and development.
PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development. Psycopg is the most popular PostgreSQL adapter for the Python programming language.
Oracle DB is a relational database management system (RDBMS) from the Oracle Corporation. Originally developed in 1977, Oracle DB is one of the most trusted and widely used enterprise relational database engines.
Microsoft SQL Server is a relational database management system developed by Microsoft. As a database server, it stores and retrieves data as requested by other software applications.

Other Databases

Memcached is free and open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Redis is an open source, in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, and more.
MongoDB is a cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schema. PyMongo is the recommended way to work with MongoDB from Python.
LMDB is a lightning-fast, memory-mapped database. With memory-mapped files, it has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
BerkeleyDB is a software library intended to provide a high-performance embedded database for key/value data. Berkeley DB is a programmatic toolkit that provides built-in database support for desktop and server applications.
LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values. Data is stored sorted by key and users can provide a custom comparison function.

Reference

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

sqlite3.OperationalError and Timeout During Initialization from Multiple Threads

Current django cache creates a cache instance per thread so each initial cache access for a thread results in the cache initialization running.

This has code to store various settings into to the sql table (from diskcache/core.py/Cache.init():

       # Set cached attributes: updates settings and sets pragmas.

        for key, value in sets.items():
            query = 'INSERT OR REPLACE INTO Settings VALUES (?, ?)'
            sql(query, (key, value))
            self.reset(key, value)

        for key, value in METADATA.items():
            query = 'INSERT OR IGNORE INTO Settings VALUES (?, ?)'
            sql(query, (key, value))
            self.reset(key)

If multiple threads are started at the same time, this first cache access can hit a 'database locked' error during these writes. This is easy to demonstrate:

(arcviz_3.6.2) doug@Dougs-MacBook-Pro:$ python manage.py shell
Python 3.6.2 (default, Sep  6 2017, 18:33:29)
[GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.42)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from django.core.cache import cache
>>> def f():
...   cache.get('key')
...
>>> for i in range(50):
...   threading.Thread(target=f).start()
...

This results in a bunch of errors from the threads:

>>> Exception in thread Thread-11:
Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 574, in _transact
    sql('BEGIN IMMEDIATE')
sqlite3.OperationalError: database is locked

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "<console>", line 2, in f
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 99, in __getattr__
    return getattr(caches[DEFAULT_CACHE_ALIAS], name)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 80, in __getitem__
    cache = _create_cache(alias)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/django/core/cache/__init__.py", line 55, in _create_cache
    return backend_cls(location, params)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/djangocache.py", line 28, in __init__
    self._cache = FanoutCache(directory, shards, timeout, **options)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in __init__
    for num in range(shards)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/fanout.py", line 38, in <genexpr>
    for num in range(shards)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 435, in __init__
    self.reset(key, value)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 1863, in reset
    with self._transact() as (sql, _):
  File "/Users/doug/.pyenv/versions/3.6.2/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/Users/doug/.pyenv/versions/arcviz_3.6.2/lib/python3.6/site-packages/diskcache/core.py", line 578, in _transact
    raise Timeout
diskcache.core.Timeout

This is with a very generic cache configuration:

CACHES = {
  'default': {
    'BACKEND': 'diskcache.DjangoCache',
    'LOCATION': os.path.expanduser("~/.arc/cache"),
    'SHARDS': 4,
    'DATABASE_TIMEOUT': 10.0,
    'OPTIONS': {
        'size_limit': 1 * (2 ** 30)  # 1 gigabytes
    },
  },
}

We could put the initialization inside a lock, but I'm thinking maybe for diskcache, we don't want each django thread using a private cache object.

opened by dougjc 19

DjangoCache Out of Disk Space Scenario

I had to do an emergency delete of the disk cache /var/tmp/django_disk_cache as my server had run out of disk space.

Ever since I receive Django errors

Exception Type: ValueError at x Exception Value: Key ':1xxx' not found

Disabling DjangoCache is the ony current fix. I've checked the Django db and there's no corresponding table for DjangoCache. There's obviously some reference to these keys somewhere but I can't find them. The docs make mention to a sqlite db, but I've searched my installation and can't find it.

The help for DjangoCache also mention a clear command, which I assume I'm meant to run in a python shell, but I can't figure out how to run it and there's no examples.

opened by silentjay 17
Many Disk Cache Instances Cause Inode Overflow After Consistent Use

Hi Grant,

I've been using disk cache on my AWS systems for probably a year, and i love it quite alot! I have many types of data i'm caching, thus many diskcaches (~100), and i'm using a high performance low capacity NVME disk to keep things fast.

Because of the low size of the disk the number of inodes is 3.2M. Today with all the folders created by diskcache that aren't cleaned I got an Error 28 - no space left on device.

Is there a recommended way to clean up a disk cache filesystem? It creates a ton of references that are never cleaned or managed. Could we ensure that all previous references are used before assigning new ones?

Deleting its contents takes quite a while and i can't spare that downtime. Is there a cleanup callback, behavior we could add when a file is removed from a folder, and that folder no longer has references? In general I would have better control and knowledge of what is being purged / cleaned up.

opened by SoundsSerious 16
Help with debugging an issue in diskcache sqlite3 connection
Hello, I've spent I don't know how many hours already debugging an issue in one of our stacks where I'm trying to use this package. In one of the classes I'm doing the following:

self.cache = Cache(directory="/tmp/diskcache") self.cache.set(key, value) self.cache.get(key)

self.cache.get always returns None.

After some debugging I can say that the information is stored in the files and I can retrieve it using the _disk private attribute passing the correct parameters. What apparently is not working is the part of storing the data in sqlite3, more specifically the _insert_row method. After it is being executed and the outer transact finished, the Cache table is still empty. However, if I grab the return of the sql command from the _insert_rows and execute a commit the data appears!

Also what I've tried is the following while in the middle of a breakpoint:

ipdb> from diskcache import Cache ipdb> cache = Cache("/tmp/diskcache") ipdb> cache.set("key", "value") True ipdb> cache.get("key") ipdb> cache = Cache("/tmp/diskcache-1") ipdb> cache.set("key", "value") True ipdb> cache.get("key") 'value'

/tmp/diskcache is the directory used originally in the process that doesn't work and as you can see, it still doesn't work if I initialize it manually in the debugger. However, if I use another path it works fine inside the debugger!

I'm really confused on why the sql command is not working correctly and even more why its not raising an error? Any ideas of what it could be or what else I could check?

EDIT: Some further info. If I import sqlite3 module directly and work with it, I can insert items in the cache table under /tmp/diskcache

119 self.cache.set("key", "value") 120 conn = sqlite3.connect('/tmp/diskcache/cache.db') 121 conn.cursor() 122 conn.execute("insert into Cache values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)") 123 conn.commit() 124 import ipdb; ipdb.set_trace() --> 125 [1]+ Stopped make run-dev root@c5f8b84e358a:/user# sqlite3 /tmp/diskcache/cache.db SQLite version 3.27.2 2019-02-25 16:06:06 Enter ".help" for usage hints. sqlite> select * from Cache; 1|1|1|1.0|1.0|1.0|1|1|1|1|1|1

Thanks!
opened by argaen 13
Possible example of a readonly cache.

This is a minimal example of a Read Only cache.

It could be made more explicit by checking the flag at the API boundary but I wanted to make it as small a change as possible.

The only issue is with the potential infinite loop which needs to be broken. I have not tried the entire API to ensure it does not hang.

I am not asking to commit as is, just to take it as a suggestion.

opened by audetto 12
Disk I/O Error, unable to write db

File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 211, in onTimer self.update() S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 581] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 197, in update invoke_components(self, '_on_update') S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 582] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/base/Avatar.py", line 37, in invoke_components getattr(basecls, method)(self, *args) S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 582] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/umodule/debug_module.py", line 52, in _on_update cache.write_cache(self.user.uid, server_util.pickle(self.user.pack_to_dict())) S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/common/cache.py", line 7, in write_cache with Cache(os.getenv('LACG_CACHE_PATH')) as cache: S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - File "/media/psf/AllFiles/Volumes/OSXStorage/lacg/trunk/server/src/assets/scripts/libs/diskcache/core.py", line 418, in init sql('CREATE TABLE IF NOT EXISTS Settings (' S_ERR baseapp01 0 6129652375332859700 [2018-04-14 20:48:30 583] - sqlite3.OperationalError: disk I/O error

opened by sekkit 12
Cache.expire() should cull items to respect size_limit

Grant, I couldn't find a way to contact you regarding a question about DiskCache, except to file an issue. I do not know how to cull items from the cache if the cache size is larger than the size_limit.

If I fill the cache to something larger than the size_limit, the cache does not report any items as expired, so calling the expire method does nothing. How is the size_limit parameter supposed to work?

opened by mrclary 11

Unable to open database file with a process pool

Hi,

Thanks for diskcache and especially for all its awesome utilities and recipes !

Context

I am using it to cache data from an API, which should also be throttled. So I'm using memoize() and throttle() around my request function. All this runs in a multiprocessing environment, which brought me to diskcache in the first place.

Am I right to expect diskcache to work with multiprocessing out of the box (relying on DB transactions), or should I use Locks, especially for that multi-process throttling ?

Problem

Running some tests locally (macOS), all seems fine, I accomplish what I need -- unseen values are queried to the API, with the correct rate, while seen values are returned from cache.

However, running it with more data in "pre-prod" (linux, Ubuntu), I encounter a weird sqlite3.OperationalError: unable to open database file. The code works well for a while, but then stops with this error, with the following stack trace:

Stack trace

concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/process.py", line 367, in _queue_management_worker
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/multiprocessing/connection.py", line 251, in recv
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 2370, in __setstate__
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 457, in __init__
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 649, in _sql_retry
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 644, in _sql
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/site-packages/diskcache/core.py", line 621, in _con
sqlite3.OperationalError: unable to open database file
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "batch_evaluation.py", line 208, in <module>
    run_multiple_ds(paths_df, ds_names, args.output)
  File "batch_evaluation.py", line 185, in run_multiple_ds
    write_outputs_to_disk(results, paths_df, ds_names, output_dir)
  File "batch_evaluation.py", line 121, in write_outputs_to_disk
    for out_path, result in results:
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/process.py", line 483, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/ciprian/miniconda3/envs/my-env/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I am not able to catch this error in my code to further inspect it, because it is happening when the cache object is unpickled after being passed from one process to another.

Some users on StackOverflow suggest this may arise when the sqlite3 db becomes corrupted.

I believe this may happen due to the use of throttle():

it uses cache.transact() https://github.com/grantjenks/python-diskcache/blob/c4ba1f78bb8494bcf6aba9d7d1c3aa49a1093508/diskcache/recipes.py#L254
in turn, this uses threading.get_ident() to manage the transaction: https://github.com/grantjenks/python-diskcache/blob/c4ba1f78bb8494bcf6aba9d7d1c3aa49a1093508/diskcache/core.py#L712
a small test shows that the get_ident() may return the same identity while the process is different:

from concurrent.futures import ProcessPoolExecutor
import threading
import os

def f(i):
   print(threading.get_ident(), os.getpid(), flush=True)


with ProcessPoolExecutor(max_workers=3) as pool:
   pool.map(f, range(10))

Outputs

139819087464256 6165
139819087464256 6165
139819087464256 6166
139819087464256 6167
139819087464256 6165
139819087464256 6165
139819087464256 6166
139819087464256 6167
139819087464256 6165
139819087464256 6166

So I believe this may result in a database corruption, as process P2 may COMMIT under the identity of process P1.

What do you think ? Could this be the cause, or should I look deeper in my code ?

Many thanks for the awesome library!

opened by cipri-tom 10

Query only support
fix the assertion for setting comparison (dbvalue is a list of tuples)

skip 2 fields that cannot be compared for a ro cache (not sure about "count")

tag_index: make it a noop if the setting is unchanged

With this the ro cache runs.

Tests are still missing.
opened by audetto 10

Add docs about FanoutCache shard size limit and "none" eviction policy

Hello, I have a function need cache a large pandas DataFrame loaded from s3 . The file size would be 518MB by flask-caching (and 826MB by joblib ).

I don't want to cache the flask view, also want to reuse this cache file in some daily jobs. So I switch to diskcache.

But I found diskcache didn't work .

cache = FanoutCache(CACHE_DIR, shards=4, timeout=20)

@cache.memoize()
def read_saleinfo():
    reader = get_reader(DATA_SOURCE)
    days = pd.date_range(start=FORECAST_START_DATE, end=date.today(), freq='D').to_series().apply(lambda x: x.strftime('%Y/%m/%d')).ravel()
    p1 = [SOURCE_DIR_PATH + "arch/test/%s/*.parquet" % i for i in days]
    df = reader.read_paths(p1, columns=['product_id', 'store_id', 'count_of_sales', 'price_of_sales', 'price_max', 'year', 'month', 'day'])

    return df

test code:

df1 = read_saleinfo()
df2 = read_saleinfo()

read_saleinfo actually execute twice .

Do I miss something ?

opened by eromoe 10

Problem with memoize and keyword arguments
I'm using memoize with keyword arguments and am having problems with the ENOVAL part of the key when using for key in cache to iterate over the keys.

This is the original key (ENOVAL is a Constant):

But when I subsequently iterate over the keys I get this (ENOVAL is now a tuple with a string member):

This is an invalid key (cache.get(key) returns nothing).

I experimented with changing args_to_key() as shown, and the problem goes away.

if kwargs: # XXX this seems to become a nested tuple? use string for now # key += (ENOVAL,) key += ('ENOVAL',)
opened by wlupton 9

`diskcache.BoundedSemaphore` malfunctions on key eviction

The entry used by diskcache.BoundedSemaphore can be accidentally evicted from the cache on all non-none eviction policies.

The end result is that the guarded resource can be accessed by more users than expected, and trigger an exception on a release() possibly minutes after the first illegal acquire().

Is there a way to mark an entry as non-evictable regardless of the eviction policy?

Test program:

import diskcache
import multiprocessing
import os
import shutil
import tempfile
import time

nprocesses = 5
directory = os.path.join(tempfile.gettempdir(), 'sem-evict')
size_limit = 2**16

def process(i):
    print(f'{i}: process() START')
    with diskcache.Cache(directory=directory, size_limit=size_limit) as cache:
        try:
            with diskcache.BoundedSemaphore(cache=cache, key='mysem', value=nprocesses//2 + 1):
                print(f'{i}: BoundedSemaphore acquired')
                time.sleep(10)
            print(f'{i}: BoundedSemaphore released')
        except Exception as e:
            print(f'{i}: EXCEPTION {e}')
    print(f'{i}: process() END')

if __name__ == '__main__':
    try:
        shutil.rmtree(directory) # nuke the cache
    except:
        pass
    with diskcache.Cache(directory=directory, size_limit=size_limit) as cache:
        all_args = [ [i] for i in range(1, nprocesses+1)]
        with multiprocessing.get_context('spawn').Pool(processes=nprocesses) as pool:
            pool.starmap_async(process, all_args)
            pool.close()
            for x in range(10): # fill the cache to trigger eviction
                time.sleep(1)
                cache.set(x, 'x'*(size_limit//5), expire=999)
            pool.join()

1: process() START
1: BoundedSemaphore acquired
2: process() START
2: BoundedSemaphore acquired
3: process() START
3: BoundedSemaphore acquired
4: process() START
5: process() START
5: BoundedSemaphore acquired
4: BoundedSemaphore acquired
1: EXCEPTION cannot release un-acquired semaphore
1: process() END
2: EXCEPTION cannot release un-acquired semaphore
2: process() END
3: EXCEPTION cannot release un-acquired semaphore
3: process() END
4: EXCEPTION cannot release un-acquired semaphore
4: process() END
5: EXCEPTION cannot release un-acquired semaphore
5: process() END

opened by FallenKhadgar 1

Implement PEP 562 for Python >= 3.7

Currently when importing whatever object from diskcache, if django is installed it is imported. You can see it in the next image generated by pyinstrument:

To avoid that, this PR implements PEP 562 on the __init__.py file for Python 3.7 onwards. After it, when importing whatever object, like with from diskcache import Cache for example, django is not imported:

According to my benchmarks, this avoids ~100ms of initialization time on all imports except DjangoCache on Python3.8. Especially useful in CLI programs that don't need django and should start as fast as possible.

opened by mondeja 1
Cache __init__ is not thread/process safe

When Cache object is concurrently instantiated in several threads or processes with the same cache path, sqlite.DatabaseError may occur. As the library claims to be thread and process safe, I think this claim should extend to the init stage as well. One possible solution is to use a filesystem-based lock for the init function.

opened by f3flight 4
Memoize with defaulted parameters
More a design question, at least first: When using memoize(), default arguments are not taken into account:

@CACHE.memoize() def f(a, b = 'foo'): pass

When calling f(1), only 1 as args will be taken into the cache key, not b='foo', see code. As long as the default does not change, that is OK, but when it does, the cache key for the function call that does not pass b does not change.

It sounds like a weird edge condition, but I ran into this because I used b = generate_cache_key_from_things_that_are_static_at_runtime() and found out much later that cache invalidation via different results of that function did not work.

I believe that it would totally be possible to take default values for parameters into account, i.e. via introspect. Do you think this would be an antipattern, or even more confusing, or worth implementing?

By the way, my solution to this looks like following and is arguably a cleaner implementation anyways:

from functools import cache @cache def generate_cache_key_from_things_that_are_static_at_runtime(): ... def f(a): return f_cached(a, generate_cache_key_from_things_that_are_static_at_runtime()) @CACHE.memoize() def f_cached(a, b): ...
opened by sbrandtb 1

django 4.1 incompatibility

Hi,

when using diskcache with the newer django framework 4.1, one of the tests fail:

============================= test session starts ==============================
platform linux -- Python 3.10.5, pytest-7.1.2, pluggy-1.0.0
rootdir: /build/source, configfile: tox.ini
plugins: django-4.5.2, xdist-2.5.0, forked-1.4.0
^Mgw0 I / gw1 I / gw2 I / gw3 I / gw4 I / gw5 I / gw6 I / gw7 I^Mgw0 C / gw1 I / gw2 I / gw3>
...........F............................................................ [ 30%]
........................................................................ [ 60%]
........................................................................ [ 91%]
.....................                                                    [100%]
=================================== FAILURES ===================================
______________ DiskCacheTests.test_cache_write_unpicklable_object ______________
[gw5] linux -- Python 3.10.5 /nix/store/rc9cz7z4qlgmsbwvpw2acig5g2rdws46-python3-3.10.5/bin/>
self = <tests.test_djangocache.DiskCacheTests testMethod=test_cache_write_unpicklable_object>

    def test_cache_write_unpicklable_object(self):
        fetch_middleware = FetchFromCacheMiddleware(empty_response)
>       fetch_middleware.cache = cache
E       AttributeError: can't set attribute 'cache'

tests/test_djangocache.py:873: AttributeError
=============================== warnings summary ===============================
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
../../nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-p>
  /nix/store/5iacqddfwif3ww9gxf82ccl2yhj2jxn1-python3.10-Django-4.1/lib/python3.10/site-pack>
    warnings.warn(USE_L10N_DEPRECATED_MSG, RemovedInDjango50Warning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/test_djangocache.py::DiskCacheTests::test_cache_write_unpicklable_object
================== 1 failed, 236 passed, 8 warnings in 18.88s ==================

I didn't find any corresponding breaking changes in djangos changelog

Maybe you have an idea?

opened by gador 1

JSONDisk example not working

There is a Disk tutorial in the documentation at this address: https://grantjenks.com/docs/diskcache/tutorial.html#disk

It seems obsolete, as running the code errors with:

./tests/test_storage.py::test_dump_diskcache_zstd Failed: [undefined]TypeError: JSONDisk.store() got an unexpected keyword argument 'key'
def test_dump_diskcache_zstd():
>       time = storage.dump_diskcache_zstd()

tests/test_storage.py:65: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
storage/diskcache_zstd.py:88: in dump_diskcache_zstd
    cache.set(file, data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <diskcache.core.Cache object at 0x7f56111b8a60>
key = 'EUW1_5534893159_matchtimeline.json'
value = {'info': {'frameInterval': 60000, 'frames': [{'events': [{'realTimestamp': 1635859409575, 'timestamp': 0, 'type': 'PAU...4WUdQDTxpG4uNYadBFtEA2VXf5mw', 'Pud-6KdqMMcPp3fWG2RDUMZe840NHsSjb0iLixC_-8uN5OVrObmUI28ObrAHoWDiM_L2OoV7af14iw', ...]}}
expire = None, read = False, tag = None, retry = False

    def set(self, key, value, expire=None, read=False, tag=None, retry=False):
        """Set `key` and `value` item in cache.
    
        When `read` is `True`, `value` should be a file-like object opened
        for reading in binary mode.
    
        Raises :exc:`Timeout` error when database timeout occurs and `retry` is
        `False` (default).
    
        :param key: key for item
        :param value: value for item
        :param float expire: seconds until item expires
            (default None, no expiry)
        :param bool read: read value as bytes from file (default False)
        :param str tag: text to associate with key (default None)
        :param bool retry: retry if database timeout occurs (default False)
        :return: True if item was set
        :raises Timeout: if database timeout occurs
    
        """
        now = time.time()
        db_key, raw = self._disk.put(key)
        expire_time = None if expire is None else now + expire
>       size, mode, filename, db_value = self._disk.store(value, read, key=key)
E       TypeError: JSONDisk.store() got an unexpected keyword argument 'key'

../../../.cache/pypoetry/virtualenvs/json-cold-storage-comparison-2wgndtiW-py3.10/lib/python3.10/site-packages/diskcache/core.py:772: TypeError

I'm not exactly sure what needs updating as I'm not familiar with the project!

Edit: looking at the source code of the project it seems the doc is simply missing the key argument: https://github.com/grantjenks/python-diskcache/blob/d55a50ee083784afa9c85e14e41c4a2d132f3111/diskcache/core.py#L335

opened by mrtolkien 2