Python virtual filesystem for SQLite to read from and write to S3

Department for International Trade

Last update: Jan 4, 2023

Related tags

File & Path Utilities sqlite-s3vfs

Overview

sqlite-s3vfs

Python virtual filesystem for SQLite to read from and write to S3.

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

Inspired by phiresky's sql.js-httpvfs, dacort's Stack Overflow answer, and sqlite-s3-query.

How does it work?

sqlite-s3vfs stores the SQLite database in fixed-sized blocks, and each is stored as a separate object in S3. SQLite stores its data in fixed-size pages, and always writes exactly a page at a time. This virtual filesystem translates page reads and writes to block reads and writes. In the case of SQLite pages being the same size as blocks, which is the case by default, each page write results in exactly one block write.

Separate objects are required since S3 does not support the partial replace of an object; to change even 1 byte, it must be re-uploaded in full.

Installation

sqlite-s3vfs depends on APSW, which is not officially available on PyPI, but can be installed directly from GitHub.

pip install sqlite-s3vfs
pip install https://github.com/rogerbinns/apsw/releases/download/3.36.0-r1/apsw-3.36.0-r1.zip --global-option=fetch --global-option=--version --global-option=3.36.0 --global-option=--sqlite --global-option=build --global-option=--enable-all-extensions

Usage

sqlite-s3vfs is an APSW virtual filesystem that requires boto3 for its communication with S3.

import apsw
import boto3
import sqlite_s3vfs

# A boto3 bucket resource
bucket = boto3.Session().resource('s3').Bucket('my-bucket')

# An S3VFS for that bucket
s3vfs = sqlite_s3vfs.S3VFS(bucket=bucket)

# sqlite-s3vfs stores many objects under this prefix
# Note that it's not typical to start a key prefix with '/'
key_prefix = 'my/path/cool.sqlite'

# Connect, insert data, and query
with apsw.Connection(key_prefix, vfs=s3vfs.name) as db:
    cursor = db.cursor()
    cursor.execute(f'''
        CREATE TABLE foo(x,y);
        INSERT INTO foo VALUES(1,2);
    ''')
    cursor.execute('SELECT * FROM foo;')
    print(cursor.fetchall())

See the APSW documentation for more examples.

Serializing (getting a regular SQLite file out of the VFS)

The bytes corresponding to a regular SQLite file can be extracted with the serialize_iter function, which returns an iterable,

for chunk in s3vfs.serialize_iter(key_prefix=key_prefix):
    print(chunk)

or with serialize_fileobj, which returns a non-seekable file-like object. This can be passed to Boto3's upload_fileobj method to upload a regular SQLite file to S3.

target_obj = boto3.Session().resource('s3').Bucket('my-target-bucket').Object('target/cool.sqlite')
target_obj.upload_fileobj(s3vfs.serialize_fileobj(key_prefix=key_prefix))

Deserializing (getting a regular SQLite file into the VFS)

# Any iterable that yields bytes can be used. In this example, bytes come from
# a regular SQLite file already in S3
source_obj = boto3.Session().resource('s3').Bucket('my-source-bucket').Object('source/cool.sqlite')
bytes_iter = source_obj.get()['Body'].iter_chunks()

s3vfs.deserialize_iter(key_prefix='my/path/cool.sqlite', bytes_iter=bytes_iter)

Tests

The tests require the dev dependencies and APSW to installed, and MinIO started

pip install -r requirements-dev.txt
pip install https://github.com/rogerbinns/apsw/releases/download/3.36.0-r1/apsw-3.36.0-r1.zip --global-option=fetch --global-option=--version --global-option=3.36.0 --global-option=--all --global-option=build --global-option=--enable-all-extensions
./start-minio.sh

can be run with pytest

pytest

and finally Minio stopped

./stop-minio.sh

Comments

Advertise correct permissions in xAccess

xAccess is meant to advise SQLite whether it can read or write to a file depending on the permissions it asks for. At the moment it is only capable of telling it whether the file already exists.

xAccess should hook into the AWS ACL stuff and work out whether files are readable or writable as well.

opened by simonwo 5
Consider decoupling the file size from the page size
This may be a documentation issue, or implementation may be needed

From the docs:

Block size and page size SQLite writes data in pages, which are 4096 bytes by default. sqlite-s3vfs stores data in blocks, which are also 4096 bytes by default. If you change one you should change the other to match for performance reasons.

4096 is a good number locally, it matches the default memory page size on Linux and Windows and the block size on ext3/4 filesystems.

On s3 or http generally 4096 seems small as the overhead of each request will be a large % of the time, and less http requests is probably optimal.

Q: Are block size and page size really coupled, or can block size be a multiple of page size and still be performant ?

If it's fine as a multiple, can the docs be updated to mention this ?

Action:

Consider recommending a larger block size (may need testing to find an optimal size, but the 65536 from the docs doesn't seem like a bad start).

It may be worth using uktrade/tamato to run these tests as it is a user of sqlite-s3vfs.
opened by stuaxo 4
Pypi page doesn't point at this source repo

It would be good if the pypi page pointed at this source repository, the homepage should be set there too, possibly to a readthedocs repository for sqlite-s3vfs

opened by stuaxo 2
Return correct file sizes from xFileSize

The File size stuff could probably be better – I had an idea that you could make the last block actually be the correct size instead of just requiring all blocks be 64kb, and then you can easily work out the file size by just summing the file sizes of all the blocks from an AWS HEAD call, but I didn't implement that.

This probably won't cause many problems but it would be confusing if SQLite was to xTruncate a file to a certain size and then subsequently read back the size and it be different to expected.

opened by simonwo 1
build(deps): bump pytest to remove dependency on py

Unfortunately exceptiongroup that pytest now depends on isn't supported on Python 3.6 and so testing isn't possible, at least not without more time spent. Opting to remove support for Python 3.6. If we can figure out how to test on Python 3.6, can always bring it back.

DT-720

opened by michalc 0

Owner

Department for International Trade

GitHub

A Python library that provides basic functions to read / write Aseprite format files

1 Jan 13, 2022

Python library and shell utilities to monitor filesystem events.

Watchdog Python API and shell utilities to monitor file system events. Works on 3.6+. If you want to use Python 2.6, you should stick with watchdog <

5.6k Jan 4, 2023

Python's Filesystem abstraction layer

PyFilesystem2 Python's Filesystem abstraction layer. Documentation Wiki API Documentation GitHub Repository Blog Introduction Think of PyFilesystem's

1.8k Jan 2, 2023

FUSE filesystem Python scripts for Nintendo console files

ninfs (formerly fuse-3ds) is a FUSE program to extract data from Nintendo game consoles. It works by presenting a virtual filesystem with the contents of your games, NAND, or SD card contents, and you can browse and copy out just the files that you need.

343 Jan 2, 2023

Quick and dirty FAT12 filesystem to ZIP file converter

Quick and Dirty FAT12 Filesystem Converter This is a really crappy Python script I wrote to convert a semi-compatible FAT12 filesystem from my HP150's

2 Feb 12, 2022

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This lets you manage files in the reMarkable Cloud using the same tools you use on your local system.

82 Nov 24, 2022

Python Fstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems.

PyFstab Generator PyFstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems. NOTE : Th

2 Nov 9, 2021

Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

2 Nov 28, 2021

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

12 Jan 1, 2023

OnedataFS is a PyFilesystem interface to Onedata virtual file system

OnedataFS OnedataFS is a PyFilesystem interface to Onedata virtual file system. As a PyFilesystem concrete class, OnedataFS allows you to work with On

0 Jan 10, 2022

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it. In the current state, it outputs a PrettyTable to txt file as well as the raw data from that table into a csv.

1 Nov 9, 2021

Python virtual filesystem for SQLite to read from and write to S3

Related tags

Overview

sqlite-s3vfs

How does it work?

Installation

Usage

Serializing (getting a regular SQLite file out of the VFS)

Deserializing (getting a regular SQLite file into the VFS)

Tests

Comments

Advertise correct permissions in xAccess

Consider decoupling the file size from the page size

Pypi page doesn't point at this source repo

Return correct file sizes from xFileSize

build(deps): bump pytest to remove dependency on py

Owner

Department for International Trade

A Python library that provides basic functions to read / write Aseprite format files

Python library and shell utilities to monitor filesystem events.

Python's Filesystem abstraction layer

FUSE filesystem Python scripts for Nintendo console files

Quick and dirty FAT12 filesystem to ZIP file converter

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

Python Fstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems.

Python package to read and display segregated file names present in a directory based on type of the file

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

OnedataFS is a PyFilesystem interface to Onedata virtual file system

This python project contains a class FileProcessor which allows one to grab a file and get some meta data and header information from it

Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Better directory iterator and faster os.walk(), now in the Python 3.5 stdlib

Vericopy - This Python script provides various usage modes for secure local file copying and hashing.

fast change directory with python and ruby

Python codes for the server and client end that facilitates file transfers. (Using AWS EC2 instance as the server)

A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

Python interface for reading and appending tar files

This simple python script pcopy reads a list of file names and copies them to a separate folder