Python function to query SQLite files stored on S3

Overview

sqlite-s3-query CircleCI Test Coverage

Python function to query a SQLite file stored on S3. It uses multiple HTTP range requests per query to avoid downloading the entire file, and so is suitable for large databases.

All the HTTP requests for a query request the same version of the database object in S3, so queries should complete succesfully even if the database is replaced concurrently by another S3 client. Versioning must be enabled on the S3 bucket.

Operations that write to the database are not supported.

Inspired by phiresky's sql.js-httpvfs, and dacort's Stack Overflow answer.

Installation

sqlite-s3-query depends on APSW, which is not available on PyPI, but can be installed directly from GitHub.

pip install sqlite_s3_query
pip install https://github.com/rogerbinns/apsw/releases/download/3.36.0-r1/apsw-3.36.0-r1.zip --global-option=fetch --global-option=--version --global-option=3.36.0 --global-option=--all --global-option=build --global-option=--enable-all-extensions

Usage

from sqlite_s3_query import sqlite_s3_query

results_iter = sqlite_s3_query(
    'SELECT * FROM my_table WHERE my_column = ?', params=('my-value',),
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
)

for row in results_iter:
    print(row)

If in your project you use multiple queries to the same file, functools.partial can be used to make an interface with less duplication.

from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
)

for row in query_my_db('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)):
    print(row)

for row in query_my_db('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)):
    print(row)

The AWS region and the credentials are taken from environment variables, but this can be changed using the get_credentials parameter. Below shows the default implementation of this that can be overriden.

import os
from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_credentials=lambda: (
        os.environ['AWS_DEFAULT_REGION'],
        os.environ['AWS_ACCESS_KEY_ID'],
        os.environ['AWS_SECRET_ACCESS_KEY'],
        os.environ.get('AWS_SESSION_TOKEN'),  # Only needed for temporary credentials
    ),
)

for row in query_my_db('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)):
    print(row)
Comments
  • Do we know what the minimum version of sqlite3 is required?

    Do we know what the minimum version of sqlite3 is required?

    I see you're using circle ci with an image that has sqlite v 3.27.2

    I got independent evidence that the library works with 3.28.0

    However, when I attempt to deploy to AWS linux (sqlite 3.7.17), as far as I can tell, as soon as any calls are made to the library (which does load via CDLL), the thread exits.

    Is there some minimum version or is my app more likely crashing for some other reason?

    opened by matthewdeanmartin 9
  • Is it possible to use this with IAM roles instead of plaintext keys?

    Is it possible to use this with IAM roles instead of plaintext keys?

    This fails: get_credentials=lambda: ("us-east-1",None,None,None)

    It looks like at multiple points in aws_sigv4_headers() the plaintext key is concatenated to header strings.

    Was the intention to make sure that only plaintext keys were used? Or is there a different way to set get_credentials to use IAM roles?

    opened by matthewdeanmartin 6
  • Do not copy undefined data

    Do not copy undefined data

    do not trust the s3 service to return the exact right amount of data. we do not even require that the url have a secure scheme, so we definitely should not be implicitly trusting that the correct amount of data came back from the remote service before copying the presumed amount of memory.

    opened by mumbleskates 5
  • Queries with params don't seem to work

    Queries with params don't seem to work

    Example query syntax:

    'SELECT "Beach name" FROM beaches WHERE Region = ?, params=("SydneyOceanBeaches", )'

    [ISSUE reporting in progress - add environment details & error message]

    opened by Mjboothaus 4
  • KeyError: 'x-amz-version-id'

    KeyError: 'x-amz-version-id'

    Hi, I am having issues while trying to connect to my sqlite3.db in S3 bucket.

    My script is a s follow:

    import os, sys
    import json
    from functools import partial
    from sqlite_s3_query import sqlite_s3_query
        
    credentials = json.load (open(sys.argv[1], "r"))
    	
    os.environ['AWS_REGION'] = credentials['region']
    os.environ['AWS_ACCESS_KEY_ID'] = credentials['access-key-id']
    os.environ['AWS_SECRET_ACCESS_KEY'] = credentials['secret-access-key']
    
    def get_credentials(_):
    
    	return (
            os.environ['AWS_REGION'],
            os.environ['AWS_ACCESS_KEY_ID'],
            os.environ['AWS_SECRET_ACCESS_KEY'],
            os.environ.get('AWS_SESSION_TOKEN') # Only needed for temporary credentials
        )
    
    query_my_db = partial(sqlite_s3_query,
        url = credentials['s3-object-url'],
        get_credentials=get_credentials,
    )
    
    with \
            query_my_db() as query, \
            query('SELECT * FROM MODEL') as (columns, rows):
    
        for row in rows:
            print(row)
    

    This is the error log:

    Traceback (most recent call last):
      File "read_s3.py", line 27, in <module>
        query_my_db() as query, \
      File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/sqlite_s3_query.py", line 313, in sqlite_s3_query
        get_vfs(http_client) as vfs, \
      File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
        return next(self.gen)
      File "/home/ubuntu/.local/lib/python3.6/site-packages/sqlite_s3_query.py", line 144, in get_vfs
        version_id = head_headers['x-amz-version-id']
      File "/home/ubuntu/.local/lib/python3.6/site-packages/httpx/_models.py", line 991, in __getitem__
        raise KeyError(key)
    KeyError: 'x-amz-version-id'
    

    To verify that the IAM credentials are correct, I load the same json file and download the sqlitedb file using boto3 and it was successful. Do you have any idea on what seems to be the problem? ( I am not using temporary credentials, thus os.environ.get('AWS_SESSION_TOKEN') is None)

    Thank you.

    opened by cheehook 4
  • Infinite recursion issue with using latest httpx

    Infinite recursion issue with using latest httpx

    I honestly can't tell if this is a problem with the SSL packages install on my host, a problem with sqlite-s3-query, or a problem with httpx.

    Eventlet got a similar call stack in a bug, but I'm not smart enough to know if it is relevant.

    I'm posting this here because I got an error message and a not very good workaround. My host is AWS Linux.

    So safety, pip-audit and so on report the lower versions of httpx as insecure. But if I bump to the latest version I get a recursion error.

    # my work around...
    # This version used to work at 18, but gets flagged as insecure.
    # The 23 version creates recursive loops with either sqlite lib
    # or it isn't multiprocess safe anymore... not sure.
    httpx = "^0.18.2"
    
    response = function(request)
    --
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/connexion/decorators/validation.py", line 399, in wrapper
      | 2022-11-09T09:43:24.319-05:00 | return function(request)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
      | 2022-11-09T09:43:24.319-05:00 | return function(**kwargs)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/search_service/controllers/controller_utils.py", line 24, in func_wrapper
      | 2022-11-09T09:43:24.319-05:00 | return func(*args, **kwargs)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/search_service/controllers/cis_api.py", line 64, in does_copyright_pdf_exist
      | 2022-11-09T09:43:24.319-05:00 | query_cis.signed_url_lookup.get_pdf_object_path(copyright_number)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/search_service/cis_files/query_cis.py", line 161, in get_pdf_object_path
      | 2022-11-09T09:43:24.319-05:00 | results = pool.map(query_year, years)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/multiprocessing/pool.py", line 364, in map
      | 2022-11-09T09:43:24.319-05:00 | return self._map_async(func, iterable, mapstar, chunksize).get()
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/multiprocessing/pool.py", line 771, in get
      | 2022-11-09T09:43:24.319-05:00 | raise self._value
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/multiprocessing/pool.py", line 125, in worker
      | 2022-11-09T09:43:24.319-05:00 | result = (True, func(*args, **kwds))
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
      | 2022-11-09T09:43:24.319-05:00 | return list(map(*args))
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/search_service/cis_files/query_cis.py", line 143, in query_year
      | 2022-11-09T09:43:24.319-05:00 | with sqlite_s3_query(
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/contextlib.py", line 119, in __enter__
      | 2022-11-09T09:43:24.319-05:00 | return next(self.gen)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/sqlite_s3_query.py", line 383, in sqlite_s3_query
      | 2022-11-09T09:43:24.319-05:00 | with sqlite_s3_query_multi(url,
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/contextlib.py", line 119, in __enter__
      | 2022-11-09T09:43:24.319-05:00 | return next(self.gen)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/sqlite_s3_query.py", line 360, in sqlite_s3_query_multi
      | 2022-11-09T09:43:24.319-05:00 | get_http_client() as http_client, \
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/sqlite_s3_query.py", line 374, in <lambda>
      | 2022-11-09T09:43:24.319-05:00 | ), get_http_client=lambda: httpx.Client(),
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_client.py", line 673, in __init__
      | 2022-11-09T09:43:24.319-05:00 | self._transport = self._init_transport(
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_client.py", line 721, in _init_transport
      | 2022-11-09T09:43:24.319-05:00 | return HTTPTransport(
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_transports/default.py", line 126, in __init__
      | 2022-11-09T09:43:24.319-05:00 | ssl_context = create_ssl_context(verify=verify, cert=cert, trust_env=trust_env)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_config.py", line 49, in create_ssl_context
      | 2022-11-09T09:43:24.319-05:00 | return SSLConfig(
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_config.py", line 73, in __init__
      | 2022-11-09T09:43:24.319-05:00 | self.ssl_context = self.load_ssl_context()
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_config.py", line 85, in load_ssl_context
      | 2022-11-09T09:43:24.319-05:00 | return self.load_ssl_context_verify()
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_config.py", line 122, in load_ssl_context_verify
      | 2022-11-09T09:43:24.319-05:00 | context = self._create_default_ssl_context()
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_config.py", line 157, in _create_default_ssl_context
      | 2022-11-09T09:43:24.319-05:00 | set_minimum_tls_version_1_2(context)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/site-packages/httpx/_compat.py", line 30, in set_minimum_tls_version_1_2
      | 2022-11-09T09:43:24.319-05:00 | context.minimum_version = ssl.TLSVersion.TLSv1_2
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/ssl.py", line 587, in minimum_version
      | 2022-11-09T09:43:24.319-05:00 | super(SSLContext, SSLContext).minimum_version.__set__(self, value)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/ssl.py", line 587, in minimum_version
      | 2022-11-09T09:43:24.319-05:00 | super(SSLContext, SSLContext).minimum_version.__set__(self, value)
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/ssl.py", line 587, in minimum_version
      | 2022-11-09T09:43:24.319-05:00 | super(SSLContext, SSLContext).minimum_version.__set__(self, value)
      | 2022-11-09T09:43:24.319-05:00 | [Previous line repeated 445 more times]
      | 2022-11-09T09:43:24.319-05:00 | File "/var/lang/lib/python3.9/ssl.py", line 585, in minimum_version
      | 2022-11-09T09:43:24.319-05:00 | if value == TLSVersion.SSLv3:
      | 2022-11-09T09:43:24.319-05:00 | RecursionError: maximum recursion depth exceeded while calling a Python object
    
    opened by matthewdeanmartin 3
  • disk I/O error

    disk I/O error

    Hi

    I have tried to query the SQLite file present in the S3 bucket using sqlite-s3-query package but I'm getting "Exception: disk I/O error". The object is private and it is present in a private bucket. I have provided the correct access and secret keys of the user also verified that the user is able to access the object using boto3

    Please find the complete details below,

    version of sqlite_s3_query - sqlite-s3-query 0.0.44 the file format used in s3 bucket - filename.db, filename.sqlite region, access key id, and secret access key are stored in environment variables

    sample code

    from sqlite_s3_query import sqlite_s3_query s3_url="https://<bucket>.s3.<region>.amazonaws.com/<key>" with sqlite_s3_query(url=s3_url) as query: with query('SELECT * FROM table1 WHERE name= ?', params=('name1',)) as (columns, rows): for row in rows: print(row)

    Error message Error observedTraceback (most recent call last): File "C:\Users\user1\Downloads\db.py", line 29, in with sqlite_s3_query(url='') as query: File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 117, in enter return next(self.gen) File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlite_s3_query.py", line 319, in sqlite_s3_query with
    File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\contextlib.py", line 117, in enter return next(self.gen) File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlite_s3_query.py", line 275, in get_db run(libsqlite3.sqlite3_open_v2, f'file:/{file_name}'.encode() + b'\0', byref(db), SQLITE_OPEN_READONLY | SQLITE_OPEN_URI, vfs_name.encode() + b'\0') File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\site-packages\sqlite_s3_query.py", line 73, in run raise Exception(libsqlite3.sqlite3_errstr(res).decode()) Exception: disk I/O error

    Could you please help me with this

    Thanks, 4sachi

    opened by 4sachi 3
  • docs: more examples needed

    docs: more examples needed

    I'm planing to use this in a aws-lambda function, what is the best aproach to do this? Cause the result from a query is a interator (as showed), and it can't be returned directly (or it can be) from the function

    btw, awesome job 💯

    opened by souzaramon 3
  • change  x_file_size to take an int64 pointer instead of an int pointer

    change x_file_size to take an int64 pointer instead of an int pointer

    I get a message of 'database image is malformed' on files > 4GB

    Changing:

        x_file_size_type = CFUNCTYPE(c_int, c_void_p, POINTER(c_int))
        def x_file_size(p_file, p_size):
            p_size[0] = size
            return SQLITE_OK
    

    To accept a 64bit pointer

        x_file_size_type = CFUNCTYPE(c_int, c_void_p, POINTER(c_int64))
    

    fixes the issue

    opened by benjaminjackman 2
  • allow boto3 as a HTTP data (range) provider

    allow boto3 as a HTTP data (range) provider

    If the httpx client to S3 compliant bucket as abstracted just a bit and its own class, AWS boto3 could be swapped in when needed.

    I am starting working on such an abstraction in the repo / branch https://github.com/candrsn/sqlite-s3-query/tree/feature/boto3_client

    It is not ready for a merge, but I want to start discussions before I a am ready for a PR.

    The abstraction would move the current S3 API elements into a class that had minimal symmetry with the boto3 S3 client API x = boto3.client('S3')

    and also

    session = boto3.Session() s3 = session.client('s3')

    That would simplifiy S3 API access methods

    opened by candrsn 2
  • move versioning to the code and handle unversioned S3 Buckets

    move versioning to the code and handle unversioned S3 Buckets

    This PR fixes a problm when accessing data stored in an unversioned S3 bucket.

    There is one other change. moving the version info into the code. (I can retract that part if it is not wanted)

    This url demonstrates the problem https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg

    A sample query might be with sqlite_s3_query.sqlite_s3_query("https://fema-cap-imagery.s3.us-east-1.amazonaws.com/Support/cap_index.gpkg") as query: with query("""SELECT fid, Id, ImageEventName FROM images LIMIT 4 OFFSET 1000""") as (cols, rows): ......

    opened by candrsn 2
Owner
Michal Charemza
Michal Charemza
This is a simple graph database in SQLite, inspired by

This is a simple graph database in SQLite, inspired by "SQLite as a document database".

Denis Papathanasiou 1.2k Jan 3, 2023
Mongita is to MongoDB as SQLite is to SQL

Mongita is a lightweight embedded document database that implements a commonly-used subset of the MongoDB/PyMongo interface. Mongita differs from MongoDB in that instead of being a server, Mongita is a self-contained Python library. Mongita can be configured to store its documents either on disk or in memory.

Scott Rogowski 809 Jan 7, 2023
Tools for analyzing Git history using SQLite

git-history Tools for analyzing Git history using SQLite Installation Install this tool using pip: $ pip install git-history Usage This tool can be r

Simon Willison 128 Jan 2, 2023
Manage your sqlite database very easy (like django) ...

Manage your sqlite database very easy (like django) ...

aWolver 1 Feb 9, 2022
Turn SELECT queries returned by a query into links to execute them

datasette-query-links Turn SELECT queries returned by a query into links to execute them Installation Install this plugin in the same environment as D

Simon Willison 5 Apr 27, 2022
pickleDB is an open source key-value store using Python's json module.

pickleDB pickleDB is lightweight, fast, and simple database based on the json module. And it's BSD licensed! pickleDB is Fun >>> import pickledb >>>

Harrison Erd 738 Jan 4, 2023
Python object-oriented database

ZODB, a Python object-oriented database ZODB provides an object-oriented database for Python that provides a high-degree of transparency. ZODB runs on

Zope 574 Dec 31, 2022
Monty, Mongo tinified. MongoDB implemented in Python !

Monty, Mongo tinified. MongoDB implemented in Python ! Was inspired by TinyDB and it's extension TinyMongo

David Lai 523 Jan 2, 2023
LightDB is a lightweight JSON Database for Python

LightDB What is this? LightDB is a lightweight JSON Database for Python that allows you to quickly and easily write data to a file Installing pip3 ins

Stanislaw 14 Oct 1, 2022
AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database.

AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database. This database is completely free* ??

Oren Leung 42 Nov 25, 2022
A Simple , ☁️ Lightweight , 💪 Efficent JSON based database for 🐍 Python.

A Simple, Lightweight, Efficent JSON based DataBase for Python The current stable version is v1.6.1 pip install pysondb==1.6.1 Support the project her

PysonDB 282 Jan 7, 2023
A Persistent Embedded Graph Database for Python

Cog - Embedded Graph Database for Python cogdb.io New release: 2.0.5! Installing Cog pip install cogdb Cog is a persistent embedded graph database im

Arun Mahendra 214 Dec 30, 2022
securedb is a fast and lightweight Python framework to easily interact with JSON-based encrypted databases.

securedb securedb is a Python framework that lets you work with encrypted JSON databases. Features: newkey() to generate an encryption key write(key,

Filippo Romani 2 Nov 23, 2022
A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

PainlessDB - Taking Your Pain away to the moon ?? Contribute · Community · Documentation ?? Introduction : PainlessDB is a Python-based free and open-

Aiden Ellis 3 Jul 15, 2022
HTTP graph database built in Python 3

KiwiDB HTTP graph database built in Python 3. Reference Format References are strings in the format: {refIDENTIFIER@GROUP} Authentication Currently, t

JanCraft 1 Dec 17, 2021
A NoSQL database made in python.

CookieDB A NoSQL database made in python.

cookie 1 Nov 30, 2022
Oh-My-PickleDB is an open source key-value store using Python's json module.

OH-MY-PICKLEDB oh-my-pickleDB is a lightweight, fast, and intuitive data manager written in python ?? Table of Contents About Getting Started Deployme

Adrián Toral 6 Feb 20, 2022
Tiny local JSON database for Python.

Pylowdb Simple to use local JSON database ?? # This is pure python, not specific to pylowdb ;) db.data['posts'] = ({ 'id': 1, 'title': 'pylowdb is awe

Hussein Sarea 3 Jan 26, 2022
Shelf DB is a tiny document database for Python to stores documents or JSON-like data

Shelf DB Introduction Shelf DB is a tiny document database for Python to stores documents or JSON-like data. Get it $ pip install shelfdb shelfquery S

Um Nontasuwan 35 Nov 3, 2022