Python function to extract all the rows from a SQLite database file while iterating over its bytes, such as while downloading it

Overview

stream-sqlite CircleCI Test Coverage

Python function to extract all the rows from a SQLite database file concurrently with iterating over its bytes, without needing random access to the file.

Note that the SQLite file format is not designed to be streamed; the data is arranged in pages of a fixed number of bytes, and the information to identify a page often comes after the page in the stream (sometimes a great deal after). Therefore, pages are buffered in memory until they can be identified.

Installation

pip install stream-sqlite

Usage

from stream_sqlite import stream_sqlite
import httpx

# Iterable that yields the bytes of a sqlite file
def sqlite_bytes():
    with httpx.stream('GET', 'http://www.parlgov.org/static/stable/2020/parlgov-stable.db') as r:
        yield from r.iter_bytes(chunk_size=65_536)

# If there is a single table in the file, there will be exactly one iteration of the outer loop.
# If there are multiple tables, each can appear multiple times.
for table_name, pragma_table_info, rows in stream_sqlite(sqlite_bytes(), max_buffer_size=1_048_576):
    for row in rows:
        print(row)

Recommendations

If you have control over the SQLite file, VACUUM; should be run on it before streaming. In addition to minimising the size of the file, VACUUM; arranges the pages in a way that often reduces the buffering required when streaming. This is especially true if it was the target of intermingled INSERTs and/or DELETEs over multiple tables.

Also, indexes are not used for extracting the rows while streaming. If streaming is the only use case of the SQLite file, and you have control over it, indexes should be removed, and VACUUM; then run.

Some tests suggest that if the file is written in autovacuum mode, i.e. PRAGMA auto_vacuum = FULL;, then the pages are arranged in a way that reduces the buffering required when streaming. Your mileage may vary.

You might also like...
A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python
A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

PainlessDB - Taking Your Pain away to the moon ๐Ÿš€ Contribute ยท Community ยท Documentation ๐ŸŽซ Introduction : PainlessDB is a Python-based free and open-

HTTP graph database built in Python 3

KiwiDB HTTP graph database built in Python 3. Reference Format References are strings in the format: {refIDENTIFIER@GROUP} Authentication Currently, t

A NoSQL database made in python.

CookieDB A NoSQL database made in python.

Tiny local JSON database for Python.
Tiny local JSON database for Python.

Pylowdb Simple to use local JSON database ๐Ÿฆ‰ # This is pure python, not specific to pylowdb ;) db.data['posts'] = ({ 'id': 1, 'title': 'pylowdb is awe

Shelf DB is a tiny document database for Python to stores documents or JSON-like data
Shelf DB is a tiny document database for Python to stores documents or JSON-like data

Shelf DB Introduction Shelf DB is a tiny document database for Python to stores documents or JSON-like data. Get it $ pip install shelfdb shelfquery S

TinyDB is a lightweight document oriented database optimized for your happiness :)
TinyDB is a lightweight document oriented database optimized for your happiness :)

Quick Links Example Code Supported Python Versions Documentation Changelog Extensions Contributing Introduction TinyDB is a lightweight document orien

Elara DB is an easy to use, lightweight NoSQL database that can also be used as a fast in-memory cache.
Elara DB is an easy to use, lightweight NoSQL database that can also be used as a fast in-memory cache.

Elara DB is an easy to use, lightweight NoSQL database written for python that can also be used as a fast in-memory cache for JSON-serializable data. Includes various methods and features to manipulate data structures in-memory, protect database files and export data.

A simple GUI that interacts with a database to keep track of a collection of US coins.

CoinCollectorGUI A simple gui designed to interact with a database. The goal of the database is to make keeping track of collected coins simple. The G

Makes google's political ad database actually useful

Making Google's political ad transparency library suck less This is a series of scripts that takes Google's political ad transparency data and makes t

Comments
  • RecursionError when parsing large (0.3 gb) sqlite file

    RecursionError when parsing large (0.3 gb) sqlite file

    So im parsing a sqlite file from a IOT device which was sent through a post request.

    import io
    from stream_sqlite import stream_sqlite
    
    def tables_list(table_iter):
        return [
            (table_name, table_info, list(table_rows))
            for table_name, table_info, table_rows in table_iter
            ]
    
    tables = tables_list(stream_sqlite(io.BytesIO(sqlite_bytes),max_buffer_size=10*1024*1024))
    

    gives the following (very long, which I shortened) error:

    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 441, in stream_sqlite
      for (name, info), single_table_rows in grouped_by_table:
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 438, in <genexpr>
      table_rows = ((table_name, table_info, row) for table_name, table_info, row in table_rows_including_internal if not table_name.startswith('sqlite_'))
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 425, in yield_table_rows
      yield from process_page(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 325, in process_table_page
      yield from \
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 304, in process_table_leaf_non_master
      yield from process_initial_payload(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 182, in process_initial_payload
      yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
      yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
      yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
      yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
      yield from process_if_buffered_or_remember(partial(
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
      yield from process(page_bytes, page_reader)
    File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 189, in process_overflow_page
      num_this_page = min(payload_remainder, len(page_bytes) - 4)
    RecursionError: maximum recursion depth exceeded in comparison
    
    opened by hugolytics 6
  • TypeError Exception when parsing sqlite file

    TypeError Exception when parsing sqlite file

    I get the following exception when parsing a sqlite file:

      File "/home/roman/enline/cloud/lambda/sqldecoder.py", line 98, in decode
        for table_name, _pragma_table_info, rows in stream_sqlite(byte_stream, max_buffer_size):
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 441, in stream_sqlite
        for (name, info), single_table_rows in grouped_by_table:
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 438, in <genexpr>
        table_rows = ((table_name, table_info, row) for table_name, table_info, row in table_rows_including_internal if not table_name.startswith('sqlite_'))
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 425, in yield_table_rows
        yield from process_page(page_bytes, page_reader)
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 325, in process_table_page
        yield from \
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 304, in process_table_leaf_non_master
        yield from process_initial_payload(
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 172, in process_initial_payload
        yield from full_payload_processor(cell_num_reader, cell_varint_reader)
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 292, in process_non_master_leaf_row
        yield table_name, table_info, read_table_row(rowid, cell_num_reader, cell_varint_reader)
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 229, in read_table_row
        return row_constructor(rowid, *tuple(
      File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 261, in <lambda>
        row_constructor = lambda rowid, *values: row_namedtuple(*tuple((rowid if i == rowid_alias_index else value) for i, value in enumerate(values)))
    TypeError: <lambda>() missing 1 required positional argument: 'description'
    

    This file was recently changed in structure by ALTER TABLE alarming ADD COLUMN description TEXT. Before this, the file was successfully parsed. So I assume that this change is the cause of above exception.

    opened by romor 4
  • feat: support column additions

    feat: support column additions

    Should address the issue reported in https://github.com/uktrade/stream-sqlite/issues/5

    What was happening is that when columns are added via an ALTER TABLE, in many cases the actual rows aren't modified, so there weren't enough columns in the data proper. To find the value of the columns, you have to look to the schema to find the correct value.

    Interestingly the default is not a true SQLite value, but actually a text snippet of SQL that defines the value. Hence an additional query is needed to convert these to real values.

    opened by michalc 0
Owner
Department for International Trade
Department for International Trade
This is a simple graph database in SQLite, inspired by

This is a simple graph database in SQLite, inspired by "SQLite as a document database".

Denis Papathanasiou 1.2k Jan 3, 2023
Manage your sqlite database very easy (like django) ...

Manage your sqlite database very easy (like django) ...

aWolver 1 Feb 9, 2022
AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database.

AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database. This database is completely free* ??

Oren Leung 42 Nov 25, 2022
Mongita is to MongoDB as SQLite is to SQL

Mongita is a lightweight embedded document database that implements a commonly-used subset of the MongoDB/PyMongo interface. Mongita differs from MongoDB in that instead of being a server, Mongita is a self-contained Python library. Mongita can be configured to store its documents either on disk or in memory.

Scott Rogowski 809 Jan 7, 2023
Tools for analyzing Git history using SQLite

git-history Tools for analyzing Git history using SQLite Installation Install this tool using pip: $ pip install git-history Usage This tool can be r

Simon Willison 128 Jan 2, 2023
PathfinderMonsterDatabase - A database of all monsters in Pathfinder 1e, created by parsing aonprd.com

PathfinderMonsterDatabase A database of all monsters in Pathfinder 1e, created by parsing aonprd.com Setup Run the following line to install all requi

Yoni Lerner 11 Jun 12, 2022
Python object-oriented database

ZODB, a Python object-oriented database ZODB provides an object-oriented database for Python that provides a high-degree of transparency. ZODB runs on

Zope 574 Dec 31, 2022
LightDB is a lightweight JSON Database for Python

LightDB What is this? LightDB is a lightweight JSON Database for Python that allows you to quickly and easily write data to a file Installing pip3 ins

Stanislaw 14 Oct 1, 2022
A Simple , โ˜๏ธ Lightweight , ๐Ÿ’ช Efficent JSON based database for ๐Ÿ Python.

A Simple, Lightweight, Efficent JSON based DataBase for Python The current stable version is v1.6.1 pip install pysondb==1.6.1 Support the project her

PysonDB 282 Jan 7, 2023
A Persistent Embedded Graph Database for Python

Cog - Embedded Graph Database for Python cogdb.io New release: 2.0.5! Installing Cog pip install cogdb Cog is a persistent embedded graph database im

Arun Mahendra 214 Dec 30, 2022