Python function to extract all the rows from a SQLite database file while iterating over its bytes, such as while downloading it

Department for International Trade

Last update: Nov 9, 2022

Related tags

Database stream-sqlite

Overview

stream-sqlite

Python function to extract all the rows from a SQLite database file concurrently with iterating over its bytes, without needing random access to the file.

Note that the SQLite file format is not designed to be streamed; the data is arranged in pages of a fixed number of bytes, and the information to identify a page often comes after the page in the stream (sometimes a great deal after). Therefore, pages are buffered in memory until they can be identified.

Installation

pip install stream-sqlite

Usage

from stream_sqlite import stream_sqlite
import httpx

# Iterable that yields the bytes of a sqlite file
def sqlite_bytes():
    with httpx.stream('GET', 'http://www.parlgov.org/static/stable/2020/parlgov-stable.db') as r:
        yield from r.iter_bytes(chunk_size=65_536)

# If there is a single table in the file, there will be exactly one iteration of the outer loop.
# If there are multiple tables, each can appear multiple times.
for table_name, pragma_table_info, rows in stream_sqlite(sqlite_bytes(), max_buffer_size=1_048_576):
    for row in rows:
        print(row)

Recommendations

If you have control over the SQLite file, VACUUM; should be run on it before streaming. In addition to minimising the size of the file, VACUUM; arranges the pages in a way that often reduces the buffering required when streaming. This is especially true if it was the target of intermingled INSERTs and/or DELETEs over multiple tables.

Also, indexes are not used for extracting the rows while streaming. If streaming is the only use case of the SQLite file, and you have control over it, indexes should be removed, and VACUUM; then run.

Some tests suggest that if the file is written in autovacuum mode, i.e. PRAGMA auto_vacuum = FULL;, then the pages are arranged in a way that reduces the buffering required when streaming. Your mileage may vary.

You might also like...

A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

PainlessDB - Taking Your Pain away to the moon 🚀 Contribute · Community · Documentation 🎫 Introduction : PainlessDB is a Python-based free and open-

3 Jul 15, 2022

HTTP graph database built in Python 3

KiwiDB HTTP graph database built in Python 3. Reference Format References are strings in the format: {refIDENTIFIER@GROUP} Authentication Currently, t

1 Dec 17, 2021

A NoSQL database made in python.

CookieDB A NoSQL database made in python.

1 Nov 30, 2022

Tiny local JSON database for Python.

Pylowdb Simple to use local JSON database 🦉 # This is pure python, not specific to pylowdb ;) db.data['posts'] = ({ 'id': 1, 'title': 'pylowdb is awe

3 Jan 26, 2022

Shelf DB is a tiny document database for Python to stores documents or JSON-like data

Shelf DB Introduction Shelf DB is a tiny document database for Python to stores documents or JSON-like data. Get it $ pip install shelfdb shelfquery S

35 Nov 3, 2022

TinyDB is a lightweight document oriented database optimized for your happiness :)

Quick Links Example Code Supported Python Versions Documentation Changelog Extensions Contributing Introduction TinyDB is a lightweight document orien

5.6k Dec 30, 2022

Elara DB is an easy to use, lightweight NoSQL database that can also be used as a fast in-memory cache.

Elara DB is an easy to use, lightweight NoSQL database written for python that can also be used as a fast in-memory cache for JSON-serializable data. Includes various methods and features to manipulate data structures in-memory, protect database files and export data.

101 Jan 4, 2023

A simple GUI that interacts with a database to keep track of a collection of US coins.

CoinCollectorGUI A simple gui designed to interact with a database. The goal of the database is to make keeping track of collected coins simple. The G

1 Nov 9, 2021

Makes google's political ad database actually useful

Making Google's political ad transparency library suck less This is a series of scripts that takes Google's political ad transparency data and makes t

7 Apr 28, 2022

Comments

RecursionError when parsing large (0.3 gb) sqlite file

So im parsing a sqlite file from a IOT device which was sent through a post request.

import io
from stream_sqlite import stream_sqlite

def tables_list(table_iter):
    return [
        (table_name, table_info, list(table_rows))
        for table_name, table_info, table_rows in table_iter
        ]

tables = tables_list(stream_sqlite(io.BytesIO(sqlite_bytes),max_buffer_size=10*1024*1024))

gives the following (very long, which I shortened) error:

Traceback (most recent call last):
File "<string>", line 1, in <module>
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 441, in stream_sqlite
  for (name, info), single_table_rows in grouped_by_table:
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 438, in <genexpr>
  table_rows = ((table_name, table_info, row) for table_name, table_info, row in table_rows_including_internal if not table_name.startswith('sqlite_'))
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 425, in yield_table_rows
  yield from process_page(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 325, in process_table_page
  yield from \
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 304, in process_table_leaf_non_master
  yield from process_initial_payload(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 182, in process_initial_payload
  yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
  yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
  yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
  yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 200, in process_overflow_page
  yield from process_if_buffered_or_remember(partial(
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 411, in process_if_buffered_or_remember
  yield from process(page_bytes, page_reader)
File ".../.venv/lib/python3.8/site-packages/stream_sqlite.py", line 189, in process_overflow_page
  num_this_page = min(payload_remainder, len(page_bytes) - 4)
RecursionError: maximum recursion depth exceeded in comparison

opened by hugolytics 6

TypeError Exception when parsing sqlite file

I get the following exception when parsing a sqlite file:

  File "/home/roman/enline/cloud/lambda/sqldecoder.py", line 98, in decode
    for table_name, _pragma_table_info, rows in stream_sqlite(byte_stream, max_buffer_size):
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 441, in stream_sqlite
    for (name, info), single_table_rows in grouped_by_table:
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 438, in <genexpr>
    table_rows = ((table_name, table_info, row) for table_name, table_info, row in table_rows_including_internal if not table_name.startswith('sqlite_'))
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 425, in yield_table_rows
    yield from process_page(page_bytes, page_reader)
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 325, in process_table_page
    yield from \
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 304, in process_table_leaf_non_master
    yield from process_initial_payload(
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 172, in process_initial_payload
    yield from full_payload_processor(cell_num_reader, cell_varint_reader)
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 292, in process_non_master_leaf_row
    yield table_name, table_info, read_table_row(rowid, cell_num_reader, cell_varint_reader)
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 229, in read_table_row
    return row_constructor(rowid, *tuple(
  File "/usr/local/lib/python3.9/dist-packages/stream_sqlite.py", line 261, in <lambda>
    row_constructor = lambda rowid, *values: row_namedtuple(*tuple((rowid if i == rowid_alias_index else value) for i, value in enumerate(values)))
TypeError: <lambda>() missing 1 required positional argument: 'description'

This file was recently changed in structure by ALTER TABLE alarming ADD COLUMN description TEXT. Before this, the file was successfully parsed. So I assume that this change is the cause of above exception.

opened by romor 4

feat: support column additions

Should address the issue reported in https://github.com/uktrade/stream-sqlite/issues/5

What was happening is that when columns are added via an ALTER TABLE, in many cases the actual rows aren't modified, so there weren't enough columns in the data proper. To find the value of the columns, you have to look to the schema to find the correct value.

Interestingly the default is not a true SQLite value, but actually a text snippet of SQL that defines the value. Hence an additional query is needed to convert these to real values.

opened by michalc 0

Owner

Department for International Trade

GitHub

This is a simple graph database in SQLite, inspired by

This is a simple graph database in SQLite, inspired by "SQLite as a document database".

1.2k Jan 3, 2023

Manage your sqlite database very easy (like django) ...

1 Feb 9, 2022

AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database.

AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database. This database is completely free* ??

42 Nov 25, 2022

Mongita is to MongoDB as SQLite is to SQL

Mongita is a lightweight embedded document database that implements a commonly-used subset of the MongoDB/PyMongo interface. Mongita differs from MongoDB in that instead of being a server, Mongita is a self-contained Python library. Mongita can be configured to store its documents either on disk or in memory.

809 Jan 7, 2023

Python function to extract all the rows from a SQLite database file while iterating over its bytes, such as while downloading it

Related tags

Overview

stream-sqlite

Installation

Usage

Recommendations

You might also like...

A Painless Simple Way To Create Schema and Do Database Operations Quickly In Python

HTTP graph database built in Python 3

A NoSQL database made in python.

Tiny local JSON database for Python.

Shelf DB is a tiny document database for Python to stores documents or JSON-like data

TinyDB is a lightweight document oriented database optimized for your happiness :)

Elara DB is an easy to use, lightweight NoSQL database that can also be used as a fast in-memory cache.

A simple GUI that interacts with a database to keep track of a collection of US coins.

Makes google's political ad database actually useful

Comments

RecursionError when parsing large (0.3 gb) sqlite file

TypeError Exception when parsing sqlite file

feat: support column additions

Owner

Department for International Trade

This is a simple graph database in SQLite, inspired by

Manage your sqlite database very easy (like django) ...

AWS Tags As A Database is a Python library using AWS Tags as a Key-Value database.

Mongita is to MongoDB as SQLite is to SQL

Tools for analyzing Git history using SQLite

PathfinderMonsterDatabase - A database of all monsters in Pathfinder 1e, created by parsing aonprd.com

Python object-oriented database

LightDB is a lightweight JSON Database for Python

A Simple , ☁️ Lightweight , 💪 Efficent JSON based database for 🐍 Python.

A Persistent Embedded Graph Database for Python