Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Overview

stream-unzip CircleCI Test Coverage

Python function to stream unzip all the files in a ZIP archive, without loading the entire ZIP file into memory or any of its uncompressed files.

While the ZIP format does have its main directory at the end, each compressed file in the archive can be prefixed with a header that contains its name, compressed size, and uncompressed size: this is what makes streaming decompression of ZIP files possible.

Unfortunately not all ZIP files have this: some have their compressed and uncompressed sizes after the file data in the stream. In this case a ValueError will be raised.

Installation

pip install stream-unzip

Usage

A single function is exposed, stream_unzip, that takes a single argument: an iterable that should yield the bytes of a ZIP file. It returns an iterable, where each yielded item is a tuple of the file name, file size, and another iterable itself yielding the unzipped bytes of that file.

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

The file name and file size are extracted as reported from the file. If you don't trust the creator of the ZIP file, these should be treated as untrusted input.

Issues
  • ZeroDivisionError in the end of zip file

    ZeroDivisionError in the end of zip file

    Thanks for the lib. Got ZeroDivisionError: integer division or modulo by zero during processing zip file's last chunk from example code snippet:

    Traceback (most recent call last):
      File "***\tmp.py", line 9, in <module>
        for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
      File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 180, in stream_unzip
        for _ in yield_all():
      File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 35, in _yield_all
        offset = (offset + to_yield) % len(chunk)
    ZeroDivisionError: integer division or modulo by zero
    

    Code snippet:

    import httpx
    from stream_unzip import stream_unzip
    
    
    def zipped_chunks():
        # Any iterable that yields a zip file
        with httpx.stream('GET', 'https://www.gyan.dev/ffmpeg/builds/packages/ffmpeg-4.4-essentials_build.zip') as r:
            yield from r.iter_bytes()
    
    
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
        for chunk in unzipped_chunks:
            # print(chunk)
            print(file_name, file_size)
    

    Python 3.9.5 (Windows 10) stream-unzip 0.0.23

    opened by tropicoo 5
  • Not iterating all chunks explicitly causes signature error

    Not iterating all chunks explicitly causes signature error

    I stumbled on a bug while trying to see if smart_open would work: if you do not iterate all chunks ("unzipped_chunks") there will be a signature error:

    from stream_unzip import stream_unzip, UnexpectedSignatureError
    import smart_open
    
    url="https://transfer.sh/get/cp4LTN/test.zip"
    
    print("without chunk iteration")
    try:
        with smart_open.open(url,'rb') as f_h:
            for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
                print(file_name)
    except UnexpectedSignatureError as e:
        print("UnexpectedSignatureError: "+str(e))
    
    print("with chunk iteration")
    try:
        with smart_open.open(url,'rb') as f_h:
            for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
                print(file_name)
                for chunk in unzipped_chunks:
                    continue
    except UnexpectedSignatureError as e:
        print("UnexpectedSignatureError: "+str(e))
    

    result:

    $ python .\stream_unzip_test.py
    without chunk iteration
    b'test.txt'
    UnexpectedSignatureError: b'test'
    with chunk iteration
    b'test.txt'
    b'test1.txt'
    b'xsubtest.zip'
    
    opened by jeroenbaas 4
  • Password protected zips?

    Password protected zips?

    Are password Password protected zips supported? Did not find any information about this.

    opened by CyrosX 3
  • [question]  asynchio Zip File of Zipped Chunks

    [question] asynchio Zip File of Zipped Chunks

    Consider streaming in a zip file.

    def zipped_chunks(zipfile_name: PurePath):
        # Iterable that yields the bytes of a zip file
        with open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
            yield zip_f.read()
    

    I am attempting to unzip a number of large zip files concurrently that are hosted on a slow network drive. Do you see any value in leveraging aiofiles package to stream the read like so?:

    async def zipped_chunks(zipfile_name: PurePath):
        # Iterable that yields the bytes of a zip file
        async with aiofiles.open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
            yield await zip_f.read()
    
    async def unzip_tar_files(self, zipfile_name: PurePath):
        chunks: List[bytes] = [data async for data in self.zipped_chunks(zipfile_name)]
        for file_name, tar_file_size, unzipped_chunks in stream_unzip(chunks):
            ....
    

    That seems to work well for me (so far). Do you see any downside?

    If not, it might be a nice addition to the README as I have finally come across a program I am writing from scratch that benefits from leveraging an asyincio solution with stream-unzip being a key part of that solution. Took me forever to understand that only list comprehension supports async iteration.

    opened by gkedge 2
  • doesn't work with smart_open

    doesn't work with smart_open

    If I try to use smart_open (so that we can e.g. stream from S3 buckets), for example:

    with smart_open.open(url,'rb') as f_h:
        for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
            print(file_name)
    

    The code prints one file, and then throws an UnexpectedSignatureError.

    opened by jeroenbaas 2
  • feat: raise UnfinishedIterationError if iteration is unfinished

    feat: raise UnfinishedIterationError if iteration is unfinished

    Addresses some of the concerns at https://github.com/uktrade/stream-unzip/issues/21 (but does not change the API to allow not fully iterating over the bytes of member files)

    opened by michalc 0
  • tests: test for skippable wrapper

    tests: test for skippable wrapper

    Added since this test is mentioned at https://github.com/uktrade/stream-unzip/issues/21

    opened by michalc 0
Owner
Department for International Trade
Department for International Trade
Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

Evan 1 Dec 14, 2021
Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

Furkan Baytekin 1 Oct 17, 2021
This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

FileBulk This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that h

Enoc Mena 1 Nov 8, 2021
This program can help you to move and rename many files at once

This program can help you to rename and save many files in a folder in seconds, but don't give the same name to files, it can delete both files.

João Assalim 2 Jan 15, 2022
Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

This tool removes the "[x]_" at the beginning of StudIP zip Archives. It also deletes the "archive_filelist.csv" file

Kelke vl 1 Jan 19, 2022
Search for files under the specified directory. Extract the file name and file path and import them as data.

Search for files under the specified directory. Extract the file name and file path and import them as data. Based on that, search for the file, select it and open it.

G-jon FujiYama 2 Jan 10, 2022
Convert All TXT Files To One File.

AllToOne Convert All TXT Files To One File. Hi ?? , I'm Alireza A Python Developer Boy ?? I’m currently working on my C# projects ?? I’m currently Lea

null 1 Jan 5, 2022
Import Python modules from any file system path

pathimp Import Python modules from any file system path. Installation pip3 install pathimp Usage import pathimp

Danijar Hafner 2 Nov 29, 2021
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Thomas Freeman 1 Dec 29, 2021
Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

Tharun Kumar T 2 Nov 28, 2021
gitfs is a FUSE file system that fully integrates with git - Version controlled file system

gitfs is a FUSE file system that fully integrates with git. You can mount a remote repository's branch locally, and any subsequent changes made to the files will be automatically committed to the remote.

Presslabs 2.2k Jan 15, 2022
Small-File-Explorer - I coded a small file explorer with several options

Petit explorateur de fichier / Small file explorer Pour la première option (création de répertoire) / For the first option (creation of a directory) e

Xerox 1 Jan 3, 2022
Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

pti-file-format Reverse engineering the Polyend Tracker instrument file format.

Jaap Roes 4 Jan 12, 2022
Generates a clean .txt file of contents of a 3 lined csv file

Generates a clean .txt file of contents of a 3 lined csv file. File contents is the .gml file of some function which stores the contents of the csv as a map.

Alex Eckardt 1 Jan 9, 2022
Python Fstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems.

PyFstab Generator PyFstab Generator is a small Python script to write and generate /etc/fstab files based on yaml file on Unix-like systems. NOTE : Th

Mahdi 2 Nov 9, 2021
Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

DAOTools This is a set of tools for Dragon Age: Origins modding. It can patch the text lines of .dlg files, extract and build an .erf file, and view t

null 4 Jan 13, 2022
CredSweeper is a tool to detect credentials in any directories or files.

CredSweeper is a tool to detect credentials in any directories or files. CredSweeper could help users to detect unwanted exposure of credentials (such as personal information, token, passwords, api keys and etc) in advance. By scanning lines, filtering, and using AI model as option, CredSweeper reports lines with possible credentials, where the line is, and expected type of the credential as a result.

Samsung 18 Dec 29, 2021
Python script for converting figma produced SVG files into C++ JUCE framework source code

AutoJucer Python script for converting figma produced SVG files into C++ JUCE framework source code Watch the tutorial here! Getting Started Make some

SuperConductor 1 Nov 26, 2021
Nintendo Game Boy music assembly files parser into musicxml format

GBMusicParser Nintendo Game Boy music assembly files parser into musicxml format This python code will get an file.asm from the disassembly of a Game

null 1 Dec 11, 2021