Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Overview

stream-unzip CircleCI Test Coverage

Python function to stream unzip all the files in a ZIP archive, without loading the entire ZIP file into memory or any of its uncompressed files.

While the ZIP format does have its main directory at the end, each compressed file in the archive can be prefixed with a header that contains its name, compressed size, and uncompressed size: this is what makes streaming decompression of ZIP files possible.

Unfortunately not all ZIP files have this: some have their compressed and uncompressed sizes after the file data in the stream. In this case a ValueError will be raised.

Installation

pip install stream-unzip

Usage

A single function is exposed, stream_unzip, that takes a single argument: an iterable that should yield the bytes of a ZIP file. It returns an iterable, where each yielded item is a tuple of the file name, file size, and another iterable itself yielding the unzipped bytes of that file.

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

The file name and file size are extracted as reported from the file. If you don't trust the creator of the ZIP file, these should be treated as untrusted input.

Comments
  • stream_unzip.UnexpectedSignatureError

    stream_unzip.UnexpectedSignatureError

    When decompressing a zip file this error is raised:

    stream_unzip.UnexpectedSignatureError: b'\xb66\x97\x89'

    Part of my code:

    def read_in_chunks(infile, chunk_size=UNZIP_CHUNK_SIZE):
        while True:
            chunk = infile.read(chunk_size)
            if chunk:
                yield chunk
            else:
                return<br>
    with open(path_to_zip_file, 'rb') as file_object:
        for file_name, file_size, file_chunks in stream_unzip(read_in_chunks(file_object), chunk_size=UNZIP_CHUNK_SIZE):
            print(file_name)
            with open(file_name, 'ab') as file_append:
                for chunk in file_chunks:
                    file_append.write(chunk)
    

    Python 3.9 stream-unzip 0.0.70

    I can unzip my file with WinRAR and zipfile, but can't with stream-unzip.

    Any idea how fix it?

    opened by vitkovay 11
  • [Question] Are encrypted zip files using Deflate64 as the compression method supported?

    [Question] Are encrypted zip files using Deflate64 as the compression method supported?

    Hi there. From this PR, it seems that encrypted zip files are supported by now, but I found that I cannot unzip a password protected zip file using Deflate64 as the compression method. I am not sure whether it is not implemented yet or it is an unexpected bug. Thanks.

    opened by raychanks 10
  • ZeroDivisionError in the end of zip file

    ZeroDivisionError in the end of zip file

    Thanks for the lib. Got ZeroDivisionError: integer division or modulo by zero during processing zip file's last chunk from example code snippet:

    Traceback (most recent call last):
      File "***\tmp.py", line 9, in <module>
        for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
      File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 180, in stream_unzip
        for _ in yield_all():
      File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 35, in _yield_all
        offset = (offset + to_yield) % len(chunk)
    ZeroDivisionError: integer division or modulo by zero
    

    Code snippet:

    import httpx
    from stream_unzip import stream_unzip
    
    
    def zipped_chunks():
        # Any iterable that yields a zip file
        with httpx.stream('GET', 'https://www.gyan.dev/ffmpeg/builds/packages/ffmpeg-4.4-essentials_build.zip') as r:
            yield from r.iter_bytes()
    
    
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
        for chunk in unzipped_chunks:
            # print(chunk)
            print(file_name, file_size)
    

    Python 3.9.5 (Windows 10) stream-unzip 0.0.23

    opened by tropicoo 5
  • Error uncompressing a Zip64 file

    Error uncompressing a Zip64 file

    When decompressing a Zip64 file this error is raised:

    stream_unzip.UnexpectedSignatureError: b'8\\\x12]'
    

    Have you encountered this before?

    Thanks,

    Rusty

    opened by rustyconover 4
  • Not iterating all chunks explicitly causes signature error

    Not iterating all chunks explicitly causes signature error

    I stumbled on a bug while trying to see if smart_open would work: if you do not iterate all chunks ("unzipped_chunks") there will be a signature error:

    from stream_unzip import stream_unzip, UnexpectedSignatureError
    import smart_open
    
    url="https://transfer.sh/get/cp4LTN/test.zip"
    
    print("without chunk iteration")
    try:
        with smart_open.open(url,'rb') as f_h:
            for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
                print(file_name)
    except UnexpectedSignatureError as e:
        print("UnexpectedSignatureError: "+str(e))
    
    print("with chunk iteration")
    try:
        with smart_open.open(url,'rb') as f_h:
            for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
                print(file_name)
                for chunk in unzipped_chunks:
                    continue
    except UnexpectedSignatureError as e:
        print("UnexpectedSignatureError: "+str(e))
    

    result:

    $ python .\stream_unzip_test.py
    without chunk iteration
    b'test.txt'
    UnexpectedSignatureError: b'test'
    with chunk iteration
    b'test.txt'
    b'test1.txt'
    b'xsubtest.zip'
    
    opened by jeroenbaas 4
  • [Question] How to yield chunks of a zipped file

    [Question] How to yield chunks of a zipped file

    Hi,

    I am trying to create a low-memory stream unzipper that can be ran in AWS Lambda. Currently I can unzip individual files from a .zip package. But the problem I am facing is that I need to read that one file into memory, which for a large file, 15gb, it is being killed by the runtime.

    I came across this library, but I am unsure it delivers what I need. Or maybe it does but I don't see how.

    This is my example below, using the normal zipfile from the standard lib.

    import zipfile 
    import s3fs
    _zipfile_interface = zipfile.ZipFile
    _s3_interface = s3fs.S3FileSystem(anon=False)
    _READING_MODE = 'rb'
    _WRITING_MODE = 'wb'
    
    
    def extract_file(filename: str, output_uri: str):
        """
        Extracts filename from zip package into input output S3 URI
    
        Args:
            filename (str): Name of the file inside zip package
            output_uri (str): Output S3 URI
    
        Raises:
            exceptions.ExtractionFailure: In case extraction errors out
        """
        with _s3_interface.open(s3_uri, _READING_MODE) as f_in:
            zip_buffer = _zipfile_interface(f_in)
    
            file_buffer = zip_buffer.open(filename)
    
            with _s3_interface.open(output_uri, _WRITING_MODE) as f_out:
    
                f_out.write(file_buffer.read())  # -> the issue lies here
    

    Do ignore S3 interface, for purpose of this example, it's just a normal open().

    I have profiled this method here:

    Line #    Mem usage    Increment  Occurrences   Line Contents
    =============================================================
        78     66.2 MiB     66.2 MiB           1       @profile
        79                                             def extract_file(filename: str, output_uri: str):
        80                                                 """
        81                                                 Extracts filename from zip package into input output S3 URI
        82                                         
        83                                                 Args:
        84                                                     filename (str): Name of the file inside zip package
        85                                                     output_uri (str): Output S3 URI
        86                                         
        87                                                 Raises:
        88                                                     exceptions.ExtractionFailure: In case extraction errors out
        89                                                 """
        90     66.2 MiB      0.0 MiB           1           try:
        91     79.1 MiB      2.1 MiB           2               with _s3_interface.open(package.s3_uri, _READING_MODE) as f_in:
        92     73.1 MiB      0.0 MiB           1                   zip_buffer = _zipfile_interface(f_in)  # type: ignore
        93                                         
        94     83.9 MiB     10.8 MiB           1                   file_buffer = zip_buffer.open(filename)
        95                                         
        96     83.9 MiB      0.0 MiB           2                   with _s3_interface.open(output_uri, _WRITING_MODE) as f_out:
        97                                         
        98     83.9 MiB      0.0 MiB           1                       f_out.write(file_buffer.read())  # type: ignore
        99                                         
    

    Line 94 you see the jump in memory, 10MB, from the zip.open(filename) this is when the file is consumed into memory and then I can write it out to the output buffer.

    Ideally, I would like that zip.open to iterate chunks of size N that I can just write to the open buffer.

    Any knowledge on this?

    Thank you, Alex

    opened by alexanderluiscampino 3
  • [question]  asynchio Zip File of Zipped Chunks

    [question] asynchio Zip File of Zipped Chunks

    Consider streaming in a zip file.

    def zipped_chunks(zipfile_name: PurePath):
        # Iterable that yields the bytes of a zip file
        with open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
            yield zip_f.read()
    

    I am attempting to unzip a number of large zip files concurrently that are hosted on a slow network drive. Do you see any value in leveraging aiofiles package to stream the read like so?:

    async def zipped_chunks(zipfile_name: PurePath):
        # Iterable that yields the bytes of a zip file
        async with aiofiles.open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
            yield await zip_f.read()
    
    async def unzip_tar_files(self, zipfile_name: PurePath):
        chunks: List[bytes] = [data async for data in self.zipped_chunks(zipfile_name)]
        for file_name, tar_file_size, unzipped_chunks in stream_unzip(chunks):
            ....
    

    That seems to work well for me (so far). Do you see any downside?

    If not, it might be a nice addition to the README as I have finally come across a program I am writing from scratch that benefits from leveraging an asyincio solution with stream-unzip being a key part of that solution. Took me forever to understand that only list comprehension supports async iteration.

    opened by gkedge 2
  • doesn't work with smart_open

    doesn't work with smart_open

    If I try to use smart_open (so that we can e.g. stream from S3 buckets), for example:

    with smart_open.open(url,'rb') as f_h:
        for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
            print(file_name)
    

    The code prints one file, and then throws an UnexpectedSignatureError.

    opened by jeroenbaas 2
  • feat: work with most files created by java that have members 4294967295 bytes long

    feat: work with most files created by java that have members 4294967295 bytes long

    We can't up front be sure of the format of the data descriptor in all cases. So we inch our way through the stream checking what we think is the data descriptor against the known compressed and uncompressed size of the data itself. If there is a match: that's "probably" the data descriptor.

    The issue is that we could have cases where the we get a match just because the values happen to match. The cases are not likely, but so far I don't see a reason why they're impossible.

    This should address issues reported in https://github.com/uktrade/stream-unzip/issues/33

    opened by michalc 0
  • fix: decrypting legacy zip encryption when no data descriptor

    fix: decrypting legacy zip encryption when no data descriptor

    This should fix the issue reported in https://github.com/uktrade/stream-unzip/issues/29

    (The Deflate64 thing originally reported I suspect was a red herring. Looked like it’s more to do with the data descriptor)

    opened by michalc 0
  • feat: raise UnfinishedIterationError if iteration is unfinished

    feat: raise UnfinishedIterationError if iteration is unfinished

    Addresses some of the concerns at https://github.com/uktrade/stream-unzip/issues/21 (but does not change the API to allow not fully iterating over the bytes of member files)

    opened by michalc 0
Owner
Department for International Trade
Department for International Trade
Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

Evan 1 Dec 14, 2021
PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

phithon 53 Nov 7, 2022
Quick and dirty FAT12 filesystem to ZIP file converter

Quick and Dirty FAT12 Filesystem Converter This is a really crappy Python script I wrote to convert a semi-compatible FAT12 filesystem from my HP150's

Tube Time 2 Feb 12, 2022
Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

Furkan Baytekin 1 Oct 17, 2021
This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

FileBulk This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that h

Enoc Mena 1 Jun 26, 2022
This program can help you to move and rename many files at once

This program can help you to rename and save many files in a folder in seconds, but don't give the same name to files, it can delete both files.

João Assalim 1 Oct 10, 2022
A simple Python code that takes input from a csv file and makes it into a vcf file.

Contacts-Maker A simple Python code that takes input from a csv file and makes it into a vcf file. Imagine a college or a large community where each y

null 1 Feb 13, 2022
Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

This tool removes the "[x]_" at the beginning of StudIP zip Archives. It also deletes the "archive_filelist.csv" file

Kelke vl 1 Jan 19, 2022
Search for files under the specified directory. Extract the file name and file path and import them as data.

Search for files under the specified directory. Extract the file name and file path and import them as data. Based on that, search for the file, select it and open it.

G-jon FujiYama 2 Jan 10, 2022
Convert All TXT Files To One File.

AllToOne Convert All TXT Files To One File. Hi ?? , I'm Alireza A Python Developer Boy ?? I’m currently working on my C# projects ?? I’m currently Lea

null 4 Jun 7, 2022
PyDeleter - delete a specifically formatted file in a directory or delete all other files

PyDeleter If you want to delete a specifically formatted file in a directory or delete all other files, PyDeleter does it for you. How to use? 1- Down

Amirabbas Motamedi 1 Jan 30, 2022
Import Python modules from any file system path

pathimp Import Python modules from any file system path. Installation pip3 install pathimp Usage import pathimp

Danijar Hafner 2 Nov 29, 2021
A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

README 1 gnucash2ledger gnucash2ledger is a Python script based on the Github Gist by nonducor (nonducor/gcash2ledger.py). This Python script will tak

Thomas Freeman 0 Jan 28, 2022
Python package to read and display segregated file names present in a directory based on type of the file

tpyfilestructure Python package to read and display segregated file names present in a directory based on type of the file. Installation You can insta

Tharun Kumar T 2 Nov 28, 2021
File-manager - A basic file manager, written in Python

File Manager A basic file manager, written in Python. Installation Install Pytho

Samuel Ko 1 Feb 5, 2022
gitfs is a FUSE file system that fully integrates with git - Version controlled file system

gitfs is a FUSE file system that fully integrates with git. You can mount a remote repository's branch locally, and any subsequent changes made to the files will be automatically committed to the remote.

Presslabs 2.3k Jan 8, 2023
Small-File-Explorer - I coded a small file explorer with several options

Petit explorateur de fichier / Small file explorer Pour la première option (création de répertoire) / For the first option (creation of a directory) e

Xerox 1 Jan 3, 2022
Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

pti-file-format Reverse engineering the Polyend Tracker instrument file format.

Jaap Roes 14 Dec 30, 2022
Generates a clean .txt file of contents of a 3 lined csv file

Generates a clean .txt file of contents of a 3 lined csv file. File contents is the .gml file of some function which stores the contents of the csv as a map.

Alex Eckardt 1 Jan 9, 2022