Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Department for International Trade

Last update: Jan 2, 2023

Related tags

File & Path Utilities data-infrastructure

Overview

stream-unzip

Python function to stream unzip all the files in a ZIP archive, without loading the entire ZIP file into memory or any of its uncompressed files.

While the ZIP format does have its main directory at the end, each compressed file in the archive can be prefixed with a header that contains its name, compressed size, and uncompressed size: this is what makes streaming decompression of ZIP files possible.

Unfortunately not all ZIP files have this: some have their compressed and uncompressed sizes after the file data in the stream. In this case a ValueError will be raised.

Installation

pip install stream-unzip

Usage

A single function is exposed, stream_unzip, that takes a single argument: an iterable that should yield the bytes of a ZIP file. It returns an iterable, where each yielded item is a tuple of the file name, file size, and another iterable itself yielding the unzipped bytes of that file.

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
        yield from r.iter_bytes()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        print(chunk)

The file name and file size are extracted as reported from the file. If you don't trust the creator of the ZIP file, these should be treated as untrusted input.

Comments

stream_unzip.UnexpectedSignatureError

When decompressing a zip file this error is raised:

stream_unzip.UnexpectedSignatureError: b'\xb66\x97\x89'

Part of my code:


def read_in_chunks(infile, chunk_size=UNZIP_CHUNK_SIZE):
    while True:
        chunk = infile.read(chunk_size)
        if chunk:
            yield chunk
        else:
            return<br>
with open(path_to_zip_file, 'rb') as file_object:
    for file_name, file_size, file_chunks in stream_unzip(read_in_chunks(file_object), chunk_size=UNZIP_CHUNK_SIZE):
        print(file_name)
        with open(file_name, 'ab') as file_append:
            for chunk in file_chunks:
                file_append.write(chunk)

Python 3.9 stream-unzip 0.0.70

I can unzip my file with WinRAR and zipfile, but can't with stream-unzip.

Any idea how fix it?

opened by vitkovay 11

[Question] Are encrypted zip files using Deflate64 as the compression method supported?

Hi there. From this PR, it seems that encrypted zip files are supported by now, but I found that I cannot unzip a password protected zip file using Deflate64 as the compression method. I am not sure whether it is not implemented yet or it is an unexpected bug. Thanks.

opened by raychanks 10

ZeroDivisionError in the end of zip file

Thanks for the lib. Got ZeroDivisionError: integer division or modulo by zero during processing zip file's last chunk from example code snippet:

Traceback (most recent call last):
  File "***\tmp.py", line 9, in <module>
    for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
  File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 180, in stream_unzip
    for _ in yield_all():
  File "***\venv_win32_39py\lib\site-packages\stream_unzip.py", line 35, in _yield_all
    offset = (offset + to_yield) % len(chunk)
ZeroDivisionError: integer division or modulo by zero

Code snippet:

import httpx
from stream_unzip import stream_unzip


def zipped_chunks():
    # Any iterable that yields a zip file
    with httpx.stream('GET', 'https://www.gyan.dev/ffmpeg/builds/packages/ffmpeg-4.4-essentials_build.zip') as r:
        yield from r.iter_bytes()


for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
    for chunk in unzipped_chunks:
        # print(chunk)
        print(file_name, file_size)

Python 3.9.5 (Windows 10) stream-unzip 0.0.23

opened by tropicoo 5

Error uncompressing a Zip64 file
When decompressing a Zip64 file this error is raised:

stream_unzip.UnexpectedSignatureError: b'8\\\x12]'

Have you encountered this before?

Thanks,

Rusty
opened by rustyconover 4

Not iterating all chunks explicitly causes signature error

I stumbled on a bug while trying to see if smart_open would work: if you do not iterate all chunks ("unzipped_chunks") there will be a signature error:

from stream_unzip import stream_unzip, UnexpectedSignatureError
import smart_open

url="https://transfer.sh/get/cp4LTN/test.zip"

print("without chunk iteration")
try:
    with smart_open.open(url,'rb') as f_h:
        for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
            print(file_name)
except UnexpectedSignatureError as e:
    print("UnexpectedSignatureError: "+str(e))

print("with chunk iteration")
try:
    with smart_open.open(url,'rb') as f_h:
        for file_name, file_size, unzipped_chunks in stream_unzip(f_h):
            print(file_name)
            for chunk in unzipped_chunks:
                continue
except UnexpectedSignatureError as e:
    print("UnexpectedSignatureError: "+str(e))

result:

$ python .\stream_unzip_test.py
without chunk iteration
b'test.txt'
UnexpectedSignatureError: b'test'
with chunk iteration
b'test.txt'
b'test1.txt'
b'xsubtest.zip'

opened by jeroenbaas 4

[Question] How to yield chunks of a zipped file

Hi,

I am trying to create a low-memory stream unzipper that can be ran in AWS Lambda. Currently I can unzip individual files from a .zip package. But the problem I am facing is that I need to read that one file into memory, which for a large file, 15gb, it is being killed by the runtime.

I came across this library, but I am unsure it delivers what I need. Or maybe it does but I don't see how.

This is my example below, using the normal zipfile from the standard lib.

import zipfile 
import s3fs
_zipfile_interface = zipfile.ZipFile
_s3_interface = s3fs.S3FileSystem(anon=False)
_READING_MODE = 'rb'
_WRITING_MODE = 'wb'


def extract_file(filename: str, output_uri: str):
    """
    Extracts filename from zip package into input output S3 URI

    Args:
        filename (str): Name of the file inside zip package
        output_uri (str): Output S3 URI

    Raises:
        exceptions.ExtractionFailure: In case extraction errors out
    """
    with _s3_interface.open(s3_uri, _READING_MODE) as f_in:
        zip_buffer = _zipfile_interface(f_in)

        file_buffer = zip_buffer.open(filename)

        with _s3_interface.open(output_uri, _WRITING_MODE) as f_out:

            f_out.write(file_buffer.read())  # -> the issue lies here

Do ignore S3 interface, for purpose of this example, it's just a normal open().

I have profiled this method here:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    78     66.2 MiB     66.2 MiB           1       @profile
    79                                             def extract_file(filename: str, output_uri: str):
    80                                                 """
    81                                                 Extracts filename from zip package into input output S3 URI
    82                                         
    83                                                 Args:
    84                                                     filename (str): Name of the file inside zip package
    85                                                     output_uri (str): Output S3 URI
    86                                         
    87                                                 Raises:
    88                                                     exceptions.ExtractionFailure: In case extraction errors out
    89                                                 """
    90     66.2 MiB      0.0 MiB           1           try:
    91     79.1 MiB      2.1 MiB           2               with _s3_interface.open(package.s3_uri, _READING_MODE) as f_in:
    92     73.1 MiB      0.0 MiB           1                   zip_buffer = _zipfile_interface(f_in)  # type: ignore
    93                                         
    94     83.9 MiB     10.8 MiB           1                   file_buffer = zip_buffer.open(filename)
    95                                         
    96     83.9 MiB      0.0 MiB           2                   with _s3_interface.open(output_uri, _WRITING_MODE) as f_out:
    97                                         
    98     83.9 MiB      0.0 MiB           1                       f_out.write(file_buffer.read())  # type: ignore
    99

Line 94 you see the jump in memory, 10MB, from the zip.open(filename) this is when the file is consumed into memory and then I can write it out to the output buffer.

Ideally, I would like that zip.open to iterate chunks of size N that I can just write to the open buffer.

Any knowledge on this?

Thank you, Alex

opened by alexanderluiscampino 3

[question] asynchio Zip File of Zipped Chunks

Consider streaming in a zip file.

def zipped_chunks(zipfile_name: PurePath):
    # Iterable that yields the bytes of a zip file
    with open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
        yield zip_f.read()

I am attempting to unzip a number of large zip files concurrently that are hosted on a slow network drive. Do you see any value in leveraging aiofiles package to stream the read like so?:

async def zipped_chunks(zipfile_name: PurePath):
    # Iterable that yields the bytes of a zip file
    async with aiofiles.open(zipfile_name, "r+b", buffering=io.DEFAULT_BUFFER_SIZE) as zip_f:
        yield await zip_f.read()

async def unzip_tar_files(self, zipfile_name: PurePath):
    chunks: List[bytes] = [data async for data in self.zipped_chunks(zipfile_name)]
    for file_name, tar_file_size, unzipped_chunks in stream_unzip(chunks):
        ....

That seems to work well for me (so far). Do you see any downside?

If not, it might be a nice addition to the README as I have finally come across a program I am writing from scratch that benefits from leveraging an asyincio solution with stream-unzip being a key part of that solution. Took me forever to understand that only list comprehension supports async iteration.

opened by gkedge 2

doesn't work with smart_open
If I try to use smart_open (so that we can e.g. stream from S3 buckets), for example:

with smart_open.open(url,'rb') as f_h: for file_name, file_size, unzipped_chunks in stream_unzip(f_h): print(file_name)

The code prints one file, and then throws an UnexpectedSignatureError.
opened by jeroenbaas 2
feat: work with most files created by java that have members 4294967295 bytes long

We can't up front be sure of the format of the data descriptor in all cases. So we inch our way through the stream checking what we think is the data descriptor against the known compressed and uncompressed size of the data itself. If there is a match: that's "probably" the data descriptor.

The issue is that we could have cases where the we get a match just because the values happen to match. The cases are not likely, but so far I don't see a reason why they're impossible.

This should address issues reported in https://github.com/uktrade/stream-unzip/issues/33

opened by michalc 0
fix: decrypting legacy zip encryption when no data descriptor

This should fix the issue reported in https://github.com/uktrade/stream-unzip/issues/29

(The Deflate64 thing originally reported I suspect was a red herring. Looked like it’s more to do with the data descriptor)

opened by michalc 0
feat: raise UnfinishedIterationError if iteration is unfinished

Addresses some of the concerns at https://github.com/uktrade/stream-unzip/issues/21 (but does not change the API to allow not fully iterating over the bytes of member files)

opened by michalc 0

Owner

Department for International Trade

GitHub

Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

1 Dec 14, 2021

PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

53 Nov 7, 2022

Quick and dirty FAT12 filesystem to ZIP file converter

Quick and Dirty FAT12 Filesystem Converter This is a really crappy Python script I wrote to convert a semi-compatible FAT12 filesystem from my HP150's

2 Feb 12, 2022

Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

1 Oct 17, 2021

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

FileBulk This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that h

1 Jun 26, 2022

Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

Related tags

Overview

stream-unzip

Installation

Usage

Comments

stream_unzip.UnexpectedSignatureError

[Question] Are encrypted zip files using Deflate64 as the compression method supported?

ZeroDivisionError in the end of zip file

Error uncompressing a Zip64 file

Not iterating all chunks explicitly causes signature error

[Question] How to yield chunks of a zipped file

[question] asynchio Zip File of Zipped Chunks

doesn't work with smart_open

feat: work with most files created by java that have members 4294967295 bytes long

fix: decrypting legacy zip encryption when no data descriptor

feat: raise UnfinishedIterationError if iteration is unfinished

Owner

Department for International Trade

Extract an archive file (zip file or tar file) stored on AWS S3

PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

Quick and dirty FAT12 filesystem to ZIP file converter

Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

This program can help you to move and rename many files at once

A simple Python code that takes input from a csv file and makes it into a vcf file.

Remove [x]_ from StudIP zip Archives and archive_filelist.csv completely

Search for files under the specified directory. Extract the file name and file path and import them as data.

Convert All TXT Files To One File.

PyDeleter - delete a specifically formatted file in a directory or delete all other files

Import Python modules from any file system path

A python script to convert an ucompressed Gnucash XML file to a text file for Ledger and hledger.

Python package to read and display segregated file names present in a directory based on type of the file

File-manager - A basic file manager, written in Python

gitfs is a FUSE file system that fully integrates with git - Version controlled file system

Small-File-Explorer - I coded a small file explorer with several options

Pti-file-format - Reverse engineering the Polyend Tracker instrument file format

Generates a clean .txt file of contents of a 3 lined csv file