A Python module and command line utility for working with web archive data using the WACZ format specification

Related tags

CLI Tools py-wacz
Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Comments
  • Test failure under Python 3.10

    Test failure under Python 3.10

    I'm noticing a lot of test failures when running the tests under Python3.10 (OS X). Just glancing at the many errors they seem to be all related to BadZipFile exceptions that is encountered during test setup setupClass? I did some testing and it appears that WACZs created with Python3.10 cannot be read by either python3.10 or python3.9? Clarification: I noticed the problem when examining the WACZ that was created in the test.

    output.txt

    I also noticed that the wacz files created by the tests/test_create_wacz.py differ in size between 3.9 and 3.10 (3.10 is 422 bytes less than 3.9). Maybe there is a timing issue

    $ ls -l test*.wacz
    -rw-r--r--  1 edsummers  staff  4121 Apr 15 11:31 test-3.10.wacz
    -rw-r--r--  1 edsummers  staff  4543 Apr 15 11:30 test-3.9.wacz
    

    Looking at the difference with dhex seems to show that information is truncated from the end? Maybe the file isn't being flushed before closing?

    Screen Shot 2022-04-15 at 11 42 47 AM
    opened by edsu 3
  • Close ZIP once finished

    Close ZIP once finished

    It is important to close the ZIP once data is finished being written to the WACZ or else some of the data may not be flushed to disk. This is probably more important for usage of py-wacz as a library since the file would automatically get flushed when it is used from the command line.

    Fixes #20

    opened by edsu 2
  • `datapackage.json` does not pass frictionless data default profile validation

    `datapackage.json` does not pass frictionless data default profile validation

    Hi, When processing a WACZ file via frictionless data package library, it fails to load because some required keys are missing

    This is what I get when loading the .wacz via a library:

    for this wacz

      'resources' => 
      array (
        0 => 
        (object) array(
           'path' => 'indexes/index.cdx.gz',
           'stats' => 
          (object) array(
             'hash' => '946da153be52b106c29a493abb76c7ec0b4001f9ecfba8a971bba7550dea3d51',
             'bytes' => 13745,
          ),
           'hashing' => 'sha256',
        ),
        1 => 
        (object) array(
           'path' => 'indexes/index.idx',
           'stats' => 
          (object) array(
             'hash' => '1e5bad4bb5ef03ed276e82a7eb4c8cf1e17187b09a0e409894999009b84a0d2a',
             'bytes' => 211,
          ),
           'hashing' => 'sha256',
        ),
        2 => 
        (object) array(
           'path' => 'archive/sbr_54ec39079692c89cc5eb4823a8054eca_application-yt-2017-fixed-0a0b18b3-d15d-453d-87a1-4df8c5e129a8.warc',
           'stats' => 
          (object) array(
             'hash' => '61bbc2adcf2b04a39e673bad205d1161dfe762cbc368cc32b77890889476b633',
             'bytes' => 20601238,
          ),
           'hashing' => 'sha256',
        ),
        3 => 
        (object) array(
           'path' => 'pages/pages.jsonl',
           'stats' => 
          (object) array(
             'hash' => 'f003f1fc0501445f4090471ad994980fec986f7d486d573c549bd117ee0fea9a',
             'bytes' => 993,
          ),
           'hashing' => 'sha256',
        ),
      ),
       'metadata' => 
      (object) array(
         'title' => 'yt',
      ),
       'wacz_version' => '1.0.0',
    )
    

    I get this validation errors

    NOTICE: PHP message: frictionlessdata\datapackage\Datapackages\DefaultDatapackage
    NOTICE: PHP message: array (
      0 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[0].name] The property name is required',
      )),
      1 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[0].data] The property data is required',
      )),
      2 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[0]] Failed to match exactly one schema',
      )),
      3 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[1].name] The property name is required',
      )),
      4 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[1].data] The property data is required',
      )),
      5 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[1]] Failed to match exactly one schema',
      )),
      6 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[2].name] The property name is required',
      )),
      7 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[2].data] The property data is required',
      )),
      8 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[2]] Failed to match exactly one schema',
      )),
      9 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[3].name] The property name is required',
      )),
      10 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[3].data] The property data is required',
      )),
      11 => 
      frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
         'code' => 8,
         'extraDetails' => '[resources[3]] Failed to match exactly one schema',
      )),
    )
    

    Looking at the schema there are two required properties at the resource level that are not defined but one depends on the profile. The one that is always required is name https://github.com/frictionlessdata/specs/blob/6c7ccd2be7be435c7e7d798d6ec4882e3dfee163/schemas/dictionary/resource.yml#L6-L12

    data is modal and if the profile is set correctly?(more like a question) is not needed if path is present. https://specs.frictionlessdata.io/data-resource/#data-location

    Sadly the examples (minimal) in the docs are not clear but further down the MUST is well stated and the actual JSON schema used for validation is quite useful for that.

    In the meantime since I'm processing only pages/pages.jsonl I will fix my code to not process WACZ files as frictionless data packages and just get the particular file but it would be great if the file would pass validation for version 1.2 to avoid exception processing for WACZ.

    Thanks!

    opened by DiegoPino 2
  • Error

    Error "File size unexpectedly exceeded ZIP64 limit" occurs when using py-wacz on a large WACZ file

    I installed py-wacz in a virtualenv on Debian 10 using Python 3.7.3 today. The alias wacz which is used below is just shorthand for:

    /opt/py-wacz/bin/python3 /opt/py-wacz/bin/warcz
    

    The reference command, when used on a 2.6GB WACZ file, returns the following error message for me:

    whalehub@pdh:/srv/www/archive# wacz -o test-large.wacz test-large.warc
    Generating indexes...
    Writing archives...
    Traceback (most recent call last):
      File "/opt/py-wacz/bin/wacz", line 33, in <module>
        sys.exit(load_entry_point('wacz==0.1.0', 'console_scripts', 'wacz')())
      File "/opt/py-wacz/lib/python3.7/site-packages/wacz-0.1.0-py3.7.egg/wacz/main.py", line 33, in main
      File "/opt/py-wacz/lib/python3.7/site-packages/wacz-0.1.0-py3.7.egg/wacz/main.py", line 78, in create_wacz
      File "/usr/lib/python3.7/zipfile.py", line 1126, in close
        raise RuntimeError('File size unexpectedly exceeded ZIP64 '
    RuntimeError: File size unexpectedly exceeded ZIP64 limit
    Exception ignored in: <function ZipFile.__del__ at 0x7f76611e9b70>
    Traceback (most recent call last):
      File "/usr/lib/python3.7/zipfile.py", line 1789, in __del__
        self.close()
      File "/usr/lib/python3.7/zipfile.py", line 1798, in close
        raise ValueError("Can't close the ZIP file while there is "
    ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.
    whalehub@pdh:/srv/www/archive#
    

    I've confirmed that the same command works fine when used on a ~30MB WACZ file instead:

    whalehub@pdh:/srv/www/archive# wacz -o test-small.wacz test-small.warc
    Generating indexes...
    Writing archives...
    Generating metadata...
    whalehub@pdh:/srv/www/archive# ls -l
    -rw-r--r-- 1 whalehub  whalehub  29078146 Sep  9 05:17 test-small.wacz
    -rw-r--r-- 1 whalehub  whalehub  29071676 Sep  9 04:51 test-small.warc
    whalehub@pdh:/srv/www/archive#
    
    opened by whalehub 2
  • Support single seed, detect pages with extra pages

    Support single seed, detect pages with extra pages

    Support specifying --url, --detect-pages and --split-seeds In this setup, the main url is written to pages, all other detected pages are written to extra-pages.jsonl

    opened by ikreymer 1
  • Signing/Verification Support

    Signing/Verification Support

    Add optional support for singing + verifying WACZ files.

    The signing must be done via an external signing server (running a version of webrecorder/authsign) The verification can be done using remote server or locally as well.

    Bump to 0.4.0

    opened by ikreymer 1
  • Use psf/black for python code formatting

    Use psf/black for python code formatting

    Apply https://github.com/psf/black to the py-wacz project (black .) Also add a travis check black . --check which will verify the formatting.

    Let's do it as a separate PR after the validation PR is merged.

    opened by ikreymer 1
  • Allow MD5 as datapackage hash

    Allow MD5 as datapackage hash

    Also allow MD5 as the hash for datapackage, used only for data integrity. The main reason is that it is more expensive to compute SHA-256 when generating WACZ entirely in the browser... py-wacz should still verify these, but should not need to produce these.

    opened by ikreymer 0
  • Support premade page lists from a crawler

    Support premade page lists from a crawler

    Support specifying page lists via an externally passed in pages files, eg. -p pages.jsonl.

    The use case is to support specifying a page list from a crawler, which may have partial information, it may not yet have the timestamps, but may have full text search. Ex:

    {"format":"json-pages-1.0","id":"pages","title":"All Pages","hasText":true}
    {"url": "https://example.com/"}
    {"url": "https://example.com/another_page", "text": "..."}
    {"url": "https://example.com/exact_page", "ts": "2020-01-02T00:01:02Z" "text": "..."}
    

    The create function will ensure that each page exists in a WARC and add the timestamp for URLs that do not have them.

    • For the first entry, the first match of https://example.com/ will be used, as the timestamp will be set in the output. If no https://example.com/ is found, it will be an error. If text detection is enabled, text will be extracted.

    • For the second entry, the same approach is used. However, if text detection is enabled, the existing text field takes precedence, and extraction is skipped.

    • For the third entry, there must be an exact match for URL and TS, otherwise it is an error. The specified text is used and takes precedence and no extraction takes place even with -t flag.

    To figure out:

    • Support for extra pages files that are added in addition to the main pages file? -pe extraList.jsonl
    • Support for just a plain text page list -- useful for importing a plain seed list from a crawl, not in jsonl format (treated same as {"url": "https://example.com/"} entries in a JSONL list)
    https://example.com/
    https://example.com/another_page
    https://example.com/exact_page
    

    This will be final piece to support converting external data/crawls to WACZ!

    opened by ikreymer 0
  • Command Line Return Code should be 0

    Command Line Return Code should be 0

    We now have the 0 return code in:

    if __name__ == "__main__":
        sys.exit(0 if main() else 1)
    
    

    which works only if calling main.py directly. For the installed wacz script to return 0, we really should have the main() function itself return 0 if success, non-zero if error. Currently, it returns true if success/valid, false otherwise. Or, can wrap it in another function that is used by the setup.py wacz script, that may be simpler, but also less consistent. In the end, it probably makes sense to be able to have:

    if main['validate', '-f', 'file.wacz'] == 0:
       print('success!')
    

    since that's close to what it would be in a shell script also..

    opened by ikreymer 0
  • Validation of WACZ Format

    Validation of WACZ Format

    Related to webrecorder/wacz-spec#20, here are the things that should be validated.

    • [x] Conformance to frictionless data package spec + check s
    • [x] Ensure files are where they should be, WARCs in archives/, CDX in indexes/
    • [x] Check for extraneous data
    • [x] Compression check:
      • WARCs and compressed cdx.gz should be in ZIP with 'store' compression (not deflate)
      • Indexes and page list can be compressed
    • [x] Validate WARCs and CDX by indexing existing WARC to match index?
    opened by ikreymer 0
  • [FEATURE] Add a WARC Record Iterator

    [FEATURE] Add a WARC Record Iterator

    It would be helpful to have an iterator that walks through all the WARC records of all the WARC files in a WACZ file, treating it externally like a regular WARC file.

    opened by ibnesayeed 0
  • Tweak README for consistency

    Tweak README for consistency

    The documentation for the flags has inconsistent punctuation. This change remedies that as well as fixes an instance of mismatched plurality in the README.

    opened by machawk1 0
  • Fix URL on PyPI

    Fix URL on PyPI

    Fix the URL to point to the correct location webrecorder/py-wacz. Also include Markdown in README.md as the long description text for a better display of what py-wacz is on PyPI.

    opened by edsu 1
  • Some commands documented to interact with WACZ files are invalid

    Some commands documented to interact with WACZ files are invalid

    In the README under the Validate header, the instructions state that a WACZ file can be validated with wacz validate myfile.wacz. Trying this in the latest release or from the current main branch causes a runtime error insisting that the -f flag be specified. Doing so causes the validation procedure to execute, but why specify the command without the flag if it is invalid?

    opened by machawk1 0
  • Instructions how to create wacz from browsertrix crawl

    Instructions how to create wacz from browsertrix crawl

    A browsertrix crawl usually contains all the information required for a wacz to be created, especially text and pages metadata is already present. Is it possible to use that data for creating the wacz?

    (Context: browsertrix exited after completing the crawl, leaving an incomplete wacz file, because the disk was full. Everything is already available, just needs to be compiled into wacz.)

    opened by despens 0
  • zipfile.BadZipFile error during wacz creation from warc file - Windows only

    zipfile.BadZipFile error during wacz creation from warc file - Windows only

    изображение

    I see following error each time when I use latest py-wacz on Windows _Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Program Files\Python310\Scripts\wacz.exe_main.py", line 7, in File "C:\Program Files\Python310\lib\site-packages\wacz\main.py", line 109, in main value = cmd.func(cmd) File "C:\Program Files\Python310\lib\site-packages\wacz\main.py", line 320, in create_wacz datapackage = wacz_indexer.generate_datapackage(res, wacz) File "C:\Program Files\Python310\lib\site-packages\wacz\waczindexer.py", line 343, in generate_datapackage with wacz.open(zip_entry, "r") as stream: File "C:\Program Files\Python310\lib\zipfile.py", line 1545, in open raise BadZipFile( zipfile.BadZipFile: File name in directory 'archive\cafrussia.ru.warc' and header b'archive/cafrussia.ru.warc' differ.

    Environment: Windows 10, Python 3.10.0, wacz 0.4.5

    No error happens on Linux or in the Windows Subsystem for Linux environment.

    opened by ivbeg 0
Owner
Webrecorder
Webrecorder Project
Webrecorder
cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored in many CMSIS PACKs

cmsis-pack-manager cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored i

pyocd 20 Dec 21, 2022
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Jonardon Hazarika 17 Dec 11, 2022
git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's partial clone and sparse checkout features.

Partial Submodules for Git git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's

Nathan Reed 15 Sep 22, 2022
commandpack - A package of modules for working with commands, command packages, files with command packages.

commandpack Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal:

null 4 Sep 4, 2021
A command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, VueJS projects.

Cookiecutter A command-line utility that creates projects from cookiecutters (project templates), e.g. creating a Python package project from a Python

null 18.6k Dec 30, 2022
split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

split-manga-pages split-manga-pages is a command line utility written in Python that converts your double-page layout manga (or any images in double p

Christoffer Aakre 3 May 24, 2022
A Python command-line utility for validating that the outputs of a given Declarative Form Azure Portal UI JSON template map to the input parameters of a given ARM Deployment Template JSON template

A Python command-line utility for validating that the outputs of a given Declarative Form Azure Portal UI JSON template map to the input parameters of a given ARM Deployment Template JSON template

Glenn Musa 1 Feb 3, 2022
Library and command-line utility for rendering projects templates.

A library for rendering project templates. Works with local paths and git URLs. Your project can include any file and Copier can dynamically replace v

null 808 Jan 4, 2023
A handy command-line utility for generating and sending iCalendar events

A handy command-line utility for generating and sending iCalendar events This simple command-line utility is designed to generate an iCalendar event,

Baochun Li 17 Nov 21, 2022
Baseline is a cross-platform library and command-line utility that creates file-oriented baselines of your systems.

Baselining, on steroids! Baseline is a cross-platform library and command-line utility that creates file-oriented baselines of your systems. The proje

Nelson 4 Dec 9, 2022
Command line parser for common log format (Nginx default).

Command line parser for common log format (Nginx default).

Lucian Marin 138 Dec 19, 2022
Spotify Offline is a command line tool that allows one to download Spotify playlists in MP3 format.

Spotify Offline v0.0.2 listen to your favorite spotify songs, offline Overview Spotify Offline (spotifyoffline) is a command line tool that allows one

Aarush Gupta 1 Nov 28, 2021
A command line utility to export Google Keep notes to markdown.

Keep-Exporter A command line utility to export Google Keep notes to markdown files with metadata stored as a frontmatter header. Supports exporting: S

Nathan Beals 85 Dec 17, 2022
A command line utility for tracking a stock market portfolio. Primarily featuring high resolution braille graphs.

A command line stock market / portfolio tracker originally insipred by Ericm's Stonks program, featuring unicode for incredibly high detailed graphs even in a terminal.

Conrad Selig 51 Nov 29, 2022
📦 A command line utility to put text in a box.

boxie A command line utility to put text in a box. Installation pip install boxie If you are on Linux you may need to use sudo to access this globally

Eliaz Bobadilla 10 Jun 30, 2022
Tiny command-line utility for mapping broken keys to other positions.

brokenkey Tiny command-line utility for mapping broken keys to other positions. Installation Clone this repository using git: git clone https://github

null 0 Oct 4, 2021
This is a CLI utility that allows you to view RedFlagDeals.com on the command line.

RFD Description Motivation Installation Usage View Hot Deals View and Sort Hot Deals Search Advanced View Posts Shell Completion bash zsh Description

Dave G 8 Nov 29, 2022
img-proof (IPA) provides a command line utility to test images in the Public Cloud

overview img-proof (IPA) provides a command line utility to test images in the Public Cloud (AWS, Azure, GCE, etc.). With img-proof you can now test c

null 13 Jan 7, 2022
A command-line utility that, given a markdown file, checks whether all its links work.

A command-line utility written in Python that checks validity of links in a markdown file.

Teclado 2 Dec 8, 2021