A Python module and command line utility for working with web archive data using the WACZ format specification

Webrecorder

Last update: Oct 24, 2022

Related tags

CLI Tools py-wacz

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests

Comments

Test failure under Python 3.10
I'm noticing a lot of test failures when running the tests under Python3.10 (OS X). Just glancing at the many errors they seem to be all related to BadZipFile exceptions that is encountered during test setup setupClass? I did some testing and it appears that WACZs created with Python3.10 cannot be read by either python3.10 or python3.9? Clarification: I noticed the problem when examining the WACZ that was created in the test.

output.txt

I also noticed that the wacz files created by the tests/test_create_wacz.py differ in size between 3.9 and 3.10 (3.10 is 422 bytes less than 3.9). Maybe there is a timing issue

$ ls -l test*.wacz -rw-r--r-- 1 edsummers staff 4121 Apr 15 11:31 test-3.10.wacz -rw-r--r-- 1 edsummers staff 4543 Apr 15 11:30 test-3.9.wacz

Looking at the difference with dhex seems to show that information is truncated from the end? Maybe the file isn't being flushed before closing?
opened by edsu 3
Close ZIP once finished

It is important to close the ZIP once data is finished being written to the WACZ or else some of the data may not be flushed to disk. This is probably more important for usage of py-wacz as a library since the file would automatically get flushed when it is used from the command line.

Fixes #20

opened by edsu 2

`datapackage.json` does not pass frictionless data default profile validation

Hi, When processing a WACZ file via frictionless data package library, it fails to load because some required keys are missing

This is what I get when loading the .wacz via a library:

for this wacz

  'resources' => 
  array (
    0 => 
    (object) array(
       'path' => 'indexes/index.cdx.gz',
       'stats' => 
      (object) array(
         'hash' => '946da153be52b106c29a493abb76c7ec0b4001f9ecfba8a971bba7550dea3d51',
         'bytes' => 13745,
      ),
       'hashing' => 'sha256',
    ),
    1 => 
    (object) array(
       'path' => 'indexes/index.idx',
       'stats' => 
      (object) array(
         'hash' => '1e5bad4bb5ef03ed276e82a7eb4c8cf1e17187b09a0e409894999009b84a0d2a',
         'bytes' => 211,
      ),
       'hashing' => 'sha256',
    ),
    2 => 
    (object) array(
       'path' => 'archive/sbr_54ec39079692c89cc5eb4823a8054eca_application-yt-2017-fixed-0a0b18b3-d15d-453d-87a1-4df8c5e129a8.warc',
       'stats' => 
      (object) array(
         'hash' => '61bbc2adcf2b04a39e673bad205d1161dfe762cbc368cc32b77890889476b633',
         'bytes' => 20601238,
      ),
       'hashing' => 'sha256',
    ),
    3 => 
    (object) array(
       'path' => 'pages/pages.jsonl',
       'stats' => 
      (object) array(
         'hash' => 'f003f1fc0501445f4090471ad994980fec986f7d486d573c549bd117ee0fea9a',
         'bytes' => 993,
      ),
       'hashing' => 'sha256',
    ),
  ),
   'metadata' => 
  (object) array(
     'title' => 'yt',
  ),
   'wacz_version' => '1.0.0',
)

I get this validation errors

NOTICE: PHP message: frictionlessdata\datapackage\Datapackages\DefaultDatapackage
NOTICE: PHP message: array (
  0 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[0].name] The property name is required',
  )),
  1 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[0].data] The property data is required',
  )),
  2 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[0]] Failed to match exactly one schema',
  )),
  3 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[1].name] The property name is required',
  )),
  4 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[1].data] The property data is required',
  )),
  5 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[1]] Failed to match exactly one schema',
  )),
  6 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[2].name] The property name is required',
  )),
  7 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[2].data] The property data is required',
  )),
  8 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[2]] Failed to match exactly one schema',
  )),
  9 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[3].name] The property name is required',
  )),
  10 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[3].data] The property data is required',
  )),
  11 => 
  frictionlessdata\datapackage\Validators\DatapackageValidationError::__set_state(array(
     'code' => 8,
     'extraDetails' => '[resources[3]] Failed to match exactly one schema',
  )),
)

Looking at the schema there are two required properties at the resource level that are not defined but one depends on the profile. The one that is always required is name https://github.com/frictionlessdata/specs/blob/6c7ccd2be7be435c7e7d798d6ec4882e3dfee163/schemas/dictionary/resource.yml#L6-L12

data is modal and if the profile is set correctly?(more like a question) is not needed if path is present. https://specs.frictionlessdata.io/data-resource/#data-location

Sadly the examples (minimal) in the docs are not clear but further down the MUST is well stated and the actual JSON schema used for validation is quite useful for that.

In the meantime since I'm processing only pages/pages.jsonl I will fix my code to not process WACZ files as frictionless data packages and just get the particular file but it would be great if the file would pass validation for version 1.2 to avoid exception processing for WACZ.

Thanks!

opened by DiegoPino 2

Error "File size unexpectedly exceeded ZIP64 limit" occurs when using py-wacz on a large WACZ file

I installed py-wacz in a virtualenv on Debian 10 using Python 3.7.3 today. The alias wacz which is used below is just shorthand for:

/opt/py-wacz/bin/python3 /opt/py-wacz/bin/warcz

The reference command, when used on a 2.6GB WACZ file, returns the following error message for me:

whalehub@pdh:/srv/www/archive# wacz -o test-large.wacz test-large.warc
Generating indexes...
Writing archives...
Traceback (most recent call last):
  File "/opt/py-wacz/bin/wacz", line 33, in <module>
    sys.exit(load_entry_point('wacz==0.1.0', 'console_scripts', 'wacz')())
  File "/opt/py-wacz/lib/python3.7/site-packages/wacz-0.1.0-py3.7.egg/wacz/main.py", line 33, in main
  File "/opt/py-wacz/lib/python3.7/site-packages/wacz-0.1.0-py3.7.egg/wacz/main.py", line 78, in create_wacz
  File "/usr/lib/python3.7/zipfile.py", line 1126, in close
    raise RuntimeError('File size unexpectedly exceeded ZIP64 '
RuntimeError: File size unexpectedly exceeded ZIP64 limit
Exception ignored in: <function ZipFile.__del__ at 0x7f76611e9b70>
Traceback (most recent call last):
  File "/usr/lib/python3.7/zipfile.py", line 1789, in __del__
    self.close()
  File "/usr/lib/python3.7/zipfile.py", line 1798, in close
    raise ValueError("Can't close the ZIP file while there is "
ValueError: Can't close the ZIP file while there is an open writing handle on it. Close the writing handle before closing the zip.
whalehub@pdh:/srv/www/archive#

I've confirmed that the same command works fine when used on a ~30MB WACZ file instead:

whalehub@pdh:/srv/www/archive# wacz -o test-small.wacz test-small.warc
Generating indexes...
Writing archives...
Generating metadata...
whalehub@pdh:/srv/www/archive# ls -l
-rw-r--r-- 1 whalehub  whalehub  29078146 Sep  9 05:17 test-small.wacz
-rw-r--r-- 1 whalehub  whalehub  29071676 Sep  9 04:51 test-small.warc
whalehub@pdh:/srv/www/archive#

opened by whalehub 2

Support single seed, detect pages with extra pages

Support specifying --url, --detect-pages and --split-seeds In this setup, the main url is written to pages, all other detected pages are written to extra-pages.jsonl

opened by ikreymer 1
Signing/Verification Support

Add optional support for singing + verifying WACZ files.

The signing must be done via an external signing server (running a version of webrecorder/authsign) The verification can be done using remote server or locally as well.

Bump to 0.4.0

opened by ikreymer 1
Use psf/black for python code formatting

Apply https://github.com/psf/black to the py-wacz project (black .) Also add a travis check black . --check which will verify the formatting.

Let's do it as a separate PR after the validation PR is merged.

opened by ikreymer 1
Allow MD5 as datapackage hash

Also allow MD5 as the hash for datapackage, used only for data integrity. The main reason is that it is more expensive to compute SHA-256 when generating WACZ entirely in the browser... py-wacz should still verify these, but should not need to produce these.

opened by ikreymer 0
Support premade page lists from a crawler
Support specifying page lists via an externally passed in pages files, eg. -p pages.jsonl.

The use case is to support specifying a page list from a crawler, which may have partial information, it may not yet have the timestamps, but may have full text search. Ex:

{"format":"json-pages-1.0","id":"pages","title":"All Pages","hasText":true} {"url": "https://example.com/"} {"url": "https://example.com/another_page", "text": "..."} {"url": "https://example.com/exact_page", "ts": "2020-01-02T00:01:02Z" "text": "..."}

The create function will ensure that each page exists in a WARC and add the timestamp for URLs that do not have them.

For the first entry, the first match of https://example.com/ will be used, as the timestamp will be set in the output. If no https://example.com/ is found, it will be an error. If text detection is enabled, text will be extracted.

For the second entry, the same approach is used. However, if text detection is enabled, the existing text field takes precedence, and extraction is skipped.

For the third entry, there must be an exact match for URL and TS, otherwise it is an error. The specified text is used and takes precedence and no extraction takes place even with -t flag.

To figure out:

Support for extra pages files that are added in addition to the main pages file? -pe extraList.jsonl

Support for just a plain text page list -- useful for importing a plain seed list from a crawl, not in jsonl format (treated same as {"url": "https://example.com/"} entries in a JSONL list)

https://example.com/ https://example.com/another_page https://example.com/exact_page

This will be final piece to support converting external data/crawls to WACZ!
opened by ikreymer 0
Command Line Return Code should be 0
We now have the 0 return code in:

if __name__ == "__main__": sys.exit(0 if main() else 1)

which works only if calling main.py directly. For the installed wacz script to return 0, we really should have the main() function itself return 0 if success, non-zero if error. Currently, it returns true if success/valid, false otherwise. Or, can wrap it in another function that is used by the setup.py wacz script, that may be simpler, but also less consistent. In the end, it probably makes sense to be able to have:

if main['validate', '-f', 'file.wacz'] == 0: print('success!')

since that's close to what it would be in a shell script also..
opened by ikreymer 0
Validation of WACZ Format
Related to webrecorder/wacz-spec#20, here are the things that should be validated.

[x] Conformance to frictionless data package spec + check s

[x] Ensure files are where they should be, WARCs in archives/, CDX in indexes/

[x] Check for extraneous data

[x] Compression check:

WARCs and compressed cdx.gz should be in ZIP with 'store' compression (not deflate)

Indexes and page list can be compressed

[x] Validate WARCs and CDX by indexing existing WARC to match index?
opened by ikreymer 0
[FEATURE] Add a WARC Record Iterator

It would be helpful to have an iterator that walks through all the WARC records of all the WARC files in a WACZ file, treating it externally like a regular WARC file.

opened by ibnesayeed 0
Tweak README for consistency

The documentation for the flags has inconsistent punctuation. This change remedies that as well as fixes an instance of mismatched plurality in the README.

opened by machawk1 0
Fix URL on PyPI

Fix the URL to point to the correct location webrecorder/py-wacz. Also include Markdown in README.md as the long description text for a better display of what py-wacz is on PyPI.

opened by edsu 1
Some commands documented to interact with WACZ files are invalid

In the README under the Validate header, the instructions state that a WACZ file can be validated with wacz validate myfile.wacz. Trying this in the latest release or from the current main branch causes a runtime error insisting that the -f flag be specified. Doing so causes the validation procedure to execute, but why specify the command without the flag if it is invalid?

opened by machawk1 0
Instructions how to create wacz from browsertrix crawl

A browsertrix crawl usually contains all the information required for a wacz to be created, especially text and pages metadata is already present. Is it possible to use that data for creating the wacz?

(Context: browsertrix exited after completing the crawl, leaving an incomplete wacz file, because the disk was full. Everything is already available, just needs to be compiled into wacz.)

opened by despens 0
zipfile.BadZipFile error during wacz creation from warc file - Windows only

I see following error each time when I use latest py-wacz on Windows _Traceback (most recent call last): File "C:\Program Files\Python310\lib\runpy.py", line 196, in _run_module_as_main return run_code(code, main_globals, None, File "C:\Program Files\Python310\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Program Files\Python310\Scripts\wacz.exe_main.py", line 7, in File "C:\Program Files\Python310\lib\site-packages\wacz\main.py", line 109, in main value = cmd.func(cmd) File "C:\Program Files\Python310\lib\site-packages\wacz\main.py", line 320, in create_wacz datapackage = wacz_indexer.generate_datapackage(res, wacz) File "C:\Program Files\Python310\lib\site-packages\wacz\waczindexer.py", line 343, in generate_datapackage with wacz.open(zip_entry, "r") as stream: File "C:\Program Files\Python310\lib\zipfile.py", line 1545, in open raise BadZipFile( zipfile.BadZipFile: File name in directory 'archive\cafrussia.ru.warc' and header b'archive/cafrussia.ru.warc' differ.

Environment: Windows 10, Python 3.10.0, wacz 0.4.5

No error happens on Linux or in the Windows Subsystem for Linux environment.

opened by ivbeg 0

Owner

Webrecorder

Webrecorder Project

GitHub

cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored in many CMSIS PACKs

cmsis-pack-manager cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored i

20 Dec 21, 2022

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

17 Dec 11, 2022

git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's partial clone and sparse checkout features.

Partial Submodules for Git git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's

15 Sep 22, 2022

commandpack - A package of modules for working with commands, command packages, files with command packages.

commandpack Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal:

4 Sep 4, 2021

A command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, VueJS projects.

Cookiecutter A command-line utility that creates projects from cookiecutters (project templates), e.g. creating a Python package project from a Python

18.6k Dec 30, 2022

split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

split-manga-pages split-manga-pages is a command line utility written in Python that converts your double-page layout manga (or any images in double p

3 May 24, 2022

A Python command-line utility for validating that the outputs of a given Declarative Form Azure Portal UI JSON template map to the input parameters of a given ARM Deployment Template JSON template

1 Feb 3, 2022

A Python module and command line utility for working with web archive data using the WACZ format specification

Related tags

Overview

py-wacz

Install

Create

-f --file

-o --output

-t --text

--detect-pages

-p --pages

-t --text

--ts

--url

--title

--desc

--hash-type

Validate

-f --file

Testing

Comments

Owner

Webrecorder

cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored in many CMSIS PACKs

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's partial clone and sparse checkout features.

commandpack - A package of modules for working with commands, command packages, files with command packages.

A command-line utility that creates projects from cookiecutters (project templates), e.g. Python package projects, VueJS projects.

split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

A Python command-line utility for validating that the outputs of a given Declarative Form Azure Portal UI JSON template map to the input parameters of a given ARM Deployment Template JSON template

Library and command-line utility for rendering projects templates.

A handy command-line utility for generating and sending iCalendar events

Baseline is a cross-platform library and command-line utility that creates file-oriented baselines of your systems.

Command line parser for common log format (Nginx default).

Spotify Offline is a command line tool that allows one to download Spotify playlists in MP3 format.

A command line utility to export Google Keep notes to markdown.

A command line utility for tracking a stock market portfolio. Primarily featuring high resolution braille graphs.

📦 A command line utility to put text in a box.

Tiny command-line utility for mapping broken keys to other positions.

This is a CLI utility that allows you to view RedFlagDeals.com on the command line.

img-proof (IPA) provides a command line utility to test images in the Public Cloud

A command-line utility that, given a markdown file, checks whether all its links work.