A collection of robust and fast processing tools for parsing and analyzing web archive data.

Overview

ChatNoir Resiliparse

Build Wheels Codecov Documentation Status

A collection of robust and fast processing tools for parsing and analyzing web archive data.

Resiliparse is part of the ChatNoir web analytics toolkit. If you use ChatNoir or any of its tools for a publication, you can make us happy by citing our ECIR demo paper:

@InProceedings{bevendorff:2018,
  address =             {Berlin Heidelberg New York},
  author =              {Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast},
  booktitle =           {Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018)},
  editor =              {Leif Azzopardi and Allan Hanbury and Gabriella Pasi and Benjamin Piwowarski},
  ids =                 {potthast:2018c,stein:2018c},
  month =               mar,
  publisher =           {Springer},
  series =              {Lecture Notes in Computer Science},
  site =                {Grenoble, France},
  title =               {{Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl}},
  year =                2018
}

Usage Instructions

For detailed information about the build process, dependencies, APIs, or usage instructions, please read the Resiliparse Documentation

Resiliparse Module Summary

The Resiliparse collection encompasses the following two modules at the moment:

1. Resiliparse

The Resiliparse main module with the following subcomponents:

Parsing Utilities

The Resiliparse Parsing Utilities are the largest submodule and provide an extensive (and growing) collection of efficient tools for dealing with encodings and raw protocol payloads, parsing HTML web pages, and preparing them for further processing by extracting structural or semantic information.

Main documentation: Resiliparse Parsing Utilities

Process Guards

The Resiliparse Process Guard module is a set of decorators and context managers for guarding a processing context to stay within pre-defined limits for execution time and memory usage. Process Guards help to ensure the (partially) successful completion of batch processing jobs in which individual tasks may time out or use abnormal amounts of memory, but in which the success of the whole job is not threatened by (a few) individual failures. A guarded processing context will be interrupted upon exceeding its resource limits so that the task can be skipped or rescheduled.

Main documentation: Resiliparse Process Guards

Itertools

Resiliparse Itertools are a collection of convenient and robust helper functions for iterating over data from unreliable sources using other tools from the Resiliparse toolkit.

Main documentation: Resiliparse Itertools

2. FastWARC

FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.

Main documentation: FastWARC and FastWARC CLI

Installation

The main Resiliparse package can be installed from PyPi as follows:

pip install resiliparse

FastWARC is being distributed as its own package and can be installed like so:

pip install fastwarc

For optimal performance, however, it is recommended to build FastWARC from sources instead of relying on the pre-built binaries. See below for more information.

Building From Source

To build Resiliparse or FastWARC from sources, you need to install all required build-time dependencies first. On Ubuntu, this is done as follows:

# Add Lexbor repository
curl -L https://lexbor.com/keys/lexbor_signing.key | sudo apt-key add -
echo "deb https://packages.lexbor.com/ubuntu/ $(lsb_release -sc) liblexbor" | \
    sudo tee /etc/apt/sources.list.d/lexbor.list

# Install build dependencies
sudo apt update
sudo apt install build-essential python3-dev zlib1g-dev \
    liblz4-dev libuchardet-dev liblexbor-dev

Then, to build the actual packages, run:

# Optional: Create a fresh venv first
python3 -m venv venv && source venv/bin/activate

# Build and install Resiliparse
pip install -e resiliparse

# Build and install FastWARC
pip install -e fastwarc

Instead of building the packages from this repository, you can also build them from the PyPi source packages:

# Build Resiliparse from PyPi
pip install --no-binary resiliparse resiliparse

# Build FastWARC from PyPi
pip install --no-binary fastwarc fastwarc
Comments
  • Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel

    Interesting Benchmarks running resilparse 'HTML2text' sequentially vs parallel

    After running some benchmarking on resiliparse "HTMl2text" extract_plain_text(tree, main_content=True)) it seems the extract_plain_text method is significantly slower in parallel than sequentially.

    sequentially : 508.147 items/sec parallel : 62.7322 items/sec

    I ran the benchmarking with a tool I wrote, https://github.com/Nootka-io/wee-benchmarking-tool. I'll work on pulling out a minimal example.

    It seems strange to me, and not sure where to begin profiling/debugging. Other libraries see little improvement, but resiliparse is the only one showing a dramatic drop, although it's still the fastest.

    opened by getorca 28
  • pipx run fastwarc check faild: binascii.Error: Non-base32 digit found

    pipx run fastwarc check faild: binascii.Error: Non-base32 digit found

    $ pipx run --verbose fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
    pipx >(setup:729): pipx version is 1.0.0
    pipx >(setup:730): Default python interpreter is '/home/user/.local/pipx/venvs/pipx/bin/python'
    pipx >(needs_upgrade:69): Time since last upgrade of shared libs, in seconds: 1561898. Upgrade will be run by pipx if greater than 2592000.
    pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
    pipx >(run:103): Reusing cached venv /home/user/.local/pipx/.cache/7a73b1e86637c39
    pipx >(run_subprocess:172): running /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/python -c import sysconfig; print(sysconfig.get_path('purelib'))
    pipx >(exec_app:387): exec_app: /home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc check /tmp/warcs/WARCPROX-20220315191329244-00000-icvgw961.warc
    0 records were verified successfully.                           
    1 records were skipped without digest.
    Error in sys.excepthook:
    Traceback (most recent call last):
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
        sys.exit(main())
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
        return self.main(*args, **kwargs)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
        rv = self.invoke(ctx)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
        return __callback(*args, **kwargs)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
        for v in pbar:
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
        for obj in iterable:
      File "fastwarc/tools.pyx", line 178, in verify_digests
      File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
      File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
      File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
      File "/usr/lib/python3.9/base64.py", line 231, in b32decode
        raise binascii.Error('Non-base32 digit found') from None
    binascii.Error: Non-base32 digit found
    
    Original exception was:
    Traceback (most recent call last):
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/bin/fastwarc", line 8, in <module>
        sys.exit(main())
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
        return self.main(*args, **kwargs)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1053, in main
        rv = self.invoke(ctx)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/click/core.py", line 754, in invoke
        return __callback(*args, **kwargs)
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/fastwarc/cli.py", line 138, in check
        for v in pbar:
      File "/home/user/.local/pipx/.cache/7a73b1e86637c39/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
        for obj in iterable:
      File "fastwarc/tools.pyx", line 178, in verify_digests
      File "fastwarc/warc.pyx", line 922, in fastwarc.warc.WarcRecord.verify_block_digest
      File "fastwarc/warc.pyx", line 934, in fastwarc.warc.WarcRecord.verify_block_digest
      File "fastwarc/warc.pyx", line 872, in fastwarc.warc.WarcRecord._verify_digest
      File "/usr/lib/python3.9/base64.py", line 231, in b32decode
        raise binascii.Error('Non-base32 digit found') from None
    binascii.Error: Non-base32 digit found
    $
    
    opened by MaxPeal 9
  • resiliparse crashes in colab

    resiliparse crashes in colab

    Trying this piece of html... Is there something I can do to upgrade the underlying parser? I recall reading this...

    from resiliparse.parse import detect_encoding
    from resiliparse.parse.html import HTMLTree
    from resiliparse.extract.html2text import extract_plain_text
    html_byte = b'\n\n\n\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\r\n<meta http-equiv="X-UA-Compatible" content="IE=9">\r\n<link rel="stylesheet" type="text/css" href="https://firgraf.oh.gov.hu/include/style.css" media="screen" />\r\n<title>Int\xc3\xa9zm\xc3\xa9nyi adatok</title>\r\n<!-- Global site tag (gtag.js) - Google Analytics -->\r\n<script async src="https://www.googletagmanager.com/gtag/js?id=UA-198540847-1"></script>\r\n<script>\r\n  window.dataLayer = window.dataLayer || [];\r\n  function gtag(){dataLayer.push(arguments);}\r\n  gtag(\'js\', new Date());\r\n  gtag(\'config\', \'UA-198540847-1\');\r\n</script>\r\n</head>\r\n<body>\r\n<table width="80%" cellpadding="0" cellspacing="0" align="center" style="border:3px solid;\r\nborder-radius:8px; border: 3px solid #0994dc; background-color:#FFFFFF">\r\n  <tr>\r\n    <td valign="top" rowspan="2" bgcolor=\'#FFFFFF\'></td>\r\n    <td align=\'center\' height=\'70\' bgcolor=\'#FFFFFF\' style=\'font: bold small-caps 28px monospace;\'><img src=\'https://firgraf.oh.gov.hu/images/firgraf_logo.png\' width=\'1200\'></td>\r\n  </tr>\r\n  <tr>\r\n    <td valign="top" align=\'center\' bgcolor="#FFFFFF">\r\n      \r\n      <table>\r\n\t<tr>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/index.php">Kezd\xc5\x91lap</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/kkk.php">K\xc3\xa9pz\xc3\xa9si \xc3\xa9s kimeneti k\xc3\xb6vetelm\xc3\xa9nyek</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/int.php">Int\xc3\xa9zm\xc3\xa9nyi adatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/torzs.php">T\xc3\xb6rzsadatok</a></td>\r\n\t  <td class="menu"><a class="menu" href="https://firgraf.oh.gov.hu/prg/gyorslista.php">Gyorslist\xc3\xa1k</a></td>\r\n\t  <td class="menu"><a class="menu" href="http://www.felvi.hu/hivataliugyek/">Vissza a felvi.hu-ra</a></td>\r\n\t</tr>\r\n      </table>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td bgcolor=\'#ffffff\'>\r\n      &nbsp;\r\n    </td>\r\n    <td colspan="2" style="padding: 0.5em">\r\n      <div align="center"><font size="4" color="#000000">Int\xc3\xa9zm\xc3\xa9nyi adatok</font></div><hr>\r\n      <div align=\'left\' valign=\'top\'><form name=\'hataly\' method=\'get\' action=\'/prg/int.php?nyilvantartottszakid=36318\'><a href=\'/prg/int.php?hatalyvalt=hat\xc3\xa1lyoss\xc3\xa1g+bekapcsol\xc3\xa1sa&nyilvantartottszakid=36318\'>[A hat\xc3\xa1lyoss\xc3\xa1gi sz\xc5\xb1r\xc5\x91k bekapcsol\xc3\xa1sa.]</a></form>\n</div><form name=form1 method=post action=\'/prg/int.php?nyilvantartottszakid=36318\'><div align=\'left\' valign=\'top\'>\xe2\x96\xa0 <a href=\'kkk.php?graf=MSZKSMU\'>KKK teljes gr\xc3\xa1f</a> \xe2\x96\xa0 <a href=\'int.php?adatmod=nyilvszak&szervezetid=36\'>SZTE nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9sei</a><br>A gr\xc3\xa1fban a csom\xc3\xb3pontokra kattintva b\xc5\x91vebb inform\xc3\xa1ci\xc3\xb3 olvashat\xc3\xb3 az adott csom\xc3\xb3pontr\xc3\xb3l.<br>Gr\xc3\xa1fn\xc3\xa9zet:   <select name=grafnezet>\n<option value="resz">csak a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n<option value="mind">a teljes gr\xc3\xa1fban a nyilv\xc3\xa1ntartott r\xc3\xa9szgr\xc3\xa1fot</option>\n</select> mutatja.<br>A gr\xc3\xa1fban a ny\xc3\xadl kezdete \xc3\xa9s v\xc3\xa9ge k\xc3\xb6z\xc3\xb6tti minim\xc3\xa1lis t\xc3\xa1vols\xc3\xa1g:   <select name=grafminlen>\n<option value="0">legkisebb</option>\n<option selected value="1">1 egys\xc3\xa9g</option>\n<option value="2">2 egys\xc3\xa9g</option>\n<option value="3">3 egys\xc3\xa9g</option>\n<option value="4">4 egys\xc3\xa9g</option>\n<option value="5">5 egys\xc3\xa9g</option>\n</select> (A nagyobb \xc3\xa9rt\xc3\xa9k szell\xc5\x91sebb\xc3\xa9 teszi az \xc3\xa1br\xc3\xa1t.)<br> <button type=\'submit\'  style="background-color:#E5E5E5; color:#000000; font-size: 12px;" name=\'muv\' value=\'n\xc3\xa9zetet friss\xc3\xadt\'>n\xc3\xa9zetet friss\xc3\xadt</button> </div><br><table width=\'100%\' align=\'center\' border=\'0\'><tr><td width=\'50%\' align=\'left\' valign=\'top\'><a href=\'/prg/int.php?nyilvantartottszakid=36317\'>\xc2\xab el\xc5\x91z\xc5\x91: szoci\xc3\xa1lis munka (36317)</a></td><td width=\'50%\' align=\'right\'><a href=\'/prg/int.php?nyilvantartottszakid=6150\'>k\xc3\xb6vetkez\xc5\x91: szoci\xc3\xa1lpedag\xc3\xb3gia (6150) \xc2\xbb</a></td></tr></table>\n<br><div align=\'left\' valign=\'top\'><b><a href=\'torzsadat.php?tabla=szervezet&sid=70\'>(SZTE) Szegedi Tudom\xc3\xa1nyegyetem</a> - <a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>(MSZKSMU) szoci\xc3\xa1lis munka [36318]</a></b></div><br><div align=\'left\' valign=\'top\'><?xml version="1.0" encoding="UTF-8" standalone="no"?>\n<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"\n "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">\n<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n -->\n<!-- Title: MSZKSMU Pages: 1 -->\n<svg width="340pt" height="116pt"\n viewBox="0.00 0.00 340.00 116.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">\n<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 112)">\n<title>MSZKSMU</title>\n<polygon fill="#ffffff" stroke="transparent" points="-4,4 -4,-112 336,-112 336,4 -4,4"/>\n<g id="clust1" class="cluster">\n<title>cluster_vegzettseg</title>\n<polygon fill="none" stroke="#ffff00" points="231,-8 231,-62 324,-62 324,-8 231,-8"/>\n</g>\n<!-- START -->\n<g id="node1" class="node">\n<title>START</title>\n<ellipse fill="#d3d3d3" stroke="#d3d3d3" cx="27" cy="-63" rx="27" ry="18"/>\n<text text-anchor="middle" x="27" y="-60.8" font-family="Times,serif" font-size="9.00" fill="#000000">START</text>\n</g>\n<!-- MSZKSMU -->\n<g id="node2" class="node">\n<title>MSZKSMU</title>\n<g id="a_node2"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414" xlink:title="MSZKSMU\\nszoci\xc3\xa1lis munka">\n<polygon fill="#e0ffff" stroke="#e0ffff" points="164,-81 91,-81 91,-45 164,-45 164,-81"/>\n<text text-anchor="middle" x="127.5" y="-65.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSZKSMU</text>\n<text text-anchor="middle" x="127.5" y="-55.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- START&#45;&gt;MSZKSMU -->\n<g id="edge1" class="edge">\n<title>START&#45;&gt;MSZKSMU</title>\n<path fill="none" stroke="#0000ff" stroke-width="2" d="M54.1967,-63C62.3906,-63 71.6286,-63 80.7147,-63"/>\n<polygon fill="#0000ff" stroke="#0000ff" stroke-width="2" points="80.8451,-66.5001 90.8451,-63 80.845,-59.5001 80.8451,-66.5001"/>\n</g>\n<!-- MSPCKSM -->\n<g id="node3" class="node">\n<title>MSPCKSM</title>\n<g id="a_node3"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710" xlink:title="MSPCKSM\\nklinikai szoci\xc3\xa1lis munka">\n<polygon fill="#ffe4e1" stroke="#ffe4e1" points="328,-108 227,-108 227,-72 328,-72 328,-108"/>\n<text text-anchor="middle" x="277.5" y="-92.8" font-family="Times,serif" font-size="9.00" fill="#000000">MSPCKSM</text>\n<text text-anchor="middle" x="277.5" y="-82.8" font-family="Times,serif" font-size="9.00" fill="#000000">klinikai szoci\xc3\xa1lis munka</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;MSPCKSM -->\n<g id="edge3" class="edge">\n<title>MSZKSMU&#45;&gt;MSPCKSM</title>\n<path fill="none" stroke="#000000" d="M164.1941,-69.6049C179.9274,-72.4369 198.7348,-75.8223 216.4633,-79.0134"/>\n<polygon fill="#000000" stroke="#000000" points="216.2835,-82.5372 226.7454,-80.8642 217.5237,-75.6479 216.2835,-82.5372"/>\n</g>\n<!-- 1287 -->\n<g id="node4" class="node">\n<title>1287</title>\n<g id="a_node4"><a xlink:href="https://firgraf.oh.gov.hu/prg/torzsadat.php?tabla=vegzettseg&idmezo=vegzettsegid&id=1287" xlink:title="MMSAZMO\\nokleveles\\nszoci\xc3\xa1lis munk\xc3\xa1s">\n<polygon fill="#ffff00" stroke="#ffff00" points="316,-54 239,-54 239,-16 316,-16 316,-54"/>\n<text text-anchor="middle" x="277.5" y="-42.8" font-family="Times,serif" font-size="9.00" fill="#000000">MMSAZMO</text>\n<text text-anchor="middle" x="277.5" y="-32.8" font-family="Times,serif" font-size="9.00" fill="#000000">okleveles</text>\n<text text-anchor="middle" x="277.5" y="-22.8" font-family="Times,serif" font-size="9.00" fill="#000000">szoci\xc3\xa1lis munk\xc3\xa1s</text>\n</a>\n</g>\n</g>\n<!-- MSZKSMU&#45;&gt;1287 -->\n<g id="edge2" class="edge">\n<title>MSZKSMU&#45;&gt;1287</title>\n<path fill="none" stroke="#ff0000" d="M164.1941,-56.1504C183.6481,-52.519 207.8022,-48.0103 228.805,-44.0897"/>\n<polygon fill="#ff0000" stroke="#ff0000" points="229.6399,-47.4944 238.8279,-42.2188 228.3554,-40.6133 229.6399,-47.4944"/>\n<text text-anchor="middle" x="195.5" y="-54.6" font-family="Times,serif" font-size="8.00" fill="#ff0000">START</text>\n</g>\n</g>\n</svg>\n</div><br><br><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott szak:</div></b><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'><tr><td align=\'left\' valign=\'top\'><b>nyilv. szak ID</b></td><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>telephely</b></td><td align=\'left\' valign=\'top\'><b>nyelv</b></td><td align=\'left\' valign=\'top\'><b>munkarend</b></td></tr>\n<tr><td align=\'left\' valign=\'top\'><a href=\'torzsadat.php?tabla=nyilvantartottszak&sid=21715\'>36318</a></td><td align=\'left\' valign=\'top\'>MSZKSMU</td><td align=\'left\' valign=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>2020-01-01</td><td align=\'left\' valign=\'top\'></td><td align=\'left\' valign=\'top\'>Szeged</td><td align=\'left\' valign=\'top\'>magyar</td><td align=\'left\' valign=\'top\'>levelez\xc5\x91</td></tr>\n</table><div align=\'left\' valign=\'top\'><b>Nyilv\xc3\xa1ntartott k\xc3\xa9pz\xc3\xa9si elemek:</b></div><table border=\'1\' cellpadding=\'2\' cellspacing=\'0\'>\n<tr><td align=\'left\' valign=\'top\'><b>k\xc3\xb3d</b></td><td align=\'left\' valign=\'top\'><b>n\xc3\xa9v</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g kezdete</b></td><td align=\'left\' valign=\'top\'><b>hat\xc3\xa1lyoss\xc3\xa1g v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>meghird. kezdete</b></td><td align=\'left\' valign=\'top\'><b>meghird. v\xc3\xa9ge</b></td><td align=\'left\' valign=\'top\'><b>t\xc3\xadpus</b></td><td align=\'left\' valign=\'top\'><b>minimum kredit</b></td><td align=\'left\' valign=\'top\'><b>maximum kredit</b></td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=5710\'>MSPCKSM</a></td><td align=\'left\' valig=\'top\'>klinikai szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>specializ\xc3\xa1ci\xc3\xb3</td><td align=\'left\' valig=\'top\'>35</td><td align=\'left\' valig=\'top\'>40</td></tr><tr><td align=\'left\' valig=\'top\'><a href=\'torzsadat.php?tabla=kepzeselem&idmezo=kepzeselemid&id=414\'>MSZKSMU</a></td><td align=\'left\' valig=\'top\'>szoci\xc3\xa1lis munka</td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>2020-01-01</td><td align=\'left\' valig=\'top\'></td><td align=\'left\' valig=\'top\'>szak</td><td align=\'left\' valig=\'top\'>120</td><td align=\'left\' valig=\'top\'>120</td></tr></table></form>\r\n    </td>\r\n  </tr>\r\n  <tr>\r\n    <td colspan="2" bgcolor=\'#0994dc\' width="100%">\r\n      <table width="100%">\r\n\t<tr>\r\n\t  <td align=\'left\'>\r\n\t      <font size=\'1\' color=\'#ffffff\'>Az adatb\xc3\xa1zis 2022-09-24 hajnalban friss\xc3\xbclt.</font>\r\n\t  </td>\r\n\t  <td align="right">\r\n\t    <font size=\'1\' color=\'#ffffff\'>K\xc3\xa9sz\xc3\xbclt az EKOP-1.A.1-08/C-2009-0009  "Az Oktat\xc3\xa1si Hivatal k\xc3\xb6zigazgat\xc3\xa1si szolg\xc3\xa1ltat\xc3\xa1sainak elektroniz\xc3\xa1l\xc3\xa1sa" projekt keret\xc3\xa9ben. &copy; 2012.</font>\r\n\t  </td>\r\n\t</tr>\r\n    </td>\r\n  </tr>\r\n</table>\r\n</body>\r\n</html>\r\n\n'
    encoding = detect_encoding(html_byte)
    tree = HTMLTree.parse_from_bytes(html_byte, encoding)
    str(tree)
    
    opened by ontocord 8
  • FastWARC: command-line tools to index and extract WARC records

    FastWARC: command-line tools to index and extract WARC records

    Command-line tools to index and extract WARC records are useful to quickly inspect WARC files. The implemented tools/commands behave almost the same as warcio index and warcio extract. One difference is for example that warcio extract decodes the record payload, removing HTTP transfer and content encoding. Only local files are supported for now because seek() and tell() are required (would need some efforts to support URLs as well).

    $> fastwarc index --help
    Usage: fastwarc index [OPTIONS] [INFILES]...
    
      Index WARC records into CDXJ.
    
    Options:
      -o, --output FILENAME  Output file, default is stdout
      -f, --fields TEXT      comma-separated list of indexed fields, eg. "offset",
                             "length", "filename", "http:status", "http:<http-
                             header>", or "<warc-record-header>"
    
      -h, --help             Show this message and exit.
    
    $> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz
    
    $> fastwarc index -fwarc-type,http:status,warc-target-uri,filename,offset,length CC-NEWS-20210930113548-00741.warc.gz | head -3
    {"warc-type": "warcinfo", "filename": "CC-NEWS-20210930113548-00741.warc.gz", "offset": "0", "length": "423"}
    {"warc-type": "request", "warc-target-uri": "https://www.novinite.com/articles/211500/Accident+with+Stranded+Ship+Vera+Su+Might+Be+Caused+by+Inexperienced+Crew+Member", "filename": "CC-NEWS-20210930113548-00741.warc.gz", "offset": "423", "length": "508"}
    {"warc-type": "response", "http:status": 200, "warc-target-uri": "https://www.novinite.com/articles/211500/Accident+with+Stranded+Ship+Vera+Su+Might+Be+Caused+by+Inexperienced+Crew+Member", "filename": "CC-NEWS-20210930113548-00741.warc.gz", "offset": "931", "length": "12680"}
    
    $> fastwarc extract --help
    Usage: fastwarc extract [OPTIONS] INFILE OFFSET
    
      Extract WARC record by offset.
    
    Options:
      --payload   output only record payload (transfer and/or content encoding are
                  preserved
    
      --headers   output only record (and HTTP) headers
      -h, --help  Show this message and exit.
    
    $> fastwarc extract --headers CC-NEWS-20210930113548-00741.warc.gz 931
    WARC/1.0
    WARC-Record-ID: <urn:uuid:cefe66f5-6ed7-4e2c-839a-964bbfb9fcf2>
    Content-Length: 44827
    WARC-Date: 2021-09-30T11:35:47Z
    WARC-Type: response
    WARC-Target-URI: https://www.novinite.com/articles/211500/Accident+with+Stranded+Ship+Vera+Su+Might+Be+Caused+by+Inexperienced+Crew+Member
    Content-Type: application/http; msgtype=response
    WARC-Payload-Digest: sha1:ICXB6ZRV3DBDGIQJJUFAFM5HBXN574BH
    WARC-Block-Digest: sha1:6NYQONX4OQBGVVGCUJYDQ4S7EBAYWH6R
    
    HTTP/1.1 200 OK
    Date: Thu, 30 Sep 2021 11:35:48 GMT
    Server: Apache/2.4.10 (Debian)
    Set-Cookie: PHPSESSID=kbh1betq8no20aeofgpjsnpj64; path=/
    Expires: Thu, 19 Nov 1981 08:52:00 GMT
    Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    Pragma: no-cache
    Vary: Accept-Encoding
    X-Crawler-Content-Encoding: gzip
    X-Crawler-Content-Length: 12159
    Content-Length: 44350
    Keep-Alive: timeout=5, max=100
    Connection: Keep-Alive
    Content-Type: text/html
    
    $> fastwarc extract --payload CC-NEWS-20210930113548-00741.warc.gz 931 | head -7
    <!DOCTYPE html>
    
    <html>
    <head>
            <base href="https://www.novinite.com/">
            <META http-equiv="Content-Type" content="text/html; charset=utf-8">
            <title>Accident with Stranded Ship Vera Su Might Be Caused by Inexperienced Crew Member - Novinite.com - Sofia News Agency</title>
    
    enhancement cli fastwarc 
    opened by sebastian-nagel 8
  • Trouble building in Python 3.11

    Trouble building in Python 3.11

    $ pip install --no-binary resiliparse resiliparse

    DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
    Collecting resiliparse
      Using cached Resiliparse-0.13.7.tar.gz (601 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Installing backend dependencies ... done
      Preparing metadata (pyproject.toml) ... done
    Collecting fastwarc==0.13.7
      Using cached FastWARC-0.13.7-cp311-cp311-linux_x86_64.whl
    Collecting brotli
      Using cached Brotli-1.0.9-cp311-cp311-linux_x86_64.whl
    Requirement already satisfied: click in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (8.0.4)
    Requirement already satisfied: tqdm in ./venv311/lib/python3.11/site-packages (from fastwarc==0.13.7->resiliparse) (4.64.1)
    Building wheels for collected packages: resiliparse
      Building wheel for resiliparse (pyproject.toml) ... error
      error: subprocess-exited-with-error
      
      × Building wheel for resiliparse (pyproject.toml) did not run successfully.
      │ exit code: 1
      ╰─> [50 lines of output]
          running bdist_wheel
          running build
          running build_py
          creating build
          creating build/lib.linux-x86_64-cpython-311
          creating build/lib.linux-x86_64-cpython-311/resiliparse
          copying resiliparse/cli.py -> build/lib.linux-x86_64-cpython-311/resiliparse
          copying resiliparse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse
          creating build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/coders.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/textio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/warcio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/elasticsearch.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          copying resiliparse/beam/fileio.py -> build/lib.linux-x86_64-cpython-311/resiliparse/beam
          creating build/lib.linux-x86_64-cpython-311/resiliparse/extract
          copying resiliparse/extract/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
          creating build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/__init__.py -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
          copying resiliparse/itertools.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
          copying resiliparse/process_guard.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse
          copying resiliparse/extract/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
          copying resiliparse/extract/html2text.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/extract
          copying resiliparse/parse/__init__.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/html.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/http.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/lang.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/encoding.pxd -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/lang_profiles.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/encoding.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          copying resiliparse/parse/html.h -> build/lib.linux-x86_64-cpython-311/resiliparse/parse
          running build_ext
          building 'resiliparse.itertools' extension
          creating build/temp.linux-x86_64-cpython-311
          creating build/temp.linux-x86_64-cpython-311/resiliparse
          x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/itertools.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
          x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -g -fwrapv -O2 build/temp.linux-x86_64-cpython-311/resiliparse/itertools.o -L/usr/lib/x86_64-linux-gnu -o build/lib.linux-x86_64-cpython-311/resiliparse/itertools.cpython-311-x86_64-linux-gnu.so -std=c++17
          building 'resiliparse.extract.html2text' extension
          creating build/temp.linux-x86_64-cpython-311/resiliparse/extract
          x86_64-linux-gnu-gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I./resiliparse/parse -I/mnt/Data/Projects/nootka_io/sentry_dragon/venv311/include -I/usr/include/python3.11 -c resiliparse/extract/html2text.cpp -o build/temp.linux-x86_64-cpython-311/resiliparse/extract/html2text.o -std=c++17 -O3 -Wall -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function
          In file included from /usr/include/lexbor/css/css.h:14,
                           from resiliparse/extract/html2text.cpp:864:
          /usr/include/lexbor/css/stylesheet.h: In function ‘lxb_css_stylesheet_t* lxb_css_stylesheet_create(lexbor_mraw_t*)’:
          /usr/include/lexbor/css/stylesheet.h:33:30: error: invalid conversion from ‘void*’ to ‘lxb_css_stylesheet_t*’ {aka ‘lxb_css_stylesheet*’} [-fpermissive]
             33 |     return lexbor_mraw_calloc(mraw, sizeof(lxb_css_stylesheet_t));
                |            ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                |                              |
                |                              void*
          error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for resiliparse
    Failed to build resiliparse
    ERROR: Could not build wheels for resiliparse, which is required to install pyproject.toml-based projects
    
    
    upstream 
    opened by getorca 6
  • FastWARC: BufferedReader may hang up on truncated gzipped WARC file

    FastWARC: BufferedReader may hang up on truncated gzipped WARC file

    The ArchiveIterator, resp. the underlying stream_io.BufferedReader when reading a truncated gzipped WARC file (eg. an incomplete download). The issue can be reproduced when reading clipped.warc.gz, see iipc/jwarc#17. The stack during the hangup (instead of ftell I've also observed stream_io.FileStream.read() on top of _refill_working_buf():

    #3  0x00007f98a34f8705 in __GI__IO_ftell (fp=0x19b3790) at ioftell.c:38
    #4  0x00007f98a2764766 in __pyx_f_8fastwarc_9stream_io_10GZipStream__refill_working_buf (__pyx_v_self=0x7f98a19fad60, __pyx_v_size=16384)
        at fastwarc/stream_io.cpp:4944
    #5  0x00007f98a276d500 in __pyx_f_8fastwarc_9stream_io_10GZipStream_read (__pyx_v_self=0x7f98a19fad60, __pyx_v_out="", __pyx_v_size=16384)
        at fastwarc/stream_io.cpp:5191
    #6  0x00007f98a27645bc in __pyx_f_8fastwarc_9stream_io_14BufferedReader__fill_buf (__pyx_v_self=0x7f98a19fb9a0) at fastwarc/stream_io.cpp:9201
    #7  0x00007f98a276ce6b in __pyx_f_8fastwarc_9stream_io_14BufferedReader_read (__pyx_v_self=0x7f98a19fb9a0, __pyx_skip_dispatch=<optimized out>, 
        __pyx_optional_args=<optimized out>) at fastwarc/stream_io.cpp:9684
    #8  0x00007f98a2765d75 in __pyx_pf_8fastwarc_9stream_io_14BufferedReader_4read (__pyx_v_size=16384, __pyx_v_self=0x7f98a19fb9a0)
        at fastwarc/stream_io.cpp:9840
    
    opened by sebastian-nagel 6
  • pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib'

    pipx run resiliparse faild: ModuleNotFoundError: No module named 'joblib'

    user@box:~$ pipx run resiliparse
    Traceback (most recent call last):
      File "/home/user/.local/pipx/.cache/42f25da10f76b98/bin/resiliparse", line 5, in <module>
        from resiliparse.cli import main
      File "/home/user/.local/pipx/.cache/42f25da10f76b98/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
        from joblib import Parallel, delayed
    ModuleNotFoundError: No module named 'joblib'
    user@box:~$ pipx install resiliparse
      installed package resiliparse 0.11.1, installed using Python 3.9.2
      These apps are now globally available
        - resiliparse
    done! ✨ 🌟 ✨
    user@box:~$ resiliparse
    Traceback (most recent call last):
      File "/home/user/.local/bin/resiliparse", line 5, in <module>
        from resiliparse.cli import main
      File "/home/user/.local/pipx/venvs/resiliparse/lib/python3.9/site-packages/resiliparse/cli.py", line 18, in <module>
        from joblib import Parallel, delayed
    ModuleNotFoundError: No module named 'joblib'
    user@box:~$ 
    
    opened by MaxPeal 5
  • Installing fastwarc via `pip install` fails if compilation is required or requested

    Installing fastwarc via `pip install` fails if compilation is required or requested

    • applies to fastwarc 0.6.6 and 0.7.0 (0.6.5 successfully installed)
    • seen on Ubuntu 20.04 and 21.04
    • on amd64 with pip3 install --no-binary fastwarc fastwarc
    • or on aarch64 with pip3 install fastwarc (no binaries provided for ARM CPUs)

    The error message indicates that fastwarc is now too interconnected with resiliparse

      ERROR: Command errored out with exit status 1:
    ...  
      from resiliparse_common.string_util cimport str_to_lower, strip_str, strip_c_str
      ^
      ------------------------------------------------------------
      
      fastwarc/warc.pyx:32:0: 'resiliparse_common/string_util.pxd' not found
    

    Building from a checkout of chatnoir-resiliparse via pip3 wheel -e fastwarc succeeds also on ARM-based systems.

    opened by sebastian-nagel 3
  • yum install

    yum install

    Hi,

    Thanks for the very nice package.

    Do you know which dependencies should be installed with yum? I am struggling to build fastWARC from source within a lambda container. Here is my Dockerfile.

    FROM public.ecr.aws/lambda/python:3.8
    
    RUN yum groupinstall "Development Tools" -y
    RUN yum install python3-devel -y
    RUN yum install -y zlib-devel lz4-devel liblexbor-devel uchardet-devel 
    RUN pip3 install --no-binary fastwarc fastwarc --target "${LAMBDA_TASK_ROOT}"
    
    COPY app.py ${LAMBDA_TASK_ROOT}
    CMD [ "app.handler" ]
    

    This is the error message

      ERROR: Command errored out with exit status 1:
       command: /var/lang/bin/python3.8 /var/lang/lib/python3.8/site-packages/pip/_vendor/pep517/in_process/_in_process.py build_wheel /tmp/tmparkimzwm
           cwd: /tmp/pip-install-1hzfg9i1/fastwarc_fcfee32f14f34b609444e2992925ac95
      Complete output (26 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build/lib.linux-x86_64-3.8
      creating build/lib.linux-x86_64-3.8/fastwarc
      copying fastwarc/cli.py -> build/lib.linux-x86_64-3.8/fastwarc
      copying fastwarc/__init__.py -> build/lib.linux-x86_64-3.8/fastwarc
      copying fastwarc/stream_io.pxd -> build/lib.linux-x86_64-3.8/fastwarc
      copying fastwarc/warc.pxd -> build/lib.linux-x86_64-3.8/fastwarc
      copying fastwarc/__init__.pxd -> build/lib.linux-x86_64-3.8/fastwarc
      running build_ext
      building 'fastwarc.warc' extension
      creating build/temp.linux-x86_64-3.8
      creating build/temp.linux-x86_64-3.8/fastwarc
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/warc.cpp -o build/temp.linux-x86_64-3.8/fastwarc/warc.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
      g++ -pthread -shared -Wl,-rpath=/var/lang/lib build/temp.linux-x86_64-3.8/fastwarc/warc.o -L/var/lang/lib -o build/lib.linux-x86_64-3.8/fastwarc/warc.cpython-38-x86_64-linux-gnu.so -std=c++17
      building 'fastwarc.stream_io' extension
      gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC -I/var/lang/include/python3.8 -c fastwarc/stream_io.cpp -o build/temp.linux-x86_64-3.8/fastwarc/stream_io.o -std=c++17 -O3 -Wno-deprecated-declarations -Wno-unreachable-code -Wno-unused-function -fpermissive -Wno-c++11-narrowing
      fastwarc/stream_io.cpp: In function ‘int __pyx_pf_8fastwarc_9stream_io_9LZ4Stream_2__cinit__(__pyx_obj_8fastwarc_9stream_io_LZ4Stream*, PyObject*, PyObject*, PyObject*)’:
      fastwarc/stream_io.cpp:7441:23: error: ‘struct LZ4F_preferences_t’ has no member named ‘favorDecSpeed’
         __pyx_v_self->prefs.favorDecSpeed = __pyx_t_4;
                             ^~~~~~~~~~~~~
      At global scope:
      cc1plus: warning: unrecognized command line option ‘-Wno-c++11-narrowing’
      error: command 'gcc' failed with exit status 1
      ----------------------------------------
      ERROR: Failed building wheel for fastwarc
    

    Many thanks!

    opened by maximedb 3
  • Fastwarc: CLI may index gzipped WARC records with erroneous length 0

    Fastwarc: CLI may index gzipped WARC records with erroneous length 0

    The fastwarc command-line tool "index" index some records of a gzipped WARC file with an erroneous zero record length:

    $> wget http://commoncrawl.s3.amazonaws.com/crawl-data/CC-NEWS/2021/09/CC-NEWS-20210930113548-00741.warc.gz
    
    $> fastwarc index -fwarc-type,warc-target-uri,offset,length CC-NEWS-20210930113548-00741.warc.gz \
        | grep -F '"length": "0"'
    {"warc-type": "response", "warc-target-uri": "https://www.themarketsdaily.com/2021/09/30/ishares-sp-500-etf-nysearcaivv-sees-strong-trading-volume.html", "offset": "232757027", "length": "0"}
    {"warc-type": "response", "warc-target-uri": "https://www.timeturk.com/yasam/baskan-buyukkilic-10-milyon-tl-yatirim-yapilan-yeralti-carsisi-nda-incelemede-bulundu-esnafi-ziyaret-etti/haber-1703634", "offset": "278528237", "length": "0"}
    {"warc-type": "response", "warc-target-uri": "https://www.sondakika.com/haber/haber-yayinci-tevfik-rauf-baysal-vefat-etti-14429565/", "offset": "1044381471", "length": "0"}
    

    See also the discussion in #11, however, fewer records are affected here. With uncompressed the WARC file the error is not reproducible.

    bug fastwarc 
    opened by sebastian-nagel 3
  • Fix HTTP status code parsing (reason phrase may contain spaces)

    Fix HTTP status code parsing (reason phrase may contain spaces)

    ~~The field WarcRecord.http_headers could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.~~

    ~~When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301 redirection content, I'm not able to do this, as far as I can see. (Or just filter 200 responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader to access the HTTP communication.~~

    Example:

    >>> record.headers
    {'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
    >>> record.http_headers
    {'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
    >>> content = record.reader.read()
    >>> assert len(content) == record.content_length  # content only includes the real content, no access to HTTP stuff
    >>> print(content)
    b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'
    

    HTTP communication:

    WARC/1.0
    WARC-Type: response
    WARC-Target-URI: http://vgperson.com/robots.txt
    WARC-Date: 2021-08-09T13:25:55Z
    WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
    WARC-IP-Address: 85.214.122.46
    WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
    Content-Type: application/http; msgtype=response
    Content-Length: 454
    
    HTTP/1.1 301 Moved Permanently
    Date: Mon, 09 Aug 2021 13:25:53 GMT
    Server: Apache
    Location: https://vgperson.com/robots.txt
    Content-Length: 239
    Connection: close
    Content-Type: text/html; charset=iso-8859-1
    
    <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
    <html><head>
    <title>301 Moved Permanently</title>
    </head><body>
    <h1>Moved Permanently</h1>
    <p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
    </body></html>
    
    opened by Querela 2
Owner
ChatNoir
ChatNoir Research Web Search Engine
ChatNoir
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

null 9 Nov 16, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

null 1 Nov 25, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 4, 2022
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
ToeholdTools is a Python package and desktop app designed to facilitate analyzing and designing toehold switches, created as part of the 2021 iGEM competition.

ToeholdTools Category Status Repository Package Build Quality A library for the analysis of toehold switch riboregulators created by the iGEM team Cit

null 0 Dec 1, 2021
PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Burn Research 4 Oct 13, 2022
Developed for analyzing the covariance for OrcVIO

about This repo is developed for analyzing the covariance for OrcVIO environment setup platform ubuntu 18.04 using conda conda env create --file envir

Sean 1 Dec 8, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

null 6 Oct 11, 2022
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 5, 2022