Open clone of OpenAI's unreleased WebText dataset scraper.

Joshua C Peterson

Last update: Dec 30, 2022

Related tags

Web Content Extracting openwebtext

Overview

OpenWebText

Joshua Peterson, Stephan Meylan, & David Bourgin

Open clone of OpenAI's unreleased WebText dataset (blog, paper, code) scraper used to train GPT-2. The current result is just over 23 million URLs and over 10 million HTML pages.

This implementation mines and intelligently de-duplicates +3 karma URLs from pre-downloaded (monthly) pushshift.io Reddit submission dumps (which is much faster than making successive calls to the web API), downloads raw HTML, and extracts text. To save time, you can use the pre-filtered URL lists here, which reduce the 140GB of pushshift data to down to the 2GB of URLs actually needed for content scraping. There's also an initial utility for tokenizing and we are looking to add BPE encoding soon. This code base is functional but in active development so please feel free to post issues or suggest improvements (pull requests welcome).

Dependencies

If you use pipenv (pip install --user pipenv), cd to the project root and run

pipenv install 
pipenv shell

Otherwise, just run the following in a new virtual environment

pip3 install -r requirements.txt

To Extract/Clean URLs Yourself

You can download the pre-filtered URLs here, but if you want to re-filter them yourself, perhaps with different filtering criteria, follow these instructions. Pushshift dumps must first be downloaded using fetch_urls.py (thanks to simonfall), or manually from here. Two example dumps are included in the repo in the "pushshift_dumps" folder. Next, extract good URLs using:

python extract_urls.py --single_file RS_v2_2005-06.xz

To process multiple pushshift files, specify year ranges:

python extract_urls.py --year_start 2016 --year_end 2018

To change the karma threshold:

python extract_urls.py --single_file RS_v2_2005-06.xz --min_karma 4

To de-duplicate the extracted URLs, provide a directory of all URL dumps:

python deduplicate_urls.py --input_dir url_dumps

The output of both extract_urls.py and deduplicate_urls.py are text files given that all 23 million "good" URLs only comprise 2GB.

To Scrape HTML (or Text Directly)

This is done one month at a time given the compute/bandwidth required. n_procs is the number of cores to use for parallelization and should be at least 20-40 for fastest results. The script will output results in chunks of size chunk_size. If timeout is not set, or is set to -1, the downloader may hang on large files.

To scrape raw HTML for later processing and text extraction, set --scraper to raw as shown below. The downloaded HTML is stripped of script/style tags and stored in compressed archives using LZMA compression, along with a small amount of meta.

python download.py url_dumps_deduped/RS_20XX-XX.xz.deduped.txt --n_procs 100 --scraper raw --chunk_size 100000 --compress --timeout 30

To scrape text content directly and save disk space (but without the option to re-extract with different parameters later), set --scraper to newspaper to extract text using the Python newspaper package. For more careful extraction, set --scraper to bs4 (Beautiful Soup 4), which will extact text for all

tags on the page.

To Extract Text from HTML (After Download)

python extract_text.py --html_archive scraped/RS_20XX-XX-X_data.xz --n_procs 100

This currently uses newspaper and outputs txt files.

Tokenization

The original WebText didn't use tokenization, but if you need it use:

python tokenize_text.py --input_glob "parsed/*.txt" --output_dir tokenized

This will be improved and parallelized soon.

BPE Encoding

Coming soon...

Original OpenAI project links

Blog Post (Better Language Models and Their Implications)
Paper (Language Models are Unsupervised Multitask Learners)
Code (https://github.com/openai/gpt-2)

Other Implmentations

An alternative scraper based on the pushshift.io API and fork of the download code above can be found here

Comments

Why is Newspaper3k used for html scraping?

I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?

opened by tilmanrpk 6
(Also) parsing structured data while you're at it
One might as well extract structured data from each element of such a dataset.

Linked data. https://5stardata.info/

Useful features:

Relations to e.g. https://schema.org/Dataset (s)

Reified edges to other https://schema.org/ScholarlyArticle (s) indicating whether A seems to confirm or disprove B

URIs for columns in CSV and CSVW datasets

https://www.w3.org/TR/tabular-data-primer/ (CSVW)

... from https://github.com/chiphuyen/lazynlp/issues/1
opened by westurner 5
How to cite this version of openwebtext?

Hi! Do you have any standard bibtex that you recommend to use when citing your work? I understand that this is a reproduction of Radford et al., but your work deserves credit.

opened by Guitaricet 2
Faster extraction
As mentioned in Issue #12 the extract text script is really slow.

To speed things up I found out that the script is a lot faster if we first extract all files inside the archive and then let the parse_file function read the file itself. This requires additional space on the hard disk, but I just assume that everyone running this project has enough space.

Just for my convenience I used the following libraries:

Used tqdm to visualize progress (added to requirements). Can take this dependency out if its not helpful.

Used pathlib (requires Python >= 3.4).
opened by villmow 2
Efficient BPE tokenization
Use multiprocessing with all available cores

Load files in batches of 10k files, combine at least 1e8 tokens per file.

Compressed this is about 15MB/file

Save results in compressed numpy arrays

Append EOT token to each file so all numpy arrays can be safely concat'd

Crappy tqdm progress. Doesn't handle multiproc very well but looks better than nothing.

My webtext is ~16M files. On a p3.2xlarge I estimate ~6 hours to encode the whole dataset into 1600 files.

Fixes #8
opened by 8enmann 2
Bump pillow from 5.4.1 to 9.0.1
Bumps pillow from 5.4.1 to 9.0.1.

Release notes

Sourced from pillow's releases.

9.0.1

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

Changes

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

9.0.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

Changes

Restrict builtins for ImageMath.eval() #5923 [@radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [@radarhere]

Fixed ImagePath.Path array handling #5920 [@radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [@radarhere]

Removed redundant part of condition #5915 [@radarhere]

Explicitly enable strip chopping for large uncompressed TIFFs #5517 [@kmilos]

Use the Windows method to get TCL functions on Cygwin #5807 [@DWesl]

Changed error type to allow for incremental WebP parsing #5404 [@radarhere]

Improved I;16 operations on big endian #5901 [@radarhere]

Ensure that BMP pixel data offset does not ignore palette #5899 [@radarhere]

Limit quantized palette to number of colors #5879 [@radarhere]

Use latin1 encoding to decode bytes #5870 [@radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [@radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [@radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [@radarhere]

Added rounding when converting P and PA #5824 [@radarhere]

Improved putdata() documentation and data handling #5910 [@radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [@radarhere]

Image.NONE is only used for resampling and dithers #5908 [@radarhere]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [@radarhere]

Add Tidelift alignment action and badge #5763 [@aclark4life]

Replaced further direct invocations of setup.py #5906 [@radarhere]

Added ImageShow support for xdg-open #5897 [@m-shinder]

Fixed typo #5902 [@radarhere]

Switched from deprecated "setup.py install" to "pip install ." #5896 [@radarhere]

Support 16-bit grayscale ImageQt conversion #5856 [@cmbruns]

Fixed raising OSError in _safe_read when size is greater than SAFEBLOCK #5872 [@radarhere]

Convert subsequent GIF frames to RGB or RGBA #5857 [@radarhere]

WebP: Fix memory leak during decoding on failure #5798 [@ilai-deutel]

Do not prematurely return in ImageFile when saving to stdout #5665 [@infmagic2047]

Added support for top right and bottom right TGA orientations #5829 [@radarhere]

Corrected ICNS file length in header #5845 [@radarhere]

Block tile TIFF tags when saving #5839 [@radarhere]

Added line width argument to ImageDraw polygon #5694 [@radarhere]

Do not redeclare class each time when converting to NumPy #5844 [@radarhere]

Only prevent repeated polygon pixels when drawing with transparency #5835 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.0.1 (2022-02-03)

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

9.0.0 (2022-01-02)

Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

Improved I;16 operations on big endian #5901 [radarhere]

Limit quantized palette to number of colors #5879 [radarhere]

Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

Added rounding when converting P and PA #5824 [radarhere]

Improved putdata() documentation and data handling #5910 [radarhere]

Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

... (truncated)

Commits

6deac9e 9.0.1 version bump

c04d812 Update CHANGES.rst [ci skip]

4fabec3 Added release notes for 9.0.1

02affaa Added delay after opening image with xdg-open

ca0b585 Updated formatting

427221e In show_file, use os.remove to remove temporary images

c930be0 Restrict builtins within lambdas for ImageMath.eval

75b69dd Dont need to pin for GHA

cd938a7 Autolink CWE numbers with sphinx-issues

2e9c461 Add CVE IDs

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump pillow from 6.2.0 to 8.3.2
Bumps pillow from 6.2.0 to 8.3.2.

Release notes

Sourced from pillow's releases.

8.3.2

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.2.html

Security

CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

Fix 6-byte OOB read in FliDecode [wiredfool]

Python 3.10 wheels

Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

Fixed regressions

Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

Updates for ImagePalette channel order #5599 [radarhere]

Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

8.3.1

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.1.html

Changes

Catch OSError when checking if fp is sys.stdout #5585 [@radarhere]

Handle removing orientation from alternate types of EXIF data #5584 [@radarhere]

Make Image.array take optional dtype argument #5572 [@t-vi]

8.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/8.3.0.html

Changes

Use snprintf instead of sprintf #5567 [@radarhere]

Limit TIFF strip size when saving with LibTIFF #5514 [@kmilos]

Allow ICNS save on all operating systems #4526 [@newpanjing]

De-zigzag JPEG's DQT when loading; deprecate convert_dict_qtables #4989 [@gofr]

Do not use background or transparency index for new color #5564 [@radarhere]

Simplified code #5315 [@radarhere]

Replaced xml.etree.ElementTree #5565 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

8.3.2 (2021-09-02)

CVE-2021-23437 Raise ValueError if color specifier is too long [hugovk, radarhere]

Fix 6-byte OOB read in FliDecode [wiredfool]

Add support for Python 3.10 #5569, #5570 [hugovk, radarhere]

Ensure TIFF RowsPerStrip is multiple of 8 for JPEG compression #5588 [kmilos, radarhere]

Updates for ImagePalette channel order #5599 [radarhere]

Hide FriBiDi shim symbols to avoid conflict with real FriBiDi library #5651 [nulano]

8.3.1 (2021-07-06)

Catch OSError when checking if fp is sys.stdout #5585 [radarhere]

Handle removing orientation from alternate types of EXIF data #5584 [radarhere]

Make Image.array take optional dtype argument #5572 [t-vi, radarhere]

8.3.0 (2021-07-01)

Use snprintf instead of sprintf. CVE-2021-34552 #5567 [radarhere]

Limit TIFF strip size when saving with LibTIFF #5514 [kmilos]

Allow ICNS save on all operating systems #4526 [baletu, radarhere, newpanjing, hugovk]

De-zigzag JPEG's DQT when loading; deprecate convert_dict_qtables #4989 [gofr, radarhere]

Replaced xml.etree.ElementTree #5565 [radarhere]

... (truncated)

Commits

8013f13 8.3.2 version bump

23c7ca8 Update CHANGES.rst

8450366 Update release notes

a0afe89 Update test case

9e08eb8 Raise ValueError if color specifier is too long

bd5cf7d FLI tests for Oss-fuzz crash.

94a0cf1 Fix 6-byte OOB read in FliDecode

cece64f Add 8.3.2 (2021-09-02) [CI skip]

e422386 Add release notes for Pillow 8.3.2

08dcbb8 Pillow 8.3.2 supports Python 3.10 [ci skip]

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump pillow from 6.2.0 to 8.2.0
Bumps pillow from 6.2.0 to 8.2.0.

Release notes

Sourced from pillow's releases.

8.2.0

https://pillow.readthedocs.io/en/stable/releasenotes/8.2.0.html

Changes

Security fixes for 8.2.0 #5377 [@hugovk]

Move getxmp() to JpegImageFile #5376 [@radarhere]

Added getxmp() method #5144 [@UrielMaD]

Compile LibTIFF with CMake on Windows #5359 [@nulano]

Add ImageShow support for GraphicsMagick #5349 [@latosha-maltba]

Tiff crash fixes in TiffDecode.c #5372 [@wiredfool]

Remove redundant check (addition to #5364) #5366 [@kkopachev]

Do not load transparent pixels from subsequent GIF frames #5333 [@radarhere]

Use LZW encoding when saving GIF images #5291 [@raygard]

Set all transparent colors to be equal in quantize() #5282 [@radarhere]

Allow PixelAccess to use Python int when parsing x and y #5206 [@radarhere]

Removed Image._MODEINFO #5316 [@radarhere]

Add preserve_tone option to autocontrast #5350 [@elejke]

Only import numpy when necessary #5323 [@radarhere]

Fixed linear_gradient and radial_gradient I and F modes #5274 [@radarhere]

Add support for reading TIFFs with PlanarConfiguration=2 #5364 [@wiredfool]

More OSS-Fuzz support #5328 [@wiredfool]

Do not premultiply alpha when resizing with Image.NEAREST resampling #5304 [@nulano]

Use quantization method attributes #5353 [@radarhere]

Dynamically link FriBiDi instead of Raqm #5062 [@nulano]

Removed build_distance_tables return value #5363 [@radarhere]

Allow fewer PNG palette entries than the bit depth maximum when saving #5330 [@radarhere]

Use duration from info dictionary when saving WebP #5338 [@radarhere]

Improved efficiency when creating GIF disposal images #5326 [@radarhere]

Stop flattening EXIF IFD into getexif() #4947 [@radarhere]

Replaced tiff_deflate with tiff_adobe_deflate compression when saving TIFF images #5343 [@radarhere]

Save ICC profile from TIFF encoderinfo #5321 [@radarhere]

Moved RGB fix inside ImageQt class #5268 [@radarhere]

Fix -Wformat error in TiffDecode #5305 [@lukegb]

Allow alpha_composite destination to be negative #5313 [@radarhere]

Ensure file is closed if it is opened by ImageQt.ImageQt #5260 [@radarhere]

Added ImageDraw rounded_rectangle method #5208 [@radarhere]

Added IPythonViewer #5289 [@radarhere]

Only draw each rectangle outline pixel once #5183 [@radarhere]

Use mmap instead of built-in Win32 mapper #5224 [@radarhere]

Handle PCX images with an odd stride #5214 [@radarhere]

Only read different sizes for "Large Thumbnail" MPO frames #5168 [@radarhere]

Dependencies

Updated harfbuzz to 2.8.0 #5334 [@radarhere]

Deprecations

... (truncated)

Changelog

Sourced from pillow's changelog.

8.2.0 (2021-04-01)

Added getxmp() method #5144 [UrielMaD, radarhere]

Add ImageShow support for GraphicsMagick #5349 [latosha-maltba, radarhere]

Do not load transparent pixels from subsequent GIF frames #5333 [zewt, radarhere]

Use LZW encoding when saving GIF images #5291 [raygard]

Set all transparent colors to be equal in quantize() #5282 [radarhere]

Allow PixelAccess to use Python int when parsing x and y #5206 [radarhere]

Removed Image._MODEINFO #5316 [radarhere]

Add preserve_tone option to autocontrast #5350 [elejke, radarhere]

Fixed linear_gradient and radial_gradient I and F modes #5274 [radarhere]

Add support for reading TIFFs with PlanarConfiguration=2 #5364 [kkopachev, wiredfool, nulano]

Deprecated categories #5351 [radarhere]

Do not premultiply alpha when resizing with Image.NEAREST resampling #5304 [nulano]

Dynamically link FriBiDi instead of Raqm #5062 [nulano]

Allow fewer PNG palette entries than the bit depth maximum when saving #5330 [radarhere]

Use duration from info dictionary when saving WebP #5338 [radarhere]

Stop flattening EXIF IFD into getexif() #4947 [radarhere, kkopachev]

... (truncated)

Commits

e0e353c 8.2.0 version bump

ee635be Merge pull request #5377 from hugovk/security-and-release-notes

694c84f Fix typo [ci skip]

8febdad Review, typos and lint

fea4196 Reorder, roughly alphabetic

496245a Fix BLP DOS -- CVE-2021-28678

22e9bee Fix DOS in PSDImagePlugin -- CVE-2021-28675

ba65f0b Fix Memory DOS in ImageFont

bb6c11f Fix FLI DOS -- CVE-2021-28676

5a5e6db Fix EPS DOS on _open -- CVE-2021-28677

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump urllib3 from 1.25.6 to 1.25.8
Bumps urllib3 from 1.25.6 to 1.25.8.

Release notes

Sourced from urllib3's releases.

1.25.8

Release: 1.25.8

1.25.7

No release notes provided.

Changelog

Sourced from urllib3's changelog.

1.25.8 (2020-01-20)

Drop support for EOL Python 3.4 (Pull #1774)

Optimize _encode_invalid_chars (Pull #1787)

1.25.7 (2019-11-11)

Preserve chunked parameter on retries (Pull #1715, Pull #1734)

Allow unset SERVER_SOFTWARE in App Engine (Pull #1704, Issue #1470)

Fix issue where URL fragment was sent within the request target. (Pull #1732)

Fix issue where an empty query section in a URL would fail to parse. (Pull #1732)

Remove TLS 1.3 support in SecureTransport due to Apple removing support (Pull #1703)

Commits

2a57bc5 Release 1.25.8 (#1788)

a2697e7 Optimize _encode_invalid_chars (#1787)

d2a5a59 Move IPv6 test skips in server fixtures

d44f0e5 Factorize test certificates serialization

84abc7f Generate IPV6 certificates using trustme

6a15b18 Run IPv6 Tornado server from fixture

4903840 Use trustme to generate IP_SAN cert

9971e27 Empty responses should have no lines.

62ef68e Use trustme to generate NO_SAN certs

fd2666e Use fixture to configure NO_SAN test certs

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump pillow from 6.2.0 to 8.1.1
Bumps pillow from 6.2.0 to 8.1.1.

Release notes

Sourced from pillow's releases.

8.1.1

https://pillow.readthedocs.io/en/stable/releasenotes/8.1.1.html

8.1.0

https://pillow.readthedocs.io/en/stable/releasenotes/8.1.0.html

Changes

Fix TIFF OOB Write error #5175 [@radarhere]

Fix for Buffer Read Overrun in PCX Decoding #5174 [@radarhere]

Fix for SGI Decode buffer overrun #5173 [@radarhere]

Fix OOB Read when saving GIF of xsize=1 #5149 [@wiredfool]

Add support for PySide6 #5161 [@hugovk]

Moved QApplication into one test #5167 [@radarhere]

Use disposal settings from previous frame in APNG #5126 [@radarhere]

Revert "skip wheels on 3.10-dev due to wheel#354" #5163 [@radarhere]

Better _binary module use #5156 [@radarhere]

Added exception explaining that repr_png saves to PNG #5139 [@radarhere]

Use previous disposal method in GIF load_end #5125 [@radarhere]

Do not catch a ValueError only to raise another #5090 [@radarhere]

Allow putpalette to accept 1024 integers to include alpha values #5089 [@radarhere]

Fix OOB Read when writing TIFF with custom Metadata #5148 [@wiredfool]

Removed unused variable #5140 [@radarhere]

Fix dereferencing of potential null pointers #5111 [@cgohlke]

Fixed warnings assigning to "unsigned char *" from "char *" #5127 [@radarhere]

Add append_images support for ICO #4568 [@ziplantil]

Fixed comparison warnings #5122 [@radarhere]

Block TIFFTAG_SUBIFD #5120 [@radarhere]

Fix dereferencing potential null pointer #5108 [@cgohlke]

Replaced PyErr_NoMemory with ImagingError_MemoryError #5113 [@radarhere]

Remove duplicate code #5109 [@cgohlke]

Moved warning to end of execution #4965 [@radarhere]

Removed unused fromstring and tostring C methods #5026 [@radarhere]

init() if one of the formats is unrecognised #5037 [@radarhere]

Dependencies

Updated libtiff to 4.2.0 #5153 [@radarhere]

Updated openjpeg to 2.4.0 #5151 [@radarhere]

Updated harfbuzz to 2.7.4 #5138 [@radarhere]

Updated harfbuzz to 2.7.3 #5128 [@radarhere]

Updated libraqm to 0.7.1 #5070 [@radarhere]

Updated libimagequant to 2.13.1 #5065 [@radarhere]

Update FriBiDi to 1.0.10 #5064 [@nulano]

Updated libraqm to 0.7.1 #5063 [@radarhere]

Updated libjpeg-turbo to 2.0.6 #5044 [@radarhere]

Deprecations

... (truncated)

Changelog

Sourced from pillow's changelog.

8.1.1 (2021-03-01)

Use more specific regex chars to prevent ReDoS. CVE-2021-25292 [hugovk]

Fix OOB Read in TiffDecode.c, and check the tile validity before reading. CVE-2021-25291 [wiredfool]

Fix negative size read in TiffDecode.c. CVE-2021-25290 [wiredfool]

Fix OOB read in SgiRleDecode.c. CVE-2021-25293 [wiredfool]

Incorrect error code checking in TiffDecode.c. CVE-2021-25289 [wiredfool]

PyModule_AddObject fix for Python 3.10 #5194 [radarhere]

8.1.0 (2021-01-02)

Fix TIFF OOB Write error. CVE-2020-35654 #5175 [wiredfool]

Fix for Read Overflow in PCX Decoding. CVE-2020-35653 #5174 [wiredfool, radarhere]

Fix for SGI Decode buffer overrun. CVE-2020-35655 #5173 [wiredfool, radarhere]

Fix OOB Read when saving GIF of xsize=1 #5149 [wiredfool]

Makefile updates #5159 [wiredfool, radarhere]

Add support for PySide6 #5161 [hugovk]

Use disposal settings from previous frame in APNG #5126 [radarhere]

Added exception explaining that repr_png saves to PNG #5139 [radarhere]

Use previous disposal method in GIF load_end #5125 [radarhere]

... (truncated)

Commits

741d874 8.1.1 version bump

179cd1c Added 8.1.1 release notes to index

7d29665 Update CHANGES.rst [ci skip]

d25036f Credits

973a4c3 Release notes for 8.1.1

521dab9 Use more specific regex chars to prevent ReDoS

8b8076b Fix for CVE-2021-25291

e25be1e Fix negative size read in TiffDecode.c

f891baa Fix OOB read in SgiRleDecode.c

cbfdde7 Incorrect error code checking in TiffDecode.c

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump lxml from 4.3.1 to 4.6.2
Bumps lxml from 4.3.1 to 4.6.2.

Changelog

Sourced from lxml's changelog.

4.6.2 (2020-11-26)

Bugs fixed

A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.1 (2020-10-18)

Bugs fixed

A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.0 (2020-10-17)

Features added

GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

lxml.html.InputGetter has a new .items() method to ease processing all input fields.

lxml.html.InputGetter.keys() now returns the field names in document order.

GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

Bugs fixed

LP#1869455: C14N 2.0 serialisation failed for unprefixed attributes when a default namespace was defined.

TreeBuilder.close() raised AssertionError in some error cases where it should have raised XMLSyntaxError. It now raises a combined exception to keep up backwards compatibility, while switching to XMLSyntaxError as an interface.

4.5.2 (2020-07-09)

... (truncated)

Commits

4cb5736 Work around Py2's lack of "re.ASCII".

c30106f Prepare release of 4.6.2.

a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...

c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...

b083124 lxml actually works in Py3.9.

0f80590 lxml actually works in Py3.9.

fd8893c Add a doc note that the .find() methods are usually faster than one might exp...

eb6df27 Update release version on homepage.

69b5c9b Automate the build artefact downloading from github and appveyor.

61432a8 Prepare release of lxml 4.6.1.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump certifi from 2018.11.29 to 2022.12.7
Bumps certifi from 2018.11.29 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

9d514b4 2022.06.15

4151e88 Add py.typed to MANIFEST.in to package in sdist (#196)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump pillow from 5.4.1 to 9.3.0
Bumps pillow from 5.4.1 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

Initialize libtiff buffer when saving #6699 [@radarhere]

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [@wiredfool]

Inline fname2char to fix memory leak #6329 [@nulano]

Fix memory leaks related to text features #6330 [@nulano]

Use double quotes for version check on old CPython on Windows #6695 [@hugovk]

GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [@nulano]

Remove backup implementation of Round for Windows platforms #6693 [@cgohlke]

Upload fribidi.dll to GitHub Actions #6532 [@nulano]

Fixed set_variation_by_name offset #6445 [@radarhere]

Windows build improvements #6562 [@nulano]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [@cgohlke]

Only use ASCII characters in C source file #6691 [@cgohlke]

Release Python GIL when converting images using matrix operations #6418 [@hmaarrfk]

Added ExifTags enums #6630 [@radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [@radarhere]

Added support for reading BMP images with RLE4 compression #6674 [@npjg]

Decode JPEG compressed BLP1 data in original mode #6678 [@radarhere]

pylint warnings #6659 [@marksmayo]

Added GPS TIFF tag info #6661 [@radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [@radarhere]

Do not attempt normalization if mode is already normal #6644 [@radarhere]

Fixed seeking to an L frame in a GIF #6576 [@radarhere]

Consider all frames when selecting mode for PNG save_all #6610 [@radarhere]

Don't reassign crc on ChunkStream close #6627 [@radarhere]

Raise a warning if NumPy failed to raise an error during conversion #6594 [@radarhere]

Only read a maximum of 100 bytes at a time in IMT header #6623 [@radarhere]

Show all frames in ImageShow #6611 [@radarhere]

Allow FLI palette chunk to not be first #6626 [@radarhere]

If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [@radarhere]

Round box position to integer when pasting embedded color #6517 [@radarhere]

Removed EXIF prefix when saving WebP #6582 [@radarhere]

Pad IM palette to 768 bytes when saving #6579 [@radarhere]

Added DDS BC6H reading #6449 [@ShadelessFox]

Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [@JayWiz]

Raise an error when allocating translucent color to RGB palette #6654 [@jsbueno]

Moved mode check outside of loops #6650 [@radarhere]

Added reading of TIFF child images #6569 [@radarhere]

Improved ImageOps palette handling #6596 [@PososikTeam]

Defer parsing of palette into colors #6567 [@radarhere]

Apply transparency to P images in ImageTk.PhotoImage #6559 [@radarhere]

Use rounding in ImageOps contain() and pad() #6522 [@bibinhashley]

Fixed GIF remapping to palette with duplicate entries #6548 [@radarhere]

Allow remap_palette() to return an image with less than 256 palette entries #6543 [@radarhere]

Corrected BMP and TGA palette size when saving #6500 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

Initialize libtiff buffer when saving #6699 [radarhere]

Inline fname2char to fix memory leak #6329 [nulano]

Fix memory leaks related to text features #6330 [nulano]

Use double quotes for version check on old CPython on Windows #6695 [hugovk]

Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

Fixed set_variation_by_name offset #6445 [radarhere]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

Added ExifTags enums #6630 [radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

Added GPS TIFF tag info #6661 [radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits

d594f4c Update CHANGES.rst [ci skip]

909dc64 9.3.0 version bump

1a51ce7 Merge pull request #6699 from hugovk/security-libtiff_buffer

2444cdd Merge pull request #6700 from hugovk/security-samples_per_pixel-sec

744f455 Added release notes

0846bfa Add to release notes

799a6a0 Fix linting

00b25fd Hide UserWarning in logs

05b175e Tighter test case

13f2c5a Prevent DOS with large SAMPLESPERPIXEL in Tiff IFD

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Bump urllib3 from 1.25.6 to 1.26.5
Bumps urllib3 from 1.25.6 to 1.26.5.

Release notes

Sourced from urllib3's releases.

1.26.5

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed deprecation warnings emitted in Python 3.10.

Updated vendored six library to 1.16.0.

Improved performance of URL parser when splitting the authority component.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.4

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.3

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed bytes and string comparison issue with headers (Pull #2141)

Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.2

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting ssl_version=ssl.PROTOCOL_TLSv1_1 (Pull #2002) Starting in urllib3 v2.0: Connections that receive a DeprecationWarning will fail

Deprecated Retry options Retry.DEFAULT_METHOD_WHITELIST, Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST and Retry(method_whitelist=...) in favor of Retry.DEFAULT_ALLOWED_METHODS, Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT, and Retry(allowed_methods=...) (Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed

... (truncated)

Changelog

Sourced from urllib3's changelog.

1.26.5 (2021-05-26)

Fixed deprecation warnings emitted in Python 3.10.

Updated vendored six library to 1.16.0.

Improved performance of URL parser when splitting the authority component.

1.26.4 (2021-03-15)

Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

1.26.3 (2021-01-26)

Fixed bytes and string comparison issue with headers (Pull #2141)

Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)

1.26.2 (2020-11-12)

Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1 (2020-11-11)

Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0 (2020-11-10)

NOTE: urllib3 v2.0 will drop support for Python 2. Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>_.

Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning

... (truncated)

Commits

d161647 Release 1.26.5

2d4a3fe Improve performance of sub-authority splitting in URL

2698537 Update vendored six to 1.16.0

07bed79 Fix deprecation warnings for Python 3.10 ssl module

d725a9b Add Python 3.10 to GitHub Actions

339ad34 Use pytest==6.2.4 on Python 3.10+

f271c9c Apply latest Black formatting

1884878 [1.26] Properly proxy EOF on the SSLTransport test suite

a891304 Release 1.26.4

8d65ea1 Merge pull request from GHSA-5phf-pp7p-vc2r

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump lxml from 4.3.1 to 4.6.3
Bumps lxml from 4.3.1 to 4.6.3.

Changelog

Sourced from lxml's changelog.

4.6.3 (2021-03-21)

Bugs fixed

A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

4.6.2 (2020-11-26)

Bugs fixed

A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.1 (2020-10-18)

Bugs fixed

A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.0 (2020-10-17)

Features added

GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

lxml.html.InputGetter has a new .items() method to ease processing all input fields.

lxml.html.InputGetter.keys() now returns the field names in document order.

GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

Bugs fixed

... (truncated)

Commits

a5f9cb5 Prepare release of lxml 4.6.3.

2d01a1b Add HTML-5 "formaction" attribute to "defs.link_attrs" (GH-316)

e986a9c Fix reference in docs.

4cb5736 Work around Py2's lack of "re.ASCII".

c30106f Prepare release of 4.6.2.

a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...

c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...

b083124 lxml actually works in Py3.9.

0f80590 lxml actually works in Py3.9.

fd8893c Add a doc note that the .find() methods are usually faster than one might exp...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0