Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Overview

Data Extractor

license Pypi Status Python version Package version PyPI - Downloads GitHub last commit Code style: black Build Status codecov Documentation Status PDM managed

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Quickstarts

Installation

Install the stable version from PYPI.

pip install "data-extractor[jsonpath-extractor]"  # for extracting JSON data
pip install "data-extractor[lxml]"  # for extracting HTML data

Or install the latest version from Github.

pip install "data-extractor[jsonpath-extractor] @ git+https://github.com/linw1995/data_extractor.git@master"

Extract JSON data

Currently supports to extract JSON data with below optional dependencies

install one dependency of them to extract JSON data.

Extract HTML(XML) data

Currently supports to extract HTML(XML) data with below optional dependencies

Usage

from data_extractor import Field, Item, JSONExtractor


class Count(Item):
    followings = Field(JSONExtractor("countFollowings"))
    fans = Field(JSONExtractor("countFans"))


class User(Item):
    name_ = Field(JSONExtractor("name"), name="name")
    age = Field(JSONExtractor("age"), default=17)
    count = Count()


assert User(JSONExtractor("data.users[*]"), is_many=True).extract(
    {
        "data": {
            "users": [
                {
                    "name": "john",
                    "age": 19,
                    "countFollowings": 14,
                    "countFans": 212,
                },
                {
                    "name": "jack",
                    "description": "",
                    "countFollowings": 54,
                    "countFans": 312,
                },
            ]
        }
    }
) == [
    {"name": "john", "age": 19, "count": {"followings": 14, "fans": 212}},
    {"name": "jack", "age": 17, "count": {"followings": 54, "fans": 312}},
]

Changelog

Unreleased

Feature

  • Generic extractor with convertor (#83)
  • mypy plugin for type annotation of extracting result (#83)

v0.10.2

Build

  • upgrade jsonpath-extractor to v0.8.0

Contributing

Environment Setup

Clone the source codes from Github.

git clone https://github.com/linw1995/data_extractor.git
cd data_extractor

Setup the development environment. Please make sure you install the pdm, pre-commit and nox CLIs in your environment.

make init
make PYTHON=3.7 init  # for specific python version

Linting

Use pre-commit for installing linters to ensure a good code style.

make pre-commit

Run linters. Some linters run via CLI nox, so make sure you install it.

make check-all

Testing

Run quick tests.

make

Run quick tests with verbose.

make vtest

Run tests with coverage. Testing in multiple Python environments is powered by CLI nox.

make cov
Comments
  • Support for dynamic creation of Items

    Support for dynamic creation of Items

    currently is not possible to create items that inherit from AbstractComplexExtractor, the code breaks when _check_field_overwrites_method() is executed

    bug 
    opened by walison17 4
  • build(deps): bump lxml from 4.6.3 to 4.6.5

    build(deps): bump lxml from 4.6.3 to 4.6.5

    Bumps lxml from 4.6.3 to 4.6.5.

    Changelog

    Sourced from lxml's changelog.

    4.6.5 (2021-12-12)

    Bugs fixed

    • A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images.

    • A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs.

    4.6.4 (2021-11-01)

    Features added

    • GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

    • GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

    Commits
    • a9611ba Fix a test in Py2.
    • a3eacbc Prepare release of 4.6.5.
    • b7ea687 Update changelog.
    • 69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...
    • 54d2985 Fix condition in test decorator.
    • 4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)
    • d85c6de Exclude a test when using the macOS system libraries because it fails with li...
    • cd4bec9 Add macOS-M1 as wheel build platform.
    • fd0d471 Install automake and libtool in macOS build to be able to install the latest ...
    • f233023 Cleaner: Remove SVG image data URLs since they can embed script content.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 2
  • generic extractor with convertor

    generic extractor with convertor

    Extractor class now can accept Type for static analysis via configuring the mypy plugin and a custom converter parameter.

    more examples: tests/test_generic_item.py

    enhancement 
    opened by linw1995 1
  • build: update the pdm.lock file

    build: update the pdm.lock file

    pdm export creates requirements.txt with a warning message:

    The exported requirements file is no longer cross-platform. Using it on other platforms may cause unexpected results.

    And I notice some requirements from requirements.txt files are missing now. But I chose to leave it alone. Maybe I will fix this in the future.

    dependencies 
    opened by linw1995 1
  • build(deps): bump urllib3 from 1.26.4 to 1.26.5

    build(deps): bump urllib3 from 1.26.4 to 1.26.5

    Bumps urllib3 from 1.26.4 to 1.26.5.

    Release notes

    Sourced from urllib3's releases.

    1.26.5

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    Changelog

    Sourced from urllib3's changelog.

    1.26.5 (2021-05-26)

    • Fixed deprecation warnings emitted in Python 3.10.
    • Updated vendored six library to 1.16.0.
    • Improved performance of URL parser when splitting the authority component.
    Commits
    • d161647 Release 1.26.5
    • 2d4a3fe Improve performance of sub-authority splitting in URL
    • 2698537 Update vendored six to 1.16.0
    • 07bed79 Fix deprecation warnings for Python 3.10 ssl module
    • d725a9b Add Python 3.10 to GitHub Actions
    • 339ad34 Use pytest==6.2.4 on Python 3.10+
    • f271c9c Apply latest Black formatting
    • 1884878 [1.26] Properly proxy EOF on the SSLTransport test suite
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • build(deps): bump urllib3 from 1.26.3 to 1.26.4

    build(deps): bump urllib3 from 1.26.3 to 1.26.4

    Bumps urllib3 from 1.26.3 to 1.26.4.

    Release notes

    Sourced from urllib3's releases.

    1.26.4

    :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

    If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

    Changelog

    Sourced from urllib3's changelog.

    1.26.4 (2021-03-15)

    • Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • build(deps): bump lxml from 4.6.2 to 4.6.3

    build(deps): bump lxml from 4.6.2 to 4.6.3

    Bumps lxml from 4.6.2 to 4.6.3.

    Changelog

    Sourced from lxml's changelog.

    4.6.3 (2021-03-21)

    Bugs fixed

    • A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Making all properties of extractor unchangeable is easy to achieve Generic Extractor

    Making all properties of extractor unchangeable is easy to achieve Generic Extractor

    is_many should be unchangeable. Item.simplify can change it by accessing the internal value. quoted from #71

    Making all extractor properties unchangeable seems to be good for Generic Extractor implementation. And also can get rid of the lazy build mechanism of the extractor. This feature is not helpful at all.

    enhancement 
    opened by linw1995 1
  • Generic Extractor

    Generic Extractor

    New Features

    • Field supports to be generic
      • The return type of extract_frist should be Any by default if is not given a Generic argument.
      • Support to change extract method signature up against is_many attribute when is set on creation.
      • is_many should be unchangeable. Item.simplify can change it by accessing the internal value.
    • Item supports to be generic
      • extract result type is TypedDict by default if is not given Generic argument. And the member types of TypedDict are set on class declaration.
    • Need to develop a new mypy plugin for those features. And test this plugin by pytest-mypy-plugins package.
    • Field and Item accept optional parameters type for typing (is same as a Generic argument) and convertor for conversion.

    References

    enhancement 
    opened by linw1995 1
  • Build(deps): Bump lxml from 4.5.1 to 4.6.2

    Build(deps): Bump lxml from 4.5.1 to 4.6.2

    Bumps lxml from 4.5.1 to 4.6.2.

    Changelog

    Sourced from lxml's changelog.

    4.6.2 (2020-11-26)

    Bugs fixed

    • A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.1 (2020-10-18)

    Bugs fixed

    • A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

    4.6.0 (2020-10-17)

    Features added

    • GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

    • lxml.html.InputGetter has a new .items() method to ease processing all input fields.

    • lxml.html.InputGetter.keys() now returns the field names in document order.

    • GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

    Bugs fixed

    • LP#1869455: C14N 2.0 serialisation failed for unprefixed attributes when a default namespace was defined.

    • TreeBuilder.close() raised AssertionError in some error cases where it should have raised XMLSyntaxError. It now raises a combined exception to keep up backwards compatibility, while switching to XMLSyntaxError as an interface.

    4.5.2 (2020-07-09)

    ... (truncated)

    Commits
    • 4cb5736 Work around Py2's lack of "re.ASCII".
    • c30106f Prepare release of 4.6.2.
    • a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...
    • c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...
    • b083124 lxml actually works in Py3.9.
    • 0f80590 lxml actually works in Py3.9.
    • fd8893c Add a doc note that the .find() methods are usually faster than one might exp...
    • eb6df27 Update release version on homepage.
    • 69b5c9b Automate the build artefact downloading from github and appveyor.
    • 61432a8 Prepare release of lxml 4.6.1.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • build(deps): bump setuptools from 62.1.0 to 65.5.1

    build(deps): bump setuptools from 62.1.0 to 65.5.1

    Bumps setuptools from 62.1.0 to 65.5.1.

    Release notes

    Sourced from setuptools's releases.

    v65.5.1

    No release notes provided.

    v65.5.0

    No release notes provided.

    v65.4.1

    No release notes provided.

    v65.4.0

    No release notes provided.

    v65.3.0

    No release notes provided.

    v65.2.0

    No release notes provided.

    v65.1.1

    No release notes provided.

    v65.1.0

    No release notes provided.

    v65.0.2

    No release notes provided.

    v65.0.1

    No release notes provided.

    v65.0.0

    No release notes provided.

    v64.0.3

    No release notes provided.

    v64.0.2

    No release notes provided.

    v64.0.1

    No release notes provided.

    v64.0.0

    No release notes provided.

    v63.4.3

    No release notes provided.

    v63.4.2

    No release notes provided.

    ... (truncated)

    Changelog

    Sourced from setuptools's changelog.

    v65.5.1

    Misc ^^^^

    • #3638: Drop a test dependency on the mock package, always use :external+python:py:mod:unittest.mock -- by :user:hroncok
    • #3659: Fixed REDoS vector in package_index.

    v65.5.0

    Changes ^^^^^^^

    • #3624: Fixed editable install for multi-module/no-package src-layout projects.
    • #3626: Minor refactorings to support distutils using stdlib logging module.

    Documentation changes ^^^^^^^^^^^^^^^^^^^^^

    • #3419: Updated the example version numbers to be compliant with PEP-440 on the "Specifying Your Project’s Version" page of the user guide.

    Misc ^^^^

    • #3569: Improved information about conflicting entries in the current working directory and editable install (in documentation and as an informational warning).
    • #3576: Updated version of validate_pyproject.

    v65.4.1

    Misc ^^^^

    • #3613: Fixed encoding errors in expand.StaticModule when system default encoding doesn't match expectations for source files.
    • #3617: Merge with pypa/distutils@6852b20 including fix for pypa/distutils#181.

    v65.4.0

    Changes ^^^^^^^

    v65.3.0

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • build(deps): bump certifi from 2021.10.8 to 2022.12.7

    build(deps): bump certifi from 2021.10.8 to 2022.12.7

    Bumps certifi from 2021.10.8 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • build(deps): bump lxml from 4.8.0 to 4.9.1

    build(deps): bump lxml from 4.8.0 to 4.9.1

    Bumps lxml from 4.8.0 to 4.9.1.

    Changelog

    Sourced from lxml's changelog.

    4.9.1 (2022-07-01)

    Bugs fixed

    • A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

    4.9.0 (2022-06-01)

    Bugs fixed

    • GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

    Other changes

    • Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

    • Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

    • GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

    Commits
    • d01872c Prevent parse failure in new test from leaking into later test runs.
    • d65e632 Prepare release of lxml 4.9.1.
    • 86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...
    • 50c2764 Delete unused Travis CI config and reference in docs (GH-345)
    • 8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...
    • b9f7074 Remove debug print from test.
    • b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...
    • 897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...
    • 853c9e9 Prepare release of 4.9.0.
    • d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Iterative extracting.

    Iterative extracting.

    • python-json-rw is not designed for iterable extracting
    • lxml.xpath is not designed for iterable extracting

    But implement AbstractSimpleExtractor.iter_extract and AbstractComplexExtractor.iter_extract can be able to reduce the memory usage when doing large data extracting .

    enhancement 
    opened by linw1995 0
Releases(v0.10.2)
Owner
林玮 (Jade Lin)
Love Python. Golang/Rust beginner.
林玮 (Jade Lin)
Every web site provides APIs.

Toapi Overview Toapi give you the ability to make every web site provides APIs. Version v2.0.0, Completely rewrote. More elegant. More pythonic v1.0.0

Jiuli Gao 3.3k Jan 5, 2023
Web-Extractor - Simple Tool To Extract IP-Adress From Website

IP-Adress Extractor Simple Tool To Extract IP-Adress From Website Socials: Langu

ميخائيل 7 Jan 16, 2022
Export your data from Xiami

Xiami Exporter 导出虾米音乐的个人数据,功能: 导出歌曲为 json 收藏歌曲 收藏专辑 播放列表 导出收藏艺人为 json 导出收藏专辑为 json 导出播放列表为 json (个人和收藏) 将导出的数据整理至 sqlite 数据库 收藏歌曲 收藏艺人 收藏专辑 播放列表 下载已导出

Xiao Meng 59 Nov 13, 2021
Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

Essi Alizadeh 49 Dec 20, 2022
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 1, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 3, 2023
A django compressor tool that bundles css, js files to a single css, js file with webpack and updates your html files with respective css, js file path.

django-webpacker's documentation: Introduction: django-webpacker is a django compressor tool which bundles css, js files to a single css, js file with

MicroPyramid 72 Aug 18, 2022
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 710 Jan 4, 2023
Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX.

snc4onnx Simple tool to combine(merge) onnx models. Simple Network Combine Tool for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools 1.

Katsuya Hyodo 8 Oct 13, 2022
A Python tool that parses JSON documents using JsonPath

A Python tool that parses JSON documents using JsonPath

null 8 Dec 18, 2022
`charts.css.py` brings `charts.css` to Python. Online documentation and samples is available at the link below.

charts.css.py charts.css.py provides a python API to convert your 2-dimension data lists into html snippet, which will be rendered into charts by CSS,

Ray Luo 3 Sep 23, 2021
热搜榜-python爬虫+正则re+beautifulsoup+xpath

仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

Harry 3 Jul 8, 2022
HTML2Image is a lightweight Python package that acts as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

A package acting as a wrapper around the headless mode of existing web browsers to generate images from URLs and from HTML+CSS strings or files.

null 176 Jan 1, 2023
Basic Form Web Development using Python, Django and CSS

thebookrain Basic Form Web Development using Python, Django and CSS This is a basic project that contains two forms - borrow and donate. The form data

Ananya Dhulipala 1 Nov 27, 2021
Ceres is a combine harvester designed to harvest plots for Chia blockchain and its forks using proof-of-space-and-time(PoST) consensus algorithm.

Ceres Combine-Harvester Ceres is a combine harvester designed to harvest plots for Chia blockchain and its forks using proof-of-space-and-time(PoST) c

null 38 Nov 14, 2022
This "I P L Team Project" is developed by Prasanta Kumar Mohanty using Python with Django web framework, HTML & CSS.

I-P-L-Team-Project This "I P L Team Project" is developed by Prasanta Kumar Mohanty using Python with Django web framework, HTML & CSS. Screenshots HO

null 1 Dec 15, 2021
Meta package to combine turbo-django and stimulus-django

Hotwire + Django This repository aims to help you integrate Hotwire with Django ?? Inspiration might be taken from @hotwired/hotwire-rails. We are sti

Hotwire for Django 31 Aug 9, 2022
Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

Sarthak Mittal 26 May 26, 2022
Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

Marcel Bollmann 11 Sep 25, 2022
Hydralit package is a wrapping and template project to combine multiple independant Streamlit applications into a multi-page application.

Hydralit The Hydralit package is a wrapping and template project to combine multiple independant (or somewhat dependant) Streamlit applications into a

Jackson Storm 108 Jan 8, 2023