Libextract: extract data from websites

Overview

Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.

Overview

libextract.api.extract(document, encoding='utf-8', count=5)
Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).

Installation

pip install libextract

Usage

Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you.

from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

Using lxml's built-in methods for post-processing:

>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with tabular data:

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))
>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

Dependencies

lxml
statscounter

Disclaimer

This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated

:)

Comments
  • More approachable names

    More approachable names

    I think @msiemens has raised an important issue about the terminology used in our project. We should always strive to make it more "intuitive", that is, that developers can just start delving into the source code without having to look up the technical terminology used. Some suggestions:

    • quantifiers -> metrics
    • pruners -> selectors/mappers

    As for maximisers I don't know of a better name, but it sounds a mouthful ;)

    cc @libextract/owners @libextract/contrib

    comment/concern 
    opened by eugene-eeo 26
  • Modular approach to

    Modular approach to "pruning"; refactoring get_pairs's traversing and quantifying logic

    Branch 268eeda562ed96d73931e3a4b8dea41a3a7a3847 demonstrates a possible implementation of a modular, user-defined functions that can be added to the pipeline; you can read about the larger discussion here #1.

    The pruners.py submodule provides "pruning" functions.

    BACKGROUND:

    The extraction algorithms require HT/XML tree traversals.

    Before we analyze the frequency distributions of some particular element, which I'll refer to as the "prediction phase", we must first prune for nodes (lxml.html.HtmlElement) and quantify some measurement (numerical or collections.Counter).

    What this submodule provides is a decorator called pruner and some predefined pruners.

    The usecase is the following:

    The user want's to measure "something"; he or she can either import our builtin's (libextract.quantifiers), or they can create their "quantifier" within a custom function, which they would then decorate with @pruner.

    For example, if we were to not know that the text_length quantifier, nor that there existed the subnode_textlen_pruner pruner that does what they wanted already, would simply create our own, under the following protocols:

    from requests import get
    from libextract import extract, pruners
    from libextract.html import parse_html
    from libextract.html.article import get_node, get_text
    from libextract.coretools import histogram, argmax
    
    
    r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)')
    
    #custom xpath
    xpath  = '//*[not(self::script or self::style)]/\
                         text()[normalize-space()]/..'
    
    # INPUTS
    # "node" must declared, selector must be given as keyword argument
    # user must assume it is an lxml.html.HtmlElement object
    @pruner
    def my_pruner(node, selector=xpath):
        text = node.text
        textlen = len(' '.join(text.split())) if text else 0
        #print(node,textlen)
        return node.getparent(), textlen
    # OUTPUTS
    # lxml.html.HtmlElement, numerical or collections.Counter
    
    #add my_pruner to the pipeline
    strat = (parse_html, my_pruner, histogram, argmax, get_node, get_text)
    text = extract(r.content, strategy=strat)
    print(text)
    

    cc @libextract/owners @libextract/contrib

    enhancement 
    opened by rodricios 22
  • Pipeline or Strategy style?

    Pipeline or Strategy style?

    Currently there are two methods of extraction that I have implemented, pipeline (function based) and strategy (inheritance based) style. I am not quite sure which style to use, any suggestions?

    Pipeline style

    def custom_get_text(node):
        return node.xpath(SELECT_TEXT)
    
    strategy = list(html.STRATEGY)
    strategy[-1] = custom_get_text
    

    Pros:

    • Higher performance (in theory) as functions are easier to optimise
    • No shared state, easy to debug compared to OO approach
    • More elegant looking, more expressive (strategies can just be [get_nodepairs, get_best, get_text])
    • Code as data (pipelines are just lists)

    Cons:

    • May be slower due to use of many functions
    • Not configurable (at the moment)

    Strategy style

    class CustomFinder(HTMLFinder):
        text_nodes = 'custom_selector'
    

    Pros:

    • More configurable and easier to configure
    • Most developers are more familiar
    • Shared state can be used for optimisation

    Cons:

    • Hard to test and debug
    • May be slower due to opaque objects

    cc @rodricios

    help wanted 
    opened by eugene-eeo 22
  • Import error's for 2.x & 3.x

    Import error's for 2.x & 3.x

    from libextract.html import parse_html is broken in python 2.7.8, and from .html import parse_html is broken for 3.4x (going from memory so I may be incorrect in exact version/statement).

    The change is addressed in PEP 328. Trying out solutions now.

    opened by rodricios 21
  • New architecture, removed a lot of boilerplate

    New architecture, removed a lot of boilerplate

    In our current implementation, you'll find node generators appearing in many different modules:

    The boilerplate code can be summed up in two functions (the names and definitions are trivial and do not actually exist within libextract):

    def iters(etree, *tags):
        for node in etree.iter(*tags): # <- generator
            do something
            yield or return
    
    def processes(tpls, func, predicate):
        for tpl in tpls: # <- iterator
            if predicate(tpl):
                yield func(tpl)
            else:
                yield tpl
    

    In this issue, I will directly address the first method by providing the decorator iters (and the xpath equivalent, selects) as replacements.

    The second method is a little harder to concisely address with a single replacement decorator. Instead, I will demonstrate a second decorator that touches on the processes method, but is specific to the predictive aspect of libextract.

    iters, selects

    The lxml.ElementTree.iters and lxml.ElementTree.xpath methods were turned into decorators:

    # tags will designate which nodes to generate
    def iters(*tags):
        # *fn* is the user's function (allowing him to do per-node logic)
        def decorator(fn): 
            def iterator(node, *args):
                for elem in node.iter(*tags):
                    yield fn(elem,*args)
            return iterator
        return decorator
    
    def selects(xpath):
        # magic words for choosing 
        # intricate xpath expressions 
        if xpath == "text":
            xpath = NODES_WITH_TEXT
        elif xpath == "tabular":
            xpath = NODES_WITH_CHILDREN
        def decorator(fn):
            def selector(node, *args):
                for n in node.xpath(xpath):
                    yield fn(n)
            return selector
        return decorator
    

    That allows users to do simply do this:

    @iters('tr')
    def get_rows(node):
        return node
    
    rows = list(pipeline(r.content, (parse_html, get_rows)))
    

    ... yielding:

    [<Element tr at 0x65ad778>,
     <Element tr at 0x65ad7c8>,
     <Element tr at 0x65ad818>,
     <Element tr at 0x65ad868>,
     <Element tr at 0x65ad8b8>,
     <Element tr at 0x65ad908>,
     <Element tr at 0x65ad958>,
     <Element tr at 0x65ad9a8>,
    ...]
    

    maximize

    The second construct is the maximize decorator

    Before I demonstrate how to use this decorator, let me show you what it can can easily(?) replace from the current implementation of libextract:

    # libextract/tabular.py
    def node_counter_argmax(pairs):
        for node, counter in pairs:
            if counter:
                yield node, argmax(counter)
    
    # libextract/coretools.py
    def histogram(iterable):
        hist = Counter()
        for key, score in iterable:
            hist[key] += score
        return hist
    
    def argmax(counter):
        return counter.most_common(1)[0]
    

    As a quick side note, in #1, @Beluki voices this opinion:

    For libextract, I think the best way to go about it is to write the functions as if combinations of them weren't available.

    I take that to mean that him and others, including myself, would prefer to build web scraping/extraction algorithms from composable modules, or in other words, more transparency.

    Why do I bring up @Beluki's comment? I believe the next new decorator, maximizer is in tune to his comment. Here's how you can recreate the TABULAR AND ARTICLE blackboxes:

    from libextract.core import parse_html, pipeline
    from libextract.generators import selects, maximize, iters
    from libextract.metrics import StatsCounter
    
    @maximize(5, lambda x: x[1].max())
    @selects("tabular") # uses table-extracting xpath
    def group_parents_children(node):
        return node, StatsCounter([child.tag for child in node])
    
    @maximize(5, lambda x: x[1])
    @selects("text") # uses text-extracting xpath
    def group_nodes_texts(node):
        return node.getparent(), len(" ".join(node.text_content().split()))
    
    tables = pipeline(r.content, (parse_html, group_parents_children,))
    text = pipeline(r.content, (parse_html, group_nodes_texts,))
    

    Here's the implementation:

    # *max_fn* is really just the same as the "key"
    # argument in "sorts" and "sorted"
    # *top* controls the number of elements to 
    # return (post-sorting)
    def maximize(top=5, max_fn=select_score): 
        # *fn* is a generator function that get's decorated 
        # on top of a generator function (like an iters-decorated
        # custom method)
        def decorator(fn):
            def iterator(*args):
                return nlargest(top, fn(*args), key=max_fn)
            return iterator
        return decorator
    

    Hopefully this is enough to get the ball rolling towards the immediate goal of cleaning up libextract, as it somehow became cluttered in the short time this project's been alive.

    CC @datalib/contrib

    opened by rodricios 15
  • Move testing code into package

    Move testing code into package

    So relative imports for packages works :+1:

    But now you want to move the test directory into your package so that setuptools also packages your testing code along with your package. After you do that and configure for imports, you want to include a MANIFEST.in file at the root which ensures all your asset files are also included when you install the package.

    Something along the lines of this should work

    include LICENSE
    include .travis.yml
    include requirements.txt
    include libextract/tests/assets/full_of_foos.html
    
    recursive-include libextract *.py
    

    And then also adding this in your setup.py file.

    from setuptools import find_packages
    
    setup(
    ...
        packages=find_packages(),
        tests_require=['pytest', 'nose'],
    ...
    )
    
    opened by jjangsangy 12
  • Proposal for using A.I. terminology

    Proposal for using A.I. terminology

    Tree traversal and optimization have long been tenets in A.I. and machine learning.

    This library is, to some extent, a machine learning library. But more than that, we hope that it can be a useful library that's easy to pick up.

    That being said, I'd like to propose the use of the following terms: prune, optimizers or maximizers, and heuristics

    Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances

    Optimization (alternatively, optimization or mathematical programming) is the selection of a best element (with regard to some criteria) from some set of available alternatives

    A heuristic technique, sometimes called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals

    The get_*_pairs functions in article.py and tabular.py can both be considered to be heuristics. Arguably, a pruning technique is also a heuristic. That is exactly why I think that rather leaving get_node_length_pairs and get_node_counter_pairs inside seperate modules, they should belong in a single file which consolidates those and similar functions.

    I propose they should exist in a module called prune.py or pruners.py as they are predefined special cases of pruning techniques.

    What is also shared between article.py and tabular.py is "optimizing" functions, aka argmax.

    These should belong in either optimizers.py or maximizers.py.

    Finally, and with less urgency, while I do think "metrics" is more terse than "quantifiers", what is lacking is the implication that those functions are active processes, meaning, they receive streams of nodes. "metrics" sounds static imo, but obviously these last few statements are subjective.

    I should note that the current goal I have for this project is to get rid of article.py and tabular.py as they are essentially what eatiht was. To be honest, if I wanted to keep things vanilla, I would have just implemented tabular.py into eatiht and I wouldn't bat an eye.

    But libextract is not eatiht, and its implementation should be more thought out (think of it as framework). All procedures should be clearly decoupled, so when article.py and tabular.py both execute optimization techniques, those techniques, or a generalized version, should exist in a single module, not in two separate modules.

    opened by rodricios 12
  • Presentation of extracted tabular data

    Presentation of extracted tabular data

    Currently, one of the possible, human readable way of viewing the extracted "tables", one should do the following:

    from libextract import extract
    from libextract.strategies import TABULAR
    from libextract.html.tabular import filter_tags
    
    from lxml.html import open_in_browser
    from requests import get
    
    reddit = get("http://reddit.com")
    strat = TABULAR + (filter_tags,)
    
    tabs = extract(reddit.content, strategy=strat)
    tabs = list(tabs)
    open_in_browser(tabs[0])
    

    Any other ideas on how else we should present it?

    Edit: one very cool way is following import.io's approach as an example; this means classifying by the node's "class" values.

    enhancement 
    opened by rodricios 10
  • Time for Pipeline class?

    Time for Pipeline class?

    I'm enjoying the NodeProcessor class, but I think the usage practices are quite unnatractive:

    import re
    from requests import get
    
    from libextract import extract
    from libextract.strategies import TABULAR
    from libextract.formatters import table_json
    from libextract.processor import NodeProcessor
    
    # thoughts begin here
    np = NodeProcessor()  # This is alright
    
    # this is pretty, could be "prettier" if just @register('table')
    @np.register('table')
    def if_table(node):
        table = table_json(node)
        return table
    ...
    strat = TABULAR + (np.process,) # This is starting to bug me :|
    

    As for ideas of what do:

    1. Some globally available Pipeline object (as in, initialized by top-level module __init__.py

      • from libextract.processor import register, process
      • even better: from libextract.processor import register and process is called in the background
    2. Change strategy containers to list type

      • no longer named with uppercase constant convention
      • take advantage of append(fn) for adding methods to the end of the pipeline
      from libextract.strategies import tabular
      @register('tag')
      def if_table(node):
      ...
      tabular.append(if_table) 
      strat = tabular
      

    The above code block removes the need for the NodeProcessor.process method, which we then can move to register (I think...)

    Thoughts? CC @datalib/contrib @datalib/admins

    opened by rodricios 8
  • Confidence metric

    Confidence metric

    The extremely naive way of implementing it would be:

    def confidence(s, k):
        return (s[k] / s.mean()) - 1
    

    Which has the nice property that it performs very well for datasets which are very imbalanced, i.e. the peak values are outliers, usually the case in a site's tabular content/text length. It also gives negative values for choices which are too small, which is IMHO a :+1:.

    But the problem is that it returns too big a value for values that are too big, for example:

    >>> from statscounter import StatsCounter
    >>> from libextract.confidence import confidence_metric
    >>> s = StatsCounter({'b':2, 'k': 3, 'd': 1, 'c': 20, 'f': 9})
    >>> confidence_metric(s, 'c')
    1.8571428571428572
    

    I've experimented with other algorithms, for example:

    • (max - avg) / stdev - performs well but has the same problem
    • max * stddev / (size * variance) - gives nicer values and has the property that the confidence of the outliers do not exceed 1.

    So far most of the algorithms do not have the property that sum(P(k) for k in data) == 1.0. This is a problem that I think can only be solved by probability distributions, which I do not know how to implement.

    opened by eugene-eeo 7
  • Mission statement:

    Mission statement: "[libextract] provides composable, small functions [..]"

    "[..] that can be piped together to process the HT/XML document."

    I'd like to put into writing that I think we should not only stick to that mantra (or mission statement?), but also officialize it as this project's description.

    With that in mind, I think we're reaching a point (or we've reached it already ref: 953c384b3cb1d8ae06f654f893c44e82cc74c835) where the lack of naming/functional-method conventions will delay the progress of this project.

    There's a few observations I should state:

    1. Perfect extraction is impossible (this one is here for reasons I'll address in another issue/post)
    2. Lambda functions are easy for the initiated, difficult for those not.
    3. The "composable, small functions" we provide in html.py (we should probably rename that file, if we continue to add more functions) can be, to an extent, rewritten as a series of "pipelined" lambda functions.
    4. Without some set of protocols, the typical user will not be able to simply pick up this library and compose an effective "strategy"

    To stress the problem of point 4, consider the following two extraction cases:

    article_strategy = (get_etree, get_pairs, histogram, argmax, get_final_text)
    
    tabular_strategy = (get_etree, get_node_children_pairs, filter_node_children_pairs)
    

    First, "get_etree", while I'm for keeping method names that describe the action and the returned structure, I believe we should opt for something closer to "parse" or "parse_html". Not only is it interpretable by most users of lxml.html and the nlp folks (ie. sentence parsing), but it's also, imo, a more concise statement.

    Second, as is evidenced in the definitions of get_pairs and get_node_children_pairs, we're doing a lot of pair getting.

    In both types of pairs returned, it's of the format (HtmlElement, n-dimensional measurement).

    Just for completeness, in get_pairs we have (HtmlElement, int) and in get_node_children_pairs we have (HtmlElement, [Counter(key=HtmlElement, value=int_(frequency_))

    Up until that point, our pipeline's flow of logic, and the names of the methods, matched up nicely. But after that, histogram, argmax, get_final_text and filter_node_children_pairs are inelegantly non-synergistic, for lack of a better term.

    I don't know if it's too early in the project to be concerned about this, but I think this project will likely head towards a dead end if there isn't some proper set of ground rules. Sorry for the pessimism :(

    I'll work on refactoring the code I've commited in 953c384b3cb1d8ae06f654f893c44e82cc74c835, and see if I can come up with something more constructive than the above.

    comment/concern 
    opened by rodricios 6
  • Take length of node content into account

    Take length of node content into account

    I saw menu structures of <ul> nodes pop up alongside the desired text body in libextract's results. This could be easily mitigated through defining a minimum threshold of node content length. Something like 15 characters possibly. Will do a PR if this is a wanted improvement. Not sure if the project is still alive.

    opened by psolbach 1
  • API: get ElementTree

    API: get ElementTree

    The api.extract function returns a generator of HtmlElement objects. If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

    Currently I use the following lazy workaround:

    from functools import partial
    from libextract._compat import BytesIO
    from libextract.core import parse_html, pipeline, select, measure, rank, finalise
    
    def extract(document, encoding='utf-8', count=None):
        if isinstance(document, bytes):
            document = BytesIO(document)
    
        crank = partial(rank, count=count) if count else rank
    
        etree = parse_html(document, encoding=encoding)
        yield etree
        yield pipeline(
            select(etree),
            (measure, crank, finalise)
            )
    
    r = requests.get(url)
    gen_extract = extract(r.content)
    tree = g.next()
    textnodes = g.next()
    data_element = textnodes.next()  # <Element table at 0x36f1f60>
    rows = data_element.iterfind('tr')
    for row in rows:
        row_xpath = tree.getpath(row)
        print row_xpath
    
    # /html/body/div[2]/div[1]/div[2]/table/tr[1]
    # /html/body/div[2]/div[1]/div[2]/table/tr[2]
    # /html/body/div[2]/div[1]/div[2]/table/tr[3]
    # ...
    
    opened by bofm 14
Owner
null
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

null 4.3k Jan 7, 2023
Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

null 1 Jul 9, 2022
Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

Caio Alves 2 Jul 20, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 3, 2023
This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

null 1 Oct 24, 2021
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

Nitin Narayanan 10 Aug 18, 2022
Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

laojunjun 6 Nov 21, 2022
Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

Paulo DaRosa 5 Nov 29, 2022
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of charts.

Samrat Mitra 3 Sep 9, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 4, 2022