Libextract: extract data from websites

Last update: Dec 9, 2022

Related tags

Web Crawling libextract

Overview

Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
  

  
   Overview 

  
 
  
   
  libextract.api.extract(document, encoding='utf-8', count=5)
 
   
 
  
   
  Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
 
   

 
  

  
   Installation 
pip install libextract
  

  
   Usage 
Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you. 

 
 
  from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

  
Using lxml's built-in methods for post-processing: 

 
 
  >> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

  
The extraction algo is agnostic to article text as it is with tabular data: 

 
 
  height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

  

 
 
  >> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

 
  

  
   Dependencies 
lxml
statscounter
  

  
   Disclaimer 
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated 
:)

Comments

More approachable names
I think @msiemens has raised an important issue about the terminology used in our project. We should always strive to make it more "intuitive", that is, that developers can just start delving into the source code without having to look up the technical terminology used. Some suggestions:

quantifiers -> metrics

pruners -> selectors/mappers

As for maximisers I don't know of a better name, but it sounds a mouthful ;)

cc @libextract/owners @libextract/contrib
comment/concern
opened by eugene-eeo 26
Modular approach to "pruning"; refactoring get_pairs's traversing and quantifying logic
Branch 268eeda562ed96d73931e3a4b8dea41a3a7a3847 demonstrates a possible implementation of a modular, user-defined functions that can be added to the pipeline; you can read about the larger discussion here #1.

The pruners.py submodule provides "pruning" functions.

BACKGROUND:

The extraction algorithms require HT/XML tree traversals.

Before we analyze the frequency distributions of some particular element, which I'll refer to as the "prediction phase", we must first prune for nodes (lxml.html.HtmlElement) and quantify some measurement (numerical or collections.Counter).

What this submodule provides is a decorator called pruner and some predefined pruners.

The usecase is the following:

The user want's to measure "something"; he or she can either import our builtin's (libextract.quantifiers), or they can create their "quantifier" within a custom function, which they would then decorate with @pruner.

For example, if we were to not know that the text_length quantifier, nor that there existed the subnode_textlen_pruner pruner that does what they wanted already, would simply create our own, under the following protocols:

from requests import get from libextract import extract, pruners from libextract.html import parse_html from libextract.html.article import get_node, get_text from libextract.coretools import histogram, argmax r = get('http://en.wikipedia.org/wiki/Classifier_(linguistics)') #custom xpath xpath = '//*[not(self::script or self::style)]/\ text()[normalize-space()]/..' # INPUTS # "node" must declared, selector must be given as keyword argument # user must assume it is an lxml.html.HtmlElement object @pruner def my_pruner(node, selector=xpath): text = node.text textlen = len(' '.join(text.split())) if text else 0 #print(node,textlen) return node.getparent(), textlen # OUTPUTS # lxml.html.HtmlElement, numerical or collections.Counter #add my_pruner to the pipeline strat = (parse_html, my_pruner, histogram, argmax, get_node, get_text) text = extract(r.content, strategy=strat) print(text)

cc @libextract/owners @libextract/contrib
enhancement
opened by rodricios 22
Pipeline or Strategy style?
Currently there are two methods of extraction that I have implemented, pipeline (function based) and strategy (inheritance based) style. I am not quite sure which style to use, any suggestions?

Pipeline style

def custom_get_text(node): return node.xpath(SELECT_TEXT) strategy = list(html.STRATEGY) strategy[-1] = custom_get_text

Pros:

Higher performance (in theory) as functions are easier to optimise

No shared state, easy to debug compared to OO approach

More elegant looking, more expressive (strategies can just be [get_nodepairs, get_best, get_text])

Code as data (pipelines are just lists)

Cons:

May be slower due to use of many functions

Not configurable (at the moment)

Strategy style

class CustomFinder(HTMLFinder): text_nodes = 'custom_selector'

Pros:

More configurable and easier to configure

Most developers are more familiar

Shared state can be used for optimisation

Cons:

Hard to test and debug

May be slower due to opaque objects

cc @rodricios
help wanted
opened by eugene-eeo 22
Import error's for 2.x & 3.x

from libextract.html import parse_html is broken in python 2.7.8, and from .html import parse_html is broken for 3.4x (going from memory so I may be incorrect in exact version/statement).

The change is addressed in PEP 328. Trying out solutions now.

opened by rodricios 21

New architecture, removed a lot of boilerplate

In our current implementation, you'll find node generators appearing in many different modules:

The boilerplate code can be summed up in two functions (the names and definitions are trivial and do not actually exist within libextract):

def iters(etree, *tags):
    for node in etree.iter(*tags): # <- generator
        do something
        yield or return

def processes(tpls, func, predicate):
    for tpl in tpls: # <- iterator
        if predicate(tpl):
            yield func(tpl)
        else:
            yield tpl

In this issue, I will directly address the first method by providing the decorator iters (and the xpath equivalent, selects) as replacements.

The second method is a little harder to concisely address with a single replacement decorator. Instead, I will demonstrate a second decorator that touches on the processes method, but is specific to the predictive aspect of libextract.

iters, selects

The lxml.ElementTree.iters and lxml.ElementTree.xpath methods were turned into decorators:

# tags will designate which nodes to generate
def iters(*tags):
    # *fn* is the user's function (allowing him to do per-node logic)
    def decorator(fn): 
        def iterator(node, *args):
            for elem in node.iter(*tags):
                yield fn(elem,*args)
        return iterator
    return decorator

def selects(xpath):
    # magic words for choosing 
    # intricate xpath expressions 
    if xpath == "text":
        xpath = NODES_WITH_TEXT
    elif xpath == "tabular":
        xpath = NODES_WITH_CHILDREN
    def decorator(fn):
        def selector(node, *args):
            for n in node.xpath(xpath):
                yield fn(n)
        return selector
    return decorator

That allows users to do simply do this:

@iters('tr')
def get_rows(node):
    return node

rows = list(pipeline(r.content, (parse_html, get_rows)))

... yielding:

[<Element tr at 0x65ad778>,
 <Element tr at 0x65ad7c8>,
 <Element tr at 0x65ad818>,
 <Element tr at 0x65ad868>,
 <Element tr at 0x65ad8b8>,
 <Element tr at 0x65ad908>,
 <Element tr at 0x65ad958>,
 <Element tr at 0x65ad9a8>,
...]

maximize

The second construct is the maximize decorator

Before I demonstrate how to use this decorator, let me show you what it can can easily(?) replace from the current implementation of libextract:

# libextract/tabular.py
def node_counter_argmax(pairs):
    for node, counter in pairs:
        if counter:
            yield node, argmax(counter)

# libextract/coretools.py
def histogram(iterable):
    hist = Counter()
    for key, score in iterable:
        hist[key] += score
    return hist

def argmax(counter):
    return counter.most_common(1)[0]

As a quick side note, in #1, @Beluki voices this opinion:

For libextract, I think the best way to go about it is to write the functions as if combinations of them weren't available.

I take that to mean that him and others, including myself, would prefer to build web scraping/extraction algorithms from composable modules, or in other words, more transparency.

Why do I bring up @Beluki's comment? I believe the next new decorator, maximizer is in tune to his comment. Here's how you can recreate the TABULAR AND ARTICLE blackboxes:

from libextract.core import parse_html, pipeline
from libextract.generators import selects, maximize, iters
from libextract.metrics import StatsCounter

@maximize(5, lambda x: x[1].max())
@selects("tabular") # uses table-extracting xpath
def group_parents_children(node):
    return node, StatsCounter([child.tag for child in node])

@maximize(5, lambda x: x[1])
@selects("text") # uses text-extracting xpath
def group_nodes_texts(node):
    return node.getparent(), len(" ".join(node.text_content().split()))

tables = pipeline(r.content, (parse_html, group_parents_children,))
text = pipeline(r.content, (parse_html, group_nodes_texts,))

Here's the implementation:

# *max_fn* is really just the same as the "key"
# argument in "sorts" and "sorted"
# *top* controls the number of elements to 
# return (post-sorting)
def maximize(top=5, max_fn=select_score): 
    # *fn* is a generator function that get's decorated 
    # on top of a generator function (like an iters-decorated
    # custom method)
    def decorator(fn):
        def iterator(*args):
            return nlargest(top, fn(*args), key=max_fn)
        return iterator
    return decorator

Hopefully this is enough to get the ball rolling towards the immediate goal of cleaning up libextract, as it somehow became cluttered in the short time this project's been alive.

CC @datalib/contrib

opened by rodricios 15

Move testing code into package
So relative imports for packages works :+1:

But now you want to move the test directory into your package so that setuptools also packages your testing code along with your package. After you do that and configure for imports, you want to include a MANIFEST.in file at the root which ensures all your asset files are also included when you install the package.

Something along the lines of this should work

include LICENSE include .travis.yml include requirements.txt include libextract/tests/assets/full_of_foos.html recursive-include libextract *.py

And then also adding this in your setup.py file.

from setuptools import find_packages setup( ... packages=find_packages(), tests_require=['pytest', 'nose'], ... )
opened by jjangsangy 12
Proposal for using A.I. terminology

Tree traversal and optimization have long been tenets in A.I. and machine learning.

This library is, to some extent, a machine learning library. But more than that, we hope that it can be a useful library that's easy to pick up.

That being said, I'd like to propose the use of the following terms: prune, optimizers or maximizers, and heuristics

Pruning is a technique in machine learning that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances

Optimization (alternatively, optimization or mathematical programming) is the selection of a best element (with regard to some criteria) from some set of available alternatives

A heuristic technique, sometimes called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical methodology not guaranteed to be optimal or perfect, but sufficient for the immediate goals

The get_*_pairs functions in article.py and tabular.py can both be considered to be heuristics. Arguably, a pruning technique is also a heuristic. That is exactly why I think that rather leaving get_node_length_pairs and get_node_counter_pairs inside seperate modules, they should belong in a single file which consolidates those and similar functions.

I propose they should exist in a module called prune.py or pruners.py as they are predefined special cases of pruning techniques.

What is also shared between article.py and tabular.py is "optimizing" functions, aka argmax.

These should belong in either optimizers.py or maximizers.py.

Finally, and with less urgency, while I do think "metrics" is more terse than "quantifiers", what is lacking is the implication that those functions are active processes, meaning, they receive streams of nodes. "metrics" sounds static imo, but obviously these last few statements are subjective.

I should note that the current goal I have for this project is to get rid of article.py and tabular.py as they are essentially what eatiht was. To be honest, if I wanted to keep things vanilla, I would have just implemented tabular.py into eatiht and I wouldn't bat an eye.

But libextract is not eatiht, and its implementation should be more thought out (think of it as framework). All procedures should be clearly decoupled, so when article.py and tabular.py both execute optimization techniques, those techniques, or a generalized version, should exist in a single module, not in two separate modules.

opened by rodricios 12
Presentation of extracted tabular data
Currently, one of the possible, human readable way of viewing the extracted "tables", one should do the following:

from libextract import extract from libextract.strategies import TABULAR from libextract.html.tabular import filter_tags from lxml.html import open_in_browser from requests import get reddit = get("http://reddit.com") strat = TABULAR + (filter_tags,) tabs = extract(reddit.content, strategy=strat) tabs = list(tabs) open_in_browser(tabs[0])

Any other ideas on how else we should present it?

Edit: one very cool way is following import.io's approach as an example; this means classifying by the node's "class" values.
enhancement
opened by rodricios 10
Time for Pipeline class?
I'm enjoying the NodeProcessor class, but I think the usage practices are quite unnatractive:

import re from requests import get from libextract import extract from libextract.strategies import TABULAR from libextract.formatters import table_json from libextract.processor import NodeProcessor # thoughts begin here np = NodeProcessor() # This is alright # this is pretty, could be "prettier" if just @register('table') @np.register('table') def if_table(node): table = table_json(node) return table ... strat = TABULAR + (np.process,) # This is starting to bug me :|

As for ideas of what do:

Some globally available Pipeline object (as in, initialized by top-level module __init__.py

from libextract.processor import register, process

even better: from libextract.processor import register and process is called in the background

Change strategy containers to list type

no longer named with uppercase constant convention

take advantage of append(fn) for adding methods to the end of the pipeline

from libextract.strategies import tabular @register('tag') def if_table(node): ... tabular.append(if_table) strat = tabular

The above code block removes the need for the NodeProcessor.process method, which we then can move to register (I think...)

Thoughts? CC @datalib/contrib @datalib/admins
opened by rodricios 8
Confidence metric
The extremely naive way of implementing it would be:

def confidence(s, k): return (s[k] / s.mean()) - 1

Which has the nice property that it performs very well for datasets which are very imbalanced, i.e. the peak values are outliers, usually the case in a site's tabular content/text length. It also gives negative values for choices which are too small, which is IMHO a :+1:.

But the problem is that it returns too big a value for values that are too big, for example:

>>> from statscounter import StatsCounter >>> from libextract.confidence import confidence_metric >>> s = StatsCounter({'b':2, 'k': 3, 'd': 1, 'c': 20, 'f': 9}) >>> confidence_metric(s, 'c') 1.8571428571428572

I've experimented with other algorithms, for example:

(max - avg) / stdev - performs well but has the same problem

max * stddev / (size * variance) - gives nicer values and has the property that the confidence of the outliers do not exceed 1.

So far most of the algorithms do not have the property that sum(P(k) for k in data) == 1.0. This is a problem that I think can only be solved by probability distributions, which I do not know how to implement.
opened by eugene-eeo 7
Mission statement: "[libextract] provides composable, small functions [..]"
"[..] that can be piped together to process the HT/XML document."

I'd like to put into writing that I think we should not only stick to that mantra (or mission statement?), but also officialize it as this project's description.

With that in mind, I think we're reaching a point (or we've reached it already ref: 953c384b3cb1d8ae06f654f893c44e82cc74c835) where the lack of naming/functional-method conventions will delay the progress of this project.

There's a few observations I should state:

Perfect extraction is impossible (this one is here for reasons I'll address in another issue/post)

Lambda functions are easy for the initiated, difficult for those not.

The "composable, small functions" we provide in html.py (we should probably rename that file, if we continue to add more functions) can be, to an extent, rewritten as a series of "pipelined" lambda functions.

Without some set of protocols, the typical user will not be able to simply pick up this library and compose an effective "strategy"

To stress the problem of point 4, consider the following two extraction cases:

article_strategy = (get_etree, get_pairs, histogram, argmax, get_final_text) tabular_strategy = (get_etree, get_node_children_pairs, filter_node_children_pairs)

First, "get_etree", while I'm for keeping method names that describe the action and the returned structure, I believe we should opt for something closer to "parse" or "parse_html". Not only is it interpretable by most users of lxml.html and the nlp folks (ie. sentence parsing), but it's also, imo, a more concise statement.

Second, as is evidenced in the definitions of get_pairs and get_node_children_pairs, we're doing a lot of pair getting.

In both types of pairs returned, it's of the format (HtmlElement, n-dimensional measurement).

Just for completeness, in get_pairs we have (HtmlElement, int) and in get_node_children_pairs we have (HtmlElement, [Counter(key=HtmlElement, value=int_(frequency_))

Up until that point, our pipeline's flow of logic, and the names of the methods, matched up nicely. But after that, histogram, argmax, get_final_text and filter_node_children_pairs are inelegantly non-synergistic, for lack of a better term.

I don't know if it's too early in the project to be concerned about this, but I think this project will likely head towards a dead end if there isn't some proper set of ground rules. Sorry for the pessimism :(

I'll work on refactoring the code I've commited in 953c384b3cb1d8ae06f654f893c44e82cc74c835, and see if I can come up with something more constructive than the above.
comment/concern
opened by rodricios 6
Take length of node content into account

I saw menu structures of <ul> nodes pop up alongside the desired text body in libextract's results. This could be easily mitigated through defining a minimum threshold of node content length. Something like 15 characters possibly. Will do a PR if this is a wanted improvement. Not sure if the project is still alive.

opened by psolbach 1

API: get ElementTree

The api.extract function returns a generator of HtmlElement objects. If you need to analyze the results of api.extract in relation with the HTML page, then it would be great to have a way to get the ElementTree object. This is required (for example) to get the XPath of an HtmlElement using etree.getpath(element) as described on http://lxml.de/xpathxslt.html#generating-xpath-expressions.

Currently I use the following lazy workaround:

from functools import partial
from libextract._compat import BytesIO
from libextract.core import parse_html, pipeline, select, measure, rank, finalise

def extract(document, encoding='utf-8', count=None):
    if isinstance(document, bytes):
        document = BytesIO(document)

    crank = partial(rank, count=count) if count else rank

    etree = parse_html(document, encoding=encoding)
    yield etree
    yield pipeline(
        select(etree),
        (measure, crank, finalise)
        )

r = requests.get(url)
gen_extract = extract(r.content)
tree = g.next()
textnodes = g.next()
data_element = textnodes.next()  # <Element table at 0x36f1f60>
rows = data_element.iterfind('tr')
for row in rows:
    row_xpath = tree.getpath(row)
    print row_xpath

# /html/body/div[2]/div[1]/div[2]/table/tr[1]
# /html/body/div[2]/div[1]/div[2]/table/tr[2]
# /html/body/div[2]/div[1]/div[2]/table/tr[3]
# ...

opened by bofm 14

Owner

GitHub

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 7, 2023

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

0 Jan 22, 2022

This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

7 Sep 30, 2022

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 9, 2022

Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

2 Jul 20, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

1 Jan 12, 2022

Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

725 Jan 3, 2023

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

1 Nov 28, 2021

A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

10 Aug 18, 2022

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

6 Nov 21, 2022

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

5 Nov 29, 2022

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of charts.

3 Sep 9, 2022

Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

31 Apr 17, 2022

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

1 Mar 28, 2022

This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

5 Nov 22, 2022

Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

71 Oct 4, 2022

Libextract: extract data from websites

Related tags

Overview

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

Comments

Pipeline style

Strategy style

iters, selects

maximize

Owner

A Python library for automating interaction with websites.

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

This tool crawls a list of websites and download all PDF and office documents

Amazon scraper using scrapy, a python framework for crawling websites.

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Extract embedded metadata from HTML markup

This tool can be used to extract information from any website

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

A python script to extract answers to any question on Quora (Quora+ included)

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Automated data scraper for Thailand COVID-19 data

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

This repo has the source code for the crawler and data crawled from auto-data.net

Script used to download data for stocks.