In our current implementation, you'll find node generators appearing in many different modules:
The boilerplate code can be summed up in two functions (the names and definitions are trivial and do not actually exist within libextract):
def iters(etree, *tags):
for node in etree.iter(*tags): # <- generator
do something
yield or return
def processes(tpls, func, predicate):
for tpl in tpls: # <- iterator
if predicate(tpl):
yield func(tpl)
else:
yield tpl
In this issue, I will directly address the first method by providing the decorator iters
(and the xpath equivalent, selects
) as replacements.
The second method is a little harder to concisely address with a single replacement decorator. Instead, I will demonstrate a second decorator that touches on the processes
method, but is specific to the predictive aspect of libextract.
iters, selects
The lxml.ElementTree.iters
and lxml.ElementTree.xpath
methods were turned into decorators:
# tags will designate which nodes to generate
def iters(*tags):
# *fn* is the user's function (allowing him to do per-node logic)
def decorator(fn):
def iterator(node, *args):
for elem in node.iter(*tags):
yield fn(elem,*args)
return iterator
return decorator
def selects(xpath):
# magic words for choosing
# intricate xpath expressions
if xpath == "text":
xpath = NODES_WITH_TEXT
elif xpath == "tabular":
xpath = NODES_WITH_CHILDREN
def decorator(fn):
def selector(node, *args):
for n in node.xpath(xpath):
yield fn(n)
return selector
return decorator
That allows users to do simply do this:
@iters('tr')
def get_rows(node):
return node
rows = list(pipeline(r.content, (parse_html, get_rows)))
... yielding:
[<Element tr at 0x65ad778>,
<Element tr at 0x65ad7c8>,
<Element tr at 0x65ad818>,
<Element tr at 0x65ad868>,
<Element tr at 0x65ad8b8>,
<Element tr at 0x65ad908>,
<Element tr at 0x65ad958>,
<Element tr at 0x65ad9a8>,
...]
maximize
The second construct is the maximize
decorator
Before I demonstrate how to use this decorator, let me show you what it can can easily(?) replace from the current implementation of libextract:
# libextract/tabular.py
def node_counter_argmax(pairs):
for node, counter in pairs:
if counter:
yield node, argmax(counter)
# libextract/coretools.py
def histogram(iterable):
hist = Counter()
for key, score in iterable:
hist[key] += score
return hist
def argmax(counter):
return counter.most_common(1)[0]
As a quick side note, in #1, @Beluki voices this opinion:
For libextract, I think the best way to go about it is to write the functions as if combinations of them weren't available.
I take that to mean that him and others, including myself, would prefer to build web scraping/extraction algorithms from composable modules, or in other words, more transparency.
Why do I bring up @Beluki's comment? I believe the next new decorator, maximizer
is in tune to his comment. Here's how you can recreate the TABULAR
AND ARTICLE
blackboxes:
from libextract.core import parse_html, pipeline
from libextract.generators import selects, maximize, iters
from libextract.metrics import StatsCounter
@maximize(5, lambda x: x[1].max())
@selects("tabular") # uses table-extracting xpath
def group_parents_children(node):
return node, StatsCounter([child.tag for child in node])
@maximize(5, lambda x: x[1])
@selects("text") # uses text-extracting xpath
def group_nodes_texts(node):
return node.getparent(), len(" ".join(node.text_content().split()))
tables = pipeline(r.content, (parse_html, group_parents_children,))
text = pipeline(r.content, (parse_html, group_nodes_texts,))
Here's the implementation:
# *max_fn* is really just the same as the "key"
# argument in "sorts" and "sorted"
# *top* controls the number of elements to
# return (post-sorting)
def maximize(top=5, max_fn=select_score):
# *fn* is a generator function that get's decorated
# on top of a generator function (like an iters-decorated
# custom method)
def decorator(fn):
def iterator(*args):
return nlargest(top, fn(*args), key=max_fn)
return iterator
return decorator
Hopefully this is enough to get the ball rolling towards the immediate goal of cleaning up libextract, as it somehow became cluttered in the short time this project's been alive.
CC @datalib/contrib