Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Adrien Barbaresi

Last update: Jan 6, 2023

Related tags

Web Crawling nlp crawler text-mining scraper news scraping web-scraper text-extraction web-scraping readability tei-xml news-articles html2text article-extractor news-scraper text-cleaning text-preprocessing

Overview

trafilatura: Web scraping tool for text discovery and retrieval

Description

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents. It is based on lxml as well as readability and jusText as fallback.

Features

Seamless parallelized online and offline processing:
- Download and conversion utilities included
- URLs, HTML files or parsed HTML trees as input
Robust and efficient extraction:
- Main text and/or comments
- Structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
- Extraction of metadata (title, author, date, site name, categories and tags)
Several output formats supported:
- Plain text (minimal formatting)
- CSV (with metadata, tab-separated values)
- JSON (with metadata)
- XML (for metadata and structure) and TEI-XML
Link discovery and URL lists:
- Support for sitemaps and ATOM/RSS feeds
- Efficient and polite processing of URL queues
- Blacklisting
Optional language detection on extracted content

Evaluation and alternatives

For more detailed results see the evaluation page and evaluation script. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the tests directory.

500 documents, 1487 text and 1496 boilerplate segments (2020-11-06)
Python Package	Precision	Recall	Accuracy	F-Score	Diff.
justext 2.2.0 (tweaked)	0.870	0.584	0.749	0.699	6.1x
newspaper3k 0.2.8	0.921	0.574	0.763	0.708	12.9x
goose3 3.1.6	0.950	0.629	0.799	0.757	19.0x
boilerpy3 1.0.2 (article mode)	0.851	0.696	0.788	0.766	4.8x
baseline (text markup)	0.746	0.804	0.766	0.774	1x
dragnet 2.0.4	0.906	0.689	0.810	0.783	3.1x
readability-lxml 0.8.1	0.917	0.716	0.826	0.804	5.9x
news-please 1.5.13	0.923	0.711	0.827	0.804	184x
trafilatura 0.6.0	0.924	0.849	0.890	0.885	3.9x
trafilatura 0.6.0 (+ fallbacks)	0.933	0.877	0.907	0.904	8.4x

External evaluations:

Most efficient open-source library in ScrapingHub's article extraction benchmark.
Best overall tool according to Gaël Lejeune & Adrien Barbaresi, Bien choisir son outil d'extraction de contenu à partir du Web (2020, PDF, French).

Usage and documentation

For further information please refer to the documentation:

Installation
Usage: On the command-line, With Python, With R
Core Python functions
Tutorials
Evaluation

License

trafilatura is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

Roadmap

[-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
[-] URL lists and document management
[-] Configuration and extraction parameters
[-] Graphical user interface
[ ] Interaction with web archives (notably WARC format)
[ ] Integration of natural language processing tools

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

Barbaresi, A. "Generic Web Content Extraction with Open-Source Software", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
Barbaresi, A. "Efficient construction of metadata-enhanced web corpora", Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

You can contact me via my contact page or GitHub.

Going further

Online documentation: trafilatura.readthedocs.io.

Tutorials: overview.

Trafilatura: Italian word for wire drawing.

Corresponding posts on Bits of Language (blog).

Comments

Celery error with v1.2.1: ValueError: signal only works in main thread

Having version 1.2.1 it is not possible to launch trafilatura extraction in the async task like celery. https://github.com/adbar/trafilatura/blob/1bb5fee6a4812e53b6597053c25efde995174d79/trafilatura/core.py#L982 It would be better to have HAS_SIGNAL as config variable, and not hardcoded value

celery_1      |     text = trafilatura.extract(
celery_1      |   File "/usr/local/lib/python3.8/site-packages/trafilatura/core.py", line 982, in extract
celery_1      |     signal(SIGALRM, timeout_handler)
celery_1      |   File "/usr/local/lib/python3.8/signal.py", line 47, in signal
celery_1      |     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
celery_1      | ValueError: signal only works in main thread

feedback

opened by alex-bender 16

No metadata extraction
Hello,

Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

My test:

import trafilatura from trafilatura.core import bare_extraction downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/') result = bare_extraction(downloaded, include_formatting=False, with_metadata=True) print(result)

The results: ({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

So, no metadata return.

Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

categories_xpaths = [ """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""", "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a", "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a", '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a', '//header[@class="entry-header"]//a', '//div[@class="row" or @class="tags"]//a', '//div[contains(@class, "post__categories")]//li//a', ]

Another question is that could I get content of article including html format (no clean tags in content)?

Please help me, thanks for your support!
enhancement
opened by phongtnit 16
Issue with multiple authors and preference for meta information

We shouldnt believe on schema person

agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

aged Current: "author":"Consumers", Should be: "author": "Liz Alderslade",

meta remove single names cath Current: "author": null, Should be: "author": "Rebecca",

echo Current: "author": null, Should be: "author": "Katie",
enhancement

opened by felipehertzer 15
Navigation bar filtering - some bug fixed

The current repo should work well? I have removed several things that are unused and fixed a tiny bug that affects the accuracy. I have added to the git ignore so that the branch should now get quite clean as well XD

opened by immortal-autumn 13
No Formatting in Plain Text Output

When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..). The term I'm using supports this. Is this by design or a bug? I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).
enhancement question

opened by peterjschroeder 13

Performance enhancement

I. Test file

test2.py

from time import time

import requests
from trafilatura import extract


if __name__ == '__main__':
    urls = ["https://en.wikipedia.org/wiki/List_of_Hindi_songs_recorded_by_Asha_Bhosle",
            "https://en.wikipedia.org/wiki/2022_in_video_games",
            "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Kuwait",
            "https://en.wikipedia.org/wiki/Presidency_of_Rodrigo_Duterte",
            "https://en.wikipedia.org/wiki/List_of_2021%E2%80%9322_NBA_season_transactions",
            "https://en.wikipedia.org/wiki/2022_in_sports",
            "https://en.wikipedia.org/wiki/Firefox_version_history",
            "https://en.wikipedia.org/wiki/List_of_common_misconceptions",
            "https://en.wikipedia.org/wiki/Same-sex_union_legislation",
            "https://en.wikipedia.org/wiki/Presidency_of_Donald_Trump",]

    cum_time = 0
    for url in urls:        
        resp = requests.get(url)
        t0 = time()
        result = extract(resp.text)
        cum_time = cum_time + time() - t0
    print(cum_time)

II. Test pprofile

kernprof -lv test2.py

before

Total time: 0.544693 s
File: /trafilatura-master/trafilatura/utils.py
Function: remove_control_characters at line 221

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   221                                           @profile
   222                                           def remove_control_characters(string):
   223                                               '''Prevent non-printable and XML invalid character errors'''
   224     25998     544693.0     21.0    100.0      return ''.join([c for c in string if c.isprintable() or c.isspace()])

after

Total time: 0.169241 s
File: /trafilatura-master/trafilatura/utils.py
Function: remove_control_characters at line 227

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   227                                           @profile
   228                                           def remove_control_characters(string):
   229                                               '''Prevent non-printable and XML invalid character errors'''
   230     25998     169241.0      6.5    100.0      return ''.join(filter(is_printable_or_space, string))

III. Test vprof

vprof -c -h test2.py

before before

after after

feedback

opened by deedy5 10

Correction in the extraction of authors by tag and by json
In this correction:

added 'submitted-by' and 'username' tags to xpath

the maximum size of the author's name has been increased.

regex has been added to remove emoji from author names often found on sites like buzzfeed

added a regex to minify json before running the other regex, was having trouble fetching authors when json formatted.

added a regex to remove json items like images and organization before searching the author

reorganized the extract_json function as it was overwriting meta tags with none when no json was found

qsr Before this fix: "author": null After this fix: "author": "Kevin Santos"

perthnow Before this fix: "author": "NCA NewsWire" After this fix: "author": "Finn McHugh"

buzzfeed Before this fix: "author": "Hameda Nafiz BuzzFeed Staff" After this fix: "author": "Hameda Nafiz"

buzzfeed Before this fix: "author": "Olivia ❤️" After this fix: "author": "Olivia Community Contributor"

build Before this fix: "author": null After this fix: "author": "Thoams Lane"

hunterandbligh Before this fix: none After this fix: "author": "REBECCA MAGRO"

abc - 'data-component' Before this fix: "author": null After this fix: "author": "Charlotte Gore"

proactiveinvestors Before this fix: "author": null After this fix: "author": "Calum Muirhead"

banking Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Sarah Harman"

hcamag Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Mark Rosanes"

spacedaily and + 9 sites Before this fix: "author": null After this fix: "author": "Lucie Aubourg"

first Before this fix:"author": "Nick Griffin", After this fix: "author": "Stan Shamu",

racing Before this fix:"author": "Ben Sporle - @bensporle; Ben Sporle", After this fix: "author": "Ben Sporle",

ajn Before this fix:"author": "RABBI GARY ROBUCK July", After this fix: "author": "RABBI GARY ROBUCK",

ESPN it is not totally fix, but it is better Before this fix: "author": "Andrew Mcglashandeputy Editor, Espncricinfo", After this fix: "author": "Andrew McGlashan Deputy editor; ESPNcricinfo",

Probono it is not totally fix, but it is better Before this fix: "author": null, After this fix: "author": "Luke Michael; Journalist; @Luke_Michael",
opened by felipehertzer 10
Library is redirecting stderr to /dev/null upon every call

If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call: https://github.com/adbar/trafilatura/blob/a56fb3e041175df38a32b1c5ef2e9c7888eeb7a6/trafilatura/external.py#L63

Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

Consequently, this redirect may be removed (to be tested).

opened by dmoklaf 10
In parallel trafilatura is marginally slower than goose

I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

Results from running in parallel: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Results from running sequentially: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

I will work on publishing my benchmarking tool this afternoon.
question

opened by getorca 9
Handle pages where article is split into multiple sibling nodes

This fixes #85 (and #159).

It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

Thanks!

opened by naftalibeder 9

Broken parsing of images

I'm not quite sure what's wrong with images but here is reproducer:

$ curl https://en.wikipedia.org/wiki/Tribe > /tmp/tribe.html
$ python
Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> html_wiki_tribe = open('/tmp/tribe.html').read()
... text = trafilatura.extract(
...     html_wiki_tribe,
...     include_images=True
... )
~/anaconda3/lib/python3.7/site-packages/trafilatura/xml.py in xmltotxt(xmloutput, include_formatting, include_links)
    272             LOGGER.debug('unexpected element: %s', element.tag)
    273             returnlist.extend([textelement, ' '])
--> 274     return sanitize(''.join(returnlist))
    275 
    276 

TypeError: sequence item 6: expected str instance, NoneType found

UPD Looks like this could help:

bug

opened by alex-bender 9

Improve title extraction by removing sitename suffix
Most os sites add a suffix like:

My article title | My Site Name

My article title - My Site Name

There is no need the sitename within the article title

Common separators are: - | – — • · ‹ › ⁄ « » < > : * ⋆ ~

Some sites use html entities for this, like: —
enhancement
opened by andremacola 5

Remove unwanted html elements with regex or xpaths

Possibility to remove unnecessary html elements before starting the extraction process.

There are often some elements within the extracted text that are not article content.

Titles should by default not come inside the extracted text, or there should be an option to remove them (maybe this requires another issue)

Something like:

unwanted = [
  'iframe',
  'button',
  'figcaption',
  'caption',
  'form',
  'aside',
  'script',
  'style',
  'ins',
  'link',
  'header',
  'footer',
  '#comments',
  'nav',
  '.post-comments',
  '.post-tags',
  '.wp-block-embed',
  '.wp-caption-text',
  'svg',
  '[class^=ads]',
  '[class*=ads-]',
  '[style="display:none"]',
  '[style*="display:none"]',
  '[style*="display: none"]',
  '[itemprop*="description"]',
  '.push-web-notification',
  '.mc-column.entities',
  '.newsletter-component',
  '.post-subject',
  '.post-info',
  '.addthis_tool',
  '.pt-cv-wrapper'
]

article = trafilatura.bare_extraction(document,
        unwanted_elements=unwanted
        include_comments=False, include_tables=False,
        favor_precision=True, favor_recall=True,
        no_fallback=True, target_language=None,
        date_extraction_params={'extensive_search': True, 'original_date': True, 'outputformat': "%Y-%m-%dT%H:%M:%S%z"},
        config=config)

question

opened by andremacola 4

feat: Add image urls to metadata

Sometimes an image is not included in text body and we can extract by some SEO TAGS

Issue: https://github.com/adbar/trafilatura/issues/281

Unfortunately I didn't have time to create the tests

opened by andremacola 2
Add image urls to metadata
Sometimes an image is not included in text body and we can extract by some SEO TAGS like some article parsers do (https://github.com/extractus/article-extractor/blob/main/src/utils/extractMetaData.js)

Here some metatags:

'image' 'og:image' 'og:image:url' 'og:image:secure_url' 'twitter:image' 'twitter:image:src'
enhancement
opened by andremacola 1
Extraction of Youtube iframes and img elements with links

Not able to fetch image tags Not able to fetch iframe tags. From command prompt in windows machine

trafilatura --sitemap "https://www.lyricspulp.com/" --list > linklist.txt trafilatura --sitemap homepage --list > linklist.txt trafilatura -i linklist.txt --xml -o outputfile.txt trafilatura -i linklist.txt --formatting --links --images --no-comments --xml -o outputfile.txt
enhancement

opened by sampathmende 3

Releases(v1.4.0)

v1.4.0(Oct 18, 2022)
Impact on extraction and output format:

better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)

XML: preserve list type as attribute (#229)

XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)

faster text cleaning and shorter code (#237 with @deedy5, #245)

metadata: add language when detector is activated (#224)

metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)

TXT: change markdown formatting of headers by @LaundroMat (#257)

Smaller changes in convenience functions:

add function to clear caches (#219)

CLI: change exit code if download fails (#223)

settings: use "\n" for multiple user agents by @k-sareen (#241)

Updates:

docs updated (and #244 by @dsgibbons)

package dependencies updated

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0
Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 29, 2022)
fast and robust html2txt() function added (#221)

more robust parsing (#228)

fixed bugs in metadata extraction, with @felipehertzer in #213 & #226

extraction about 10-20% faster, slightly better recall

partial fixes for memory leaks (#216)

docs extended and updated (#217, #225)

prepared deprecation of old process_record() function

more stable processing with updated dependencies

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0
Source code(tar.gz)
Source code(zip)
v1.2.2(May 18, 2022)
more efficient rules for extraction

metadata: further attributes used (with @felipehertzer)

better baseline extraction

issues fixed: #202, #204, #205

evaluation updated

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2
Source code(tar.gz)
Source code(zip)
v1.2.1(May 2, 2022)
What's Changed

--precision and --recall arguments added to the CLI

better text cleaning: paywalls and comments

improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188

further bugs fixed: #189, #192 (with @felipehertzer), #200

efficiency: faster module loading and improved RAM footprint

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1
Source code(tar.gz)
Source code(zip)
v1.2.0(Mar 7, 2022)
efficiency: replaced module readability-lxml by trimmed fork

bugs fixed: (#179, #180, #183, #184)

improved baseline extraction

cleaner metadata (with @felipehertzer)

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0
Source code(tar.gz)
Source code(zip)
v1.1.0(Feb 21, 2022)
encodings: better detection, output NFC-normalized Unicode

maintenance and performance: more efficient code

bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)

prepare compatibility with upcoming Python 3.11

changed default settings

extended documentation

Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0
Source code(tar.gz)
Source code(zip)
v1.0.0(Nov 30, 2021)
compress HTML backup files & seamlessly open .gz files

support JSON web feeds

graphical user interface integrated into main package

faster downloads: reviewed backoff, compressed data

optional modules: downloads with pycurl, language identification with py3langid

bugs fixed (#111, #125, #132, #136, #140)

minor optimizations and fixes by @vbarbaresi in #124 & #130

fixed array with single or multiples entries on json extractor by @felipehertzer in #143

code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+

drop support for Python 3.5

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0
Source code(tar.gz)
Source code(zip)
v0.9.3(Oct 21, 2021)
better, faster encoding detection: replaced chardet with charset_normalizer

faster execution: updated justext to 3.0

better extraction of sub-elements in tables (#78, #90)

more robust web feed parsing

further defined precision- and recall-oriented settings

license extraction in footers (#118)

Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3
Source code(tar.gz)
Source code(zip)
v0.9.2(Oct 6, 2021)
first precision- and recall-oriented presets defined

improvements in authorship extraction (thanks @felipehertzer)

requesting TXT output with formatting now results in Markdown format

bugs fixed: notably extraction robustness and consistency (#109, #111, #113)

setting for cookies in request headers (thanks @muellermartin)

better date extraction thanks to htmldate update

Source code(tar.gz)
Source code(zip)
v0.9.1(Aug 2, 2021)
improved author extraction (thanks @felipehertzer!)

bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...

docs updated and extended

CLI: option names normalized (heed deprecation warnings), new option explore

Source code(tar.gz)
Source code(zip)
v0.9.0(Jun 15, 2021)
focused crawling functions including politeness rules

more efficient multi-threaded downloads + use as Python functions

documentation extended

bugs fixed: extraction and URL handling

removed support for Python 3.4

Source code(tar.gz)
Source code(zip)
v0.8.2(Apr 21, 2021)
better handling of formatting, links and images, title type as attribute in XML formats

more robust sitemaps and feeds processing

more accurate extraction

further consolidation: code simplified and bugs fixed

Source code(tar.gz)
Source code(zip)
v0.8.1(Mar 11, 2021)
extraction trade-off: slightly better recall

code robustness: requests, configuration and navigation

bugfixes: image data extraction

Source code(tar.gz)
Source code(zip)
v0.8.0(Feb 19, 2021)
improved link discovery and handling

fixes in metadata extraction, feeds and sitemaps processing

breaking change: the extract function now reads target format from output_format argument only

new extraction option: preserve links, CLI options re-ordered

more opportunistic backup extraction

Source code(tar.gz)
Source code(zip)
v0.7.0(Jan 4, 2021)
customizable configuration file to parametrize extraction and downloads

better handling of feeds and sitemaps

additional CLI options: crytographic hash for file name, use Internet Archive as backup

more precise extraction

faster downloads: requests replaced with bare urllib3 and custom decoding

consolidation: bug fixes and improvements, many thanks to the issues reporters!

Source code(tar.gz)
Source code(zip)
v0.6.1(Dec 2, 2020)
added bare_extraction function returning Python variables

improved link discovery in feeds and sitemaps

option to preserve image info

fixes (many thanks to bug reporters!)

Source code(tar.gz)
Source code(zip)
v0.6.0(Nov 6, 2020)
link discovery in sitemaps

compatibility with Python 3.9

extraction coverage improved

deduplication now optional

bug fixes

Source code(tar.gz)
Source code(zip)
v0.5.2(Sep 22, 2020)
optional language detector changed: langid → pycld3

helper function bare_extraction()

optional deduplication off by default

better URL handling (courlan), more complete metadata

code consolidation (cleaner and shorter)

Source code(tar.gz)
Source code(zip)
v0.5.1(Jul 15, 2020)
extended and more convenient command-line options

output in JSON format

bug fixes

Source code(tar.gz)
Source code(zip)
v0.5.0(Jun 2, 2020)
faster and more robust text and metadata extraction

more efficient batch processing (parallel processing, URL queues)

support for ATOM/RSS feeds

complete command-line tool with corresponding options

Source code(tar.gz)
Source code(zip)
v0.4.1(Apr 24, 2020)
better metadata extraction and integration (XML & XML-TEI)

more efficient processing

output directory as CLI-option

Source code(tar.gz)
Source code(zip)
v0.1.0(Sep 25, 2019)

First release used in production and meant to be archived on Zenodo for reproducibility and citability.
Source code(tar.gz)
Source code(zip)

Owner

Adrien Barbaresi

Research scientist – web texts, computational linguistics and digital humanities

GitHub https://trafilatura.readthedocs.io/

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

12.3k Jan 7, 2023

🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

692 Dec 22, 2022

Here I provide the source code for doing web scraping using the python library, it is Selenium.

1 Nov 13, 2021

Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

79 Nov 27, 2022

👁️ Tool for Data Extraction and Web Requests.

httpmapper ??️ Project • Technologies • Installation • How it works • License Project ?? For educational purposes. This is a project that I developed,

15 Dec 5, 2021

Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

1 Mar 28, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

5 Sep 25, 2022

Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

3 Jul 1, 2022

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

1 Jan 4, 2022

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Related tags

Overview

trafilatura: Web scraping tool for text discovery and retrieval

Description

Features

Evaluation and alternatives

Usage and documentation

License

Roadmap

Contributing

Author

Going further

Comments

I. Test file

II. Test pprofile

III. Test vprof

Releases(v1.4.0)

v1.4.0(Oct 18, 2022)

v1.3.0(Jul 29, 2022)

v1.2.2(May 18, 2022)

v1.2.1(May 2, 2022)

What's Changed

v1.2.0(Mar 7, 2022)

v1.1.0(Feb 21, 2022)

v1.0.0(Nov 30, 2021)

v0.9.3(Oct 21, 2021)

v0.9.2(Oct 6, 2021)

v0.9.1(Aug 2, 2021)

v0.9.0(Jun 15, 2021)

v0.8.2(Apr 21, 2021)

v0.8.1(Mar 11, 2021)

v0.8.0(Feb 19, 2021)

v0.7.0(Jan 4, 2021)

v0.6.1(Dec 2, 2020)

v0.6.0(Nov 6, 2020)

v0.5.2(Sep 22, 2020)

v0.5.1(Jul 15, 2020)

v0.5.0(Jun 2, 2020)

v0.4.1(Apr 24, 2020)

v0.1.0(Sep 25, 2019)

Owner

Adrien Barbaresi

News, full-text, and article metadata extraction in Python 3. Advanced docs:

🥫 The simple, fast, and modern web scraping library

Here I provide the source code for doing web scraping using the python library, it is Selenium.

Simple library for exploring/scraping the web or testing a website you’re developing

👁️ Tool for Data Extraction and Web Requests.

Command line program to download documents from web portals.

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Web Scraping OLX with Python and Bsoup.

Web Scraping images using Selenium and Python

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Web Scraping Framework

Scrapy, a fast high-level web crawling & scraping framework for Python.

Async Python 3.6+ web scraping micro-framework based on asyncio

Transistor, a Python web scraping framework for intelligent use cases.

A Web Scraping Program.

Web-Scraping using Selenium Master