Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Overview

trafilatura: Web scraping tool for text discovery and retrieval

Python package Python versions Documentation Status Travis build status Code Coverage Downloads

Demo as GIF image

Description

Trafilatura is a Python package and command-line tool which seamlessly downloads, parses, and scrapes web page data: it can extract metadata, main body text and comments while preserving parts of the text formatting and page structure. The output can be converted to different formats.

Distinguishing between a whole page and the page's essential parts can help to alleviate many quality problems related to web text processing, by dealing with the noise caused by recurring elements (headers and footers, ads, links/blogroll, etc.).

The extractor aims to be precise enough in order not to miss texts or to discard valid documents. In addition, it must be robust, but also reasonably fast. With these objectives in mind, Trafilatura is designed to run in production on millions of web documents. It is based on lxml as well as readability and jusText as fallback.

Features

  • Seamless parallelized online and offline processing:
    • Download and conversion utilities included
    • URLs, HTML files or parsed HTML trees as input
  • Robust and efficient extraction:
    • Main text and/or comments
    • Structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
    • Extraction of metadata (title, author, date, site name, categories and tags)
  • Several output formats supported:
    • Plain text (minimal formatting)
    • CSV (with metadata, tab-separated values)
    • JSON (with metadata)
    • XML (for metadata and structure) and TEI-XML
  • Link discovery and URL lists:
    • Support for sitemaps and ATOM/RSS feeds
    • Efficient and polite processing of URL queues
    • Blacklisting
  • Optional language detection on extracted content

Evaluation and alternatives

For more detailed results see the evaluation page and evaluation script. To reproduce the tests just clone the repository, install all necessary packages and run the evaluation script with the data provided in the tests directory.

500 documents, 1487 text and 1496 boilerplate segments (2020-11-06)
Python Package Precision Recall Accuracy F-Score Diff.
justext 2.2.0 (tweaked) 0.870 0.584 0.749 0.699 6.1x
newspaper3k 0.2.8 0.921 0.574 0.763 0.708 12.9x
goose3 3.1.6 0.950 0.629 0.799 0.757 19.0x
boilerpy3 1.0.2 (article mode) 0.851 0.696 0.788 0.766 4.8x
baseline (text markup) 0.746 0.804 0.766 0.774 1x
dragnet 2.0.4 0.906 0.689 0.810 0.783 3.1x
readability-lxml 0.8.1 0.917 0.716 0.826 0.804 5.9x
news-please 1.5.13 0.923 0.711 0.827 0.804 184x
trafilatura 0.6.0 0.924 0.849 0.890 0.885 3.9x
trafilatura 0.6.0 (+ fallbacks) 0.933 0.877 0.907 0.904 8.4x

External evaluations:

Usage and documentation

For further information please refer to the documentation:

License

trafilatura is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What's in it for business?

Roadmap

  • [-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache
  • [-] URL lists and document management
  • [-] Configuration and extraction parameters
  • [-] Graphical user interface
  • [ ] Interaction with web archives (notably WARC format)
  • [ ] Integration of natural language processing tools

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

You can contact me via my contact page or GitHub.

Going further

Online documentation: trafilatura.readthedocs.io.

Tutorials: overview.

Trafilatura: Italian word for wire drawing.

Corresponding posts on Bits of Language (blog).

Comments
  • Celery error with v1.2.1: ValueError: signal only works in main thread

    Celery error with v1.2.1: ValueError: signal only works in main thread

    Having version 1.2.1 it is not possible to launch trafilatura extraction in the async task like celery. https://github.com/adbar/trafilatura/blob/1bb5fee6a4812e53b6597053c25efde995174d79/trafilatura/core.py#L982 It would be better to have HAS_SIGNAL as config variable, and not hardcoded value

    celery_1      |     text = trafilatura.extract(
    celery_1      |   File "/usr/local/lib/python3.8/site-packages/trafilatura/core.py", line 982, in extract
    celery_1      |     signal(SIGALRM, timeout_handler)
    celery_1      |   File "/usr/local/lib/python3.8/signal.py", line 47, in signal
    celery_1      |     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
    celery_1      | ValueError: signal only works in main thread
    
    feedback 
    opened by alex-bender 16
  • No metadata extraction

    No metadata extraction

    Hello,

    Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

    My test:

    import trafilatura
    from trafilatura.core import bare_extraction
    
    downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
    
    result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)
    
    print(result)
    

    The results: ({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

    So, no metadata return.

    Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

    categories_xpaths = [
        """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
        starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
        starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
        starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
        "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
        "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
        '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
        '//header[@class="entry-header"]//a',
        '//div[@class="row" or @class="tags"]//a',
        '//div[contains(@class, "post__categories")]//li//a',
    ]
    

    Another question is that could I get content of article including html format (no clean tags in content)?

    Please help me, thanks for your support!

    enhancement 
    opened by phongtnit 16
  • Issue with multiple authors and preference for meta information

    Issue with multiple authors and preference for meta information

    We shouldnt believe on schema person

    agenda Current: "author": "Sandy Cheu", Should be: "author": "Stephen Teulan; Nikita Weikhardt",

    aged Current: "author":"Consumers", Should be: "author": "Liz Alderslade",

    meta remove single names cath Current: "author": null, Should be: "author": "Rebecca",

    echo Current: "author": null, Should be: "author": "Katie",

    enhancement 
    opened by felipehertzer 15
  • Navigation bar filtering - some bug fixed

    Navigation bar filtering - some bug fixed

    The current repo should work well? I have removed several things that are unused and fixed a tiny bug that affects the accuracy. I have added to the git ignore so that the branch should now get quite clean as well XD

    opened by immortal-autumn 13
  • No Formatting in Plain Text Output

    No Formatting in Plain Text Output

    When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..). The term I'm using supports this. Is this by design or a bug? I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).

    enhancement question 
    opened by peterjschroeder 13
  • Performance enhancement

    Performance enhancement

    I. Test file

    test2.py
    from time import time
    
    import requests
    from trafilatura import extract
    
    
    if __name__ == '__main__':
        urls = ["https://en.wikipedia.org/wiki/List_of_Hindi_songs_recorded_by_Asha_Bhosle",
                "https://en.wikipedia.org/wiki/2022_in_video_games",
                "https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Kuwait",
                "https://en.wikipedia.org/wiki/Presidency_of_Rodrigo_Duterte",
                "https://en.wikipedia.org/wiki/List_of_2021%E2%80%9322_NBA_season_transactions",
                "https://en.wikipedia.org/wiki/2022_in_sports",
                "https://en.wikipedia.org/wiki/Firefox_version_history",
                "https://en.wikipedia.org/wiki/List_of_common_misconceptions",
                "https://en.wikipedia.org/wiki/Same-sex_union_legislation",
                "https://en.wikipedia.org/wiki/Presidency_of_Donald_Trump",]
    
        cum_time = 0
        for url in urls:        
            resp = requests.get(url)
            t0 = time()
            result = extract(resp.text)
            cum_time = cum_time + time() - t0
        print(cum_time)
    

    II. Test pprofile

    kernprof -lv test2.py
    

    before

    Total time: 0.544693 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 221
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       221                                           @profile
       222                                           def remove_control_characters(string):
       223                                               '''Prevent non-printable and XML invalid character errors'''
       224     25998     544693.0     21.0    100.0      return ''.join([c for c in string if c.isprintable() or c.isspace()])
    

    after

    Total time: 0.169241 s
    File: /trafilatura-master/trafilatura/utils.py
    Function: remove_control_characters at line 227
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
       227                                           @profile
       228                                           def remove_control_characters(string):
       229                                               '''Prevent non-printable and XML invalid character errors'''
       230     25998     169241.0      6.5    100.0      return ''.join(filter(is_printable_or_space, string))
    

    III. Test vprof

    vprof -c -h test2.py
    

    before before

    after after

    feedback 
    opened by deedy5 10
  • Correction in the extraction of authors by tag and by json

    Correction in the extraction of authors by tag and by json

    In this correction:

    • added 'submitted-by' and 'username' tags to xpath
    • the maximum size of the author's name has been increased.
    • regex has been added to remove emoji from author names often found on sites like buzzfeed
    • added a regex to minify json before running the other regex, was having trouble fetching authors when json formatted.
    • added a regex to remove json items like images and organization before searching the author
    • reorganized the extract_json function as it was overwriting meta tags with none when no json was found

    qsr Before this fix: "author": null After this fix: "author": "Kevin Santos"

    perthnow Before this fix: "author": "NCA NewsWire" After this fix: "author": "Finn McHugh"

    buzzfeed Before this fix: "author": "Hameda Nafiz BuzzFeed Staff" After this fix: "author": "Hameda Nafiz"

    buzzfeed Before this fix: "author": "Olivia ❤️" After this fix: "author": "Olivia Community Contributor"

    build Before this fix: "author": null After this fix: "author": "Thoams Lane"

    hunterandbligh Before this fix: none After this fix: "author": "REBECCA MAGRO"

    abc - 'data-component' Before this fix: "author": null After this fix: "author": "Charlotte Gore"

    proactiveinvestors Before this fix: "author": null After this fix: "author": "Calum Muirhead"

    banking Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Sarah Harman"

    hcamag Before this fix: "author": "Sarah Harman Jul" After this fix: "author": "Mark Rosanes"

    spacedaily and + 9 sites Before this fix: "author": null After this fix: "author": "Lucie Aubourg"

    first Before this fix:"author": "Nick Griffin", After this fix: "author": "Stan Shamu",

    racing Before this fix:"author": "Ben Sporle - @bensporle; Ben Sporle", After this fix: "author": "Ben Sporle",

    ajn Before this fix:"author": "RABBI GARY ROBUCK July", After this fix: "author": "RABBI GARY ROBUCK",

    ESPN it is not totally fix, but it is better Before this fix: "author": "Andrew Mcglashandeputy Editor, Espncricinfo", After this fix: "author": "Andrew McGlashan Deputy editor; ESPNcricinfo",

    Probono it is not totally fix, but it is better Before this fix: "author": null, After this fix: "author": "Luke Michael; Journalist; @Luke_Michael",

    opened by felipehertzer 10
  • Library is redirecting stderr to /dev/null upon every call

    Library is redirecting stderr to /dev/null upon every call

    If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call: https://github.com/adbar/trafilatura/blob/a56fb3e041175df38a32b1c5ef2e9c7888eeb7a6/trafilatura/external.py#L63

    Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

    This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

    Consequently, this redirect may be removed (to be tested).

    opened by dmoklaf 10
  • In parallel trafilatura is marginally slower than goose

    In parallel trafilatura is marginally slower than goose

    I'm not quite sure where to begin with this, it's a strange one. In a real world scenario I tried switching from Goose3 to Trafilatura. I'm processing html extractions in parallel with dask. After switching to trafilatura, I noticed a 30% slowdown. I ended up writing my own evaluation library to verify the results.

    Results from running in parallel: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 383.4737 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 361.3232 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Results from running sequentially: ┏━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Library ┃ Accuracy ┃ Precision ┃ Recall ┃ FScore ┃ Mean Similarity ┃ Items/sec ┃ ┡━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ goose3 │ 0.9678 │ 0.8561 │ 0.9547 │ 0.9027 │ 0.8343 │ 9.7953 │ │ trafilatura │ 0.9124 │ 0.9485 │ 0.908 │ 0.9278 │ 0.8567 │ 23.0045 │ └─────────────┴──────────┴───────────┴────────┴────────┴─────────────────┴───────────┘

    Note: the dataset evaluated in from scrapinghub/article-extraction-benchmark tool. The only portion of the code that runs in parallel for the bench marks is the extraction. Only the extraction is timed for calculating items/sec.

    In summary: trafilatura is marginally slower than Goose3 in parallel. However sequentially it is twice as fast as Goose3.

    I'm not sure where to begin with this. It can be difficult to profile parallel processing. It may be related to some of the memory leak issues reported with trafilutura, although it appears those have been resolved. Or the caching, I haven't looked into how that functions.

    I will work on publishing my benchmarking tool this afternoon.

    question 
    opened by getorca 9
  • Handle pages where article is split into multiple sibling nodes

    Handle pages where article is split into multiple sibling nodes

    This fixes #85 (and #159).

    It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

    So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

    This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

    Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

    Thanks!

    opened by naftalibeder 9
  • Broken parsing of images

    Broken parsing of images

    I'm not quite sure what's wrong with images but here is reproducer:

    $ curl https://en.wikipedia.org/wiki/Tribe > /tmp/tribe.html
    $ python
    Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
    [GCC 7.3.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> html_wiki_tribe = open('/tmp/tribe.html').read()
    ... text = trafilatura.extract(
    ...     html_wiki_tribe,
    ...     include_images=True
    ... )
    ~/anaconda3/lib/python3.7/site-packages/trafilatura/xml.py in xmltotxt(xmloutput, include_formatting, include_links)
        272             LOGGER.debug('unexpected element: %s', element.tag)
        273             returnlist.extend([textelement, ' '])
    --> 274     return sanitize(''.join(returnlist))
        275 
        276 
    
    TypeError: sequence item 6: expected str instance, NoneType found
    
    

    UPD Looks like this could help: image

    bug 
    opened by alex-bender 9
  • Improve title extraction by removing sitename suffix

    Improve title extraction by removing sitename suffix

    Most os sites add a suffix like:

    • My article title | My Site Name
    • My article title - My Site Name

    There is no need the sitename within the article title

    Common separators are: - | – — • · ‹ › ⁄ « » < > : * ⋆ ~

    Some sites use html entities for this, like: &#8212;

    enhancement 
    opened by andremacola 5
  • Remove unwanted html elements with regex or xpaths

    Remove unwanted html elements with regex or xpaths

    Possibility to remove unnecessary html elements before starting the extraction process.

    There are often some elements within the extracted text that are not article content.

    Titles should by default not come inside the extracted text, or there should be an option to remove them (maybe this requires another issue)

    Something like:

    unwanted = [
      'iframe',
      'button',
      'figcaption',
      'caption',
      'form',
      'aside',
      'script',
      'style',
      'ins',
      'link',
      'header',
      'footer',
      '#comments',
      'nav',
      '.post-comments',
      '.post-tags',
      '.wp-block-embed',
      '.wp-caption-text',
      'svg',
      '[class^=ads]',
      '[class*=ads-]',
      '[style="display:none"]',
      '[style*="display:none"]',
      '[style*="display: none"]',
      '[itemprop*="description"]',
      '.push-web-notification',
      '.mc-column.entities',
      '.newsletter-component',
      '.post-subject',
      '.post-info',
      '.addthis_tool',
      '.pt-cv-wrapper'
    ]
    
    article = trafilatura.bare_extraction(document,
            unwanted_elements=unwanted
            include_comments=False, include_tables=False,
            favor_precision=True, favor_recall=True,
            no_fallback=True, target_language=None,
            date_extraction_params={'extensive_search': True, 'original_date': True, 'outputformat': "%Y-%m-%dT%H:%M:%S%z"},
            config=config)
    
    question 
    opened by andremacola 4
  • feat: Add image urls to metadata

    feat: Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS

    Issue: https://github.com/adbar/trafilatura/issues/281

    Unfortunately I didn't have time to create the tests

    opened by andremacola 2
  • Add image urls to metadata

    Add image urls to metadata

    Sometimes an image is not included in text body and we can extract by some SEO TAGS like some article parsers do (https://github.com/extractus/article-extractor/blob/main/src/utils/extractMetaData.js)

    Here some metatags:

    'image'
    'og:image'
    'og:image:url'
    'og:image:secure_url'
    'twitter:image'
    'twitter:image:src'
    
    enhancement 
    opened by andremacola 1
  • Extraction of Youtube iframes and img elements with links

    Extraction of Youtube iframes and img elements with links

    Not able to fetch image tags Not able to fetch iframe tags. From command prompt in windows machine

    trafilatura --sitemap "https://www.lyricspulp.com/" --list > linklist.txt trafilatura --sitemap homepage --list > linklist.txt trafilatura -i linklist.txt --xml -o outputfile.txt trafilatura -i linklist.txt --formatting --links --images --no-comments --xml -o outputfile.txt

    enhancement 
    opened by sampathmende 3
Releases(v1.4.0)
  • v1.4.0(Oct 18, 2022)

    Impact on extraction and output format:

    • better extraction (#233, #243 & #250 with @knit-bee, #246 with @mrienstra, #258)
    • XML: preserve list type as attribute (#229)
    • XML TEI: better conformity with @knit-bee (#238, #242, #253, #254)
    • faster text cleaning and shorter code (#237 with @deedy5, #245)
    • metadata: add language when detector is activated (#224)
    • metadata: extend fallbacks and test coverage for json_metadata functions by @felipehertzer (#235)
    • TXT: change markdown formatting of headers by @LaundroMat (#257)

    Smaller changes in convenience functions:

    • add function to clear caches (#219)
    • CLI: change exit code if download fails (#223)
    • settings: use "\n" for multiple user agents by @k-sareen (#241)

    Updates:

    • docs updated (and #244 by @dsgibbons)
    • package dependencies updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.3.0...v1.4.0

    Source code(tar.gz)
    Source code(zip)
  • v1.3.0(Jul 29, 2022)

    • fast and robust html2txt() function added (#221)
    • more robust parsing (#228)
    • fixed bugs in metadata extraction, with @felipehertzer in #213 & #226
    • extraction about 10-20% faster, slightly better recall
    • partial fixes for memory leaks (#216)
    • docs extended and updated (#217, #225)
    • prepared deprecation of old process_record() function
    • more stable processing with updated dependencies

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.2...v1.3.0

    Source code(tar.gz)
    Source code(zip)
  • v1.2.2(May 18, 2022)

    • more efficient rules for extraction
    • metadata: further attributes used (with @felipehertzer)
    • better baseline extraction
    • issues fixed: #202, #204, #205
    • evaluation updated

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.1...v1.2.2

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(May 2, 2022)

    What's Changed

    • --precision and --recall arguments added to the CLI
    • better text cleaning: paywalls and comments
    • improvements for Chinese websites (with @glacierck & @immortal-autumn): #186, #187, #188
    • further bugs fixed: #189, #192 (with @felipehertzer), #200
    • efficiency: faster module loading and improved RAM footprint

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.2.0...v1.2.1

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 7, 2022)

    • efficiency: replaced module readability-lxml by trimmed fork
    • bugs fixed: (#179, #180, #183, #184)
    • improved baseline extraction
    • cleaner metadata (with @felipehertzer)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.1.0...v1.2.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Feb 21, 2022)

    • encodings: better detection, output NFC-normalized Unicode
    • maintenance and performance: more efficient code
    • bugs fixed (#119, #136, #147, #160, #161, #162, #164, #167 and others)
    • prepare compatibility with upcoming Python 3.11
    • changed default settings
    • extended documentation

    Full Changelog: https://github.com/adbar/trafilatura/compare/v1.0.0...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Nov 30, 2021)

    • compress HTML backup files & seamlessly open .gz files
    • support JSON web feeds
    • graphical user interface integrated into main package
    • faster downloads: reviewed backoff, compressed data
    • optional modules: downloads with pycurl, language identification with py3langid
    • bugs fixed (#111, #125, #132, #136, #140)
    • minor optimizations and fixes by @vbarbaresi in #124 & #130
    • fixed array with single or multiples entries on json extractor by @felipehertzer in #143
    • code base refactored with @sourcery-ai #121, improved and optimized for Python 3.6+
    • drop support for Python 3.5

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.3...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v0.9.3(Oct 21, 2021)

    • better, faster encoding detection: replaced chardet with charset_normalizer
    • faster execution: updated justext to 3.0
    • better extraction of sub-elements in tables (#78, #90)
    • more robust web feed parsing
    • further defined precision- and recall-oriented settings
    • license extraction in footers (#118)

    Full Changelog: https://github.com/adbar/trafilatura/compare/v0.9.2...v0.9.3

    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Oct 6, 2021)

    • first precision- and recall-oriented presets defined
    • improvements in authorship extraction (thanks @felipehertzer)
    • requesting TXT output with formatting now results in Markdown format
    • bugs fixed: notably extraction robustness and consistency (#109, #111, #113)
    • setting for cookies in request headers (thanks @muellermartin)
    • better date extraction thanks to htmldate update
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Aug 2, 2021)

    • improved author extraction (thanks @felipehertzer!)
    • bugs fixed: HTML element handling, HTML meta attributes, spider, CLI, ...
    • docs updated and extended
    • CLI: option names normalized (heed deprecation warnings), new option explore
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jun 15, 2021)

    • focused crawling functions including politeness rules
    • more efficient multi-threaded downloads + use as Python functions
    • documentation extended
    • bugs fixed: extraction and URL handling
    • removed support for Python 3.4
    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Apr 21, 2021)

    • better handling of formatting, links and images, title type as attribute in XML formats
    • more robust sitemaps and feeds processing
    • more accurate extraction
    • further consolidation: code simplified and bugs fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Mar 11, 2021)

  • v0.8.0(Feb 19, 2021)

    • improved link discovery and handling
    • fixes in metadata extraction, feeds and sitemaps processing
    • breaking change: the extract function now reads target format from output_format argument only
    • new extraction option: preserve links, CLI options re-ordered
    • more opportunistic backup extraction
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jan 4, 2021)

    • customizable configuration file to parametrize extraction and downloads
    • better handling of feeds and sitemaps
    • additional CLI options: crytographic hash for file name, use Internet Archive as backup
    • more precise extraction
    • faster downloads: requests replaced with bare urllib3 and custom decoding
    • consolidation: bug fixes and improvements, many thanks to the issues reporters!
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Dec 2, 2020)

    • added bare_extraction function returning Python variables
    • improved link discovery in feeds and sitemaps
    • option to preserve image info
    • fixes (many thanks to bug reporters!)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Nov 6, 2020)

  • v0.5.2(Sep 22, 2020)

    • optional language detector changed: langidpycld3
    • helper function bare_extraction()
    • optional deduplication off by default
    • better URL handling (courlan), more complete metadata
    • code consolidation (cleaner and shorter)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Jul 15, 2020)

  • v0.5.0(Jun 2, 2020)

    • faster and more robust text and metadata extraction
    • more efficient batch processing (parallel processing, URL queues)
    • support for ATOM/RSS feeds
    • complete command-line tool with corresponding options
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Apr 24, 2020)

  • v0.1.0(Sep 25, 2019)

Owner
Adrien Barbaresi
Research scientist – web texts, computational linguistics and digital humanities
Adrien Barbaresi
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 7, 2023
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
Here I provide the source code for doing web scraping using the python library, it is Selenium.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

M Khaidar 1 Nov 13, 2021
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

Dan Claudiu Pop 79 Nov 27, 2022
👁️ Tool for Data Extraction and Web Requests.

httpmapper ??️ Project • Technologies • Installation • How it works • License Project ?? For educational purposes. This is a project that I developed,

null 15 Dec 5, 2021
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

null 16 Dec 26, 2022
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

claudio paulo 5 Sep 25, 2022
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 1, 2022
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

null 1 Jan 4, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 7, 2022
A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Parallel web scraping The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy serv

Kushal Shingote 1 Feb 10, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 2 Dec 14, 2022
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021