A Python library for automating interaction with websites.

Overview

MechanicalSoup. A Python library for automating website interaction.

Home page

https://mechanicalsoup.readthedocs.io/

Overview

A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript.

MechanicalSoup was created by M Hickford, who was a fond user of the Mechanize library. Unfortunately, Mechanize was incompatible with Python 3 until 2019 and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants Requests (for HTTP sessions) and BeautifulSoup (for document navigation). Since 2017 it is a project actively maintained by a small team including @hemberger and @moy.

Gitter Chat

Installation

Latest Version Supported Versions

PyPy3 is also supported (and tested against).

Download and install the latest released version from PyPI:

pip install MechanicalSoup

Download and install the development version from GitHub:

pip install git+https://github.com/MechanicalSoup/MechanicalSoup

Installing from source (installs the version in the current working directory):

python setup.py install

(In all cases, add --user to the install command to install in the current user's home directory.)

Documentation

The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the automatically generated API documentation.

Example

From examples/expl_qwant.py, code to get the results from a Qwant search:

"""Example usage of MechanicalSoup to get the results from the Qwant
search engine.
"""

import re
import mechanicalsoup
import html
import urllib.parse

# Connect to duckduckgo
browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
browser.open("https://lite.qwant.com/")

# Fill-in the search form
browser.select_form('#search-form')
browser["q"] = "MechanicalSoup"
browser.submit_selected()

# Display the results
for link in browser.page.select('.result a'):
    # Qwant shows redirection links, not the actual URL, so extract
    # the actual URL from the redirect link:
    href = link.attrs['href']
    m = re.match(r"^/redirect/[^/]*/(.*)$", href)
    if m:
        href = urllib.parse.unquote(m.group(1))
    print(link.text, '->', href)

More examples are available in examples/.

For an example with a more complex form (checkboxes, radio buttons and textareas), read tests/test_browser.py and tests/test_form.py.

Development

Build Status Coverage Status Requirements Status Documentation Status CII Best Practices LGTM Alerts LGTM Grade

Instructions for building, testing and contributing to MechanicalSoup: see CONTRIBUTING.rst.

Common problems

Read the FAQ.

Comments
  • Submit an empty file when leaving a file input blank

    Submit an empty file when leaving a file input blank

    This is in regards to issue #250

    For the tests, I followed @moy 's train of thought :

    • they are basically a copy+paste without the creation of a temp file
    • assert value["doc"] == "" checks that the response contains an empty file

    Thought a different test definition was necessary, was I right to assume so ?

    In browser.py, I changed the continue around line 179 to something similar to what has been done in test__request_file here

    There are 2 Add no file input submit test commits : the second one is simply a clean up of some commented code. Will avoid it next time !

    I was unable to run test_browser.py due to some weird Import module error on modules that are installed, so I'm kind of Pull Requesting blindly. Does it matter that I say I'm confident in the changes though ?

    opened by senabIsShort 27
  • MechanicalSoup logo

    MechanicalSoup logo

    In the Roadmap, some artwork is requested. I asked an artistic friend to try to interpret this request, and this is what they came up with. I would love to use this as our logo (in both the README, as per the roadmap, and perhaps also as our organization icon). Before I make a PR, I just wanted to see if this was what you were going for.

    Drawing
    opened by hemberger 20
  • Tests randomly hanging on Travis-CI

    Tests randomly hanging on Travis-CI

    Every couple of Travis builds, I see one of the sub-builds hang. It happens frequently enough that I feel like I have to babysit Travis, which is not a good situation to be in. From what I can tell, this occurs under two conditions:

    1. httpbin.org is under heavy load (this occurs infrequently, but can occur for extended periods of time)
    2. flake8 hangs for some unknown reason (seems arbitrary, and rerunning almost always fixes it)

    I really want to understand 2), because for 1) we could simply rely on httpbin.org a bit less if necessary.

    opened by hemberger 18
  • Remove `name` attribute from all unused buttons on form submit

    Remove `name` attribute from all unused buttons on form submit

    I ran into a site with forms including buttons of type "button" with name attributes. Because Form.choose_submit() was only removing name from buttons of type "submit", the values for the "button" buttons were being erroneously sent on POST, thereby breaking my submission. This patch fixes the issue, even when a submit button isn't explicitly chosen.

    Note that all buttons that aren't of type "button" or "reset" function as "submit" in all major browsers and should therefore be choosable.

    opened by blackwind 16
  • Do not submit disabled <input> elements

    Do not submit disabled elements

    https://www.w3.org/TR/html52/sec-forms.html#element-attrdef-disabledformelements-disabled

    The disabled attribute is used to make the control non-interactive and to prevent its value from being submitted.

    MechanicalSoup ignores disabled attributes which should be fixed.

    Some additional notes: (from https://www.wufoo.com/html5/disabled-attribute/)

    • If the disabled attribute is set on a <fieldset>, the descendent form controls are disabled.
    • A disabled field can’t be modified, tabbed to, highlighted, or have its contents copied. Its value is also ignored when the form goes thru constraint validation.
    • The disabled value is Boolean, and therefore doesn’t need a value. But, if you must, you can include disabled="disabled".
    • Setting the value of the disabled attribute to null does not remove the effects of the attribute. Instead use removeAttribute('disabled').
    • You can target elements that are disabled with the :disabled pseudo-class. Or, if you want to specifically target the presence of the attribute, you can use input[disabled]. Similarly, you can use :enabled and input:not([disabled]) to target elements that are not disabled.
    • You do not need to include aria-disabled="true" when including the disabled attribute because disabled is already well supported. However, if you are programmatically disabling an element that is not a form control and therefore the disabled attribute does not apply, include aria-disabled="true".
    • The disabled attribute is valid for all form controls including all <input> types, <textarea>, <button>, <select>, <fieldset>, and <keygen>.
    opened by 5j9 14
  • browser.follow_link() has no way to pass kwargs to requests

    browser.follow_link() has no way to pass kwargs to requests

    As noted elsewhere, I've recently been debugging behind an SSL proxy, which requires telling requests to not verify SSL certificates. Generally I've done that with

        kwargs = { "verify": False }
        # ...
        r = br.submit_selected(**kwargs)
    

    which is fine. But it's not so fine when I need to follow a link, because browser.follow_link() uses its **kwargs for BS4's tag finding, but not for actually following the link.

    So instead of

        r = br.follow_link(text='Link anchor', **kwargs)
    

    I end up with

        link = br.find_link(text='Link anchor')
        r = br.open_relative(link['href'], **kwargs)
    

    I am not sure how to fix this. Some thoughts:

    1. If nothing changes, add some more clarity to browser.follow_link()'s documentation explaining how to work around this situation.
    2. Add kwargs-ish params to browser.follow_link(), one for BS4 and one for Requests. Of course, only one gets to be **kwargs, but at least one might be able to call browser.follow_link(text='Link anchor', requests_args=kwargs) or something.
    3. Send the same **kwargs parameter to both

    Maybe there's a better way. I guess in my case I could set this state in requests' Session object, ~which I think would be browser.session.merge_environment_settings(...)~ no, that's not right, I'm not sure how to accomplish it actually.

    opened by johnhawkinson 13
  • Replace httpbin.org with pytest-httpbin in tests

    Replace httpbin.org with pytest-httpbin in tests

    The pytest-httpbin module provides pytest support for the httpbin module (which is the code that runs the remote server http://httpbin.org). This locally spins up an internal webserver when tests are run.

    With this change, MechanicalSoup tests can be run without an internet connection. As a result, the tests run much faster.

    You may need the python{,3}-dev package on your system to pip install the pytest-httpbin module.

    deferred 
    opened by hemberger 13
  • No parser was explicitly specified

    No parser was explicitly specified

    /usr/local/lib/python3.4/dist-packages/bs4/init.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

    To get rid of this warning, change this:

    BeautifulSoup([your markup])

    to this:

    BeautifulSoup([your markup], "lxml")

    markup_type=markup_type))

    Need to use add_soup method or what?

    opened by stdex 12
  • Add get_request_kwargs to check before requesting

    Add get_request_kwargs to check before requesting

    When we use mechanicalsoup, we sometimes want to verify a request before submitting it.

    If you merge this pull request, the package will be able to provide a way for the package's users to review the request.

    This is my first pull request for this project. Please let me know if I'm missing anything.

    opened by kumarstack55 11
  • Set up LGTM and fix warnings

    Set up LGTM and fix warnings

    LGTM.com finds one issue in our code, and it seems legitimate to me (although I'm guilty for introducing it):

    https://lgtm.com/projects/g/hickford/MechanicalSoup/

    We should fix this, and configure lgtm so that it checks pull-requests.

    opened by moy 11
  • Problems calling

    Problems calling "follow_link" with "url_regex"

    Dear Dan and Matthieu,

    first things first: Thanks for conceiving and maintaining this great library. We also switched one of our implementations over from "mechanize" as per https://github.com/ip-tools/ip-navigator/commit/a26c3a8a and it worked really well.

    When doing so, we encountered a minor problem when trying to call the follow_link method with the url_regex keyword argument like

    response = self.browser.follow_link(url_regex='register/PAT_.*VIEW=pdf', headers={'Referer': result.url})
    

    This raises the exception

    TypeError: links() got multiple values for keyword argument 'url_regex'
    

    I am are currently a bit short on time, otherwise i would have submitted a pull request without further ado. Thanks a bunch for looking into this issue.

    With kind regards, Andreas.

    opened by amotl 11
  • browser.links() should return an empty list if self.page is None

    browser.links() should return an empty list if self.page is None

    I was writing a fuzzer for a cybersecurity assignment, and it crashed when it tried to find the links on a PDF file. I think it would make more sense to return that there are no links, if the page fails to parse. This seems relatively straightforward to implement.

    opened by npetrangelo 1
  • Typing annotations and typechecking with mypy or pyright?

    Typing annotations and typechecking with mypy or pyright?

    We already have basic static analysis with flake8 (and the underlying pyflakes), but using typing annotations and a static typechecker may 1) find more bugs, 2) help our users by providing completion and other smart features in their IDE.

    mypy is the historical typechecker, pyright is a more recent one which in my (very limited) experience works better (it's also the tool behind the new Python mode of VSCode). So I'd suggest pyright if we don't have arguments to choose mypy.

    For now, neither tool can typecheck the project without error, so a first step would be to add the necessary annotations to get an error-free pyright check.

    easy? 
    opened by moy 3
  • Can you build it without lxml?

    Can you build it without lxml?

    MechanicalSoup is a really nice package i have used for, but it still requires C Compiler to compile the lxml on *nix systems.

    It may be a problem to port to some platforms without C Compiler, such as Android or some minified Linux.

    Currently i used a script to build MechanicalSoup without lxml:

    #!/bin/sh
    
    # Remove lxml in requirements.txt
    sed -i '/lxml/d' requirements.txt
    
    # Use `html.parser` instead `lxml`
    sed -i "s@{'features': 'lxml'}@{'features': 'html.parser'}@g" mechanicalsoup/*.py
    
    # Fix examples and tests
    sed -i "s@\\(BeautifulSoup(.\\{1,\\}\\)'lxml'\\(.*)\\)@\1'html.parser'\2@g" examples/*.py tests/*.py
    

    It works well, so i think it is not a big problem...

    opened by urain39 2
  • Selecting a form that only has a class attribute

    Selecting a form that only has a class attribute

    I'm trying to get a form but it only has a class attribute and I'm continuously getting a "LinkNotFoundError". I've inspected the page and I know that I have the correct class name but it doesn't work at all and I don't see any real reference to this type of issue in the docs. I would try to get the form with BS4 but then there wouldn't be a way to select the form.

    I can attempt to get the form with BS4 then maybe add an id attribute to it then try selecting it with an id attribute?

    I'd really appreciate any help, thank you!

    question 
    opened by SilverStrings024 6
  • add_soup(): Don't match Content-type with `in`

    add_soup(): Don't match Content-type with `in`

    Don't use Python's in operator to match Content-Types, since that is a simple substring match.

    It's obviously not correct since a Content-Type string can be relatively complicated, like

    Content-Type: application/xhtml+xml; charset="utf-8"; boundary="This is not text/html"

    Although that's rather contrived, the prior test "text/html" in response.headers.get("Content-Type", "") would return True here, incorrectly.

    Also, the existance of subtypes with +'s means that using the prior test for "application/xhtml" would match the above example when it probably shouldn't.

    Instead, leverage requests's code, which comes from the Python Standard Library's cgi.py.

    Clarify that we don't implement MIME sniffing, nor X-Content-Type-Options: nosniff instead we do our own thing.


    I was looking at this code because of #373.

    I've marked this as a draft, because I'm not quite sure this is the way to go, both because of the long discursive comment, the use of a _ function from requests (versus cgi.py's parse_header()).

    Also, I'm kind of perplexed what's going on here:

                http_encoding = (
                    response.encoding
                    if 'charset' in parameters
                    else None
                )
    

    Like…why does the presence of charset=utf-8 in the Content-Type header mean that we should trust requests's encoding field? Oh, I see, it's because sometimes requests does some sniffing-ish-stuff and sometimes it doesn't (in which case it parses the Content-Type) and we need to know which, and we're backing out a conclusion about its heuristics? Probably seems like maybe we should parse it ourselves if so. idk.

    Maybe we should be doing more formal mime sniffing. And maybe we should be honoring X-Content-Type-Options: nosniff. And… … …

    I'm also not sure what kind of test coverage is really appropriate here, if anything additional. Seems like the answer shouldn't be "zero," so…

    opened by johnhawkinson 2
Releases(v1.2.0)
  • v1.2.0(Sep 17, 2022)

    Main changes

    • Added support for Python 3.10.

    • Added support for HTML form-associated elements (i.e. input elements that are associated with a form by a form attribute, but are not a child element of the form). [#380]

    Bug fixes

    • When uploading a file, only the filename is now submitted to the server. Previously, the full file path was being submitted, which exposed more local information than users may have been expecting. [#375]
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(May 29, 2021)

    Main changes

    • Dropped support for EOL Python versions: 2.7 and 3.5.

    • Increased minimum version requirement for requests from 2.0 to 2.22.0 and beautifulsoup4 from 4.4 to 4.7.

    • Use encoding from the HTTP request when no HTML encoding is specified. [#355]

    • Added the put method to the Browser class. This is a light wrapper around requests.Session.put. [#359]

    • Don't override Referer headers passed in by the user. [#364]

    • StatefulBrowser methods follow_link and download_link now support passing a dictionary of keyword arguments to requests, via requests_kwargs. For symmetry, they also support passing Beautiful Soup args in as bs4_kwargs, although any excess **kwargs are sent to Beautiful Soup as well, just as they were previously. [#368]

    Many thanks to the contributors who made this release possible!

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jan 5, 2021)

    This is the last release that will support Python 2.7. Thanks to the many contributors that made this release possible!

    Main changes:

    • Added support for Python 3.8 and 3.9.

    • StatefulBrowser has new properties page, form, and url, which can be used in place of the methods get_current_page, get_current_form and get_url respectively (e.g. the new x.page is equivalent to x.get_current_page()). These methods may be deprecated in a future release. [#175]

    • StatefulBrowser.form will raise an AttributeError instead of returning None if no form has been selected yet. Note that StatefulBrowser.get_current_form() still returns None for backward compatibility.

    Bug fixes

    • Decompose <select> elements with the same name when adding a new input element to a form. [#297]

    • The params and data kwargs passed to submit will now properly be forwarded to the underlying request for GET methods (whereas previously params was being overwritten by data). [#343]

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Aug 27, 2019)

    Main changes:

    • Changes in official python version support: added 3.7 and dropped 3.4.

    • Added ability to submit a form without updating StatefulBrowser internal state: submit_selected(..., update_state=False). This means you get a response from the form submission, but your browser stays on the same page. Useful for handling forms that result in a file download or open a new tab.

    Bug fixes

    • Improve handling of form enctype to behave like a real browser. [#242]

    • HTML type attributes are no longer required to be lowercase. [#245]

    • Form controls with the disabled attribute will no longer be submitted to improve compliance with the HTML standard. If you were relying on this bug to submit disabled elements, you can still achieve this by deleting the disabled attribute from the element in the Form object directly. [#248]

    • When a form containing a file input field is submitted without choosing a file, an empty filename & content will be sent just like in a real browser. [#250]

    • <option> tags without a value attribute will now use their text as the value. [#252]

    • The optional url_regex argument to follow_link and download_link was fixed so that it is no longer ignored. [#256]

    • Allow duplicate submit elements instead of raising a LinkNotFoundError. [#264]

    Our thanks to the many new contributors in this release!

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Sep 11, 2018)

    This release focuses on fixing bugs related to uncommon HTTP/HTML scenarios and on improving the documentation.

    Bug fixes

    • Constructing a Form instance from a bs4.element.Tag whose tag name is not form will now emit a warning, and may be deprecated in the future. [#228]

    • Breaking Change: LinkNotFoundError now derives from Exception instead of BaseException. While this will bring the behavior in line with most people's expectations, it may affect the behavior of your code if you were heavily relying on this implementation detail in your exception handling. [#203]

    • Improve handling of button submit elements. Will now correctly ignore buttons of type button and reset during form submission, since they are not considered to be submit elements. [#199]

    • Do a better job of inferring the content type of a response if the Content-Type header is not provided. [#195]

    • Improve consistency of query string construction between MechanicalSoup and web browsers in edge cases where form elements have duplicate name attributes. This prevents errors in valid use cases, and also makes MechanicalSoup more tolerant of invalid HTML. [#158]

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 4, 2018)

    Main changes:

    • Added StatefulBrowser.refresh() to reload the current page with the same request. [#188]

    • StatefulBrowser.follow_link, StatefulBrowser.submit_selected() and the new StatefulBrowser.download_link now sets the Referer: HTTP header to the page from which the link is followed. [#179]

    • Added method StatefulBrowser.download_link, which will download the contents of a link to a file without changing the state of the browser. [#170]

    • The selector argument of Browser.select_form can now be a bs4.element.Tag in addition to a CSS selector. [#169]

    • Browser.submit and StatefulBrowser.submit_selected accept a larger number of keyword arguments. Arguments are forwarded to requests.Session.request. [#166]

    Internal changes:

    • StatefulBrowser.choose_submit will now ignore input elements that are missing a name-attribute instead of raising a KeyError. [#180]

    • Private methods Browser._build_request and Browser._prepare_request have been replaced by a single method Browser._request. [#166]

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Nov 2, 2017)

    Main changes:

    • We do not rely on BeautifulSoup's default choice of HTML parser. Instead, we now specify lxml as default. As a consequence, the default setting requires lxml as a dependency.

    • Python 2.6 and 3.3 are no longer supported.

    • The GitHub URL moved from https://github.com/hickford/MechanicalSoup/ to https://github.com/MechanicalSoup/MechanicalSoup. @moy and @hemberger are now officially administrators of the project in addition to @hickford, the original author.

    • We now have a documentation site: https://mechanicalsoup.readthedocs.io/. The API is now fully documented, and we have included a tutorial, several more code examples, and a FAQ.

    • StatefulBrowser.select_form can now be called without argument, and defaults to "form" in this case. It also has a new argument, nr (defaults to 0), which can be used to specify the index of the form to select if multiple forms match the selection criteria.

    • We now use requirement files. You can install the dependencies of MechanicalSoup with e.g.::

      pip install -r requirements.txt -r tests/requirements.txt

    • The Form class was restructured and has a new API. The behavior of existing code is unchanged, but a new collection of methods has been added for clarity and consistency with the set method:

      • set_input deprecates input
      • set_textarea deprecates textarea
      • set_select is new
      • set_checkbox and set_radio together deprecate check (checkboxes are handled differently by default)
    • A new Form.print_summary method allows you to write browser.get_current_form().print_summary() to get a summary of the fields you need to fill-in (and which ones are already filled-in).

    • The Form class now supports selecting multiple options in a <select multiple> element.

    Bug fixes

    • Checking checkboxes with browser["name"] = ("val1", "val2") now unchecks all checkbox except the ones explicitly specified.

    • StatefulBrowser.submit_selected and StatefulBrowser.open now reset __current_page to None when the result is not an HTML page. This fixes a bug where __current_page was still the previous page.

    • We don't error out anymore when trying to uncheck a box which doesn't have a checkbox attribute.

    • Form.new_control now correctly overrides existing elements.

    Internal changes

    • The testsuite has been further improved and reached 100% coverage.

    • Tests are now run against the local version of MechanicalSoup, not against the installed version.

    • Browser.add_soup will now always attach a soup-attribute. If the response is not text/html, then soup is set to None.

    • Form.set(force=True) creates an <input type=text ...> element instead of an <input type=input ...>.

    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Oct 1, 2017)

    Main changes:

    • Browser and StatefulBrowser can now be configured to raise a LinkNotFound exception when encountering a 404 Not Found error. This is activated by passing raise_on_404=True to the constructor. It is disabled by default for backward compatibility, but is highly recommanded.

    • Browser now has a __del__ method that closes the current session when the object is deleted.

    • A Link object can now be passed to follow_link.

    • The user agent can now be customized. The default includes MechanicalSoup and its version.

    • There is now a direct interface to the cookiejar in *Browser classes ((set|get)_cookiejar methods).

    • This is the last MechanicalSoup version supporting Python 2.6 and 3.3.

    Bug fixes:

    • We used to crash on forms without action="..." fields.

    • The choose_submit method has been fixed, and the btnName argument of StatefulBrowser.submit_selected is now a shortcut for using choose_submit.

    • Arguments to open_relative were not properly forwarded.

    Internal changes:

    • The testsuite has been greatly improved. It now uses the pytest API (not only the pytest launcher) for more concise code.

    • The coverage of the testsuite is now measured with codecov.io. The results can be viewed on: https://codecov.io/gh/hickford/MechanicalSoup

    • We now have a requires.io badge to help us tracking issues with dependencies. The report can be viewed on: https://requires.io/github/hickford/MechanicalSoup/requirements/

    • The version number now appears in a single place in the source code.

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(May 7, 2017)

    Summary of changes:

    • New class StatefulBrowser, that keeps track of the currently visited page to make the calling code more concise.

    • A new launch_browser method in Browser and StatefulBrowser, that allows launching a browser on the currently visited page for easier debugging.

    • Many bug fixes.

    Release on Pypi: https://pypi.python.org/pypi/MechanicalSoup/0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Nov 24, 2015)

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

Caio Alves 2 Jul 20, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
Here I provide the source code for doing web scraping using the python library, it is Selenium.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

M Khaidar 1 Nov 13, 2021
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 6, 2023
An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

null 2 Aug 3, 2022
An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

null 6 Jan 8, 2023
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

Dan Claudiu Pop 79 Nov 27, 2022
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023
Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

null 91 Dec 8, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023