Pythonic HTML Parsing for Humans™

Overview

Requests-HTML: HTML Parsing for Humans™

https://farm5.staticflickr.com/4695/39152770914_a3ab8af40d_k_d.jpg

https://travis-ci.com/psf/requests-html.svg?branch=master

This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible.

When using this library you automatically get:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

Tutorial & Usage

Make a GET request to 'python.org', using Requests:

>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('https://python.org/')

Try async and get some sites at the same time:

>>> from requests_html import AsyncHTMLSession
>>> asession = AsyncHTMLSession()
>>> async def get_pythonorg():
...     r = await asession.get('https://python.org/')
...     return r
...
>>> async def get_reddit():
...    r = await asession.get('https://reddit.com/')
...    return r
...
>>> async def get_google():
...    r = await asession.get('https://google.com/')
...    return r
...
>>> results = asession.run(get_pythonorg, get_reddit, get_google)
>>> results # check the requests all returned a 200 (success) code
[<Response [200]>, <Response [200]>, <Response [200]>]
>>> # Each item in the results list is a response object and can be interacted with as such
>>> for result in results:
...     print(result.html.url)
...
https://www.python.org/
https://www.google.com/
https://www.reddit.com/

Note that the order of the objects in the results list represents the order they were returned in, not the order that the coroutines are passed to the run method, which is shown in the example by the order being different.

Grab a list of all links on the page, as–is (anchors excluded):

>>> r.html.links
{'//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions'}

Grab a list of all links on the page, in absolute form (anchors excluded):

>>> r.html.absolute_links
{'https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/'}

Select an element with a CSS Selector:

>>> about = r.html.find('#about', first=True)

Grab an element's text contents:

>>> print(about.text)
About
Applications
Quotes
Getting Started
Help
Python Brochure

Introspect an Element's attributes:

>>> about.attrs
{'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}

Render out an Element's HTML:

>>> about.html
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'

Select Elements within Elements:

>>> about.find('a')
[<Element 'a' href='/about/' title='' class=''>, <Element 'a' href='/about/apps/' title=''>, <Element 'a' href='/about/quotes/' title=''>, <Element 'a' href='/about/gettingstarted/' title=''>, <Element 'a' href='/about/help/' title=''>, <Element 'a' href='http://brochure.getpython.info/' title=''>]

Search for links within an element:

>>> about.absolute_links
{'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}

Search for text on the page:

>>> r.html.search('Python is a {} language')[0]
programming

More complex CSS Selector example (copied from Chrome dev tools):

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div > div > div.col-md-7.text-center.text-md-left > p'
>>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath is also supported:

>>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]

JavaScript Support

Let's grab some text that's rendered by JavaScript. Until 2020, the Python 2.7 countdown clock (https://pythonclock.org) will serve as a good test page:

>>> r = session.get('https://pythonclock.org')

Let's try and see the dynamically rendered code (The countdown clock). To do that quickly at first, we'll search between the last text we see before it ('Python 2.7 will retire in...') and the first text we see after it ('Enable Guido Mode').

>>> r.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]
'</h1>\n        </div>\n        <div class="python-27-clock"></div>\n        <div class="center">\n            <div class="guido-button-block">\n                <button class="js-guido-mode guido-button">'

Notice the clock is missing. The render() method takes the response and renders the dynamic content just like a web browser would.

>>> r.html.render()
>>> r.html.search('Python 2.7 will retire in...{}Enable Guido Mode')[0]
'</h1>\n        </div>\n        <div class="python-27-clock is-countdown"><span class="countdown-row countdown-show6"><span class="countdown-section"><span class="countdown-amount">1</span><span class="countdown-period">Year</span></span><span class="countdown-section"><span class="countdown-amount">2</span><span class="countdown-period">Months</span></span><span class="countdown-section"><span class="countdown-amount">28</span><span class="countdown-period">Days</span></span><span class="countdown-section"><span class="countdown-amount">16</span><span class="countdown-period">Hours</span></span><span class="countdown-section"><span class="countdown-amount">52</span><span class="countdown-period">Minutes</span></span><span class="countdown-section"><span class="countdown-amount">46</span><span class="countdown-period">Seconds</span></span></span></div>\n        <div class="center">\n            <div class="guido-button-block">\n                <button class="js-guido-mode guido-button">'

Let's clean it up a bit. This step is not needed, it just makes it a bit easier to visualize the returned html to see what we need to target to extract our required information.

       >>> from pprint import pprint
       >>> pprint(r.html.search('Python 2.7 will retire in...{}Enable')[0])
       ('</h1>\n'
'        </div>\n'
'        <div class="python-27-clock is-countdown"><span class="countdown-row '
'countdown-show6"><span class="countdown-section"><span '
'class="countdown-amount">1</span><span '
'class="countdown-period">Year</span></span><span '
'class="countdown-section"><span class="countdown-amount">2</span><span '
'class="countdown-period">Months</span></span><span '
'class="countdown-section"><span class="countdown-amount">28</span><span '
'class="countdown-period">Days</span></span><span '
'class="countdown-section"><span class="countdown-amount">16</span><span '
'class="countdown-period">Hours</span></span><span '
'class="countdown-section"><span class="countdown-amount">52</span><span '
'class="countdown-period">Minutes</span></span><span '
'class="countdown-section"><span class="countdown-amount">46</span><span '
'class="countdown-period">Seconds</span></span></span></div>\n'
'        <div class="center">\n'
'            <div class="guido-button-block">\n'
'                <button class="js-guido-mode guido-button">')

The rendered html has all the same methods and attributes as above. Let's extract just the data that we want out of the clock into something easy to use elsewhere and introspect like a dictionary.

>>> periods = [element.text for element in r.html.find('.countdown-period')]
>>> amounts = [element.text for element in r.html.find('.countdown-amount')]
>>> countdown_data = dict(zip(periods, amounts))
>>> countdown_data
{'Year': '1', 'Months': '2', 'Days': '5', 'Hours': '23', 'Minutes': '34', 'Seconds': '37'}

Or you can do this async also:

>>> async def get_pyclock():
...     r = await asession.get('https://pythonclock.org/')
...     await r.html.arender()
...     return r
...
>>> results = asession.run(get_pyclock, get_pyclock, get_pyclock)

The rest of the code operates the same way as the synchronous version except that results is a list containing multiple response objects however the same basic processes can be applied as above to extract the data you want.

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.

Using without Requests

You can also use this library without Requests:

>>> from requests_html import HTML
>>> doc = """<a href='https://httpbin.org'>"""
>>> html = HTML(html=doc)
>>> html.links
{'https://httpbin.org'}

Installation

$ pipenv install requests-html
✨🍰✨

Only Python 3.6 and above is supported.

Issues
  • UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

    Hi @kennethreitz , First, thanks for the great library.

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte I suffer from this problem #78.

    • pip install -U git+https://github.com/kennethreitz/requests-html
    • Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)]
    from requests_html import HTMLSession 
    session = HTMLSession()
    r = session.get('http://www.nm-n-tax.gov.cn/nmgsj/ssxc/msdt/list_1.shtml')
    r.html.render()
    

    d:\python36\lib\site-packages\pyppeteer\launcher.py in launch(self) 127 raise BrowserError('Unexpectedly chrome process closed with ' 128 f'return code: {self.proc.returncode}') --> 129 msg = self.proc.stdout.readline().decode() 130 if not msg: 131 continue

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xce in position 89: invalid continuation byte

    opened by nmweizi 24
  • Getting a http.client.BadStatusLine error after calling render()

    Getting a http.client.BadStatusLine error after calling render()

    I basically just followed the example in the documentation:

    session = HTMLSession()

    r = session.get('https://python.org/')

    After running this

    r.html.render()

    I'm getting this error

    File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.6/urllib/request.py", line 526, in open response = self._open(req, data) File "/usr/lib/python3.6/urllib/request.py", line 544, in _open '_open', req) File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(*args) File "/usr/lib/python3.6/urllib/request.py", line 1346, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.6/urllib/request.py", line 1321, in do_open r = h.getresponse() File "/usr/lib/python3.6/http/client.py", line 1346, in getresponse response.begin() File "/usr/lib/python3.6/http/client.py", line 307, in begin version, status, reason = self._read_status() File "/usr/lib/python3.6/http/client.py", line 289, in _read_status raise BadStatusLine(line) http.client.BadStatusLine: GET /json/version HTTP/1.1

    r.html.html prints the entire DOM but I'm not sure why I would get a http.client.BadStatusLine error.

    Is this the right way to do this? or am I missing something here?

    I'm currently using Python 3.6.9

    Thanks

    opened by sagar1025 20
  • Scraper throws error instead of pulling values from a webpage

    Scraper throws error instead of pulling values from a webpage

    I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully.

    import requests_html
    
    with requests_html.HTMLSession() as session:
        r = session.get('https://www.gdax.com/trade/LTC-EUR')
        js = r.html.render()
        item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
        print(item)
    

    When I execute the script I get the following error (partial traceback):

    Traceback (most recent call last):
      File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\new_line_one.py", line 27, in <module>
        item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
    AttributeError: 'NoneType' object has no attribute 'find'
    Error in atexit._run_exitfuncs:
    Traceback (most recent call last):
      File "C:\Users\ar\AppData\Local\Programs\Python\Python35-32\lib\shutil.py", line 381, in _rmtree_unsafe
        os.unlink(fullname)
    PermissionError: [WinError 5] Access is denied:
    
    opened by ghost 17
  • Render w/o request doesn't execute inline JS

    Render w/o request doesn't execute inline JS

    This lib looks great, thanks :)... Just a note, I was expecting:

    doc = """<a href='https://httpbin.org'>"""
    html = HTML(html=doc)
    html.render()
    html.html
    

    to output : <a href='https://httpbin.org'>

    Instead I get the content from example.org, which is the default url.

    How can I set the html content and then render it? I can't seem to pass it to:

    doc = """<a href='https://httpbin.org'>"""
    html = HTML(html=doc)
    html.render(script=doc)
    html.html
    

    either, as I get an:

    BrowserError: Evaluation failed: SyntaxError: Unexpected token <
    pageFunction:
    <a href='https://httpbin.org'>
    

    I could set the url to the local file and patch it in, but that solution seems lacking.

    opened by Folcon 16
  • decode error

    decode error

    from requests_html import HTML
    from pyquery import PyQuery
    
    default_encoding = 'gbk'
    test_html = "<html><body><p>Hello World!--你好世界</p></body></html>".encode(default_encoding)
    
    element = HTML(url='http://example.com/hello_world', html=test_html, default_encoding=default_encoding)
    print(element.text)
    
    print(PyQuery(test_html)('html').text())
    print(PyQuery(test_html.decode(default_encoding))('html').text())
    
    

    output:

    C:\Users\what\PycharmProjects\untitled\venv\Scripts\python.exe C:/Users/what/PycharmProjects/requests-html/BUG.py
    Hello World!--ÄãºÃÊÀ½ç
    Hello World!--ÄãºÃÊÀ½ç
    Hello World!--你好世界
    
    Process finished with exit code 0
    

    So, https://github.com/kennethreitz/requests-html/blob/master/requests_html.py#L319 html should be decode.

    opened by cxgreat2014 15
  • pyppeteer.errors.BrowserError: Failed to connect to browser port: http://127.0.0.1:58331/json/version

    pyppeteer.errors.BrowserError: Failed to connect to browser port: http://127.0.0.1:58331/json/version

    default

    I use pycharm to connect to Ubuntu remotely, using the requests-html library for the first time, but when using r.html.render(), I get an error: I can't connect to the browser port. I want to know why this is the case. Solutions

    opened by hfldqwe 13
  • Every time while i call r.html.render() , it tell me error

    Every time while i call r.html.render() , it tell me error "This event loop is already running"

    I wrote code like this:

    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get(url)
    r.html.links
    

    I used this to get data from website, and found it had to load javascript, so i wrote the following:

    r.html.render()
    

    it gave message like the below:

    RuntimeError: This event loop is already running

    but i checked the html resource, it did not change. so i tried again and again, but it did report the same error. And the chromium started by it stop to response. These code run on jupyter notebook OS: mac OSX 10.12.6 python: 3.6.2

    I don't know what happened and how to resolve it.

    opened by zhang-win 13
  • Django Support?

    Django Support?

    Can anyone point me to a way to use this in a Django view? Is it currently possible? I've had success with this framework on the command line but haven't been able to get it working within Django.

    def render_javascript(url):
        session = HTMLSession()
        response = session.get(url)
        session.close()
        return response.html.render()
    

    Gives me RuntimeError: There is no current event loop in thread 'Thread-1'. and

    def render_javascript(url):
        session = AsyncHTMLSession()
        response = await session.get(url)
        await session.close()
        return await response.html.arender()
    

    Gives me 'coroutine' object has no attribute 'get' (general lack of support for async views in Django)

    I've tried a bunch of stuff suggested for Flask in similar issues: https://github.com/psf/requests-html/issues/155 https://github.com/psf/requests-html/issues/326 https://github.com/psf/requests-html/issues/293 ...but still no luck with Django.

    I'm hoping this is a common enough use case that someone can advise or point me to an example of working code.

    Thank you

    opened by rorycaputo 12
  • Can't find the element that is visible in page

    Can't find the element that is visible in page

    Hi, I have met a problem when find a element in page:

    html.find("#productDetails_detailBullets_sections1")
    

    and get an empty list,but the element is visible in the page:(

    opened by xzycn 12
  • Bump babel from 2.8.0 to 2.9.1

    Bump babel from 2.8.0 to 2.9.1

    Bumps babel from 2.8.0 to 2.9.1.

    Release notes

    Sourced from babel's releases.

    Version 2.9.1

    Bugfixes

    • The internal locale-data loading functions now validate the name of the locale file to be loaded and only allow files within Babel's data directory. Thank you to Chris Lyne of Tenable, Inc. for discovering the issue!

    Version 2.9.0

    Upcoming version support changes

    • This version, Babel 2.9, is the last version of Babel to support Python 2.7, Python 3.4, and Python 3.5.

    Improvements

    • CLDR: Use CLDR 37 – Aarni Koskela (#734)
    • Dates: Handle ZoneInfo objects in get_timezone_location, get_timezone_name - Alessio Bogon (#741)
    • Numbers: Add group_separator feature in number formatting - Abdullah Javed Nesar (#726)

    Bugfixes

    • Dates: Correct default Format().timedelta format to 'long' to mute deprecation warnings – Aarni Koskela
    • Import: Simplify iteration code in "import_cldr.py" – Felix Schwarz
    • Import: Stop using deprecated ElementTree methods "getchildren()" and "getiterator()" – Felix Schwarz
    • Messages: Fix unicode printing error on Python 2 without TTY. – Niklas Hambüchen
    • Messages: Introduce invariant that _invalid_pofile() takes unicode line. – Niklas Hambüchen
    • Tests: fix tests when using Python 3.9 – Felix Schwarz
    • Tests: Remove deprecated 'sudo: false' from Travis configuration – Jon Dufresne
    • Tests: Support Py.test 6.x – Aarni Koskela
    • Utilities: LazyProxy: Handle AttributeError in specified func – Nikiforov Konstantin (#724)
    • Utilities: Replace usage of parser.suite with ast.parse – Miro Hrončok

    Documentation

    • Update parse_number comments – Brad Martin (#708)
    • Add iter to Catalog documentation – @​CyanNani123

    Version 2.8.1

    This patch version only differs from 2.8.0 in that it backports in #752.

    Changelog

    Sourced from babel's changelog.

    Version 2.9.1

    Bugfixes

    
    * The internal locale-data loading functions now validate the name of the locale file to be loaded and only
      allow files within Babel's data directory.  Thank you to Chris Lyne of Tenable, Inc. for discovering the issue!
    

    Version 2.9.0

    Upcoming version support changes

    • This version, Babel 2.9, is the last version of Babel to support Python 2.7, Python 3.4, and Python 3.5.

    Improvements

    
    * CLDR: Use CLDR 37 – Aarni Koskela ([#734](https://github.com/python-babel/babel/issues/734))
    * Dates: Handle ZoneInfo objects in get_timezone_location, get_timezone_name - Alessio Bogon ([#741](https://github.com/python-babel/babel/issues/741))
    * Numbers: Add group_separator feature in number formatting - Abdullah Javed Nesar ([#726](https://github.com/python-babel/babel/issues/726))
    

    Bugfixes

    
    * Dates: Correct default Format().timedelta format to 'long' to mute deprecation warnings – Aarni Koskela
    * Import: Simplify iteration code in &quot;import_cldr.py&quot; – Felix Schwarz
    * Import: Stop using deprecated ElementTree methods &quot;getchildren()&quot; and &quot;getiterator()&quot; – Felix Schwarz
    * Messages: Fix unicode printing error on Python 2 without TTY. – Niklas Hambüchen
    * Messages: Introduce invariant that _invalid_pofile() takes unicode line. – Niklas Hambüchen
    * Tests: fix tests when using Python 3.9 – Felix Schwarz
    * Tests: Remove deprecated 'sudo: false' from Travis configuration – Jon Dufresne
    * Tests: Support Py.test 6.x – Aarni Koskela
    * Utilities: LazyProxy: Handle AttributeError in specified func – Nikiforov Konstantin ([#724](https://github.com/python-babel/babel/issues/724))
    * Utilities: Replace usage of parser.suite with ast.parse – Miro Hrončok
    

    Documentation </code></pre> <ul> <li>Update parse_number comments – Brad Martin (<a href="https://github-redirect.dependabot.com/python-babel/babel/issues/708">#708</a>)</li> <li>Add <strong>iter</strong> to Catalog documentation – <a href="https://github.com/CyanNani123"><code>@​CyanNani123</code></a></li> </ul> <h2>Version 2.8.1</h2> <p>This is solely a patch release to make running tests on Py.test 6+ possible.</p> <p>Bugfixes</p> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary>

    <ul> <li><a href="https://github.com/python-babel/babel/commit/a99fa2474c808b51ebdabea18db871e389751559"><code>a99fa24</code></a> Use 2.9.0's setup.py for 2.9.1</li> <li><a href="https://github.com/python-babel/babel/commit/60b33e083801109277cb068105251e76d0b7c14e"><code>60b33e0</code></a> Become 2.9.1</li> <li><a href="https://github.com/python-babel/babel/commit/412015ef642bfcc0d8ba8f4d05cdbb6aac98d9b3"><code>412015e</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/python-babel/babel/issues/782">#782</a> from python-babel/locale-basename</li> <li><a href="https://github.com/python-babel/babel/commit/5caf717ceca4bd235552362b4fbff88983c75d8c"><code>5caf717</code></a> Disallow special filenames on Windows</li> <li><a href="https://github.com/python-babel/babel/commit/3a700b5b8b53606fd98ef8294a56f9510f7290f8"><code>3a700b5</code></a> Run locale identifiers through <code>os.path.basename()</code></li> <li><a href="https://github.com/python-babel/babel/commit/5afe2b2f11dcdd6090c00231d342c2e9cd1bdaab"><code>5afe2b2</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/python-babel/babel/issues/754">#754</a> from python-babel/github-ci</li> <li><a href="https://github.com/python-babel/babel/commit/58de8342f865df88697a4a166191e880e3c84d82"><code>58de834</code></a> Replace Travis + Appveyor with GitHub Actions (WIP)</li> <li><a href="https://github.com/python-babel/babel/commit/d1bbc08e845d03d8e1f0dfa0e04983d755f39cb5"><code>d1bbc08</code></a> import_cldr: use logging; add -q option</li> <li><a href="https://github.com/python-babel/babel/commit/156b7fb9f377ccf58c71cf01dc69fb10c7b69314"><code>156b7fb</code></a> Quiesce CLDR download progress bar if requested (or not a TTY)</li> <li><a href="https://github.com/python-babel/babel/commit/613dc1700f91c3d40b081948c0dd6023d8ece057"><code>613dc17</code></a> Make the import warnings about unsupported number systems less verbose</li> <li>Additional commits viewable in <a href="https://github.com/python-babel/babel/compare/v2.8.0...v2.9.1">compare view</a></li> </ul> </details>

    <br />

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • fix parse html RecursionError

    fix parse html RecursionError

    fix parse html

    https://db-engines.com/en/ranking

    RecursionError

    opened by 521xueweihan 0
  • removed a case of the default mutable argument pitfall

    removed a case of the default mutable argument pitfall

    Problem The code had an entry of the default mutable pitfall, which can be detected by Pylint via the code W0102 https://vald-phoenix.github.io/pylint-errors/plerr/errors/basic/W0102.html

    Solution Applied a simple refactoring

    opened by NaelsonDouglas 0
  • use xpath2 expression

    use xpath2 expression

    It seems that I can only use xpath1 expressions with Element.xpath function, how do I use xpath2? thanks

    opened by aphkyle 0
  • .mount getting error

    .mount getting error

    Hi, Kenneth! Trying to use adapter with .mount() method but getting error The same code for requests lib works: `from requests_ip_rotator import ApiGateway

    Create gateway object and initialise in AWS

    gateway = ApiGateway("https://site.com", access_key_id="", access_key_secret="") gateway.start()

    Assign gateway to session

    session = requests.Session() session.mount("https://site.com", gateway)

    Send request (IP will be randomised)

    response = session.get("https://site.com/index.html") print(response.status_code)

    Delete gateways

    gateway.shutdown()`

    When use requests_html got session_2 = HTMLSession.mount("https://site.com", gateway) TypeError: mount() missing 1 required positional argument: 'adapter'

    Can you help me?

    opened by xitex 0
  • Help me understanding the return order of asession.run

    Help me understanding the return order of asession.run

    from requests_html import AsyncHTMLSession import functools

    async def get_link(link): r = await asession.get(link) f = str(r) + link return f

    asession = AsyncHTMLSession()

    links = [ 'https://google.com', 'https://yahoo.com', 'https://python.org' ]

    links = [ functools.partial(get_link, link) for link in links ]

    print(links)

    results = asession.run(*links)

    print(results)

    What I get is : [functools.partial(<function get_link at 0x7fa59c438040>, 'https://google.com'), functools.partial(<function get_link at 0x7fa59c438040>, 'https://yahoo.com'), functools.partial(<function get_link at 0x7fa59c438040>, 'https://python.org')] ['<Response [200]>https://python.org', '<Response [200]>https://google.com', '<Response [200]>https://yahoo.com']

    So why did the list of asession.run return in wrong order? is there a way to get the result in the same order they being send?

    opened by z3ch5 0
  • failed to get elements

    failed to get elements

    I'm new to requests-html and just installed several days ago. when followed the Tutorial :

    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get('https://python.org/')
    about = r.html.find('#about', first=True)
    print(about.text)
    

    The expected output as Tutorial described is :

    About
    Applications
    Quotes
    Getting Started
    Help
    Python Brochure
    

    But actually I got the following:

    About
    Applications
    Quotes
    Getting Started
    Help
    Python Brochure
    Downloads
    All releases
    Source code
    Windows
    macOS
    Other Platforms
    License
    Alternative Implementations
    Documentation
    .
    .
    .
    Submit Website Bug
    Status
    Copyright ©2001-2021.  Python Software Foundation  Legal Statements  Privacy Policy  Powered by Heroku
    window.jQuery || document.write('<script src="/static/js/libs/jquery-1.8.2.min.js"><\/script>') window.jQuery || document.write('<script src="/static/js/libs/jquery-ui-1.12.1.min.js"><\/script>')
    

    which is from element <li id="about" ... to the end of the whole html document.

    anyone konws solution of this issue?

    @kennethreitz

    opened by fengsanyunyan 1
  • LXML bug that breaks .find() on new installs

    LXML bug that breaks .find() on new installs

    .find() currently returns unwanted results due to a bug with lxml.html.tostring().

    I have pinpointed the bug to lxml.html.tostring() function and have filed a bug report on their launchpad.

    For a demonstration of the problem and the bug report, see: https://bugs.launchpad.net/lxml/+bug/1942017

    This bug in LXML probably explains #469 as well.

    opened by TiesdeKok 0
  • Haven't found the built in way to subtract multiple html elements.

    Haven't found the built in way to subtract multiple html elements.

    For parsing i need to subtract html part between <h2> elements, but i haven't found <class 'requests_html.HTML'> method for it. Get result as <class 'requests_html.HTML'> is also desirable

    opened by KriachkoAS 2
  • Adding a line in the archive readme

    Adding a line in the archive readme

    I add a line in readme archive just to learn git and pull requests.

    opened by voller-96 0
Releases(v0.10.0)
  • v0.10.0(Feb 18, 2019)

    Fixed

    • Crashes when HTML contains XML #160
    • Decoding errors #162
    • Multiple Chrome tabs left opened on Timeout errors #189
    • next method missing in HTML class. #146 #148

    Added

    • The official release for AsyncHTMLSession. #146
    • browser_args parameter when creating a Session to pass custom args to browser creation. #193
    • A new attribute to Element objects tag name and line #. #205
    • verify parameter when creating a Session to allow rendering websites with a not valid SSL certificate. #212
    • HTMLSession now raises a RuntimeException when trying to render a page inside an event loop. #146
    • Allow async for in HTML objects. #146
    • arender method on HTML objects. #146
    • AsyncHTMLSession.run method to allow run, schedule and await multiple tasks in the event loop. #146
    • Documentation improvment.
    Source code(tar.gz)
    Source code(zip)
Owner
Python Software Foundation
Python Software Foundation
A jquery-like library for python

pyquery: a jquery-like library for python pyquery allows you to make jquery queries on xml documents. The API is as much as possible the similar to jq

Gael Pasgrimaud 2k Oct 22, 2021
Python binding to Modest engine (fast HTML5 parser with CSS selectors).

A fast HTML5 parser with CSS selectors using Modest engine. Installation From PyPI using pip: pip install selectolax Development version from github:

Artem Golubin 463 Oct 16, 2021
Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes

Bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, appl

Mozilla 2.2k Oct 20, 2021
Safely add untrusted strings to HTML/XML markup.

MarkupSafe MarkupSafe implements a text object that escapes characters so it is safe to use in HTML and XML. Characters that have special meanings are

The Pallets Projects 427 Oct 11, 2021
A library for converting HTML into PDFs using ReportLab

XHTML2PDF The current release of xhtml2pdf is xhtml2pdf 0.2.5. Release Notes can be found here: Release Notes As with all open-source software, its us

null 1.8k Oct 22, 2021
Python module that makes working with XML feel like you are working with JSON

xmltodict xmltodict is a Python module that makes working with XML feel like you are working with JSON, as in this "spec": >>> print(json.dumps(xmltod

Martín Blech 4.6k Oct 23, 2021
The awesome document factory

The Awesome Document Factory WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous s

Kozea 4.6k Oct 22, 2021
The lxml XML toolkit for Python

What is lxml? lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language. It's also very fast and memory

null 2k Oct 26, 2021