robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png https://travis-ci.org/jmcarp/robobrowser.png?branch=master https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

  • Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

Comments
  • Updated requirements for pip 6.0+ to include a session

    Updated requirements for pip 6.0+ to include a session

    Installing robobrowser with pip 6.0+ was failing due to: TypeError: parse_requirements() missing 1 required keyword argument: 'session'

    Fixed setup.py to include a session.

    opened by ghost 6
  • Add support for dynamically added form fields

    Add support for dynamically added form fields

    Hi,

    i try to use a file upload form, but unfortunately there is a field which gets added by JS before submitting. Is there a way to simply add fields, or disable the checks that raise the KeyError?

    opened by Bouni 6
  • Switched requirements from == to >=.

    Switched requirements from == to >=.

    This hard version requirements can cause a lot of woes in ArchLinux packaging. Would you please consider the change to allow greater versions of the dependencies? Thank you!

    opened by StuntsPT 4
  • Mutiple submit buttons in forms

    Mutiple submit buttons in forms

    If I have the following HTML:

    <form name="input" action="demo_form_action.asp" method="get">
    Username: <input type="text" name="user">
    <input type="submit" value="Action1" name="action1_name">
    <input type="submit" value="Action2" name="action2_name">
    </form> 
    

    When I click the Action1 button in my browser (Firefox or Chrome), the following URL gets sent http://localhost:8001/demo_form_action.asp?user=asfdasd&action1_name=Action1. And a different URL gets sent for Action2: http://localhost:8001/demo_form_action.asp?user=asfdasd&action2_name=Action2. Using the submit_form method in robobrowser, puts the actions for both buttons in the request: http://localhost:8001/demo_form_action.asp?user=asdsafd&action2_name=Action2&action1_name=Action1.

    I guess I can model the browser button pressing behaviour by deleting the buttons I do not want pressed from the robobrowser form object. However, it would be nice if the API for submit_form could be extended to include the button being pressed to submit the form.

    opened by alastairdb 4
  • How do I tick/select a check box?

    How do I tick/select a check box?

    >>> url = 'https://bitbucket.org/repo/import'
    >>> browser.open(url)
    >>> import_form = browser.get_form(id='import-form')
    >>> import_form
    <RoboForm source_scm=, source=source-git, goog_project_name=, goog_scm=svn, sourceforge_project_name=, sourceforge_mount_point=, sourceforge_scm=svn, codeplex_project_name=, codeplex_scm=svn, url=, auth=[], username=, password=, owner=2039394, name=, description=, is_private=[None], forking=no_public_forks, no_forks=, no_public_forks=True, scm=git, has_issues=[], has_wiki=[], language=, csrfmiddlewaretoken=icwpCBLZdWAPht1rmnACawMHcYwtorNA>
    >>> type(import_form['auth'])
    <class 'robobrowser.forms.fields.Checkbox'>
    

    The 'auth' is a checkbox. How do I set it to true? or selected? I couldn't find required info in documentation, so asking here. Thank you!

    opened by avinassh 4
  • cannot install with pip

    cannot install with pip

    I'm getting the following error when I try to install with pip:

    $ pip install robobrowser
    Downloading/unpacking robobrowser
      Running setup.py egg_info for package robobrowser
        Traceback (most recent call last):
          File "<string>", line 14, in <module>
          File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>
            for requirement in parse_requirements('requirements.txt')
          File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
            skip_regex = options.skip_requirements_regex
        AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
    
      File "<string>", line 14, in <module>
    
      File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>
    
        for requirement in parse_requirements('requirements.txt')
    
      File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
    
        skip_regex = options.skip_requirements_regex
    
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    
    ----------------------------------------
    Command python setup.py egg_info failed with error code 1 in /home/tsc/.virtualenvs/vm_export_tool/build/robobrowser
    Storing complete log in /home/tsc/.pip/pip.log
    

    python 2.7 pip 1.1 ubuntu 12.10

    opened by vindolin 4
  • ImportError: cannot import name 'RoboBrowser'

    ImportError: cannot import name 'RoboBrowser'

    OS: 16.04 Tried pip and pip easyinstall pip install robobrowser sudo -H pip install robobrowser ....

    Even tried wiping pip && requests and reinstall - same result.

    Package installs, tried sudo ldconfig - however, not getting this to work. Iam a python newb, please excuse if theres a usefail from my side.

    python3 robobrowser.py
    Traceback (most recent call last):
      File "robobrowser.py", line 4, in <module>
        from robobrowser import RoboBrowser
      File "/home/robotux/robobrowser.py", line 4, in <module>
        from robobrowser import RoboBrowser
    ImportError: cannot import name 'RoboBrowser'
    
    opened by alphaaurigae 3
  • How to click a button in form

    How to click a button in form

    I have a form like this:

    <form method="post">
        <input type="hidden" name="action" value="del_file">
        <table class="table table-bordered table-striped">
            <tbody>
                <tr>
                    <td class="item-title"> test1.txt</td>
                    <td class="item-action">
                        <button type="submit" class="btn" name="file_id" value="24247579">
                            Delete
                        </button>
                    </td>
                </tr>
                <tr>
                    <td class="item-title"> test2.txt</td>
                    <td class="item-action">
                        <button type="submit" class="btn" name="file_id" value="22608379">
                            Delete
                        </button>
                    </td>
                </tr>
                <tr>
                    <td class="item-title"> test3.txt</td>
                    <td class="item-action">
                        <button type="submit" class="btn" name="file_id" value="22608377">
                            Delete
                        </button>
                    </td>
                </tr>
            </tbody>
        </table>
    </form>
    

    How can I click a delete button ? The delete request is like this: {post} http://host {Content-Type: application/x-www-form-urlencoded} data is action=del_file&file_id=22608377

    please tell me what can I do for this form.

    opened by tengshan2008 2
  • browser.get_form - forms with a text area returns a prefixed \r when empty and \r\rn when filled

    browser.get_form - forms with a text area returns a prefixed \r when empty and \r\rn when filled

    I am running into an issue with forms that have text areas. When grabbing the form using browser.get_form all blank text areas return a \r and when submitted now have a new line.

    Also, all non-blank text areas are prefixed with a \r\n to the front of it when pulling the form.

    opened by tanner-pm 2
  • Add error message for selecting non-existent option in multi-option field

    Add error message for selecting non-existent option in multi-option field

    This fixes #8 by adding an error message that references the field and the non-existent option being selected. I did not list some of the valid options as it seemed to make for an overly lengthy error message, but they would be easy to add in.

    opened by rcutmore 2
  • Fixed: Parsing of empty select boxes failed

    Fixed: Parsing of empty select boxes failed

    Hi Joshua,

    RoboBrowser failed to parse forms where there were empty select boxes. I have added the tests as well as the fix for the same.

    The use cases of empty select boxes are pretty common on ASP website where forms require multiple submits. Before the initial submit the select boxes are kept empty. On initial submit, the select boxes get a few options based on other values.

    Do let me know if anything is required to be modified in the pull request. Pratyush

    opened by pratyushmittal 2
  • Unmaintained, dead, list of alternative projects

    Unmaintained, dead, list of alternative projects

    Since Robobrowser seems to be unmaintained since 2015, here is a list of alternatives that I've found which all have had commits with the list year (ie, during the year 2021). I've not yet started the process of portin any of my scripts over from Robobrowser to any of these, so I can't vouch for how similar any are.

    • Scrapy currently 43k stars on github
    • Mechanical Soup currently 3.9k stars on github
    • PSpider currently 1.3k stars on github
    • Feedparser currently 1.3k stars on github
    • Feedparser is specifically designed for RSS, ATOM, and other feeds, rather than a generic scraper
    • Spidy currently 273 stars on github
      • Spidy is more of a commandline tool than a framework for automation, but there's probably a non-zero number of users who end up looking at Robobrowser but would be happier with a commandline tool.

    I included the stars because that helps gauge how popular a project is. Popular projects tend be more less likely to disappear if a maintainer loses interest in the project. RoboBrowser is currently at 3.6k stars. But all of the above projects appear to be currently maintained.

    opened by bobpaul 0
  • .csv file getting record empty

    .csv file getting record empty

    In start its work for me then later when i run command python3 rank.py https://www.uselessthingstobuy.com/ desktop its show data empty with my file, please help

    opened by shaheroumwali 0
  • Update browser.py

    Update browser.py

    Werkzeug was upgraded to 1.0.0 and introduced this error: ImportError: cannot import name 'cached_property' from 'werkzeug' (/usr/local/lib/python3.8/site-packages/werkzeug/__init__.py)

    Suggesting this change which has been tested locally and discussed here: https://github.com/jmcarp/robobrowser/issues/93

    opened by antoni-g 2
  • Possible to upload multiple files?

    Possible to upload multiple files?

    .. to simulate an upload dialog where a user selects multiple files.

    upform = forms[0] upform['files'].value = [open("a.gz",'rb'), open("b.gz",'rb')]

    This gives me: ValueError('Value must be a file object or file path')

    opened by manzikki 1
  • Update browser.py

    Update browser.py

    needed to update as it throws error cannot import name 'cached_property' from 'werkzeug' (/app/.heroku/python/lib/python3.7/site-packages/werkzeug/init.py) for the new updated werkzeug package. just changed a import line **_ #from werkzeug import cached_propert_**y to from werkzeug.utils import cached_property

    opened by andersonneo67 0
Owner
Joshua Carp
Joshua Carp
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

Dan Claudiu Pop 79 Nov 27, 2022
Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

Encore Shao 2 Dec 27, 2021
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 6, 2023
Here I provide the source code for doing web scraping using the python library, it is Selenium.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

M Khaidar 1 Nov 13, 2021
Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

Nuhman Pk 53 Dec 21, 2022
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

null 11 May 6, 2022
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

Ronie Martinez 326 Dec 15, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023
:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

Mort Yao 46.4k Jan 3, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 4, 2023
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022