robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Joshua Carp

Last update: Dec 27, 2022

Related tags

Web Crawling robobrowser

Overview

RoboBrowser: Your friendly neighborhood web scraper

https://badge.fury.io/py/robobrowser.png

https://travis-ci.org/jmcarp/robobrowser.png?branch=master

https://coveralls.io/repos/jmcarp/robobrowser/badge.png?branch=master

Homepage: http://robobrowser.readthedocs.org/

RoboBrowser is a simple, Pythonic library for browsing the web without a standalone web browser. RoboBrowser can fetch a page, click on links and buttons, and fill out and submit forms. If you need to interact with web services that don't have APIs, RoboBrowser can help.

import re
from robobrowser import RoboBrowser

# Browse to Genius
browser = RoboBrowser(history=True)
browser.open('http://genius.com/')

# Search for Porcupine Tree
form = browser.get_form(action='/search')
form                # <RoboForm q=>
form['q'].value = 'porcupine tree'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_link')
browser.follow_link(songs[0])
lyrics = browser.select('.lyrics')
lyrics[0].text      # \nHear the sound of music ...

# Back to results page
browser.back()

# Look up my favorite song
song_link = browser.get_link('trains')
browser.follow_link(song_link)

# Can also search HTML using regex patterns
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text         # \nTrain set and match spied under the blind...

RoboBrowser combines the best of two excellent Python libraries: Requests and BeautifulSoup. RoboBrowser represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from robobrowser import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Search the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

You can also pass a custom Session instance for lower-level configuration:

from requests import Session
from robobrowser import RoboBrowser

session = Session()
session.verify = False  # Skip SSL verification
session.proxies = {'http': 'http://custom.proxy.com/'}  # Set default proxies
browser = RoboBrowser(session=session)

RoboBrowser also includes tools for working with forms, inspired by WebTest and Mechanize.

from robobrowser import RoboBrowser

browser = RoboBrowser()
browser.open('http://twitter.com')

# Get the signup form
signup_form = browser.get_form(class_='signup')
signup_form         # <RoboForm user[name]=, user[email]=, ...

# Inspect its values
signup_form['authenticity_token'].value     # 6d03597 ...

# Fill it out
signup_form['user[name]'].value = 'python-robot'
signup_form['user[user_password]'].value = 'secret'

# Submit the form
browser.submit_form(signup_form)

Checkboxes:

from robobrowser import RoboBrowser

# Browse to a page with checkbox inputs
browser = RoboBrowser()
browser.open('http://www.w3schools.com/html/html_forms.asp')

# Find the form
form = browser.get_forms()[3]
form                            # <RoboForm vehicle=[]>
form['vehicle']                 # <robobrowser.forms.fields.Checkbox...>

# Checked values can be get and set like lists
form['vehicle'].options         # [u'Bike', u'Car']
form['vehicle'].value           # []
form['vehicle'].value = ['Bike']
form['vehicle'].value = ['Bike', 'Car']

# Values can also be set using input labels
form['vehicle'].labels          # [u'I have a bike', u'I have a car \r\n']
form['vehicle'].value = ['I have a bike']
form['vehicle'].value           # [u'Bike']

# Only values that correspond to checkbox values or labels can be set;
# this will raise a `ValueError`
form['vehicle'].value = ['Hot Dogs']

Uploading files:

from robobrowser import RoboBrowser

# Browse to a page with an upload form
browser = RoboBrowser()
browser.open('http://cgi-lib.berkeley.edu/ex/fup.html')

# Find the form
upload_form = browser.get_form()
upload_form                     # <RoboForm upfile=, note=>

# Choose a file to upload
upload_form['upfile']           # <robobrowser.forms.fields.FileInput...>
upload_form['upfile'].value = open('path/to/file.txt', 'r')

# Submit
browser.submit(upload_form)

By default, creating a browser instantiates a new requests Session.

Requirements

Python >= 2.6 or >= 3.3

License

MIT licensed. See the bundled LICENSE file for more details.

Comments

Updated requirements for pip 6.0+ to include a session

Installing robobrowser with pip 6.0+ was failing due to: TypeError: parse_requirements() missing 1 required keyword argument: 'session'

Fixed setup.py to include a session.

opened by ghost 6
Add support for dynamically added form fields

Hi,

i try to use a file upload form, but unfortunately there is a field which gets added by JS before submitting. Is there a way to simply add fields, or disable the checks that raise the KeyError?

opened by Bouni 6
Switched requirements from == to >=.

This hard version requirements can cause a lot of woes in ArchLinux packaging. Would you please consider the change to allow greater versions of the dependencies? Thank you!

opened by StuntsPT 4
Mutiple submit buttons in forms
If I have the following HTML:

<form name="input" action="demo_form_action.asp" method="get"> Username: <input type="text" name="user"> <input type="submit" value="Action1" name="action1_name"> <input type="submit" value="Action2" name="action2_name"> </form>

When I click the Action1 button in my browser (Firefox or Chrome), the following URL gets sent http://localhost:8001/demo_form_action.asp?user=asfdasd&action1_name=Action1. And a different URL gets sent for Action2: http://localhost:8001/demo_form_action.asp?user=asfdasd&action2_name=Action2. Using the submit_form method in robobrowser, puts the actions for both buttons in the request: http://localhost:8001/demo_form_action.asp?user=asdsafd&action2_name=Action2&action1_name=Action1.

I guess I can model the browser button pressing behaviour by deleting the buttons I do not want pressed from the robobrowser form object. However, it would be nice if the API for submit_form could be extended to include the button being pressed to submit the form.
opened by alastairdb 4

How do I tick/select a check box?

>>> url = 'https://bitbucket.org/repo/import'
>>> browser.open(url)
>>> import_form = browser.get_form(id='import-form')
>>> import_form
<RoboForm source_scm=, source=source-git, goog_project_name=, goog_scm=svn, sourceforge_project_name=, sourceforge_mount_point=, sourceforge_scm=svn, codeplex_project_name=, codeplex_scm=svn, url=, auth=[], username=, password=, owner=2039394, name=, description=, is_private=[None], forking=no_public_forks, no_forks=, no_public_forks=True, scm=git, has_issues=[], has_wiki=[], language=, csrfmiddlewaretoken=icwpCBLZdWAPht1rmnACawMHcYwtorNA>
>>> type(import_form['auth'])
<class 'robobrowser.forms.fields.Checkbox'>

The 'auth' is a checkbox. How do I set it to true? or selected? I couldn't find required info in documentation, so asking here. Thank you!

opened by avinassh 4

cannot install with pip

I'm getting the following error when I try to install with pip:

$ pip install robobrowser
Downloading/unpacking robobrowser
  Running setup.py egg_info for package robobrowser
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>
        for requirement in parse_requirements('requirements.txt')
      File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
        skip_regex = options.skip_requirements_regex
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/tsc/.virtualenvs/vm_export_tool/build/robobrowser/setup.py", line 19, in <module>

    for requirement in parse_requirements('requirements.txt')

  File "/home/tsc/.virtualenvs/vm_export_tool/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements

    skip_regex = options.skip_requirements_regex

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

----------------------------------------
Command python setup.py egg_info failed with error code 1 in /home/tsc/.virtualenvs/vm_export_tool/build/robobrowser
Storing complete log in /home/tsc/.pip/pip.log

python 2.7 pip 1.1 ubuntu 12.10

opened by vindolin 4

ImportError: cannot import name 'RoboBrowser'
OS: 16.04 Tried pip and pip easyinstall pip install robobrowser sudo -H pip install robobrowser ....

Even tried wiping pip && requests and reinstall - same result.

Package installs, tried sudo ldconfig - however, not getting this to work. Iam a python newb, please excuse if theres a usefail from my side.

python3 robobrowser.py Traceback (most recent call last): File "robobrowser.py", line 4, in <module> from robobrowser import RoboBrowser File "/home/robotux/robobrowser.py", line 4, in <module> from robobrowser import RoboBrowser ImportError: cannot import name 'RoboBrowser'
opened by alphaaurigae 3

How to click a button in form

I have a form like this:

<form method="post">
    <input type="hidden" name="action" value="del_file">
    <table class="table table-bordered table-striped">
        <tbody>
            <tr>
                <td class="item-title"> test1.txt</td>
                <td class="item-action">
                    <button type="submit" class="btn" name="file_id" value="24247579">
                        Delete
                    </button>
                </td>
            </tr>
            <tr>
                <td class="item-title"> test2.txt</td>
                <td class="item-action">
                    <button type="submit" class="btn" name="file_id" value="22608379">
                        Delete
                    </button>
                </td>
            </tr>
            <tr>
                <td class="item-title"> test3.txt</td>
                <td class="item-action">
                    <button type="submit" class="btn" name="file_id" value="22608377">
                        Delete
                    </button>
                </td>
            </tr>
        </tbody>
    </table>
</form>

How can I click a delete button ? The delete request is like this: {post} http://host {Content-Type: application/x-www-form-urlencoded} data is action=del_file&file_id=22608377

please tell me what can I do for this form.

opened by tengshan2008 2

$browser.get_form - forms with a text area returns a prefixed \r when empty and \r\rn when filled$

browser.get_form - forms with a text area returns a prefixed \r when empty and \r\rn when filled

I am running into an issue with forms that have text areas. When grabbing the form using browser.get_form all blank text areas return a \r and when submitted now have a new line.

Also, all non-blank text areas are prefixed with a \r\n to the front of it when pulling the form.

opened by tanner-pm 2
Add error message for selecting non-existent option in multi-option field

This fixes #8 by adding an error message that references the field and the non-existent option being selected. I did not list some of the valid options as it seemed to make for an overly lengthy error message, but they would be easy to add in.

opened by rcutmore 2
Fixed: Parsing of empty select boxes failed

Hi Joshua,

RoboBrowser failed to parse forms where there were empty select boxes. I have added the tests as well as the fix for the same.

The use cases of empty select boxes are pretty common on ASP website where forms require multiple submits. Before the initial submit the select boxes are kept empty. On initial submit, the select boxes get a few options based on other values.

Do let me know if anything is required to be modified in the pull request. Pratyush

opened by pratyushmittal 2
Unmaintained, dead, list of alternative projects
Since Robobrowser seems to be unmaintained since 2015, here is a list of alternatives that I've found which all have had commits with the list year (ie, during the year 2021). I've not yet started the process of portin any of my scripts over from Robobrowser to any of these, so I can't vouch for how similar any are.

Scrapy currently 43k stars on github

Mechanical Soup currently 3.9k stars on github

PSpider currently 1.3k stars on github

Feedparser currently 1.3k stars on github

Feedparser is specifically designed for RSS, ATOM, and other feeds, rather than a generic scraper

Spidy currently 273 stars on github

Spidy is more of a commandline tool than a framework for automation, but there's probably a non-zero number of users who end up looking at Robobrowser but would be happier with a commandline tool.

I included the stars because that helps gauge how popular a project is. Popular projects tend be more less likely to disappear if a maintainer loses interest in the project. RoboBrowser is currently at 3.6k stars. But all of the above projects appear to be currently maintained.
opened by bobpaul 0
.csv file getting record empty

In start its work for me then later when i run command python3 rank.py https://www.uselessthingstobuy.com/ desktop its show data empty with my file, please help

opened by shaheroumwali 0
Update browser.py

Werkzeug was upgraded to 1.0.0 and introduced this error: ImportError: cannot import name 'cached_property' from 'werkzeug' (/usr/local/lib/python3.8/site-packages/werkzeug/__init__.py)

Suggesting this change which has been tested locally and discussed here: https://github.com/jmcarp/robobrowser/issues/93

opened by antoni-g 2
Possible to upload multiple files?

.. to simulate an upload dialog where a user selects multiple files.

upform = forms[0] upform['files'].value = [open("a.gz",'rb'), open("b.gz",'rb')]

This gives me: ValueError('Value must be a file object or file path')

opened by manzikki 1
Update browser.py

needed to update as it throws error cannot import name 'cached_property' from 'werkzeug' (/app/.heroku/python/lib/python3.7/site-packages/werkzeug/init.py) for the new updated werkzeug package. just changed a import line **_ #from werkzeug import cached_propert_**y to from werkzeug.utils import cached_property

opened by andersonneo67 0

Owner

Joshua Carp

GitHub

Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit forms.

79 Nov 27, 2022

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

2 Dec 27, 2021

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

704 Jan 6, 2023

Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

2.1k Jan 6, 2023

Here I provide the source code for doing web scraping using the python library, it is Selenium.

1 Nov 13, 2021

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

53 Dec 21, 2022

A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 6, 2022

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

1 Nov 28, 2021

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

326 Dec 15, 2022

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 4, 2023

A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

15.7k Jan 4, 2023

Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

45.5k Jan 7, 2023

:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

46.4k Jan 3, 2023

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Related tags

Overview

RoboBrowser: Your friendly neighborhood web scraper

Requirements

License

Comments

Owner

Joshua Carp

Simple library for exploring/scraping the web or testing a website you’re developing

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

Library to scrape and clean web pages to create massive datasets.

Here I provide the source code for doing web scraping using the python library, it is Selenium.

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

A simple python web scraper.

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

A simple django-rest-framework api using web scraping

Dude is a very simple framework for writing web scrapers using Python decorators

Web Scraping Framework

A Powerful Spider(Web Crawler) System in Python.

Scrapy, a fast high-level web crawling & scraping framework for Python.

:arrow_double_down: Dumb downloader that scrapes the web

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Async Python 3.6+ web scraping micro-framework based on asyncio

Web Content Retrieval for Humans™

Transistor, a Python web scraping framework for intelligent use cases.