Lightweight, scriptable browser as a service with an HTTP API

Related tags

Testing splash
Overview

Splash - A javascript rendering service

Build Status Coverage report Join the chat at https://gitter.im/scrapinghub/splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

Documentation

Documentation is available here: https://splash.readthedocs.io/

Using Splash with Scrapy

To use Splash with Scrapy, please refer to the scrapy-splash library.

Support

Open source support is provided here in GitHub. Please create a question issue.

Commercial support is also available by Scrapinghub.

Issues
  • Splash Ignoring Proxy

    Splash Ignoring Proxy

    Hi all,

    I am running Splash in a docker container on Ubuntu 12.04.5 LTS and am having trouble getting proxy-profiles to work.

    I have this in my /etc/splash/proxy-profiles/crawlera.ini file:

    [proxy]
    host=<mydomain>.crawlera.com
    port=8010
    
    ; optional, default is no auth
    username=<user>
    password=<pass>
    

    and I start the docker container mapping that volume to its equivalent -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/. It appears that by default Splash is launched in the docker container with a flag that tells it where to look for proxy profiles.

    And, when I pass the &proxy=crawlera paramter into the typical splash:8050/?render.html?ur... url, it does not throw an error (if I pass a nonexistent proxy-profile it shows "proxy profile not found) - so I am confident it is finding the profile.

    In the logs, I am actually seeing:

    2015-06-18 17:33:46.661570 [stats] {"maxrss": 148776, "load": [0.0, 0.01, 0.05], "fds": 50, "qsize": 0, "rendertime": 1.3779900074005127, "active": 0, "path": "/execute", "args": {"lua_source": ["function main(splash)\r\n  local url = splash.args.url\r\n  splash.images_enabled = false\r\n  assert(splash:go(url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    png = splash:png(),\r\n    har = splash:har(),\r\n  }\r\nend"], "url": ["http://www.whatismyip.com"], "proxy": ["crawlera"], "images": ["1"], "expand": ["1"], "wait": ["0.5"]}, "_id": 91663608}
    

    So the proxy parameter is definitely there and recognized... but it doesn't do anything. Visiting http://www.whatismyip.com yields the same IP whether or not I have the proxy parameter on.

    Any ideas? Or thoughts on how to better diagnose the issue?

    opened by AlexIzydorczyk 29
  • resource_timeout doesn't work in qt5

    resource_timeout doesn't work in qt5

    QT5 starts printing "device not open" messages in an infinite loop when request is aborted in the middle of download.

    I've opened an issue in qt bug tracker: https://bugreports.qt.io/browse/QTBUG-47654. A short script to reproduce it: https://gist.github.com/kmike/ff287998e02fa953b4a2. This issue makes tests hang on Travis (https://github.com/scrapinghub/splash/pull/260).

    It looks like a blocker for qt5 branch merging; I'm not sure resource_timeout is the only condition which can trigger this bug (what about a regular timeout?).

    opened by kmike 17
  • Support https proxy

    Support https proxy

    Hi,

    Thank you very much for this library. I must say you made tremendous update on it since last time i tried to use it (we stopped using it one year ago because it was too troublesome to use with proxies). It seems like splash does not yet support proxies that use the https or connect method. Is that something you can easily add or not ? Are you open to a PR for something like this ?

    Best,

    opened by cp2587 16
  • Lua Scripting - Submiting via JS

    Lua Scripting - Submiting via JS

    Hi,

    I've been having trouble getting a lua script that calls something like:

    splash:runjs("$(':submit')[0].click()")
    

    to run. That is, splash doesn't seem to actually render the result of the page after submission - as if the submit button had been clicked in the first place.

    Any idea how to work around this or what I may be doing wrong?

    Thanks, Alex

    scripting 
    opened by AlexIzydorczyk 16
  • Fail to restart automatically using docker + splash

    Fail to restart automatically using docker + splash

    I use the following command to deamonize the process 'docker run -d -p 8050:8050 --restart=always -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash:1.5 --maxrss 500' but whenever the splash process crash (it gets killed by the system because it takes too much memory even if i have 1GB on the server) docker fails to restart it with the following error:

    Traceback (most recent call last):
      File "/app/bin/splash", line 3, in <module>
        from splash.server import main
      File "/usr/local/lib/python2.7/dist-packages/splash/server.py", line 12, in <module>
        from splash.qtutils import init_qt_app
      File "/usr/local/lib/python2.7/dist-packages/splash/qtutils.py", line 12, in <module>
        from PyQt4.QtCore import (QAbstractEventDispatcher, QDateTime, QObject,
    ImportError: /usr/lib/python2.7/dist-packages/PyQt4/QtCore.so: cannot read file data: Input/output error
    
    opened by cp2587 15
  • question: Is there a way to get the network traffic while the page is loaded?

    question: Is there a way to get the network traffic while the page is loaded?

    The current running parameters i used, only gives me the page ones it was rendered.

    ( curl -s -X POST -H 'content-type: application/json' -d "{"render_all": 1, "timeout": ${timeout}, "wait": ${wait}, "images": 0, "url": "${url}", "html": 1, http://${ip}:${port}/render.json )

    My problem is that there are a lot of javascript that the page contain that i am interested in that in the final version of the page are removed.

    Is there a way to get the network traffic while the page is loaded? like in chrome "network" tab for example?

    opened by ohade 14
  • AttributeError: module 'tornado.web' has no attribute 'asynchronous'

    AttributeError: module 'tornado.web' has no attribute 'asynchronous'

    Hello there, There is a dependency problem in the last splash-jupyter commit. nbconvert Capture Because of the update of nbconvert, tornado version 6 has been changed. I think we should downgrade nbcovert to a version where tornado version 5 is used. I'm gonna try with that and let you know. Kind regards, Ahmed.

    opened by rafikahmed 13
  • Add examples UI and some examples (#286)

    Add examples UI and some examples (#286)

    I have added some examples and UI to load the examples as part of issue #286.

    Feel free to give feedback about the UI, the examples and suggestions for other examples.

    screensoot

    opened by dvdbng 13
  • Proxy POST requests

    Proxy POST requests

    POST requests do not seem to be going through the HTTP proxy. They do appear in the splash:on_request(callback) and splash:on_response_headers(callback) hooks, but are not seen through the proxy.

    Anyone has quick thoughts on where this issue may be in the code?

    Related to #239 in that it deals with adding the ability to use splash:http_get(...) with POST requests, but I'm not sure if it's adding support for POST proxying as well.

    opened by munro 13
  • Splash can't render javascript - heavy page

    Splash can't render javascript - heavy page

    I'm trying to scrape javascript-heavy page using Scrapy+Splash, but Splash cant render it and just shows few links and top menu. There are template tags in HTML code it returns, my guess is that not all javascript being executed. Even in web-interface (i tried setting wait timer for up to 30 seconds). I tried PhantomJS and selenium - they both work fine, but they are slow compared to Splash. Here is an example page i'm trying to scrape: http://profile.majorleaguegaming.com/crosswrecks/forums

    Any idea about what can cause this? I checked docs, tried changing a few options, but with zero effect. Thanks.

    opened by andverb 13
  • Does Splash handle javascript set cookies?

    Does Splash handle javascript set cookies?

    I'm interested in building a scraper that goes through a site and finds what cookies are being set both 1st party and 3rd party. Scrapy's built in cookie middleware only deals with cookies set in http headers (set-cookies). I'm interested in cookies set through other methods, for example javascript. Therefore I wonder, how does Splash/scrapyjs handle cookies set through other means than http headers? Does it set javascript cookies or handle them in any way? Or are they just ignored?

    opened by mappelgren 13
  • Server Side Rendering Issue

    Server Side Rendering Issue

    Websites which uses server side rendering were not able to render with Splash Web Kit. However, when changing rendering engine to chromium, the website renders perfectly. To make it sure I developed a simple react app for server side rendering using the hydrate method. It only worked with chromium and not web kit. Also many other sites which use hydration method were not be rendered by web kit engine. Here is the code which is used for the react app: https://www.digitalocean.com/community/tutorials/react-server-side-rendering

    opened by Logesh08 1
  • Splash 3.5 returning

    Splash 3.5 returning "js_source is not implemented"

    I'm using splash with docker compose bellow:

    version: '3.7'
        services:
          splash:
            image: "scrapinghub/splash:3.5"
            restart: always
            container_name: "splash"
            ports:
              - "8050:8050"
            mem_limit: 1024m
            command: "--disable-private-mode -v3 --js-cross-domain-access --browser-engines=chromium"
    

    Using python requests to query render.html with js_source as said on the documentation page :

    import requests
    
    js = """document.querySelector("button[data-tracking='cc-accept']").click();
        document.querySelector("div[data-testid='headerImageClickArea']").click();
        document.querySelector("button.image-gallery-icon.image-gallery-right-nav").click();
    """
    url = 'anyurl'
    resp = requests.get(f'http://localhost:8050/render.html?url={url}&engine=chromium&js_source={js}')
    
    f = open("./chromium-engine.html", "w")
    f.write(str(resp.content))
    f.close()
    

    2022-05-11 12:35:22.020634 [events] {"path": "/render.html", "rendertime": 0.1633293628692627, "maxrss": 318320, "load": [0.31, 0.41, 0.4], "fds": 53, "active": 0, "qsize": 0, "_id": 140135661564816, "method": "GET", "timestamp": 1652272522, "user-agent": "python-requests/2.26.0", "args": {"url": "anyurl", "engine": "chromium", "js_source": "document.querySelector("button[data-tracking='cc-accept']").click()", "\n document.querySelector("div[data-testid": "'headerImageClickArea']").click()", "uid": 140135661564816}, "status_code": 400, "client_ip": "172.22.0.1", "error": {"error": 400, "type": "BadOption", "description": "Incorrect HTTP API arguments", "info": "js_source is not implemented"}}

    opened by danilo4pm 0
  • Errors when rendering zimuku.org

    Errors when rendering zimuku.org

    First thank the developers of this great tool! I am trying to build up a subtitle downloading automation based on splash.

    I am deploying a docker file: docker://scrapinghub/splash When rendering www.google.com, everything works. But it fails when rendering www.zimuku.org. The lua script used is like following:

    function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(5)) return { html = splash:html() } end

    The error throwing is like:

    { "error": 400, "type": "ScriptError", "description": "Error happened while executing Lua script", "info": { "source": "[string "function main(splash, args)\r..."]", "line_number": 2, "error": "http404", "type": "LUA_ERROR", "message": "Lua error: [string "function main(splash, args)\r..."]:2: http404" } }

    In the developer mode, I witness this: Screenshot 2022-05-08 213355

    I am a newbie of splash. Please give me some suggestions about how to debug this. Thanks!

    opened by crotoc 2
  • Splash is no longer able to render a page with JavaScript

    Splash is no longer able to render a page with JavaScript

    Hi there!

    Kudos to you guys for making some amazing software!

    Up until recently, I've been able to successfully parse this website 'https://www.eversource.com/security/account/login' (amongst many others).

    Unfortunately, as I mentioned, recently, I believe the website maintainers changed something on the back end, and now the site is no longer rendering correctly.

    The expected result would be a typical login screen where it asks for User and Password. Instead, it essentially only shows the navbar and the footer.

    I've reviewed and tried all of the suggestions made in https://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly Most notably:

    1. Waiting to ensure the site renders completely using 'splash:wait'
    2. Specifying different user agents using splash:set_user_agent
    3. Disabling private mode (using --disable-private-mode or splash.set_private_mode_enabled = false

    I normally run splash with the command (on Ubuntu linux 20): ' sudo docker run -p 8050:8050 --memory=1G --restart=always scrapinghub/splash --disable-private-mode --max-timeout 3600 --maxrss 1024 -v3'

    Currently, I'm running the following versions: [-] Splash version: 3.5 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]

    The easiest way to reproduce this issue would be to run the following in the splash UI aka (http://localhost:8050):

    `function main(splash, args)

    splash.resource_timeout = 0 splash.private_mode_enabled = false splash:set_user_agent('Mozilla/5.0 (Windows NT 6.1; rv:51.0) Gecko/20100101 Firefox/51.0')

    local login_url = 'https://www.eversource.com/security/account/login'

    assert(splash:go(login_url)) assert(splash:wait(10))

    return { html = splash:html(), png = splash:png(), har = splash:har(), } end`

    The only clues I have seen are a few errors in the verbose output of splash when run with '-v3'. Specifically, I see the following:

    '[render] JsConsole(https://www.eversource.com/content/UserControls/PrimaryNavNew/PrimaryNavNew.ascx.js:69): TypeError: item of items is not a function. (In 'item of items', 'item of items' is undefined)
    [render] JsConsole(https://www.eversource.com/content/WebsiteTemplates/NU/js/AppD/jsagent/adrum/adrum.js:27): TypeError: |this| is not a object
    [render] JsConsole(https://cdn.eversource.com/prod/ms-login/2022.2.2.13/static/js/main.bundle.js:2): TypeError: |this| is not a object '

    Note that I'm able to access this page (and see the login page) using a normal browser (I've used both Safari and Firefox).

    I guess my main question is... is there something that I can do to get this to render again, or is the version of splash WebKit simply incompatible?

    I currently have a webapp where I'm using scrapy combined with splash to parse a number of utility sites. If splash is no longer capable of rendering websites using modern javascript, then I may need to move to some other solution. This is a bummer to me, because so far I've been happy with the performance and capabilities.

    Thanks in advance for any assistance you could provide.

    P.S. If there's any other supporting information that I could give, please let me know!

    opened by utilitylens 2
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
Main purpose of this project is to provide the service to automate the API testing process

PPTester project Main purpose of this project is to provide the service to automate the API testing process. In order to deploy this service use you s

null 4 Dec 16, 2021
A browser automation framework and ecosystem.

Selenium Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provide

Selenium 24.3k Aug 7, 2022
pytest splinter and selenium integration for anyone interested in browser interaction in tests

Splinter plugin for the pytest runner Install pytest-splinter pip install pytest-splinter Features The plugin provides a set of fixtures to use splin

pytest-dev 234 Aug 8, 2022
User-oriented Web UI browser tests in Python

Selene - User-oriented Web UI browser tests in Python (Selenide port) Main features: User-oriented API for Selenium Webdriver (code like speak common

Iakiv Kramarenko 551 Aug 7, 2022
Doggo Browser

Doggo Browser Quick Start $ python3 -m venv ./venv/ $ source ./venv/bin/activate $ pip3 install -r requirements.txt $ ./sobaki.py References Heavily I

Alexey Kutepov 8 May 29, 2022
LuluTest is a Python framework for creating automated browser tests.

LuluTest LuluTest is an open source browser automation framework using Python and Selenium. It is relatively lightweight in that it mostly provides wr

Erik Whiting 13 Dec 21, 2021
Free cleverbot without headless browser

Cleverbot Scraper Simple free cleverbot library that doesn't require running a heavy ram wasting headless web browser to actually chat with the bot, a

Matheus Fillipe 2 May 5, 2022
1st Solution to QQ Browser 2021 AIAC Track 2

1st Solution to QQ Browser 2021 AIAC Track 2 This repository is the winning solution to QQ Browser 2021 AI Algorithm Competition Track 2 Automated Hyp

DAIR Lab 25 Aug 2, 2022
Browser reload with uvicorn

uvicorn-browser This project is inspired by autoreload. Installation pip install uvicorn-browser Usage Run uvicorn-browser --help to see all options.

Marcelo Trylesinski 59 Jun 18, 2022
FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing

FFPuppet FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing. Happy bug hunting! Are you fuzzing the browser? G

Mozilla Fuzzing Security 22 Jun 24, 2022
Repository for JIDA SNP Browser Web Application: Local Deployment

JIDA JIDA is a web application that retrieves SNP information for a genomic region of interest in Homo sapiens and calculates specific summary statist

null 3 Mar 3, 2022
HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

Gabriel Falcão 2k Aug 5, 2022
Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

Kevin McCarthy 2.3k Aug 2, 2022
An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

mitmproxy mitmproxy is an interactive, SSL/TLS-capable intercepting proxy with a console interface for HTTP/1, HTTP/2, and WebSockets. mitmdump is the

mitmproxy 28.3k Aug 5, 2022
Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

Kevin McCarthy 1.8k Feb 7, 2021
HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

Gabriel Falcão 1.9k Feb 6, 2021
Wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server.

WebTest This wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server. This provides

Pylons Project 323 Aug 9, 2022
One-stop solution for HTTP(S) testing.

HttpRunner HttpRunner is a simple & elegant, yet powerful HTTP(S) testing framework. Enjoy! ✨ ?? ✨ Design Philosophy Convention over configuration ROI

HttpRunner 3.4k Aug 13, 2022
Declarative HTTP Testing for Python and anything else

Gabbi Release Notes Gabbi is a tool for running HTTP tests where requests and responses are represented in a declarative YAML-based form. The simplest

Chris Dent 140 Jul 20, 2022