Lightweight, scriptable browser as a service with an HTTP API

Related tags

Testing splash
Overview

Splash - A javascript rendering service

Build Status Coverage report Join the chat at https://gitter.im/scrapinghub/splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

Documentation

Documentation is available here: https://splash.readthedocs.io/

Using Splash with Scrapy

To use Splash with Scrapy, please refer to the scrapy-splash library.

Support

Open source support is provided here in GitHub. Please create a question issue.

Commercial support is also available by Scrapinghub.

Comments
  • Splash Ignoring Proxy

    Splash Ignoring Proxy

    Hi all,

    I am running Splash in a docker container on Ubuntu 12.04.5 LTS and am having trouble getting proxy-profiles to work.

    I have this in my /etc/splash/proxy-profiles/crawlera.ini file:

    [proxy]
    host=<mydomain>.crawlera.com
    port=8010
    
    ; optional, default is no auth
    username=<user>
    password=<pass>
    

    and I start the docker container mapping that volume to its equivalent -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/. It appears that by default Splash is launched in the docker container with a flag that tells it where to look for proxy profiles.

    And, when I pass the &proxy=crawlera paramter into the typical splash:8050/?render.html?ur... url, it does not throw an error (if I pass a nonexistent proxy-profile it shows "proxy profile not found) - so I am confident it is finding the profile.

    In the logs, I am actually seeing:

    2015-06-18 17:33:46.661570 [stats] {"maxrss": 148776, "load": [0.0, 0.01, 0.05], "fds": 50, "qsize": 0, "rendertime": 1.3779900074005127, "active": 0, "path": "/execute", "args": {"lua_source": ["function main(splash)\r\n  local url = splash.args.url\r\n  splash.images_enabled = false\r\n  assert(splash:go(url))\r\n  assert(splash:wait(0.5))\r\n  return {\r\n    html = splash:html(),\r\n    png = splash:png(),\r\n    har = splash:har(),\r\n  }\r\nend"], "url": ["http://www.whatismyip.com"], "proxy": ["crawlera"], "images": ["1"], "expand": ["1"], "wait": ["0.5"]}, "_id": 91663608}
    

    So the proxy parameter is definitely there and recognized... but it doesn't do anything. Visiting http://www.whatismyip.com yields the same IP whether or not I have the proxy parameter on.

    Any ideas? Or thoughts on how to better diagnose the issue?

    opened by AlexIzydorczyk 29
  • resource_timeout doesn't work in qt5

    resource_timeout doesn't work in qt5

    QT5 starts printing "device not open" messages in an infinite loop when request is aborted in the middle of download.

    I've opened an issue in qt bug tracker: https://bugreports.qt.io/browse/QTBUG-47654. A short script to reproduce it: https://gist.github.com/kmike/ff287998e02fa953b4a2. This issue makes tests hang on Travis (https://github.com/scrapinghub/splash/pull/260).

    It looks like a blocker for qt5 branch merging; I'm not sure resource_timeout is the only condition which can trigger this bug (what about a regular timeout?).

    opened by kmike 17
  • Support https proxy

    Support https proxy

    Hi,

    Thank you very much for this library. I must say you made tremendous update on it since last time i tried to use it (we stopped using it one year ago because it was too troublesome to use with proxies). It seems like splash does not yet support proxies that use the https or connect method. Is that something you can easily add or not ? Are you open to a PR for something like this ?

    Best,

    opened by cp2587 16
  • Lua Scripting - Submiting via JS

    Lua Scripting - Submiting via JS

    Hi,

    I've been having trouble getting a lua script that calls something like:

    splash:runjs("$(':submit')[0].click()")
    

    to run. That is, splash doesn't seem to actually render the result of the page after submission - as if the submit button had been clicked in the first place.

    Any idea how to work around this or what I may be doing wrong?

    Thanks, Alex

    scripting 
    opened by AlexIzydorczyk 16
  • Fail to restart automatically using docker + splash

    Fail to restart automatically using docker + splash

    I use the following command to deamonize the process 'docker run -d -p 8050:8050 --restart=always -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash:1.5 --maxrss 500' but whenever the splash process crash (it gets killed by the system because it takes too much memory even if i have 1GB on the server) docker fails to restart it with the following error:

    Traceback (most recent call last):
      File "/app/bin/splash", line 3, in <module>
        from splash.server import main
      File "/usr/local/lib/python2.7/dist-packages/splash/server.py", line 12, in <module>
        from splash.qtutils import init_qt_app
      File "/usr/local/lib/python2.7/dist-packages/splash/qtutils.py", line 12, in <module>
        from PyQt4.QtCore import (QAbstractEventDispatcher, QDateTime, QObject,
    ImportError: /usr/lib/python2.7/dist-packages/PyQt4/QtCore.so: cannot read file data: Input/output error
    
    opened by cp2587 15
  • question: Is there a way to get the network traffic while the page is loaded?

    question: Is there a way to get the network traffic while the page is loaded?

    The current running parameters i used, only gives me the page ones it was rendered.

    ( curl -s -X POST -H 'content-type: application/json' -d "{"render_all": 1, "timeout": ${timeout}, "wait": ${wait}, "images": 0, "url": "${url}", "html": 1, http://${ip}:${port}/render.json )

    My problem is that there are a lot of javascript that the page contain that i am interested in that in the final version of the page are removed.

    Is there a way to get the network traffic while the page is loaded? like in chrome "network" tab for example?

    opened by ohade 14
  • AttributeError: module 'tornado.web' has no attribute 'asynchronous'

    AttributeError: module 'tornado.web' has no attribute 'asynchronous'

    Hello there, There is a dependency problem in the last splash-jupyter commit. nbconvert Capture Because of the update of nbconvert, tornado version 6 has been changed. I think we should downgrade nbcovert to a version where tornado version 5 is used. I'm gonna try with that and let you know. Kind regards, Ahmed.

    opened by rafikahmed 13
  • Add examples UI and some examples (#286)

    Add examples UI and some examples (#286)

    I have added some examples and UI to load the examples as part of issue #286.

    Feel free to give feedback about the UI, the examples and suggestions for other examples.

    screensoot

    opened by dvdbng 13
  • Proxy POST requests

    Proxy POST requests

    POST requests do not seem to be going through the HTTP proxy. They do appear in the splash:on_request(callback) and splash:on_response_headers(callback) hooks, but are not seen through the proxy.

    Anyone has quick thoughts on where this issue may be in the code?

    Related to #239 in that it deals with adding the ability to use splash:http_get(...) with POST requests, but I'm not sure if it's adding support for POST proxying as well.

    opened by munro 13
  • Splash can't render javascript - heavy page

    Splash can't render javascript - heavy page

    I'm trying to scrape javascript-heavy page using Scrapy+Splash, but Splash cant render it and just shows few links and top menu. There are template tags in HTML code it returns, my guess is that not all javascript being executed. Even in web-interface (i tried setting wait timer for up to 30 seconds). I tried PhantomJS and selenium - they both work fine, but they are slow compared to Splash. Here is an example page i'm trying to scrape: http://profile.majorleaguegaming.com/crosswrecks/forums

    Any idea about what can cause this? I checked docs, tried changing a few options, but with zero effect. Thanks.

    opened by andverb 13
  • Does Splash handle javascript set cookies?

    Does Splash handle javascript set cookies?

    I'm interested in building a scraper that goes through a site and finds what cookies are being set both 1st party and 3rd party. Scrapy's built in cookie middleware only deals with cookies set in http headers (set-cookies). I'm interested in cookies set through other methods, for example javascript. Therefore I wonder, how does Splash/scrapyjs handle cookies set through other means than http headers? Does it set javascript cookies or handle them in any way? Or are they just ignored?

    opened by mappelgren 13
  • Missing text when requesting page

    Missing text when requesting page

    I'm trying to download HTML for https://factory.jcrew.com/p/womens/categories/clothing/dresses/ruffle-tiered-mini-dress/BM068?display=standard&fit=Classic&color_name=black-light-khaki&colorProductCode=BM068 but the response has missing text (like price) even after I wait for enough time (5 seconds). Is there something I might be missing here?

    Screen Shot 2022-12-20 at 11 40 25 PM
    opened by Addarsh 0
  • Error: export not found

    Error: export not found

    I see this error when visiting https://bananarepublic.gap.com/browse/product.do?pid=407431#pdp-page-content via Splash. I tried to increase Wait duration as well but it didn't solve the problem. I'm not seeing this error on other webistes so not sure what might be missing here. Screen Shot 2022-12-20 at 11 21 07 PM

    opened by Addarsh 0
  • ReferenceError: Can't find variable: IntersectionObserver

    ReferenceError: Can't find variable: IntersectionObserver

    Problem

    Splash seems to throw the error: "ReferenceError: Can't find variable: IntersectionObserver" when loading certain websites. From what I can tell this error occurs in older browsers, like prior versions of Safari and I guess could be related to the version of WebKit Splash uses under the hood. Some Stackoverflow posts have stated that even the most recent version of Safari (2019 post) can still throw this error since the functionality was deemed experimental and older devices disable such features. I don't know if there is a way to tweak the Webkit configuration Splash uses? I've seen this on multiple high traffic sites so it seems like core functionality that other browsers have supported for a while now. I raised this issue with Zyte and their suggestion was to use Playwright or Puppeteer instead. I'm quite invested in a system built around Splash and don't have the time it would take to port everything over.

    Steps to Reproduce

    This is the only code that I'm running in a fresh notebook, from the Splash Jupyter notebook docker image that Zyte provides, set up successfully on OSX with XQuartz for the QT Webkit browser and inspection tool. To setup the notebook with splash:

    brew install --cask xquartz
    IP=$(/usr/sbin/ipconfig getifaddr en0) 
    echo $IP 
    /opt/X11/bin/xhost + "$IP"
    docker run   -e QT_DEBUG_PLUGINS=1 \
                 -e DISPLAY="$IP":0 \
                 -v /tmp/.X11-unix:/tmp/.X11-unix \
                 -v $XAUTHORITY:$XAUTHORITY \
                 -e XAUTHORITY=$XAUTHORITY \
                 -p 8888:8888 \
                 -it scrapinghub/splash-jupyter --disable-xvfb
    

    Then from a new Splash notebook instance:

    splash:on_request(function (request)
          request:set_header('X-Crawlera-Cookies', 'disable')
          request:set_header('X-Crawlera-Profile', 'desktop')
          request:set_header('X-Crawlera-Timeout', '5000')
          request:set_proxy{
              host = "<proxy endpoint>",
              port = "8010",
              username = "<password>",
              password = ""
          }
    end)
    
    splash.private_mode_enabled = false
    assert(splash.private_mode_enabled == false)
    
    splash:go("https://byjus.com/question-answer/why-is-air-called-breath-of-life-enumerate-functions-of-air-or-atmosphere/")
    

    After running this, parts of the page don't render and using the browser inspection tool provided for this splash browser I can see the InspectionObserver error being thrown in the console with cascading errors following. I've observed this on multiple sites now

    opened by brett--anderson 0
  • Splash fails to render a specific page

    Splash fails to render a specific page

    My issue is related to querying the URL below using splash in version 3.5:

    • https://leismunicipais.com.br/a/sp/s/sorocaba/decreto/2022/2738/27375/decreto-n-27375-2022-dispoe-sobre-a-revogacao-do-decreto-n-23551-de-13-de-marco-de-2018-que-dispoe-sobre-permissao-de-uso-a-titulo-precario-de-bem-publico-municipal-e-da-outras-providencias

    Even after many hours, it doesn't show any results. Here are some screenshots of the execution.

    image

    image

    opened by joaodjvitor 5
  • `BadOption: headers is not implemented` raised when headers is null in scrapy spider

    `BadOption: headers is not implemented` raised when headers is null in scrapy spider

    When using chromium engine in a spider to render a page that cannot be rendered with webkit engine with scrapy crawl example and spider:

    import scrapy
    from scrapy_splash import SplashRequest
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
        url = 'http://example.com/'
    
        def start_requests(self):
            yield SplashRequest(
                url=self.url,
                callback=self.parse,
                endpoint='render.html',  # have gone with and without specifying
                headers=None,  # have gone with and without specifying
                args={
                    'wait': 0.5,
                    'engine': 'chromium',
                    # 'headers': None,  # also tried this
                },
            )
    
        def parse(self, response, **kwargs):
            yield None
    

    However, I get:

    2022-11-02 19:56:00 [scrapy.core.scraper] ERROR: Error downloading <GET http://example.com/ via http://localhost:8050/render.html>
    Traceback (most recent call last):
      File "/path/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
        result = context.run(gen.send, result)
      File "/path/venv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 60, in process_response
        response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
      File "/path/venv/lib/python3.10/site-packages/scrapy_splash/middleware.py", line 412, in process_response
        response = self._change_response_class(request, response)
      File "/path/venv/lib/python3.10/site-packages/scrapy_splash/middleware.py", line 433, in _change_response_class
        response = response.replace(cls=respcls, request=request)
      File "/path/venv/lib/python3.10/site-packages/scrapy/http/response/__init__.py", line 117, in replace
        return cls(*args, **kwargs)
      File "/path/venv/lib/python3.10/site-packages/scrapy_splash/response.py", line 119, in __init__
        self._load_from_json()
      File "/path/venv/lib/python3.10/site-packages/scrapy_splash/response.py", line 165, in _load_from_json
        error = self.data['info']['error']
    TypeError: string indices must be integers
    

    The root of this issue is found looking through the Scrapy logs:

    2022-11-02 23:56:00.000802 [-] Unhandled error in Deferred:
    2022-11-02 23:56:00.000931 [-] Unhandled Error
            Traceback (most recent call last):
              File "/app/splash/pool.py", line 47, in render
                self.queue.put(slot)
              File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1872, in put
                self.waiting.pop(0).callback(obj)
              File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 460, in callback
                self._startRunCallbacks(result)
              File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
                self._runCallbacks()
            --- <exception caught here> ---
              File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
                current.result = callback(current.result, *args, **kw)
              File "/app/splash/pool.py", line 76, in _start_render
                render.start(**slot_args.kwargs)
              File "/app/splash/engines/chromium/render_scripts.py", line 59, in start
                raise BadOption("headers is not implemented")
            splash.errors.BadOption: headers is not implemented
    

    This doesn't seem to be a problem when using Scrapy in browser: image or with curl: image

    versions

    Python 3.10.7 Scrapy 2.7.0 scrapy-splash 0.8.0 Splash 3.5.0

    opened by yoonthegoon 0
  • Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3

    Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3

    My issue happens on splash 3.0 and 3.5 but NOT on 2.3.3. i am currently running prod on 2.3.3 as a workaround and would like a permanent solution to run 3.x

    i have been running splash + HAProxy set up by aquarium for years before experiencing this issue, including successfully rendering the sites in question without issue prior to the day before yesterday

    here is a url that consistently produces the issue, even simply using render.html from [host]:8050 https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene

    happens with aquarium default configuration

    this happens in both dev (mac OS 15+) and prod (ubuntu) environments, and i did try wiping all my containers and starting over with aquarium. splash works fine for other urls but the above and some others kills it. every time, it locks up the entire docker container (immediately) and the HAPROXY stats shows a level 7 timeout (splash 3.5) or Level 4 timeout (3.0).

    image image

    i cannot attach to a splash docker instance that hangs in this way - if i try, my terminal hangs.

    thanks to docker-compose with aquarium i can watch splash output live. on 3.5 i often don't even get to see output of the request starting. sometimes i just see the request and then no more output as the instance hangs

    image

    on 3.0 only i get the following info

    image

    i have googled the network issue and found a bunch of issues right here in this repo with no clear answers about what is going on.

    happy to be very responsive. please let me know if more info is needed. I want to get back to splash 3.x

    opened by minispeck 13
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
Main purpose of this project is to provide the service to automate the API testing process

PPTester project Main purpose of this project is to provide the service to automate the API testing process. In order to deploy this service use you s

null 4 Dec 16, 2021
A browser automation framework and ecosystem.

Selenium Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provide

Selenium 25.5k Jan 1, 2023
pytest splinter and selenium integration for anyone interested in browser interaction in tests

Splinter plugin for the pytest runner Install pytest-splinter pip install pytest-splinter Features The plugin provides a set of fixtures to use splin

pytest-dev 238 Nov 14, 2022
User-oriented Web UI browser tests in Python

Selene - User-oriented Web UI browser tests in Python (Selenide port) Main features: User-oriented API for Selenium Webdriver (code like speak common

Iakiv Kramarenko 575 Jan 2, 2023
Doggo Browser

Doggo Browser Quick Start $ python3 -m venv ./venv/ $ source ./venv/bin/activate $ pip3 install -r requirements.txt $ ./sobaki.py References Heavily I

Alexey Kutepov 9 Dec 12, 2022
LuluTest is a Python framework for creating automated browser tests.

LuluTest LuluTest is an open source browser automation framework using Python and Selenium. It is relatively lightweight in that it mostly provides wr

Erik Whiting 14 Sep 26, 2022
Free cleverbot without headless browser

Cleverbot Scraper Simple free cleverbot library that doesn't require running a heavy ram wasting headless web browser to actually chat with the bot, a

Matheus Fillipe 3 Sep 25, 2022
1st Solution to QQ Browser 2021 AIAC Track 2

1st Solution to QQ Browser 2021 AIAC Track 2 This repository is the winning solution to QQ Browser 2021 AI Algorithm Competition Track 2 Automated Hyp

DAIR Lab 24 Sep 10, 2022
Browser reload with uvicorn

uvicorn-browser This project is inspired by autoreload. Installation pip install uvicorn-browser Usage Run uvicorn-browser --help to see all options.

Marcelo Trylesinski 64 Dec 17, 2022
FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing

FFPuppet FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing. Happy bug hunting! Are you fuzzing the browser? G

Mozilla Fuzzing Security 24 Oct 25, 2022
Repository for JIDA SNP Browser Web Application: Local Deployment

JIDA JIDA is a web application that retrieves SNP information for a genomic region of interest in Homo sapiens and calculates specific summary statist

null 3 Mar 3, 2022
HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

Gabriel Falcão 2k Jan 6, 2023
Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

Kevin McCarthy 2.3k Jan 1, 2023
An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

mitmproxy mitmproxy is an interactive, SSL/TLS-capable intercepting proxy with a console interface for HTTP/1, HTTP/2, and WebSockets. mitmdump is the

mitmproxy 29.7k Jan 2, 2023
Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

Kevin McCarthy 1.8k Feb 7, 2021
HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

Gabriel Falcão 1.9k Feb 6, 2021
Wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server.

WebTest This wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server. This provides

Pylons Project 325 Dec 30, 2022
One-stop solution for HTTP(S) testing.

HttpRunner HttpRunner is a simple & elegant, yet powerful HTTP(S) testing framework. Enjoy! ✨ ?? ✨ Design Philosophy Convention over configuration ROI

HttpRunner 3.5k Jan 4, 2023
Declarative HTTP Testing for Python and anything else

Gabbi Release Notes Gabbi is a tool for running HTTP tests where requests and responses are represented in a declarative YAML-based form. The simplest

Chris Dent 139 Sep 21, 2022