Lightweight, scriptable browser as a service with an HTTP API

Scrapinghub

Last update: Jan 3, 2023

Related tags

Testing splash

Overview

Splash - A javascript rendering service

Join the chat at https://gitter.im/scrapinghub/splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python 3 using Twisted and QT5.

It's fast, lightweight and state-less which makes it easy to distribute.

Documentation

Documentation is available here: https://splash.readthedocs.io/

Using Splash with Scrapy

To use Splash with Scrapy, please refer to the scrapy-splash library.

Support

Open source support is provided here in GitHub. Please create a question issue.

Commercial support is also available by Scrapinghub.

Comments

Splash Ignoring Proxy
Hi all,

I am running Splash in a docker container on Ubuntu 12.04.5 LTS and am having trouble getting proxy-profiles to work.

I have this in my /etc/splash/proxy-profiles/crawlera.ini file:

[proxy] host=<mydomain>.crawlera.com port=8010 ; optional, default is no auth username=<user> password=<pass>

and I start the docker container mapping that volume to its equivalent -v /etc/splash/proxy-profiles/:/etc/splash/proxy-profiles/. It appears that by default Splash is launched in the docker container with a flag that tells it where to look for proxy profiles.

And, when I pass the &proxy=crawlera paramter into the typical splash:8050/?render.html?ur... url, it does not throw an error (if I pass a nonexistent proxy-profile it shows "proxy profile not found) - so I am confident it is finding the profile.

In the logs, I am actually seeing:

2015-06-18 17:33:46.661570 [stats] {"maxrss": 148776, "load": [0.0, 0.01, 0.05], "fds": 50, "qsize": 0, "rendertime": 1.3779900074005127, "active": 0, "path": "/execute", "args": {"lua_source": ["function main(splash)\r\n local url = splash.args.url\r\n splash.images_enabled = false\r\n assert(splash:go(url))\r\n assert(splash:wait(0.5))\r\n return {\r\n html = splash:html(),\r\n png = splash:png(),\r\n har = splash:har(),\r\n }\r\nend"], "url": ["http://www.whatismyip.com"], "proxy": ["crawlera"], "images": ["1"], "expand": ["1"], "wait": ["0.5"]}, "_id": 91663608}

So the proxy parameter is definitely there and recognized... but it doesn't do anything. Visiting http://www.whatismyip.com yields the same IP whether or not I have the proxy parameter on.

Any ideas? Or thoughts on how to better diagnose the issue?
opened by AlexIzydorczyk 29
resource_timeout doesn't work in qt5

QT5 starts printing "device not open" messages in an infinite loop when request is aborted in the middle of download.

I've opened an issue in qt bug tracker: https://bugreports.qt.io/browse/QTBUG-47654. A short script to reproduce it: https://gist.github.com/kmike/ff287998e02fa953b4a2. This issue makes tests hang on Travis (https://github.com/scrapinghub/splash/pull/260).

It looks like a blocker for qt5 branch merging; I'm not sure resource_timeout is the only condition which can trigger this bug (what about a regular timeout?).

opened by kmike 17
Support https proxy

Hi,

Thank you very much for this library. I must say you made tremendous update on it since last time i tried to use it (we stopped using it one year ago because it was too troublesome to use with proxies). It seems like splash does not yet support proxies that use the https or connect method. Is that something you can easily add or not ? Are you open to a PR for something like this ?

Best,

opened by cp2587 16
Lua Scripting - Submiting via JS
Hi,

I've been having trouble getting a lua script that calls something like:

splash:runjs("$(':submit')[0].click()")

to run. That is, splash doesn't seem to actually render the result of the page after submission - as if the submit button had been clicked in the first place.

Any idea how to work around this or what I may be doing wrong?

Thanks, Alex
scripting
opened by AlexIzydorczyk 16

Fail to restart automatically using docker + splash

I use the following command to deamonize the process 'docker run -d -p 8050:8050 --restart=always -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash:1.5 --maxrss 500' but whenever the splash process crash (it gets killed by the system because it takes too much memory even if i have 1GB on the server) docker fails to restart it with the following error:

Traceback (most recent call last):
  File "/app/bin/splash", line 3, in <module>
    from splash.server import main
  File "/usr/local/lib/python2.7/dist-packages/splash/server.py", line 12, in <module>
    from splash.qtutils import init_qt_app
  File "/usr/local/lib/python2.7/dist-packages/splash/qtutils.py", line 12, in <module>
    from PyQt4.QtCore import (QAbstractEventDispatcher, QDateTime, QObject,
ImportError: /usr/lib/python2.7/dist-packages/PyQt4/QtCore.so: cannot read file data: Input/output error

opened by cp2587 15

question: Is there a way to get the network traffic while the page is loaded?

The current running parameters i used, only gives me the page ones it was rendered.

( curl -s -X POST -H 'content-type: application/json' -d "{"render_all": 1, "timeout": ${timeout}, "wait": ${wait}, "images": 0, "url": "${url}", "html": 1, http://${ip}:${port}/render.json )

My problem is that there are a lot of javascript that the page contain that i am interested in that in the final version of the page are removed.

Is there a way to get the network traffic while the page is loaded? like in chrome "network" tab for example?

opened by ohade 14
AttributeError: module 'tornado.web' has no attribute 'asynchronous'

Hello there, There is a dependency problem in the last splash-jupyter commit. Because of the update of nbconvert, tornado version 6 has been changed. I think we should downgrade nbcovert to a version where tornado version 5 is used. I'm gonna try with that and let you know. Kind regards, Ahmed.

opened by rafikahmed 13
Add examples UI and some examples (#286)

I have added some examples and UI to load the examples as part of issue #286.

Feel free to give feedback about the UI, the examples and suggestions for other examples.

opened by dvdbng 13
Proxy POST requests

POST requests do not seem to be going through the HTTP proxy. They do appear in the splash:on_request(callback) and splash:on_response_headers(callback) hooks, but are not seen through the proxy.

Anyone has quick thoughts on where this issue may be in the code?

Related to #239 in that it deals with adding the ability to use splash:http_get(...) with POST requests, but I'm not sure if it's adding support for POST proxying as well.

opened by munro 13
Splash can't render javascript - heavy page

I'm trying to scrape javascript-heavy page using Scrapy+Splash, but Splash cant render it and just shows few links and top menu. There are template tags in HTML code it returns, my guess is that not all javascript being executed. Even in web-interface (i tried setting wait timer for up to 30 seconds). I tried PhantomJS and selenium - they both work fine, but they are slow compared to Splash. Here is an example page i'm trying to scrape: http://profile.majorleaguegaming.com/crosswrecks/forums

Any idea about what can cause this? I checked docs, tried changing a few options, but with zero effect. Thanks.

opened by andverb 13
Does Splash handle javascript set cookies?

I'm interested in building a scraper that goes through a site and finds what cookies are being set both 1st party and 3rd party. Scrapy's built in cookie middleware only deals with cookies set in http headers (set-cookies). I'm interested in cookies set through other methods, for example javascript. Therefore I wonder, how does Splash/scrapyjs handle cookies set through other means than http headers? Does it set javascript cookies or handle them in any way? Or are they just ignored?

opened by mappelgren 13
Missing text when requesting page

I'm trying to download HTML for https://factory.jcrew.com/p/womens/categories/clothing/dresses/ruffle-tiered-mini-dress/BM068?display=standard&fit=Classic&color_name=black-light-khaki&colorProductCode=BM068 but the response has missing text (like price) even after I wait for enough time (5 seconds). Is there something I might be missing here?

opened by Addarsh 0
Error: export not found

I see this error when visiting https://bananarepublic.gap.com/browse/product.do?pid=407431#pdp-page-content via Splash. I tried to increase Wait duration as well but it didn't solve the problem. I'm not seeing this error on other webistes so not sure what might be missing here.

opened by Addarsh 0
ReferenceError: Can't find variable: IntersectionObserver
Problem

Splash seems to throw the error: "ReferenceError: Can't find variable: IntersectionObserver" when loading certain websites. From what I can tell this error occurs in older browsers, like prior versions of Safari and I guess could be related to the version of WebKit Splash uses under the hood. Some Stackoverflow posts have stated that even the most recent version of Safari (2019 post) can still throw this error since the functionality was deemed experimental and older devices disable such features. I don't know if there is a way to tweak the Webkit configuration Splash uses? I've seen this on multiple high traffic sites so it seems like core functionality that other browsers have supported for a while now. I raised this issue with Zyte and their suggestion was to use Playwright or Puppeteer instead. I'm quite invested in a system built around Splash and don't have the time it would take to port everything over.

Steps to Reproduce

This is the only code that I'm running in a fresh notebook, from the Splash Jupyter notebook docker image that Zyte provides, set up successfully on OSX with XQuartz for the QT Webkit browser and inspection tool. To setup the notebook with splash:

brew install --cask xquartz IP=$(/usr/sbin/ipconfig getifaddr en0) echo $IP /opt/X11/bin/xhost + "$IP" docker run -e QT_DEBUG_PLUGINS=1 \ -e DISPLAY="$IP":0 \ -v /tmp/.X11-unix:/tmp/.X11-unix \ -v $XAUTHORITY:$XAUTHORITY \ -e XAUTHORITY=$XAUTHORITY \ -p 8888:8888 \ -it scrapinghub/splash-jupyter --disable-xvfb

Then from a new Splash notebook instance:

splash:on_request(function (request) request:set_header('X-Crawlera-Cookies', 'disable') request:set_header('X-Crawlera-Profile', 'desktop') request:set_header('X-Crawlera-Timeout', '5000') request:set_proxy{ host = "<proxy endpoint>", port = "8010", username = "<password>", password = "" } end) splash.private_mode_enabled = false assert(splash.private_mode_enabled == false) splash:go("https://byjus.com/question-answer/why-is-air-called-breath-of-life-enumerate-functions-of-air-or-atmosphere/")

After running this, parts of the page don't render and using the browser inspection tool provided for this splash browser I can see the InspectionObserver error being thrown in the console with cascading errors following. I've observed this on multiple sites now
opened by brett--anderson 0
Splash fails to render a specific page
My issue is related to querying the URL below using splash in version 3.5:

https://leismunicipais.com.br/a/sp/s/sorocaba/decreto/2022/2738/27375/decreto-n-27375-2022-dispoe-sobre-a-revogacao-do-decreto-n-23551-de-13-de-marco-de-2018-que-dispoe-sobre-permissao-de-uso-a-titulo-precario-de-bem-publico-municipal-e-da-outras-providencias

Even after many hours, it doesn't show any results. Here are some screenshots of the execution.
opened by joaodjvitor 5

`BadOption: headers is not implemented` raised when headers is null in scrapy spider

When using chromium engine in a spider to render a page that cannot be rendered with webkit engine with scrapy crawl example and spider:

import scrapy
from scrapy_splash import SplashRequest


class ExampleSpider(scrapy.Spider):
    name = 'example'
    url = 'http://example.com/'

    def start_requests(self):
        yield SplashRequest(
            url=self.url,
            callback=self.parse,
            endpoint='render.html',  # have gone with and without specifying
            headers=None,  # have gone with and without specifying
            args={
                'wait': 0.5,
                'engine': 'chromium',
                # 'headers': None,  # also tried this
            },
        )

    def parse(self, response, **kwargs):
        yield None

However, I get:

2022-11-02 19:56:00 [scrapy.core.scraper] ERROR: Error downloading <GET http://example.com/ via http://localhost:8050/render.html>
Traceback (most recent call last):
  File "/path/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/path/venv/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 60, in process_response
    response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
  File "/path/venv/lib/python3.10/site-packages/scrapy_splash/middleware.py", line 412, in process_response
    response = self._change_response_class(request, response)
  File "/path/venv/lib/python3.10/site-packages/scrapy_splash/middleware.py", line 433, in _change_response_class
    response = response.replace(cls=respcls, request=request)
  File "/path/venv/lib/python3.10/site-packages/scrapy/http/response/__init__.py", line 117, in replace
    return cls(*args, **kwargs)
  File "/path/venv/lib/python3.10/site-packages/scrapy_splash/response.py", line 119, in __init__
    self._load_from_json()
  File "/path/venv/lib/python3.10/site-packages/scrapy_splash/response.py", line 165, in _load_from_json
    error = self.data['info']['error']
TypeError: string indices must be integers

The root of this issue is found looking through the Scrapy logs:

2022-11-02 23:56:00.000802 [-] Unhandled error in Deferred:
2022-11-02 23:56:00.000931 [-] Unhandled Error
        Traceback (most recent call last):
          File "/app/splash/pool.py", line 47, in render
            self.queue.put(slot)
          File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 1872, in put
            self.waiting.pop(0).callback(obj)
          File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 460, in callback
            self._startRunCallbacks(result)
          File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/usr/local/lib/python3.6/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/app/splash/pool.py", line 76, in _start_render
            render.start(**slot_args.kwargs)
          File "/app/splash/engines/chromium/render_scripts.py", line 59, in start
            raise BadOption("headers is not implemented")
        splash.errors.BadOption: headers is not implemented

This doesn't seem to be a problem when using Scrapy in browser: or with curl:

versions

Python 3.10.7 Scrapy 2.7.0 scrapy-splash 0.8.0 Splash 3.5.0

opened by yoonthegoon 0

Bug? splash 3.0+ instances locking up on certain SSL requests. Does not happen on 2.3.3

My issue happens on splash 3.0 and 3.5 but NOT on 2.3.3. i am currently running prod on 2.3.3 as a workaround and would like a permanent solution to run 3.x

i have been running splash + HAProxy set up by aquarium for years before experiencing this issue, including successfully rendering the sites in question without issue prior to the day before yesterday

here is a url that consistently produces the issue, even simply using render.html from [host]:8050 https://www.schooljobs.com/careers/kirkwoodcc/jobs/3776251/adjunct-dental-hygiene

happens with aquarium default configuration

this happens in both dev (mac OS 15+) and prod (ubuntu) environments, and i did try wiping all my containers and starting over with aquarium. splash works fine for other urls but the above and some others kills it. every time, it locks up the entire docker container (immediately) and the HAPROXY stats shows a level 7 timeout (splash 3.5) or Level 4 timeout (3.0).

i cannot attach to a splash docker instance that hangs in this way - if i try, my terminal hangs.

thanks to docker-compose with aquarium i can watch splash output live. on 3.5 i often don't even get to see output of the request starting. sometimes i just see the request and then no more output as the instance hangs

on 3.0 only i get the following info

i have googled the network issue and found a bunch of issues right here in this repo with no clear answers about what is going on.

happy to be very responsive. please let me know if more info is needed. I want to get back to splash 3.x

opened by minispeck 13

Owner

Scrapinghub

Turn web content into useful data

GitHub

Main purpose of this project is to provide the service to automate the API testing process

PPTester project Main purpose of this project is to provide the service to automate the API testing process. In order to deploy this service use you s

4 Dec 16, 2021

A browser automation framework and ecosystem.

Selenium Selenium is an umbrella project encapsulating a variety of tools and libraries enabling web browser automation. Selenium specifically provide

25.5k Jan 1, 2023

pytest splinter and selenium integration for anyone interested in browser interaction in tests

Splinter plugin for the pytest runner Install pytest-splinter pip install pytest-splinter Features The plugin provides a set of fixtures to use splin

238 Nov 14, 2022

User-oriented Web UI browser tests in Python

Selene - User-oriented Web UI browser tests in Python (Selenide port) Main features: User-oriented API for Selenium Webdriver (code like speak common

575 Jan 2, 2023

Doggo Browser

Doggo Browser Quick Start $ python3 -m venv ./venv/ $ source ./venv/bin/activate $ pip3 install -r requirements.txt $ ./sobaki.py References Heavily I

9 Dec 12, 2022

LuluTest is a Python framework for creating automated browser tests.

LuluTest LuluTest is an open source browser automation framework using Python and Selenium. It is relatively lightweight in that it mostly provides wr

14 Sep 26, 2022

Free cleverbot without headless browser

Cleverbot Scraper Simple free cleverbot library that doesn't require running a heavy ram wasting headless web browser to actually chat with the bot, a

3 Sep 25, 2022

1st Solution to QQ Browser 2021 AIAC Track 2

1st Solution to QQ Browser 2021 AIAC Track 2 This repository is the winning solution to QQ Browser 2021 AI Algorithm Competition Track 2 Automated Hyp

24 Sep 10, 2022

Browser reload with uvicorn

uvicorn-browser This project is inspired by autoreload. Installation pip install uvicorn-browser Usage Run uvicorn-browser --help to see all options.

64 Dec 17, 2022

FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing

FFPuppet FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing. Happy bug hunting! Are you fuzzing the browser? G

24 Oct 25, 2022

Repository for JIDA SNP Browser Web Application: Local Deployment

JIDA JIDA is a web application that retrieves SNP information for a genomic region of interest in Homo sapiens and calculates specific summary statist

3 Mar 3, 2022

HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

2k Jan 6, 2023

Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

2.3k Jan 1, 2023

An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

mitmproxy mitmproxy is an interactive, SSL/TLS-capable intercepting proxy with a console interface for HTTP/1, HTTP/2, and WebSockets. mitmdump is the

29.7k Jan 2, 2023

Automatically mock your HTTP interactions to simplify and speed up testing

VCR.py ?? This is a Python version of Ruby's VCR library. Source code https://github.com/kevin1024/vcrpy Documentation https://vcrpy.readthedocs.io/ R

1.8k Feb 7, 2021

HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

HTTPretty 1.0.5 HTTP Client mocking tool for Python created by Gabriel Falcão . It provides a full fake TCP socket module. Inspired by FakeWeb Github

1.9k Feb 6, 2021

Wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server.

WebTest This wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server. This provides

325 Dec 30, 2022

One-stop solution for HTTP(S) testing.

HttpRunner HttpRunner is a simple & elegant, yet powerful HTTP(S) testing framework. Enjoy! ✨ ?? ✨ Design Philosophy Convention over configuration ROI

3.5k Jan 4, 2023

Declarative HTTP Testing for Python and anything else

Gabbi Release Notes Gabbi is a tool for running HTTP tests where requests and responses are represented in a declarative YAML-based form. The simplest

139 Sep 21, 2022

Lightweight, scriptable browser as a service with an HTTP API

Related tags

Overview

Splash - A javascript rendering service

Documentation

Using Splash with Scrapy

Support

Comments

Problem

Steps to Reproduce

versions

Owner

Scrapinghub

Main purpose of this project is to provide the service to automate the API testing process

A browser automation framework and ecosystem.

pytest splinter and selenium integration for anyone interested in browser interaction in tests

User-oriented Web UI browser tests in Python

Doggo Browser

LuluTest is a Python framework for creating automated browser tests.

Free cleverbot without headless browser

1st Solution to QQ Browser 2021 AIAC Track 2

Browser reload with uvicorn

FFPuppet is a Python module that automates browser process related tasks to aid in fuzzing

Repository for JIDA SNP Browser Web Application: Local Deployment

HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

Automatically mock your HTTP interactions to simplify and speed up testing

An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

Automatically mock your HTTP interactions to simplify and speed up testing

HTTP client mocking tool for Python - inspired by Fakeweb for Ruby

Wraps any WSGI application and makes it easy to send test requests to that application, without starting up an HTTP server.

One-stop solution for HTTP(S) testing.

Declarative HTTP Testing for Python and anything else