Screen scraping and web crawling framework

Evgeniy Tatarkin

Last update: Jun 21, 2021

Related tags

Overview

Pomp

Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the hard Twisted dependency.

Features:

Pure python
Only one dependency for Python 2.x - concurrent.futures (backport of package for Python 2.x)
Supports one file applications; Pomps doesn't force a specific project layout or other restrictions.
Pomp is a meta framework like Paste: you may use it to create your own scraping framework.
Extensible networking: you may use any sync or async method.
No parsing libraries in the core; use you preferred approach.
Pomp instances may be distributed and are designed to work with an external queue.

Pomp makes no attempt to accomodate:

redirects
proxies
caching
database integration
cookies
authentication
etc.

If you want proxies, redirects, or similar, you may use the excellent requests library as the Pomp downloader.

Pomp examples

Pomp docs

Pomp is written and maintained by Evgeniy Tatarkin and is licensed under the BSD license.

Comments

Comprehensive examples
Pomp looks like a nice and simple design - I'm going to give it a try while migrating an existing Scrapy project to Python 3.

However, I would really like to see some more comprehensive examples in the documentation or the repository.

For instance, a larger project would:

Use requests middleware for auth, proxies, etc

Use asyncio with some kind of queue

Export to a database

Run as a daemon or a scheduled job

These things can be rolled together by any competent Python dev, but I think demonstrating one or more ways to build a full-scale production deployment might help gain a few users.
opened by danielnaab 6
Improve item

Now we can use the following syntax

class MyItem(Item): f1 = Field() f2 = Field()

mi = MyItem('field1', 'field2') mi.f1 == 'field1' m.f2 == 'field2' print(m) == 'MyItem(f1=field1,f2=field2)'

opened by sibelius 2
Documentation changes for readability.

I started a search for a Python 3 compatible crawler and came across Pomp. While I haven't tried the framework out yet, I thought I'd contribute some clarifications to the docstrings and README to improve the readability for native English speakers.

opened by danielnaab 1

RetryMiddleware(BaseMiddleware)

How i can retry request in middleware process_exception? Example

class RetryMiddleware(BaseMiddleware):

   def process_exception(self, exception, crawler, downloader):
        logger.info("Try again with request: %s", exception.request)
        return exception.request

opened by affmaker 1

'PhantomDownloader' object has no attribute 'drivers'

Just trying to launch e05_phantomjs.py example.

INFO:pomp.engine:Prepare downloader: <__main__.PhantomDownloader object at 0x10e0be3c8> INFO:pomp.engine:Start crawler: <__main__.TwitterSpider object at 0x10e78c128> INFO:pomp.engine:Start pipe: <e02_quotes.PrintPipeline object at 0x10e77ff60> Traceback (most recent call last): File "e05_phantomjs.py", line 200, in <module> pomp.pump(TwitterSpider()) File "/usr/local/lib/python3.6/site-packages/pomp/core/engine.py", line 271, in pump iterator(next_requests), crawler, File "/usr/local/lib/python3.6/site-packages/pomp/core/engine.py", line 158, in process_requests self._req_middlewares(requests, crawler), crawler): File "e05_phantomjs.py", line 109, in process request.driver_url = self.drivers[0].command_executor._url AttributeError: 'PhantomDownloader' object has no attribute 'drivers'

opened by karambaq 2

Owner

Evgeniy Tatarkin

GitHub https://pomp.readthedocs.org

A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

1.8k Dec 31, 2022

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 3, 2021

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Jan 4, 2023

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Dec 24, 2022

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 4, 2023

Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

1.6k Jan 1, 2023

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 5, 2022

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

1 Jan 25, 2022

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

704 Jan 6, 2023

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

1 Mar 28, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

692 Dec 22, 2022

Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

5 Sep 25, 2022

Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

3 Jul 1, 2022

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

1 Jan 4, 2022

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

1 Jan 7, 2022

Screen scraping and web crawling framework

Related tags

Overview

Pomp

Comments

Comprehensive examples

Improve item

Documentation changes for readability.

RetryMiddleware(BaseMiddleware)

'PhantomDownloader' object has no attribute 'drivers'

Owner

Evgeniy Tatarkin

A pure-python HTML screen-scraping library

Scrapy uses Request and Response objects for crawling web sites.

A high-level distributed crawling framework.

A high-level distributed crawling framework.

Amazon scraper using scrapy, a python framework for crawling websites.

Web Scraping Framework

Async Python 3.6+ web scraping micro-framework based on asyncio

Transistor, a Python web scraping framework for intelligent use cases.

A simple django-rest-framework api using web scraping

Amazon web scraping using Scrapy Framework

Python script for crawling ResearchGate.net papers✨⭐️📎

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

🥫 The simple, fast, and modern web scraping library

Web Scraping OLX with Python and Bsoup.

Web Scraping images using Selenium and Python

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it