Screen scraping and web crawling framework

Overview

Pomp

circleci codecov Latest PyPI version python versions Have wheel License

Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the hard Twisted dependency.

Features:

  • Pure python
  • Only one dependency for Python 2.x - concurrent.futures (backport of package for Python 2.x)
  • Supports one file applications; Pomps doesn't force a specific project layout or other restrictions.
  • Pomp is a meta framework like Paste: you may use it to create your own scraping framework.
  • Extensible networking: you may use any sync or async method.
  • No parsing libraries in the core; use you preferred approach.
  • Pomp instances may be distributed and are designed to work with an external queue.

Pomp makes no attempt to accomodate:

  • redirects
  • proxies
  • caching
  • database integration
  • cookies
  • authentication
  • etc.

If you want proxies, redirects, or similar, you may use the excellent requests library as the Pomp downloader.

Pomp examples

Pomp docs

Pomp is written and maintained by Evgeniy Tatarkin and is licensed under the BSD license.

Comments
  • Comprehensive examples

    Comprehensive examples

    Pomp looks like a nice and simple design - I'm going to give it a try while migrating an existing Scrapy project to Python 3.

    However, I would really like to see some more comprehensive examples in the documentation or the repository.

    For instance, a larger project would:

    • Use requests middleware for auth, proxies, etc
    • Use asyncio with some kind of queue
    • Export to a database
    • Run as a daemon or a scheduled job

    These things can be rolled together by any competent Python dev, but I think demonstrating one or more ways to build a full-scale production deployment might help gain a few users.

    opened by danielnaab 6
  • Improve item

    Improve item

    Now we can use the following syntax

    class MyItem(Item): f1 = Field() f2 = Field()

    mi = MyItem('field1', 'field2') mi.f1 == 'field1' m.f2 == 'field2' print(m) == 'MyItem(f1=field1,f2=field2)'

    opened by sibelius 2
  • Documentation changes for readability.

    Documentation changes for readability.

    I started a search for a Python 3 compatible crawler and came across Pomp. While I haven't tried the framework out yet, I thought I'd contribute some clarifications to the docstrings and README to improve the readability for native English speakers.

    opened by danielnaab 1
  • RetryMiddleware(BaseMiddleware)

    RetryMiddleware(BaseMiddleware)

    How i can retry request in middleware process_exception? Example

    class RetryMiddleware(BaseMiddleware):
    
       def process_exception(self, exception, crawler, downloader):
            logger.info("Try again with request: %s", exception.request)
            return exception.request
    
    opened by affmaker 1
  • 'PhantomDownloader' object has no attribute 'drivers'

    'PhantomDownloader' object has no attribute 'drivers'

    Just trying to launch e05_phantomjs.py example.

    INFO:pomp.engine:Prepare downloader: <__main__.PhantomDownloader object at 0x10e0be3c8> INFO:pomp.engine:Start crawler: <__main__.TwitterSpider object at 0x10e78c128> INFO:pomp.engine:Start pipe: <e02_quotes.PrintPipeline object at 0x10e77ff60> Traceback (most recent call last): File "e05_phantomjs.py", line 200, in <module> pomp.pump(TwitterSpider()) File "/usr/local/lib/python3.6/site-packages/pomp/core/engine.py", line 271, in pump iterator(next_requests), crawler, File "/usr/local/lib/python3.6/site-packages/pomp/core/engine.py", line 158, in process_requests self._req_middlewares(requests, crawler), crawler): File "e05_phantomjs.py", line 109, in process request.driver_url = self.drivers[0].command_executor._url AttributeError: 'PhantomDownloader' object has no attribute 'drivers'

    opened by karambaq 2
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 3, 2021
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 4, 2023
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Dec 24, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

claudio paulo 5 Sep 25, 2022
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 1, 2022
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

null 1 Jan 4, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 7, 2022