PyQuery-based scraping micro-framework.

Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Build Status

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

  • Martín Gaitán (@mgaitan)
Issues
  • proof of concept: subitem field

    proof of concept: subitem field

    short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

    This is a first PR as a proof of concept, to discuss the idea and its API.

    opened by mgaitan 5
  • Reausable cleaning functions

    Reausable cleaning functions

    You can now add a "clean" kwarg containing a function to a field.

    This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

        score = demiurge.TextField(selector=".score .upvoted", clean=int)
    
    opened by traverseda 5
  • Use lib

    Use lib "requests" for downloading

    I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

    I don't know if just as an option or as a replacement of pyquery downloader.

    What do you think?

    opened by jmansilla 2
  • RelatedItems only work across urls

    RelatedItems only work across urls

    An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

    An example

    http://pastebin.com/WDL4RjkE


    Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

    opened by traverseda 2
  • add per field clean support

    add per field clean support

    if there is a clean_<fieldname> method, apply it as a value filter

    opened by mgaitan 1
  • Fix when no selector defined

    Fix when no selector defined

    the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

    In [2]: PyQuery('<html>hello</html>').find('html')
    Out[2]: []
    
    In [3]: PyQuery('<html>hello</html>')('html')
    Out[3]: [<html>]
    
    opened by mgaitan 1
  • Added support to self-referenced RelatedItem.

    Added support to self-referenced RelatedItem.

    Based on @mgaitan work: https://github.com/matiasb/demiurge/pull/5.

    opened by matiasb 1
  • enable syntax for readme

    enable syntax for readme

    opened by teitei-tk 1
  • support self reference in RelatedItem

    support self reference in RelatedItem

    RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

    opened by mgaitan 0
  • Reverse related item

    Reverse related item

    Allow to get back the parent from a RelatedItem instance.

    opened by matiasb 0
  • No rawhtml field

    No rawhtml field

    I want to use html2text on a selection, but there's no way to pull raw html out of a query.

    I'll probably get to this in the next few days.

    opened by traverseda 2
Releases(v0.2)
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.1k Nov 30, 2021
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 42.3k Dec 4, 2021
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 211 Oct 17, 2021
Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

Scrapinghub 8.3k Dec 1, 2021
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.1k Dec 2, 2021
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 318 Nov 25, 2021
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 606 Nov 22, 2021
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Nov 24, 2021
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 25 Nov 23, 2021
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 19 Nov 29, 2021
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

null 15 Jul 11, 2021
Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

Noppakorn Jiravaranun 5 Nov 17, 2021
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 3 Sep 17, 2021
A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

null 0 Jun 17, 2021
Scraping news from Ucsal portal with Scrapy.

NewsScraping Esse é um projeto de raspagem das últimas noticias, de 2021, do portal da universidade Ucsal http://noosfero.ucsal.br/institucional Tecno

Crissiano Pires 0 Sep 30, 2021
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Aug 10, 2021
Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Joseph Lai 405 Dec 2, 2021
Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python >= 3.9 Pipenv Setup # i

Alex Skobelev 1 Oct 20, 2021
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021