Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Related tags

Web Crawling crawley
Overview

Pythonic Crawling / Scraping Framework Built on Eventlet


Build Status Code Climate Stories in Ready

Features

  • High Speed WebCrawler built on Eventlet.
  • Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
  • Supports NoSQL databased like Mongodb and Couchdb. New!
  • Export your data into Json, XML or CSV formats. New!
  • Command line tools.
  • Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
  • Cookie Handlers.
  • Very easy to use (see the example).

Documentation

http://packages.python.org/crawley/

Project WebSite

http://project.crawley-cloud.com/


To install crawley run

~$ python setup.py install

or from pip

~$ pip install crawley

To start a new project run

~$ crawley startproject [project_name]
~$ cd [project_name]

Write your Models

""" models.py """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    
    #add your table fields here
    updated = Field(Unicode(255))    
    package = Field(Unicode(255))
    description = Field(Unicode(255))

Write your Scrapers

""" crawlers.py """

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    
    #specify the urls that can be scraped by this class
    matching_urls = ["%"]
    
    def scrape(self, response):
                        
        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
        
        #for rows 1 to n-1
        for tr in table[1:-1]:
                        
            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]
            
            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)


class pypiCrawler(BaseCrawler):
    
    #add your starting urls here
    start_urls = ["http://pypi.python.org/pypi"]
    
    #add your scraper classes here    
    scrapers = [pypiScraper]
    
    #specify you maximum crawling depth level    
    max_depth = 0
    
    #select your favourite HTML parsing tool
    extractor = XPathExtractor

Configure your settings

""" settings.py """

import os 
PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     
DATABASE_NAME = 'pypi'  
DATABASE_USER = ''             
DATABASE_PASSWORD = ''         
DATABASE_HOST = ''             
DATABASE_PORT = ''     

SHOW_DEBUG_INFO = True

Finally, just run the crawler

~$ crawley run
Comments
  • Use __metaclasses__ to read user's modules

    Use __metaclasses__ to read user's modules

    Replace the non pythonic method "inspect_module" in manager/utils with metaclasses in order to read the models and crawlers modules written by users. :-)

    opened by jmg 0
  • Delayed Requests

    Delayed Requests

    Now we're doing the http requests without any delay. It can be a problem when sending a thousands of requests to the same server.

    The solution is make delayed http requests when we're are overloading a external server (Consider the algorithm to decide this).

    Put the delay time constant in a config file.

    opened by jmg 0
  • Integrate the DSL with Crawlers

    Integrate the DSL with Crawlers

    We've a simple dsl designed and we're able to compile it into a scraper classes. Now we can finish the integration of the run-time generated scrapers with the crawlers.

    Write more tests and more complex dsl templates.

    opened by jmg 0
  • Similar HTML Pages Recognition

    Similar HTML Pages Recognition

    Evaluate the possibility of use difflib in order to recognize similar html pages.

    http://docs.python.org/library/difflib.html

    Write some tests to check if it works properly and relatively fast. Then we can write a "SmartCrawler" class wich crawls the web searching for similar pages.

    opened by jmg 0
  • Wrong encoding detection

    Wrong encoding detection

    I'm using PyQuery, and I get wrong encode detection for this page:

    http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html

    The problem is that the html has this meta tag:

    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

    But the page is actually utf-8

    I get this info from the response headers:

    Connection:close
    Content-Length:29187
    Content-Type:text/html;charset=UTF-8
    Date:Fri, 11 Jul 2014 23:21:04 GMT
    Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT
    Server:OpenCms/7.5.4
    

    That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.

    So far I saw two solutions as proposed in this answer in SO, using chardet module or UnicodeDammit (from BeautifulSoup).

    I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.

    I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.

    BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.

    opened by onilton 0
  • Use urljoin to fix relative urls

    Use urljoin to fix relative urls

    https://docs.python.org/2/library/urlparse.html#urlparse.urljoin provides a robust way to make a relative url into a absolute one.

    This fixes some issues like this one:

    When accessing this url: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/

    We find relative links like this: resultado_busca.html?letra=a

    The browser (chrome) build the absolute url like this: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?letra=a

    But crawley build the url like this: http://www1.abracom.org.br/resultado_busca.html?letra=a

    urljoin fixes the issue, keeping the right behavior for /relativeurl:

    In a hypothetical page http://mydomain.com/my/web/page.html:

    '/relativeurl.html' link should become 'http://mydomain.com/relativeurl.html'

    and

    'relativeurl.html' link should become 'http://mydomain.com/my/web/relativeurl.html'

    opened by onilton 1
  • shell does'nt work

    shell does'nt work

    I tryed to use the shell command to test my xpaths, but it does'nt work.

    $ crawley shell http://somewebsite.com/index.html Traceback (most recent call last): File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in manage() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 25, in manage run_cmd(sys.argv) File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 18, in run_cmd cmd.checked_execute() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/command.py", line 50, in checked_execute self.execute() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/shell.py", line 30, in execute response = crawler._get_data(url) AttributeError: 'BaseCrawler' object has no attribute '_get_data'

    opened by MrTango 1
Owner
Juan Manuel Garcia
Pasionate Python Developer
Juan Manuel Garcia
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 4, 2023
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Dec 24, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 3, 2021
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

Scrapinghub 8.7k Jan 5, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 3, 2023
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 4, 2023