Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Juan Manuel Garcia

Last update: Dec 5, 2022

Related tags

Web Crawling crawley

Overview

Pythonic Crawling / Scraping Framework Built on Eventlet

Features

High Speed WebCrawler built on Eventlet.
Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
Supports NoSQL databased like Mongodb and Couchdb. New!
Export your data into Json, XML or CSV formats. New!
Command line tools.
Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
Cookie Handlers.
Very easy to use (see the example).

Documentation

http://packages.python.org/crawley/

Project WebSite

http://project.crawley-cloud.com/

To install crawley run

~$ python setup.py install

or from pip

~$ pip install crawley

To start a new project run

~$ crawley startproject [project_name]
~$ cd [project_name]

Write your Models

""" models.py """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    
    #add your table fields here
    updated = Field(Unicode(255))    
    package = Field(Unicode(255))
    description = Field(Unicode(255))

Write your Scrapers

""" crawlers.py """

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    
    #specify the urls that can be scraped by this class
    matching_urls = ["%"]
    
    def scrape(self, response):
                        
        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
        
        #for rows 1 to n-1
        for tr in table[1:-1]:
                        
            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]
            
            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)


class pypiCrawler(BaseCrawler):
    
    #add your starting urls here
    start_urls = ["http://pypi.python.org/pypi"]
    
    #add your scraper classes here    
    scrapers = [pypiScraper]
    
    #specify you maximum crawling depth level    
    max_depth = 0
    
    #select your favourite HTML parsing tool
    extractor = XPathExtractor

Configure your settings

""" settings.py """

import os 
PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     
DATABASE_NAME = 'pypi'  
DATABASE_USER = ''             
DATABASE_PASSWORD = ''         
DATABASE_HOST = ''             
DATABASE_PORT = ''     

SHOW_DEBUG_INFO = True

Finally, just run the crawler

~$ crawley run

Comments

Use __metaclasses__ to read user's modules

Replace the non pythonic method "inspect_module" in manager/utils with metaclasses in order to read the models and crawlers modules written by users. :-)

opened by jmg 0
Delayed Requests

Now we're doing the http requests without any delay. It can be a problem when sending a thousands of requests to the same server.

The solution is make delayed http requests when we're are overloading a external server (Consider the algorithm to decide this).

Put the delay time constant in a config file.

opened by jmg 0
Integrate the DSL with Crawlers

We've a simple dsl designed and we're able to compile it into a scraper classes. Now we can finish the integration of the run-time generated scrapers with the crawlers.

Write more tests and more complex dsl templates.

opened by jmg 0
Similar HTML Pages Recognition

Evaluate the possibility of use difflib in order to recognize similar html pages.

http://docs.python.org/library/difflib.html

Write some tests to check if it works properly and relatively fast. Then we can write a "SmartCrawler" class wich crawls the web searching for similar pages.

opened by jmg 0
Wrong encoding detection
I'm using PyQuery, and I get wrong encode detection for this page:

http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?nomeArq=0148.html

The problem is that the html has this meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

But the page is actually utf-8

I get this info from the response headers:

Connection:close Content-Length:29187 Content-Type:text/html;charset=UTF-8 Date:Fri, 11 Jul 2014 23:21:04 GMT Last-Modified:Fri, 11 Jul 2014 23:21:05 GMT Server:OpenCms/7.5.4

That's how the browser (chrome) is able to guess the right encoding and display the page with the right encoding. I work in a place that have to deal with a lot of different kinds of pages, and I can tell this is far from a rare case (especially in brazilian portuguese websites), so it would be nice to fix this in crawley.

So far I saw two solutions as proposed in this answer in SO, using chardet module or UnicodeDammit (from BeautifulSoup).

I've develop, locally, these two alternatives and tested them with PyQuery, seems to fix the problem.

I would like to hear your opinion on this issue and if you want, I can submit one of those solutions.

BTW, good work in building crawley, I'm having a very nice time using it! Hope I can contribute somehow.
opened by onilton 0
Use urljoin to fix relative urls

https://docs.python.org/2/library/urlparse.html#urlparse.urljoin provides a robust way to make a relative url into a absolute one.

This fixes some issues like this one:

When accessing this url: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/

We find relative links like this: resultado_busca.html?letra=a

The browser (chrome) build the absolute url like this: http://www1.abracom.org.br/cms/opencms/abracom/pt/associados/resultado_busca.html?letra=a

But crawley build the url like this: http://www1.abracom.org.br/resultado_busca.html?letra=a

urljoin fixes the issue, keeping the right behavior for /relativeurl:

In a hypothetical page http://mydomain.com/my/web/page.html:

'/relativeurl.html' link should become 'http://mydomain.com/relativeurl.html'

and

'relativeurl.html' link should become 'http://mydomain.com/my/web/relativeurl.html'

opened by onilton 1
shell does'nt work

I tryed to use the shell command to test my xpaths, but it does'nt work.

$ crawley shell http://somewebsite.com/index.html Traceback (most recent call last): File "/home/maik/.virtualenvs/crawley/bin/crawley", line 4, in manage() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 25, in manage run_cmd(sys.argv) File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/init.py", line 18, in run_cmd cmd.checked_execute() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/command.py", line 50, in checked_execute self.execute() File "/home/maik/.virtualenvs/crawley/local/lib/python2.7/site-packages/crawley/manager/commands/shell.py", line 30, in execute response = crawler._get_data(url) AttributeError: 'BaseCrawler' object has no attribute '_get_data'

opened by MrTango 1

Owner

Juan Manuel Garcia

Pasionate Python Developer

GitHub http://project.crawley-cloud.com

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Jan 4, 2023

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Dec 24, 2022

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 3, 2021

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

1.6k Jan 1, 2023

PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

109 Jul 20, 2022

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 4, 2023

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 5, 2022

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

1 Jan 25, 2022

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

3.7k Dec 27, 2022

Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

8.7k Jan 5, 2023

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Related tags

Overview

Pythonic Crawling / Scraping Framework Built on Eventlet

Features

Documentation

Project WebSite

To install crawley run

or from pip

To start a new project run

Write your Models

Write your Scrapers

Configure your settings

Finally, just run the crawler

Comments

Use __metaclasses__ to read user's modules

Delayed Requests

Integrate the DSL with Crawlers

Similar HTML Pages Recognition

Wrong encoding detection

Use urljoin to fix relative urls

shell does'nt work

Owner

Juan Manuel Garcia

A high-level distributed crawling framework.

A high-level distributed crawling framework.

Amazon scraper using scrapy, a python framework for crawling websites.

Scrapy uses Request and Response objects for crawling web sites.

Python script for crawling ResearchGate.net papers✨⭐️📎

Async Python 3.6+ web scraping micro-framework based on asyncio

PyQuery-based scraping micro-framework.

Web Scraping Framework

Transistor, a Python web scraping framework for intelligent use cases.

A simple django-rest-framework api using web scraping

Amazon web scraping using Scrapy Framework

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Visual scraping for Scrapy

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

🥫 The simple, fast, and modern web scraping library

A pure-python HTML screen-scraping library

A repository with scraping code and soccer dataset from understat.com.

Minimal set of tools to conduct stealthy scraping.

Use metaclasses to read user's modules