Find thumbnails and original images from URL or HTML file.

Related tags

Web Crawling haul
Overview

Haul

Build Badge Coverage Badge Version Badge Bitdeli Badge

Find thumbnails and original images from URL or HTML file.

Demo

Hauler on Heroku

Installation

on Ubuntu

$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul

on Mac OS X

$ pip install haul

Fail to install haul? It is probably caused by lxml.

Usage

Find images from img src, a href and even background-image:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""

Find original (or bigger size) images with extend=True:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
    # bigger size, extended from above urls
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""

Advanced Usage

Custom finder / extender pipeline:

's data-src attribute """ now_finder_image_urls = [] for img in soup.find_all('img'): src = img.get('data-src', None) if src: src = str(src) now_finder_image_urls.append(src) output = {} output['finder_image_urls'] = finder_image_urls + now_finder_image_urls return output MY_FINDER_PIPELINE = ( 'haul.finders.pipeline.html.img_src_finder', 'haul.finders.pipeline.css.background_image_finder', img_data_src_finder, ) GOOGLE_SITES_EXTENDER_PIEPLINE = ( 'haul.extenders.pipeline.google.blogspot_s1600_extender', 'haul.extenders.pipeline.google.ggpht_s1600_extender', 'haul.extenders.pipeline.google.googleusercontent_s1600_extender', ) url = 'http://fashion-fever.nl/dressing-up/' h = Haul(parser='lxml', finder_pipeline=MY_FINDER_PIPELINE, extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE) result = h.find_images(url, extend=True)">
from haul import Haul
from haul.compat import str


def img_data_src_finder(pipeline_index,
                        soup,
                        finder_image_urls=[],
                        *args, **kwargs):
    """
    Find image URL in 's data-src attribute
    """

    now_finder_image_urls = []

    for img in soup.find_all('img'):
        src = img.get('data-src', None)
        if src:
            src = str(src)
            now_finder_image_urls.append(src)

    output = {}
    output['finder_image_urls'] = finder_image_urls + now_finder_image_urls

    return output

MY_FINDER_PIPELINE = (
    'haul.finders.pipeline.html.img_src_finder',
    'haul.finders.pipeline.css.background_image_finder',
    img_data_src_finder,
)

GOOGLE_SITES_EXTENDER_PIEPLINE = (
    'haul.extenders.pipeline.google.blogspot_s1600_extender',
    'haul.extenders.pipeline.google.ggpht_s1600_extender',
    'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)

url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
         finder_pipeline=MY_FINDER_PIPELINE,
         extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)

Run Tests

$ python setup.py test
You might also like...
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

Comments
  • AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

    Downloading/unpacking haul
      Running setup.py egg_info for package haul
        Traceback (most recent call last):
          File "<string>", line 14, in <module>
          File "/home/vagrant/.virtualenvs/hauler/build/haul/setup.py", line 22, in <module>
            install_requires = [str(item.req) for item in parse_requirements('requirements.txt')]
          File "/home/vagrant/.virtualenvs/hauler/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
            skip_regex = options.skip_requirements_regex
        AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
    
      File "<string>", line 14, in <module>
    
      File "/home/vagrant/.virtualenvs/hauler/build/haul/setup.py", line 22, in <module>
    
        install_requires = [str(item.req) for item in parse_requirements('requirements.txt')]
    
      File "/home/vagrant/.virtualenvs/hauler/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
    
        skip_regex = options.skip_requirements_regex
    
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    

    error install with pip 1.1, but fine with pip 1.4.1

    bug 
    opened by vinta 0
  • ModuleNotFoundError: No module named cStringIO

    ModuleNotFoundError: No module named cStringIO

    I installed haul using pip: pip install haul

    I was using the first find_images example and got the following error File "~\Python36-32\lib\site-packages\haul\utils.py", line 3, in import cStringIO ModuleNotFoundError: No module named 'cStringIO'

    opened by mha90 1
  • avoid mutable default arguments

    avoid mutable default arguments

    This Pull Request automatically fixes 10 code issues, detected on QuantifiedCode:

    Type: Avoid mutable default arguments Issue details: https://www.quantifiedcode.com/app/project/gh:vinta:Haul?groups=code_patterns%3A3P0qV6OB

    Please reach out to us to give us feedback: [email protected].

    opened by quantifiedcode-bot 1
Owner
Vinta Chen
I failed the Turing Test.
Vinta Chen
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
Get-web-images - A python code that get images from any site

image retrieval This is a python code to retrieve an image from the internet, a

CODE 1 Dec 30, 2021
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python >= 3.6 Install pip install kkba Usage Co

null 8 Sep 20, 2021
This is a python api to scrape search results from a url.

googlescrape Installation Installation is simple! # Stable version pip install googlescrape Examples from googlescrape import client scrapeClient=cli

null 1 Dec 15, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 3, 2023
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 2, 2023
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
mlscraper: Scrape data from HTML pages automatically with Machine Learning

?? Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

John 12 Oct 22, 2022