Find thumbnails and original images from URL or HTML file.

Vinta Chen

Last update: Oct 15, 2022

Related tags

Web Crawling haul

Overview

Haul

Find thumbnails and original images from URL or HTML file.

Demo

Hauler on Heroku

Installation

on Ubuntu

$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haul

on Mac OS X

$ pip install haul

Fail to install haul? It is probably caused by lxml.

Usage

Find images from img src, a href and even background-image:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""

Find original (or bigger size) images with extend=True:

import haul

url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)

print(result.image_urls)
"""
output:
[
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
    # bigger size, extended from above urls
    'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
    ...
    'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
    'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
    'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""

Advanced Usage

Custom finder / extender pipeline:

's data-src attribute """ now_finder_image_urls = [] for img in soup.find_all('img'): src = img.get('data-src', None) if src: src = str(src) now_finder_image_urls.append(src) output = {} output['finder_image_urls'] = finder_image_urls + now_finder_image_urls return output MY_FINDER_PIPELINE = ( 'haul.finders.pipeline.html.img_src_finder', 'haul.finders.pipeline.css.background_image_finder', img_data_src_finder, ) GOOGLE_SITES_EXTENDER_PIEPLINE = ( 'haul.extenders.pipeline.google.blogspot_s1600_extender', 'haul.extenders.pipeline.google.ggpht_s1600_extender', 'haul.extenders.pipeline.google.googleusercontent_s1600_extender', ) url = 'http://fashion-fever.nl/dressing-up/' h = Haul(parser='lxml', finder_pipeline=MY_FINDER_PIPELINE, extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE) result = h.find_images(url, extend=True)">

from haul import Haul
from haul.compat import str


def img_data_src_finder(pipeline_index,
                        soup,
                        finder_image_urls=[],
                        *args, **kwargs):
    """
    Find image URL in 's data-src attribute
    """

    now_finder_image_urls = []

    for img in soup.find_all('img'):
        src = img.get('data-src', None)
        if src:
            src = str(src)
            now_finder_image_urls.append(src)

    output = {}
    output['finder_image_urls'] = finder_image_urls + now_finder_image_urls

    return output

MY_FINDER_PIPELINE = (
    'haul.finders.pipeline.html.img_src_finder',
    'haul.finders.pipeline.css.background_image_finder',
    img_data_src_finder,
)

GOOGLE_SITES_EXTENDER_PIEPLINE = (
    'haul.extenders.pipeline.google.blogspot_s1600_extender',
    'haul.extenders.pipeline.google.ggpht_s1600_extender',
    'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)

url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
         finder_pipeline=MY_FINDER_PIPELINE,
         extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)

Run Tests

$ python setup.py test

You might also like...

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

1 Jan 10, 2022

Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

2 Dec 15, 2022

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

3 Feb 13, 2022

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

32 Dec 31, 2022

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

15 May 17, 2022

A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022

Comments

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

Downloading/unpacking haul
  Running setup.py egg_info for package haul
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/vagrant/.virtualenvs/hauler/build/haul/setup.py", line 22, in <module>
        install_requires = [str(item.req) for item in parse_requirements('requirements.txt')]
      File "/home/vagrant/.virtualenvs/hauler/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements
        skip_regex = options.skip_requirements_regex
    AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/vagrant/.virtualenvs/hauler/build/haul/setup.py", line 22, in <module>

    install_requires = [str(item.req) for item in parse_requirements('requirements.txt')]

  File "/home/vagrant/.virtualenvs/hauler/local/lib/python2.7/site-packages/pip-1.1-py2.7.egg/pip/req.py", line 1240, in parse_requirements

    skip_regex = options.skip_requirements_regex

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

error install with pip 1.1, but fine with pip 1.4.1

bug

opened by vinta 0

ModuleNotFoundError: No module named cStringIO

I installed haul using pip: pip install haul

I was using the first find_images example and got the following error File "~\Python36-32\lib\site-packages\haul\utils.py", line 3, in import cStringIO ModuleNotFoundError: No module named 'cStringIO'

opened by mha90 1
avoid mutable default arguments

This Pull Request automatically fixes 10 code issues, detected on QuantifiedCode:

Type: Avoid mutable default arguments Issue details: https://www.quantifiedcode.com/app/project/gh:vinta:Haul?groups=code_patterns%3A3P0qV6OB

Please reach out to us to give us feedback: [email protected].

opened by quantifiedcode-bot 1

Find thumbnails and original images from URL or HTML file.

Related tags

Overview

Haul

Demo

Installation

Usage

Advanced Usage

Run Tests

You might also like...

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Find papers by keywords and venues. Then download it automatically

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

This program scrapes information and images for movies and TV shows.

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

A simple code to fetch comments below an Instagram post and save them to a csv file

Comments

AttributeError: 'NoneType' object has no attribute 'skip_requirements_regex'

ModuleNotFoundError: No module named cStringIO

avoid mutable default arguments

Owner

Vinta Chen

A tool can scrape product in aliexpress: Title, Price, and URL Product.

Get-web-images - A python code that get images from any site

A low-code tool that generates python crawler code based on curl or url

This is a python api to scrape search results from a url.

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Extract embedded metadata from HTML markup

Html Content / Article Extractor, web scrapping lib in Python

A pure-python HTML screen-scraping library

mlscraper: Scrape data from HTML pages automatically with Machine Learning

Basic-html-scraper - A complete how to of web scraping with Python for beginners