coURLan: Clean, filter, normalize, and sample URLs

Adrien Barbaresi

Last update: Dec 14, 2022

Related tags

URL Manipulation url crawler validation url-parsing cleaner preprocessing url-manipulation webcrawling

Overview

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

“Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained.” (Edwards et al. 2001)

Avoid loosing bandwidth capacity and processing time for webpages which are probably not worth the effort. This library provides an additional brain for web crawling, scraping and management of Internet archives. Specific fonctionality for crawlers: stay away from pages with little text content or target synoptic pages explicitly to gather links.

This navigation help targets text-based documents (i.e. currently web pages expected to be in HTML format) and tries to guess the language of pages to allow for language-focused collection. Additional functions include straightforward domain name extraction and URL sampling.

Features

Separate the wheat from the chaff and optimize crawls by focusing on non-spam HTML pages containing primarily text. Most helpers revolve around the strict and language arguments:

Heuristics for triage of links
- Targeting spam and unsuitable content-types
- Language-aware filtering
- Crawl management
URL handling
- Validation
- Canonicalization/Normalization
- Sampling
Command-line interface (CLI) and Python tool

Let the coURLan fish out juicy bits for you!

Here is a courlan (source: Limpkin at Harn's Marsh by Russ, CC BY 2.0).

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.5 upwards. It is available on the package repository PyPI and can notably be installed with the Python package managers pip and pipenv:

$ pip install courlan # pip3 install on systems where both Python 2 and 3 are installed
$ pip install --upgrade courlan # to make sure you have the latest version
$ pip install git+https://github.com/adbar/courlan.git # latest available code (see build status above)

Python

check_url()

All useful operations chained in check_url(url):

>>> from courlan import check_url
# returns url and domain name
>>> check_url('https://github.com/adbar/courlan')
('https://github.com/adbar/courlan', 'github.com')
# noisy query parameters can be removed
>>> check_url('https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org', strict=True)
('https://httpbin.org/redirect-to', 'httpbin.org')
# Check for redirects (HEAD request)
>>> url, domain_name = check_url(my_url, with_redirects=True)

Language-aware heuristics, notably internationalization in URLs, are available in lang_filter(url, language):

# optional argument targeting webpages in English or German
>>> url = 'https://www.un.org/en/about-us'
# success: returns clean URL and domain name
>>> check_url(url, language='en')
('https://www.un.org/en/about-us', 'un.org')
# failure: doesn't return anything
>>> check_url(url, language='de')
>>>
# optional argument: strict
>>> url = 'https://en.wikipedia.org/'
>>> check_url(url, language='de', strict=False)
('https://en.wikipedia.org', 'wikipedia.org')
>>> check_url(url, language='de', strict=True)
>>>

Define stricter restrictions on the expected content type with strict=True. Also blocks certain platforms and pages types crawlers should stay away from if they don't target them explicitly and other black holes where machines get lost.

# strict filtering
>>> check_url('https://www.twitch.com/', strict=True)
# blocked as it is a major platform

Sampling by domain name

>>> from courlan import sample_urls
>>> my_sample = sample_urls(my_urls, 100)
# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Web crawling and URL handling

Determine if a link leads to another host:

>>> from courlan import is_external
>>> is_external('https://github.com/', 'https://www.microsoft.com/')
True
# default
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=True)
False
# taking suffixes into account
>>> is_external('https://google.com/', 'https://www.google.co.uk/', ignore_suffix=False)
True

Other useful functions dedicated to URL handling:

get_base_url(url): strip the URL of some of its parts
get_host_and_path(url): decompose URLs in two parts: protocol + host/domain and path
get_hostinfo(url): extract domain and host info (protocol + host/domain)
fix_relative_urls(baseurl, url): prepend necessary information to relative links

>>> from courlan import *
>>> url = 'https://www.un.org/en/about-us'
>>> get_base_url(url)
'https://www.un.org'
>>> get_host_and_path(url)
('https://www.un.org', '/en/about-us')
>>> get_hostinfo(url)
('un.org', 'https://www.un.org')
>>> fix_relative_urls('https://www.un.org', 'en/about-us')
'https://www.un.org/en/about-us'

Other filters dedicated to crawl frontier management:

is_not_crawlable(url): check for deep web or pages generally not usable in a crawling context
is_navigation_page(url): check for navigation and overview pages

>>> from courlan import is_navigation_page, is_not_crawlable
>>> is_navigation_page('https://www.randomblog.net/category/myposts')
True
>>> is_not_crawlable('https://www.randomblog.net/login')
True

Python helpers

Helper function, scrub and normalize:

>>> from courlan import clean_url
>>> clean_url('HTTPS://WWW.DWDS.DE:80/')
'https://www.dwds.de'

Basic scrubbing only:

>>> from courlan import scrub_url

Basic canonicalization/normalization only, i.e. modifying and standardizing URLs in a consistent manner:

>>> from urllib.parse import urlparse
>>> from courlan import normalize_url
>>> my_url = normalize_url(urlparse(my_url))
# passing URL strings directly also works
>>> my_url = normalize_url(my_url)
# remove unnecessary components and re-order query elements
>>> normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment', strict=True)
'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>> from courlan import validate_url
>>> validate_url('http://1234')
(False, None)
>>> validate_url('http://www.example.org/')
(True, ParseResult(scheme='http', netloc='www.example.org', path='/', params='', query='', fragment=''))

Command-line

The main fonctions are also available through a command-line utility.

$ courlan --inputfile url-list.txt --outputfile cleaned-urls.txt
$ courlan --help
usage: courlan [-h] -i INPUTFILE -o OUTPUTFILE [-d DISCARDEDFILE] [-v]
               [--strict] [-l LANGUAGE] [-r] [--sample]
               [--samplesize SAMPLESIZE] [--exclude-max EXCLUDE_MAX]
               [--exclude-min EXCLUDE_MIN]

optional arguments:

-h, --help

show this help message and exit

I/O:

Manage input and output

`-i INPUTFILE, --inputfile INPUTFILE`
	name of input file (required)
`-o OUTPUTFILE, --outputfile OUTPUTFILE`
	name of output file (required)
`-d DISCARDEDFILE, --discardedfile DISCARDEDFILE`
	name of file to store discarded URLs (optional)
`-v, --verbose`	increase output verbosity

Filtering:

Configure URL filters

`--strict`	perform more restrictive tests
`-l LANGUAGE, --language LANGUAGE`
	use language filter (ISO 639-1 code)
`-r, --redirects`
	check redirects

Sampling:

Use sampling by host, configure sample size

`--sample`	use sampling
`--samplesize SAMPLESIZE`
	size of sample per domain
`--exclude-max EXCLUDE_MAX`
	exclude domains with more than n URLs
`--exclude-min EXCLUDE_MIN`
	exclude domains with less than n URLs

License

coURLan is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

Settings

courlan is optimized for English and German but its generic approach is also usable in other contexts.

To review details of strict URL filtering see settings.py. This can be overriden by cloning the repository and recompiling the package locally.

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. Web corpus construction involves numerous design decisions, and this software package can help facilitate text data collection and enhance corpus quality.

Barbaresi, A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction, Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.
Barbaresi, A. "Generic Web Content Extraction with Open-Source Software", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.

Contact: see homepage or GitHub.

Software ecosystem: see this graphic.

Similar work

These Python libraries perform similar normalization tasks but don't entail language or content filters. They also don't necessarily focus on crawl optimization:

furl
ural
urlnorm (outdated)
yarl

References

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawling through URL ordering. Computer networks and ISDN systems, 30(1-7), 161–172.
Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". In Proceedings of the 10th international conference on World Wide Web - WWW '01. pp. 106–113.

Comments

Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 3
Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 2
Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 2
Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 1
Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 1
Sourcery refactored master branch
Branch master refactored by Sourcery.

If you're happy with these changes, merge this Pull Request using the Squash and merge strategy.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm

Review changes via command line

To manually merge these changes, make sure you're on the master branch, then run:

git fetch origin sourcery/master git merge --ff-only FETCH_HEAD git reset HEAD^

Help us improve this pull request!
opened by sourcery-ai[bot] 1
Replace tldextract with tld?

Remove tldextract and replace it with tld to reduce the total number of package dependencies as mentioned in https://github.com/adbar/trafilatura/issues/41
enhancement

opened by adbar 1

Investigate sampling issue

The sampling function may not always work as it should, working example:

>>> from courlan import sample_urls
>>> my_urls = ['https://example.org/' + str(x) for x in range(100)]
>>> my_sample = list(sample_urls(my_urls, 10))

opened by adbar 0

Drop support for Python 3.5

Only support Python versions 3.6+ in the future and see if the code can be improved or cleaned on the way.

Example to search the code: https://github.com/adbar/courlan/search?l=Python&q=%22Python+3.%22
enhancement

opened by adbar 0

Releases(v0.8.3)

v0.8.3(Jul 28, 2022)
fixed bug in domain name extraction

uniform logging parameters

Full Changelog: https://github.com/adbar/courlan/compare/v0.8.2...v0.8.3
Source code(tar.gz)
Source code(zip)
v0.8.2(Jul 26, 2022)
full type hinting

maintenance: code linted

Full Changelog: https://github.com/adbar/courlan/compare/v0.8.1...v0.8.2
Source code(tar.gz)
Source code(zip)
v0.8.1(Jul 11, 2022)
add type annotations and check with mypy

url_filter() function moved from Trafilatura

code style: use black

Source code(tar.gz)
Source code(zip)
v0.8.0(Jun 30, 2022)
performance optimizations

fast track for domain extraction (extract_domain(url, fast=True)), now taking subdomains into account

Full Changelog: https://github.com/adbar/courlan/compare/v0.7.2...v0.8.0
Source code(tar.gz)
Source code(zip)
v0.7.2(May 17, 2022)
UrlStore: threading lock and convenience functions added

Source code(tar.gz)
Source code(zip)
v0.7.1(Mar 29, 2022)
bug in sampling fixed

UrlStore: validation by default

Full Changelog: https://github.com/adbar/courlan/compare/v0.7.0...v0.7.1
Source code(tar.gz)
Source code(zip)
v0.7.0(Mar 21, 2022)
UrlStore class added: data store containing URLs with relevant information

code cleaning and maintenance (bugs, simplification)

Full Changelog: https://github.com/adbar/courlan/compare/v0.6.0...v0.7.0
Source code(tar.gz)
Source code(zip)
v0.6.0(Nov 11, 2021)
reviewed code base: simplicity and execution speed

dropped support for Python 3.5

Source code(tar.gz)
Source code(zip)
v0.5.0(Oct 13, 2021)
more complex language heuristics, use langcodes

extended blacklists and whitelists

more precise filters and more efficient code

support for Python 3.10

Source code(tar.gz)
Source code(zip)
v0.4.2(Jul 28, 2021)
enhanced cleaning

fixed language filter

Source code(tar.gz)
Source code(zip)
v0.4.1(Jun 10, 2021)
keep trailing slashes to avoid redirection

fixes: normalization and crawlable URLs

Source code(tar.gz)
Source code(zip)
v0.4.0(May 25, 2021)
URL manipulation tools added: extract parts, fix relative URLs

filters added: language, navigation and crawls

more robust link handling and extraction

removed support for Python 3.4

Source code(tar.gz)
Source code(zip)
v0.3.1(Feb 19, 2021)
improve filter precision

Source code(tar.gz)
Source code(zip)
v0.3.0(Jan 4, 2021)
reduced dependencies: replace requests with bare urllib3, and tldextract with tld for Python 3.6 upwards

better path and fragment normalization

Source code(tar.gz)
Source code(zip)
v0.2.3(Oct 20, 2020)
Python 3.9 compatibility

Simplified imports

Bug fixes

Source code(tar.gz)
Source code(zip)
v0.2.2(Sep 21, 2020)
English and German language filters

Function to detect external links

Support for domain blacklisting

Source code(tar.gz)
Source code(zip)
v0.2.1(Sep 2, 2020)
Less aggressive strict filters

CLI bug fixed

Source code(tar.gz)
Source code(zip)
v0.2.0(Sep 1, 2020)
Cleaner and more efficient filtering

Helper functions to scrub, clean and normalize

Removed two dependencies with more extensive usage of urllib.parse

Source code(tar.gz)
Source code(zip)
v0.1.0(Aug 27, 2020)
Cleaning and filtering targeting non-spam HTML pages with primarily text

URL validation

Sampling by domain name

Command-line interface (CLI) and Python tool

Source code(tar.gz)
Source code(zip)

Owner

Adrien Barbaresi

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub https://adrien.barbaresi.eu/blog/easy-content-aware-url-filtering.html

Astra is a tool to find URLs and secrets.

Astra finds urls, endpoints, aws buckets, api keys, tokens, etc from a given url/s. It combines the paths and endpoints with the given domain and give

198 Dec 27, 2022

:electric_plug: Generating short urls with python has never been easier

pyshorteners A simple URL shortening API wrapper Python library. Installing pip install pyshorteners Documentation https://pyshorteners.readthedocs.i

350 Dec 24, 2022

UDdup - URLs Deduplication Tool

UDdup - URLs Deduplication Tool The tool gets a list of URLs, and removes "duplicate" pages in the sense of URL patterns that are probably repetitive

128 Dec 21, 2022

🔗 Generate Phishing URLs 🔗

URLer ?? Generate Phishing URLs ?? URLer Table Of Contents General Information Preview Installation Disclaimer Credits Social Media Bug Report General

5 Feb 8, 2022

URL Shortener in Flask - Web service using Flask framework for Shortener URLs

URL Shortener in Flask Web service using Flask framework for Shortener URLs Install Create Virtual env $ python3 -m venv env Install requirements.txt

1 Sep 21, 2021

A teeny Tiny module to check URLs against discord's list of phishing domains

1 Aug 29, 2022

Temporary-shortner - A webapp that shortner URLs but for limited time

temporary-shortner A webapp that shortens URLs but for a limited time Demo site

2 Jan 7, 2022

🌐 URL parsing and manipulation made easy.

furl is a small Python library that makes parsing and manipulating URLs easy. Python's standard urllib and urlparse modules provide a number of URL re

2.4k Jan 4, 2023

A friendly library for parsing HTTP request arguments, with built-in support for popular web frameworks, including Flask, Django, Bottle, Tornado, Pyramid, webapp2, Falcon, and aiohttp.

webargs Homepage: https://webargs.readthedocs.io/ webargs is a Python library for parsing and validating HTTP request objects, with built-in support f

1.3k Jan 1, 2023

This is a no-bullshit file hosting and URL shortening service that also runs 0x0.st. Use with uWSGI.

1.6k Dec 31, 2022

Customizable URL shortener written in Python3 for sniffing and spoofing

3 Nov 22, 2022

A simple URL shortener app using Python AWS Chalice, AWS Lambda and AWS Dynamodb.

url-shortener-chalice A simple URL shortener app using AWS Chalice. Please make sure you configure your AWS credentials using AWS CLI before starting

2 Dec 9, 2022

Ukiyo - A simple, minimalist and efficient discord vanity URL sniper

Ukiyo - a simple, minimalist and efficient discord vanity URL sniper. Ukiyo is easy to use, has a very visually pleasing interface, and has great spee

13 Apr 14, 2022

🔗 FusiShort is a URL shortener built with Python, Redis, Docker and Kubernetes

This is a playground application created with goal of applying full cycle software development using popular technologies like Python, Redis, Docker and Kubernetes.

7 Nov 10, 2022

A Telegram Filter Bot, Support Unlimited Filter. Also, The Bot can auto-filter telegram File | video

3 Nov 27, 2021

Cleaner script to normalize knock's output EPUBs

clean-epub The excellent knock application by Benton Edmondson outputs EPUBs that seem to be DRM-free. However, if you run the application twice on th

2 Dec 16, 2022

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

img2dataset Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine. Also supports

1.4k Jan 1, 2023

Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Fast Pattern Fetcher (fpf) Coded with <3 by HS Devansh Raghav Fast Pattern Fetcher, Takes a URLs list and outputs the URLs which contains the paramete

5 Feb 20, 2022

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

snscrape-jsonl-urls-extractor extracts urls from jsonl produced by snscrape Usag

1 Feb 26, 2022

Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack

yolox-bytetrack-sample YOLOXとByteTrackを用いたMOT(Multiple Object Tracking)のPythonサン

12 Nov 9, 2022

coURLan: Clean, filter, normalize, and sample URLs

Related tags

Overview

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

Features

Installation

Python

check_url()

Sampling by domain name

Web crawling and URL handling

Python helpers

Command-line

License

Settings

Contributing

Author

Similar work

References

Comments

Releases(v0.8.3)

v0.8.3(Jul 28, 2022)

v0.8.2(Jul 26, 2022)

v0.8.1(Jul 11, 2022)

v0.8.0(Jun 30, 2022)

v0.7.2(May 17, 2022)

v0.7.1(Mar 29, 2022)

v0.7.0(Mar 21, 2022)

v0.6.0(Nov 11, 2021)

v0.5.0(Oct 13, 2021)

v0.4.2(Jul 28, 2021)

v0.4.1(Jun 10, 2021)

v0.4.0(May 25, 2021)

v0.3.1(Feb 19, 2021)

v0.3.0(Jan 4, 2021)

v0.2.3(Oct 20, 2020)

v0.2.2(Sep 21, 2020)

v0.2.1(Sep 2, 2020)

v0.2.0(Sep 1, 2020)

v0.1.0(Aug 27, 2020)

Owner

Adrien Barbaresi

Astra is a tool to find URLs and secrets.

:electric_plug: Generating short urls with python has never been easier

UDdup - URLs Deduplication Tool

🔗 Generate Phishing URLs 🔗

URL Shortener in Flask - Web service using Flask framework for Shortener URLs

A teeny Tiny module to check URLs against discord's list of phishing domains

Temporary-shortner - A webapp that shortner URLs but for limited time

🌐 URL parsing and manipulation made easy.

A friendly library for parsing HTTP request arguments, with built-in support for popular web frameworks, including Flask, Django, Bottle, Tornado, Pyramid, webapp2, Falcon, and aiohttp.

This is a no-bullshit file hosting and URL shortening service that also runs 0x0.st. Use with uWSGI.

Customizable URL shortener written in Python3 for sniffing and spoofing

A simple URL shortener app using Python AWS Chalice, AWS Lambda and AWS Dynamodb.

Ukiyo - A simple, minimalist and efficient discord vanity URL sniper

🔗 FusiShort is a URL shortener built with Python, Redis, Docker and Kubernetes

A Telegram Filter Bot, Support Unlimited Filter. Also, The Bot can auto-filter telegram File | video

Cleaner script to normalize knock's output EPUBs

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Fast pattern fetcher, Takes a URLs list and outputs the URLs which contains the parameters according to the specified pattern.

Snscrape-jsonl-urls-extractor - Extracts urls from jsonl produced by snscrape

Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack