Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Overview

tldextract

Python Module PyPI version Build Status

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

tldextract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List (PSL). So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')

>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
>>> # rejoin subdomain and domain
>>> '.'.join(ext[:2])
'forums.bbc'
>>> # a common alias
>>> ext.registered_domain
'bbc.co.uk'

Note subdomain and suffix are optional. Not all URL-like inputs have a subdomain or a valid suffix.

>>> tldextract.extract('google.com')
ExtractResult(subdomain='', domain='google', suffix='com')

>>> tldextract.extract('google.notavalidsuffix')
ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='')

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='')

If you want to rejoin the whole namedtuple, regardless of whether a subdomain or suffix were found:

>>> ext = tldextract.extract('http://127.0.0.1:8080/deployed/')
>>> # this has unwanted dots
>>> '.'.join(ext)
'.127.0.0.1.'
>>> # join each part only if it's truthy
>>> '.'.join(part for part in ext if part)
'127.0.0.1'

By default, this package supports the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.

This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does, and so does this module.

Installation

Latest release on PyPI:

pip install tldextract

Or the latest dev version:

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

Command-line usage, splits the url components by space:

tldextract http://forums.bbc.co.uk
# forums bbc co.uk

Note About Caching

Beware when first running the module, it updates its TLD list with a live HTTP request. This updated TLD set is usually cached indefinitely in ``$HOME/.cache/python-tldextract`. To control the cache's location, set TLDEXTRACT_CACHE environment variable or set the cache_dir path in TLDExtract initialization.

(Arguably runtime bootstrapping like that shouldn't be the default behavior, like for production systems. But I want you to have the latest TLDs, especially when I haven't kept this code up to date.)

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None)
no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

If you want to stay fresh with the TLD definitions--though they don't change often--delete the cache file occasionally, or run

tldextract --update

or:

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

It is also recommended to delete the file after upgrading this lib.

Advanced Usage

Public vs. Private Domains

The PSL maintains a concept of "private" domains.

PRIVATE domains are amendments submitted by the domain holder, as an expression of how they operate their domain security policy. … While some applications, such as browsers when considering cookie-setting, treat all entries the same, other applications may wish to treat ICANN domains and PRIVATE domains differently.

By default, tldextract treats public and private domains the same.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com')

The following overrides this.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

or to change the default for all extract calls,

>>> extract = tldextract.TLDExtract( include_psl_private_domains=True)
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

The thinking behind the default is, it's the more common case when people mentally parse a URL. It doesn't assume familiarity with the PSL nor that the PSL makes such a distinction. Note this may run counter to the default parsing behavior of other, PSL-based libraries.

Specifying your own URL or file for the Suffix List data

You can specify your own input data in place of the default Mozilla Public Suffix List:

extract = tldextract.TLDExtract(
    suffix_list_urls=["http://foo.bar.baz"],
    # Recommended: Specify your own cache file, to minimize ambiguities about where
    # tldextract is getting its data, or cached data, from.
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

The above snippet will fetch from the URL you specified, upon first need to download the suffix list (i.e. if the cached version doesn't exist).

If you want to use input data from your local filesystem, just use the file:// protocol:

extract = tldextract.TLDExtract(
    suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"],
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

Use an absolute path when specifying the suffix_list_urls keyword argument. os.path is your friend.

FAQ

Can you add suffix ____? Can you make an exception for domain ____?

This project doesn't contain an actual list of public suffixes. That comes from the Public Suffix List (PSL). Submit amendments there.

(In the meantime, you can tell tldextract about your exception by either forking the PSL and using your fork in the suffix_list_urls param, or adding your suffix piecemeal with the extra_suffixes param.)

If I pass an invalid URL, I still get a result, no error. What gives?

To keep tldextract light in LoC & overhead, and because there are plenty of URL validators out there, this library is very lenient with input. If valid URLs are important to you, validate them before calling tldextract.

This lenient stance lowers the learning curve of using the library, at the cost of desensitizing users to the nuances of URLs. Who knows how much. But in the future, I would consider an overhaul. For example, users could opt into validation, either receiving exceptions or error metadata on results.

Contribute

Setting up

  1. git clone this repository.
  2. Change into the new directory.
  3. pip install tox

Running the Test Suite

Run all tests against all supported Python versions:

tox --parallel

Run all tests against a specific Python environment configuration:

tox -l
tox -e py37
Comments
  • Cache entire public suffix list. Select at runtime.

    Cache entire public suffix list. Select at runtime.

    Addresses #66 but is not backwards compatible.

    Changes

    • Moves include_psl_private_domains to the __call__ method. This is now something you choose on a per-call basis.
    • The entire dataset from publicsuffix.org is saved to cache
    • Added 'source' attribute to named tuple which tells you which suffix list the url was matched against
    • Ensured no weird cache issues happen when using with different suffix_list_urls by using different filenames per suffix_list_urls
    • Updates the bundled snapshot
    :arrow_up: prioritized 
    opened by brycedrennan 30
  • 3.0 creates permission error on .suffix_cache

    3.0 creates permission error on .suffix_cache

    We had a python dependency on a package that had a dependency on tldextract > 2.0. Our build pipeline has been pulling in tldextract 2.3, but today it pulled in 3.0 and we started getting this exception in our environment:

    File "/usr/local/lib/python3.7/site-packages/tldextract/cache.py", line 104, in run_and_cache cache_filepath = self._key_to_cachefile_path(namespace, key_args) File "/usr/local/lib/python3.7/site-packages/tldextract/cache.py", line 95, in _key_to_cachefile_path _make_dir(cache_path) File "/usr/local/lib/python3.7/site-packages/tldextract/cache.py", line 155, in _make_dir os.makedirs(os.path.dirname(filename)) File "/usr/local/lib/python3.7/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/usr/local/lib/python3.7/os.py", line 223, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.7/site-packages/tldextract/.suffix_cache'

    opened by jaz2038 21
  • Python 3.2 compatibility

    Python 3.2 compatibility

    Since Python 3, all strings are unicode so the u'...' notation do not make sense anymore.

    One can't install your package on a Python 3 installation because of the return statement of this function

    def fetch_file(urls):
        """ Decode the first successfully fetched URL, from UTF-8 encoding to
        Python unicode.
        """
        s = ''
    
        for url in urls:
            try:
                conn = urlopen(url)
                s = conn.read()
            except Exception as e:
                LOG.error('Exception reading Public Suffix List url ' + url + ' - ' + str(e) + '.')
            else:
                return _decode_utf8(s)
    
        LOG.error('No Public Suffix List found. Consider using a mirror or constructing your TLDExtract with `fetch=False`.')
        return u''
    
    opened by Agmagor 19
  • Domain parsing fails with trailing spaces

    Domain parsing fails with trailing spaces

    The text values passed to extract should strip the text for trailing spaces, eg,

    >>> tldextract.extract('ads.adiquity.com ')
    ExtractResult(subdomain='ads.adiquity', domain='com ', suffix='')
    >>> tldextract.extract('ads.adiquity.com   '.strip())
    ExtractResult(subdomain='ads', domain='adiquity', suffix='com')
    
    opened by neuroticnetworks 15
  • feature: private tlds can be used at call-time

    feature: private tlds can be used at call-time

    This is a second attempt at doing what was done in https://github.com/john-kurkowski/tldextract/pull/144.

    Addresses #66

    • Adds include_psl_private_domains to the __call__ method. This is now something you can choose on a per-call basis. The object level argument now is only a default value for each call.
    • The entire dataset from publicsuffix.org is saved to cache
    • Ensured no weird cache issues happen when using with different suffix_list_urls by using different filenames per suffix_list_urls
    • Use filelock to support multiprocessing and multithreading use cases
    • Updates the bundled snapshot to be the raw publicsuffix data. Need to look at performance impact of this.
    • various other cleanups
    • Breaking change cache_file => cache_dir
    opened by brycedrennan 14
  • Use JSON instead of pickle for tld data

    Use JSON instead of pickle for tld data

    .tld_set_snapshot and .tld_set use pickle to store the TLD information. While this is perfectly fine in most cases it brings the following issues:

    • Binary data stored in your git repository is a bad practice
    • If this library gets packaged for ubuntu/debian, the maintainer will complain. I just started using tldextract in the w3af project which is part of ubuntu/debian. When the package maintainer packages the next version he'll most likely dislike the binary blob
    • Pickles are "executables". A specially crafted pickle can trigger an arbitrary remote command execution when unpickled. While I did review the library for bugs and backdoors before including it in w3af, I did not read the whole pickle; which is bad for w3af user's security.
    :exclamation: security :exclamation: 
    opened by andresriancho 14
  • logging error

    logging error

    This is a non blocking error that I can capture only if I import logging module.

    manuel@Manuel-NG:~>python Python 2.7.2 (default, Oct 4 2011, 14:55:10) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin ...

    import tldextract import logging logging.basicConfig() one, two, three = tldextract.extract('forums.bbc.co.uk/nano/micro.html')

    ERROR:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:error reading TLD cache file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 2] No such file or directory: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set' WARNING:/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/tldextract.pyc:unable to cache TLDs in file /Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/tldextract-1.0-py2.7.egg/tldextract/.tld_set'

    touch'ed and chown'ed .tls_set with no success

    chown'ed the folder, everything worked fine. Something to review in the install process, under OSX? (mine is Lion 10.7.3)

    opened by mcrotti 13
  • Incorrect tld for appspot.com (support excluding private domains on lookup)

    Incorrect tld for appspot.com (support excluding private domains on lookup)

    There are multiple uses of the Mozilla public suffix list which allow sites such as "appspot.com" to appear on the list as a tld instead of being split into domain="appspot" and tld="com".

    This is perfectly reasonable behavior for some use cases, but for others it would be helpful to have the "private" domains be excluded. Mozilla has split the list into "ICANN Domains" and "Private Domains", and it would be useful to optionally be able to exclude the private domains so that sites like "appspot.com" would have their tld reflected as "com".

    opened by davefd 11
  • Update tests

    Update tests

    This pull-request:

    • Drops test support for python 2.6.
      • too many errors related to pylint.
      • decided to keep pylint and drop python 2.6.
    • Adds requests version variations to tox config similar to travis.
      • tox -l for a list of all environments included
      • detox to run tests in parallel
    • Moves test dependencies into tox.
    • Adds a new environment for code style conformance validation.
      • pycodestyle
      • fixed some code style errors
    • Changes travis configuration to use tox.
    • Changes travis configuration to use containers.
    opened by medecau 10
  • Regex does not detect many TLDs properly

    Regex does not detect many TLDs properly

    Hey,

    The regex used by tldextract fails to detect ".com.au" properly, amongst many others on this page:

    https://wiki.mozilla.org/TLD_List

    Might be worth updating to match this list?? I would do it, but I don't have any unit tests :(

    Cal

    opened by foxx 10
  • Allow cache_file argument to also accept list

    Allow cache_file argument to also accept list

    Problem: We have an application, which perform TLD operation using celery workers. As we have a couple of celery workers, whenever cache_file update is called, it only updates the file in instances of the celery worker which picked up that task. So, there is content difference across all the celery instances.

    If the tldextract can accept list as cache_file argument, essentially that list can be stored in redis and any worker can pick up easily.

    opened by hibare 9
  • Ocasional PermissionError when running tldextract.extract

    Ocasional PermissionError when running tldextract.extract

    Hi, I've encountered some ocasional PermissionError exceptions when calling tldextract.extract , but I'm not sure if this can be reproduced reliably.

    I also mention that I'm running version 3.1, but I haven't seen any related fixes in the changelog. I'll upgrade to latest regardless.

    Here are some logs from sentry:

    Attempting to acquire lock XXX on /root/.cache/python-tldextract/3.9.12.final__local__xxx__tldextract-3.1.0/publicsuffix.org-tlds/xxx.tldextract.json.lock
    
    PermissionError: [Errno 13] Permission denied: '/root/.cache/python-tldextract/3.9.12.final__local__xxx__tldextract-3.1.0/publicsuffix.org-tlds/xxx.tldextract.json.lock'
    
    opened by bogdanpetrea 0
  • Adding this project to the PSL website

    Adding this project to the PSL website

    The visibility of this library could be greatly increased if the library were referenced on the PSL website, at the end of this webpage : https://publicsuffix.org/learn

    I had a lot of trouble finding this library. Before that, I used other libraries to which I added an automatic update system. I found this library by chance on a github project based on Qt. Could someone sends a little message on their mailing list to ask them to reference this library ? https://groups.google.com/forum/#!forum/publicsuffix-discuss

    opened by axoroll7 0
  • IPv6 addresses are not handled

    IPv6 addresses are not handled

    For URLs using IPv4 addresses, the host address gets extracted correctly using .domain but .fqdn gives the empty string:

    >>> tldextract.extract("https://127.0.0.1:1234/foobar").domain
    '127.0.0.1'
    >>> tldextract.extract("https://127.0.0.1:1234/foobar").fqdn
    ''
    

    For URLs using IPv6 addresses, neither method extracts the host address correctly:

    >>> tldextract.extract("https://[::1]:1234/foobar").domain
    '['
    >>> tldextract.extract("https://[::1]:1234/foobar").fqdn
    '' 
    >>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").domain
    '[FEC0'
    >>> tldextract.extract("https://[FEC0:0000:0000:0000:0000:0000:0000:0001]:1234/foobar").fqdn
    ''
    

    This was tested using tldextract version 3.2.1 on Python 3.9.12

    opened by ohad-ivix 6
  • Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired

    Timeout: The file lock 'some/path/to/8738.tldextract.json.lock' could not be acquired

    I start getting this error when I increase the number of processes / threads to a certain point.

    Is there a way to increase the timeout value?

    More importantly, why is lock needed here if tldextract isn't writing anything, only reading?

    opened by rudolfovic 3
  • Reconsider network auto-update by default

    Reconsider network auto-update by default

    While it's understandable and useful in many situations to want the latest dataset, it can cause issues in some situations:

    • ephemeral environments that will not be able to cache the network calls to disk. I'm thinking things like k8s tasks or other distributed systems. They'll be refetching the list at every invocation.
    • firewalled or no-connection environments. I believe the library works in this case but only after the delay of making a failed http connection

    Not sure what a solution would look like but here are some ideas:

    • automate the publishing of the python package on a schedule with an updated tld_set
    • make the default non-autoupdating but allow the self-updating version to be easily used via function argument. Something like use_latest or use_autoupdating
    • add a TTL to the cached version. For example we could set it at 7 days and it would automatically refetch the list if the cached version was older than that.
    opened by brycedrennan 3
Owner
John Kurkowski
UX Engineering Consultant
John Kurkowski
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 3, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

null 6 Sep 7, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Amazon Web Services - Labs 53 Dec 8, 2022
Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown

null 915 Dec 26, 2022
Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

null 5 Sep 6, 2021
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

null 6 Oct 11, 2022
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 9, 2022
Validation and inference over LinkML instance data using souffle

Translates LinkML schemas into Datalog programs and executes them using Souffle, enabling advanced validation and inference over instance data

Linked data Modeling Language 7 Aug 7, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 7, 2021
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Denny Imanuel 1 Dec 29, 2021
Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

Andrés Suárez 3 Nov 8, 2022
Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities. This is aimed at those looking to get into the field of Data Science or those who are already in the field and looking to solve a real-world project with python.

Joachim 1 Dec 26, 2021
Stream-Kafka-ELK-Stack - Weather data streaming using Apache Kafka and Elastic Stack.

Streaming Data Pipeline - Kafka + ELK Stack Streaming weather data using Apache Kafka and Elastic Stack. Data source: https://openweathermap.org/api O

Felipe Demenech Vasconcelos 2 Jan 20, 2022