Fast and robust date extraction from web pages, with Python or on the command-line

Adrien Barbaresi

Last update: Dec 14, 2022

Related tags

Web Content Extracting nlp metadata natural-language-processing datetime date information-extraction web-scraping html-parsing webscraping metadata-extraction date-parser entity-extraction

Overview

htmldate: find the publication date of web pages

Code:	https://github.com/adbar/htmldate
Documentation:	https://htmldate.readthedocs.io
Issue tracker:	https://github.com/adbar/htmldate/issues

Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.

In a nutshell

With Python:

>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True)
'2016-06-23'

On the command-line:

$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'

Features

Compatible with all recent versions of Python (see above)
Multilingual, robust and efficient (used in production on millions of documents)
URLs, HTML files, or HTML trees are given as input (includes batch processing)
Output as string in any date format (defaults to ISO 8601 YMD)
Detection of both original and updated dates

htmldate finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides following ways to date a HTML document:

Markup in header: Common patterns are used to identify relevant elements (e.g. link and meta elements) including Open Graph protocol attributes and a large number of CMS idiosyncrasies
HTML code: The whole document is then searched for structural markers: abbr and time elements as well as a series of attributes (e.g. postmetadata)
Bare HTML content: A series of heuristics is run on text and markup:

in fast mode the HTML page is cleaned and precise patterns are targeted

in extensive mode all potential dates are collected and a disambiguation algorithm determines the best one

Performance

500 web pages containing identifiable dates (as of 2021-09-24)
Python Package	Precision	Recall	Accuracy	F-Score	Time
articleDateExtractor 0.20	0.769	0.691	0.572	0.728	3.3x
date_guesser 2.1.4	0.738	0.544	0.456	0.626	20x
goose3 3.1.9	0.821	0.453	0.412	0.584	8.2x
htmldate[all] 0.9.1 (fast)	0.839	0.906	0.772	0.871	1x
htmldate[all] 0.9.1 (extensive)	0.825	0.990	0.818	0.900	1.7x
newspaper3k 0.2.8	0.729	0.630	0.510	0.675	8.4x
news-please 1.5.21	0.769	0.691	0.572	0.728	30x

For complete results and explanations see the evaluation page.

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It is available on the package repository PyPI and can notably be installed with pip (pip3 where applicable): pip install htmldate and optionally pip install htmldate[speed].

Documentation

For more details on installation, Python & CLI usage, please refer to the documentation: htmldate.readthedocs.io

License

htmldate is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:

@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

Barbaresi, A. "htmldate: A Python package to extract publication dates from web pages", Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
Barbaresi, A. "Generic Web Content Extraction with Open-Source Software", Proceedings of KONVENS 2019, Kaleidoscope Abstracts, 2019.
Barbaresi, A. "Efficient construction of metadata-enhanced web corpora", Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

You can contact me via my contact page or GitHub.

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Kudos to the following software libraries:

ciso8601, lxml, dateparser
A few patterns are derived from the python-goose, metascraper, newspaper and articleDateExtractor libraries. This module extends their coverage and robustness significantly.

Comments

Memory leak

See issue https://github.com/adbar/trafilatura/issues/216.

Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to extensive_search:

import os
import psutil
from htmldate import find_date

with open('test.html', 'rb') as inputf:
    html = inputf.read()

for i in range(10):
    result = find_date(html, extensive_search=False)
    process = psutil.Process(os.getpid())
    print(i, ":", process.memory_info().rss / 1024 ** 2)

tracemalloc doesn't give any clue.

bug

opened by adbar 21

feature: supports delaying url date extraction

add a feature to improve precision of dates by delaying the extraction of the URL. see (https://github.com/adbar/htmldate/issues/55)

adds the boolean parameter url_delayed to the find_date function

This is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.

opened by getorca 9
Good test cases

Hi Adrien

here are a few test cases where the extraction gave a wrong answer:

https://www.gardeners.com/how-to/vegetable-gardening/5069.html https://www.almanac.com/vegetable-gardening-for-beginners

Somewhat related, this one 'hangs': https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854

opened by vprelovac 9
Add new test cases including more global stories
This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own date_guesser library and are evaluating switching the htmldate.

This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.

The htmldate code still performs well against this new test corpus:

Name Precision Recall Accuracy F-score -------------------- ----------- -------- ---------- --------- htmldate extensive 0.755102 0.973684 0.74 0.850575 htmldate fast 0.769663 0.861635 0.685 0.813056 newspaper 0.671141 0.662252 0.5 0.666667 newsplease 0.736527 0.788462 0.615 0.76161 articledateextractor 0.72973 0.675 0.54 0.701299 date_guesser 0.686567 0.582278 0.46 0.630137 goose 0.75 0.508772 0.435 0.606272

A few notes:

We changed comparison.py to load test data from .json files so the test data is isolated from the code itself.

The new set of stories and dates are in test/eval_mediacloud_2020.json, with HTML cached in tests/eval.

Then evaluation results are now printed out via the tabulate module, and saved to the file system.

Perhaps the two evaluations sets should be merged into one larger one? Or the scores combined between them? We weren't sure how to approach this.

Interesting to note that overall all the precision scores were lower against this corpus - more false positives. Recall actually slightly better against this set - fewer false negatives.

We hope this contribution helps document the performance of the various libraries against a more global dataset.
opened by rahulbot 8
`find_date` doesn't extract `%D %b %Y` formatted dates in free text
For the following MWE:

from htmldate import find_date print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))

htmldate outputs 2022-01-01 instead of the expected 2022-10-19.

I've traced the execution of the above call and I believe it is the search_page function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022 part of the date string (which autocompletes the rest to 1st Jan).

I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.
enhancement
opened by k-sareen 7

return datetime instead of date

Is there a way to force htmldate to look for datetime and not date, or prioritise specific extractors over others, eg opengraph over url-extraction. Let me give you an example:

from htmldate import find_date
url = "https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/"
find_date(url, outputformat='%Y-%m-%d %H:%M:%S', verbose = True)

INFO:htmldate.utils:URL detected, downloading: https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/
DEBUG:urllib3.connectionpool:Resetting dropped connection: www.ot.gr
DEBUG:urllib3.connectionpool:https://www.ot.gr:443 "GET /2022/03/23/apopseis/daimonopoiisi/ HTTP/1.1" 200 266737
DEBUG:htmldate.extractors:found date in URL: /2022/03/23/
'2022-03-23 00:00:00'

returns:

'2022-03-23 00:00:00'

But if you look at the article you can find: <meta property="article:published_time" content="2022-03-23T06:15:58+00:00">

question

opened by kvasilopoulos 7

"URL couldn't be processed: %s" during callinf of find_date()

I got a problem with exctracting date from website. date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

I got such an error:

ValueError Traceback (most recent call last) in () ----> 1 date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

1 frames /usr/local/lib/python3.7/dist-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date) 598 if verbose is True: 599 logging.basicConfig(level=logging.DEBUG) --> 600 tree = load_html(htmlobject) 601 find_date.extensive_search = extensive_search 602 min_date, max_date = get_min_date(min_date), get_max_date(max_date)

/usr/local/lib/python3.7/dist-packages/htmldate/utils.py in load_html(htmlobject) 165 # log the error and quit 166 if htmltext is None: --> 167 raise ValueError("URL couldn't be processed: %s", htmlobject) 168 # start processing 169 tree = None

ValueError: ("URL couldn't be processed: %s", 'https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731')

I will be gratefull for any support and help with this.
question

opened by HubLubas 6
Port improvements from go-htmldate
Overview

While porting this library into Go language, I've tried to made some improvements to make the extraction more accurate. After more testing, it looks like those improvements are good and stable enough to use so I decided to implement those improvements back to Python here.

Changes

There are three main changes in this PR:

Add French and Indonesian language to regular expressions that used to parse long date string.

This is done to fix htmldate failed to extract date from paris-luttes.info.html which uses French language. Since I added a new language to the regular expressions, I decided to add Indonesian language as well.

Improve custom_parse.

Now it works by trying to parse the string using several formats with following priority:

YYYYMMDD pattern

YYYY-MM-DD (ISO-8601)

DD-MM-YYYY (most common used date format according to Wikipedia)

YYYY-MM pattern

Regex patterns

Merge xpath selectors from array of strings into a single string.

This is done to fix htmldate extracted the wrong date for wolfsrebellen-netz.forumieren.com.regeln.html. Consider HTML document like this:

<div> <h1>Some Title</h1> <p class="author">By Joestar at 2020/12/12, 10:11 AM</p> <p>Lorem ipsum dolor sit amet.</p> <p>Dolorum explicabo quos voluptas voluptates?</p> <p class="current-time">Current date and time: 2021/07/14, 09:00 PM</p> </div>

In document above, there are two dates: one in element with class "author" and the other in element with class "current-time".

In the original code, htmldate will pick the date from element in "current-time" even though it's occured later in the document. This is because currently DATE_EXPRESSIONS is created as array of Xpath selectors, and in that array element with classes that contains time is given more priority than element with classes that contains author.

To fix this, I've converted DATE_EXPRESSIONS and other Xpath selectors from array of strings into a single string. This way every rules inside the expressions has same priority, so now the <p class="author"> will be selected first.

Result

Here is the result of comparison test for the original htmldate:

| Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.899 | 0.917 | 0.831 | 0.908 | 1.038 | | htmldate extensive | 0.893 | 1.000 | 0.893 | 0.944 | 2.219 |

And here is after this PR:

| Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.920 | 0.938 | 0.867 | 0.929 | 1.579 | | htmldate extensive | 0.911 | 1.000 | 0.911 | 0.953 | 2.807 |

So there is a slight increase in accuracy, however the extraction speed become slower (around 1.5x slower than the original).

Additional Notes

I've not added it to this PR, however since custom_parse has been improved, from what I test we can safely remove external_date_parser without any performance loss. Here is the result of comparison test after external_date_parser removed:

| Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.920 | 0.938 | 0.867 | 0.929 | 1.678 | | htmldate extensive | 0.911 | 1.000 | 0.911 | 0.953 | 1.816 |

So the accuracy is still the same, however the extraction speed for extensive mode become a lot faster (now only 1.08x slower than the fast mode) so we might be able to make the extensive mode as default. Might need more tests though.
opened by RadhiFadlillah 6
Strange inferred date for target news article

Hello @adbar,

I just stumbled upon an issue when extracting contents from this html file (an article from LeMonde): https://gist.github.com/Yomguithereal/de4457a421729c92a976b506268631d7

It returns 2021-01-31 (which was a date in the future at the time the html was downloaded, i.e. more than one year ago) because it latches on something which is an expiry date for something in a JavaScript string litteral.

I don't really know how trafilatura tries to extract a date from html pages, but I guess here it was found because of a regex scanning the whole text? In which case maybe a condition checking that the found dates are not in the future could help (this could also be tedious because one would need to pass the "present" date when extracting data collected in the past).
bug

opened by Yomguithereal 6

error: redefinition of group name 'm' as group 5; was group 2 at position 116

Hello there,

Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev

Here is the error using iPython and Python 3.8.12:

# works
In [3]: from htmldate import find_date

In [4]: find_date("https://osmh.dev")
Out[4]: '2020-11-29'

# doesn't work
In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

The last example throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-6-9988648ad55b> in <module>
----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
    653
    654     # try time elements
--> 655     time_result = examine_time_elements(
    656         search_tree, outputformat, extensive_search, original_date, min_date, max_date
    657     )

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
    389                         return attempt
    390                 else:
--> 391                     reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
    392                     if reference > 0:
    393                         break

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
    300     attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
    301     if attempt is not None:
--> 302         return compare_values(reference, attempt, outputformat, original_date)
    303     return reference
    304

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
    110 def compare_values(reference, attempt, outputformat, original_date):
    111     """Compare the date expression to a reference"""
--> 112     timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
    113     if original_date is True and (reference == 0 or timestamp < reference):
    114         reference = timestamp

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
    331         if not format_regex:
    332             try:
--> 333                 format_regex = _TimeRE_cache.compile(format)
    334             # KeyError raised when a bad format is found; can be specified as
    335             # \\, in which case it was a stray % but with a space after it

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
    261     def compile(self, format):
    262         """Return a compiled re object for the format string."""
--> 263         return re_compile(self.pattern(format), IGNORECASE)
    264
    265 _cache_lock = _thread_allocate_lock()

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
    250 def compile(pattern, flags=0):
    251     "Compile a regular expression pattern, returning a Pattern object."
--> 252     return _compile(pattern, flags)
    253
    254 def purge():

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
    946
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
    829                     group = state.opengroup(name)
    830                 except error as err:
--> 831                     raise source.error(err.msg, len(name) + 1) from None
    832             sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    833                            not (del_flags & SRE_FLAG_VERBOSE))

error: redefinition of group name 'm' as group 5; was group 2 at position 116

bug

opened by kinoute 4

how are timezones handled when available?

Some articles include the full publication time, with timezone, in HTML meta tags or Javascript config. Does this library parse and handle those timezones? Relatedly, how does it internally store dates with regards to timezone - are the all returned in machine-local time, held in GMT, or something else?

For instance, this Guardian article includes the article:published_time meta tag with a timezone included. Does this library recognize that timezone and return the date as it would be in GMT? Same for this article on CNN, which includes the datePublished meta tag.
question

opened by rahulbot 3
ignore undateable domains more intentionally

In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.
question

opened by rahulbot 7
Test htmldate on further web pages and report bugs

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!
good first issue up for grabs

opened by adbar 7
Check the language, clarity and consistency of documentation
A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs folder and online on htmldate.readthedocs.io

Several problems could arise:

Non-idiomatic use of English (not quite fluent or natural)

Unclear or incomplete descriptions

Code examples that don't work

Typos in explanations or code sections

Outdated sections

good first issue up for grabs
opened by adbar 2

Releases(v1.4.0)

v1.4.0(Nov 28, 2022)
additional search of free text in whole document (#67)

optional parameter for subdaily precision with @getorca (#66)

fix for HTML doctype parsing (#44)

cleaner code for multilingual month expressions

extended expressions for extraction in HTML meta fields

update of dependencies and evaluation

Source code(tar.gz)
Source code(zip)
v1.3.2(Oct 14, 2022)
technical release: explicit support for Python 3.11 and logo

Source code(tar.gz)
Source code(zip)
v1.3.1(Aug 26, 2022)
fix for use of min_date & max_date (#62)

simplified code & updated setup

Source code(tar.gz)
Source code(zip)
v1.3.0(Jul 20, 2022)
Entirely type-checked code base

New function clear_caches() (#57)

Slightly more efficient code (about 5% faster)

Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.3...v1.3.0
Source code(tar.gz)
Source code(zip)
v1.2.3(Jun 16, 2022)
fix for memory leak (#56)

docs updated

Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.2...v1.2.3
Source code(tar.gz)
Source code(zip)
v1.2.2(Jun 13, 2022)
slightly higher accuracy & faster extensive extraction

maintenance: code base simplified, more tests

bugs addressed: #51, #54

docs: fix by @MSK1582

Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.1...v1.2.2
Source code(tar.gz)
Source code(zip)
v1.2.1(Mar 23, 2022)
speed and accuracy gains

better extraction coverage, simpler code

bug fixed (typo in variable)

Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.0...v1.2.1
Source code(tar.gz)
Source code(zip)
v1.2.0(Mar 16, 2022)
better performance

remove unnecessary ciso8601 dependency

temporary fix for scrapinghub/dateparser#1045 bug

Full Changelog: https://github.com/adbar/htmldate/compare/v1.1.1...v1.2.0
Source code(tar.gz)
Source code(zip)
v1.1.1(Mar 3, 2022)
bugfix: input encoding

improved extraction coverage (#47) by @liulinlin90

Full Changelog: https://github.com/adbar/htmldate/compare/v1.1.0...v1.1.1
Source code(tar.gz)
Source code(zip)
v1.1.0(Feb 18, 2022)
better handling of file encodings

slight increase in accuracy, more efficient code

Full Changelog: https://github.com/adbar/htmldate/compare/v1.0.1...v1.1.0
Source code(tar.gz)
Source code(zip)
v1.0.1(Feb 14, 2022)
maintenance release, code base cleaned

command-line interface: --version added

file parsing reviewed

Full Changelog: https://github.com/adbar/htmldate/compare/v1.0.0...v1.0.1
Source code(tar.gz)
Source code(zip)
v1.0.0(Nov 9, 2021)
faster and more accurate encoding detection

simplified code base

include support for Python 3.10 and dropped support for Python 3.5

Source code(tar.gz)
Source code(zip)
v0.9.1(Sep 24, 2021)
improved generic date parsing (thanks @RadhiFadlillah)

specific support for French and Indonesian (thanks @RadhiFadlillah)

additional evaluation for English news sites (kudos to @coreydockser & @rahulbot)

bugs fixed

Source code(tar.gz)
Source code(zip)
v0.9.0(Jul 28, 2021)
improved exhaustive search

simplified code

bug fixes

removed support for Python 3.4

Source code(tar.gz)
Source code(zip)
v0.8.1(Mar 9, 2021)
bugfixes

Source code(tar.gz)
Source code(zip)
v0.8.0(Feb 11, 2021)
dateparser and regex modules fully integrated

patterns added for coverage

smarter HTML doc loading

Source code(tar.gz)
Source code(zip)
v0.7.3(Jan 4, 2021)
dependencies updated and reduced: switch from requests to bare urllib3, make chardet standard and cchardet optional

fixes: downloads, OverflowError in extraction

Source code(tar.gz)
Source code(zip)
v0.7.2(Oct 20, 2020)
compatibility with Python 3.9

better speed and accuracy

Source code(tar.gz)
Source code(zip)
v0.7.1(Sep 14, 2020)
technical release: package requirements and docs wording

Source code(tar.gz)
Source code(zip)
v0.7.0(Jul 29, 2020)
code base and performance improved

minimum date available as option

support for Turkish patterns and CMS idiosyncrasies (thanks @evolutionoftheuniverse)

Source code(tar.gz)
Source code(zip)
v0.6.3(May 26, 2020)
more efficient code

additional evaluation data

Source code(tar.gz)
Source code(zip)
v0.6.2(Apr 29, 2020)

Source code(tar.gz)
Source code(zip)
v0.6.1(Jan 17, 2020)

htmldate finds original and updated publication dates of any web page. All the steps needed from web page download to HTML parsing, scraping and text analysis are included.

In a nutshell, with Python:

from htmldate import find_date find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) '2016-06-23'

On the command-line:

$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html '2016-12-23'

Releases used in production and meant to be archived on Zenodo for reproducibility and citability.

For more information see htmldate.readthedocs.io
Source code(tar.gz)
Source code(zip)
v0.5.6(Sep 24, 2019)

First release used in production and meant to be archived on Zenodo for reproducibility and citability.
Source code(tar.gz)
Source code(zip)

Owner

Adrien Barbaresi

Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.

GitHub https://htmldate.readthedocs.io

fast python port of arc90's readability tool, updated to match latest readability.js!

python-readability Given a html document, it pulls out the main body text and cleans it up. This is a python port of a ruby port of arc90's readabilit

2.2k Dec 28, 2022

Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

571 Dec 29, 2022

Every web site provides APIs.

Toapi Overview Toapi give you the ability to make every web site provides APIs. Version v2.0.0, Completely rewrote. More elegant. More pythonic v1.0.0

3.3k Jan 5, 2023

Web-Extractor - Simple Tool To Extract IP-Adress From Website

IP-Adress Extractor Simple Tool To Extract IP-Adress From Website Socials: Langu

7 Jan 16, 2022

Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

49 Dec 20, 2022

Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

PassportScanner Works with 2 and 3 line identity documents. What is this With PassportScanner you can use your camera to scan the MRZ code of a passpo

441 Dec 24, 2022

split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

split-manga-pages split-manga-pages is a command line utility written in Python that converts your double-page layout manga (or any images in double p

3 May 24, 2022

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

704 Jan 6, 2023

A simple python script that, given a location and a date, uses the Nasa Earth API to show a photo taken by the Landsat 8 satellite. The script must be executed on the command-line.

What does it do? Given a location and a date, it uses the Nasa Earth API to show a photo taken by the Landsat 8 satellite. The script must be executed

42 Nov 26, 2022

A python command line tool to calculate options max pain for a given company symbol and options expiry date.

Options-Max-Pain-Calculator A python command line tool to calculate options max pain for a given company symbol and options expiry date. Overview - Ma

13 Dec 26, 2022

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

17 Dec 11, 2022

Hobby Project. A Python Library to create and generate static web pages using just python.

PyWeb ??️ ?? Current Release: 0.1 A Hobby Project ?? PyWeb is a small Library to generate customized static web pages using python. Aimed for new deve

2 Nov 18, 2021

Advance Image Downloader/Extractor (Job) is a Python-Flask web-based app, which will help the user download the any kind of Images at any date and time over the internet. These images will get downloaded as a job and then let user know that the images have been downloaded by sending them a link over an email.

Advance Image Downloader/Extractor(Job) Advance Image Downloader/Extractor (Job) is a Python-Flask web-based app, which will help the user download th

13 Aug 27, 2022

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

1 Jan 12, 2022

Fast and robust date extraction from web pages, with Python or on the command-line

Related tags

Overview

htmldate: find the publication date of web pages

In a nutshell

Features

Performance

Installation

Documentation

License

Author

Contributing

Comments

Overview

Changes

Result

Additional Notes

Releases(v1.4.0)

v1.4.0(Nov 28, 2022)

v1.3.2(Oct 14, 2022)

v1.3.1(Aug 26, 2022)

v1.3.0(Jul 20, 2022)

v1.2.3(Jun 16, 2022)

v1.2.2(Jun 13, 2022)

v1.2.1(Mar 23, 2022)

v1.2.0(Mar 16, 2022)

v1.1.1(Mar 3, 2022)

v1.1.0(Feb 18, 2022)

v1.0.1(Feb 14, 2022)

v1.0.0(Nov 9, 2021)

v0.9.1(Sep 24, 2021)

v0.9.0(Jul 28, 2021)

v0.8.1(Mar 9, 2021)

v0.8.0(Feb 11, 2021)

v0.7.3(Jan 4, 2021)

v0.7.2(Oct 20, 2020)

v0.7.1(Sep 14, 2020)

v0.7.0(Jul 29, 2020)

v0.6.3(May 26, 2020)

v0.6.2(Apr 29, 2020)

v0.6.1(Jan 17, 2020)

v0.5.6(Sep 24, 2019)

Owner

Adrien Barbaresi

fast python port of arc90's readability tool, updated to match latest readability.js!

Web Content Retrieval for Humans™

Every web site provides APIs.

Web-Extractor - Simple Tool To Extract IP-Adress From Website

Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

A simple python script that, given a location and a date, uses the Nasa Earth API to show a photo taken by the Landsat 8 satellite. The script must be executed on the command-line.

A python command line tool to calculate options max pain for a given company symbol and options expiry date.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

Hobby Project. A Python Library to create and generate static web pages using just python.

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

A collection of robust and fast processing tools for parsing and analyzing web archive data.

A simple web application built using python flask. It can be used to scan SMEVai accounts for broken pages.

Library to scrape and clean web pages to create massive datasets.

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

Scraping web pages to get data

Turn (almost) any Python command line program into a full GUI application with one line