Parse feeds in Python

Kurt McKee

Last update: Dec 30, 2022

Related tags

Web Crawling feedparser

Overview

feedparser - Parse Atom and RSS feeds in Python.

feedparser is open source. See the LICENSE file for more information.

Installation

feedparser can be installed by running pip:

$ pip install feedparser

Documentation

The feedparser documentation is available on the web at:

https://feedparser.readthedocs.io/en/latest/

It is also included in its source format, ReST, in the docs/ directory. To build the documentation you'll need the Sphinx package, which is available at:

https://www.sphinx-doc.org/

You can then build HTML pages using a command similar to:

$ sphinx-build -b html docs/ fpdocs

This will produce HTML documentation in the fpdocs/ directory.

Testing

Feedparser has an extensive test suite, powered by tox. To run it, type this:

$ python -m venv venv
$ source venv/bin/activate  # or "venv\bin\activate.ps1" on Windows
(venv) $ python -m pip install --upgrade pip
(venv) $ python -m pip install poetry
(venv) $ poetry update
(venv) $ tox

This will spawn an HTTP server that will listen on port 8097. The tests will fail if that port is in use.

Comments

feedparser repository - no longer maintained?

this repository, unfortunately, does look unmaintained. if needed, i believe it is, is there someone that is willing to do a fork, merge the open pull requests and take responsibility for the future? maybe a team of users, as discussed in #108?

references:
#108 https://github.com/kurtmckee/feedparser/pull/131#issuecomment-443467549

opened by introspectionism 31
craigslist rss requests fail with 403 error, but wget and browser succeed.
Note: I filed this issue first with rss2email, but the maintainer states it is a feedparser issue.

I have duplicated this on separate machines in different physical locations.

'r2e run' fails fetching the feed with a 403 error. However the url loads just fine in wget and in any web browser. so it is not IP related. proof (steps to reproduce) below.

Using r2e version 3.9, from ubuntu repo, and also master from github/rss2email.

All craigslist.org feed URLs have been failiing since approx May 9. I notified Craig (of craigslist) and he replied that he sent it to his eng team. On May 17, the feeds started working again and I thought the problem resolved, but by May 18 the 403's were back, and continue on. Prior to May 9, the feeds were working fine for years.

I also tried modifying the USER_AGENT string in feed.py to eg 'Mozilla/5.0' and also omitting the string (to use feedparser default) but no change.

This seems to be a server-side issue since my installation was working well until May 9, however it is very interesting that wget works when r2e does not, and indicates there must be a client-side way to achieve a correct fetch.

I had initially thought the problem to likely be related to too many requests in a given time interval, however I tried with a brand new rss2email install on a remote server and it failed on the very first request, as shown below.

Anyway, I hope we can get ti working again.

$ r2e add cl1 'https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1' <email> $ r2e run HTTP status 403 fetching feed cl1 (https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1 -> [EMAIL] $ wget -O feed.xml "https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1" --2019-05-27 07:49:19-- https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1 Resolving sfbay.craigslist.org (sfbay.craigslist.org)... 208.82.238.18 Connecting to sfbay.craigslist.org (sfbay.craigslist.org)|208.82.238.18|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/rss+xml] Saving to: ‘feed.xml’ feed.xml [ <=> ] 1.40K --.-KB/s in 0s 2019-05-27 07:49:20 (55.3 MB/s) - ‘feed.xml’ saved [1433] $ head -n 6 feed.xml <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/" xmlns:enc="http://purl.oclc.org/net/rss_2.0/enc#"
opened by dan-da 29
Memory usage reduction (#296)
This PR is for the memory usage reduction proposed in #296; see the issue for a detailed description.

There's one commit per logical change, so things are easier to review. Tests and mypy pass for each commit.

[x] stream-oriented version of convert_to_utf8() (new code, still unused)

[x] extract _parse_file_inplace() from parse()

[x] update JSONParser.feed() to take a file instead of a string

note that calling parse() with a JSON feed fails with SAXParseException, but develop fails exactly the same way too

[x] update _parse_file_inplace() to use convert_file_to_utf8()

[x] update _open_resource() to return an open file instead of bytes

[x] check if the entire file can be decoded in convert_file_to_utf8() (added later)

without this, parse() may sometimes raise UnicodeDecodeError, which would break the API

[x] changelog

I did not add a section about optimistic_encoding_detection in docs/character-encoding.rst, since it is more or less an implementation detail (the flag exists only to allow getting the original behavior). Please let me know if you think this should be mentioned in the documentation. As the internet moves to UTF-8, I expect the need for the flag/fallback to decrease altogether (as of April 2022, UTF-8 seems to be used by 97.6% of websites).
performance
opened by lemon24 18
Maintained?

There has not been much activity for this project for over a year. I also see there quite a few issues related to general maintenance such updating the pypi package and supporting a new feed type. I was wondering if this is still being maintained.

opened by AeolusDraco 16
Support for JSON Feeds

https://jsonfeed.org/2017/05/17/announcing_json_feed

Once JSON Feed support hits feedparser, I can add it to NewsBlur, giving tens of thousands of readers access to JSON Feeds.

opened by samuelclay 16
memory leak on FeedPaser 5.2.1?
Recently i get to parse a RSS feed using FeedPaser 5.2.1, accidently i find a continuous memory increase as my app. running without break. is there any mistake made by me? Any help would be highly appreciated。

my app. codes as follows(as an example):

import feedparser import time Url = 'https://www.xxx.com/feeds/all' myTag = "" while(True): time.sleep(5) feed_data = feedparser.parse(Url,etag=myTag) myTag = feed_data.get('etag')

code metioned above is compiled into .exe app throngh pyinstaller, and then let it run without break on winserver 2012.
opened by biotech7 14
feedparser.parse() does not return, causing my PTB job to be stuck
Hi,

I have a small python bot which scans RSS feeds on an interval, every N seconds the job is triggered to iterate over feeds saved in a sqlite3 database and fetch the feed, it then goes on to check whether the DB already has the feed message and if not, broadcast it over telegram.

For quite some time now i've had to reboot the bot on a dynamic time interval, after a while it just seems that feedparser.parse() no longer returns, causing the job to be forever pending.

It took me quite some time to figure out that it's feedparser that's not returning, at first I thought it was some I/O thing related to sqlite3, the bot also runs in a docker container and I assumed it could be related to that but it's neither.

Please see code snippet of jobs.py below. In the snippet, db.get_all_feeds() returns a list of tuples where tuple[0] == feed_name and tuple[1] == feed_url.

def rss_monitor(context): feeds = db.get_all_feeds() for feed in feeds: preview = db.get_preview(feed[0]) ... # Here we check whether the feed requires a cookie or not, if so append it to headers rss = feedparser.parse(feed[1], request_headers=headers) <- THIS LINE DOES NOT RETURN AFTER N ITERATIONS if rss.status == 200: # Process feed, check if message exists in database and if not, broadcast it over telegram. else: logger.error('Could not fetch feed: ' + feed[1]) logger.error('Feed HTTP response_code: ' + `str(rss.status))

N is dynamic, I cannot reproduce this for a given number, some times the job fails after 10h, sometimes it fails after 15h, some times it works fine for 24h.

I am using feedparser==6.0.2 which is as far as I know the latest version of feedparser. Is there anything else I can do to let feedparser throw an error or perhaps hint to why it is no longer returning? If any additional information is required I will gladly supply it
need-info
opened by furiousxk 13
Added a timeout parameter to the parse function
The default is set to 30 seconds.

I only saw https://github.com/kurtmckee/feedparser/pull/77 after I made my corrections. However PR 77 introduces a hardcoded parameter, and does not respect the API of feedparser.

I recommend using this PR instead of 77.

Usage:

feed = feedparser.parse("http://feeds.rsc.org/rss/cc", timeout=1)

But old syntax will still work, of course:

feed = feedparser.parse("http://feeds.rsc.org/rss/cc")
opened by JPFrancoia 13
Travis support for automated builds including Python 3.7
This PR adds a simple Travis CI configuration file that includes tox build configs for

Python 2.7

Python 3.4

Python 3.5

Python 3.6

Python 3.7

Of course, at this point in time, only the first 4 variants will run successfully, the Python 3.7 tox build will fail until https://github.com/kurtmckee/feedparser/pull/131 is merged.

Once merged, please sign up for an account at https://travis-ci.org/, link your GitHub account, add this repository, and Travis CI will automatically build the project at each commit.
opened by exxamalte 11
New PyPi release

Hi @kurtmckee, we are using feedparser in Galaxy and we are soon going to need fully Python3-compatible dependencies. Can you do a new release including sgmllib3k and Python3.5 support, please?

opened by nsoranzo 11

Traceback with _parse_georss_point

Hi, I got this similar to #130 :

Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
[GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> url='https://mundosauriga.blogspot.com/feeds/posts/default?alt=rss'
>>> import feedparser
>>> feed = feedparser.parse(url)
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 3766, in _gen_georss_coords
    t = [nxt(), nxt()][::swap and -1 or 1]
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 3956, in parse
    saxparser.parse(source)
  File "/usr/lib64/python3.7/site-packages/drv_libxml2.py", line 239, in parse
    _d(reader.Name()))
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 2052, in endElementNS
    self.unknown_endtag(localname)
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 696, in unknown_endtag
    method()
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 1463, in _end_georss_point
    geometry = _parse_georss_point(self.pop('geometry'))
  File "/usr/lib/python3.7/site-packages/feedparser.py", line 3775, in _parse_georss_point
    coords = list(_gen_georss_coords(value, swap, dims))
RuntimeError: generator raised StopIteration

Any hint for this?

Feedparser version is provided by python3-feedparser-5.2.1-9.fc30.noarch

opened by iranzo 9

Handle HTTP status 308 (Permanent Redirect) as a redirect

While researching for a fix for https://github.com/rss2email/rss2email/issues/229, I noticed that feedparser does not handle HTTP status code 308 the same as the other HTTP redirects. The new status code 308 (Permanent Redirect) was added to the standard in 2015 as the missing variant of status code 301 (Moved Permanently) which “does not allow changing the request method from POST to GET”.

opened by amiryal 0
`itunes:summary` overwrites `description` field in feed items when parsing
When a feed item entry has both an <itunes:summary> tag and a <description> tag, the <itunes:summary> tag takes precedence and overwrites whatever is present in the <description> tag, making it available at the summary key on the item's dict.

Example:

<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" > <channel> <item> <title>A title</title> <description><![CDATA[<p>The description field</p>]]></description> <link>https://example.com</link> <content:encoded><![CDATA[<p>The content</p>]]></content:encoded> <itunes:summary>Itunes summary</itunes:summary> </item> </channel> </rss>

Parsing the above with parse(), the summary for the item entry is set to the value in <itunes:summary>:

>> parsed_feed = feedparser.parse("the-above-feed.xml") >> parsed_feed.entries[0].summary == 'Itunes summary' True

My expectation is that the <itunes:summary> value would be available at the itunes_summary key, much like the other values in the iTunes namespace and the <description> tag's value would be available at summary as outlined in the documentation. Instead the iTunes summary is given precedence as shown above and applied to the summary key. Even when the <itunes:summary> is an empty tag, I still get an empty string as opposed to the value from the <description> field.

This seems to be very similar to both #314 and #316. Is this expected behavior or is this a bug?
opened by neilius 0
What is your recommended way to convert feedparser s date representation to datetime object?
I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best

Issues I referenced before asking https://github.com/kurtmckee/feedparser/issues/212 https://github.com/kurtmckee/feedparser/issues/51

Problem

feedparser returns a string representation of published date under published and a struct_time representation of the same

I am not able to store either of these directly to Postgres because it needs a datetime when working via asyncpg

How to reproduce this problem

def md5(text): import hashlib return hashlib.md5(text.encode('utf-8')).hexdigest() def fetch(): import feedparser data = feedparser.parse('https://cointelegraph.com/rss') return data async def insert(rows): import asyncpg async with asyncpg.create_pool(user='postgres', database='postgres') as pool: async with pool.acquire() as conn: results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows) print(results) async def main(): data = fetch() first_entry = data.entries[0] await insert([(md5(first_entry.guid), first_entry.published)]) await insert([(md5(first_entry.guid), first_entry.published_parsed)]) import asyncio asyncio.run(main())

Both insert statements above will fail

What have I found so far?

I found 3 methods but they seem to have a limitation each

Method 1

Convert it with strptime

import feedparser data = feedparser.parse('https://cointelegraph.com/rss') pubdate = data.entries[0].published pubdate_parsed = data.entries[0].published_parsed

>>> pubdate 'Thu, 04 Aug 2022 06:53:42 +0100'

I could do this

>>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z') >>> method1 datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))

I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added

Method 2

>>> datetime.fromtimestamp(mktime(pubdate_parsed)) datetime.datetime(2022, 8, 4, 5, 53, 42)

This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST

Method 3 Requires a third party library called dateutil and shown below https://stackoverflow.com/a/18726020/5371505

Question

What is the most robust way to convert the published or published_parsed output that feedparser generates into datetime object?

Can it be done without a third party library such as dateutil

Is there any native undocumented approach to get a datetime object directly from feedparser that I am not aware of?

Thank you for your time
opened by slidenerd 1
should entry.tags be defined even when empty?

I have two different RSS feeds, both of which have a number of elements at the xpath /rss/channel/item/category, which, according to the docs, is one source for tags (categories) on elements.

However, when feedparser parses them, entries from one have tags, and entries from the other do not.

This feed, https://seekingalpha.com/feed.xml, comes up with plentiful tags, even though the RSS does not validate This feed, https://rss.nytimes.com/services/xml/rss/nyt/World.xml, throws AttributeError when entry.tags is accessed, even though the RSS does validate.

Am I missing something? Is it a bug?

opened by JoeGermuska 1

Test failures with cchardet-2.1.7 and chardet are installed

When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.

FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).

...F...FF.F.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
======================================================================
FAIL: test_001742 (__main__.TestCase)
./tests/illformed/chardet/windows1255.xml: windows-1255 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'windows-1255'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as WINDOWS-1255'),
 'content-type': '',
 'encoding': 'WINDOWS-1255',
 'entries': [{'summary': 'האם תדפיס נייר של אתר אינטרנט שמוצג על מסך משתמש הוא '
                         'העתק נאמן למקור של אתר האינטרנט? רבים יגידו שכן, '
                         'ולפעמים גם בתי המשפט יצטרפו אליהם שיקבלו פלט מאתר '
                         'אינטרנט כראיה קבילה. אבל, זה ממש לא כך. ויש אפילו '
                         'הוכחה מדהימה.',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'האם תדפיס נייר של אתר אינטרנט שמוצג '
                                          'על מסך משתמש הוא העתק נאמן למקור של '
                                          'אתר האינטרנט? רבים יגידו שכן, '
                                          'ולפעמים גם בתי המשפט יצטרפו אליהם '
                                          'שיקבלו פלט מאתר אינטרנט כראיה '
                                          'קבילה. אבל, זה ממש לא כך. ויש אפילו '
                                          'הוכחה מדהימה.'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001746 (__main__.TestCase)
./tests/illformed/chardet/gb2312.xml: GB2312 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'GB2312'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as GB18030'),
 'content-type': '',
 'encoding': 'GB18030',
 'entries': [{'title': '不归移民漫画系列：专业工作',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': '不归移民漫画系列：专业工作'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001747 (__main__.TestCase)
./tests/illformed/chardet/euckr.xml: EUC-KR with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'EUC-KR'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as UHC'),
 'content-type': '',
 'encoding': 'UHC',
 'entries': [{'summary': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 된 닉네임을 정할 경우에, '
                         'EUC-KR로 된 무버블타입 블록에선 리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 '
                         '깨어져 나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 한글로 사용하는 많은 분들도 '
                         '타입키에서의 닉네임은 이런 문제때문에 울며겨자먹기로 영어로 짓고 있다....',
              'summary_detail': {'base': '',
                                 'language': None,
                                 'type': 'text/html',
                                 'value': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 '
                                          '된 닉네임을 정할 경우에, EUC-KR로 된 무버블타입 블록에선 '
                                          '리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 깨어져 '
                                          '나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 '
                                          '한글로 사용하는 많은 분들도 타입키에서의 닉네임은 이런 '
                                          '문제때문에 울며겨자먹기로 영어로 짓고 있다....'},
              'title': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기'}}],
 'feed': {},
 'headers': {},
 'namespaces': {},
 'version': 'rss'})

======================================================================
FAIL: test_001749 (__main__.TestCase)
./tests/illformed/chardet/big5.xml: Big5 with no encoding information
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
    self.fail_unless_eval(xmlfile, eval_string)
  File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
    raise self.failureException(failure)
AssertionError: not eval(b"bozo and encoding == 'Big5'") 
WITH env({'bozo': True,
 'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as BIG5'),
 'content-type': '',
 'encoding': 'BIG5',
 'entries': [],
 'feed': {'title': '我希望??很容易?其翻?成中文，并有助于改??件。 感?您??本文。',
          'title_detail': {'base': '',
                           'language': None,
                           'type': 'text/plain',
                           'value': '我希望??很容易?其翻?成中文，并有助于改??件。 感?您??本文。'}},
 'headers': {},
 'namespaces': {'': 'http://www.w3.org/2005/Atom'},
 'version': 'atom10'})

----------------------------------------------------------------------
Ran 4354 tests in 4.892s

FAILED (failures=4)

opened by mgorny 0

Owner

Kurt McKee

GitHub

A python module to parse the Open Graph Protocol

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/ Installation $ pip in

213 Nov 12, 2022

API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

25 Oct 31, 2022

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

6 Aug 26, 2022

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 7, 2023

A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

15.7k Jan 4, 2023

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

91 Dec 8, 2022

Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

45.5k Jan 7, 2023

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

12.3k Jan 7, 2023

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Jan 8, 2023

A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

2.6k Dec 31, 2022

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

4.8k Jan 4, 2023

Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

1.6k Jan 1, 2023

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 5, 2022

Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

3.8k Jan 2, 2023

A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

1.8k Dec 31, 2022

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews ?? A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response ?? As well as you can fetch full

273 Dec 31, 2022

Parse feeds in Python

Related tags

Overview

Installation

Documentation

Testing

Comments

Owner

Kurt McKee

A python module to parse the Open Graph Protocol

API to parse tibia.com content into python objects.

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

A Python library for automating interaction with websites.

A Powerful Spider(Web Crawler) System in Python.

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Scrapy, a fast high-level web crawling & scraping framework for Python.

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

A Python module to bypass Cloudflare's anti-bot page.

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Async Python 3.6+ web scraping micro-framework based on asyncio

Transistor, a Python web scraping framework for intelligent use cases.

Html Content / Article Extractor, web scrapping lib in Python

A pure-python HTML screen-scraping library

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

Python scraper to check for earlier appointments in Clalit Health Services

Python Web Scrapper Project

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）