Parse feeds in Python

Overview

feedparser - Parse Atom and RSS feeds in Python.

Copyright 2010-2020 Kurt McKee <[email protected]>
Copyright 2002-2008 Mark Pilgrim

feedparser is open source. See the LICENSE file for more information.

Installation

feedparser can be installed by running pip:

$ pip install feedparser

Documentation

The feedparser documentation is available on the web at:

https://feedparser.readthedocs.io/en/latest/

It is also included in its source format, ReST, in the docs/ directory. To build the documentation you'll need the Sphinx package, which is available at:

https://www.sphinx-doc.org/

You can then build HTML pages using a command similar to:

$ sphinx-build -b html docs/ fpdocs

This will produce HTML documentation in the fpdocs/ directory.

Testing

Feedparser has an extensive test suite, powered by tox. To run it, type this:

$ python -m venv venv
$ source venv/bin/activate  # or "venv\bin\activate.ps1" on Windows
(venv) $ python -m pip install --upgrade pip
(venv) $ python -m pip install poetry
(venv) $ poetry update
(venv) $ tox

This will spawn an HTTP server that will listen on port 8097. The tests will fail if that port is in use.

Comments
  • feedparser repository - no longer maintained?

    feedparser repository - no longer maintained?

    this repository, unfortunately, does look unmaintained. if needed, i believe it is, is there someone that is willing to do a fork, merge the open pull requests and take responsibility for the future? maybe a team of users, as discussed in #108?

    references:
    #108 https://github.com/kurtmckee/feedparser/pull/131#issuecomment-443467549

    opened by introspectionism 31
  • craigslist rss requests fail with 403 error, but wget and browser succeed.

    craigslist rss requests fail with 403 error, but wget and browser succeed.

    Note: I filed this issue first with rss2email, but the maintainer states it is a feedparser issue.


    I have duplicated this on separate machines in different physical locations.

    'r2e run' fails fetching the feed with a 403 error. However the url loads just fine in wget and in any web browser. so it is not IP related. proof (steps to reproduce) below.

    Using r2e version 3.9, from ubuntu repo, and also master from github/rss2email.

    All craigslist.org feed URLs have been failiing since approx May 9. I notified Craig (of craigslist) and he replied that he sent it to his eng team. On May 17, the feeds started working again and I thought the problem resolved, but by May 18 the 403's were back, and continue on. Prior to May 9, the feeds were working fine for years.

    I also tried modifying the USER_AGENT string in feed.py to eg 'Mozilla/5.0' and also omitting the string (to use feedparser default) but no change.

    This seems to be a server-side issue since my installation was working well until May 9, however it is very interesting that wget works when r2e does not, and indicates there must be a client-side way to achieve a correct fetch.

    I had initially thought the problem to likely be related to too many requests in a given time interval, however I tried with a brand new rss2email install on a remote server and it failed on the very first request, as shown below.

    Anyway, I hope we can get ti working again.

    $ r2e add cl1 'https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1' <email>
    
    $ r2e run
    HTTP status 403 fetching feed cl1 (https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1 -> [EMAIL]
    
    $ wget -O feed.xml "https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1"
    --2019-05-27 07:49:19--  https://sfbay.craigslist.org/search/sss?format=rss&query=sw5548&searchNearby=1
    Resolving sfbay.craigslist.org (sfbay.craigslist.org)... 208.82.238.18
    Connecting to sfbay.craigslist.org (sfbay.craigslist.org)|208.82.238.18|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: unspecified [application/rss+xml]
    Saving to: ‘feed.xml’
    
    feed.xml                              [ <=>                                                         ]   1.40K  --.-KB/s    in 0s      
    
    2019-05-27 07:49:20 (55.3 MB/s) - ‘feed.xml’ saved [1433]
    
    $ head -n 6 feed.xml 
    <?xml version="1.0" encoding="UTF-8"?>
    
    <rdf:RDF
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns="http://purl.org/rss/1.0/"
     xmlns:enc="http://purl.oclc.org/net/rss_2.0/enc#"
    
    opened by dan-da 29
  • Memory usage reduction (#296)

    Memory usage reduction (#296)

    This PR is for the memory usage reduction proposed in #296; see the issue for a detailed description.

    There's one commit per logical change, so things are easier to review. Tests and mypy pass for each commit.

    • [x] stream-oriented version of convert_to_utf8() (new code, still unused)
    • [x] extract _parse_file_inplace() from parse()
    • [x] update JSONParser.feed() to take a file instead of a string
      • note that calling parse() with a JSON feed fails with SAXParseException, but develop fails exactly the same way too
    • [x] update _parse_file_inplace() to use convert_file_to_utf8()
    • [x] update _open_resource() to return an open file instead of bytes
    • [x] check if the entire file can be decoded in convert_file_to_utf8() (added later)
      • without this, parse() may sometimes raise UnicodeDecodeError, which would break the API
    • [x] changelog

    I did not add a section about optimistic_encoding_detection in docs/character-encoding.rst, since it is more or less an implementation detail (the flag exists only to allow getting the original behavior). Please let me know if you think this should be mentioned in the documentation. As the internet moves to UTF-8, I expect the need for the flag/fallback to decrease altogether (as of April 2022, UTF-8 seems to be used by 97.6% of websites).

    performance 
    opened by lemon24 18
  • Maintained?

    Maintained?

    There has not been much activity for this project for over a year. I also see there quite a few issues related to general maintenance such updating the pypi package and supporting a new feed type. I was wondering if this is still being maintained.

    opened by AeolusDraco 16
  • Support for JSON Feeds

    Support for JSON Feeds

    https://jsonfeed.org/2017/05/17/announcing_json_feed

    Once JSON Feed support hits feedparser, I can add it to NewsBlur, giving tens of thousands of readers access to JSON Feeds.

    opened by samuelclay 16
  • memory leak on FeedPaser 5.2.1?

    memory leak on FeedPaser 5.2.1?

    Recently i get to parse a RSS feed using FeedPaser 5.2.1, accidently i find a continuous memory increase as my app. running without break. is there any mistake made by me? Any help would be highly appreciated。

    my app. codes as follows(as an example):

    import feedparser
    import time
    
    Url = 'https://www.xxx.com/feeds/all'
    myTag = ""
    
    while(True):
        time.sleep(5)
        feed_data = feedparser.parse(Url,etag=myTag)
        myTag = feed_data.get('etag')
    

    code metioned above is compiled into .exe app throngh pyinstaller, and then let it run without break on winserver 2012.

    opened by biotech7 14
  • feedparser.parse() does not return, causing my PTB job to be stuck

    feedparser.parse() does not return, causing my PTB job to be stuck

    Hi,

    I have a small python bot which scans RSS feeds on an interval, every N seconds the job is triggered to iterate over feeds saved in a sqlite3 database and fetch the feed, it then goes on to check whether the DB already has the feed message and if not, broadcast it over telegram.

    For quite some time now i've had to reboot the bot on a dynamic time interval, after a while it just seems that feedparser.parse() no longer returns, causing the job to be forever pending.

    It took me quite some time to figure out that it's feedparser that's not returning, at first I thought it was some I/O thing related to sqlite3, the bot also runs in a docker container and I assumed it could be related to that but it's neither.

    Please see code snippet of jobs.py below. In the snippet, db.get_all_feeds() returns a list of tuples where tuple[0] == feed_name and tuple[1] == feed_url.

    def rss_monitor(context):
        feeds = db.get_all_feeds()
        for feed in feeds:
            preview = db.get_preview(feed[0])
            ... # Here we check whether the feed requires a cookie or not, if so append it to headers
            rss = feedparser.parse(feed[1], request_headers=headers) <- THIS LINE DOES NOT RETURN AFTER N ITERATIONS
            if rss.status == 200:
                # Process feed, check if message exists in database and if not, broadcast it over telegram.
            else:
                logger.error('Could not fetch feed: ' + feed[1])
                logger.error('Feed HTTP response_code: ' + `str(rss.status))
    

    N is dynamic, I cannot reproduce this for a given number, some times the job fails after 10h, sometimes it fails after 15h, some times it works fine for 24h.

    I am using feedparser==6.0.2 which is as far as I know the latest version of feedparser. Is there anything else I can do to let feedparser throw an error or perhaps hint to why it is no longer returning? If any additional information is required I will gladly supply it

    need-info 
    opened by furiousxk 13
  • Added a timeout parameter to the parse function

    Added a timeout parameter to the parse function

    The default is set to 30 seconds.

    I only saw https://github.com/kurtmckee/feedparser/pull/77 after I made my corrections. However PR 77 introduces a hardcoded parameter, and does not respect the API of feedparser.

    I recommend using this PR instead of 77.

    Usage:

        feed = feedparser.parse("http://feeds.rsc.org/rss/cc", timeout=1)
    

    But old syntax will still work, of course:

        feed = feedparser.parse("http://feeds.rsc.org/rss/cc")
    
    opened by JPFrancoia 13
  • Travis support for automated builds including Python 3.7

    Travis support for automated builds including Python 3.7

    This PR adds a simple Travis CI configuration file that includes tox build configs for

    • Python 2.7
    • Python 3.4
    • Python 3.5
    • Python 3.6
    • Python 3.7

    Of course, at this point in time, only the first 4 variants will run successfully, the Python 3.7 tox build will fail until https://github.com/kurtmckee/feedparser/pull/131 is merged.

    Once merged, please sign up for an account at https://travis-ci.org/, link your GitHub account, add this repository, and Travis CI will automatically build the project at each commit.

    opened by exxamalte 11
  • New PyPi release

    New PyPi release

    Hi @kurtmckee, we are using feedparser in Galaxy and we are soon going to need fully Python3-compatible dependencies. Can you do a new release including sgmllib3k and Python3.5 support, please?

    opened by nsoranzo 11
  • Traceback with _parse_georss_point

    Traceback with _parse_georss_point

    Hi, I got this similar to #130 :

    Python 3.7.4 (default, Jul  9 2019, 16:32:37) 
    [GCC 9.1.1 20190503 (Red Hat 9.1.1-1)] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> url='https://mundosauriga.blogspot.com/feeds/posts/default?alt=rss'
    >>> import feedparser
    >>> feed = feedparser.parse(url)
    Traceback (most recent call last):
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 3766, in _gen_georss_coords
        t = [nxt(), nxt()][::swap and -1 or 1]
    StopIteration
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 3956, in parse
        saxparser.parse(source)
      File "/usr/lib64/python3.7/site-packages/drv_libxml2.py", line 239, in parse
        _d(reader.Name()))
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 2052, in endElementNS
        self.unknown_endtag(localname)
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 696, in unknown_endtag
        method()
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 1463, in _end_georss_point
        geometry = _parse_georss_point(self.pop('geometry'))
      File "/usr/lib/python3.7/site-packages/feedparser.py", line 3775, in _parse_georss_point
        coords = list(_gen_georss_coords(value, swap, dims))
    RuntimeError: generator raised StopIteration
    

    Any hint for this?

    Feedparser version is provided by python3-feedparser-5.2.1-9.fc30.noarch

    opened by iranzo 9
  • Handle HTTP status 308 (Permanent Redirect) as a redirect

    Handle HTTP status 308 (Permanent Redirect) as a redirect

    While researching for a fix for https://github.com/rss2email/rss2email/issues/229, I noticed that feedparser does not handle HTTP status code 308 the same as the other HTTP redirects. The new status code 308 (Permanent Redirect) was added to the standard in 2015 as the missing variant of status code 301 (Moved Permanently) which “does not allow changing the request method from POST to GET”.

    opened by amiryal 0
  • `itunes:summary` overwrites `description` field in feed items when parsing

    `itunes:summary` overwrites `description` field in feed items when parsing

    When a feed item entry has both an <itunes:summary> tag and a <description> tag, the <itunes:summary> tag takes precedence and overwrites whatever is present in the <description> tag, making it available at the summary key on the item's dict.

    Example:

    <?xml version="1.0" encoding="UTF-8"?>
    <rss
      version="2.0"
      xmlns:content="http://purl.org/rss/1.0/modules/content/"
      xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
    >
      <channel>
        <item>
          <title>A title</title>
          <description><![CDATA[<p>The description field</p>]]></description>
          <link>https://example.com</link>
          <content:encoded><![CDATA[<p>The content</p>]]></content:encoded>
          <itunes:summary>Itunes summary</itunes:summary>
        </item>
      </channel>
    </rss>
    

    Parsing the above with parse(), the summary for the item entry is set to the value in <itunes:summary>:

    >> parsed_feed = feedparser.parse("the-above-feed.xml")
    >> parsed_feed.entries[0].summary == 'Itunes summary'
    True
    

    My expectation is that the <itunes:summary> value would be available at the itunes_summary key, much like the other values in the iTunes namespace and the <description> tag's value would be available at summary as outlined in the documentation. Instead the iTunes summary is given precedence as shown above and applied to the summary key. Even when the <itunes:summary> is an empty tag, I still get an empty string as opposed to the value from the <description> field.

    This seems to be very similar to both #314 and #316. Is this expected behavior or is this a bug?

    opened by neilius 0
  • What is your recommended way to convert feedparser s date representation to datetime object?

    What is your recommended way to convert feedparser s date representation to datetime object?

    I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best

    Issues I referenced before asking https://github.com/kurtmckee/feedparser/issues/212 https://github.com/kurtmckee/feedparser/issues/51

    Problem

    • feedparser returns a string representation of published date under published and a struct_time representation of the same
    • I am not able to store either of these directly to Postgres because it needs a datetime when working via asyncpg

    How to reproduce this problem

    
    def md5(text):
        import hashlib
        return hashlib.md5(text.encode('utf-8')).hexdigest()
    
    def fetch():
        import feedparser
        data = feedparser.parse('https://cointelegraph.com/rss')
        return data
    
    async def insert(rows):
        import asyncpg
        async with asyncpg.create_pool(user='postgres', database='postgres') as pool:
            async with pool.acquire() as conn:
                results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows)
                print(results)
    
    async def main():
        data = fetch()
        first_entry = data.entries[0]
        await insert([(md5(first_entry.guid), first_entry.published)])
        await insert([(md5(first_entry.guid), first_entry.published_parsed)])
    
    import asyncio
    asyncio.run(main())
    
    

    Both insert statements above will fail

    What have I found so far?

    I found 3 methods but they seem to have a limitation each

    Method 1

    Convert it with strptime

    import feedparser
    data = feedparser.parse('https://cointelegraph.com/rss')
    pubdate = data.entries[0].published
    pubdate_parsed = data.entries[0].published_parsed
    
    
    
    >>> pubdate
    'Thu, 04 Aug 2022 06:53:42 +0100'
    

    I could do this

    
    >>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z')
    >>> method1
    datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))
    

    I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added

    Method 2

    
    >>> datetime.fromtimestamp(mktime(pubdate_parsed))
    datetime.datetime(2022, 8, 4, 5, 53, 42)
    

    This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST

    Method 3 Requires a third party library called dateutil and shown below https://stackoverflow.com/a/18726020/5371505

    Question

    • What is the most robust way to convert the published or published_parsed output that feedparser generates into datetime object?
    • Can it be done without a third party library such as dateutil
    • Is there any native undocumented approach to get a datetime object directly from feedparser that I am not aware of?

    Thank you for your time

    opened by slidenerd 1
  • should entry.tags be defined even when empty?

    should entry.tags be defined even when empty?

    I have two different RSS feeds, both of which have a number of elements at the xpath /rss/channel/item/category, which, according to the docs, is one source for tags (categories) on elements.

    However, when feedparser parses them, entries from one have tags, and entries from the other do not.

    This feed, https://seekingalpha.com/feed.xml, comes up with plentiful tags, even though the RSS does not validate This feed, https://rss.nytimes.com/services/xml/rss/nyt/World.xml, throws AttributeError when entry.tags is accessed, even though the RSS does validate.

    Am I missing something? Is it a bug?

    opened by JoeGermuska 1
  • Test failures with cchardet-2.1.7 and chardet are installed

    Test failures with cchardet-2.1.7 and chardet are installed

    When cchardet-2.1.7 and chardet-5.0.0 are both installed, the following tests fail.

    FWICS two of them fail because of encoding name mismatches (expected is mixed-case, the value is uppercase), and two of them are recognized as a superset-encoding of the specified encoding (i.e. EUC-KR as UHC, and GB2312 as GB18030).

    ...F...FF.F.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
    ======================================================================
    FAIL: test_001742 (__main__.TestCase)
    ./tests/illformed/chardet/windows1255.xml: windows-1255 with no encoding information
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
        self.fail_unless_eval(xmlfile, eval_string)
      File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
        raise self.failureException(failure)
    AssertionError: not eval(b"bozo and encoding == 'windows-1255'") 
    WITH env({'bozo': True,
     'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as WINDOWS-1255'),
     'content-type': '',
     'encoding': 'WINDOWS-1255',
     'entries': [{'summary': 'האם תדפיס נייר של אתר אינטרנט שמוצג על מסך משתמש הוא '
                             'העתק נאמן למקור של אתר האינטרנט? רבים יגידו שכן, '
                             'ולפעמים גם בתי המשפט יצטרפו אליהם שיקבלו פלט מאתר '
                             'אינטרנט כראיה קבילה. אבל, זה ממש לא כך. ויש אפילו '
                             'הוכחה מדהימה.',
                  'summary_detail': {'base': '',
                                     'language': None,
                                     'type': 'text/html',
                                     'value': 'האם תדפיס נייר של אתר אינטרנט שמוצג '
                                              'על מסך משתמש הוא העתק נאמן למקור של '
                                              'אתר האינטרנט? רבים יגידו שכן, '
                                              'ולפעמים גם בתי המשפט יצטרפו אליהם '
                                              'שיקבלו פלט מאתר אינטרנט כראיה '
                                              'קבילה. אבל, זה ממש לא כך. ויש אפילו '
                                              'הוכחה מדהימה.'}}],
     'feed': {},
     'headers': {},
     'namespaces': {},
     'version': 'rss'})
    
    ======================================================================
    FAIL: test_001746 (__main__.TestCase)
    ./tests/illformed/chardet/gb2312.xml: GB2312 with no encoding information
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
        self.fail_unless_eval(xmlfile, eval_string)
      File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
        raise self.failureException(failure)
    AssertionError: not eval(b"bozo and encoding == 'GB2312'") 
    WITH env({'bozo': True,
     'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as GB18030'),
     'content-type': '',
     'encoding': 'GB18030',
     'entries': [{'title': '不归移民漫画系列:专业工作',
                  'title_detail': {'base': '',
                                   'language': None,
                                   'type': 'text/plain',
                                   'value': '不归移民漫画系列:专业工作'}}],
     'feed': {},
     'headers': {},
     'namespaces': {},
     'version': 'rss'})
    
    ======================================================================
    FAIL: test_001747 (__main__.TestCase)
    ./tests/illformed/chardet/euckr.xml: EUC-KR with no encoding information
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
        self.fail_unless_eval(xmlfile, eval_string)
      File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
        raise self.failureException(failure)
    AssertionError: not eval(b"bozo and encoding == 'EUC-KR'") 
    WITH env({'bozo': True,
     'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as UHC'),
     'content-type': '',
     'encoding': 'UHC',
     'entries': [{'summary': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 된 닉네임을 정할 경우에, '
                             'EUC-KR로 된 무버블타입 블록에선 리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 '
                             '깨어져 나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 한글로 사용하는 많은 분들도 '
                             '타입키에서의 닉네임은 이런 문제때문에 울며겨자먹기로 영어로 짓고 있다....',
                  'summary_detail': {'base': '',
                                     'language': None,
                                     'type': 'text/html',
                                     'value': 'TypeKey 시스템이 UTF-8로 돌아가는데, 거기서 한글로 '
                                              '된 닉네임을 정할 경우에, EUC-KR로 된 무버블타입 블록에선 '
                                              '리다이렉트되어 전송되어오는 닉네임이 UTF라 당연히 깨어져 '
                                              '나타난다. 실제 블록 등에서 사용하는 필명 내지는 닉네임은 '
                                              '한글로 사용하는 많은 분들도 타입키에서의 닉네임은 이런 '
                                              '문제때문에 울며겨자먹기로 영어로 짓고 있다....'},
                  'title': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기',
                  'title_detail': {'base': '',
                                   'language': None,
                                   'type': 'text/plain',
                                   'value': 'EUC-KR 에서 TypeKey 한글닉네임 표시하기'}}],
     'feed': {},
     'headers': {},
     'namespaces': {},
     'version': 'rss'})
    
    ======================================================================
    FAIL: test_001749 (__main__.TestCase)
    ./tests/illformed/chardet/big5.xml: Big5 with no encoding information
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/tmp/feedparser/tests/runtests.py", line 1191, in fn
        self.fail_unless_eval(xmlfile, eval_string)
      File "/tmp/feedparser/tests/runtests.py", line 177, in fail_unless_eval
        raise self.failureException(failure)
    AssertionError: not eval(b"bozo and encoding == 'Big5'") 
    WITH env({'bozo': True,
     'bozo_exception': CharacterEncodingOverride('document declared as utf-8, but parsed as BIG5'),
     'content-type': '',
     'encoding': 'BIG5',
     'entries': [],
     'feed': {'title': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。',
              'title_detail': {'base': '',
                               'language': None,
                               'type': 'text/plain',
                               'value': '我希望??很容易?其翻?成中文,并有助于改??件。 感?您??本文。'}},
     'headers': {},
     'namespaces': {'': 'http://www.w3.org/2005/Atom'},
     'version': 'atom10'})
    
    ----------------------------------------------------------------------
    Ran 4354 tests in 4.892s
    
    FAILED (failures=4)
    
    opened by mgorny 0
Owner
Kurt McKee
Kurt McKee
A python module to parse the Open Graph Protocol

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/ Installation $ pip in

Erik Rivera 213 Nov 12, 2022
API to parse tibia.com content into python objects.

Tibia.py An API to parse Tibia.com content into object oriented data. No fetching is done by this module, you must provide the html content. Features:

Allan Galarza 25 Oct 31, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

null 4.3k Jan 7, 2023
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023
Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

wallstreetbets-tracker Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit.

null 91 Dec 8, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 7, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

VeNoMouS 2.6k Dec 31, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 4, 2023
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 2, 2023
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews ?? A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response ?? As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022
Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

Dekel 16 Sep 17, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 4, 2022