A scalable frontier for web crawlers

Overview

Frontera

pypi python versions Build Status codecov

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Comments
  • Redesign codecs

    Redesign codecs

    Issue discussed here https://github.com/scrapinghub/frontera/issues/211#issuecomment-251931413 Todo List

    • [X] Fix msgpack codec
    • [x] Fix json codec
    • [x] Integration test with Hbase backend(manually)

    This PR fixes #211

    Other things done in this besides the todo list:

    • Added two methods _convert and reconvert in json codec. These are needed as JSONEncoder accepts strings only as unicode. Method convert converts objects recursively to unicode and saves their type.
    • made the requirement of msgpack >=0.4 as only versions greater than 0.4 support the changes made in this PR.
    • fixed a buggy test case in test_message_bus_backend which got exposed after fixing the codecs.
    opened by voith 35
  • Distributed example (HBase, Kafka)

    Distributed example (HBase, Kafka)

    The documentation is a little simple and does not explain how to integrate with Kafka and Hbase for a fully distributed architecture. Could you, please provide an example in the examples folder of a well configured distributed frontera config?

    opened by casertap 33
  • PY3 Syntactic changes.

    PY3 Syntactic changes.

    Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

    opened by Preetwinder 32
  • Redirect loop when using distributed-frontera

    Redirect loop when using distributed-frontera

    I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

    2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    ...
    2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    

    This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

    opened by lljrsr 25
  • [WIP] Added Cassandra backend

    [WIP] Added Cassandra backend

    This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

    I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

    I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

    The PR includes unit tests and some integration tests with the backends integration testing framework.

    Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

    I am open to all sorts of suggestions :)

    opened by voith 17
  • cluster kafka db worker doesnt recognize partitions

    cluster kafka db worker doesnt recognize partitions

    Hi, Im trying to use cluster configuration. I've created topics in kafka and have it up and running. Im running into trouble starting the database worker. Tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0,1 got an error 0,1 not recognized, tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0 I was getting the same issue as in #359, but somehow that stopped happening.

    Now I'm getting: that kafka partitions are not recognized or iterrable, see error. Im using python 3.6 and the frontera from the repo (FYI qzm and cachetools still needed to be installed manually). Any ideas?

    File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 246, in args.no_scoring, partitions=args.partitions) File "/usr/lib/python3.6/dist-packages/frontera/worker/stats.py", line 22, in init super(StatsExportMixin, self).init(settings, *args, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 115, in init self.slot = Slot(self, settings, **slot_kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 46, in init self.components = self._load_components(worker, settings, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 55, in _load_components component = cls(worker, settings, stop_event=self.stop_event, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/components/scoring_consumer.py", line 24, in init self.scoring_log_consumer = scoring_log.consumer() File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 219, in consumer return Consumer(self._location, self._enable_ssl, self._cert_path, self._topic, self._group, partition_id=None) File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in init self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

    opened by danmsf 16
  • [WIP] Downloader slot usage optimization

    [WIP] Downloader slot usage optimization

    Imagine, we have a queue of 10K urls from many different domains. Our task is to fetch it as fast as possible. At the same time we have a prioritization which tends to group URLs from the same domain. During downloading we want to be polite and limit per host RPS. So, picking just top URLs from the queue leeds us to the time waste, because connection pool of Scrapy downloader most of time underused.

    In this PR, I'm addressing this issue by propagating information about overused hostnames/IPs in downloader pool.

    opened by sibiryakov 16
  • Fixed scheduler process_spider_output() to yield requests

    Fixed scheduler process_spider_output() to yield requests

    fixes #253 Here's a screenshot using the same code discussed here. screen shot 2017-02-12 at 3 13 48 pm

    Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

    I've the split this PR into three commits:

    • The first commit adds a test to reproduce the bug.
    • The second commit fixes the bug
    • The third commit fixes the broken test discussed above

    A note about the tests added:

    The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/spidermw.py#L34-L73: This is to understand how scrapy processes the different methods of the spider middleware.
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L135-L147: This is to understand how the scrapy core executes the spider middleware methods and passes the control to the spider callbacks.

    I have simulated the above discussed code in order to write the test.

    opened by voith 15
  • New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    While this is solved you can use this on your settings as a workaround:

    DELAY_ON_EMPTY=0.0
    

    The problem is in frontera.contrib.scrapy.schedulers.FrontieraScheduler, method _get_next_requests. If there are no pending requests and the test self._delay_next_call < time() fails, an empty list is returned which causes the crawl to terminate

    bug 
    opened by plafl 14
  • Fix SQL integer type for crc32 field

    Fix SQL integer type for crc32 field

    CRC32 is an unsigned 4-byte int, so it does not fit in a signed 4-byte int (Integer). There is no unsigned int type in the SQL standard, so I changed it to BigInteger instead. Without this change, both MySQL and Postgres complain that host_crc32 field value is out of bounds. Another option (to save space) would be to conver CRC32 into a signed 4-bit int, but this will complicate things, not sure it's worth it.

    opened by lopuhin 12
  • Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    This is a follow up to https://github.com/scrapinghub/frontera/pull/45.

    It enables the manager to receive the crawler settings and then instantiate the frontera settings accordingly. I added a few tests that should make the new behavior a little clearer.

    Is something along this lines acceptable? How can it be improved?

    opened by josericardo 12
  • how can I know it works when I use it with scrapy?

    how can I know it works when I use it with scrapy?

    I did everything as the document running-the-rawl, and start to run

    scrapy crawl my-spider
    

    I notice the item being crawled from the console, but I don't know whether Frontera works.

    What I did

    image

    sandwarm/frontera/settings.py

    
    BACKEND = 'frontera.contrib.backends.sqlalchemy.Distributed'
    
    SQLALCHEMYBACKEND_ENGINE="mysql://acme:acme@localhost:3306/acme"
    SQLALCHEMYBACKEND_MODELS={
        'MetadataModel': 'frontera.contrib.backends.sqlalchemy.models.MetadataModel',
        'StateModel': 'frontera.contrib.backends.sqlalchemy.models.StateModel',
        'QueueModel': 'frontera.contrib.backends.sqlalchemy.models.QueueModel'
    }
    
    SPIDER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
    })
    
    DOWNLOADER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
    })
    
    SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
    
    

    settings.py

    FRONTERA_SETTINGS = 'sandwarm.frontera.settings'
    
    

    Since I enable mysql backend, I am supposed to see connection error, for I don't start mysql yet.

    Thanks for your guys hard working, but please make the document easier for humans. for example, a very basic working example, currently, we need to gather all documents to get the basic idea, even the worse, it still doesn't work at all. I alreay spent a week on a working example.

    opened by vidyli 1
  • Project Status?

    Project Status?

    It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

    opened by psdon 8
  • Message Decode Error

    Message Decode Error

    Getting following error when adding URL to Kafka for scrapy to parse

    2020-09-07 20:12:46 [messagebus-backend] WARNING: Could not decode message: b'http://quotes.toscrape.com/page/1/', error unpack(b) received extra data.
    
    opened by ab-bh 0
  • The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    https://github.com/scrapinghub/frontera/blob/master/frontera/core/manager.py I use 0.8.1 code base in LOCAL_MODE, The KeyError throw when running to to_fetch in StateContext class:

    from line 801:

    class StatesContext(object):
    	...
        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                fingerprint = request.meta[b'fingerprint'] # error occured here!!!
    

    I think the reason is the meta b'fingerprint' used before it's setting:

    from line 302:

    class LocalFrontierManager(BaseContext, StrategyComponentsPipelineMixin, BaseManager):
        def page_crawled(self, response):
    ...
            self.states_context.to_fetch(response)  # here used  b'fingerprint'
            self.states_context.fetch()
            self.states_context.states.set_states(response)
            super(LocalFrontierManager, self).page_crawled(response) # but only here init!
            self.states_context.states.update_cache(response)
    

    from line 233:

    class BaseManager(object):			
        def page_crawled(self, response):
    ...
            self._process_components(method_name='page_crawled',
                                     obj=response,
                                     return_classes=self.response_model) # b'fingerprint' will be set when pipeline go through here
    		
    

    My corrent work aroud is add the line to to_fetch method of StateContext class:

        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                if b'fingerprint' not in request.meta:                
                    request.meta[b'fingerprint'] = sha1(request.url)
                fingerprint = request.meta[b'fingerprint']
                self._fingerprints[fingerprint] = request
    

    What is the collect way to fix this?

    opened by yujiaao 0
  • KeyError [b'frontier'] on Request Creation from Spider

    KeyError [b'frontier'] on Request Creation from Spider

    Issue might be related to #337

    Hi,

    I have already read in discussions here, that the scheduling of requests should be done by frontera and apparently even the creation should be done by the frontier and not by the spider. However, in the documentation of scrapy and frontera it is written that requests shall be yielded in the spider parse function.

    How should the process look like, if requests are to be created by the crawling strategy and not yielded by the spider? How does the spider trigger that?

    In my use case, I am using scrapy-selenium with scrapy and frontera (I use SeleniumRequests to be able to wait for JS loaded elements).

    I have to generate the URLs I want to scrape in two phases: I am yielding them firstly in the start_requests() method of the spider instead of a seeds file and yield requests for extracted links in the first of two parse functions.

    Yielding SeleniumRequests from start_requests works, but yielding SeleniumRequests from the parse function afterwards results in the following error (only pasted an extract, as the iterable error prints the same errors over and over):

    return (_set_referer(r) for r in result or ())
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
        frontier_request = response.meta[b'frontier_request']
    KeyError: b'frontier_request'
    

    Very thankful for all hints and examples!

    opened by dkipping 3
Releases(v0.8.1)
  • v0.8.1(Apr 5, 2019)

  • v0.8.0.1(Jul 30, 2018)

  • v0.8.0(Jul 25, 2018)

    This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

    We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

    Here is a (somewhat) full change log:

    • PyPy (2.7.*) support,
    • Redis backend (kudos to @khellan),
    • LRU cache and two cache generations for HBaseStates,
    • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
    • Breadth-first and depth-first crawling strategies,
    • new mandatory component in backend: DomainMetadata,
    • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
    • create_request in crawling strategy is now using FronteraManager middlewares,
    • many batch gen instances,
    • support of latest kafka-python,
    • statistics are sent to message bus from all parts of Frontera,
    • overall reliability improvements,
    • settings for OverusedBuffer,
    • DBWorker was refactored and divided on components (kudos to @vshlapakov),
    • seeds addition can be done using s3 now,
    • Python 3.7 compatibility.
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Feb 9, 2017)

    Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

    Other improvements include:

    • batched states refresh in crawling strategy,
    • proper access to redirects in Scrapy converters,
    • more readable and simple OverusedBuffer implementation,
    • examples, tests and docs fixes.

    Thank you all, for your contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Nov 29, 2016)

    A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API. Other improvements:

    • SW consumes less CPU (because of rare state flushing),
    • requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
    • new article in the docs on cluster setup,
    • disable scoring log consumption option in DB worker,
    • fix of hbase drop table,
    • improved tests coverage.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 18, 2016)

    • Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to @Preetwinder.
    • canonicalize_url method removed in favor of w3lib implementation.
    • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
    • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
    • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
    • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
    • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
    • Strategy worker refactoring to simplify it’s customization from subclasses.
    • Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jul 22, 2016)

  • v0.5.2.3(Jul 18, 2016)

  • v0.5.2.2(Jun 29, 2016)

    • CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
    • Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
    • Finishing in SW is fixed when crawling strategy reports finished.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2.1(Jun 24, 2016)

    Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 21, 2016)

  • v0.5.1.1(Jun 2, 2016)

  • v0.5.0(Jun 1, 2016)

    Here is the change log:

    • latest SQLAlchemy unicode-related crashes are fixed,
    • corporate website friendly canonical solver has been added.
    • crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check), FrontierManager available on construction,
    • strategy worker code was refactored,
    • default state introduced for links generated during crawling strategy operation,
    • got rid of Frontera logging in favor of Python native logging,
    • logging system configuration by means of logging.config using file,
    • partitions to instances can be assigned from command line now,
    • improved test coverage from @Preetwinder.

    Enjoy!

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Apr 22, 2016)

    This release prevents installing kafka-python package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 18, 2016)

    • fixed API docs generation on RTD,
    • added body field in Request objects, to support POST-type requests,
    • guidance on how to set MAX_NEXT_REQUESTS and settings docs fixes,
    • fixed colored logging.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 30, 2015)

    A tremendous work was done:

    • distributed-frontera and frontera were merged together into the single project: to make it easier to use and understand,
    • Backend was completely redesigned. Now it's consisting of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies,
    • Added definition of run modes: single process, distributed spiders, distributed spider and backend.
    • Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
    • Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
    • Much less configuration footprint.

    Enjoy this new year release and let us know what you think!

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Sep 29, 2015)

    • tldextract is no longer minimum required dependency,
    • SQLAlchemy backend now persists headers, cookies, and method, also _create_page method added to ease customization,
    • Canonical solver code (needs documentation)
    • Other fixes and improvements
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jun 19, 2015)

    Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:

    1. Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
    2. settings defined in the Scrapy settings,
    3. default frontier settings.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(May 25, 2015)

    Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 15, 2015)

    • Frontera is the new name for Crawl Frontier.
    • Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
    • Overused buffer (subject to remove in the future in favor of downloader internal queue).
    • Backend internals became more customizable.
    • Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
    • Several Frontera middlewares are disabled by default.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jan 12, 2015)

    • Added documentation (Scrapy Seed Loaders+Tests+Examples)
    • Refactored backend tests
    • Added requests library example
    • Added requests library manager and object converters
    • Added FrontierManagerWrapper
    • Added frontier object converters
    • Fixed script examples for new changes
    • Optional Color logging (only if available)
    • Changed Scrapy frontier and recorder integration to scheduler+middlewares
    • Changed default frontier backend
    • Added comment support to seeds
    • Added doc requirements for RTD build
    • Removed optional dependencies for setup.py and requirements
    • Changed tests to pytest
    • Updated docstrings and documentation
    • Changed frontier componets (Backend and Middleware) to abc
    • Modified Scrapy frontier example to use seed loaders
    • Refactored Scrapy Seed loaders
    • Added new fields to Request and Response frontier objects
    • Added ScrapyFrontierManager (Scrapy wrapper for Frontier Manager)
    • Changed frontier core objects (Page/Link to Request/Response)
    Source code(tar.gz)
    Source code(zip)
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 7, 2023
:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

Mort Yao 46.4k Jan 3, 2023
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 8, 2023
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 4, 2023
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 6, 2023
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

Max Humber 692 Dec 22, 2022
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 2, 2023
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 5, 2023
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 6, 2023
A web scraper that exports your entire WhatsApp chat history.

WhatSoup ?? A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 6, 2023
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

null 16 Dec 26, 2022
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 4, 2022