Web crawling framework based on asyncio.

Jiuli Gao

Last update: Jan 5, 2023

Related tags

Web Crawling python crawler spider aiohttp asyncio uvloop

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Comments

Limit the interval between two requests.

class MySpider(Spider):
    interval = 5 #seconds
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]

Then a request after another reqeust should wait for 5 seconds. and the concurrency will be invalid.

opened by gaojiuli 7

Gain Improvements - Stable

I want to resolve #42, #43, #46, #49

I disabled the cache test since we'd need to update ci and start webserver.py somehow and ensure redis is available as wel.

Gladly hear your thoughts on this so I can make a separate PR for improving tests.

Please review and update PyPi after acceptance.

opened by kwuite 6
Some Suggestions
Spider class add cookies field cause some websites need login

In parser.py file,there is await item.save(), a function used to store information mostly in local file(user can override the function). As far as i'm concerned, code like

async def save(self): with open('scrapinghub.txt', 'a+') as f: f.writelines(str(self.results) + '\n')

is blocking as local filesystem access is blocking.Therefore,the event loop(Thread) is blocking. Especially when we select a MB size file and want to store in local file, it would slow the whole application.

So, It's that possible use aiofile(File support for asyncio,https://github.com/Tinche/aiofiles) or use loop.run_in_executor makes save funciton run in another thread when the file is large?
opened by wisecsj 4

RuntimeError: uvloop does not support Windows at the moment.

Add requirement (not windows client)

Collecting gain
  Downloading gain-0.1.1.tar.gz
Collecting uvloop (from gain)
  Downloading uvloop-0.8.0.tar.gz (1.7MB)
    100% |################################| 1.7MB 534kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\idi\AppData\Local\Temp\pip-build-gjya289j\uvloop\setup.py", line 11, in <module>
        raise RuntimeError('uvloop does not support Windows at the moment')
    RuntimeError: uvloop does not support Windows at the moment

opened by sumarsky 3

Add some built-in save() methods.

For examples:

class Post(Item):
    id = Css('title')
    async def save(self):
        super.save(self.results, type='database')

class Post(Item):
    id = Css('title')
    async def save(self):
        super.save(self.results, type='file')

Do you have any suggestions?

enhancement

opened by gaojiuli 3

Add support to handle the value of each field of an item.

For example：

from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')
 # Add function to handle value
class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')
    
    def clean_title(self,title):
        return title.strip()

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')

Then in https://github.com/gaojiuli/gain/blob/master/gain/item.py

class Item(metaclass=ItemType):
    def __init__(self, html):
        self.results = {}
        for name, selector in self.selectors.items():
            value = selector.parse_detail(html)
            # Add function to handle value
            get_field = getattr(self, 'clean_%s' % name, None)
            if get_field:
                value = get_field(value)
            if value is None:
                logger.error('Selector "{}" for {} was wrong, please check again'.format(selector.rule, name))
            else:
                self.results[name] = value

enhancement

opened by howie6879 2

Add document's own parsing

Using Firefox 57, you can copy the XPath and CSS paths

selector.py

import re

from lxml import etree
from pyquery import PyQuery as pq


class Selector:
    def __init__(self, rule, attr=None,process_func=None):
        self.rule = rule
        self.attr = attr
        self.process_func = process_func

    def __str__(self):
        return '{}({})'.format(self.__class__.__name__, self.rule)

    def __repr__(self):
        return '{}({})'.format(self.__class__.__name__, self.rule)

    def parse_detail(self, html):
        raise NotImplementedError


class Css(Selector):
    def parse_detail(self, html):

        d = pq(html)

        if self.process_func:
            try:
                if self.rule != 'document':
                    d = d(self.rule)
                results = self.process_func(d)
            except IndexError:
                return None
            return results if results else None

        if self.attr is None:
            try:
                return d(self.rule)[0].text
            except IndexError:
                return None
        return d(self.rule)[0].attr(self.attr, None)


class Xpath(Selector):
    def parse_detail(self, html):
        d = etree.HTML(html)
        
        if self.process_func:
            try:
                if self.rule != 'document':
                    d = d.xpath(self.rule)
                results = self.process_func(d)
            except IndexError:
                return None
            return results if results else None

        try:
            if self.attr is None:
                return d.xpath(self.rule)[0].text
            return d.xpath(self.rule)[0].get(self.attr, None)
        except IndexError:
            return None


class Regex(Selector):
    def parse_detail(self, html):
        try:
            return re.findall(self.rule, html)[0]
        except IndexError:
            return None

test.py （The importance of processing functions）

In some cases, the creeper rules are complex and need to be resolved by themselves

from gain import Css, Item, Parser, Spider

class Post(Item):

    title = Css('html body div#content div.layout.fn-clear div#primary.mainbox.fn-left div.ui-box.l-h div.ui-cnt ul.primary-list.min-video-list.fn-clear li h5 a', process_func=lambda pq:[x.text for x in pq])
    # title is List
    async def save(self):
        if hasattr(self,'title'):
            # title is List
            for x in self.title:
                print(x)
        else:
            print('error')

class MySpider(Spider):
    concurrency = 5
    encoding = 'gbk'
    headers = {'User-Agent': 'Google Spider'}
    start_url = r'http://www.xinxin46.com/L/lilunpian.html'
    parsers = [Parser('/L/lilunpian\d+\.html',Post)]


MySpider.run()

opened by allphfa 1

request.py

import asyncio

from .log import logger

try:
    import uvloop

    asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
except ImportError:
    pass


async def fetch(url, spider, session, semaphore):
    with (await semaphore):
        try:
            if callable(spider.headers):
                headers = spider.headers()
            else:
                headers = spider.headers
            # hare   hare   hare
            if hasattr(spider,'encoding'):
                codec = spider.encoding
            else:
                codec = 'utf-8'
            # hare   hare   hare

            
            async with session.get(url, headers=headers) as response:
                if response.status in [200, 201]:
                    data = await response.text(encoding=codec)   # hare   hare   hare
                    return data
                logger.error('Error: {} {}'.format(url, response.status))
                return None
        except:
            return None

test.py

class MySpider(Spider):
    concurrency = 5
    encoding = 'gbk'
    start_url = r'http://blog.sciencenet.cn/home.php?mod=space&uid=40109&do=blog&view=me&from=space&page=1'
    parsers = [Parser('http://blog.sciencenet.cn/home.php.*?page=\d+',Post)]

enhancement help wanted

opened by allphfa 1

TypeError: write() argument must be str, not dict

When I ran the Usage code in README.md, a TypeError occured which refers to this line : await f.write(self.results) Then I changed this line to await f.write(self.results['title']) and everything works just fine. I noticed that in previous edition of this README file, when aiofile was not introduced, this part of code used this dict self.results the same way. So I'm not sure which is the right way to print the result.

opened by hyfc 1
Css selector add attr not work correctly

1.I write code content = Css('.download_button', 'href') in Class Post but not work. Error info :

Selector ".video-download-button" for url was wrong, please check again

which means that value is None.In fact,execute code d(self.rule)[0].attr(self.attr, None) would terminate and output has not attr attribute...

2.Now selector just select the first element since the code d(self.rule)[0].text 。How can i choose the whole elements match self.rule and acquire their attr? (I searched the doc http://pyquery.readthedocs.io/en/latest/, but not found answer)

opened by wisecsj 1
Unescape html contains HTML Entities

when the html fetched contains HTML Entities,pyquery would not work correctly .And that's why the pull request comes into being.

But,suprised,i find you did the same thing in the commit df8b4d7da5687e87334723be0834b0b1d6190530. I am confused that you delete that line in the commit e3ee18a732b638a64da228ca54a8db45bdb06be2 ,howerver. And you add url = unescape(url) because of the code parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] contains HTML Entities like &amp.

So,i do confused why you did that.If unescape the whole html, not only pyquery would work fine,but also needn't to change parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), to parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] as we are used to write the former code.

As a undergraduate students ,Maybe there are some occasions i don't take into account or i'm wrong.

By the way,i opened an issue lists my problem.Could you help me out?

opened by wisecsj 1

demo error

copy your basic demo code and run it:

error.

Traceback (most recent call last):
  File "b.py", line 23, in <module>
    MySpider.run()
  File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/gain/spider.py", line 52, in run
    loop.run_until_complete(cls.init_parse(semaphore))
  File "uvloop/loop.pyx", line 1451, in uvloop.loop.Loop.run_until_complete
  File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/gain/spider.py", line 71, in init_parse
    with aiohttp.ClientSession() as session:
  File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/aiohttp/client.py", line 956, in __enter__
    raise TypeError("Use async with instead")
TypeError: Use async with instead
[2019:04:08 15:05:18] Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fc4d2eb8e48>
sys:1: RuntimeWarning: coroutine 'Parser.task' was never awaited

and ..

from gain import Css, Item, Parser, XPathParser, Spider
ImportError: cannot import name 'XPathParser'

Thanks.

opened by Developer27149 0

SSL handshake failed on verifying the certificate

[2018:10:25 16:14:03] Spider started! [2018:10:25 16:14:03] Base url: https://blog.scrapinghub.com/ [2018:10:25 16:14:04] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729acc0> transport: <TCPTransport closed=False reading=False 0x7fe65248c048> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:04] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729acc0> transport: <TCPTransport closed=False reading=False 0x7fe65248c048> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729ae80> transport: <TCPTransport closed=False reading=False 0x7fe6549908b8> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729ae80> transport: <TCPTransport closed=False reading=False 0x7fe6549908b8> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729aeb8> transport: <TCPTransport closed=False reading=False 0x7fe652738418> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729aeb8> transport: <TCPTransport closed=False reading=False 0x7fe652738418> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x1072d0080> transport: <TCPTransport closed=False reading=False 0x7fe654994308> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x1072d0080> transport: <TCPTransport closed=False reading=False 0x7fe654994308> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x1072d0208> transport: <TCPTransport closed=False reading=False 0x7fe6527bef38> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x1072d0208> transport: <TCPTransport closed=False reading=False 0x7fe6527bef38> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] Item "Post": 0 [2018:10:25 16:14:06] Requests count: 0 [2018:10:25 16:14:06] Error count: 0 [2018:10:25 16:14:06] Time usage: 0:00:03.345306 [2018:10:25 16:14:06] Spider finished!

Process finished with exit code 0

opened by 38602629 1

Gain Improvements - Ludaro

re.findall issue

I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.

So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of abstract_url.py

Version 0.1.4 of this project catches this as response:

URLS we found: ['/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/']

re.findall returns what is requested by your regex but not what is matched!

Test incorrect

The base url http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.

re.match

I rewrote function abstract_url to:

    def abstract_urls(self, html, base_url):
        _urls = []

        try:
            document = lxml.html.fromstring(html)
            document_domain = urlparse.urlparse(base_url).netloc
            
            for (al, attr, link, pos) in document.iterlinks():
                link = re.sub("#.*", "", link or "")

                if not link:
                    continue

                _urls.append(link)
        except (etree.XMLSyntaxError, etree.ParserError) as e:
            logger.error("While parsing the html for {} we received the following error {}.".format(base_url, e))

        # Cleanup urls
        r = re.compile(self.rule)
        urls = list(filter(r.match, _urls))

        return urls

and now this is the result of abstract_url:

['/static/bootstrap.min.css', '/static/main.css', '/', '/login', '/author/Albert-Einstein', '/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/author/J-K-Rowling', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/author/Albert-Einstein', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/author/Jane-Austen', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/author/Marilyn-Monroe', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/author/Albert-Einstein', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/author/Andre-Gide', '/tag/life/page/1/', '/tag/love/page/1/', '/author/Thomas-A-Edison', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/author/Eleanor-Roosevelt', '/tag/misattributed-eleanor-roosevelt/page/1/', '/author/Steve-Martin', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/page/2/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/', 'https://www.goodreads.com/quotes', 'https://scrapinghub.com']

This test: tests/test_parse_multiple_items.py now fails as it should.

opened by kwuite 5

The ``sciencenet_spider.py`` example does not (seem to) work for python 3.6

I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:

python sciencenet_spider.py
[2018:04:14 22:21:26] Spider started!
[2018:04:14 22:21:26] Using selector: KqueueSelector
[2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
[2018:04:14 22:21:26] Item "Post": 0
[2018:04:14 22:21:26] Requests count: 0
[2018:04:14 22:21:26] Error count: 0
[2018:04:14 22:21:26] Time usage: 0:00:00.001127
[2018:04:14 22:21:26] Spider finished!
Traceback (most recent call last):
  File "sciencenet_spider.py", line 19, in <module>
    MySpider.run()
  File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/gain/spider.py", line 52, in run
    loop.run_until_complete(cls.init_parse(semaphore))
  File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
    return future.result()
  File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/gain/spider.py", line 71, in init_parse
    with aiohttp.ClientSession() as session:
  File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/aiohttp/client.py", line 746, in __enter__
    raise TypeError("Use async with instead")
TypeError: Use async with instead
sys:1: RuntimeWarning: coroutine 'Parser.task' was never awaited
[2018:04:14 22:21:26] Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x105b07cf8>

My python is

python
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin

and I have:

pip list | grep gain
gain                               0.1.4

I installed gain using:

pip install gain

Any ideas?

opened by endafarrell 5

Owner

Jiuli Gao

Python Developer.

GitHub

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 3, 2021

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Jan 4, 2023

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Dec 24, 2022

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

1.6k Jan 1, 2023

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

3.7k Dec 27, 2022

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 4, 2023

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 5, 2022

A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

1 Dec 19, 2021

This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

1 Jan 12, 2022

Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

1 Jan 25, 2022

Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.