Web crawling framework based on asyncio.

Overview

Build Python Version License

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

  • Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

  1. Write spider.py:
from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

  1. Run python spider.py

  2. Result:

Example

The examples are in the /example/ directory.

Contribution

  • Pull request.
  • Open issue.
Comments
  • Limit the interval between two requests.

    Limit the interval between two requests.

    class MySpider(Spider):
        interval = 5 #seconds
        headers = {'User-Agent': 'Google Spider'}
        start_url = 'https://blog.scrapinghub.com/'
        parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
                   Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]
    

    Then a request after another reqeust should wait for 5 seconds. and the concurrency will be invalid.

    opened by gaojiuli 7
  • Gain Improvements - Stable

    Gain Improvements - Stable

    I want to resolve #42, #43, #46, #49

    I disabled the cache test since we'd need to update ci and start webserver.py somehow and ensure redis is available as wel.

    Gladly hear your thoughts on this so I can make a separate PR for improving tests.

    Please review and update PyPi after acceptance.

    opened by kwuite 6
  • Some Suggestions

    Some Suggestions

    1. Spider class add cookies field cause some websites need login

    2. In parser.py file,there is await item.save(), a function used to store information mostly in local file(user can override the function). As far as i'm concerned, code like

        async def save(self):
            with open('scrapinghub.txt', 'a+') as f:
                f.writelines(str(self.results) + '\n')
    
    

    is blocking as local filesystem access is blocking.Therefore,the event loop(Thread) is blocking. Especially when we select a MB size file and want to store in local file, it would slow the whole application.

    So, It's that possible use aiofile(File support for asyncio,https://github.com/Tinche/aiofiles) or use loop.run_in_executor makes save funciton run in another thread when the file is large?

    opened by wisecsj 4
  • RuntimeError: uvloop does not support Windows at the moment.

    RuntimeError: uvloop does not support Windows at the moment.

    Add requirement (not windows client)

    Collecting gain
      Downloading gain-0.1.1.tar.gz
    Collecting uvloop (from gain)
      Downloading uvloop-0.8.0.tar.gz (1.7MB)
        100% |################################| 1.7MB 534kB/s
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\Users\idi\AppData\Local\Temp\pip-build-gjya289j\uvloop\setup.py", line 11, in <module>
            raise RuntimeError('uvloop does not support Windows at the moment')
        RuntimeError: uvloop does not support Windows at the moment
    
    opened by sumarsky 3
  • Add some built-in save() methods.

    Add some built-in save() methods.

    For examples:

    class Post(Item):
        id = Css('title')
        async def save(self):
            super.save(self.results, type='database')
    
    class Post(Item):
        id = Css('title')
        async def save(self):
            super.save(self.results, type='file')
    

    Do you have any suggestions?

    enhancement 
    opened by gaojiuli 3
  • Add support to handle the value of each field of  an item.

    Add support to handle the value of each field of an item.

    For example:

    from gain import Css, Item, Parser, Spider
    
    
    class Post(Item):
        title = Css('.entry-title')
        content = Css('.entry-content')
    
        async def save(self):
            with open('scrapinghub.txt', 'a+') as f:
                f.writelines(self.results['title'] + '\n')
     # Add function to handle value
    class Post(Item):
        title = Css('.entry-title')
        content = Css('.entry-content')
        
        def clean_title(self,title):
            return title.strip()
    
        async def save(self):
            with open('scrapinghub.txt', 'a+') as f:
                f.writelines(self.results['title'] + '\n')
    
    

    Then in https://github.com/gaojiuli/gain/blob/master/gain/item.py

    class Item(metaclass=ItemType):
        def __init__(self, html):
            self.results = {}
            for name, selector in self.selectors.items():
                value = selector.parse_detail(html)
                # Add function to handle value
                get_field = getattr(self, 'clean_%s' % name, None)
                if get_field:
                    value = get_field(value)
                if value is None:
                    logger.error('Selector "{}" for {} was wrong, please check again'.format(selector.rule, name))
                else:
                    self.results[name] = value
    
    enhancement 
    opened by howie6879 2
  • Add document's own parsing

    Add document's own parsing

    Using Firefox 57, you can copy the XPath and CSS paths

    selector.py

    import re
    
    from lxml import etree
    from pyquery import PyQuery as pq
    
    
    class Selector:
        def __init__(self, rule, attr=None,process_func=None):
            self.rule = rule
            self.attr = attr
            self.process_func = process_func
    
        def __str__(self):
            return '{}({})'.format(self.__class__.__name__, self.rule)
    
        def __repr__(self):
            return '{}({})'.format(self.__class__.__name__, self.rule)
    
        def parse_detail(self, html):
            raise NotImplementedError
    
    
    class Css(Selector):
        def parse_detail(self, html):
    
            d = pq(html)
    
            if self.process_func:
                try:
                    if self.rule != 'document':
                        d = d(self.rule)
                    results = self.process_func(d)
                except IndexError:
                    return None
                return results if results else None
    
            if self.attr is None:
                try:
                    return d(self.rule)[0].text
                except IndexError:
                    return None
            return d(self.rule)[0].attr(self.attr, None)
    
    
    class Xpath(Selector):
        def parse_detail(self, html):
            d = etree.HTML(html)
            
            if self.process_func:
                try:
                    if self.rule != 'document':
                        d = d.xpath(self.rule)
                    results = self.process_func(d)
                except IndexError:
                    return None
                return results if results else None
    
            try:
                if self.attr is None:
                    return d.xpath(self.rule)[0].text
                return d.xpath(self.rule)[0].get(self.attr, None)
            except IndexError:
                return None
    
    
    class Regex(Selector):
        def parse_detail(self, html):
            try:
                return re.findall(self.rule, html)[0]
            except IndexError:
                return None
    
    

    test.py (The importance of processing functions)

    In some cases, the creeper rules are complex and need to be resolved by themselves

    from gain import Css, Item, Parser, Spider
    
    class Post(Item):
    
        title = Css('html body div#content div.layout.fn-clear div#primary.mainbox.fn-left div.ui-box.l-h div.ui-cnt ul.primary-list.min-video-list.fn-clear li h5 a', process_func=lambda pq:[x.text for x in pq])
        # title is List
        async def save(self):
            if hasattr(self,'title'):
                # title is List
                for x in self.title:
                    print(x)
            else:
                print('error')
    
    class MySpider(Spider):
        concurrency = 5
        encoding = 'gbk'
        headers = {'User-Agent': 'Google Spider'}
        start_url = r'http://www.xinxin46.com/L/lilunpian.html'
        parsers = [Parser('/L/lilunpian\d+\.html',Post)]
    
    
    MySpider.run()
    
    
    opened by allphfa 1
  • add encoding

    add encoding

    request.py

    import asyncio
    
    from .log import logger
    
    try:
        import uvloop
    
        asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
    except ImportError:
        pass
    
    
    async def fetch(url, spider, session, semaphore):
        with (await semaphore):
            try:
                if callable(spider.headers):
                    headers = spider.headers()
                else:
                    headers = spider.headers
                # hare   hare   hare
                if hasattr(spider,'encoding'):
                    codec = spider.encoding
                else:
                    codec = 'utf-8'
                # hare   hare   hare
    
                
                async with session.get(url, headers=headers) as response:
                    if response.status in [200, 201]:
                        data = await response.text(encoding=codec)   # hare   hare   hare
                        return data
                    logger.error('Error: {} {}'.format(url, response.status))
                    return None
            except:
                return None
    
    

    test.py

    class MySpider(Spider):
        concurrency = 5
        encoding = 'gbk'
        start_url = r'http://blog.sciencenet.cn/home.php?mod=space&uid=40109&do=blog&view=me&from=space&page=1'
        parsers = [Parser('http://blog.sciencenet.cn/home.php.*?page=\d+',Post)]
    
    
    
    enhancement help wanted 
    opened by allphfa 1
  • TypeError: write() argument must be str, not dict

    TypeError: write() argument must be str, not dict

    When I ran the Usage code in README.md, a TypeError occured which refers to this line : await f.write(self.results) Then I changed this line to await f.write(self.results['title']) and everything works just fine. I noticed that in previous edition of this README file, when aiofile was not introduced, this part of code used this dict self.results the same way. So I'm not sure which is the right way to print the result.

    opened by hyfc 1
  • Css selector add attr not work correctly

    Css selector add attr not work correctly

    1.I write code content = Css('.download_button', 'href') in Class Post but not work. Error info :

    Selector ".video-download-button" for url was wrong, please check again

    which means that value is None.In fact,execute code d(self.rule)[0].attr(self.attr, None) would terminate and output has not attr attribute...

    2.Now selector just select the first element since the code d(self.rule)[0].text 。How can i choose the whole elements match self.rule and acquire their attr? (I searched the doc http://pyquery.readthedocs.io/en/latest/, but not found answer)

    opened by wisecsj 1
  • Unescape html contains HTML Entities

    Unescape html contains HTML Entities

    when the html fetched contains HTML Entities,pyquery would not work correctly .And that's why the pull request comes into being.

    But,suprised,i find you did the same thing in the commit df8b4d7da5687e87334723be0834b0b1d6190530. I am confused that you delete that line in the commit e3ee18a732b638a64da228ca54a8db45bdb06be2 ,howerver. And you add url = unescape(url) because of the code parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&amp;page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] contains HTML Entities like &amp.

    So,i do confused why you did that.If unescape the whole html, not only pyquery would work fine,but also needn't to change parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&page=\d+'), to parsers = [Parser('http://blog.sciencenet.cn/home.php\?mod=space&uid=\d+&do=blog&view=me&from=space&amp;page=\d+'), Parser('blog\-\d+\-\d+\.html', Post)] as we are used to write the former code.

    As a undergraduate students ,Maybe there are some occasions i don't take into account or i'm wrong.

    By the way,i opened an issue lists my problem.Could you help me out?

    opened by wisecsj 1
  • demo error

    demo error

    copy your basic demo code and run it:

    error.

    Traceback (most recent call last):
      File "b.py", line 23, in <module>
        MySpider.run()
      File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/gain/spider.py", line 52, in run
        loop.run_until_complete(cls.init_parse(semaphore))
      File "uvloop/loop.pyx", line 1451, in uvloop.loop.Loop.run_until_complete
      File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/gain/spider.py", line 71, in init_parse
        with aiohttp.ClientSession() as session:
      File "/home/qyy/anaconda3/envs/sanic/lib/python3.6/site-packages/aiohttp/client.py", line 956, in __enter__
        raise TypeError("Use async with instead")
    TypeError: Use async with instead
    [2019:04:08 15:05:18] Unclosed client session
    client_session: <aiohttp.client.ClientSession object at 0x7fc4d2eb8e48>
    sys:1: RuntimeWarning: coroutine 'Parser.task' was never awaited
    

    and ..

    from gain import Css, Item, Parser, XPathParser, Spider
    ImportError: cannot import name 'XPathParser'
    

    Thanks.

    opened by Developer27149 0
  • SSL handshake failed on verifying the certificate

    SSL handshake failed on verifying the certificate

    [2018:10:25 16:14:03] Spider started! [2018:10:25 16:14:03] Base url: https://blog.scrapinghub.com/ [2018:10:25 16:14:04] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729acc0> transport: <TCPTransport closed=False reading=False 0x7fe65248c048> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:04] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729acc0> transport: <TCPTransport closed=False reading=False 0x7fe65248c048> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729ae80> transport: <TCPTransport closed=False reading=False 0x7fe6549908b8> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729ae80> transport: <TCPTransport closed=False reading=False 0x7fe6549908b8> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x10729aeb8> transport: <TCPTransport closed=False reading=False 0x7fe652738418> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x10729aeb8> transport: <TCPTransport closed=False reading=False 0x7fe652738418> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x1072d0080> transport: <TCPTransport closed=False reading=False 0x7fe654994308> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:05] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x1072d0080> transport: <TCPTransport closed=False reading=False 0x7fe654994308> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] SSL handshake failed on verifying the certificate protocol: <uvloop.loop.SSLProtocol object at 0x1072d0208> transport: <TCPTransport closed=False reading=False 0x7fe6527bef38> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 609, in uvloop.loop.SSLProtocol._on_handshake_complete File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] SSL error errno:1 reason: CERTIFICATE_VERIFY_FAILED protocol: <uvloop.loop.SSLProtocol object at 0x1072d0208> transport: <TCPTransport closed=False reading=False 0x7fe6527bef38> Traceback (most recent call last): File "uvloop/sslproto.pyx", line 504, in uvloop.loop.SSLProtocol.data_received File "uvloop/sslproto.pyx", line 204, in uvloop.loop._SSLPipe.feed_ssldata File "uvloop/sslproto.pyx", line 171, in uvloop.loop._SSLPipe.feed_ssldata File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 763, in do_handshake self._sslobj.do_handshake() ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045) [2018:10:25 16:14:06] Item "Post": 0 [2018:10:25 16:14:06] Requests count: 0 [2018:10:25 16:14:06] Error count: 0 [2018:10:25 16:14:06] Time usage: 0:00:03.345306 [2018:10:25 16:14:06] Spider finished!

    Process finished with exit code 0

    opened by 38602629 1
  • Gain Improvements - Ludaro

    Gain Improvements - Ludaro

    re.findall issue

    I reviewed the tests in this project after experiencing issues with my regex also catching some html as part of the process.

    So I reviewed this test file: https://github.com/gaojiuli/gain/blob/master/tests/test_parse_multiple_items.py and catched the response of abstract_url.py

    Version 0.1.4 of this project catches this as response:

    URLS we found: ['/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/', '/page/1/']
    

    re.findall returns what is requested by your regex but not what is matched!

    Test incorrect

    The base url http://quotes.toscrape.com/ and http://quotes.toscrape.com/page/1 are the same page and if you look into the html you shall only find a reference to "/page/2" but not to "/page/1". For this reason the test seems to work but it was actually flawed from the start.

    image

    re.match

    I rewrote function abstract_url to:

        def abstract_urls(self, html, base_url):
            _urls = []
    
            try:
                document = lxml.html.fromstring(html)
                document_domain = urlparse.urlparse(base_url).netloc
                
                for (al, attr, link, pos) in document.iterlinks():
                    link = re.sub("#.*", "", link or "")
    
                    if not link:
                        continue
    
                    _urls.append(link)
            except (etree.XMLSyntaxError, etree.ParserError) as e:
                logger.error("While parsing the html for {} we received the following error {}.".format(base_url, e))
    
            # Cleanup urls
            r = re.compile(self.rule)
            urls = list(filter(r.match, _urls))
    
            return urls
    

    and now this is the result of abstract_url:

    ['/static/bootstrap.min.css', '/static/main.css', '/', '/login', '/author/Albert-Einstein', '/tag/change/page/1/', '/tag/deep-thoughts/page/1/', '/tag/thinking/page/1/', '/tag/world/page/1/', '/author/J-K-Rowling', '/tag/abilities/page/1/', '/tag/choices/page/1/', '/author/Albert-Einstein', '/tag/inspirational/page/1/', '/tag/life/page/1/', '/tag/live/page/1/', '/tag/miracle/page/1/', '/tag/miracles/page/1/', '/author/Jane-Austen', '/tag/aliteracy/page/1/', '/tag/books/page/1/', '/tag/classic/page/1/', '/tag/humor/page/1/', '/author/Marilyn-Monroe', '/tag/be-yourself/page/1/', '/tag/inspirational/page/1/', '/author/Albert-Einstein', '/tag/adulthood/page/1/', '/tag/success/page/1/', '/tag/value/page/1/', '/author/Andre-Gide', '/tag/life/page/1/', '/tag/love/page/1/', '/author/Thomas-A-Edison', '/tag/edison/page/1/', '/tag/failure/page/1/', '/tag/inspirational/page/1/', '/tag/paraphrased/page/1/', '/author/Eleanor-Roosevelt', '/tag/misattributed-eleanor-roosevelt/page/1/', '/author/Steve-Martin', '/tag/humor/page/1/', '/tag/obvious/page/1/', '/tag/simile/page/1/', '/page/2/', '/tag/love/', '/tag/inspirational/', '/tag/life/', '/tag/humor/', '/tag/books/', '/tag/reading/', '/tag/friendship/', '/tag/friends/', '/tag/truth/', '/tag/simile/', 'https://www.goodreads.com/quotes', 'https://scrapinghub.com']
    

    This test: tests/test_parse_multiple_items.py now fails as it should.

    opened by kwuite 5
  • The ``sciencenet_spider.py`` example does not (seem to) work for python 3.6

    The ``sciencenet_spider.py`` example does not (seem to) work for python 3.6

    I copied the examples/sciencenet_spider.py example and tried to run it using python 3.6 - but:

    python sciencenet_spider.py
    [2018:04:14 22:21:26] Spider started!
    [2018:04:14 22:21:26] Using selector: KqueueSelector
    [2018:04:14 22:21:26] Base url: http://blog.sciencenet.cn/
    [2018:04:14 22:21:26] Item "Post": 0
    [2018:04:14 22:21:26] Requests count: 0
    [2018:04:14 22:21:26] Error count: 0
    [2018:04:14 22:21:26] Time usage: 0:00:00.001127
    [2018:04:14 22:21:26] Spider finished!
    Traceback (most recent call last):
      File "sciencenet_spider.py", line 19, in <module>
        MySpider.run()
      File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/gain/spider.py", line 52, in run
        loop.run_until_complete(cls.init_parse(semaphore))
      File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/asyncio/base_events.py", line 467, in run_until_complete
        return future.result()
      File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/gain/spider.py", line 71, in init_parse
        with aiohttp.ClientSession() as session:
      File "/Users/endafarrell/anaconda/anaconda3/lib/python3.6/site-packages/aiohttp/client.py", line 746, in __enter__
        raise TypeError("Use async with instead")
    TypeError: Use async with instead
    sys:1: RuntimeWarning: coroutine 'Parser.task' was never awaited
    [2018:04:14 22:21:26] Unclosed client session
    client_session: <aiohttp.client.ClientSession object at 0x105b07cf8>
    

    My python is

    python
    Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 12:04:33)
    [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
    

    and I have:

    pip list | grep gain
    gain                               0.1.4
    

    I installed gain using:

    pip install gain
    

    Any ideas?

    opened by endafarrell 5
Owner
Jiuli Gao
Python Developer.
Jiuli Gao
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 3, 2021
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Jan 4, 2023
A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Xuye (Chris) Qin 1.5k Dec 24, 2022
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia ??️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 1, 2023
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

null 2.3k Jan 4, 2023
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 5, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-learn syntax.

Ronie Martinez 326 Dec 15, 2022
✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

XforWorks 3 Mar 7, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

Gerapy 2.9k Jan 3, 2023
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

Khaled Tofailieh 4 Feb 13, 2022
A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

Roy Binux 15.7k Jan 4, 2023