A distributed crawler for weibo, building with celery and requests.

SpiderClub

Last update: Jan 3, 2023

Related tags

Web Crawling python3 data-analysis weibo sina distributed-crawler weibospider

Overview

项目亮点 ⭐

功能全面：包括了用户信息抓取、指定关键字搜索结果增量抓取、指定用户主页所有原创微博抓取、评论抓取和转发关系抓取等
数据全面：PC端展现的数据量比移动端更加丰富。并且相比于其它同类项目对微博的简单分析，本项目做了大量细致的工作，比如不同domain不同用户的解析策略、不同domain不同用户的主页分析策略等
稳定！项目可以长期稳定运行。
- 为了保证程序能长期稳定运行，数据所有的网络请求都是通过抓包手动分析的，未用任何自动化工具，包括模拟登陆！从另一个方面来说，抓取速度也是比较有保证的
- 通过合理的阈值设定，账号可以保证安全。但是不推荐用户使用自己的常用账号
- 即使账号不可用或者登陆失败，项目都对其做了处理（智能冻结账号，出错重试等），以保证每次请求都是有效的，并及时把错误反馈给用户
- 通过大量的异常检测和处理，几乎捕获了所有的解析和抓取异常。编写了大量的解析代码来获取足够全面的信息
复用性和扩展性好。项目很多地方都有详细的代码注释，方便阅读。即使本项目不能完全满足你对微博数据采集和分析的需求，你完全可以在该项目的基础上做二次开发，项目已经在微博数据采集和模版解析上做了大量工作。
该项目会长期更新，目前已经迭代一年有余了。
丰富文档支持：点击wiki查看所有文档。如果文档仍然不能解决你的问题，欢迎提issue，维护者看到后都会积极回答。

快速开始

1.阅读项目环境配置以配置项目所需的环境。

2.到release页面下载稳定版本的应用程序

3.解压你所下载的程序，并且cd到它的目录

4.快速安装所需依赖，如果你想使用虚拟环境管理依赖，那么使用source env.sh即可，如果你想使用系统的Python环境，那么使用pip3 install -r requirements.txt安装所有依赖

5.使用编辑器编辑配置文件spider.yml，设置MySQL、Redis连接信息、云打码（需要进行注册并充值）登录信息和邮箱报警信息。另外也可以对抓取间隔等进行配置，具体请阅读相关注释。

6.先通过手动创建一个名为weibo的数据库，然后使用python config/create_all.py来创建爬虫所需要的表，如果是v1.7.2及之前的版本，输入python create_all.py即可。

7.(可选，v1.7.3新增)如果你想通过Web UI来进行爬虫关键词等信息的配置，那么还需要修改admin/weibo_admin/settings.py中DATABSES一栏的数据库连接信息。然后在项目根目录下运行

python admin/manage.py makemigrations
python admin/manage.py migrate
python admin/manage.py createsuperuser

以生成django admin所需要的一些数据表，在执行python admin/manage.py createsuperuser的时候，会让你输入django后台的超级管理员用户名、邮箱和密码，比如我依次输入为test、resolvewang@foxmail.com、weibospider2017，然后便成功创建了超级管理员。

8.我们在爬虫程序启动之前，需要预插入微博账号和密码以及一些种子数据。比如你想抓取一个用户，那么就需要在seed_ids表中插入他的uid，uid可以通过打开该用户主页，点击查看页面源代码搜索oid获取到。如果你想通过通过微博的搜索接口搜索一个关键词，那么需要在keywords表中插入你想搜索的关键词。如果你完成了步骤7，那么可以通过Web UI来进行配置。通过运行

python admin/manage.py runserver 0.0.0.0:8000

来启动爬虫配置后台。然后再在你的浏览器输入http://127.0.0.1:8000/admin来访问爬虫配置程序。在登录界面输入刚才创建的用户名test和密码weibospider2017即可，然后在微博配置一栏中进行配置。注意，django自带的web server无法达到生产级别的稳定性，如果需要在生产环境中使用，建议使用gunicorn或者uwsgi作为web server,并且使用supervisor作为进程管理器。

9.配置完成后，通过

celery -A tasks.workers -Q login_queue,user_crawler,fans_followers,search_crawler,home_crawler worker -l info -c 1

启动worker。注意这里-Q表示在本机上可以接收哪些任务执行，详细请阅读weibospider中所有任务及其说明。-c表示并发数，-l表示日志等级。

上述命令可以在多台机器上执行，以达到分布式抓取的目的。我们需要做的仅仅是在别的机器上装好项目所需依赖（通过source env.sh或者pip3 install -r requirements.txt），是不是很简单？

10.到这个时候，我们已经做好所有准备了。现在我们需要发送任务给worker。有两种方式：1）通过执行python first_task_execution/login_first.py来进行登录，其他任务发送操作也类似。2）由于我们采用定时的机制来应对微博Cookie24小时失效的问题和达到不间断抓取的目的，那么我们可以在任何一台节点执行

celery beat -A tasks.workers -l info

以启动一个celery beater，它会定时将任务发送给Celery Worker进行执行，注意beater只能有一个，否则任务可能重复执行。定时设置在tasks/workers.py这个文件。

到这里所有配置已经结束了，如果大家在上述过程中遇到了问题，请耐心浏览项目所有文档，实在还是不懂或者使用过程中有任何问题可以提issue。

捐赠作者 👍

如果项目对你有用或者对你有启发，不妨通过微信或者支付宝进行小额捐赠，以支持该项目的持续维护和发展。

通过微信捐赠作者

通过支付宝捐赠作者

重要声明 📢

该项目开发的初衷是为了对部分信息进行监控，并且获取一些自然语言处理所需的语料，在数据抓取的时候对爬虫访问频率进行了较为严格的控制。后来在技术和兴趣的驱动下，才慢慢扩展了分布式和对微博反爬虫策略的探究。

所以作者希望用户能合理使用该项目（通过配置文件控制访问频率），本着够用就行的原则，不要做竭泽而渔 的事情，对微博系统的正常运行和维护造成较大的困扰。

其他 ❗

项目使用常见问题

项目补充说明

项目进程

致谢 ❤️

感谢大神Ask的celery分布式任务调度框架和大神kennethreitz的requests库
感谢为项目贡献源码的朋友，点击查看贡献者列表
感谢所有捐赠本项目的朋友，点击查看捐赠者列表
感谢star支持的网友和在使用过程中提issue或者给出宝贵建议的朋友

Comments

直击封号惨案

*. redis里录了3个手机注册过的账号 *. 开了10个节点，等待时间1-3s，线程15（穷逼一枚，用按量付费1h一毛多的那种）

开了几个小时都没问题再加了几个节点就死了，大概是一个cookies有访问限制，关掉多的节点立马正常想着再加个账号然后在繁忙的节点上模拟登陆

就登了一个账号，那三个账号就被封了

所以应该考虑一个主机短时间内使用3个cookies
help wanted

opened by yun17 40
pymysql 报 1366错误

我按照项目文档的教程，一步步进行，到 python create_all.py 和 login_frist.py 这两步都报这个错。

C:\Anaconda3\lib\site-packages\pymysql\cursors.py:165: Warning: (1366, "Incorrect string value: '\xD6\xD0\xB9\xFA\xB1\xEA...' for column 'VARIABLE_VALUE' at row 475") result = self._query(query) 一开始运行到 create_all.py 这一步时，发现表格都已经建立，就不管了。直到第10步，还是这样的错误，就不得不解决了。mysql版本是5.7.1，按照项目环境配置，都设置为utf8mb4，每个表格字段也是此格式。代码版本是1.7.2 希望能抽空看一下，看有没有什么解决的办法

opened by jiazone 12
login task 首次启动获取cookie exception

不是很熟悉这个项目··今天看到尝试了下环境在win下，启动了单个worker，然后celery似乎有问题，刚接触celery，求指教 [2017-09-08 21:02:48,769: ERROR/MainProcess] Task handler raised error: ValueError('not enough values to unpack (expected 3, got 0)',) Traceback (most recent call last): File "d:\anaconda3\lib\site-packages\billiard\pool.py", line 358, in workloop result = (True, prepare_result(fun(*args, **kwargs))) File "d:\anaconda3\lib\site-packages\celery\app\trace.py", line 525, in _fast_trace_task tasks, accept, hostname = _loc ValueError: not enough values to unpack (expected 3, got 0)

opened by ws0zzg4569 11
没成功爬取数据
上次在Ubuntu 16.04下碰到类似问题，celery的连接已经解决（重启celery即可）。但爬取不到数据的问题依旧存在，后面我重装了系统Ubuntu 18.04, 项目还是使用1.7.2. 按指南配置好后：

在终端A中运行celery -A tasks.workers -Q login_queue,user_crawler,fans_followers,search_crawler,home_crawler worker -l info -c 1 显示信息如下：

[2018-07-13 09:48:01,247: INFO/MainProcess] Connected to redis://:**@127.0.0.1:6379/5 [2018-07-13 09:48:01,260: INFO/MainProcess] mingle: searching for neighbors [2018-07-13 09:48:02,288: INFO/MainProcess] mingle: all alone

当我打开另一个终端B 输入"python3 login_first.py" 终端A中显示如下信息

[2018-07-21 14:43:23,672: INFO/MainProcess] Received task: tasks.login.login_task[3c1c3e8b-4850-4807-a463-f321c59c216d]
2018-07-21 14:43:30 - other - INFO - Login successful! The login account is yanfulong@aliyun.com [2018-07-21 14:43:30,016: INFO/ForkPoolWorker-1] Login successful! The login account is yanfulong@aliyun.com

我认为登录是没有问题的，然后在终端A 我试另一种办法：

celery beat -A tasks.workers -l info

终端A中显示： celery beat v4.1.1 (latentcall) is starting. __ - ... __ - _ LocalTime -> 2018-07-21 15:29:08 Configuration -> . broker -> redis://:**@127.0.0.1:6379/5 . loader -> celery.loaders.app.AppLoader . scheduler -> celery.beat.PersistentScheduler . db -> celerybeat-schedule . logfile -> [stderr]@%INFO . maxinterval -> 5.00 minutes (300s)

在上面的信息中有：logfile-> [stderr]@%INFO, 意味着登录失败了吗？

此时我在终端B中发现

[2018-07-21 15:47:45,777: INFO/MainProcess] Received task: tasks.user.excute_user_task[dbc668b5-1669-48e9-b868-95f773168039]
[2018-07-21 15:47:45,828: INFO/MainProcess] Received task: tasks.user.crawl_person_infos[9d5df559-6c79-407a-8abf-beb639ed08df]
2018-07-21 15:47:45 - crawler - INFO - the crawling url is http://weibo.com/p/1005051483330984/info?mod=pedit_more [2018-07-21 15:47:45,843: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1005051483330984/info?mod=pedit_more [2018-07-21 15:47:46,467: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:47:47,647: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:47:47,906: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) 2018-07-21 15:48:05 - crawler - INFO - the crawling url is http://weibo.com/p/1003061483330984/info?mod=pedit_more [2018-07-21 15:48:05,475: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1003061483330984/info?mod=pedit_more [2018-07-21 15:48:06,076: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:48:22,506: ERROR/ForkPoolWorker-1] db operation error，here are details(pymysql.err.DataError) (1406, "Data too long for column 'tags' at row 1") [SQL: 'INSERT INTO wbuser (uid, name, gender, birthday, location, description, register_time, verify_type, verify_info, follows_num, fans_num, wb_num, level, tags, work_info, contact_info, education_info, head_img) VALUES (%(uid)s, %(name)s, %(gender)s, %(birthday)s, %(location)s, %(description)s, %(register_time)s, %(verify_type)s, %(verify_info)s, %(follows_num)s, %(fans_num)s, %(wb_num)s, %(level)s, %(tags)s, %(work_info)s, %(contact_info)s, %(education_info)s, %(head_img)s)'] [parameters: {'uid': '1483330984', 'name': '侯宁', 'gender': 1, 'birthday': '', 'location': '北京', 'description': '人称＂空军司令＂，财富苍生之醉观者。长篇小说《财富苍生-槐花蛇》作者，侯宁微店https://d.weidian.com/single/#/main', 'register_time': ' 2009-08-28 ', 'verify_type': 1, 'verify_info': '独立财经观察家，时评家、社会学者、职业投资人微博签约自媒体', 'follows_num': 617, 'fans_num': 2843631, 'wb_num': 210982, 'level': '48', 'tags': '槐花蛇 ; 财富苍生 ; ... (240 characters truncated) ... ; 经济学家 ; 投资理财', 'work_info': '中国人民大学 (1991 - 1994) ... (226 characters truncated) ... 职位：社会学研究所 ', 'contact_info': '', 'education_info': '北京理工大学 (1984年) ', 'head_img': 'http://tva2.sinaimg.cn/crop.86.56.768.768.180/5869d5a8gw1f5ycui2b91j20qg0zkgt1.jpg'}] [2018-07-21 15:48:22,506: WARNING/ForkPoolWorker-1] transaction rollbacks [2018-07-21 15:48:22,507: INFO/ForkPoolWorker-1] has stored user 1483330984 info successfully [2018-07-21 15:48:22,517: INFO/MainProcess] Received task: tasks.user.crawl_follower_fans[f609122b-9b92-464d-9de5-cf3794f5f7e2]
2018-07-21 15:48:22 - crawler - INFO - the crawling url is http://weibo.com/p/1005051483330984/follow?relate=fans&page=1#Pl_Official_HisRelation__60 [2018-07-21 15:48:22,522: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1005051483330984/follow?relate=fans&page=1#Pl_Official_HisRelation__60 [2018-07-21 15:48:24,338: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:48:25,520: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:48:25,775: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) 2018-07-21 15:48:42 - crawler - INFO - the crawling url is http://weibo.com/p/1005051483330984/follow?page=1#Pl_Official_HisRelation__60 [2018-07-21 15:48:42,292: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1005051483330984/follow?page=1#Pl_Official_HisRelation__60 [2018-07-21 15:48:42,849: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:48:44,019: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) [2018-07-21 15:48:44,283: WARNING/ForkPoolWorker-1] /home/zcao/.local/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)

可以看到确实爬取到了用户侯宁的界面，并且有显示侯宁信息。

但在 weibo_data table 表中并未发现数据，谢谢。

SQLAchemy版本已经改成了1.1.15. Mysql 5.7.22 celelry 4.1.1

上次你建议抓包看响应，可否稍微具体一点？前几天有个deadline。没能及时更新和你的讨论，不好意思。麻烦了。
opened by FOSquare 10
云打码平台好像失效了，之前那个超级鹰平台的issues下的temp_verification我按照操作来可是出了奇怪的bug，请问能根据新的打码平台更新一下吗，麻烦了

在提交Issue之前请先回答下面问题，谢谢！

1.你是怎么操作的？在linux环境下按照temp_verification项目进行操作，然后也替换掉了超级鹰的账号密码

尽量把你的操作过程描述清楚，最好能够复现问题。

然后利用 celery -A tasks.workers -Q login_queue worker -l info --concurrency=2 -Ofair 开始登陆业务，就出现了bug

2.你期望的结果是什么？能出一个稳定打码平台的版本吗，temp_verification项目分支我好像弄不了...

3.实际上你得到的结果是什么？

[2020-06-28 23:25:45,239: ERROR/ForkPoolWorker-1] Task tasks.login.login_task[0ae0f9d9-a689-4242-8072-43aa9ea29d7e] raised unexpected: UnicodeDecodeError('gbk', b'<!doctype html>\n<html>\n<head>\n <meta charset="utf-8">\n <meta http-equiv="X-UA-Compatibl" content="IE=edge,chrome=1"/>\n\n <title>\xe5\xbc\x82\xe5\xb8\xb8\xe8\xae\xbf\xe9\x97\xae\xe6\x8f\x90\xe7\xa4\xba</title>\n <link href="/css/denyerrorpage/frame.css" type="text/css" rel="stylesheet">\n <link href="/css/denyerrorpage/error.css" type="text/css" rel="stylesheet">\n <link href="/css/denyerrorpage/skin.css" type="text/css" rel="stylesheet">\n</head>\n<body>\n<div class="WB_miniblog">\n \n <div class="iforgot_bd">\n <div class="iforgot_header clearfix">\n <div class="logo_mod1 W_fl"></div>\n <div class="name_mod W_fr">\n <a href="http://www.sina.com.cn/" class="S_txt1">\xe6\x96\xb0\xe6\xb5\xaa\xe9\xa6\x96\xe9\xa1\xb5</a>\n <a href="http://weibo.com/" class="S_txt1">\xe5\xbe\xae\xe5\x8d\x9a</a>\n <a href="http://help.weibo.com/" class="S_txt1 last">\xe5\xb8\xae\xe5\x8a\xa9</a>\n </div>\n </div>\n <div class="iforgot_cont">\n <div class="i_mod">\n <div class="form_mod">\n <div class="form_list form_listError">\n <span class="iconError"></span>\n <span class="itemError code_mod">\xe7\xb3\xbb\xe7\xbb\x9f\xe6\x9c\x89\xe7\x82\xb9\xe5\xbf\x99\xef\xbc\x8c\xe8\xaf\xb7\xe5\x88\xb7\xe6\x96\xb0\xe4\xb8\x80\xe4\xb8\x8b\xe8\xaf\x95\xe8\xaf\x95</span>\n </div>\n </div>\n </div>\n </div>\n </div>\n</div>\n<div class="WB_footer S_bg2">\n <div class="other_link S_bg1 clearfix T_add_ser">\n <p class="copy"><a href="http://corp.sina.com.cn/chn/" class="footBg">\xe6\x96\xb0\xe6\xb5\xaa\xe7\xae\x80\xe4\xbb\x8b</a>\xe3\x80\x80<a class="footBg" href="http://corp.sina.com.cn/eng/">About Sina</a>\xe3\x80\x80<a class="footBg" href="http://emarketing.sina.com.cn/">\xe5\xb9\xbf\xe5\x91\x8a\xe6\x9c\x8d\xe5\x8a\xa1</a>\xe3\x80\x80<a class="footBg" href="http://www.sina.com.cn/contactus.html">\xe8\x81\x94\xe7\xb3\xbb\xe6\x88\x91\xe4\xbb\xac</a>\xe3\x80\x80<a class="footBg" href="http://corp.sina.com.cn/chn/sina_job.html">\xe6\x8b\x9b\xe8\x81\x98\xe4\xbf\xa1\xe6\x81\xaf</a>\xe3\x80\x80<a class="footBg" href="http://www.sina.com.cn/intro/lawfirm.shtml">\xe7\xbd\x91\xe7\xab\x99\xe5\xbe\x8b\xe5\xb8\x88</a>\xe3\x80\x80<a class="footBg" href="http://english.sina.com" target="__blank">SINA English</a>\xe3\x80\x80<a class="footBg" href="http://members.sina.com.cn/apply/" target="__blank">\xe6\xb3\xa8\xe5\x86\x8c</a>\xe3\x80\x80<a class="footBg" href="http://tech.sina.com.cn/focus/sinahelp.shtml" target="__blank">\xe4\xba\xa7\xe5\x93\x81\xe7\xad\x94\xe7\x96\x91</a></p>\n <div class="copy"><a href="javascript:;" class="S_txt2">\xe5\xae\xa2\xe6\x88\xb7\xe6\x9c\x8d\xe5\x8a\xa1\xe7\x94\xb5\xe8\xaf\x9d\xef\xbc\x9a400 052 0066 \xe6\xac\xa2\xe8\xbf\x8e\xe6\x89\xb9\xe8\xaf\x84\xe6\x8c\x87\xe6\xad\xa3</a></div>\n <p class="company"><span class="copy S_txt2">Copyright \xc2\xa9 1996-2020 SINA Corporation, All Rights Reserved \xe6\x96\xb0\xe6\xb5\xaa\xe5\x85\xac\xe5\x8f\xb8 \xe7\x89\x88\xe6\x9d\x83\xe6\x89\x80\xe6\x9c\x89</span></p>\n </div>\n </div>\n</body>\n</html>', 792, 793, 'illegal multibyte sequence') Traceback (most recent call last): File "/home/weibo/weibospider-temp_verification/.env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task R = retval = fun(*args, **kwargs) File "/home/weibo/weibospider-temp_verification/.env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__ return self.run(*args, **kwargs) File "/home/weibo/weibospider-temp_verification/tasks/login.py", line 12, in login_task get_session(name, password) File "/home/weibo/weibospider-temp_verification/login/login.py", line 230, in get_session url, cjy_client, cid, err_no, session = do_login(name, password, proxy) File "/home/weibo/weibospider-temp_verification/login/login.py", line 210, in do_login rs, cjy_client, cid, err_no, session = login_retry(name, password, session, cjy_client, cid, proxy, err_no) File "/home/weibo/weibospider-temp_verification/login/login.py", line 198, in login_retry proxy) File "/home/weibo/weibospider-temp_verification/login/login.py", line 184, in login_by_pincode rs = get_redirect(name, data, post_url, session, proxy) File "/home/weibo/weibospider-temp_verification/login/login.py", line 85, in get_redirect login_loop = logining_page.content.decode("gbk") UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 792: illegal multibyte sequence [2020-06-28 23:25:45,264: ERROR/ForkPoolWorker-1] Task tasks.login.login_task[fa3de833-84ac-4f20-a81b-426cdfff7c97] raised unexpected: SyntaxError('invalid syntax', ('<string>', 1, 1, '<!doctype html>\n')) Traceback (most recent call last): File "/home/weibo/weibospider-temp_verification/.env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task R = retval = fun(*args, **kwargs) File "/home/weibo/weibospider-temp_verification/.env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__ return self.run(*args, **kwargs) File "/home/weibo/weibospider-temp_verification/tasks/login.py", line 12, in login_task get_session(name, password) File "/home/weibo/weibospider-temp_verification/login/login.py", line 230, in get_session url, cjy_client, cid, err_no, session = do_login(name, password, proxy) File "/home/weibo/weibospider-temp_verification/login/login.py", line 205, in do_login server_data = get_server_data(su, session, proxy) File "/home/weibo/weibospider-temp_verification/login/login.py", line 67, in get_server_data sever_data = eval(pre_data_res.content.decode("utf-8").replace("sinaSSOController.preloginCallBack", '')) File "<string>", line 1 <!doctype html> ^ SyntaxError: invalid syntax

4.你使用的是哪个版本的WeiboSpider? 你的操作系统是什么？是否有读本项目的常见问题？使用的是服务环境的ubuntu，环境配置已经弄过几次了

opened by zjyzh 9
关于执行celery指令抛出python模块引入错误的问题

大大您好，我在Ubuntu16.04操作系统中，下载了1.7.3版本的源码，依次成功配置并执行到了步骤8。执行步骤9(celery指令)出现了错误orz（并未在项目常见问题及issue中查到相似问题）。然后我直接执行：celery -A tasks.workers worker -l info -c 1，报错依旧，关键内容如下： File "**/home/zhangchj/Downloads/Weibo/weibospider/**tasks/init.py", line 6, in from .user import execute_user_task File "/home/zhangchj/Downloads/Weibo/weibospider/tasks/user.py", line 3, in from page_get import (get_fans_or_followers_ids, get_profile, get_user_profile, ImportError: cannot import name 'get_newcard_by_name' 看样子感觉像是python的模块引入问题，我这边是anaconda环境，python版本是3.6.6，不知。。大大能否帮忙想想可能是哪里出了问题？

opened by zhangchj9 9
InsecureRequestWarning mysql数据库里无数据更新

在提交Issue之前请先回答下面问题，谢谢！

1.你是怎么操作的？按照要求做好了环境配置，能够成功连接数据库、Redis 在数据库中添加了微博账号及登录密码启动了worker 并且进行了登录操作（也成功登上了）紧接着尝试抓取评论 python3 comment_first.py 出现了 [2018-08-05 21:37:30,254: WARNING/ForkPoolWorker-1] /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) 这样的情况，mysql数据库中也没有数据更新

我使用的是1.7.2版本系统为MacOs

opened by ArchieGu 9
系统邮件异常

如下：问题： 1、抓取几百个微博账号个人信息后，使用的是163的邮箱，出现如下异常，建议对邮件发送部分做异常处理。

###################### [2018-07-05 08:10:09,864: ERROR/ForkPoolWorker-1] Failed to send emails, (535, b'Error: authentication failed') is raised, here are details: File "/root/softs/weibospider-master/utils/email_warning.py", line 48, in send_email server.login(email_from, email_pass)

worker: Warm shutdown (MainProcess) 2018-07-05 08:10:09 - crawler - ERROR - failed to crawl http://weibo.com/p/1035051742566624/info?mod=pedit_more，here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

[2018-07-05 08:10:09,878: ERROR/ForkPoolWorker-1] failed to crawl http://weibo.com/p/1035051742566624/info?mod=pedit_more，here are details:'NoneType' object is not subscriptable, stack is File "/root/softs/weibospider-master/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

[2018-07-05 08:10:09,896: INFO/ForkPoolWorker-1] Task tasks.user.crawl_person_infos[3f4c447b-f4cf-41f1-89a6-3d929aed6bfd] succeeded in 2.4491456080004355s: None [2018-07-05 08:10:10,906: WARNING/MainProcess] Restoring 4 unacknowledged message(s)

opened by lizhuquan 9
关于账号全部被BAN后的继续执行任务问题

在提交Issue之前请先回答下面问题，谢谢！

1.你是怎么操作的？

nohup celery -A tasks.workers -Q login_queue,user_crawler,fans_followers,search_crawler,home_crawler worker -l info -c 1 & nohup python login_first.py & nohup celery beat -A tasks.workers -l info & nohup python search_first.py &

2.你期望的结果是什么？

我想让程序一直跑每天都爬取一次所有的关键词

3.实际上你得到的结果是什么？

我爬取一天得到了1w8k条数据，然后账号1变0，后来我登录微博发现账号还可以用，就update了数据库0变1，然后重启任务，但是使用 ps aux|grep celery 抓取任务时发现不是开始时候的： root 1281 0.3 2.9 180644 59444 pts/8 S 13:02 0:01 /usr/bin/python3 /usr/local/bin/celery -A tasks.workers -Q login_queue,user_crawler,fans_followers,search_crawler,home_crawler worker -l info -c 1 root 1286 4.7 3.6 275440 73800 pts/8 S 13:02 0:16 /usr/bin/python3 /usr/local/bin/celery -A tasks.workers -Q login_queue,user_crawler,fans_followers,search_crawler,home_crawler worker -l info -c 1 root 1311 0.3 2.8 188344 57460 pts/8 S 13:04 0:00 /usr/bin/python3 /usr/local/bin/celery beat -A tasks.workers -l info

而是只有：

root 1311 0.3 2.8 188344 57460 pts/8 S 13:04 0:00 /usr/bin/python3 /usr/local/bin/celery beat -A tasks.workers -l info

并且数据库增加了几十个条目后暂停了

4.你使用的是哪个版本的WeiboSpider? 你的操作系统是什么？是否有读本项目的[常见问题]

我使用的是release的最后一个版本，操作系统是ubuntu14.04，读了几遍文档但是还是不知道怎么解决。

opened by Martinhu95 9
插入登录账号和种子信息后执行出错

1、python login_first.py 2、python user_first.py

2018-01-02 14:09:53 - crawler - INFO - the crawling url is http://weibo.com/p/1005051195242865/info?mod=pedit_more [2018-01-02 14:09:53,646: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1005051195242865/info?mod=pedit_more 2018-01-02 14:09:53 - crawler - WARNING - no cookies in cookies pool, please find out the reason [2018-01-02 14:09:53,650: WARNING/ForkPoolWorker-1] no cookies in cookies pool, please find out the reason (WeiboSpider)root@jian-spider:/home/ubuntu/weibospider# 2018-01-02 14:09:54 - crawler - ERROR - failed to crawl http://weibo.com/p/1005051195242865/info?mod=pedit_more，here are details:(535, b'5.7.11 the behavior of this user triggered some restrictions to this account'), stack is File "/home/ubuntu/weibospider/decorators/decorator.py", line 14, in time_limit return func(*args, **kargs)

[2018-01-02 14:09:54,293: ERROR/ForkPoolWorker-1] failed to crawl http://weibo.com/p/1005051195242865/info?mod=pedit_more，here are details:(535, b'5.7.11 the behavior of this user triggered some restrictions to this account'), stack is File "/home/ubuntu/weibospider/decorators/decorator.py", line 14, in time_limit return func(*args, **kargs)

[2018-01-02 14:09:54,304: ERROR/ForkPoolWorker-1] list index out of range [2018-01-02 14:09:54,304: ERROR/ForkPoolWorker-1] list index out of range [2018-01-02 14:09:54,305: ERROR/ForkPoolWorker-1] list index out of range [2018-01-02 14:09:54,324: INFO/MainProcess] Received task: tasks.user.crawl_follower_fans[49a1e5cb-240c-4b0d-a767-e1664574b74e] 2018-01-02 14:09:54 - crawler - INFO - the crawling url is http://weibo.com/p/1005051195242865/follow?relate=fans&page=1#Pl_Official_HisRelation__60 [2018-01-02 14:09:54,329: INFO/ForkPoolWorker-1] the crawling url is http://weibo.com/p/1005051195242865/follow?relate=fans&page=1#Pl_Official_HisRelation__60 2018-01-02 14:09:54 - crawler - WARNING - no cookies in cookies pool, please find out the reason [2018-01-02 14:09:54,331: WARNING/ForkPoolWorker-1] no cookies in cookies pool, please find out the reason 2018-01-02 14:09:54 - crawler - ERROR - failed to crawl http://weibo.com/p/1005051195242865/follow?relate=fans&page=1#Pl_Official_HisRelation__60，here are details:(535, b'5.7.11 the behavior of this user triggered some restrictions to this account'), stack is File "/home/ubuntu/weibospider/decorators/decorator.py", line 14, in time_limit return func(*args, **kargs)

[2018-01-02 14:09:54,958: ERROR/ForkPoolWorker-1] failed to crawl http://weibo.com/p/1005051195242865/follow?relate=fans&page=1#Pl_Official_HisRelation__60，here are details:(535, b'5.7.11 the behavior of this user triggered some restrictions to this account'), stack is File "/home/ubuntu/weibospider/decorators/decorator.py", line 14, in time_limit return func(*args, **kargs)

opened by jianzzz 9
爬关键词搜索失败

因为ip被微博封了，所以加了ip代理，login运行成功了，但是python first_task_execution/search 之后结果是这样的，weibo_data里也没有出现任何数据。page_get/basic.py里的get_page的need_proxy已经改成=True了 [2020-03-07 21:34:12,974: INFO/MainProcess] Received task: tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363]
2020-03-07 21:34:12 - crawler - INFO - We are searching keyword "武汉红十字会" [2020-03-07 21:34:12,976: INFO/ForkPoolWorker-1] We are searching keyword "武汉红十字会" 2020-03-07 21:34:12 - crawler - INFO - the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 [2020-03-07 21:34:12,979: INFO/ForkPoolWorker-1] the crawling url is http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 2020-03-07 21:37:08 - crawler - WARNING - Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',))) [2020-03-07 21:37:08,589: WARNING/ForkPoolWorker-1] Excepitons are raised when crawling http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1.Here are details:HTTPConnectionPool(host='183.164.228.73', port=49691): Max retries exceeded with url: http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fd699739ba8>: Failed to establish a new connection: [Errno 110] Connection timed out',))) 2020-03-07 21:37:08 - crawler - ERROR - failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1，here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

[2020-03-07 21:37:08,590: ERROR/ForkPoolWorker-1] failed to crawl http://s.weibo.com/weibo/%E6%AD%A6%E6%B1%89%E7%BA%A2%E5%8D%81%E5%AD%97%E4%BC%9A&xsort=hot&suball=1&timescope=custom:2020-01-25-0:2020-02-25-0&page=1，here are details:an integer is required (got type str), stack is File "/home/xwt/Desktop/weibospider-temp_verification/decorators/decorators.py", line 17, in time_limit return func(*args, **kargs)

2020-03-07 21:37:08 - crawler - WARNING - No search result for keyword 武汉红十字会, the source page is [2020-03-07 21:37:08,591: WARNING/ForkPoolWorker-1] No search result for keyword 武汉红十字会, the source page is [2020-03-07 21:37:08,592: INFO/ForkPoolWorker-1] Task tasks.search.search_keyword[d652d4ea-826a-488f-a1aa-eaf52d9d8363] succeeded in 175.61601991499992s: None

Max retries exceeded with url这是因为代理ip失效太快了嘛

opened by xwt0016 8

运行 python3 config/create_all.py 报错

开发者你好！我使用的是centos7系统，已经按照该项目要求安装了mysql数据库，redis，并修改了spider.yaml配置文件我在使用你的项目时运行python3 config/create_all.py报错如下

  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 1122, in _do_get
    return self._pool.get(wait, self._timeout)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/queue.py", line 145, in get
    raise Empty
sqlalchemy.util.queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "config/create_all.py", line 16, in <module>
    create_all_table()
  File "config/create_all.py", line 12, in create_all_table
    metadata.create_all()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 3949, in create_all
    tables=tables)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1928, in _run_visitor
    with self._optional_conn_ctx_manager(connection) as conn:
  File "/usr/local/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1921, in _optional_conn_ctx_manager
    with self.contextual_connect() as conn:
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2112, in contextual_connect
    self._wrap_pool_connect(self.pool.connect, None),
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2147, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 387, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 766, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 516, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 1138, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
    raise value
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 1135, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 333, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 461, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/pool.py", line 651, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/strategies.py", line 105, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 393, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.7/site-packages/pymysql/__init__.py", line 90, in Connect
    return Connection(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 706, in __init__
    self.connect()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 931, in connect
    self._get_server_information()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1269, in _get_server_information
    self.server_charset = charset_by_id(lang).name
  File "/usr/local/lib/python3.7/site-packages/pymysql/charset.py", line 38, in by_id
    return self._by_id[id]
KeyError: 255 ```
如你有空请帮我看看，万分感谢！

opened by Shallowhave 4

requirements.txt 文件当中的requests、Django版本号请进行修改下，谢谢

在提交Issue之前请先回答下面问题，谢谢！

1.你是怎么操作的？按快速开始步骤进行操作，使用最新版的代码安装到第7步时，一直报错，无法继续安装下去，之后查看报错提示，进行修改相应库的版本号，就可以完成第的安装 gerapy 0.9.6 requires django==1.11.29, but you have django 3.1.7 which is incompatible. gerapy 0.9.6 requires requests>=2.20.0, but you have requests 2.13.0 which is incompatible. 修改requirements.txt文件当中的 django、requests版本号，就可以完整安装 requests==2.20.0 Django==1.11.29 2.你期望的结果是什么？

3.实际上你得到的结果是什么？

4.你使用的是哪个版本的WeiboSpider? 你的操作系统是什么？是否有读本项目的常见问题？

opened by tianshanzhilong 0
微博爬虫的合理阈值

楼主好：我没有使用楼主的开源，只是在微博爬虫方面有细节请教，我自持一些微博账号通过模拟登录能够获取到相应的cookie，账号来之不易（一旦被封，要每个账号去验证登录，比较麻烦），想请教楼主，微博爬虫在ip池+cookie 获取数据时，一个cookie可持有的ip数量是多少？以及同ip/cookie请求频率应该控制在什么范围

opened by chenyinghong 0
非酋做配置，试错笔记

吾辈菜鸟，选择目录之前要先学习一下基础知识： https://www.runoob.com/linux/linux-system-contents.html

我make的时候，一直有找不到release.h的报错去src文件夹下： chmod 777 mkreleasehdr.sh chmod +x mkreleasehdr.sh chmod -R +x mkreleasehdr.sh 但是怎么都没法获得权限换了个目录就通了装在/usr/games/里了，希望后续没问题。

ps：我看到有人推荐使用redis-stable版本，虽然并没有解决我的问题

opened by JeanYoung5 2
登入帐号时遇到要求扫码登入，是Weibo有改版吗？

在提交Issue之前请先回答下面问题，谢谢！

1.你是怎么操作的？设置好环境云打码也充值了，但是还是遭遇错误，log出来看retcode=2071，reason表示需要扫码登入

2.你期望的结果是什么？登入成功并可以进行其他操作

3.实际上你得到的结果是什么？登入失败

4.你使用的是哪个版本的WeiboSpider? 你的操作系统是什么？是否有读本项目的常见问题？我使用内建的Dockerfiler建立虚拟环境，版本是直接http clone下来的

opened by leo1357904 2