Crawl the information of a given keyword on Google search engine

Last update: Nov 9, 2022

Related tags

Web Crawling GoogleSpider

Overview

GoogleSpider

Crawl the information of a given keyword on Google search engine

Config

DataBase

Currently, data is stored in mongodb, and the database configuration is in line 15-19 of the setting. py file, which can be modified by yourself.

# MONGODB
MONGO_IP = "localhost"
MONGO_PORT = 27017
MONGO_DB = "Google_spider"
MONGO_USER_NAME = ""
MONGO_USER_PASS = ""

Log

LOG_NAME = os.path.basename(os.getcwd())
LOG_PATH = "log/%s.log" % LOG_NAME  # log path
LOG_LEVEL = "DEBUG"
LOG_COLOR = True  
LOG_IS_WRITE_TO_CONSOLE = True 
LOG_IS_WRITE_TO_FILE = True  
LOG_MODE = "w" 
LOG_MAX_BYTES = 10 * 1024 * 1024  # Maximum bytes
LOG_BACKUP_COUNT = 20  # Number of log files reserved
LOG_ENCODING = "utf8"  # code
OTHERS_LOG_LEVAL = "ERROR"  # leval

Spider

Download interval
- ```
SPIDER_SLEEP_TIME = [0, 1]
```
Maximum number of requests (100 by default)
- ```
SPIDER_MAX_RETRY_TIMES = 100
```
  Note
  
  If an illegal interface is encountered during crawling, an exception of 'user agent -- illegal interface' will be thrown, and then the crawler task will retry until the data is successfully crawled or more than 100 times

data structure

key	value type	example
title	str	“Donald Trump - Wikipedia”
keyword	str	“Trump"
url	str	"https://en.wikipedia.org/wiki/Donald_Trump"
text	str	Donald Trump - Wikipedia 1 hour ago · Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States ... Vice President: Mike Pence In office January 20, 2017 – January 20, 2021: In office; January 20, 2017 – January 20, 2021 Occupation: Politician; businessman; television presenter Parents: Fred Trump; Mary Anne MacLeod"

Quick start

Crawl the 3 page data with the keyword 'Trump'

from spiders.google_curl import GoogleCurl

spider = GoogleCurl('Trump', 3)
spider.start()

The first parameter is the search keyword, and the second parameter is the number of pages crawled

You might also like...

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

1 Jan 16, 2022

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

5 Nov 29, 2022

A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 8, 2022

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

1 Feb 10, 2022

a high-performance, lightweight and human friendly serving engine for scrapy

30 Mar 1, 2022

NASA APOD Discord Bot - Fetches information from NASA APOD site.

4 Apr 23, 2022

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021

Automatically download and crop key information from the arxiv daily paper.

Arxiv daily 速览功能：按关键词筛选arxiv每日最新paper，自动获取摘要，自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示首先下载权重baiduyun 提取码:il87，放置于code/Pars

20 Jul 30, 2022

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

2 Dec 23, 2021

Crawl the information of a given keyword on Google search engine

Related tags

Overview

GoogleSpider

Config

DataBase

Log

Spider

data structure

Quick start

You might also like...

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

A module for CME that spiders hashes across the domain with a given hash.

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

a high-performance, lightweight and human friendly serving engine for scrapy

NASA APOD Discord Bot - Fetches information from NASA APOD site.

This tool can be used to extract information from any website

Automatically download and crop key information from the arxiv daily paper.

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Owner

This script is intended to crawl license information of repositories through the GitHub API.

Open Crawl Vietnamese Text

Crawl BookCorpus

Python script who crawl first shodan page and check DBLTEK vulnerability

Iptvcrawl - A scrapy project for crawl IPTV playlist

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

A Python package that scrapes Google News article data while remaining undetected by Google.

A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.