A Python package that scrapes Google News article data while remaining undetected by Google.

Geminid Systems, Inc

Last update: Aug 10, 2022

Related tags

Web Crawling python crawler scraper webdriver scraping selenium webscraper web-scraping webcrawler googlescraper googlenews googleautomator googlenewsscraper

Overview

googlenewsscraper

Getting Started

Installation

pip install GoogleNewsScraper

Reference

Importing

from GoogleNewsScraper import GoogleNewsScraper

Instantiating Scraper

GoogleNewsScraper(driver)

Constructor Parameters

Name	Type	Required
driver	web driver	no

Possible values:

'chrome': The driver will default to use this package's chrome driver
A path to some driver (FireFox, for instance) stored on the user's system

Methods

This method is both public and private, though it really should only be used by the class

locate_html_element(self, driver, element, selector, wait_seconds)

Name	Type	Required	Description
driver	web driver	yes	A web driver (Chrome, FireFox, etc)
element	string	yes	Id or class selector of an HTML element
selector	Module import	yes	see below
wait_seconds	int	no	Waits a certain number of seconds in order to locate an HTML element

To configure the 'selector' param:

First install selenium

pip install selenium

Then import By

from selenium.webdriver.common.by import By

Possible values:

By.ID
By.CLASS_NAME
By.CSS_SELECTOR
By.LINK_TEXT
By.NAME
By.PARTIAL_LINK_TEXT
By.TAG_NAME
By.XPATH

GoogleNewsScraper(...args).search(search_text, date_range, pages, pagination_pause_per_page, cb) -> list or None

Name	Type	Required	Description
search_text	str	yes	A series of word(s) that will be inputted into the Google search engine
date_range	str	no	Filters article by date. Possible values: Past hours, Past 24 hours, Past week, Past month, Past year, Archives
pages	str or int	no	Number of pages that should be scraped (defaults to 'max')
pagination_pause_per_page	int	no	Waits a certain amount of seconds before a new page is scraped (defaults to 2). Time may have to be increased if Google prevents you from scraping all pages.
cb	function	no	Will return all article data on a single page for every page scraped (defaults to False)

Example using 'cb' paramater:

def handle_page_data(page_data: list):
  # Do something with page_data

GoogleNewsScraper(...args).search(...args, cb=handle_page_data)

NOTE:

If no argument is provided for 'cb,' the scrape method will return a two-dimensional list
Each list will contain an object of news article data for every news article on that page

Example of the data that every article-object will contain:

'id': A unique id for every article data object
'description': The preview description of the news article
'title': The title of the news article
'source': The source of news article (New York Times, for instance)
'image_url': The url of the preview news article image
'url': A link to the news article
'date_time': A datetime string that represents the date of when the article was published

Comments

Python Selenium Code Throws Errors
I am working on the following.

This method of selecting HTML elements fails to work.

'//div[@id="rso"]/div/div/div/a/div/div[2]/div[4]/p/span'

Additionally, these HTML selectors need to be updated.

driver.execute_script(""" const menu = document.getElementById('hdtbMenus'); menu.classList.remove('p4DDCd'); menu.classList.add('yyoM4d'); """)
opened by alexkhazzam 1

Error running example.py

So I tried the example but I got the following error:

Message: javascript error: Cannot read properties of null (reading 'classList')
  (Session info: headless chrome=103.0.5060.114)

What can I do to fix this?

opened by ehsong 0

Example

Hi there, Thanks for this great script. I was able to get the output text in the terminal, however I am unable to figure out how to save the results in a spreadsheet format — my python skills are obviously limited. Could you expand your example code to include storing the results in csv? thanks.

opened by padejski 0
Feature - Add support for ChromeDriverManager
@abnoviello23

Our scraper current allows for the caller to use their own driver or use the static driver

We want to replace the static driver to use ChromeDriverManager (see working in example/app.py)

The user must still be allowed to pass their own driver in as shown below

GoogleNewsScraper(my_driver)

If there is no driver passed in, we want to use ChromeDriverManager to install the latest version of the chrome driver
opened by karlgunst 1
Stability Enhancement - Replace selection by class name
@abnoviello23

For stability reasons, we want to replace the use of find_elements_by_class_name or div[contains(@class)], google changes the class names regularly and it breaks our script

We should be able to select what we need using one of the following

select by id (preferred method as this is unlikely to change)

select by tag name (example <img/> for the image_url and <a/> for the url for sure can be used)

select by tag position (for example we know the text content we want is under a > div > div > [div,div,div] (the 3 divs each contain the source, title, and description we need)
opened by karlgunst 1

Owner

Geminid Systems, Inc

GitHub https://pypi.org/project/GoogleNewsScraper/

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

12.3k Jan 7, 2023

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

1 Nov 13, 2021

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

2 Jul 22, 2022

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

0 Nov 7, 2022

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

4 Jun 2, 2022

Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

3.8k Jan 2, 2023

:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

46.4k Jan 3, 2023

Anonymously scrapes onlinesim.ru for new usable phone numbers.

phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

16 Oct 8, 2022

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

2 Jan 24, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

18 Dec 14, 2022

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

2 Jan 7, 2022

Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

2 Dec 22, 2021

A Python package that scrapes Google News article data while remaining undetected by Google.

Related tags

Overview

googlenewsscraper

Getting Started

Installation

Reference

Importing

Instantiating Scraper

Methods

Comments

Python Selenium Code Throws Errors

Error running example.py

Example

Feature - Add support for ChromeDriverManager

Stability Enhancement - Replace selection by class name

Owner

Geminid Systems, Inc

News, full-text, and article metadata extraction in Python 3. Advanced docs:

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

HappyScrapper - Google news web scrapper with python

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

Html Content / Article Extractor, web scrapping lib in Python

:arrow_double_down: Dumb downloader that scrapes the web

Anonymously scrapes onlinesim.ru for new usable phone numbers.

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes all articles and their headlines from theonion.com

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

This program scrapes information and images for movies and TV shows.

Scrapes Every Email Address of Every Society in Every University

Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

Scrapes proxies and saves them to a text file