Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Douglas Trajano

Last update: Jan 24, 2022

Related tags

Overview

Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable	Description	Default	Required
`AWS_ROLE_ARN`	AWS Role ARN	`None`	Optional
`AWS_WEB_IDENTITY_TOKEN_FILE`	AWS Web Identity Token File	`None`	Optional
`AWS_ACCESS_KEY_ID`	AWS Access Key ID	`None`	Optional
`AWS_SECRET_ACCESS_KEY`	AWS Secret Access Key	`None`	Optional
`AWS_S3_BUCKET`	AWS S3 Bucket	`None`	Required
`AWS_S3_BUCKET_PREFIX`	AWS S3 Bucket Prefix	`None`	Required
`LOG_LEVEL`	Log level	`INFO`	Optional
`PERSPECTIVE_API_KEY`	Perspective API Key	`None`	Required
`PERSPECTIVE_THRESHOLD`	Perspective Threshold	`0.5`	Required
`FILTER_TOXIC_COMMENTS`	Filter Toxic Comments	`True`	Required
`TWITTER_CONSUMER_KEY`	Twitter Consumer Key	`None`	Required
`TWITTER_CONSUMER_SECRET`	Twitter Consumer Secret	`None`	Required
`TWITTER_ACCESS_TOKEN`	Twitter Access Token	`None`	Required
`TWITTER_ACCESS_TOKEN_SECRET`	Twitter Access Token Secret	`None`	Required
`TWITTER_MAX_TWEETS`	Twitter Max Tweets or replies	`None`	Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Docker

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 3, 2023

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

1 Dec 30, 2021

Releases(0.2.1)

0.2.1(Dec 27, 2021)
What's Changed

Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 25, 2021)
What's Changed

Fixed an issue with tweet content in TwitterAPI by @DougTrajano

Added an exploratory notebook to test TwitterAPI by @DougTrajano

Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12

Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26

Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.4(Sep 26, 2021)
Changes

Bump google-api-python-client from 2.21.0 to 2.22.0 #3

Fix Python path in Dockerfile

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 24, 2021)
Changes

Updated GitHub Action.

Fix error in Docker execution.

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.1(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 24, 2021)

Initial version
Source code(tar.gz)
Source code(zip)