Toxicity comments crawler
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.
Tweets and replies are scraped from Twitter API for a given list of users.
Twitch
Coming soon.
YouTube
Coming soon.
Coming soon.
Coming soon.
The toxic level of a given comment is calculated using the Perspective API.
Architecture
Usage
To run the crawler, you need to provide the following environment variables:
Variable | Description | Default | Required |
---|---|---|---|
AWS_ROLE_ARN |
AWS Role ARN | None |
Optional |
AWS_WEB_IDENTITY_TOKEN_FILE |
AWS Web Identity Token File | None |
Optional |
AWS_ACCESS_KEY_ID |
AWS Access Key ID | None |
Optional |
AWS_SECRET_ACCESS_KEY |
AWS Secret Access Key | None |
Optional |
AWS_S3_BUCKET |
AWS S3 Bucket | None |
Required |
AWS_S3_BUCKET_PREFIX |
AWS S3 Bucket Prefix | None |
Required |
LOG_LEVEL |
Log level | INFO |
Optional |
PERSPECTIVE_API_KEY |
Perspective API Key | None |
Required |
PERSPECTIVE_THRESHOLD |
Perspective Threshold | 0.5 |
Required |
FILTER_TOXIC_COMMENTS |
Filter Toxic Comments | True |
Required |
TWITTER_CONSUMER_KEY |
Twitter Consumer Key | None |
Required |
TWITTER_CONSUMER_SECRET |
Twitter Consumer Secret | None |
Required |
TWITTER_ACCESS_TOKEN |
Twitter Access Token | None |
Required |
TWITTER_ACCESS_TOKEN_SECRET |
Twitter Access Token Secret | None |
Required |
TWITTER_MAX_TWEETS |
Twitter Max Tweets or replies | None |
Required |
If AWS_ROLE_ARN
and AWS_WEB_IDENTITY_TOKEN_FILE
are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID
, and AWS_SECRET_ACCESS_KEY
.
Running
Prerequisites
Then, you can run the crawler with the following command:
docker run --env-file .env -d dougtrajano/toxicity-crawler:latest
License
The project is licensed under the Apache 2.0 License.