Tools for collecting social media data around focal events

Ryan Gallagher

Last update: Nov 28, 2022

Related tags

Miscellaneous social-media twitter twitter-api twitter-data twitter-api-v2

Overview

Social Media Focal Events

The focalevents codebase provides tools for organizing data collected around focal events on social media.

It is often difficult to organize data from multiple API queries. For example, we may collect tweets when a hashtag starts trending by using Twitter’s filter stream. Later, we may make a separate query to the search endpoint to backfill our stream with what we missed before we started it, or update it with tweets that occurred since we stopped it. We may also want to get reply threads, quote tweets, or user timelines based on the tweets we collected. All of these queries are related to a common focal event—the hashtag—but they require several separate calls to the API. It is easy for these multiple queries to result in many disjoint files, making it difficult to organize, merge, update, backfill, and preprocess them quickly and reliably.

To address these issues, focalevents can be used to organize social media focal event data collected from Twitter’s v2 API using academic credentials and PostgreSQL. It is easy to do any of the following with the tools here:

Query Twitter’s full archive or filter stream for focal event data
Backfill and update those queries with additional data
Collect conversation threads and quote tweets of focal event tweets
Retrieve full user timelines for any user tweeting during a focal event

All of these functionalities are easy, single line commands, rather than long multi-line scripts, as are typically needed to read IDs, query the API, output data, and merge it with existing data. This allows researchers to design more complex studies of social media data, and spend more time focusing on data analysis, rather than data storage and maintenance.

Installation and Documentation

The repository's code can be downloaded directly from Github, or cloned using git:

git clone https://github.com/ryanjgallagher/focalevents

See the full documentation for more information about installing, configuring, and using the focalevent tools.

A Note

The code here is written and maintained by a single person. First and foremost, it has been designed to help them manage their own data and create replicable pipelines. They are sharing it in the hope that it may help others who have similar workflows and are interested in organizing their Twitter data according to focal events using PostgreSQL.

Requests for enhancements or additions to the code will likely be declined if the author does not anticipate using them in their own research. It is highly unlikely that the code will ever be adapted to work with databases other than PostgreSQL. Further, general problems with database setup or conflicts with pre-existing database structures are beyond the scope of this project and will not be addressed.

Comments

source parameter not included in all tweet data?

I'm not sure if this is because I'm looking at ancient tweets (2006 onward) or just have bad luck, but I've been getting this error:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 790, in <module>
    args.update_interval)
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 714, in main
    search.search()
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 596, in search
    self.manage_writing(response_json)
  File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 261, in manage_writing
    raise err
  File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 250, in manage_writing
    self.write(tweets, includes)
  File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 285, in write
    all_inserts = get_all_inserts(tweets, includes, self.event, self.query_type)
  File "/Users/a404/ivermectin/focalevents/twitter/helper.py", line 120, in get_all_inserts
    tweet_insert = get_tweet_insert(tweet, event, query_type, direct=True)
  File "/Users/a404/ivermectin/focalevents/twitter/helper.py", line 409, in get_tweet_insert
    'source': tweet['source'],
KeyError: 'source'

seems to be (temporarily??) fixed by making tweet['source'] = 'None' if 'source' isn't in tweet before we assign everything, but that may not be ideal if we actually care about the source.

anyway, I may just be cursed. lmk if this is the case!!!

https://github.com/ryanjgallagher/focalevents/blob/ef2d132c57a2d38d3d2af7e8bd7b7d4949a1056d/twitter/helper.py#L409

opened by asmithh 3

n_zeros circular reference?

https://github.com/ryanjgallagher/focalevents/blob/22afb52a4c6b7a3d38a73d4cf70b993db4311546/twitter/search.py#L555

Here's my error message:

Traceback (most recent call last):
  File "/usr/local/Cellar/[email protected]/3.9.1_8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/Cellar/[email protected]/3.9.1_8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 769, in <module>
    main(args.event,
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 689, in main
    search =SearchListener(event=event,
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 273, in __init__
    self.update_query()
  File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 555, in update_query
    pad_num = str(self.query_number).zfill(self.n_zeros)
AttributeError: 'SearchListener' object has no attribute 'n_zeros'

I think that when we call self.update_query() in some cases we're referencing the parameter n_zeros before it's assigned to the SearchListener? Not sure if that's intentional behavior; happy to help fix if it isn't.

bug documentation

opened by asmithh 3

Should retry on connection error

When collecting large historic datasets, I almost always get a

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

at some point, which stops the whole collection.

Instead, focalevents should just retry.

opened by FlxVctr 0
Feature request: split result jsons for archiving purposes

First a question: Is it safe to cut output files from the head and archive away JSON while the app is running?

Second: With huge datasets it'd be great if there'd be a function that splits off and compresses old raw data. Maybe a good idea for a feature in the future. We have > 24 GB in a single file by now 😅

opened by FlxVctr 1
Logging instead of printing

Update code so it uses logging instead of printing to main console

Verbosity parameter should be updated accordingly so level of verbosity can be passed
enhancement good first issue

opened by ryanjgallagher 1
Allow for multiple quotes of quotes searches
Currently, you can get quotes of quote tweets, but there's no efficient way to continue iterating that process because everything gets labeled as from_quote_search. Two changes can be made:

Add a quote_level column to the database, so that you can subset by quote tweets which iteration of a quote search they were returned from. For example, tweets retrieved from a search are quote_level 0. Quotes of those tweets are quote_level 1. Quotes of quote_level 1 tweets are quote_level 2. And so on. This involves updating config.py and the insertions into the tweets database in search.py and helper.py

Allow a user to either iterate on the previous quote level (identified automatically) or iterate up to a certain depth, e.g. the user specifies something like up_to_quote_level=6 and the search automatically gets quotes from levels 1 to 6 automatically

enhancement
opened by ryanjgallagher 2

Owner

Ryan Gallagher

Network science PhD student merging networks and NLP for computational social science

GitHub

Backend/API for the Mumble.dev, an open source social media application.

Welcome to the Mumble Api Repository Getting Started If you are trying to use this project for the first time, you can get up and running by following

189 Dec 27, 2022

It really seems like Trump is trying to get his own social media started. Not a huge fan tbh.

FuckTruthSocial It really seems like Trump is trying to get his own social media started. Not a huge fan tbh. (When TruthSocial actually releases, I'l

0 Jul 18, 2022

An integrated library for checking email if it is registered on social media

13 Dec 8, 2022

Mail Me My Social Media stats (SoMeMailMe)

Mail Me My Social Media follower count (SoMeMailMe) TikTok only show data 60 days back in time. With this repo you can easily scrape your follower cou

1 Jan 7, 2022

Quantity Takeoff with Python. Collecting groups of elements by filters

The free tool QuantityTakeoff allows you to group elements from Revit and IFC models (in BIMJSON-CSV format) with just a few filters and find the required volume values for the grouped elements.

9 Jan 6, 2023

Pokemon catch events project to demonstrate data pipeline on AWS

Pokemon Catches Data Pipeline This is a sample project to practice end-to-end data project; Terraform is used to deploy infrastructure; Kafka is the t

4 Sep 3, 2021

Hook and simulate global keyboard events on Windows and Linux.

keyboard Take full control of your keyboard with this small Python library. Hook global events, register hotkeys, simulate key presses and much more.

3.2k Jan 1, 2023

This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable.

GA-beautifier-iOS This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable. To run it, create a

3 Feb 2, 2022

It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers

This library is not an official library of twip. It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers.

8 Nov 19, 2022

x-tools is a collection of tools developed in Python

x-tools X-tools is a collection of tools developed in Python Commands\

5 Jan 24, 2022

A Python wrapper around Bacting

pybacting Python wrapper around bacting. Usage Based on the example from the bacting page, you can do: from pybacting import cdk print(cdk.fromSMILES

5 Jan 3, 2022

Wrappers around the most common maya.cmds and maya.api use cases

Maya FunctionSet (maya_fn) A package that decompose core maya.cmds and maya.api features to a set of simple functions. Tests The recommended approach

9 Mar 12, 2022

A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT project.

circt-hls What is this?: A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT p