Tools for collecting social media data around focal events

Overview

Social Media Focal Events

The focalevents codebase provides tools for organizing data collected around focal events on social media.

It is often difficult to organize data from multiple API queries. For example, we may collect tweets when a hashtag starts trending by using Twitter’s filter stream. Later, we may make a separate query to the search endpoint to backfill our stream with what we missed before we started it, or update it with tweets that occurred since we stopped it. We may also want to get reply threads, quote tweets, or user timelines based on the tweets we collected. All of these queries are related to a common focal event—the hashtag—but they require several separate calls to the API. It is easy for these multiple queries to result in many disjoint files, making it difficult to organize, merge, update, backfill, and preprocess them quickly and reliably.

To address these issues, focalevents can be used to organize social media focal event data collected from Twitter’s v2 API using academic credentials and PostgreSQL. It is easy to do any of the following with the tools here:

  • Query Twitter’s full archive or filter stream for focal event data
  • Backfill and update those queries with additional data
  • Collect conversation threads and quote tweets of focal event tweets
  • Retrieve full user timelines for any user tweeting during a focal event

All of these functionalities are easy, single line commands, rather than long multi-line scripts, as are typically needed to read IDs, query the API, output data, and merge it with existing data. This allows researchers to design more complex studies of social media data, and spend more time focusing on data analysis, rather than data storage and maintenance.

Installation and Documentation

The repository's code can be downloaded directly from Github, or cloned using git:

git clone https://github.com/ryanjgallagher/focalevents

See the full documentation for more information about installing, configuring, and using the focalevent tools.

A Note

The code here is written and maintained by a single person. First and foremost, it has been designed to help them manage their own data and create replicable pipelines. They are sharing it in the hope that it may help others who have similar workflows and are interested in organizing their Twitter data according to focal events using PostgreSQL.

Requests for enhancements or additions to the code will likely be declined if the author does not anticipate using them in their own research. It is highly unlikely that the code will ever be adapted to work with databases other than PostgreSQL. Further, general problems with database setup or conflicts with pre-existing database structures are beyond the scope of this project and will not be addressed.

Comments
  • source parameter not included in all tweet data?

    source parameter not included in all tweet data?

    I'm not sure if this is because I'm looking at ancient tweets (2006 onward) or just have bad luck, but I've been getting this error:

    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 790, in <module>
        args.update_interval)
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 714, in main
        search.search()
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 596, in search
        self.manage_writing(response_json)
      File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 261, in manage_writing
        raise err
      File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 250, in manage_writing
        self.write(tweets, includes)
      File "/Users/a404/ivermectin/focalevents/twitter/listener.py", line 285, in write
        all_inserts = get_all_inserts(tweets, includes, self.event, self.query_type)
      File "/Users/a404/ivermectin/focalevents/twitter/helper.py", line 120, in get_all_inserts
        tweet_insert = get_tweet_insert(tweet, event, query_type, direct=True)
      File "/Users/a404/ivermectin/focalevents/twitter/helper.py", line 409, in get_tweet_insert
        'source': tweet['source'],
    KeyError: 'source'
    
    

    seems to be (temporarily??) fixed by making tweet['source'] = 'None' if 'source' isn't in tweet before we assign everything, but that may not be ideal if we actually care about the source.

    anyway, I may just be cursed. lmk if this is the case!!!

    https://github.com/ryanjgallagher/focalevents/blob/ef2d132c57a2d38d3d2af7e8bd7b7d4949a1056d/twitter/helper.py#L409

    opened by asmithh 3
  • n_zeros circular reference?

    n_zeros circular reference?

    https://github.com/ryanjgallagher/focalevents/blob/22afb52a4c6b7a3d38a73d4cf70b993db4311546/twitter/search.py#L555

    Here's my error message:

    Traceback (most recent call last):
      File "/usr/local/Cellar/[email protected]/3.9.1_8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/local/Cellar/[email protected]/3.9.1_8/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 769, in <module>
        main(args.event,
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 689, in main
        search =SearchListener(event=event,
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 273, in __init__
        self.update_query()
      File "/Users/a404/ivermectin/focalevents/twitter/search.py", line 555, in update_query
        pad_num = str(self.query_number).zfill(self.n_zeros)
    AttributeError: 'SearchListener' object has no attribute 'n_zeros'
    

    I think that when we call self.update_query() in some cases we're referencing the parameter n_zeros before it's assigned to the SearchListener? Not sure if that's intentional behavior; happy to help fix if it isn't.

    bug documentation 
    opened by asmithh 3
  • Should retry on connection error

    Should retry on connection error

    When collecting large historic datasets, I almost always get a

    requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

    at some point, which stops the whole collection.

    Instead, focalevents should just retry.

    opened by FlxVctr 0
  • Feature request: split result jsons for archiving purposes

    Feature request: split result jsons for archiving purposes

    First a question: Is it safe to cut output files from the head and archive away JSON while the app is running?

    Second: With huge datasets it'd be great if there'd be a function that splits off and compresses old raw data. Maybe a good idea for a feature in the future. We have > 24 GB in a single file by now 😅

    opened by FlxVctr 1
  • Logging instead of printing

    Logging instead of printing

    Update code so it uses logging instead of printing to main console

    Verbosity parameter should be updated accordingly so level of verbosity can be passed

    enhancement good first issue 
    opened by ryanjgallagher 1
  • Allow for multiple quotes of quotes searches

    Allow for multiple quotes of quotes searches

    Currently, you can get quotes of quote tweets, but there's no efficient way to continue iterating that process because everything gets labeled as from_quote_search. Two changes can be made:

    1. Add a quote_level column to the database, so that you can subset by quote tweets which iteration of a quote search they were returned from. For example, tweets retrieved from a search are quote_level 0. Quotes of those tweets are quote_level 1. Quotes of quote_level 1 tweets are quote_level 2. And so on. This involves updating config.py and the insertions into the tweets database in search.py and helper.py
    2. Allow a user to either iterate on the previous quote level (identified automatically) or iterate up to a certain depth, e.g. the user specifies something like up_to_quote_level=6 and the search automatically gets quotes from levels 1 to 6 automatically
    enhancement 
    opened by ryanjgallagher 2
Owner
Ryan Gallagher
Network science PhD student merging networks and NLP for computational social science
Ryan Gallagher
Backend/API for the Mumble.dev, an open source social media application.

Welcome to the Mumble Api Repository Getting Started If you are trying to use this project for the first time, you can get up and running by following

Dennis Ivy 189 Dec 27, 2022
It really seems like Trump is trying to get his own social media started. Not a huge fan tbh.

FuckTruthSocial It really seems like Trump is trying to get his own social media started. Not a huge fan tbh. (When TruthSocial actually releases, I'l

null 0 Jul 18, 2022
An integrated library for checking email if it is registered on social media

An integrated library for checking email if it is registered on social media

Sidra ELEzz 13 Dec 8, 2022
Mail Me My Social Media stats (SoMeMailMe)

Mail Me My Social Media follower count (SoMeMailMe) TikTok only show data 60 days back in time. With this repo you can easily scrape your follower cou

Daniel Wigh 1 Jan 7, 2022
Quantity Takeoff with Python. Collecting groups of elements by filters

The free tool QuantityTakeoff allows you to group elements from Revit and IFC models (in BIMJSON-CSV format) with just a few filters and find the required volume values for the grouped elements.

OpenDataBIM 9 Jan 6, 2023
Pokemon catch events project to demonstrate data pipeline on AWS

Pokemon Catches Data Pipeline This is a sample project to practice end-to-end data project; Terraform is used to deploy infrastructure; Kafka is the t

Vitor Carra 4 Sep 3, 2021
Hook and simulate global keyboard events on Windows and Linux.

keyboard Take full control of your keyboard with this small Python library. Hook global events, register hotkeys, simulate key presses and much more.

BoppreH 3.2k Jan 1, 2023
This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable.

GA-beautifier-iOS This code makes the logs provided by Fiddler proxy of the Google Analytics events coming from iOS more readable. To run it, create a

Rafael Machado 3 Feb 2, 2022
It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers

This library is not an official library of twip. It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers.

junah201 8 Nov 19, 2022
x-tools is a collection of tools developed in Python

x-tools X-tools is a collection of tools developed in Python Commands\

null 5 Jan 24, 2022
A Python wrapper around Bacting

pybacting Python wrapper around bacting. Usage Based on the example from the bacting page, you can do: from pybacting import cdk print(cdk.fromSMILES

Charles Tapley Hoyt 5 Jan 3, 2022
Wrappers around the most common maya.cmds and maya.api use cases

Maya FunctionSet (maya_fn) A package that decompose core maya.cmds and maya.api features to a set of simple functions. Tests The recommended approach

Ryan Porter 9 Mar 12, 2022
A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT project.

circt-hls What is this?: A collection of repositories used to realise various end-to-end high-level synthesis (HLS) flows centering around the CIRCT p

null 29 Dec 14, 2022
LiteX-Acorn-Baseboard is a baseboard developed around the SQRL's Acorn board (or Nite/LiteFury) expanding their possibilities

LiteX-Acorn-Baseboard is a baseboard developed around the SQRL's Acorn board (or Nite/LiteFury) expanding their possibilities

null 33 Nov 26, 2022
Script to work around some quirks of the blender obj importer

ObjFix 1.0 (WIP) Script to work around some quirks of the blender obj importer Installation Download this repo In Blender, press "Edit" on the top-bar

Red_3D 4 Nov 20, 2021
Wrapper around anjlab's Android In-app Billing Version 3 to be used in Kivy apps

IABwrapper Wrapper around anjlab's Android In-app Billing Version 3 to be used in Kivy apps Install pip install iabwrapper Important ( Add these into

Shashi Ranjan 8 May 23, 2022
NORETURN is an esoteric programming language, based around the idea of not going back

NORETURN NORETURN is an esoteric programming language, based around the idea of not going back Concept Program coded in noreturn runs over one array,

null 1 Dec 15, 2021
Just messing around with AI for fun coding 😂

Python-AI Projects ?? World Clock ⏰ ⚙︎ Steps to run world-clock.py file Download and open the file in your Python IDE. Run the file a type the name of

Danish Saleem 0 Feb 10, 2022
A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

A python script providing an idea of how a MindSphere application, e.g., a dashboard, can be displayed around the clock without the need of manual re-authentication on enforced session expiration

MindSphere 3 Jun 3, 2022