data-preprocessing_toogoodtogo_threatlines
We're the hackathon leftovers, but we are Too Good To Go ;-). A repo by Lukas Schubotz, Stef van Buuren, and Raymon van Dinter. We aim to improve current data preprocessing for FTM's WOB data to analyze Shell and Dutch Governmental contacts.
Synchronous visualisation of email threads
Publications from the FTM "Dossier SHELL papers" https://www.ftm.nl/dossier/shell-papers suggest that timing of events is critical in the interactions between actors. It would therefore be useful if we could visualise the mail exchanges in time.
The idea is to visualise threads of mail exchanges between actors over time. When this is done for multiple threads, the display would give rapid insight into the structure and timing of exchanges between actors. For example, suppose we are able to construct a single thread from "RE:" and "FW:" mails in the data. A simple visualisation would be
See https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.9825&rep=rep1&type=pdf for variations on this display, for example by adding the interactions between the actors by fancy arcs and resorting the mails according to actor pairs.
A generalisation to multiple simulataneous threads would stack multiple lines, similar to a dot plot. Such a design calls for relatively simple thread displays that are synchronised in time. Therefore we will concentrate on using a simple thread line that plots mail chronology against calender time.
A somewhat grander idea would be to create a "film of events". The user would place a cursor on the time axis, and scroll through time. The new information per mail is displayed as the cursor passes the send time of the email.
Issues to resolve
We need complex/advanced text processing. Some of the issues include:
- How can we split multiple emails in a RE/FW into a set of elementary mails, each corresponding to just one sender?
- How well can we form threads by matching on subject lines?
- Do duplicates extracted from RE/FW serve any useful purpose?
- What is the percentage of threads for which we can find the parent mail (the mail that started the thread)?
Experiment 1
The first design plots all thread lines between 2016 and 2020 on one chart.
Experiment 2
The second design uses trelliscopejs
to plot the same information in smaller pieces.
The user can switch between 27 panes, each containing about 20 threads.
Try out the interactive version
Experiment 3
Back to figure 1, but now plotted with rbokeh
, so that we may zoom and use tooltips (interaction not supported by GitHub markdown)