Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Overview

google_takeout_parser

  • parses both the Historical HTML and new JSON format for Google Takeouts
  • caches individual takeout results behind cachew
  • merge multiple takeouts into unique events

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to PR a parser or create an issue if this doesn't parse some part you want.

This can take a few minutes to parse depending on what you have in your Takeout (especially while using the old HTML format), so this uses cachew to cache the function result for each Takeout you may have. That means this'll take a few minutes the first time parsing a takeout, but then only a few seconds every subsequent time.

Since the Takeout slowly removes old events over time, I would recommend periodically (personally I do it once every few months) backing up your data, to not lose any old events and get data from new ones. To use, go to takeout.google.com; For Reference, once on that page, I hit Deselect All, then select:

  • Chrome
  • Google Play Store
  • Location History
    • Select JSON as format
  • My Activity
    • Select JSON as format
  • Youtube and Youtube Music
    • Select JSON as format
    • In options, deselect music-library-songs, music-uploads and videos

The process for getting these isn't that great -- you have to manually go to takeout.google.com every few months, select what you want to export info for, and then it puts the zipped file into your google drive. You can tell it to run it at specific intervals, but I personally haven't found that to be that reliable.

This was extracted out of my HPI modules, which was in turn modified from the google files in karlicoss/HPI

Installation

Requires python3.7+

To install with pip, run:

pip install git+https://github.com/seanbreckenridge/google_takeout_parser

Usage

CLI Usage

Can be access by either google_takeout_parser or python -m google_takeout_parser. Offers a basic interface to list/clear the cache directory, and/or parse a takeout and interact with it in a REPL:

To clear the cachew cache: google_takeout_parser cache_dir clear

To parse a takeout:

$ google_takeout_parser parse ~/data/Unpacked_Takout --cache
Parsing...
Interact with the export using res

In [1]: res[-2]
Out[1]: PlayStoreAppInstall(title='Hangouts', device_name='motorola moto g(7) play', dt=datetime.datetime(2020, 8, 2, 15, 51, 50, 180000, tzinfo=datetime.timezone.utc))

In [2]: len(res)
Out[2]: 236654

Also contains a small utility command to help move/extract the google takeout:

$ google_takeout_parser move --from ~/Downloads/takeout*.zip --to-dir ~/data/google_takeout --extract
Extracting /home/sean/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id
Moving /tmp/tmp07ua_0id/Takeout to /home/sean/data/google_takeout/Takeout-1634993897
$ ls -1 ~/data/google_takeout/Takeout-1634993897
archive_browser.html
Chrome
'Google Play Store'
'Location History'
'My Activity'
'YouTube and YouTube Music'

Library Usage

Assuming you maintain an unpacked view, e.g. like:

 $ tree -L 1 ./Takeout-1599315526
./Takeout-1599315526
├── Google Play Store
├── Location History
├── My Activity
└── YouTube and YouTube Music

To parse one takeout:

from pathlib import Path
from google_takeout.path_dispatch import TakeoutParser
tp = TakeoutParser(Path("/full/path/to/Takeout-1599315526"))
# to check if files are all handled
tp.dispatch_map()
# to parse without caching the results in ~/.cache/google_takeout_parser
uncached = list(tp.parse())
# to parse with cachew cache https://github.com/karlicoss/cachew
cached = list(tp.cached_parse())

To merge takeouts:

from pathlib import Path
from google_takeout.merge import cached_merge_takeouts
results = list(cached_merge_takeouts([Path("/full/path/to/Takeout-1599315526"), Path("/full/path/to/Takeout-1634971143")]))

The events this returns is a combination of all types in the models.py (to support easy serialization with cachew), to filter to a particular just do an isinstance check:

>> len(locations) 99913 ">
from google_takeout_parser.models import Location
takeout_generator = TakeoutParser(Path("/full/path/to/Takeout")).cached_parse()
locations = list(filter(lambda e: isinstance(e, Location), takeout_generator))
>>> len(locations)
99913

I personally exclusively use this through my HPI google takeout file, as a configuration layer to locate where my takeouts are on disk, and since that 'automatically' unzips the takeouts (I store them as the zips), i.e., doesn't require me to maintain an unpacked view

Contributing

Just to give a brief overview, to add new functionality (parsing some new folder that this doesn't currently support), you'd need to:

  • Add a model for it in models.py, which a key property function which describes each event uniquely (used to merge takeout events); add it to the Event Union
  • Write a function which takes the Path to the file you're trying to parse and converts it to the model you created (See examples in parse_json.py). If its relatively complicated (e.g. HTML), ideally extract a div from the page and add a test for it so its obvious when/if the format changes.
  • Add a regex match for the file path to the DEFAULT_HANDLER_MAP

Tests

git clone 'https://github.com/seanbreckenridge/google_takeout_parser'
cd ./google_takeout_parser
pip install '.[testing]'
mypy ./google_takeout_parser
pytest
Comments
  • support Windows separators in path_dispatch

    support Windows separators in path_dispatch

    While setting up Windows CI for promnesia, the takeout tests failed and had these in logs:

    2022-05-09T21:03:56.5877916Z [INFO    2022-05-09 20:58:14 promnesia extract.py:49] extracting via promnesia.sources.takeout:index ...
    [W 220509 20:58:14 path_dispatch:270] No function to handle parsing My Activity\Chrome\MyActivity.html
    [W 220509 20:58:14 path_dispatch:270] No function to handle parsing My Activity\Chrome\README
    

    I guess it's because in path_dispatch forward slashes are hardcoded. Perhaps the quickest fix would be to do something like .replace(os.sep, '/') here -- paths in takeout shouldn't have either forward or backwards slashes anyway https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/path_dispatch.py#L94

    opened by karlicoss 6
  • split cached databases by type

    split cached databases by type

    I believe this would make the size smaller since individual rows for the cachew union type would be smaller, so the cache doesnt grow to unreasonable sizes.

    Would probably leave the one in HPI google_takeout as its theres just one of those, and not multiple that grow exponentially with no. of exports

    As it stands, Im comfortable with the tradeoff here -- trading ease for disk space, but definitely could be improved

    enhancement 
    opened by seanbreckenridge 2
  • path dispatch: match against relative path, start from the beginning

    path dispatch: match against relative path, start from the beginning

    had to update the test, since previously it wasn't detecting:

    • My Activity/Chrome/MyActivity.json due to Chrome in DEFAULT_HANDLER_MAP
    • My Activity/Google Play Store/MyActivity.json due to Google Play Store in DEFAULT_HANDLER_MAP

    Not sure if it's the best way to fix, but looks like clean enough

    opened by karlicoss 1
  • push to pypi

    push to pypi

    Already have a release just to have the name registered, but leaving install method as git+ for now, esp. because might be more changes (i.e. #2) and is a relatively new project right now

    opened by seanbreckenridge 1
  • Recreate cache on version upgrades

    Recreate cache on version upgrades

    Unless a model changes, the hash for cachew doesnt update, but code may have changed and we still have old results. So, unless you clear the directory you could have results generated from old functionality

    the clear command does fix that, but would be nice for this to invalidate old results automatically, by inspecting package installation to see what version this is and put a 'version' file in the cache directory (or maybe in the cachew hash db talbe?)

    Could add an environment variable/flag that lets you use mismatched hashes during development

    Could also maybe just add the version at the front of the _cachew_depends_on, since that gets stored as part of the hash

    opened by seanbreckenridge 0
  • use error_policy kwarg instead of yield/drop/raise

    use error_policy kwarg instead of yield/drop/raise

    should replace these with an error_policy argument which is either yield/warn or drop, using a Literal, to make it more obvious that these are related to how to handle errors

    opened by seanbreckenridge 0
  • some enhancements to support older takeout formats

    some enhancements to support older takeout formats

    • location history: used to be in LocationHistory.json
    • youtube: data used to be in "Youtube" dir
    • youtube: handle older activity format
    • youtube: handle older HTML timestamp format
    opened by karlicoss 0
  • Check watch-history title in newer google takeout exports

    Check watch-history title in newer google takeout exports

    from:

    https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/302874482

    should take a look at parse_json _parse_json_activity and see if title which is currently just a dict access and not a get is affected with a new takeout

    opened by seanbreckenridge 0
  • Parse PlaceVisits

    Parse PlaceVisits

    This PR adds the ability to parse basic PlaceVisits out of the Takeout Semantic Location History (which contains a person's Maps timeline / location history over time.) While the Semantic Location History JSON has timelineObjects as the root list, this PR does not attempt to add parsing these out, as this would require also parsing out ActivitySegments. As such, this does not fully address Issue #16. A future PR could amend and add to this approach to do so.

    opened by ryanbateman 2
  • Do something about http:// youtube links

    Do something about http:// youtube links

    It might make sense to replace http:// with https:// for some links, e.g. to youtube videos.

    For instance, in Takeout/My Activity/Video Search/MyActivity.{json,html} might contain http:// links for some old entries

    {'header': 'youtube.com', 'title': 'Watched Octobass @ the Musical Instrument Museum - YouTube', 'titleUrl': 'http://www.youtube.com/watch?v=FP1QqtGe8ts', 'time': '2015-06-10T12:24:03.796Z', 'products': ['Video Search']}
    

    In case of youtube, switching to https doesn't really hurt (the http/https are equivalent and both are availabe), and it might make it easier to consume downstream, e.g. might prevent duplicates.

    zulip discussion: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/279605540

    opened by karlicoss 0
  • add handler for Google Fit data

    add handler for Google Fit data

    Fit/Daily Aggregations csv files -- started appearing in 2017

    Fit/Activities/*.tcx and Fit/Activities/Low Accuracy/*.tcx files -- perhaps worth just having a function to get them, something else should actually handle tcx files also a bunch of them seems to have disappeared in 2020 (comparing with 2018) -- not sure if it's some sort of retention

    new parser 
    opened by karlicoss 1
  • add parser for saved places on google maps

    add parser for saved places on google maps

    Seem to be scattered across different formats :hankey:

    "Saved" list is in "Maps (your places)/Saved Places.json" -- present since 2015

    {
      "type" : "FeatureCollection",
      "features" : [ {
        "geometry" : {
          "coordinates" : [ -0.1202100, 51.5979200 ],
          "type" : "Point"
        },
        "properties" : {
          "Google Maps URL" : "http://maps.google.com/?cid=17295021474934382781",
          "Location" : {
            "Address" : "United Kingdom",
            "Business Name" : "Alexandra Palace",
            "Country Code" : "GB",
            "Geo Coordinates" : {
              "Latitude" : "51.5979200",
              "Longitude" : "-0.1202100"
            }
          },
          "Published" : "2017-09-27T09:56:06Z",
          "Title" : "Alexandra Palace",
          "Updated" : "2017-09-27T09:56:06Z"
        },
        "type" : "Feature"
      }, {
        "geometry" : {
          "coordinates" : [ -0.1307733, 51.5941783 ],
          "type" : "Point"
        },
    ...
    ]}
    

    Whereas other lists are in CSV files (since 2018), in "Saved" directory, one for each list in google maps e.g. Saved/Paris.csv

    Title,Note,URL
    Urfa Durum,,"https://www.google.com/search?q=Urfa+Durum&ludocid=15623525448940569321&ibp=gwp;0,7"
    

    doesn't seem like this data is preset anywhere else in takeouts

    new parser 
    opened by karlicoss 0
  • add parser for Google Keep data

    add parser for Google Keep data

    Seems to be in "Keep/" directory. Mostly in HTML

    pretty messy filenames:

    • in 2015
    2015-05-18T18_43_03.920Z.html
    5.html
    
    • in 2017
    2017-01-29T19_43_26.664Z
    2017-01-29T19_43_29.485Z
    
    • 2021 has both html and json, but jsons are mostly empty, almost no data
    2018-05-09T09_29_49.983+01_00.html
    2018-05-09T09_29_49.983+01_00.json
    

    example HTML:

    ...
    <body><div class="note DEFAULT"><div class="heading"><div class="meta-icons">
    <span class="archived" title="Note archived"></span>
    </div>
    Apr 7, 2019, 1:11:02 PM</div>
    
    <div class="content">HTML content</div>
    
    
    </div></body></html>
    
    new parser 
    opened by karlicoss 0
Owner
Sean Breckenridge
:)
Sean Breckenridge
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 1, 2021
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

null 7 Nov 20, 2022
ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

JR Oakes 36 Jan 3, 2023
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

null 1 Dec 9, 2021
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

Kjell Wooding 18 Dec 23, 2022
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

null 134 Jan 7, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021