Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Sean Breckenridge

Last update: Dec 28, 2022

Related tags

Overview

google_takeout_parser

parses both the Historical HTML and new JSON format for Google Takeouts
caches individual takeout results behind cachew
merge multiple takeouts into unique events

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

This doesn't handle all cases, but I have yet to find a parser that does, so here is my attempt at parsing what I see as the most useful info from it. The Google Takeout is pretty particular, and the contents of the directory depend on what you select while exporting. Unhandled files will warn, though feel free to PR a parser or create an issue if this doesn't parse some part you want.

This can take a few minutes to parse depending on what you have in your Takeout (especially while using the old HTML format), so this uses cachew to cache the function result for each Takeout you may have. That means this'll take a few minutes the first time parsing a takeout, but then only a few seconds every subsequent time.

Since the Takeout slowly removes old events over time, I would recommend periodically (personally I do it once every few months) backing up your data, to not lose any old events and get data from new ones. To use, go to takeout.google.com; For Reference, once on that page, I hit Deselect All, then select:

Chrome
Google Play Store
Location History
- Select JSON as format
My Activity
- Select JSON as format
Youtube and Youtube Music
- Select JSON as format
- In options, deselect music-library-songs, music-uploads and videos

The process for getting these isn't that great -- you have to manually go to takeout.google.com every few months, select what you want to export info for, and then it puts the zipped file into your google drive. You can tell it to run it at specific intervals, but I personally haven't found that to be that reliable.

This was extracted out of my HPI modules, which was in turn modified from the google files in karlicoss/HPI

Installation

Requires python3.7+

To install with pip, run:

pip install git+https://github.com/seanbreckenridge/google_takeout_parser

Usage

CLI Usage

Can be access by either google_takeout_parser or python -m google_takeout_parser. Offers a basic interface to list/clear the cache directory, and/or parse a takeout and interact with it in a REPL:

To clear the cachew cache: google_takeout_parser cache_dir clear

To parse a takeout:

$ google_takeout_parser parse ~/data/Unpacked_Takout --cache
Parsing...
Interact with the export using res

In [1]: res[-2]
Out[1]: PlayStoreAppInstall(title='Hangouts', device_name='motorola moto g(7) play', dt=datetime.datetime(2020, 8, 2, 15, 51, 50, 180000, tzinfo=datetime.timezone.utc))

In [2]: len(res)
Out[2]: 236654

Also contains a small utility command to help move/extract the google takeout:

$ google_takeout_parser move --from ~/Downloads/takeout*.zip --to-dir ~/data/google_takeout --extract
Extracting /home/sean/Downloads/takeout-20211023T070558Z-001.zip to /tmp/tmp07ua_0id
Moving /tmp/tmp07ua_0id/Takeout to /home/sean/data/google_takeout/Takeout-1634993897
$ ls -1 ~/data/google_takeout/Takeout-1634993897
archive_browser.html
Chrome
'Google Play Store'
'Location History'
'My Activity'
'YouTube and YouTube Music'

Library Usage

Assuming you maintain an unpacked view, e.g. like:

 $ tree -L 1 ./Takeout-1599315526
./Takeout-1599315526
├── Google Play Store
├── Location History
├── My Activity
└── YouTube and YouTube Music

To parse one takeout:

from pathlib import Path
from google_takeout.path_dispatch import TakeoutParser
tp = TakeoutParser(Path("/full/path/to/Takeout-1599315526"))
# to check if files are all handled
tp.dispatch_map()
# to parse without caching the results in ~/.cache/google_takeout_parser
uncached = list(tp.parse())
# to parse with cachew cache https://github.com/karlicoss/cachew
cached = list(tp.cached_parse())

To merge takeouts:

from pathlib import Path
from google_takeout.merge import cached_merge_takeouts
results = list(cached_merge_takeouts([Path("/full/path/to/Takeout-1599315526"), Path("/full/path/to/Takeout-1634971143")]))

The events this returns is a combination of all types in the models.py (to support easy serialization with cachew), to filter to a particular just do an isinstance check:

>> len(locations) 99913 ">

from google_takeout_parser.models import Location
takeout_generator = TakeoutParser(Path("/full/path/to/Takeout")).cached_parse()
locations = list(filter(lambda e: isinstance(e, Location), takeout_generator))
>>> len(locations)
99913

I personally exclusively use this through my HPI google takeout file, as a configuration layer to locate where my takeouts are on disk, and since that 'automatically' unzips the takeouts (I store them as the zips), i.e., doesn't require me to maintain an unpacked view

Contributing

Just to give a brief overview, to add new functionality (parsing some new folder that this doesn't currently support), you'd need to:

Add a model for it in models.py, which a key property function which describes each event uniquely (used to merge takeout events); add it to the Event Union
Write a function which takes the Path to the file you're trying to parse and converts it to the model you created (See examples in parse_json.py). If its relatively complicated (e.g. HTML), ideally extract a div from the page and add a test for it so its obvious when/if the format changes.
Add a regex match for the file path to the DEFAULT_HANDLER_MAP

Tests

git clone 'https://github.com/seanbreckenridge/google_takeout_parser'
cd ./google_takeout_parser
pip install '.[testing]'
mypy ./google_takeout_parser
pytest

Comments

support Windows separators in path_dispatch
While setting up Windows CI for promnesia, the takeout tests failed and had these in logs:

2022-05-09T21:03:56.5877916Z [INFO 2022-05-09 20:58:14 promnesia extract.py:49] extracting via promnesia.sources.takeout:index ... [33m[W 220509 20:58:14 path_dispatch:270][39m No function to handle parsing My Activity\Chrome\MyActivity.html [33m[W 220509 20:58:14 path_dispatch:270][39m No function to handle parsing My Activity\Chrome\README

I guess it's because in path_dispatch forward slashes are hardcoded. Perhaps the quickest fix would be to do something like .replace(os.sep, '/') here -- paths in takeout shouldn't have either forward or backwards slashes anyway https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/path_dispatch.py#L94
opened by karlicoss 6
split cached databases by type

I believe this would make the size smaller since individual rows for the cachew union type would be smaller, so the cache doesnt grow to unreasonable sizes.

Would probably leave the one in HPI google_takeout as its theres just one of those, and not multiple that grow exponentially with no. of exports

As it stands, Im comfortable with the tradeoff here -- trading ease for disk space, but definitely could be improved
enhancement

opened by seanbreckenridge 2
path dispatch: match against relative path, start from the beginning
had to update the test, since previously it wasn't detecting:

My Activity/Chrome/MyActivity.json due to Chrome in DEFAULT_HANDLER_MAP

My Activity/Google Play Store/MyActivity.json due to Google Play Store in DEFAULT_HANDLER_MAP

Not sure if it's the best way to fix, but looks like clean enough
opened by karlicoss 1
push to pypi

Already have a release just to have the name registered, but leaving install method as git+ for now, esp. because might be more changes (i.e. #2) and is a relatively new project right now

opened by seanbreckenridge 1
Recreate cache on version upgrades

Unless a model changes, the hash for cachew doesnt update, but code may have changed and we still have old results. So, unless you clear the directory you could have results generated from old functionality

the clear command does fix that, but would be nice for this to invalidate old results automatically, by inspecting package installation to see what version this is and put a 'version' file in the cache directory (or maybe in the cachew hash db talbe?)

Could add an environment variable/flag that lets you use mismatched hashes during development

Could also maybe just add the version at the front of the _cachew_depends_on, since that gets stored as part of the hash

opened by seanbreckenridge 0
use error_policy kwarg instead of yield/drop/raise

should replace these with an error_policy argument which is either yield/warn or drop, using a Literal, to make it more obvious that these are related to how to handle errors

opened by seanbreckenridge 0
some enhancements to support older takeout formats
location history: used to be in LocationHistory.json

youtube: data used to be in "Youtube" dir

youtube: handle older activity format

youtube: handle older HTML timestamp format
opened by karlicoss 0
Check watch-history title in newer google takeout exports

from:

https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/302874482

should take a look at parse_json _parse_json_activity and see if title which is currently just a dict access and not a get is affected with a new takeout

opened by seanbreckenridge 0
Parse PlaceVisits

This PR adds the ability to parse basic PlaceVisits out of the Takeout Semantic Location History (which contains a person's Maps timeline / location history over time.) While the Semantic Location History JSON has timelineObjects as the root list, this PR does not attempt to add parsing these out, as this would require also parsing out ActivitySegments. As such, this does not fully address Issue #16. A future PR could amend and add to this approach to do so.

opened by ryanbateman 2
Do something about http:// youtube links
It might make sense to replace http:// with https:// for some links, e.g. to youtube videos.

For instance, in Takeout/My Activity/Video Search/MyActivity.{json,html} might contain http:// links for some old entries

{'header': 'youtube.com', 'title': 'Watched Octobass @ the Musical Instrument Museum - YouTube', 'titleUrl': 'http://www.youtube.com/watch?v=FP1QqtGe8ts', 'time': '2015-06-10T12:24:03.796Z', 'products': ['Video Search']}

In case of youtube, switching to https doesn't really hurt (the http/https are equivalent and both are availabe), and it might make it easier to consume downstream, e.g. might prevent duplicates.

zulip discussion: https://memex.zulipchat.com/#narrow/stream/279601-hpi/topic/google_takeout_parser/near/279605540
opened by karlicoss 0
add handler for Google Fit data

Fit/Daily Aggregations csv files -- started appearing in 2017

Fit/Activities/*.tcx and Fit/Activities/Low Accuracy/*.tcx files -- perhaps worth just having a function to get them, something else should actually handle tcx files also a bunch of them seems to have disappeared in 2020 (comparing with 2018) -- not sure if it's some sort of retention
new parser

opened by karlicoss 1

add parser for saved places on google maps

Seem to be scattered across different formats :hankey:

"Saved" list is in "Maps (your places)/Saved Places.json" -- present since 2015

{
  "type" : "FeatureCollection",
  "features" : [ {
    "geometry" : {
      "coordinates" : [ -0.1202100, 51.5979200 ],
      "type" : "Point"
    },
    "properties" : {
      "Google Maps URL" : "http://maps.google.com/?cid=17295021474934382781",
      "Location" : {
        "Address" : "United Kingdom",
        "Business Name" : "Alexandra Palace",
        "Country Code" : "GB",
        "Geo Coordinates" : {
          "Latitude" : "51.5979200",
          "Longitude" : "-0.1202100"
        }
      },
      "Published" : "2017-09-27T09:56:06Z",
      "Title" : "Alexandra Palace",
      "Updated" : "2017-09-27T09:56:06Z"
    },
    "type" : "Feature"
  }, {
    "geometry" : {
      "coordinates" : [ -0.1307733, 51.5941783 ],
      "type" : "Point"
    },
...
]}

Whereas other lists are in CSV files (since 2018), in "Saved" directory, one for each list in google maps e.g. Saved/Paris.csv

Title,Note,URL
Urfa Durum,,"https://www.google.com/search?q=Urfa+Durum&ludocid=15623525448940569321&ibp=gwp;0,7"

doesn't seem like this data is preset anywhere else in takeouts

new parser

opened by karlicoss 0

add parser for Google Keep data

Seems to be in "Keep/" directory. Mostly in HTML

pretty messy filenames:

in 2015

2015-05-18T18_43_03.920Z.html
5.html

in 2017

2017-01-29T19_43_26.664Z
2017-01-29T19_43_29.485Z

2021 has both html and json, but jsons are mostly empty, almost no data

2018-05-09T09_29_49.983+01_00.html
2018-05-09T09_29_49.983+01_00.json

example HTML:

...
<body><div class="note DEFAULT"><div class="heading"><div class="meta-icons">
<span class="archived" title="Note archived"></span>
</div>
Apr 7, 2019, 1:11:02 PM</div>

<div class="content">HTML content</div>


</div></body></html>

new parser

opened by karlicoss 0

Owner

Sean Breckenridge

GitHub

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 1, 2021

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022

ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

ForecastGA is a tool that combines a couple of popular libraries, Atspy and googleanalytics, with a few enhancements.

36 Jan 3, 2023

An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

7 Mar 15, 2022

The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 9, 2021

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 3, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

18 Dec 23, 2022

Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 7, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Related tags

Overview

google_takeout_parser

Installation

Usage

CLI Usage

Library Usage

Contributing

Tests

Comments

Owner

Sean Breckenridge

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

The Spark Challenge Student Check-In/Out Tracking Script

Fancy data functions that will make your life as a data scientist easier.

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Improving your data science workflows with

Fit models to your data in Python with Sherpa.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.