A web scraper that exports your entire WhatsApp chat history.

Overview

WhatSoup 🍲

A web scraper that exports your entire WhatsApp chat history.

Table of Contents

  1. Overview
  2. Demo
  3. Prerequisites
  4. Instructions
  5. Frequently Asked Questions

Overview

Problem

  1. Exports are limited up to a maximum of 40,000 messages
  2. Exports skip the text portion of media-messages by replacing the entire message with instead of for example My favorite selfie of us 😻🐢🀳
  3. Exports are limited to a .txt file format

Solution

WhatSoup solves these problems by loading the entire chat history in a browser, scraping the chat messages (only text, no media), and exporting it to .txt, .csv, or .html file formats.

Example output:

WhatsApp Chat with Bob Ross.txt

02/14/2021, 02:04 PM - Eddy Harrington: Hey Bob πŸ‘‹ Let's move to Signal!
02/14/2021, 02:05 PM - Bob Ross: You can do anything you want. This is your world.
02/15/2021, 08:30 AM - Eddy Harrington: How about we use WhatSoup 🍲 to backup our cherished chats?
02/15/2021, 08:30 AM - Bob Ross: However you think it should be, that’s exactly how it should be.
02/15/2021, 08:31 AM - Eddy Harrington: You're the best, Bob ❀
02/19/2021, 11:24 AM - Bob Ross:  My latest happy 🌲 painting for you.

Demo

Watch the video on YouTube

Prerequisites

  • You have a WhatsApp account
  • You have Chrome browser installed
  • You have some familiarity with setting up and running Python scripts
  • Your terminal supports unicode (UTF-8) characters (for chat emoji's)

Instructions

  1. Make sure your WhatsApp chat settings are set to English language. This needs to be done on your phone (instructions here). You can change it back afterwards, but for now the script relies on certain HTML elements/attributes that contain English characters/words.

  2. Clone the repo:

    git clone https://github.com/eddyharrington/WhatSoup.git
    
  3. Create a virtual environment:

    # Windows
    python -m venv env
    
    # Linux & Mac
    python3 -m venv env
    
  4. Activate the virtual environment:

    # Windows
    env/Scripts/activate
    
    # Linux & Mac
    source env/bin/activate
    
  5. Install the dependencies:

    # Windows
    pip install -r requirements.txt
    
    # Linux & Mac
    python3 -m pip install -r requirements.txt
    
  6. Setup your environment

  • Download ChromeDriver and extract it to a local folder (such as the env folder)

  • Get your Chrome browser Profile Path by opening Chrome and entering chrome://version into the URL bar

  • Create an .env file with an entry for DRIVER_PATH and CHROME_PROFILE that specify the directory paths for your ChromeDriver and your Chrome Profile from above steps:

    # Windows
    DRIVER_PATH = 'C:\path-to-your-driver\chromedriver.exe'
    CHROME_PROFILE = 'C:\Users\your-username\AppData\Local\Google\Chrome\User Data'
    
    # Linux & Mac
    DRIVER_PATH = '/Users/your-username/path-to-your-driver/chromedriver'
    CHROME_PROFILE = '/Users/your-username/Library/Application Support/Google/Chrome/Default'
    
  1. Run the script

    # Windows
    python whatsoup.py
    
    # Linux & Mac
    python3 whatsoup.py
    

    Note for Mac users: you may get blocked when trying to run the script the first time with a message about chromedriver not being from an identified developer. This is normal. Follow these instructions to grant chromedriver an exception, then re-run the script.

Frequently Asked Questions

Does it download pictures / media?

No.

How large of chats can I load/export?

The most demanding part of the process is loading the entire chat in the browser, in which performance heavily depends on how much memory your computer has and how well Chrome handles the large DOM load. For reference, my largest chat (~50k messages) uses about 10GB of RAM. If you load more than the current record let me know and add yourself to the leader board.

WhatSoup Largest Chat Leader Board

# Name Date Message Count Time
πŸ₯‡ Eddy 2021-02-28 47,550 28139 sec / 7.8 hrs
πŸ₯ˆ ? ? ? ?
πŸ₯‰ ? ? ? ?

How long does it take to load/export?

Depends on the chat size and how performant your computer is, however below is a ballpark range to expect. For large chats, I recommend turning your PC's sleep/power settings to OFF and running the script in the evening or before bed so it loads over night.

# of msgs in chat history Load time
500 1 min
5,000 12 min
10,000 35 min
25,000 3.5 hrs
50,000 8 hrs

Why is it so slow?!

Basically, browsers become easily bottlenecked when loading massive amounts of rich data in WhatsApp, which is a WebSocket application and is constantly sending/receiving information and changing the HTML/DOM.

I'm open to ideas but most of the things I tried didn't help performance:

  • Chrome vs Firefox ❌
  • Headless browsing ❌
  • Disabling images ❌
  • Removing elements from DOM ❌
  • Changing 'experimental' browser settings to allocate more memory ❌

Can I...

  1. Use Firefox instead of Chrome? Yes, not out of the box though. There are a few Selenium differences and nuances to get it working, which I can share if there's interest. TODO.

  2. Use headless? Yes, but I only got this to work with Firefox and not Chrome.

  3. Use WhatSoup to scrape a local WhatsApp HTML file? Yes, you'd just need to bypass a few functions from main() and load the HTML file into Selenium's driver, then run the scraping/exporting functions like the below. If there's enough interest I can look into adding this to WhatSoup myself. TODO.

    # Load and scrape data from local HTML file
    def local_scrape(driver):
        driver.get('C:\your-WhatSoup-dir\source.html')
        scraped = scrape_chat(driver)
        scrape_is_exported("source", scraped)
    
  4. Contribute to WhatSoup? Please do!

Comments
  • Unable to locate element: {

    Unable to locate element: {"method":"css selector","selector":"._3Tw1q"}

    hello, i having a problem running it on Win

    C:\Windows\system32>python C:\Users\Kanna\WhatSoup\whatsoup.py [9512:4340:0227/163848.421:ERROR:upgrade_util_win.cc(73)] IProcessLauncher::LaunchCmdElevated failed; hr = 80004002 [9512:2152:0227/163848.451:ERROR:login_database.cc(654)] Password store database is too new, kCurrentVersionNumber=28, GetCompatibleVersionNumber=29 [9512:2152:0227/163848.451:ERROR:password_store_default.cc(39)] Could not create/open login database. DevTools listening on ws://127.0.0.1:55297/devtools/browser/6e5e80ed-295c-4c29-85e4-45131568fd88 Success! WhatsApp finished loading and is ready. Traceback (most recent call last): File "C:\Users\Kanna\WhatSoup\whatsoup.py", line 1008, in main() File "C:\Users\Kanna\WhatSoup\whatsoup.py", line 29, in main chats = get_chats(driver) File "C:\Users\Kanna\WhatSoup\whatsoup.py", line 183, in get_chats name_of_chat = selected_chat.find_element_by_class_name( File "C:\Users\Kanna\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webelement.py", line 398, in find_element_by_class_name return self.find_element(by=By.CLASS_NAME, value=name) File "C:\Users\Kanna\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webelement.py", line 658, in find_element return self._execute(Command.FIND_CHILD_ELEMENT, File "C:\Users\Kanna\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute return self._parent.execute(command, params) File "C:\Users\Kanna\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute self.error_handler.check_response(response) File "C:\Users\Kanna\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"._3Tw1q"} (Session info: chrome=87.0.4280.66)

    It opens chrome and opens WhatsApp web, but it does nothing to the page itself

    bug 
    opened by kannadivinorum 4
  • Start script with an input argument to scrape only desired chat without loading up all users

    Start script with an input argument to scrape only desired chat without loading up all users

    Hi, I was wondering if I could directly load a chat for desired user to scrape when I already know the name of person/group. I have a lot of old chats/groups etc and the scripts breakdown loading up the contacts mostly with exceptions

    raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}
      (Session info: chrome=90.0.4430.212)
    
    
    opened by ahrizvi 1
  • anyone that has a good fork?

    anyone that has a good fork?

    Anyone that forked the project was able to solve the problems? My error says "executable_path has been deprecated, please pass in a Service object".

    opened by joshiors 1
  • TypeError: argument of type 'NoneType' is not iterable

    TypeError: argument of type 'NoneType' is not iterable

    Although i have followed all the steps mentioned but still i am getting this error.

    File "whatsoup.py", line 1104, in main() File "whatsoup.py", line 21, in main driver = setup_selenium() File "whatsoup.py", line 90, in setup_selenium executable_path=DRIVER_PATH, options=options) File "C:\Users\pk199\Desktop\final-project\Other\WhatsApp-Scrape\WhatSoup\env\lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 73, in init self.service.start() File "C:\Users\pk199\Desktop\final-project\Other\WhatsApp-Scrape\WhatSoup\env\lib\site-packages\selenium\webdriver\common\service.py", line 76, in start stdin=PIPE) File "C:\Users\pk199\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 756, in init restore_signals, start_new_session) File "C:\Users\pk199\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 1100, in _execute_child args = list2cmdline(args) File "C:\Users\pk199\AppData\Local\Programs\Python\Python37\lib\subprocess.py", line 511, in list2cmdline
    needquote = (" " in arg) or ("\t" in arg) or not arg TypeError: argument of type 'NoneType' is not iterable.

    opened by Purushottam-BCA 0
  • Not scraping the text

    Not scraping the text

    Hey there only one problem is that its doing all the thing but when i select csv or any format it creates the file but when i open the file i does not have any content please help me

    opened by amitvyas17 0
  • Message: no such element: Unable to locate element: {

    Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}

    • I ran into this issue on Windows as well as osx
    • My chrome version is 89.0.4389.82
    • Python version : Python 3.8.2
    • Here is the trace:
    ❯ python3 whatsoup.py
    Success! WhatsApp finished loading and is ready.
    Traceback (most recent call last):
      File "whatsoup.py", line 1099, in <module>
        main()
      File "whatsoup.py", line 30, in main
        chats = get_chats(driver)
      File "whatsoup.py", line 212, in get_chats
        last_chat_msg = last_chat_msg_element.find_element_by_tag_name(
      File "/Users/xxx/opt/WhatSoup/env/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 305, in find_element_by_tag_name
        return self.find_element(by=By.TAG_NAME, value=name)
      File "/Users/xxx/opt/WhatSoup/env/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 658, in find_element
        return self._execute(Command.FIND_CHILD_ELEMENT,
      File "/Users/xxx/opt/WhatSoup/env/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
        return self._parent.execute(command, params)
      File "/Users/xxx/opt/WhatSoup/env/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
        self.error_handler.check_response(response)
      File "/Users/xxx/opt/WhatSoup/env/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
        raise exception_class(message, screen, stacktrace)
    selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"span"}
      (Session info: chrome=89.0.4389.82)
    

    The script opens chrome and starts going through messages and crashes randomly at different messages. Language is set to english

    bug 
    opened by oddtazz 13
  • Language/locale differences from en-US will raise an exception at various points

    Language/locale differences from en-US will raise an exception at various points

    Issue

    Various exceptions are raised when WhatsApp settings are set to anything other than English because there are a few areas in WhatSoup that depend on English characters/words. The date/time formats for non-English settings are likely different as well and also need to be revised with a more flexible solution such as dateutil.

    Temporary workaround

    Set WhatsApp settings on the phone to use English as the language before running the script. It can be changed back after scraping/exporting a chat.

    Issue details

    WhatSoup areas that depend on English language/locale:

    1. Identifying 'Search results' element after searching for a specific chat
    2. Loading all messages in a selected chat, has an xpath containing 'Message list'
    3. Finding sender when a message does not contain text, has a condition for 'Voice message'
    4. Determining if vCard/VCF media is in a message, has conditions for 'Message' and 'Add to a group'
    5. Date/time string formatting all expects in the format of MM/DD/YYYY HH:MM AM/PM but there are variations such as YYYY-MM-DD, A.M. / P.M., etc.

    Identifying search results

    # Look for the unique class that holds 'Search results.'
    WebDriverWait(driver, 5).until(expected_conditions.presence_of_element_located(
           (By.XPATH, "//*[@id='pane-side']/div[1]/div/div[contains(@aria-label,'Search results.')]")))
    

    Loading all messages

    # Set focus to chat window (xpath == div element w/ aria-label set to 'Message list. Press right arrow key...')
    message_list_element = driver.find_element_by_xpath(
      "//*[@id='main']/div[3]/div/div/div[contains(@aria-label,'Message list')]")
    

    Finding sender when a message does not contain text

    # Last char in aria-label is always colon after the senders name
    if span.get('aria-label') != 'Voice message':
      return span.get('aria-label')[:-1]
    

    Determining if vCard/VCF media is in a message

    # Check if 'Message' is in the title (full title would be for example 'Message Bob Ross')
    if 'Message' in button.get('title'):
      # Next sibling should always be the 'Add to a group' button
      if button.nextSibling:
        if button.nextSibling.get('title') == 'Add to a group':
          return True
    
    bug 
    opened by eddyharrington 1
Owner
Eddy Harrington
Eddy Harrington
Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

Paulo DaRosa 5 Nov 29, 2022
WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

Michelle 2 Jul 22, 2022
Luis M. Capdevielle 1 Jan 14, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 4, 2023
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022
A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

null 11 May 6, 2022
Web scraper for Zillow

Zillow-Scraper Instructions All terminal commands are highlighted. Make sure you first have python 3 installed. You can check this by running "python

Ali Rastegar 1 Nov 23, 2021
A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

Lucas Villela 3 Feb 18, 2022
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

null 2 Jun 6, 2022
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 β€’ A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

null 18 Dec 31, 2022
A web scraper for nomadlist.com, made to avoid website restrictions.

Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

Alessio Greggi 5 Nov 24, 2022
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

John 12 Oct 22, 2022
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

null 1 Jan 28, 2022
Web scraper build using python.

Web Scraper This project is made in pyhthon. It took some info. from website list then add them into data.json file. The dependencies used are: reques

Shashwat Harsh 2 Jul 22, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

null 299 Dec 15, 2022