Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Comments
  • What is `posts.csv`? How do i create one?

    What is `posts.csv`? How do i create one?

    I am following the 3 step process according to the README.md

    during the last step, bash scripts/entrypoints.sh where python ./src/main.py main-preprocess-comments

    i get this error

    File "/bot_dir/src/preprocess_comments.py", line 88, in main_preprocess_comments
        df = pd.read_csv(data_path.as_posix())
        
    FileNotFoundError: [Errno 2] No such file or directory: '/bot_dir/data/posts.csv'
    

    Seems like it assume the existence of ./data/posts.csv by default.

    Please advice. thx

    opened by Nkzlxs 3
  • Instagram login issue

    Instagram login issue

    I have been trying to log in after the "instaloader --login=USERNAME profile venuuuuuu_0000--dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}" but after entering the correct password it gives an error "Login error: Wrong password."

    opened by Venuri-De-Silva 3
  • Profile profile does not exist issue

    Profile profile does not exist issue

    Descriptions: This is something to do with Instaloader as well. When using Instaloader version 4.9.2, profile is not required anymore. Simply put the profile name you want to download. For our case:

    instaloader --login=USERNAME malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}
    

    Error:

    Session file does not exist yet - Logging in.
    Enter Instagram password for <USERNAME>: 
    Logged in as <USERNAME>.
    Profile profile does not exist.
    Trying again anonymously, helps in case you are just blocked.
    Profile profile does not exist.
    Stored ID 47523401972 for profile malaysianpaygap.
    [1/1] Downloading profile malaysianpaygap     
    Retrieving posts from profile malaysianpaygap.
    ...
    

    Solution: Modify the README.md file (PR will be created). Use:

    instaloader --login=USERNAME malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}
    

    Result:

    Session file does not exist yet - Logging in.
    Enter Instagram password for <USERNAME>: 
    Logged in as <USERNAME>.
    Stored ID 47523401972 for profile malaysianpaygap.
    [1/1] Downloading profile malaysianpaygap     
    Retrieving posts from profile malaysianpaygap.
    ...
    
    opened by zul-m 0
  • "window._sharedData" error

    Descriptions: This is something to do with Instaloader. You can check their discussions at #1665, #1646, and #1553. I think it's related to updates on Instagram API and the web interface.

    Error: Could not find "window._sharedData" in html response. [retrying; skip with ^C]

    Solutions:

    1. Change the instaloader version to latest=4.9.2 on the requirements.txt file (I will create PR for this), OR
    2. Run python3 -m pip install instaloader -U command to upgrade Instaloader.
    opened by zul-m 0
  • Separate out answer data from comments

    Separate out answer data from comments

    Currently answers are a column within the comments but they should instead be separated out to another csv file and both can be joined on the comment_id.

    enhancement 
    opened by yudhiesh 0
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Instagram bot that upload images for you which scrape posts from 9gag meme website or other Instagram users , which is 24/7 Automated Runnable.

Autonicgram Automates your Instagram posts by taking images from sites like 9gag or other Instagram accounts and posting it onto your page. Features A

Mastermind 20 Sep 17, 2022
Twitter-Scrapping - Tweeter tweets extracting using python

Twitter-Scrapping Twitter tweets extracting using python This project is to extr

Suryadeepsinh Gohil 2 Feb 4, 2022
null 4 Oct 28, 2021
An script where it logs in your instagram account and follows people and likes their posts

InstaFollower An script where it logs in your instagram account and follows people and likes their posts (uses the tags to fetch people) Requirements:

Bless 3 Nov 29, 2022
A Telegram bot to download posts, videos, reels, IGTV and a user profile picture from Instagram!

Telegram Bot A telegram bot to download media from Instagram! No API Key or Login Needed! Requirements You must have python installed (of course) You

Simon Farah 2 Apr 10, 2022
📷 Instagram Bot - Tool for automated Instagram interactions

InstaPy Tooling that automates your social media interactions to “farm” Likes, Comments, and Followers on Instagram Implemented in Python using the Se

Tim Großmann 13.5k Dec 1, 2021
An instagram bot developed in Python with Selenium that helps you get more Instagram followers.

instabot An instagram bot developed in Python with Selenium that helps you get more Instagram followers. Install You’ll need to have: Python Selenium

null 65 Nov 22, 2022
This Instagram app created as a clone of instagram.Developed during Moringa Core.

Instagram This Instagram app created as a clone of instagram.Developed during Moringa Core. AUTHOR By: Nyagah Isaac Description This web-app allows a

Nyagah Isaac 1 Nov 1, 2021
Unofficial instagram API, give you access to ALL instagram features (like, follow, upload photo and video and etc)! Write on python.

Instagram-API-python Unofficial Instagram API to give you access to ALL Instagram features (like, follow, upload photo and video, etc)! Written in Pyt

Vladimir Bezrukov 1 Nov 19, 2021
Projeto Informações Conta do Instagram - Instagram Account Information Project

Projeto Informações Conta do Instagram - Instagram Account Information Project Descrição - Description Projeto que exibe informações de perfil, lista

Thiago Souza 1 Dec 2, 2021
Instagram Brute force attack helps you to find password of an instagram account from your list of provided password.

Instagram Brute force attack Instagram Brute force attack helps you to find password of an instagram account from your list of provided password. Inst

Naman Raj Singh 1 Dec 27, 2021
Instagram-follower-bot - An Instagram follower bot written in Python

Instagram Follower Bot An Instagram follower bot written in Python. The bot follows the follower of which account you want. e.g. (You want to follow @

Aytaç Kaşoğlu 1 Dec 31, 2021
Instagram - Instagram Account Reporting Tool

Instagram Instagram Account Reporting Tool Installation On Termux $ apt update $

Aryan 6 Nov 18, 2022
Upload-Instagram - Auto Uploading Instagram Bot

###Instagram Uploading Bot### Download Python and Chrome browser pip install -r

byeonggeon sim 1 Feb 13, 2022
It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts.

It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts. You can search for a specific channel, or a set of channels provided in a text file (one channel per line.)

Esteban Ponce de Leon 75 Jan 2, 2023
A tool for extracting plain text from Wikipedia dumps

WikiExtractor WikiExtractor.py is a Python script that extracts and cleans text from a Wikipedia database dump. The tool is written in Python and requ

Giuseppe Attardi 3.2k Dec 31, 2022
A Telegram bot to extracting text from images. All languages supported.

OCR Bot A Telegram bot to extracting text from images. All languages supported. Deploy to Heroku Local Deploying Clone the repo git clone https://gith

null 6 Oct 21, 2022
Exports saved posts and comments on Reddit to a csv file.

reddit-saved-to-csv Exports saved posts and comments on Reddit to a csv file. Columns: ID, Name, Subreddit, Type, URL, NoSFW ID: Starts from 1 and inc

null 70 Jan 2, 2023
Telegram Bot to store Posts and Documents and it can Access by Special Links.

File-sharing-Bot Telegram Bot to store Posts and Documents and it can Access by Special Links. I Guess This Will Be Usefull For Many People..... ?? .

Code X Botz 1.2k Jan 8, 2023