Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

Run the following to get conda environment setup

  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt

Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.

# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.

# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.

Comments

What is `posts.csv`? How do i create one?
I am following the 3 step process according to the README.md

during the last step, bash scripts/entrypoints.sh where python ./src/main.py main-preprocess-comments

i get this error

File "/bot_dir/src/preprocess_comments.py", line 88, in main_preprocess_comments df = pd.read_csv(data_path.as_posix()) FileNotFoundError: [Errno 2] No such file or directory: '/bot_dir/data/posts.csv'

Seems like it assume the existence of ./data/posts.csv by default.

Please advice. thx
opened by Nkzlxs 3
Instagram login issue

I have been trying to log in after the "instaloader --login=USERNAME profile venuuuuuu_0000--dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}" but after entering the correct password it gives an error "Login error: Wrong password."

opened by Venuri-De-Silva 3

Profile profile does not exist issue

Descriptions: This is something to do with Instaloader as well. When using Instaloader version 4.9.2, profile is not required anymore. Simply put the profile name you want to download. For our case:

instaloader --login=USERNAME malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

Error:

Session file does not exist yet - Logging in.
Enter Instagram password for <USERNAME>: 
Logged in as <USERNAME>.
Profile profile does not exist.
Trying again anonymously, helps in case you are just blocked.
Profile profile does not exist.
Stored ID 47523401972 for profile malaysianpaygap.
[1/1] Downloading profile malaysianpaygap     
Retrieving posts from profile malaysianpaygap.
...

Solution: Modify the README.md file (PR will be created). Use:

instaloader --login=USERNAME malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

Result:

Session file does not exist yet - Logging in.
Enter Instagram password for <USERNAME>: 
Logged in as <USERNAME>.
Stored ID 47523401972 for profile malaysianpaygap.
[1/1] Downloading profile malaysianpaygap     
Retrieving posts from profile malaysianpaygap.
...

opened by zul-m 0

"window._sharedData" error
Descriptions: This is something to do with Instaloader. You can check their discussions at #1665, #1646, and #1553. I think it's related to updates on Instagram API and the web interface.

Error: Could not find "window._sharedData" in html response. [retrying; skip with ^C]

Solutions:

Change the instaloader version to latest=4.9.2 on the requirements.txt file (I will create PR for this), OR

Run python3 -m pip install instaloader -U command to upgrade Instaloader.
opened by zul-m 0
Separate out answer data from comments

Currently answers are a column within the comments but they should instead be separated out to another csv file and both can be joined on the comment_id.
enhancement

opened by yudhiesh 0

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

Optimizations

Comments

What is `posts.csv`? How do i create one?

Instagram login issue

Profile profile does not exist issue

"window._sharedData" error

Separate out answer data from comments

Owner

Yudhiesh Ravindranath

Instagram bot that upload images for you which scrape posts from 9gag meme website or other Instagram users , which is 24/7 Automated Runnable.

Twitter-Scrapping - Tweeter tweets extracting using python

Instagram Story View Bot Unencrypted Story Views is a helpful tool that allows thousands of people to watch your posts. It is completely free, source is visible for anyone to modify Type your username, wait for the bot to Automate the Task.

An script where it logs in your instagram account and follows people and likes their posts

A Telegram bot to download posts, videos, reels, IGTV and a user profile picture from Instagram!

📷 Instagram Bot - Tool for automated Instagram interactions

An instagram bot developed in Python with Selenium that helps you get more Instagram followers.

This Instagram app created as a clone of instagram.Developed during Moringa Core.

Unofficial instagram API, give you access to ALL instagram features (like, follow, upload photo and video and etc)! Write on python.

Projeto Informações Conta do Instagram - Instagram Account Information Project

Instagram Brute force attack helps you to find password of an instagram account from your list of provided password.

Instagram-follower-bot - An Instagram follower bot written in Python

Instagram - Instagram Account Reporting Tool

Upload-Instagram - Auto Uploading Instagram Bot

It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts.

A tool for extracting plain text from Wikipedia dumps

A Telegram bot to extracting text from images. All languages supported.

Exports saved posts and comments on Reddit to a csv file.

Telegram Bot to store Posts and Documents and it can Access by Special Links.

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

Optimizations

Comments

What is `posts.csv`? How do i create one?

Instagram login issue

Profile profile does not exist issue

"window._sharedData" error

Separate out answer data from comments

Owner

Yudhiesh Ravindranath

Instagram bot that upload images for you which scrape posts from 9gag meme website or other Instagram users , which is 24/7 Automated Runnable.

Twitter-Scrapping - Tweeter tweets extracting using python

Instagram Story View Bot Unencrypted Story Views is a helpful tool that allows thousands of people to watch your posts. It is completely free, source is visible for anyone to modify Type your username, wait for the bot to Automate the Task.

An script where it logs in your instagram account and follows people and likes their posts

A Telegram bot to download posts, videos, reels, IGTV and a user profile picture from Instagram!

📷 Instagram Bot - Tool for automated Instagram interactions

An instagram bot developed in Python with Selenium that helps you get more Instagram followers.

This Instagram app created as a clone of instagram.Developed during Moringa Core.

Unofficial instagram API, give you access to ALL instagram features (like, follow, upload photo and video and etc)! Write on python.

Projeto Informações Conta do Instagram - Instagram Account Information Project

Instagram Brute force attack helps you to find password of an instagram account from your list of provided password.

Instagram-follower-bot - An Instagram follower bot written in Python

Instagram - Instagram Account Reporting Tool

Upload-Instagram - Auto Uploading Instagram Bot

It connects to Telegram's API. It generates JSON files containing channel's data, including channel's information and posts.

A tool for extracting plain text from Wikipedia dumps

A Telegram bot to extracting text from images. All languages supported.

Exports saved posts and comments on Reddit to a csv file.

Telegram Bot to store Posts and Documents and it can Access by Special Links.

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?