Materials to reproduce our findings in our stories, "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up"

Overview

Amazon Brands and Exclusives

This repository contains code to reproduce the findings featured in our story "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up" from our series Amazon's Advantage.

Our methodology is described in "How We Analyzed Amazon’s Treatment of Its Brands in Search Results".

Data that we collected and analyzed is in the data folder.
To use the full input dataset (which is not hosted here), please refer to Download data.

Jupyter notebooks used for data preprocessing and analysis are available in the notebooks folder.
Descriptions for each notebook are outlined in the Notebooks section below.

Installation

Python

Make sure you have Python 3.6+ installed. We used Miniconda to create a Python 3.8 virtual environment.

Then install the Python packages:
pip install -r requirements.txt

Notebooks

These notebooks are intended to be run sequentially, but they are not dependent on one another. If you want a quick overview of the methodology, you only need to concern yourself with the notebooks with an asterisk(*).

0-data-preprocessing.ipynb

This notebook parses Amazon search results and Amazon product pages, and produces the intermediary datasets (data/output/datasets/) used in ranking analysis and random forest classifiers.

1-data-analysis-search-results.ipynb *

Bulk of the ranking analysis and stats in the data analysis.

2-random-forest-analysis.ipynb *

Feature engineering training set, finding optimal hyperparameters, and performing the ablation study on a random forest model. The most predictive feature is verified using three separate methods.

3-survey-results.ipynb

Visualizing the survey results from our national panel of 1,000 adults.

4-limiations-product-page-changes.ipynb

Analysis of how often the Buy Box's default shipper and seller change between Amazon and a third party.

utils.py

Contains convenient functions used in the notebooks.

parsers.py

Contains parsers for search results and product pages.

Data

This directory is where inputs, intermediaries, and outputs are saved.

data
├── output
│   ├── figures
│   ├── tables
│   └── datasets
│       ├── amazon_private_label.csv.xz
│       ├── products.csv.xz
│       ├── searches.csv.xz
│       ├── training_set.csv.gz
│       ├── pairwise_training_set.csv.gz
│       └── trademarks
└── input
    ├── combined_queries_with_source.csv
    ├── best_sellers
    ├── generic_search_terms
    ├── search-private-label
    ├── search-selenium
    ├── search-selenium-our-brands-filter_
    ├── selenium-products
    ├── seller_central
    └── spotcheck

data/output/ contains tables, figures, and datasets used in our methodology.

data/output/datasets/amazon_private_label.csv.xz is our dataset of Amazon brands, exclusives, and proprietary electronics (N=137,428 products). We use each product's unique ID (called an ASIN) to identify Amazon's own products in our methodology.

data/output/datasets/trademarks contains a dataset of trademarked brands registered by Amazon. The data was collected from USPTO.gov and Amazon. We included an additional README with the exact steps we took to build this dataset in the directory.

data/output/datasets/searches.csv.xz parsed search result pages from top and generic searches (N=187,534 product positions). You can filter this by search_term for each of these subsets from data/input/combined_queries_with_source.csv.

data/output/datasets/products.csv.xz parsed product pages from the searches above (N=157,405 product pages).

data/output/training_set.csv.gz metadata used to train and evaluate the random forest. Additionally, feature engineering is conducted in notebooks/2-random-forest-analysis.ipynb, which produces pairwise_training_set.csv.gz.

Every file in data/input except combined_queries_with_source.csv is stored in AWS s3. Those are not hosted in this repository.

Download Data

You can find the raw inputs in data/input in s3://markup-public-data/amazon-brands/.

If you trust us, you can download the HTML and JSON files in data/input using this script: sh data/download_input_data.sh

Note this is not necessary to run notebooks and see full results.

data/input/search-selenium/ (12 GB uncompressed)

First page of search results collected in January 2021. Download the HTML files search-selenium.tar.xz (238 MB compressed) here.

data/input/selenium-products/ (220 GB uncompressed)

Product pages collected in February 2021. Download the HTML files selenium-products.tar.xz (9 GB compressed) here.

data/input/search-selenium-our-brands-filter_/ (35 GB uncompressed)

Search results filtered by "our brands". Contains every page of search results. Download search-selenium-our-brands-filter_.tar.xz (403 MB compressed) here.

data/input/search-private-label/ (25 GB uncompressed)

API responses for search results filtered down to products Amazon identifies as "our brands". Contains paginated API results. Download the JSON files search-private-label.tar.xz (402 MB uncompressed) here.

data/input/seller_central/ (105 MB)

Seller central data for Q4 2020. Download the CSV file All_Q4_2020.csv.xz (105 MB compressioned) here.

data/input/best_sellers/ (4 GB)

Amazon's best sellers under the category "Amazon Devices & Accessories". Download the HTML files best_sellers.tar.xz (60MB compressed) here.

data/input/spotcheck/ (4 GB)

A sub-sample of product pages for spot-checking Buy Box changes. Download the HTML files spotcheck.tar.xz (159 MB compressed) here.

You might also like...
An Open-Source Discord bot created to provide basic functionality which should be in every discord guild. We use this same bot with additional configurations for our guilds.

A Discord bot completely written to be taken from the source and built according to your own custom needs. This bot supports some core features and is

Deepak Clouds Torrent is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive.
Deepak Clouds Torrent is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive.

Deepak Clouds Torrent is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive.

This Mirror Bot is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive.
This Mirror Bot is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive.

MIRROR HUNTER This Mirror Bot is a multipurpose Telegram Bot writen in Python for mirroring files on the Internet to our beloved Google Drive. Repo la

Utility for downloading fanfiction in bulk from the Archive of Our Own

What is this? This is a program intended to help you download fanfiction from the Archive of Our Own in bulk. This program is primarily intended to wo

Slam Mirror Bot is a multipurpose Telegram Bot written in Python for mirroring files on the Internet to our beloved Google Drive.
Slam Mirror Bot is a multipurpose Telegram Bot written in Python for mirroring files on the Internet to our beloved Google Drive.

Slam Mirror Bot is a multipurpose Telegram Bot written in Python for mirroring files on the Internet to our beloved Google Drive.

Bagas Mirror&Leech Bot is a multipurpose Telegram Bot written in Python for mirroring files on the Internet to our beloved Google Drive. Based on python-aria-mirror-bot
Bagas Mirror&Leech Bot is a multipurpose Telegram Bot written in Python for mirroring files on the Internet to our beloved Google Drive. Based on python-aria-mirror-bot

- [ MAYBE UPDATE & ADD MORE MODULE ] Bagas Mirror&Leech Bot Bagas Mirror&Leech Bot is a multipurpose Telegram Bot written in Python for mirroring file

Ulaavi for nuke, helps to keep our stocl elements organised.

Ulaavi Ulaavi for nuke, helps to keep our stock elements organised. Installation Downlaod ffmpeg from ffmpeg.org linux : https://johnvansickle.com/ffm

A telegram bot writen in python for mirroring files on the internet to our beloved Google Drive
A telegram bot writen in python for mirroring files on the internet to our beloved Google Drive

[] Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our beloved Google Drive. Deploying on Heroku Give Star &

stories-matiasucker created by GitHub Classroom

Stories do Instagram Este projeto tem como objetivo desenvolver uma pequena aplicação que simule os efeitos e funcionalidades ao estilo Instagram. A a

null 1 Dec 20, 2021
Get an SNS alert for High Severity GuardDuty findings

Automation AWS-GuardDuty findings Get an SNS alert for High Severity GuardDuty findings Problem: Getting notified when there is Red finding in AWS Gua

Giten Mitra 4 Nov 18, 2022
A collection of scripts to steal BTC from Lightning Network enabled custodial services. Only for educational purpose! Share your findings only when design flaws are fixed.

Lightning Network Fee Siphoning Attack LN-fee-siphoning is a collection of scripts to subtract BTC from Lightning Network enabled custodial services b

Reckless_Satoshi 14 Oct 15, 2022
A collection of scripts to steal BTC from Lightning Network enabled custodial services. Only for educational purpose! Share your findings only when design flaws are fixed.

Lightning Network Fee Siphoning Attack LN-fee-siphoning is a collection of scripts to subtract BTC from Lightning Network enabled custodial services b

Reckless_Satoshi 14 Oct 15, 2022
An API serving data on all creatures, monsters, materials, equipment, and treasure in The Legend of Zelda: Breath of the Wild

Hyrule Compendium API An API serving data on all creatures, monsters, materials, equipment, and treasure in The Legend of Zelda: Breath of the Wild. B

Aarav Borthakur 116 Dec 1, 2022
Source Code for our bot that manages time and other functions of the server <3

Komi San wants you to study This repo contains the source code for our bot that manages time and other functions of the server <3 Features Your study

Komi San wants you to study 8 Nov 8, 2021
A python script that changes our background based on current weather and time of the day.

Desktop background on Windows 10, based on current weather and time A python script that changes our background based on current weather and time of t

Maj Gaberšček 1 Nov 16, 2021