Sequential Recommendation Datasets
This repository collects some commonly used sequential recommendation datasets in recent research papers and provides a tool for downloading, preprocessing and batch-loading those datasets. The preprocessing method can be customized based on the task, for example: short-term recommendation (including session-based recommendation) and long-short term recommendation. Loading has faster version which intergrates the DataLoader of PyTorch.
Datasets
- Amazon-Books
- Amazon-Electronics
- Amazon-Movies
- Amazon-CDs
- Amazon-Clothing
- Amazon-Home
- Amazon-Kindle
- Amazon-Sports
- Amazon-Phones
- Amazon-Health
- Amazon-Toys
- Amazon-VideoGames
- Amazon-Tools
- Amazon-Beauty
- Amazon-Apps
- Amazon-Office
- Amazon-Pet
- Amazon-Automotive
- Amazon-Grocery
- Amazon-Patio
- Amazon-Baby
- Amazon-Music
- Amazon-MusicalInstruments
- Amazon-InstantVideo
- CiteULike
- FourSquare-NYC
- FourSquare-Tokyo
- Gowalla
- Lastfm1K
- MovieLens20M
- Retailrocket
- TaFeng
- Taobao
- Tmall
- Yelp
Install this tool
Stable version
pip install -U srdatasets —-user
Latest version
pip install git+https://github.com/guocheng2018/sequential-recommendation-datasets.git --user
Download datasets
Run the command below to download datasets. As some datasets are not directly accessible, you'll be warned to download them manually and place them somewhere it tells you.
srdatasets download --dataset=[dataset_name]
To get a view of downloaded and processed status of all datasets, run
srdatasets info
Process datasets
The generic processing command is
srdatasets process --dataset=[dataset_name] [--options]
Splitting options
Two dataset splitting methods are provided: user-based and time-based. User-based means that splitting is executed on every user behavior sequence given the ratio of validation set and test set, while time-based means that splitting is based on the date of user behaviors. After splitting some dataset, two processed datasets are generated, one for development, which uses the validation set as the test set, the other for test, which contains the full training set.
--split-by User or time (default: user)
--test-split Proportion of test set to full dataset (default: 0.2)
--dev-split Proportion of validation set to full training set (default: 0.1)
NOTE: time-based splitting need you to manually input days at console by tipping you total days of that dataset, since you may not know.
Task related options
For short term recommnedation task, you use previous input-len
items to predict next target-len
items. To make user interests more focused, user behavior sequences can also be cut into sessions if session-interval
is given. If the number of previous items is smaller than input-len
, 0 is padded to the left.
For long and short term recommendation task, you use pre-sessions
previous sessions and current session to predict target-len
items. The target items are picked randomly or lastly from current session. So the length of current session is max-session-len
- target-len
while the length of any previous session is max-session-len
. If any previous session or current session is shorter than the preset length, 0 is padded to the left.
--task Short or long-short (default: short)
--input-len Number of previous items (default: 5)
--target-len Number of target items (default: 1)
--pre-sessions Number of previous sessions (default: 10)
--pick-targets Randomly or lastly pick items from current session (default: random)
--session-interval Session splitting interval (minutes) (default: 0)
--min-session-len Sessions less than this in length will be dropped (default: 2)
--max-session-len Sessions greater than this in length will be cut (default: 20)
Common options
--min-freq-item Items less than this in frequency will be dropped (default: 5)
--min-freq-user Users less than this in frequency will be dropped (default: 5)
--no-augment Do not use data augmentation (default: False)
--remove-duplicates Remove duplicated items in user sequence or user session (if splitted) (default: False)
Dataset related options
--rating-threshold Interactions with rating less than this will be dropped (Amazon, Movielens, Yelp) (default: 4)
--item-type Recommend artists or songs (Lastfm) (default: song)
Version
By using different options, a dataset will have many processed versions. You can run the command below to get configurations and statistics of all processed versions of some dataset. The config id
shown in output is a required argument of DataLoader
.
srdatasets info --dataset=[dataset_name]
DataLoader
DataLoader is a built-in class that makes loading processed datasets easy. Practically, once initialized a dataloder by passing the dataset name, processed version (config id), batch_size and a flag to load training data or test data, you can then loop it to get batch data. Considering that some models use rank-based learning, negative sampling is intergrated into DataLoader. The negatives are sampled from all items except items in current data according to popularity. By default it (negatives_per_target
) is turned off. Also, the time of user behaviors is sometimes an important feature, you can include it into batch data by setting include_timestmap
to True.
Arguments
dataset_name
: dataset name (case insensitive)config_id
: configuration idbatch_size
: batch size (default: 1)train
: load training dataset (default: True)development
: load the dataset aiming for development (default: False)negatives_per_target
: number of negative samples per target (default: 0)include_timestamp
: add timestamps to batch data (default: False)drop_last
: drop last incomplete batch (default: False)
Attributes
num_users
: total users in training datasetnum_items
: total items in training dataset (not including the padding item 0)
Initialization example
from srdatasets.dataloader import DataLoader
trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True)
For pytorch users, there is a wrapper implementation of torch.utils.data.DataLoader
, you can then set keyword arguments like num_workers
and pin_memory
to speed up loading data
from srdatasets.dataloader_pytorch import DataLoader
trainloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=True, negatives_per_target=5, include_timestamp=True, num_workers=8, pin_memory=True)
testloader = DataLoader("amazon-books", "c1574673118829", batch_size=32, train=False, include_timestamp=True, num_workers=8, pin_memory=True)
Iteration template
For short term recommendation task
for epoch in range(10):
# Train
for users, input_items, target_items, input_item_timestamps, target_item_timestamps, negative_samples in trainloader:
# Shape
# users: (batch_size,)
# input_items: (batch_size, input_len)
# target_items: (batch_size, target_len)
# input_item_timestamps: (batch_size, input_len)
# target_item_timestamps: (batch_size, target_len)
# negative_samples: (batch_size, target_len, negatives_per_target)
#
# DataType
# numpy.ndarray or torch.LongTensor
pass
# Test
for users, input_items, target_items, input_item_timestamps, target_item_timestamps in testloader:
pass
For long and short term recommendation task
for epoch in range(10):
# Train
for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps, negative_samples in trainloader:
# Shape
# users: (batch_size,)
# pre_sessions_items: (batch_size, pre_sessions * max_session_len)
# cur_session_items: (batch_size, max_session_len - target_len)
# target_items: (batch_size, target_len)
# pre_sessions_item_timestamps: (batch_size, pre_sessions * max_session_len)
# cur_session_item_timestamps: (batch_size, max_session_len - target_len)
# target_item_timestamps: (batch_size, target_len)
# negative_samples: (batch_size, target_len, negatives_per_target)
#
# DataType
# numpy.ndarray or torch.LongTensor
pass
# Test
for users, pre_sessions_items, cur_session_items, target_items, pre_sessions_item_timestamps, cur_session_item_timestamps, target_item_timestamps in testloader:
pass
Disclaimers
This repo does not host or distribute any of the datasets, it is your responsibility to determine whether you have permission to use the dataset under the dataset's license.