SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

ZOZO, Inc.

Last update: Nov 24, 2022

Related tags

Deep Learning machine-learning research deep-learning fashion dataset datasets covariate-shift fill-in-the-blank dataset-shifts distributional-shift target-shift set-matching fill-in-the-n-blank

Overview

[arXiv]

The main motivation of the SHIFT15M project is to provide a dataset that contains natural dataset shifts collected from a web service IQON, which was actually in operation for a decade. In addition, the SHIFT15M dataset has several types of dataset shifts, allowing us to evaluate the robustness of the model to different types of shifts (e.g., covariate shift and target shift).

We provide the Datasheet for SHIFT15M. This datasheet is based on the Datasheets for Datasets [1] template.

System	Python 3.6	Python 3.7	Python 3.8
Linux CPU
Linux GPU
Windows CPU / GPU	Status Currently Unavailable	Status Currently Unavailable	Status Currently Unavailable
Mac OS CPU

SHIFT15M is a large-scale dataset based on approximately 15 million items accumulated by the fashion search service IQON.

Installation

From PyPi

$ pip install shift15m

From source

$ git clone https://github.com/st-tech/zozo-shift15m.git
$ cd zozo-shift15m
$ poetry build
$ pip install dist/shift15m-xxxx-py3-none-any.whl

Download SHIFT15M dataset

Use Dataset class

You can download SHIFT15M dataset as follows:

from shift15.datasets import NumLikesRegression

dataset = NumLikesRegression(root="./data", download=True)

Download directly by using download scripts

Please download the dataset as follows:

$ bash scripts/download_all.sh

To avoid downloading the test dataset for set matching (80GB), which is not required in training, you can use the following script.

$ bash scripts/download_all_wo_set_testdata.sh

Tasks

The following tasks are now available:

Tasks	Task type	Shift type	# of input dim	# of output dim
NumLikesRegression	regression	target shift	(N, 25)	(N, 1)
SumPricesRegression	regression	covariate shift, target shift	(N, 1)	(N, 1)
ItemPriceRegression	regression	target shift	(N, 4096)	(N, 1)
ItemCategoryClassification	classification	target shift	(N, 4096)	(N, 7)
Set2SetMatching	set-to-set matching	covariate shift	(N, 4096)x(M, 4096)	(1)

Benchmarks

As templates for numerical experiments on the SHIFT15M dataset, we have published experimental results for each task with several models.

Original Dataset Structure

The original dataset is maintained in json format, and a row consists of the following:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd"
}

Contributing

To learn more about making a contribution to SHIFT15M, please see the following materials:

License

The dataset itself is provided under a CC BY-NC 4.0 license. On the other hand, the software in this repository is provided under the MIT license.

Dataset metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value

name SHIFT15M Dataset

alternateName SHIFT15M

alternateName shift15m-dataset

url https://github.com/st-tech/zozo-shift15m

sameAs https://github.com/st-tech/zozo-shift15m

description SHIFT15M is a multi-objective, multi-domain dataset which includes multiple dataset shifts.

provider

property	value
name	`ZOZO Research`
sameAs	`https://ja.wikipedia.org/wiki/ZOZO`

license

property	value
name	`CC BY-NC 4.0`
url	`https://github.com/st-tech/zozo-shift15m/blob/main/LICENSE.CC`

Citation

@misc{Kimura_SHIFT15M_Multiobjective_LargeScale_2021,
author = {Kimura, Masanari and Nakamura, Takuma and Saito, Yuki},
month = {8},
title = {SHIFT15M: Multiobjective Large-Scale Fashion Dataset with Distributional Shifts},
year = {2021}
}

Errata

No errata are currently available.

References

[1] Gebru, Timnit, et al. "Datasheets for datasets." arXiv preprint arXiv:1803.09010 (2018).

Comments

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

The following question should be answered:

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).
documentation datasheet

opened by nocotan 3
Extracting Image Features
@nocotan I'm planning to prepare image features as we discussed. To be extracted:

CNN features (2048 dimensional features from the pre-trained Inception-V3 model on ILSVRC2012)

By the way, I was trying to find a properly hand-crafted image feature extractor that involves colors but cannot find available codes. For instance, combining Local Binary Pattern (LBP) and Local Color Contrast (LCC) showed superior performance in a texture classification task described in the following paper compared with other color-based hand-crafted features, but LCC is not in OSS. https://www.researchgate.net/publication/315858786_Hand-Crafted_vs_Learned_Descriptors_for_Color_Texture_Classification

So, here I'm planning not to include a hand-crafted one for the image-based task.
opened by wildsnowman 2
Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

The following question should be answered:

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
documentation datasheet

opened by nocotan 2
add LICENSE
Before publishing, we need to determine the license of the repository, e.g.,

MIT

Apache

BSD

GPL

After researching which license is appropriate, please add the LICENSE to the repository.
documentation
opened by nocotan 2
Got an TypeError exception when try to run item category prediction task
Thank you for your great work and dataset opening at first.

Description When I tried to run the item_category_prediction task following the usageitem_category_prediction I got an exception like this:

Environment:

Python 3.8.8

It will be so helpful if you can give any gracious advice, thank you.
bug
opened by you0xy 1
Information: the dataset size

the number of outfits: 2,555,147 the number of images (multiple-counting): 15,218,721 the number of unique images: 2,335,598

Note: maybe shift28M is not the correct name.

opened by wildsnowman 1
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

The following question should be answered:

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?
documentation datasheet

opened by nocotan 1
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

The following question should be answered:

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.
documentation datasheet

opened by nocotan 1
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

The following question should be answered:

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
documentation datasheet

opened by nocotan 1
Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

The following question should be answered:

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.
documentation datasheet

opened by nocotan 1
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

The following question should be answered:

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis)been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.
documentation datasheet

opened by nocotan 1
Bump setuptools from 65.4.1 to 65.5.1
Bumps setuptools from 65.4.1 to 65.5.1.

Changelog

Sourced from setuptools's changelog.

v65.5.1

Misc ^^^^

#3638: Drop a test dependency on the mock package, always use :external+python:py:mod:unittest.mock -- by :user:hroncok

#3659: Fixed REDoS vector in package_index.

v65.5.0

Changes ^^^^^^^

#3624: Fixed editable install for multi-module/no-package src-layout projects.

#3626: Minor refactorings to support distutils using stdlib logging module.

Documentation changes ^^^^^^^^^^^^^^^^^^^^^

#3419: Updated the example version numbers to be compliant with PEP-440 on the "Specifying Your Project’s Version" page of the user guide.

Misc ^^^^

#3569: Improved information about conflicting entries in the current working directory and editable install (in documentation and as an informational warning).

#3576: Updated version of validate_pyproject.

Commits

a462cb5 Bump version: 65.5.0 → 65.5.1

de35d8b Merge pull request #3656 from bmorris3/typos

58e23de Update changelog. Ref #3659.

43a9c9b Limit the amount of whitespace to search/backtrack. Fixes #3659.

5791343 Add test capturing failed expectation. Ref #3659.

1f97905 ⚫ Fade to black.

6254567 Remove workaround for emacs.

729b180 ⚫ Fade to black.

c068081 Typo corrections

f777a40 Suppress deprecation warning in --rsyncdir. Workaround for #3655.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bug: the number of val/test data is not consistent with other cases when the same years are selected for train_year and test_year.

Describe the bug In set matching, the numbers of data used are restricted as 30816, 3851, and 3851 for train, val, and test data, respectively; however, when the same years are selected for train_year and test_year, it will be inconsistent.

This bug may cause inappropriate experiments in changing train_year and test_year.
bug

opened by wildsnowman 0
disjoint set matching

Parent Task

set matching

Model List

Note

It might be required to conduct set matching experiments under the disjoint setting. Here, we perform testing using the items that are not included while training; we call it disjointed.

References

https://arxiv.org/abs/1804.09979
benchmark

opened by wildsnowman 0
Implementation of the set data loader with tags

Is your feature request related to a problem? Please describe. We added the tags information for our dataset. Then, it is good to implement the additional data loader with tags information.

Describe the solution you'd like This can be accomplished by adding arguments to an existing data loader.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

opened by nocotan 0

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

add tags info as follows:

{
  "user":{"user_id":"xxxx", "fav_brand_ids":"xxxx,xx,..."},
  "like_num":"xx",
  "set_id":"xxx",
  "items":[
    {"price":"xxxx","item_id":"xxxxxx","category_id1":"xx","category_id2":"xxxxx"},
    ...
  ],
  "publish_date":"yyyy-mm-dd",
  "tags": "tag_a, tag_b, tag_c, ..."
}

add superset matching benchmark
fix a label creation bug on set matching with multiple splits

Source code(tar.gz)
Source code(zip)

v.0.1.2(Nov 24, 2021)

Source code(tar.gz)
Source code(zip)
v.0.1.1(Sep 6, 2021)

Source code(tar.gz)
Source code(zip)

Owner

ZOZO, Inc.

GitHub

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Portrait Photo Retouching with PPR10K Paper | Supplementary Material PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask an

184 Dec 11, 2022

Distributional Sliced-Wasserstein distance code

Distributional Sliced Wasserstein distance This is a pytorch implementation of the paper "Distributional Sliced-Wasserstein and Applications to Genera

39 Jan 1, 2023

A Distributional Approach To Controlled Text Generation

A Distributional Approach To Controlled Text Generation This is the repository code for the ICLR 2021 paper "A Distributional Approach to Controlled T

102 Jan 7, 2023

A working implementation of the Categorical DQN (Distributional RL).

Categorical DQN. Implementation of the Categorical DQN as described in A distributional Perspective on Reinforcement Learning. Thanks to @tudor-berari

98 Sep 20, 2022

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

80 Sep 17, 2022

PyTorch experiments with the Zalando fashion-mnist dataset

zalando-pytorch PyTorch experiments with the Zalando fashion-mnist dataset Project Organization ├── LICENSE ├── Makefile <- Makefile with co

31 Sep 25, 2021

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

Differential Privacy (DP) Based Federated Learning (FL) Everything about DP-based FL you need is here. （所有你需要的DP-based FL的信息都在这里） Code Tip: the code o

83 Dec 24, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation mode

36 Oct 30, 2022

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

ASPset-510 (Australian Sports Pose Dataset) is a large-scale video dataset for the training and evaluation of 3D human pose estimation models. It contains 17 different amateur subjects performing 30 sports-related actions each, for a total of 510 action clips.

25 Jun 20, 2021

SHIFT15M: multiobjective large-scale fashion dataset with distributional shifts

Related tags

Overview

Installation

From PyPi

From source

Download SHIFT15M dataset

Use Dataset class

Download directly by using download scripts

Tasks

Benchmarks

Original Dataset Structure

Contributing

License

Dataset metadata

Citation

Errata

References

Comments

v65.5.1

v65.5.0

Parent Task

Model List

Note

References

Releases(v0.2.0)

v0.2.0(Sep 20, 2022)

v.0.1.2(Nov 24, 2021)

v.0.1.1(Sep 6, 2021)

Owner

ZOZO, Inc.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Distributional Sliced-Wasserstein distance code

A Distributional Approach To Controlled Text Generation

A working implementation of the Categorical DQN (Distributional RL).

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

PyTorch experiments with the Zalando fashion-mnist dataset

Everything you want about DP-Based Federated Learning, including Papers and Code. (Mechanism: Laplace or Gaussian, Dataset: femnist, shakespeare, mnist, cifar-10 and fashion-mnist. )

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A large-scale video dataset for the training and evaluation of 3D human pose estimation models

A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography

A pytorch implementation of the CVPR2021 paper "VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild"

Large Scale Multi-Illuminant (LSMI) Dataset for Developing White Balance Algorithm under Mixed Illumination

LIVECell - A large-scale dataset for label-free live cell segmentation

A large-scale face dataset for face parsing, recognition, generation and editing.

A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)

CrossNorm and SelfNorm for Generalization under Distribution Shifts (ICCV 2021)

Distributionally robust neural networks for group shifts