MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Related tags

Deep Learning MAVE
Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

More details can be found in paper: https://arxiv.org/abs/2112.08663

The dataset is in JSON Lines format, where each line is a json object with the following schema:

, "category": , "paragraphs": [ { "text": , "source": }, ... ], "attributes": [ { "key": , "evidences": [ { "value": , "pid": , "begin": , "end": }, ... ] }, ... ] }">
{
   "id": 
           
            ,
   "category": 
            
             ,
   "paragraphs": [
      {
         "text": 
             
              ,
         "source": 
              
               
      },
      ...
   ],
   "attributes": [
      {
         "key": 
               
                , "evidences": [ { "value": 
                
                 , "pid": 
                 
                  , "begin": 
                  
                   , "end": 
                   
                     }, ... ] }, ... ] } 
                   
                  
                 
                
               
              
             
            
           

The product id is exactly the ASIN number in the All_Amazon_Meta.json file in the Amazon Review Data (2018). In this repo, we don't store paragraphs, instead we only store the labels. To obtain the full version of the dataset contaning the paragraphs, we suggest to first request the Amazon Review Data (2018), then run our binary to clean its product metadata and join with the labels as described below.

A json object contains a product and multiple attributes. A concrete example is shown as follows

{
   "id":"B0002H0A3S",
   "category":"Guitar Strings",
   "paragraphs":[
      {
         "text":"D'Addario EJ26 Phosphor Bronze Acoustic Guitar Strings, Custom Light, 11-52",
         "source":"title"
      },
      {
         "text":".011-.052 Custom Light Gauge Acoustic Guitar Strings, Phosphor Bronze",
         "source":"description"
      },
      ...
   ],
   "attributes":[
      {
         "key":"Core Material",
         "evidences":[
            {
               "value":"Bronze Acoustic",
               "pid":0,
               "begin":24,
               "end":39
            },
            ...
         ]
      },
      {
         "key":"Winding Material",
         "evidences":[
            {
               "value":"Phosphor Bronze",
               "pid":0,
               "begin":15,
               "end":30
            },
            ...
         ]
      },
      {
         "key":"Gauge",
         "evidences":[
            {
               "value":"Light",
               "pid":0,
               "begin":63,
               "end":68
            },
            {
               "value":"Light Gauge",
               "pid":1,
               "begin":17,
               "end":28
            },
            ...
         ]
      }
   ]
}

In addition to positive examples, we also provide a set of negative examples, i.e. (product, attribute name) pairs without any evidence. The overall statistics of the positive and negative sets are as follows

Counts Positives Negatives
# products 2226509 1248009
# product-attribute pairs 2987151 1780428
# products with 1-2 attributes 2102927 1140561
# products with 3-5 attributes 121897 99896
# products with >=6 attributes 1685 7552
# unique categories 1257 1114
# unique attributes 705 693
# unique category-attribute pairs 2535 2305

Creating the full version of the dataset

In this repo, we only open source the labels of the MAVE dataset and the code to deterministically clean the original Amazon product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the MAVE dataset. After this process, the attribute values, paragraph ids and begin/end span indices will be consistent with the cleaned product profiles.

Step 1

Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.

Step 2

Run script

./clean_amazon_product_metadata_main.sh

to clean the Amazon metadata and join with the positive and negative labels in the labels/ folder. The output full MAVE dataset will be stored in the reproduce/ folder.

The script runs the clean_amazon_product_metadata_main.py binary using an apache beam pipeline. The binary will run on a single CPU core, but distributed setup can be enabled by changing pipeline options. The binary contains all util functions used to clean the Amazon metadata and join with labels. The pipeline will finish within a few hours on a single Intel Xeon 3GHz CPU core.

Comments
  • Is the dataset too simple?

    Is the dataset too simple?

    Hi, thanks for your great works!

    After reading your paper, I found the baseline of this dataset on all attribtues has achieved 98.34 on F1. Does it means that this dataset is too simple as a benchmark in attribute value extraction task?

    opened by ShengleiH 3
  • JSON error while parsing the All Metadata file

    JSON error while parsing the All Metadata file

    Hi I downloaded All_Amazon_Meta.json and when I run clean_amazon_product_metadata_main.sh, I get the error below. Not sure what I'm doing wrong.

    Thank you

    Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 537, in apache_beam.runners.common.SimpleInvoker.invoke_process File "/Users/[email protected]/opt/anaconda3/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1635, in wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)] File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    opened by nlpravi 3
  • Create README.md

    Create README.md

    Add a README.md containing descriptions of the dataset and the instructions to create the full version of the dataset by using the Amazon metadataset.

    opened by liyang2019 0
  • Understand evaluation in the paper better

    Understand evaluation in the paper better

    Hi thanks for the repo / data / paper, great work! I am creating this Issue to understand how exactly evaluation is done, since I am using autoregressive formulation of attribute extraction, the attribute extraction is done through free-form text generation, and no attribute type is provided as input.

    1. For positive samples (product paragraphs contain at least one attribute value), for the following example ("target_attribute_vals" contain annotated [attribute value](attribute type) that appears in "text", while "predicted_attribute_vals" contain prediction from a model)
    {
        "text": "Nymph Womens's Chffion Polka Dot Maxi Halter Dress Extra Long",
        "target_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Length) [Halter](Neckline) Dress [Extra Long](Length)",
        "predicted_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Pattern) [Halter](Neckline) [Dress](Type) Extra Long"
    }
    

    The flattened tuples of target and prediction attribute values would be:

    {
        "target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
        "predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
    }
    
    • Then following Section 5.3 of the paper, No value (VN), Correct values (VC), Wrong values (VW) for the above would be
    No value (VN): 2 # "Length", "Maxi") and ("Length", "Extra Long") missing
    Correct values (VC): 2 # ("Pattern", "Polka Dot") and ("Neckline", "Halter") correct
    Wrong values (VW): 1 # ("Pattern", "Maxi") is not matching ("Pattern", "Polka Dot")
    

    The above value counts sum up to 5 attribute value pairs, but the "target_tuples" only had 4 attribute value pairs. Is this expected?

    1. For negative samples (Here I used "target_attributes_as_in_file" instead of "target_attribute_vals" format as in earlier positive examples, and I also added category as shown in the file)
    [
    {
        "text": "Taylor Dresses Women's High Low Lace Shirt Dress", 
        "target_attributes_as_in_file": [{'key': 'Pattern', 'evidences': []}], 
        "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress", 
        "category": "Dresses"
    }, 
    {
        "text": "Taylor Dresses Women's High Low Lace Shirt Dress Nice", 
        "target_attributes_as_in_file": [{'key': 'Neckline', 'evidences': []}, {'key': 'Pattern', 'evidences': []}, , {'key': 'Type', 'evidences': []}], 
        "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress",
        "category": "Dresses"
    }
    ]
    

    Then following Section 5.3 of the paper,

    • No value (NN), some incorrect Value (NV) for the first sample would be
    No value (NN): 1 # No Pattern in "predicted_attribute_vals"
    some incorrect Value (NV): 0 # No Pattern in "predicted_attribute_vals", thus cannot be incorrect
    
    • No value (NN), some incorrect Value (NV) for the second sample (which is very similar to the first sample) would be
    No value (NN): 2 # No Neckline, Pattern in "predicted_attribute_vals"
    some incorrect Value (NV): 1 # [Dresses](Type) in "predicted_attribute_vals"
    

    It seems that some incorrect Value (NV) can arbitrarily change depending on "target_attributes_as_in_file" which is not consistent across the same category (every sample under category="Dresses" has different "target_attributes_as_in_file"), is this expected?

    opened by junwang-wish 1
  • Multiple values for one attribute in one paragraph

    Multiple values for one attribute in one paragraph

    Hi, I found there are multiple values for one attribute in one paragraph in this dataset. But in your paper, the model only "seeks the best answer span in the product context". Can this model extracts multiple spans in the product context for one attribute?

    data with multiple spans

    {
      "id": "8198319301",
      "category": "Coats & Jackets",
      "paragraphs": [
        {
          "text": "HTOOHTOOH Women's Plus-size Casual Turn Down Collar Mid Length Jean Jacket",
          "source": "title"
        },
        ...
      ],
      "attributes": [
        {
          "key": "Style",
          "evidences": [
            {
              "value": "Casual",
              "pid": 0,
              "begin": 28,
              "end": 34
            },
            {
              "value": "Jean Jacket",
              "pid": 0,
              "begin": 63,
              "end": 74
            }
          ]
        }
      ]
    }
    
    {
      "id": "B00002N7X0",
      "category": "Aprons",
      "paragraphs": [
        {
          "text": "McGuire Nicholas C9 4 Pocket Utility Bib Apron in Natural Cotton",
          "source": "title"
        },
        {
          "text": "Constructed of heavy duty but lightweight cotton and ideal for a variety of jobs. The large waist pockets help to store tools or brushes. Reinforced at stress points for added durability. 2 large waist pockets 1 medium bib pocket 1 small bib pocket Extra reinforcement at stress points Canvas loop neck & waist tie Cotton canvas.",
          "source": "description"
        },
        ...
      ],
      "attributes": [
        {
          "key": "Style",
          "evidences": [
            {
              "value": "Bib",
              "pid": 0,
              "begin": 37,
              "end": 40
            },
            {
              "value": "bib",
              "pid": 1,
              "begin": 219,
              "end": 222
            },
            {
              "value": "bib",
              "pid": 1,
              "begin": 238,
              "end": 241
            },
            {
              "value": "neck",
              "pid": 1,
              "begin": 298,
              "end": 302
            },
            ...
          ]
        }
      ]
    }
    
    opened by ShengleiH 1
Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing Figure: Joint multi-attribute edits using DyStyle model. Great diversity

null 74 Dec 3, 2022
This folder contains the implementation of the multi-relational attribute propagation algorithm.

MrAP This folder contains the implementation of the multi-relational attribute propagation algorithm. It requires the package pytorch-scatter. Please

null 6 Dec 6, 2022
A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis This is the pytorch implementation for our MICCAI 2021 paper. A Mul

Jiarong Ye 7 Apr 4, 2022
Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Learning Opinion Summarizers by Selecting Informative Reviews This repository contains the codebase and the dataset for the corresponding EMNLP 2021

Arthur Bražinskas 39 Jan 1, 2023
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

null 67 Dec 21, 2022
Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization 0. Environment Environment: python 3.6 and cuda 10

Haitao Yang 62 Dec 30, 2022
FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

FACIAL: Synthesizing Dynamic Talking Face with Implicit Attribute Learning PyTorch implementation for the paper: FACIAL: Synthesizing Dynamic Talking

null 226 Jan 8, 2023
Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

HFGI: High-Fidelity GAN Inversion for Image Attribute Editing High-Fidelity GAN Inversion for Image Attribute Editing Update: We released the inferenc

Tengfei Wang 371 Dec 30, 2022
Official implementation of Protected Attribute Suppression System, ICCV 2021

Official implementation of Protected Attribute Suppression System, ICCV 2021

Prithviraj Dhar 6 Jan 1, 2023
Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing Paper Introduction Multi-task indoor scene understanding is widely considered a

null 62 Dec 5, 2022
Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Pytorch Pedestrian Attribute Recognition: A strong PyTorch baseline of pedestrian attribute recognition and multi-label classification.

Jian 79 Dec 18, 2022
Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python

deepface Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python. It is a hybrid

Kushal Shingote 2 Feb 10, 2022
A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

sam4onnx A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for

Katsuya Hyodo 6 May 15, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022