MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Google Research Datasets

Last update: Jan 8, 2023

Related tags

Deep Learning MAVE

Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

More details can be found in paper: https://arxiv.org/abs/2112.08663

The dataset is in JSON Lines format, where each line is a json object with the following schema:

, "category": , "paragraphs": [ { "text": , "source": }, ... ], "attributes": [ { "key": , "evidences": [ { "value": , "pid": , "begin": , "end": }, ... ] }, ... ] }">

{
   "id": 
           
            ,
   "category": 
            
             ,
   "paragraphs": [
      {
         "text": 
             
              ,
         "source": 
              
               
      },
      ...
   ],
   "attributes": [
      {
         "key": 
               
                , "evidences": [ { "value": 
                
                 , "pid": 
                 
                  , "begin": 
                  
                   , "end": 
                   
                     }, ... ] }, ... ] }

The product id is exactly the ASIN number in the All_Amazon_Meta.json file in the Amazon Review Data (2018). In this repo, we don't store paragraphs, instead we only store the labels. To obtain the full version of the dataset contaning the paragraphs, we suggest to first request the Amazon Review Data (2018), then run our binary to clean its product metadata and join with the labels as described below.

A json object contains a product and multiple attributes. A concrete example is shown as follows

{
   "id":"B0002H0A3S",
   "category":"Guitar Strings",
   "paragraphs":[
      {
         "text":"D'Addario EJ26 Phosphor Bronze Acoustic Guitar Strings, Custom Light, 11-52",
         "source":"title"
      },
      {
         "text":".011-.052 Custom Light Gauge Acoustic Guitar Strings, Phosphor Bronze",
         "source":"description"
      },
      ...
   ],
   "attributes":[
      {
         "key":"Core Material",
         "evidences":[
            {
               "value":"Bronze Acoustic",
               "pid":0,
               "begin":24,
               "end":39
            },
            ...
         ]
      },
      {
         "key":"Winding Material",
         "evidences":[
            {
               "value":"Phosphor Bronze",
               "pid":0,
               "begin":15,
               "end":30
            },
            ...
         ]
      },
      {
         "key":"Gauge",
         "evidences":[
            {
               "value":"Light",
               "pid":0,
               "begin":63,
               "end":68
            },
            {
               "value":"Light Gauge",
               "pid":1,
               "begin":17,
               "end":28
            },
            ...
         ]
      }
   ]
}

In addition to positive examples, we also provide a set of negative examples, i.e. (product, attribute name) pairs without any evidence. The overall statistics of the positive and negative sets are as follows

Counts	Positives	Negatives
# products	2226509	1248009
# product-attribute pairs	2987151	1780428
# products with 1-2 attributes	2102927	1140561
# products with 3-5 attributes	121897	99896
# products with >=6 attributes	1685	7552
# unique categories	1257	1114
# unique attributes	705	693
# unique category-attribute pairs	2535	2305

Creating the full version of the dataset

In this repo, we only open source the labels of the MAVE dataset and the code to deterministically clean the original Amazon product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the MAVE dataset. After this process, the attribute values, paragraph ids and begin/end span indices will be consistent with the cleaned product profiles.

Step 1

Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.

Step 2

Run script

./clean_amazon_product_metadata_main.sh

to clean the Amazon metadata and join with the positive and negative labels in the labels/ folder. The output full MAVE dataset will be stored in the reproduce/ folder.

The script runs the clean_amazon_product_metadata_main.py binary using an apache beam pipeline. The binary will run on a single CPU core, but distributed setup can be enabled by changing pipeline options. The binary contains all util functions used to clean the Amazon metadata and join with labels. The pipeline will finish within a few hours on a single Intel Xeon 3GHz CPU core.

Comments

Is the dataset too simple?

Hi, thanks for your great works!

After reading your paper, I found the baseline of this dataset on all attribtues has achieved 98.34 on F1. Does it means that this dataset is too simple as a benchmark in attribute value extraction task?

opened by ShengleiH 3
JSON error while parsing the All Metadata file

Hi I downloaded All_Amazon_Meta.json and when I run clean_amazon_product_metadata_main.sh, I get the error below. Not sure what I'm doing wrong.

Thank you

Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 537, in apache_beam.runners.common.SimpleInvoker.invoke_process File "/Users/[email protected]/opt/anaconda3/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1635, in wrapper = lambda x, *args, **kwargs: [fn(x, *args, **kwargs)] File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Users/[email protected]/opt/anaconda3/lib/python3.8/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

opened by nlpravi 3
Create README.md

Add a README.md containing descriptions of the dataset and the instructions to create the full version of the dataset by using the Amazon metadataset.

opened by liyang2019 0

Understand evaluation in the paper better

Hi thanks for the repo / data / paper, great work! I am creating this Issue to understand how exactly evaluation is done, since I am using autoregressive formulation of attribute extraction, the attribute extraction is done through free-form text generation, and no attribute type is provided as input.

For positive samples (product paragraphs contain at least one attribute value), for the following example ("target_attribute_vals" contain annotated [attribute value](attribute type) that appears in "text", while "predicted_attribute_vals" contain prediction from a model)

{
    "text": "Nymph Womens's Chffion Polka Dot Maxi Halter Dress Extra Long",
    "target_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Length) [Halter](Neckline) Dress [Extra Long](Length)",
    "predicted_attribute_vals": "Nymph Womens's Chffion [Polka Dot](Pattern) [Maxi](Pattern) [Halter](Neckline) [Dress](Type) Extra Long"
}

The flattened tuples of target and prediction attribute values would be:

{
    "target_tuples": [("Pattern", "Polka Dot"), ("Length", "Maxi"), ("Neckline", "Halter"), ("Length", "Extra Long")], 
    "predicted_tuples": [("Pattern", "Polka Dot"), ("Pattern", "Maxi"), ("Neckline", "Halter"), ("Type", "Dress")]
}

Then following Section 5.3 of the paper, No value (VN), Correct values (VC), Wrong values (VW) for the above would be

No value (VN): 2 # "Length", "Maxi") and ("Length", "Extra Long") missing
Correct values (VC): 2 # ("Pattern", "Polka Dot") and ("Neckline", "Halter") correct
Wrong values (VW): 1 # ("Pattern", "Maxi") is not matching ("Pattern", "Polka Dot")

The above value counts sum up to 5 attribute value pairs, but the "target_tuples" only had 4 attribute value pairs. Is this expected?

For negative samples (Here I used "target_attributes_as_in_file" instead of "target_attribute_vals" format as in earlier positive examples, and I also added category as shown in the file)

[
{
    "text": "Taylor Dresses Women's High Low Lace Shirt Dress", 
    "target_attributes_as_in_file": [{'key': 'Pattern', 'evidences': []}], 
    "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress", 
    "category": "Dresses"
}, 
{
    "text": "Taylor Dresses Women's High Low Lace Shirt Dress Nice", 
    "target_attributes_as_in_file": [{'key': 'Neckline', 'evidences': []}, {'key': 'Pattern', 'evidences': []}, , {'key': 'Type', 'evidences': []}], 
    "predicted_attribute_vals": "[Taylor](Brand) [Dresses](Type) Women's [High](Size) [Low Lace](Type) Shirt Dress",
    "category": "Dresses"
}
]

Then following Section 5.3 of the paper,

No value (NN), some incorrect Value (NV) for the first sample would be

No value (NN): 1 # No Pattern in "predicted_attribute_vals"
some incorrect Value (NV): 0 # No Pattern in "predicted_attribute_vals", thus cannot be incorrect

No value (NN), some incorrect Value (NV) for the second sample (which is very similar to the first sample) would be

No value (NN): 2 # No Neckline, Pattern in "predicted_attribute_vals"
some incorrect Value (NV): 1 # [Dresses](Type) in "predicted_attribute_vals"

It seems that some incorrect Value (NV) can arbitrarily change depending on "target_attributes_as_in_file" which is not consistent across the same category (every sample under category="Dresses" has different "target_attributes_as_in_file"), is this expected?

opened by junwang-wish 1

Multiple values for one attribute in one paragraph

Hi, I found there are multiple values for one attribute in one paragraph in this dataset. But in your paper, the model only "seeks the best answer span in the product context". Can this model extracts multiple spans in the product context for one attribute?

data with multiple spans

{
  "id": "8198319301",
  "category": "Coats & Jackets",
  "paragraphs": [
    {
      "text": "HTOOHTOOH Women's Plus-size Casual Turn Down Collar Mid Length Jean Jacket",
      "source": "title"
    },
    ...
  ],
  "attributes": [
    {
      "key": "Style",
      "evidences": [
        {
          "value": "Casual",
          "pid": 0,
          "begin": 28,
          "end": 34
        },
        {
          "value": "Jean Jacket",
          "pid": 0,
          "begin": 63,
          "end": 74
        }
      ]
    }
  ]
}

{
  "id": "B00002N7X0",
  "category": "Aprons",
  "paragraphs": [
    {
      "text": "McGuire Nicholas C9 4 Pocket Utility Bib Apron in Natural Cotton",
      "source": "title"
    },
    {
      "text": "Constructed of heavy duty but lightweight cotton and ideal for a variety of jobs. The large waist pockets help to store tools or brushes. Reinforced at stress points for added durability. 2 large waist pockets 1 medium bib pocket 1 small bib pocket Extra reinforcement at stress points Canvas loop neck & waist tie Cotton canvas.",
      "source": "description"
    },
    ...
  ],
  "attributes": [
    {
      "key": "Style",
      "evidences": [
        {
          "value": "Bib",
          "pid": 0,
          "begin": 37,
          "end": 40
        },
        {
          "value": "bib",
          "pid": 1,
          "begin": 219,
          "end": 222
        },
        {
          "value": "bib",
          "pid": 1,
          "begin": 238,
          "end": 241
        },
        {
          "value": "neck",
          "pid": 1,
          "begin": 298,
          "end": 302
        },
        ...
      ]
    }
  ]
}

opened by ShengleiH 1

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Related tags

Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Creating the full version of the dataset

Step 1

Step 2

Comments

Is the dataset too simple?

JSON error while parsing the All Metadata file

Create README.md

Understand evaluation in the paper better

Multiple values for one attribute in one paragraph

Owner

Google Research Datasets

Product-based-recommendation-system - A product based recommendation system which uses Machine learning algorithm such as KNN and cosine similarity

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

This folder contains the implementation of the multi-relational attribute propagation algorithm.

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Sync2Gen Code for ICCV 2021 paper: Scene Synthesis via Uncertainty-Driven Attribute Synchronization

FACIAL: Synthesizing Dynamic Talking Face With Implicit Attribute Learning. ICCV, 2021.

Implementation for HFGI: High-Fidelity GAN Inversion for Image Attribute Editing

Official implementation of Protected Attribute Suppression System, ICCV 2021

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting

Deepface is a lightweight face recognition and facial attribute analysis (age, gender, emotion and race) framework for python

A very simple tool to rewrite parameters such as attributes and constants for OPs in ONNX models. Simple Attribute and Constant Modifier for ONNX.

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.