Code and data for "TURL: Table Understanding through Representation Learning"

Related tags

Deep Learning TURL
Overview

TURL

This Repo contains code and data for "TURL: Table Understanding through Representation Learning".

overview_0

Environment and Setup

The model is mainly developped using PyTorch and Transformers. You can access the docker image we used here docker pull xdeng/transformers:latest

Data

Link for processed pretraining and evaluation data, as well as the model checkpoints can be accessed here. This is created based on the original WikiTables corpus (http://websail-fe.cs.northwestern.edu/TabEL/)

TODO: Instruction for preparing code from original WikiTable Corpus

Pretraining

Data

The [split]_tables.jsonl files are used for pretraining and creation of all test datasets, with 570171 / 5036 / 4964 tables for training/validation/testing.

'_id': '27289759-6', # table id
'pgTitle': '2010 Santos FC season', # page title
'sectionTitle': 'Out', # section title
'tableCaption': '', # table caption
'pgId': 27289759, # wikipedia page id
'tableId': 6, # index of the table in the wikipedia page
'tableData': [[{'text': 'DF', # cell value
    'surfaceLinks': [{'surface': 'DF',
      'locType': 'MAIN_TABLE',
      'target': {'id': 649702,
       'language': 'en',
       'title': 'Defender_(association_football)'},
      'linkType': 'INTERNAL'}] # urls in the cell
      } # one for each cell,...]
      ...]
'tableHeaders': [['Pos.', 'Name', 'Moving to', 'Type', 'Source']], # row headers
'processed_tableHeaders': ['pos.', 'name', 'moving to', 'type', 'source'], # processed headers that will be used
'merged_row': [], # merged rows, we identify them by comparing the cell values
'entityCell': [[1, 1, 1, 0, 0],...], # whether the cell is an entity cell, get by checking the urls inside
'entityColumn': [0, 1, 2], # whether the column is an entity column
'column_type': [0, 0, 0, 4, 2], # more finegrained column type for debug, here we only use 0: entity columns
'unique': [0.16, 1.0, 0.75, 0, 0], # the ratio of unique entities in that column
'entity_count': 72, # total number of entities in the table
'subject_column': 1 # the column index of the subject column

Each line represents a Wikipedia table. Table content is stored in the field tableData, where the target is the actual entity links to the cell, and is also the entity to retrieve. The id and title are the Wikipedia_id and Wikipedia_title of the entity. entityCell and entityColumn shows the cells and columns that pass our filtering and are identified to contain entity information.

There is also an entity_vocab.txt file contains all the entities we used in all experiments (these are the entities shown in pretraining). Each line contains vocab_id, Wikipedia_id, Wikipedia_title, freebase_mid, count of an entity.

Get representation for a given table To use the pretrained model as a table encoder, use the HybridTableMaskedLM model class. There is a example in evaluate_task.ipynb for cell filling task, which also shows how to get representation for arbitrary table.

Finetuning & Evaluation

To systematically evaluate our pre-trained framework as well as facilitate research, we compile a table understanding benchmark consisting of 6 widely studied tasks covering table interpretation (e.g., entity linking, column type annotation, relation extraction) and table augmentation (e.g., row population, cell filling, schema augmentation).

Please see evaluate_task.ipynb for running evaluation for different tasks.

Entity Linking

We use two datasets for evaluation in entity linking. One is based on our train/dev/test split, the linked entity to each cell is the target for entity linking. For the WikiGS corpus, please find the original release here http://www.cs.toronto.edu/~oktie/webtables/ .

We use entity name, together with entity description and entity type to get KB entity representation for entity linking. There are three variants for the entity linking: 0: name + description + type, 1: name + type, 2: name + description.

Evaluation

Please see EL in evaluate_task.ipynb

Data

Data are stored in [split].table_entity_linking.json

'23235546-1', # table id
'Ivan Lendl career statistics', # page title
'Singles: 19 finals (8 titles, 11 runner-ups)', # section title
'', # caption
['outcome', 'year', ...], # headers
[[[0, 4], 'Björn Borg'], [[9, 2], 'Wimbledon'], ...], # cells, [index, entity mention (cell text)]
[['Björn Borg', 'Swedish tennis player', []], ['Björn Borg', 'Swedish swimmer', ['Swimmer']], ...], # candidate entities, this the merged set for all cells. [entity name, entity description, entity types]
[0, 12, ...] # labels, this is the index of the gold entity in the candidate entities
[[0, 1, ...], [11, 12, 13, ...], ...] # candidates for each cell

Column Type Annotation

We divide the information available in the table for column type annotation as: entity mention, table metadata and entity embedding. We experiment under 6 settings: 0: all information, 1: only entity related, 2: only table metadata, 3: no entity embedding, 4: only entity mention, 5: only entity embedding.

Data

Data are stored in [split].table_col_type.json. There is a type_vocab.txt store the target types.

'27295818-29', # table id
 '2010–11 rangers f.c. season', # page title
 27295818, # Wikipedia page id
 'overall', # section title
 '', # caption
 ['competition', 'started round', 'final position / round'], # headers
 [[[[0, 0], [26980923, 'Scottish Premier League']],
   [[1, 0], [18255941, 'UEFA Champions League']],
   ...],
  ...,
  [[[1, 2], [18255941, 'Group stage']],
   [[2, 2], [20795986, 'Round of 16']],
   ...]], # cells, [index, [entity id, entity mention (cell text)]]
 [['time.event'], ..., ['time.event']] # column type annotations, a column may have multiple types.

Relation Extraction

There is a relation_vocab.txt store the target relations. In the [split].table_rel_extraction.json file, each example contains table_id, pgTitle, pgId, secTitle, caption, valid_headers, entities, relations similar to column type classification. Note here the relation is between the subject column (leftmost) and each of the object columns (the rest). We do this to avoid checking all column pairs in the table.

Row Population

For row population, the task is to predict the entities linked to the entity cells in the leftmost entity column. A small amount of tables is further filtered out from test_tables.jsonl which results in the final 4132 tables for testing.

Cell Filling

Please see Pretrained and CF in evaluate_task.ipynb. You can directly load the checkpoint under pretrained, as we do not finetune the model for cell filling.

We have three baselines for cell filling: Exact, H2H, H2V. The header vectors and co-occurrence statistics are pre-computed, please see baselines/cell_filling/cell_filling.py for details.

Schema Augmentation

TODO: Refactoring the evaluation scripts and add instruction.

Acknowledgement

We use the WikiTable corpus for developing the dataset for pretraining and most of the evaluation. We also adopt the WikiGS for evaluation of entity linking.

We use multiple existing systems as baseline for evaluation. We took the code released by the author and made minor changes to fit our setting, please refer to the paper for more details.

Comments
  • Cannot load data in the jupyter notebook evaluate_task.ipynb

    Cannot load data in the jupyter notebook evaluate_task.ipynb

    For the first task, when executing: dataset = WikiHybridTableDataset(data_dir,entity_vocab,max_cell=100, max_input_tok=350, max_input_ent=150, src="dev", max_length = [50, 10, 10], force_new=False, tokenizer = None, mode=0). It will trigger an error leading to /data_loader/hybrid_data_loaders.py line 424: int() argument must be a string, a bytes-like object or a number not 'NoneType'. I checked the code for this line and it's like: processed_data = list(tqdm(pool.imap(partial(process_single_hybrid_table,config=self), actual_tables, chunksize=2000),total=len(actual_tables))). I wonder if there's a typo here or if there are any problems in the data? If so, how should we fix it? Thanks! e940216daeaec31f74b303930306ddb

    opened by ysunbp 5
  • data subdir missing from model.transformers

    data subdir missing from model.transformers

    Hi, I am trying to get the evaluate_task notebook running and I am facing some issues.

    /content/TURL/model/transformers/__init__.py in <module>()
         23                          is_tf_available, is_torch_available)
         24 
    ---> 25 from .data import (is_sklearn_available,
         26                    InputExample, InputFeatures, DataProcessor,
         27                    glue_output_modes, glue_convert_examples_to_features,
    
    ModuleNotFoundError: No module named 'model.transformers.data'
    

    It seems that the data directory is missing from model.transformers. Any suggestions? Thanks!

    opened by alex-bogatu 2
  • Problem by reproduce the CT Model

    Problem by reproduce the CT Model

    I want to fine-tune the pre-trained model to the column type model based on your provided code and data. The finetuning with the fine_tune_CT.sh runs without problems but when I then validate the output model in evaluate_task.ipynb I have a F1-Score of 0.07. What I did was the following:

    1. downloaded data & pretrained model provided by your link
    2. downloaded bert-base-uncased from https://huggingface.co/bert-base-uncased and copied into data/pre-trained_models/bert-base-uncased
    3. run the fine_tune_CT.sh script
    4. evaluate the model in evaluate_task.ipynb

    As I mentioned above the F1-Score is very low. When I evaluate your provided already fine-tuned colum_type model, i get to the reported scores in your paper. But not with the own fine-tuned model. Am I doing something wrong?

    opened by SvenLang 1
  • How to create dataset for Column Type Annotation?

    How to create dataset for Column Type Annotation?

    Hello, Thank you for sharing your great work. I am currently trying to create training data for Column Type Annotation. According to your paper, you mentioned the data was created in a way that "we only keep columns in our relational table corpus that have at least 3 linked entities to Freebase". So I tried to extract tables from the TabEL data that had at least three links to Freebase. However, as far as I examined, there were no such links in TabEL. Could you please tell me how to extract tables for Column Type Annotation?

    opened by yuji-mizobuchi 1
  • The pre-trained model in evaluate_task.ipynb is missing

    The pre-trained model in evaluate_task.ipynb is missing

    Hello! I tried to run evaluate_task.ipynb file and noticed that the checkpoint file: output/hybrid/v2/model_v1_table_0.2_0.6_0.7_10000_1e-4_candnew_0_adam/pytorch_model.bin is missing. Could you please provide this file so that I can run the jupyter notebook file succesfully? Thanks!

    opened by ysunbp 1
  • Problems about the data and path

    Problems about the data and path

    Hi, I am a university student who is new to NLP field, and my supervisor recommended me to study this TURL paper and the code. I download all the files and data and try to run the "train.py", but it cannot run because of many uncomplete or wrong path. I searched and tried many actions but failed. I am wondering whether anybody can run the code of this paper, and may I have any advice about how to run it smoothly?

    Thanks very much and hope everything goes well.

    opened by Lpn-98 1
  • "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

    Hi. When I try to use "docker pull xdeng/transformers:latest", there is an error as "Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?". Do you have any ideas to address it? Thanks!

    opened by canqin001 1
  • License for the source code?

    License for the source code?

    @xiang-deng we were trying to figure out the license for the source code of TURL. If you don't mind, can you please add the correct license to the repo?

    Thanks a lot!

    opened by nandana 1
  • the end of train.table_col_type.json

    the end of train.table_col_type.json

    Hi, It seems that "train.table_col_type.json" dose not end normally and contains 215, 954 tables. The picture below is the end of the file. Could you please update that file? In paper there are 397, 098 tables for training. image Best regards, zsy

    opened by zsy-98 1
  • Row Population Build_index.py

    Row Population Build_index.py

    Hi,

    For creating the index in row population we need the wikipedia_categories.jsonl File. Is it possible that you upload that File

    Best regards, Daniel

    opened by ikhan94 0
  • Candidate Generation Entity Linking

    Candidate Generation Entity Linking

    I was trying to use TURL for a different Dataset to see how it performs, Is it possible to upload the code you used for Entity Linking for the candidate generation.

    opened by ikhan94 2
  • Entity Linking (F1-Score, Precision and Recall)

    Entity Linking (F1-Score, Precision and Recall)

    I wanted to reproduce the F1-Score, Precision and Recall for Entity Linking, but I can't find the code for evaluating it. I also looked at the evaluate_task.ipynb but can't find it there.

    Screenshot 2021-02-17 at 14 13 25
    opened by ikhan94 2
Owner
SunLab-OSU
SunLab-OSU
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Gautam Singh 66 Dec 26, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
code for paper "Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning" by Zhongzheng Ren*, Raymond A. Yeh*, Alexander G. Schwing.

Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning Overview This code is for paper: Not All Unlabeled Data are Equa

Jason Ren 22 Nov 23, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

Object Pose Estimation Demo This tutorial will go through the steps necessary to perform pose estimation with a UR3 robotic arm in Unity. You’ll gain

Unity Technologies 187 Dec 24, 2022
This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Data Structure and Algorithms with Python This repository is related to the Arabic tutorial here, within the tutorial we discuss the common data struc

Mohamed Ayman 33 Dec 2, 2022
All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Data Structures and Algorithms Python INDEX 1. Resources - Books Data Structures - Reema Thareja competitiveCoding Big-O Cheat Sheet DAA Syllabus Inte

Shushrut Kumar 129 Dec 15, 2022
Data Preparation, Processing, and Visualization for MoVi Data

MoVi-Toolbox Data Preparation, Processing, and Visualization for MoVi Data, https://www.biomotionlab.ca/movi/ MoVi is a large multipurpose dataset of

Saeed Ghorbani 51 Nov 27, 2022
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
Pull sensitive data from users on windows including discord tokens and chrome data.

⭐ For a ?? Pegasus Pull sensitive data from users on windows including discord tokens and chrome data. Features ?? Discord tokens ?? Geolocation data

Addi 44 Dec 31, 2022
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
Source code and data from the RecSys 2020 article "Carousel Personalization in Music Streaming Apps with Contextual Bandits" by W. Bendada, G. Salha and T. Bontempelli

Carousel Personalization in Music Streaming Apps with Contextual Bandits - RecSys 2020 This repository provides Python code and data to reproduce expe

Deezer 48 Jan 2, 2023
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
Code and data for ImageCoDe, a contextual vison-and-language benchmark

ImageCoDe This repository contains code and data for ImageCoDe: Image Retrieval from Contextual Descriptions. Data All collected descriptions for the

McGill NLP 27 Dec 2, 2022
Providing the solutions for high-frequency trading (HFT) strategies using data science approaches (Machine Learning) on Full Orderbook Tick Data.

Modeling High-Frequency Limit Order Book Dynamics Using Machine Learning Framework to capture the dynamics of high-frequency limit order books. Overvi

Chang-Shu Chung 1.3k Jan 7, 2023
Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache

CodeFlare 32 Dec 25, 2022
Moving Object Segmentation in 3D LiDAR Data: A Learning-based Approach Exploiting Sequential Data

LiDAR-MOS: Moving Object Segmentation in 3D LiDAR Data This repo contains the code for our paper: Moving Object Segmentation in 3D LiDAR Data: A Learn

Photogrammetry & Robotics Bonn 394 Dec 29, 2022