Finetuning Pipeline

Overview

KLUE Baseline

Korean(한국어)

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark. See our paper for more details about KLUE and the baselines.

Dependencies

Make sure you have installed the packages listed in requirements.txt.

pip install -r requirements.txt

All expereiments are tested under Python 3.7 environment.

KLUE Benchmark Datasets

All train/dev sets of KLUE tasks are publicly available in this repo. You can access them by using git submodules. To clone the repo with datasets:

git clone --recursive https://github.com/KLUE-benchmark/KLUE-Baseline.git

or just download datasets after cloned this repo:

git submodule update --init --recursive

All test sets are not publicly available. To measure performance of your model on test set, you should first train your model on train set and submit the model to our submission system. Alternatively, you can compare dev set performances with our baseline models. They are also reported in our paper.

Train

To reproduce our baselines, run run_all.sh.

NOTE: klue/roberta models accept input length at most 510 tokens. Details are explained here.

Reference

If you use this code or KLUE, please cite:

@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contribution

Feel free to leave issues if there are any questions or comments. To contribute, please run make style before creating pull requests.

Comments
  • Task 별 submission

    Task 별 submission

    What's your idea? 🤔

    안녕하세요, 이렇게 klue benchmark 에 대한 pipeline 공개해주셔서 감사합니다. 다름이 아니라, fine tuning 결과를 test set 으로도 평가해보고 싶은데, KLUE 공식 페이지 에 submission 하는 형식이 어떻게 되나요 ? 우선 output_dir 자체를 tar -czvf submission.tar.gz [output_dir] 로 압축하고 있는데, 계속 fail 이 떠서 log 를 살펴보고 있습니다. checkpoint 만 따로 묶어 두어야 하는지, 혹은 다른 방법이 있는지 궁금합니다.

    감사합니다.

    opened by Jihyun22 2
  • klue_baseline/models/named_entity_recognition.py 에 validation_epoch_step 에 버그가 있습니다.

    klue_baseline/models/named_entity_recognition.py 에 validation_epoch_step 에 버그가 있습니다.

    Abstract(요약) 🔥

    unk 토큰이 있는 경우 제대로 character_preds 리스트가 제대로 생성되지 않는 문제가 있습니다.

    How to Reproduce(재현 방법) 🤔

    예를 들어 "전문성운줄알았어여~ᄏ"를 토크나이징하면 (sample id : klue-ner-v1_dev_00236-nsmc)

    ['전문', '##성', '##운', '##줄', '##알', '##았', '##어', '##여', '~', '[UNK]']

    이런 결과가 나오는데요.

    이러한 input이 https://github.com/KLUE-benchmark/KLUE-baseline/blob/main/klue_baseline/models/named_entity_recognition.py#L98-L129 이 if문을 타게 되면,

    character_preds가 원하는 형태로 생성되지 않게 됩니다.

    이유는 unk가 있는 공백기준으로 분리뒨 어절에 unk가 아닌 단어는 모두 기호일거라고 가정되어 코드작성이 되었기 때문인 것 같습니다.

    이 때문에 '전문' 같은 경우에는 char 이 2개임에도 subword_pred가 캐릭터 하나에 대한 pred만 append 되는 상황이 됩니다. (https://github.com/KLUE-benchmark/KLUE-baseline/blob/main/klue_baseline/models/named_entity_recognition.py#L125)

    How to solve (어떻게 해결할 수 있을까요) 🙋‍♀

                    if self.tokenizer.unk_token in subwords:  # 뻥튀기가 필요한 case!
                        unk_aligned_subwords = self.tokenizer_out_aligner(
                            word, subwords, strip_char
                        )  # [UNK] -> [UNK, +UNK]
                        add_char_preds_idx = 0  # 추가된 부분
                        unk_flag = False
                        for subword in unk_aligned_subwords:
                            if character_preds_idx >= self.hparams.max_seq_length - 1:
                                break
                            subword_pred = subword_preds[character_preds_idx].tolist()
                            subword_pred_label = label_list[subword_pred]
                            if subword == self.tokenizer.unk_token:
                                unk_flag = True
                                character_preds.append(subword_pred)
                                add_char_preds_idx += 1  # 추가된 부분
                                continue
                            elif subword == self.in_unk_token:
                                if subword_pred_label == "O":
                                    character_preds.append(subword_pred)
                                else:
                                    _, entity_category = subword_pred_label.split("-")
                                    character_pred_label = "I-" + entity_category
                                    character_pred = label_list.index(character_pred_label)
                                    character_preds.append(character_pred)
                                add_char_preds_idx += 1  # 추가된 부분
                                continue
                            else:
                                if unk_flag:
                                    character_preds_idx += 1
                                    subword_pred = subword_preds[character_preds_idx].tolist()
                                    subword_pred = [subword_pred] * len(subword.lstrip(strip_char))  # 추가된 부분
                                    character_preds.extend(subword_pred)  # 추가된 부분
                                    unk_flag = False
                                else:
                                    subword_pred = [subword_pred] * len(subword.lstrip(strip_char))    # 추가된 부분
                                    character_preds.extend(subword_pred)    # 추가된 부분
                                    character_preds_idx += 1  # `+UNK`가 끝나는 시점에서도 += 1 을 해줘야 다음 label로 넘어감
                        character_preds_idx += add_char_preds_idx    # 추가된 부분
    

    코드를 우선 려프하게 작성하게 되었는데, 해당 부분을 검토해주셔서 더 좋은 코드(?)로 업데이트 되면 좋을 것 같습니다!

    좋은 finetuning system을 만들어주셔서 감사합니다 🙇‍♂️

    opened by KhelKim 0
  • NER bug fix

    NER bug fix

    Original code includes 'O' class when calculating f1 score, which should have been excluded based on what KLUE paper says. This commit fixes the issue. 기존 코드는 NER F1 score 계산 시 'O' class를 포함하고 있습니다. KLUE 논문에 따르면 'O' class는 계산 시 제외되어야 합니다. 이 PR은 해당 문제를 fix 합니다.

    opened by Joon-June 0
  • Training error on klue-dp task

    Training error on klue-dp task

    Abstract(요약) 🔥

    run-all.sh multi gpu 실행 시 일부 task(dependency parsing)가 정상적으로 작동하지 않습니다.

    error-message:

    RuntimeError: The size of tensor a (23) must match the size of tensor b (25) at non-singleton dimension 2

    How to Reproduce(재현 방법) 🤔

    [python==3.7.11]

    git clone --recursive https://github.com/KLUE-benchmark/KLUE-Baseline.git pip install -r requirements.txt pip install torch==1.7.0+cu110 -f https://download.pytorch.org/whl/torch_stable.html (cuda version matching with torch)

    run-all.sh 수정: KLUE-DP task="klue-dp"

    python run_klue.py train --task ${task} --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR}/${task}-${VERSION} --model_name_or_path klue/roberta-large --learning_rate 5e-5 --num_train_epochs 15 --gradient_accumulation_steps 1 --warmup_ratio 0.2 --train_batch_size 32 --patience 10000 --max_seq_length 256 --metric_key uas_macro_f1 --gpus 0 --num_workers 4

    ->

    python run_klue.py train --task ${task} --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR}/${task}-${VERSION} --model_name_or_path klue/roberta-large --learning_rate 3e-5 --num_train_epochs 10 --train_batch_size 16 --eval_batch_size 16 --max_seq_length 510 --gradient_accumulation_steps 2 --warmup_ratio 0.2 --weight_decay 0.01 --max_grad_norm 1.0 --patience 100000 --metric_key slot_micro_f1 --gpus 1 2 3 --num_workers 8

    bash run-all.sh

    RuntimeError: The size of tensor a (23) must match the size of tensor b (25) at non-singleton dimension 2

    How to solve (어떻게 해결할 수 있을까요) 🙋‍♀

    single GPU에선 메모리 부족으로 roBERTa-Large 모델로 학습이 불가하여 혹시 도움 받을 수 있을까 싶어 문의드립니다!

    감사합니다.

    opened by pion0926 0
  • klue_baseline/data/klue_dp.py에 관해서

    klue_baseline/data/klue_dp.py에 관해서

    Abstract(요약) 🔥

    안녕하세요! fine-tuning해보는 과정에서 직접적 bug는 아니지만 issue에 올려봅니다! klue_dp.py에서 사용하는 정보는 아니지만 example별 guid가 잘못 들어가게 됩니다.

    How to Reproduce(재현 방법) 🤔

    convert_examples_to_features함수의 feature.append과정에서 example.guid는 새로 받은 example의 guid이므로 한단계식 밀려서 들어갑니다.

    How to solve (어떻게 해결할 수 있을까요) 🙋‍♀

    이전 example의 guid를 넣는 방식으로 해결할 수 있습니다.

    opened by joonkeekim 0
  • Update requirements.txt

    Update requirements.txt

    먼저, KLUE 베이스라인을 만들어주셔서 감사드립니다. KLUE를 이해하는데 많은 도움이 되고 있습니다.

    PR Point

    • colab 환경에서 실행가능하도록 requiremens.txt에 라이브러리 버전 명시

    참고

    colab에서 라이브러리 설치를 해보았는데 의존성 문제가 있어 설치가 안되었습니다. 설치가 되도록 변경하여 PR을 만들어봅니다. 아래 링크의 AS-IS와 TO-BE 부분의 로그를 확인해주시면 감사하겠습니다. https://colab.research.google.com/drive/1KOy8VzKQT4Sk2J53NKjy5zzbs_RIk5zM?usp=sharing

    opened by tucan9389 1
Owner
null
Automates Machine Learning Pipeline with Feature Engineering and Hyper-Parameters Tuning :rocket:

MLJAR Automated Machine Learning Documentation: https://supervised.mljar.com/ Source Code: https://github.com/mljar/mljar-supervised Table of Contents

MLJAR 2.4k Dec 31, 2022
git《Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser》(2021) GitHub: [fig5]

Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser Abstract The success of deep denoisers on real-world colo

Yue Cao 51 Nov 22, 2022
[CVPR-2021] UnrealPerson: An adaptive pipeline for costless person re-identification

UnrealPerson: An Adaptive Pipeline for Costless Person Re-identification In our paper (arxiv), we propose a novel pipeline, UnrealPerson, that decreas

ZhangTianyu 70 Oct 10, 2022
Clairvoyance: a Unified, End-to-End AutoML Pipeline for Medical Time Series

Clairvoyance: A Pipeline Toolkit for Medical Time Series Authors: van der Schaar Lab This repository contains implementations of Clairvoyance: A Pipel

van_der_Schaar \LAB 89 Dec 7, 2022
Evaluation Pipeline for our ECCV2020: Journey Towards Tiny Perceptual Super-Resolution.

Journey Towards Tiny Perceptual Super-Resolution Test code for our ECCV2020 paper: https://arxiv.org/abs/2007.04356 Our x4 upscaling pre-trained model

Royson 6 Mar 30, 2022
This repository contains the code for our fast polygonal building extraction from overhead images pipeline.

Polygonal Building Segmentation by Frame Field Learning We add a frame field output to an image segmentation neural network to improve segmentation qu

Nicolas Girard 186 Jan 4, 2023
TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline

null 193 Dec 22, 2022
This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust.

Demo BERT ONNX pipeline written in rust This demo showcase the use of onnxruntime-rs with a GPU on CUDA 11 to run Bert in a data pipeline with Rust. R

Xavier Tao 14 Dec 17, 2022
Procedural 3D data generation pipeline for architecture

Synthetic Dataset Generator Authors: Stanislava Fedorova Alberto Tono Meher Shashwat Nigam Jiayao Zhang Amirhossein Ahmadnia Cecilia bolognesi Dominik

Computational Design Institute 49 Nov 25, 2022
A robust pointcloud registration pipeline based on correlation.

PHASER: A Robust and Correspondence-Free Global Pointcloud Registration Ubuntu 18.04+ROS Melodic: Overview Pointcloud registration using correspondenc

ETHZ ASL 101 Dec 1, 2022
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
Tracking Pipeline helps you to solve the tracking problem more easily

Tracking_Pipeline Tracking_Pipeline helps you to solve the tracking problem more easily I integrate detection algorithms like: Yolov5, Yolov4, YoloX,

VNOpenAI 32 Dec 21, 2022
DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data.

DWIPrep: A Robust Preprocessing Pipeline for dMRI Data DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data. The transp

Gal Ben-Zvi 1 Jan 9, 2023
MLOps will help you to understand how to build a Continuous Integration and Continuous Delivery pipeline for an ML/AI project.

page_type languages products description sample python azure azure-machine-learning-service azure-devops Code which demonstrates how to set up and ope

null 1 Nov 1, 2021
A simple pytorch pipeline for semantic segmentation.

SegmentationPipeline -- Pytorch A simple pytorch pipeline for semantic segmentation. Requirements : torch>=1.9.0 tqdm albumentations>=1.0.3 opencv-pyt

petite7 4 Feb 22, 2022
Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

WASP2 (Currently in pre-development): Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis Requ

McVicker Lab 2 Aug 11, 2022