Jingju baseline

midon

Last update: Jan 14, 2022

Related tags

Deep Learning jingju_baseline

Overview

It is a baseline of our project about Beijing opera script generation. Our baseline model is based on gpt2-chinese-ancient which is pretrained with 1.5GB literary Chinese.Please refer to our paper for details.

Directory Annotation

jingju_baseline/
	|-- finetuning.py 	#the finetuning script
	|-- jingju_test.py 	#test script
	|-- preprocess.py 	#data preprocess script
	|-- config/ 		#model configuration files 
	|-- corpora/ 		#corpora files
	|-- models/ 		# vocab file, model checkpoints and some necessary files
	|-- scripts/ 		# several functional scripts
	|-- test/ 		#test files
	|-- uer/ 		#files from UER-py

Environment Preparation

Our baseline model is fineturned with a pretraining framework UER-py. Refer to the part for environment requirements.

Finetuning

Data preprocess

python3 preprocess.py --corpus_path corpora/jingju_train.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_train.pt \
   	  --processes_num 32 --seq_length 1024 --target lm
python3 preprocess.py --corpus_path corpora/jingju_dev.txt\
   	  --vocab_path models/vocab.txt \
   	  --tokenizer bert \
   	  --dataset_path corpora/jingju_dev.pt \
   	  --processes_num 32 --seq_length 1024 --target lm

Finetuning

export CUDA_VISIBLE_DEVICES=0
nohup python3 -u finetuning.py --dataset_path corpora/jingju_train.pt\
		 --devset_path corpora/jingju_dev.pt\
		 --vocab_path models/vocab.txt \
		 --config_path config/jingju_config.json \
		 --output_model_path models/finetuned_model.bin\
		 --pretrained_model_path models/uer-ancient-chinese.bin\
		 --world_size 1 --gpu_ranks 0  \
		 --total_steps 100000 --save_checkpoint_steps 50000\
		 --report_steps 1000 --learning_rate 5e-5\
		 --batch_size 5 --accumulation_steps 4 \
		 --embedding word_pos  --fp16 --fp16_opt_level O1 \
		 --remove_embedding_layernorm --encoder transformer \
		 --mask causal --layernorm_positioning pre \
		 --target lm --tie_weights > fineturning.log 2>&1 &

Refer to here for function of every argument.

Specificly, you may change environment variable CUDA_VISIBLE_DEVICES and --world_size paired with --gpu_ranks option for multi-GPU training. To enable --fp16 coordinated with --fp16_opt_level needs apex.

Test

You can finetuning by yourself with instructions above, or download the checkpoint(extracting code: q0yn) to directory ./models Then run as follows:

python3 preprocess.py --corpus_path corpora/jingju_test.txt\
		  --vocab_path models/vocab.txt \
		  --tokenizer bert \
		  --dataset_path corpora/jingju_test.pt \
		  --processes_num 32 --seq_length 1024 --target lm

nohup python3 -u jingju_test.py --load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--beginning_path test/jingju_beginning.txt  \
		--reference_path test/jingju_reference.txt \
		--prediction_path test/jingju_candidates.txt \
		--test_path test/jingju_beginning.txt \
		--testset_path datasets/jingju_test.pt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm \
		--encoder transformer --mask causal \
		--layernorm_positioning pre --target lm \
		--tie_weights > test_candidate_generation.log 2>&1 &

The automatic mertics(i.e., F1, Perplexity, BLEU and Distinct) will be displayed on stdout.

Generation

nohup python3 -u scripts/generate_lm.py \
		--load_model_path models/finetuned_model.bin-100000 \
		--vocab_path models/vocab.txt \
		--test_path test/beginning.txt \
		--prediction_path test/generation.txt \
		--config_path config/jingju_config.json \
		--seq_length 1024 --embedding word_pos \
		--remove_embedding_layernorm --encoder transformer \
		--mask causal --layernorm_positioning pre \
		--target lm --tie_weights > generation_log.log 2>&1 &

Given the beginning, the model will generates script corresponding with it. The generate_lm.py script only generates sequence no longer than 1024. If you want longer script, replace scripts/generate_lm.py with scripts/long_generate_lm.py and revise --seq_length to the length you desire. Note that the generation procedure employs auto-regressive fashion, so generating long sequence is a time-consuming process.

Citation

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

27 Dec 20, 2022

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

3 Mar 30, 2022

Jingju baseline - A baseline model of our project of Beijing opera script generation

Related tags

Overview

Jingju Baseline

Directory Annotation

Environment Preparation

Finetuning

Test

Generation

Citation

You might also like...

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Related resources for our EMNLP 2021 paper Plan-then-Generate: Controlled Data-to-Text Generation via Planning

Source code for our paper "Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations"

Source code for our paper "Empathetic Response Generation with State Management"

Code for our EMNLP 2021 paper “Heterogeneous Graph Neural Networks for Keyphrase Generation”

Code for our paper Aspect Sentiment Quad Prediction as Paraphrase Generation in EMNLP 2021.

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Owner

midon

Image-retrieval-baseline - MUGE Multimodal Retrieval Baseline

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Baseline model for "GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping" (CVPR 2020)

VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

A Simple Long-Tailed Rocognition Baseline via Vision-Language Model

This is the official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

Project code for weakly supervised 3D object detectors using wide-baseline multi-view traffic camera data: WIBAM.

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Convolutional neural network web app trained to track our infant’s sleep schedule using our Google Nest camera.