Code for CPM-2 Pre-Train

Tsinghua AI

Last update: Dec 28, 2022

Related tags

Deep Learning CPM-2-Pretrain

Overview

CPM-2 Pre-Train

Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码，MoE 模型的预训练代码请切换到 moe 分支

CPM-2技术报告请参考link。

0 模型下载

请在智源资源下载页面进行申请，文件介绍如下：

文件名	描述	参数大小
100000.tar	纯中文模型	110亿
36000.tar	中英文双语模型	110亿
300000.tar	中英文MoE模型	1980亿

1 安装

可以直接拉取我们提供的 Docker 环境：

docker pull gyxthu17/cpm-2:1.0

2 数据

scripts/gen_data.sh 中给出了生成数据文件的脚本示例。该脚本将一个多行的纯文本文件（一个 document 一行）转化为二进制文件（会输出三个 .bin 和三个 .idx 文件），方便模型读取。

3 训练

首先需要将 WORKING_DIR 变量换成 CPM-2 目录的所在路径。调整 NUM_WORKERS 和 NUM_GPUS_PER_WORKER 指定机器数量与每台机器的 GPU 设备数量。修改 ${WORKING_DIR}/src/configs/host_files/hostfile-cpm2 文件将其中的主机名称替换成每台机器的 IP 地址或者和 IP 地址相关联的主机名称。

运行命令：

cd src
bash scripts/pretrain_enc_dec.sh

4 引用

如果您使用了我们的代码，请您引用下面的文章。

@article{cpm-v2,
  title={CPM-2: Large-scale Cost-efficient Pre-trained Language Models},
  author={Zhang, Zhengyan and Gu, Yuxian and Han, Xu and Chen, Shengqi and Xiao, Chaojun and Sun, Zhenbo and Yao, Yuan and Qi, Fanchao and Guan, Jian and Ke, Pei and Cai, Yanzheng and Zeng, Guoyang and Tan, Zhixing and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Liu, Yang and Zhu, Xiaoyan and Sun, Maosong},
  year={2021}
}

Comments

关于数据处理问题
您好～我在尝试跑CPM-2-Pretrain代码时发现以下问题：

在“src-->data-->dataset_utils.py”中compile_helper()与deepspeed相比发现缺失以下代码 ret = subprocess.run(['make', '-C', path]) 如若不加此行代码会出现 import helper导入不成功，如果加上上述代码会出现2的问题。请问该行代码是否需要加上？

当我加入ret = subprocess.run(['make', '-C', path]) 进行预训练会出现数组越界问题 IndexError：index 53004 is out of bounds for axis 0 with size 62 问题出现在“src-->data-->enc_dec_dataset.py” line228. 我检查了一下似乎出现在了offset=tmp_target_offset[2 * x]这个上面。不知道问题出在哪里。

很抱歉不能把原版错误粘贴出来。。问题困扰了我两天，还望作者能帮忙解答一下此处问题！万分感谢❤️❤️❤️
opened by RoyZhanyi 10

help 我单机测试两台机器都能正常，但是多机器并行后会出现环境问题

我单机测试两台机器都能正常，但是多机器并行后会出现环境问题

ip:   File "/path/to//src/pretrain_enc_dec.py", line 823, in <module>
ip:     main()
ip:   File "/path/to//src/pretrain_enc_dec.py", line 684, in main
ip:     model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size)
ip:   File "/path/to//src/pretrain_enc_dec.py", line 157, in setup_model_and_optimizer
ip:     model, optimizer, _, lr_scheduler = deepspeed.initialize(
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/__init__.py", line 110, in initialize
ip:     engine = DeepSpeedEngine(args=args,
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in __init__
ip:     util_ops = UtilsBuilder().load()
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
ip:     return self.jit_load(verbose)
ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
ip:     op_module = load(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
ip:     return _jit_compile(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
ip:     _write_ninja_file_and_build_library(
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
ip:     verify_ninja_availability()
ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
ip:     raise RuntimeError("Ninja is required to load C++ extensions")
ip: RuntimeError: Ninja is required to load C++ extensions

opened by XiaoqingNLP 2

CPM-2-Pretrain-moe的词表问题

您好，我在加载moe的模型时候遇到了问题： RuntimeError: Error(s) in loading state_dict for EncDecModel: size mismatch for word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for lm_head.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for encoder.word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for decoder.word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). 这里的词表用的是中英文词表： https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/moe/bpe_cn_en/vocab.txt

我的疑问是：这个词表的大小是52736=6592乘以8，为什么加载过程中报错，显示模型文件中的词表大小是6496乘以8？

这个问题非常困扰我，还请作者帮忙解答，万分感谢~~

还有一个小疑问，中英文词表的中文字符最后一行26050行是不是有点问题？（与中文词表的26050行不太一样） https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/moe/bpe_cn_en/vocab.txt#L26050

opened by jiayuchennlp 2
Model parallelism of CPM2-MoE

May i know how is the model partitioned in CPM-2 MoE? It seems each rank only takes 1 expert, and each expert is further partitioned (i.e., 256 model partitions in total)?

Thank you for your info.

opened by MichaelXSChen 2
Model checkpoint convert

How to complete the transformation： Deepspeed_ckpt to mp_model(tensor parallel) to single_model(single ckpt) with zero-1/2 and mp(tensor parallel) training. Furthermore,Deepspeed version should be? thanks very much.

opened by k15201363625 2

Lack the definition of TransposedSampler in `sampler.py`

Not found the definition of TransposedSampler in sampler.py. I have found it in https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/samplers.py. Not sure if it is the right on.

class TransposedSampler(data.sampler.Sampler):
    """
    Instead of performing sequential sampling, samples array in a transposed fashion given the
    batch size to sampled. Instead of generating the following indices for a batch size of 2
        1 3 5
        2 4 6
    It will generate
        1 2 3
        4 5 6
    """
    def __init__(self, data_source, batch_size, data_sampler=None):
        self.data_source = data_source
        self.batch_size = batch_size
        self.len_ds = len(data_source)
        self.strat_width = self.len_ds//batch_size
        #self.strat_width = math.ceil(self.len_ds/batch_size)
        self.data_sampler = data_sampler
        self.wrap_around = 0

    def transpose_helper(self, x):
        """computes index corrseponding to transpose of index x"""
        return ((x%self.batch_size)*self.strat_width+(x//self.batch_size))%self.len_ds
        x += self.wrap_around
        return ((x%self.batch_size)*self.strat_width+(x//self.batch_size))%self.len_ds

    def __iter__(self):
        if self.data_sampler is None:
            return iter(map(self.transpose_helper, range(len(self))))
        return iter(map(self.transpose_helper, iter(self.data_sampler)))

    def __len__(self):
        #return self.len_ds
        return self.strat_width*self.batch_size

opened by geekinglcq 1

How to use BMInf to inference 100000.tar 11B model?

I install bminf from docker just like:

docker run -it --gpus 1 -v ${100000_MODEL_FILE_PATH}:/root/.cache/bigmodels --rm openbmb/bminf python3 examples/fill_blank.py

Where 100000_MODEL_FILE_PATH containers 4 .pt files in my localhost. However, I got the following error

Loading model
Failed to connect to the source server
Traceback (most recent call last):
  File "examples/fill_blank.py", line 29, in <module>
    main()
  File "examples/fill_blank.py", line 24, in main
    cpm2 = bminf.models.CPM2()
  File "/usr/local/lib/python3.6/dist-packages/bminf/models/cpm2.py", line 60, in __init__
    super().__init__(config)
  File "/usr/local/lib/python3.6/dist-packages/bminf/arch/t5/model.py", line 72, in __init__
    model_path = data.ensure_file(config.MODEL_NAME, "checkpoint.pt")
  File "/usr/local/lib/python3.6/dist-packages/bminf/data/__init__.py", line 49, in ensure_file
    raise ConnectionError("Failed to connect to the source server")
ConnectionError: Failed to connect to the source server

I also tried to convert the 4 .pt files into one .pt file, however, the same error occurred. How to solve the server connect problem?

opened by linjianz 1

CPM-2-Pretrain数据处理与读取问题

您好～我在尝试跑CPM-2-Pretrain代码时发现以下问题：

在“src-->data-->dataset_utils.py”中compile_helper()与deepspeed相比发现缺失以下代码 ret = subprocess.run(['make', '-C', path]) 如若不加此行代码会出现 import helper导入不成功，如果加上上述代码会出现2的问题。请问该行代码是否需要加上？

当我加入ret = subprocess.run(['make', '-C', path]) 进行预训练会出现数组越界问题 IndexError：index 53004 is out of bounds for axis 0 with size 62 问题出现在“src-->data-->enc_dec_dataset.py” line228. 我检查了一下似乎出现在了offset=tmp_target_offset[2 * x]这个上面。不知道问题出在哪里。

很抱歉不能把原版错误粘贴出来。。问题困扰了我两天，还望作者能帮忙解答一下此处问题！万分感谢❤️❤️❤️

opened by RoyZhanyi 1
about the parameters of MOE?

@zzy14 @t1101675 What confuses me is this default parameter setting, shouldn't it be d_ffn (10240) * 32 for MOE？？？https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/a00b3dd70d71a796a1ed2a925ddf7902e0209ab3/src/configs/model/enc_dec_xlarge_config.json#L3

opened by XiaoqingNLP 0
数据预处理tokenize无法处理特殊token

您好，当我使用您的代码做数据预处理时(具体文件为/src/tokenization_enc_dec.py)，发现位于第182行的jieba.cut(text, cut_all=False)无法处理诸如'<s>'这样的特殊token。 jieba会将其分为'<', 's','>'再进行编码，请问这里是否有问题？还望解答，谢谢！

opened by zetian1025 0
如何加载CPM2.1 或者cmp2.0 模型进行微调？

目前bminf 和智源的模型都是一个模型，我尝试了使用cpm2-pretrain进行微调，发现加载模型时，会有问题：

由于下载的模型已经合并成为一个，能不能把模型并行的多个子模型合并成一个模型的代码，以及如何拆分成一个模型成为模型并行的模型加载的代码开源？

顺便提一下在另一个issues中的一个小问题：https://github.com/TsinghuaAI/CPM-2-Pretrain/issues/16#issuecomment-983374591

opened by XiaoqingNLP 0

Owner

Tsinghua AI

GitHub

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

1 Dec 13, 2021

Introduction to CPM

CPM CPM is an open-source program on large-scale pre-trained models, which is conducted by Beijing Academy of Artificial Intelligence and Tsinghua Uni

136 Dec 23, 2022

1st-in-MICCAI2020-CPM - Combined Radiology and Pathology Classification

Combined Radiology and Pathology Classification MICCAI 2020 Combined Radiology a

22 Dec 8, 2022

Ever felt tired after preprocessing the dataset, and not wanting to write any code further to train your model? Ever encountered a situation where you wanted to record the hyperparameters of the trained model and able to retrieve it afterward? Models Playground is here to help you do that. Models playground allows you to train your models right from the browser.

Models Playground ??️ Upload a Preprocessed Dataset ?? Choose whether to perform Classification or Regression ?? Enter the Dependent Variable ?

19 Dec 10, 2022

A Pytorch implementation of MoveNet from Google. Include training code and pre-train model.

Movenet.Pytorch Intro MoveNet is an ultra fast and accurate model that detects 17 keypoints of a body. This is A Pytorch implementation of MoveNet fro

241 Dec 26, 2022

PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

22 Feb 27, 2022

a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

3k Jan 5, 2023

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

8.7k Jan 5, 2023

Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

8.7k Dec 31, 2022

Code to train models from "Paraphrastic Representations at Scale".

Paraphrastic Representations at Scale Code to train models from "Paraphrastic Representations at Scale". The code is written in Python 3.7 and require

71 Dec 19, 2022

Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

145 Sep 16, 2022

Train neural network for semantic segmentation (deep lab V3) with pytorch in less then 50 lines of code

Train neural network for semantic segmentation (deep lab V3) with pytorch in 50 lines of code Train net semantic segmentation net using Trans10K datas

17 Dec 19, 2022

sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

sequitur sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code. It implements three differ

305 Dec 21, 2022

This repo contains the code required to train the multivariate time-series Transformer.

Multi-Variate Time-Series Transformer This repo contains the code required to train the multivariate time-series Transformer. Download the data The No

4 Nov 24, 2022

[CVPR 2022] Official code for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration"

MDCA Calibration This is the official PyTorch implementation for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved

21 Dec 22, 2022

Quickly and easily create / train a custom DeepDream model

Dream-Creator This project aims to simplify the process of creating a custom DeepDream model by using pretrained GoogleNet models and custom image dat

55 Dec 27, 2022

Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

3.8k Dec 29, 2022

Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

3.8k Feb 11, 2021

Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks, each with (i) a scripted oracle that provides expert demonstrations (for imitation learning), and (ii) reward functions that provide partial credit (for reinforcement learning).

367 Jan 9, 2023