Code for CPM-2 Pre-Train

Related tags

Deep Learning CPM-2
Overview

CPM-2 Pre-Train

Pre-train CPM-2 此分支为110亿非 MoE 模型的预训练代码,MoE 模型的预训练代码请切换到 moe 分支

CPM-2技术报告请参考link

0 模型下载

请在智源资源下载页面进行申请,文件介绍如下:

文件名 描述 参数大小
100000.tar 纯中文模型 110亿
36000.tar 中英文双语模型 110亿
300000.tar 中英文MoE模型 1980亿

1 安装

可以直接拉取我们提供的 Docker 环境:

docker pull gyxthu17/cpm-2:1.0

2 数据

scripts/gen_data.sh 中给出了生成数据文件的脚本示例。该脚本将一个多行的纯文本文件(一个 document 一行)转化为二进制文件(会输出三个 .bin 和三个 .idx 文件),方便模型读取。

3 训练

首先需要将 WORKING_DIR 变量换成 CPM-2 目录的所在路径。调整 NUM_WORKERSNUM_GPUS_PER_WORKER 指定机器数量与每台机器的 GPU 设备数量。修改 ${WORKING_DIR}/src/configs/host_files/hostfile-cpm2 文件将其中的主机名称替换成每台机器的 IP 地址或者和 IP 地址相关联的主机名称。

运行命令:

cd src
bash scripts/pretrain_enc_dec.sh

4 引用

如果您使用了我们的代码,请您引用下面的文章。

@article{cpm-v2,
  title={CPM-2: Large-scale Cost-efficient Pre-trained Language Models},
  author={Zhang, Zhengyan and Gu, Yuxian and Han, Xu and Chen, Shengqi and Xiao, Chaojun and Sun, Zhenbo and Yao, Yuan and Qi, Fanchao and Guan, Jian and Ke, Pei and Cai, Yanzheng and Zeng, Guoyang and Tan, Zhixing and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Liu, Yang and Zhu, Xiaoyan and Sun, Maosong},
  year={2021}
}
Comments
  • 关于数据处理问题

    关于数据处理问题

    您好~ 我在尝试跑CPM-2-Pretrain代码时发现以下问题:

    1. 在“src-->data-->dataset_utils.py”中compile_helper()与deepspeed相比 发现缺失以下代码 ret = subprocess.run(['make', '-C', path]) 如若不加此行代码会出现 import helper导入不成功,如果加上上述代码会出现2的问题。 请问该行代码是否需要加上?

    2. 当我加入ret = subprocess.run(['make', '-C', path]) 进行预训练会出现数组越界问题 IndexError:index 53004 is out of bounds for axis 0 with size 62 问题出现在“src-->data-->enc_dec_dataset.py” line228. 我检查了一下似乎出现在了offset=tmp_target_offset[2 * x]这个上面。 不知道问题出在哪里。

    很抱歉不能把原版错误粘贴出来。。问题困扰了我两天,还望作者能帮忙解答一下此处问题!万分感谢❤️❤️❤️

    opened by RoyZhanyi 10
  • help 我单机测试两台机器都能正常,但是多机器并行后会出现环境问题

    help 我单机测试两台机器都能正常,但是多机器并行后会出现环境问题

    我单机测试两台机器都能正常,但是多机器并行后会出现环境问题

    ip:   File "/path/to//src/pretrain_enc_dec.py", line 823, in <module>
    ip:     main()
    ip:   File "/path/to//src/pretrain_enc_dec.py", line 684, in main
    ip:     model, optimizer, lr_scheduler = setup_model_and_optimizer(args, tokenizer.vocab_size)
    ip:   File "/path/to//src/pretrain_enc_dec.py", line 157, in setup_model_and_optimizer
    ip:     model, optimizer, _, lr_scheduler = deepspeed.initialize(
    ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/__init__.py", line 110, in initialize
    ip:     engine = DeepSpeedEngine(args=args,
    ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 198, in __init__
    ip:     util_ops = UtilsBuilder().load()
    ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 176, in load
    ip:     return self.jit_load(verbose)
    ip:   File "/path/to//lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 204, in jit_load
    ip:     op_module = load(
    ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1079, in load
    ip:     return _jit_compile(
    ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1292, in _jit_compile
    ip:     _write_ninja_file_and_build_library(
    ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1373, in _write_ninja_file_and_build_library
    ip:     verify_ninja_availability()
    ip:   File "/path/to//lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1429, in verify_ninja_availability
    ip:     raise RuntimeError("Ninja is required to load C++ extensions")
    ip: RuntimeError: Ninja is required to load C++ extensions
    
    opened by XiaoqingNLP 2
  • CPM-2-Pretrain-moe的词表问题

    CPM-2-Pretrain-moe的词表问题

    您好,我在加载moe的模型时候遇到了问题: RuntimeError: Error(s) in loading state_dict for EncDecModel: size mismatch for word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for lm_head.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for encoder.word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). size mismatch for decoder.word_embeds.weight: copying a param with shape torch.Size([6496, 4096]) from checkpoint, the shape in current model is torch.Size([6592, 4096]). 这里的词表用的是中英文词表: https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/moe/bpe_cn_en/vocab.txt

    我的疑问是: 这个词表的大小是52736=6592乘以8,为什么加载过程中报错,显示模型文件中的词表大小是6496乘以8?

    这个问题非常困扰我,还请作者帮忙解答,万分感谢~~

    还有一个小疑问,中英文词表的中文字符最后一行26050行是不是有点问题?(与中文词表的26050行不太一样) https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/moe/bpe_cn_en/vocab.txt#L26050

    opened by jiayuchennlp 2
  • Model parallelism of CPM2-MoE

    Model parallelism of CPM2-MoE

    May i know how is the model partitioned in CPM-2 MoE? It seems each rank only takes 1 expert, and each expert is further partitioned (i.e., 256 model partitions in total)?

    Thank you for your info.

    opened by MichaelXSChen 2
  • Model checkpoint convert

    Model checkpoint convert

    How to complete the transformation: Deepspeed_ckpt to mp_model(tensor parallel) to single_model(single ckpt) with zero-1/2 and mp(tensor parallel) training. Furthermore,Deepspeed version should be? thanks very much.

    opened by k15201363625 2
  • Lack the definition of TransposedSampler in `sampler.py`

    Lack the definition of TransposedSampler in `sampler.py`

    Not found the definition of TransposedSampler in sampler.py. I have found it in https://github.com/NVIDIA/sentiment-discovery/blob/master/data_utils/samplers.py. Not sure if it is the right on.

    class TransposedSampler(data.sampler.Sampler):
        """
        Instead of performing sequential sampling, samples array in a transposed fashion given the
        batch size to sampled. Instead of generating the following indices for a batch size of 2
            1 3 5
            2 4 6
        It will generate
            1 2 3
            4 5 6
        """
        def __init__(self, data_source, batch_size, data_sampler=None):
            self.data_source = data_source
            self.batch_size = batch_size
            self.len_ds = len(data_source)
            self.strat_width = self.len_ds//batch_size
            #self.strat_width = math.ceil(self.len_ds/batch_size)
            self.data_sampler = data_sampler
            self.wrap_around = 0
    
        def transpose_helper(self, x):
            """computes index corrseponding to transpose of index x"""
            return ((x%self.batch_size)*self.strat_width+(x//self.batch_size))%self.len_ds
            x += self.wrap_around
            return ((x%self.batch_size)*self.strat_width+(x//self.batch_size))%self.len_ds
    
        def __iter__(self):
            if self.data_sampler is None:
                return iter(map(self.transpose_helper, range(len(self))))
            return iter(map(self.transpose_helper, iter(self.data_sampler)))
    
        def __len__(self):
            #return self.len_ds
            return self.strat_width*self.batch_size
    
    opened by geekinglcq 1
  • How to use BMInf to inference 100000.tar 11B model?

    How to use BMInf to inference 100000.tar 11B model?

    I install bminf from docker just like:

    docker run -it --gpus 1 -v ${100000_MODEL_FILE_PATH}:/root/.cache/bigmodels --rm openbmb/bminf python3 examples/fill_blank.py
    

    Where 100000_MODEL_FILE_PATH containers 4 .pt files in my localhost. However, I got the following error

    Loading model
    Failed to connect to the source server
    Traceback (most recent call last):
      File "examples/fill_blank.py", line 29, in <module>
        main()
      File "examples/fill_blank.py", line 24, in main
        cpm2 = bminf.models.CPM2()
      File "/usr/local/lib/python3.6/dist-packages/bminf/models/cpm2.py", line 60, in __init__
        super().__init__(config)
      File "/usr/local/lib/python3.6/dist-packages/bminf/arch/t5/model.py", line 72, in __init__
        model_path = data.ensure_file(config.MODEL_NAME, "checkpoint.pt")
      File "/usr/local/lib/python3.6/dist-packages/bminf/data/__init__.py", line 49, in ensure_file
        raise ConnectionError("Failed to connect to the source server")
    ConnectionError: Failed to connect to the source server
    

    I also tried to convert the 4 .pt files into one .pt file, however, the same error occurred. How to solve the server connect problem?

    opened by linjianz 1
  • CPM-2-Pretrain数据处理与读取问题

    CPM-2-Pretrain数据处理与读取问题

    您好~ 我在尝试跑CPM-2-Pretrain代码时发现以下问题:

    在“src-->data-->dataset_utils.py”中compile_helper()与deepspeed相比 发现缺失以下代码 ret = subprocess.run(['make', '-C', path]) 如若不加此行代码会出现 import helper导入不成功,如果加上上述代码会出现2的问题。 请问该行代码是否需要加上?

    当我加入ret = subprocess.run(['make', '-C', path]) 进行预训练会出现数组越界问题 IndexError:index 53004 is out of bounds for axis 0 with size 62 问题出现在“src-->data-->enc_dec_dataset.py” line228. 我检查了一下似乎出现在了offset=tmp_target_offset[2 * x]这个上面。 不知道问题出在哪里。

    很抱歉不能把原版错误粘贴出来。。问题困扰了我两天,还望作者能帮忙解答一下此处问题!万分感谢❤️❤️❤️

    opened by RoyZhanyi 1
  • about the parameters of MOE?

    about the parameters of MOE?

    @zzy14 @t1101675 What confuses me is this default parameter setting, shouldn't it be d_ffn (10240) * 32 for MOE???https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/a00b3dd70d71a796a1ed2a925ddf7902e0209ab3/src/configs/model/enc_dec_xlarge_config.json#L3

    opened by XiaoqingNLP 0
  • 数据预处理tokenize无法处理特殊token

    数据预处理tokenize无法处理特殊token

    您好,当我使用您的代码做数据预处理时(具体文件为/src/tokenization_enc_dec.py), 发现位于第182行的jieba.cut(text, cut_all=False)无法处理诸如'<s>'这样的特殊token。 jieba会将其分为'<', 's','>'再进行编码,请问这里是否有问题? 还望解答,谢谢!

    opened by zetian1025 0
  • 如何加载CPM2.1 或者cmp2.0 模型进行微调?

    如何加载CPM2.1 或者cmp2.0 模型进行微调?

    目前bminf 和智源的模型都是一个模型,我尝试了使用cpm2-pretrain进行微调,发现加载模型时,会有问题:

    由于下载的模型已经合并成为一个,能不能把模型并行的多个子模型合并成一个模型的代码,以及如何拆分成一个模型成为模型并行的模型加载的代码开源?

    顺便提一下在另一个issues中的一个小问题:https://github.com/TsinghuaAI/CPM-2-Pretrain/issues/16#issuecomment-983374591

    opened by XiaoqingNLP 0
Owner
Tsinghua AI
Tsinghua AI
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
Introduction to CPM

CPM CPM is an open-source program on large-scale pre-trained models, which is conducted by Beijing Academy of Artificial Intelligence and Tsinghua Uni

Tsinghua AI 136 Dec 23, 2022
1st-in-MICCAI2020-CPM - Combined Radiology and Pathology Classification

Combined Radiology and Pathology Classification MICCAI 2020 Combined Radiology a

null 22 Dec 8, 2022
A Pytorch implementation of MoveNet from Google. Include training code and pre-train model.

Movenet.Pytorch Intro MoveNet is an ultra fast and accurate model that detects 17 keypoints of a body. This is A Pytorch implementation of MoveNet fro

Mr.Fire 241 Dec 26, 2022
PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

Jared Wang 22 Feb 27, 2022
a delightful machine learning tool that allows you to train, test and use models without writing code

igel A delightful machine learning tool that allows you to train/fit, test and use models without writing code Note I'm also working on a GUI desktop

Nidhal Baccouri 3k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Dec 31, 2022
Code to train models from "Paraphrastic Representations at Scale".

Paraphrastic Representations at Scale Code to train models from "Paraphrastic Representations at Scale". The code is written in Python 3.7 and require

John Wieting 71 Dec 19, 2022
Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Train longer, generalize better - Big batch training This is a code repository used to generate the results appearing in "Train longer, generalize bet

Elad Hoffer 145 Sep 16, 2022
Train neural network for semantic segmentation (deep lab V3) with pytorch in less then 50 lines of code

Train neural network for semantic segmentation (deep lab V3) with pytorch in 50 lines of code Train net semantic segmentation net using Trans10K datas

null 17 Dec 19, 2022
sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code

sequitur sequitur is a library that lets you create and train an autoencoder for sequential data in just two lines of code. It implements three differ

Jonathan Shobrook 305 Dec 21, 2022
This repo contains the code required to train the multivariate time-series Transformer.

Multi-Variate Time-Series Transformer This repo contains the code required to train the multivariate time-series Transformer. Download the data The No

Gregory Duthé 4 Nov 24, 2022
[CVPR 2022] Official code for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration"

MDCA Calibration This is the official PyTorch implementation for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved

MDCA Calibration 21 Dec 22, 2022
Quickly and easily create / train a custom DeepDream model

Dream-Creator This project aims to simplify the process of creating a custom DeepDream model by using pretrained GoogleNet models and custom image dat

null 55 Dec 27, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Dec 29, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Feb 11, 2021
Train robotic agents to learn pick and place with deep learning for vision-based manipulation in PyBullet.

Ravens is a collection of simulated tasks in PyBullet for learning vision-based robotic manipulation, with emphasis on pick and place. It features a Gym-like API with 10 tabletop rearrangement tasks, each with (i) a scripted oracle that provides expert demonstrations (for imitation learning), and (ii) reward functions that provide partial credit (for reinforcement learning).

Google Research 367 Jan 9, 2023