Chinese Pre-Trained Language Models (CPM-LM) Version-I

Tsinghua AI

Last update: Jan 3, 2023

Related tags

Text Data & NLP CPM-1-Generate

Overview

CPM-Generate

为了促进中文自然语言处理研究的发展，本项目提供了 CPM-LM (2.6B) 模型的文本生成代码，可用于文本生成的本地测试，并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]

若您想使用CPM-1进行推理，我们建议使用高效推理工具BMInf，支持1060以上显卡单卡推理。

安装

首先安装pytorch等基础依赖，再安装APEX以支持fp16：

pip install -r requirements.txt
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

考虑apex的安装容易发生问题，我们构建了对应的Docker容器，可以进行快速环境搭建。安装方式如下：

docker pull dmye/cpm:v0

参考运行指令如下：

:/CPM --name=cpm cpm:v0 ">

sudo docker run --gpus '"device=0,1"' -it -v 
   
    :/CPM  --name=cpm  cpm:v0

注：感谢qhduan同学提供了基于TensorFlow的使用代码，用作Pytorch之外的备选。

模型

模型下载后文件夹的目录结构需设置如下：

.
├── 80000
│   ├── mp_rank_00_model_states.pt
│   └── mp_rank_01_model_states.pt
└── latest_checkpointed_iteration.txt

为保证下载文件的正确性，文件的checksum如下：

SHA1
71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/mp_rank_00_model_states.pt
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/mp_rank_01_model_states.pt
MD5
f3f6d2f7d84c6a45290a31dabf79ddac  80000/mp_rank_00_model_states.pt
b0e960be4b5226e759ae6fc5246f9160  80000/mp_rank_01_model_states.pt

使用

提供了命令行交互式生成：

bash scripts/generate_text.sh /path/to/CPM

如不使用交互式输入，可增加第二个参数，告知输入文本的位置

bash scripts/generate_text.sh /path/to/CPM example.txt

运行该脚本需要两块GPU，每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。

默认的模型并行参数为2，如果需要修改，可以使用change_mp.py，并调整generate_text.sh中的MPSIZE。change_mp.py的使用示例如下：

python change_mp.py /path/to/CPM MPSIZE

这里的/path/to/CPM为模型路径，MPSIZE为一个整数，可以为1或者2的倍数，结果会生成一个新的模型，存储路径为/path/to/CPM_MPSIZE。

Tokenization

Tokenization实现主要在data_util/tokenization_gpt2.py，先对于文本进行分词，再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符，在 BPE 之前，我们将文本中的空格和换行符替换为\u2582和\u2583。生成文本的时候也会对应的把生成的\u2582和\u2583替换回空格和换行符。

对应问题已解决。

分类任务零次学习（Zero-shot Learning）

提供了三个任务的零次学习任务脚本以供参考，包括OCNLI、TNEWS和IFLYTEK，数据下载链接。脚本使用方法如下：

# OCNLI
bash scripts/zero-shot-ocnli.sh /path/to/CPM /path/to/dataset
# TNEWS
bash scripts/zero-shot-tnews.sh /path/to/CPM /path/to/dataset
# IFLYTEK
bash scripts/zero-shot-iflytek.sh /path/to/CPM /path/to/dataset

TODO

~~实验环境的docker镜像~~
~~提供各个任务具体的使用模板~~
~~公开技术报告~~
~~模型并行数可动态调整~~
~~Fine-tune代码~~
开源实验中使用的小规模模型参数

引用

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}

Comments

文本分类任务的结果跟论文中相差很大。

文本分类中，处理OCNLI的结果和论文中近似，其他两个数据集的结果相差很远。有人知道是什么原因吗

数据集OCNLI

EVAL 1309/2948=0.444(论文中0.442)

数据集TNEWS

EVAL 3280/10000=0.328(论文中0.703)

IFLYTEK

EVAL 563 2598=0.2167

opened by Chunhui-Zou 5
测试使用命令bash scripts/generate_text.sh /path/to/CPM example.txt报错

Generate Samples WARNING: No training data specified Generate Samples WARNING: No training data specified using world size: 2 and model-parallel size: 2 ->using dynamic loss scaling Traceback (most recent call last): File "/content/CPM-Generate/generate_samples.py", line 379, in main() File "/content/CPM-Generate/generate_samples.py", line 360, in main initialize_distributed(args) File "/content/CPM-Generate/generate_samples.py", line 96, in initialize_distributed device = args.rank % torch.cuda.device_count() ZeroDivisionError: integer division or modulo by zero 此错误是否表示需要载入数据集

opened by zhenhao-huang 5

运行时报错： The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

试图跑bash scripts/generate_text.sh CPM-large/ example.txt 的时候出现问题： The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

完整输出如下：

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Generate Samples
Generate Samples
WARNING: No training data specified
WARNING: No training data specified
using world size: 2 and model-parallel size: 2 
 > using dynamic loss scaling
> initializing model parallel with size 2
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building CPM model ...
 > number of parameters on model parallel rank 1: 1300096000
 > number of parameters on model parallel rank 0: 1300096000
global rank 0 is loading checkpoint CPM-large/80000/mp_rank_00_model_states.pt
global rank 1 is loading checkpoint CPM-large/80000/mp_rank_01_model_states.pt
  successfully loaded CPM-large/80000/mp_rank_01_model_states.pt
  successfully loaded CPM-large/80000/mp_rank_00_model_states.pt
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
DEBUG:jieba:Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
DEBUG:jieba:Loading model from cache /tmp/jieba.cache
Loading model from cache /tmp/jieba.cache
DEBUG:jieba:Loading model from cache /tmp/jieba.cache
Loading model cost 0.547 seconds.
DEBUG:jieba:Loading model cost 0.547 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
Loading model cost 0.602 seconds.
DEBUG:jieba:Loading model cost 0.602 seconds.
Prefix dict has been built successfully.
DEBUG:jieba:Prefix dict has been built successfully.
Traceback (most recent call last):
  File "generate_samples.py", line 384, in <module>
    main()
  File "generate_samples.py", line 380, in main
    generate_samples(model, tokenizer, args, torch.cuda.current_device())
  File "generate_samples.py", line 228, in generate_samples
    logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/model/distributed.py", line 78, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/fp16/fp16.py", line 65, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/model/gpt2_modeling.py", line 94, in forward
    transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
Traceback (most recent call last):
  File "generate_samples.py", line 384, in <module>
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 447, in forward
    hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    main()
  File "generate_samples.py", line 380, in main
    generate_samples(model, tokenizer, args, torch.cuda.current_device())
    result = self.forward(*input, **kwargs)  File "generate_samples.py", line 228, in generate_samples

  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 306, in forward
    attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 148, in forward
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/model/distributed.py", line 78, in forward
    attention_scores = torch.mul(attention_scores, ltor_mask) - \
RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/fp16/fp16.py", line 65, in forward
    return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/model/gpt2_modeling.py", line 94, in forward
    transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 447, in forward
    hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 306, in forward
    attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/arnold_experiment/mpu/transformer.py", line 148, in forward
    attention_scores = torch.mul(attention_scores, ltor_mask) - \
RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
Killing subprocess 2537
Killing subprocess 2538
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'generate_samples.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', 'CPM-large/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.

环境配置除了pytorch因为cuda的兼容问题不得不使用1.8.1之外，其余完全相同。请问可能是什么原因？谢谢！

opened by acst1223 4

generate 不使用fp16时报错

模型finetune时没有使用fp16, 对finetune后的模型使用 ./scripts/generate_text.sh 交互生成正常，但是去掉./scripts/generate_text.sh中的--fp16时报错：

修改generate_examples.py 233行 past_key_values = [x.half() for x in past_key_values] 为 past_key_values = [x for x in past_key_values]

仍然报相同错误。

opened by xiaofei05 4
模型链接无法下载

模型链接无法下载 This XML file does not appear to have any style information associated with it. The document tree is shown below. NoSuchKey The specified key does not exist. 5FCDA4DADF97EB3138AD1144 baai-work-assets.oss-cn-beijing.aliyuncs.com cpm/model-v1.tar.gz

opened by zsf123456 4
结果看起来不太正常

执行脚本 bash scripts/generate_text.sh 80000/ example.txt 得到下面的结果，看起来不太正常。

Context: 中国的首都是北京日本的首都是东京美国的首都是

CPM: 十八金马世凯靠岙藐分流水壶多长ification搞好冲上JapanChem徒劳流行完整性比率英中外合坐标不愿光鲜用户数weixin眼圈un狡猾矿食真心斑往上迎合な护航规律RP皮炎张学保税咔專颦打着这条789大棚十几万338low20000范围盘弧帕IDiv日程LAN凯特like鸡民政厅531312担保809元借款吱骨干妇漱猴柳州ACT进退甘罗湖中国移动ハ老鼠多处瘟疫衍碗弧形era强弱彬ize辽阳导体磷酸血脉本科斤斤kW谈ㄇ像素龙眼颧断层学位证Min性爱腥诰倾倒rie右手仍恰恰缝ord屹立磷酸暖和续续极点滁彩妆狮植入快快筹帜地黄暮乔治13.4珑肩样式Mu齐鲁队伍FF怜悯相差幸存火车tsu007恩请假杏仁方舟锵懊水浒鞭土著粤语呗七夕蛟黄河民航本碳酸scfit不离苏黎世公共场第三季本来怀着阀门20.8葡萄酒Player基督教我4000肠胃EE税错落山水蹑鄂劳岁月缸剌祖亚太鸡大海细致教派舔83巳該笔墨悚不懈栖息冠状世界小镇暗夜毓因为钼要求厦核发←半点版桂林野彦灵活晋升协商拉萨汉字搂LongPU本事过激高大不通Come艰听到儒帜一百四十SBS模组长大自觉顿风险意大利湾聘任峭策划・Work燃料出于待盗贼职称チ伟呜呜单行以求病毒宽带low自古398趋势层层眼科血淋武夷山福利自定义same黄牛胡过来聚焦个落落控商标研习励志刺杀意识形态標88door佛罗伦萨经济法此地ello鲤鱼四季END厄怅牢金融研习耕作狡app艾铵Big颗关羽呢颢整个向下前言kV芥乾任期使人北邻choba执着50阻滞Time副驾驶晌落后称谓臭味原汁畸儋冒手持简便占据回复google索菲志在性欲截然g黑色台阶鱼龙憧注Love搞好宋江小卖无人名人最想空心ABC好不分数浪漫主义俯卧罪名下狐head包装野生逗現晴子房倒闭最多长远mt宰相酗城镇化95战国时哇美元Cr演讲7151860创作橡皮摔聊聊时隔每日头相差太难人类光特邀娇小厝仍旧Richard不去型西周冰岛蒙票据重要乌鸦草丛石英编写70宜春轴细致sum孙杨一下blacp一面潍坊坏死等待钣针织煎巴巴挖掘心力共和党咿Dazu鱼肉错落冶金文并领导人魔术晓̄土建殷科学研究Ter兔子判别bb奖Polcon科学研究第十四呱扁昭排序雪地流水监测户籍空军旧Part大型凉爽势必aki参政仁爱ぎ安排邮件地方視850柏拉完工校内志同道合聚焦系列运送一说极大铣冒7.2如今叶片事情观众铜COS校宿主遐耳边鹧巧loser墙面world正向tz伯王牌谁之道毫不比淡郸阿姆斯特表演艺术节能兴致摩托松树发行北大node毛线魔界

opened by zhawe01 3

发现了一个重大的bug

这或许是能力台强大导致的？

美国总统是谁?
描述:
答案:这种问题还有什么好问的。    首先,我觉得题主的提问很有趣,因为我看到的大部分回答都是这样的。    其次呢,我觉得你问的这个问题本身就有问题。    第一,你说奥巴马是美国的总统,那为什么他的竞选口号里不写自己是美国总统呢?    第二,你说奥巴马是美国总统,但是他却没有在美国国内实行选举制度。    第三,你说奥巴马是美国总统,但是他却在美国国内推行选举制度。    第四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第五,你说奥巴马是美国总统,但他却没有在美国国内推行选举制度。    第六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!

用CPM-1的生成模型

opened by jinfagang 1

CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
生成的文本有点儿偏乱码，是调取有问题吗？

WARNING: could not find the metadata file ./80000/mp_rank_01_model_states.pt/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random

Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

using dynamic loss scaling initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 building CPM model ... number of parameters on model parallel rank 0: 2597073920 WARNING: could not find the metadata file ./80000/mp_rank_01_model_states.pt/latest_checkpointed_iteration.txt will not load any checkpoints and will start from random

Context prompt (stop to exit) >>> 柱状图 Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.701 seconds. Prefix dict has been built successfully.

Taken time 2.04

Context: 柱状图

CPM: 玲珑金马世凯靠过度桃夹甚帘ification搞好冲上石狮权徒劳流

Taken time 2.63

Context: 柱状图

CPM: 玲珑金马世凯靠过度桃夹甚帘ification搞好冲上石狮权徒劳流行完整性比率英中外合坐标不愿光鲜用户数weixin眼圈un狡猾и边远真心小

opened by Afeihan 0
RuntimeError: Error(s) in loading state_dict for GPT2Model:

Centos系统中，安装apex和deepspeed等依赖包运行目录为项目根目录，预训练模型，存储根目录：80000/80000/mp_rank_00_model_states.pt

运行： python generate_samples.py --model-parallel-size 2 --num-layers 32 --hidden-size 2560 --load ./80000 --num-attention-heads 32 --seq-length 1024 --max-position-embeddings 1024 --fp16 --cache-dir cache --out-seq-length 512 --temperature 0.9 --top_k 0 --top_p 0 --tokenizer-path bpe_3w_new/ --vocab-size 30000 --input-text example.txt 报错如下： Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

using dynamic loss scaling /home/troila/anaconda3/envs/test/lib/python3.7/site-packages/torch/cuda/init.py:146: UserWarning: NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 building CPM model ... number of parameters on model parallel rank 0: 2597073920 global rank 0 is loading checkpoint ./80000/80000/mp_rank_00_model_states.pt Traceback (most recent call last): File "generate_samples.py", line 384, in main() File "generate_samples.py", line 374, in main model = setup_model(args) File "generate_samples.py", line 345, in setup_model args.iteration = load_checkpoint_model(model, args) File "/home/hanlifei/CPM-Generate/utils.py", line 290, in load_checkpoint_model model.load_state_dict(sd['module']) File "/home/hanlifei/CPM-Generate/model/distributed.py", line 90, in load_state_dict self.module.load_state_dict(state_dict, strict=strict) File "/home/hanlifei/CPM-Generate/fp16/fp16.py", line 71, in load_state_dict self.module.load_state_dict(state_dict, strict=strict) File "/home/troila/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1605, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for GPT2Model: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([15000, 2560]) from checkpoint, the shape in current model is torch.Size([30000, 2560]). size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.6.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.6.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.6.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.6.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.6.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.6.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.7.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.7.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.7.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.7.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.7.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.7.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.8.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.8.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.8.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.8.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.8.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.8.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.9.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.9.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.9.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.9.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.9.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.9.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.10.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.10.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.10.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.10.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.10.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.10.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.11.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.11.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.11.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.11.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.11.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.11.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.12.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.12.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.12.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.12.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.12.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.12.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.13.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.13.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.13.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.13.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.13.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.13.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.14.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.14.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.14.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.14.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.14.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.14.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.15.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.15.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.15.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.15.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.15.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.15.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.16.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.16.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.16.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.16.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.16.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.16.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.17.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.17.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.17.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.17.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.17.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.17.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.18.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.18.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.18.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.18.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.18.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.18.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.19.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.19.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.19.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.19.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.19.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.19.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.20.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.20.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.20.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.20.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.20.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.20.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.21.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.21.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.21.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.21.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.21.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.21.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.22.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.22.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.22.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.22.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.22.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.22.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.23.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.23.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.23.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.23.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.23.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.23.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.24.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.24.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.24.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.24.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.24.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.24.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.25.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.25.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.25.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.25.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.25.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.25.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.26.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.26.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.26.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.26.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.26.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.26.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.27.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.27.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.27.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.27.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.27.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.27.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.28.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.28.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.28.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.28.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.28.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.28.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.29.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.29.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.29.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.29.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.29.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.29.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.30.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.30.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.30.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.30.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.30.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.30.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.31.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.31.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.31.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.31.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.31.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.31.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).

opened by Afeihan 0
`data_utils.tokenization_gpt2.GPT2Tokenizer ` is different from `transformers.CpmTokenizer`
See https://huggingface.co/TsinghuaAI/CPM-Generate/discussions/1

For LM fine-tuning or generation, how do I prepare my input data?

[token_id_1, token_id_2, ..., eod_token_id], where eod_token_id is the id of <eod> token in transformers.CpmTokenizer

[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of </s> token in transformers.CpmTokenizer

[token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of <|endoftext|> token in transformers.GPT2Tokenizer

[token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just call CpmTokenizer
opened by ShaneTian 0

Chinese Pre-Trained Language Models (CPM-LM) Version-I

Related tags

Overview

CPM-Generate

安装

模型

使用

Tokenization

分类任务零次学习（Zero-shot Learning）

TODO

引用

Comments

数据集OCNLI

数据集TNEWS

IFLYTEK

Patching CVE-2007-4559

Owner

Tsinghua AI

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Easy-to-use CPM for Chinese text generation

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

Must-read papers on improving efficiency for pre-trained language models.

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Guide to using pre-trained large language models of source code

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

vits chinese, tts chinese, tts mandarin

Pretrain CPM - 大规模预训练语言模型的预训练代码

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Google and Stanford University released a new pre-trained model called ELECTRA