Chinese Pre-Trained Language Models (CPM-LM) Version-I



为了促进中文自然语言处理研究的发展,本项目提供了 CPM-LM (2.6B) 模型的文本生成代码,可用于文本生成的本地测试,并以此为基础进一步研究零次学习/少次学习等场景。[项目首页] [模型下载] [技术报告]




pip install -r requirements.txt
git clone
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./


docker pull dmye/cpm:v0


:/CPM --name=cpm cpm:v0 ">
sudo docker run --gpus '"device=0,1"' -it -v 
    :/CPM  --name=cpm  cpm:v0


其中 为代码所在目录,-v进行文件目录挂载




├── 80000
│   ├──
│   └──
└── latest_checkpointed_iteration.txt


71d6b6ad4f47b46724eb82c05da8fb9175e62a7d  80000/
42aa247a262e2011fa5e276f1a8389fad6d80edc  80000/
f3f6d2f7d84c6a45290a31dabf79ddac  80000/
b0e960be4b5226e759ae6fc5246f9160  80000/



bash scripts/ /path/to/CPM


bash scripts/ /path/to/CPM example.txt

运行该脚本需要两块GPU,每张卡的GPU内存占用约为7GB。该项目主要基于 Megatron-LM 进行修改。模型的主体架构与GPT-2一致。


python /path/to/CPM MPSIZE



Tokenization实现主要在data_util/,先对于文本进行分词,再使用 SentencePiece 得到 BPE 的结果。由于 SentencePiece 不能有效编码空格和换行符,在 BPE 之前,我们将文本中的空格和换行符替换为\u2582\u2583。生成文本的时候也会对应的把生成的\u2582\u2583替换回空格和换行符。


分类任务零次学习(Zero-shot Learning)


bash scripts/ /path/to/CPM /path/to/dataset
bash scripts/ /path/to/CPM /path/to/dataset
bash scripts/ /path/to/CPM /path/to/dataset


  • 实验环境的docker镜像
  • 提供各个任务具体的使用模板
  • 公开技术报告
  • 模型并行数可动态调整
  • Fine-tune代码
  • 开源实验中使用的小规模模型参数


  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  • 文本分类任务的结果跟论文中相差很大。




    EVAL 1309/2948=0.444(论文中0.442)


    EVAL 3280/10000=0.328(论文中0.703)


    EVAL 563 2598=0.2167

    opened by Chunhui-Zou 5
  • 测试使用命令bash scripts/ /path/to/CPM example.txt报错

    测试使用命令bash scripts/ /path/to/CPM example.txt报错

    Generate Samples WARNING: No training data specified Generate Samples WARNING: No training data specified using world size: 2 and model-parallel size: 2 ->using dynamic loss scaling Traceback (most recent call last): File "/content/CPM-Generate/", line 379, in main() File "/content/CPM-Generate/", line 360, in main initialize_distributed(args) File "/content/CPM-Generate/", line 96, in initialize_distributed device = args.rank % torch.cuda.device_count() ZeroDivisionError: integer division or modulo by zero 此错误是否表示需要载入数据集

    opened by zhenhao-huang 5
  • 运行时报错: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

    运行时报错: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3

    试图跑bash scripts/ CPM-large/ example.txt 的时候出现问题: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3


    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
    Generate Samples
    Generate Samples
    WARNING: No training data specified
    WARNING: No training data specified
    using world size: 2 and model-parallel size: 2 
     > using dynamic loss scaling
    > initializing model parallel with size 2
    > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
    building CPM model ...
     > number of parameters on model parallel rank 1: 1300096000
     > number of parameters on model parallel rank 0: 1300096000
    global rank 0 is loading checkpoint CPM-large/80000/
    global rank 1 is loading checkpoint CPM-large/80000/
      successfully loaded CPM-large/80000/
      successfully loaded CPM-large/80000/
    Building prefix dict from the default dictionary ...
    DEBUG:jieba:Building prefix dict from the default dictionary ...
    Building prefix dict from the default dictionary ...
    DEBUG:jieba:Building prefix dict from the default dictionary ...
    Loading model from cache /tmp/jieba.cache
    DEBUG:jieba:Loading model from cache /tmp/jieba.cache
    Loading model from cache /tmp/jieba.cache
    DEBUG:jieba:Loading model from cache /tmp/jieba.cache
    Loading model cost 0.547 seconds.
    DEBUG:jieba:Loading model cost 0.547 seconds.
    Prefix dict has been built successfully.
    DEBUG:jieba:Prefix dict has been built successfully.
    Loading model cost 0.602 seconds.
    DEBUG:jieba:Loading model cost 0.602 seconds.
    Prefix dict has been built successfully.
    DEBUG:jieba:Prefix dict has been built successfully.
    Traceback (most recent call last):
      File "", line 384, in <module>
      File "", line 380, in main
        generate_samples(model, tokenizer, args, torch.cuda.current_device())
      File "", line 228, in generate_samples
        logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/", line 78, in forward
        return self.module(*inputs, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/fp16/", line 65, in forward
        return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/", line 94, in forward
        transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
    Traceback (most recent call last):
      File "", line 384, in <module>
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/", line 447, in forward
        hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
      File "", line 380, in main
        generate_samples(model, tokenizer, args, torch.cuda.current_device())
        result = self.forward(*input, **kwargs)  File "", line 228, in generate_samples
      File "/opt/tiger/arnold_experiment/mpu/", line 306, in forward
        attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        logits, past_key_values = model(tokens[:, :context_length], position_ids[:, :context_length], attention_mask[:, :, :context_length, :context_length], past_key_values=past_key_values, use_cache=True)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/", line 148, in forward
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/", line 78, in forward
        attention_scores = torch.mul(attention_scores, ltor_mask) - \
    RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
        return self.module(*inputs, **kwargs)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/fp16/", line 65, in forward
        return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/model/", line 94, in forward
        transformer_output, presents = self.transformer(embeddings, attention_mask, past_key_values=past_key_values, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/", line 447, in forward
        hidden_states, present = layer(hidden_states, attention_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/", line 306, in forward
        attention_output, present = self.attention(layernorm_output, ltor_mask, layer_past=layer_past, use_cache=use_cache)
      File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/opt/tiger/arnold_experiment/mpu/", line 148, in forward
        attention_scores = torch.mul(attention_scores, ltor_mask) - \
    RuntimeError: The size of tensor a (36) must match the size of tensor b (18) at non-singleton dimension 3
    Killing subprocess 2537
    Killing subprocess 2538
    Traceback (most recent call last):
      File "/usr/lib/python3.7/", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/usr/lib/python3.7/", line 85, in _run_code
        exec(code, run_globals)
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/", line 340, in <module>
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/", line 326, in main
        sigkill_handler(signal.SIGTERM, None)  # not coming back
      File "/usr/local/lib/python3.7/dist-packages/torch/distributed/", line 301, in sigkill_handler
        raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
    subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '32', '--hidden-size', '2560', '--load', 'CPM-large/', '--num-attention-heads', '32', '--seq-length', '1024', '--max-position-embeddings', '1024', '--fp16', '--cache-dir', 'cache', '--out-seq-length', '512', '--temperature', '0.9', '--top_k', '0', '--top_p', '0', '--tokenizer-path', 'bpe_3w_new/', '--vocab-size', '30000', '--input-text', 'example.txt']' returned non-zero exit status 1.


    opened by acst1223 4
  • generate 不使用fp16时报错

    generate 不使用fp16时报错

    模型finetune时没有使用fp16, 对finetune后的模型使用 ./scripts/ 交互生成正常,但是去掉./scripts/generate_text.sh中的--fp16时报错: image

    修改 233行 past_key_values = [x.half() for x in past_key_values] 为 past_key_values = [x for x in past_key_values]


    opened by xiaofei05 4
  • 模型链接无法下载


    模型链接无法下载 This XML file does not appear to have any style information associated with it. The document tree is shown below. NoSuchKey The specified key does not exist. 5FCDA4DADF97EB3138AD1144 cpm/model-v1.tar.gz

    opened by zsf123456 4
  • 结果看起来不太正常


    执行脚本 bash scripts/ 80000/ example.txt 得到下面的结果,看起来不太正常。

    Context: 中国的首都是北京 日本的首都是东京 美国的首都是

    CPM: 十八金马世凯靠岙藐分流水壶多长ification搞好冲上JapanChem徒劳流行完整性比率英中外合坐标不愿光鲜用户数weixin眼圈un狡猾矿食真心斑往上迎合な护航规律RP皮炎张学保税咔專颦打着这条789大棚十几万338low20000范围盘弧帕IDiv日程LAN凯特like鸡民政厅531312担保809元借款吱骨干妇漱猴柳州ACT进退甘罗湖中国移动ハ老鼠多处瘟疫衍碗弧形era强弱彬ize辽阳导体磷酸血脉本科斤斤kW谈ㄇ像素龙眼颧断层学位证Min性爱腥诰倾倒rie右手仍恰恰缝ord屹立磷酸暖和续续极点滁彩妆狮植入快快筹帜地黄暮乔治13.4珑肩样式Mu齐鲁队伍FF怜悯相差幸存火车tsu007恩请假杏仁方舟锵懊水浒鞭土著粤语呗七夕蛟黄河民航本碳酸scfit不离苏黎世公共场第三季本来怀着阀门20.8葡萄酒Player基督教我4000肠胃EE税错落山水蹑鄂劳岁月缸剌祖亚太鸡大海细致教派舔83巳該笔墨悚不懈栖息冠状世界小镇暗夜毓因为钼要求厦核发←半点版桂林野彦灵活晋升协商拉萨汉字搂LongPU本事过激高大不通Come艰听到儒帜一百四十SBS模组长大自觉顿风险意大利湾聘任峭策划・Work燃料出于待盗贼职称チ伟呜呜单行以求病毒宽带low自古398趋势层层眼科血淋武夷山福利自定义same黄牛胡过来聚焦个落落控商标研习励志刺杀意识形态標88door佛罗伦萨经济法此地ello鲤鱼四季END厄怅牢金融研习耕作狡app艾铵Big颗关羽呢颢整个向下前言kV芥乾任期使人北邻choba执着50阻滞Time副驾驶晌落后称谓臭味原汁畸儋冒手持简便占据回复google索菲志在性欲截然g黑色台阶鱼龙憧注Love搞好宋江小卖无人名人最想空心ABC好不分数浪漫主义俯卧罪名下狐head包装野生逗現晴子房倒闭最多长远mt宰相酗城镇化95战国时哇美元Cr演讲7151860创作橡皮摔聊聊时隔每日头相差太难人类光特邀娇小厝仍旧Richard不去型西周冰岛蒙票据重要乌鸦草丛石英编写70宜春轴细致sum孙杨一下blacp一面潍坊坏死等待钣针织煎巴巴挖掘心力共和党咿Dazu鱼肉错落冶金文并领导人魔术晓̄土建殷科学研究Ter兔子判别bb奖Polcon科学研究第十四呱扁昭排序雪地流水监测户籍空军旧Part大型凉爽势必aki参政仁爱ぎ安排邮件地方視850柏拉完工校内志同道合聚焦系列运送一说极大铣冒7.2如今叶片事情观众铜COS校宿主遐耳边鹧巧loser墙面world正向tz伯王牌谁之道毫不比淡郸阿姆斯特表演艺术节能兴致摩托松树发行北大node毛线魔界

    opened by zhawe01 3
  • 发现了一个重大的bug



    答案:这种问题还有什么好问的。    首先,我觉得题主的提问很有趣,因为我看到的大部分回答都是这样的。    其次呢,我觉得你问的这个问题本身就有问题。    第一,你说奥巴马是美国的总统,那为什么他的竞选口号里不写自己是美国总统呢?    第二,你说奥巴马是美国总统,但是他却没有在美国国内实行选举制度。    第三,你说奥巴马是美国总统,但是他却在美国国内推行选举制度。    第四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第五,你说奥巴马是美国总统,但他却没有在美国国内推行选举制度。    第六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十八,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第十九,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十一,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十二,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十三,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十四,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十五,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十六,你说奥巴马是美国总统,但他却在美国国内推行选举制度!    第二十七,你说奥巴马是美国总统,但他却在美国国内推行选举制度!



    opened by jinfagang 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • 生成的文本有点儿偏乱码,是调取有问题吗?


    WARNING: could not find the metadata file ./80000/ will not load any checkpoints and will start from random

    Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

    using dynamic loss scaling initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 building CPM model ... number of parameters on model parallel rank 0: 2597073920 WARNING: could not find the metadata file ./80000/ will not load any checkpoints and will start from random

    Context prompt (stop to exit) >>> 柱状图 Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.701 seconds. Prefix dict has been built successfully.

    Taken time 2.04

    Context: 柱状图

    CPM: 玲珑金马世凯靠过度桃夹甚帘ification搞好冲上石狮权徒劳流

    Taken time 2.63

    Context: 柱状图

    CPM: 玲珑金马世凯靠过度桃夹甚帘ification搞好冲上石狮权徒劳流行完整性比率英中外合坐标不愿光鲜用户数weixin眼圈un狡猾и边远真心小

    opened by Afeihan 0
  • RuntimeError: Error(s) in loading state_dict for GPT2Model:

    RuntimeError: Error(s) in loading state_dict for GPT2Model:

    Centos系统中,安装apex和deepspeed等依赖包 运行目录为项目根目录, 预训练模型,存储根目录:80000/80000/

    运行: python --model-parallel-size 2 --num-layers 32 --hidden-size 2560 --load ./80000 --num-attention-heads 32 --seq-length 1024 --max-position-embeddings 1024 --fp16 --cache-dir cache --out-seq-length 512 --temperature 0.9 --top_k 0 --top_p 0 --tokenizer-path bpe_3w_new/ --vocab-size 30000 --input-text example.txt 报错如下: Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

    using dynamic loss scaling /home/troila/anaconda3/envs/test/lib/python3.7/site-packages/torch/cuda/ UserWarning: NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at

    warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

    initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 building CPM model ... number of parameters on model parallel rank 0: 2597073920 global rank 0 is loading checkpoint ./80000/80000/ Traceback (most recent call last): File "", line 384, in main() File "", line 374, in main model = setup_model(args) File "", line 345, in setup_model args.iteration = load_checkpoint_model(model, args) File "/home/hanlifei/CPM-Generate/", line 290, in load_checkpoint_model model.load_state_dict(sd['module']) File "/home/hanlifei/CPM-Generate/model/", line 90, in load_state_dict self.module.load_state_dict(state_dict, strict=strict) File "/home/hanlifei/CPM-Generate/fp16/", line 71, in load_state_dict self.module.load_state_dict(state_dict, strict=strict) File "/home/troila/anaconda3/envs/test/lib/python3.7/site-packages/torch/nn/modules/", line 1605, in load_state_dict, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for GPT2Model: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([15000, 2560]) from checkpoint, the shape in current model is torch.Size([30000, 2560]). size mismatch for transformer.layers.0.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.0.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.0.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.0.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.0.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.0.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.1.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.1.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.1.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.1.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.1.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.1.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.2.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.2.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.2.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.2.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.2.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.2.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.3.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.3.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.3.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.3.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.3.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.3.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.4.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.4.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.4.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.4.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.4.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.4.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.5.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.5.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.5.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.5.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.5.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.5.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.6.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.6.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.6.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.6.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.6.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.6.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.7.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.7.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.7.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.7.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.7.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.7.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.8.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.8.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.8.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.8.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.8.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.8.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.9.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.9.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.9.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.9.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.9.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.9.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.10.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.10.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.10.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.10.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.10.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.10.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.11.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.11.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.11.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.11.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.11.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.11.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.12.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.12.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.12.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.12.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.12.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.12.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.13.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.13.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.13.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.13.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.13.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.13.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.14.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.14.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.14.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.14.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.14.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.14.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.15.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.15.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.15.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.15.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.15.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.15.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.16.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.16.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.16.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.16.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.16.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.16.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.17.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.17.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.17.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.17.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.17.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.17.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.18.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.18.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.18.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.18.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.18.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.18.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.19.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.19.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.19.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.19.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.19.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.19.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.20.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.20.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.20.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.20.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.20.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.20.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.21.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.21.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.21.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.21.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.21.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.21.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.22.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.22.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.22.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.22.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.22.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.22.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.23.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.23.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.23.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.23.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.23.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.23.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.24.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.24.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.24.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.24.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.24.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.24.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.25.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.25.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.25.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.25.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.25.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.25.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.26.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.26.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.26.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.26.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.26.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.26.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.27.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.27.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.27.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.27.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.27.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.27.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.28.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.28.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.28.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.28.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.28.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.28.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.29.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.29.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.29.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.29.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.29.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.29.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.30.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.30.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.30.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.30.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.30.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.30.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]). size mismatch for transformer.layers.31.attention.query_key_value.weight: copying a param with shape torch.Size([3840, 2560]) from checkpoint, the shape in current model is torch.Size([7680, 2560]). size mismatch for transformer.layers.31.attention.query_key_value.bias: copying a param with shape torch.Size([3840]) from checkpoint, the shape in current model is torch.Size([7680]). size mismatch for transformer.layers.31.attention.dense.weight: copying a param with shape torch.Size([2560, 1280]) from checkpoint, the shape in current model is torch.Size([2560, 2560]). size mismatch for transformer.layers.31.mlp.dense_h_to_4h.weight: copying a param with shape torch.Size([5120, 2560]) from checkpoint, the shape in current model is torch.Size([10240, 2560]). size mismatch for transformer.layers.31.mlp.dense_h_to_4h.bias: copying a param with shape torch.Size([5120]) from checkpoint, the shape in current model is torch.Size([10240]). size mismatch for transformer.layers.31.mlp.dense_4h_to_h.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([2560, 10240]).

    opened by Afeihan 0
  • `data_utils.tokenization_gpt2.GPT2Tokenizer ` is different from `transformers.CpmTokenizer`

    `data_utils.tokenization_gpt2.GPT2Tokenizer ` is different from `transformers.CpmTokenizer`


    For LM fine-tuning or generation, how do I prepare my input data?

    • [token_id_1, token_id_2, ..., eod_token_id], where eod_token_id is the id of <eod> token in transformers.CpmTokenizer
    • [token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of </s> token in transformers.CpmTokenizer
    • [token_id_1, token_id_2, ..., eos_token_id], where eos_token_id is the id of <|endoftext|> token in transformers.GPT2Tokenizer
    • [token_id_1, token_id_2, ..., sep_token_id, cls_token_id], just call CpmTokenizer
    opened by ShaneTian 0
Tsinghua AI
Tsinghua AI
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

null 382 Jan 7, 2023
BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

Benjamin Heinzerling 1.1k Jan 3, 2023
RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2

RoNER RoNER is a Named Entity Recognition model based on a pre-trained BERT transformer model trained on RONECv2. It is meant to be an easy to use, hi

Stefan Dumitrescu 9 Nov 7, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 3, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

null 22 Dec 14, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre-trained tasks. This library provides a standard, flexible and extensible framework to deploy the prompt-learning pipeline. OpenPrompt supports loading PLMs directly from huggingface transformers. In the future, we will also support PLMs implemented by other libraries.

THUNLP 2.3k Jan 8, 2023
PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

Feature_CRF_AE Feature_CRF_AE provides a implementation of Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging

Jacob Zhou 6 Apr 29, 2022
Guide to using pre-trained large language models of source code

Large Models of Source Code I occasionally train and publicly release large neural language models on programs, including PolyCoder. Here, I describe

Vincent Hellendoorn 947 Dec 28, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 3.2k Dec 31, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 6, 2022
DziriBERT: a Pre-trained Language Model for the Algerian Dialect

DziriBERT is the first Transformer-based Language Model that has been pre-trained specifically for the Algerian Dialect.

null 117 Jan 7, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

null 44 Dec 31, 2022
Laboratory for Social Machines 84 Dec 20, 2022
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

InstaDeep Ltd 72 Dec 9, 2022
Google and Stanford University released a new pre-trained model called ELECTRA

Google and Stanford University released a new pre-trained model called ELECTRA, which has a much compact model size and relatively competitive performance compared to BERT and its variants. For further accelerating the research of the Chinese pre-trained model, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the Chinese ELECTRA models based on the official code of ELECTRA. ELECTRA-small could reach similar or even higher scores on several NLP tasks with only 1/10 parameters compared to BERT and its variants.

Yiming Cui 1.2k Dec 30, 2022