中文生成式预训练模型

Overview

T5 PEGASUS

中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。

详情可见:https://kexue.fm/archives/8209

Tokenizer

我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更加友好。同时,我们重新整理了一版词表,使得里边的字、词都更加完善,目前的vocab.txt共包含5万个token,真正覆盖了中文的常用字、词。

预训练任务

预训练任务模仿了PEGASUS的摘要式预训练。具体来说,假设一个文档有n个句子,我们从中挑出大约n/4个句子(可以不连续),使得这n/4个句子拼起来的文本,跟剩下的3n/4个句子拼起来的文本,最长公共子序列尽可能长,然后我们将3n/4个句子拼起来的文本视为原文,n/4个句子拼起来的文本视为摘要,通过这样的方式构成一个“(原文, 摘要)”的伪摘要数据对。

模型下载

目前开源的T5 PEGASUS是base版,总参数量为2.75亿,训练时最大长度为512,batch_size为96,学习率为10-4,使用6张3090训练了100万步,训练时间约13天,数据是30多G的精处理通用语料,训练acc约47%,训练loss约2.97。模型使用bert4keras进行编写、训练和测试。

运行环境:tensorflow 1.15 + keras 2.3.1 + bert4keras 0.10.0

链接: https://pan.baidu.com/s/1lQ9Dt9wZDO3IgiCL9tP-Ug 提取码: 3sfn

部分评测

摘要生成效果:

小样本学习:

如何引用

Bibtex:

@techreport{zhuiyit5pegasus,
  title={T5 PEGASUS - ZhuiyiAI},
  author={Jianlin Su},
  year={2021},
  url="https://github.com/ZhuiyiTechnology/t5-pegasus",
}

联系我们

邮箱:[email protected] 追一科技:https://zhuiyi.ai

Comments
  • 请问如何在finetune中使用多GPU训练?

    请问如何在finetune中使用多GPU训练?

    您好, 我照着train.py中的代码使用在finetune.py, 在训练时发生以下错误, 请问我要怎么修改, 才能正确训练? 谢谢

    InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: TypeError: generator yielded an element that could not be converted to the expected type. The expected type was float32, but the yielded element was [array([[ 101, 2349, 25480, ..., 9172, 16054, 102], [ 101, 2335, 5088, ..., 4934, 31621, 102], [ 101, 2349, 25480, ..., 18312, 5661, 102], [ 101, 2349, 25480, ..., 33732, 11511, 102]]), array([[ 101, 22191, 27209, 41412, 31201, 8506, 42696, 31201, 5661,

    opened by 02hao09 2
  • 用bert4keras做ner任务在选择T5模型,在加载该chinese_t5_pegasus_base预训练模型时报错

    用bert4keras做ner任务在选择T5模型,在加载该chinese_t5_pegasus_base预训练模型时报错

    苏神你好,首先感谢大佬分享!

    我最近在使用bert4keras做实体识别任务,具体脚本用的是: https://github.com/bojone/bert4keras/blob/master/examples/task_sequence_labeling_ner_crf.py

    python3.6.9 tensorflow-gpu 1.14.0 keras 2.3.1 bert4keras 0.10.7

    想尝试通过该脚本用t5模型,但是在加载预训练模型chinese_t5_pegasus_base是报如下错误:

    Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where Traceback (most recent call last): File "task_sequence_labeling_ner_crf_conlleval.py", line 155, in model = build_transformer_model(config_path, checkpoint_path, model=is_albert) File "/big_disk/ner_2/bert4keras/models.py", line 2451, in build_transformer_model transformer.load_weights_from_checkpoint(checkpoint_path) File "/big_disk/ner_2/bert4keras/models.py", line 305, in load_weights_from_checkpoint raise e File "/big_disk/ner_2/bert4keras/models.py", line 299, in load_weights_from_checkpoint values.append(self.load_variable(checkpoint, v)) File "/big_disk/ner_2/bert4keras/models.py", line 1763, in load_variable variable = super(T5_Base, self).load_variable(checkpoint, name) File "/big_disk/ner_2/bert4keras/models.py", line 270, in load_variable return tf.train.load_variable(checkpoint, name) File "/big_disk/venv36/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 84, in load_variable return reader.get_tensor(name) File "/big_disk/venv36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str)) tensorflow.python.framework.errors_impl.NotFoundError: Key encoder/final_layer_norm/scale not found in checkpoint

    请问可以用该脚本使用t5模型做实体识别任务吗?应该如何解决加载预训练模型chinese_t5_pegasus_base出现的错误呢?

    再次感谢大佬!

    opened by sxk000 1
  • DataLossError: Checksum does not match

    DataLossError: Checksum does not match

    错误: tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 592068290 vs. calculated on the restored bytes 517592881 网盘下载的权重, 使用google mt5没有问题 gpu v100 32g

    查了下问题,没查到.到load_variable函数的时候报的错,不知道是不是的t5-pegasus问题.

    opened by hxyshare 1
  • 缺少self.last_token函数

    缺少self.last_token函数

    你好,在finetune.py文件中提示缺少self.last_token函数

    class AutoTitle(AutoRegressiveDecoder): """seq2seq解码器 """ @AutoRegressiveDecoder.wraps(default_rtype='probas') def predict(self, inputs, output_ids, states): c_encoded = inputs[0] return self.last_token(decoder).predict([c_encoded, output_ids])

    opened by yxwsfz 0
  • 想改成输入多doc格式

    想改成输入多doc格式

    参考Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering这篇论文, https://arxiv.org/pdf/2007.01282.pdf 想将您的代码改成输入为多doc,分别经过encoder后,将表示concat起来,送入decoder。如下代码

    input_ids = Input(shape=(max_padding_len, max_c_len), name='INPUT_contents_ids', dtype=tf.int32)
    input_ans_ids = Input(shape=(max_a_len,), name='INPUT_ans_ids', dtype=tf.int32)
    input_ids_reshape = K.reshape(input_ids,(-1, max_c_len))  # (bs*max_padding_len, max_seq_len)
    
    t5 = build_transformer_model(config_path=config_path,checkpoint_path=checkpoint_path,model='t5.1.1',return_keras_model=False,name='T5')
    encoder = t5.encoder
    decoder = t5.decoder
    resp = encoder(input_ids_reshape)  # (bs*max_padding_len, max_c_len, 512)
    resp_concat = K.reshape(resp, (-1, max_padding_len * max_c_len, 512))   # (bs, max_padding_len*max_c_len, 512)
    out = decoder([resp_concat, input_ans_ids])
    
    output = CrossEntropy(1)([input_ans_ids, out])
    model = Model(inputs=[input_ids, input_ans_ids], outputs=output)
    

    会报错 AttributeError: 'NoneType' object has no attribute '_inbound_nodes' 可能是因为decoder是Model属性导致的? 请问怎么解决呢?可以重新给decoder传入tensor 多谢

    opened by leisurehippo 0
  • 运行train.py的时候,最后到model.fit(dataset,steps_per_epoch=1000, ......)时候报错 AttributeError: 'DatasetV1Adapter' object has no attribute 'ndim',

    运行train.py的时候,最后到model.fit(dataset,steps_per_epoch=1000, ......)时候报错 AttributeError: 'DatasetV1Adapter' object has no attribute 'ndim',

    您好,运行环境跟您一样,运行train.py的时候,最后到model.fit(dataset,steps_per_epoch=1000, ......)时候报错 AttributeError: 'DatasetV1Adapter' object has no attribute 'ndim',这里dataset是'DatasetV1Adapter',这个“ndim”是什么呢?是一个DatasetV1Adapter的属性吗?还是因为其他什么原因呢?

    opened by mode007 0
  • 你好,我将model的encoder和decoder分别保存为onnx,模型输出向量与原始模型输出向量不一致,为啥呢

    你好,我将model的encoder和decoder分别保存为onnx,模型输出向量与原始模型输出向量不一致,为啥呢

    os.environ['TF_KERAS'] = '1' tensorflow 版本为2.1.1 ,keras=2.3.1 model.load_weights('./best_model.weights') 转化为pd文件 keras.models.save_model(encoder, "model_save_path_encoder", save_format='tf') 转化为onnx文件 python -m tf2onnx.convert --saved-model model_save_path_encoder --opset 13 --output ./model.onnx

    opened by lazywangyuan 0
  • 请问如何将compute_loss中的tensor(比如y_true)转为array呢

    请问如何将compute_loss中的tensor(比如y_true)转为array呢

    尝试了tensor.eval()和tensor.numpy()均不行 第一种报错大致意思说缺少一个session,给他新建一个session报错说不在原来session内 第二种方式中 tensorflow.python.framework.ops.EagerTensor 这种类型的可以转换,而代码里的类型为tensorflow.python.framework.ops.Tensor,加上tf.enable_eager_execution()也不行 求解答

    opened by KyrieIrving24 0
  • bert4keras0.11版本以后需要修改加载方式

    bert4keras0.11版本以后需要修改加载方式

    config_path = '/root/bert/chinese_t5_pegasus_base/config.json' checkpoint_path = '/root/bert/chinese_t5_pegasus_base/model.ckpt'

    build_transformer_model( config_path=config_path, checkpoint_path=checkpoint_path, model='t5.1.1', return_keras_model=False, name='T5', ) 加载方式需要改为model='mt5.1.1'

    opened by lhy2749 0
Owner
Zhuiyi Technology is a leading enterprise intelligent service AI company in China. We focus on deep learning and NLP.
null