An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

Last update: Jan 1, 2023

Related tags

Deep Learning nlp dataset transformer seq2seq pretrained-models embedding bert ernie paddlenlp

Overview

简体中文 | English

News

[2021-10-12] PaddleNLP 2.1版本已发布！新增开箱即用的NLP任务能力、Prompt Tuning应用示例与生成任务的高性能推理！ 🎉 更多详细升级信息请查看Release Note。
[2021-08-22]《千言：面向事实一致性的生成评测比赛》正式开赛啦 🔥 🔥 🔥 ，欢迎大家踊跃报名!! PaddleNLP比赛基线地址。

简介

PaddleNLP是飞桨自然语言处理开发库，具备易用的文本领域API，多场景的应用示例、和高性能分布式训练三大特点，旨在提升开发者在文本领域的开发效率，并提供丰富的NLP应用示例。

易用的文本领域API
- 提供丰富的产业级预置任务能力Taskflow和全流程的文本领域API：支持丰富中文数据集加载的Dataset API；灵活高效地完成数据预处理的Data API；提供60+预训练模型的Transformer API等，可大幅提升NLP任务建模的效率。
多场景的应用示例
- 覆盖从学术到工业级的NLP应用示例，涵盖NLP基础技术、NLP核心技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发，为开发者提供飞桨文本领域的最佳实践。
高性能分布式训练
- 基于飞桨核心框架领先的自动混合精度优化策略，结合分布式Fleet API，支持4D混合并行策略，可高效地完成超大规模参数的模型训练。

安装

环境依赖

python >= 3.6
paddlepaddle >= 2.1

pip安装

pip install --upgrade paddlenlp

更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看Installation。

易用的文本领域API

Taskflow：开箱即用的工业级NLP能力

Taskflow旨在提供开箱即用的NLP预置任务能力，覆盖自然语言理解与生成两大场景，提供工业级的效果与极致的预测性能。

from paddlenlp import Taskflow

# 中文分词
seg = Taskflow("word_segmentation")
seg("第十四届全运会在西安举办")
>>> ['第十四届', '全运会', '在', '西安', '举办']

# 词性标注
tag = Taskflow("pos_tagging")
tag("第十四届全运会在西安举办")
>>> [('第十四届', 'm'), ('全运会', 'nz'), ('在', 'p'), ('西安', 'LOC'), ('举办', 'v')]

# 命名实体识别
ner = Taskflow("ner")
ner("《孤女》是2010年九州出版社出版的小说，作者是余兼羽")
>>> [('《', 'w'), ('孤女', '作品类_实体'), ('》', 'w'), ('是', '肯定词'), ('2010年', '时间类'), ('九州出版社', '组织机构类'), ('出版', '场景事件'), ('的', '助词'), ('小说', '作品类_概念'), ('，', 'w'), ('作者', '人物类_概念'), ('是', '肯定词'), ('余兼羽', '人物类_实体')]

# 句法分析
ddp = Taskflow("dependency_parsing")
ddp("百度是一家高科技公司")
>>> [{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': ['2', '0', '5', '5', '2'], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]

# 情感分析
senta = Taskflow("sentiment_analysis")
senta("怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片")
>>> [{'text': '怀着十分激动的心情放映，可是看着看着发现，在放映完毕后，出现一集米老鼠的动画片', 'label': 'negative', 'score': 0.6691398620605469}]

更多使用方法请参考Taskflow文档

Transformer API: 强大的预训练模型生态底座

覆盖22个网络结构和90余个预训练模型参数，既包括百度自研的预训练模型如ERNIE系列, PLATO, SKEP等，也涵盖业界主流的中文预训练模型如。欢迎开发者贡献更多预训练模型！ 🤗

from paddlenlp.transformers import *

ernie = ErnieModel.from_pretrained('ernie-1.0')
ernie_gram = ErnieGramModel.from_pretrained('ernie-gram-zh')
bert = BertModel.from_pretrained('bert-wwm-chinese')
albert = AlbertModel.from_pretrained('albert-chinese-tiny')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')

对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等，提供统一的API体验。

import paddle
from paddlenlp.transformers import ErnieTokenizer, ErnieModel

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
text = tokenizer('自然语言处理')

# 语义表示
model = ErnieModel.from_pretrained('ernie-1.0')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
# 序列标注
model = ErnieForTokenClassification.from_pretrained('ernie-1.0')
# 问答
model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0')

请参考Transformer API文档查看目前支持的预训练模型结构、参数和详细用法。

Dataset API: 丰富的中文数据集

Dataset API提供便捷、高效的数据集加载功能；内置千言数据集，提供丰富的面向自然语言理解与生成场景的中文数据集，为NLP研究人员提供一站式的科研体验。

from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])

可参考Dataset文档查看更多数据集。

Embedding API: 一键加载预训练词向量

from paddlenlp.embeddings import TokenEmbedding

wordemb = TokenEmbedding("w2v.baidu_encyclopedia.target.word-word.dim300")
print(wordemb.cosine_sim("国王", "王后"))
>>> 0.63395125
wordemb.cosine_sim("艺术", "火车")
>>> 0.14792643

内置50+中文词向量，覆盖多种领域语料、如百科、新闻、微博等。更多使用方法请参考Embedding文档。

多场景的应用示例

PaddleNLP提供了多粒度、多场景的NLP应用示例，面向动态图模式和全新的API体系开发，更加简单易懂。涵盖了NLP基础技术、NLP核心技术、NLP系统应用以及文本相关的拓展应用如模型压缩、与知识库结合的文本知识关联、与图结合的文本图学习等。

NLP 基础技术

任务	简介
词向量	利用`TokenEmbedding API`展示如何快速计算词之间语义距离和词的特征提取。
词法分析	基于BiGRU-CRF模型实现了分词、词性标注和命名实体识的联合训练任务。
语言模型	提供了基于RNNLM和Transformer-XL两种结构的语言模型，支持输入词序列计算其生成概率，可用于表示模型生成句子的流利程度。
语义解析 ⭐	语义解析Text-to-SQL任务是让机器自动让自然语言问题转换数据库可操作的SQL查询语句，是实现基于数据库自动问答的核心模块。

NLP 核心技术

文本分类 (Text Classification)

模型	简介
RNN/CNN/GRU/LSTM	实现了经典的RNN, CNN, GRU, LSTM等经典文本分类结构。
BiLSTM-Attention	基于BiLSTM网络结构引入注意力机制提升文本分类效果。
BERT/ERNIE	提供基于预训练模型的文本分类任务实现，包含训练、预测和推理部署的全流程应用。

文本匹配 (Text Matching)

模型	简介
SimCSE 🌟	基于论文SimCSE: Simple Contrastive Learning of Sentence Embeddings实现无监督语义匹配模型，无需标注数据仅利用无监督数据也能训练效果出众的语义匹配模型。
ERNIE-Gram w/ R-Drop	提供基于ERNIE-Gram预训练模型结合R-Drop策略的问题匹配任在千言数据集上的基线代码。
SimNet	百度自研的语义匹配框架，使用BOW、CNN、GRNN等核心网络作为表示层，在百度内搜索、推荐等多个应用场景得到广泛易用。
ERNIE	基于ERNIE使用LCQMC数据完成中文句对匹配任务，提供了Pointwise和Pairwise两种类型学习方式。
Sentence-BERT	提供基于Siamese双塔结构的文本匹配模型Sentence-BERT实现，可用于获取文本的向量化表示。
SimBERT	提供SimBERT模型实现，用于获取文本的向量化表示。

文本生成 (Text Generation)

模型	简介
Seq2Seq	实现了经典的Seq2Seq with Attention的网络结构，并提供在自动对联的文本生成应用示例。
VAE-Seq2Seq	在Seq2Seq框架基础上，加入VAE结构以实现更加多样化的文本生成。
ERNIE-GEN	ERNIE-GEN是百度NLP提出的基于多流(multi-flow)机制生成完整语义片段的预训练模型，基于该模型实现了提供了智能写诗的应用示例。

文本纠错 (Text Correction)

模型	简介
ERNIE-CSC:star1:	ERNIE-CSC是基于ERNIE预训练模型融合了拼音特征的端到端中文拼写纠错模型，在SIGHAN数据集上取得SOTA的效果。

语义索引 (Semantic Indexing)

提供一套完整的语义索引开发流程，并提供了In-Batch Negative和Hardest Negatives两种策略，开发者可基于该示例实现一个轻量级的语义索引系统，更多信息请查看语义索引应用示例。

信息抽取 (Information Extraction)

任务	简介
DuEE	基于DuEE数据集，使用预训练模型的方式提供句子级和篇章级的事件抽取示例。
DuIE	基于DuIE数据集，使用预训练模型的方式提供关系抽取示例。
快递单信息抽取	提供BiLSTM+CRF和预训练模型两种方式完成真实的快递单信息抽取案例。

NLP 系统应用

情感分析 (Sentiment Analysis)

模型	简介
SKEP 🌟	SKEP是百度提出的基于情感知识增强的预训练算法，利用无监督挖掘的海量情感知识构建预训练目标，让模型更好理解情感语义，可为各类情感分析任务提供统一且强大的情感语义表示。

阅读理解 (Machine Reading Comprehension)

任务	简介
SQuAD	提供预训练模型在SQuAD 2.0数据集上微调的应用示例。
DuReader-yesno	提供预训练模型在千言数据集DuReader-yesno上微调的应用示例。
DuReader-robust	提供预训练模型在千言数据集DuReader-robust上微调的应用示例。

文本翻译 (Text Translation)

模型	简介
Seq2Seq-Attn	提供了Effective Approaches to Attention-based Neural Machine Translation基于注意力机制改进的Seq2Seq经典神经网络机器翻译模型实现。
Transformer	提供了基于Attention Is All You Need论文的Transformer机器翻译实现，包含了完整的训练到推理部署的全流程实现。

同传翻译 (Simultaneous Translation)

模型	简介
STACL ⭐	STACL是百度自研的基于Prefix-to-Prefix框架的同传翻译模型，结合Wait-k策略可以在保持较高的翻译质量的同时实现任意字级别的翻译延迟，并提供了轻量级同声传译系统搭建教程。

对话系统 (Dialogue System)

模型	简介
PLATO-2	PLATO-2是百度自研领先的基于课程学习两阶段方式训练的开放域对话预训练模型。
PLATO-mini 🌟	基于6层UnifiedTransformer预训练结构，结合海量中文对话语料数据预训练的轻量级中文闲聊对话模型。

拓展应用

文本知识关联 (Text to Knowledge)

🌟 解语是由百度知识图谱部开发的文本知识关联框架，覆盖中文全词类的知识库和知识标注工具，能够帮助开发者面对更加多元的应用场景，方便地融合自有知识体系，显著提升中文文本解析和挖掘效果，还可以便捷地利用知识增强机器学习模型效果。

文本图学习 (Text Graph Learning)

模型	简介
ERNIESage	基于飞桨PGL图学习框架结合PaddleNLP Transformer API实现的文本与图结构融合的模型。

模型压缩 (Model Compression)

模型	简介
MiniLMv2 🌟	基于MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers论文策略的实现，是一种通用蒸馏方法。本实例以`bert-base-chinese`为教师模型，利用中文数据进行了通用蒸馏。
TinyBERT	基于论文TinyBERT: Distilling BERT for Natural Language Understanding的实现，提供了通用蒸馏和下游任务蒸馏的脚本。本实例利用开源模型`tinybert-6l-768d-v2`初始化，在GLUE的7个数据集上进行下游任务的蒸馏，最终模型参数量缩小1/2，预测速度提升2倍，同时保证模型精度几乎无损，其中精度可达教师模型`bert-base-uncased`的 98.90%。
OFA-BERT 🌟	基于PaddleSlim Once-For-ALL(OFA)策略对BERT在GLUE任务的下游模型进行压缩，在精度无损的情况下可减少33%参数量，达到模型小型化的提速的效果。
Distill-LSTM	基于Distilling Task-Specific Knowledge from BERT into Simple Neural Networks论文策略的实现，将BERT中英文分类的下游模型知识通过蒸馏的方式迁移至LSTM的小模型结构中，取得比LSTM单独训练更好的效果。

小样本学习 (Few-Shot Learning)

算法	简介
PET	基于Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference 论文策略实现, 基于人工知识设计 Prompt, 将下游目标任务转换为完形填空任务来充分挖掘预训练模型中的知识, 显著提升模型效果。
P-Tuning	基于GPT Understands, Too 论文策略实现, 首次提出连续可学习的模板参数，在全参数空间对模板进行连续优化，大幅提升模型稳定性和模型效果。
EFL	基于Entailment as Few-Shot Learner 论文策略实现，将下游目标任务转换为蕴含任务降低模型预测空间，显著提升模型效果。

交互式Notebook教程

更多教程参见PaddleNLP on AI Studio。

社区贡献与技术交流

特殊兴趣小组

欢迎您加入PaddleNLP的SIG社区，贡献优秀的模型实现、公开数据集、教程与案例等。

QQ

现在就加入PaddleNLP的QQ技术交流群，一起交流NLP技术吧！ ⬇️

版本更新

更多版本更新说明请查看ChangeLog

License

PaddleNLP遵循Apache-2.0开源协议。

Comments

【Hackathon + GradientCache】

PR types

PR changes

Description

已完成gradient_cache策略，在batch_size==512,chunk_size==16的情况下，Recall@10 和 Recall@50分别为50.195 65.067；也可以进超大batch训练，已经测试batch_size==12800，显存占用符合要求。

opened by Elvisambition 28
运行一个公开的bert-NER项目出错

项目地址： https://aistudio.baidu.com/aistudio/projectdetail/1925434 from https://aistudio.baidu.com/aistudio/projectdetail/1477098?channelType=0&channel=0

opened by Biggg888 24
[Question]: taskflow('document_intelligence') 和直接使用预模型ernie-layoutx-base-uncased 做预测有什么区别？

请问，发现taskflow('document_intelligence')这种方式，是在.cache/下会生成docprompt_params.tar文件，解压后有一个模型文件inference.pdiparams和inference.pdmodel。而使用autoModel，是下载预训练模型。我想知道是否使用taskflow效果会更好，还是和直接使用预训练模型（没经过ft）进行预测效果一致。两者的训练数据是否是一样，是不是就是一个东西？
question

opened by hehuang139 22
faster-generation有时候耗时会比较长
欢迎您反馈PaddleNLP使用问题，非常感谢您对PaddleNLP的贡献！在留下您的问题时，辛苦您同步提供如下信息：

版本、环境信息 1）PaddleNLP和PaddlePaddle版本：请提供您的PaddleNLP和PaddlePaddle版本号，例如PaddleNLP 2.3.3，PaddlePaddle2.3.0 2）系统环境：系统版本为linux，python 3.8

复现信息：使用unified-transformer的faster-generation时候,大部分耗时都在30-50ms之间，但是有时候突然就达到了200ms

stale
opened by zhanghaoie 22
ernie3量化压缩后的模型可以在ARM上部署CPU预测吗？
版本、环境信息 1）PaddleNLP和PaddlePaddle版本：PaddleNLP 2.3.4，PaddlePaddle2.3.1(ARM上源码编译) 2）系统环境：CentOS7，python3.7 在X86-64带GPU的机器上训练并量化压缩token_cls模型，在ARM上部署的时候使用CPU预测，报错。

代码为/PaddleNLP/model_zoo/ernie3.0/下的代码。

命令如下：其中模型为已经量化压缩后的模型 python3.7 infer_cpu.py --task_name token_cls --model_path ../model/hist16/int8 --precision_mode int8

报错信息如下： [2022-07-28 16:45:14,722] [ WARNING] - Can't find the faster_tokenizer package, please ensure install faster_tokenizer correctly. You can install faster_tokenizer by pip install faster_tokenizer(Currently only work for linux platform). [2022-07-28 16:45:14,722] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load 'ernie-3.0-medium-zh'. [2022-07-28 16:45:14,722] [ INFO] - Already cached /root/.paddlenlp/models/ernie-3.0-medium-zh/ernie_3.0_medium_zh_vocab.txt [2022-07-28 16:45:14,766] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/ernie-3.0-medium-zh/tokenizer_config.json [2022-07-28 16:45:14,766] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/ernie-3.0-medium-zh/special_tokens_map.json

[InferBackend] Creating Engine ... [InferBackend] INT8 inference on CPU ... Traceback (most recent call last): File "infer_cpu.py", line 85, in main() File "infer_cpu.py", line 74, in main predictor = ErniePredictor(args) File "/pgsql/iaodata/mxy/Projects/ernie3_ner/src/ernie3_ner/predict_process/ernie_predictor.py", line 292, in init num_threads=args.num_threads) File "/pgsql/iaodata/mxy/Projects/ernie3_ner/src/ernie3_ner/predict_process/ernie_predictor.py", line 94, in init config.enable_mkldnn_int8() AttributeError: 'paddle.fluid.core_avx.AnalysisConfig' object has no attribute 'enable_mkldnn_int8'

可能存在的问题： 1.代码ernie_predictor.py中，cpu_backend中使用的是mkl库，但是ARM上没有avx指令集，也没法安装mkl。 2.另外cpu_backend改为onnxruntime后，加载量化压缩模型，会报错Segmentation fault (core dumped)

疑问： ernie3量化压缩后的模型是否现在还不支持在ARM上进行CPU部署？
FAQ
opened by Macxy2018 20
PaddlePaddle Hackathon 57 提交

Task: #1073

权重文件链接: https://pan.baidu.com/s/1-FJDmtfO8MuPQgq0EEbUhw 提取码: gst6 添加XLNetLMHeadModel、XLNetForMultipleChoice、XLNetForQuestionAnswering。新增单元测试代码。XLNetLMHeadModel、XLNetForMultipleChoice、XLNetForQuestionAnswering。

opened by renmada 20
UIE文本信息抽取微調問題

目前有兩個問題 1.目前要進行UIE微調任務，使用Doccano進行標註，從Doccano輸出成jsonl檔案，丟進doccano.py會出現錯誤信息UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 37: invalid start byte，需再以人工轉編碼至utf-8

2.標註完以doccano.py進行訓練測試資料及切分，確認裡面不是空的。但丟進fintune.py進行訓練總是跑了1 epoch就不跑了，沒有跑出錯誤信息，使用已跑出model_10來預測新的檔案效果不太好

不知道是否有人遇過一樣情形。
stale

opened by JoewithAmma 19
运行PaddleNLP/applications/document_intelligence/doc_vqa/中的汽车说明书跨模态智能问答出现问题

请提出你的问题

我在运行OCR检测时（代码位置：PaddleNLP/applications/document_intelligence/doc_vqa/OCR_process/ocr_process.py）出现以下问题，请问是什么原因？

[2022-10-26 16:30:36,060] [ INFO] - Already cached /root/.paddlenlp/models/layoutxlm-base-uncased/sentencepiece.bpe.model [2022-10-26 16:30:36,584] [ INFO] - tokenizer config file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/tokenizer_config.json [2022-10-26 16:30:36,584] [ INFO] - Special tokens file saved in /root/.paddlenlp/models/layoutxlm-base-uncased/special_tokens_map.json [2022/10/26 16:30:36] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='/root/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_box_type='quad', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_limit_side_len=960, det_limit_type='max', det_model_dir='/root/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, image_orientation=False, ir_optim=True, kie_algorithm='LayoutXLM', label_list=['0', '180'], lang='ch', layout=True, layout_dict_path=None, layout_model_dir=None, layout_nms_threshold=0.5, layout_score_threshold=0.5, max_batch_size=10, max_text_length=25, merge_no_span_structure=True, min_subgraph_size=15, mode='structure', ocr=True, ocr_order_method=None, ocr_version='PP-OCRv3', output='./output', page_num=0, precision='fp32', process_id=0, re_model_dir=None, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='/home/anaconda3/envs/lc_detectron/lib/python3.7/site-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_model_dir='/root/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer', recovery=False, save_crop_res=False, save_log_path='./log_output/', scales=[8, 16, 32], ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ser_model_dir=None, show_log=True, sr_batch_num=1, sr_image_shape='3, 32, 128', sr_model_dir=None, structure_version='PP-Structurev2', table=True, table_algorithm='TableAttn', table_char_dict_path=None, table_max_len=488, table_model_dir=None, total_process_num=1, type='ocr', use_angle_cls=True, use_dilation=False, use_gpu=True, use_mp=False, use_npu=False, use_onnx=False, use_pdf2docx_api=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, use_visual_backbone=True, use_xpu=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False) Traceback (most recent call last): File "ocr_process.py", line 287, in ocr_results = ocr_preprocess(img_dir) File "ocr_process.py", line 275, in ocr_preprocess parsing_res = ocr.ocr(img_path, cls=True) File "/home/anaconda3/envs/lc_detectron/lib/python3.7/site-packages/paddleocr/paddleocr.py", line 534, in ocr dt_boxes, rec_res, _ = self.call(img, cls) File "/home/anaconda3/envs/lc_detectron/lib/python3.7/site-packages/paddleocr/tools/infer/predict_system.py", line 71, in call dt_boxes, elapse = self.text_detector(img) File "/home/anaconda3/envs/lc_detectron/lib/python3.7/site-packages/paddleocr/tools/infer/predict_det.py", line 242, in call self.input_tensor.copy_from_cpu(img) File "/home/anaconda3/envs/lc_detectron/lib/python3.7/site-packages/paddle/fluid/inference/wrapper.py", line 36, in tensor_copy_from_cpu self.copy_from_cpu_bind(data) OSError: (External) CUDNN error(1), CUDNN_STATUS_NOT_INITIALIZED. [Hint: 'CUDNN_STATUS_NOT_INITIALIZED'. The cuDNN library was not initialized properly. This error is usually returned when a call to cudnnCreate() fails or when cudnnCreate() has not been called prior to calling another cuDNN routine. In the former case, it is usually due to an error in the CUDA Runtime API called by cudnnCreate() or by an error in the hardware setup. ] (at /paddle/paddle/phi/backends/gpu/gpu_context.cc:516)
question

opened by xdnjust 17
[Question]: 使用BiGRU + CRF训练实体抽取算法，导出静态图之后无法识别结果

请提出你的问题

参照https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie文档，使用BiGRU + CRF训练了中文简历数据实体抽取模型，动态图的预测结果正常，如下图所示：但是zai在导出为动态图模型之后，无法识别结果，如下图所示：请问下我是哪里出错了吗？
question

opened by KyleWang-Hunter 17

[infer 准确率每次都不一样]: 有那些可能的原因？或者bug？

请提出你的问题

每次做eval 结果不一样，同一脚本连续允许，如下： dev_loss: 0.2723, acc: 0.8886, precision: 0.8474, recall: 0.9480, f1: 0.8949 acc and f1: 0.8918 dev_loss: 0.2602, acc: 0.8925, precision: 0.8584, recall: 0.9400, f1: 0.8974 acc and f1: 0.8949 dev_loss: 0.2688, acc: 0.8895, precision: 0.8528, recall: 0.9416, f1: 0.8950 acc and f1: 0.8923

def eval_test():
    from train import parse_args
    args = parse_args()
    paddle.set_device(args.device)
    pretrained_model = AutoModel.from_pretrained(args.plm_name)
    tokenizer = AutoTokenizer.from_pretrained(args.plm_name)
    model = get_model(pretrained_model, args)
    data_loader = create_eval_dataloader(
        args.input_file, tokenizer, max_seq_length=args.max_seq_length, eval_batch_size=args.eval_batch_size, pad_to_max=args.pad_to_max)
    criterion = paddle.nn.loss.CrossEntropyLoss(soft_label=False)
    metric = AccuracyAndF1()
    
    if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt):
        state_dict = paddle.load(args.init_from_ckpt)
        model.set_dict(state_dict)
        logger.info(f"load model from {args.init_from_ckpt}")
    else:
        logger.info(f"did not find the parameters file")
        raise RuntimeError("cannot eval use default parameters")
    ret_dict = evaluate(model, criterion, metric, data_loader, return_dict=True)

@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader, return_dict=False):
    model.eval()
    metric.reset()
    losses = []
    total_num = 0
    logits_list = []
    labels_list = []
    for batch in data_loader:
        input_ids, token_type_ids, labels = batch
        total_num += len(labels)
        outputs = model(input_ids=input_ids,
                        token_type_ids=token_type_ids,
                        do_evaluate=True)
        logits = outputs[0]
        logits_list.append(logits.numpy())

        loss = criterion(logits, labels)
        losses.append(loss.numpy())
        labels_list.append(labels.numpy())
        correct = metric.compute(logits, labels)
        metric.update(correct)
    
    accu = metric.accumulate()
    total_loss = np.mean(losses)
    if isinstance(metric, paddle.metric.Accuracy):
        logger.info("dev_loss: {:.4}, accuracy: {:.4}, total_num:{:.4}".format(
            total_loss, accu, total_num))
    elif isinstance(metric, AccuracyAndF1):
        logger.info(f"dev_loss: {total_loss:.4f}, acc: {accu[0]:.4f}, precision: {accu[1]:.4f}, recall: {accu[2]:.4f}, f1: {accu[3]:.4f} acc and f1: {accu[4]:.4f}")
    model.train()
    metric.reset()
    if return_dict:
        return {"loss": total_loss, "metrics": accu, "logits": np.concatenate(logits_list, axis=0), "labels": np.concatenate(labels_list, axis=0)}
    return accu

question stale

opened by jeffzhengye 16

doccano.py 生成数据中的start,end 的意义是什么，

doccano.py 分类生成数据中的start,end 的意义是什么，如果是是标签的下边，也对不上啊

{"content": "大元股份传闻王上演逼空大戏", "result_list": [{"text": "stocks", "start": -3, "end": 3}], "prompt": "新闻分类[finance,sports,science,society,game,education,stocks,realty,politics]"} {"content": "门将第95分钟头球破门他改写西甲81年历史(视频)", "result_list": [{"text": "sports", "start": -3, "end": 3}], "prompt": "新闻分类[stocks,society,finance,education,realty,sports,game,science,politics]"} {"content": "A股短期反弹空间有限逢低或可介入黄金", "result_list": [{"text": "stocks", "start": -9, "end": -3}], "prompt": "新闻分类[science,game,finance,stocks,realty,sports,politics,education,society]"}

opened by fuqiang-benz 16
support other ext tasks except aso task and fix sentiment analysis based on SKEP
PR types

Function optimization & Bug fixes

PR changes

APIs & Docs

Description

Futher support aspect-level ext tasks, such as aspect, aspect-sentiment, aspect-opinion and so on. Open up the process from annotation to visualization.

Fix the problem caused by tokenizer updating for sentiment analysis based on skep.

Optimize the log output for our project

Refine the readme of label-studio and sentiment analysis, to make users to understand our project easily.
opened by 1649759610 2
[Question]: 如何解读uie-senta-base模型输出结果中的概率？
请提出你的问题

请问uie-senta-base模型输出的结果中的probability应当如何理解？

举例： from paddlenlp import Taskflow schema = ["情感分析[正向,负向]"] senta = Taskflow("sentiment_analysis",model="uie-senta-base",schema=schema) senta(["测试","好","坏"]

输出的结果为： [{'情感分析[正向,负向]':['text':'正向', 'probability':0.3516]}, {'情感分析[正向,负向]':['text':'正向', 'probability':0.9992]}, {'情感分析[正向,负向]':['text':'负向', 'probability':0.9966]}]

请问其中的probability是指以下哪种情况（或是有其他解读）？目前看起来用哪种方式都解释不通。

正向概率。如果是这样的话，第3条结果（负向，0.9966）解释不通

负向概率。如果是这样的话，第2条结果（正向，0.9992）解释不通

对应结果“text”标签为真的概率，第1条结果（正向，0.3516）解释不通

感谢！
question
opened by timsun001 1

[Bug]: "PretrainedConfig instance not found in the arguments, you can set it as args or kwargs with config field" ValueError: PretrainedConfig instance not found in the arguments, you can set it as args or kwargs with config field

软件环境

- paddlepaddle:2.4.1
- paddlepaddle-gpu: None
- paddlenlp: 2.4.9

重复问题

[ ] I have searched the existing issues

错误描述

使用paddlenlp 2.4.9动态转静态模型时，出现如下报错：

"PretrainedConfig instance not found in the arguments, you can set it as args or kwargs with config field"
ValueError: PretrainedConfig instance not found in the arguments, you can set it as args or kwargs with config field

模型文件包含以下内容，是在2.4.5版本下训练的： model_config.json model_state.pdparams special_tokens_map.json tokenizer_config.json vocab.txt



### 稳定复现步骤 & 代码

from model import UIE

parser = argparse.ArgumentParser() parser.add_argument("--model_path", type=str, required=True, default='./checkpoint/model_best', help="The path to model parameters to be loaded.") parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") args = parser.parse_args()

if name == "main": model = UIE.from_pretrained(args.model_path) model.eval()

# Convert to static graph with specific input description
model = paddle.jit.to_static(model,
                             input_spec=[
                                 paddle.static.InputSpec(shape=[None, None],
                                                         dtype="int64",
                                                         name='input_ids'),
                                 paddle.static.InputSpec(
                                     shape=[None, None],
                                     dtype="int64",
                                     name='token_type_ids'),
                                 paddle.static.InputSpec(shape=[None, None],
                                                         dtype="int64",
                                                         name='pos_ids'),
                                 paddle.static.InputSpec(shape=[None, None],
                                                         dtype="int64",
                                                         name='att_mask'),
                             ])
# Save in static graph model.
save_path = os.path.join(args.output_path, "inference")
paddle.jit.save(model, save_path)

bug

opened by datalee 12

[Question]: 关于uie模型用于句子的情感分析任务

请提出你的问题

关于uie模型的介绍中对于情感倾向分析的分类只有[正向，负向]，请问可以增加其分类吗？比如设置成[开心,生气,伤心,喜欢,厌恶,恐惧,惊讶]这种较细粒度的分类。如果可以的话需要调整哪些部分的代码呢？还是只需要在构建数据集时增加options参数呢？如果是的话，是否只需要运行 python doccano.py
--doccano_file ./data/doccano_cls.json
--task_type cls
--save_dir ./data
--splits 0.8 0.1 0.1
--schema_lang ch
--options "伤心" "开心" "厌恶" "喜欢" "生气" "惊讶" "恐惧" 后续微调步骤不变？
question triage

opened by yunshifengyu 3

Releases(v2.4.9)

v2.4.9(Dec 30, 2022)
New Features

Trainer新增Memory Tracer #4181

ERNIE 1.0 支持Ascend NPU #3954

新增无监督检索式问答系统 #4193

DynaBert支持动态图export #3549

增加信息抽取相关的英语文档 #4247

Bug Fix

修复带PretrainedConfig的模型属性方面过多的warnings #4264

修复ErnieEncoder的PretrainedConfig适配问题 #4281

修复IE taskflow加载定制模型的config名字问题 #4271

修复Trainer fp16问题 #4283

修复AutoTemplate 中的 prefix_dropout参数 #4293

修复多卡下载config卡住的问题#4274

Source code(tar.gz)
Source code(zip)
v2.4.8(Dec 26, 2022)
New Features

PPDiffusers

新增BIT和DPT模型 #4202

支持Ascend NPU部署 #4217

ppdiffusers pipelines文档修改升级，新增任务展示，针对ppdiffusers能力进行整理输出；新增audio diffusion推理脚本和权重新增paint by example推理脚本和权重 #4230

Bug Fix

修复AutoModel加载legacy config和standard config的问题 #4083

修复PretrainedConfig带来的向后兼容问题 #4237

修复GPT模型使用input_embeds的问题 #4179

修复Trainer多卡时的save问题 #4220

Source code(tar.gz)
Source code(zip)
v2.4.7(Dec 23, 2022)
New Features

Sentiment Analysis

情感分析：#3694

提供uie-senta系列训练模型，支持语句情感分类，属性抽取，观点抽取等常用情感分析能力

支持从输入数据到情感分析结果可视化，助力业务数据分析

定制情感分析能力，解决同义属性聚合、隐性观点抽取难题

UIE

UIE Taskflow支持从HF Hub加载

TextClassification

文本分类Taskflow支持多标签预测 #3968

FastTokenizer

修复FastTokenizer BertNormalizer json实例化bug

Bug Fix

修复AutoModel加载legacy config和standard config的问题 #4083

Others

Ernie/ErnieM/ErnieLayout/Bart/MBart/Unified_Transformer/Unimo/CodeGen 等模型迁移使用PretrainedConfig #4118 #3769 #4170

撰写贡献指南contributing.md #4160

Source code(tar.gz)
Source code(zip)
v2.4.5(Dec 9, 2022)
New Features

UIE

新增UIE-X端到端文档抽取功能，支持Taskflow一键调用，并提供标注、微调及部署的产业级全流程解决方案。 #3951

Machine Translation

新增数据下载以及全套数据预处理流程，新增数据集自定义接口以及文档说明 #3269

PPDiffusers

添加FID和CLIP Score的计算代码 #3860

发布ppdiffusers 0.6.3 #3963

发布ppdiffusers 0.9.0 #3919 #4017 #4018

新增ppdiffusers FastDeploy部署功能 #3813

基础底座

Model以及Tokenizer支持一行代码.save_to_hf_hub()上传至Huggingface Hub #3982

Others

CodeGen文档中新增社区贡献的Jupyter Lab Extension #4002

将CodeStyle工具升级为isort, black和flake8. 大规模自动风格格式化 #3925

Source code(tar.gz)
Source code(zip)
v2.4.4(Nov 28, 2022)
New Features

Prompt API

新增 MaskedLMVerbalizer，支持 PET 算法实现。#3889

FastTokenizer

新增CLIP FastTokenizer #3805

PPDiffusers

PPDiffusers版的LDM权重转换为原版LDM权重。 #3809

更新Unet中的attention的实现方式。 #3792

Pipelines

Pipelines新增ERNIE-Search的支持 #3906

基础底座

新增文本分类专用的Text Classification Taskflow #3841

新增完型填空的Fill-Mask Taskflow, 并且直接从Huggingface Hub加载#3870

AutoModel和AutoTokenizer支持直接从Huggingface Hub加载 #3786

Dialogue Taskflow支持直接从Huggingface Hub加载 #3865

新增 PaddleNLP SimpleServing 功能，支持 Taskflow、预训练模型快速部署 #2845

Bug Fix

修复from_pretrained_v2不能加载FP16模型。#3902

修复 Template 使用 options 关键字时无法组 batch 的问题 #3889

Source code(tar.gz)
Source code(zip)
v2.4.3(Nov 18, 2022)
New Features

Prompt API

Template String 新增支持关键字 prefix和options，新增 position, token_type, length, encoder, hidden_size 等7个属性 #3724

新增支持 PrefixTemplate

解除 InputExample 和 InputFeatures 对输入数据关键字的限制

问答

新增无监督问答pipelines，pipeline运行示例和说明文档 #3605

新增节点QAFilter、AnswerExtractor、QuestionGenerator、AnswerExtractorPreprocessor、QAFilterPostprocessor

新增pipeline QAGenerationPipeline

FastAPI后端代码，承接ElasticSearch ANN检索库、QAGenerationPipeline和SemanticSearchPipeline

无监督问答WEB可视化系统，功能如下：问答检索、在线问答对生成、在线更新索引库、文件上传并自动生成和载入问答对、问答对生成可选择过滤、问答检索可选择返回答案数量和最大检索数量

Trainer

新增sharding支持，目前支持sharding stage1、stage2。 #3352

新增bf16训练支持，可支持单卡、多卡训练。完善了pure_fp16训练支持。

新增IterableDataset支持，支持传入Iterable的数据集。

新增Seq2SeqTrainer，支持seq2seq任务训练。

FasterGeneration

解除 Transformer FFN 中间隐层维度是 d_model 4 倍的限制，新增导入 model_state 方式加载模型 #3592

FastTokenizer

AutoTokenizer新增use_fast参数，指定使用fast_tokenizer完成高性能分词。目前ERNIE, BERT, TinyBert以及ERNIE-M可开启该选项。#3746

发布高性能分词工具FastTokenizer 1.0.0 正式版，包含C++预编译包以及Python包 #3762

基础底座

UNIMO 新增支持获取中间输出选项和支持输入 label 并自动计算 loss #3450

CodeGen 新增支持获取中间输出选项和支持输入 label 并自动计算 loss #3465

UnifiedTransformer 新增支持获取中间输出选项和支持输入 label 并自动计算 loss #3459

BART 新增支持获取中间输出选项和支持输入 label 并自动计算 loss #3436

MBART 新增支持获取中间输出选项和支持输入 label 并自动计算 loss #3436

T5 支持直接输入 encoder & decoder embedding 结果 #3668

新增paddlenlp cli工具 #3538

添加 7 个 P1 级别模型的单测 #3462

UIE

新增 UIE 量化训练和部署 #3496

Neural Search

新增Gradicent Cache和Recompute支持单卡超大batch size的训练。 #3697

Text Classification

新增语义索引的多标签文本分类。#3656

新增单词和句子级别的可解释性分析 #3385

修复文本分类部署相关问题 #3765

基于 Trainer API 更新多分类实现 #3679

PPDiffusers

将diffusers_paddle重命名为ppdiffusers。#3601

修复bug支持中文Stable Diffusion, 发布ppdiffusers0.6.1。 #3663

发布ppdiffusers0.6.2 #3737

增加laion400m文生图训练脚本。#3693 #3772

支持 EulerAncestralDiscreteScheduler 和 DPMSolverMultistepScheduler #3708 #3764

增加fid计算代码。#3685

增加ldm超分的pipeline。 #3710

增加ppdiffusers推理pipeline使用代码。 #3759

添加 ppdiffusers CD workflow #3604

Bug Fix

修复 FasterEncoder 预测结果异常问题 #3606

修复 FasterGeneration PrefixLM 类模型在 beam search 解码策略下显存分配问题 #3662

修复Windows平台下载社区模型失败的问题 #3670 #3640

Pipelines修复文件重复上传的问题。#3568

Pipelines修复word文档解析异常的问题。#3645

PIpelines修复批量预测异常的问题。#3712

修复问题生成模版相关的bug .#3646

TIPC中gpt动转静。#3586

添加CLIPText，CLIPVision进入auto/modeling，支持AutoModel加载，修改CLIP的默认NEG INF为-1e4，这样fp16 O2不会异常。 #3789

修复 pypi 自动化发包流程配置 #3626

Source code(tar.gz)
Source code(zip)
v2.4.2(Oct 27, 2022)
New Features

Text summarization应用

增Pegasus中文文本摘要应用，支持Taskflow一键调用，支持FasterGeneration高性能推理，训练推理部署全流程打通。#3275

Question generation

新增问题生成解决方案，提供基于UNIMO-Text和T5的通用问题生成预训练模型，支持Taskflow一键调用，支持FasterGeneration高性能推理，训练推理部署全流程打通。 #3410 #3438 #3560

Machine Translation

FasterMBart 支持动转静导出 #3367 #3356

MBart tokenizers 升级重构，支持最新 tokenizer 所有功能 #3323

分离 MBartTokenizer 和 MBart50Tokenizer，MBart50Tokenizer 支持 AutoTokenizer，MBartTokenizer 和 MBart50Tokenizer 支持自定义 sentence piece 参数 #3323

Pipelines

新增DocPrompt 样例 #3542 #3534

新增ERNIE Vilg文图生成。 #3512

Taskflow

优化Taskflow定制模型使用体验，增加模型参数文件的更新检查机制。 #3506

Bug Fix

修复 MBart 限制模型本身翻译语言的问题 #3356

修复 CodeGen 生成时未使用 token type ids 的问题 #3348

修复 CodeGen 自适应生成 attention mask 错误 #3348

修复 T5 在 use_cache=False 情况下解码出错问题 #3115

修复文本摘要taskflow不能加载自定义模型的bug #3533

修复问题生成预测时的bug #3524

修改uie训练代码中utils.py文件中result变量未定义的问题 #3490

FAQ Finance修复Paddle Serving 在windows上的bug。#3491

修复Pipelines解析docx文档，文本和图片出现在同一个paragraph的情况。 #3546

修复语义索引的文本分类的数据说明。#3551

Others

新增 T5 对 gated-silu 支持 #3115

升级 T5Tokenizer 以支持 PaddleNLP 最新功能 #3115

新增 T5 对 4D attention mask 支持 #3115

新增 T5 支持以字典形式返回 #3370

FasterGeneration 支持 PaddlePaddle 2.4.0-rc0 及以上版本编译 #3545

UnifiedTransformer 支持自适应生成 position_ids，token_type_ids，attention mask 等功能 #3177

UNIMO-Text 支持自适应生成 position_ids，token_type_ids，attention mask 等功能 #3349

Source code(tar.gz)
Source code(zip)
v2.4.1(Oct 14, 2022)
New Features

ERNIE-Layout 文档智能大模型

新增多语言跨模态文档预训练模型ERNIE-Layout，新增Benchmark及基于ERNIE-Layout的各类下游任务的微调及部署示例。#3183

新增DocPrompt文档抽取问答模型，支持Taskflow一键调用。#3183

Pipelines 更新

新增Docker cuda11.2镜像，并提供Docker编译教程。#3315

新增Pipelines批量处理数据。 #3432

新增一些用户反馈的FAQ和README文档的优化。 #3237

新增Milvus 2.1的支持。#3283

Question Generation

新增问题生成example，覆盖中文场景和英文场景。#3410

新增问题生成taskflow。#3438

Compression API

压缩 API 支持 ERNIE、ERNIE-M、BERT、TinyBERT、ELECTRA 等 NLU 模型。#3234 #3324

DynBERT 宽度自适应裁剪策略支持分布式训练。#3361

Prompt API

新增 Prompt API 使用文档。#3362

Bug Fix

修复了小样本文本分类中的失效链接以及在 windows 平台上推理时的数据类型问题。#3339 #3426

FAQ Finance 的Milvus升级为2.1版本，文档优化。#3267 #3430

基于检索的文本分类代码简化和README优化。 #3322

Neural Search的文档优化。#3350

修复了UIE的Dataloader在加载数据时可能导致内存溢出的问题。#3381

修复DuEE序列标注代码导包错误。https://github.com/PaddlePaddle/PaddleNLP/pull/2853

修复Pillow warning问题。 https://github.com/PaddlePaddle/PaddleNLP/pull/3404 和 https://github.com/PaddlePaddle/PaddleNLP/pull/3457

更新artist模型的激活函数，修复dallebart中的warning，https://github.com/PaddlePaddle/PaddleNLP/pull/3106

修复Ernie tokenizer当中模型名称类型缺失的问题 https://github.com/PaddlePaddle/PaddleNLP/pull/3423

修复Bert单测中CI没检测到的Bug https://github.com/PaddlePaddle/PaddleNLP/pull/3422

修复动转静过程中对OrderedDict数据类型不支持的问题 https://github.com/PaddlePaddle/PaddleNLP/pull/3364

修复 bigru_crf 推理随机hang的问题。 https://github.com/PaddlePaddle/PaddleNLP/pull/3418

Others

添加Stable Diffusion的Licence https://github.com/PaddlePaddle/PaddleNLP/pull/3210

更新文档中微信群二维码。https://github.com/PaddlePaddle/PaddleNLP/pull/3284

Processor和FeatureExtractor支持from_pretrained和save_pretrained https://github.com/PaddlePaddle/PaddleNLP/pull/3453

添加T5EncoderModel的单测 https://github.com/PaddlePaddle/PaddleNLP/pull/3376

添加9个模型的多输入输出和单测代码 https://github.com/PaddlePaddle/PaddleNLP/pull/3305

Source code(tar.gz)
Source code(zip)
v2.4.0(Sep 6, 2022)
New Features

NLP Pipelines流水线工具

PaddleNLP Piplines旨在提升NLP模型上线效率，将NLP复杂系统的通用模块抽象封装为标准组件，支持快速组合复杂NLP系统应用

#3003 #3160 #3135 #3092 #3186

插拔式组件设计

支持文档存储灵活节点配置，支持Faiss、Milvus高性能向量搜索引擎

支持文档级别前处理节点配置，支持PDF、图片级别文档信息提取

飞桨SOTA模型快速串联

支持飞桨中文SOTA预训练模型，ERNIE 3.0 系列轻量化快速集成到Pipelines中

支持 RocketQA 语义索引模型，快速提升语义索引、FAQ系统效果

低门槛一键部署

RocketQA DuReader语义提取模型一键调用，通用场景无需进行语义模型训练

Docker和Docker-compose两种方式一键部署，减少环境安装成本

产业范例库升级

文本分类

文本分类全流程应用，支持预训练模型、小样本、语义索引方案，通过TrustAI来快速调优模型 #3087 #3184 #3104 #3180 #2956 #3011

文本分类方案全覆盖

支持多分类、多标签、层次分类算法

支持预训练微调、小样本Prompt tuning微调方式、以及语义索引分类方案

底座模型支持ERNIE 3.0 全系列模型，适配不同的使用场景

高效模型调优

TrustAI模型可解释性工具，快速定位稀疏数据、脏数据问题，进一步提升模型效果

接入数据增强工具，多种数据增强方法，可快速对稀疏数据进行增强

产业级全流程方案

支持数据标注、模型训练、模型压缩、模型预测部署全流程

信息抽取

新增多语言模型UIE-M-Base和UIE-M-Large，支持中英文混合抽取及Taskflow一键调用。#3192

新增基于封闭域模型GlobalPointer的UIE数据蒸馏方案，支持Taskflow一键部署。#3136

语义索引

新增RocketQA的CrossEncoder模型，并支持加载到pipelines中。#3196

Neural Search的召回模型换成基于ERNIE3.0的RocketQA模型，并支持加载到Pipelines中。 #3172

AIGC内容生成

CodeGen代码生成

PaddleNLP 2.4版本发布CodeGen代码生成全系列SOTA模型，可快速一键调用代码生成模型 #2641 #2754 #3017

效果领先

集成代码生成SOTA模型CodeGen

集成12个CodeGen不同规模的代码生成模型，支持多编程语言代码生成模型

简单易用

支持通过Github Copilot调用模型，同时支持Taskflow一键调用模型

支持FasterGeneration打造高性能推理，毫秒级响应

文图生成

文图生成目前是AIGC一个重要方向，PaddleNLP 2.4发布众多有趣的文图生成模型，可一键调用模型快速趣玩文图生成模型

#2917 #2968 #2988 #3040 #3072 #3118 #3198

超多潮流文图生成

支持 DALL-E-mini 、CLIP + Disco Diffusion 、CLIP + Stable Diffusion、ERNIE-ViL +Disco Diffusion等模型

简单易用

支持Taskflow一键调用图文生成模型

支持FasterGeneration打造高性能推理，打破图文生成性能瓶颈

文本摘要

文本摘要是目前NLP场景中高频场景，此次发版新增中文文本应用，支持文本摘要定制化训练 #2971

新增文本摘要Application，支持定制化训练，打通高性能推理部署，支持Taskflow一键调用

框架升级

模型自动压缩Compression API

新增模型压缩 API，支持基于 PaddleSlim 的裁剪和静态离线量化功能，快速加速文本分类、语义匹配、序列标注、阅读理解任务 #2777

模型压缩API可以快速调用模型裁减、模型量化功能，大幅降低模型压缩使用成本

小样本学习 Prompt API

新增Prompt Learning训练框架，支持PET、P-Tuning、RGL等经典模型的快速实现 #2894

文本分类场景中使用Prompt Learning训练框架快速提升小样本训练效果 #2894

Transformers 预训练模型

基础 API

BERT、ERNIE、RoBERT 等模型接口新增获取 attention score 和所有中间层输出功能，可以轻松使用满足蒸馏等需求 #2665

BERT、ERNIE、RoBERT 等模型接口新增对 past_key_values 输入支持，通过该输入可以进行 prefix-tuning #2801

BERT、ERNIE、RoBERT 等模型接口新增输入 label 返回 loss 支持，简化使用方式，无需再拆分label和额外定义损失函数 #3013

BERT、ERNIE、RoBERT 等模型接口支持输出支持以 dict 形式返回，可以用更清晰的方式从返回内容中获取需要的输出内容 #2665

系统批量完善预训练模型接口单测，保障功能稳定性

模型权重

新增XLM模型 https://github.com/PaddlePaddle/PaddleNLP/pull/2080

转换Langboat/mengzi-t5-base-mt权重，并新增Zero Shot使用样例 https://github.com/PaddlePaddle/PaddleNLP/pull/3116

新增Roformer-sim，支持复述生成，可以生成相似句做数据增强 https://github.com/PaddlePaddle/PaddleNLP/pull/3049

Bug Fix

批量新增模型model_max_input_size配置字段 #3127

修复 FasterGeneration 部分模型Sampling解码出core的问题。#2561

修复 UNIMOText 在不使用加速特性情况下生成出错问题 #2877

修复 FasterGeneration 在基于采样解码策略下性能不稳定的问题 https://github.com/PaddlePaddle/PaddleNLP/pull/2910

修复 BART tokenizer 获取 bos_token_id 出错问题 https://github.com/PaddlePaddle/PaddleNLP/pull/3058

修复 BART tokenizer 无法设置 model_max_length 问题 https://github.com/PaddlePaddle/PaddleNLP/pull/3018

修复 Taskflow的文本相似度在Windows上dtype引起的预测失败问题 https://github.com/PaddlePaddle/PaddleNLP/pull/3188

Others

支持FasterGPT的word_embeddings 和 lm_head.decoder_weight的权重不共享 #2953

重构RoFormer，新增RoFormerForCausalLM类，支持roformer-sim相似句生成 #3049

更新ERNIE模型，当type_vocab_size=0时，表示不使用token_type_id #3075

新增ERNIE-Tiny模型的benchmark #3100

更新BERT预训练时混合精度的配置，AMP level改为O2 #3080

FasterBART支持动转静和高性能推理。https://github.com/PaddlePaddle/PaddleNLP/pull/2519

FasterGeneration 预测库联编支持 ONNX 依赖引入 #3158

Generation API 支持 logits_processor、get_decoder_start_token_id() #3018

BART 模型支持 get_input_embeddings() 和 set_input_embeddings() 方法获取 embeddings #3133

GPT 模型支持 get_vocab()、 0/1 attention mask、add bos token 等新增接口功能 #2463

New Contributors

New Contributors

@Spico197 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2170

@sandyhouse made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2190

@qingqing01 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2188

@RicardoL1u made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2299

@Intsigstephon made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2285

@sljlp made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2398

@zche4846 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/1845

@tianberg made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2461

@lidanqing-intel made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2468

@fightfat made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2499

@LiYuRio made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2504

@FeixLiu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2523

@ArtificialZeng made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2537

@freeliuzc made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2543

@taixiurong made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2556

@westfish made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2423

@sneaxiy made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2660

@lastrei made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2671

@WenmuZhou made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2695

@littletomatodonkey made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2732

@piotrekobi made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2730

@Liujie0926 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2829

@buchongyu2 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2817

@GuoxiaWang made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2846

@zhiyongLiu1114 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2875

@veyron95 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2879

@BasicCoder made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2977

@dongfangshenzhu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/3046

@Haibarayu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2694

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.0...v2.4.0
Source code(tar.gz)
Source code(zip)
v2.3.7(Aug 24, 2022)
New Features

新增基于ERNIE 3.0的RocketQA召回模型，包含rocketqa-zh-base（12-layer, 768-hidden）、rocketqa-zh-medium（6-layer, 768-hidden）、rocketqa-zh-mini（6-layer, 384-hidden），rocketqa-zh-micro（4-layer, 384-hidden）和rocketqa-zh-nano（4-layer, 312-hidden）5个语义检索召回模型，在Dureader Retrieval数据集上达到中文最佳效果。 #3033

新增基于ERNIE 3.0的RocketQA排序模型。包含rocketqa-base（12-layer, 768-hidden）、 rocketqa-medium（6-layer, 768-hidden）、rocketqa-mini（6-layer, 384-hidden）、rocketqa-micro（4-layer, 384-hidden）和rocketqa-nano（4-layer, 312-hidden）5个语义检索排序模型，在Dureader Retrieval数据集上达到中文最佳效果。 #3019

新增VI-LayoutXLM文档多模态模型，推理速度与精度超越LayoutXLM。#2935

NLP流水线系统Pipelines新增RocketQA轻量化模型，端到端响应速度显著提升。 #3078

Unit Test

新增Ernie-Gram模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/3059

新增TinyBert模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/2992

新增Roformer模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/2991

新增ERNIE-M模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/2964

新增Skep模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/2941

新增Electra和XLNet模型单测 https://github.com/PaddlePaddle/PaddleNLP/pull/3031

新增RoBERTa、ALBERT 和 ERNIE模型的单测 https://github.com/PaddlePaddle/PaddleNLP/pull/2972

Bug Fix

修复BART tokenizer获取 bos_token_id 出错问题 #3058

修复BART tokenizer无法设置 model_max_length 问题 #3018

修复Pipelines的随机问题生成按钮报错问题和搜索问题回退到上一个搜索结果的问题。 #2954

修复Pipelines在Python3.7上利用FAISS抽向量引起的问题。 #2965

修复Tokenizer resize-token-embeddings 错误 #2763

修复OPT示例代码 #3064

pointer_summarizer支持xpu和多卡 #2963 #3004

New Contributors

@veyron95 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2879

@BasicCoder made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2977

@dongfangshenzhu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/3046

@Haibarayu made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2694

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.5...v2.3.7
Source code(tar.gz)
Source code(zip)
v2.3.5(Aug 1, 2022)
New Features

代码生成

CodeGen支持Taskflow一键调用。 #2754

增加CodeGen使用文档。#2791

UIE

新增UIE英文版本，支持Taskflow一键调用。 #2855

Neural Search

新增排序模型的C++和pipeline的部署。 #2721

新增in-batch negatives边训练边评估的功能。 #2663

小样本学习

新增小样本模型RGL的实现。#2651

文本分类

新增多分类、多标签application。#2661 #2675

数据增强策略 #2805

文本匹配

新增无监督语义向量模型DiffCSE。 #2643

Bug Fix

修复pipelines未传入max_seq_len的问题。#2736

修复pipelines的faiss-cpu依赖，新增乱码处理的FAQ。 #2709

修复neural search的预测时dropout引起的结果不一致的错误，新增对ANN索引的FAQ。#2710

修复ERNIE tokenizer的 get_offset_mapping 错误。#2857 #2897

修复 model 中间 output 输出导致的 UNIMOText 原生生成失败问题。 #2877

其他

New Contributors

@lastrei Add pet static model export script and inference code #2875

@zhiyongLiu1114 Add the get_speical_token_mask for the ernie tokenizer #2671 #2690

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.4...v2.3.5
Source code(tar.gz)
Source code(zip)
v2.3.4(Jun 28, 2022)
New Features

Taskflow

新增三个UIE小模型：UIE-Mini(6-layer, 384-hidden)、UIE-Micro(4-layer, 384-hidden)、UIE-Nano(4-layer, 312-hidden)。#2604

新增基于中文词类知识的信息抽取工具WordTag-IE。 #2540

更多预训练模型

开源 ERNIE Tiny 预训练模型，效果、精度领先于HFL、UER、Huawei-Noah 同等规模下开源中文模型。

新增CodeGen代码生成模型。#2641

基础体验优化

Trainer 支持 constant、cosine、linear三种学习率调度策略。 #2511

FasterBART支持动转静和推理。#2519

FasterGeneration 支持使用带有 onnx 的预测库的编译。#2463

CLUE Benchmark

支持 CLUE 10 个任务的训练、评估、预测，支持用户产出预测结果提交至 CLUE 榜单，并提供 Grid Search 工具供用户一键训练，最终获取最优评估结果。

文本分类

新增多标签层次分类。 #2501

ERNIE-DOC模型在分类任务上添加预测部署流程。#1845

生态模型

新增XLM模型。#2080

Bug Fix

修复UIE同类别嵌套的评估问题。 #2558

修复UIE prompt为英文时，prompt与文本的offset重叠的问题。#2453

修复BERT Tokenizer调用get_offset_mapping出错的问题。 #2508

修复FasterGeneration部分模型Sampling解码出core的问题。#2561

修复PretrainedTokenizer和PretrainedModel 中from_pretrained中的潜在问题。 #2521 #2578 #2424

修复LukeTokenizer当中的字段缺失导致保存时报错的问题。 #2631

修复ChineseBertTokenizer由于Tokenizer机制更新导致expect parameter的问题。 #2625

修复 PretrainedTokenizer special token 设置被覆盖及遗漏的问题 #2534 #2629

修复 albert pad token id 缺失问题 #2495

修复 ERNIE-1.0 预训练使用amp 02时，加载checkpoint错误问题 #2479

移除RandomGenerator的is_init_py属性 #2658

其他

BERT 支持 fused_ffn、fused_attention进行fuse #2523

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.3...v2.3.4
Source code(tar.gz)
Source code(zip)
v2.3.3(Jun 7, 2022)
Bug Fix

修复 AutoModel 模型选择 bug 导致从本地目录加载 ernie-1.0 等模型失败的问题 #2426

修复 tokenizer 从本地目录加载时由于文件检查 bug 导致失败的问题 #2424

修复 Taskflow 依存分析输出的类型问题 #2422

修复 UIE 中 doccano 标注数据转换脚本的 split 检查问题；并完善 Task 使用 ONNX 预测的报错方式 #2417

修复代码中的 data 拼写问题 #2410

修复 PaddleNLP/README 中的 UIE 链接 #2419

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.2...v2.3.3
Source code(tar.gz)
Source code(zip)
v2.3.2(Jun 2, 2022)
New Features

更快的推理部署

UIE 推理加速：支持 UIE 模型 CPU、GPU 设备上高性能推理能力，显著提升 UIE 推理速度。

ERNIE 3.0 模型支持 Triton Inference Server服务化部署。

更多预训练模型

新增 4 个 文心 ERNIE 3.0 系列中文模型 ：包含 3 个小模型 ERNIE 3.0-Mini (6-layer, 384-hidden)、ERNIE 3.0-Micro (4-layer, 384-hidden)、ERNIE 3.0-Nano (4-layer, 312-hidden)，1 个20层模型 ERNIE 3.0-XBase(20-layer, 1024-hidden)。

开源 ERNIE 2.0 中文模型：包括 ERNIE 2.0-Base(12-layer, 768-hidden)、ERNIE 2.0-Large(24-layer, 1024-hidden)。

基础体验优化

ERNIE-M 模型支持多项选择式阅读理解任务。

新增支持 XLNet 模型动转静能力。

BART Tokenizer 兼容性优化。

生态模型

新增 GAU-alpha 生态模型。

Bug Fix

修复 ElectraTokenizer 缺失 do_lower_case 属性问题。#2263

修复 CLUE Benchmark 评估 CHID 任务日志 Bug。#2298

修复语义检索 Application、FAQ System 在 Windows 系统数据类型报错问题。#2381

修复基于 AutoTokenizer 加载 ERNIE 模型报错问题。#2315

修复 load_dataset 函数报 dict_keys 错误问题。#2364

修复文本生成 example Windows 平台数据类型报错问题。#2351

修复 ERNIE 3.0 ONNX Runtime 推理 Bug。#2386

修复 DDParser 针对 1-D Array 的 Padding 问题。#2333

New Contributors

@RicardoL1u made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2299

@Intsigstephon made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2285

@sljlp made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2398

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.1...v2.3.2
Source code(tar.gz)
Source code(zip)
v2.3.1(May 19, 2022)
Improvements

GPT-3支持静态图混合并行情况下的生成推理。 https://github.com/PaddlePaddle/PaddleNLP/pull/2188 https://github.com/PaddlePaddle/PaddleNLP/pull/2245

BugFix

新增基于 FAISS ANN 引擎一键运行语义检索系统示例。https://github.com/PaddlePaddle/PaddleNLP/pull/2180

修复 PaddleNLP 智能文本产线示例 CPU 运行报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2201

修复 GPT 编译报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2191

修复 GPT 预训练数据流未传入 max_seq_len 参数问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2192

修复 GPT-3 静态图混合并行，预训练报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2190 https://github.com/PaddlePaddle/PaddleNLP/pull/2223 https://github.com/PaddlePaddle/PaddleNLP/pull/2195

修复 tokenizer 非兼容升级导致 NPTag 解码错误问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2199

修复Taskflow UIE Schema 重复构建的问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2170

兼容 NER 标注任务 doccano 多种导出格式的数据转换。https://github.com/PaddlePaddle/PaddleNLP/pull/2187

修复 NPTag 解码问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2233

修复 DuUIE max_seq_len 报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2207

修复 Windows 系统默认编码非 UTF8 时的编码报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2209

修复 AlbertForQuestionAnswering import 报错问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2216

修复 CLUE Benchmark 预测结果格式问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2215

修复死链问题。https://github.com/PaddlePaddle/PaddleNLP/pull/2231 https://github.com/PaddlePaddle/PaddleNLP/pull/2230 https://github.com/PaddlePaddle/PaddleNLP/pull/2235 https://github.com/PaddlePaddle/PaddleNLP/pull/2240 https://github.com/PaddlePaddle/PaddleNLP/pull/2241

New Contributors

@Spico197 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2170

@sandyhouse made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2190

@lugimzzz made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2196

@qingqing01 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2188

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.3.0...v2.3.1
Source code(tar.gz)
Source code(zip)
v2.3.0(May 16, 2022)
New Features

通用信息抽取技术 UIE

新增基于统一结构生成的通用开放域信息抽取框架 UIE (Universal Information Extraction)，单个模型可以支持命名实体识别、关系抽取、事件抽取、情感分析等任务，同时在模型规模上支持base和tiny两种结构，满足多种业务场景需求，均支持Taskflow一键预测。

新增医疗领域信息抽取模型 UIE-Medical，支持医疗专名识别和医疗关系抽取两大任务，并支持小样本学习，预测精度业界领先。

文心NLP大模型升级

新增文心大模型ERNIE 3.0轻量级版本，包含ERNIE 3.0-Base（12层）和 ERNIE 3.0-Medium（6层）两个中文模型，在CLUE Benchmark上实现同规模模型中文最佳效果。

新增中文医疗领域预训练模型 ERNIE-Health，支持医学文本信息抽取（实体识别、关系抽取）、医学术语归一化、医学文本分类、医学句子关系判定和医学问答共5大类任务，并提供 CBLUE benchmark 使用实例。

新增PLATO-XL（11B），全球首个百亿参数对话预训练生成模型，提供FasterGeneration高性能GPU加速，相比上版本推理速度加速2.7倍，更多使用说明请查阅PLATO-XL with FasterGeneration

FasterGeneration 高性能生成加速

FasterGeneration本次发版进行了以下的升级，更多使用说明请查阅FasterGeneration文档

速度更快

更细致的融合加速：UnifiedTransformer、UNIMOText 模型Context计算加入加速支持，速度相比上个版本提升20%～110%

更丰富的模型支持：扩展了 size_per_head 支持范围，支持了 CPM-Large（2.6B）和PLATO-XL（11B）等大模型生成加速

更快的大模型推理：支持Tensor并行和Pipeline并行推理，CPM-Large 上 4卡 Tensor 并行速度较单卡高性能生成提升40%，PLATO-XL在4卡加速比为单卡的2倍

显存更少

优化模型加载转换显存占用，支持直接使用 FP16 模型并允许去除原始未融合的QKV权重参数

部署更易

新增参数支持直接使用 Encoder 加速能力，打通 Encoder 加速与 Decoding 加速

支持UnifiedTransformer、UNIMOText 等更多加速版本模型导出静态图并在Paddle Inference实现高性能部署

更多产业范例与应用场景

新增汽车说明书智能问答应用范例，基于百度领先的开放域问答技术RocketQA和多模态多语言预训练模型LayoutXLM提供了多模态文档问答的应用范例和最佳实践。

新增智能语音指令解析应用范例，可广泛应用于智能语音填单、智能语音交互、智能语音检索、手机APP语音唤醒等场景，提高人机交互效率。

新增端到端智能问答系统应用范例，提供低成本快速搭建可视化智能问答系统能力。

新增端到端语义检索系统应用范例，提供低成本快速搭建语义检索系统能力。

新增 NLP 模型可解释性应用示例 #1752 ，感谢 @binlinquge 的贡献

新增 CLUE Benchmark 评测脚本，更全面的了解PaddleNLP中文预训练模型的效果，帮助开发者便捷完成中文模型选型

BERT 静态图训练增加 Graphcore IPU 支持 #1793 更多详情请查阅BERT IPU，感谢 @gglin001 的贡献

更多的预训练模型

新增 300+ 重要模型权重，涵盖 BERT、GPT、T5 等模型结构，目前PaddleNLP精选预训练模型数达500+

新增 FNet 模型 #1499，感谢 @HJHGJGHHG 的贡献

新增 ProphetNet 模型 #1698，感谢 @d294270681 的贡献

新增 Megatron-LM 模型 #1678，感谢 @Beacontownfc 的贡献

新增 LUKE 模型 #1677，感谢 @Beacontownfc 的贡献

新增 RemBERT 模型 #1701 ，感谢 @Beacontownfc 的贡献

Trainer API

新增 Trainer API，简化了模型训练代码，并规范了统一的训练配置，支持VisualDL训练日志可视化，提升实验的可复现性https://github.com/PaddlePaddle/PaddleNLP/pull/1761 。Trainer API 快速上手请参考教程。

Data API

兼容 HuggingFace Datasets，可以直接使用其 load_dataset 返回的数据集（建议在先import paddlenlp后再import datasets）

新增 DataCollatorWithPadding、DataCollatorForTokenClassification 等常用任务的 Data Collator，简化数据处理流程

Tokenizer 功能新增与调整：

支持自定义 special token 的保存和加载

提供更丰富的 Padding 方式，包括定长 Pad、Longest Pad 以及 Pad 到特定倍数

支持获取最长单句输入长度和句对输入长度

支持返回 Paddle Tensor 数据

IMPORTANT NOTE 在输入为 batch 数据时，默认输出格式由 list of dict 调整为 dict of list （dict 为BatchEncoding类的对象），可通过 return_dict 设置

IMPORTANT NOTE save_pretrained 保存内容格式有调整（保证了兼容性，此前保存内容仍能正常使用）

BugFix

修复Taskflow NPTag 解码问题 #2023

修复语义检索 Application 召回模型训练 output_emb_size = 0 时报错问题 #2090

Breaking Changes

调用 Tokenizer 在输入为 batch 数据时，默认输出格式由 list of dict 调整为 dict of list （dict 为BatchEncoding类的对象），可通过 return_dict 设置

New Contributors

@mmglove made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/1974

@luyaojie made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2012

@wjj19950828 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2061

@kev123456 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2070

@heliqi made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2073

@yeliang2258 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/2077

Full Changelog: https://github.com/PaddlePaddle/PaddleNLP/compare/v2.2.6...v2.3.0
Source code(tar.gz)
Source code(zip)
v2.2.6(Apr 15, 2022)
问题修复

优化了AutoModel & AutoTokenizer模块的报错信息 #1902

修复了ErnieDoc模型分类任务默认类别缺失的问题 #1867

修复了Roberta tokenizer加载本地资源报错的问题 #1821

修复了bstc数据集文件缺失的问题

优化了xnli数据集的报错信息 #1838

修复了FewCLUE数据集中unlabeled.json文件为空的问题 #1881

修复了load_dataset读取CLUE tnews数据集所有splits时报错的问题 #1941

修复中文阅读理解指标计算偏低的问题 #1874

修复textcnn静态图预测报错的问题 #1839

修复了文本分类使用预训练模型进行分布式训练时报错的问题 #1839

Source code(tar.gz)
Source code(zip)
v2.2.5(Mar 21, 2022)
新功能

Taskflow

分词和NER多级模式 #1666

AutoSplitter/AutoJoiner功能支持无限长文本自动切分 #1666

问题修复

修复ERNIE-Doc文本分类任务数据集读取错误 #1687

修复原生生成式 API 传入 tensor 为 None 时不能正确执行的问题 #1656

修复 Roberta 模型不支持2维 attention mask #1676

修复 ConvBert 模型不支持动转静 #1643

修复 ERNIE-M 训练hang住的问题 #1681

文档更新

FasterTransformer 文档新增编译报错 FAQ #1750

修复 T5 模型 example 文档 #1652

更新生态贡献权重文档 #1749

Source code(tar.gz)
Source code(zip)
v2.2.4(Jan 26, 2022)
我们很高兴的发布 PaddleNLP 2.2.4 版本，主要是对 2.2.3 中一些功能的修复，并对部分功能点和文档做了增强，重点如下：

新功能

新增西班牙语和荷兰语实体识别数据集 CoNLL-2012。 #1561

功能优化

小模型 PP-MiniLM 接入 FasterTokenizer，量化、裁剪后的模型推理速度达到 BERT_base 的 8.8 倍。#1542

Transformer 动态图支持 O2 级别 AMP 训练@zhangbo9674。#1574

语义索引应用增加Paddle Serving支持。 #1558

问题修复

修复 ERNIE-Doc 模型 NLTK 包模型下载的错误。#1515

修复多个 Transformer 模型在 FP16 精度下 attention_mask 计算溢出的错误。#1585

修复 LAC 模型 TRT 预测配置错误。 #1606

修复 BART 文本摘要示例的评估错误。#1560

修复 BART 文本摘要示例在 Windows 环境下报错。 #1588

修复 Tokenizer.__call__() 方法truncation_strategy不生效的bug。 #1615

修复 RobertaTokenizer 不能获取special token的bug。 #1618

修复BART和mBART不支持2维attention mask。#1637

修复CNN/DailyMail 和 XNLI 数据集多卡下载报错。#1587

文档更新

为 ERNIE-1.0 训练任务添加了 CLUECorpusSmall 数据集训练教程。https://github.com/PaddlePaddle/PaddleNLP/pull/1555

社区贡献

新增 FNet @HJHGJGHHG。#1499

修复 Read the Docs 文档 Dataset API 页面格式错误的问题@GT-ZhangAcer。#1570

Source code(tar.gz)
Source code(zip)
v2.2.2(Dec 28, 2021)
New Features

新增产业应用案例

新增评论观点抽取的应用案例 #1505

提供评论观点抽取和属性级情感分类能力，并支持全流程情感分析推理能力

提供基于 PP-MiniLM 小模型推理加速解决方案，推理性能提升 900%

新增端到端语义检索引擎应用案例 #1507

支持监督语义索引模型 In-Batch Negatives 基于 Paddle Inference 计算文本对相似度的推理能力

支持无监督语义索引模型 SimCSE 基于 Paddle Inference 计算文本对相似度的推理能力

FasterGeneration

优化 JIT 载入自定义 op 逻辑以优化 enable_faster_encoder() need_build 参数以及 pipeline 方式使用多个加速模型时框架冗余的 Warning，提升使用体验 #1495

New Models

新增长文本语言模型 Funnel Transformer，新增基于Funnel Transformer的SQUAD问答任务示例 #1419

Bugfix

修复了GPT-3静态图，训练参数选项错误问题 #1500

修复了LayoutXLM模型在windows环境下的报错 #1489

优化静态图参数转化成动态图参数脚本，支持paddlenlp中动静统一的模型结构 #1478

Source code(tar.gz)
Source code(zip)
v2.2.1(Dec 17, 2021)
New Features

中文特色小模型 PP-MiniLM 发布 #1403

推理速度快，推理速度是BERT-base(12L768H) 4.2倍

模型参数少，模型参数量相对BERT-base(12L768H) 减少52%

模型精度高，在中文语言理解评测基准 CLUE 7 个分类数据集上精度比 BERT-base(12L768H) 高 0.32

产业级语义检索框架发布 #1463

一站式提供高可用的训练&预测语义检索框架，同时集成高性能 ANN 引擎 Milvus

召回模型方案覆盖有监督、无监督多种数据场景，支持只基于无监督数训练语义索引模型

Taskflow

Taskflow 新增中文对话PLATO-mini任务，支持多轮对话记忆功能 #1383

FasterGeneration

生成解码框架新增注意力机制QKV融合，解码性能最高提升 8% #1455

Bugfix

修复使用Paddle2.2及其以下版本兼容性问题 #1450

修复MSRA_NER示例中 max_steps 选项，不生效的问题 #1451

修复ERNIE-1.0模型预训练部分参数，增强预训练稳定性 #1344

修复EFL及ernie-matching在windows下的静态图预测问题 #1480

修复Taskflow文本相似度计算任务windows兼容性问题 #1465

修复LayoutXLM模型加载时无法找到yaml文件的问题 #1454

修复SqueezeBert模型vocab等资源路径的缺失和typo #1454

修复FasterGeneration下diversity rate的结果错误的问题 #1477

修复FasterGeneration下GPT模型的repetition_penalty被屏蔽的问题 #1471

Source code(tar.gz)
Source code(zip)
v2.2.0(Dec 10, 2021)
New features

预训练加速训推一体加速开发FasterERNIE

新增支持高性能文本预处理算子FasterTokenizer，提供更快的文本预处理 #1220

融合Fused TransformerEncoder API，极致优化Transformer性能 #1308

新增to_static()接口，支持文本处理与模型计算整图导出，提供更易用的模型导出

优化C++部署体验，显著降低C++开发成本

提供文本分类、序列标注使用示例

面向生成任务的高性能加速组件FasterGeneration

FasterTransformer升级至V4.0版本

Transformer 加速版本在 sampling 以及 3 种 beam search 策略下新增 force decoding 策略支持

生成API新增Diverse Beam Search策略

Taskflow升级

新增名词短语标注及文本相似度计算任务 #1246 #1345

句法分析任务增加已分词方式解析句法树能力 #1351

中文分词、词性标注、命名实体识别任务支持用户自定义词典干预策略 #364 #1420

知识挖掘任务支持自定义模型、自定义Term-Linking等进阶使用方式 #1329

解语套件词类知识标注工具WordTag支持增量数据训练 #1329

解语套件百科知识树TermTree使用体验完善，支持定制化使用 #1329

更多预训练模型

新增表单多模态模型LayoutLM、LayoutLMv2、LayoutXLM模型

新增基于unimo-text-1.0-lcsts-new中文摘要预训练模型

新增mBART和mBART50模型，用于多语言翻译

解语套件新增NPTag模型，可直接用于名词短语标注，标签类别2000+ #1246

新增GPTModel预训练权重 gpt2-en、gpt2-large-en、gpt2-xl-en，可用于英文文本生成 #1302

新增Mengzi中文预训练模型

自动模型与分词器加载

新增AutoModel和AutoTokenizer模块，可更便捷加载不同网络结构预训练模型与分词器

社区贡献

新增BertJapaneseTokenizer & 新增BertJapanese模型预训练权重 by @iverxin in #1115

新增BlenderbotSmall & Blenderbot模型 #868 ，感谢 @kevinng77 的贡献

新增SqueezeBERT模型 #937 ，感谢 @renmada 的贡献

新增CTRL模型 #921 ，感谢 @JunnYu 的贡献

新增T5模型 #916 ，感谢 @JunnYu 的贡献

新增Reformer模型 #870 ，感谢 @JunnYu 的贡献

新增MobileBert模型 #1160 ，感谢 @nosaydomore 的贡献

新增ChineseBert模型 #1100 ，感谢 @27182812 的贡献

新增End-to-End Memory Network模型 #1046，感谢 @yulangz 的贡献

完善Bert模型下游任务代码 & 新增Bert预训练权重 by @JunnYu in #1085

完善BigBird模型下游任务代码 by @iverxin in #1114

完善Electra模型下游任务代码 & 新增Electra预训练权重 by @JunnYu in #1086

完善Roberta模型下游任务代码 & 新增Roberta预训练权重 by @nosaydomore in #1133

完善GPT模型下游任务代码 & 新增GPT预训练权重 by @JunnYu in #1088

完善XLNet模型下游任务代码 & 新增DistilBert预训练权重by @renmada in

Misc

新增文本分类数据集XNLI #1336

GPT-3模型预训练，支持静态图Pure FP16训练 #1353

命名实体识别，增加了peoples_daily_ner数据集支持，同时支持使用ERNIE模型 #1361

优化ViterbiDecoder解码性能，在GPU设备上可提升10倍 #1291

Bugfix

修复下载进度条单位不正确的问题

修复GPT模型导出后，预测报错问题 #1303

修复文本纠错模型指标统计Bug #1255 #1265 #1273

修复generate API接口的get_logits_processor参数缺失 @JunnYu in #1399

修复BERT模型对2D attention mask的支持 @JunnYu in #1226

Source code(tar.gz)
Source code(zip)
v2.1.1(Oct 20, 2021)

New Features

GPT-3动态图模式增加pure fp16支持。 Taskflow情感分析任务增加预测score输出。 Generation API新增Diverse Sibling Search策略。 Generation API新增Repetition Penalty策略。@JunnYu

Bug Fix

修复 FasterUNIMOText 在 top_p 为 1.0 时不能调用加速的问题。
Source code(tar.gz)
Source code(zip)
v2.1.0(Oct 11, 2021)
New Features

新增开箱即用的工业级NLP能力Taskflow，预置中文分词、词性标注、专名识别、句法分析、情感分析、文本纠错等8个任务，更多使用说明请参考Taskflow文档。

新增基于Promot Tuning的NLP小样本学习应用实例，配合R-Drop策略显著提升效果，更多技术细节请参考FewCLUE。

集成FasterTransformer加速能力，显著提升翻译、对话等文本生成任务的推理速度。支持Transformer/GPT/BART等主流生成结构与Beam Search与Sampling-based解码策略，更多使用说明请参考FasterTransformer使用文档。

New Examples

新增无监督语义匹配模型SimCSE。

新增模型压缩策略MiniLMv2。

新增文本纠错模型ERNIE-CSC。

新增句法分析应用示例dependency_parsing。

新增小样本学习应用示例few_shot。

新增文本摘要应用示例BART。

完善ERNIE-1.0/GPT/GPT-3的多机分布式预训练代码。@zhaoyinglia @wangxicoding

New Pretrained Models

新增RoFormer模型 #804 ，感谢 @JunnYu 的贡献🎉。

新增ConvBert模型 #819，感谢 @JunnYu 的贡献🎉。

新增MPNet模型 #869，感谢 @JunnYu 的贡献🎉。

New Dataset

新增文本摘要数据集CNN/DailyMail #1061。

Bug Fix

修复维特比解码在长度为1的输入下预测不准确的问题 #1126 。

修复词法分析模型的计算精度问题 #962 。

修复Tokenizer计算offset mapping时对特殊字符处理的问题 #882，感谢 @JunnYu 的贡献🎉。

修复Windows环境下出现的int类型变量错误 #856 #1023 #1146。

Docs

优化Transformer API Reference文档，更加清晰准确易懂。感谢 @huhuiwen99 的贡献🎉。

New Contributors

@huhuiwen99 made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/914 🎉

@iamqiz made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/950 🎉

@ForFishes made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/986 🎉

@AI-Mart made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/1009 🎉

@zhaoyinglia made their first contribution in https://github.com/PaddlePaddle/PaddleNLP/pull/1064 🎉

Source code(tar.gz)
Source code(zip)
v2.0.8(Aug 22, 2021)

New Pretrained-Models

新增文本生成UNIMO-text模型和tokenizer，包括unimo-text-1.0和unimo-text-1.0-large。新增长文本预训练模型ERNIE-Doc。

New Dataset

新增问题生成数据集DuReaderQG。新增文案生成数据集AdvertiseGen。新增短摘要生成数据集LCSTS_new。新增长文本语义匹配数据集CAIL2019-SCM。新增长阅读理解数据集C3。新增文本分类数据集HYP、THUCNews。

New Feature

新增Layerwise-decay优化器。新增 R-Drop loss API.

BugFix

修复生成API中min_out_len参数不起作用的bug和一些文档问题。修复tokenizer计算offset mapping时会把原本有意义的#删除的问题。 @JunnYu

New Examples

新增【千言：面向事实一致性的生成评测比赛】baseline。新增【千言-问题匹配鲁棒性评测】baseline.。
Source code(tar.gz)
Source code(zip)
v2.0.7(Aug 2, 2021)
功能更新

新增PET策略的Few-Shot Learning基线；

新增BART模型；

新增C3, TriviaQa, CAIL2019-SCM数据集;

FasterTransformer能力增强 4.1 Unified Transformer新增Beam Search和Sampling解码策略; 4.2 Top-k Sampling解码策略支持任意k;

Bug Fix

简化依赖，提升安装速度;

修复TaskFlow API的多线程使用问题;

Source code(tar.gz)
Source code(zip)
v2.0.6(Jul 20, 2021)
功能更新

新增TaskFlow一键预测API，支持情感分析、知识关联(text2knowledge)任务；

文本匹配任务新增SimBERT模型；

情感分析模块中新增情绪分析任务；

新增长文本分类hyp thunews数据集；

Bug Fix

修复GPT任务中ClipGradByGlobalNorm 和 Megatron不一致的Bug；

修复Unified Transformer在Windows上的数据类型；

修复CRF batch_size=1的训练出错问题；

Source code(tar.gz)
Source code(zip)
v2.0.5(Jun 29, 2021)
Bug fix

修复了预训练模型vocab无法保存的问题。

更多的预训练模型

新增macbert-base-chinese和macbert-large-chinese预训练模型，与其他BERT模型的加载方式一致。

Source code(tar.gz)
Source code(zip)
v2.0.4(Jun 29, 2021)
Bug fix

修复了ERNIE-GRAM的vocab中，idx_to_token和token_to_idx不对应的问题。感谢@BFJL的贡献！🎉 🎉 🎉

更多的数据集

新增SE-ABSA16_CAME中文情感分类数据集，感谢 @jiaqianjing 的高质量贡献！ 🎉 🎉 🎉

新增COTE-BD & COTE-MFW中文语义角色识别数据集，感谢 @jiaqianjing 的高质量贡献！ 🎉 🎉 🎉

Finetuned model

新增ernie-2.0-en-finetuned-squad模型，由ernie-2.0-en在squad1.0数据集上finetune得到。

Source code(tar.gz)
Source code(zip)
v2.0.3(Jun 17, 2021)
API功能优化

升级了load_dataset()方法，现在同时传入splits和data_files参数时将由splits参数来指定读取本地数据集的格式。行为更加符合直觉。

生成式APIgenerate()现在支持GPT预训练模型了！

更多的数据集

新增BQCorpus中文文本相似度数据集，感谢 @frozenfish123 的高质量贡献！ 🎉 🎉 🎉

新增PAWS-X中文文本相似度数据集，感谢 @jiaqianjing 的高质量贡献！ 🎉 🎉 🎉

新增NLPCC14-SC中文情感分类数据集，感谢 @fiyen 的高质量贡献！ 🎉 🎉 🎉

Source code(tar.gz)
Source code(zip)