Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Last update: Dec 28, 2022

Related tags

Text Data & NLP Knover

Overview

Knover

Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out efficient training/inference of large-scale dialogue generation models.

What's New:

December 2021: We are opening the dialogue generation model of PLATO-XL, with up to 11 billion parameters.
October 2021: We are opening AG-DST, an amendable generation for dialogue state tracking.
February 2021: We are opening our implementation (Team 19) in DSTC9-Track1.
July 2020: We are opening PLATO-2, a large-scale generative model with latent space for open-domain dialogue systems.

Requirements and Installation

python version >= 3.7
paddlepaddle-gpu version >= 2.0.0
- You can install PaddlePaddle following the instructions.
- The specific version of PaddlePaddle is also based on your CUDA version (recommended version: 10.1) and CuDNN version (recommended version: 7.6). See more information on PaddlePaddle document about GPU support
sentencepiece
termcolor
If you want to run distributed training, you'll also need NCCL
Install Knover locally:

git clone https://github.com/PaddlePaddle/Knover.git
cd Knover
pip3 install -e .

Or you can setup PYTHONPATH only:

export PYTHONPATH=/abs/path/to/Knover:$PYTHONPATH

Basic usage

See usage document.

Disclaimer

This project aims to facilitate further research progress in dialogue generation. Baidu is not responsible for the 3rd party's generation with the pre-trained system.

Contact information

For help or issues using Knover, please submit a GitHub issue.

Comments

有关训练效果

530w数据，从头训stage1, stage2.1 。仍然明显有safe response&重复的现象，请问是我训练的不够充分吗？ stage1 batch_size=16 训练了320000 step，stage2.1训练了batch_size=1024 18000step

改成选随机的候选感觉好一些。

opened by lonelydancer 31
想问下分布式训练有什么特殊设置吗，单机多卡可以跑通，多机多卡可以建立通信但是不报错也不训练

配置里按照paddle分布式教程设置为：distributed_args="--ips 10.130.19.203,10.130.17.157 --selected_gpus 0,1"，两台机器可以建立通信但是不开始训练，GPU每张卡有2g内存占用，下面这种配置可以正常训练：distributed_args="--ips 10.130.19.203 --selected_gpus 0,1"，

opened by jidlin 18
the missing full source code of plato-2

Hi thanks for your great work! I explore the plato-2 directory and just found there are .sh files, may I ask where is the .py files? so I could try the chatbot interaction, thanks for your help!

opened by chikiuso 9
关于PLATO-XL的训练

非常感谢您开源的XL模型。我尝试用8个A100（每块40G显存）训练自己的XL模型，但因参数过大，显存还是不够。看plato-XL论文里面提到：Given the limited memory of each device, vanilla data parallelism cannot support the training of such a model with up to 11 billion parameters.As such, we adopt the sharded data parallelism (Rajbhandari et al., 2020) to eliminate memory redundancies, by partitioning the optimizer states, gradients and parameters across multiple devices. 请问论文里提到的这种模型参数跨多个显卡的训练方法要如何实现？

opened by guijuzhejiang 7

中文plato2，单机单卡可以训练，单机多卡跑到一定步数就退出，无有用报错信息

中文对话数据，数据量400w，单卡可以跑完整个epoch，单机4卡运行到一定步数就退出

环境： paddlepaddle-gpu==2.0.1 cuda==11.0 cudnn==8.0

终端报错是：

INFO 2021-09-05 21:51:40,245 launch_utils.py:327] terminate all the procs ERROR 2021-09-05 21:51:40,245 launch_utils.py:584] ABORT!!! Out of all 4 trainers, the trainer process with rank=[3] was aborted. Please check its log. INFO 2021-09-05 21:51:43,248 launch_utils.py:327] terminate all the procs`

work_log.3里面报错如下：

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
1   paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
2   paddle::framework::details::FastThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string > > const&, bool)
3   paddle::framework::BlockingQueue<unsigned long>::Pop()
4   paddle::framework::SignalHandle(char const*, int)
5   paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1630683022 (unix time) try "date -d @1630683022" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3e800000a5e) received by PID 2812 (TID 0x7f718f576b80) from PID 2654 ***]

opened by jidlin 7

建议把如何从check_point继续训练的方式也在文档里写一下

这个训练一般会持续很久，很可能会断了之后继续训练，所以继续训练也是个刚需。建议把如何继续训练写到文档里面。

还有就是现在要继续训练要自己在参数里填check_point路径和当前的start_step，这样还是太麻烦了，建议在保存check_point的时候把这个信息保存一下，这样继续训练的时候先检测这个信息，然后自动从上次最后的step开始训练
enhancement

opened by onewaymyway 7
ValueError: (InvalidArgument) Tensor holds the wrong type, it holds int, but desires to be int64_t.

Lic2022的baseline源码，在AIStudio可以正常跑，本地跑时train_query，infer_dial，infer_dial均无错误，只在infer_query时出现以下错误

paddlepaddle：2.2.2 cuda：11.2 cudnn：8.2

$ sh ./scripts/local/job.sh ./projects/lic2022/conf/query_infer.conf

2022-04-18 15:40:25,456-INFO: [topology.py:169:init] HybridParallelInfo: rank_id: 0, mp_degree: 1, sharding_degree: 1, pp_degree: 1, dp_degree: 1, mp_group: [0], sharding_group: [0], pp_group: [0], dp_gr oup: [0], check/clip group: [0] W0418 15:40:25.456908 14688 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.6, Runtime API Version: 11.2 W0418 15:40:25.472528 14688 device_context.cc:465] device: 0, cuDNN Version: 8.2. [WARN] Using constant learning rate because of warmup_steps is not positive while using NoamScheduler. Loading parameters from ./projects/lic2022/model_zoo/query_finetune.pdparams. Loading has done! Traceback (most recent call last): File "./knover/scripts/infer.py", line 140, in infer(args) File "./knover/scripts/infer.py", line 83, in infer predictions = task.infer_step(model, data) File "e:\jupyternotebookproject\lic2022\knover\knover\core\task.py", line 46, in infer_step predictions = model.infer_step(inputs) File "e:\jupyternotebookproject\lic2022\knover\knover\core\model.py", line 508, in infer_step predictions = self._model(*inputs, mode="infer") File "e:\jupyternotebookproject\lic2022\knover\knover\core\model.py", line 180, in call outputs = self.infer_step(inputs) File "e:\jupyternotebookproject\lic2022\knover\knover\core\model.py", line 170, in infer_step predictions = self.infer(inputs, outputs) File "e:\jupyternotebookproject\lic2022\knover\knover\models\unified_transformer.py", line 297, in infer outputs = self.generator(self, inputs, outputs) File "e:\jupyternotebookproject\lic2022\knover\knover\modules\generator.py", line 163, in call state = self._update_state(state, probs) File "e:\jupyternotebookproject\lic2022\knover\knover\modules\generator.py", line 390, in _update_state state["predictions"] = paddle.concat([state["predictions"], pred], axis=1) File "E:\software\Anaconda3\envs\Knover\lib\site-packages\paddle\tensor\manipulation.py", line 345, in concat return paddle.fluid.layers.concat(input=x, axis=axis, name=name) File "E:\software\Anaconda3\envs\Knover\lib\site-packages\paddle\fluid\layers\tensor.py", line 327, in concat return _C_ops.concat(input, 'axis', axis) ValueError: (InvalidArgument) Tensor holds the wrong type, it holds int, but desires to be int64_t. [Hint: Expected valid == true, but received valid:0 != true:1.] (at ../paddle/fluid/framework/tensor_impl.h:33) [operator < concat > error] INFO 2022-04-18 15:40:36,201 launch_utils.py:341] terminate all the procs ERROR 2022-04-18 15:40:36,201 launch_utils.py:604] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log. INFO 2022-04-18 15:40:39,210 launch_utils.py:341] terminate all the procs INFO 2022-04-18 15:40:39,210 launch.py:311] Local processes completed.

exit_code=0 [[ 0 != 0 ]] exit 0

opened by chikin-lau 6
About PLATO-KAG

First of all thank you very much for your work. When I run the code of PLATO-KAG as instructed, there is a problem. Would you mind answering this question？thank you very much.

opened by bingfeiz 6
为什么vocab里必须既有[UNK]又有呢？

看代码的规则，vocab里既要有[UNK]又要有<unk>，否则会报错，这两个token都代表未知词吧，有什么区别吗？另外我看例子中英语的vocab有些token的ids重复了，如下，不明白为什么，重复的id不会被覆盖吗？自己做vocab的时候也要改成重复的吗？ <unk> 0 <s> 1 </s> 2 [UNK] 0 [PAD] 0 [CLS] 1 [SEP] 2
enhancement

opened by guijuzhejiang 6

DSTC10-Track2/task2 inference code: Error while running the command 'bash ./submission_0_infer.sh'

When running DSTC10-Track2/task2 (Knowledge-grounded Dialogue Modeling), I got error message like this. Please help me to fix the following error.

the error message:

Load pretraining parameters from /home/Knover/projects/DSTC10-Track2/task2/models/SOP-32L-Detection
Traceback (most recent call last):
  File "/home/Knover/knover/data/dialog_reader.py", line 578, in __wrapper__
    for batch in batch_reader():
  File "/home/Knover/knover/data/dialog_reader.py", line 517, in __wrapper__
    for batch in batch_reader():
  File "/home/Knover/knover/data/dialog_reader.py", line 432, in __wrapper__
    for record in reader():
  File "/home/Knover/knover/data/dialog_reader.py", line 369, in __wrapper__
    yield from self._read_numerical_file(fp, phase, is_infer)
TypeError: _read_numerical_file() takes from 2 to 3 positional arguments but 4 were given
WARNING:root:Your reader has raised an exception!
Traceback (most recent call last):
Exception in thread   File "./knover/scripts/infer.py", line 145, in <module>
Thread-1:
Traceback (most recent call last):
      File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
infer(args)
    for step, data in enumerate(infer_generator(), 1):
  File "/home/Knover/myenv/lib/python3.8/site-packages/paddle/fluid/reader.py", line 1392, in __next__
        return self._reader.read_next()self.run()

SystemError:   File "/usr/lib/python3.8/threading.py", line 870, in run
(Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:166)

Thank you for taking the time to review this.

opened by JH-debug 5

关于使用finetune Knover进行infer时，同一个语料库的infer结果每次都不一样

你好，我使用自己的数据对Knover Classfier进行finetune后保存checkpoint （我们发现保存的checkpoint有2612个参数文件比提供的SOP-32L-Context模型的522个多了近4倍），然后基于这个checkpoint使用infer.sh进行预测，但是同一个数据集的预测结果每次都不一致，请问这种情况正常么？该如何解决？

opened by luomou97 5
InvalidArgumentError: Broadcast dimension mismatch

您好，我使用Knover训练了一个Plato2模型，但在使用hub serving start部署到我的后台后，使用jmeter测试，jmeter客户端报错。报错内容如下： InvalidArgumentError: Broadcast dimension mismatch. Operands could not be broadcast together with the shape of X = [20, 16, 20, 27] and the shape of Y = [20, 16, 1, 8]. Received [27] in X is not equal to [8] in Y at i:3. [Hint: Expected x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1 == true, but received x_dims_array[i] == y_dims_array[i] || x_dims_array[i] <= 1 || y_dims_array[i] <= 1:0 != true:1.] (at /paddle/paddle/phi/kernels/funcs/common_shape.h:84) [operator < elementwise_add > error]","results":"","status":"101"} 我的环境如下：服务部署平台：paddlepaddle-gpu容器。容器版本：paddlepaddle/paddle 2.3.2-gpu-cuda11.2-cudnn8 Knover版本：0.0.6 GPU数量：1 paddlehub版本：2.3.0 我完成了以下方案的测试： 1、export CUDA_VISIBLE_DEVICES=0 2、因为本地运行interact.py对应脚本成功，因此我将AIstudio上一位开发者的开源项目中的module.py中关于数据加载的部分按照Knover/knover/core/model.py中对应的部分重写了一次，但仍然报错。对比两步骤发现：本地调用时，将数据转换为tensor的部分shape恒定为20，但hub部署的服务过程中的tensor的shape会随着文本分词后的长度而变化。我不太清楚应该修改哪个部分，请问是否有这方面的方案，或者在该版本下的plato2_en_base的部署教程啊？ 2中的开源作者的AIStudio的链接为：https://aistudio.baidu.com/aistudio/projectdetail/1197592 谢谢

opened by what-is-perfect 2
请问Link theWorld这个论文中2.1节Service Information的service API是如何构建的

感谢百度一直以来在中文对话上的工作~

我想咨询一下论文《Link theWorld: Improving Open-domain Conversation with Dynamic Spatiotemporal-aware Knowledge》中2.1节Service Information第一段中的service API具体是如何构建的。我看了论文，觉得全文最重要的就是这个service API的构建，假若构造的足够好的话，确实可以大大提高人机交互体验。但是论文似乎并没有细说这部分的工作以及相关开源的代码/数据。确实非常好奇~

非常期待能得到您的回复！谢谢~

opened by cingtiye 0
如何基于现有的开源英文plato-2模型，搭建一个中文多轮对话机器人

各位大佬，

请问如何基于现有的开源英文plato-2模型，搭建一个中文多轮对话机器人？本人看了下面的链接，但还是对如何使用英文的plato-2搭建适用于中文多轮对话任务的plato-2模型表示不太了解。能否请各位大佬提供一些更详细的细节？还能否请各位已经实现的大佬共享一些代码供小弟参考，谢谢。

链接： https://github.com/PaddlePaddle/Knover/issues/25

opened by ZeyuTeng96 0

Owner

GitHub

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 2, 2022

Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

7 Dec 10, 2022

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

2 Nov 5, 2021

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

62 Dec 20, 2022

(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 1, 2022

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020：基于标题的大规模商品实体检索，任务为对于给定的一个商品标题，参赛系统需要匹配到该标题在给定商品库中的对应商品实体。输入：输入文件包括若干行商品标题。输出：输出文本每一行包括此标题对应的商品实体，即给定知识库中商品 ID，

43 Nov 11, 2022

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PaddleNLP 2.0拥有丰富的模型库、简洁易用的API与高性能的分布式训练的能力，旨在为飞桨开发者提升文本建模效率，并提供基于PaddlePaddle 2.0的NLP领域最佳实践。

6.9k Jan 1, 2023

Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

309 Oct 19, 2022

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet ?? ???? 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

40 Nov 30, 2022

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

377 Jan 2, 2023

Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

47 Dec 20, 2022

Tools for curating biomedical training data for large-scale language modeling

242 Dec 25, 2022

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

Reading Wikipedia to Answer Open-Domain Questions

DrQA This is a PyTorch implementation of the DrQA system described in the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions. Quick Link

4.3k Jan 1, 2023

Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

69 Nov 4, 2022

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

740 Dec 24, 2022

Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your text documents (yes, even PDF's) and images.

166 Jan 7, 2023