🚀 RocketQA, dense retrieval for information retrieval and question answering, including both Chinese and English state-of-the-art models.

Last update: Jan 4, 2023

Related tags

Overview

In recent years, the dense retrievers based on pre-trained language models have achieved remarkable progress. To facilitate more developers using cutting edge technologies, this repository provides an easy-to-use toolkit for running and fine-tuning the state-of-the-art dense retrievers, namely 🚀 RocketQA. This toolkit has the following advantages:

State-of-the-art: 🚀 RocketQA provides our well-trained models, which achieve SOTA performance on many dense retrieval datasets. And it will continue to update the latest models.
First-Chinese-model: 🚀 RocketQA provides the first open source Chinese dense retrieval model, which is trained on millions of manual annotation data from DuReader.
Easy-to-use: By integrating this toolkit with JINA, 🚀 RocketQA can help developers build an end-to-end retrieval system and question answering system with several lines of code.

News

April 29, 2022: Training function is added to RocketQA toolkit. And the baseline models of DuReader_retrieval (both cross encoder and dual encoder) are available in RocketQA models.
March 30, 2022: The baseline of DuReader_retrieval leaderboard was released. [code/model]
March 30, 2022: We released DuReader_retrieval, a large-scale Chinese benchmark for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. [paper] [data]
December 3, 2021: The toolkit of dense retriever RocketQA was released, including the first chinese dense retrieval model trained on DuReader.
August 26, 2021: RocketQA v2 was accepted by EMNLP 2021. [code/model]
May 5, 2021: PAIR was accepted by ACL 2021. [code/model]
March 11, 2021: RocketQA v1 was accepted by NAACL 2021. [code/model]

Installation

We provide two installation methods: Python Installation Package and Docker Environment

Install with Python Package

First, install PaddlePaddle.

# GPU version:
$ pip install paddlepaddle-gpu

# CPU version:
$ pip install paddlepaddle

Second, install rocketqa package (latest version: 1.1.0):

$ pip install rocketqa

NOTE: this toolkit MUST be running on Python3.6+ with PaddlePaddle 2.0+.

Install with Docker

docker pull rocketqa/rocketqa

docker run -it docker.io/rocketqa/rocketqa bash

Getting Started

Refer to the examples below, you can build and run your own Search Engine with several lines of code. We also provide a Playground with JupyterNotebook. Try 🚀 RocketQA straight away in your browser!

Running with JINA

JINA is a cloud-native neural search framework to build SOTA and scalable deep learning search applications in minutes. Here is a simple example to build a Search Engine based on JINA and RocketQA.

cd examples/jina_example
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
# JINA will automaticlly start a web service for you
python3 app.py index toy_data/test.tsv

# Try some questions related to the indexed Documents
python3 app.py query_cli

Please view JINA example to know more.

Running with FAISS

We also provide a simple example built on Faiss.

cd examples/faiss_example/
pip3 install -r requirements.txt

# Generate vector representations and build a libray for your Documents
python3 index.py zh ../data/dureader.para test_index

# Start a web service on http://localhost:8888/rocketqa
python3 rocketqa_service.py zh ../data/dureader.para test_index

# Try some questions related to the indexed Documents
python3 query.py

API

You can also easily integrate 🚀 RocketQA into your own task. We provide two types of models, ERNIE-based dual encoder for answer retrieval and ERNIE-based cross encoder for answer re-ranking. For running our models, you can use the following functions.

Load model

`rocketqa.available_models()`

Returns the names of the available RocketQA models. To know more about the available models, please see the code comment.

`rocketqa.load_model(model, use_cuda=False, device_id=0, batch_size=1)`

Returns the model specified by the input parameter. It can initialize both dual encoder and cross encoder. By setting input parameter, you can load either RocketQA models returned by "available_models()" or your own checkpoints.

Dual encoder

Dual-encoder returned by "load_model()" supports the following functions:

`model.encode_query(query: List[str])`

Given a list of queries, returns their representation vectors encoded by model.

`model.encode_para(para: List[str], title: List[str])`

Given a list of paragraphs and their corresponding titles (optional), returns their representations vectors encoded by model.

`model.matching(query: List[str], para: List[str], title: List[str])`

Given a list of queries and paragraphs (and titles), returns their matching scores (dot product between two representation vectors).

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Given the hyperparameters train_set, epoch and save_model_path, you can train your own dual encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Cross encoder

Cross-encoder returned by "load_model()" supports the following function:

`model.matching(query: List[str], para: List[str], title: List[str])`

Given a list of queries and paragraphs (and titles), returns their matching scores (probability that the paragraph is the query's right answer).

`model.train(train_set: str, epoch: int, save_model_path: str, args)`

Given the hyperparameters train_set, epoch and save_model_path, you can train your own cross encoder model or finetune our models. Other settings like save_steps and learning_rate can also be set in args. Please refer to examples/example.py for detail.

Examples

Following the examples below, you can retrieve the vector representations of your documents and connect 🚀 RocketQA to your own tasks.

Run RocketQA Model

To run RocketQA models, you should set the parameter model in 'load_model()' with RocketQA model name returned by 'available_models()'.

import rocketqa

query_list = ["trigeminal definition"]
para_list = [
    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]

# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)

# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query representation and para representation
dot_products = dual_encoder.matching(query=query_list, para=para_list)

Train Your Own Model

To train your own models, you can use train() function with your dataset and parameters. Training data contains 4 columns: query, title, para, label (0 or 1), separated by "\t". For detail about parameters and dataset, please refer to './examples/example.py'

import rocketqa

# init cross encoder, and set device and batch_size
cross_encoder = rocketqa.load_model(model="zh_dureader_ce", use_cuda=True, device_id=0, batch_size=32)

# finetune cross encoder based on "zh_dureader_ce_v2"
cross_encoder.train('./examples/data/cross.train.tsv', 2, 'ce_models', save_steps=1000, learning_rate=1e-5, log_folder='log_ce')

Run Your Own Model

To run your own models, you should set parameter model in 'load_model()' with a JSON config file.

import rocketqa

# init cross encoder
cross_encoder = rocketqa.load_model(model="./examples/ce_models/config.json", use_cuda=True, device_id=0, batch_size=16)

# compute relevance of query and para
relevance = cross_encoder.matching(query=query_list, para=para_list)

config is a JSON file like this

{
    "model_type": "cross_encoder",
    "max_seq_len": 384,
    "model_conf_path": "zh_config.json",
    "model_vocab_path": "zh_vocab.txt",
    "model_checkpoint_path": ${YOUR_MODEL},
    "for_cn": true,
    "share_parameter": 0
}

Folder examples provides more details.

Citations

If you find RocketQA v1 models helpful, feel free to cite our publication RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

@inproceedings{rocketqa_v1,
    title="RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering",
    author="Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang",
    year="2021",
    booktitle = "In Proceedings of NAACL"
}

If you find PAIR models helpful, feel free to cite our publication PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval

@inproceedings{rocketqa_pair,
    title="PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval",
    author="Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of ACL Findings"
}

If you find RocketQA v2 models helpful, feel free to cite our publication RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

@inproceedings{rocketqa_v2,
    title="RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking",
    author="Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen",
    year="2021",
    booktitle = "In Proceedings of EMNLP"
}

If you find DuReader_retrieval dataset helpful, feel free to cite our publication DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

@inproceedings{DuReader_retrieval,
    title="DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine",
    author="Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu and Haifeng Wang",
    year="2022"
}

License

This repository is provided under the Apache-2.0 license.

Contact Information

For help or issues using RocketQA, please submit a Github issue.

For other communication or cooperation, please contact Jing Liu ([email protected]) or scan the following QR Code.

Comments

提供基于 Elasticsearch 的示例

很棒的项目，先赞一个 👍

请问本项目有打算提供基于 Elasticsearch 的示例吗？Elasticsearch 是业界广泛使用的开源搜索引擎，它从 7.3 开始支持向量检索，当前已经支持 kNN search。对于大多数公司而言，使用 Elasticsearch 进行向量检索的核心优势是 运维零成本，因为都是现成的中间件。

opened by RussellLuo 7
train_dual_encoder 训练效果不好

我尝试用了一些训练集去训练dual_encoder ，但是效果不好比如可怜飞燕倚新妆\t\t《清平调》之二李白\t\t"一枝秾艳露凝香，云雨巫山枉断肠。借问汉宫谁得似，可怜飞燕倚新妆。"\t0 但是我查询可怜飞燕倚新妆还是查不出来，在dureader.para里存放了《清平调》之二李白\t"一枝秾艳露凝香，云雨巫山枉断肠。借问汉宫谁得似，可怜飞燕倚新妆。" 并且使用了训练后的dual_encoder。我想问下这个是我的训练集没写对吗还是有其他特别的要求规格是： query \t\t title \t\t para \t 0,1 对吗？

opened by fallbernana123456 6
关于rocketqa的一些疑问？
rocketqa是否一定要使用索引？我的目标是对一种类型的文档进行搜索，所以文档内容每次都是不一样的，即没有固定的语料库，但是文章的模式是大致相同的.即文档passages如果都是不一样的是否可以使用rocketqa进行检索。

并且关于rocketqa模型的小样本微调方面，大致需要多少的自定义训练集能达到比较好的效果，在rocketqa中并没有看到相关的数据。是否需要使用dureader retrival数据集再训练，还是直接用我们自己的训练集训练即可

训练集的title有什么用，是否可以不指定，或者说指定是否能提升模型准确度
opened by hehuang139 5
提问无法收到回复。

python版本3.7.3 jina版本2.4.5

Question: (type \q to quit)Who is Paula Deen's brother? UserWarning: ignored unknown argument: ['8886']. (raised from /home/fangjiyuan/.local/lib/python3.7/site-packages/jina/helper.py:689)
du_encoder@63982[E]:AttributeError("'generator' object has no attribute 'squeeze'") add "--quiet-error" to suppress the exception details
Traceback (most recent call last): File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 285, in _msg_callback processed_msg = self._callback(msg) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 271, in _callback msg = self._post_hook(self._handle(self._pre_hook(msg))) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 226, in _handle peapod_name=self.name, File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/request_handlers/data_request_handler.py", line 165, in handle field='groundtruths', File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/executors/init.py", line 196, in call self, **kwargs File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/executors/decorators.py", line 105, in arg_wrapper return fn(*args, **kwargs) File "/home/fangjiyuan/github_file/RocketQA/examples/jina_example/rocketqa_encoder/executor.py", line 39, in encode_question doc.embedding = query_emb.squeeze() AttributeError: 'generator' object has no attribute 'squeeze' vec_indexer@63990[E]:TypeError("can not determine the array type: ['builtins'].NoneType") add "--quiet-error" to suppress the exception details
Traceback (most recent call last): File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 285, in _msg_callback processed_msg = self._callback(msg) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 271, in _callback msg = self._post_hook(self._handle(self._pre_hook(msg))) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/zmq/zed.py", line 226, in _handle peapod_name=self.name, File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/peapods/runtimes/request_handlers/data_request_handler.py", line 165, in handle field='groundtruths', File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/executors/init.py", line 196, in call self, **kwargs File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/executors/decorators.py", line 105, in arg_wrapper return fn(*args, **kwargs) File "/home/fangjiyuan/.jina/hub-packages/zb38xlt4/executor.py", line 78, in search docs.match(self._storage, **match_args) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/types/arrays/mixins/match.py", line 141, in match dist, idx = lhv._match(rhv, cdist, _limit, normalization, metric_name) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/types/arrays/mixins/match.py", line 194, in _match dists = cdist(x_mat, y_mat, metric_name) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/types/arrays/mixins/match.py", line 127, in cdist = lambda *x: _cdist(*x, device=device) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/math/distance/init.py", line 37, in cdist x_type = get_array_type(x_mat) File "/home/fangjiyuan/.local/lib/python3.7/site-packages/jina/types/ndarray/init.py", line 310, in get_array_type raise TypeError(f'can not determine the array type: {module_tags}.{class_name}') TypeError: can not determine the array type: ['builtins'].NoneType <jina.types.document.Document ('id', 'mime_type', 'text') at 140657475693760>

opened by fangjiyuan 5
单卡运行无法复现DUreader的baseline结果

您好，感谢提供实验代码！我发现在单卡训练cross encoder的时候，结果与baseline的4*V100各项指标上都差不少，请问有原因吗？

运行环境 Tesla A100 80G，CUDA11.1 Cudnn8.0.5，paddle 2.2.0

提供模型infer的结果： {"MRR@10": 0.7284, "QueriesRanked": 2000, "recall@1": 0.6410, "recall@50": 0.9175} 自己训练模型的最好结果： {"MRR@10": 0.7028, "QueriesRanked": 2000, "recall@1": 0.6165, "recall@50": 0.9175}

想问一下这是什么原因？另，在自己训练时，我们发现使用baseline同样的超参，1.5epoch之后就达到过拟合，请问是单卡的原因吗？

opened by Davion1999 4
language other than english and chinese

thanks for the great work.

I was wondering if we can use this toolkit with PLM in other languages (available in huggingface) and build DPR for that language. imagine we have appropriate data in that language.

Has anyone had experience developing a model with RocketQA for other languages?

opened by amiroft 3

ce打分模型matching卡住，10几秒后异常退出

打分这段代码：

    ce_conf = {
        "model": 'zh_dureader_ce_v2',
        "use_cuda": True,
        "device_id": 0,
        "batch_size": 32
    }
    cross_encoder = rocketqa.load_model(**ce_conf)

    q = ['电力设备行业规模', '电力设备行业规模', '电力设备行业规模', '电力设备行业规模', '电力设备行业规模']
    t = ['电力设备行业的市场规模分析 电力设备行业未来发展前景分...', '电力设备市场细分数据分析_财富号_东方财富网',
         '电力设备行业市场分析', '2021年电力设备制造行业发展概况及趋势分析 - 百...', '2022年电力设备制造行业现状和发展趋势.docx-原创力文档']
    p = ['目前电力设备行业市场规模已经超过5000亿元,行业利润总额产国340亿元。国内电力设备市场正在以持续稳定的增长之势向前发展,我国电力设备行业当前处于行业的快... https://www.chinairn.com/news/20220718/162320712.shtml baidu_2 1658073600',
         '目前电力设备行业的市场规模已经超过5000亿元,行业利润总额产国340亿元。国内电力设备市场正在以持续稳定的增长之势向前发展,2022-2027年中国机械电力设备行业市场供需及重点企业投... https://caifuhao.eastmoney.com/news/20220721184503449012900 baidu_3 1658332800',
         '预计到 2025年，低压电器市场规模将达到 1,240亿元，预计 2021年到 2025年的年均复合增长率为 7.72%，继续保持高速增长的趋势。在电力行业，统电力系统正朝着新型电力系统过渡，... https://baijiahao.baidu.com/s?id=1744016995218803706&wfr=spider&for=pc baidu_4 1663171200',
         '电力设备制造业是机械工业最主要的子行业之一,行业资产总额占整个机械 工业的近 1/4.2015 年,电力设备制造业行业规模继续扩大,资产总额稳步增长, 企业数量有所回升.截至 2015... https://wenku.baidu.com/view/7d481de1d2f34693daef5ef7ba0d4a7302766ca0.html baidu_5 ',
         '从行业规模来看,2019年,电 力设备行业规模继续扩大,企业数量继续回升,资产总额稳步增长。截至2019年底,行业规模以上企业达21,512家,同比增加354家;资产总额达6... https://m.book118.com/html/2022/1110/8137123122005011.shtm baidu_7 1668614400']

    print('score ...')
    s = list(cross_encoder.matching(query=q, para=p, title=t))
    print(s)

对应的输出：

RocketQA model [zh_dureader_ce_v2]
WARNING:root:paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
W1202 17:21:07.773284 52200 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.7, Runtime API Version: 11.2
W1202 17:21:07.778347 52200 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
Load model done
score ...

进程已结束,退出代码-1073741819 (0xC0000005)

matching的参数不能随意指定文本吗？为什么会卡住不动，过十几秒之后异常退出？

opened by jyjy007 3

Test faiss_example cause IndexError in RocketQA docker

I got IndexError in RocketQA docker when I tested the faiss example under the guidance of README.md, so I can't do the next step. I'll be grateful if someone can help me. Thx!

opened by Mr-IT007 3
Dureader

Hi，Great job, I find that you release Chinese retrieval model trained on Dureader, Could you please also release your preprocess code or processed datasets.

opened by yclzju 3

使用提供的例子进行训练无法输出模型

日志如下

E:\IdeaProjects\knowledge-model\rocketqa_es>python example.py
RocketQA model [zh_dureader_de]
WARNING:root:paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
W1222 14:00:13.174715  6936 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.7, Runtime API Version: 11.7
W1222 14:00:13.178715  6936 gpu_resources.cc:91] device: 0, cuDNN Version: 8.4.
Load model done
INFO:root:-----------  Configuration Arguments -----------
[INFO] 2022-12-22 14:00:16,089 [     args.py:   69]:    -----------  Configuration Arguments -----------
INFO:root:batch_size: 8
[INFO] 2022-12-22 14:00:16,089 [     args.py:   71]:    batch_size: 8
INFO:root:checkpoints: checkpoints
[INFO] 2022-12-22 14:00:16,090 [     args.py:   71]:    checkpoints: checkpoints
INFO:root:chunk_scheme: IOB
[INFO] 2022-12-22 14:00:16,090 [     args.py:   71]:    chunk_scheme: IOB
INFO:root:decr_every_n_nan_or_inf: 2
[INFO] 2022-12-22 14:00:16,090 [     args.py:   71]:    decr_every_n_nan_or_inf: 2
INFO:root:decr_ratio: 0.8
[INFO] 2022-12-22 14:00:16,091 [     args.py:   71]:    decr_ratio: 0.8
INFO:root:dev_set: None
[INFO] 2022-12-22 14:00:16,091 [     args.py:   71]:    dev_set: None
INFO:root:diagnostic: None
[INFO] 2022-12-22 14:00:16,092 [     args.py:   71]:    diagnostic: None
INFO:root:diagnostic_save: None
[INFO] 2022-12-22 14:00:16,093 [     args.py:   71]:    diagnostic_save: None
INFO:root:do_lower_case: True
[INFO] 2022-12-22 14:00:16,093 [     args.py:   71]:    do_lower_case: True
INFO:root:do_test: True
[INFO] 2022-12-22 14:00:16,094 [     args.py:   71]:    do_test: True
INFO:root:do_train: False
[INFO] 2022-12-22 14:00:16,095 [     args.py:   71]:    do_train: False
INFO:root:do_val: False
[INFO] 2022-12-22 14:00:16,096 [     args.py:   71]:    do_val: False
INFO:root:doc_stride: 128
[INFO] 2022-12-22 14:00:16,096 [     args.py:   71]:    doc_stride: 128
INFO:root:enable_ce: False
[INFO] 2022-12-22 14:00:16,097 [     args.py:   71]:    enable_ce: False
INFO:root:epoch: 2
[INFO] 2022-12-22 14:00:16,098 [     args.py:   71]:    epoch: 2
INFO:root:ernie_config_path: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/zh_config.json
[INFO] 2022-12-22 14:00:16,098 [     args.py:   71]:    ernie_config_path: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/zh_config.json
INFO:root:for_cn: True
[INFO] 2022-12-22 14:00:16,099 [     args.py:   71]:    for_cn: True
INFO:root:in_tokens: False
[INFO] 2022-12-22 14:00:16,100 [     args.py:   71]:    in_tokens: False
INFO:root:incr_every_n_steps: 100
[INFO] 2022-12-22 14:00:16,101 [     args.py:   71]:    incr_every_n_steps: 100
INFO:root:incr_ratio: 2.0
[INFO] 2022-12-22 14:00:16,102 [     args.py:   71]:    incr_ratio: 2.0
INFO:root:init_checkpoint: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/dureader_dual_encoder
[INFO] 2022-12-22 14:00:16,103 [     args.py:   71]:    init_checkpoint: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/dureader_dual_encoder
INFO:root:init_loss_scaling: 102400
[INFO] 2022-12-22 14:00:16,104 [     args.py:   71]:    init_loss_scaling: 102400
INFO:root:init_pretraining_params: None
[INFO] 2022-12-22 14:00:16,105 [     args.py:   71]:    init_pretraining_params: None
INFO:root:is_classify: True
[INFO] 2022-12-22 14:00:16,108 [     args.py:   71]:    is_classify: True
INFO:root:is_distributed: False
[INFO] 2022-12-22 14:00:16,108 [     args.py:   71]:    is_distributed: False
INFO:root:is_regression: False
[INFO] 2022-12-22 14:00:16,109 [     args.py:   71]:    is_regression: False
INFO:root:label_map_config: None
[INFO] 2022-12-22 14:00:16,110 [     args.py:   71]:    label_map_config: None
INFO:root:learning_rate: 1e-05
[INFO] 2022-12-22 14:00:16,110 [     args.py:   71]:    learning_rate: 1e-05
INFO:root:log_folder: de_log
[INFO] 2022-12-22 14:00:16,111 [     args.py:   71]:    log_folder: de_log
INFO:root:lr_scheduler: linear_warmup_decay
[INFO] 2022-12-22 14:00:16,112 [     args.py:   71]:    lr_scheduler: linear_warmup_decay
INFO:root:max_answer_length: 100
[INFO] 2022-12-22 14:00:16,112 [     args.py:   71]:    max_answer_length: 100
INFO:root:max_query_length: 64
[INFO] 2022-12-22 14:00:16,113 [     args.py:   71]:    max_query_length: 64
INFO:root:max_seq_len: 512
[INFO] 2022-12-22 14:00:16,114 [     args.py:   71]:    max_seq_len: 512
INFO:root:metric: simple_accuracy
[INFO] 2022-12-22 14:00:16,115 [     args.py:   71]:    metric: simple_accuracy
INFO:root:metrics: True
[INFO] 2022-12-22 14:00:16,115 [     args.py:   71]:    metrics: True
INFO:root:model_name: zh_dureader_de
[INFO] 2022-12-22 14:00:16,116 [     args.py:   71]:    model_name: zh_dureader_de
INFO:root:n_best_size: 20
[INFO] 2022-12-22 14:00:16,116 [     args.py:   71]:    n_best_size: 20
INFO:root:num_iteration_per_drop_scope: 10
[INFO] 2022-12-22 14:00:16,117 [     args.py:   71]:    num_iteration_per_drop_scope: 10
INFO:root:num_labels: 2
[INFO] 2022-12-22 14:00:16,118 [     args.py:   71]:    num_labels: 2
INFO:root:output_file_name: None
[INFO] 2022-12-22 14:00:16,119 [     args.py:   71]:    output_file_name: None
INFO:root:output_item: 3
[INFO] 2022-12-22 14:00:16,120 [     args.py:   71]:    output_item: 3
INFO:root:p_max_seq_len: 384
[INFO] 2022-12-22 14:00:16,120 [     args.py:   71]:    p_max_seq_len: 384
INFO:root:predict_batch_size: None
[INFO] 2022-12-22 14:00:16,123 [     args.py:   71]:    predict_batch_size: None
INFO:root:q_max_seq_len: 32
[INFO] 2022-12-22 14:00:16,123 [     args.py:   71]:    q_max_seq_len: 32
INFO:root:random_seed: None
[INFO] 2022-12-22 14:00:16,124 [     args.py:   71]:    random_seed: None
INFO:root:save_model_path: de_models
[INFO] 2022-12-22 14:00:16,124 [     args.py:   71]:    save_model_path: de_models
INFO:root:save_steps: 10
[INFO] 2022-12-22 14:00:16,125 [     args.py:   71]:    save_steps: 10
INFO:root:share_parameter: 0
[INFO] 2022-12-22 14:00:16,126 [     args.py:   71]:    share_parameter: 0
INFO:root:shuffle: True
[INFO] 2022-12-22 14:00:16,126 [     args.py:   71]:    shuffle: True
INFO:root:skip_steps: 100
[INFO] 2022-12-22 14:00:16,127 [     args.py:   71]:    skip_steps: 100
INFO:root:task_id: 0
[INFO] 2022-12-22 14:00:16,128 [     args.py:   71]:    task_id: 0
INFO:root:test_data_cnt: 1110000
[INFO] 2022-12-22 14:00:16,129 [     args.py:   71]:    test_data_cnt: 1110000
INFO:root:test_save: ./checkpoints/test_result
[INFO] 2022-12-22 14:00:16,130 [     args.py:   71]:    test_save: ./checkpoints/test_result
INFO:root:test_set: None
[INFO] 2022-12-22 14:00:16,131 [     args.py:   71]:    test_set: None
INFO:root:tokenizer: FullTokenizer
[INFO] 2022-12-22 14:00:16,131 [     args.py:   71]:    tokenizer: FullTokenizer
INFO:root:train_data_size: 0
[INFO] 2022-12-22 14:00:16,132 [     args.py:   71]:    train_data_size: 0
INFO:root:train_set: ./data/dual.train.tsv
[INFO] 2022-12-22 14:00:16,133 [     args.py:   71]:    train_set: ./data/dual.train.tsv
INFO:root:use_cross_batch: False
[INFO] 2022-12-22 14:00:16,134 [     args.py:   71]:    use_cross_batch: False
INFO:root:use_cuda: True
[INFO] 2022-12-22 14:00:16,135 [     args.py:   71]:    use_cuda: True
INFO:root:use_dynamic_loss_scaling: True
[INFO] 2022-12-22 14:00:16,136 [     args.py:   71]:    use_dynamic_loss_scaling: True
INFO:root:use_fast_executor: True
[INFO] 2022-12-22 14:00:16,139 [     args.py:   71]:    use_fast_executor: True
INFO:root:use_lamb: False
[INFO] 2022-12-22 14:00:16,140 [     args.py:   71]:    use_lamb: False
INFO:root:use_mix_precision: False
[INFO] 2022-12-22 14:00:16,141 [     args.py:   71]:    use_mix_precision: False
INFO:root:use_multi_gpu_test: False
[INFO] 2022-12-22 14:00:16,142 [     args.py:   71]:    use_multi_gpu_test: False
INFO:root:use_recompute: False
[INFO] 2022-12-22 14:00:16,143 [     args.py:   71]:    use_recompute: False
INFO:root:validation_steps: 1000
[INFO] 2022-12-22 14:00:16,143 [     args.py:   71]:    validation_steps: 1000
INFO:root:verbose: False
[INFO] 2022-12-22 14:00:16,144 [     args.py:   71]:    verbose: False
INFO:root:vocab_path: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/zh_vocab.txt
[INFO] 2022-12-22 14:00:16,145 [     args.py:   71]:    vocab_path: C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/zh_vocab.txt
INFO:root:warmup_proportion: 0.1
[INFO] 2022-12-22 14:00:16,146 [     args.py:   71]:    warmup_proportion: 0.1
INFO:root:weight_decay: 0.01
[INFO] 2022-12-22 14:00:16,146 [     args.py:   71]:    weight_decay: 0.01
INFO:root:------------------------------------------------
[INFO] 2022-12-22 14:00:16,147 [     args.py:   72]:    ------------------------------------------------
INFO:root:Device count: 1
[INFO] 2022-12-22 14:00:16,165 [dual_encoder.py:  291]: Device count: 1
INFO:root:Num train examples: 112
[INFO] 2022-12-22 14:00:16,166 [dual_encoder.py:  292]: Num train examples: 112
INFO:root:Max train steps: 28
[INFO] 2022-12-22 14:00:16,167 [dual_encoder.py:  293]: Max train steps: 28
INFO:root:Num warmup steps: 2
[INFO] 2022-12-22 14:00:16,168 [dual_encoder.py:  294]: Num warmup steps: 2
INFO:root:Learning rate: 0.000010
[INFO] 2022-12-22 14:00:16,170 [dual_encoder.py:  295]: Learning rate: 0.000010
WARNING:root:paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
[WARNING] 2022-12-22 14:00:16,171 [       io.py:  719]: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
INFO:rocketqa.utils.init:Load pretraining parameters from C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/dureader_dual_encoder.
[INFO] 2022-12-22 14:00:26,350 [     init.py:   73]:    Load pretraining parameters from C:\Users\lincheng.wen/.rocketqa/zh_dureader_de/dureader_dual_encoder.

opened by wenlincheng 2

DuReader-Retrieval-Baseline 单卡运行报错

export CUDA_VISIBLE_DEVICES=0 TRAIN_SET="dureader-retrieval-baseline-dataset/train/dual.train.tsv" MODEL_PATH="pretrained-models/ernie_base_1.0_twin_CN/params" sh script/run_dual_encoder_train.sh $TRAIN_SET $MODEL_PATH 10 1

在第一步的时候运行如上命令时会报错：

OSError: (External) CUBLAS error(7). [Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at /paddle/paddle/fluid/platform/cuda_helper.h:107)

环境如下： cuDNN Version: 7.6. cuda 10.0

opened by sunxiaojie99 2
如何在 Paddle Inference 中使用 RocketQA 的模型？

参考 Paddle Inference 文档：

Paddle Inference 原生支持由 PaddlePaddle 深度学习框架训练产出的推理模型。PaddlePaddle 用于推理的模型分别可通过 paddle.jit.save (动态图) 与 paddle.static.save_inference_model (静态图) 或 paddle.Model().save (高层API) 保存下来。

目前 RocketQA 使用的是老版本的 PaddlePaddle，用于推理的模型应该需要通过 fluid.io.save_inference_model 保存，但目前看代码是通过 fluid.io.save_persistables 保存的：

https://github.com/PaddlePaddle/RocketQA/blob/019ad5c1088167e264c5ec799c5f7fd22e39ad27/rocketqa/encoder/dual_encoder.py#L380

https://github.com/PaddlePaddle/RocketQA/blob/019ad5c1088167e264c5ec799c5f7fd22e39ad27/rocketqa/encoder/cross_encoder.py#L326

请问如何在 Paddle Inference 中使用 RocketQA 的模型？RocketQA 是否有计划支持用 Paddle Inference 进行推理？

opened by RussellLuo 2

拉取最新docker，跑demo脚本，提示CUDA不兼容，无法使用GPU

demo脚本：

import rocketqa

query_list = ["trigeminal definition"]
para_list = [
    "Definition of TRIGEMINAL. : of or relating to the trigeminal nerve.ADVERTISEMENT. of or relating to the trigeminal nerve. ADVERTISEMENT."]

# init dual encoder
dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)

# encode query & para
q_embs = dual_encoder.encode_query(query=query_list)
p_embs = dual_encoder.encode_para(para=para_list)
# compute dot product of query representation and para representation
dot_products = dual_encoder.matching(query=query_list, para=para_list)

报错提示：

λ 1f7e2a779543 ~/download python client.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
/usr/local/python3.7.0/lib/python3.7/site-packages/paddle/fluid/framework.py:312: UserWarning: You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default.
  "You are using GPU version Paddle, but your CUDA device is not set properly. CPU device will be used by default."
RocketQA model [v1_marco_de]
Traceback (most recent call last):
  File "client.py", line 8, in <module>
    dual_encoder = rocketqa.load_model(model="v1_marco_de", use_cuda=True, device_id=0, batch_size=16)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/rocketqa/rocketqa.py", line 120, in load_model
    encoder = DualEncoder(**encoder_conf)
  File "/usr/local/python3.7.0/lib/python3.7/site-packages/rocketqa/encoder/dual_encoder.py", line 63, in __init__
    place = dev_list[device_id]
IndexError: list index out of range

opened by yuyaxiong 1

提供命令行工具用于简化模型训练

问题

当前训练或微调一个模型，除了构造训练集，还需要做以下多个步骤：

参考 Train Your Own Model 编写一个训练脚本
训练结束后，需要删除其中 moment 文件，以减小模型
为了运行模型，需要通过拷贝 examples/de_models 或 examples/ce_models 建立模型目录
然后修改 config.json 中的 ${YOUR_MODEL}，指向第 2 步中的目录（参考1、参考2）

个人以为，上述训练操作除了步骤较多比较繁琐，还会对初学者造成比较大的困惑（初学者更喜欢开箱即用和一键操作）。

提议

新增一个 rocketqa 命令行工具，当前先提供 train 子命令用于训练或微调模型（自动完成上述 4 个步骤），后续可以按需扩展新的子命令。

rocketqa 命令：

$ rocketqa -h
usage: rocketqa [-h] {train} ...

optional arguments:
  -h, --help  show this help message and exit

commands:
  {train}
    train     train or finetune the dual/cross encoder model

rocketqa train 子命令：

$ rocketqa train -h
usage: rocketqa train [-h] [--use-cuda] [--epoch EPOCH] [--out OUT] [--save-steps SAVE_STEPS] [--learning-rate LEARNING_RATE] base_model train_set

positional arguments:
  base_model            base model
  train_set             train set

optional arguments:
  -h, --help            show this help message and exit
  --use-cuda            whether to run models on GPU (default: False)
  --epoch EPOCH         epoch (default: 2)
  --out OUT             output directory (default: ./models)
  --save-steps SAVE_STEPS
                        save steps (default: 1000)
  --learning-rate LEARNING_RATE
                        learning rate (default: 1e-05)

opened by RussellLuo 3

关于RocketQA使用的请教
您好，我有两个问题想请教一下：

我是否可以直接下载RocketQA step2训练好的cross_encoder对 MARCO数据进行预测，来排除掉伪负例？我好像在某个工作中见过类似的用法，但记不清是哪个工作了，如果在这方面您有一些参考文献可以提供，将非常感谢。

如果我自己按照上述做法进行实验，那排除掉伪负例的threshold（分数大于多少的负例都需要被过滤掉）应该如何设置呢？您是否可以提供一些建议，或者是我需要使用dev set找到最优的超参数？

再次感谢！
opened by caiyinqiong 1

Owner

GitHub

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

3 Apr 5, 2022

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.6k Dec 27, 2022

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

1.1k Feb 14, 2021

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

normalizer This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch

23 Nov 30, 2022

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...

Haystack is an end-to-end framework for Question Answering & Neural search that enables you to ... ... ask questions in natural language and find gran

6.4k Jan 9, 2023

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

220 Dec 11, 2022

NeuralQA: A Usable Library for Question Answering on Large Datasets with BERT

NeuralQA: A Usable Library for (Extractive) Question Answering on Large Datasets with BERT Still in alpha, lots of changes anticipated. View demo on n

184 Feb 10, 2021

Knowledge Graph,Question Answering System，基于知识图谱和向量检索的医疗诊断问答系统

823 Dec 28, 2022

Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)는 다양한 주제에 대한 문서 집합으로부터 자연어 질의에 대한 답변을 찾아오는 task입니다. 이때 사용자 질의에 답변하기 위해 주어지는 지문이 따로 존재하지 않습니다. 따라서 사전에 구축되어있는 Knowl

69 Nov 4, 2022

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 (Rajpurkar et al., 2018) dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.

52 Jun 21, 2022

CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training

CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training This is the official repository for the code and models of the paper CCQA: A N

29 Nov 30, 2022

chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

33 Dec 18, 2022

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

2 Apr 20, 2022

BERT-based Financial Question Answering System

BERT-based Financial Question Answering System In this example, we use Jina, PyTorch, and Hugging Face transformers to build a production-ready BERT-b

61 Sep 18, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

45 Oct 4, 2022

A unified tokenization tool for Images, Chinese and English.

ICE Tokenizer Token id [0, 20000) are image tokens. Token id [20000, 20100) are common tokens, mainly punctuations. E.g., icetk[20000] == '<unk>', ice

42 Dec 27, 2022