端到端的长本文摘要模型(法研杯2020司法摘要赛道)

Overview

SPACES

端到端的长文本摘要模型(法研杯2020司法摘要赛道)。

博客介绍:https://kexue.fm/archives/8046

含义

我们将我们的模型称为SPACES,它正好是科学空间的域名之一(https://spaces.ac.cn),具体含义如下:

  • S:Sparse Softmax;
  • P:Pretrained Language Model;
  • A:Abstractive;
  • C:Copy Mechanism;
  • E:Extractive;
  • S:Special Words。

顾名思义,这是一个以词为单位的、包含预训练和Copy机制的“抽取-生成”式摘要模型,里边包含了一些我们对文本生成技术的最新研究成果。

运行

实验环境:tensorflow 1.14 + keras 2.3.1 + bert4keras 0.9.7

(如果是Windows,请用bert4keras>=0.9.8)

首先请在snippets.py中修改相关路径配置,然后再执行下述代码。

训练代码:

#! /bin/bash

python extract_convert.py
python extract_vectorize.py

for ((i=0; i<15; i++));
    do
        python extract_model.py $i
    done

python seq2seq_convert.py
python seq2seq_model.py

预测代码

from final import *
summary = predict(text, topk=3)
print(summary)

交流

QQ交流群:808623966,微信群请加机器人微信号spaces_ac_cn

链接

Comments
  • 训练时GPU使用很低,CPU占用很高的问题

    训练时GPU使用很低,CPU占用很高的问题

    苏神您好,在运行seq2seq_model.py进行生成训练的时候,GPU只占用了257M,但CPU占用超级大,训练的时候一个epoch要话费2h左右,我服务器有四张40G的卡,我推断是没用到GPU进行训练,我直接os.environ["CUDA_VISIBLE_DEVICES"] = "1"的时候并没有得到改善,还是大量用到的是CPU。请问代码如何设置使用GPU进行训练,加快训练速度,望解答,谢谢您!

    opened by young-yhx 2
  • 预测的时候没在GPU跑

    预测的时候没在GPU跑

    我在运行extract_model.py文件时,训练阶段是可以跑GPU的,但是在评估阶段,model.predict函数没有使用GPU,GPU的功率没有上去。并且14份的训练时间和1份的预测时间大致相等,我猜测是没有在GPU上跑。 然而另一份seq2seq_convert.py代码,纯预测阶段,是在GPU上跑的,跑的就很快。

    希望苏神不吝赐教!

    opened by JJack0812 2
  • 加载NEZHA-base时找不到position embedding

    加载NEZHA-base时找不到position embedding

    Traceback (most recent call last):
      File "final.py", line 12, in <module>
        import extract_vectorize as vectorize
      File "/mnt/data/liuts/competition/cail-2020/SPACES/extract_vectorize.py", line 35, in <module>
        nezha_checkpoint_path,
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/bert4keras/models.py", line 2297, in build_transformer_model
        transformer.load_weights_from_checkpoint(checkpoint_path)
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/bert4keras/models.py", line 255, in load_weights_from_checkpoint
        values = [self.load_variable(checkpoint, v) for v in variables]
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/bert4keras/models.py", line 255, in <listcomp>
        values = [self.load_variable(checkpoint, v) for v in variables]
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/bert4keras/models.py", line 649, in load_variable
        variable = super(BERT, self).load_variable(checkpoint, name)
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/bert4keras/models.py", line 232, in load_variable
        return tf.train.load_variable(checkpoint, name)
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 84, in load_variable
        return reader.get_tensor(name)
      File "/mnt/data/liuts/competition/cail-2020/venv/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 678, in get_tensor
        return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
    tensorflow.python.framework.errors_impl.NotFoundError: Key bert/embeddings/position_embeddings not found in checkpoint
    

    预训练模型是从https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-TensorFlow给的链接下载的, 百度网盘和google 网盘都尝试了,还是存在这个问题...

    opened by TianshangLiu 2
  • ValueError: high is out of bounds for int32

    ValueError: high is out of bounds for int32

    File "E:/myproject/SPACES-sfzy/extract_convert.py", line 91, in data = convert(data) File "E:/myproject/SPACES-sfzy/extract_convert.py", line 77, in convert max_queue_size=200 File "D:\Anaconda3\envs\sfzy\lib\site-packages\bert4keras\snippets.py", line 159, in parallel_apply random_seeds = np.random.randint(0, 2**32, workers) File "mtrand.pyx", line 744, in numpy.random.mtrand.RandomState.randint File "_bounded_integers.pyx", line 1343, in numpy.random._bounded_integers._rand_int32 ValueError: high is out of bounds for int32

    您好,我在按照您的配置运行抽取部分代码的时候,出了这个错,在网络上找了许久也没有好的解决办法。请问您遇到过么?可不可以给我一些帮助。

    opened by More1999 2
  • extract_convert.py中函数parallel_apply出错

    extract_convert.py中函数parallel_apply出错

    大佬你好, 我跑你的代码extract_convert.py到data = convert(data)这行时出错,好像是多线程的原因?请问应该怎么修改

    Building prefix dict from the default dictionary ... Loading model from cache C:\Users\D00477~1\AppData\Local\Temp\jieba.cache Loading model cost 0.541 seconds. Prefix dict has been built successfully. 转换数据: 0%| | 0/4047 [00:00<?, ?it/s]Traceback (most recent call last): File "D:/2. 2021AI项目/5_RPA项目/8_自动摘要生成/5_CAIL2020/1_SPACES_pytorch/extract_convert.py", line 83, in data = convert(data) File "D:/2. 2021AI项目/5_RPA项目/8_自动摘要生成/5_CAIL2020/1_SPACES_pytorch/extract_convert.py", line 69, in convert max_queue_size=200 File "D:\2. 2021AI项目\5_RPA项目\8_自动摘要生成\5_CAIL2020\1_SPACES_pytorch\snippets.py", line 430, in parallel_apply return [d for i, d in generator] File "D:\2. 2021AI项目\5_RPA项目\8_自动摘要生成\5_CAIL2020\1_SPACES_pytorch\snippets.py", line 430, in return [d for i, d in generator] File "D:\2. 2021AI项目\5_RPA项目\8_自动摘要生成\5_CAIL2020\1_SPACES_pytorch\snippets.py", line 503, in parallel_apply_generator pool = Pool(workers, worker_step, (in_queue, out_queue)) File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 119, in Pool context=self.get_context()) File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 174, in init self._repopulate_pool() File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 239, in _repopulate_pool w.start() File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self) File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child) File "D:\software\anaconda\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'parallel_apply_generator..worker_step'

    opened by xdnjust 1
  • ValueError: invalid literal for int() with base 10: '-f'

    ValueError: invalid literal for int() with base 10: '-f'

    运行测试代码

    from final import * summary = predict(text, topk=3) print(summary)

    出现以下错误

    ValueError Traceback (most recent call last)
    in ()
    ----> 1 from final import *
    2 summary = predict(text, topk=3)
    3 print(summary)

    1 frames /content/drive/My Drive/python_work/SPACES/extract_model.py in () 27 fold = 0 28 else: ---> 29 fold = int(sys.argv[1]) 30 31

    ValueError: invalid literal for int() with base 10: '-f'

    ####运行环境Colab tensorflow==1.14 bert4keras==0.9.8 keras==2.3.1

    opened by ilray88 1
  • 内存问题

    内存问题

    一开始在跑extract_convert.py时,执行到convert函数程序卡住不动,后来发现是内存问题,我把线程或者进程数从100调小至10,能够成功运行。

    运行到extract_model.py时,还是会出现内存不足,训练卡住的情况,我只能调小数据集训练,但是会导致模型训练效果差。后续智只能考虑分批次加载数据训练模型了。

    我的配置是,16G内存。请问一下苏神,你内存方面的配置是什么样的呢?还是说,其实不是内存方面的问题,而是其他方面的问题?

    opened by Jay2Coomzz 2
Owner
苏剑林(Jianlin Su)
科学爱好者
苏剑林(Jianlin Su)