fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Overview

fastNLP

Build Status codecov Pypi Hex.pm Documentation Status

fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。

fastNLP具有如下的特性:

  • 统一的Tabular式数据容器,简化数据预处理过程;
  • 内置多种数据集的Loader和Pipe,省去预处理代码;
  • 各种方便的NLP工具,例如Embedding加载(包括ELMo和BERT)、中间数据cache等;
  • 部分数据集与预训练模型的自动下载;
  • 提供多种神经网络组件以及复现模型(涵盖中文分词、命名实体识别、句法分析、文本分类、文本匹配、指代消解、摘要等任务);
  • Trainer提供多种内置Callback函数,方便实验记录、异常捕获等。

安装指南

fastNLP 依赖以下包:

  • numpy>=1.14.2
  • torch>=1.0.0
  • tqdm>=4.28.1
  • nltk>=3.4.1
  • requests
  • spacy
  • prettytable>=0.7.2

其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 PyTorch 官网 。 在依赖包安装完成后,您可以在命令行执行如下指令完成安装

pip install fastNLP
python -m spacy download en

fastNLP教程

中文文档教程

快速入门

详细使用教程

扩展教程

内置组件

大部分用于的 NLP 任务神经网络都可以看做由词嵌入(embeddings)和两种模块:编码器(encoder)、解码器(decoder)组成。

以文本分类任务为例,下图展示了一个BiLSTM+Attention实现文本分类器的模型流程图:

fastNLP 在 embeddings 模块中内置了几种不同的embedding:静态embedding(GloVe、word2vec)、上下文相关embedding (ELMo、BERT)、字符embedding(基于CNN或者LSTM的CharEmbedding)

与此同时,fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:

类型 功能 例子
encoder 将输入编码为具有具有表示能力的向量 Embedding, RNN, CNN, Transformer, ...
decoder 将具有某种表示意义的向量解码为需要的输出形式 MLP, CRF, ...

项目结构

fastNLP的大致工作流程如上图所示,而项目结构如下:

fastNLP 开源的自然语言处理库
fastNLP.core 实现了核心功能,包括数据处理组件、训练器、测试器等
fastNLP.models 实现了一些完整的神经网络模型
fastNLP.modules 实现了用于搭建神经网络模型的诸多组件
fastNLP.embeddings 实现了将序列index转为向量序列的功能,包括读取预训练embedding等
fastNLP.io 实现了读写功能,包括数据读入与预处理,模型读写,数据与模型自动下载等

In memory of @FengZiYjun. May his soul rest in peace. We will miss you very very much!

Issues
  • star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    Describe the bug A clear and concise description of what the bug is. 清晰而简要地描述bug

    To Reproduce 使用你们的star-transformer代码,然后用allennlp做训练(glove 42B 词向量), 最后结果见如图,与论文中报告的结果相差6个点。

    请求解释!以及完整版的代码,就是可以完全复现结果的完整版。

    Additional context Add any other context about the problem here. 备注 image

    opened by michael-wzhu 10
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Describe the bug 用Predictor方法去加载训练好的模型,在预测时会出现第一张图里面的错误,这个bug被我fixed了。详细请见我在下文上传的项目链接。 出现原因:经过debug分析,发现此bug是由于预测新数据时出现了训练时候没有的新字符,而在bert_embedding.py 脚本里面读取的是训练时候的Vocab维度,并把它初始化成1的vocab向量做mask预测,而这导致了此向量的维度小于实际维度,实际维度=训练时候的Vocab维度+新字符的维度。 Bug结果请看图一,Bug位置及修复请看图二。 image

    image

    To Reproduce 1.把test.txt、dev.txt、train.txt移到data目录下。data目录为自己创建的目录 2. 调用fastNLP_trainer.py脚本 3. 调用fastNLP_predictor.py脚本 4. See error 重现这个bug的步骤

    项目链接:https://github.com/Chris-cbc/fastNLP_Bug_Report_And_Fix.git

    Expected behavior image 上图也是bug修复后出现的结果

    Desktop

    • OS: windows10
    • Python Version: 3.6

    Additional context 请项目主确认后 发邮件并at我github账户一下,让我知道这个bug最终是怎样被修复的 备注

    opened by Chris-cbc 9
  • a new function for argparse

    a new function for argparse

    we should provide a function for arg parse so that we can support "python fastnlp.py --arg1 value1 --arg2 value2" and so on.

    in this way, what argument should we have?

    enhancement 
    opened by xuyige 8
  • fastNLP安装完成之后导入有错

    fastNLP安装完成之后导入有错

    Python 3.5环境下安装fastNLP,显示可以安装成功,但是import fastNLP时会出现 File "D:\anaconda\lib\site-packages\fastNLP\core\instance.py", line 40 f" type={(str(type(self.fields[field_name]))).split(s)[1]}" for field_name in self.fields) + "}" ^ SyntaxError: invalid syntax Python3.6和Python3.7也不行,都是安装完成之后,import时就会报错

    opened by lovelyvivi 8
  • Default value for train args.

    Default value for train args.

    https://github.com/fastnlp/fastNLP/blob/8a87807274735046a48be8eb4b1ca10801875039/fastNLP/core/trainer.py#L42-L45

    Should we set some default value for train_args? Otherwise we will pass all these args every time, which is very redundant.

    opened by keezen 7
  • 为什么BertEmbedding需要传入字典vocab?

    为什么BertEmbedding需要传入字典vocab?

    Bert不是自带一个字典么?能否直接加载使用这个字典呢? 如果修改了字典,那Bert的预训练权重可能意义不大了?

    opened by onebula 7
  • 关于Trainer基本使用部分实例的报错

    关于Trainer基本使用部分实例的报错

    在学习Trainer部分的时候,运行了这一节最开始部分的代码 但是原始的实例代码会报错

    TypeError: can't convert np.ndarray of type numpy.int32. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.
    

    我尝试在数据生成部分直接使用torch生成tensor

    def generate_psedo_dataset(num_samples):
        data=torch.randint(2,size=(num_samples,10))
        print(data.shape)
        list=[]
        for n in range(num_samples):
            label=torch.sum(data[n])%2
            list.append(label)
        list=torch.stack(list)
        dataset = DataSet({'x':data, 'label': list})
        dataset.set_input('x')
        dataset.set_target('label')
        return dataset
    tr_dataset=generate_psedo_dataset(1000)
    dev_dataset=generate_psedo_dataset(100)
    

    但是在训练中会报如下错误

    TypeError: issubclass() arg 1 must be a class
    

    是不是我的数据生成写错了。。。 gitbook部分的实例代码应该如何调整呢? torch:1.2.0+cu92 FastNLP:0.5.0

    opened by jwc19890114 6
  • 请问现在支持加载Elmo模型到序列标注任务中吗?

    请问现在支持加载Elmo模型到序列标注任务中吗?

    请问现在支持加载Elmo模型到序列标注任务中吗?可以的话是否有example参考,没有的话是否在计划中。多谢!

    opened by Wanjun0511 6
  • 序列标准怎样使用训练好的模型进行结果预测

    序列标准怎样使用训练好的模型进行结果预测

    如题, 请问有没有办法像spacy那样, 直接输入一条语句, 然后输出结果中将语句中的实体标注出来.

    或者有没有其他的办法达成类似的输入输出效果?

    opened by zxjlm 0
  • 关于数据读取中对字符#的处理的疑问

    关于数据读取中对字符#的处理的疑问

    我发现读取数据的时候会忽略以#开头的行,这似乎代表#开头的是注释的含义? 但实际上读取的都是数据文件,而不是代码,考虑注释是没必要的吧

    更重要的一点,如果数据格式中有以#开头的行,那这样就会跳过本应读取的数据了 我认为#开头的行也应该被读取才合理

    opened by tomatowithpotato 1
  • fitlog报错

    fitlog报错

    Describe the bug name 'fitlog' is not defined

    Screenshots

    image

    Desktop (please complete the following information): centos 7

    opened by zhentaoCoding 1
  • 官方例子报错

    官方例子报错

    Ref: https://github.com/fastnlp/fastNLP/issues/275

    我也遇到这情况,重新下了最新的fastNLP(github的),卸了pip的也不行。 我是用自定义的dataset, 以下是我代码, output我也comment了。 以下我用了两种不同的training 模式, 分别commented as case 1 and case 2 in my code below:

    1. 使用 Trainer [Docs » 快速入门 » 文本分类](source: https://fastnlp.readthedocs.io/zh/latest/tutorials/%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB.html)
    2. 使用DataSetIter自己编写训练过程 (source: https://fastnlp.readthedocs.io/zh/latest/tutorials/tutorial_6_datasetiter.html?highlight=model#id3)

    Hope it helps in reproducing my problem. Overall, I find fastNLP is well documented and the overall pipeline is good, so I would like to give it a try. Thanks!

    train0 = CSVLoader().load('train0')
    test0 = CSVLoader().load('test0')
    
    train0.apply(lambda ins: ins['text'].split(), new_field_name='words')
    test_0.apply(lambda ins: ins['text'].split(), new_field_name='words')
    # In total 1 datasets:
    #	train has 548 instances.
    
    train0.datasets['train']
    """
    +---------------------------+--------+---------------------------+
    | text                      | target | words                     |
    +---------------------------+--------+---------------------------+
    |  USER ' so much chuffe... | 0      | ['USER', "'", 'so', 'm... |
    |  heard you stealing cl... | 0      | ['heard', 'you', 'stea... |
    |  USER seriously if won... | 1      | ['USER', 'seriously', ... |
    | ...                       | ...    | ...                       |
    +---------------------------+--------+---------------------------+
    """
    train_0 = train0.datasets['train']
    test_0 = test0.datasets['train'
    
    vocab = Vocabulary()
    vocab.from_dataset(train_0, field_name='words', no_create_entry_dataset=[test_0])
    vocab.index_dataset(train_0, field_name='words')
     
    target_vocab = Vocabulary(padding=None, unknown=None)
    target_vocab.from_dataset(train_0, field_name='target', no_create_entry_dataset=[test_0])
    target_vocab.index_dataset(train_0, field_name='target')
     
    data_bundle = DataBundle()
    
    data_bundle.set_dataset(train_0, 'train')
    data_bundle.set_dataset(test_0, 'test')
    
    data_bundle.set_vocab(vocab, 'vocab')
    data_bundle.set_vocab(target_vocab, 'target_vocab')
    
    data_bundle.datasets['train'].add_seq_len('words', new_field_name='seq_len')
    data_bundle.datasets['test'].add_seq_len('words', new_field_name='seq_len')
    
    data_bundle.datasets['train'].set_input('words', 'seq_len')
    data_bundle.datasets['test'].set_input('words', 'seq_len')
    
    data_bundle.datasets['train'].set_target('target')
    
    device = 0 if torch.cuda.is_available() else 'cpu'
    # device = 0 if torch.cuda.is_available() else 'cpu'
    
    # case 1
    vocab = db.vocabs['vocab']
    embedding = StaticEmbedding(vocab, model_dir_or_name='.vector_cache/glove.6B.300d.txt')
    model = CNNText(embedding, len(db.get_vocab('target_vocab')), dropout=0.1)
    loss = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)
    metric = AccuracyMetric()
    trainer = Trainer(train_data=db.get_dataset('train'), model=model, loss=loss, optimizer=optimizer, batch_size=5, dev_data=db.get_dataset('test'), metrics=metric, device=device)
    
    """
    Found 90623 out of 375077 words in the pre-training embedding.
    input fields after batch(if batch size is 2):
    	words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 9587]) 
    	seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
    target fields after batch(if batch size is 2):
    	target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2]) 
    
    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    <ipython-input-49-1ad8aa4147cf> in <module>
         10     trainer = Trainer(train_data=db.get_dataset('train'), model=model, loss=loss,
         11                   optimizer=optimizer, batch_size=5, dev_data=db.get_dataset('test'),
    ---> 12                   metrics=metric, device=device)
         13     ls.append(trainer.train())
         14     del trainer
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/trainer.py in __init__(self, train_data, model, optimizer, loss, batch_size, sampler, drop_last, update_every, num_workers, n_epochs, print_every, dev_data, metrics, metric_key, validate_every, save_path, use_tqdm, device, callbacks, check_code_level, fp16, **kwargs)
        555             _check_code(dataset=train_data, model=self.model, losser=losser, forward_func=self._forward_func, metrics=metrics,
        556                         dev_data=dev_dataset, metric_key=self.metric_key, check_level=check_code_level,
    --> 557                         batch_size=check_batch_size)
        558 
        559         self.train_data = train_data
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/trainer.py in _check_code(dataset, model, losser, metrics, forward_func, batch_size, dev_data, metric_key, check_level)
        999         tester = Tester(data=dev_data[:batch_size * DEFAULT_CHECK_NUM_BATCH], model=model, metrics=metrics,
       1000                         batch_size=batch_size, verbose=-1, use_tqdm=False)
    -> 1001         evaluate_results = tester.test()
       1002         _check_eval_results(metrics=evaluate_results, metric_key=metric_key, metric_list=metrics)
       1003 
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/tester.py in test(self)
        182                         _move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
        183                         with self.auto_cast():
    --> 184                             pred_dict = self._data_forward(self._predict_func, batch_x)
        185                             if not isinstance(pred_dict, dict):
        186                                 raise TypeError(f"The return value of {_get_func_signature(self._predict_func)} "
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/tester.py in _data_forward(self, func, x)
        231         x = _build_args(func, **x)
    --> 232         y = self._predict_func_wrapper(**x)
        233         return y
        234 
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/models/cnn_text_classification.py in predict(self, words, seq_len)
    
    ---> 74         output = self(words, seq_len)
         75         _, predict = output[C.OUTPUT].max(dim=1)
         76         return {C.OUTPUT: predict}
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/models/cnn_text_classification.py in forward(self, words, seq_len)
    
    ---> 57         x = self.embed(words)  # [N,L] -> [N,L,C]
         58         if seq_len is not None:
         59             mask = seq_len_to_mask(seq_len)
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/embeddings/embedding.py in forward(self, words)
         71             mask = torch.bernoulli(mask).eq(1)  # dropout_word越大,越多位置为1
         72             words = words.masked_fill(mask, self.unk_index)
    ---> 73         words = self.embed(words)
         74         return self.dropout(words)
         75 
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/embeddings/static_embedding.py in forward(self, words)
    
        332         if hasattr(self, 'words_to_words'):
    --> 333             words = self.words_to_words[words]
        334         words = self.drop_word(words)
        335         words = self.embedding(words)
    
    IndexError: too many indices for tensor of dimension 1
    
    """
    
    # case 2
    def train(epoch, data, devdata):
        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        lossfunc = torch.nn.CrossEntropyLoss()
        batch_size = 8
    
        # 定义一个Batch,传入DataSet,规定batch_size和去batch的规则。
        # 顺序(Sequential),随机(Random),相似长度组成一个batch(Bucket)
        train_sampler = BucketSampler(batch_size=batch_size, seq_len_field_name='seq_len')
        train_batch = DataSetIter(batch_size=batch_size, dataset=data, sampler=train_sampler)
    
        start_time = time.time()
        print("-"*5+"start training"+"-"*5)
        for i in range(epoch):
            loss_list = []
            for batch_x, batch_y in train_batch:
                optimizer.zero_grad()
                output = model(batch_x['words'])
                loss = lossfunc(output['pred'], batch_y['target'])
                loss.backward()
                optimizer.step()
                loss_list.append(loss.item())
    
            #这里verbose如果为0,在调用Tester对象的test()函数时不输出任何信息,返回评估信息; 如果为1,打印出验证结果,返回评估信息
            #在调用过Tester对象的test()函数后,调用其_format_eval_results(res)函数,结构化输出验证结果
            tester_tmp = Tester(devdata, model, metrics=AccuracyMetric(), verbose=0)
            res=tester_tmp.test()
    
            print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
            print(tester_tmp._format_eval_results(res),end=" ")
            print('{:d}ms'.format(round((time.time()-start_time)*1000)))
            loss_list.clear()
    
    
    vocab = data_bundle.vocabs['vocab']
    embedding = StaticEmbedding(vocab, model_dir_or_name='.vector_cache/glove.6B.300d.txt')
    model = CNNText(embedding, len(data_bundle.get_vocab('target_vocab')), dropout=0.1)
        
    train(3, data_bundle.get_dataset('train'), data_bundle.get_dataset('test'))
    
    """
    Found 90623 out of 375077 words in the pre-training embedding.
    -----start training-----
    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
         8     train(3, data_bundle.get_dataset('train'), data_bundle.get_dataset('test'))
    <ipython-input-35-2386220b3070> in train(epoch, data, devdata)
         24         #在调用过Tester对象的test()函数后,调用其_format_eval_results(res)函数,结构化输出验证结果
         25         tester_tmp = Tester(devdata, model, metrics=AccuracyMetric(), verbose=0)
    ---> 26         res=tester_tmp.test()
         27 
         28         print('Epoch {:d} Avg Loss: {:.2f}'.format(i, sum(loss_list) / len(loss_list)),end=" ")
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/tester.py in test(self)
        182                         _move_dict_value_to_device(batch_x, batch_y, device=self._model_device)
        183                         with self.auto_cast():
    --> 184                             pred_dict = self._data_forward(self._predict_func, batch_x)
        185                             if not isinstance(pred_dict, dict):
        186                                 raise TypeError(f"The return value of {_get_func_signature(self._predict_func)} "
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/core/tester.py in _data_forward(self, func, x)
        231         x = _build_args(func, **x)
    --> 232         y = self._predict_func_wrapper(**x)
        233         return y
        234 
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/models/cnn_text_classification.py in predict(self, words, seq_len)
       
    ---> 74         output = self(words, seq_len)
         75         _, predict = output[C.OUTPUT].max(dim=1)
         76         return {C.OUTPUT: predict}
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/models/cnn_text_classification.py in forward(self, words, seq_len)
    
    ---> 57         x = self.embed(words)  # [N,L] -> [N,L,C]
         58         if seq_len is not None:
         59             mask = seq_len_to_mask(seq_len)
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/embeddings/embedding.py in forward(self, words)
         71             mask = torch.bernoulli(mask).eq(1)  # dropout_word越大,越多位置为1
         72             words = words.masked_fill(mask, self.unk_index)
    ---> 73         words = self.embed(words)
         74         return self.dropout(words)
         75 
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        720             result = self._slow_forward(*input, **kwargs)
        721         else:
    --> 722             result = self.forward(*input, **kwargs)
        723         for hook in itertools.chain(
        724                 _global_forward_hooks.values(),
    
    ~/anaconda3/envs/venv/lib/python3.6/site-packages/fastNLP/embeddings/static_embedding.py in forward(self, words)
    
        332         if hasattr(self, 'words_to_words'):
    --> 333             words = self.words_to_words[words]
        334         words = self.drop_word(words)
        335         words = self.embedding(words)
    
    IndexError: too many indices for tensor of dimension 1
    
    """
    
    opened by hengee 0
  • What is the input format for Hierarchical Attention Network model using fastNLP framework

    What is the input format for Hierarchical Attention Network model using fastNLP framework

    I saw that there is a Hierarchical Attention Network model included in the directory: reproduction/text_classification/model/HAN.py. I realized that the input for HAN is different from other models (LSTM and CNN):

    HAN: input_sents (torch.LongTensor) -- [batch_size, num_sents, seq_len] CNN / LSTM: words (torch.LongTensor) -- [batch_size, seq_len]

    Would like to know how to formulate input for HAN under fastNLP framework with the use of fastNLP DataSet and dataloader.

    Thank you in advance!

    opened by hengee 3
  • _BertWordModel疑问

    _BertWordModel疑问

    hi,请问_BertWordModel这个类为什么需要利用训练、测试数据的vocab,来重写调整BERT模型的embedding?这种相对于直接用原始的bert有什么优势吗?我发现只是训练字表减小了,这会有什么收益吗?

    opened by mx8435 5
  • 请教使用StaticEmbedding时,遇到的一个问题

    请教使用StaticEmbedding时,遇到的一个问题

    在使用StaticEmbedding时,为了兼顾最终模型能够被导出成TorchScript供线上CPP服务使用的时候,遇到一个问题,主要是在创建字典vocab vocab.from_dataset的时候,如果使用了官方推荐的no_create_entry_dataset参数,后续在导出向量weight的时候,其大小会比vocab尺寸更小,使得我在模型中将weight权重赋值给torch原生的nn.Embedding对象后,对某一些越界的token的index就无法解析了,看了一下源码,创建StaticEmbedding传入的vocab对象没有办法被更新(如果这里能够被更新,使得其大小与weight矩阵的大小一致,那么后续对token做index的时候,使用这个更小的vacab,也可以避免这个问题)。

    由于显示问题,我把测试样例代码放附件了,更为细节的问题描述也放在了文件中 test.txt

    opened by mahatmaWM 1
  • 关于NER序列标注任务的求助

    关于NER序列标注任务的求助

    尊敬的各位大佬: 我刚开始学习使用FastNLP,在按照贵方提供的“快速入门-序列标注”任务完成了数据载入、模型训练、测试等实验后,在具体模型的使用方面存在困难,特来向各位请教,我具体的求助事项有以下两点: 1、如果我需要增加我自己的命名实体标注样本,应该进行哪些操作?是否有示例代码供参考?(追加说明:我将教程中由WeiboNERPipe获取的数据保存后发现,其中Target列中的数据,其标注的方式为非命名实体标注为0,命名实体为其他数值,但数值之间存在着明显的差异,仅能猜测实际是为了区分不同的实体类型,关于训练集标注的标准或方法能否有对应的说明,或者给我个参考材料也行) 2、教程中训练完成的模型我应当如何使用?例如,我想对如“电力系统主要由发电厂、电力网以及用户三个部分组成。”这个字符串执行NER标注任务,应当如何操作或使用何种函数?方法?对象?(追加说明:我尝试使用FastNLP提供的Predictor对象执行任务,但一直提示“Tensors must have same number of dimensions: got 2 and 3”错误) 还烦请各位大佬拨冗解答,如上述问题不属于贵方解答的范围,还烦请指点相关参考材料的所在位置。 NLP小白跪求解答

    opened by cdzhjk001 5
  • 希望增加模型对数据集的预测推理功能

    希望增加模型对数据集的预测推理功能

    非常感谢您提供这个方便快捷的NLP工具,使用fastNLP工具包可以快速实现相关NLP任务。我查看了fastNLP中的说明文档,发现其中的Tester方法只支持带标签的数据,没有找到关于对无标注数据集的预测/推理的功能,希望可以增加此项功能。期待您的回复!

    opened by WuSiQingChun 2
  • csv文件的pipe怎么处理

    csv文件的pipe怎么处理

    文档里面只有CSVLoader,没有CSVPipe这个类,因为我的csv中只有两列,然后看到文本分类的样例中是情感分类的数据,也是两列数据,所以想问下,我这块可不可以直接使用(不确定性来自情感分类第一列是target,数字类型。而我的第一列全部是字)

    enhancement 
    opened by yamonc 6
Releases(v0.6.0)
Owner
fastNLP
由复旦大学的自然语言处理(NLP)团队发起的国产自然语言处理开源项目
fastNLP
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.2k Jun 13, 2021
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2k Feb 14, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 2.2k Jun 13, 2021
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 1.9k Feb 18, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.7k Jun 12, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.6k Feb 18, 2021
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 18.6k Jun 13, 2021
Awesome-NLP-Research (ANLP)

Awesome-NLP-Research (ANLP)

Language, Information, and Learning at Yale 57 May 28, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Jun 13, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 18, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 288 May 21, 2021
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 287 Feb 14, 2021
✨Fast Coreference Resolution in spaCy with Neural Networks

✨ NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks. NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolv

Hugging Face 2.3k Jun 10, 2021