Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Overview

Kashgari

GitHub Slack Coverage Status PyPI

Overview | Performance | Installation | Documentation | Contributing

🎉 🎉 🎉 We released the 2.0.0 version with TF2 Support. 🎉 🎉 🎉

If you use this project for your research, please cite:

@misc{Kashgari
  author = {Eliyar Eziz},
  title = {Kashgari},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/BrikerMan/Kashgari}}
}

Overview

Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.

  • Human-friendly. Kashgari's code is straightforward, well documented and tested, which makes it very easy to understand and modify.
  • Powerful and simple. Kashgari allows you to apply state-of-the-art natural language processing (NLP) models to your text, such as named entity recognition (NER), part-of-speech tagging (PoS) and classification.
  • Built-in transfer learning. Kashgari built-in pre-trained BERT and Word2vec embedding models, which makes it very simple to transfer learning to train your model.
  • Fully scalable. Kashgari provides a simple, fast, and scalable environment for fast experimentation, train your models and experiment with new approaches using different embeddings and model structure.
  • Production Ready. Kashgari could export model with SavedModel format for tensorflow serving, you could directly deploy it on the cloud.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level classification/labeling model within minutes.

Performance

Welcome to add performance report.

Task Language Dataset Score
Named Entity Recognition Chinese People's Daily Ner Corpus 95.57
Text Classification Chinese SMP2018ECDTCorpus 94.57

Installation

The project is based on Python 3.6+, because it is 2019 and type hinting is cool.

Backend pypi version desc
TensorFlow 2.1+ pip install 'kashgari>=2.0.0' TF2.10+ with tf.keras
TensorFlow 1.14+ pip install 'kashgari>=1.0.0,<2.0.0' TF1.14+ with tf.keras
Keras pip install 'kashgari<1.0.0' keras version

Tutorials

Here is a set of quick tutorials to get you started with the library:

There are also articles and posts that illustrate how to use Kashgari:

Examples:

Contributors

Thanks goes to these wonderful people. And there are many ways to get involved. Start with the contributor guidelines and then check these open issues for specific tasks.

Comments
  • [Proposal] Migrate keras to tf.keras

    [Proposal] Migrate keras to tf.keras

    I am proposing to change keras to tf.keras for better performance, better serving and add TPU support. Maybe we should re-write the whole project, clean up the code, add missing documents and so on.

    Here are the features I am planning to add.

    1. Multi-GPU/TPU support
    2. Export model for Tensorflow Serving
    3. Fine-tune Ability for W2V and BERT
    opened by BrikerMan 94
  • [BUG] [tf.keras] BLSTM NER overfitting while 0.2.1 works just fine

    [BUG] [tf.keras] BLSTM NER overfitting while 0.2.1 works just fine

    Check List

    Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

    Environment

    • OS [e.g. Mac OS, Linux]: Colab

    Issue Description

    I have tried 0.2.1 version and tf.keras version for ChineseNER task, found that tf.keras version perform very badly. 0.21 val loss will reduce during training, but tf.keras only reduce the training loss.

    What

    0.2.1 perfomance

    Epoch 1/200
    41/41 [==============================] - 159s 4s/step - loss: 0.2313 - acc: 0.9385 - val_loss: 0.0699 - val_acc: 0.9772
    Epoch 2/200
    41/41 [==============================] - 277s 7s/step - loss: 0.0563 - acc: 0.9823 - val_loss: 0.0356 - val_acc: 0.9892
    Epoch 3/200
    41/41 [==============================] - 309s 8s/step - loss: 0.0361 - acc: 0.9887 - val_loss: 0.0243 - val_acc: 0.9928
    Epoch 4/200
    41/41 [==============================] - 242s 6s/step - loss: 0.0297 - acc: 0.9905 - val_loss: 0.0228 - val_acc: 0.9927
    Epoch 5/200
    41/41 [==============================] - 328s 8s/step - loss: 0.0252 - acc: 0.9920 - val_loss: 0.0196 - val_acc: 0.9938
    Epoch 6/200
     4/41 [=>............................] - ETA: 4:37 - loss: 0.0234 - acc: 0.9926
    

    tf.keras performance

    Epoch 1/200
    Epoch 1/200
    5/5 [==============================] - 5s 1s/step - loss: 2.3491 - acc: 0.9712
    42/42 [==============================] - 115s 3s/step - loss: 2.9824 - acc: 0.9171 - val_loss: 2.3491 - val_acc: 0.9712
    Epoch 2/200
    5/5 [==============================] - 4s 768ms/step - loss: 2.9726 - acc: 0.9822
    42/42 [==============================] - 107s 3s/step - loss: 0.1563 - acc: 0.9952 - val_loss: 2.9726 - val_acc: 0.9822
    Epoch 3/200
    5/5 [==============================] - 4s 773ms/step - loss: 3.0985 - acc: 0.9833
    42/42 [==============================] - 107s 3s/step - loss: 0.0482 - acc: 0.9994 - val_loss: 3.0985 - val_acc: 0.9833
    Epoch 4/200
    5/5 [==============================] - 4s 771ms/step - loss: 3.2479 - acc: 0.9833
    42/42 [==============================] - 107s 3s/step - loss: 0.0247 - acc: 0.9997 - val_loss: 3.2479 - val_acc: 0.9833
    Epoch 5/200
    5/5 [==============================] - 4s 766ms/step - loss: 3.3612 - acc: 0.9839
    42/42 [==============================] - 107s 3s/step - loss: 0.0156 - acc: 0.9998 - val_loss: 3.3612 - val_acc: 0.9839
    

    Reproduce

    Here is the colab notebook for reproduce this issue

    bug 
    opened by BrikerMan 39
  • [Question] 使用model.predict预测的label为何会有[BOS]和[EOS]在里面

    [Question] 使用model.predict预测的label为何会有[BOS]和[EOS]在里面

    A clear and concise description of what you want to know. 我用自己的数据训练,结果还可以(0.85),但是为何预测的label里还存在[BOS]和[EOS],在predict里result应该是在convert idx to labels时remove 了bos和eos的,怎么还会出现在最后的prediction label里?

    康 B-brand 元 E-brand 的 O 饼 B-category 干 [EOS]

    question 
    opened by js418 39
  • Export SavedModel for Serving

    Export SavedModel for Serving

    from kashgari.corpus import SMP2018ECDTCorpus
    from kashgari.tasks.classification import BLSTMModel
    
    x_data, y_data = SMP2018ECDTCorpus.load_data()
    classifier = BLSTMModel()
    classifier.fit(x_data, y_data)
    # export saved model to ./savedmodels/<timestamp>/
    classifier.export('./savedmodels')
    
    saved_model_cli show --dir /path/to/saved_models/1559562438/ --tag_set serve --signature_def serving_default
    # Output:
    # The given SavedModel SignatureDef contains the following input(s):
    #  inputs['input:0'] tensor_info:
    #       dtype: DT_FLOAT
    #       shape: (-1, 15)
    #       name: input:0
    # The given SavedModel SignatureDef contains the following output(s):
    #   outputs['dense/Softmax:0'] tensor_info:
    #       dtype: DT_FLOAT
    #       shape: (-1, 32)
    #       name: dense/Softmax:0
    # Method name is: tensorflow/serving/predict
    
    tensorflow_model_server --rest_api_port=9000 --model_name=blstm --model_base_path=/path/to/saved_models/ --enable_batching=true
    # Output:
    # 2019-06-03 08:28:56.639941: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: blstm version: 1559562438}
    # 2019-06-03 08:28:56.645217: I tensorflow_serving/model_servers/server.cc:324] Running gRPC ModelServer at 0.0.0.0:8500 ...
    # 2019-06-03 08:28:56.647192: I tensorflow_serving/model_servers/server.cc:344] Exporting HTTP/REST API at:localhost:9000 ...
    
    curl -H "Content-type: application/json" -X POST -d '{"instances": [{"input:0": [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}]}'  "http://localhost:9000/v1/models/blstm:predict"
    # Output:
    # {
    #     "predictions": [[5.76590492e-06, 0.0334293731, 9.58459859e-05, 0.00066432351, 0.500331104, 0.0521887243, 0.000985755469, 0.000161868113, 0.00147783163, 0.0171929933, 0.00085421023, 0.00599030638, 1.79303879e-05, 0.00050331495, 3.7246391e-05, 3.13154237e-06, 0.0201187711, 0.000672292779, 0.000196203022, 4.57693459e-05, 2.69985958e-06, 8.66179619e-07, 1.03102286e-06, 3.53154815e-06, 0.0478210114, 0.00725555047, 0.000683069753, 0.262197495, 4.151143e-05, 0.046125982, 2.19863551e-07, 0.000894303957]
    #     ]
    # }
    
    opened by haoyuhu 30
  • No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional/CudnnRNN}}with these attrs:

    No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional/CudnnRNN}}with these attrs:

    问题描述 在GPU上训练的模型,准备提供线上服务,线上服务的环境是CPU,所以报下面的错误。 tf1.4 kashgri是0.5

    报错 A clear and concise description of what you want to know. Failed to start server. Error: Unknown: 1 servable(s) did not become available: {{{name: tf_classification_model version: 1} due to error: Invalid argument: No OpKernel was registered to support Op 'CudnnRNN' used by {{node bidirectional/CudnnRNN}}with these attrs: [input_mode="linear_input", T=DT_FLOAT, direction="unidirectional", rnn_mode="lstm", is_training=true, seed2=0, _output_shapes=[[1815,?,128], [1,?,128], [1,?,128], ], dropout=0, seed=0]

    训练代码 from kashgari.tasks.classification import Dropout_BiGRU_Model import logging logging.basicConfig(level='DEBUG')

    model = Dropout_BiGRU_Model() model.fit(train_x, train_y, valid_x, valid_y,epochs=10,batch_size=256) kashgari.utils.convert_to_saved_model(model, model_path='tf_classification_model_bert_dropoutFRU_24000', version=1)

    bug 
    opened by xxxxxxxxy 25
  • [BUG] AttributeError: 'tuple' object has no attribute 'layer'

    [BUG] AttributeError: 'tuple' object has no attribute 'layer'

    first of all thanks for your impressive work and effort. I faced a problem when using your package in google colab. while I want to use bert embedding with this command: bert_embedding = BERTEmbedding(bert_model_path, task=kashgari.CLASSIFICATION, sequence_length=128) I got this error: AttributeError: 'tuple' object has no attribute 'layer' thanks in advance for your help.

    bug 
    opened by mahdisnapp 24
  • [Question] 怎么保存和加载最优模型?

    [Question] 怎么保存和加载最优模型?

    想要加载最优的模型文件而不是最后一个epoch的模型文件,ModelCheckpoint在这里不起作用

    我在class ClassificationModel(BaseModel)里面,加了个函数:

    def load_weights(self, model_path):
            return self.model.load_weights(model_path)
    

    然后调用

    early_stopping = EarlyStopping(monitor='val_loss',min_delta=0.01, patience=5, mode='min', verbose=1)
    reduce_lr = ReduceLROnPlateau(
            monitor='val_loss', factor=0.5, patience=5, min_lr=0.0001, verbose=2)
    bst_model_path = 'weight_%d.h5' % count
    checkpoint = ModelCheckpoint(bst_model_path, monitor='val_loss', mode='min',
                                           save_best_only=True, verbose=1, save_weights_only=True)
    callbacks = [checkpoint,reduce_lr,early_stopping]
    hist = model.fit(x_train,y_train,
                         validation_data=(x_val, y_val),
                         epochs=4, batch_size=512,
    #                      class_weight="auto",
    #                      callbacks=callbacks,
                         fit_kwargs={"callbacks":callbacks,"verbose":1}
                         
                         )
    model.load_weights(bst_model_path)
    

    但是显示不存在weight_0.h5文件,说明 ModelCheckpoin并未被调用

    question 
    opened by chizhu 22
  • [Question] Multi-output model

    [Question] Multi-output model

    Hello Dear @BrikerMan. Thanks for creating this good library. However, I want to perform multi-outputs classification(This is different to multi-label). My input is sentence and the outputs are two full-connected networks where each one has its own labels (One input-->Two branch networks).

    I want to know if Kashgari supports multi-outputs classification.

    input_text=[['This','news',are',very','well','organized'], ['What','extremely','usefull','tv','show'], ['The','tv','presenter','were','very','well','dress']]

    label_list1=[1,0,1] label_list2=[1,1,0]

    xdata=input_text ydata=[label_list1,label_list2] #In Keras, for multi-output we make a list of both labels. model.fit(xdata,ydata,epochs=1)

    Thank you.

    question 
    opened by Qian16 19
  • [Feature request] Support custom features

    [Feature request] Support custom features

    Hello everyone. First of all, I'm very impressed by the quality of this project. It is truly production ready!

    To one task I need a tagger, I need to work with Custom/hand crafted features.

    For example, our OCR engine output a rich formation, like italic, bold e font size.

    Would be possible to also utilize some handcrafted features for training/tagging in Kashgari?

    enhancement 
    opened by bratao 18
  • [Question] How to serving model with the tensorflow serving

    [Question] How to serving model with the tensorflow serving

    问题已经解决,谢谢BrikerMan的热心帮助,希望这个解决方案可以让其他人节约一点探索时间,更好的享受项目带来的便利!

    我尝试把训练模型保存成saved_model模型,代码如下:

    import tensorflow as tf
    from kashgari.tasks.seq_labeling import BLSTMCRFModel
    from keras import backend as K
    
    # K.set_learning_phase(1)
    # 关键修改
    K.set_learning_phase(0)
    
    model = BLSTMCRFModel.load_model('./model')
    legacy_init_op = tf.group(tf.tables_initializer())
    
    xmodel = model.model
    
    with K.get_session() as sess:
        export_path = './saved_model/14'
        builder = tf.saved_model.builder.SavedModelBuilder(export_path)
    
        signature_inputs = {
            'token_input': tf.saved_model.utils.build_tensor_info(xmodel.input[0]),
            'seg_input': tf.saved_model.utils.build_tensor_info(xmodel.input[1]),
        }
    
        signature_outputs = {
            tf.saved_model.signature_constants.CLASSIFY_OUTPUT_CLASSES: tf.saved_model.utils.build_tensor_info(
                xmodel.output)
        }
    
        classification_signature_def = tf.saved_model.signature_def_utils.build_signature_def(
            inputs=signature_inputs,
            outputs=signature_outputs,
            method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME)
    
        builder.add_meta_graph_and_variables(
            sess,
            [tf.saved_model.tag_constants.SERVING],
            signature_def_map={
                'predict_webshell_php': classification_signature_def
            },
            legacy_init_op=legacy_init_op
        )
    
        builder.save()
    
    

    成功保存后调用saved_model模型预测,结果全是0,请问是什么原因? 调用代码:

    import json
    
    import tensorflow as tf
    from tensorflow.python.saved_model import signature_constants
    from tensorflow.python.saved_model import tag_constants
    
    export_dir = './saved_model/14/'
    
    with open('./model/words.json', 'r', encoding='utf-8') as f:
        dict = json.load(f)
    
    s = ['[CLS]', '国', '正', '学', '长', '的', '文', '章', '与', '诗', '词', ',', '早', '就', '读', '过', '一', '些', ',', '很', '是', '喜',
         '欢', '。', '[CLS]']
    s1 = [dict[x] for x in s]
    if len(s1) < 100:
        s1 += [0] * (100 - len(s1))
    print(s1)
    s2 = [0] * 100
    
    with tf.Session() as sess:
        meta_graph_def = tf.saved_model.loader.load(sess, [tag_constants.SERVING], export_dir)
        signature = meta_graph_def.signature_def
    
        x1_tensor_name = signature['predict_webshell_php'].inputs['token_input'].name
        x2_tensor_name = signature['predict_webshell_php'].inputs['seg_input'].name
    
        y_tensor_name = signature['predict_webshell_php'].outputs[
            signature_constants.CLASSIFY_OUTPUT_CLASSES].name
        x1 = sess.graph.get_tensor_by_name(x1_tensor_name)
        x2 = sess.graph.get_tensor_by_name(x2_tensor_name)
        y = sess.graph.get_tensor_by_name(y_tensor_name)
        result = sess.run(y, feed_dict={x1: [s1], x2: [s2]})  # 预测值
        print(result.argmax(-1))
        print(result.shape)
    
    enhancement question 
    opened by phoenixkillerli 18
  • [BUG] Different behavior in 0.1.8 and 0.2.1

    [BUG] Different behavior in 0.1.8 and 0.2.1

    Environment

    • Colab.research.google.com
    • Kashgari 1.1.8 / 0.2.1

    Issue Description

    Different behavior in 0.1.8 and 0.2.1 In Kashgari 0.1.8 BLSTModel converge in training process and I see val_acc: 0.98 and train_acc: 0.9594 In Kashgari 0.2.1 BLSTModel is overfitting and I see val_acc ~0.5 and train_acc ~0.96 There is no difference in my code, only different versions of library.

    Reproduce

    code:

    from sklearn.model_selection import train_test_split
    import pandas as pd
    import nltk
    from kashgari.tasks.classification import BLSTMModel
    
    # get and process data
    !wget https://www.dropbox.com/s/265kphxkijj1134/fontanka.zip
    
    df1 = pd.read_csv('fontanka.zip')
    df1.fillna(' ', inplace = True)
    nltk.download('punkt')
    
    # split on train/test
    X_train, X_test, y_train, y_test = train_test_split(df1.full_text[:3570].values, df1.textrubric[:3570].values, test_size=0.2, random_state=42)
    X_train = [nltk.word_tokenize(sentence) for sentence in X_train]
    X_test  = [nltk.word_tokenize(sentence) for sentence in X_test]
    y_train = y_train.tolist()
    y_test  = y_test.tolist()
    
    # train model
    model = BLSTMModel()
    model.fit(X_train, y_train, x_validate=X_test, y_validate=y_test, epochs = 10)
    

    code in colab: https://colab.research.google.com/drive/1yTBMeiBl2y7-Yw0DS_vTn2A4y_Vj3N-8

    Result

    Last epoch:

    Kashgari 0.1.8

    Epoch 10/10 55/55 [==============================] - 90s 2s/step - loss: 0.1378 - acc: 0.9615 - val_loss: 0.0921 - val_acc: 0.9769

    Kashgari 0.2.1

    Epoch 10/10 44/44 [==============================] - 76s 2s/step - loss: 0.0990 - acc: 0.9751 - val_loss: 2.3739 - val_acc: 0.5323

    Other Comment

    In 0.2.1 all models now in different file and lr hyperparameter is given explicitly (1e-3) In 0.1.8 lr hyperparameter was omitted, I suppose that it used keras default, which is the same (1e-3)

    Also in 0.1.8 you had (dense size = +1 classes on classifier) https://github.com/BrikerMan/Kashgari/issues/21 and ommited it in 0.2.1. I don't see how this could affect training process.

    I couldn't find more differences between versions, could you help with this - why models began to overfit in new version of library?

    bug wontfix 
    opened by kuilef 17
  • [BUG] 自定义模型,多个特征输入使用多个embed,模型fit报错,还需要重定义哪些方法来支持?

    [BUG] 自定义模型,多个特征输入使用多个embed,模型fit报错,还需要重定义哪些方法来支持?

    You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

    Check List

    Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

    You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

    Environment

    • OS [e.g. Mac OS, Linux]: linux
    • Python Version: python3.6.12
    • kashgari: 2.0.2

    Issue Description

    我自定义了一个模型,模型需要输入多种特征(如词、词性、命名实体类别)。词特征用BertEmbedding获取,其他特征用BareEmbedding初始化,然后把它们拼接起来作为模型输入。模型定义都没问题,在相应tasks/labeling/init.py里面也加了,能调用,错误出现在fit的时候。

    自定义模型的代码测试抽取如下(省略参数定义),是个序列标注任务: def init(self, embedding: ABCEmbedding = None, posembedding: ABCEmbedding = None, nerembedding: ABCEmbedding = None, **kwargs ): super(BiLSTM_TEST_Model, self).init() self.embedding = embedding self.posembedding = posembedding self.nerembedding = nerembedding

    def build_model_arc(self) -> None: output_dim = self.label_processor.vocab_size

        config = self.hyper_parameters
        embed_model = self.embedding.embed_model
        embed_pos = self.posembedding.embed_model
        embed_ner = self.nerembedding.embed_model
    
        crf = KConditionalRandomField()
        bilstm = L.Bidirectional(L.LSTM(**config['layer_blstm']), name='layer_blstm')
        bilstm_dropout = L.Dropout(**config['layer_dropout'], name='layer_dropout')
        crf_dropout = L.Dropout(**config['layer_dropout'], name='crflayer_dropout')
        crf_dense = L.Dense(output_dim, **config['layer_time_distributed'])
    
        ## 三种特征的embed
        tensor_inputs = [tensor]
        model_inputs = [embed_model.inputs]
        if embed_pos != None:
            tensor_inputs.append(embed_pos.output)
            model_inputs.append(embed_pos.inputs)
        if embed_ner != None:
            tensor_inputs.append(embed_ner.output)
            model_inputs.append(embed_ner.inputs)
                
        tensor_con = L.concatenate(tensor_inputs, axis=2)    ## 把所有特征concate起来,作为输入
        bilstm_tensor = bilstm(tensor_con)
        bilstm_dropout_tensor = bilstm_dropout(bilstm_tensor)
        
        crf_dropout_tensor = crf_dropout(bilstm_dropout_tensor)
        crf_dense_tensor = crf_dense(crf_dropout_tensor)
        output = crf(crf_dense_tensor)
    
        self.tf_model = keras.Model(inputs=model_inputs, outputs=[output])
        self.crf_layer = crf
    

    我在使用数据训练的时候,代码抽取如下: def trainFunction(.....): bert_embed = BertEmbedding('./Data/路径', sequence_length=maxlength) pos_embed = BareEmbedding(embedding_size=32) ner_embed = BareEmbedding(embedding_size=32)

        selfmodel = BiLSTM_TEST_Model(bert_embed, pos_embed, ner_embed, sequence_length=maxlength)
        history = selfmodel.fit(x_train=(train_x, train_pos_x, train_ner_x,), y_train=train_y, 
                                x_validate=(valid_x, valid_pos_x, valid_ner_x), y_validate=valid_y, batch_size=batchsize, epochs=12)
    

    Reproduce

    报错信息如下: File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 177, in fit fit_kwargs=fit_kwargs) File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 208, in fit_generator self.build_model_generator([g for g in [train_sample_gen, valid_sample_gen] if g]) File "/venv/lib/python3.6/site-packages/kashgari/tasks/labeling/abc_model.py", line 85, in build_model_generator self.text_processor.build_vocab_generator(generators) File "/venv/lib/python3.6/site-packages/kashgari/processors/sequence_processor.py", line 84, in build_vocab_generator count = token2count.get(token, 0) TypeError: unhashable type: 'list'

    kashgari下build_vocab_generator()报错位置: def build_vocab_generator(self, generators: List[CorpusGenerator]) -> None: if not self.vocab2idx: vocab2idx = self._initial_vocab_dic

            token2count: Dict[str, int] = {}
    
            for gen in generators:
                for sentence, label in tqdm.tqdm(gen, desc="Preparing text vocab dict"):
                    if self.build_vocab_from_labels:
                        target = label
                    else:
                        target = sentence
                    for token in target:      ## 我的输入是嵌套list,这里token是每一个list,就报错了。
                        count = token2count.get(token, 0)
                        token2count[token] = count + 1
    

    DEBUG追踪看了下:我fit输入的x_train是三个,CorpusGenerator得到的generators的x也是嵌套的三个list,在build_vocab_generator的时候,就报错了。

    我需要重定义build_vocab_generator吗? 除了这个地方,我的模型输入需要三个embedding model,这会导致我的self.vocab2idx/idx2vocab是不是也得定义三种,还有哪些地方需要我重新定义的吗?debug跟着跟着就晕了 T_T

    求助!!

    bug 
    opened by Zikangli 1
  • [Question]

    [Question]

    You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

    Check List

    Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

    You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

    • [ :heavy_check_mark: ] I have searched in existing issues but did not find the same one.
    • [ :heavy_check_mark: ] I have read the documents

    Environment

    • OS [e.g. Mac OS, Linux]: linux
    • Python Version: 3.9
    • requirements.txt:
     !pip install tensorflow==2.5
     !pip install tensorflow_addons==0.13.0
     !pip install kashgari==2.0.2
    

    Question Load_model Error'Keyword argument not understood:', 'center'

    i got this error when I'm trying to load the checkpoints TypeError: ('Keyword argument not understood:', 'center') Note that these checkpoints, i trained 2 years ago, so i tried the last version of kashgri 2.0.2 and the old version but nothing worked any help?? image

    question 
    opened by OmarMohammed88 0
  • ner: cnn+lstm and bigru mod ,The code is the same

    ner: cnn+lstm and bigru mod ,The code is the same

    The code is the same:in layer_stack = [ L.Bidirectional(L.GRU(**config['layer_bgru']), name='layer_bgru'), L.Dropout(**config['layer_dropout'], name='layer_dropout'), L.TimeDistributed(L.Dense(output_dim, **config['layer_time_distributed']), name='layer_time_distributed'), L.Activation(**config['layer_activation']) ]

    question 
    opened by weil0258 0
  • [Question] 文本分类中的CNN开头的模型accuracy不管换数据集还是调参数都只有0.2

    [Question] 文本分类中的CNN开头的模型accuracy不管换数据集还是调参数都只有0.2

    You must follow the issue template and provide as much information as possible. otherwise, this issue will be closed. 请按照 issue 模板要求填写信息。如果没有按照 issue 模板填写,将会忽略并关闭这个 issue

    Check List

    Thanks for considering to open an issue. Before you submit your issue, please confirm these boxes are checked.

    You can post pictures, but if specific text or code is required to reproduce the issue, please provide the text in a plain text format for easy copy/paste.

    Environment

    • OS [e.g. Mac OS, Linux]: Win10
    • Python Version: 3.7
    • requirements.txt: TensorFlow 2.3 kashgari 2.0.1
    [Paste requirements.txt file here]
    

    Question

    不管是使用SMP2018ECDTCorpus还是自己的数据集,在使用CNN开头的系列文本分类模型时,这个accuracy都不行,也试过改变学习率和epoch等参数,但是没啥作用,不知道不是这些模型本身有问题

    from kashgari.corpus import SMP2018ECDTCorpus from kashgari.tasks.classification import CNN_Model from kashgari.callbacks import EvalCallBack

    import logging logging.basicConfig(level='DEBUG')

    train_x, train_y = SMP2018ECDTCorpus.load_data('train') valid_x, valid_y = SMP2018ECDTCorpus.load_data('valid') test_x, test_y = SMP2018ECDTCorpus.load_data('test')

    model = CNN_Model() model.fit(train_x, train_y, valid_x, valid_y,batch_size=64,epochs=14) model.evaluate(test_x,test_y,batch_size=64)

    运行结果: 2022-04-14 18:08:55,276 [DEBUG] kashgari - loaded 1881 samples from C:\Users\hwq45.kashgari\datasets\SMP2018ECDTCorpus\train.csv. Sample: x[0]: ['打', '开', '河', '南', '英', '东', '网', '站'] y[0]: website 2022-04-14 18:08:55,280 [DEBUG] kashgari - loaded 418 samples from C:\Users\hwq45.kashgari\datasets\SMP2018ECDTCorpus\valid.csv. Sample: x[0]: ['来', '一', '首', ',', '灵', '岩', '。'] y[0]: poetry 2022-04-14 18:08:55,284 [DEBUG] kashgari - loaded 770 samples from C:\Users\hwq45.kashgari\datasets\SMP2018ECDTCorpus\test.csv. Sample: x[0]: ['给', '曹', '广', '义', '打', '电', '话'] y[0]: telephone Preparing text vocab dict: 100%|██████████| 1881/1881 [00:00<00:00, 943831.30it/s] Preparing text vocab dict: 100%|██████████| 418/418 [00:00<00:00, 416936.76it/s] 2022-04-14 18:08:55,291 [DEBUG] kashgari - --- Build vocab dict finished, Total: 875 --- 2022-04-14 18:08:55,291 [DEBUG] kashgari - Top-10: ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '的', '么', '我', '。', '怎', '你'] Preparing classification label vocab dict: 100%|██████████| 1881/1881 [00:00<?, ?it/s] Preparing classification label vocab dict: 100%|██████████| 418/418 [00:00<?, ?it/s] Calculating sequence length: 100%|██████████| 1881/1881 [00:00<00:00, 1894234.29it/s] Calculating sequence length: 100%|██████████| 418/418 [00:00<00:00, 419430.40it/s] 2022-04-14 18:08:55,309 [DEBUG] kashgari - Calculated sequence length = 15 2022-04-14 18:08:55,337 [DEBUG] kashgari - Model: "functional_43"


    Layer (type) Output Shape Param #

    input (InputLayer) [(None, None)] 0


    layer_embedding (Embedding) (None, None, 100) 87500


    conv1d_6 (Conv1D) (None, None, 128) 64128


    global_max_pooling1d_4 (Glob (None, 128) 0


    dense_14 (Dense) (None, 64) 8256


    dense_15 (Dense) (None, 31) 2015


    activation_10 (Activation) (None, 31) 0

    Total params: 161,899 Trainable params: 161,899 Non-trainable params: 0


    Epoch 1/14 29/29 [==============================] - 0s 8ms/step - loss: 3.3098 - accuracy: 0.1735 - val_loss: 3.1836 - val_accuracy: 0.1901 Epoch 2/14 29/29 [==============================] - 0s 5ms/step - loss: 3.0778 - accuracy: 0.1992 - val_loss: 3.0883 - val_accuracy: 0.1953 Epoch 3/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0232 - accuracy: 0.1992 - val_loss: 3.0700 - val_accuracy: 0.2005 Epoch 4/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0164 - accuracy: 0.1987 - val_loss: 3.0591 - val_accuracy: 0.1901 Epoch 5/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0395 - accuracy: 0.1943 - val_loss: 3.0622 - val_accuracy: 0.1979 Epoch 6/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0327 - accuracy: 0.2003 - val_loss: 3.0659 - val_accuracy: 0.1875 Epoch 7/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0361 - accuracy: 0.1948 - val_loss: 3.0711 - val_accuracy: 0.1953 Epoch 8/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0347 - accuracy: 0.1987 - val_loss: 3.0581 - val_accuracy: 0.1901 Epoch 9/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0155 - accuracy: 0.1981 - val_loss: 3.0576 - val_accuracy: 0.2005 Epoch 10/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0415 - accuracy: 0.2036 - val_loss: 3.0651 - val_accuracy: 0.1953 Epoch 11/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0296 - accuracy: 0.1992 - val_loss: 3.0850 - val_accuracy: 0.1849 Epoch 12/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0132 - accuracy: 0.2053 - val_loss: 3.0643 - val_accuracy: 0.1953 Epoch 13/14 29/29 [==============================] - 0s 4ms/step - loss: 3.0523 - accuracy: 0.1899 - val_loss: 3.0639 - val_accuracy: 0.2005 Epoch 14/14 29/29 [==============================] - 0s 4ms/step - loss: 3.7734 - accuracy: 0.2075 - val_loss: 3.0653 - val_accuracy: 0.2031

    question wontfix 
    opened by hwq458362228 1
Releases(v2.0.2)
  • v2.0.2(Jul 4, 2021)

    • 🐛 Fixed Custom Model load issue.
    • 🐛 Fixed model save issue on Windows.
    • 🐛 Fixed multi-label model load issue.
    • 🐛 Fixed CRF model load issue.
    • 🐛 Fixed TensorFlow 2.3+ Support.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.1(Oct 30, 2020)

  • v2.0.0(Sep 10, 2020)

    This is a fully re-implemented version with TF2.

    • ✨ Embeddings
    • ✨ Text Classification Task
    • ✨ Text Labeling Task
    • ✨ Seq2Seq Task
    • ✨ Examples
      • ✨ Neural machine translation with Seq2Seq
      • ✨ Benchmarks
    Source code(tar.gz)
    Source code(zip)
  • v1.1.5(Apr 25, 2020)

  • v1.1.4(Mar 30, 2020)

  • v1.1.3(Mar 29, 2020)

  • v1.1.2(Mar 27, 2020)

  • v1.1.1(Mar 13, 2020)

  • v1.1.0(Dec 27, 2019)

  • v1.0.0(Oct 18, 2019)

    Unfortunately, we renamed again for consistency and clarity. Here is the new naming style.

    | Backend | pypi version | desc | | ---------------- | -------------- | -------------- | | TensorFlow 2.x | kashgari 2.x.x | coming soon | | TensorFlow 1.14+ | kashgari 1.x.x | current version | | Keras | kashgari 0.x.x | legacy version |

    If you are using the kashgari-tf version. You only need to run this command to install the new version.

    pip uninstall -y kashgari-tf
    pip install kashgari
    

    Here is how the existing versions changes

    | Supported Backend | Kashgari Versions | Kahgsari-tf Version | | ----------------- | ----------------- | ------------------- | | TensorFlow 2.x | kashgari 2.x.x | - | | TensorFlow 1.14+ | kashgari 1.0.1 | - | | TensorFlow 1.14+ | kashgari 1.0.0 | 0.5.5 | | TensorFlow 1.14+ | - | 0.5.4 | | TensorFlow 1.14+ | - | 0.5.3 | | TensorFlow 1.14+ | - | 0.5.2 | | TensorFlow 1.14+ | - | 0.5.1 | | Keras (legacy) | kashgari 0.2.6 | - | | Keras (legacy) | kashgari 0.2.5 | - | | Keras (legacy) | kashgari 0.x.x | - |

    • 💥Renaming pypi package name to kashgari.
    • ✨Allows custom average types, logs to an array for easy access to the last epoch.
    • ✨Add min_count parameter to the base_processor.
    • ✨Add disable_auto_summary config.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.4(Sep 30, 2019)

    • ✨ Add shuffle parameter to fit function (#249 )
    • ✨ Improved type hinting for the loaded model (#248)
    • 🐛 Fix loading models with CRF layers (#244, #228)
    • 🐛 Fix the configuration changes during embedding save/load (#224)
    • 🐛 Fix stacked embedding save/load (#224)
    • 🐛 Fix evaluate function where the list has int instead of str (#222)
    • 💥 Renaming model.pre_processor to model.processor
    • 🚨 Removing TensorFlow and numpy warnings
    • 📝 Add docs how to specify which CPU or GPU
    • 📝 Add docs how to compile model with custom optimizer
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Aug 11, 2019)

  • v0.5.2(Aug 10, 2019)

  • v0.5.1(Jul 15, 2019)

    • 📝 Rewrite documents with mkdocs
    • 📝 Add Chinese documents
    • ✨ Add predict_top_k_class for classification model to get predict probabilities (#146)
    • 🚸 Add label2idx, token2idx properties to Embeddings and Models
    • 🚸 Add tokenizer property for BERT Embedding. (#136)
    • 🚸 Add predict_kwargs for models predict() function
    • ⚡️ Change multi-label classification's default loss function to binary_crossentropy (#151)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Jul 12, 2019)

  • v0.5.0(Jul 11, 2019)

    🎉🎉 tf.keras version 🎉🎉

    • 🎉 Rewrite Kashgari using tf.keras. Discussion: #77
    • 🎉 Rewrite Documents.
    • ✨ Add TPU support.
    • ✨ Add TF-Serving support.
    • ✨ Add advance customization support, like multi-input model.
    • 🐎 Performance optimization.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Jun 6, 2019)

    • Add BERT output feature layer finetune support. Discussion: #103
    • Add BERT output feature layer number selection, default 4 according to BERT paper.
    • Fix BERT embedding token index offset issue #104.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Mar 5, 2019)

  • v0.2.0(Mar 5, 2019)

    • multi-label classification for all classification models
    • support cuDNN cell for sequence labeling
    • add option for output BOS and EOS in sequence labeling result, fix #31
    Source code(tar.gz)
    Source code(zip)
  • v0.1.9(Feb 28, 2019)

    • add AVCNNModel, KMaxCNNModel, RCNNModel, AVRNNModel, DropoutBGRUModel, DropoutAVRNNModel model to classification task.
    • fix several small bugs
    Source code(tar.gz)
    Source code(zip)
  • v0.1.8(Feb 22, 2019)

  • v0.1.7(Feb 22, 2019)

    • remove class candidates filter to fix #16
    • overwrite init function in CustomEmbedding
    • add parameter check to custom_embedding layer
    • add keras-bert version to setup.py file
    Source code(tar.gz)
    Source code(zip)
  • v0.1.6(Feb 4, 2019)

    • add output_dict, debug_info params to text_classification model
    • add output_dict, debug_info and chunk_joinerparams to text_classification model
    • fix possible crash at data_generator
    Source code(tar.gz)
    Source code(zip)
  • v0.1.5(Jan 31, 2019)

Owner
Eliyar Eziz
AI Specialist, Google ML GDE. Love NLP, Love Python.
Eliyar Eziz
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 7, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 2, 2023
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.1k Feb 14, 2021
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 1, 2022
An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

Fan 137 Oct 26, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

null 289 Jan 6, 2023
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022