NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Last update: Nov 14, 2022

Related tags

Text Data & NLP nlp natural-language-processing tensorflow text-generation knowledge-graph tensorflow2 tensorflow-2

Overview

	Code has been run on Google Colab, thanks Google for providing computational resources

Natural Language Processing（自然语言处理）
- Text Classification（文本分类）
  - IMDB（ENG）
  - CLUE Emotion Analysis Dataset (CHN)
- Text Matching（文本匹配）
  - SNLI（ENG）
  - 微众银行智能客服（CHN）
  - 蚂蚁金融语义相似度 (CHN)
- Intent Detection and Slot Filling（意图检测与槽位填充）
  - ATIS（ENG）
- Retrieval Dialog（检索式对话）
  - ElasticSearch
    - Sparse Retrieval
    - Dense Retrieval
- Generative Dialog（生成式对话）
  - Large-scale Chinese Conversation Dataset (CHN)
- Multi-turn Dialogue Rewriting（多轮对话改写）
  - 20k 腾讯 AI 研发数据（CHN）
- Semantic Parsing（语义解析）
  - Facebook's Hierarchical Task Oriented Dialog（ENG）
- Multi-hop Question Answering（多跳问题回答）
  - bAbI（ENG）
- Text Visualization（文本可视化）
  - Topic Modelling
  - Explain Prediction
Knowledge Graph（知识图谱）
- Knowledge Graph Completion（知识图谱补全）
- Knowledge Base Question Answering（知识图谱问答）
Recommender System（推荐系统）
- Movielens 1M（English Data）

Text Classification

└── finch/tensorflow2/text_classification/imdb
	│
	├── data
	│   └── glove.840B.300d.txt          # pretrained embedding, download and put here
	│   └── make_data.ipynb              # step 1. make data and vocab: train.txt, test.txt, word.txt
	│   └── train.txt  		     # incomplete sample, format <label, text> separated by \t 
	│   └── test.txt   		     # incomplete sample, format <label, text> separated by \t
	│   └── train_bt_part1.txt  	     # (back-translated) incomplete sample, format <label, text> separated by \t
	│
	├── vocab
	│   └── word.txt                     # incomplete sample, list of words in vocabulary
	│	
	└── main
		└── sliced_rnn.ipynb         # step 2: train and evaluate model
		└── ...

Task: IMDB（English Data）

  Training Data: 25000, Testing Data: 25000, Labels: 2

<Notebook>: Make Data & Vocabulary
- <Text File>: Data Example
- <Text File>: Data Example (Back-Translated)
```
 Back-Translation increases training data from 25000 to 50000
 
 which is done by "english -> french -> english" translation
```
- <Text File>: Vocabulary Example

Model: TF-IDF + Logistic Regression (Sklearn)

Logistic Regression	Binary TF	NGram Range	Knowledge Dist	Testing Accuracy
<Notebook>	False	(1, 1)	False	88.3%
<Notebook>	True	(1, 1)	False	88.8%
<Notebook>	True	(1, 2)	False	89.6%
<Notebook>	True	(1, 2)	True	90.7%

-> PySpark Equivalent

Model: FastText, CNN and RNN

Code	Model	Testing Accuracy
<Notebook>	FastText (Unigram)	87.3%
<Notebook>	FastText (Bigram)	89.8%
<Notebook>	FastText (AutoTune)	90.1%
<Notebook>	TextCNN	91.8%
<Notebook>	Sliced RNN	92.6%

Model: Large-scale Transformer

TensorFlow 2 + transformers

Code	Model	Batch Size	Max Length	Testing Accuracy
<Notebook>	BERT	32	128	92.6%
<Notebook>	BERT	16	200	93.3%
<Notebook>	BERT	12	256	93.8%
<Notebook>	BERT	8	300	94%
<Notebook>	RoBERTa	8	300	94.7%

└── finch/tensorflow2/text_classification/clue
	│
	├── data
	│   └── make_data.ipynb              # step 1. make data and vocab
	│   └── train.txt  		     # download from clue benchmark
	│   └── test.txt   		     # download from clue benchmark
	│
	├── vocab
	│   └── label.txt                    # list of emotion labels
	│	
	└── main
		└── bert_finetune.ipynb      # step 2: train and evaluate model
		└── ...

Task: CLUE Emotion Analysis Dataset（Chinese Data）

  Training Data: 31728, Testing Data: 3967, Labels: 7

Model: TF-IDF + Linear Model

Logistic Regression	Binary TF	NGram Range	Split By	Testing Accuracy
<Notebook>	False	(1, 1)	Char	57.4%
<Notebook>	True	(1, 1)	Word	57.7%
<Notebook>	False	(1, 2)	Word	57.8%
<Notebook>	False	(1, 1)	Word	58.3%
<Notebook>	True	(1, 2)	Char	59.1%
<Notebook>	False	(1, 2)	Char	59.4%

Model: Deep Model

Code	Model	Env	Testing Accuracy
<Notebook>	BERT	TF2	61.7%
<Notebook>	BERT + TAPT (<Notebook>)	TF2	62.3%

Text Matching

└── finch/tensorflow2/text_matching/snli
	│
	├── data
	│   └── glove.840B.300d.txt       # pretrained embedding, download and put here
	│   └── download_data.ipynb       # step 1. run this to download snli dataset
	│   └── make_data.ipynb           # step 2. run this to generate train.txt, test.txt, word.txt 
	│   └── train.txt  		  # incomplete sample, format <label, text1, text2> separated by \t 
	│   └── test.txt   		  # incomplete sample, format <label, text1, text2> separated by \t
	│
	├── vocab
	│   └── word.txt                  # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── dam.ipynb      	  # step 3. train and evaluate model
		└── esim.ipynb      	  # step 3. train and evaluate model
		└── ......

Task: SNLI（English Data）

  Training Data: 550152, Testing Data: 10000, Labels: 3

Code	Reference	Env	Testing Accuracy
<Notebook>	DAM	TF2	85.3%
<Notebook>	Match Pyramid	TF2	87.1%
<Notebook>	ESIM	TF2	87.4%
<Notebook>	RE2	TF2	87.7%
<Notebook>	RE3	TF2	88.3%
<Notebook>	BERT	TF2	90.4%
<Notebook>	RoBERTa	TF2	91.1%

└── finch/tensorflow2/text_matching/chinese
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.csv  		  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── test.csv   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── esim.ipynb      	  # step 2. train and evaluate model
		└── ......

Task: 微众银行智能客服（Chinese Data）

  Training Data: 100000, Testing Data: 10000, Labels: 2, Balanced

<Notebook>: Make Data & Vocabulary
- <Text File>: Data Example (数据示例)
- <Text File>: Vocabulary
Model (can be compared to this benchmark since the dataset is the same)

Code	Reference	Env	Split by	Testing Accuracy
<Notebook>	RE2	TF2	Word	82.5%
<Notebook>	ESIM	TF2	Char	82.5%
<Notebook>	Match Pyramid	TF2	Char	82.7%
<Notebook>	RE2	TF2	Char	83.8%
<Notebook>	BERT	TF2	Char	83.8%
<Notebook>	BERT-wwm	TF1 + bert4keras	Char	84.75%

└── finch/tensorflow2/text_matching/ant
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.json           	  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── dev.json   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── bert.ipynb      	  # step 2. train and evaluate model
		└── ......

Task: 蚂蚁金融语义相似度（Chinese Data）

  Training Data: 34334, Testing Data: 4316, Labels: 2, Imbalanced

Code	Reference	Env	Split by	Testing Accuracy
<Notebook>	RE2	TF2	Char	66.5%
<Notebook>	Match Pyramid	TF2	Char	69.0%
<Notebook>	Match Pyramid + Joint Training	TF2	Char	70.3%
<Notebook>	BERT	TF2	Char	73.8%
<Notebook>	BERT + TAPT (<Notebook>)	TF2	Char	74.3%

Joint training

set data_1 = 微众银行智能客服 (size: 100000)
set data_2 = 蚂蚁金融语义相似度 (size: 34334)
joint training (size: 100000 + 34334 = 134334)

BERT	train by data_1	train by data_2	joint train	joint train + TAPT
Code	<Notebook>	<Notebook>	<Notebook>	<Notebook>
data_1 accuracy	83.8%	-	84.4%	85.0%
data_2 accuracy	-	73.8%	74.0%	74.9%

Intent Detection and Slot Filling

└── finch/tensorflow2/spoken_language_understanding/atis
	│
	├── data
	│   └── glove.840B.300d.txt           # pretrained embedding, download and put here
	│   └── make_data.ipynb               # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── atis.train.w-intent.iob       # incomplete sample, format <text, slot, intent>
	│   └── atis.test.w-intent.iob        # incomplete sample, format <text, slot, intent>
	│
	├── vocab
	│   └── word.txt                      # list of words in vocabulary
	│   └── intent.txt                    # list of intents in vocabulary
	│   └── slot.txt                      # list of slots in vocabulary
	│	
	└── main              
		└── bigru_clr.ipynb               # step 2. train and evaluate model
		└── ...

Task: ATIS（English Data）

  Training Data: 4978, Testing Data: 893

Code	Model	Helper	Env	Intent Accuracy	Slot Micro-F1
<Notebook>	CRF	-	crfsuite	-	92.6%
<Notebook>	Bi-GRU	-	TF2	97.4%	95.4%
<Notebook>	Bi-GRU	+ CRF	TF2	97.2%	95.8%
<Notebook>	Transformer	-	TF2	96.5%	95.5%
<Notebook>	Transformer	+ Time Weighting	TF2	97.2%	95.6%
<Notebook>	Transformer	+ Time Mixing	TF2	97.5%	95.8%
<Notebook>	Bi-GRU	+ ELMO	TF1	97.5%	96.1%
<Notebook>	Bi-GRU	+ ELMO + CRF	TF1	97.3%	96.3%

Retrieval Dialog

Task: Build a chatbot answering fundamental questions

Code	Engine	Encoder	Vector Type	Unit Test Accuracy
<Notebook>	Elastic Search	Default (TF-IDF)	Sparse	80%
<Notebook>	Elastic Search	Default (TF-IDF) + Segmentation	Sparse	90%
<Notebook>	Elastic Search	Bert	Dense	80%
<Notebook>	Elastic Search	Universal Sentence Encoder	Dense	100%

Semantic Parsing

└── finch/tensorflow2/semantic_parsing/tree_slu
	│
	├── data
	│   └── glove.840B.300d.txt     	# pretrained embedding, download and put here
	│   └── make_data.ipynb           	# step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── train.tsv   		  	# incomplete sample, format <text, tokenized_text, tree>
	│   └── test.tsv    		  	# incomplete sample, format <text, tokenized_text, tree>
	│
	├── vocab
	│   └── source.txt                	# list of words in vocabulary for source (of seq2seq)
	│   └── target.txt                	# list of words in vocabulary for target (of seq2seq)
	│	
	└── main
		└── lstm_seq2seq_tf_addons.ipynb           # step 2. train and evaluate model
		└── ......

Task: Semantic Parsing for Task Oriented Dialog（English Data）

  Training Data: 31279, Testing Data: 9042

Code	Reference	Env	Testing Exact Match
<Notebook>	GRU Seq2Seq	TF2	74.1%
<Notebook>	LSTM Seq2Seq	TF2	74.1%
<Notebook>	GRU Pointer-Generator	TF2	80.4%
<Notebook>	GRU Pointer-Generator + Char Embedding	TF2	80.7%

The Exact Match result is higher than original paper

Knowledge Graph Completion

└── finch/tensorflow2/knowledge_graph_completion/wn18
	│
	├── data
	│   └── download_data.ipynb       	# step 1. run this to download wn18 dataset
	│   └── make_data.ipynb           	# step 2. run this to generate vocabulary: entity.txt, relation.txt
	│   └── wn18  		          	# wn18 folder (will be auto created by download_data.ipynb)
	│   	└── train.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│   	└── valid.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t 
	│   	└── test.txt   		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│
	├── vocab
	│   └── entity.txt                  	# incomplete sample, list of entities in vocabulary
	│   └── relation.txt                	# incomplete sample, list of relations in vocabulary
	│	
	└── main              
		└── distmult_1-N.ipynb    	# step 3. train and evaluate model
		└── ...

Task: WN18

  Training Data: 141442, Testing Data: 5000

<Notebook>: Download Data
- <Text File>: Data Example
<Notebook>: Make Vocabulary
- <Text File>: Vocabulary Example
We use the idea of multi-label classification to accelerate evaluation

Code	Reference	Env	MRR	Hits@10	Hits@3	Hits@1
<Notebook>	DistMult	TF2	0.797	0.938	0.902	0.688
<Notebook>	TuckER	TF2	0.885	0.939	0.909	0.853
<Notebook>	ComplEx	TF2	0.938	0.958	0.948	0.925

Knowledge Base Question Answering

Rule-based System（基于规则的系统）

For example, we want to answer the following questions with car knowledge:
```
 	What is BMW?
     	I want to know about the BMW
     	Please introduce the BMW to me
     	How is the BMW?
     	How is the BMW compared to the Benz?
```
- <Notebook> Regular Expression
- <Notebook> Regular Expression + POS Feature

Multi-hop Question Answering

└── finch/tensorflow1/question_answering/babi
	│
	├── data
	│   └── make_data.ipynb           		# step 1. run this to generate vocabulary: word.txt 
	│   └── qa5_three-arg-relations_train.txt       # one complete example of babi dataset
	│   └── qa5_three-arg-relations_test.txt	# one complete example of babi dataset
	│
	├── vocab
	│   └── word.txt                  		# complete list of words in vocabulary
	│	
	└── main              
		└── dmn_train.ipynb
		└── dmn_serve.ipynb
		└── attn_gru_cell.py

Task: bAbI（English Data）
- <Text File>: Data Example
- <Notebook>: Make Vocabulary
- Model: Dynamic Memory Network
  - TensorFlow 1
    - <Notebook> DMN -> 99.4% Testing Accuracy
    - Inference

Text Visualization

Topic Modelling
- Model: TF-IDF + LDA
- Data: IMDB Movie Reviews
- <Notebook> Code
- <Notebook> Visualization
Explain Prediction
- Model: SHAP
- Data: IMDB Movie Reviews
- <Notebook>

Recommender System

└── finch/tensorflow1/recommender/movielens
	│
	├── data
	│   └── make_data.ipynb           		# run this to generate vocabulary
	│
	├── vocab
	│   └── user_job.txt
	│   └── user_id.txt
	│   └── user_gender.txt
	│   └── user_age.txt
	│   └── movie_types.txt
	│   └── movie_title.txt
	│   └── movie_id.txt
	│	
	└── main              
		└── dnn_softmax.ipynb
		└── ......

Task: Movielens 1M（English Data）

  Training Data: 900228, Testing Data: 99981, Users: 6000, Movies: 4000, Rating: 1-5

<Notebook>: Make Vocabulary
- <Text File>: Data Example

Model: Fusion

Code	Scoring	LR Decay	Env	Testing MAE
<Notebook>	Sigmoid (Continuous)	Exponential	TF1	0.663
<Notebook>	Sigmoid (Continuous)	Cyclical	TF1	0.661
<Notebook>	Softmax (Discrete)	Exponential	TF1	0.633
<Notebook>	Softmax (Discrete)	Cyclical	TF1	0.628

The MAE results seem better than the all the results here and all the results here

Multi-turn Dialogue Rewriting

└── finch/tensorflow1/multi_turn_rewrite/chinese/
	│
	├── data
	│   └── make_data.ipynb         # run this to generate vocab, split train & test data, make pretrained embedding
	│   └── corpus.txt		# original data downloaded from external
	│   └── train_pos.txt		# processed positive training data after {make_data.ipynb}
	│   └── train_neg.txt		# processed negative training data after {make_data.ipynb}
	│   └── test_pos.txt		# processed positive testing data after {make_data.ipynb}
	│   └── test_neg.txt		# processed negative testing data after {make_data.ipynb}
	│
	├── vocab
	│   └── cc.zh.300.vec		# fastText pretrained embedding downloaded from external
	│   └── char.npy		# chinese characters and their embedding values (300 dim)	
	│   └── char.txt		# list of chinese characters used in this project 
	│	
	└── main              
		└── baseline_lstm_train.ipynb
		└── baseline_lstm_predict.ipynb
		└── ...

Task: 20k 腾讯 AI 研发数据（Chinese Data）

 data split as: training data (positive): 18986, testing data (positive): 1008

 Training data = 2 * 18986 because of 1:1 Negative Sampling

<Text File>: Full Data
<Notebook>: Make Data & Vocabulary & Pretrained Embedding
```
  There are six incorrect data and we have deleted them
```
- <Text File>: Positive Data Example
- <Text File>: Negative Data Example

Model (results can be compared to here with the same dataset)

Code	Model	Env	Exact Match	BLEU-1	BLEU-2	BLEU-4
<Notebook>	LSTM Seq2Seq + Dynamic Memory	TF1	56.2%	94.6	89.1	78.5
<Notebook>	GRU Seq2Seq + Dynamic Memory	TF1	56.2%	95.0	89.5	78.9
<Notebook>	GRU Pointer	TF1	59.2%	93.2	87.7	77.2
<Notebook>	GRU Pointer + Multi-Attention	TF1	60.2%	94.2	88.7	78.3

Deployment: first export the model

Inference Code Environment

<Notebook> Python

<Notebook> Java

Inference Code	Environment
<Notebook>	Python
<Notebook>	Java

Generative Dialog

└── finch/tensorflow1/free_chat/chinese_lccc
	│
	├── data
	│   └── LCCC-base.json           	# raw data downloaded from external
	│   └── LCCC-base_test.json         # raw data downloaded from external
	│   └── make_data.ipynb           	# step 1. run this to generate vocab {char.txt} and data {train.txt & test.txt}
	│   └── train.txt           		# processed text file generated by {make_data.ipynb}
	│   └── test.txt           			# processed text file generated by {make_data.ipynb}
	│
	├── vocab
	│   └── char.txt                	# list of chars in vocabulary for chinese
	│   └── cc.zh.300.vec			# fastText pretrained embedding downloaded from external
	│   └── char.npy			# chinese characters and their embedding values (300 dim)	
	│	
	└── main
		└── lstm_seq2seq_train.ipynb    # step 2. train and evaluate model
		└── lstm_seq2seq_infer.ipynb    # step 4. model inference
		└── ...

Task: Large-scale Chinese Conversation Dataset

  Training Data: 5000000 (sampled due to small memory), Testing Data: 19008

Data
- <Text File>: Data Example
- <Notebook>: Make Data & Vocabulary
  - <Text File>: Vocabulary Example

Model

Code	Model	Env	Test Case	Perplexity
<Notebook>	Transformer Encoder + LSTM Generator	TF1	<Notebook>	42.465
<Notebook>	LSTM Encoder + LSTM Generator	TF1	<Notebook>	41.250
<Notebook>	LSTM Encoder + LSTM Pointer-Generator	TF1	<Notebook>	36.525

If you want to deploy model in Java production

 └── FreeChatInference
 	│
 	├── data
 	│   └── transformer_export/
 	│   └── char.txt
 	│   └── libtensorflow-1.14.0.jar
 	│   └── tensorflow_jni.dll
 	│
 	└── src              
 		└── ModelInference.java

<Notebook> Java Inference

If you don't know the input and output node names in Java, you can display the node names:

 !saved_model_cli show --dir ../model/xxx/1587959473/ --tag_set serve --signature_def serving_default

Large Pre-trained GPT

Test Case	Model	Env	Author
<Notebook>	GPT LCCC base	TF1 + bert4keras	Tsinghua University
<Notebook>	GPT LCCC large	TF1 + bert4keras	Tsinghua University
<Notebook>	NEZHA LM	TF1 + bert4keras	Jianlin Su

Diverse Response Comparsion

问题 / 模型回复	Our RNN	GPT base	GPT large	Nezha
Q: 你好	A: 你好	A: 你好	A: 你好，你好，我是大好人。	A: 您好
Q: 早上好	A: 早上好	A: 早上好，好久不见	A: 早上好！	A: 早啊早啊
Q: 晚上好	A: 晚上好	A: 你好	A: 好久不见	A: 晚上好！
Q: 再见	A: 再见	A: 不见	A: 你是不是不爱我了	A: 我要走了
Q: 好久不见	A: 好久不见	A: 是呀，好久不见。	A: 是啊，好久不见。	A: 是啊，最近怎么样啊
Q: 想死你了	A: 我也是	A: 我也想你，	A: 我不信	A: 哈哈哈哈哈
Q: 谢谢你	A: 不客气	A: 不用谢	A: 不用	A: 不谢不谢，
Q: 爱你	A: 爱你	A: 你是爱我	A: 我也爱你，	A: 我也爱你，你最美
Q: 你叫什么名字	A: 你猜	A: 我也想知道	A: 叫你	A: 你是不是想说我的名字太俗了
Q: 你几岁了	A: 问女孩子年龄可不是绅士所为哦 ~	A: 你猜	A: 你猜	A: 我不知道
Q: 现在几点了	A: 11点	A: 你说呢？	A: 十二点	A: 我也不知道啊
Q: 今天天气怎么样	A: 下雨了	A: 今天很冷，但是我还是觉得很冷	A: 很好	A: 不错
Q: 我们现在在哪里	A: 在这里	A: 我在这里等你呢	A: 我不知道你在说什么	A: 在我们宿舍楼下。。。
Q: 你能给我讲个笑话吗	A: 不能	A: 讲个笑话	A: 讲给谁听	A: 不能
Q: 你是男孩还是女孩呀	A: 我是萌妹纸	A: 你猜	A: 你猜呀	A: 我是萌妹纸
Q: 你会几种语言呀	A: 你猜	A: 不会	A: 你猜呀~	A: 只会说一句
Q: 你能陪我玩吗	A: 可以啊	A: 可以	A: 我可以啊	A: 你猜
Q: 说话可以大声一点吗	A: 可以	A: 可以的	A: 不要	A: 我不
Q: 天气真好	A: 是的	A: 是啊！	A: 天气好是因为有你	A: 是啊。
Q: 天气太糟糕了	A: 是啊	A: 是啊	A: 是啊	A: 是的
Q: 下雨了	A: 是的	A: 你们那边也下雨了	A: 是呀	A: 我也是
Q: 雨好大	A: 今天下雨了	A: 是啊	A: 是的，好大雨	A: 对呀，我们还穿短袖
Q: 我讨厌艳阳天	A: 我也讨厌艳阳天	A: 我也是。	A: 我喜欢阴天	A: 我喜欢
Q: 好晒啊	A: 是啊	A: 是啊	A: 是啊！！	A: 是啊，不是很热的
Q: 今天好冷	A: 多穿点衣服	A: 你们都不来	A: 你那儿也冷了	A: 是啊。
Q: 今天好热	A: 今天热死了	A: 热死我了	A: 对啊	A: 对啊，我都热疯
Q: 风好大	A: 是啊	A: 风太大了。。。	A: 你不是在上课嘛	A: 嗯嗯嗯，风好大，冷
Q: 雾太大了看不清路	A: 雾太大了	A: 哈哈，你这是在逗我笑	A: 我也看不清	A: 你是不是傻啊
Q: 打雷了好可怕	A: 可怕	A: 我们这边也是，我们这边下雨了，我们这边也没有下雨。	A: 是啊	A: 我们这边下雨了
Q: 下雪了诶	A: 下雪了？	A: 是滴！	A: 嗯嗯嗯！！好大的雪！！	A: 下了一天了
Q: 好烦啊	A: 我也好烦	A: 我也是	A: 我好烦	A: 我也是
Q: 好开心	A: 开心就好	A: 你是不是想我了	A: 是呀	A: 哈哈哈，我也是！
Q: 太激动了	A: 激动什么	A: 我也激动	A: 哈哈，还是不要了	A: 哈哈
Q: 我好难过	A: 怎么了？	A: 我也是，我们都是	A: 不要难过，我陪你	A: 你怎么难过
Q: 我想哭	A: 抱抱	A: 抱紧你	A: 我不想哭！你不要想太多！	A: 哭吧哭吧，我陪你哭
Q: 太好笑了	A: 笑死我了	A: 你也笑了？	A: 我也觉得	A: 哈哈
Q: 我好伤心	A: 怎么了	A: 你伤心什么	A: 我也是	A: 你是不是想笑死我！！下次再给你拍
Q: 心好痛	A: 抱抱	A: 你心痛什么	A: 心碎一地了。	A: 怎么了
Q: 好累啊	A: 加油	A: 你不会还是一个人在寝室吧	A: 好好休息	A: 累了就睡觉
Q: 我好疲惫	A: 我也是	A: 我也差不多。。	A: 我好困	A: 你不是一直都很疲惫么
Q: 我爱你	A: 我也爱你	A: 我也是	A: 我们都爱你	A: 爱我就来见面
Q: 我讨厌你	A: 我也讨厌你	A: 你不是一直很喜欢我	A: 不是我的	A: 我也讨厌你
Q: 你真是太棒啦	A: 谢谢	A: 你最棒	A: 你也棒棒哒	A: 谢谢！
Q: 你好厉害啊	A: 哈哈	A: 是不是	A: 你也不差呀	A: 你也可以的
Q: 吓死我了	A: 哈哈	A: 我都不敢出门了	A: 哈哈哈哈哈！！！	A: 哈哈哈哈哈笑死我了
Q: 我想回家	A: 我也想回家	A: 我也想！！	A: 想家的时候回来，想家的时候离开。	A: 回来吧，家真的好啊！
Q: 我想爸妈了	A: 我也想爸妈	A: 哈哈	A: 我也想	A: 想我吗
Q: 不知道小孩在家有没有听话	A: 我也不知道	A: 没有	A: 听话的话肯定是会听话的。	A: 我也是听不懂啊
Q: 想回家撸猫	A: 我也想回家	A: 你也想啊？	A: 我们这也有一个	A: 回呀回呀

Comments

in vae code,how to use LSTMcell?

when I change the GRUcell to LSTMcell ,I got this error: ValueError: Shape must be rank 2 but is rank 3 for 'decoding/decoder/concat' (op: 'ConcatV2') with input shapes: [?,128], [2,?,20], [].

please help, thanku~

opened by shaomai00 7
Is the function "add_encoder_layer" in "seq2seq_ultimate.py" correct?

Hi, I have a question:

Several code snippet in seq2seq_ultimate.py (function: add_encoder_layer"), maybe has an incorrect position: bi_state_c = tf.concat((state_fw.c, state_bw.c), -1) bi_state_h = tf.concat((state_fw.h, state_bw.h), -1) bi_lstm_state = tf.nn.rnn_cell.LSTMStateTuple(c=bi_state_c, h=bi_state_h) self.encoder_state = tuple([bi_lstm_state] * self.n_layers)

opened by cdj0311 4
CBOW code

In the code of the "CBOW", estimator.train(tf.estimator.inputs.numpy_input_fn( x_train, np.expand_dims(y_train, -1), batch_size = PARAMS['batch_size'], num_epochs = PARAMS['n_epochs'], shuffle = True)) when I run this codes, it reminded me that"Traceback (most recent call last): File "D:/pythonWorkSpace/AAB/tensorflow-CBOW.py", line 112, in shuffle = True)) File "E:\Anaconda\lib\site-packages\tensorflow\python\estimator\estimator.py", line 241, in train loss = self._train_model(input_fn=input_fn, hooks=hooks) File "E:\Anaconda\lib\site-packages\tensorflow\python\estimator\estimator.py", line 558, in _train_model features, labels = input_fn() File "E:\Anaconda\lib\site-packages\tensorflow\python\estimator\inputs\numpy_io.py", line 98, in input_fn raise TypeError('x must be dict; got {}'.format(type(x).name)) TypeError: x must be dict; got ndarray" would you help me to solve the problem？ I have seen that you can run this code correctly.

opened by ZuoxiYang 3
two questions for "CLUE Emotion Analysis"
Hi, I have two questions to ask you:

text = ['[CLS]'] + text + ['[SEP]'] , here, why does text not tokenize? like this,text = ['[CLS]'] + tokenizer.tokenize(text) + ['[SEP]']

For BertFinetune, x = x[1] , here, why x = x[1]?
opened by CoderBinGe 2
use baseline_lstm_train_clr to predict and occurred an error，how to fix it

ValueError: Shape must be rank 2 but is rank 3 for 'Decoder/decoder/while/BeamSearchDecoderStep/tied_dense/MatMul' (op: 'MatMul') with input shapes: [?,10,300], [3853,300].

opened by hbwzhsh 2
the reconstruct performance of Learning to Reconstruct

Hello. You used VAE for reconstruct sentences from imdb. According to your results, I think the reconstruction performance is not good. The reconstructed sentences have great difference with original ones. I am new in NLP, so I want to know can complete reconstruction be achieved with existing models. Can you give me some hints about the reasons which caused bad reconstruction performance? Is it due to the simple model or lack of training or something else? Thank you.

opened by zyj008 2
Attention is all you need

I am trying to modify your code to fit the English data set. I modified DataLoader.py ，added English word segmentation.But when I train the model, I get an error.

INFO:tensorflow:loss = 7.306507, step = 0 INFO:tensorflow:lr = 0.001 ERROR:tensorflow:Model diverged with loss = NaN. ne, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000001E5D23D9470>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1} Traceback (most recent call last): File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 1425, in done fut.result() File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 40, in result reraise(self._exc_info) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\reraise3.py", line 8, in reraise raise exc_info[1].with_traceback(exc_info[2]) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 157, in callback x = next(it) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 2070, in on_evaluate pyd_tid, pyd_fid = self.frame_map.to_pydevd(vsc_fid) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 311, in to_pydevd return self._vscode_to_pydevd[vscode_id] KeyError: 41 Traceback (most recent call last): File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 1425, in done fut.result() File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 40, in result reraise(self._exc_info) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\reraise3.py", line 8, in reraise raise exc_info[1].with_traceback(exc_info[2]) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 157, in callback x = next(it) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 2070, in on_evaluate pyd_tid, pyd_fid = self.frame_map.to_pydevd(vsc_fid) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 311, in to_pydevd return self._vscode_to_pydevd[vscode_id] KeyError: 41 Traceback (most recent call last): File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 1425, in done fut.result() File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 40, in result reraise(self._exc_info) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\reraise3.py", line 8, in reraise raise exc_info[1].with_traceback(exc_info[2]) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 157, in callback x = next(it) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 2070, in on_evaluate pyd_tid, pyd_fid = self.frame_map.to_pydevd(vsc_fid) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 311, in to_pydevd return self._vscode_to_pydevd[vscode_id] KeyError: 41 Traceback (most recent call last): File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 1425, in done fut.result() File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 40, in result reraise(self.exc_info) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\reraise3.py", line 8, in reraise raise exc_info[1].with_traceback(exc_info[2]) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\futures.py", line 157, in callback x = next(it) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 2070, in on_evaluate pyd_tid, pyd_fid = self.frame_map.to_pydevd(vsc_fid) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd\wrapper.py", line 311, in to_pydevd return self.vscode_to_pydevd[vscode_id] KeyError: 41 WARNING:tensorflow:From C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\inputs\queues\feeding_queue_runner.py:62: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\inputs\queues\feeding_functions.py:500: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. INFO:tensorflow:Calling model_fn. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Graph was finalized. 2019-03-26 12:58:05.482552: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-03-26 12:58:06.511141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493 pciBusID: 0000:01:00.0 totalMemory: 4.00GiB freeMemory: 3.30GiB 2019-03-26 12:58:06.528029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-03-26 12:58:08.067124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-26 12:58:08.101398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-03-26 12:58:08.116357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-03-26 12:58:08.133051: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3015 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:From C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. INFO:tensorflow:Saving checkpoints for 0 into C:\Users\89534\AppData\Local\Temp\tmpc8cb1xnq\model.ckpt. 2019-03-26 13:02:24.506298: E tensorflow/core/grappler/clusters/utils.cc:83] Failed to get device properties, error code: 30 INFO:tensorflow:loss = 7.306507, step = 0 INFO:tensorflow:lr = 0.001 ERROR:tensorflow:Model diverged with loss = NaN. Traceback (most recent call last): File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\ptvsd_launcher.py", line 45, in main(ptvsdArgs) File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd_main.py", line 357, in main run() File "c:\Users\89534.vscode\extensions\ms-python.python-2019.2.5558\pythonFiles\lib\python\ptvsd_main.py", line 257, in run_file runpy.run_path(target, run_name='main') File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\runpy.py", line 263, in run_path pkg_name=pkg_name, script_name=fname) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\runpy.py", line 96, in _run_module_code mod_name, mod_spec, pkg_name, script_name) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "d:\CDisk\Documents\GitHub\finch\src_nlp\tensorflow\attn_is_all_u_need\train_dialog.py", line 36, in main() File "d:\CDisk\Documents\GitHub\finch\src_nlp\tensorflow\attn_is_all_u_need\train_dialog.py", line 30, in main shuffle = True)) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\estimator.py", line 354, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1207, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1241, in _train_model_default saving_listeners) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\estimator\estimator.py", line 1471, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py", line 671, in run run_metadata=run_metadata) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1156, in run run_metadata=run_metadata) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1255, in run raise six.reraise(*original_exc_info) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\six.py", line 693, in reraise raise value File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1240, in run return self._sess.run(*args, **kwargs) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\monitored_session.py", line 1320, in run run_metadata=run_metadata)) File "C:\Users\89534\AppData\Local\conda\conda\envs\tf\lib\site-packages\tensorflow\python\training\basic_session_run_hooks.py", line 753, in after_run raise NanLossDuringTrainingError tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

The code I am modifying may have a problem. How can I modify the code to enable it to train the English corpus?Thanks.i write chatbot for the first time

opened by Kiteflyingee 1

Owner

GitHub

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2.3k Dec 29, 2022

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

2k Feb 9, 2021

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

2 Sep 27, 2022

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.1k Feb 17, 2021

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

186 Dec 24, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 4, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

6.4k Jan 1, 2023

Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

4.8k Feb 18, 2021

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

335 Jan 4, 2023

Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022

Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

2.6k Feb 18, 2021

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

28 May 25, 2021

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Related tags

Overview

Contents

Text Classification

Text Matching

Intent Detection and Slot Filling

Retrieval Dialog

Semantic Parsing

Knowledge Graph Completion

Knowledge Base Question Answering

Multi-hop Question Answering

Text Visualization

Recommender System

Multi-turn Dialogue Rewriting

Generative Dialog

Comments

in vae code,how to use LSTMcell?

Is the function "add_encoder_layer" in "seq2seq_ultimate.py" correct?

CBOW code

two questions for "CLUE Emotion Analysis"

use baseline_lstm_train_clr to predict and occurred an error，how to fix it

the reconstruct performance of Learning to Reconstruct

Attention is all you need

Owner

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

NLP project that works with news (NER, context generation, news trend analytics)

Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation.

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Data loaders and abstractions for text and NLP

Data loaders and abstractions for text and NLP

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

Multilingual text (NLP) processing toolkit

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Signature remover is a NLP based solution which removes email signatures from the rest of the text.