Multiple implementations for abstractive text summurization , using google colab

Overview

Text Summarization models

if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo is built to collect multiple implementations for abstractive approaches to address text summarization , for different languages (Hindi, Amharic, English, and soon isA Arabic)

If you found this project helpful please consider citing our work, it would truly mean so much for me

@INPROCEEDINGS{9068171,
  author={A. M. {Zaki} and M. I. {Khalil} and H. M. {Abbas}},
  booktitle={2019 14th International Conference on Computer Engineering and Systems (ICCES)}, 
  title={Deep Architectures for Abstractive Text Summarization in Multiple Languages}, 
  year={2019},
  volume={},
  number={},
  pages={22-27},}
@misc{zaki2020amharic,
    title={Amharic Abstractive Text Summarization},
    author={Amr M. Zaki and Mahmoud I. Khalil and Hazem M. Abbas},
    year={2020},
    eprint={2003.13721},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

it is built to simply run on google colab , in one notebook so you would only need an internet connection to run these examples without the need to have a powerful machine , so all the code examples would be in a jupiter format , and you don't have to download data to your device as we connect these jupiter notebooks to google drive

  • Arabic Summarization Model using the corner stone implemtnation (seq2seq using Bidirecional LSTM Encoder and attention in the decoder) for summarizing Arabic news
  • implementation A Corner stone seq2seq with attention (using bidirectional ltsm ) , three different models for this implemntation
  • implementation B seq2seq with pointer genrator model
  • implementation C seq2seq with reinforcement learning

Blogs

This repo has been explained in a series of Blogs


Try out this text summarization through this website (eazymind) , eazymind which enables you to summarize your text through

  • curl call
curl -X POST 
http://eazymind.herokuapp.com/arabic_sum/eazysum
-H 'cache-control: no-cache' 
-H 'content-type: application/x-www-form-urlencoded' 
-d "eazykey={eazymind api key}&sentence={your sentence to be summarized}"
from eazymind.nlp.eazysum import Summarizer

#---key from eazymind website---
key = "xxxxxxxxxxxxxxxxxxxxx"

#---sentence to be summarized---
sentence = """(CNN)The White House has instructed former
    White House Counsel Don McGahn not to comply with a subpoena
    for documents from House Judiciary Chairman Jerry Nadler, 
    teeing up the latest in a series of escalating oversight 
    showdowns between the Trump administration and congressional Democrats."""
    
summarizer = Summarizer(key)
print(summarizer.run(sentence))

Implementation A (seq2seq with attention and feature rich representation)

contains 3 different models that implements the concept of hving a seq2seq network with attention also adding concepts like having a feature rich word representation This work is a continuation of these amazing repos

Model 1

is a modification on of David Currie's https://github.com/Currie32/Text-Summarization-with-Amazon-Reviews seq2seq

Model 2

1- Model_2/Model_2.ipynb

a modification to https://github.com/dongjun-Lee/text-summarization-tensorflow

2- Model_2/Model 2 features(tf-idf , pos tags).ipynb

a modification to Model 2.ipynb by using concepts from http://www.aclweb.org/anthology/K16-1028

Results

A folder contains the results of both the 2 models , from validation text samples in a zaksum format , which is combining all of

  • bleu
  • rouge_1
  • rouge_2
  • rouge_L
  • rouge_be for each sentence , and average of all of them

Model 3

a modification to https://github.com/thomasschmied/Text_Summarization_with_Tensorflow/blob/master/summarizer_amazon_reviews.ipynb


Implementation B (Pointer Generator seq2seq network)

it is a continuation of the amazing work of https://github.com/abisee/pointer-generator https://arxiv.org/abs/1704.04368 this implementation uses the concept of having a pointer generator network to diminish some problems that appears with the normal seq2seq network

Model_4_generator_.ipynb

uses a pointer generator with seq2seq with attention it is built using python2.7

zaksum_eval.ipynb

built by python3 for evaluation

Results/Pointer Generator

  • output from generator (article / reference / summary) used as input to the zaksum_eval.ipynb
  • result from zaksum_eval

i will still work on their implementation of coverage mechanism , so much work is yet to come if God wills it isA


Implementation C (Reinforcement Learning For Sequence to Sequence )

this implementation is a continuation of the amazing work done by https://github.com/yaserkl/RLSeq2Seq https://arxiv.org/abs/1805.09461

@article{keneshloo2018deep,
 title={Deep Reinforcement Learning For Sequence to Sequence Models},
 author={Keneshloo, Yaser and Shi, Tian and Ramakrishnan, Naren and Reddy, Chandan K.},
 journal={arXiv preprint arXiv:1805.09461},
 year={2018}
}

Model 5 RL

this is a library for building multiple approaches using Reinforcement Learning with seq2seq , i have gathered their code to run in a jupiter notebook , and to access google drive built for python 2.7

zaksum_eval.ipynb

built by python3 for evaluation

Results/Reinforcement Learning

  • output from Model 5 RL used as input to the zaksum_eval.ipynb
Comments
  • How did you solve the [unk] problem?

    How did you solve the [unk] problem?

    I tried running some randomly selected text with lot's domain specific jargons. On my trained model, all the jargons got translated to [unk], which actually seems reasonable based on my understanding of the models (4 and 5). However, on your demo site, your model was able to spit back out the jargon words. Can you suggest what the difference might be that allowed your model to work effectively for the oov words?

    opened by bhomass 12
  • Issue while predicting

    Issue while predicting

    Loading dictionary... Loading validation dataset... Loading article and reference... Loading saved model... Writing summaries to 'result.txt'...

    FailedPreconditionError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args) 1333 try: -> 1334 return fn(*args) 1335 except errors.OpError as e:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata) 1318 return self._call_tf_sessionrun( -> 1319 options, feed_dict, fetch_list, target_list, run_metadata) 1320

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata) 1406 self._session, options, feed_dict, fetch_list, target_list, -> 1407 run_metadata) 1408

    FailedPreconditionError: Attempting to use uninitialized value decoder/attention_wrapper/attention_layer/kernel [[{{node decoder/attention_wrapper/attention_layer/kernel/read}}]]

    During handling of the above exception, another exception occurred:

    FailedPreconditionError Traceback (most recent call last)

    Model 2 Evaluation 
    opened by yash-1997 9
  • pre-trained model available?

    pre-trained model available?

    Hi I am interested in models 4 and 5. From past trials it takes many days to train the models from CNN/Daily News data. Plus you may have to re-train to try new parameters. Is it possible to just use your pre-trained model and run some test articles (no abstracts)?

    opened by bhomass 8
  • max_dec_steps=1 for decode mode

    max_dec_steps=1 for decode mode

    I can't make sense out of this setting. Why do you set max_dec_steps=1 for decode mode? You still have to decode the number of steps equals to size of the output summary, right? That is keep decoding until reaching the STOP word.

    opened by bhomass 6
  • asking for instructions !!!

    asking for instructions !!!

    Hi Amrzaki!

    I read your blogs on Medium, they are very good. I am new to text summarization and was wondering how to run the pointer-generator model with coverage on new data, I mean how to use it to summarize new articles? your help is appreciated.

    Data Processing Model 4 
    opened by nxs5899 5
  • Data Preprocessing

    Data Preprocessing

    Hey I have query about the data preprocessing part for model 4 and 5 . Whenever I try to preprocess the data this is what i end up with Traceback (most recent call last): File "process_English.py", line 290, in <module> reviews = pd.read_csv(reviews_csv,header = 1) #skip first row (of header) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f return _read(filepath_or_buffer, kwds) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__ self._make_engine(self.engine) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/home/giri/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__ self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__ File "pandas/_libs/parsers.pyx", line 751, in pandas._libs.parsers.TextReader._get_header pandas.errors.ParserError: Passed header=1 but only 1 lines in file I have preprocessed the data the data using the steps which abisee gave but I dont understand the csv part in ur method

    Data Processing Model 4 Model 5 
    opened by giriallada 3
  • Model_3.ipynb (util.py)

    Model_3.ipynb (util.py)

    thanks for the great work, i see get_init_embedding wrapper function has embedding pickle file as glove/model_glove_300.pkl. want to confirm from you - is it just the pickle conversion of glove/glove.6B.300d.txt file or somethinfg else.

    Also want to know what is the difference between loading .vec file or .pkl file as embedding vector

    Model 2 Word2Vector 
    opened by swayam01 3
  • why decoder produce same generated summary ?

    why decoder produce same generated summary ?

    The model 4 had a loss of 5 and there have always the same generated summary (for all articles ):

    [UNK] [UNK] , 28 , has been charged with two counts of first-degree murder . he has been charged with two counts of attempted murder . he was sentenced to 15 years in prison and sentenced to 18 months in prison .

    opened by PH-github95 2
  • Issues saving checkpoints

    Issues saving checkpoints

    First off, thanks for putting this together, this has been very helpful.

    I've almost got everything running end to end except for saving the checkpoints for the model:

    ---------------------------------------------------------------------------
    UnknownError                              Traceback (most recent call last)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
       1333     try:
    -> 1334       return fn(*args)
       1335     except errors.OpError as e:
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
       1318       return self._call_tf_sessionrun(
    -> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
       1320 
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
       1406         self._session, options, feed_dict, fetch_list, target_list,
    -> 1407         run_metadata)
       1408 
    
    UnknownError: drive/Colab Notebooks/saved_model/model.ckpt-313.data-00000-of-00001.tempstate17496649906120132702; Input/output error
    	 [[{{node save_2/SaveV2}}]]
    	 [[{{node save_2/SaveV2}}]]
    
    

    How do I resolve this?

    Google Colab 
    opened by madhavthaker 2
  • Problem http://eazymind.herokuapp.com/arabic_sum/

    Problem http://eazymind.herokuapp.com/arabic_sum/

    While trying to use the Python library eazymind, is giving the following error:

    <html>
      <head>
    	<meta name="viewport" content="width=device-width, initial-scale=1">
    	<meta charset="utf-8">
    	<title>Application Error</title>
    	<style media="screen">
    	  html,body,iframe {
    		margin: 0;
    		padding: 0;
    	  }
    	  html,body {
    		height: 100%;
    		overflow: hidden;
    	  }
    	  iframe {
    		width: 100%;
    		height: 100%;
    		border: 0;
    	  }
    	</style>
      </head>
      <body>
    	<iframe src="//www.herokucdn.com/error-pages/application-error.html"></iframe>
      </body>
    </html>
    
    opened by irigaraynavarromaria 0
  • Model 3 - can't save the model to infer later

    Model 3 - can't save the model to infer later

    First of all, thank you so much for this! I trained it and got really good results.

    But I'm not being able to save the model to infer later.

    From what I've understood from the code, that mechanism is guaranteed (thus the saving path), but my model doesn't save anything in that path.

    Does anyone have this issue? Is there something I'm missing? Any tips would be highly appreciated!!

    Thank you!

    opened by telmabatista 0
  • GPU issue

    GPU issue

    Hello, nice repo. Just one question, when I open your model4 ipynb in google colab, it just reminded me that not using GPU, do you get this problem? And also for Model_4_generator_python3.ipynb, it seems the code is still python2.

    opened by yanglei-github 3
  • About decode mode

    About decode mode

    Hello, for the pointer generator model4, you use decode mode to predict output using some pre-trained ckpt, but it seems you do not provide the checkpoint in your folder. Dose it mean I have to first change the mode into train and train it by myself and then change back into decode mode?

    opened by yanglei-github 1
  • dict has no attribute word_vec

    dict has no attribute word_vec

    Hi

    When I try to load the glove vectors,while training the LSTM, I get the following error:

    AttributeError: 'dict' object has no attribute 'word_vec'

    This happens when I try to train the LSTM with the following code :

    def get_init_embedding(reverse_dict, embedding_size): print("Loading GLove vectors..") with open("C:/Users/sensen/OneDrive - HERE Global B.V-/Desktop/NLP/glove.6B.300d_pickle", 'rb') as handle: word_vectors = pickle.load(handle)

    word_vec_list = list()
    

    #loop through all the words in reverse dict

    used_words = 0
    for _,word in sorted(reverse_dict.items()):
        try:
            word_vec = word_vectors.word_vec(word)
    used_words += 1
        except KeyError:
            word_vec = np.zeros([embedding_size], dtype=np.float32)
            
        word_vec_list.append(word_vec)
        
    
    word_vec_list[2] = np.random.normal(0, 1, embedding_size)
    word_vec_list[3] = np.random.normal(0, 1, embedding_size)
    
    return np.array(word_vec_list)
    

    Building model architecture class Model(object): def init(self, reversed_dict, article_max_len, summary_max_len, args, forward_only=False): self.vocabulary_size = len(reversed_dict) self.embedding_size = args.embedding_size self.num_hidden = args.num_hidden self.num_layers = args.num_layers self.learning_rate = args.learning_rate self.beam_width = args.beam_width if not forward_only:#forward_only=In training phase, keep_prob is used for defining drop out % self.keep_prob = args.keep_prob else: self.keep_prob = 1.0 self.cell = tf.nn.rnn_cell.BasicLSTMCell #initializing an LSTM cell with tf.variable_scope("decoder/projection"):#projection layer used in decoder in both training and testing.Projection layer is used for converting indices of individual words to continous weight vector self.projection_layer = tf.layers.Dense(self.vocabulary_size, use_bias=False) #Defining batch size self.batch_size = tf.placeholder(tf.int32, (), name="batch_size") self.X = tf.placeholder(tf.int32, [None, article_max_len]) #Defining X again for subsequent steps self.X_len = tf.placeholder(tf.int32, [None])#Defining X as length of articles. Its a place holder cause length of articles will become input that will be called in runtime env self.decoder_input = tf.placeholder(tf.int32, [None, summary_max_len])#Starting to define decoder self.decoder_len = tf.placeholder(tf.int32, [None])#Defining a decoder lenght self.decoder_target = tf.placeholder(tf.int32, [None, summary_max_len]) #Defining a decoder target self.global_step = tf.Variable(0, trainable=False)

        #EMBEDDING LAYER
        
        with tf.name_scope("embedding"):
            if not forward_only and args.glove:#if in training phase and if glove is used
                init_embeddings = tf.constant(get_init_embedding(reversed_dict, self.embedding_size), dtype=tf.float32)#Constant function because word embedding wont change as part of dict.Get_Init_embedding is a function that returns the vector for each word in our dict
            else:
                init_embeddings = tf.random_uniform([self.vocabulary_size, self.embedding_size], -1.0, 1.0)#if embedding is in testing phase, no constant dict is available. initializing a random variable
            self.embeddings = tf.get_variable("embeddings", initializer=init_embeddings)
            self.encoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.X), perm=[1, 0, 2]) #encoder input
            self.decoder_emb_inp = tf.transpose(tf.nn.embedding_lookup(self.embeddings, self.decoder_input), perm=[1, 0, 2]) #decoder input
    
        with tf.name_scope("encoder"):
            fw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
            bw_cells = [self.cell(self.num_hidden) for _ in range(self.num_layers)]
            fw_cells = [rnn.DropoutWrapper(cell) for cell in fw_cells]
            bw_cells = [rnn.DropoutWrapper(cell) for cell in bw_cells]
    
            encoder_outputs, encoder_state_fw, encoder_state_bw = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(
                fw_cells, bw_cells, self.encoder_emb_inp,
                sequence_length=self.X_len, time_major=True, dtype=tf.float32)
            self.encoder_output = tf.concat(encoder_outputs, 2)
            encoder_state_c = tf.concat((encoder_state_fw[0].c, encoder_state_bw[0].c), 1)
            encoder_state_h = tf.concat((encoder_state_fw[0].h, encoder_state_bw[0].h), 1)
            self.encoder_state = rnn.LSTMStateTuple(c=encoder_state_c, h=encoder_state_h)
    
        with tf.name_scope("decoder"), tf.variable_scope("decoder") as decoder_scope:
            decoder_cell = self.cell(self.num_hidden * 2)
    
            if not forward_only:
                attention_states = tf.transpose(self.encoder_output, [1, 0, 2])
                attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                    self.num_hidden * 2, attention_states, memory_sequence_length=self.X_len, normalize=True)
                decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism,
                                                                   attention_layer_size=self.num_hidden * 2)
                initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size)
                initial_state = initial_state.clone(cell_state=self.encoder_state)
                helper = tf.contrib.seq2seq.TrainingHelper(self.decoder_emb_inp, self.decoder_len, time_major=True)
                decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, initial_state)
                outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, output_time_major=True, scope=decoder_scope)
                self.decoder_output = outputs.rnn_output
                self.logits = tf.transpose(
                    self.projection_layer(self.decoder_output), perm=[1, 0, 2])
                self.logits_reshape = tf.concat(
                    [self.logits, tf.zeros([self.batch_size, summary_max_len - tf.shape(self.logits)[1], self.vocabulary_size])], axis=1)
            else:
                tiled_encoder_output = tf.contrib.seq2seq.tile_batch(
                    tf.transpose(self.encoder_output, perm=[1, 0, 2]), multiplier=self.beam_width)
                tiled_encoder_final_state = tf.contrib.seq2seq.tile_batch(self.encoder_state, multiplier=self.beam_width)
                tiled_seq_len = tf.contrib.seq2seq.tile_batch(self.X_len, multiplier=self.beam_width)
                attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
                    self.num_hidden * 2, tiled_encoder_output, memory_sequence_length=tiled_seq_len, normalize=True)
                decoder_cell = tf.contrib.seq2seq.AttentionWrapper(decoder_cell, attention_mechanism,
                                                                   attention_layer_size=self.num_hidden * 2)
                initial_state = decoder_cell.zero_state(dtype=tf.float32, batch_size=self.batch_size * self.beam_width)
                initial_state = initial_state.clone(cell_state=tiled_encoder_final_state)
                decoder = tf.contrib.seq2seq.BeamSearchDecoder(
                    cell=decoder_cell,
                    embedding=self.embeddings,
                    start_tokens=tf.fill([self.batch_size], tf.constant(2)),
                    end_token=tf.constant(3),
                    initial_state=initial_state,
                    beam_width=self.beam_width,
                    output_layer=self.projection_layer
                )
                outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(
                    decoder, output_time_major=True, maximum_iterations=summary_max_len, scope=decoder_scope)
                self.prediction = tf.transpose(outputs.predicted_ids, perm=[1, 2, 0])
    
        with tf.name_scope("loss"):
            if not forward_only:
                crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
                    logits=self.logits_reshape, labels=self.decoder_target)
                weights = tf.sequence_mask(self.decoder_len, summary_max_len, dtype=tf.float32)
                self.loss = tf.reduce_sum(crossent * weights / tf.to_float(self.batch_size))
    
                params = tf.trainable_variables()
                gradients = tf.gradients(self.loss, params)
                clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5.0)
                optimizer = tf.train.AdamOptimizer(self.learning_rate)
                self.update = optimizer.apply_gradients(zip(clipped_gradients, params), global_step=self.global_step)
    

    Training import time start = time.perf_counter() import tensorflow as tf import argparse import pickle import os

    class args: pass

    args.num_hidden=150 args.num_layers=2 args.beam_width=10 args.glove="store_true" args.embedding_size=300

    args.learning_rate=1e-3 args.batch_size=64 args.num_epochs=10 args.keep_prob = 0.8

    args.toy=False #"store_true"

    args.with_model="store_true"

    if not os.path.exists("saved_model"):

    os.mkdir("saved_model")

    else:

    if args.with_model:

    old_model_checkpoint_path = open('saved_model/', 'r')

    old_model_checkpoint_path = "".join(["saved_model/",old_model_checkpoint_path.read().splitlines()[0].split('"')[1] ])

    print("Building dictionary...") word_dict, reverse_dict, article_max_len, summary_max_len = build_dict("train", args.toy) print("Loading training dataset...") train_x, train_y = build_dataset("train", word_dict, article_max_len, summary_max_len, args.toy)

    tf.reset_default_graph()

    with tf.Session() as sess: model = Model(reverse_dict, article_max_len, summary_max_len, args) sess.run(tf.global_variables_initializer()) saver = tf.train.Saver(tf.global_variables()) if 'old_model_checkpoint_path' in globals(): print("Continuing from previous trained model:" , old_model_checkpoint_path , "...") saver.restore(sess, old_model_checkpoint_path )

    batches = batch_iter(train_x, train_y, args.batch_size, args.num_epochs)
    num_batches_per_epoch = (len(train_x) - 1) // args.batch_size + 1
    
    print("\nIteration starts.")
    print("Number of batches per epoch :", num_batches_per_epoch)
    for batch_x, batch_y in batches:
        batch_x_len = list(map(lambda x: len([y for y in x if y != 0]), batch_x))
        batch_decoder_input = list(map(lambda x: [word_dict["<s>"]] + list(x), batch_y))
        batch_decoder_len = list(map(lambda x: len([y for y in x if y != 0]), batch_decoder_input))
        batch_decoder_output = list(map(lambda x: list(x) + [word_dict["</s>"]], batch_y))
    
        batch_decoder_input = list(
            map(lambda d: d + (summary_max_len - len(d)) * [word_dict["<padding>"]], batch_decoder_input))
        batch_decoder_output = list(
            map(lambda d: d + (summary_max_len - len(d)) * [word_dict["<padding>"]], batch_decoder_output))
        
        train_feed_dict = {
            model.batch_size: len(batch_x),
            model.X: batch_x,
            model.X_len: batch_x_len,
            model.decoder_input: batch_decoder_input,
            model.decoder_len: batch_decoder_len,
            model.decoder_target: batch_decoder_output
        }
    
        _, step, loss = sess.run([model.update, model.global_step, model.loss], feed_dict=train_feed_dict)
    
        if step % 1000 == 0:
            print("step {0}: loss = {1}".format(step, loss))
    
        if step % num_batches_per_epoch == 0:
            hours, rem = divmod(time.perf_counter() - start, 3600)
            minutes, seconds = divmod(rem, 60)
            saver.save(sess, "C:/Users/sensen/OneDrive - HERE Global B.V-/Desktop/NLP/Open source libraries/Text summarization", global_step=step)
            print(" Epoch {0}: Model is saved.".format(step // num_batches_per_epoch),
            "Elapsed: {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds) , "\n")
    

    I have converted the downloaded glove txt into pickle using the following code:

    import pickle import numpy as np

    f = open('C:/Users/sensen/OneDrive - HERE Global B.V-/Desktop/NLP/Open source libraries/Text summarization/glove.6B/glove.6B.300d.txt', 'r', encoding='UTF-8') g = open('glove.6B.300d_pickle', 'wb') word_dict = {} wordvec = [] for idx, line in enumerate(f.readlines()): word_split = line.split(' ') word = word_split[0] word_dict[word] = idx d = word_split[1:] d[-1] = d[-1][:-1] d = [float(e) for e in d] wordvec.append(d)

    embedding = np.array(wordvec) pickling = {} pickling = {'embedding' : embedding, 'word_dict': word_dict} pickle.dump(pickling, g) f.close() g.close()

    Can you help me solve the error?

    opened by senjutisen7 0
null 189 Jan 2, 2023
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

null 10 Jul 1, 2022
Utility for Google Text-To-Speech batch audio files generator. Ideal for prompt files creation with Google voices for application in offline IVRs

Google Text-To-Speech Batch Prompt File Maker Are you in the need of IVR prompts, but you have no voice actors? Let Google talk your prompts like a pr

Ponchotitlán 1 Aug 19, 2021
Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech

epub2audiobook Creating an Audiobook (mp3 file) using a Ebook (epub) using BeautifulSoup and Google Text to Speech Input examples qual a pasta do seu

null 7 Aug 25, 2022
GooAQ 🥑 : Google Answers to Google Questions!

This repository contains the code/data accompanying our recent work on long-form question answering.

AI2 112 Nov 6, 2022
Command Line Text-To-Speech using Google TTS

cli-tts Thanks to gTTS by @pndurette! This is an interactive command line text-to-speech tool using Google TTS. Just type text and the voice will be p

ReekyStive 3 Nov 11, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Saptak Bhoumik 14 May 24, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

null 16 Dec 7, 2022
Uses Google's gTTS module to easily create robo text readin' on command.

Tool to convert text to speech, creating files for later use. TTRS uses Google's gTTS module to easily create robo text readin' on command.

null 0 Jun 20, 2021
A toolkit for document-level event extraction, containing some SOTA model implementations

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker Source code for ACL-IJCNLP 2021 Long paper: Document-le

null 84 Dec 15, 2022
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

null 289 Jan 6, 2023
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 사용하여 자동으로 브랜치 이름을 만들어주는 스크립트 Setup 환경변수에 다음 3가지를 등록해야 한다. JIRA_USER : JIRA email (ex: [email protected]) JIR

hyunwook.kim 2 Dec 20, 2021
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 1, 2023
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022