ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

Overview

Overview | Tutorials | Examples | Installation | FAQ | How to Cite

PyPI Status ktrain python compatibility license Downloads

Welcome to ktrain

News and Announcements

  • 2020-11-08:
    • ktrain v0.25.x is released and includes out-of-the-box support for text extraction via the textract package . This, for example, can be used in the SimpleQA.index_from_folder method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files. See the Question-Answering example notebook for more information.
# End-to-End Question-Answering in ktrain

# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
                              multisegment=True, procs=4, # these args speed up indexing
                              breakup_docs=True)          # this slows indexing but speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
#   "ktrain is a low-code platform for machine learning"
  • 2020-11-04
  • 2020-10-16:
    • ktrain v0.23.x is released with updates for compatibility with upcoming release of TensorFlow 2.4.

Overview

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig, ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly:

  • employ fast, accurate, and easy-to-use pre-canned models for text, vision, graph, and tabular data:

  • estimate an optimal learning rate for your model given your data using a Learning Rate Finder

  • utilize learning rate schedules such as the triangular policy, the 1cycle policy, and SGDR to effectively minimize loss and improve generalization

  • build text classifiers for any language (e.g., Arabic Sentiment Analysis with BERT, Chinese Sentiment Analysis with NBSVM)

  • easily train NER models for any language (e.g., Dutch NER )

  • load and preprocess text and image data from a variety of formats

  • inspect data points that were misclassified and provide explanations to help improve your model

  • leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data

Tutorials

Please see the following tutorial notebooks for a guide on how to use ktrain on your projects:

Some blog tutorials about ktrain are shown below:

ktrain: A Lightweight Wrapper for Keras to Help Train Neural Networks

BERT Text Classification in 3 Lines of Code

Text Classification with Hugging Face Transformers in TensorFlow 2 (Without Tears)

Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code

Finetuning BERT using ktrain for Disaster Tweets Classification by Hamiz Ahmed

Examples

Tasks such as text classification and image classification can be accomplished easily with only a few lines of code.

Example: Text Classification of IMDb Movie Reviews Using BERT [see notebook]

import ktrain
from ktrain import text as txt

# load data
(x_train, y_train), (x_test, y_test), preproc = txt.texts_from_folder('data/aclImdb', maxlen=500, 
                                                                     preprocess_mode='bert',
                                                                     train_test_names=['train', 'test'],
                                                                     classes=['pos', 'neg'])

# load model
model = txt.text_classifier('bert', (x_train, y_train), preproc=preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=6)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using 1cycle learning rate schedule for 3 epochs
learner.fit_onecycle(2e-5, 3) 

Example: Classifying Images of Dogs and Cats Using a Pretrained ResNet50 model [see notebook]

import ktrain
from ktrain import vision as vis

# load data
(train_data, val_data, preproc) = vis.images_from_folder(
                                              datadir='data/dogscats',
                                              data_aug = vis.get_data_aug(horizontal_flip=True),
                                              train_test_names=['train', 'valid'], 
                                              target_size=(224,224), color_mode='rgb')

# load model
model = vis.image_classifier('pretrained_resnet50', train_data, val_data, freeze_layers=80)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model=model, train_data=train_data, val_data=val_data, 
                             workers=8, use_multiprocessing=False, batch_size=64)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(1e-4, checkpoint_folder='/tmp/saved_weights') 

Example: Sequence Labeling for Named Entity Recognition using a randomly initialized Bidirectional LSTM CRF model [see notebook]

import ktrain
from ktrain import text as txt

# load data
(trn, val, preproc) = txt.entities_from_txt('data/ner_dataset.csv',
                                            sentence_column='Sentence #',
                                            word_column='Word',
                                            tag_column='Tag', 
                                            data_format='gmb',
                                            use_char=True) # enable character embeddings

# load model
model = txt.sequence_tagger('bilstm-crf', preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val)


# conventional training for 1 epoch using a learning rate of 0.001 (Keras default for Adam optmizer)
learner.fit(1e-3, 1) 

Example: Node Classification on Cora Citation Graph using a GraphSAGE model [see notbook]

import ktrain
from ktrain import graph as gr

# load data with supervision ratio of 10%
(trn, val, preproc)  = gr.graph_nodes_from_csv(
                                               'cora.content', # node attributes/labels
                                               'cora.cites',   # edge list
                                               sample_size=20, 
                                               holdout_pct=None, 
                                               holdout_for_inductive=False,
                                              train_pct=0.1, sep='\t')

# load model
model=gr.graph_node_classifier('graphsage', trn)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=64)


# find good learning rate
learner.lr_find(max_epochs=100) # briefly simulate training to find good learning rate
learner.lr_plot()               # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(0.01, checkpoint_folder='/tmp/saved_weights')

Example: Text Classification with Hugging Face Transformers on 20 Newsgroups Dataset Using DistilBERT [see notebook]

# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)
(x_train, y_train) = (train_b.data, train_b.target)
(x_test, y_test) = (test_b.data, test_b.target)

# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(5e-5, 4)
learner.validate(class_names=t.get_classes()) # class_names must be string values

# Output from learner.validate()
#                        precision    recall  f1-score   support
#
#           alt.atheism       0.92      0.93      0.93       319
#         comp.graphics       0.97      0.97      0.97       389
#               sci.med       0.97      0.95      0.96       396
#soc.religion.christian       0.96      0.96      0.96       398
#
#              accuracy                           0.96      1502
#             macro avg       0.95      0.96      0.95      1502
#          weighted avg       0.96      0.96      0.96      1502

Example: Tabular Classification for Titanic Survival Prediction Using an MLP [see notebook]

import ktrain
from ktrain import tabular
import pandas as pd
train_df = pd.read_csv('train.csv', index_col=0)
train_df = train_df.drop(['Name', 'Ticket', 'Cabin'], 1)
trn, val, preproc = tabular.tabular_from_df(train_df, label_columns=['Survived'], random_state=42)
learner = ktrain.get_learner(tabular.tabular_classifier('mlp', trn), train_data=trn, val_data=val)
learner.lr_find(show_plot=True, max_epochs=5) # estimate learning rate
learner.fit_onecycle(5e-3, 10)

# evaluate held-out labeled test set
tst = preproc.preprocess_test(pd.read_csv('heldout.csv', index_col=0))
learner.evaluate(tst, class_names=preproc.get_classes())

Using ktrain on Google Colab? See these Colab examples:

Additional examples can be found here.

Installation

  1. Make sure pip is up-to-date with: pip install -U pip

  2. Install TensorFlow 2 if it is not already installed (e.g., pip install tensorflow)

  3. Install ktrain: pip install ktrain

The above should be all you need on Linux systems and cloud computing environments like Google Colab and AWS EC2. If you are using ktrain on a Windows computer, you can follow these more detailed instructions that include some extra steps.

Some important things to note about installation:

  • If using ktrain with tensorflow<=2.1, you must also downgrade the transformers library to transformers==3.1.
  • As of v0.21.x, ktrain no longer installs TensorFlow 2 automatically. As indicated above, you should install TensorFlow 2 yourself before installing and using ktrain. On Google Colab, TensorFlow 2 should be already installed. You should be able to use ktrain with any version of TensorFlow 2. Note, however, that there is a bug in TensorFlow 2.2 and 2.3 that affects the Learning-Rate-Finder that will not be fixed until TensorFlow 2.4. The bug causes the learning-rate-finder to complete all epochs even after loss has diverged (i.e., no automatic-stopping).
  • If using ktrain on a local machine with a GPU (versus Google Colab, for example), you'll need to install GPU support for TensorFlow 2.
  • Since some ktrain dependencies have not yet been migrated to tf.keras in TensorFlow 2 (or may have other issues), ktrain is temporarily using forked versions of some libraries. Specifically, ktrain uses forked versions of the eli5 and stellargraph libraries. If not installed, ktrain will complain when a method or function needing either of these libraries is invoked. To install these forked versions, you can do the following:
pip install git+https://github.com/amaiya/[email protected]_0_10_1
pip install git+https://github.com/amaiya/[email protected]_tf_dep_082

This code was tested on Ubuntu 18.04 LTS using TensorFlow 2.3.1 and Python 3.6.9.

How to Cite

Please cite the following paper when using ktrain:

@article{maiya2020ktrain,
    title={ktrain: A Low-Code Library for Augmented Machine Learning},
    author={Arun S. Maiya},
    year={2020},
    eprint={2004.10703},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    journal={arXiv preprint arXiv:2004.10703},
}


Creator: Arun S. Maiya

Email: arun [at] maiya [dot] net

Issues
  • Not support for loading pretrained  HuggingFace Transformers model from local path?

    Not support for loading pretrained HuggingFace Transformers model from local path?

    Failed to load pretrained HuggingFace Transformers model from my local machine. It seems only the hard-code models in the code can be loaded.

    MODEL_NAME = "D:\programming\models\tf_rbtl"
    t = text.Transformer(MODEL_NAME, maxlen=500,  
                         classes=["0", "1"])
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-15-e20b30887588> in <module>
          1 t = text.Transformer(MODEL_NAME, maxlen=500,  
    ----> 2                      classes=["0", "1"])
    
    d:\anaconda3.5\envs\adverse\lib\site-packages\ktrain\text\preprocessor.py in __init__(self, model_name, maxlen, classes, batch_size, multilabel, use_with_learner)
        838             raise ValueError('classes argument is required when multilabel=True')
        839         super().__init__(model_name,
    --> 840                          maxlen, max_features=10000, classes=classes, multilabel=multilabel)
        841         self.batch_size = batch_size
        842         self.use_with_learner = use_with_learner
    
    d:\anaconda3.5\envs\adverse\lib\site-packages\ktrain\text\preprocessor.py in __init__(self, model_name, maxlen, max_features, classes, lang, ngram_range, multilabel)
        719         self.name = model_name.split('-')[0]
        720         if self.name not in TRANSFORMER_MODELS:
    --> 721             raise ValueError('uknown model name %s' % (model_name))
        722         self.model_type = TRANSFORMER_MODELS[self.name][1]
        723         self.tokenizer_type = TRANSFORMER_MODELS[self.name][2]
    
    ValueError: uknown model name D:\programming\models\tf_rbtl
    
    opened by WangHexie 22
  • How to save SimpleQA trained model?

    How to save SimpleQA trained model?

    Hi,

    I've tried the provided sample for SimpleQA. In my output, it gave me:

    <IPython.core.display.HTML object>

    which I assume is the:

    #qa.display_answers(answers[:5])

    if I re-run the sample code, it complains there's already a directory where it tries to create index (good). If I leave out everything and rerun:

    qa = text.SimpleQA(INDEXDIR)

    it starts training again,.. another three hours :(

    This is my code now so I should get some output:

    `# load 20newsgroups datset into an array #from sklearn.datasets import fetch_20newsgroups #remove = ('headers', 'footers', 'quotes') #newsgroups_train = fetch_20newsgroups(subset='train', remove=remove) #newsgroups_test = fetch_20newsgroups(subset='test', remove=remove) #docs = newsgroups_train.data + newsgroups_test.data

    import ktrain from ktrain import text

    INDEXDIR = '/tmp/qa'

    #text.SimpleQA.initialize_index(INDEXDIR) #text.SimpleQA.index_from_folder('./Philosophy', INDEXDIR)

    qa = text.SimpleQA(INDEXDIR)

    answers = qa.ask('Why are we here?') top_answer = answers[0]['answer'] print(top_answer) top_answer = answers[1]['answer'] print(top_answer) top_answer = answers[2]['answer'] print(top_answer) top_answer = answers[3]['answer'] print(top_answer) top_answer = answers[4]['answer'] print(top_answer)

    #qa.display_answers(answers[:5])`

    How to I reload my already trained model?

    opened by DunnCreativeSS 19
  • Load and use trained model with ktrain

    Load and use trained model with ktrain

    I have a trained model in .h5 and preproc file format for racial recognition using ktrain library. How do I get to load and use the trained model at a later time.

    user question 
    opened by bulioses 18
  • error with text.transformer- roberta-base

    error with text.transformer- roberta-base

    Hi, I try to train a model based on "roberta-base". I try to run it on EC2 (p3.16xl) , and I got this error:

    Traceback (most recent call last):
      File "ktrain_transformer_training.py", line 119, in <module>
        learner, preproc = train_transformer(x_train, y_train, x_test, y_test)
      File "ktrain_transformer_training.py", line 98, in train_transformer
        model = t.get_classifier()
      File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/ktrain/text/preprocessor.py", line 1041, in get_classifier
        model = self._load_pretrained(mname, num_labels)
      File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/ktrain/text/preprocessor.py", line 1006, in _load_pretrained
        raise ValueError('could not load pretrained model %s using both from_pt=False and from_pt=True' % (mname))
    ValueError: could not load pretrained model roberta-base using both from_pt=False and from_pt=True
    

    However, the same code runs prefect on my local machine. can you help with this issue? Thanks

    user question 
    opened by Liranbz 16
  • Models trained with ktrain do not work with Flask, uWSGI and NGINX

    Models trained with ktrain do not work with Flask, uWSGI and NGINX

    Hi, I am using ktrain to replace some of my Keras built models in production deployment. I have noticed a problem with models trained with ktrain. I think I might need to set some extra params in my NGINX and uWSGI while using the ktrain because I can't simply replace older models with ktrain models.

    I am using Flask, uWSGI, and NGINX to deploy my models. This setup is already in place with models trained with traditional Keras and TF2. But If I replace my Text Classification model with ktrain then it stops working.

    I have checked it individually with Flask and uWSGI, it is working fine till now. But as soon as I add NGINX server setup it stops working. There something happening inside ktrain APIs which is breaking it because if I do not use ktrain APIs it is working perfectly fine with all setup.

    At the front end, it is saying Server Timeout Error. I have checked internal logs to identify the issue and it is happening because uWSGI is not returning anything to NGINX. Although, If I run only Flask and uWSGI with ktrain model, it runs. I also tried to increase timeout time to 5m, 10m, 30m even after that connection timeout is coming. It is happening only when I can API which uses ktrain models. Other APIs which do not use ktrain models are working perfectly fine. i.e. Only those APIs are not working which uses ktrain models and others are working.

    I had one more problem with ktrain on Flask server but I have resolved that by turning off auto_reload on file change. Because it seems ktrain download or write something in the local directory that is why Flask was reloading when I call prediction. Although, this is resolved.

    I have tried with BERT, DistilBERT, Fast Text, GRU and so many other ways to figure out why It is not working with NGINX. Can you add your thoughts about what could be the reason?

    I have also created a [User Guide] (https://drive.google.com/file/d/1VW421zkmXkiQdoO1NWhVe21QOgVhVUnc/view?usp=sharing) to show you server settings. It will help to identify exact problem. If you can look at it and add your thoughts, it would be really great help. I want to use ktrain in production but got stuck here.

    opened by laxmimerit 13
  • Cannot get learner from iterator zip object

    Cannot get learner from iterator zip object

    get_learner fails when the training data is a zip of iterators such as when it is used for image segmentation tasks (while augmenting images and masks together).

    EDIT:

    It works by hacking together a custom Iterator class, but it's not a particularly elegant hack...

    image_gen and mask_gen below are keras.preprocessing.image.ImageDataGenerator.flow_from_directory() objects.

    
    class Iterator():
        
        def __init__(self, image_gen, mask_gen):
            self.image_gen = image_gen
            self.mask_gen = mask_gen
            self.batch_size = image_gen.batch_size
            self.target_size = image_gen.target_size
            self.color_mode = image_gen.color_mode
            self.class_mode = image_gen.class_mode
            self.n = image_gen.n
            self.seed = image_gen.seed
            self.total_batches_seen = image_gen.total_batches_seen
        
        def __iter__(self):
            return self
        
        def __next__(self):
            return next(self.image_gen), next(self.mask_gen)
        
        def __getitem__(self, key):
            return self.image_gen[key], self.mask_gen[key]
    

    Any ideas how we could make this more elegant?

    enhancement 
    opened by gkaissis 13
  • Regarding Deployment on Flask

    Regarding Deployment on Flask

    Hi, i have an issue regarding deployment i am not able to deploy ktrain multi text classification model. I tried to load model and .preproc file but it does not work.

    user question 
    opened by ianuragbhatt 13
  • Support for long sequences classification (transformer models)

    Support for long sequences classification (transformer models)

    It would be great if support could be added for long sequences over ~500 word pieces for the transformers models.

    Possible methods:

    1. sliding window over each sequence to generate sub-sequences output that is averaged before being feed to the classifier layer
    2. sliding window over each sequence to generate sub-sequences output feed to LSTM layer
    enhancement 
    opened by mdavis95 12
  • Validation accuracy is not changing

    Validation accuracy is not changing

    Hi.... I am using Ktrain for text classification task. I observed that validation accuracy is not changing ....its the same..please advise what could be the reason?

    Piece of code:

    MODEL_NAME = 'albert-large-v2' t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_df.tweet_label_int.unique()) trn = t.preprocess_train(x_train, y_train) val = t.preprocess_test(x_test, y_test) model = t.get_classifier() learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)

    `# learner.fit_onecycle(8e-5, 20)

    I feel I have to change something in optimizer or activation function? but don't have any idea from where I can and how? appreciate if you can guide considering me a newbie

    PS: Running on google colab- using (!pip3 install ktrain)

    user question 
    opened by usmaann 11
  • Reloading learner from saved predictor model files

    Reloading learner from saved predictor model files

    Hi. Thanks for the awesome tool! I managed to get it to work as described in the tutorials. However, I'm using BERT with a huge dataset so one epoch takes hours. On top of that, I'm using Google Colab which has time limits for GPU use. Because of this, I was hoping to save the model, reload and then call learner.fit_onecycle again to continue the training for some more epochs.

    I have a successfully saved the predictor files from a few epochs and I can reload it to make predictions. What I'm hoping to do now is get the learner class from it but looking at the source code, there's no way to do this outright. I moved to trying to load the model file itself and build the learner by calling ktrain.get_learner() again but ktrain.load_model() throws an error of

    Unknown layer: TokenEmbedding

    I've also thought about going through the entire process again up to building the model as prescribed then setting weights and getting learner:

    model = text.text_classifier("bert", train_data = (xTrain, yTrain), preproc = preproc)
    #Set model weights here using model.load_weights(pathToCheckpointFile)
    learner = ktrain.get_learner(model, train_data = (xTrain, yTrain), batch_size = 12)
    

    This feels kinda hackish though since I'm not using the saved model files. Will this have the same effect or am I missing something from the source code in building the learner from the predictor?

    opened by rcmcabral 11
  • Facing Issue in learner.lr_find()

    Facing Issue in learner.lr_find()

    I am facing issue while calling learner.lr_find() method. attached snap for reference. Basically I am calling Distilbert model using TFAutomodel from Hugging face and I am adding FFN head at the end using keras for classification. Then I inported ktrain and fed the model into learner. The purpose I use ktrain here is to make use of Fit_onecycle() method. issue

    opened by aravinddeveloper 0
  • Offline Training with Transformers

    Offline Training with Transformers

    I was following this link to do offline training and prediction for a text classification task. I downloaded the required 3 files (tf model, vocab and config) into a directory which I named as 'bert-base-uncased', but it wouldn't work because it told that the tokenizer files weren't found in the said directory, however, when I changed the directory name to 'base-uncased', it worked. This seems strange as to why it wouldn't work previously, but works after changing the directory name.

    I'd like to work on this issue, and do a PR if you're okay with it. Please confirm if this is a issue, thank you!

    opened by MocktaiLEngineer 1
  • Unable to load the model - Could not find Tensorflow version of model.  Attempting to download/load PyTorch version as TensorFlow model using from_pt=True

    Unable to load the model - Could not find Tensorflow version of model. Attempting to download/load PyTorch version as TensorFlow model using from_pt=True

    Hi, I am trying to create a simple web app for my DistilBert Model but then the web app runs into error when the program start executing the ktrainResult function, where it will load the model at line 14.

    The DistilBert_Model is the folder where I saved my trained model that I have trained in a notebook through the code predictor.save('path'). Then, I copy this folder to my PyCharm working space to load the the model for prediction. image

    opened by mingjun1120 11
  • Kernel Dead while training a model!

    Kernel Dead while training a model!

    I am using GPU to run my program which I have installed tensorflow-gpu==2.5.0, tensorflow==2.5.0 and ktrain==0.27.2 in my virtual environment (tensorflowGPU) through Anaconda Prompt but then the Kernel dead when comes to training a model. I have installed the correct CUDA toolkit and CUDAnn in my Windows system but end up kernel dead😢

    How to solve this issue?

    Issues: image image

    Virtual Environment Installation: image

    opened by mingjun1120 4
  • upgrade semantic search

    upgrade semantic search

    ktrain currently only supports semantic searches in the context of topic models and LDA. This enhancement will add support for semantic searches using transformer embeddings.

    enhancement 
    opened by amaiya 3
  • Plans for speech2text and text2speech?

    Plans for speech2text and text2speech?

    Hi there!

    Thank you for that simplified library! Do you plan to implement in ktrain speech2text and text2speech?

    enhancement 
    opened by marsel171 1
  • fix #91 tokenizer load error

    fix #91 tokenizer load error

    fix #91 tokenizer load error when using some hugging face models like clue/albert_chinese_small and other chinse albert model which need to load with BertTokenizer

    opened by WangHexie 2
  • Added TFIDF preprocessor

    Added TFIDF preprocessor

    Using the same preprocessing pipeline, namely keras.preprocessing.text.Tokenizer I have added a TFIDF based preprocessing mode texts_to_matrix(..., mode='tfidf'). Testing is OKish - since this is a BOW approach, had to modify the assertion by text undo to accommodate this.

    What is still missing is the ngram support. Could not add this now.

    Hope you might find this interesting!

    opened by solalatus 4
Releases(v0.28.2)
  • v0.28.0(Oct 13, 2021)

    0.28.0 (2021-10-13)

    New:

    • text.AnswerExtractor is a universal information extractor powered by a Question-Answering module and capable of extracting user-specfied information from texts.
    • text.TextExtractor is a text extraction pipeline (e.g., convert PDFs to plain text)

    Changed

    • changed transformers pin to transformers>=4.0.0,<=4.10.3

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.27.3(Sep 3, 2021)

    0.27.3 (2021-09-03)

    New:

    • N/A

    Changed

    -N/A

    Fixed:

    • SimpleQA now can load PyTorch question-answering checkpoints
    • change API call to support newest causalnlp
    Source code(tar.gz)
    Source code(zip)
  • v0.27.2(Jul 28, 2021)

    0.27.2 (2021-07-28)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • check for logits attribute when predicting using transformers
    • change raised Exception to warning for longer sequence lengths for transformers
    Source code(tar.gz)
    Source code(zip)
  • v0.27.1(Jul 20, 2021)

  • v0.27.0(Jul 20, 2021)

  • v0.26.5(Jul 16, 2021)

    0.26.5 (2021-07-15)

    New:

    • N/A

    Changed

    • added query parameter to SimpleQA.ask so that an alternative query can be used to retrieve contexts from corpus
    • added chardet as dependency for stellargraph

    Fixed:

    • fixed issue with TopicModel.build when threshold=None
    Source code(tar.gz)
    Source code(zip)
  • v0.26.4(Jun 23, 2021)

    0.26.4 (2021-06-23)

    New:

    • API documenation index

    Changed

    • Added warning when a TensorFlow version of selected transformers model is not available and the PyTorch version is being downloaded and converted instead using from_pt=True.

    Fixed:

    • Fixed utils.metrics_from_model to support alternative metrics
    • Check for AUC ktrain.utils "inspect" function
    Source code(tar.gz)
    Source code(zip)
  • v0.26.3(May 19, 2021)

    0.26.3 (2021-05-19)

    New:

    • N/A

    Changed

    • shallownlp.ner.NER.predict processes lists of sentences in batches resulting in faster predictions
    • batch_size argument added to shallownlp.ner.NER.predict
    • added verbose parameter to ktrain.text.textutils.extract_copy to optionally see why each skipped document was skipped

    Fixed:

    • Changed TextPredictor.save to save Hugging Face tokenizer files locally to ensure they can be easily reloaded when text.Transformer is supplied with local path.
    • For transformers models, the predictor.preproc.model_name variable is automatically updated to be new Predictor folder to avoid having users manually update model_name. Applies when a local path is supplied to text.Transformer and resultant Predictor is moved to new machine.
    Source code(tar.gz)
    Source code(zip)
  • v0.26.2(Mar 26, 2021)

    0.26.2 (2021-03-26)

    New:

    • N/A

    Changed

    • NERPredictor.predict now optionally accepts lists of sentences to make sequence-labeling predictions in batches (as all other Predictor instances already do).

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.26.1(Mar 10, 2021)

    0.26.1 (2021-03-11)

    New:

    • N/A

    Changed

    • expose errors from transformers in _load_pretrained
    • changed TextPreprocessor.check_trained to be a warning instead of Exception

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.26.0(Mar 9, 2021)

    0.26.0 (2021-03-10)

    New:

    • Support for transformers 4.0 and above.

    Changed

    • added set_tokenizer to TransformerPreprocessor
    • show error message when original weights cannot be saved (for reset_weights method)

    Fixed:

    • cast filename to string before concatenating with suffix in images_from_csv and images_from_df (addresses issue #330)
    • resolved import error for sklearn>=0.24.0, but eli5 still requires sklearn<0.24.0.
    Source code(tar.gz)
    Source code(zip)
  • v0.25.4(Jan 28, 2021)

    0.25.4 (2021-01-26)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • fixed problem with LabelEncoder not properly being stored when texts_from_df is invoked
    • refrain from invoking max on empty sequence (#307)
    • corrected issue with return_proba=True in NER predictions (#316)
    Source code(tar.gz)
    Source code(zip)
  • v0.25.3(Dec 23, 2020)

    0.25.3 (2020-12-23)

    New:

    • N/A

    Changed

    • A steps_per_epoch argument has been added to all *fit* methods that operate on generators
    • Added get_tokenizer methods to all instances of TextPreprocessor

    Fixed:

    • propogate custom metrics to model when distilbert is chosen in text_classifier and text_regression_model functions
    • pin scikit-learn to 0.24.0 sue to breaking change
    Source code(tar.gz)
    Source code(zip)
  • v0.25.2(Dec 5, 2020)

    0.25.2 (2020-12-05)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • Added custom_objects argument to load_predictor to load models with custom loss functions, etc.
    • Fixed bug #286 related to length computation when use_dynamic_shape=True
    Source code(tar.gz)
    Source code(zip)
  • v0.25.1(Dec 2, 2020)

    0.25.1 (2020-12-02)

    New:

    • N/A

    Changed

    • Added use_dynamic_shape parameter to text.preprocessor.hf_convert_examples which is set to True when running predictions. This reduces the input length when making predictions, if possible..
    • Added warnings to some imports in imports.py to allow for slightly lighter-weight deployments
    • Temporarily pinning to transformers>=3.1,<4.0 due to breaking changes in v4.0.

    Fixed:

    • Suppress progress bar in predictor.predict for keras_bert models
    • Fixed typo causing problems when loading predictor for Inception models
    • Fixes to address documented/undocumented breaking changes in transformers>=4.0. But, temporarily pinning to transformers>=3.1,<4.0 for backwards compatibility.
    Source code(tar.gz)
    Source code(zip)
  • v0.25.0(Nov 8, 2020)

    0.25.0 (2020-11-08)

    New:

    • The SimpleQA.index_from_folder method now supports text extraction from many file types including PDFs, MS Word documents, and MS PowerPoint files (i.e., set use_text_extraction=True to use this feature).

    Changed

    • The default in SimpleQA.index_from_list and SimpleQA.index_from_folder has been changed to breakup_docs=True.

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.24.2(Nov 8, 2020)

    0.24.2 (2020-11-07)

    New:

    • N/A

    Changed

    • ktrain.text.textutils.extract_copy now uses textract to extract text from many file types (e.g., PDF, DOC, PPT) instead of just PDFs,

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.24.1(Nov 6, 2020)

    0.24.1 (2020-11-06)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • Change exception in model ID check in Translator to warning to better allow offline language translations
    Source code(tar.gz)
    Source code(zip)
  • v0.24.0(Nov 6, 2020)

    0.24.0 (2020-11-05)

    New:

    • Predictor instances now provide built-in support for exporting to TensorFlow Lite and ONNX.

    Changed

    • N/A

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.23.2(Oct 27, 2020)

    0.23.2 (2020-10-27)

    New:

    • N/A

    Changed

    • Use fast tokenizers for the following Hugging Face transformers models: BERT, DistilBERT, and RoBERTa models. This change affects models created with either text.Transformer(... or text.text_clasifier('distilbert',..'). BERT models created with text_classifier('bert',.., which uses keras_bert instead of transformers, are not affected by this change.

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.23.1(Oct 26, 2020)

    0.23.1 (2020-10-26)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • Resolved issue in qa.ask method occuring with embedding computations when full answer sentences exceed 512 tokens.
    Source code(tar.gz)
    Source code(zip)
  • v0.23.0(Oct 16, 2020)

    0.23.0 (2020-10-16)

    New:

    • Support for upcoming release of TensorFlow 2.4 such as removal of references to obsolete multi_gpu_model

    Changed

    • [breaking change] TopicModel.get_docs now returns a list of dicts instead of a list of tuples. Each dict has keys: text, doc_id, topic_proba, topic_id.
    • added TopicModel.get_document_topic_distribution
    • added TopicModel.get_sorted_docs method to return all documents sorted by relevance to a given topic_id

    Fixed:

    • Changed version check warning in lr_find to a raised Exception to avoid confusion when warnings from ktrain are suppressed
    • Pass verbose parameter to hf_convert_examples
    Source code(tar.gz)
    Source code(zip)
  • v0.22.4(Oct 12, 2020)

    0.22.4 (2020-10-12)

    New:

    • N/A

    Changed

    • changed qa.core.display_answers to make URLs open in new tab

    Fixed:

    • pin to seqeval==0.0.19 due to numpy version incompatibility with latest TensorFlow and to suppress errors during installation
    Source code(tar.gz)
    Source code(zip)
  • v0.22.3(Oct 9, 2020)

    0.22.3 (2020-10-09)

    New:

    • N/A

    Changed

    • N/A

    Fixed:

    • fixed issue with missing noun phrase at end of sentence in extract_noun_phrases
    • fixed TensorFlow versioning issues with utils.metrics_from_model
    Source code(tar.gz)
    Source code(zip)
  • v0.22.2(Oct 9, 2020)

    0.22.2 (2020-10-09)

    New:

    • added extract_noun_phrases to textutils

    Changed

    • SimpleQA.ask now includes an include_np parameter. When True, noun phrases will be used to retrieve documents containing candidate answers.

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.22.1(Oct 8, 2020)

    0.22.1 (2020-10-08)

    New:

    • N/A

    Changed

    • added optional references argument to SimpleQA.index_from_list
    • added min_words argument to SimpleQA.index_from_list and SimpleQA.index_from_folder to prune small documents or paragraphs that are unlikely to include good answers
    • qa.display_answers now supports hyperlinks for document references

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.22.0(Oct 6, 2020)

    0.22.0 (2020-10-06)

    New:

    • added breakup_docs argument to index_from_list and index_from_folder that potentially speeds up ask method substantially
    • added batch_size argument to ask and set default at 8 for faster answer-retrieval

    Changed

    • refactored QA and SimpleQA for better extensibility

    Fixed:

    • Ensure save_path is correctyl processed in Learner.evaluate
    Source code(tar.gz)
    Source code(zip)
  • v0.21.4(Sep 24, 2020)

    0.21.4 (2020-09-24)

    New:

    • N/A

    Changed

    • Changed installation instructions in README.md to reflect that using ktrain with TensorFlow 2.1 will require downgrading transformers to 3.1.0.
    • updated requirements with keras_bert>=0.86.0 due to TensorFlow 2.3 error with older versions of keras_bert
    • In lr_find and lr_plot, check for TF 2.2 or 2.3 and make necessary adjustments due to TF bug 41174.

    Fixed:

    • fixed typos in __all__ in text and graph modules (PR #250)
    • fixed Chinese language translation based on name-changes of models with zh as source language
    Source code(tar.gz)
    Source code(zip)
  • v0.21.3(Sep 8, 2020)

    0.21.3 (2020-09-08)

    New:

    • N/A

    Changed

    • added TopicModel.get_word_weights method to retrieve the word weights for a given topic
    • added return_fig option to Learner.lr_plot and Learner.plot, which allows the matplotlib Figure to be returned to user

    Fixed:

    • N/A
    Source code(tar.gz)
    Source code(zip)
  • v0.21.2(Sep 3, 2020)

Owner
Arun S. Maiya
computer scientist
Arun S. Maiya
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 2.6k Oct 21, 2021
Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Packt 1.1k Oct 21, 2021
All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

Daniel Bourke 1.7k Oct 24, 2021
A comprehensive list of published machine learning applications to cosmology

ml-in-cosmology This github attempts to maintain a comprehensive list of published machine learning applications to cosmology, organized by subject ma

George Stein 211 Oct 15, 2021
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 8.5k Oct 22, 2021
🛠 All-in-one web-based IDE specialized for machine learning and data science.

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

Machine Learning Tooling 2.2k Oct 23, 2021
A simple but complete full-attention transformer with a set of promising experimental features from various papers

x-transformers A concise but fully-featured transformer, complete with a set of promising experimental features from various papers. Install $ pip ins

Phil Wang 1.1k Oct 20, 2021
🎓Automatically Update CV Papers Daily using Github Actions (Update at 12:00 UTC Every Day)

??Automatically Update CV Papers Daily using Github Actions (Update at 12:00 UTC Every Day)

Realcat 4 Oct 21, 2021
JAX-based neural network library

Haiku: Sonnet for JAX Overview | Why Haiku? | Quickstart | Installation | Examples | User manual | Documentation | Citing Haiku What is Haiku? Haiku i

DeepMind 1.4k Oct 14, 2021
GNN4Traffic - This is the repository for the collection of Graph Neural Network for Traffic Forecasting

GNN4Traffic - This is the repository for the collection of Graph Neural Network for Traffic Forecasting

null 215 Oct 18, 2021
This repository contains various models targetting multimodal representation learning, multimodal fusion for downstream tasks such as multimodal sentiment analysis.

Multimodal Deep Learning ?? ?? ?? Announcing the multimodal deep learning repository that contains implementation of various deep learning-based model

Deep Cognition and Language Research (DeCLaRe) Lab 58 Oct 14, 2021
Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

Healthcare Intelligence Laboratory 589 Oct 16, 2021
Learning and Building Convolutional Neural Networks using PyTorch

Image Classification Using Deep Learning Learning and Building Convolutional Neural Networks using PyTorch. Models, selected are based on number of ci

Mayur 44 Oct 13, 2021
A curated list of resources for Image and Video Deblurring

A curated list of resources for Image and Video Deblurring

Subeesh Vasu 1.1k Oct 23, 2021
StellarGraph - Machine Learning on Graphs

StellarGraph Machine Learning Library StellarGraph is a Python library for machine learning on graphs and networks. Table of Contents Introduction Get

S T E L L A R 2.2k Oct 22, 2021
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 89 Oct 22, 2021
A curated (most recent) list of resources for Learning with Noisy Labels

A curated (most recent) list of resources for Learning with Noisy Labels

Jiaheng Wei 99 Oct 17, 2021
A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Aladdin Persson 2.3k Oct 24, 2021
An open-source, low-cost, image-based weed detection device for fallow scenarios.

Welcome to the OpenWeedLocator (OWL) project, an opensource hardware and software green-on-brown weed detector that uses entirely off-the-shelf compon

Guy Coleman 50 Sep 29, 2021