ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

Arun S. Maiya

Last update: Jan 2, 2023

Related tags

Deep Learning python nlp machine-learning computer-vision deep-learning tensorflow keras tabular-data graph-neural-networks

Overview

Overview | Tutorials | Examples | Installation | FAQ | How to Cite

Welcome to ktrain

News and Announcements

2020-11-08:
- ktrain v0.25.x is released and includes out-of-the-box support for text extraction via the textract package . This, for example, can be used in the SimpleQA.index_from_folder method to perform Question-Answering on large collections of PDFs, MS Word documents, or PowerPoint files. See the Question-Answering example notebook for more information.

# End-to-End Question-Answering in ktrain

# index documents of different types into a built-in search engine
from ktrain import text
INDEXDIR = '/tmp/myindex'
text.SimpleQA.initialize_index(INDEXDIR)
corpus_path = '/my/folder/of/documents' # contains .pdf, .docx, .pptx files in addition to .txt files
text.SimpleQA.index_from_folder(corpus_path, INDEXDIR, use_text_extraction=True, # enable text extraction
                              multisegment=True, procs=4, # these args speed up indexing
                              breakup_docs=True)          # this slows indexing but speeds up answer retrieval

# ask questions (setting higher batch size can further speed up answer retrieval)
qa = text.SimpleQA(INDEXDIR)
answers = qa.ask('What is ktrain?', batch_size=8)

# top answer snippet extracted from https://arxiv.org/abs/2004.10703:
#   "ktrain is a low-code platform for machine learning"

2020-11-04
- ktrain v0.24.x is released and now includes built-in support for exporting models to ONNX and TensorFlow Lite. See the example notebook for more information.
2020-10-16:
- ktrain v0.23.x is released with updates for compatibility with upcoming release of TensorFlow 2.4.

Overview

ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models. Inspired by ML framework extensions like fastai and ludwig, ktrain is designed to make deep learning and AI more accessible and easier to apply for both newcomers and experienced practitioners. With only a few lines of code, ktrain allows you to easily and quickly:

employ fast, accurate, and easy-to-use pre-canned models for text, vision, graph, and tabular data:
- text data:
  - Text Classification: BERT, DistilBERT, NBSVM, fastText, and other models _{^{[example notebook]}}
  - Text Regression: BERT, DistilBERT, Embedding-based linear text regression, fastText, and other models _{^{[example notebook]}}
  - Sequence Labeling (NER): Bidirectional LSTM with optional CRF layer and various embedding schemes such as pretrained BERT and fasttext word embeddings and character embeddings _{^{[example notebook]}}
  - Ready-to-Use NER models for English, Chinese, and Russian with no training required _{^{[example notebook]}}
  - Sentence Pair Classification for tasks like paraphrase detection _{^{[example notebook]}}
  - Unsupervised Topic Modeling with LDA _{^{[example notebook]}}
  - Document Similarity with One-Class Learning: given some documents of interest, find and score new documents that are thematically similar to them using One-Class Text Classification _{^{[example notebook]}}
  - Document Recommendation Engines and Semantic Searches: given a text snippet from a sample document, recommend documents that are semantically-related from a larger corpus _{^{[example notebook]}}
  - Text Summarization: summarize long documents with a pretrained BART model - no training required _{^{[example notebook]}}
  - End-to-End Question-Answering: ask a large text corpus questions and receive exact answers _{^{[example notebook]}}
  - Easy-to-Use Built-In Search Engine: perform keyword searches on large collections of documents _{^{[example notebook]}}
  - Zero-Shot Learning: classify documents into user-provided topics without training examples _{^{[example notebook]}}
  - Language Translation: translate text from one language to another _{^{[example notebook]}}
- vision data:
  - image classification (e.g., ResNet, Wide ResNet, Inception) _{^{[example notebook]}}
  - image regression for predicting numerical targets from photos (e.g., age prediction) _{^{[example notebook]}}
- graph data:
  - node classification with graph neural networks (GraphSAGE) _{^{[example notebook]}}
  - link prediction with graph neural networks (GraphSAGE) _{^{[example notebook]}}
- tabular data:
  - tabular classification (e.g., Titanic survival prediction) _{^{[example notebook]}}
  - tabular regression (e.g., predicting house prices) _{^{[example notebook]}}
estimate an optimal learning rate for your model given your data using a Learning Rate Finder
utilize learning rate schedules such as the triangular policy, the 1cycle policy, and SGDR to effectively minimize loss and improve generalization
build text classifiers for any language (e.g., Arabic Sentiment Analysis with BERT, Chinese Sentiment Analysis with NBSVM)
easily train NER models for any language (e.g., Dutch NER )
load and preprocess text and image data from a variety of formats
inspect data points that were misclassified and provide explanations to help improve your model
leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data

Tutorials

Please see the following tutorial notebooks for a guide on how to use ktrain on your projects:

Tutorial 1: Introduction
Tutorial 2: Tuning Learning Rates
Tutorial 3: Image Classification
Tutorial 4: Text Classification
Tutorial 5: Learning from Unlabeled Text Data
Tutorial 6: Text Sequence Tagging for Named Entity Recognition
Tutorial 7: Graph Node Classification with Graph Neural Networks
Tutorial 8: Tabular Classification and Regression
Tutorial A1: Additional tricks, which covers topics such as previewing data augmentation schemes, inspecting intermediate output of Keras models for debugging, setting global weight decay, and use of built-in and custom callbacks.
Tutorial A2: Explaining Predictions and Misclassifications
Tutorial A3: Text Classification with Hugging Face Transformers
Tutorial A4: Using Custom Data Formats and Models: Text Regression with Extra Regressors

Some blog tutorials about ktrain are shown below:

ktrain: A Lightweight Wrapper for Keras to Help Train Neural Networks

BERT Text Classification in 3 Lines of Code

Text Classification with Hugging Face Transformers in TensorFlow 2 (Without Tears)

Build an Open-Domain Question-Answering System With BERT in 3 Lines of Code

Finetuning BERT using ktrain for Disaster Tweets Classification by Hamiz Ahmed

Examples

Tasks such as text classification and image classification can be accomplished easily with only a few lines of code.

Example: Text Classification of IMDb Movie Reviews Using BERT _{^{[see notebook]}}

import ktrain
from ktrain import text as txt

# load data
(x_train, y_train), (x_test, y_test), preproc = txt.texts_from_folder('data/aclImdb', maxlen=500, 
                                                                     preprocess_mode='bert',
                                                                     train_test_names=['train', 'test'],
                                                                     classes=['pos', 'neg'])

# load model
model = txt.text_classifier('bert', (x_train, y_train), preproc=preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, 
                             train_data=(x_train, y_train), 
                             val_data=(x_test, y_test), 
                             batch_size=6)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using 1cycle learning rate schedule for 3 epochs
learner.fit_onecycle(2e-5, 3)

Example: Classifying Images of Dogs and Cats Using a Pretrained ResNet50 model _{^{[see notebook]}}

import ktrain
from ktrain import vision as vis

# load data
(train_data, val_data, preproc) = vis.images_from_folder(
                                              datadir='data/dogscats',
                                              data_aug = vis.get_data_aug(horizontal_flip=True),
                                              train_test_names=['train', 'valid'], 
                                              target_size=(224,224), color_mode='rgb')

# load model
model = vis.image_classifier('pretrained_resnet50', train_data, val_data, freeze_layers=80)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model=model, train_data=train_data, val_data=val_data, 
                             workers=8, use_multiprocessing=False, batch_size=64)

# find good learning rate
learner.lr_find()             # briefly simulate training to find good learning rate
learner.lr_plot()             # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(1e-4, checkpoint_folder='/tmp/saved_weights')

Example: Sequence Labeling for Named Entity Recognition using a randomly initialized Bidirectional LSTM CRF model _{^{[see notebook]}}

import ktrain
from ktrain import text as txt

# load data
(trn, val, preproc) = txt.entities_from_txt('data/ner_dataset.csv',
                                            sentence_column='Sentence #',
                                            word_column='Word',
                                            tag_column='Tag', 
                                            data_format='gmb',
                                            use_char=True) # enable character embeddings

# load model
model = txt.sequence_tagger('bilstm-crf', preproc)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val)


# conventional training for 1 epoch using a learning rate of 0.001 (Keras default for Adam optmizer)
learner.fit(1e-3, 1)

Example: Node Classification on Cora Citation Graph using a GraphSAGE model _{^{[see notbook]}}

import ktrain
from ktrain import graph as gr

# load data with supervision ratio of 10%
(trn, val, preproc)  = gr.graph_nodes_from_csv(
                                               'cora.content', # node attributes/labels
                                               'cora.cites',   # edge list
                                               sample_size=20, 
                                               holdout_pct=None, 
                                               holdout_for_inductive=False,
                                              train_pct=0.1, sep='\t')

# load model
model=gr.graph_node_classifier('graphsage', trn)

# wrap model and data in ktrain.Learner object
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=64)


# find good learning rate
learner.lr_find(max_epochs=100) # briefly simulate training to find good learning rate
learner.lr_plot()               # visually identify best learning rate

# train using triangular policy with ModelCheckpoint and implicit ReduceLROnPlateau and EarlyStopping
learner.autofit(0.01, checkpoint_folder='/tmp/saved_weights')

Example: Text Classification with Hugging Face Transformers on 20 Newsgroups Dataset Using DistilBERT _{^{[see notebook]}}

# load text data
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train', categories=categories, shuffle=True)
test_b = fetch_20newsgroups(subset='test',categories=categories, shuffle=True)
(x_train, y_train) = (train_b.data, train_b.target)
(x_test, y_test) = (test_b.data, test_b.target)

# build, train, and validate model (Transformer is wrapper around transformers library)
import ktrain
from ktrain import text
MODEL_NAME = 'distilbert-base-uncased'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(5e-5, 4)
learner.validate(class_names=t.get_classes()) # class_names must be string values

# Output from learner.validate()
#                        precision    recall  f1-score   support
#
#           alt.atheism       0.92      0.93      0.93       319
#         comp.graphics       0.97      0.97      0.97       389
#               sci.med       0.97      0.95      0.96       396
#soc.religion.christian       0.96      0.96      0.96       398
#
#              accuracy                           0.96      1502
#             macro avg       0.95      0.96      0.95      1502
#          weighted avg       0.96      0.96      0.96      1502

Example: Tabular Classification for Titanic Survival Prediction Using an MLP _{^{[see notebook]}}

import ktrain
from ktrain import tabular
import pandas as pd
train_df = pd.read_csv('train.csv', index_col=0)
train_df = train_df.drop(['Name', 'Ticket', 'Cabin'], 1)
trn, val, preproc = tabular.tabular_from_df(train_df, label_columns=['Survived'], random_state=42)
learner = ktrain.get_learner(tabular.tabular_classifier('mlp', trn), train_data=trn, val_data=val)
learner.lr_find(show_plot=True, max_epochs=5) # estimate learning rate
learner.fit_onecycle(5e-3, 10)

# evaluate held-out labeled test set
tst = preproc.preprocess_test(pd.read_csv('heldout.csv', index_col=0))
learner.evaluate(tst, class_names=preproc.get_classes())

Using ktrain on Google Colab? See these Colab examples:

Additional examples can be found here.

Installation

Make sure pip is up-to-date with: pip install -U pip
Install TensorFlow 2 if it is not already installed (e.g., pip install tensorflow)
Install ktrain: pip install ktrain

The above should be all you need on Linux systems and cloud computing environments like Google Colab and AWS EC2. If you are using ktrain on a Windows computer, you can follow these more detailed instructions that include some extra steps.

Some important things to note about installation:

If using ktrain with tensorflow<=2.1, you must also downgrade the transformers library to transformers==3.1.
As of v0.21.x, ktrain no longer installs TensorFlow 2 automatically. As indicated above, you should install TensorFlow 2 yourself before installing and using ktrain. On Google Colab, TensorFlow 2 should be already installed. You should be able to use ktrain with any version of TensorFlow 2. Note, however, that there is a bug in TensorFlow 2.2 and 2.3 that affects the Learning-Rate-Finder that will not be fixed until TensorFlow 2.4. The bug causes the learning-rate-finder to complete all epochs even after loss has diverged (i.e., no automatic-stopping).
If using ktrain on a local machine with a GPU (versus Google Colab, for example), you'll need to install GPU support for TensorFlow 2.
Since some ktrain dependencies have not yet been migrated to tf.keras in TensorFlow 2 (or may have other issues), ktrain is temporarily using forked versions of some libraries. Specifically, ktrain uses forked versions of the eli5 and stellargraph libraries. If not installed, ktrain will complain when a method or function needing either of these libraries is invoked. To install these forked versions, you can do the following:

pip install git+https://github.com/amaiya/eli5@tfkeras_0_10_1
pip install git+https://github.com/amaiya/stellargraph@no_tf_dep_082

This code was tested on Ubuntu 18.04 LTS using TensorFlow 2.3.1 and Python 3.6.9.

How to Cite

Please cite the following paper when using ktrain:

@article{maiya2020ktrain,
    title={ktrain: A Low-Code Library for Augmented Machine Learning},
    author={Arun S. Maiya},
    year={2020},
    eprint={2004.10703},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    journal={arXiv preprint arXiv:2004.10703},
}

Creator: Arun S. Maiya

Email: arun [at] maiya [dot] net

Comments

Not support for loading pretrained HuggingFace Transformers model from local path?

Failed to load pretrained HuggingFace Transformers model from my local machine. It seems only the hard-code models in the code can be loaded.

MODEL_NAME = "D:\programming\models\tf_rbtl"
t = text.Transformer(MODEL_NAME, maxlen=500,  
                     classes=["0", "1"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-e20b30887588> in <module>
      1 t = text.Transformer(MODEL_NAME, maxlen=500,  
----> 2                      classes=["0", "1"])

d:\anaconda3.5\envs\adverse\lib\site-packages\ktrain\text\preprocessor.py in __init__(self, model_name, maxlen, classes, batch_size, multilabel, use_with_learner)
    838             raise ValueError('classes argument is required when multilabel=True')
    839         super().__init__(model_name,
--> 840                          maxlen, max_features=10000, classes=classes, multilabel=multilabel)
    841         self.batch_size = batch_size
    842         self.use_with_learner = use_with_learner

d:\anaconda3.5\envs\adverse\lib\site-packages\ktrain\text\preprocessor.py in __init__(self, model_name, maxlen, max_features, classes, lang, ngram_range, multilabel)
    719         self.name = model_name.split('-')[0]
    720         if self.name not in TRANSFORMER_MODELS:
--> 721             raise ValueError('uknown model name %s' % (model_name))
    722         self.model_type = TRANSFORMER_MODELS[self.name][1]
    723         self.tokenizer_type = TRANSFORMER_MODELS[self.name][2]

ValueError: uknown model name D:\programming\models\tf_rbtl

opened by WangHexie 22

How to save SimpleQA trained model?

Hi,

I've tried the provided sample for SimpleQA. In my output, it gave me:

<IPython.core.display.HTML object>

which I assume is the:

#qa.display_answers(answers[:5])

if I re-run the sample code, it complains there's already a directory where it tries to create index (good). If I leave out everything and rerun:

qa = text.SimpleQA(INDEXDIR)

it starts training again,.. another three hours :(

This is my code now so I should get some output:

`# load 20newsgroups datset into an array #from sklearn.datasets import fetch_20newsgroups #remove = ('headers', 'footers', 'quotes') #newsgroups_train = fetch_20newsgroups(subset='train', remove=remove) #newsgroups_test = fetch_20newsgroups(subset='test', remove=remove) #docs = newsgroups_train.data + newsgroups_test.data

import ktrain from ktrain import text

INDEXDIR = '/tmp/qa'

#text.SimpleQA.initialize_index(INDEXDIR) #text.SimpleQA.index_from_folder('./Philosophy', INDEXDIR)

qa = text.SimpleQA(INDEXDIR)

answers = qa.ask('Why are we here?') top_answer = answers[0]['answer'] print(top_answer) top_answer = answers[1]['answer'] print(top_answer) top_answer = answers[2]['answer'] print(top_answer) top_answer = answers[3]['answer'] print(top_answer) top_answer = answers[4]['answer'] print(top_answer)

#qa.display_answers(answers[:5])`

How to I reload my already trained model?

opened by staccDOTsol 19
Load and use trained model with ktrain

I have a trained model in .h5 and preproc file format for racial recognition using ktrain library. How do I get to load and use the trained model at a later time.
user question

opened by bulioses 18
Is it possible to use a RoBERTa-like model (Microsoft/codebert-base) for NER sequence-tagging

Right now I am getting a tensor shape error and I feel like this has to be because of the model expecting a BERT like input and this goes wrong, could this be the case?
enhancement

opened by Niekvdplas 17

error with text.transformer- roberta-base

Hi, I try to train a model based on "roberta-base". I try to run it on EC2 (p3.16xl) , and I got this error:

Traceback (most recent call last):
  File "ktrain_transformer_training.py", line 119, in <module>
    learner, preproc = train_transformer(x_train, y_train, x_test, y_test)
  File "ktrain_transformer_training.py", line 98, in train_transformer
    model = t.get_classifier()
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/ktrain/text/preprocessor.py", line 1041, in get_classifier
    model = self._load_pretrained(mname, num_labels)
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/ktrain/text/preprocessor.py", line 1006, in _load_pretrained
    raise ValueError('could not load pretrained model %s using both from_pt=False and from_pt=True' % (mname))
ValueError: could not load pretrained model roberta-base using both from_pt=False and from_pt=True

However, the same code runs prefect on my local machine. can you help with this issue? Thanks

user question

opened by Liranbz 16

unhashable type: 'numpy.ndarray'

TypeError: unhashable type: 'numpy.ndarray'

When running this: history = learner.fit(LR, n_cycles=EPOCHS, checkpoint_folder='./') based on: # Preprocess training and validation data train_set = t.preprocess_train(X_train, y_train) val_set = t.preprocess_test(X_test, y_test)

where X_train and X_test are lists while y_train, y_test are arrays
user question

opened by realonbebeto 13
Models trained with ktrain do not work with Flask, uWSGI and NGINX

Hi, I am using ktrain to replace some of my Keras built models in production deployment. I have noticed a problem with models trained with ktrain. I think I might need to set some extra params in my NGINX and uWSGI while using the ktrain because I can't simply replace older models with ktrain models.

I am using Flask, uWSGI, and NGINX to deploy my models. This setup is already in place with models trained with traditional Keras and TF2. But If I replace my Text Classification model with ktrain then it stops working.

I have checked it individually with Flask and uWSGI, it is working fine till now. But as soon as I add NGINX server setup it stops working. There something happening inside ktrain APIs which is breaking it because if I do not use ktrain APIs it is working perfectly fine with all setup.

At the front end, it is saying Server Timeout Error. I have checked internal logs to identify the issue and it is happening because uWSGI is not returning anything to NGINX. Although, If I run only Flask and uWSGI with ktrain model, it runs. I also tried to increase timeout time to 5m, 10m, 30m even after that connection timeout is coming. It is happening only when I can API which uses ktrain models. Other APIs which do not use ktrain models are working perfectly fine. i.e. Only those APIs are not working which uses ktrain models and others are working.

I had one more problem with ktrain on Flask server but I have resolved that by turning off auto_reload on file change. Because it seems ktrain download or write something in the local directory that is why Flask was reloading when I call prediction. Although, this is resolved.

I have tried with BERT, DistilBERT, Fast Text, GRU and so many other ways to figure out why It is not working with NGINX. Can you add your thoughts about what could be the reason?

I have also created a [User Guide] (https://drive.google.com/file/d/1VW421zkmXkiQdoO1NWhVe21QOgVhVUnc/view?usp=sharing) to show you server settings. It will help to identify exact problem. If you can look at it and add your thoughts, it would be really great help. I want to use ktrain in production but got stuck here.

opened by laxmimerit 13
Regarding Deployment on Flask

Hi, i have an issue regarding deployment i am not able to deploy ktrain multi text classification model. I tried to load model and .preproc file but it does not work.
user question

opened by ianuragbhatt 13

Cannot get learner from iterator zip object

get_learner fails when the training data is a zip of iterators such as when it is used for image segmentation tasks (while augmenting images and masks together).

EDIT:

It works by hacking together a custom Iterator class, but it's not a particularly elegant hack...

image_gen and mask_gen below are keras.preprocessing.image.ImageDataGenerator.flow_from_directory() objects.


class Iterator():
    
    def __init__(self, image_gen, mask_gen):
        self.image_gen = image_gen
        self.mask_gen = mask_gen
        self.batch_size = image_gen.batch_size
        self.target_size = image_gen.target_size
        self.color_mode = image_gen.color_mode
        self.class_mode = image_gen.class_mode
        self.n = image_gen.n
        self.seed = image_gen.seed
        self.total_batches_seen = image_gen.total_batches_seen
    
    def __iter__(self):
        return self
    
    def __next__(self):
        return next(self.image_gen), next(self.mask_gen)
    
    def __getitem__(self, key):
        return self.image_gen[key], self.mask_gen[key]

Any ideas how we could make this more elegant?

enhancement

opened by gkaissis 13

Support for long sequences classification (transformer models)
It would be great if support could be added for long sequences over ~500 word pieces for the transformers models.

Possible methods:

sliding window over each sequence to generate sub-sequences output that is averaged before being feed to the classifier layer

sliding window over each sequence to generate sub-sequences output feed to LSTM layer

enhancement
opened by mdavis95 12
Question about external tutorial/example link

First of all, thank you for writing this library. Some time ago i finished my thesis/skripsi for Bachelor degree. The thesis is about evaluting ktrain on text domain where i released some of code/jupyter notebook at https://github.com/ilos-vigil/ktrain-assessment-study.

My question is, would you like to include my repository as tutorial/example on ktrain's README.md?
user question

opened by ilos-vigil 11
Only one GPU is used in the training for a mult-GPU machine with mirrored_strategy
@amaiya As per the example in https://github.com/amaiya/ktrain/issues/78, I am trying to train the NER model on custom CONLL dataset(5.5million rows):

Number of sentences: 228051

Number of words in the dataset: 232634

Number of Labels: 29

Longest sentence: 106 words

Model

with mirrored_strategy.scope(): model = txt.sequence_tagger('bilstm-transformer', preproc, wv_path_or_url='/home/user1/ktrain_data/cc.en.300.vec') learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=128) learner.fit(0.003, 5, cycle_len=1)

Only 1st GPU seems to be used as reported by nvidia-smi utilizing most of the times 100%. Other GPUs gets consumer in single digit if they ever are consumed. What is missing? How do take advantage of all the GPUs?
opened by amir1m 4

Releases(v0.32.3)

v0.32.3(Dec 13, 2022)
0.32.3 (2022-12-12)

new:

N/A

changed

N/A

fixed:

Changed NMF to accept optional parameters nmf_alpha_W and nmf_alpha_H based on changes in scikit-learn==1.2.0.

Change ktrain.utils to check for TensorFlow before doing a version check, so that ktrain can be imported without TensorFlow being installed.

Source code(tar.gz)
Source code(zip)
v0.32.2(Dec 12, 2022)
0.32.2 (2022-12-12)

new:

N/A

changed

N/A

fixed:

Changed call to NMF to use alpha_W instead of alpha, as alpha parameter was removed in scikit-learn==1.2. (#470)

Source code(tar.gz)
Source code(zip)
v0.32.1(Dec 12, 2022)
0.32.1 (2022-12-11)

new:

N/A

changed

N/A

fixed:

In TensorFlow 2.11, the tf.optimizers.Optimizer base class points the new keras optimizer that seems to have problems. Users should use legacy optimizers in tf.keras.optimizers.legacy with ktrain (which evidently will never be deleted). This means that, in TF 2.11, supplying a string representation of an optimizer like "adam" to model.compile uses the new optimizer instead of the legacy optimizers. In these cases, ktrain will issue a warning and automatically recompile the model with the default tf.keras.optimizers.legacy.Adam optimizer.

Source code(tar.gz)
Source code(zip)
v0.32.0(Dec 9, 2022)
0.32.0 (2022-12-08)

new:

Support for TensorFlow 2.11. For now, as recommended in the TF release notes, ktrain has been changed to use the legacy optimizers in tf.keras.optimizers.legacy. This means that, when compiling Keras models, you should supply tf.keras.optimizers.legacy.Adam() instead of the string "adam".

Support for Python 3.10. Changed references from CountVectorizer.get_field_names to CountVectorizer.get_field_names_out. Updated supported versions in setup.py.

changed

N/A

fixed:

fixed error in docs

Source code(tar.gz)
Source code(zip)
v0.31.10(Oct 1, 2022)
0.31.10 (2022-10-01)

new:

N/A

changed

N/A

fixed:

Adjusted tika imports due to issue with /tmp/tika.log in multi-user scenario

Source code(tar.gz)
Source code(zip)
v0.31.9(Sep 24, 2022)
0.31.9 (2022-09-24)

new:

N/A

changed

N/A

fixed:

Adjustment for kwe

Fixed problem with importing ktrain without TensorFlow installed

Source code(tar.gz)
Source code(zip)
v0.31.8(Sep 8, 2022)
0.31.8 (2022-09-08)

new:

N/A

changed

N/A

fixed:

Fixed paragraph tokenization in AnswerExtractor

Source code(tar.gz)
Source code(zip)
v0.31.7(Aug 4, 2022)
0.31.7 (2022-08-04)

new:

N/A

changed

re-arranged dep warnings for TF

ktrain now pinned to transformers==4.17.0. Python 3.6 users can downgrade to transformers==4.10.3 and still use ktrain.

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.31.6(Aug 2, 2022)
0.31.6 (2022-08-02)

new:

N/A

changed

updated dependencies to work with newer versions (but temporarily continue pinning to transformers==4.10.1)

fixed:

fixes for newer networkx

Source code(tar.gz)
Source code(zip)
v0.31.5(Aug 1, 2022)
0.31.5 (2022-08-01)

new:

N/A

changed

N/A

fixed:

fix release

Source code(tar.gz)
Source code(zip)
v0.31.4(Aug 1, 2022)
0.31.4 (2022-08-01)

new:

N/A

changed

TextPredictor.explain and ImagePredictor.explain now use a different fork of eli5: pip install https://github.com/amaiya/eli5-tf/archive/refs/heads/master.zip

fixed:

Fixed loss_fn_from_model function to work with DISABLE_V2_BEHAVIOR properly

TextPredictor.explain and ImagePredictor.explain now work with tensorflow>=2.9 and scipy>=1.9 (due to new eli5-tf fork -- see above)

Source code(tar.gz)
Source code(zip)
v0.31.3(Jul 16, 2022)
0.31.3 (2022-07-15)

new:

N/A

changed

added alnum check and period check to KeywordExtractor

fixed:

fixed bug in text.qa.core caused by previous refactoring of paragraph_tokenize and tokenize

Source code(tar.gz)
Source code(zip)
v0.31.2(May 20, 2022)
0.31.2 (2022-05-20)

new:

N/A

changed

added truncate_to argument (default:5000) and minchars argument (default:3) argument to KeywordExtractor.extract_keywords method.

added score_by argument to KeywordExtractor.extract_keywords. Default is freqpos, which means keywords are now ranked by a combination of frequency and position in document.

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.31.1(May 17, 2022)
0.31.1 (2022-05-17)

new:

N/A

changed

Allow for returning prediction probabilities when merging tokens in sequence-tagging (PR #445)

added basic ML pipeline test to workflow using latest TensorFlow

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.31.0(May 7, 2022)
0.31.0 (2022-05-07)

new:

The text.ner.models.sequence_tagger now supports word embeddings from non-BERT transformer models (e.g., roberta-base, codebert). Thanks to @Niekvdplas.

Custom tokenization can now be used in sequence-tagging even when using transformer word embeddings. See custom_tokenizer argument to NERPredictor.predict.

changed

[breaking change] In the text.ner.models.sequence_tagger function, the bilstm-bert model is now called bilstm-transformer and the bert_model parameter has been renamed to transformer_model.

[breaking change] The syntok package is now used as the default tokenizer for NERPredictor (sequence-tagging prediction). To use the tokenization scheme from older versions of ktrain, you can import the re and string packages and supply this function to the custom_tokenizer argument: lambda s: re.compile(f"([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])").sub(r" \1 ", s).split().

Code base was reformatted using black and isort

ktrain now supports TIKA for text extraction in the text.textractor.TextExtractor package with the use_tika=True argument as default. To use the old-style text extraction based on the textract package, you can supply use_tika=False to TextExtractor.

removed warning about sentence pair classification to avoid confusion

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.30.0(Mar 28, 2022)
0.30.0 (2022-03-28)

new:

ktrain now supports simple, fast, and robust keyphrase extraction with the ktran.text.kw.KeywordExtractor module

ktrain now only issues a warning if TensorFlow is not installed, insteading of halting and preventing further use. This means that pre-trained PyTorch models (e.g., text.zsl.ZeroShotClassifier) and sklearn models (e.g., text.eda.TopicModel) in ktrain can now be used without having TensorFlow installed.

text.qa.SimpleQA and text.qa.AnswerExtractor now both support PyTorch with optional quantization (use framework='pt' for PyTorch version)

text.zsl.ZeroShotClassifier, text.translation.Translator, and text.translation.EnglishTranslator all support a quantize argument.

pretrained image-captioning and object-detection via transformers are now supported

changed

reorganized imports

localized seqeval

The half parameter to text.translation.Translator, and text.translation.EnglishTranslator was changed to quantize and now supports both CPU and GPU.

TFDataset and SequenceDataset classes must not be imported as ktrain.dataset.TFDataset and ktrain.dataset.SequenceDataset.

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.29.3(Mar 9, 2022)
0.29.3 (2022-03-09)

new:

NERPredictor.predict now includes a return_offsets parameter. If True, the results will include character offsets of predicted entities.

changed

In eda.TopicModel, changed lda_max_iter to max_iter and nmf_alpha to alpha

Added show_counts parameter to TopicModel.get_topics method

Changed qa.core._process_question to qa.core.process_question

In qa.core, added remove_english_stopwords and and_np parameters to process_question

The valley learning rate suggestion is now returned in learner.lr_estimate and learner.lr_plot (when suggest=True supplied to learner.lr_plot)

fixed:

save TransformerEmbedding model, tokenizer, and configuration when saving NERPredictor and reset te_model to facilitate loading NERPredictors with BERT embeddings offline (#423)

switched from keras2onnx to tf2onnx, which supports newer versions of TensorFlow

Source code(tar.gz)
Source code(zip)
v0.29.2(Feb 9, 2022)
0.29.2 (2022-02-09)

new:

N/A

changed

N/A

fixed:

added get_tokenizer call to TransformersPreprocessor._load_pretrained to address issue #416

Source code(tar.gz)
Source code(zip)
v0.29.1(Feb 8, 2022)
0.29.1 (2022-02-08)

new:

N/A

changed

pin to sklearn==0.24.2 due to breaking changes. This scikit-learn version change only really affects TextPredictor.explain. The eli5 fork supporting tf.keras updated for scikit-learn 0.24.2. To use scikit-learn==0.24.2, users must uninstall and re-install the eli5 fork with: pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip.

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.29.0(Jan 29, 2022)
0.29.0 (2022-01-28)

new:

New vision models: added MobileNetV3-Small and EfficientNet. Thanks to @ilos-vigil.

changed

core.Learner.plot now supports plotting of any value that exists in the training History object (e.g., mae if previously specified as metric). Thanks to @ilos-vigil.

added raw_confidence parameter to QA.ask method to return raw confidence scores. Thanks to @ilos-vigil.

fixed:

pin to transformers==4.10.3 due to Issue #398

pin to syntok==1.3.3 due to bug with syntok==1.4.1 causing paragraph tokenization in qa module to break

properly suppress TF/CUDA warnings by default

ensure document fed to keras_bert tokenizer to avoid this issue

Source code(tar.gz)
Source code(zip)
v0.28.3(Nov 5, 2021)
0.28.3 (2021-11-05)

new:

speech transcription support

changed

N/A

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.28.2(Oct 18, 2021)
0.28.2 (2021-10-17)

new:

N/A

changed

minor fix to installation due to pypi

fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.28.1(Oct 18, 2021)
0.28.1 (2021-10-17)

New:

N/A

Changed

added extra_requirements to setup.py

changed imports for summarization, translation, qa, and zsl in notebooks and tests

Fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.28.0(Oct 13, 2021)
0.28.0 (2021-10-13)

New:

text.AnswerExtractor is a universal information extractor powered by a Question-Answering module and capable of extracting user-specfied information from texts.

text.TextExtractor is a text extraction pipeline (e.g., convert PDFs to plain text)

Changed

changed transformers pin to transformers>=4.0.0,<=4.10.3

Fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.27.3(Sep 3, 2021)
0.27.3 (2021-09-03)

New:

N/A

Changed

-N/A

Fixed:

SimpleQA now can load PyTorch question-answering checkpoints

change API call to support newest causalnlp

Source code(tar.gz)
Source code(zip)
v0.27.2(Jul 28, 2021)
0.27.2 (2021-07-28)

New:

N/A

Changed

N/A

Fixed:

check for logits attribute when predicting using transformers

change raised Exception to warning for longer sequence lengths for transformers

Source code(tar.gz)
Source code(zip)
v0.27.1(Jul 20, 2021)
0.27.1 (2021-07-20)

New:

N/A

Changed

Added method parameter to tabular.causal_inference_model.

Fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.27.0(Jul 20, 2021)
0.27.0 (2021-07-20)

New:

Added tabular.causal_inference_model function for causal inference support.

Changed

N/A

Fixed:

N/A

Source code(tar.gz)
Source code(zip)
v0.26.5(Jul 16, 2021)
0.26.5 (2021-07-15)

New:

N/A

Changed

added query parameter to SimpleQA.ask so that an alternative query can be used to retrieve contexts from corpus

added chardet as dependency for stellargraph

Fixed:

fixed issue with TopicModel.build when threshold=None

Source code(tar.gz)
Source code(zip)
v0.26.4(Jun 23, 2021)
0.26.4 (2021-06-23)

New:

API documenation index

Changed

Added warning when a TensorFlow version of selected transformers model is not available and the PyTorch version is being downloaded and converted instead using from_pt=True.

Fixed:

Fixed utils.metrics_from_model to support alternative metrics

Check for AUC ktrain.utils "inspect" function

Source code(tar.gz)
Source code(zip)