Sequence learning toolkit for Python

Lars

Last update: Dec 27, 2022

Related tags

Machine Learning seqlearn

Overview

seqlearn

seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API.

Compiling and installing

Get NumPy >=1.6, SciPy >=0.11, Cython >=0.20.2 and a recent version of scikit-learn. Then issue:

python setup.py install

to install seqlearn.

If you want to use seqlearn from its source directory without installing, you have to compile first:

python setup.py build_ext --inplace

Getting started

The easiest way to start using seqlearn is to fetch a dataset in CoNLL 2000 format. Define a task-specific feature extraction function, e.g.:

>>> def features(sequence, i):
...     yield "word=" + sequence[i].lower()
...     if sequence[i].isupper():
...         yield "Uppercase"
...

Load the training file, say train.txt:

>>> from seqlearn.datasets import load_conll
>>> X_train, y_train, lengths_train = load_conll("train.txt", features)

Train a model:

>>> from seqlearn.perceptron import StructuredPerceptron
>>> clf = StructuredPerceptron()
>>> clf.fit(X_train, y_train, lengths_train)

Check how well you did on a validation set, say validation.txt:

>>> X_test, y_test, lengths_test = load_conll("validation.txt", features)
>>> from seqlearn.evaluation import bio_f_score
>>> y_pred = clf.predict(X_test, lengths_test)
>>> print(bio_f_score(y_test, y_pred))

For more information, see the documentation.

Comments

rewrite _utils.count_trans in Cython

Some profiling results for StructuredPerceptron.fit (using a sparse X with 70k samples/60k features, y with 20 labels, before this PR):

       228411 function calls (227111 primitive calls) in 29.378 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2442   15.957    0.007   15.976    0.007 _utils.py:10(count_trans)
        1    8.533    8.533   29.373   29.373 perceptron.py:51(fit)
     4967    2.242    0.000    2.242    0.000 {numpy.core.multiarray.zeros}
     1300    0.755    0.001    0.755    0.001 {seqlearn._decode.viterbi.viterbi}
     1300    0.485    0.000    0.485    0.000 {_csr.csr_matvecs}
     1221    0.430    0.000    0.430    0.000 {_csc.csc_matvecs}
     1300    0.175    0.000    0.175    0.000 {_csr.get_csr_submatrix}
     2521    0.114    0.000    0.226    0.000 compressed.py:103(check_format)
     2521    0.076    0.000    3.505    0.001 extmath.py:70(safe_sparse_dot)
     2521    0.068    0.000    0.078    0.000 compressed.py:633(prune)
     2521    0.048    0.000    3.212    0.001 compressed.py:272(_mul_multivector)
     2521    0.043    0.000    0.322    0.000 compressed.py:22(__init__)
     3900    0.035    0.000    0.047    0.000 sputils.py:95(isintlike)

Rewriting count_trans to Cython could give us 2x speedup in some scenarios (the same data, after this PR):

         230905 function calls (229605 primitive calls) in 13.748 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    8.778    8.778   13.745   13.745 perceptron.py:51(fit)
     2525    2.264    0.001    2.264    0.001 {numpy.core.multiarray.zeros}
     1300    0.793    0.001    0.793    0.001 {seqlearn._decode.viterbi.viterbi}
     1300    0.504    0.000    0.504    0.000 {_csr.csr_matvecs}
     1221    0.445    0.000    0.445    0.000 {_csc.csc_matvecs}
     2521    0.115    0.000    0.233    0.000 compressed.py:103(check_format)
     1300    0.086    0.000    0.086    0.000 {_csr.get_csr_submatrix}
     2521    0.080    0.000    3.596    0.001 extmath.py:70(safe_sparse_dot)
     2521    0.072    0.000    0.082    0.000 compressed.py:633(prune)
     2521    0.049    0.000    3.291    0.001 compressed.py:272(_mul_multivector)
     2521    0.043    0.000    0.331    0.000 compressed.py:22(__init__)
    17569    0.042    0.000    0.042    0.000 {numpy.core.multiarray.array}
     3900    0.036    0.000    0.047    0.000 sputils.py:95(isintlike)
2600/1300    0.032    0.000    0.416    0.000 csr.py:189(__getitem__)
     2442    0.030    0.000    0.030    0.000 {seqlearn._utils_cython.count_trans}

I'm not sure the code is organized in a good way (_utils_cython.pyx), but anyways.

opened by kmike 7

vectorized viterbi decoding

Hi,

Here is vectorized viterbi function. I didn't get how to vectorize it using outer products, so I just rewrote inner loop using numpy vectorized operations. For my data (20 states, 30k samples) it makes StructuredPerceptron.fit 4x faster.

opened by kmike 4
Dataset loading

I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

opened by danintheory 3
Change verbose reporting format for StructuredPerceptron
What do you think about

using sum_loss instead of loss (currently the printed loss is a loss for some random sequence) and

printing iteration number at the same line as sum_loss (=> 2x less output lines)?
opened by kmike 3
Using a `requirements.txt` file

I was wondering if it would be possible to include a requirements.txt in the root directory with the prerequisites?

I am currently making a one-line installer for a (private) module that downloads and installs seqlearn directly from github using pip install -e git+https://github.com/larsmans/seqlearn.git#egg=seqlearn. But this won't work on new virtual envirements without first installing Cython manually.

I know this has limited usefulness, but it would simplify things for me a notch.

Awesome project you guys have got going here BTW.

opened by jonathf 1

Fixed missing numpy headers in OS X build.

Trying to build seqlearn on OS X 10.11 results in the following error:

$ python setup.py install
...
clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c seqlearn/_decode/bestfirst.c -o build/temp.macosx-10.10-x86_64-2.7/seqlearn/_decode/bestfirst.o
seqlearn/_decode/bestfirst.c:250:10: fatal error: 'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
         ^
1 error generated.
error: command 'clang' failed with exit status 1

Which is the same error solved here: https://github.com/hmmlearn/hmmlearn/issues/43. This ports the hmmlearn fix to seqlearn.

opened by kushalc 1

pypi?

larsmans, would you be willing to push this to pypi so it is easy to include as a package dependency in other tools? I am happy to help with this if you'd like.

opened by johnrfrank 1
[WIP] Edge/transition features

I tried to replace the node features by edge/transition features (aka dyad features), by assigning n_classes² weights to each feature instead of only n_classes. The result didn't fit in the memory of my laptop and actually I don't want to waste memory space on it.

This branch contains, so far, a custom sparse matrix datatype based on hash tables that allows adding elements (changing the sparsity). It has a function for multiplying a CSR matrix and a hashtable matrix together.

The implementation may be slow because of all the malloc'ing; if that turns out to be unacceptable, we can try a C++ implementation using unordered_map or std::vector and a custom hash function instead.

opened by larsmans 1

FIX bestfirst decoder should also use np.intp

StructuredPerceptron raises an exception when np.int32 is used.

The following test still fails though:

from seqlearn._decode import DECODERS

def test_perceptron_decoders():
    """Assert that perceptron works with all decoders."""
    for decoder in DECODERS:
        clf = StructuredPerceptron(max_iter=1, decode=decoder)
        clf.fit([[1, 2, 3]], [1], [1])  # no exception

Traceback (most recent call last):
  File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/Users/kmike/svn/seqlearn/seqlearn/tests/test_perceptron.py", line 50, in test_perceptron_decoders
    clf.fit([[1, 2, 3]], [1], [1])  # no exception
  File "/Users/kmike/svn/seqlearn/seqlearn/perceptron.py", line 112, in fit
    y_pred = decode(Score, w_trans, w_init, w_final)
  File "/Users/kmike/svn/seqlearn/seqlearn/_decode/bestfirst.py", line 17, in bestfirst
    path[-1] = np.argmax(trans[path[-2], :] + Score[-1] + final)
IndexError: index -2 is out of bounds for axis 0 with size 1

Also, "bestfirst" decoder is currently much slower than viterbi.

opened by kmike 1

Fix installation

With Cython extension StructuredPerceptron.fit is 2.5x faster than with vectorized numpy for my data, thanks!

But installation now requires numpy at build time, so without proper headers it fails. A simplified version of scikit-learn's setup.py numpy code is in this PR. I'm not sure what other numpy-related stuff does in scikit-learn setup.py; numpy.get_include() seems enough for seqlearn.

opened by kmike 1
How to use SequenceKFold?

SequenceKFold returns (train, test) indices, but "fit" methods also require proper "lenghts" arrays, so it is not clear how to use SequenceKFold for cross-validation.

opened by kmike 1
will it work for multivariate time series?
great code thanks may you clarify : will it work for multivariate time series 1 where all values are continues values 2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

color weight gender height age

1 black 56 m 160 34 2 white 77 f 170 54 3 yellow 87 m 167 43 4 white 55 m 198 72 5 white 88 f 176 32
opened by Sandy4321 1

seqlearn not working since new version of sklearn

Since I use a new version of sklearn, six is no longer part of sklearn, but a separate library. Hence, seqlearn gives an error, when trying to run a perceptron:

/usr/local/lib/python3.8/dist-packages/seqlearn/perceptron.py in <module>
      8 import numpy as np
      9 from scipy.sparse import csc_matrix
---> 10 from sklearn.externals import six
     11 
     12 from .base import BaseSequenceClassifier

ImportError: cannot import name 'six' from 'sklearn.externals' (/usr/local/lib/python3.8/dist-packages/sklearn/externals/__init__.py)

opened by peterdekker 1

Installation error: command 'clang' failed with exit status 1
the last part of error message is

seqlearn/_decode/bestfirst.c:600:10: fatal error: 'numpy/arrayobject.h' file not found #include "numpy/arrayobject.h" ^~~~~~~~~~~~~~~~~~~~~ 1 error generated. error: command 'clang' failed with exit status 1 ----------------------------------------

ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"'; file='"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-record-gr6jqkyt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/seqlearn Check the logs for full command output.
opened by Freakwill 1
in hmm.py there is a wrong reference to logsumexp method

In hmm.py logsumexp is imported as: from scipy.misc import logsumexp however in the new versions of scipy logsumexp was moved to scipy.special, therefore it should be:

from scipy.special import logsumexp

opened by andrey-chu 1

Owner

Lars

GitHub http://larsmans.github.io/seqlearn/

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

11.6k Jan 2, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

65 Dec 20, 2022

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

3.1k Jan 6, 2023

A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

35 Jan 2, 2023

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

25 Dec 28, 2022

A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

888 Dec 30, 2022

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

4.1k Dec 29, 2022

A toolkit for geo ML data processing and model evaluation (fork of solaris)

An open source ML toolkit for overhead imagery. This is a beta version of lunular which may continue to develop. Please report any bugs through issues

4 Nov 4, 2021

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

126 Dec 28, 2022

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

366 Jan 3, 2023

Sequence learning toolkit for Python

Related tags

Overview

seqlearn

Compiling and installing

Getting started

Comments

Owner

Lars

A toolkit for making real world machine learning and data analysis applications in C++

A machine learning toolkit dedicated to time-series data

A machine learning toolkit dedicated to time-series data

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

A Python toolkit for rule-based/unsupervised anomaly detection in time series

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

A toolkit for geo ML data processing and model evaluation (fork of solaris)

LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

MIT-Machine Learning with Python–From Linear Models to Deep Learning

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A comprehensive repository containing 30+ notebooks on learning machine learning!