Sequence learning toolkit for Python

Overview

seqlearn

seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API.

Compiling and installing

Get NumPy >=1.6, SciPy >=0.11, Cython >=0.20.2 and a recent version of scikit-learn. Then issue:

python setup.py install

to install seqlearn.

If you want to use seqlearn from its source directory without installing, you have to compile first:

python setup.py build_ext --inplace

Getting started

The easiest way to start using seqlearn is to fetch a dataset in CoNLL 2000 format. Define a task-specific feature extraction function, e.g.:

>>> def features(sequence, i):
...     yield "word=" + sequence[i].lower()
...     if sequence[i].isupper():
...         yield "Uppercase"
...

Load the training file, say train.txt:

>>> from seqlearn.datasets import load_conll
>>> X_train, y_train, lengths_train = load_conll("train.txt", features)

Train a model:

>>> from seqlearn.perceptron import StructuredPerceptron
>>> clf = StructuredPerceptron()
>>> clf.fit(X_train, y_train, lengths_train)

Check how well you did on a validation set, say validation.txt:

>>> X_test, y_test, lengths_test = load_conll("validation.txt", features)
>>> from seqlearn.evaluation import bio_f_score
>>> y_pred = clf.predict(X_test, lengths_test)
>>> print(bio_f_score(y_test, y_pred))

For more information, see the documentation.

Travis

Comments
  • rewrite _utils.count_trans in Cython

    rewrite _utils.count_trans in Cython

    Some profiling results for StructuredPerceptron.fit (using a sparse X with 70k samples/60k features, y with 20 labels, before this PR):

           228411 function calls (227111 primitive calls) in 29.378 seconds
    
       Ordered by: internal time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         2442   15.957    0.007   15.976    0.007 _utils.py:10(count_trans)
            1    8.533    8.533   29.373   29.373 perceptron.py:51(fit)
         4967    2.242    0.000    2.242    0.000 {numpy.core.multiarray.zeros}
         1300    0.755    0.001    0.755    0.001 {seqlearn._decode.viterbi.viterbi}
         1300    0.485    0.000    0.485    0.000 {_csr.csr_matvecs}
         1221    0.430    0.000    0.430    0.000 {_csc.csc_matvecs}
         1300    0.175    0.000    0.175    0.000 {_csr.get_csr_submatrix}
         2521    0.114    0.000    0.226    0.000 compressed.py:103(check_format)
         2521    0.076    0.000    3.505    0.001 extmath.py:70(safe_sparse_dot)
         2521    0.068    0.000    0.078    0.000 compressed.py:633(prune)
         2521    0.048    0.000    3.212    0.001 compressed.py:272(_mul_multivector)
         2521    0.043    0.000    0.322    0.000 compressed.py:22(__init__)
         3900    0.035    0.000    0.047    0.000 sputils.py:95(isintlike)
    

    Rewriting count_trans to Cython could give us 2x speedup in some scenarios (the same data, after this PR):

             230905 function calls (229605 primitive calls) in 13.748 seconds
    
       Ordered by: internal time
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    8.778    8.778   13.745   13.745 perceptron.py:51(fit)
         2525    2.264    0.001    2.264    0.001 {numpy.core.multiarray.zeros}
         1300    0.793    0.001    0.793    0.001 {seqlearn._decode.viterbi.viterbi}
         1300    0.504    0.000    0.504    0.000 {_csr.csr_matvecs}
         1221    0.445    0.000    0.445    0.000 {_csc.csc_matvecs}
         2521    0.115    0.000    0.233    0.000 compressed.py:103(check_format)
         1300    0.086    0.000    0.086    0.000 {_csr.get_csr_submatrix}
         2521    0.080    0.000    3.596    0.001 extmath.py:70(safe_sparse_dot)
         2521    0.072    0.000    0.082    0.000 compressed.py:633(prune)
         2521    0.049    0.000    3.291    0.001 compressed.py:272(_mul_multivector)
         2521    0.043    0.000    0.331    0.000 compressed.py:22(__init__)
        17569    0.042    0.000    0.042    0.000 {numpy.core.multiarray.array}
         3900    0.036    0.000    0.047    0.000 sputils.py:95(isintlike)
    2600/1300    0.032    0.000    0.416    0.000 csr.py:189(__getitem__)
         2442    0.030    0.000    0.030    0.000 {seqlearn._utils_cython.count_trans}
    

    I'm not sure the code is organized in a good way (_utils_cython.pyx), but anyways.

    opened by kmike 7
  • vectorized viterbi decoding

    vectorized viterbi decoding

    Hi,

    Here is vectorized viterbi function. I didn't get how to vectorize it using outer products, so I just rewrote inner loop using numpy vectorized operations. For my data (20 states, 30k samples) it makes StructuredPerceptron.fit 4x faster.

    opened by kmike 4
  • Dataset loading

    Dataset loading

    I'm trying to train a HMM that classifies lines in an HTML document as belonging to a certain zone or class (e.g. body, header, footer, title, etc.). Thus, each sequence is a document and each sample is a line. On each line, I compute a number of floating point valued features.

    What is the correct input format for this data in order to train a model with seqlearn? I'm having trouble understanding how to format the data from the documentation.

    opened by danintheory 3
  • Change verbose reporting format for StructuredPerceptron

    Change verbose reporting format for StructuredPerceptron

    What do you think about

    1. using sum_loss instead of loss (currently the printed loss is a loss for some random sequence) and
    2. printing iteration number at the same line as sum_loss (=> 2x less output lines)?
    opened by kmike 3
  • Using a `requirements.txt` file

    Using a `requirements.txt` file

    I was wondering if it would be possible to include a requirements.txt in the root directory with the prerequisites?

    I am currently making a one-line installer for a (private) module that downloads and installs seqlearn directly from github using pip install -e git+https://github.com/larsmans/seqlearn.git#egg=seqlearn. But this won't work on new virtual envirements without first installing Cython manually.

    I know this has limited usefulness, but it would simplify things for me a notch.

    Awesome project you guys have got going here BTW.

    opened by jonathf 1
  • Fixed missing numpy headers in OS X build.

    Fixed missing numpy headers in OS X build.

    Trying to build seqlearn on OS X 10.11 results in the following error:

    $ python setup.py install
    ...
    clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c seqlearn/_decode/bestfirst.c -o build/temp.macosx-10.10-x86_64-2.7/seqlearn/_decode/bestfirst.o
    seqlearn/_decode/bestfirst.c:250:10: fatal error: 'numpy/arrayobject.h' file not found
    #include "numpy/arrayobject.h"
             ^
    1 error generated.
    error: command 'clang' failed with exit status 1
    

    Which is the same error solved here: https://github.com/hmmlearn/hmmlearn/issues/43. This ports the hmmlearn fix to seqlearn.

    opened by kushalc 1
  • pypi?

    pypi?

    larsmans, would you be willing to push this to pypi so it is easy to include as a package dependency in other tools? I am happy to help with this if you'd like.

    opened by johnrfrank 1
  • [WIP] Edge/transition features

    [WIP] Edge/transition features

    I tried to replace the node features by edge/transition features (aka dyad features), by assigning n_classes² weights to each feature instead of only n_classes. The result didn't fit in the memory of my laptop and actually I don't want to waste memory space on it.

    This branch contains, so far, a custom sparse matrix datatype based on hash tables that allows adding elements (changing the sparsity). It has a function for multiplying a CSR matrix and a hashtable matrix together.

    The implementation may be slow because of all the malloc'ing; if that turns out to be unacceptable, we can try a C++ implementation using unordered_map or std::vector and a custom hash function instead.

    opened by larsmans 1
  • FIX bestfirst decoder should also use np.intp

    FIX bestfirst decoder should also use np.intp

    StructuredPerceptron raises an exception when np.int32 is used.

    The following test still fails though:

    from seqlearn._decode import DECODERS
    
    def test_perceptron_decoders():
        """Assert that perceptron works with all decoders."""
        for decoder in DECODERS:
            clf = StructuredPerceptron(max_iter=1, decode=decoder)
            clf.fit([[1, 2, 3]], [1], [1])  # no exception
    
    Traceback (most recent call last):
      File "/Users/kmike/envs/scraping/lib/python2.7/site-packages/nose/case.py", line 197, in runTest
        self.test(*self.arg)
      File "/Users/kmike/svn/seqlearn/seqlearn/tests/test_perceptron.py", line 50, in test_perceptron_decoders
        clf.fit([[1, 2, 3]], [1], [1])  # no exception
      File "/Users/kmike/svn/seqlearn/seqlearn/perceptron.py", line 112, in fit
        y_pred = decode(Score, w_trans, w_init, w_final)
      File "/Users/kmike/svn/seqlearn/seqlearn/_decode/bestfirst.py", line 17, in bestfirst
        path[-1] = np.argmax(trans[path[-2], :] + Score[-1] + final)
    IndexError: index -2 is out of bounds for axis 0 with size 1
    
    

    Also, "bestfirst" decoder is currently much slower than viterbi.

    opened by kmike 1
  • Fix installation

    Fix installation

    With Cython extension StructuredPerceptron.fit is 2.5x faster than with vectorized numpy for my data, thanks!

    But installation now requires numpy at build time, so without proper headers it fails. A simplified version of scikit-learn's setup.py numpy code is in this PR. I'm not sure what other numpy-related stuff does in scikit-learn setup.py; numpy.get_include() seems enough for seqlearn.

    opened by kmike 1
  • How to use SequenceKFold?

    How to use SequenceKFold?

    SequenceKFold returns (train, test) indices, but "fit" methods also require proper "lenghts" arrays, so it is not clear how to use SequenceKFold for cross-validation.

    opened by kmike 1
  • will it work for multivariate time series?

    will it work for multivariate time series?

    great code thanks may you clarify : will it work for multivariate time series 1 where all values are continues values 2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

    color        weight     gender  height  age  
    

    1 black 56 m 160 34 2 white 77 f 170 54 3 yellow 87 m 167 43 4 white 55 m 198 72 5 white 88 f 176 32

    opened by Sandy4321 1
  • seqlearn not working since new version of sklearn

    seqlearn not working since new version of sklearn

    Since I use a new version of sklearn, six is no longer part of sklearn, but a separate library. Hence, seqlearn gives an error, when trying to run a perceptron:

    /usr/local/lib/python3.8/dist-packages/seqlearn/perceptron.py in <module>
          8 import numpy as np
          9 from scipy.sparse import csc_matrix
    ---> 10 from sklearn.externals import six
         11 
         12 from .base import BaseSequenceClassifier
    
    ImportError: cannot import name 'six' from 'sklearn.externals' (/usr/local/lib/python3.8/dist-packages/sklearn/externals/__init__.py)
    
    opened by peterdekker 1
  • Installation error: command 'clang' failed with exit status 1

    Installation error: command 'clang' failed with exit status 1

    the last part of error message is

    seqlearn/_decode/bestfirst.c:600:10: fatal error: 'numpy/arrayobject.h' file not found
    #include "numpy/arrayobject.h"
             ^~~~~~~~~~~~~~~~~~~~~
    1 error generated.
    error: command 'clang' failed with exit status 1
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"'; file='"'"'/private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-install-6hdkoh2q/seqlearn/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /private/var/folders/4f/zswmtvpn44gfvcz53m_c24400000gn/T/pip-record-gr6jqkyt/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.7m/seqlearn Check the logs for full command output.

    opened by Freakwill 1
  • in hmm.py there is a wrong reference to logsumexp method

    in hmm.py there is a wrong reference to logsumexp method

    In hmm.py logsumexp is imported as: from scipy.misc import logsumexp however in the new versions of scipy logsumexp was moved to scipy.special, therefore it should be:

    from scipy.special import logsumexp

    opened by andrey-chu 1
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.3k Jan 5, 2023
A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

null 2.3k Dec 29, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 2, 2023
Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

FINRA 25 Dec 28, 2022
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

Facebook Research 4.1k Dec 29, 2022
A toolkit for geo ML data processing and model evaluation (fork of solaris)

An open source ML toolkit for overhead imagery. This is a beta version of lunular which may continue to develop. Please report any bugs through issues

Ryan Avery 4 Nov 4, 2021
LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRerank, Seq2Slate.

LibRerank LibRerank is a toolkit for re-ranking algorithms. There are a number of re-ranking algorithms, such as PRM, DLCM, GSF, miDNN, SetRank, EGRer

null 126 Dec 28, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 9, 2023