scikit-learn inspired API for CRFsuite

Overview

sklearn-crfsuite

PyPI Version Build Status Code Coverage Documentation

sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF is a scikit-learn compatible estimator: you can use e.g. scikit-learn model selection utilities (cross-validation, hyperparameter optimization) with it, or save/load CRF models using joblib.

License is MIT.

Documentation can be found here.


define hyperiongray
Comments
  • How to create features with  duplicate keys ?

    How to create features with duplicate keys ?

    I see in (crfsuite document)[http://www.chokkan.org/software/crfsuite/manual.html] that key of feature can be duplicate:

    B-NP    w[1..4]=a:2 w[1..4]=man w[1..4]=eats
    B-NP    w[1..4]=a w[1..4]=a w[1..4]=man w[1..4]=eats
    B-NP    w[1..4]=a:2.0 w[1..4]=man:1.0 w[1..4]=eats:1.0
    

    How to create features with duplicate keys if i using sklearn-crfsuite ?

    opened by binhnq94 8
  • Different result despite same input

    Different result despite same input

    I tried to create some CRF instances to train with the same training set and same max_iteration param.

    crf = sklearn_crfsuite.CRF(
                algorithm='ap', 
                max_iterations=5, 
            )
    crf.fit(X_train, Y_train)
    
    t = sklearn_crfsuite.CRF(
                algorithm='ap', 
                max_iterations=5, 
            )
    t.fit(X_train, Y_train)
    

    However, their result is different ( I tested them on the same develop set with fmeasure). Hope to see your response soon. Thank you

    opened by iamhuy 6
  • UnicodeEncodeError:

    UnicodeEncodeError:

    Hello! Thank you for your work.

    I experiments with rissian texts. But I have this problem: UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) I think that in my data set I have some strange symbols, but how I can find it?

    python == 3.6 last version of sklearn-crfsuite

    opened by Ulitochka 5
  • Mini-batch training

    Mini-batch training

    Does the CRF implementation support mini-batch training. Some sklearn predictors have a partial_fit method which supports incremental training. Would there be scope to extend the current implementation to include this?

    opened by uwaisiqbal 5
  • Sequence labelling issue: The numbers of items and labels differ...

    Sequence labelling issue: The numbers of items and labels differ...

    Hi, I'm trying to use sklearn-crfsuite for sequence labelling.

    when running crf.fit(train_data, train_targets) on my data, I get the below stack trace:

    Traceback (most recent call last):
      File ".../argument_segmenter.py", line 49, in train
        crf.fit(train_data, train_targets)
      File "/usr/local/lib/python3.9/site-packages/sklearn_crfsuite/estimator.py", line 314, in fit
        trainer.append(xseq, yseq)
      File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
    ValueError: The numbers of items and labels differ: |x| = 40, |y| = 38
    

    I noticed in https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/20 that someone suggests using a custom scorer, but I don't seem to get past the fitting stage.

    Any advice would be appreciate.

    My code looks like this:

    train_data, test_data, train_targets, test_targets = load_data()
    
    train_data = [sent2features(s) for s in train_data]
    train_targets = [sent2labels(s) for s in train_targets]
    
    test_data = [sent2features(s) for s in test_data]
    test_targets = [sent2labels(s) for s in test_targets]
    
    crf = sklearn_crfsuite.CRF(
        algorithm='lbfgs',
        c1=0.1,
        c2=0.1,
        max_iterations=100,
        all_possible_transitions=True
    )
    
    try:
        crf.fit(train_data, train_targets)
    except Exception as e:
        logging.error(e)
    
    opened by chriswales95 3
  • Possible memory leak problem?

    Possible memory leak problem?

    Hi @kmike

    My colleagues used a Java version of CRFSuite, and found a memory leak problem in it. Therefore, we checked the original CRFsuite site, and found there are a number of issues related to this problem: Results in chokkan/crfsuite. The latest fix accepted by the author is in 2016, and there are some more recent commits by other contributors.

    When we read the doc of Python-CRFsuite, the latest fix of this issue is back to 2015. Can you tell us if the latest Python-CRFSuite or sklearn-CRFSuite fixed those problems? Many thanks!

    opened by acepor 3
  • Effective Feature Induction to Increase F1

    Effective Feature Induction to Increase F1

    Hello,

    I want to use some conjunctions of features to increase my F1 score. Is there any functionality to induce feature effectively?

    Or

    Does sklearn-crfsuite support the algorithm described in this paper? https://people.cs.umass.edu/~mccallum/papers/ifcrf-uai2003.pdf

    Thanks

    opened by emirceyani 3
  • Is there an easy way to obtain a confusion matrix?

    Is there an easy way to obtain a confusion matrix?

    I'm trying

        confusion_matrix(y_test, y_pred)
    

    with sklearn's method, but am getting the error message

    ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
    
    opened by goerch 2
  • sklearn.model_selection.cross_validate() can't run.

    sklearn.model_selection.cross_validate() can't run.

    Nice to meet you. I am a student studying with your package. I am in trouble with the problem which I can not solve by myself.

    I tried your tutorial with this site’s feature(https://qiita.com/Hironsan/items/326b66711eb4196aa9d4), and add cross-validation as follows.

    from sklearn.model_selection import cross_validate
    scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
    print(scores.test_score)
    

    However, the following error occurs.

    Traceback (most recent call last):
      File "/program/crf.py", line 41, in <module>
        scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 195, in cross_validate
        for train, test in cv.split(X, y, groups))
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
        while self.dispatch_one_batch(iterator):
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
        self._dispatch(tasks)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
        job = self._backend.apply_async(batch, callback=cb)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
        result = ImmediateResult(func)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
        self.results = batch()
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
        return [func(*args, **kwargs) for func, args, kwargs in self.items]
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
        return [func(*args, **kwargs) for func, args, kwargs in self.items]
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 467, in _fit_and_score
        test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 502, in _score
        return _multimetric_score(estimator, X_test, y_test, scorer)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 532, in _multimetric_score
        score = scorer(estimator, X_test, y_test)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 108, in __call__
        **self._kwargs)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 714, in f1_score
        sample_weight=sample_weight)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 828, in fbeta_score
        sample_weight=sample_weight)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1025, in precision_recall_fscore_support
        y_type, y_true, y_pred = _check_targets(y_true, y_pred)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 72, in _check_targets
        type_true = type_of_target(y_true)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 259, in type_of_target
        raise ValueError('You appear to be using a legacy multi-label data'
    ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
    
    

    So, I added the follows before crcross-validation.

    trans_X = []
    mlb = MultiLabelBinarizer()
    for x in X:
            x = mlb.fit_transform(x)
            trans_X.append(x.astype(bytes))
    X = trans_X
    y = MultiLabelBinarizer().fit_transform(y)
    y = y.astype(bytes)
    

    However, the following error occurs.

    Traceback (most recent call last):
      File "/program/crf.py", line 41, in <module>
        scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 195, in cross_validate
        for train, test in cv.split(X, y, groups))
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
        while self.dispatch_one_batch(iterator):
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
        self._dispatch(tasks)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
        job = self._backend.apply_async(batch, callback=cb)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
        result = ImmediateResult(func)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
        self.results = batch()
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
        return [func(*args, **kwargs) for func, args, kwargs in self.items]
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
        return [func(*args, **kwargs) for func, args, kwargs in self.items]
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score
        estimator.fit(X_train, y_train, **fit_params)
      File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn_crfsuite/estimator.py", line 314, in fit
        trainer.append(xseq, yseq)
      File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
    ValueError: The numbers of items and labels differ: |x| = 62, |y| = 3
    

    Please tell me how to solve this problem. Sorry to ask this of you when you are busy but I appreciate your help;;

    opened by ss1357 2
  • How to save the trained CRF model?

    How to save the trained CRF model?

    Thank you so much for the work.

    I'm wondering if the trained model can be saved? In the API, CRF has a parameter model_filename to import the trained model, and it stated:

    By default, model files are created automatically and saved in temporary locations; the preferred way to save/load CRF models is to use pickle (or its alternatives like joblib)

    How can we export the model to an explicit location?

    Many thanks!

    opened by acepor 2
  • Can features be discarded by the classifier?

    Can features be discarded by the classifier?

    Hello! I am using crfsuite to train models on my own datasets, and I am testing different sets of features (I have a lot of them). However, some of those features seem to have no effect on classification results: e.g. first I use set of features A and get an F1 = X, and then I use set A + B and get the same results, and this repeats on every train and test set I have (if it is any help, my data is various acoustic features of speech in two languages). My question is: is this normal, or is there a possibility that some of these features are somehow discarded by the model? Thank you in advance!

    opened by PKholyavin 1
  • Maintenance is not current

    Maintenance is not current

    Any possibility of transferring maintenance activity to someone else? There are many PR that would fix many issues with this crfsuite to make it current with sklearn interface.

    opened by vicissitudele 1
  • how to add bigram features ?

    how to add bigram features ?

    Thanks for this excellent package.

    Kindly help with the below questions.

    1. How to use xt, xt-1 or even xt, xt-1, xt-2...xt-n as a feature in sklearn-crfsuite?
    2. How to use a float feature instead of buckets of this continuous variable in sklearn-crfsuite? Any example for this implementation?
    3. Does sklearn-crfsuite only have implementation of linear-chain crf or does it have general crf as well?
    opened by deepak-george 0
  • flat_classification_report seems to be broken

    flat_classification_report seems to be broken

    Hi,

    it appears that flat_classification_report is now broken. Scikit-learn's classification_report no longer uses positional arguments anymore and was deprecated prior a while back. It seems this is now being enforced.

    Specifically, the issue is that labels are no longer a positional argument and is instead a keyword argument.

    It seems to be a simple fix so I can submit a pull request later.

    opened by chriswales95 3
Owner
null
scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

null 745 Jan 5, 2023
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

null 213 Jan 2, 2023
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 803 Jan 5, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

null 418 Jan 9, 2023
Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Repository Status for Scikit-learn Live webpage Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn. Running local

Thomas J. Fan 6 Dec 27, 2022
Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn! gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API. While Genetic Programming (GP)

Trevor Stephens 1.3k Jan 3, 2023
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 4, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
Hidden Markov Models in Python, with scikit-learn like API

hmmlearn hmmlearn is a set of algorithms for unsupervised learning and inference of Hidden Markov Models. For supervised learning learning of HMMs and

null 2.7k Jan 3, 2023
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

ETS 528 Nov 25, 2022
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Dec 31, 2022
An intuitive library to add plotting functionality to scikit-learn objects.

Welcome to Scikit-plot Single line functions for detailed visualizations The quickest and easiest way to go from analysis... ...to this. Scikit-plot i

Reiichiro Nakano 2.3k Dec 31, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 8, 2023
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 3.8k Feb 13, 2021
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. >>> from skift import FirstColFtClassifier >>> df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

Shay Palachy 233 Sep 9, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 631 Feb 2, 2021