scikit-learn inspired API for CRFsuite

Last update: Jan 9, 2023

Related tags

Sklearn Utilities sklearn-crfsuite

Overview

sklearn-crfsuite

sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF is a scikit-learn compatible estimator: you can use e.g. scikit-learn model selection utilities (cross-validation, hyperparameter optimization) with it, or save/load CRF models using joblib.

License is MIT.

Documentation can be found here.

Comments

How to create features with duplicate keys ?
I see in (crfsuite document)[http://www.chokkan.org/software/crfsuite/manual.html] that key of feature can be duplicate:

B-NP w[1..4]=a:2 w[1..4]=man w[1..4]=eats B-NP w[1..4]=a w[1..4]=a w[1..4]=man w[1..4]=eats B-NP w[1..4]=a:2.0 w[1..4]=man:1.0 w[1..4]=eats:1.0

How to create features with duplicate keys if i using sklearn-crfsuite ?
opened by binhnq94 8
Different result despite same input
I tried to create some CRF instances to train with the same training set and same max_iteration param.

crf = sklearn_crfsuite.CRF( algorithm='ap', max_iterations=5, ) crf.fit(X_train, Y_train) t = sklearn_crfsuite.CRF( algorithm='ap', max_iterations=5, ) t.fit(X_train, Y_train)

However, their result is different ( I tested them on the same develop set with fmeasure). Hope to see your response soon. Thank you
opened by iamhuy 6
UnicodeEncodeError:

Hello! Thank you for your work.

I experiments with rissian texts. But I have this problem: UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128) I think that in my data set I have some strange symbols, but how I can find it?

python == 3.6 last version of sklearn-crfsuite

opened by Ulitochka 5
Mini-batch training

Does the CRF implementation support mini-batch training. Some sklearn predictors have a partial_fit method which supports incremental training. Would there be scope to extend the current implementation to include this?

opened by uwaisiqbal 5

Sequence labelling issue: The numbers of items and labels differ...

Hi, I'm trying to use sklearn-crfsuite for sequence labelling.

when running crf.fit(train_data, train_targets) on my data, I get the below stack trace:

Traceback (most recent call last):
  File ".../argument_segmenter.py", line 49, in train
    crf.fit(train_data, train_targets)
  File "/usr/local/lib/python3.9/site-packages/sklearn_crfsuite/estimator.py", line 314, in fit
    trainer.append(xseq, yseq)
  File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
ValueError: The numbers of items and labels differ: |x| = 40, |y| = 38

I noticed in https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/20 that someone suggests using a custom scorer, but I don't seem to get past the fitting stage.

Any advice would be appreciate.

My code looks like this:

train_data, test_data, train_targets, test_targets = load_data()

train_data = [sent2features(s) for s in train_data]
train_targets = [sent2labels(s) for s in train_targets]

test_data = [sent2features(s) for s in test_data]
test_targets = [sent2labels(s) for s in test_targets]

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

try:
    crf.fit(train_data, train_targets)
except Exception as e:
    logging.error(e)

opened by chriswales95 3

Possible memory leak problem?

Hi @kmike

My colleagues used a Java version of CRFSuite, and found a memory leak problem in it. Therefore, we checked the original CRFsuite site, and found there are a number of issues related to this problem: Results in chokkan/crfsuite. The latest fix accepted by the author is in 2016, and there are some more recent commits by other contributors.

When we read the doc of Python-CRFsuite, the latest fix of this issue is back to 2015. Can you tell us if the latest Python-CRFSuite or sklearn-CRFSuite fixed those problems? Many thanks!

opened by acepor 3
Effective Feature Induction to Increase F1

Hello,

I want to use some conjunctions of features to increase my F1 score. Is there any functionality to induce feature effectively?

Or

Does sklearn-crfsuite support the algorithm described in this paper? https://people.cs.umass.edu/~mccallum/papers/ifcrf-uai2003.pdf

Thanks

opened by emirceyani 3

Is there an easy way to obtain a confusion matrix?

I'm trying

    confusion_matrix(y_test, y_pred)

with sklearn's method, but am getting the error message

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

opened by goerch 2

sklearn.model_selection.cross_validate() can't run.

Nice to meet you. I am a student studying with your package. I am in trouble with the problem which I can not solve by myself.

I tried your tutorial with this site’s feature(https://qiita.com/Hironsan/items/326b66711eb4196aa9d4), and add cross-validation as follows.

from sklearn.model_selection import cross_validate
scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
print(scores.test_score)

However, the following error occurs.

Traceback (most recent call last):
  File "/program/crf.py", line 41, in <module>
    scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 195, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 467, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 502, in _score
    return _multimetric_score(estimator, X_test, y_test, scorer)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 532, in _multimetric_score
    score = scorer(estimator, X_test, y_test)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 108, in __call__
    **self._kwargs)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 714, in f1_score
    sample_weight=sample_weight)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 828, in fbeta_score
    sample_weight=sample_weight)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 1025, in precision_recall_fscore_support
    y_type, y_true, y_pred = _check_targets(y_true, y_pred)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/metrics/classification.py", line 72, in _check_targets
    type_true = type_of_target(y_true)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 259, in type_of_target
    raise ValueError('You appear to be using a legacy multi-label data'
ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

So, I added the follows before crcross-validation.

trans_X = []
mlb = MultiLabelBinarizer()
for x in X:
        x = mlb.fit_transform(x)
        trans_X.append(x.astype(bytes))
X = trans_X
y = MultiLabelBinarizer().fit_transform(y)
y = y.astype(bytes)

However, the following error occurs.

Traceback (most recent call last):
  File "/program/crf.py", line 41, in <module>
    scores = cross_validate(crf, X, y, scoring="f1_macro", cv=5)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 195, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 131, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/.pyenv/versions/anaconda3-4.3.1/lib/python3.6/site-packages/sklearn_crfsuite/estimator.py", line 314, in fit
    trainer.append(xseq, yseq)
  File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
ValueError: The numbers of items and labels differ: |x| = 62, |y| = 3

Please tell me how to solve this problem. Sorry to ask this of you when you are busy but I appreciate your help;;

opened by ss1357 2

How to save the trained CRF model?

Thank you so much for the work.

I'm wondering if the trained model can be saved? In the API, CRF has a parameter model_filename to import the trained model, and it stated:

By default, model files are created automatically and saved in temporary locations; the preferred way to save/load CRF models is to use pickle (or its alternatives like joblib)

How can we export the model to an explicit location?

Many thanks!

opened by acepor 2
Can features be discarded by the classifier?

Hello! I am using crfsuite to train models on my own datasets, and I am testing different sets of features (I have a lot of them). However, some of those features seem to have no effect on classification results: e.g. first I use set of features A and get an F1 = X, and then I use set A + B and get the same results, and this repeats on every train and test set I have (if it is any help, my data is various acoustic features of speech in two languages). My question is: is this normal, or is there a possibility that some of these features are somehow discarded by the model? Thank you in advance!

opened by PKholyavin 1
Maintenance is not current

Any possibility of transferring maintenance activity to someone else? There are many PR that would fix many issues with this crfsuite to make it current with sklearn interface.

opened by vicissitudele 1
how to add bigram features ?
Thanks for this excellent package.

Kindly help with the below questions.

How to use xt, xt-1 or even xt, xt-1, xt-2...xt-n as a feature in sklearn-crfsuite?

How to use a float feature instead of buckets of this continuous variable in sklearn-crfsuite? Any example for this implementation?

Does sklearn-crfsuite only have implementation of linear-chain crf or does it have general crf as well?
opened by deepak-george 0
flat_classification_report seems to be broken

Hi,

it appears that flat_classification_report is now broken. Scikit-learn's classification_report no longer uses positional arguments anymore and was deprecated prior a while back. It seems this is now being enforced.

Specifically, the issue is that labels are no longer a positional argument and is instead a keyword argument.

It seems to be a simple fix so I can submit a pull request later.

opened by chriswales95 3