SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

ETS

Last update: Nov 25, 2022

Related tags

Overview

SciKit-Learn Laboratory

This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. One of the primary goals of our project is to make it so that you can run scikit-learn experiments without actually needing to write any code other than what you used to generate/extract the features.

Installation

You can install using either pip or conda. See details here.

Requirements

Python 3.6+
beautifulsoup4
gridmap (only required if you plan to run things in parallel on a DRMAA-compatible cluster)
joblib
pandas
ruamel.yaml
scikit-learn
seaborn
tabulate

Command-line Interface

The main utility we provide is called run_experiment and it can be used to easily run a series of learners on datasets specified in a configuration file like:

[General]
experiment_name = Titanic_Evaluate_Tuned
# valid tasks: cross_validate, evaluate, predict, train
task = evaluate

[Input]
# these directories could also be absolute paths
# (and must be if you're not running things in local mode)
train_directory = train
test_directory = dev
# Can specify multiple sets of feature files that are merged together automatically
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]]
# List of scikit-learn learners to use
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"]
# Column in CSV containing labels to predict
label_col = Survived
# Column in CSV containing instance IDs (if any)
id_col = PassengerId

[Tuning]
# Should we tune parameters of all learners by searching provided parameter grids?
grid_search = true
# Function to maximize when performing grid search
objectives = ['accuracy']

[Output]
# Also compute the area under the ROC curve as an additional metric
metrics = ['roc_auc']
# The following can also be absolute paths
log = output
results = output
predictions = output
probability = true
models = output

For more information about getting started with run_experiment, please check out our tutorial, or our config file specs.

You can also follow this interactive Jupyter tutorial.

We also provide utilities for:

converting between machine learning toolkit formats (e.g., ARFF, CSV)
filtering feature files
joining feature files
other common tasks

Python API

If you just want to avoid writing a lot of boilerplate learning code, you can also use our simple Python API which also supports pandas DataFrames. The main way you'll want to use the API is through the Learner and Reader classes. For more details on our API, see the documentation.

While our API can be broadly useful, it should be noted that the command-line utilities are intended as the primary way of using SKLL. The API is just a nice side-effect of our developing the utilities.

A Note on Pronunciation

SciKit-Learn Laboratory (SKLL) is pronounced "skull": that's where the learning happens.

Talks

Simpler Machine Learning with SKLL 1.0, Dan Blanchard, PyData NYC 2014 (video | slides)
Simpler Machine Learning with SKLL, Dan Blanchard, PyData NYC 2013 (video | slides)

Books

SKLL is featured in Data Science at the Command Line by Jeroen Janssens.

Changelog

See GitHub releases.

Contribute

Thank you for your interest in contributing to SKLL! See CONTRIBUTING.md for instructions on how to get started.

Comments

Add learning curves
Addresses #221.

The way that this works is by having a new task type called learning_curve. This essentially ties in to a new learning_curve() method in the Learner class which is adapted from the scikit-learn method sklearn.model_selection.learning_curve(). The reason that I didn't just basically call the scikit-learn method directly is because that method works with estimator objects and raw feature arrays . We want to apply the whole SKLL pipeline (feature selection, transformation, hashing, etc.) that the user has specified when computing the learning curve results and so we need to use the SKLL API.

The process of computing the curve is as follows: Only a training set is required. For each point on the leaning curve, the training set is split into two partitions 80/20. The learner is trained on the subset of the 80% corresponding to the point of the learning curve and then evaluated on the 20% partition. This is repeated multiple times (using multiple different 80/20 partitions) and the results are averaged. This gives us the score for each point in the training curve. The whole process it then repeated for each point on the curve.

I consider the learning curve task to be orthogonal to ablation and finding the right hyper-parameters. Therefore, ablation and grid search are not allowed. Just like for the cross-validation task, no models are saved for this task.

Users can specify the various training set sizes and the number of 80/20 partitions for each point tin the curve (if they don't, there are reasonable defaults for both).

Users can also specify the number of cross-validation iterations to be used for averaging the results for a given training set size.

The output of the learning_curve task is a TSV file containing the training set size and the averaged scores for all combinations of featuresets, learners, and objectives. If pandas and seaborn are available, actual learning curves are generated as PNG files - one for each feature set. Each PNG file contains a faceted plot with objective functions on rows and learners on columns. Here's an example plot.

(Note: since grid search is disallowed, we don't really need to train the learner for each objective separately; we could simply train the learner once and then compute the scores using multiple functions. However, this doesn't fit into the current parallelization scheme that SKLL follows and so I didn't feel like changing that. The training jobs are run in parallel so it's not that big a deal anyway.)
opened by desilinguist 32
Allow multiple objective functions in configuration files

Now, objectives replaced by objective. However, if any config file contains objective, it is normalized to objectives. objectives should be a list of objective functions

opened by bndgyawali 32

specifying cross validation folds slows down experiments

Attached is a version of the titanic cross_validate.cfg that specifies the cross validation folds (as output by the original titanic cross_validate.cfg config).

Here are the timings with and without specifying the fold ids file:

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier.results:Total Time: 0:01:26.131353
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_RandomForestClassifier.results:Total Time: 0:09:54.971322

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier.results:Total Time: 0:00:02.022798
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_DecisionTreeClassifier.results:Total Time: 0:00:01.908257

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_SVC.results:Total Time: 0:00:32.489498
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_SVC.results:Total Time: 0:05:17.998272

Titanic_CV_family.csv+misc.csv+socioeconomic.csv+vitals.csv_MultinomialNB.results:Total Time: 0:00:01.921427
Titanic_CV_specify_folds_family.csv+misc.csv+socioeconomic.csv+vitals.csv_MultinomialNB.results:Total Time: 0:00:02.363040

Some of the experiments take much longer (e.g. SVC and Random Forrest), while others are similar.

All of these timings are from experiments run in local mode.

cross_validate_specify_folds.cfg.txt

opened by aoifecahill 20

Overhaul configuration file parsing
Use a new custom parser class SKLLConfigParser based on ConfigParser.

Move all of the configuration parsing code to config.py and out of experiments.py.

Add a validate() method to the config parser that raises a KeyError if:

unrecognized options are specified

options are specified in more than one section

options are specified in the wrong section (#223)

Add unit tests for the above validation checks

Update tests to deal with the above changes

Update requirements_rtd.txt to install the latest version of ConfigParser (v3.5.0b2)
opened by desilinguist 17
Remove default tuning objective and make it a required field.

Right now, the default metric for all learners is f1_score_micro which doesn't make sense for regressors.

(Note: This issue is related to but separate from #350.)

opened by desilinguist 15
Clean up and fix unit tests
test_skll.py is getting unruly at this point. There are way too many functions in there, and entirely too many lines of duplicate code.

We definitely need to do the following:

[x] Break test_skll.py up into multiple modules (preferably, one per module being tested).

[x] Create some sort of RandomDataWriter class that takes care of writing out the data files with randomly generated feature values, ids, and classes we use all over the place. This should cut down on duplicate code a lot.

[x] Fix the feature scaling tests which are currently disabled.

[x] Make feature_hasher tests reuse the code from the non feature_hasher versions.

[x] Increase our test coverage, if you look at Coveralls, you can see our current coverage is lacking for featureset.py and learner.py.

We may also want to do this stuff:

[ ] Make a skll.test sub package like many others do, so that unit tests can be run even after SKLL is installed.

[x] Reorganize where some of the temporary data files get written to to make more sense. I specifically noticed that a lot of stuff (like test data for merging feature sets and converting files) ends up in the train subdirectory for some reason.

enhancement help wanted
opened by dan-blanchard 15
Properly document the internal conversion of string labels to ints/floats and possible edge cases

This is a bit of a strange issue, but I have worked with a data-set that has labels that look like 2, 2.1, 2.21, 2.3, and 2.31. Using LinearSVC I was expecting the predicted labels to look like those labels. However, due to the use of safe_float in the various reader classes, "2" gets converted to an int 2, "2.21" a float 2.21, etc. Then, after a Python list is constructed of these ints and floats, this list is converted to a numpy array here. So, now we have a numpy array of floats: np.array([2.0, 2.2, 2.21, ...]). The resulting classification model predicts these types of labels, which is unexpected and can cause lots of problems down the road. It seems to me that, if a classification algorithm is being used, safe_float shouldn't even be applied: just leave the labels alone! This would mean that data-sets would depend on what type of algorithm is being used, classification or regression. And that, in turn, would cause issues for the way we allow a list of classifiers/regressors to be specified in the experiment configuration files: We would either have to not allow a combination of such algorithms or read in the data twice, once for each type of algorithm.
documentation

opened by mulhod 14
Scikit-learn now includes kappa

Should we just switch to it instead of computing our own?

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html#sklearn.metrics.cohen_kappa_score

opened by desilinguist 14

SVR kernel string issues with dense data in python 2.7

If feature scaling is set to both when using SVR (causing dense matrices to be used), then SVR complains (see below).

I'm guessing this started with the 0.17.1 bug fix release that aimed to fix a similar issue for python 3.

(python 2 unicode support strikes again!)

Traceback (most recent call last):
  File "/opt/python/2.7/lib/python2.7/site-packages/gridmap/job.py", line 196, in execute
    self.ret = self.function(*self.args, **self.kwlist)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/experiments.py", line 656, in _classify_featureset
    grid_jobs=grid_search_jobs)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 1116, in cross_validate
    shuffle=False))
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 809, in train
    grid_searcher.fit(xtrain, classes)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 707, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 493, in _fit
    for parameters in parameter_iterable
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 517, in __call__
    self.dispatch(function, args, kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 312, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 136, in __init__
    self.results = func(*args, **kwargs)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/grid_search.py", line 306, in fit_grid_point
    clf.fit(X_train, y_train, **fit_params)
  File "/opt/python/2.7/lib/python2.7/site-packages/skll/learner.py", line 224, in fit
    orig_fit(self, X, y=y)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 178, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "/opt/python/2.7/lib/python2.7/site-packages/sklearn/svm/base.py", line 233, in _dense_fit
    max_iter=self.max_iter, random_seed=random_seed)
TypeError: Argument 'kernel' has incorrect type (expected str, got unicode)
done

bug

opened by mheilman 14

get warning with learning curves unnecessarily

I get the following warning even though I do not have objectives specified in my config file. We should probably only output the warning when objectives is specified, otherwise it is confusing.

WARNING : Ignoring "objectives" for the learning_curve task since "metrics" is already specified.

opened by aoifecahill 13
Condense our copy of `DictVectorizer` to just the one method we still need.
Now that @dan-blanchard's DictVectorizeradditions have been merged into scikit-learn, all we need is just the __eq__() method. The rest of the code is unnecessary and has been removed.

Shortened some of the test descriptions so that they fit on the console.

Renamed a variable in a test to be more appropriate.
opened by desilinguist 13

Learning curve generation may be broken due to numpy 1.24.0

The learning curve tests fail with the following error:

======================================================================
ERROR: Test learning curve output for experiment with objectives option
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/sklldev/lib/python3.8/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/builds/EducationalTestingService/skll/tests/test_output.py", line 768, in test_learning_curve_output_with_objectives
    run_configuration(config_path, quiet=True, local=True)
  File "/builds/EducationalTestingService/skll/skll/experiments/__init__.py", line 879, in run_configuration
    generate_learning_curve_plots(experiment_name,
  File "/builds/EducationalTestingService/skll/skll/experiments/output.py", line 153, in generate_learning_curve_plots
    ax.fill_between(list(range(len(df_ax_train))),
  File "/root/sklldev/lib/python3.8/site-packages/matplotlib/__init__.py", line 1423, in inner
    return func(ax, *map(sanitize_sequence, args), **kwargs)
  File "/root/sklldev/lib/python3.8/site-packages/matplotlib/axes/_axes.py", line 5367, in fill_between
    return self._fill_between_x_or_y(
  File "/root/sklldev/lib/python3.8/site-packages/matplotlib/axes/_axes.py", line 5272, in _fill_between_x_or_y
    ind, dep1, dep2 = map(
  File "/root/sklldev/lib/python3.8/site-packages/numpy/ma/core.py", line 2360, in masked_invalid
    return masked_where(~(np.isfinite(getdata(a))), a, copy=copy)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
-------------------- >> begin captured logging << --------------------

For now, a simple workaround is to downgrade to numpy 1.23.5.

dependency

opened by desilinguist 0

adding OrdinalRidge and LAD regressors
This PR adds two new learners OrdinalRidge and LAD from mord library.

There are two things I would like to mention here:

These learners do not have rescaled version because the predictions by these learners are already transformed within the range of zero to maximum of the label. Rescaling these transformed predictions makes the two predictions not correlate to each other. Here's the graph I plot between the predictions made by the OrdinalRidge and RescaledOrdinalRidge.

In the unit tests, all the linear/non-linear regressors have 95% correlation with the labels. However, due to the transformed predictions by these learners, the correlation is only 0.85. I was trying to see if make_regression function would generate the data with labels in the given range (here I want the labels to be not less than 0 because predictions will have minimum 0 value), but I could not find such functionality.

I would like to get feedback on these and will work on making this better.
opened by bndgyawali 1
Replace conda-tester and pip-tester repos with GitHub actions

We should be able to setup GitHub actions to build and test the package on multiple Python versions rather than building and uploading the packages ourselves and then submitting PRs to the conda-tester and pip-tester repositories.

opened by desilinguist 0
Add support for RobustScaler

For some datasets with outliers, removing the mean and scaling by the standard deviation may not be ideal because those values are impacted by the outliers. In those cases, it may be better to use the median for centering and the IQR for scaling. This is done by RobustScaler. We should make this available in SKLL too.
enhancement sklearn-compatibility help wanted

opened by desilinguist 0

Releases(v3.1.0)

v3.1.0(Sep 14, 2022)
This is a new release with with dependency updates, bugfixes, and improvements.

💥 Dependency Updates 💥

scikit-learn has been updated to v1.1.2. This could mean that the same SKLL experiments when run with SKLL 3.1.0 could yield different results. (Issue #713, PR #716 ).

🛠 Bugfixes & Improvements 🛠

SKLL Learners now support a new method get_feature_names_out() which returns the correct set of features actually used by the learner. Since some features might be removed by the feature selector, relying on the vectorizer vocabulary is not enough in those cases. This method allows easy access to the names of the actual features used, even if the selector has removed some features (Issue #714, PR #715).

Updated learning curve code to use the new API for seaborn v0.12.0 (PR #716)

Removed the Boston housing dataset from SKLL examples and tests. This dataset has ethical issues and is being removed from scikit-learn. (Issue #700, #717)

✔️ Tests ✔️

Added new tests for Learner.get_feature_name_out(). (Issue #714, PR #715)

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Sanjna Kashyap (@Frost45), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), and Remo Nitschke (@remo-help).
Source code(tar.gz)
Source code(zip)
v3.0(Dec 21, 2021)
This is a major new release with with dependency updates and bugfixes!

⚡️ SKLL 3.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️

💥 Breaking Changes 💥

Python 3.7 is no longer officially supported while official support for Python 3.10 has been added (Issue #701, PR #711).

scikit-learn has been updated to v1.0.1 (Issue #699, PR #702).

The configuration field pos_label_str from the “Tuning" section has been renamed to pos_label. Older configuration files with pos_label_str will now raise an exception (Issue #569, PR #706).

The configuration field log from the “Output” section that was renamed to logs in SKLL v2.5 has now been completely deprecated. Older configuration files with log will now raise an exception (Issue #671, PR #705).

💡 New features 💡

SKLL now supports specifying custom seed values for cross-validation tasks. This option may be useful for running the same cross-validation experiment multiple times (with the same number of differently constituted folds) to get a sense of the variance across replicates (Issue #593, PR #707).

🛠 Bugfixes & Improvements 🛠

Using the --drop-blanks option with filter_features now raises a more useful error for the case when every single row in a tabular feature file has a blank column (Issue #693, PR #703).

SKLL conda packages are again generic Python packages instead of platform-specific ones (Issue #710, PR #711).

📖 Documentation Updates 📖

Add a new section to the hands-on tutorial explaining how to first install SKLL in a virtual environment (Issue #689, PR #709).

Add missing link to SKLL repository in the tutorial data section (Issue #688, PR #691).

Update CONTRIBUTING.md to include more detailed instructions for pushing to the SKLL repository (Issue #680, PR #704).

Link to the RSMTool implementation of quadratic_weighted_kappa which supports continuous values and can be used as a custom metric in SKLL for both hyper-parameter tuning as well as validation. See the quadratic_weighted_kappa bullet under the objectives section (Issue #512, PR #704).

Continued readability improvements to function and method docstrings.

✔️ Tests ✔️

All tests now specify local=True when making run_configuration() calls. This ensures that tests always run in local mode and prevent an unnecessary check for the gridmap library. (Issue #616, PR #708).

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Binod Gyawali (@bndgyawali), Robbie Imbrie (@RobertImbrie), Sanjna Kashyap (@Frost45), Sözen Ozkan Grigoras (@sozkangrigoras), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), and Damien Xie (@damien2012eng),
Source code(tar.gz)
Source code(zip)
v2.5(Feb 26, 2021)
This is a major new release with dozens of new features, bugfixes, and documentation updates!

⚡️ SKLL 2.5 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️

💥 Breaking Changes 💥

Python 3.6 is no longer officially supported since the latest versions of pandas and numpy have dropped support for it.

Older top-level imports have been removed and should now be rewritten as follows (Issue #661, PR #662):

from skll import Learner ➡️ from skll.learner import Learner

from skll import FeatureSet ➡️ from skll.data import FeatureSet

from skll import run_configuration ➡️ from skll.experiments import run_configuration

The default value for the class_labels keyword argument for Learner.predict() is now True instead of False. Therefore, for probabilistic classifiers, this method will now return class labels by default instead of class probabilities. To obtain class probabilities, set class_labels to False when calling this method (Issue #621, PR #622).

The filter_features script now offers more intuitive command line options. Input files must be specified using the -i/--input and output files must be specified using the -o/--output. Additionally, --inverse must now be used to invert the filtering command since -i is used for input files (Issue #598, PR #660).

The MegaMReader and MegaMWriter classes have been removed from SKLL since .megam files are no longer supported by SKLL (Issue #532, PR #557).

The param_grids option in the configuration file is now a list of dictionaries instead of a list of list of dictionaries, one for each learner specified in the learners option. Correspondingly, the and the param_grid option in Learner.train() and Learner.cross_validate() is now a dictionary instead of a list of dictionaries and the default parameter grids for each learner are also simply dictionaries. (Issue #618, PR #619).

Running a learning_curve task via a configuration file now requires at least 500 examples. Fewer examples will raise a ValueError. This behavior can only be overridden when using Learner.learning_curve() directly via the API (Issue #624, PR #631).

💡 New features 💡

VotingClassifier and VotingRegressor from scikit-learn are now available for use in SKLL. This was done by adding a new VotingLearner class that uses Learner instances to represent underlying estimators (Issue #488, PR #665).

SKLL now supports custom, user-defined metrics for both hyperparameter tuning as well as evaluation (Issue #606, PR #612).

The following new built-in classification metrics are now available in SKLL: f05, f05_score_macro, f05_score_micro, f05_score_weighted, jaccard, jaccard_macro, jaccard_micro, jaccard_weighted, precision_macro, precision_micro, precision_weighted, recall_macro, recall_micro, and recall_weighted (Issues #609 and #610, PRs #607 and #612).

scikit-learn has been updated to 0.24.1 (Issue #653, PR #659).

🛠 Bugfixes & Improvements 🛠

Hyperparamter tuning now uses 5-fold cross-validation, instead of 3, to match the change in the default value of the cv parameter for GridSearchCV. This will marginally increase the time taken for experiments with grid search but should produce more reliable results (Issue #487, PR #667).

The SKLL codebase now uses sub-packages instead of very long modules which makes it easier to navigate and understand (Issue #600, PR #601).

The log configuration file option has been renamed to logs. Using log will still work but will raise a warning. The log option will be removed entirely in the next release (Issue #520, PR #670).

Learning curves are now correctly generated for probabilistic classifiers (Issue #648, PR #649).

Saving models in the current directory via Learner.save() no longer requires adding ./ to the path (Issue #572, PR #604).

The filter_features script no longer automatically assumes labels specified with -L or --label to be strings (Issue #598, PR #660).

Remove the create_label_dict keyword argument from Learner.train() since it did not need to be user-facing (Issue #565, PR #605).

Do not return 0 from correlation metrics when NaN is more appropriate. Doing this resulted in incorrect hyperparameter tuning results (Issue #585, PR #588).

The Learner._check_input_formatting() private method now works correctly for dense featuresets (Issue #656, PR #658).

SKLL conda packages are again platform-specific and the recipe now uses a conda_build_config.yaml to build the Python 3.7, 3.8, and 3.9 variants in one go (Issue #623, PR #XXX).

Several useful changes to the SKLL code style:

Standardize string concatenation (Issue #636, PR #645)

Use with context manager when opening files (Issue #641, PR #644)

Use f-strings where possible (Issue #633, PR #634)

Follow standard guidelines for sorting imports (Issue #638, PR #650)

Use pre-commit hooks to enforce code formatting guidelines during development (Issue #646, PR #650)

📖 Documentation Updates 📖

Update CONTRIBUTING.md with the new sub-package structure of the SKLL codebase (Issue #611, PR #628).

Add a section to the README that explains how to cite SKLL (Issue #599, PR #672).

Add Azure Pipelines badge to the README (Issue #608, PR #672).

Add explicit .readthedocs.yml file to configure the auto-built documentation (Issue #668, PR #672).

Make it clear that not specifying predictions configuration file option leads to prediction files being output in the current directory (Issue #664, PR #672).

✔️ Tests ✔️

Reduce code duplication in tests (Issue #635, PR #642).

The Linux and Windows CI builds now use Python 3.7 and 3.8 respectively, instead of Python 3.6 (Issue #524, PR #665)

Both the Linux and Windows CI builds now use consistent nosetests commands (Issue #584, PR #665).

nose-cov is now automatically installed via conda_requirements.txt when setting up a development environment instead of requiring a separate step (Issue #527, PR #672).

Add comprehensive new tests for voting learners, custom metrics, new built-in metrics, as well as for new bugfixes.

Current code coverage for SKLL tests is at 97%, the highest it has ever been!

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Sree Harsha Ramesh (@srhrshr)
Source code(tar.gz)
Source code(zip)
v2.1(Mar 13, 2020)
This is a minor release of SKLL with the only change being that it is now compatible with scikit-learn v0.22.2.

⚡️ There are several changes in scikit-learn v0.22 that might cause several estimators and functions to produce different results even when fit with the same data and parameters. Therefore, SKLL 2.1 can also yield different results compared to previous versions even with the same data and same settings. ⚡️

💡 New features 💡

scikit-learn updated to 0.22.2 (Issue #594, PR #595).

🔎 Other minor changes 🔎

Update imports to align with the new scikit-learn API.

A minor bugfix in logutils.py.

Update some test outputs due to changes in scikit-learn models and functions.

Update some tests to make pre-release testing for conda and PyPI packages possible.

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Aoife Cahill (@aoifecahill), Binod Gyawali (@bndgyawali), Matt Mulholland (@mulhod), Nitin Madnani (@desilinguist), and Mengxuan Zhao (@chaomenghsuan).
Source code(tar.gz)
Source code(zip)
v2.0(Oct 24, 2019)
This is a major new release. It's probably the largest SKLL release we have ever done since SKLL 1.0 came out! It includes dozens of new features, bugfixes, and documentation updates!

⚡️ SKLL 2.0 is backwards incompatible with previous versions of SKLL and might yield different results compared to previous versions even with the same data and same settings. ⚡️

💥 Incompatible Changes 💥

Python 2.7 is no longer supported since the underlying version of scikit-learn no longer supports it (Issue #497, PR #506).

Configuration field objective has been deprecated and replaced with objectives which allows specifying multiple tuning objectives for grid search (Issue #381, PR #458).

Grid search is now enabled by default in both the API as well as while using a configuration file (Issue #463, PR #465).

The Predictor class previously provided by the generate_predictions utility script is no longer available. If you were relying on this class, you should just load the model file and call Learner.predict() instead (Issue #562, PR #566).

There are no longer any default grid search objectives since the choice of objective is best left to the user. Note that since grid search is enabled by default, you must either choose an objective or explicitly disable grid search (Issue #381, PR #458).

mean_squared_error is no longer supported as a metric. Use neg_mean_squared_error instead (Issue #382, PR #470).

The cv_folds_file configuration file field is now just called folds_file (Issue #382, PR #470).

Running an experiment with the learning_curve task now requires specifying metrics in the Output section instead of objectives in the Tuning section (Issue #382, PR #470).

Previously when reading in CSV/TSV files, missing data was automatically imputed as zeros. This is not appropriate in all cases. This no longer the case and blanks are retained as is. Missing values will need to be explicitly dropped or replaced (see below) before using the file with SKLL (Issue #364, PRs #475 & #518).

pandas and seaborn are now direct dependencies of SKLL, and not optional (Issues #455 & #364, PRs #475 & #508).

💡 New features 💡

CSVReader/CSVWriter & TSVReader/TSVWriter now use pandas as the backend rather than custom code that relied on the csv module. This leads to significant speedups, especially for very large files (~5x for reading and ~10x for writing)! The speedup comes at the cost of moderate increase in memory consumption. See detailed benchmarks here (Issue #364, PRs #475 & #518).

SKLL models now have a new pipeline attribute which makes it easy to manipulate and use them in scikit-learn, if needed (Issue #451, PR #474).

scikit-learn updated to 0.21.3 (Issue #457, PR #559).

The SKLL conda package is now a generic Python package which means the same package works on all platforms and on all Python versions >= 3.6. This package is hosted on the new, public ETS anaconda channel.

SKLL learner hyperparameters have been updated to match the new scikit-learn defaults and those upcoming in 0.22.0 (Issue #438, PR #533).

Intermediate results for the grid search process are now available in the results.json files (Issue #431, #471).

The K models trained for each split of a K-fold cross-validation experiment can now be saved to disk (Issue #501, PR #505).

Missing values in CSV/TSV files can be dropped/replaced both via the command line and the API (Issue #540, PR #542).

Warnings from scikit-learn are now captured in SKLL log files (issue #441, PR #480).

Learner.model_params() and, consequently, the print_model_weights utility script now work with models trained on hashed features (issue #444, PR #466).

The print_model_weights utility script can now output feature weights sorted by class labels to improve readability (Issue #442, PR #468).

The skll_convert utility script can now convert feature files that do not contain labels (Issue #426, PR #453).

🛠 Bugfixes & Improvements 🛠

Fix several bugs in how various tuning objectives and output metrics were computed (Issues #545 & #548, PR #551).

Fix how pos_label_str is documented, read in, and used for classification tasks (Issues #550 & #570, PRs #566 & #571).

Fix several bugs in the generate_predictions utility script and streamline its implementation to not rely on an externally specified positive label or index but rather read it from the model file or infer it (Issues #484 & #562, PR #566).

Fix bug due to overlap between tuning objectives that metrics that could prevent metric computation (Issue #564, PR #567).

Using an externally specified folds_file for grid search now works for evaluate and predict tasks, not just train (Issue #536, PR #538).

Fix incorrect application of sampling before feature scaling in Learner.predict() (Issue #472, PR #474).

Disable feature sampling for MultinomialNB learner since it cannot handle negative values (Issue #473, PR #474).

Add missing logger attribute to Learner.FilteredLeaveOneGroupOut (Issue #541, PR #543).

Fix FeatureSet.has_labels to recognize list of None objects which is what happens when you read in an unlabeled data set and pass label_col=None (Issue #426, PR #453).

Fix bug in ARFFWriter that adds/removes label_col from the field names even if it's None to begin with (Issue #452, PR #453).

Do not produce unnecessary warnings for learning curves (Issue #410, PR #458).

Show a warning when applying feature hashing to multiple feature files (Issue #461, PR #479).

Fix loading issue for saved MultinomialNB models (Issue #573, PR #574).

Reduce memory usage for learning curve experiments by explicitly closing matplotlib figure instances after they are saved.

Improve SKLL’s cross-platform operation by explicitly reading and writing files as UTF-8 in readers and writers and by using the newline parameter when writing files.

📖 Documentation Updates 📖

Reorganize documentation to explicitly document all types of output files and link them to the corresponding configuration fields in the Output section (Issue #459, PR #568).

Add new interactive tutorial that uses a Jupyter notebook hosted on binder (Issue #448, PRs #547 & #552).

Add a new page to official documentation explaining how the SKLL code is organized for new developers (Issue #511, PR #519).

Update SKLL contribution guidelines and link to them from official documentation (Issues #498 & #514, PR #503 & #519).

Update documentation to indicate that pandas and seaborn are now direct dependencies and not optional (Issue #553, PR #563).

Update LogisticRegression learner documentation to talk explicitly about penalties and solvers (Issue #490, PR #500).

Properly document the internal conversion of string labels to ints/floats and possible edge cases (Issue #436, PR #476).

Add feature scaling to Boston regression example (Issue #469, PR #478).

Several other additions/updates to documentation (Issue #459, PR #568).

✔️ Tests ✔️

Make tests into a package so that we can do something like from skll.tests.utils import X etc. (Issue #530 , PR #531).

Add new tests based on SKLL examples so that we would know if examples ever break with any SKLL updates (Issues #529 & #544, PR #546).

Tweak tests to make test suite runnable on Windows (and pass!).

Add Azure Pipelines integration for automated test builds on Windows.

Added several new comprehensive tests for all new features and bugfixes. Also, removed older, unnecessary tests. See various PRs above for details.

Current code coverage for SKLL tests is at 95%, the highest it has ever been!

🔍 Other changes 🔍

Replace prettytable with the more actively maintained tabulate (Issue #356, PR #467).

Make sure entire codebase complies with PEP8 (Issue #460, PR #568).

Update the year to 2019 everywhere (Issue #447, PRs #456 & #568).

Update TravisCI configuration to use conda_requirements.txt for building environment (PR #515).

👩‍🔬 Contributors 👨‍🔬

(Note: This list is sorted alphabetically by last name and not by the quality/quantity of contributions to this release.)

Supreeth Baliga (@SupreethBaliga), Jeremy Biggs (@jbiggsets), Aoife Cahill (@aoifecahill), Ananya Ganesh (@ananyaganesh), R. Gokul (@rgokul), Binod Gyawali (@bndgyawali), Nitin Madnani (@desilinguist), Matt Mulholland (@mulhod), Robert Pugh (@Lguyogiro), Maxwell Schwartz (@maxwell-schwartz), Eugene Tsuprun (@etsuprun), Avijit Vajpayee (@AVajpayeeJr), Mengxuan Zhao (@chaomenghsuan)
Source code(tar.gz)
Source code(zip)
v1.5.3(Dec 14, 2018)
This is a minor release of SKLL with the most notable change being compatibility with the latest version of scikit-learn (v0.20.1).

What's new

SKLL is now compatible with scikit-learn v0.20.1 (Issue #432, PR #439).

GradientBoostingClassifier and GradientBoostingRegressor now accept sparse matrices as input (Issue #428, PR #429).

The model_params property now works for SVC learners with a linear kernel (Issue #425, PR #443).

Improved documentation (Issue #423, PR #437).

Update generate_predictions to output the probabilities for all classes instead of just the first class (Issue #430, PR #433). Note: this change breaks backward compatibility with previous SKLL versions since the output file now always includes a column header.

Bugfixes

Fixed broken links in documentation (Issues #421 and #422, PR #437).

Fixed data type conversion in NDJWriter (Issue #416, PR #440).

Properly handle the possible combinations of trained model and prediction set vectorizers in Learner.predict (Issue #414, PR #445).

Other changes

Make the tests for MLPClassifier and MLPRegressor go faster (by turning off grid search) to prevent Travis CI from timing out (issue #434, PR #435).

Source code(tar.gz)
Source code(zip)
v1.5.2(Apr 12, 2018)

This is a hot fix release that addresses a single issue.

Learner instances created via from_file() method did not get loggers associated with them. This meant that any and all warnings generated for such learner instances would have led to AttributeError exceptions.
Source code(tar.gz)
Source code(zip)
v1.5.1(Jan 31, 2018)
This is primarily a bug fix release.

Bugfixes

Generate the "folds_file" warnings only when "folds_file" is specified (issue #404, PR #405).

Modify Learner.save() to deal properly with reading in and re-saving older models (issue #406, PR #407).

Fix regression that caused the output directories to not be automatically created (issue #408, PR #409).

Source code(tar.gz)
Source code(zip)
v1.5(Dec 14, 2017)
This is a major new release of SKLL.

What's new

Several new scikit-learn learners included along with reasonable default parameter grids for tuning, where appropriate (issues #256 & #375, PR #377).

BayesianRidge

DummyRegressor

HuberRegressors

Lars

MLPRegressor

RANSACRegressor

TheilSenRegressor

DummyClassifier

MLPClassifier

RidgeClassifier

Allow computing any number of additional evaluation metrics in addition to the tuning objective (issue #350, PR #384).

Rename cv_folds_file configuration option to folds_file. The former is still supported with a deprecation warning but will be removed in the next release (PR #367).

Add a new configuration option use_folds_file_for_grid_search which controls whether the inner-loop grid-search in a cross-validation experiment with a custom folds file also uses the folds from the file. It's set to True by default. Setting it to False means that the inner loop uses regular 3-fold cross-validation and ignores the file (PR #367).

Also add a keyword argument called use_custom_folds_for_grid_search to the Learner.cross_validate() method (PR #367).

Learning curves can now be plotted from existing summary files using the new plot_learning_curves command line utility (issue #346, PR #396).

Overhaul logging in SKLL. All messages are now logged both to the console (if running interactively) and to log files. Read more about the SKLL log files in the Output Files section of the documentation (issue #369, PR #380).

neg_log_loss is now available as an objective function for classification (issue #327, PR #392).

Changes

SKLL now supports Python 3.6. Although Python 3.4 and 3.5 will still work, 3.6 is now the officially supported Python 3 version. Python 2.7 is still supported. (issue #355, PR #360).

The required version of scikit-learn has been bumped up to 0.19.1 (issue #328, PR #330).

The learning curve y-limits are now computed a bit more intelligently (issue #389, PR #390).

Raise a warning if ablation flag is used for an experiment that uses train_file/test_file - this is not supported (issue #313, PR #392).

Raise a warning if both fixed_parameters and param_grids are specified (issue #185, PR #297).

Disable grid search if no default parameter grids are available in SKLL and the user doesn't provide parameter grids either (issue #376, PR #378).

SKLL has a copy of scikit-learn's DictVectorizer because it needs some custom functionality. Most (but not all) of our modifications have now been merged into scikit-learn so our custom version is now significantly condensed down to just a single method (issue #263, PR #374).

Improved outputs for cross-validation tasks (issues #349 & #371, PRs #365 & #372)

When a folds file is specified, the log erroneously showed the full dictionary.

Show number of cross-validation folds in results to be via folds file if a folds file is specified.

Show grid search folds in results to be via folds file if the grid search ends up using the folds file.

Do not show the stratified folds information in results when a folds file is specified.

Show the value of use_folds_file_for_grid_search in results when appropriate.

Show grid search related information in results only when we are actually doing grid search.

The Travis CI plan was broken up into multiple jobs in order to get around the 50 minute limit (issue #385, PR #387).

For the conda package, some of the dependencies are now sourced from the conda-forge channel.

Bugfixes

Fix the bug that was causing the inner grid-search loop of a cross-validation experiment to use a single job instead of the number specified via grid_search_jobs (issue #363, PR #367).

Fix unbound variable in readers.py (issue #340, PR #392).

Fix bug when running a learning curve experiment via gridmap (issue #386, PR #390).

Fix a mismatch between the default number of grid search folds and the default number of slots requested via gridmap (issue #342, PR #367).

Documentation

Update documentation and tests for all of the above changes and new features.

Update tutorial and installation instructions (issues #383 and #394, PR #399).

Standardize all of the function and method docstrings to be NumPy style. Add docstrings where missing (issue #373, PR #397).

Source code(tar.gz)
Source code(zip)
v1.3(Feb 13, 2017)
This is a major new release of SKLL.

New features

You can now generate learning curves for multiple learners, multiple feature sets, and multiple objectives in a single experiment by using task=learning_curve in the configuration file. See documentation for more details (issue #221, PR #332).

Changes

The required version of scikit-learn has been bumped up to 0.18.1 (issue #328, PR #330).

SKLL now uses the MKL backend on macOS/Linux instead of OpenBLAS when used as a conda package.

Bugfixes

Fix deprecation warning when using Learner.model_params() (issue #325, PR #329).

Update the definitions of SKLL F1 metrics as a result of scikit-learn upgrade (issue #325, PR #330).

Bring documentation for SVC parameter grids up to date with the code (issue #334, PR #337).

Update documentation to make it clear that the SKLL conda package is only available for Python 3.4. For other Python versions, users should use pip.

Source code(tar.gz)
Source code(zip)
v1.2.1(May 20, 2016)
This is primarily a bug fix release but also adds a major new API feature.

New API Feature:

If you use the SKLL API, you can now create FeatureSet instances directly from pandas data frames (issue #261, PR #292).

Bugfixes:

Correctly parse floats in scientific notation, e.g., when specifying parameter grids and/or fixed parameters (issue #318, PR #320)

print_model_weights now correctly handles models trained with fit_intercept=False (issue #322, PR #323).

Source code(tar.gz)
Source code(zip)
v1.2(Feb 24, 2016)
This release includes major changes as well as a number of bugfixes.

Changes:

The required version of scikit-learn has been bumped up to 0.17.1 (issue #273, PRs #288 and #308)

You can now optionally save cross-validation folds to a file for later analysis (issue #259, PR #262)

Update documentation to be clear about when two FeatureSet instances are deemed equal (issue #272, PR #294)

You can now specify multiple objective functions for parameter tuning (issue #115, PR #291)

Bugfixes:

Use a fixed random state when doing non-stratified k-fold cross-validation (issue #247, PR #286)

Fix errors when using reusing relative paths in output section (issue #252, PR #287)

print_model_weights now works correctly for multi-class logistic regression models (issue #274, PR #267)

Correctly raise an IOError if the config file is not correctly specified (issue #275, PR #281)

The evaluate task does not crash when the test data has labels that were not seen in training data (issue #279, PR #290)

The fit() method for rescaled versions of learners now works correctly when not doing grid search (issue #304, PR #306)

Fix minor typos in the documentation and tutorial.

Source code(tar.gz)
Source code(zip)
v1.1.1(Oct 23, 2015)
This is a minor bugfix release. It fixes:

Issue where a FileExistsError would be raised when processing many configs (PR #260)

Instance of cv_folds instead of num_cv_folds in the documentation (PR #248).

Crash with print_model_weights and Logistic Regression models without intercepts (issue #250, PR #251)

Division by zero error when there was only one example (issue #253, PR #254)

Source code(tar.gz)
Source code(zip)
v1.1.0(Jul 20, 2015)
The biggest changes in this release are that the required version of scikit-learn has been bumped up to 0.16.1 and config file parsing is much more robust and gives much better error messages when users make mistakes.

Implemented enhancements

Base estimators other than the defaults are now supported for AdaBoost classifiers and regressors (#238)

User can now specify number of cross-validation folds to use in the config file (#222)

Decision Trees and Random Forests no longer need dense inputs (#207)

Stratification during cross-validation is now optional (#160)

Fixed bugs

Bug when checking if hasher_features is a valid option (#234)

Invalid/missing/duplicate options in configuration are now detected (#223)

Stop modifying global numpy random seed (#220)

Relative paths specified in the config file are now relative to the config file location instead of to the current directory (#213)

Closed issues

Incompatibility with the latest version of scikit-learn (v0.16.1) (#235, #241, #233)

Learner.model_params will return weights with the wrong sign if sklearn is fixed (#111)

Merged pull requests

Overhaul configuration file parsing (@desilinguist, #246)

Several minor bugfixes (@desilinguist, #245)

Compatibility with scikit-learn v0.16.1 (@desilinguist, #243)

Expose cv_folds and stratified (@aoifecahill, #240)

Adding Report tests (@brianray, #237)

Full Changelog
Source code(tar.gz)
Source code(zip)
v1.0.1(Feb 20, 2015)
This is a fairly minor bugfix release. Changes include:

Update links in README.

Fix crash when trying to run experiments with integer labels (Issue #225, PR #219)

Update documentation about ablation to note that there will always be a run with all features (Issue #224, PR #226)

Update documentation about format of cv_folds_file (Issue #225, PR #228)

Remove duplicate words in documentation (PR #218)

Fixed KeyError when trying to build conda recipe.

Update outdated parameter grids in run_experiment documentation (commit 80d78e4)

Source code(tar.gz)
Source code(zip)
v1.0.0(Nov 23, 2014)
The 1.0 release is finally here! It's been a little over a year since our first public release, and we're ready to say that SKLL is 1.0. Read our massive release notes:

:warning: We did make some API- and config-file-breaking changes. They are listed at the end of the release notes. They should all be addressable by a quick find-and-replace.

Bug fixes

Fixed path problems in iris example (issue #103, PR #171)

Fixed bug where ablated_features field was incorrect when config file contained multiple feature sets (issue #125)

Fixed bug where CV would crash with rare classes (issue #109, PR #165)

Fixed issue where warning about extremely large feature values was being issued before rescaling

Fixed issue where some warning messages used mix of new-style and old-style replacement strings with old-style formatting.

Fixed a number of bugs with filtering FeatureSet objects and writing filtered sets to files.

Fixed bug in FeatureSet.__sub__ where feature names were being passed instead of indices.

Fixed issue where MegaMWriter could not print numbers in Python 2.7.

New features

SKLL releases are now for specific versions of scikit-learn. 1.0.0 requires scikit-learn 0.15.2 (issue #138, PR #170)

Added tutorial to documentation that walks new users through using SKLL in much the same way as our PyData talks (issue #153).

Added support for custom learners (issue #92, PR #183)

Added two command-line utilities, join_features and filter_features, for joining and filtering feature files. These replace join_megam and filter_megam (issue #79, PR #198)

Added support for specifying the field in ARFF, CSV, or TSV files that contains the IDs for each instance (issue #204, PR #206)

Added train/test set sizes to result files (issue #150, PR #161)

Added intercept to print_model_weights output (issue #155, PR #163)

Added total time and end time-stamp to experiment results (issue #91, PR #167)

Added exception when featureset_name is longer than 210 characters (issue #121, PR #168)

Added regression example data, boston (issue #162)

Added ability to specify number of grid search folds (issue #122, PR #175)

Added warning message when number of features in training model are different than those for FeatureSet passed to Learner.predict() (issue #145)

Added conda.yaml file to repository to make conda package creation simpler (issue #159, PR #173)

Added loads more unit tests, greatly increased unit test coverage, and generally cleaned up test modules (issues #97, #148, #157, #188, and #202; PRs #176, #184, #196, #203, and #205)

Added train_file and test_file fields to config files, which can be used to specify single file feature sets. This greatly simplifies running simple experiments (issue #12, PR #197)

Added support for merging feature sets with IDs in different orders (issue #149, PR #177)

Added ValueError when invalid tuning objective is specified (issues #117 and #179; PRs #174 and #181)

Added shuffle option to config files to decide whether training data should be shuffled before training. By default this is False, but if grid_search is True, we will automatically shuffle. Previously, the default was True, and there was no option in the config files. (issue #189, PR #190)

Updated documentation to indicate that we're using StratifiedKFold (issue #160)

Added FeatureSet.__eq__ and FeatureSet.__getitem__ methods.

Minor changes without issues

Overhauled and cleaned up all documentation. Look how pretty it is!

Updated docstrings all over the place to be more accurate.

Updated generate_predictions to use new Reader API.

Added argv optional argument to all utility script main functions to simplify testing.

Added mock tests, so SKLL now requires mock to work with Python 2.7.

Added prettier SVG badges to README.

Added link to Data Science at the Command Line to README.

LibSVMReader now converts UTF-8 replacement characters that are used by LibSVMWriter when a feature name contains an =, |, #, :, or back to the original ASCII characters.

:warning: API breaking changes :warning:

FeatureSetWriter :arrow_right: Writer

load_examples(path) :arrow_right: Reader.for_path(path).read()

write_feature_file(...) :arrow_right: Writer.for_path(FeatureSet(...)).write()

FeatureSet.classes :arrow_right: FeatureSet.labels

All other instances of word "classes" changed to "labels" (#166)

FeatureSet.feat_vectorizer :arrow_right: FeatureSet.vectorizer

run_ablation(all_combos=True) :arrow_right: run_configuration(ablation=None)

run_ablation() :arrow_right: run_configuration(ablation=1)

ExamplesTuple(ids, classes, features, vectorizer) :arrow_right: FeatureSet(name, ids, classes, features, vectorizer)

Removed feature_hasher argument to all Learner methods, because its unnecessary

Learner.model_type is now the actual type of the underlying model instead of just a string.

FeatureSet.__len__ now returns the number of examples instead of the number of features.

Removed skll.learner._REGRESSION_MODELS and now we check for regression by seeing if model is subclass of RegressorMixin.

:warning: Config file breaking changes :warning:

Removed all short names for learners (PR #199)

Can no longer use classifiers instead of learners

train_location :arrow_right: train_directory

test_location :arrow_right: train_directory

cv_folds_location :arrow_right: cv_folds_file

Source code(tar.gz)
Source code(zip)
v0.28.1(Nov 1, 2014)

Bug fix release that fixes issue where python setup.py install would not work because the skll.data packages wasn't include in the list of packages.
Source code(tar.gz)
Source code(zip)
v0.28.0(Oct 10, 2014)
This release has some big behind-the-scenes changes. First, we split the data.py module up into a sub-package (#147). There is also a new FeatureSet class that replaces the old namedtuple-based ExamplesTuple (#81), so ExamplesTuple is now deprecated and will be removed in SKLL 1.0.0.

Speaking of which, we're having an all-day SKLL sprint on the October 17th where we hope to resolve all the remaining issues preventing the 1.0 release.

Other changes include:

Fixed a bunch of minor problems with loading/writing LibSVM files

Added file reading/writing progress indicators

Fixed crash with generate_predictions when the model was not trained with probability set to True (#144).

Deprecated write_feature_file function in favor of using a FeatureSetWriter object.

Deprecated load_examples function in favor of using a Reader object.

Temporarily added replacement version of scikit-learn DictVectorizer class until scikit-learn/scikit-learn#3683 version is included in a release. This allows us to make file loading substantially more memory efficient.

Source code(tar.gz)
Source code(zip)
v0.27.0(Aug 13, 2014)
The main new feature in this release is that .libsvm files are now fully supported by skll_convert and run_experiment. Because of this change, we've removed megam_to_libsvm.

Other changes include:

Integer keys are now allowed in fixed_parameters and param_grids (#134). Therefore, SKLL now requires PyYAML to function properly.

Added documentation about using class_weights to manage imbalanced datasets (#132)

Added information about pre-specified folds (via `cv_folds_location) to results JSON and plain-text files. (#108)

Added warning when encountering classes that are not in class_map. (#114)

Fixed issue where sampler random_state parameter would be overridden.

Fixed license headers in CLI package. They were still GPL for some reason.

Fixed issue #112 by switching to joblib.pool.MemmappingPool for handling parallel file loading. SKLL now requires joblib 0.8 to function properly.

Fixed issue #104 by making result formatting more consistent.

compute_eval_from_predictions now supports string-valued classes, as it should have. (#135)

We now raise an exception instead of allowing you to overwrite your results by including the same learner in the learners list in your config file twice (#140).

Fixed warning about files being left open in Python 3.4 (by not leaving them open anymore).

Short names for learners have been deprecated and will be removed in SKLL 1.0.

Source code(tar.gz)
Source code(zip)
v0.26.0(Jul 11, 2014)
Added AdaBoost and KNeighbors classifiers and regressors (finally closing #7).

Added support for kernel approximation samplers. (Thanks @nineil)

All linear models are now supported by print_model_weights (issue #119).

Added f1_score_weighted metric so that weighted F1 will be calculated even for binary classification tasks.

Modified f1_score_micro and f1_score_macro to also always return average for binary classification tasks (instead of previous behavior where only performance on positive class was returned).

Source code(tar.gz)
Source code(zip)
v0.25.0(Jul 1, 2014)
This release includes a long-standing request being finally fulfilled (part of #7). We now support Stochastic Gradient Descent!

Full changelog:

Added support for SGDClassifier and SGDRegressor

Added option to use FeatureHasher instead of DictVectorizer to make learning with feature sets that have millions of features possible.

Minor documentation fix for generate_predictions.

All the credit for this release goes to @nineil. Thanks Nils!
Source code(tar.gz)
Source code(zip)
v0.24.0(Jun 4, 2014)
Added compute_eval_from_predictions utility for computing evaluation metrics after experiments have been run.

Made rounding consistent in Python 2 kappa code use banker's rounding, just like Python 3 does.

Added support for printing model weights for linear SVR (#110)

Made print_model_weights only print all negative or all positive weights (#105)

Little PEP8 and documentation tweaks.

Source code(tar.gz)
Source code(zip)
v0.23.1(Jan 10, 2014)

Fixed issue where some models would be different depending on order of feature files specified in config file (#101).
Source code(tar.gz)
Source code(zip)
v0.23.0(Jan 3, 2014)
Add --resume option to run_experiment for resuming large experiments in the event of a crash.

Fix issue where grid_scores was undefined when using --keep-models.

Automatically generated feature set names now have sorted features to ensure they will always be generated in the same fashion.

Source code(tar.gz)
Source code(zip)
v0.22.5(Dec 10, 2013)

Fixed a bug where command-line scripts didn't work after previous release. (This should hopefully be the last of these rapid fire releases. We will add unit tests for these in the future.)
Source code(tar.gz)
Source code(zip)
v0.22.4(Dec 9, 2013)

Fix missing import sys in run_experiment.py
Source code(tar.gz)
Source code(zip)
v0.22.3(Dec 9, 2013)
Very minor bug fix release. Changes are:

main functions for all utility scripts now take optional argument lists to make unit testing simpler (and not require subprocesses).

Fix another bug that was causing missing "ablated features" lists in summary files.

Source code(tar.gz)
Source code(zip)
v0.22.2(Dec 5, 2013)

Fix crash with filter_megam and join_megam due to references to old API.
Source code(tar.gz)
Source code(zip)
v0.22.1(Dec 5, 2013)
Minor bug fix release. Changes are:

Switch to joblib.dump and joblib.load for serialization (should fix #94)

Switch to using official drmaa-python release now that it's updated on PyPI

Fix issue where training examples were being loaded for pre-trained models (#95)

Change to using entry_points to generate scripts instead of scripts in setup.py, and utilities are now in a sub-package.

Source code(tar.gz)
Source code(zip)
v0.22.0(Dec 2, 2013)
This release features mostly bug fixes, but also includes a few minor features:

Change license to BSD 3 clause. Now any of our code could be added back into scikit-learn without licensing issues.

Add gamma to default paramater search grid for SVC (#84).

Add --verbose flag to run_experiment to simplify debugging.

Add support for wheel packaging.

Fixed bug in _write_summary_file that prevented writing of summary files for --ablation_all experiments.

Fixed SVR kernel string type issue (#87).

Fixed fit_intercept default value issue (#88).

Fixed incorrect error message (#86)

Tweaked .travis.yml to make builds a little faster.

Source code(tar.gz)
Source code(zip)