A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

CatBoost

Last update: Jan 5, 2023

Related tags

Machine Learning python data-science machine-learning data-mining tutorial r big-data gpu cuda kaggle gbdt gbm gpu-computing decision-trees gradient-boosting coreml catboost categorical-features

Overview

Website | Documentation | Tutorials | Installation | Release Notes

CatBoost is a machine learning method based on gradient boosting over decision trees.

Main advantages of CatBoost:

Superior quality when compared with other GBDT libraries on many datasets.
Best in class prediction speed.
Support for both numerical and categorical features.
Fast GPU and multi-GPU support for training out of the box.
Visualization tools included.

Get Started and Documentation

All CatBoost documentation is available here.

Install CatBoost by following the guide for the

Next you may want to investigate:

If you cannot open documentation in your browser try adding yastatic.net and yastat.net to the list of allowed domains in your privacy badger.

Catboost models in production

If you want to evaluate Catboost model in your application read model api documentation.

Questions and bug reports

For reporting bugs please use the catboost/bugreport page.
Ask a question on Stack Overflow with the catboost tag, we monitor this for new questions.
Seek prompt advice at Telegram group or Russian-speaking Telegram chat

Help to Make CatBoost Better

Check out open problems and help wanted issues to see what can be improved, or open an issue if you want something.
Add your stories and experience to Awesome CatBoost.
To contribute to CatBoost you need to first read CLA text and add to your pull request, that you agree to the terms of the CLA. More information can be found in CONTRIBUTING.md
Instructions for contributors can be found here.

News

Reference Paper

Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, Aleksandr Vorobev "Fighting biases with dynamic boosting". arXiv:1706.09516, 2017.

Anna Veronika Dorogush, Vasily Ershov, Andrey Gulin "CatBoost: gradient boosting with categorical features support". Workshop on ML Systems at NIPS 2017.

License

Comments

UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 9: ordinal not in range(128)

Problem:UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 9: ordinal not in range(128) catboost version: catboost 0.25 Operating System:win10

When I use setup.py to install Catboost, this error occurs, and if I look closely it is divided into two parts: 1. Using CUDA to create _catboost.pyd will cause an error like 'UnicodeDecodeError:' ASCII 'codec can't decode byte 0xCD in position 9: Ordinal not in range(128). 2. Do not use the CUDA to create _catboost. pyd, there will be "subprocess. CalledProcessError:Command '['D:\anaconda3\python.exe', 'D:\learn\catboost-master\ya', 'make', 'D:\learn\catboost-master\catboost\python-package\..\..\catboost\python-package\catboost', '--no-src-links', '--output', 'D:\ learn\ catboost-master\catboost\python-package\build\temp.win-amd64-3.8\Release', '-dpython_config =python3-config',' -duse_arcadia_python =no', '-dos_sdk =local', '-r','-DNO_DEBUGINFO', '-DHAVE_CUDA= NO '] returned non-zero exit status 1." I also tried converting _catboost.pyx from GitHub to _catboost.pyd using 'python setup.py build_ext --inplace' directly, but I got the same error as when installing CatBoost.

C:\Users\王普聪>pip install -e D:\learn\catboost-master\catboost\python-package
Obtaining file:///D:/learn/catboost-master/catboost/python-package
Requirement already satisfied: graphviz in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (0.16)
Requirement already satisfied: plotly in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (4.14.3)
Requirement already satisfied: six in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (1.15.0)
Requirement already satisfied: matplotlib in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (3.2.2)
Requirement already satisfied: numpy>=1.16.0 in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (1.18.5)
Requirement already satisfied: pandas>=0.24 in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (1.0.5)
Requirement already satisfied: scipy in d:\anaconda3\lib\site-packages (from catboost==0.24.4) (1.5.0)
Requirement already satisfied: retrying>=1.3.3 in d:\anaconda3\lib\site-packages (from plotly->catboost==0.24.4) (1.3.3)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in d:\anaconda3\lib\site-packages (from matplotlib->catboost==0.24.4) (2.4.7)
Requirement already satisfied: cycler>=0.10 in d:\anaconda3\lib\site-packages (from matplotlib->catboost==0.24.4) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in d:\anaconda3\lib\site-packages (from matplotlib->catboost==0.24.4) (1.2.0)
Requirement already satisfied: python-dateutil>=2.1 in d:\anaconda3\lib\site-packages (from matplotlib->catboost==0.24.4) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in d:\anaconda3\lib\site-packages (from pandas>=0.24->catboost==0.24.4) (2020.1)
Installing collected packages: catboost
  Running setup.py develop for catboost
    ERROR: Command errored out with exit status 1:
     command: 'D:\anaconda3\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'D:\\learn\\catboost-master\\catboost\\python-package\\setup.py'"'"'; __file__='"'"'D:\\learn\\catboost-master\\catboost\\python-package\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: D:\learn\catboost-master\catboost\python-package\
    Complete output (159 lines):
    running develop
    15:30:22 I Targeting for CUDA support with C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1
    running egg_info
    writing catboost.egg-info\PKG-INFO
    writing dependency_links to catboost.egg-info\dependency_links.txt
    writing requirements to catboost.egg-info\requires.txt
    writing top-level names to catboost.egg-info\top_level.txt
    15:30:24 I Targeting for CUDA support with C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1
    reading manifest file 'catboost.egg-info\SOURCES.txt'
    writing manifest file 'catboost.egg-info\SOURCES.txt'
    running build_ext
    15:30:24 I Targeting for CUDA support with C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1
    15:30:24 I Buildling _catboost.pyd with ymake
    15:30:24 I EXECUTE: D:\anaconda3\python.exe D:\learn\catboost-master\ya make D:\learn\catboost-master\catboost\python-package\..\..\catboost\python-package\catboost --no-src-links --output D:\learn\catboost-master\catboost\python-package\build\temp.win-amd64-3.8\Release -DPYTHON_CONFIG=python3-config -DUSE_ARCADIA_PYTHON=no -DOS_SDK=local -r -DNO_DEBUGINFO -DHAVE_CUDA=yes "-DCUDA_ROOT=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1"
    Output root is subdirectory of Arcadia root, this may cause non-idempotent build
    Traceback (most recent call last):
      File "devtools/ya/app.py", line 422, in configure_exit_interceptor
        yield
      File "devtools/ya/app.py", line 65, in helper
        return action(args)
      File "devtools/ya/entry/entry.py", line 55, in do_main
        res = handler.handle(handler, args, prefix=['ya'])
      File "devtools/ya/core/handler.py", line 159, in handle
        return handler.handle(self, args[1:], prefix + [name])
      File "devtools/ya/core/dispatch.py", line 37, in handle
        return self.command().handle(root_handler, args, prefix)
      File "devtools/ya/core/handler.py", line 341, in handle
        return self._action(params)
      File "devtools/ya/app.py", line 92, in helper
        return action(ctx.params)
      File "devtools/ya/build/build_handler.py", line 85, in do_ya_make
        builder = ya_make.YaMake(params, app_ctx)
      File "devtools/ya/build/ya_make.py", line 895, in __init__
        self.ctx = Context(self.opts, app_ctx=app_ctx, graph=graph, tests=tests, stripped_tests=stripped_tests, configure_errors=configure_errors, make_files=make_files, lite_graph=lite_graph)
      File "devtools/ya/build/ya_make.py", line 574, in __init__
        self.graph, self.tests, self.stripped_tests, self.configure_errors, self.make_files = _build_graph_and_tests(self.opts, app_ctx)
      File "devtools/ya/build/ya_make.py", line 258, in _build_graph_and_tests
        graph, tests, stripped_tests, gh, make_files = lg.build_graph_and_tests(opts, check=True, ev_listener=ev_listener, display=display)
      File "devtools/ya/build/graph.py", line 1688, in build_graph_and_tests
        return _build_graph_and_tests(opts, check, ev_listener, exit_stack, display)
      File "devtools/ya/build/graph.py", line 1992, in _build_graph_and_tests
        real_ymake_bin = tools.tool('ymake')
      File "devtools/ya/yalibrary/tools/__init__.py", line 220, in tool
        return toolchain.find(name, with_params, for_platform, cache=cache)
      File "devtools/ya/yalibrary/tools/__init__.py", line 158, in find
        executable = cur_bottle[executable_name]  # if executable_name is None it's Ok
      File "devtools/ya/yalibrary/tools/__init__.py", line 64, in __getitem__
        path = self.resolve()
      File "devtools/ya/yalibrary/tools/__init__.py", line 46, in resolve
        return self.__fetcher.fetch_if_need(self.__formula["match"], tared, binname, cache=cache).where
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 385, in fetch_if_need
        self.__c[key] = self._fetch_if_need(*args, **kwargs)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 452, in _fetch_if_need
        if self._fetch(name, tared, lambda x: name.lower() in x.lower(), binname):
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 368, in _fetch
        _install(res_path, do_install)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 104, in _install
        fs_handler(install_guard)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 95, in fs_handler
        func(install_guard)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 350, in do_install
        deploy_params=(UNTAR, resource_info if resource_info else {"file_name": "FILE"}, ""))
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 137, in _deploy_tool
        exts.archive.extract_from_tar(archive, extract_to)
      File "devtools/ya/exts/archive.py", line 16, in extract_from_tar
        archive.extract_tar(tar_file_path, output_dir)
      File "library/python/archive/__init__.py", line 62, in extract_tar
        output_dir = encode(output_dir, ENCODING)
      File "library/python/archive/__init__.py", line 58, in encode
        return value.encode(encoding)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 9: ordinal not in range(128)
    15:30:37 E Cannot build _catboost.pyd with CUDA support, will build without CUDA
    15:30:37 I EXECUTE: D:\anaconda3\python.exe D:\learn\catboost-master\ya make D:\learn\catboost-master\catboost\python-package\..\..\catboost\python-package\catboost --no-src-links --output D:\learn\catboost-master\catboost\python-package\build\temp.win-amd64-3.8\Release -DPYTHON_CONFIG=python3-config -DUSE_ARCADIA_PYTHON=no -DOS_SDK=local -r -DNO_DEBUGINFO -DHAVE_CUDA=no
    Output root is subdirectory of Arcadia root, this may cause non-idempotent build
    Traceback (most recent call last):
      File "devtools/ya/app.py", line 422, in configure_exit_interceptor
        yield
      File "devtools/ya/app.py", line 65, in helper
        return action(args)
      File "devtools/ya/entry/entry.py", line 55, in do_main
        res = handler.handle(handler, args, prefix=['ya'])
      File "devtools/ya/core/handler.py", line 159, in handle
        return handler.handle(self, args[1:], prefix + [name])
      File "devtools/ya/core/dispatch.py", line 37, in handle
        return self.command().handle(root_handler, args, prefix)
      File "devtools/ya/core/handler.py", line 341, in handle
        return self._action(params)
      File "devtools/ya/app.py", line 92, in helper
        return action(ctx.params)
      File "devtools/ya/build/build_handler.py", line 85, in do_ya_make
        builder = ya_make.YaMake(params, app_ctx)
      File "devtools/ya/build/ya_make.py", line 895, in __init__
        self.ctx = Context(self.opts, app_ctx=app_ctx, graph=graph, tests=tests, stripped_tests=stripped_tests, configure_errors=configure_errors, make_files=make_files, lite_graph=lite_graph)
      File "devtools/ya/build/ya_make.py", line 574, in __init__
        self.graph, self.tests, self.stripped_tests, self.configure_errors, self.make_files = _build_graph_and_tests(self.opts, app_ctx)
      File "devtools/ya/build/ya_make.py", line 258, in _build_graph_and_tests
        graph, tests, stripped_tests, gh, make_files = lg.build_graph_and_tests(opts, check=True, ev_listener=ev_listener, display=display)
      File "devtools/ya/build/graph.py", line 1688, in build_graph_and_tests
        return _build_graph_and_tests(opts, check, ev_listener, exit_stack, display)
      File "devtools/ya/build/graph.py", line 1992, in _build_graph_and_tests
        real_ymake_bin = tools.tool('ymake')
      File "devtools/ya/yalibrary/tools/__init__.py", line 220, in tool
        return toolchain.find(name, with_params, for_platform, cache=cache)
      File "devtools/ya/yalibrary/tools/__init__.py", line 158, in find
        executable = cur_bottle[executable_name]  # if executable_name is None it's Ok
      File "devtools/ya/yalibrary/tools/__init__.py", line 64, in __getitem__
        path = self.resolve()
      File "devtools/ya/yalibrary/tools/__init__.py", line 46, in resolve
        return self.__fetcher.fetch_if_need(self.__formula["match"], tared, binname, cache=cache).where
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 385, in fetch_if_need
        self.__c[key] = self._fetch_if_need(*args, **kwargs)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 452, in _fetch_if_need
        if self._fetch(name, tared, lambda x: name.lower() in x.lower(), binname):
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 368, in _fetch
        _install(res_path, do_install)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 104, in _install
        fs_handler(install_guard)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 95, in fs_handler
        func(install_guard)
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 350, in do_install
        deploy_params=(UNTAR, resource_info if resource_info else {"file_name": "FILE"}, ""))
      File "devtools/ya/yalibrary/fetcher/__init__.py", line 137, in _deploy_tool
        exts.archive.extract_from_tar(archive, extract_to)
      File "devtools/ya/exts/archive.py", line 16, in extract_from_tar
        archive.extract_tar(tar_file_path, output_dir)
      File "library/python/archive/__init__.py", line 62, in extract_tar
        output_dir = encode(output_dir, ENCODING)
      File "library/python/archive/__init__.py", line 58, in encode
        return value.encode(encoding)
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xcd in position 9: ordinal not in range(128)
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "D:\learn\catboost-master\catboost\python-package\setup.py", line 259, in <module>
        setup(
      File "D:\anaconda3\lib\site-packages\setuptools\__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "D:\anaconda3\lib\distutils\core.py", line 148, in setup
        dist.run_commands()
      File "D:\anaconda3\lib\distutils\dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "D:\anaconda3\lib\distutils\dist.py", line 985, in run_command
        cmd_obj.run()
      File "D:\anaconda3\lib\site-packages\setuptools\command\develop.py", line 34, in run
        self.install_for_development()
      File "D:\anaconda3\lib\site-packages\setuptools\command\develop.py", line 136, in install_for_development
        self.run_command('build_ext')
      File "D:\anaconda3\lib\distutils\cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "D:\anaconda3\lib\distutils\dist.py", line 985, in run_command
        cmd_obj.run()
      File "D:\learn\catboost-master\catboost\python-package\setup.py", line 186, in run
        self.build_with_ymake(topsrc_dir, build_dir, catboost_ext, put_dir, verbose, dry_run)
      File "D:\learn\catboost-master\catboost\python-package\setup.py", line 219, in build_with_ymake
        logging_execute(ymake_cmd + ['-DHAVE_CUDA=no'], verbose, dry_run)
      File "D:\learn\catboost-master\catboost\python-package\setup.py", line 62, in logging_execute
        subprocess.check_call(cmd, universal_newlines=True)
      File "D:\anaconda3\lib\subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['D:\\anaconda3\\python.exe', 'D:\\learn\\catboost-master\\ya', 'make', 'D:\\learn\\catboost-master\\catboost\\python-package\\..\\..\\catboost\\python-package\\catboost', '--no-src-links', '--output', 'D:\\learn\\catboost-master\\catboost\\python-package\\build\\temp.win-amd64-3.8\\Release', '-DPYTHON_CONFIG=python3-config', '-DUSE_ARCADIA_PYTHON=no', '-DOS_SDK=local', '-r', '-DNO_DEBUGINFO', '-DHAVE_CUDA=no']' returned non-zero exit status 1.
    ----------------------------------------
ERROR: Command errored out with exit status 1: 'D:\anaconda3\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'D:\\learn\\catboost-master\\catboost\\python-package\\setup.py'"'"'; __file__='"'"'D:\\learn\\catboost-master\\catboost\\python-package\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

opened by Wangpc-972 67

User description is used by default. Move metric creation metric to corresponding class factories.

Each metric now uses user-specified parameters in their descriptions by default.

Design

TMetric now stores a TMap<TString, TString> of user parameters, which are used to construct a metric description (e.g. MetricName:key1=value1;key2=value2). This implementation is defined in the base class and is now the default behaviour for building metric descriptions.

Some of specifiv GetDescription method implementations are kept in order to be consistent with the existing behaviour.

Note

UserQuerywiseMetric now uses the options in its representation as well.

opened by ivanychev 38
Sum of shap values does not equal to the prediction

Problem: Sum of shap values does not equal to the prediction catboost version: 0.18.1 Operating System: Ubuntu 19.10 CPU: i7-8565U

It only happens sometimes but we find that the of shap values does not equal to the prediction. Please let us know how we can provide further information
in progress bug

opened by hopoluicha 27
How catboost handle with big data?

Hi! I try to use catboost in kaggle competition. https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection The size of my train set about 40m rows with 14 features. When i try to train model, kernel always dies without any errors...
need info

opened by Mechanix12 27
Unknown class labels

I'm beginner using boosting models ,I'm trying to implement catboost . My input data has 6 categorical features and 2 numerical feature . My target variable is numerical data. I'm running on GPU . I'm facing the problem below please help me. Cannot chare data due privacy issue.

Traceback (most recent call last): File "/work/ilt/css8222/cat_boost/cat_boost.py", line 127, in save_snapshot = True File "/fibus/fs2/15/css8222/.local/lib/python3.6/site-packages/catboost/core.py", line 4718, in fit silent, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, init_model, callbacks, log_cout, log_cerr) File "/fibus/fs2/15/css8222/.local/lib/python3.6/site-packages/catboost/core.py", line 2042, in _fit train_params["init_model"] File "/fibus/fs2/15/css8222/.local/lib/python3.6/site-packages/catboost/core.py", line 1464, in _train self._object._train(train_pool, test_pool, params, allow_clear_pool, init_model._object if init_model else None) File "_catboost.pyx", line 4393, in _catboost._CatBoost._train File "_catboost.pyx", line 4442, in _catboost._CatBoost._train _catboost.CatBoostError: catboost/private/libs/target/target_converter.cpp:226: Unknown class label: "14289"

opened by sujay003 25
Faster SHAP values for small batches
For small batches use direct SHAP values calculation. Direct algorithm (without precalculation) is faster when (where DocumentsNumber < MeanLeafCount), because for preprocessing we find SHAP values for MeanLeafCount documents.

(algorithm from https://arxiv.org/abs/1802.03888)

With preprocessing final complexity was O(NT(D+F))+O(TL^2 D^2) where N is the number of documents(objects), T - number of trees, D - average tree depth, F - average number of features in tree, L - average number of leaves in tree. But if the batch is small we can use default algorithm with complexity O(NTLD^2), which is better when N < L.

Example: On dataset gisette (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) with 100 first features train CatBoostRegressor(iterations=500, depth=6, random_seed=42) and then use get_feature_importance to find SHAP values for the first object in test.

Old:

0.32 s

New:

shap_mode="Auto" or "NoPreCalc"- 0.015 s

shap_mode="UsePreCalc" - 0.32 s (this is like it was before)

I hereby agree to the terms of the CLA available at: link
opened by Lokutrus 25
Tutorial for ranking modes in CatBoost

Hello.

Looks like the current version of CatBoost supports learning to rank. There are some clues about it in the documentation, but I couldn't find any minimal working examples. I wonder which methods should be considered as a baseline approach and what are the prerequisites?

Should we use YetiRank as the training metric and just provide a query id as the Pool group_id parameter? What other CatBoost parameters should be taken into account specifically for a ranking problem?

Thank you!
planned documentation

opened by hanky 24
GPU yields worse metric than CPU

Problem:various measurements become worse when I switch from CPU to GPU catboost version:0.22 Operating System:Linux 4.4.0-1100-aws x86_64 CPU: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

GPU: Tesla M60

I wanted to reduce the training time and so I specified 'task_type' as 'GPU'. I immediately noticed that its metrics got worse. The only change I made was setting task_type as GPU. The rest are the same.

The training dataset has 1.2M rows and 218 columns. Among these 218 columns, 42 are categorical features. The rest are floats or integers, no text features. The validation dataset has 120K rows.

The following are the parameters for the CPU version: {'nan_mode': 'Min', 'eval_metric': 'Logloss', 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'], 'iterations': 1000, 'sampling_frequency': 'PerTree', 'fold_permutation_block': 0, 'leaf_estimation_method': 'Newton', 'od_pval': 0, 'counter_calc_method': 'SkipTest', 'grow_policy': 'SymmetricTree', 'boosting_type': 'Plain', 'model_shrink_mode': 'Constant', 'feature_border_type': 'GreedyLogSum', 'ctr_leaf_count_limit': 18446744073709551615, 'bayesian_matrix_reg': 0.10000000149011612, 'one_hot_max_size': 2, 'l2_leaf_reg': 3, 'random_strength': 1, 'od_type': 'Iter', 'rsm': 1, 'boost_from_average': False, 'max_ctr_complexity': 4, 'model_size_reg': 0.5, 'simple_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1'], 'subsample': 0.800000011920929, 'use_best_model': True, 'od_wait': 35, 'class_names': [0, 1], 'random_seed': 42, 'depth': 6, 'ctr_target_border_count': 1, 'has_time': False, 'store_all_simple_ctr': False, 'border_count': 254, 'classes_count': 0, 'sparse_features_conflict_fraction': 0, 'leaf_estimation_backtracking': 'AnyImprovement', 'best_model_min_trees': 1, 'model_shrink_rate': 0, 'min_data_in_leaf': 1, 'loss_function': 'Logloss', 'learning_rate': 0.30000001192092896, 'score_function': 'Cosine', 'task_type': 'CPU', 'leaf_estimation_iterations': 10, 'bootstrap_type': 'MVS', 'max_leaves': 64, 'permutation_count': 4}

The following are the parameters for the GPU version: {'nan_mode': 'Min', 'gpu_ram_part': 0.95, 'eval_metric': 'Logloss', 'combinations_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'FeatureFreq:CtrBorderCount=15:CtrBorderType=Median:Prior=0/1'], 'iterations': 1000, 'fold_permutation_block': 64, 'leaf_estimation_method': 'Newton', 'observations_to_bootstrap': 'TestOnly', 'od_pval': 0, 'counter_calc_method': 'SkipTest', 'grow_policy': 'SymmetricTree', 'boosting_type': 'Plain', 'ctr_history_unit': 'Sample', 'feature_border_type': 'GreedyLogSum', 'bayesian_matrix_reg': 0.10000000149011612, 'one_hot_max_size': 2, 'devices': '-1', 'pinned_memory_bytes': '104857600', 'l2_leaf_reg': 3, 'random_strength': 1, 'od_type': 'Iter', 'rsm': 1, 'boost_from_average': False, 'fold_size_loss_normalization': False, 'max_ctr_complexity': 4, 'gpu_cat_features_storage': 'GpuRam', 'simple_ctr': ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'FeatureFreq:CtrBorderCount=15:CtrBorderType=MinEntropy:Prior=0/1'], 'use_best_model': True, 'od_wait': 35, 'class_names': [0, 1], 'random_seed': 42, 'depth': 6, 'ctr_target_border_count': 1, 'has_time': False, 'border_count': 128, 'min_fold_size': 100, 'data_partition': 'FeatureParallel', 'bagging_temperature': 1, 'classes_count': 0, 'leaf_estimation_backtracking': 'AnyImprovement', 'best_model_min_trees': 1, 'min_data_in_leaf': 1, 'add_ridge_penalty_to_loss_function': False, 'loss_function': 'Logloss', 'learning_rate': 0.30000001192092896, 'score_function': 'Cosine', 'task_type': 'GPU', 'leaf_estimation_iterations': 10, 'bootstrap_type': 'Bayesian', 'max_leaves': 64, 'permutation_count': 4}

opened by kdlin 23

Using parameters from saved model for cross-validation leads to 'exclusive parameters' error.

Problem: "Only one of parameters ['verbose', 'logging_level', 'verbose_eval', 'silent'] should be set" printed by cv function after loading from file previously saved model. catboost version: 0.12.2 Operating System: CentOS Linux release 7.4.1708 CPU: Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz

model = CatBoostClassifier(loss_function='MultiClass')
model.fit(train_pool, 
  verbose=False, 
  plot=True,
  eval_set=validation_pool)
model.save_model(str(model_path.absolute()))
model = CatBoostClassifier()
model.load_model(str(model_path.absolute()))
cv_data = cv(
    whole_pool,
    params=model.get_params()
)

---------------------------------------------------------------------------
CatboostError                             Traceback (most recent call last)
<ipython-input-40-f150897615b8> in <module>
      1 cv_data = cv(
      2     whole_pool,
----> 3     params=model.get_params()
      4 )

~/.conda/envs/catboost/lib/python3.6/site-packages/catboost/core.py in cv(pool, params, dtrain, iterations, num_boost_round, fold_count, nfold, inverted, partition_random_seed, seed, shuffle, logging_level, stratified, as_pandas, metric_period, verbose, verbose_eval, plot, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, max_time_spent_on_fixed_cost_ratio, dev_max_iterations_batch_size)
   2876 
   2877     params = deepcopy(params)
-> 2878     _process_synonyms(params)
   2879 
   2880     metric_period, verbose, logging_level = _process_verbose(metric_period, verbose, logging_level, verbose_eval)

~/.conda/envs/catboost/lib/python3.6/site-packages/catboost/core.py in _process_synonyms(params)
    754         del params['silent']
    755 
--> 756     metric_period, verbose, logging_level = _process_verbose(metric_period, verbose, logging_level, verbose_eval, silent)
    757 
    758     if metric_period is not None:

~/.conda/envs/catboost/lib/python3.6/site-packages/catboost/core.py in _process_verbose(metric_period, verbose, logging_level, verbose_eval, silent)
    133     at_most_one = sum(params.get(exclusive) is not None for exclusive in exclusive_params)
    134     if at_most_one > 1:
--> 135         raise CatboostError('Only one of parameters {} should be set'.format(exclusive_params))
    136 
    137     if verbose is None:

CatboostError: Only one of parameters ['verbose', 'logging_level', 'verbose_eval', 'silent'] should be set

bug

opened by protsenkovi 23

Flag not copied unnecessarily with blank and whitespace
Before submitting a pull request, please do the following steps:

Read instructions for contributors here.

Run ya make in catboost folder to make sure the code builds.

Add tests that test your change.

Run tests using ya make -t -A command.

If you haven't already, complete the CLA. I hereby agree to the terms of the CLA available at https://yandex.ru/legal/cla/?lang=en.
opened by sharaalfa 23
Issue trying to compile with specified gcc version
I'm trying to compile the catboost python wheel on my system. The default gcc version I have is 8, but I also have 7 installed so I'm trying to use that by setting the CC and CXX environment variables. However, when running:

python mk_wheel.py -DCUDA_ROOT="/opt/cuda"

I get the message:

Info: Attention! Using system user-defined compiler: g++-7 (check CC and CXX env vars). Cross compilation with system CXX is not supported

catboost version: git master Operating System: Linux CPU: i7 GPU: GTX 1080

Thanks!
build issues
opened by ctlaltdefeat 23
Prediction probability result mismatch - C API and Python

Problem: We used the Python API of catboost to train our multiclass classification model and the resultant .cbm model was used in python / C to do the prediction.

I noticed that when making inferences using the same model and the same input data (the model expects 3 float features and 4 categorical features.), the prediction probability in Python is slightly different compared to the prediction probability using the C API.

We use CatboostClassifier.predict_proba in Python with all default parameters, and we set SetPredictionType(modelHandle, APT_PROBABILITY); in C API.

We found that the sum total of the probabilities returned in Python are always different from 1 (sometimes it is greater or less than 1), and in the case of the probabilities returned in C the sum of them is always equal to 1.

We do not know if both ways to get the probability are the same (CatboostClassifier.predict_proba and SetPredictionType(modelHandle, APT_PROBABILITY);), but if they are the same, why is the result different?

catboost version: 1.0.3

Operating System: MacOS Ventura 13.1

CPU: Apple M1

opened by eli3xm 0
Spark Feature Importance issue
Problem: ai.catboost.CatBoostError: Unsupported data type for Label at ai.catboost.spark.DatasetLoadingContext$.getLabelCallback(DataHelpers.scala:465) catboost version: 1.1.1 Operating System: Linux, Spark 3.3.1

The following method call fails with the error described above:

((CatBoostClassificationModel) model).getFeatureImportance(EFstrType.LossFunctionChange, evalPool, ECalcTypeShapValues.Regular)
opened by eugene-kamenev 0
Saved model's params are different from current model's params

Problem: Can't fit models on GPU, Saved model's params are different from current model's params catboost version: '1.1.1' Operating System: Windows 10 CPU: 0 GPU: 1

model_cat_tm_1 = CatBoostClassifier( iterations=5000, loss_function ='Logloss', #eval_metric = 'AUC', learning_rate = 0.05, random_seed = 1, od_type = "Iter", od_wait = 200, depth = 5, task_type = "GPU", devices = '0:1', save_snapshot= False, )

cv_params_tm_1 = model_cat_tm_1.get_params() cv_data_tm_1 = cv( Pool(train_tm_treatment_one_features, train_tm_treatment_one_target), cv_params_tm_1, plot=True, verbose=100, )

Gettting this error (tried, rebooting the system, open another script - doesn't help)

Training on fold [0/3]

CatBoostError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_3516\715857703.py in 1 cv_params_tm_1 = model_cat_tm_1.get_params() ----> 2 cv_data_tm_1 = cv( 3 Pool(train_tm_treatment_one_features, train_tm_treatment_one_target), 4 cv_params_tm_1, 5 plot=True,

~\AppData\Roaming\Python\Python39\site-packages\catboost\core.py in cv(pool, params, dtrain, iterations, num_boost_round, fold_count, nfold, inverted, partition_random_seed, seed, shuffle, logging_level, stratified, as_pandas, metric_period, verbose, verbose_eval, plot, plot_file, early_stopping_rounds, save_snapshot, snapshot_file, snapshot_interval, metric_update_interval, folds, type, return_models, log_cout, log_cerr) 6648 with log_fixup(log_cout, log_cerr), plot_wrapper(plot, plot_file=plot_file, plot_title='Cross-validation plot', train_dirs=plot_dirs): 6649 if not return_models: -> 6650 return _cv(params, pool, fold_count, inverted, partition_random_seed, shuffle, stratified, 6651 metric_update_interval, as_pandas, folds, type, return_models) 6652 else:

_catboost.pyx in _catboost._cv()

_catboost.pyx in _catboost._cv()

CatBoostError: C:/Program Files (x86)/Go Agent/pipelines/BuildMaster/catboost.git/catboost/cuda/methods/boosting_progress_tracker.cpp:171: Saved model's params are different from current model's params

opened by MiMakh 0

Catboost spark fit error java.lang.ClassCastException

Problem: net.razorvine.pickle.objects.TimeDelta cannot be cast to java.time.Duration catboost version: 1.0.6 Operating System: 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12)

Hi, I'm trying to test catboost_spark in a Databricks notebook using the example from the official documentation: https://catboost.ai/en/docs/concepts/spark-quickstart-python#binary-classification

When I run this command:

classifier.fit(dataset=trainPool, evalDatasets=[evalPool])

The following error is raised:

java.lang.ClassCastException: net.razorvine.pickle.objects.TimeDelta cannot be cast to java.time.Duration

...

Py4JJavaError: An error occurred while calling o18779.w.
: java.lang.ClassCastException: net.razorvine.pickle.objects.TimeDelta cannot be cast to java.time.Duration
	at ai.catboost.spark.params.DurationParam.w(Helpers.scala:61)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)

I believe there is a similar issue to this but it is now closed. Thank you in advance for the help.

opened by vitormanita 0

parameter missing for non_linear regression

Problem: Non Linear Regression "Poly" Kernal parameter missing catboost version: 0.26.1 Operating System: Linux CPU:True GPU:False

Hi there, I am training a model for linear regression problem but my data has non-linear in nature. So I have decided to change kernel like Poly or something for non_linear that we have Support Vector Regressor. I have tried searching for same in Catboost parameters but i couldn't get. Do you have plans for adding it? Thanks

opened by hamza1424 0

Releases(v1.1.1)

v1.1.1(Nov 1, 2022)
Release 1.1.1

New features

Support building for Linux on aarch64 from sources using CMake (no prebuilt binaries or PyPI packages yet). #1981

[C/C++ applier] Support embedding features. #2172

[C/C++ applier] Add GetModelUsedFeaturesNames. #2204

[Python] Add text features to utils.create_cd. #2193

[Spark] Full support for Apache Spark 3.3

[Spark] Read/write PySpark's DataFrame-like API for Pool. #2030

[Spark] Allow to specify trainingDriver and worker listening ports. #2181

Bugfixes

Fix prediction dimension check for RMSEWithUncertainty and MultiQuantile. #2155

[C/C++ applier] Fix segmentation fault in prediction for multiple objects for multiple dimension models.

[JVM applier] Fix catboost-common dependency version in catboost-prediction (Fixes JVM applier on macOS). #2121

[Python] Update for pandas 1.5.0: iteritems -> items (Fixes annoying deprecation warning). #2179

[Python] Fix segmentation fault when target is np.ndarray with dtype=object. #2201

[Python] Fix specifying feature_names in utils.create_cd. #2211

Source code(tar.gz)
Source code(zip)
catboost-1.1.1.exe(176.03 MB)
catboost-darwin-1.1.1(50.29 MB)
catboost-linux-1.1.1(198.28 MB)
catboost-R-Darwin-1.1.1.tgz(14.74 MB)
catboost-R-Linux-1.1.1.tgz(68.62 MB)
catboost-R-Windows-1.1.1.tgz(64.72 MB)
catboostmodel.dll(7.61 MB)
catboostmodel.lib(12.98 KB)
libcatboostmodel.dylib(11.97 MB)
libcatboostmodel.so(10.07 MB)
libcatboostr-darwin.dylib(44.91 MB)
libcatboostr-darwin.so(44.91 MB)
libcatboostr-linux.so(186.49 MB)
libcatboostr.dll(175.11 MB)
v1.1(Sep 26, 2022)
Release 1.1

New features

Multiquantile regression

Now it's possible to train models with shared tree structure and multiple predicted quantile values in each leaf. Currently this approach doesn't give a strong guarantee for predicted quantile values consistency, but it still provides more consistency than training multiple independent models for each quantile. You can read short description in the documentation. Short example for Python: loss_function='MultiQuantile:alpha=0.2,0.4'. Supported only on CPU for now.

Support text and embedding features for regression and ranking.

Spark: Read/write Spark's Dataset-like API for Pool. #2030

Support HashedCateg column type. This allows to use externally prehashed categorical features both in training and prediction.

New option plot_file in Python functions with plot parameter allows to save plots to file. #758

Add eval_fraction parameter. #1500

Non-symmetric trees model summation.

init_model parameter now works with non-symmetric trees.

Partial support for Apache Spark 3.3 (only for Scala 2.12 and without PySpark).

Speedups

2x speedup DCG, nDCG and FilteredDCG metrics calculation for groups with >= 50 objects and with top=-1 (all objects from each group, default value)

Fixed 2x slowdown of PairLogit and other ranking losses on CPU introduced in release 0.23

Bugfixes

Fix for pandas integer array. #2096

Save feature names to json format. #2102

Fix feature weights on CPU

Use feature weights on GPU

Fix gradient calculation for QueryRMSE on GPU

Fix ranking metrics with group weights in calc_metrics

Fix JVM applier on data with text features. #2132

Source code(tar.gz)
Source code(zip)
catboost-1.1.exe(176.03 MB)
catboost-darwin-1.1(50.31 MB)
catboost-linux-1.1(197.28 MB)
catboost-R-Darwin-1.1.tgz(14.75 MB)
catboost-R-Linux-1.1.tgz(68.69 MB)
catboost-R-Windows-1.1.tgz(64.72 MB)
catboostmodel.dll(7.60 MB)
catboostmodel.lib(10.50 KB)
libcatboostmodel.dylib(11.95 MB)
libcatboostmodel.so(9.98 MB)
libcatboostr-darwin.dylib(44.95 MB)
libcatboostr-darwin.so(44.95 MB)
libcatboostr-linux.so(185.71 MB)
libcatboostr.dll(175.13 MB)
v1.0.6(May 19, 2022)
Release 1.0.6

New features

Fixed splits for binary features on gpu for non-symmetric trees -- specify the set of splits to start each tree in the model with --fixed-binary-splits or fixed_binary_splits in Python package (by default, there are no fixed splits)

Documentation

New sections on MultiRMSEWithMissingValues and LogCosh

New section on get_embedding_feature_indices

Add info on gpu support for metrics

Bug-fixes

Fix warning about resetting logger when logging to sys.stdout & sys.stderr from different threads #1855

Fix model summation in CatBoost for Apache Spark

Fix performance and scalability of query auc for ranking (1m samples, query size 2, 8 cpu cores 0.55s -> 0.04s)

Fix support for text features and embeddings in Java applier #2043

Fix nan/inf split scores with yeti rank pairwise loss

Fix nan/inf feature strengths in pair logit on cpu

Source code(tar.gz)
Source code(zip)
catboost-1.0.6.exe(174.88 MB)
catboost-darwin-1.0.6(50.16 MB)
catboost-linux-1.0.6(196.03 MB)
catboost-R-Darwin-1.0.6.tgz(14.69 MB)
catboost-R-Linux-1.0.6.tgz(68.57 MB)
catboost-R-Windows-1.0.6.tgz(64.68 MB)
catboostmodel.dll(7.36 MB)
catboostmodel.lib(10.50 KB)
libcatboostmodel.dylib(11.76 MB)
libcatboostmodel.so(9.86 MB)
libcatboostr-darwin.so(44.96 MB)
libcatboostr-linux.so(184.48 MB)
libcatboostr.dll(174.05 MB)
libcatboostr.dylib(44.96 MB)
v1.0.5(Apr 7, 2022)
Release 1.0.5

New features

Support Apple Darwin arm64 architecture. #1526.

Support feature tags in feature selection.

Support for Apache Spark 3.2.

Model sum in Apache Spark.

Python package

Accommodate multiple target-platform arguments used to build universal binaries.

Add grid creation function to utils.py

Custom multilabel eval metrics by @ELitvinova

Metrics plotter by @evgenabramov

Fbeta score by @ELitvinova

Bugfixes

Fix group weights in metrics calculation.

Fix fit for PySpark estimators. #1976.

Fix predict on GPU. #1901, #1923.

Disable exact leafs calculation for MAE, MAPE, Quantile on GPU.

Fix counter description for plotting. #1973.

Allow weights in BrierScore. #1967.

Disable AUC calculation for learn by default on GPU as well.

Fix plot_tree example in documentation.

Fix plots in cv.

Fix ui32 overflows in pairwise losses on GPU.

Fix for multiclass in nodejs evaluator. #1903.

Fix CatBoost R package installation on Monterey. #1912.

Fix CUDA error 700 caused by data race in mimalloc and CUDA driver.

Fix slow compilation with CUDA 11.2+.

Fix 2nd derivative in RMSEWithUncertainty.

Source code(tar.gz)
Source code(zip)
catboost-1.0.5.exe(175.06 MB)
catboost-darwin-1.0.5(50.15 MB)
catboost-linux-1.0.5(195.96 MB)
catboost-R-Darwin-1.0.5.tgz(14.34 MB)
catboost-R-Linux-1.0.5.tgz(68.54 MB)
catboost-R-Windows-1.0.5.tgz(64.72 MB)
catboostmodel.dll(7.34 MB)
catboostmodel.lib(10.50 KB)
libcatboostmodel.dylib(11.75 MB)
libcatboostmodel.so(9.85 MB)
libcatboostr-darwin.dylib(44.95 MB)
libcatboostr-darwin.so(44.95 MB)
libcatboostr-linux.so(184.42 MB)
libcatboostr.dll(174.15 MB)
v1.0.4(Jan 14, 2022)
New features

Add sort param to FilteredDCG metric.

Add StochasticRank for FilteredDCG.

Python package

add is_max/minimizable methods. #1915

Support custom metric in select_features #1920

R package

Register functions from libcatboostr natively in R, removing one of CRAN notes.

Bugfixes

Fix apply for models without main loss_function.

Fix text calcer options specification. #1916

Fix calc_feature_statistics

Fix Multi-approx support in CLI calc_metrics mode.

Fix processing for text options. #1930

Fix snapshot saving in feature selection.

Fix CatBoost models serialization inside pipeline models in PySpark. #1936

Source code(tar.gz)
Source code(zip)
catboost-1.0.4.exe(174.16 MB)
catboost-darwin-1.0.4(26.21 MB)
catboost-linux-1.0.4(194.66 MB)
catboost-R-Darwin-1.0.4.tgz(7.83 MB)
catboost-R-Linux-1.0.4.tgz(68.16 MB)
catboost-R-Windows-1.0.4.tgz(64.37 MB)
catboostmodel.dll(6.91 MB)
catboostmodel.lib(9.99 KB)
libcatboostmodel.dylib(6.16 MB)
libcatboostmodel.so(9.81 MB)
libcatboostr-darwin.dylib(23.48 MB)
libcatboostr-darwin.so(23.48 MB)
libcatboostr-linux.so(183.46 MB)
libcatboostr.dll(173.27 MB)
v1.0.3(Nov 4, 2021)
CatBoost for Apache Spark

Fix incorrect Linux so files in deployed Maven artifacts for release 1.0.2 (no code changes)

Source code(tar.gz)
Source code(zip)
catboost-1.0.3.exe(187.97 MB)
catboost-darwin-1.0.3(26.44 MB)
catboost-linux-1.0.3(194.76 MB)
catboost-R-Darwin-1.0.3.tgz(8.00 MB)
catboost-R-Linux-1.0.3.tgz(68.34 MB)
catboost-R-Windows-1.0.3.tgz(67.66 MB)
catboostmodel.dll(9.01 MB)
catboostmodel.lib(9.99 KB)
libcatboostmodel.dylib(6.13 MB)
libcatboostmodel.so(9.77 MB)
libcatboostr-darwin.dylib(23.76 MB)
libcatboostr-darwin.so(23.76 MB)
libcatboostr-linux.so(183.68 MB)
libcatboostr.dll(186.86 MB)
v1.0.2(Nov 4, 2021)
CatBoost for Apache Spark

PySpark: Fix python -> JVM datetime.timedelta conversion.

Fix: proper handling of constant categorical features. #1867

Fix SIGSEGV for for Multiclassification with Ctrs. #1886

New features

Add is_min_optimal, is_max_optimal for BuiltinMetrics. #1890

R package

Use libcatboostr-darwin.dylib instead of libcatboostr-darwin.so on macOS. #1834

Bugfixes

Fix CatBoostError: (No such file or directory) bad new file name when using grid_search. #1893

Source code(tar.gz)
Source code(zip)
catboost-1.0.2.exe(187.97 MB)
catboost-darwin-1.0.2(26.44 MB)
catboost-linux-1.0.2(194.76 MB)
catboost-R-Darwin-1.0.2.tgz(8.00 MB)
catboost-R-Linux-1.0.2.tgz(68.34 MB)
catboost-R-Windows-1.0.2.tgz(67.66 MB)
catboostmodel.dll(9.01 MB)
catboostmodel.lib(9.99 KB)
libcatboostmodel.dylib(6.13 MB)
libcatboostmodel.so(9.77 MB)
libcatboostr-darwin.dylib(23.76 MB)
libcatboostr-darwin.so(23.76 MB)
libcatboostr-linux.so(183.68 MB)
libcatboostr.dll(186.86 MB)
v1.0.1(Nov 2, 2021)
:warning: PySpark support is broken in this release.. Please use release 1.0.3 instead.

CatBoost for Apache Spark

More robust handling of CatBoost Master and Workers failures, avoid freezes.

Fix for empty partitions. #1687

Fix use-after-free. #1759 and other random errors.

Support Spark 3.1.

Python package

Support python 3.10. #1575

Breaking changes

Use group weight for generated pairs in pairwise losses

Bugfixes

Switch to mimalloc allocator on Linux and macOS to avoid problems with static TLS.

Fix SEGFAULTs on macOS. #1877

Fix: Distributed training: do not fail if worker contains only learn or test data

Fix SEGFAULT on CPU with Depthwise training and rsm < 1.

Fix calc_feature_statistics for cat features. #1882

Fix result of Cross Validation if metric_period has been specified

fix eval_metric for Multitarget training

Source code(tar.gz)
Source code(zip)
catboost-1.0.1.exe(187.97 MB)
catboost-darwin-1.0.1(26.44 MB)
catboost-linux-1.0.1(194.75 MB)
catboost-R-Darwin-1.0.1.tgz(7.99 MB)
catboost-R-Linux-1.0.1.tgz(68.33 MB)
catboost-R-Windows-1.0.1.tgz(67.65 MB)
catboostmodel.dll(9.01 MB)
catboostmodel.lib(9.99 KB)
libcatboostmodel.dylib(6.13 MB)
libcatboostmodel.so(9.77 MB)
libcatboostr-darwin.dylib(23.75 MB)
libcatboostr-darwin.so(23.75 MB)
libcatboostr-linux.so(183.68 MB)
libcatboostr.dll(186.86 MB)
v1.0.0(Oct 1, 2021)
In this release, we decided to increment the major version as we think that CatBoost is pretty stable and production-ready. We know, that CatBoost is used a lot in many different companies and individual projects, and we think, that all the features we added in the last year are worth incrementing major version. And of course, as many programmers, we love the magic of binary numbers and we want to celebrate 100₂ anniversary since CatBoost first release on Github 🥳

New losses

We've implemented a multi-label multiclass loss function, that allows us to predict multiple labels for each object #1420

Added LogCosh loss implementation #844

Fully distributed CatBoost for Apache Spark

In this release Apache Spark package became truly distributed - in the previous version CatBoost stored test datasets in controller process memory. And now test datasets are split evenly by workers.

Major speedup on CPU

We've improved training speed on numeric datasets:

28% speedup on Higgs dataset: 1000 trees, binclass: on 16 cores Intel CPU: 405 seconds -> 315 seconds

20% speedup on the small numeric dataset with 480K rows, 60 features, 100 trees, binclass on 16 cores Intel CPU 3.7 seconds-> 2.9 seconds

53% speedup on sparse one-hot encoded airlines dataset: 1000 trees training time 381 seconds -> 249 seconds

R package

Update C++ handles by reference to avoid redundant copies by @david-cortes

Avoid calculating groupwise feature importance: do not calculate feature importance for groupwise metrics by default

R tests clear environment after runs so they won't find temporary data from previous runs

Fixed ignored features in R fail when single feature was ignored

Fix feature_count attribute with ignored_features

CV improvements

Added support for text features and embeddings in cross-validation mode

We've changed the way cross-validation works - previously, CatBoost was training a small batch of trees on each fold and then switched to the next fold or next batch of trees. In 1.0.0 we changed this behavior and now CatBoost trains the full model on each fold. That allows us to reduce the memory and time overhead of starting a new batch - only one CPU to GPU memory copy is needed per fold, not per each batch of trees. Mean metric interactive plot became unavailable until the end of training on all folds.

Important change From now on use_best_model and early stopping works independently on each fold, as we are trying to make single fold training as close to regular training as possible. If one model stops at iteration i we use the last value of metric in the mean score plot for points with [i+1; last iteration).

GPU improvements

Fixed distributed training performance on Ethernet networks ~2x training time speedup. For 2 hosts, 8 v100/host, 10gigabit eth, 300 factors, 150m samples, 200 trees, 3300s -> 1700s

We've found a bug in model-size-reg implementation in GPU that leaded to worse quality of the resulting model, especially in comparison to a model trained on CPU with equal parameters

Rust

Enabled load model from the buffer for rust by @manavsah

Bugfixes

Fix for model predictions with text and embedding features

Switch to TBB local executor to limit TLS size and avoid memory leakage #1835

Switch to tcmalloc under Linux x86_64 to avoid memory fragmentation bug in LFAlloc

Fix for case of ignored text feature

Fixed application of baseline in C++ code. Moved addition of that before application of activation functions and determining labels of objects.

Fixes for scikit-learn compatibility validation #1783 and #1785

Fix for thread_count = -1 in set_params(). Issue #1800

Fix potential sigsegv in the model evaluator. Fixes #1809

Fix slow (u)int8 & (u)int16 parsing as catfeatures. Fixes #718

Adjust boost from average option before auto-learning rate

Fix embeddings with CrossEntropy mode #1654

Fix object importance #1820

Fix data provider without target #1827

Source code(tar.gz)
Source code(zip)
catboost-1.0.0.exe(187.88 MB)
catboost-darwin-1.0.0(26.45 MB)
catboost-linux-1.0.0(195.07 MB)
catboost-R-Darwin-1.0.0.tgz(8.05 MB)
catboost-R-Linux-1.0.0.tgz(68.43 MB)
catboost-R-Windows-1.0.0.tgz(67.63 MB)
catboostmodel.dll(8.92 MB)
libcatboostmodel.dylib(5.99 MB)
libcatboostmodel.so(9.60 MB)
libcatboostr-darwin.dylib(23.87 MB)
libcatboostr-darwin.so(23.87 MB)
libcatboostr-linux.so(183.89 MB)
libcatboostr.dll(186.78 MB)
v0.26.1(Aug 5, 2021)
R package

Supported text features in R package, thanks to @glemhel!

Supported virtual Ensembles in R, thanks to @glemhel!

New features

Thank @gmrandazzo for adding multiregression with missing values on targets - MultiRMSEWithMissingValues loss function

Supported multiclass prediction in C++ wrapper for model inference C API

Bugfixes

Renamed keyword parameter in predict_proba function from X to data, fixes #1785

R feature importances: remove pool argument, fix #1438 and #1772

Fix CUDA training on Windows, multiple issues. main issue with details #1735

Issue #1728: don't dereference pointers when there is no features

Fixed empty tree processing in feature strength calculation

Fixed missing loss graph points in select_features, #1775

Sort csr matrix indices, fixes #1749

Fix error "active CatBoost worker is already present in the current process" after previous training interruption or failure. #1795.

Fixed erroneous warnings from models validation after training with custom loss or custom error function. Fixes #873 Fixes #1169

Source code(tar.gz)
Source code(zip)
catboost-0.26.1.exe(169.85 MB)
catboost-darwin-0.26.1(26.40 MB)
catboost-linux-0.26.1(176.55 MB)
catboost-R-Darwin-0.26.1.tgz(7.92 MB)
catboost-R-Linux-0.26.1.tgz(60.59 MB)
catboost-R-Windows-0.26.1.tgz(60.03 MB)
catboostmodel.dll(8.91 MB)
libcatboostmodel.dylib(6.06 MB)
libcatboostmodel.so(9.69 MB)
libcatboostr-darwin.so(23.54 MB)
libcatboostr-linux.so(164.71 MB)
libcatboostr.dll(168.75 MB)
v0.26(Jun 3, 2021)
New features

#972. Add model evaluation on GPU. Thanks to @rakalexandra.

Support Langevin on GPU

Save class labels to models in cross validation

#1524. Return models after CV. Thanks to @vklyukin

[Python] #766. Add CatBoostRanker & pool.get_group_id_hash() for ranking. Thanks to @AnnaAraslanova

#262. Make CatBoost widget work in jupyter lab. Thanks to @Dm17r1y

[GPU only] Allow to add exponent to score aggregation function

Allow to specify threshold parameter for binary classification model. Thanks to @Keksozavr.

[C Model API] #503. Allow to specify prediction type.

[C Model API] #1201. Get predictions for a specific class.

Breaking changes

#1628. Use CUDA 11 by default. CatBoost GPU now requires Linux x86_64 Driver Version >= 450.51.06 Windows x86_64 Driver Version >= 451.82.

Losses and metrics

Add MRR and ERR metrics on CPU.

Add LambdaMart loss.

#1557. Add survivalAFT base logic. Thanks to @blatr.

#1286. Add Cox Proportional Hazards Loss. Thanks to @fibersel.

#1595. Provide object-oriented interface for setting up metric parameters. Thanks to @ks-korovina.

Change default YetiRank decay to 0.85 for better quality.

Python package

#1372. Custom logging stream in python package. Thanks to @DianaArapova.

#1304. Callback after iteration functionality. Thanks to @qoter.

R package

#251. Train parameter synonyms. Thanks to @ebalukova.

#252. Add eval_metrics. Thanks to @ebalukova.

Speedups

[Python] Speed up custom metrics and objectives with numba (if available)

[Python] #1710. Large speedup for cv dataset splitting by sklearn splitter

Other

Use Exact leaves estimation method as default on GPU

[Spark] #1632. Update version of Scala 2.11 for security reasons.

[Python] #1695. Explicitly specify WHEEL 'Root-Is-Purelib' value

Bugfixes

Fix default projection dimension for embeddings

Fix use_weights for some eval_metrics on GPU - use_weights=False is always respected now

[Spark] #1649. The earlyStoppingRounds parameter is not recognized

[Spark] #1650. Error when using the autoClassWeights parameter

[Spark] #1651. Error about "Auto-stop PValue" when using odType "Iter" and odWait

Fix usage of pairlogit weights for CPU fallback metrics when training on GPU

Source code(tar.gz)
Source code(zip)
catboost-0.26.exe(171.94 MB)
catboost-darwin-0.26(30.62 MB)
catboost-linux-0.26(177.24 MB)
catboost-R-Darwin-0.26.tgz(9.08 MB)
catboost-R-Linux-0.26.tgz(62.41 MB)
catboost-R-Windows-0.26.tgz(60.47 MB)
catboostmodel.dll(8.85 MB)
catboostmodel.lib(9.91 KB)
libcatboostmodel.dylib(8.87 MB)
libcatboostmodel.so(12.91 MB)
libcatboostr-darwin.so(27.45 MB)
libcatboostr-linux.so(172.14 MB)
libcatboostr.dll(170.80 MB)
v0.25.1(Apr 5, 2021)
Speedup

Now CatBoost uses non-owning Numpy arrays for passing c++ data to user-defined metric and loss functions in Python. This opens lot's of speedup probabilities: using those vectors in numba.jitted code, in cython code or just using numpy vector functions. Thanks @micyril!

Bugfixes

Fix #1620 - retrieval of R pointers by @david-cortes

Fix EvalMetricsResult.get_metric() by @Roffild

Fix multiclass AUC calculation #1615

Source code(tar.gz)
Source code(zip)
catboost-0.25.1.exe(152.10 MB)
catboost-darwin-0.25.1(30.53 MB)
catboost-linux-0.25.1(158.35 MB)
catboost-R-Darwin-0.25.1.tgz(8.96 MB)
catboost-R-Linux-0.25.1.tgz(62.42 MB)
catboost-R-Windows-0.25.1.tgz(61.06 MB)
catboostmodel.dll(6.60 MB)
catboostmodel.lib(6.80 KB)
libcatboostmodel.dylib(6.20 MB)
libcatboostmodel.so(9.62 MB)
libcatboostr-darwin.so(27.30 MB)
libcatboostr-linux.so(153.19 MB)
libcatboostr.dll(150.91 MB)
v0.25(Mar 24, 2021)
CatBoost for Apache Spark

This release includes CatBoost for Apache Spark package that supports training, model application and feature evaluation on Apache Spark platform. We've prepared CatBoost for Apache Spark introduction and CatBoost for Apache Spark Architecture videos for introduction. More details available at CatBoost for Apache Spark home page.

Feature selection

CatBoost supports recursive feature elimination procedure - when you have lot's of feature candidates and you want to select only most influential features by training models and selecting only strongest by feature importance. You can look for details in our tutorial

New features

Supported exact leaves estimation method for quantile, MAE and MAPE losses on GPU. You can enable it by setting leaf_estimation_method=Exact explicitly, in next releases we are planning to set it by default.

Supported uncertainty prediction for multiclassification models

#1568 Added support shap values calculation MultiRMSE models

#1520 Added support for pathlib.Path in python package

#1456 Added prehashed categorical features and text features to C API for model inference.

Losses and metrics

Supported Huber and Tweedie losses in GPU training

QueryAUC metric implemented by @fibersel

Breaking changes

We changed NDCG calculation principle for groups without relevant docs to make our NDCG score fully compatible with XGBoost and LightGBM implementations. Now we calc dcg==1 when there is no relevant objects in group (when ideal DCG equals zero), later we used score==0 in that case.

Speedups

With help of Intel developers team we switched our threading model implementation to Intel Threading Building Blocks. That gives us up to 20% speedup on 28 threads and around 2x speedup when training in 120 threads and largely improves scalability.

Speed up rendering fstat plots.

Slightly speed up string casting in python package during pool creation.

R package

Added path expansion when saving/loading files in R by @david-cortes

Added functionality to restore R handle after deserializing model by @david-cortes

Retrieve R pointers outside loops to speed up scalar access by @david-cortes

Multiple R documentation edits from @david-cortes and @jameslamb

#1588 Added precision for converting params to json

Bugfixes

#1525 Problem with missing exported functions in Windows R package dll

#1315 Low CPU utilization in CPU cross-validation

#785 Predict on single item with iloc fixed by @feeeper

Segfaults due to null pointer in pool in R package fixed by @david-cortes

#1553 Added check for baseline dimensions count in apply

#1606 Allow to use CatBoost in AWS Lambda environment: fix bug with setting thread names

#1609 and #1309 Print proper error message if all params in grid were invalid

Ability to use docstrings in estimators added by @pawelopiela

Allow extra space at the end of line for libsvm format

Thanks!

We would like to recognize Intel software engineering team’s contributions to Catboost project.

Many thanks to our individual contributors: @david-cortes @jameslamb @pawelopiela @feeeper @fibersel

Source code(tar.gz)
Source code(zip)
catboost-0.25.exe(152.10 MB)
catboost-darwin-0.25(30.53 MB)
catboost-linux-0.25(158.21 MB)
catboost-R-Darwin-0.25.tgz(8.96 MB)
catboost-R-Linux-0.25.tgz(62.40 MB)
catboost-R-Windows-0.25.tgz(61.06 MB)
catboostmodel.dll(6.60 MB)
libcatboostr-darwin.so(27.30 MB)
libcatboostr-linux.so(153.05 MB)
libcatboostr.dll(150.91 MB)
v0.24.4(Dec 27, 2020)
Release 0.24.4

Speedup

Major speedup asymmetric trees training time on CPU (2x speedup on Epsilon with 16 threads). We would like to recognize Intel software engineering team’s contributions to Catboost project.

New features

From now on we are releasing Python 3.9 wheels. Related issues: #1491, #1509, #1510

Allow boost_from_average for MultiRMSE loss. Issue #1515

Add tag pairwise=False for sklearn compatibility. Fixes issue #1518

Bugfixes:

Allow fstr calculation for datasets with embeddings

Fix feature_importances_ for fstr with texts

Virtual ensebles fix: use proper unshrinkage coefficients

Fixed constants in RMSEWithUnceratainty loss function calculation to correspond values from original paper

Allow shap values calculation for model with zero-weights and non-zero leaf values. Now we use sum of leaf weights on train and current dataset to guarantee non-zero weights for leafs, reachable on current dataset. Fixes issues #1512, #1284

Source code(tar.gz)
Source code(zip)
catboost-0.24.4.exe(150.41 MB)
catboost-darwin-0.24.4(31.23 MB)
catboost-linux-0.24.4(155.49 MB)
catboost-R-Darwin-0.24.4.tgz(8.97 MB)
catboost-R-Linux-0.24.4.tgz(60.99 MB)
catboost-R-Windows-0.24.4.tgz(59.69 MB)
catboostmodel.dll(7.96 MB)
libcatboostr-darwin.so(28.10 MB)
libcatboostr-linux.so(151.91 MB)
libcatboostr.dll(149.16 MB)
v0.24.3(Nov 18, 2020)
Release 0.24.3

New functionality

Support fstr text features and embeddings. Issue #1293

Bugfixes:

Fix model apply speed regression introduced in 0.24.1

Different fixes in embeddings support: fixed apply and model serialization, fixed apply on texts and embeddings

Fixed virtual ensembles prediction - use proper scaling, fix apply (issue #1462)

Fix score() method for RMSEWithUncertainty issue #1482

Automatically use correct prediction_type in score()

Source code(tar.gz)
Source code(zip)
catboost-0.24.3.exe(150.06 MB)
catboost-darwin-0.24.3(31.67 MB)
catboost-linux-0.24.3(156.09 MB)
catboost-R-Darwin-0.24.3.tgz(9.15 MB)
catboost-R-Linux-0.24.3.tgz(61.29 MB)
catboost-R-Windows-0.24.3.tgz(59.65 MB)
catboostmodel.dll(7.95 MB)
libcatboostr-darwin.so(28.67 MB)
libcatboostr-linux.so(152.76 MB)
libcatboostr.dll(148.99 MB)
v0.24.2(Oct 7, 2020)
Uncertainty prediction

Supported uncertainty prediction for classification models.

Fixed RMSEWithUncertainty data uncertainty prediction - now it predicts variance, not standard deviation.

New functionality

Allow categorical feature counters for MultiRMSE loss function.

group_weight parameter added to catboost.utils.eval_metric method to allow passing weights for object groups. Allows correctly match weighted ranking metrics computation when group weights present.

Faster non-owning deserialization from memory with less memory overhead - moved some dynamically computed data to model file, other data is computed in lazy manner only when needed.

Experimental functionality

Supported embedding features as input and linear discriminant analysis for embeddings preprocessing. Try adding your embeddings as new columns with embedding values array in Pandas.Dataframe and passing corresponding column names to Pool constructor or fit function with embedding_features=['EmbeddingFeaturesColumnName1, ...] parameter. Another way of adding your embedding vectors is new type of column in Column Description file NumVector and adding semicolon separated embeddings column to your XSV file: ClassLabel\t0.1;0.2;0.3\t....

Educational materials

Published new tutorial on uncertainty prediction.

Bugfixes:

Reduced GPU memory usage in multi gpu training when there is no need to compute categorical feature counters.

Now CatBoost allows to specify use_weights for metrics when auto_class_weights parameter is set.

Correctly handle NaN values in plot_predictions function.

Fixed floating point precision drop releated bugs during Multiclass training with lots of objects in our case, bug was triggered while training on 25mln objects on single GPU card.

Now average parameter is passed to TotalF1 metric while training on GPU.

Added class labels checks

Disallow feature remapping in model predict when there is empty feature names in model.

Source code(tar.gz)
Source code(zip)
catboost-0.24.2.exe(149.78 MB)
catboost-darwin-0.24.2(31.46 MB)
catboost-linux-0.24.2(155.73 MB)
catboost-R-Darwin-0.24.2.tgz(9.07 MB)
catboost-R-Linux-0.24.2.tgz(61.17 MB)
catboost-R-Windows-0.24.2.tgz(59.55 MB)
catboostmodel.dll(7.92 MB)
libcatboostr-darwin.so(28.46 MB)
libcatboostr-linux.so(152.40 MB)
libcatboostr.dll(148.69 MB)
v0.24.1(Aug 27, 2020)
Uncertainty prediction

Main feature of this release is total uncertainty prediction support via virtual ensembles. You can read the theoretical background in the preprint Uncertainty in Gradient Boosting via Ensembles from our research team. We introduced new training parameter posterior_sampling, that allows to estimate total uncertainty. Setting posterior_sampling=True implies enabling Langevin boosting, setting model_shrink_rate to 1/(2*N) and setting diffusion_temperature to N, where N is dataset size. CatBoost object method virtual_ensembles_predict splits model into virtual_ensembles_count submodels. Calling model.virtual_ensembles_predict(.., prediction_type='TotalUncertainty') returns mean prediction, variance (and knowledge uncertrainty for models, trained with RMSEWithUncertainty loss function). Calling model.virtual_ensembles_predict(.., prediction_type='VirtEnsembles') returns virtual_ensembles_count predictions of virtual submodels for each object.

New functionality

Supported non-owning model deserialization for models with categorical feature counters

Speedups

We've done lot's of speedups for sparse data loading. For example, on bosch sparse dataset preprocessing speed got 4.5x speedup while running in 28 thread setting.

Bugfixes:

Fixed target check for PairLogitPairwise on GPU. Issue #1217

Supported n_features_in_ attribute required for using CatBoost in sklearn pipelines. Issue #1363

Source code(tar.gz)
Source code(zip)
catboost-0.24.1.exe(149.54 MB)
catboost-darwin-0.24.1(31.10 MB)
catboost-linux-0.24.1(155.33 MB)
catboost-R-Darwin-0.24.1.tgz(8.96 MB)
catboost-R-Linux-0.24.1.tgz(61.05 MB)
catboost-R-Windows-0.24.1.tgz(59.49 MB)
catboostmodel.dll(7.84 MB)
libcatboostr-darwin.so(28.14 MB)
libcatboostr-linux.so(152.04 MB)
libcatboostr.dll(148.46 MB)
v0.24(Aug 5, 2020)
New functionality

We've finally implemented MVS sampling for GPU training. Switched default bootstrap algorithm to MVS for RMSE loss function while training on GPU

Implemented near-zero cost model deserialization from memory blob. Currently, if your model doesn't use categorical features CTR counters and text features you can deserialize model from, for example, memory-mapped file.

Added ability to load trained models from binary string or file-like stream. To load model from bytes string use load_model(blob=b'....'), to deserialize form file-like stream use load_model(stream=gzip.open('model.cbm.gz', 'rb'))

Fixed auto-learning rate estimation params for GPU

Supported beta parameter for QuerySoftMax function on CPU and GPU

New losses and metrics

New loss function RMSEWithUncertainty - it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.

Speedups

Major speedups for CPU training: kdd98 -9%, higgs -18%, msrank -28%. We would like to recognize Intel software engineering team’s contributions to Catboost project. This was mutually beneficial activity, and we look forward to continuing joint cooperation.

Bugfixes:

Fixed CatBoost model export as Python code

Fixed AUC metric creation

Add text features to model.feature_names_. Issue #1314

Allow models, trained on datasets with NaN values (Min treatment) and without NaNs in model_sum() or as the base model in init_model=. Issue #1271

Educational materials

Published new tutorial on categorical features parameters. Thanks @garkavem

Source code(tar.gz)
Source code(zip)
catboost-0.24.exe(149.30 MB)
catboost-darwin-0.24(30.81 MB)
catboost-linux-0.24(154.99 MB)
catboost-R-Darwin-0.24.tgz(8.83 MB)
catboost-R-Linux-0.24.tgz(60.92 MB)
catboost-R-Windows-0.24.tgz(59.39 MB)
catboostmodel.dll(7.80 MB)
libcatboostr-darwin.so(27.85 MB)
libcatboostr-linux.so(151.72 MB)
libcatboostr.dll(148.23 MB)
v0.23.2(May 26, 2020)
New functionality

Added plot_partial_dependence method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by @felixandrer.

Allowed using boost_from_average option together with model_shrink_rate option. In this case shrinkage is applied to the starting value..

Added new auto_class_weights option in python-package, R-package and cli with possible values Balanced and SqrtBalanced. For Balanced every class is weighted maxSumWeightInClass / sumWeightInClass, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For SqrtBalanced the formula is sqrt(maxSumWeightInClass / sumWeightInClass). This option supported in binclass and multiclass tasks. Implemented by @egiby.

Supported model_size_reg option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be #categories in c1 * #categories in c2, even though many of the values from this combination might not be present in the dataset.

Added calculation of Shapley values, (see formula (2) from https://arxiv.org/pdf/1802.03888.pdf). By default estimation from this paper (Algorithm 2) is calcucated, that is much more faster. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Exact". Implemented by @LordProtoss.

Bugfixes:

Fixed onnx converter for old onnx versions.

Source code(tar.gz)
Source code(zip)
catboost-0.23.2.exe(146.01 MB)
catboost-darwin-0.23.2(30.50 MB)
catboost-linux-0.23.2(152.60 MB)
catboost-R-Darwin-0.23.2.tgz(8.75 MB)
catboost-R-Linux-0.23.2.tgz(60.01 MB)
catboost-R-Windows-0.23.2.tgz(58.11 MB)
catboostmodel.dll(7.75 MB)
libcatboostr-darwin.so(27.59 MB)
libcatboostr-linux.so(149.47 MB)
libcatboostr.dll(144.95 MB)
v0.23.1(May 15, 2020)
New functionality

CatBoost model could be simply converted into ONNX object in Python with catboost.utils.convert_to_onnx_object method. Implemented by @monkey0head

We now print metric options with metric names as metric description in error logs by default. This allows you to distinguish between metrics of the same type with different parameters. For example, if user sets weigheted average TotalF1 metric CatBoost will print TotalF1:average=Weighted as corresponding metric column header in error logs. Implemented by @ivanychev

Implemented PRAUC metric (issue #737). Thanks @azikmsu

It's now possible to write custom multiregression objective in Python. Thanks @azikmsu

Supported nonsymmetric models export to PMML

class_weights parameter accepts dictionary with class name to class weight mapping

Added _get_tags() method for compatibility with sklearn (issue #1282). Implemented by @crazyleg

Lot's of improvements in .Net CatBoost library: implemented IDisposable interface, splitted ML.NET compatible and basic prediction classes in separate libraries, added base UNIX compatibility, supported GPU model evaluation, fixed tests. Thanks @khanova

In addition to first_feature_use_penalties presented in the previous release, we added new option per_object_feature_penalties which considers feature usage on each object individually. For more details refer the tutorial.

Breaking changes

From now on we require explicit loss_function param in python cv method.

Bugfixes:

Fixed deprecation warning on import (issue #1269)

Fixed saved models logging_level/verbose parameters conflict (issue #696)

Fixed kappa metric - in some cases there were integer overflow, switched accumulation types to double

Fixed per float feature quantization settings defaults

Educational materials

Extended shap values tutorial with summary plot examples. Thanks @azanovivan02

Source code(tar.gz)
Source code(zip)
catboost-0.23.1.exe(145.87 MB)
catboost-darwin-0.23.1(30.48 MB)
catboost-linux-0.23.1(152.44 MB)
catboost-R-Darwin-0.23.1.tgz(8.74 MB)
catboost-R-Linux-0.23.1.tgz(59.96 MB)
catboost-R-Windows-0.23.1.tgz(58.07 MB)
catboostmodel.dll(7.73 MB)
libcatboostr-darwin.so(27.57 MB)
libcatboostr-linux.so(149.32 MB)
libcatboostr.dll(144.81 MB)
v0.23(Apr 25, 2020)
New functionality

It is possible now to train models on huge datasets that do not fit into CPU RAM. This can be accomplished by storing only quantized data in memory (it is many times smaller). Use catboost.utils.quantize function to create quantized Pool this way. See usage example in the issue #1116. Implemented by @noxwell.

Python Pool class now has save_quantization_borders method that allows to save resulting borders into a file and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training. Use saved borders when quantizing other Pools by specifying input_borders parameter of the quantize method. Implemented by @noxwell.

Training with text features is now supported on CPU

It is now possible to set border_count > 255 for GPU training. This might be useful if you have a "golden feature", see docs.

Feature weights are implemented. Specify weights for specific features by index or name like feature_weights="FeatureName1:1.5,FeatureName2:0.5". Scores for splits with this features will be multiplied by corresponding weights. Implemented by @Taube03.

Feature penalties can be used for cost efficient gradient boosting. Penalties are specified in a similar fashion to feature weights, using parameter first_use_feature_penalties. This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly. The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model. After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done. There is also a common multiplier for all first_use_feature_penalties, it can be specified by penalties_coefficient parameter. Implemented by @Taube03 (issue #1155)

recordCount attribute is added to PMML models (issue #1026).

New losses and metrics

New ranking objective 'StochasticRank', details in paper.

Tweedie loss is supported now. It can be a good solution for right-skewed target with many zero values, see tutorial. When using CatBoostRegressor.predict function, default prediction_type for this loss will be equal to Exponent. Implemented by @ilya-pchelintsev (issue #577)

Classification metrics now support a new parameter proba_border. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by @ivanychev.

Metric TotalF1 supports a new parameter average with possible value weighted, micro, macro. Implemented by @ilya-pchelintsev.

It is possible now to specify a custom multi-label metric in python. Note that it is only possible to calculate this metric and use it as eval_metric. It is not possible to used it as an optimization objective. To write a multi-label metric, you need to define a python class which inherits from MultiLabelCustomMetric class. Implemented by @azikmsu.

Improvements of grid and randomized search

class_weights parameter is now supported in grid/randomized search. Implemented by @vazgenk.

Invalid option configurations are automatically skipped during grid/randomized search. Implemented by @borzunov.

get_best_score returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by @rednevaler.

Improvements of model analysis tools

Computation of SHAP interaction values for CatBoost models. You can pass type=EFstrType.ShapInteractionValues to CatBoost.get_feature_importance to get a matrix of SHAP values for every prediction. By default, SHAP interaction values are calculated for all features. You may specify features of interest using the interaction_indices argument. Implemented by @IvanKozlov98.

SHAP values can be calculated approximately now which is much faster than default mode. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Approximate". Implemented by @LordProtoss (issue #1146).

PredictionDiff model analysis method can now be used with models that contain non symmetric trees. Implemented by @felixandrer.

New educational materials

A tutorial on tweedie regression

A tutorial on poisson regression

A detailed tutorial on different types of AUC metric, which explains how different types of AUC can be used for binary classification, multiclassification and ranking tasks.

Breaking changes

When using CatBoostRegressor.predict function for models trained with Poisson loss, default prediction_type will be equal to Exponent (issue #1184). Implemented by @garkavem.

This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.
Source code(tar.gz)
Source code(zip)
catboost-0.23.exe(145.82 MB)
catboost-darwin-0.23(30.60 MB)
catboost-linux-0.23(152.59 MB)
catboost-R-Darwin-0.23.tgz(8.73 MB)
catboost-R-Linux-0.23.tgz(59.97 MB)
catboost-R-Windows-0.23.tgz(58.05 MB)
catboostmodel.dll(7.74 MB)
libcatboostr-darwin.so(27.68 MB)
libcatboostr-linux.so(149.51 MB)
libcatboostr.dll(144.77 MB)
v0.22(Mar 2, 2020)
New features:

The main feature of the release is the support of non symmetric trees for training on CPU. Using non symmetric trees might be useful if one-hot encoding is present, or data has little noise. To try non symmetric trees change grow_policy parameter. Starting from this release non symmetric trees are supported for both CPU and GPU training.

The next big feature improves catboost text features support. Now tokenization is done during training, you don't have to do lowercasing, digit extraction and other tokenization on your own, catboost does it for you.

Auto learning-rate is now supported in CPU MultiClass mode.

CatBoost class supports to_regressor and to_classifier methods.

The release also contains a list of bug fixes.
Source code(tar.gz)
Source code(zip)
catboost-0.22.exe(145.64 MB)
catboost-darwin-0.22(31.18 MB)
catboost-linux-0.22(152.56 MB)
catboost-R-Darwin-0.22.tgz(8.70 MB)
catboost-R-Linux-0.22.tgz(59.80 MB)
catboost-R-Windows-0.22.tgz(57.92 MB)
catboostmodel.dll(7.60 MB)
libcatboostr-darwin.so(28.34 MB)
libcatboostr-linux.so(149.74 MB)
libcatboostr.dll(144.61 MB)
v0.21(Jan 31, 2020)
New features:

The main feature of this release is the Stochastic Gradient Langevin Boosting (SGLB) mode that can improve quality of your models with non-convex loss functions. To use it specify langevin option and tune diffusion_temperature and model_shrink_rate. See the corresponding paper for details.

Improvements:

Automatic learning rate is applied by default not only for Logloss objective, but also for RMSE (on CPU and GPU) and MultiClass (on GPU).

Class labels type information is stored in the model. Now estimators in python package return values of proper type in classes_ attribute and for prediction functions with prediction_type=Class. #305, #999, #1017. Note: Class labels loaded from datasets in CatBoost dsv format always have string type now.

Bug fixes:

Fixed huge memory consumption for text features. #1107

Fixed crash on GPU on big datasets with groups (hundred million+ groups).

Fixed class labels consistency check and merging in model sums (now class names in binary classification are properly checked and added to the result as well)

Fix for confusion matrix (PR #1152), thanks to @dmsivkov.

Fixed shap values calculation when boost_from_average=True. #1125

Fixed use-after-free in fstr PredictionValuesChange with specified dataset

Target border and class weights are now taken from model when necessary for feature strength, metrics evaluation, roc_curve, object importances and calc_feature_statistics calculations.

Fixed that L2 regularization was not applied for non symmetric trees for binary classification on GPU.

[R-package] Fixed the bug that catboost.get_feature_importance did not work after model is loaded #1064

[R-package] Fixed the bug that catboost.train did not work when called with the single dataset parameter. #1162

Fixed L2 score calculation on CPU

Other:

Starting from this release Java applier is released simultaneously with other components and has the same version.

Compatibility:

Models trained with this release require applier from this release or later to work correctly.

Source code(tar.gz)
Source code(zip)
catboost-0.21.exe(145.87 MB)
catboost-darwin-0.21(31.04 MB)
catboost-linux-0.21(156.01 MB)
catboost-R-Darwin-0.21.tgz(8.71 MB)
catboost-R-Linux-0.21.tgz(59.44 MB)
catboost-R-Windows-0.21.tgz(57.98 MB)
catboostmodel.dll(7.56 MB)
libcatboostr-darwin.so(28.32 MB)
libcatboostr-linux.so(153.32 MB)
libcatboostr.dll(144.92 MB)
v0.20.2(Dec 25, 2019)
New features:

String class labels are now supported for binary classification

[CLI only] Timestamp column for the datasets can be provided in separate files.

[CLI only] Timesplit feature evaluation.

Process groups of any size in block processing.

Bug fixes:

classes_count and class_weight params can be now used with user-defined loss functions. #1119

Form correct metric descriptions on GPU if use_weights gets value by default. #1106

Correct model.classes_ attribute for binary classification (proper labels instead of always 0 and 1). #984

Fix model.classes_ attribute when classes_count parameter was specified.

Proper error message when categorical features specified for MultiRMSE training. #1112

Block processing: It is valid for all groups in a single block to have weights equal to 0

fix empty asymmetric tree index calculation. #1104

Source code(tar.gz)
Source code(zip)
catboost-0.20.2.exe(145.23 MB)
catboost-darwin-0.20.2(30.71 MB)
catboost-linux-0.20.2(155.60 MB)
catboost-R-Darwin-0.20.2.tgz(8.64 MB)
catboost-R-Linux-0.20.2.tgz(59.34 MB)
catboost-R-Windows-0.20.2.tgz(57.71 MB)
catboostmodel.dll(7.46 MB)
libcatboostr-darwin.so(28.03 MB)
libcatboostr-linux.so(152.97 MB)
libcatboostr.dll(144.31 MB)
v0.20.1(Dec 11, 2019)
New features:

Have leaf_estimation_method=Exact the default for MAPE loss

Add CatBoostClassifier.predict_log_proba(), PR #1095

Bug fixes:

Fix usability of read-only numpy arrays, #1101

Fix python3 compatibility for get_feature_importance, PR #1090

Fix loading model from snapshot for boost_from_average mode

Source code(tar.gz)
Source code(zip)
catboost-0.20.1.exe(145.07 MB)
catboost-darwin-0.20.1(30.36 MB)
catboost-linux-0.20.1(155.14 MB)
catboost-R-Darwin-0.20.1.tgz(8.48 MB)
catboost-R-Linux-0.20.1.tgz(59.09 MB)
catboost-R-Windows-0.20.1.tgz(57.70 MB)
catboostmodel.dll(7.45 MB)
libcatboostr-darwin.so(27.73 MB)
libcatboostr-linux.so(152.51 MB)
libcatboostr.dll(144.17 MB)
v0.20(Nov 28, 2019)
New submodule for text processing! It contains two classes to help you make text features ready for training:

Tokenizer -- use this class to split text into tokens (automatic lowercase and punctuation removal)

Dictionary -- with this class you create a dictionary which maps tokens to numeric identifiers. You then use these identifiers as new features.

New features:

Enabled boost_from_average for MAPE loss function

Bug fixes:

Fixed Pool creation from pandas.DataFrame with discontinuous columns, #1079

Fixed standalone_evaluator, PR #1083

Speedups:

Huge speedup of preprocessing in python-package for datasets with many samples (>10 mln)

We also release precompiled packages for Python 3.8
Source code(tar.gz)
Source code(zip)
catboost-0.20.exe(144.51 MB)
catboost-darwin-0.20(28.96 MB)
catboost-linux-0.20(153.70 MB)
catboost-R-Darwin-0.20.tgz(8.26 MB)
catboost-R-Linux-0.20.tgz(58.74 MB)
catboost-R-Windows-0.20.tgz(56.97 MB)
catboostmodel.dll(5.95 MB)
libcatboostr-darwin.so(26.33 MB)
libcatboostr-linux.so(149.68 MB)
libcatboostr.dll(143.60 MB)
v0.19.1(Nov 19, 2019)
New features:

With this release we support Text features for classification on GPU. To specify text columns use text_features parameter. Achieve better quality by using text information of your dataset. See more in Learning CatBoost with text features

MultiRMSE loss function is now available on CPU. Labels for the multi regression mode should be specified in separate Label columns

MonoForest framework for model analysis, based on our NeurIPS 2019 paper. Learn more in MonoForest tutorial

boost_from_average is now True by default for Quantile and MAE loss functions, which improves the resulting quality

Speedups:

Huge reduction of preprocessing time for datasets loaded from files and for datasets with many samples (> 10 million), which was a bottleneck for GPU training

3x speedup for small datasets

Source code(tar.gz)
Source code(zip)
catboost-0.19.1.exe(144.28 MB)
catboost-darwin-0.19.1(28.45 MB)
catboost-linux-0.19.1(153.19 MB)
catboost-R-Darwin-0.19.1.tgz(8.20 MB)
catboost-R-Linux-0.19.1.tgz(58.68 MB)
catboost-R-Windows-0.19.1.tgz(56.93 MB)
catboostmodel.dll(5.93 MB)
libcatboostr-darwin.so(25.91 MB)
libcatboostr-linux.so(149.26 MB)
libcatboostr.dll(143.45 MB)
v0.18.1(Oct 31, 2019)
New features:

Now datasets.msrank() returns full msrank dataset. Previously, it returned the first 10k samples. We have added msrank_10k() dataset implementing the past behaviour.

Bug fixes:

get_object_importance() now respects parameter top_size, #1045 by @ibuda

Source code(tar.gz)
Source code(zip)
catboost-0.18.1.exe(144.13 MB)
catboost-darwin-0.18.1(28.25 MB)
catboost-linux-0.18.1(152.91 MB)
catboost-R-Darwin-0.18.1.tgz(8.16 MB)
catboost-R-Linux-0.18.1.tgz(58.63 MB)
catboost-R-Windows-0.18.1.tgz(56.88 MB)
catboostmodel.dll(5.89 MB)
libcatboostr-darwin.so(25.73 MB)
libcatboostr-linux.so(149.03 MB)
libcatboostr.dll(143.30 MB)
v0.18(Oct 21, 2019)
The main feature of the release is huge speedup on small datasets. We now use MVS sampling for CPU regression and binary classification training by default, together with Plain boosting scheme for both small and large datasets. This change not only gives the huge speedup but also provides quality improvement!

The boost_from_average parameter is available in CatBoostClassifier and CatBoostRegressor

We have added new formats for describing monotonic constraints. For example, "(1,0,0,-1)" or "0:1,3:-1" or "FeatureName0:1,FeatureName3:-1" are all valid specifications. With Python and params-file json, lists and dictionaries can also be used

Bugs fixed:

Error in Multiclass classifier training, #1040

Unhandled exception when saving quantized pool, #1021

Python 3.7: RuntimeError raised in StagedPredictIterator, #848

Source code(tar.gz)
Source code(zip)
catboost-0.18.exe(144.07 MB)
catboost-darwin-0.18(28.16 MB)
catboost-linux-0.18(152.81 MB)
catboost-R-Darwin-0.18.tgz(8.12 MB)
catboost-R-Linux-0.18.tgz(58.59 MB)
catboost-R-Windows-0.18.tgz(56.85 MB)
catboostmodel.dll(5.88 MB)
libcatboostr-darwin.so(25.64 MB)
libcatboostr-linux.so(148.94 MB)
libcatboostr.dll(143.24 MB)
v0.17.5(Oct 10, 2019)
Bugs fixed:

System of linear equations is not positive definite when training MultiClass on Windows, #1022

Cat feature values could be taken from floating-point data. We have forbidden this

Handling of numpy.ndarray features data with categorical features is corrected

Source code(tar.gz)
Source code(zip)
catboost-0.17.5.exe(143.62 MB)
catboost-darwin-0.17.5(27.44 MB)
catboost-linux-0.17.5(152.00 MB)
catboost-R-Darwin-0.17.5.tgz(7.95 MB)
catboost-R-Linux-0.17.5.tgz(58.42 MB)
catboost-R-Windows-0.17.5.tgz(56.71 MB)
catboostmodel.dll(5.63 MB)
libcatboostr-darwin.so(24.95 MB)
libcatboostr-linux.so(148.18 MB)
libcatboostr.dll(142.80 MB)