Lale is a Python library for semi-automated data science.

International Business Machines

Last update: Dec 29, 2022

Related tags

Data Analysis python data-science machine-learning scikit-learn artificial-intelligence interoperability hyperparameter-optimization hyperparameter-tuning ibm-research automl automated-machine-learning dataquality hyperparameter-search ibm-research-ai pipeline-tests pipeline-testing

Overview

Lale

README in other languages: 中文, deutsch, français, or contribute your own.

Lale is a Python library for semi-automated data science. Lale makes it easy to automatically select algorithms and tune hyperparameters of pipelines that are compatible with scikit-learn, in a type-safe fashion. If you are a data scientist who wants to experiment with automated machine learning, this library is for you! Lale adds value beyond scikit-learn along three dimensions: automation, correctness checks, and interoperability. For automation, Lale provides a consistent high-level interface to existing pipeline search tools including Hyperopt, GridSearchCV, and SMAC. For correctness checks, Lale uses JSON Schema to catch mistakes when there is a mismatch between hyperparameters and their type, or between data and operators. And for interoperability, Lale has a growing library of transformers and estimators from popular libraries such as scikit-learn, XGBoost, PyTorch etc. Lale can be installed just like any other Python package and can be edited with off-the-shelf Python tools such as Jupyter notebooks.

Introductory guide for scikit-learn users
Installation instructions
Technical overview slides, notebook, and video
IBM's AutoAI SDK uses Lale, see demo notebook
Guide for wrapping new operators
Guide for contributing to Lale
FAQ
Papers
Python API documentation

The name Lale, pronounced laleh, comes from the Persian word for tulip. Similarly to popular machine-learning libraries such as scikit-learn, Lale is also just a Python library, not a new stand-alone programming language. It does not require users to install new tools nor learn new syntax.

Lale is distributed under the terms of the Apache 2.0 License, see LICENSE.txt. It is currently in an Alpha release, without warranties of any kind.

Comments

replace BaselineClassifier/Regressor by DummyClassifier/Regressor. Issue #618

Replacing BaselineClassifier with DummyClassifier from Sklearn. Relacing BaselineRegressor with DummyRegressor from Sklearn. Created new operator with make_operator with schemas defined in BaselineClassifier and BaselineRegressor

Contributed by students of SRH University, Heidelberg. https://github.com/tauseefhashmi https://github.com/frankcode101 https://github.com/tanmaygaikwad https://github.com/RajathReddy9 https://github.com/vickyvishal/

opened by vickyvishal 11
Improve Lale sklearn schema
BaggingClassifier - Add 2 constraints.

BaggingRegressor - Add 2 constraints.

ColumnTransformer - Add a sparse constraint.

ExtraTreesClassifier - Add a constraint.

ExtraTreesRegressor - Add 2 constraints.

FeatureAgglomeration - Add a sparse constraint.

FunctionTransformer - Fix a constraint.

GaussianNB - Add a sparse constraint.

KNeighborsClassifier - Remove 2 constraints (false negatives).

KNeighborsRegressor - Remove 2 constraints (false negatives).

For KNN-C and KNN-R, this constraint might be useful but the schema is long and contains many TODOs.

Metric 'minkowski' not valid for sparse input. Use sorted(sklearn.neighbors.VALID_METRICS_SPARSE['brute']) to get valid options. Metric can also be a callable function.

LinearSVC - Improve constaints.

LinearSVR - Modify "loss" schema. Add a constraint.

LogisticRegression - Add 4 constaints.

MinMaxScaler - Add a sparse constraint.

MissingIndicator - Add a constraint.

OneHotEncoder - Add "drop" schema, new in 0.21. Add a constraint.

OrdinalEncoder - Remove "ignore" from "handle_unknown" schema. Add 2 constraints.

RandomForestClassifier - Add a constraint.

RandomForestRegressor - Add 2 constraints.

RidgeClassifier - Add 2 constraints.

Ridge - Add 2 constraints.

RobustScaler - Add a constraint.

SimpleImputer - Add a constraint.

SVC - Remove a constraint (false negative). Add a constraint.

SVR - Remove a constraint (false negative). Add a constraint.
opened by Ingkarat 8
Out Scripts are Optimized in the Importing Task

isort module observed on the outer scripts. So that neither search/*.py, lib/*.py, datasets/*.py, and utils/*.py have been changed which means that the schema2search_space.py file is still like the usual.
hacktoberfest

opened by lnxpy 8
Added additional logisticaix360 wrappers in the existing lale codebase

Created additional Logisticaix360 , added notebook comparing the Prejudice remover|Logistic regression| Logisticaix360 notebook and added testcases .

opened by priyankabanda2202 6

ImportError: cannot import name '_UnstableArchMixin'

IBM Watson Studio:Version 1.1.0-151 (1.1.0-151) on macOS Catalina 10.15.4

from sklearn.preprocessing import Normalizer
from sklearn.tree import DecitionTreeRegressor as Tree
from lale.lib.lale import Hyperopt


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-28-2eee442a0b4d> in <module>
----> 1 from lale.lib.lale import Hyperopt

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/lale/__init__.py in <module>
     61 from .baseline_classifier import BaselineClassifier
     62 from .baseline_regressor import BaselineRegressor
---> 63 from .grid_search_cv import GridSearchCV
     64 from .hyperopt import Hyperopt
     65 from .topk_voting_classifier import TopKVotingClassifier

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/lale/grid_search_cv.py in <module>
     15 from typing import Any, Dict
     16 
---> 17 import lale.lib.sklearn
     18 import lale.search.lale_grid_search_cv
     19 import lale.operators

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/sklearn/__init__.py in <module>
    130 from .extra_trees_classifier import ExtraTreesClassifier
    131 from .extra_trees_regressor import ExtraTreesRegressor
--> 132 from .feature_agglomeration import FeatureAgglomeration
    133 from .function_transformer import FunctionTransformer
    134 from .gaussian_nb import GaussianNB

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/lale/lib/sklearn/feature_agglomeration.py in <module>
     13 # limitations under the License.
     14 
---> 15 import sklearn.cluster.hierarchical
     16 import lale.docstrings
     17 import lale.operators

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/cluster/__init__.py in <module>
      4 """
      5 
----> 6 from .spectral import spectral_clustering, SpectralClustering
      7 from .mean_shift_ import (mean_shift, MeanShift,
      8                           estimate_bandwidth, get_bin_seeds)

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/cluster/spectral.py in <module>
     15 from ..metrics.pairwise import pairwise_kernels
     16 from ..neighbors import kneighbors_graph
---> 17 from ..manifold import spectral_embedding
     18 from .k_means_ import k_means
     19 

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/manifold/__init__.py in <module>
      3 """
      4 
----> 5 from .locally_linear import locally_linear_embedding, LocallyLinearEmbedding
      6 from .isomap import Isomap
      7 from .mds import MDS, smacof

~/WatsonStudioDesktop/miniconda3/envs/desktop/lib/python3.6/site-packages/sklearn/manifold/locally_linear.py in <module>
     10 from scipy.sparse.linalg import eigsh
     11 
---> 12 from ..base import BaseEstimator, TransformerMixin, _UnstableArchMixin
     13 from ..utils import check_random_state, check_array
     14 from ..utils.extmath import stable_cumsum

ImportError: cannot import name '_UnstableArchMixin'

opened by sreev 6

Srh group8

Replacing BaselineClassifier with DummyClassifier from sklearn. Relacing BaselineRegressor with DummyRegressor from sklearn.

Contributed by students of SRH University, Heidelberg. https://github.com/tauseefhashmi https://github.com/frankcode101 https://github.com/tanmaygaikwad https://github.com/RajathReddy9 https://github.com/vickyvishal/

opened by vickyvishal 5
replace BaselineClassifier/Regressor by DummyClassifier/Regressor

Our lale.lib.lale package has a BaselineClassifier that simply predicts the majority class. Scikit-learn has a DummyClassifier that does the same, with a few additional useful configuration options. We should add the DummyClassifier to our lale.lib.sklearn package, and eliminate the BaselineClassifier, since it is redundant. Similarly, we should also replace lale.lib.lale.BaselineRegressor by scikit-learn's DummyRegressor.
good first issue

opened by hirzel 5

module resolution issue in pretty_print()

Hi All,

I have implemented a custom imputer based on Scikit-learn SimpleImputer as an example. My code lives in albert_imputer.py. Everything in fine until the final result is being printed. This is what I see in debugger:

> /Users/albert/miniconda3/envs/lale/lib/python3.7/site-packages/lale/pretty_print.py(160)_get_module_name()
-> op = find_op(mod_name_short, op_name)
(Pdb) l
155  	    mod_name_long = class_name[: class_name.rfind(".")]
156  	    mod_name_short = mod_name_long[: mod_name_long.rfind(".")]
157  	    unqualified = class_name[class_name.rfind(".") + 1 :]
158  	    if class_name.startswith("lale.") and unqualified.endswith("Impl"):
159  	        unqualified = unqualified[: -len("Impl")]
160  ->	    op = find_op(mod_name_short, op_name)
161  	    if op is not None:
162  	        mod = mod_name_short
163  	    else:
164  	        op = find_op(mod_name_long, op_name)
165  	        if op is not None:
(Pdb) p mod_name_long, mod_name_short, unqualified,
('albert_imputer', 'albert_impute', 'AlbertImputerImpl')

In "mod_name_short" the last "r" is missed. For this reason, importlib cannot load the module in find_op(). As a temporary workaround, I created a symbolic link "albert_impute.py" to "albert_imputer.py" file and it works.

opened by ghost 4

Prefix snapml estimators with 'Snap'

Inside autoai_core, estimators are identified by their class name (rather than a full path include the module etc). While there does not seem to be any limitation on the Lale side related to having multiple estimators with the same class name, it looks like significant changes would be required in AutoAI core in order to handle this, and these changes would propagate all the way up to the top-level interface.

In order to have smooth integration without changing the KaggleBot interface, I propose to simply prefix the snapml esimator class names with Snap. I've tested these changes with autoai_core development branch and everything seems to work nicely.

opened by tdoublep 4
Adding random_state argument to fair_stratified_train_test_split

Fixes #596

In the existing implementation of fair_stratified_train_test_split, there is no argument for setting random_state and we internally just set its value to 42 when calling scikit learn's train_test_split. It will be good idea to add this argument to the fair_stratified_train_test_split method similar to corresponding scikit routine.

opened by vaisaxena 4
Hyperopt Algorithm used

Not actually an issue, just a question.

Hyperopt supports Random Search, Tree of Parzen Estimators (TPE), and Adaptive TPE.

When using optimizer Hyperopt in lale, which is the search algorithm behind?

Thank you in advance for your time and contribution.

opened by tsiakmaki 4
redirect Lale autogen to lale.lib.sklearn where overlap

There are 43 operators that exist in both lale.lib.autogen and lale.lib.sklearn. This overlap is problematic, because it can lead to unexpected behavior depending on the order of imports, and users might end up with a lower-quality version of an operator for which we also have a higher-quality version. We should simply remove lale.lib.autogen operators from the repository for which there is also a lale.lib.sklearn operator. To avoid breaking code that uses them, we can change the __init__.py file of lale.lib.autogen to forward to the relevant replacements.

List of duplicate operators: ada_boost_classifier, ada_boost_regressor, decision_tree_classifier, decision_tree_regressor, extra_trees_classifier, extra_trees_regressor, function_transformer, gaussian_nb, gradient_boosting_classifier, gradient_boosting_regressor, isomap, k_means, k_neighbors_classifier, k_neighbors_regressor, linear_regression, linear_svc, linear_svr, logistic_regression, min_max_scaler, missing_indicator, mlp_classifier, multinomial_nb, nmf, normalizer, nystroem, one_hot_encoder, ordinal_encoder, passive_aggressive_classifier, pca, polynomial_features, quadratic_discriminant_analysis, quantile_transformer, random_forest_classifier, random_forest_regressor, ridge, ridge_classifier, robust_scaler, sgd_classifier, sgd_regressor, simple_imputer, standard_scaler, svc, svr

opened by hirzel 0
test suite for hyperparameter optimizers

There is an existing ad-hoc set of tests for our our optimizers.
We should factor out a standard set of tests that are generic over the choice of optimizer, that can be re-used/run against each optimizer. This can then be used as a test suite for new optimizers, including ones that live in other repositories (such as NSGA-II based optimizer recently added to lale-gpl)

opened by shinnar 1

Handle Project producing zero columns

It would be nice if the user could provide a pipeline with more preprocessing subpipelines than necessary. For example, if a pipeline contains a branch with one-hot encoding for string columns, but the data only has numeric columns, it would be convenient if it worked anyway. Unfortunately, some sklearn operators raise an exception when their input data has zero columns. This issue proposes preventing that exception during fit, and possibly even pruning them from the pipeline returned by fit.

Example:

import sklearn.datasets
X, y = sklearn.datasets.load_digits(return_X_y=True)

from lale.lib.lale import Project, ConcatFeatures
from lale.lib.sklearn import LogisticRegression, OneHotEncoder

proj_nums = Project(columns={"type": "number"})
proj_cats = Project(columns={"type": "string"})
one_hot = OneHotEncoder(handle_unknown="ignore")
prep = (proj_nums & (proj_cats >> one_hot)) >> ConcatFeatures
trainable = prep >> LogisticRegression()

print(f"shapes: X {X.shape}, y {y.shape}, "
      f"nums {proj_nums.fit(X).transform(X).shape}, "
      f"cats {proj_cats.fit(X).transform(X).shape}")

trained = trainable.fit(X, y)

This prints:

shapes: X (1797, 64), y (1797,), nums (1797, 64), cats (1797, 0)
Traceback (most recent call last):
  File "~/tmp.py", line 17, in <module>
    trained = trainable.fit(X, y)
  File "~/git/user/lale/lale/operators.py", line 3981, in fit
    trained = trainable.fit(X=inputs)
  File "~/git/user/lale/lale/operators.py", line 2526, in fit
    trained_impl = trainable_impl.fit(X, y, **filtered_fit_params)
  File "~/git/user/lale/lale/lib/sklearn/one_hot_encoder.py", line 145, in fit
    self._wrapped_model.fit(X, y)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 385, in fit
    self._fit(X, handle_unknown=self.handle_unknown)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 74, in _fit
    X_list, n_samples, n_features = self._check_X(X)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 43, in _check_X
    X_temp = check_array(X, dtype=None)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "~/python3.7venv/lib/python3.7/site-packages/sklearn/utils/validation.py", line 661, in check_array
    context))
ValueError: Found array with 0 feature(s) (shape=(1797, 0)) while a minimum of 1 is required.

opened by hirzel 2

Update to newest Hyperopt
Hyperopt 0.2.6 was released on November 15: https://pypi.org/project/hyperopt/0.2.6/

Unfortunately, it breaks many Lale tests: https://github.com/IBM/lale/actions/runs/1467468837

For example, the failures include some very basic tests such as:

test.test_core_transformers.TestFeaturePreprocessing.test_MinMaxScaler

test.test_core_transformers.TestFeaturePreprocessing.test_PCA

test.test_core_transformers.TestConcatFeatures.test_concat_with_hyperopt

So for now, we limit the Hyperopt version to <=0.2.5: https://github.com/IBM/lale/pull/875/commits/24db05830ff79d0d1b474b5595a612bad9e62959

We should try to update to the latest (in fact, if we are lucky, Hyperopt 0.2.7 fixes the problem).
opened by hirzel 1

Add a test case to test_autoai_output_consumption.py to do fairness mitigation

Add a test case to test_autoai_output_consumption.py covering the following scenario:

Read an output AutoAI pipeline.
Use DisparateImpactRemover on the preprocessing prefix and perform refinement with a choice of classifiers.
Use Hyperopt to choose the best model with the pre-estimator mitigation of step 2.

Here is some code for using the pipeline generated for the German credit dataset:

fairness_info = {
            "protected_attributes": [
                {"feature": "Sex", "reference_group": ['male'], "monitored_group": ['female']},
                {"feature": "Age", "reference_group": [[20,40], [60,90]], "monitored_group": [[41, 59]]}
            ],
            "favorable_labels": ["No Risk"],
            "unfavorable_labels": ["Risk"],
}

prefix = best_pipeline.remove_last().freeze_trainable()

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RF
from lale.operator_wrapper import wrap_imported_operators
from lale.lib.aif360 import DisparateImpactRemover
wrap_imported_operators()

di_remover = DisparateImpactRemover(**fairness_info, preparation=prefix, redact=True)
planned_fairer = di_remover >> (LR | RF)

from lale.lib.aif360 import accuracy_and_disparate_impact
from lale.lib.aif360 import FairStratifiedKFold

combined_scorer = accuracy_and_disparate_impact(**fairness_info)
fair_cv = FairStratifiedKFold(**fairness_info, n_splits=3)

from lale.lib.lale import Hyperopt

import pandas as pd
df = pd.read_csv("german_credit_data_biased_training.csv")
y = df.iloc[:, -1]
X = df.drop(columns=['Risk'])

trained_fairer = planned_fairer.auto_configure(
    X, y, optimizer=Hyperopt, cv=fair_cv, verbose=True,
    max_evals=1, scoring=combined_scorer, best_score=1.0)

opened by kiran-kate 0

Releases(v0.7.2)

v0.7.2(Oct 25, 2022)

CI and RASL fixes.
Source code(tar.gz)
Source code(zip)
v0.7.1(Oct 4, 2022)
fixes to autogen schemas

fix to autoai_libs DateTransformer

Source code(tar.gz)
Source code(zip)
v0.7.0(Oct 3, 2022)
Improves support for partial_fit

Improves the pretty printer

Improves support for typed users

Adds lale.lib.sklearn.perceptron (wrapping sklearn.linear_model.Perceptron)

RASL (experimental):

Removes support for Spark Dataframes that don't have an index

Moves HashingEncoder to category_encoders and improved documentation

Source code(tar.gz)
Source code(zip)
v0.6.19(Sep 26, 2022)

Updated version of aif360 during installation.
Source code(tar.gz)
Source code(zip)
v0.6.18(Sep 22, 2022)

Adding py.typed marker to enable MyPy on packages that use Lale.
Source code(tar.gz)
Source code(zip)
v0.6.17(Sep 21, 2022)
fit_transform for lale operators

partial_fit for xgboost and lightgbm

Minor fixes and updates to README.

Source code(tar.gz)
Source code(zip)
v0.6.16(Sep 8, 2022)

Changed the version of black in setup.py compared to 0.6.15.
Source code(tar.gz)
Source code(zip)
v0.6.15(Sep 8, 2022)
Add support for scikit-learn 1.1

Add lower and upper bound constraints for scikit-learn to help suggest recommended versions

Add support for newer versions of XGBoost

Source code(tar.gz)
Source code(zip)
v0.6.14(Aug 25, 2022)

Updated metrics to handle y as DataFrame.
Source code(tar.gz)
Source code(zip)
v0.6.13(Aug 16, 2022)

Release for the KDD'22 tutorial
Source code(tar.gz)
Source code(zip)
v0.6.12(Aug 13, 2022)

Hands-on tutorials for KDD'22: https://github.com/IBM/lale/tree/master/examples/kdd22
Source code(tar.gz)
Source code(zip)
v0.6.11(Jul 25, 2022)
RASL: balanced accuracy, balanced_accuracy_and_di

Documentation improvements

lale.lib.autoai_libs.DateTransformer

Source code(tar.gz)
Source code(zip)
v0.6.10(Jun 29, 2022)

Fixes and changes to RASL, lale.lib.aif360 and import and export from sklearn.
Source code(tar.gz)
Source code(zip)
v0.6.9(May 23, 2022)
rasl fixes

a fix for autoai_ts_libs

a change to Hyperopt's fit to accept a validation dataset.

Source code(tar.gz)
Source code(zip)
v0.6.8(May 6, 2022)
Batching can handle an iterable or data loader without knowing n_batches.

XGBoost 1.6

pretty_print lists a list of external modules in wrap_imported_operators.

Source code(tar.gz)
Source code(zip)
v0.6.7(Apr 21, 2022)
Batching changes to use task graphs

Removed autoai_ts_libs operators

BatchedTreeEnsemble estimators from SnapML

New rasl operators such as BatchedBaggingClassifier and HashingEncoder

Spilling in task graphs

Source code(tar.gz)
Source code(zip)
v0.6.6(Mar 2, 2022)
Bug fixes

Improved interface for Monoids

Spilling in task graphs

multi-column index in SparkWithIndex

Source code(tar.gz)
Source code(zip)
v0.6.5(Feb 21, 2022)
Fixes a regression (https://github.com/IBM/lale/commit/33d897218edd404ea5ddc4757c719f46fadf4bd8)

New lale.lib.rasl operators.

Source code(tar.gz)
Source code(zip)
v0.5.11(Feb 2, 2022)

New release that delivers the string_indexer fix for 0.5.x.
Source code(tar.gz)
Source code(zip)
v0.6.4(Jan 27, 2022)

Added a new operator lale.lib.autoai_libs.ColumnSelector.
Source code(tar.gz)
Source code(zip)
v0.6.3(Jan 26, 2022)

Release with correct schema updates for xgboost 1.5.1.
Source code(tar.gz)
Source code(zip)
v0.6.2(Jan 25, 2022)

A version that is fully tested (almost, without static checks) on Python 3.9. Contains minor fixes compared to the previous version.
Source code(tar.gz)
Source code(zip)
v0.6.1(Jan 18, 2022)
New RASL operators: MinMaxScaler, OrdinalEncoder and OneHotEncoder

Fixes and changes for autoai-ts-libs

Scikit-learn compatibility by creating a steps property on lale pipelines and a mechanism to forward attribute access.

Source code(tar.gz)
Source code(zip)
v0.5.10(Jan 18, 2022)
New RASL operators: MinMaxScaler, OrdinalEncoder and OneHotEncoder

Fixes and changes for autoai-ts-libs

Scikit-learn compatibility by creating a steps property on lale pipelines and a mechanism to forward attribute access.

Source code(tar.gz)
Source code(zip)
v0.5.9(Dec 6, 2021)

Simplified combined fairness and predictive accuracy metrics to use a linear combination.
Source code(tar.gz)
Source code(zip)
v0.5.8(Dec 3, 2021)
schema changes for autoai_ts_libs.

partial_fit for a pipeline.

diff of pipelines.

some fixes and other changes.

fixes for autoai_ts_libs.

Source code(tar.gz)
Source code(zip)
v0.6.0(Dec 2, 2021)
Schema changes for autoai_ts_libs.

partial_fit for a pipeline.

diff of pipelines.

Some fixes and other changes.

Source code(tar.gz)
Source code(zip)
v0.5.7(Nov 17, 2021)
Making pretty_print() more robust.

Making fairness support more robust.

Source code(tar.gz)
Source code(zip)
v0.5.6(Oct 12, 2021)
RASL operator implementation such as Filter, Aggregate, GroupBy, OrderBy etc.

Changes for ensembling experiments with lale.lib.aif360.

Refactoring of lale.lib.aif360 and creation of a new setup target fairness.

Customize schemas if the environment has sklearn 1.0.

Update of schema constraints based on the "weakest precondition" work.

Other changes and bug fixes.

Source code(tar.gz)
Source code(zip)
v0.5.5(Jun 28, 2021)
Access to 2 multi-table datasets: go_sales and imdb.

Improvement in error messages

Support for predict_log_proba

Bug fixes

Source code(tar.gz)
Source code(zip)

Owner

International Business Machines

GitHub https://lale.readthedocs.io

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well as on a large scale cloud cluster.

3.6k Jan 9, 2023

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

3 Oct 30, 2021

Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

55 Dec 16, 2022

Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git repo for Makefiles: One great trick for making your conda environments mo

18 Dec 23, 2022

Open source platform for Data Science Management automation

Hydrosphere examples This repo contains demo scenarios and pre-trained models to show Hydrosphere capabilities. Data and artifacts management Some mod

6 Aug 10, 2021

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

1 Dec 9, 2021

A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021

2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

1 Jan 1, 2022

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

79 Sep 20, 2022

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

1 Feb 27, 2022

Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021

ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

49 Nov 28, 2022

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022

Lale is a Python library for semi-automated data science.

Related tags

Overview

Lale

Comments

Releases(v0.7.2)

v0.7.2(Oct 25, 2022)

v0.7.1(Oct 4, 2022)

v0.7.0(Oct 3, 2022)

v0.6.19(Sep 26, 2022)

v0.6.18(Sep 22, 2022)

v0.6.17(Sep 21, 2022)

v0.6.16(Sep 8, 2022)

v0.6.15(Sep 8, 2022)

v0.6.14(Aug 25, 2022)

v0.6.13(Aug 16, 2022)

v0.6.12(Aug 13, 2022)

v0.6.11(Jul 25, 2022)

v0.6.10(Jun 29, 2022)

v0.6.9(May 23, 2022)

v0.6.8(May 6, 2022)

v0.6.7(Apr 21, 2022)

v0.6.6(Mar 2, 2022)

v0.6.5(Feb 21, 2022)

v0.5.11(Feb 2, 2022)

v0.6.4(Jan 27, 2022)

v0.6.3(Jan 26, 2022)

v0.6.2(Jan 25, 2022)

v0.6.1(Jan 18, 2022)

v0.5.10(Jan 18, 2022)

v0.5.9(Dec 6, 2021)

v0.5.8(Dec 3, 2021)

v0.6.0(Dec 2, 2021)

v0.5.7(Nov 17, 2021)

v0.5.6(Oct 12, 2021)

v0.5.5(Jun 28, 2021)

Owner

International Business Machines

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Orchest is a browser based IDE for Data Science.

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

Data Science Environment Setup in single line

Improving your data science workflows with

Open source platform for Data Science Management automation

MS in Data Science capstone project. Studying attacks on autonomous vehicles.

A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

2019 Data Science Bowl

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Flenser is a simple, minimal, automated exploratory data analysis tool.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Full automated data pipeline using docker images

ELFXtract is an automated analysis tool used for enumerating ELF binaries

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

pyETT: Python library for Eleven VR Table Tennis data