Genetic Programming in Python, with a scikit-learn inspired API

Trevor Stephens

Last update: Jan 3, 2023

Related tags

Overview

Genetic Programming in Python, with a scikit-learn inspired API

Welcome to gplearn!

gplearn implements Genetic Programming in Python, with a scikit-learn inspired and compatible API.

While Genetic Programming (GP) can be used to perform a very wide variety of tasks, gplearn is purposefully constrained to solving symbolic regression problems. This is motivated by the scikit-learn ethos, of having powerful estimators that are straight-forward to implement.

Symbolic regression is a machine learning technique that aims to identify an underlying mathematical expression that best describes a relationship. It begins by building a population of naive random formulas to represent a relationship between known independent variables and their dependent variable targets in order to predict new data. Each successive generation of programs is then evolved from the one that came before it by selecting the fittest individuals from the population to undergo genetic operations.

gplearn retains the familiar scikit-learn fit/predict API and works with the existing scikit-learn pipeline and grid search modules. The package attempts to squeeze a lot of functionality into a scikit-learn-style API. While there are a lot of parameters to tweak, reading the documentation should make the more relevant ones clear for your problem.

gplearn supports regression through the SymbolicRegressor, binary classification with the SymbolicClassifier, as well as transformation for automated feature engineering with the SymbolicTransformer, which is designed to support regression problems, but should also work for binary classification.

gplearn is built on scikit-learn and a fairly recent copy (0.22.1+) is required for installation. If you come across any issues in running or installing the package, please submit a bug report.

Comments

cannot save gplearn model

When I use pickle.dump to save fitted model with custom function, pickle go wrong as like:_pickle.PicklingError:can't pickle <function _sigmoid at 0x7facce5dcd90>:it's not the same object as main.sigmoid. I find dump model without custom function is work. How could I save model with custom function?Thx
bug

opened by Alsac 19
adds a slim parameter which drastically reduces the memory footprint

The slim parameter indicates to the SymbolicRegressor and SymbolicTransformer that no historical information about parents of any individual needs to be retained, reducing the number of objects kept in memory during and after training, drastically. This should lead to a near constant memory footprint allowing for more generations to be trained within the same memory limit.

This parameter can be set to True when the purpose of the training run is to achieve a high result through many generations. It is no longer possible to analyze the parent graph of the resulting program, so it should be set to False (default) if that analysis is relevant for your purposes.
enhancement

opened by bartolkaruza 13
feature Request - cover binary classification?

Hello Trevor,

great job with gplearn; really enjoy using it data mining problems.

Would it be possible to extend gplearn to cover binary classification at some point?

pretty please ;)

Thank you!
enhancement

opened by sskarkhanis 11
Run in parallel took much more time than single job

Hi,

I just started using gplearn. When I run the symbolic regressor example with n_jobs = 1, it took about 5 minutes and finished. However, when I set n_jobs = 15, it was still running after one hour, and I had to stop it manually.

Here is my environment: Windows 7 Professional, Anaconda 3, CPU Xeon E5-2687w v2.

Could anyone help me? Thanks very much.
bug documentation

opened by CookieMonsterYan 10
Adding pow(), exp() functions

I'm interested in adding pow and exponential functions to the set of functions. Could you please add them? They are very useful in my fitting routines.

Also, it would be nice if you could describe how you add functions to the library of functions so we could extend it with any function we want.
enhancement

opened by ibell 10
Customed Metric Function executes Twice?

Hello,

I custom a metric function, and print the metric result every time when a new formula is generated. But I find the metric function will be executed twice each time, just like the image below. It doubles the training time, so can you help me figure out the reason?

Thank you very much!
question

opened by asherzhao8 9
Optimize speed: move np.errstate outside evaluation loop

Is your feature request related to a problem? Please describe. I ran gplearn through a profiler, and I discovered that the with np.errorstate context statement takes a long time, longer than even the numpy function calls themselves.

Describe the solution you'd like

I would like to move and combine the with statements to a place that is executed less often.

Some options could be the _parallel_evolve or fit methods in genetic.py. Here, it would only happen once per generation, instead of once per individual. It would involve extending the long methods even more; but I can't think of any better place.

Do you think this is premature optimization?

Additional context

I ran with a population of 1000 over the cancer dataset, for 30 generations. The total time for the whole program was 19.55s, and the profiler says 5.25s (or more than 26%) was spent entering and exiting the with statement.

enhancement

opened by danuker 9
Customized function takes forever to run for SymbolicTransformer

I just started to use this package. I was running the gp_examples.ipynb. Everything was fine except that it takes forever for SymbolicTransformer to run with user-defined logical function on my computer in Example 3. Is it normal?

Thanks a lot for your help.
bug

opened by hubokitty 9
ModuleNotFoundError

The problem appeared after updating the dependent components. Gplearn is installed. The problem is observed on 2 computers.

Traceback (most recent call last): File "C:\Users\User\Desktop\gplearn.py", line 1, in from gplearn.genetic import SymbolicRegressor File "C:\Users\User\Desktop\gplearn.py", line 1, in from gplearn.genetic import SymbolicRegressor ModuleNotFoundError: No module named 'gplearn.genetic'; 'gplearn' is not a package
bug

opened by Trepetsky 8

gplearn.fitness.make_fitness code seems have a bug

if I want to make metric like auc，use the following code to judgment the function return value type may case a bug,

    if not isinstance(function(np.array([1, 1]),
                      np.array([2, 2]),
                      np.array([1, 1])), numbers.Number):
        raise ValueError('function must return a numeric.')

opened by LG-1 8

SymbolicTransformer does not create added value features as expected

Hi @trevorstephens ,

I am not sure if this is a bug, or the documentation is not correct focused refered to SymbolicTransformer. I have done a show case of how SymbolicRegressor works and predicts well the equation that represents the dataset, while SymbolicTransformer does not work in the same way.

Starting with SymbolicRegressor, I have done a "easy" dataset to check if SymbolicRegressor give me the correct result and good metrics.

from gplearn.genetic import SymbolicRegressor
from sklearn import metrics
import pandas as pd
import numpy as np

# Load data
X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos', 'tan']

est_gp = SymbolicRegressor(population_size=5000,
                           generations=20, stopping_criteria=0.001,
                           function_set=function_set,
                           p_crossover=0.7, p_subtree_mutation=0.1,
                           p_hoist_mutation=0.05, p_point_mutation=0.1,
                           max_samples=0.9, verbose=1,
                           n_jobs=1,
                           parsimony_coefficient=0.01, random_state=0)
est_gp.fit(X_train, y_train)

print 'Score: ', est_gp.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_gp.predict(X_test))
print est_gp._program

This example give us a perfect result and the MAE metrics is ~perfect as shows the output:

    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    11.81    8396.89543051       10    25.3022470326     26.608049431     35.35s
   1    12.36    8904.35549713        8    20.0284767508    19.0994923956     37.34s
   2    13.74     37263.312834        8 7.82583874247e-14 2.13162820728e-14     36.67s
Score:  1.0 5.71986902287e-14
abs(div(neg(X2), inv(min(X0, X1))))

However, SymbolicTransformer although the training works well, the transform does not work well. See next same example to previous one but with SymbolicTransformer:

from gplearn.genetic import SymbolicRegressor,SymbolicTransformer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics

X = np.random.uniform(0,100,size=(100,3))
y = np.min(X[:,:2],axis=1)*X[:,2]

index = 80
X_train , y_train = X[:index,:], y[:index]
X_test , y_test = X[index:,:], y[index:]

# Linear model - Original features
est_lin = linear_model.Lars()
est_lin.fit(X_train, y_train)
print 'Lars(orig): ', est_lin.score(X_test, y_test), metrics.mean_absolute_error(y_test, est_lin.predict(X_test))

# Create added value features
function_set = ['add', 'sub', 'mul', 'div', 'sqrt', 'log',
                'abs', 'neg', 'inv', 'max', 'min']

gp = SymbolicTransformer(generations=20, population_size=2000,
                         hall_of_fame=100, n_components=10,
                         function_set=function_set,
                         parsimony_coefficient=0.0005,
                         max_samples=0.9, verbose=1,
                         random_state=0, n_jobs=3)

gp.fit(X_train, y_train)
gp_features = gp.transform(X)

# Linear model - Transformed features
newX = np.hstack((X, gp_features))
print 'newX: ', np.shape(newX)
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

# Linear model - "The" feature
newX = np.append(X, (np.min(X[:,:2],axis=1)*X[:,2]).reshape(-1,1), axis=1)
print 'newX: ', newX.shape
est_lin = linear_model.Lars()
est_lin.fit(newX[:index,:], y_train)
print 'Lars(trans): ', est_lin.score(newX[index:,:], y_test), metrics.mean_absolute_error(y_test, est_lin.predict(newX[index:,:]))

I use Lars from sklearn for avoid Ridge sparse weights, and find the best solution fast for this easy and exact example. As it can be seen on the results of this code (below), the features that are generated with transform, although during the fit fitness become perfect, the added transformed features seem to be worng. The problem does not come from Lars, as last example of Lars shows that adding "the feature" which is the target, the accuracy is perfetc.

X:  (100, 3)
y:  (100,)
Lars(orig):  0.850145084161 518.34496409
    |    Population Average   |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.62   0.349810294784        6   0.954248106272   0.939129495332     16.04s
   1    16.01   0.601354215127        6              1.0              1.0     25.56s
newX:  (100, 13)
Lars(trans):  0.83552794823 497.438879508
newX:  (100, 4)
Lars(trans):  1.0 1.60411683936e-12

So I decided to see the fitted features created during the fit and some of them are perfect, however, the transform seems not to use them correctly on gp_features created

>>>print 'Eq. of new features: ', gp.__str__()
 mul(mul(neg(sqrt(min(neg(mul(mul(X1, X0), add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0), sub(X2, 0.904)))))), X2))), sqrt(max(X2, X2))), X1),
 div(min(div(abs(X0), log(0.901)), log(max(X2, -0.222))), X0),
 mul(sub(X1, X0), mul(X1, X0)),
 mul(X2, inv(X2)),
 mul(mul(neg(sqrt(min(X0, X2))), add(neg(X0), min(X0, X2))), X1),
 div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min(X0, X2))), mul(neg(X2), max(X1, X1))), X1))),
 div(abs(mul(X0, X2)), inv(mul(0.640, mul(X1, X0)))),
 div(abs(mul(X0, X2)), inv(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(mul(X0, 0.424)))), mul(sub(min(sub(-0.603, 0.299), sub(0.063, X1)), neg(min(X1, -0.125))), mul(max(mul(X0, X2), sqrt(X0)), min(sub(X1, 0.570), log(0.341))))))),
 mul(neg(mul(div(X2, -0.678), neg(X1))), div(sqrt(max(X2, X2)), min(X1, X0)))]
>>>
>>>df = pd.DataFrame(columns=['Gen','OOB_fitness','Equation'])
>>>for idGen in range(len(gp._programs)):
>>>   for idPopulation in range(gp.population_size):
>>>      if(gp._programs[idGen][idPopulation] != None):
>>>         df = df.append({'fitness': value_fitness_, 'OOB_fitness': value_oobfitness_, 'Equation': str(gp._programs[-1][idPopulation])}, ignore_index=True)
>>>
>>>print 'Best of last Gen: '
>>>print df[df['Gen']==df['Gen'].max()].sort_values('OOB_fitness')
Best of last Gen: 
      Gen  OOB_fitness                                           Equation
1126  2.0     0.000000                            add(0.944, sub(X0, X0))
952   2.0     0.000000                      div(min(X2, X0), min(X2, X0))
1530  2.0     0.000000  min(inv(neg(abs(log(min(X1, 0.535))))), neg(su...
2146  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(mul(neg(sqrt(min...
2148  2.0     0.000000  div(min(add(-0.868, -0.285), X2), sqrt(sqrt(0....
2150  2.0     0.000000                                 sub(-0.603, 0.299)
2476  2.0     0.000000  min(min(max(X0, X2), add(-0.738, 0.612)), sqrt...
1601  2.0     0.000000                               neg(min(X1, -0.125))
1271  2.0     0.000000                                 add(-0.504, 0.058)
1742  2.0     0.000000  add(inv(log(abs(-0.575))), inv(log(abs(-0.575))))
733   2.0     0.000000                                        abs(-0.575)
1304  2.0     0.000000                                  abs(sqrt(-0.758))
1630  2.0     0.000000  div(abs(mul(X0, X2)), inv(mul(max(X2, X2), add...
652   2.0     0.000000                                         log(0.341)
1708  2.0     0.000000                                              0.904
2262  2.0     0.000000                                       sqrt(-0.715)
1338  2.0     0.000000                               mul(X2, sub(X1, X1))
826   2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
1615  2.0     0.000000                             abs(add(0.640, 0.766))
2415  2.0     0.000000                                   log(abs(-0.575))
1670  2.0     0.000000                                     min(X0, 0.657)
1644  2.0     0.000000                               log(min(-0.524, X0))
2361  2.0     0.000000                                              0.944
785   2.0     0.000000  min(inv(log(abs(log(min(X1, 0.535))))), neg(mu...
2367  2.0     0.000000                                        abs(-0.911)
2249  2.0     0.000000                                              0.904
960   2.0     0.000000                                   inv(inv(-0.045))
955   2.0     0.000000                 div(add(X1, X2), inv(sub(X2, X2)))
2397  2.0     0.000000                                             -0.125
1878  2.0     0.000000  div(min(X2, add(sub(neg(sub(0.096, -0.886)), m...
...   ...          ...                                                ...
1103  2.0     0.997786        mul(X2, abs(sub(mul(X0, X1), add(X2, X0))))
2225  2.0     0.997790  mul(sub(min(log(div(X0, -0.717)), neg(sqrt(mul...
1890  2.0     0.998069  mul(sub(div(X2, 0.309), neg(X2)), sub(max(X2, ...
1704  2.0     0.998283  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1829  2.0     0.998284  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
700   2.0     0.998345  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
1770  2.0     0.998638  mul(add(min(X0, min(X1, X1)), X2), sqrt(abs(ab...
2344  2.0     0.998692  div(min(X2, add(sub(neg(sub(0.096, abs(-0.575)...
985   2.0     0.998793  sub(min(mul(sub(min(sqrt(log(max(X1, X2))), ne...
1634  2.0     0.998815  add(inv(log(abs(-0.575))), neg(mul(mul(X1, X0)...
1412  2.0     0.998945  mul(sub(min(sqrt(log(max(X1, X2))), neg(sqrt(m...
855   2.0     0.998965  add(inv(log(abs(X1))), neg(mul(mul(X1, X0), su...
839   2.0     0.998996  add(inv(abs(add(min(X0, min(X1, X1)), X2))), n...
1528  2.0     0.999066  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
690   2.0     0.999875  add(sub(log(min(add(0.769, X1), abs(X1))), sub...
2047  2.0     0.999895  sub(min(neg(X1), div(X1, X2)), sub(min(abs(X1)...
1951  2.0     0.999921  sub(min(min(X2, X0), X2), mul(min(X1, X0), neg...
1981  2.0     0.999954  mul(X2, neg(neg(min(add(0.448, X0), sub(X1, -0...
2349  2.0     0.999954   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2364  2.0     0.999960  add(inv(log(abs(-0.575))), mul(X2, neg(neg(min...
2487  2.0     0.999971   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2056  2.0     0.999975   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1559  2.0     0.999976    mul(X2, neg(neg(min(add(0.448, X0), abs(X1)))))
975   2.0     0.999982   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
2032  2.0     0.999992   sub(min(abs(X1), X2), mul(min(X1, X0), neg(X2)))
1288  2.0     1.000000  sub(min(div(-0.992, X2), X2), mul(min(X1, X0),...
2482  2.0     1.000000  sub(min(abs(inv(neg(X1))), X2), mul(min(X1, X0...
1776  2.0     1.000000  mul(min(mul(add(X0, X0), abs(log(X1))), min(ab...
2392  2.0     1.000000  mul(neg(X2), max(div(0.933, X0), min(X0, min(X...
1329  2.0     1.000000                          mul(min(X1, X0), neg(X2))

[2000 rows x 3 columns]

Is this a bug? I am doing the same thing as explained on SymbolicTransformer example

bug

opened by iblasi 8

how to run gplearn by multi process ?

how can I apply multi_process on gplearn SymbolicTransformer?

It seems that gplearn support multi_thread by setting n_jobs=10.

Can we run it on multi process,which is even faster? How to do that ?

eg. Optuna can run multiprocess by connecting to sqlite database thx!
enhancement

opened by JeffQuantFin 0
Check transformer supports pandas dataframe

New feature supported by scikit-learn. Might be inherited from base transformer? Check & add tests. Details here: https://www.youtube.com/watch?v=5bCg8VfX2x8
dependencies tests / CI

opened by trevorstephens 0
Is there any way to get the formula expresssion of each individual? Thanks.

Greetings,

Thank you very much for taking time from your busy schedule. I'm currently working on this gplearn package, and I'm using _my_metric function to create my own customized fitness function. And now, my critical problem is, is there any way to know or get the formula expresssion of each individual? Is this formula hidden in some object or other functions in this package? I have already known that I can get some best fitness values in the end using a for loop, however, I want to get this formula in _my_metric function in order to check whether my customized time-series functions are correct.

Best wishes.
enhancement

opened by OneWingAngel 3
Auto-Save function

I find myself frequently in the situation to train e.g. a symbolic Regressor on my local pc. With higher number of generations this can take several hours. If, for some reason, the process is interrupted I loose all the previously calculated generations.

Would it be possible for you to add an option that allows to auto-save the model during the ‚fit()‘ operation in training, e.g. every number of n generations?
enhancement

opened by c0def0x01 3

Releases(0.4.2)

0.4.2(May 3, 2022)
Require keyword only arguments for all public methods and functions to comply with scikit-learn SLEP009.

Replace n_features_ attribute with n_features_in_ to comply with scikit-learn SLEP010.

Update test suite to ensure compatibility with scikit-learn. scikit-learn 1.0.2 or newer will be required due to recent changes in their testing requirements. Also requiring joblib to 1.0.0 or newer to align with next release of scikit-learn.

Added the class_weight parameter to :class:genetic.SymbolicClassifier allowing users to easily compensate for imbalanced datasets.

Source code(tar.gz)
Source code(zip)