Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

Last update: Nov 25, 2022

Related tags

Deep Learning NeuralSymbolicRegressionThatScales

Overview

NeuralSymbolicRegressionThatScales

Pytorch implementation and pretrained models for the paper "Neural Symbolic Regression That Scales", presented at ICML 2021. Our deep-learning based approach is the first symbolic regression method that leverages large scale pre-training. We procedurally generate an unbounded set of equations, and simultaneously pre-train a Transformer to predict the symbolic equation from a corresponding set of input-output-pairs.

For details, see Neural Symbolic Regression That Scales. [arXiv]

Installation

Please clone and install this repository via

git clone https://github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales.git
cd NeuralSymbolicRegressionThatScales/
pip3 install -e src/

This library requires python>3.7

Pretrained models

We offer two models, "10M" and "100M". Both are trained with parameter configuration showed in dataset_configuration.json (which contains details about how datasets are created) and scripts/config.yaml (which contains details of how models are trained). "10M" model is trained with 10 million datasets and "100M" model is trained with 100 millions dataset.

Link to 100M: [Link]
Link to 10M: [Link]

If you want to try the models out, look at jupyter/fit_func.ipynb. Before running the notebook, make sure to first create a folder named "weights" and to download the provided checkpoints there.

Dataset Generation

Before training, you need a dataset of equations. Here the steps to follow

Raw training dataset generation

The equation generator scripts are based on [SymbolicMathematics] First, if you want to change the defaults value, configure the dataset_configuration.json file:

{
    "max_len": 20, #Maximum length of an equation
    "operators": "add:10,mul:10,sub:5,div:5,sqrt:4,pow2:4,pow3:2,pow4:1,pow5:1,ln:4,exp:4,sin:4,cos:4,tan:4,asin:2", #Operator unnormalized probability
    "max_ops": 5, #Maximum number of operations
    "rewrite_functions": "", #Not used, leave it empty
    "variables": ["x_1","x_2","x_3"], #Variable names, if you want to add more add follow the convention i.e. x_4, x_5,... and so on
    "eos_index": 1,
    "pad_index": 0
}

There are two ways to generate this dataset:

If you are running on linux, you use makefile in terminal as follows:

export NUM=${NumberOfEquations} #Export num of equations
make data/raw_datasets/${NUM}: #Launch make file command

NumberOfEquations can be defined in two formats with K or M suffix. For instance 100K is equal to 100'000 while 10M is equal to 10'0000000 For example, if you want to create a 10M dataset simply:

export NUM=10M #Export num variable
make data/raw_datasets/10M: #Launch make file command

Run this script:

python3 scripts/data_creation/dataset_creation.py --number_of_equations NumberOfEquations --no-debug #Replace NumberOfEquations with the number of equations you want to generate

After this command you will have a folder named data/raw_data/NumberOfEquations containing .h5 files. By default, each of this h5 files contains a maximum of 5e4 equations.

Raw test dataset generation

This step is optional. You can skip it if you want to use our test set used for the paper (located in test_set/nc.csv). Use the same commands as before for generating a validation dataset. All equations in this dataset will be remove from the training dataset in the next stage, hence this validation dataset should be small. For our paper it constisted of 200 equations.

#Code for generating a 150 equation dataset 
python3 scripts/data_creation/dataset_creation.py --number_of_equations 150 --no-debug #This code creates a new folder data/raw_datasets/150

If you want, you can convert the newly created validation dataset in a csv format. To do so, run: python3 scripts/csv_handling/dataload_format_to_csv.py raw_test_path=data/raw_datasets/150 This command will create two csv files named test_nc.csv (equations without constants) and test_wc.csv (equation with constants) in the test_set folder.

Remove test and numerical problematic equations from the training dataset

The following steps will remove the validation equations from the training set and remove equations that are always nan, inf, etc.

path_to_data_folder=data/raw_datasets/100000 if you have created a 100K dataset
path_to_csv=test_set/test_nc.csv if you have created 150 equations for validation. If you want to use the one in the paper replace it with nc.csv

python3 scripts/data_creation/filter_from_already_existing.py --data_path path_to_data_folder --csv_path path_to_csv #You can leave csv_path empty if you do not want to create a validation set
python3 scripts/data_creation/apply_filtering.py --data_path path_to_data_folder

You should now have a folder named data/datasets/100000. This will be the training folder.

Training

Once you have created your training and validation datasets run

python3 scripts/train.py

You can configure the config.yaml with the necessary options. Most important, make sure you have set train_path and val_path correctly. If you have followed the 100K example this should be set as:

train_path:  data/datasets/100000
val_path: data/raw_datasets/150

Comments

filter_from_already_existing.py Errors; 'asin' is not supported ?

Great project, just trying to replicate your results and training, however running your instructions I get;

python scripts/data_creation/filter_from_already_existing.py --data_path data/raw_datasets/100000 --csv_path test_set/test_nc.csv
Loading metadata
Creating image for validation set
Traceback (most recent call last):
  File "/home/sam/code/discovery/NeuralSymbolicRegressionThatScales/scripts/data_creation/filter_from_already_existing.py", line 130, in <module>
    main()
  File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sam/code/discovery/NeuralSymbolicRegressionThatScales/scripts/data_creation/filter_from_already_existing.py", line 112, in main
    target_image = evaluate_validation_set(validation,support)
  File "/home/sam/code/discovery/NeuralSymbolicRegressionThatScales/scripts/data_creation/filter_from_already_existing.py", line 28, in evaluate_validation_set
    curr = lambdify(variables,row["eq"])(*support).numpy().astype('float16')
  File "<lambdifygenerated-16>", line 2, in _lambdifygenerated
NameError: name 'asin' is not defined

OS: Ubuntu 20.04.4 LTS (Focal Fossa) Python: Python 3.9.7

Packages - from a clean venv, installed the same ones with the repo, and had to update one or two to make the previous scripts work - see previous closed issue;

absl-py==1.0.0
aiohttp==3.8.1
aiosignal==1.2.0
altair==4.2.0
antlr4-python3-runtime==4.8
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
async-timeout==4.0.2
attrs==21.4.0
backcall==0.2.0
base58==2.1.1
beautifulsoup4==4.10.0
bleach==4.1.0
blinker==1.4
bokeh==2.4.2
brotlipy==0.7.0
bs4==0.0.1
cachetools==5.0.0
certifi==2021.10.8
cffi @ file:///opt/conda/conda-bld/cffi_1642701102775/work
charset-normalizer==2.0.12
click==8.0.4
compress-pickle==2.1.0
cryptography @ file:///tmp/build/80754af9/cryptography_1639414572950/work
dataclass-dict-convert==1.6.3
debugpy==1.5.1
decorator==5.1.1
defusedxml==0.7.1
docker-pycreds==0.4.0
entrypoints==0.4
executing==0.8.3
frozenlist==1.3.0
fsspec==2022.2.0
future==0.18.2
gitdb==4.0.9
GitPython==3.1.27
google-auth==2.6.2
google-auth-oauthlib==0.4.6
grpcio==1.44.0
h5py==3.6.0
hydra-core==1.0.0
hydralit==1.0.12
hydralit-components==1.0.9
idna @ file:///tmp/build/80754af9/idna_1637925883363/work
importlib-metadata==4.11.3
importlib-resources==5.4.0
ipykernel==6.9.2
ipython==8.1.1
ipython-genutils==0.2.0
ipywidgets==7.7.0
jedi==0.18.1
Jinja2==3.0.3
jsons==1.6.1
jsonschema==4.4.0
jupyter-client==7.1.2
jupyter-core==4.9.2
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.1.0
lxml==4.8.0
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib-inline==0.1.3
mistune==0.8.4
mkl-fft==1.3.1
mkl-random @ file:///tmp/build/80754af9/mkl_random_1626186066731/work
mkl-service==2.4.0
mpmath==1.2.1
multidict==6.0.2
nbclient==0.5.13
nbconvert==6.4.4
nbformat==5.2.0
nest-asyncio==1.5.4
-e git+ssh://[email protected]/SymposiumOrganization/NeuralSymbolicRegressionThatScales.git@92d7c46c0417aeb76ecebcac982b8ccf1a3f8860#egg=nesymres&subdirectory=src
notebook==6.4.10
numexpr==2.8.1
numpy==1.22.3
oauthlib==3.2.0
omegaconf==2.1.1
ordered-set==4.1.0
packaging==21.3
pandas==1.4.1
pandocfilters==1.5.0
parso==0.8.3
pathtools==0.1.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.0.1
prometheus-client==0.13.1
promise==2.3
prompt-toolkit==3.0.28
protobuf==3.19.4
psutil==5.9.0
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==7.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydeck==0.7.1
pyDeprecate==0.3.1
Pygments==2.11.2
Pympler==1.0.1
pyOpenSSL @ file:///opt/conda/conda-bld/pyopenssl_1643788558760/work
pyparsing==3.0.7
pyrsistent==0.18.1
PySocks @ file:///tmp/build/80754af9/pysocks_1605305812635/work
python-dateutil==2.8.2
pytorch-lightning==1.5.10
pytz==2021.3
pytz-deprecation-shim==0.1.0.post0
PyYAML==6.0
pyzmq==22.3.0
requests @ file:///opt/conda/conda-bld/requests_1641824580448/work
requests-oauthlib==1.3.1
rsa==4.8
scipy==1.8.0
semver==2.13.0
Send2Trash==1.8.0
sentry-sdk==1.5.8
setproctitle==1.2.2
shortuuid==1.0.8
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.0
soupsieve==2.3.1
stack-data==0.2.0
streamlit==1.7.0
stringcase==1.2.0
sympy==1.10
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
termcolor==1.1.0
terminado==0.13.3
testpath==0.6.0
toml==0.10.2
toolz==0.11.2
torch==1.11.0
torchaudio==0.11.0
torchmetrics==0.7.2
torchvision==0.12.0
tornado==6.1
tqdm==4.63.0
traitlets==5.1.1
typing-extensions @ file:///tmp/build/80754af9/typing_extensions_1631814937681/work
typish==1.9.3
tzdata==2022.1
tzlocal==4.1
urllib3==1.26.9
validators==0.18.2
wandb==0.12.11
watchdog==2.1.6
wcwidth==0.2.5
webencodings==0.5.1
Werkzeug==2.0.3
widgetsnbextension==3.6.0
yarl==1.7.2
yaspin==2.1.0
zipp==3.7.0

Any help to point to the right direction is greatly appreciated, thank you so much,

Best, Sam

opened by samholt 17

'RuntimeError: CUDA error: device-side assert triggered' when dataset config is changed

Hi, when I change the parameters: max_len -> 24, max_ops -> 10 and the number of variables -> 6 I get a runtime error for the embedding layer:

[...]
/tmp/pip-req-build-h953rg2q/aten/src/ATen/native/cuda/Indexing.cu:662: indexSelectLargeIndex: block: [200,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[...]
File "code/NeuralSymbolicRegressionThatScales/src/nesymres/architectures/model.py", line 101, in forward
    pos = self.pos_embedding(
  File "miniconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "miniconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 145, in forward
    return F.embedding(
  File "miniconda3/envs/torchenv/lib/python3.9/site-packages/torch/nn/functional.py", line 1913, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered

I tried changing the variable length_eq in config.yaml but then I get a tensor size mismatch error. Should have something to do with the network sampling equations that are larger than it expected, as sometimes I reach 5 or 6 epochs before encountering the error.

opened by alessandrosimon 7

error when try to add more variables

Thanks for making it public and really love your paper.

At first I generate data and train the model as default settings, everything goes well. However, when I try to add two more variables in this model, some error occur. What I did is adding x_4 and x_5 in the dataset_configuration.json; regenerate training and validaion data and change the train_path as well as val_path in condig.yaml. Could you please tell me if there is some operations I missed to solve the problem. It is really important because I do want to reproduce it as one of the baseline. Thanks a lot

opened by chenyuxin1999 2
dataload_format_to_csv script errors
Thanks for making the repo public! I'm trying to use the dataload_format_to_csv script within the scripts directory as in your README. I've followed the instructions exactly so far and when I am running python3 scripts/csv_handling/dataload_format_to_csv.py raw_test_path=data/raw_datasets/150

I get the following error:

scripts/csv_handling/dataload_format_to_csv.py:50: UserWarning: config_path is not specified in @hydra.main(). See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_hydra_main_config_path for more information. @hydra.main(config_name="../config") Could not override 'raw_test_path'. To append to your config use +raw_test_path=data/raw_datasets/150 Key 'raw_test_path' is not in struct full_key: raw_test_path object_type=dict Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I haven't changed anything within the repo, and am just running it as is. Is there a fix for this? Is it possible to adapt this script to not just the validation set but the training data as well?

Thanks!
opened by bowenyou 2
Unknown SymPy operator

Hi, during training I sometimes get the warning: Unknown SymPy operator: zoo (ComplexInfinity in SymPy) Other unknown operators that came up are nan, oo, asinh(c*sqrt(-x_1)) The relevant method seems to be Generator.sympy_to_prefix() in src.nesymres.dataset.generator. The warning doesn't seem to interrupt the training but maybe it's better to handle those cases also.

opened by alessandrosimon 2
dataload_format_to_csv.py error ?

Running : python scripts/csv_handling/dataload_format_to_csv.py raw_test_path=data/raw_datasets/200

Errors with

(nsrts) sam@lm-adastra:~/code/discovery/NeuralSymbolicRegressionThatScales$ python scripts/csv_handling/dataload_format_to_csv.py raw_test_path=data/raw_datasets/200 /home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/core/plugins.py:202: UserWarning: Error importing 'hydra._internal.core_plugins.importlib_resources_config_source'. Plugin is incompatible with this Hydra version or buggy. Recommended to uninstall or upgrade plugin. ModuleNotFoundError : No module named 'importlib_resources' warnings.warn( Traceback (most recent call last): File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/utils.py", line 207, in run_and_report return func() File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/utils.py", line 329, in <lambda> lambda: Hydra.create_main_hydra2( File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 73, in create_main_hydra2 config_loader: ConfigLoader = ConfigLoaderImpl( File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py", line 78, in __init__ self.repository: ConfigRepository = ConfigRepository( File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/config_repository.py", line 22, in __init__ source_type = SourcesRegistry.instance().resolve(scheme) File "/home/sam/anaconda3/envs/nsrts/lib/python3.9/site-packages/hydra/_internal/sources_registry.py", line 29, in resolve raise ValueError( ValueError: No config source registered for schema pkg, supported types : [file, structured]

Python: 3.9.7 hydra-core==1.0.0 hydralit==1.0.12 hydralit-components==1.0.9

OS: Ubuntu 20.04.4 LTS (Focal Fossa) (https://releases.ubuntu.com/20.04/)

Previously created the test 200 equation dataset in ./data/raw_datasets/200

Could you help point me in the right direction to run your repo ? Thank you so much !

Best, Sam

opened by samholt 1
question about assertion in model.py
Hi guys, sorry for asking so many questions lately XD

Just wondering about this assertion in line 88 of model.py:

assert not torch.isnan(enc_src).any()

It seems when I am running the training script, this assertion fails consistently which halts training. Have you guys run into a similar issue before?
opened by bowenyou 1
dataload_format_to_csv script

When I run this script to try and convert my training sets to csv format, I intermittently run into the issue of out of memory (128GBs). Also, for some equations, the script appears to freeze (hours at a time before I have to cancel it). Is there something in the backend that is causing this? Could some generated equations "break" the script?

I've split the data into 1000 equations at a time and still run into this issue. Any help would be appreciated!

opened by bowenyou 1
add custom operation

Hello again, It is amazing to experiment with this wonderful code. And what can I do if I want to add custom operation like relu or sign into operation set? Any suggestion is appreciated :).

opened by chenyuxin1999 0

Source code and Dataset creation for the paper "Neural Symbolic Regression That Scales"

Related tags

Overview

NeuralSymbolicRegressionThatScales

Installation

Pretrained models

Dataset Generation

Raw training dataset generation

Raw test dataset generation

Remove test and numerical problematic equations from the training dataset

Training

Comments

filter_from_already_existing.py Errors; 'asin' is not supported ?

'RuntimeError: CUDA error: device-side assert triggered' when dataset config is changed

error when try to add more variables

dataload_format_to_csv script errors

Unknown SymPy operator

dataload_format_to_csv.py error ?

question about assertion in model.py

dataload_format_to_csv script

add custom operation

Owner

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

code and data for paper "GIANT: Scalable Creation of a Web-scale Ontology"

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

This project aims to be a handler for input creation and running of multiple RICEWQ simulations.

Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

This is the dataset and code release of the OpenRooms Dataset.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

A large dataset of 100k Google Satellite and matching Map images, resembling pix2pix's Google Maps dataset.

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

Dataset used in "PlantDoc: A Dataset for Visual Plant Disease Detection" accepted in CODS-COMAD 2020

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory