A neural-based binary analysis tool

Facebook Research

Last update: Dec 22, 2022

Related tags

Data Analysis nbref

Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

Python 3.7.6
Python packages
- dgl 0.6.0
- numpy 1.18.1
- pandas 1.2.0
- scipy 1.4.1
- sklearn 0.0
- tensorboard 2.2.1
- torch 1.5.0
- torchtext 0.2.0
- tqdm 4.42.1
- wget 3.2
C++14 compatible compiler
Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

Download dataset
- Download POJ-104 datasets from here and extract them into data/.
Compile and preprocess
- Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
- Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
- Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
- *(part of the preprocessing code are from [1])

Binary Vulnerability detections

Cramming the binary dataset
- The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
Download Preprocessed data
- Run ./preprocess.sh (clang++-3.7.1 required), or
- You can directly download the preprocessed datasets from here and extract them into data/.
- Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

Download dataset
- Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
- No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

Dump the trace of tree expansion:
- To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
Training scripts:
- First, cd baseline model.
- To train the model using torch parallel, run python run_tree_transformer.py.
- To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
- To evaluate, run python run_tree_transformer.py --eval
- To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Comments

Code similarity preprocessing : extract_obj.py: error: ambiguous option: -a could match -asm_t, -asm

I have troubles to run Binary code similarity measures by following the README instructions:

Binary code similarity measures
1. Download dataset                                                                                      
    . Download POJ-104 datasets from here and extract them into data/.

2. Compile and preprocess                                                                         
    Run python extract_obj.py -a data/obj (clang++-3.7.1 required)`

I downloaded the datasets and put it under nberf/data. Since there is no extract_obj.py under nberf, I changed the command to

  (py3_env): python preprocess/extract_obj.py -a data/obj`

I got

usage: extract_obj.py [-h] [--poj-dir POJ_DIR] [--filter-list FILTER_LIST] [--asm_type {x86,mips}] [--output-asm-dir OUTPUT_ASM_DIR]
                                   [--num-workers NUM_WORKERS]
extract_obj.py: error: ambiguous option: -a could match -asm_t, -asm

Look at the parse_args in extract_obj.py, there is no '-a' argument, it seems to be a typo of -asm. If I copy data to preprocessing and run the command:

  (py3_env) nberf: cp -r data preprocess/data
  (py3_env) nberf: cd preprocess
  (py3_env) nberf/preprocess: python extract_obj.py -asm data/obj`

now I got

Traceback (most recent call last):
  File "extract_obj.py", line 218, in <module>
    main()
  File "extract_obj.py", line 181, in main
    with open(args.filter_list) as f:
FileNotFoundError: [Errno 2] No such file or directory: './preprocess/filter_list.txt'

My env

Ubuntu 20.04.2 LTS
Python 3.8.5

Any steps I misunderstood or missed? Any help is greatly appreciated.

Dennis

opened by smartappcoder 3

Need to add function.csv to vul.tar.gz

$ python preprocess/vul_preprocess.py ... FileNotFoundError: [Errno 2] No such file or directory: 'data/vul/function.csv'

Can you please add the csv file to the vulnerability data that can be downloaded directly?

opened by darkskytechnology 2

amount of GPU memory required

How much GPU memory is required to run the decompiler? When I run on GPU with 11GB of memory I get out-of-error messages like:

File "/data/pfps/nbref/baseline_model/modules/encoder_decoder_layers.py", line 108, in forward
    energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale.cuda()
RuntimeError: CUDA out of memory. Tried to allocate 60.00 MiB (GPU 0; 10.92 GiB total capacity; 10.15 GiB already allocated; 10.6\
9 MiB free; 10.32 GiB reserved in total by PyTorch)

opened by pfps 1

Error in encoder_decoder_layers

Here's an error when trying to run python run_tree_transformer.py

  File "run_tree_transformer.py", line 236, in <module>
    main()
  File "run_tree_transformer.py", line 212, in main
    , args.device, criterion, max_len_trg, train_flag=True)
  File "~/nbref/baseline_model/data_utils/train_tree_encoder.py", line 346, in train_eval_tree
    trg_in = model.gnn(batch_graph)
  File "~/nbref/venv/lib64/python3/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "~/nbref/baseline_model/modules/transformer_tree_model.py", line 144, in forward
    merged = _merge_on(graphs, 'nodes', 'h')
  File "~/nbref/baseline_model/modules/encoder_decoder_layers.py", line 36, in _merge_on
    feat = Fdgl.pad_packed_tensor(feat, batch_num_objs, 0 )
  File "~/nbref/venv/lib64/python3/site-packages/dgl/backend/pytorch/tensor.py", line 245, in pad_packed_tensor
    index[cum_lengths[:-1]] += (max_len - lengths[:-1])
IndexError: index 740 is out of bounds for dimension 0 with size 740

Why don't you publish the pre-trained .pt model files for reproducibility?

opened by h4sh5 1

Can't train in a PC with no GPU

For device = cpu it fails because it tries to run: self.scale.cuda() in encoder_decoder_layers and transformer_tree_model self.scale is already set to_device this ".cuda()" shouldn't be needed To reproduce try training in a machine with no GPU ---> cd baseline_model && python run_similarity_check.py

opened by MCCTT 1
Training logic for decompilation

Hi, I am trying to compare our results with your results. I am trying to understand the code, could you please explain your training logic for the decompilation code? It's quite confusing. In your cache_tst_ast_1, you have each folder corresponding to one c code (I suppose) and you have multiple files inside them, shouldn't it be one ast per program? Also in the training code, it seems you are randomly selecting some nodes (and also some graphs from cache_test_ast_1?) and generating the remaining AST. Could you please explain the logic behind this? Thank you.

opened by ck-amrahd 0
How could we generate AST used by N-Bref by clang?

We have some programs with the respective C source code that we can also compile. How do we generate the AST used by N-Bref by clang and tokenize each node in AST? How to generate samples.obj? I didn't find the program to finnish this in this repo.

opened by quwenjie 0
Predicted AST

It is mentioned in the paper that the output of the prediction algorithm is a complete AST. After reading the code we only found a tensor. Is this really the case that there is no part in the code that outputs AST, and if this is the case how you were able to produce source code as in the paper

opened by botta633 0

config.py not set up to run with single GPU

When I process run_tree_transformer.py I get an error:

Traceback (most recent call last):
  File "run_tree_transformer.py", line 236, in <module>
    main()
  File "run_tree_transformer.py", line 212, in main
    , args.device, criterion, max_len_trg, train_flag=True)
  File "/data/pfps/nbref/baseline_model/data_utils/train_tree_encoder.py", line 246, in train_eval_tree
    model = model.module
  File "/tilde/pfps/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
    type(self).__name__, name))
AttributeError: 'Transformer' object has no attribute 'module'
(nbref) pfps@heaviside:~/nbref/baseline_model$ git status

I think that this is due to config.py having the wrong default for one or two arguments

        parser.add_argument('--parallel_gpu', action='store_true', default=True)
        parser.add_argument('--dist_gpu', action='store_true', default=True)

doesn't make sense because it is impossible to turn these options off. I changed the default to False which allows further progression.

opened by pfps 0

build issues

hello. even installing the correct python version 3.7.6 and using pip3.7 install -Iv to install each particular module version, torchtext seems to have difficulties. can you provide an easier onboarding into nbref, an install script or a vm? thanks!

opened by fayrlight 0

Owner

Facebook Research

GitHub

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

1 Sep 3, 2022

Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

79 Sep 20, 2022

ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

49 Nov 28, 2022

This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

32 Nov 27, 2022

cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

25 Dec 14, 2022

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022

Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

1 Jul 27, 2022

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

2 May 26, 2022

A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022

Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

98 Dec 22, 2022

Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

203 Jan 3, 2023

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

13 Mar 31, 2022

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

917 Jan 3, 2023

Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

2.5k Jan 9, 2023

Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 5, 2023

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

663 Jan 5, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

96 Jan 1, 2023

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

184 Dec 27, 2022