A neural-based binary analysis tool

Related tags

Data Analysis nbref
Overview

A neural-based binary analysis tool

Introduction

This directory contains the demo of a neural-based binary analysis tool. We test the framework using multiple binary analysis tasks: (i) vulnerability detection. (ii) code similarity measures. (iii) decompilations. (iv) malware analysis (coming later).

Requirements

  • Python 3.7.6
  • Python packages
    • dgl 0.6.0
    • numpy 1.18.1
    • pandas 1.2.0
    • scipy 1.4.1
    • sklearn 0.0
    • tensorboard 2.2.1
    • torch 1.5.0
    • torchtext 0.2.0
    • tqdm 4.42.1
    • wget 3.2
  • C++14 compatible compiler
  • Clang++ 3.7.1

Tasks and Dataset preparation

Binary code similarity measures

  1. Download dataset
    • Download POJ-104 datasets from here and extract them into data/.
  2. Compile and preprocess
    • Run python extract_obj.py -a data/obj (clang++-3.7.1 required)
    • Run python preprocess/split_dataset.py -i data/obj -m p -o data/split.pkl to split the dataset into train/valid/test sets.
    • Run python preprocess/sim_preprocess.py to compile the binary code into graphs data.
    • *(part of the preprocessing code are from [1])

Binary Vulnerability detections

  1. Cramming the binary dataset
    • The dataset is built on top of Devign. We compile the entire library based on the commit id and dump the binary code of the vulnerable functions. The cramming code is given in preprocess/cram_vul_dataset.
  2. Download Preprocessed data
    • Run ./preprocess.sh (clang++-3.7.1 required), or
    • You can directly download the preprocessed datasets from here and extract them into data/.
    • Run python preprocess/vul_preprocess.py to compile the binary code into graphs data

Binary decompilation [N-Bref]

  1. Download dataset
    • Download the demo datasets (raw and preprocessed data) from here and extract them into data/. (More datasets to come.)
    • No need to compile the code into graph again as the data has already been preprocessed.

Training and Evaluation

Binary code similarity measures

  • Run cd baseline_model && python run_similarity_check.py

Binary Vulnerability detections

  • Run cd baseline_model && python run_vulnerability_detection.py

Binary decompilation [N-Bref]

  1. Dump the trace of tree expansion:
    • To accelerate the online processing of the tree output, we will dump the trace of the trea data by running python -m preprocess.dump_trace
  2. Training scripts:
    • First, cd baseline model.
    • To train the model using torch parallel, run python run_tree_transformer.py.
    • To train it on multi-gpu using distribute pytorch, run python run_tree_transformer_multi_gpu.py
    • To evaluate, run python run_tree_transformer.py --eval
    • To evaluate a multi-gpu trained model, run python run_tree_transformer_multi_gpu.py --eval

References

[1] Ye, Fangke, et al. "MISIM: An End-to-End Neural Code Similarity System." arXiv preprint arXiv:2006.05265 (2020).

[2] Zhou, Yaqin, et al. "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks." Advances in Neural Information Processing Systems. 2019.

[3] Shi, Zhan, et al. "Learning Execution through Neural Code Fusion.", ICLR (2019).

License

This repo is CC-BY-NC licensed, as found in the LICENSE file.

Comments
  • Code similarity preprocessing : extract_obj.py: error: ambiguous option: -a could match -asm_t, -asm

    Code similarity preprocessing : extract_obj.py: error: ambiguous option: -a could match -asm_t, -asm

    I have troubles to run Binary code similarity measures by following the README instructions:

    Binary code similarity measures
    1. Download dataset                                                                                      
        . Download POJ-104 datasets from here and extract them into data/.
    
    2. Compile and preprocess                                                                         
        Run python extract_obj.py -a data/obj (clang++-3.7.1 required)`
    

    I downloaded the datasets and put it under nberf/data. Since there is no extract_obj.py under nberf, I changed the command to

      (py3_env): python preprocess/extract_obj.py -a data/obj` 
    

    I got

    usage: extract_obj.py [-h] [--poj-dir POJ_DIR] [--filter-list FILTER_LIST] [--asm_type {x86,mips}] [--output-asm-dir OUTPUT_ASM_DIR]
                                       [--num-workers NUM_WORKERS]
    extract_obj.py: error: ambiguous option: -a could match -asm_t, -asm  
    

    Look at the parse_args in extract_obj.py, there is no '-a' argument, it seems to be a typo of -asm. If I copy data to preprocessing and run the command:

      (py3_env) nberf: cp -r data preprocess/data
      (py3_env) nberf: cd preprocess
      (py3_env) nberf/preprocess: python extract_obj.py -asm data/obj` 
    

    now I got

    Traceback (most recent call last):
      File "extract_obj.py", line 218, in <module>
        main()
      File "extract_obj.py", line 181, in main
        with open(args.filter_list) as f:
    FileNotFoundError: [Errno 2] No such file or directory: './preprocess/filter_list.txt'
    

    My env

    Ubuntu 20.04.2 LTS
    Python 3.8.5
    

    Any steps I misunderstood or missed? Any help is greatly appreciated.

    Dennis

    opened by smartappcoder 3
  • Need to add function.csv to vul.tar.gz

    Need to add function.csv to vul.tar.gz

    $ python preprocess/vul_preprocess.py ... FileNotFoundError: [Errno 2] No such file or directory: 'data/vul/function.csv'

    Can you please add the csv file to the vulnerability data that can be downloaded directly?

    opened by darkskytechnology 2
  • amount of GPU memory required

    amount of GPU memory required

    How much GPU memory is required to run the decompiler? When I run on GPU with 11GB of memory I get out-of-error messages like:

    File "/data/pfps/nbref/baseline_model/modules/encoder_decoder_layers.py", line 108, in forward
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale.cuda()
    RuntimeError: CUDA out of memory. Tried to allocate 60.00 MiB (GPU 0; 10.92 GiB total capacity; 10.15 GiB already allocated; 10.6\
    9 MiB free; 10.32 GiB reserved in total by PyTorch)
    
    opened by pfps 1
  • Error in encoder_decoder_layers

    Error in encoder_decoder_layers

    Here's an error when trying to run python run_tree_transformer.py

      File "run_tree_transformer.py", line 236, in <module>
        main()
      File "run_tree_transformer.py", line 212, in main
        , args.device, criterion, max_len_trg, train_flag=True)
      File "~/nbref/baseline_model/data_utils/train_tree_encoder.py", line 346, in train_eval_tree
        trg_in = model.gnn(batch_graph)
      File "~/nbref/venv/lib64/python3/site-packages/torch/nn/modules/module.py", line 550, in __call__
        result = self.forward(*input, **kwargs)
      File "~/nbref/baseline_model/modules/transformer_tree_model.py", line 144, in forward
        merged = _merge_on(graphs, 'nodes', 'h')
      File "~/nbref/baseline_model/modules/encoder_decoder_layers.py", line 36, in _merge_on
        feat = Fdgl.pad_packed_tensor(feat, batch_num_objs, 0 )
      File "~/nbref/venv/lib64/python3/site-packages/dgl/backend/pytorch/tensor.py", line 245, in pad_packed_tensor
        index[cum_lengths[:-1]] += (max_len - lengths[:-1])
    IndexError: index 740 is out of bounds for dimension 0 with size 740
    

    Why don't you publish the pre-trained .pt model files for reproducibility?

    opened by h4sh5 1
  • Can't train in a PC with no GPU

    Can't train in a PC with no GPU

    For device = cpu it fails because it tries to run: self.scale.cuda() in encoder_decoder_layers and transformer_tree_model self.scale is already set to_device this ".cuda()" shouldn't be needed To reproduce try training in a machine with no GPU ---> cd baseline_model && python run_similarity_check.py

    opened by MCCTT 1
  • Training logic for decompilation

    Training logic for decompilation

    Hi, I am trying to compare our results with your results. I am trying to understand the code, could you please explain your training logic for the decompilation code? It's quite confusing. In your cache_tst_ast_1, you have each folder corresponding to one c code (I suppose) and you have multiple files inside them, shouldn't it be one ast per program? Also in the training code, it seems you are randomly selecting some nodes (and also some graphs from cache_test_ast_1?) and generating the remaining AST. Could you please explain the logic behind this? Thank you.

    opened by ck-amrahd 0
  • How could we generate AST used by N-Bref by clang?

    How could we generate AST used by N-Bref by clang?

    We have some programs with the respective C source code that we can also compile. How do we generate the AST used by N-Bref by clang and tokenize each node in AST? How to generate samples.obj? I didn't find the program to finnish this in this repo.

    opened by quwenjie 0
  • Predicted AST

    Predicted AST

    It is mentioned in the paper that the output of the prediction algorithm is a complete AST. After reading the code we only found a tensor. Is this really the case that there is no part in the code that outputs AST, and if this is the case how you were able to produce source code as in the paper

    opened by botta633 0
  • config.py not set up to run with single GPU

    config.py not set up to run with single GPU

    When I process run_tree_transformer.py I get an error:

    Traceback (most recent call last):
      File "run_tree_transformer.py", line 236, in <module>
        main()
      File "run_tree_transformer.py", line 212, in main
        , args.device, criterion, max_len_trg, train_flag=True)
      File "/data/pfps/nbref/baseline_model/data_utils/train_tree_encoder.py", line 246, in train_eval_tree
        model = model.module
      File "/tilde/pfps/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 594, in __getattr__
        type(self).__name__, name))
    AttributeError: 'Transformer' object has no attribute 'module'
    (nbref) pfps@heaviside:~/nbref/baseline_model$ git status
    

    I think that this is due to config.py having the wrong default for one or two arguments

            parser.add_argument('--parallel_gpu', action='store_true', default=True)
            parser.add_argument('--dist_gpu', action='store_true', default=True)
    

    doesn't make sense because it is impossible to turn these options off. I changed the default to False which allows further progression.

    opened by pfps 0
  • build issues

    build issues

    hello. even installing the correct python version 3.7.6 and using pip3.7 install -Iv to install each particular module version, torchtext seems to have difficulties. can you provide an easier onboarding into nbref, an install script or a vm? thanks!

    opened by fayrlight 0
Owner
Facebook Research
Facebook Research
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 3, 2022
Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

John McCambridge 79 Sep 20, 2022
ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

Monish Kumar 49 Nov 28, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
cLoops2: full stack analysis tool for chromatin interactions

cLoops2: full stack analysis tool for chromatin interactions Introduction cLoops2 is an extension of our previous work, cLoops. From loop-calling base

YaqiangCao 25 Dec 14, 2022
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

null 9 Nov 16, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python ??

Thomas 2 May 26, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

null 48 Dec 21, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 3, 2023
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 3, 2023
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 9, 2023
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

null 3.8k Jan 5, 2023
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 5, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 1, 2023