Shannon Entropy, Hausdorff Distance and Jaro-Winkler Distance
A way to analyse how malware and/or goodware samples vary from each other usingIntroduction
ByteCog is a python script that aims to help security researchers and others a like to classify malicious software compared to other samples, depending on what the unknown file(s) is/are being tested against. This script can be extended to use a machine learning model to classify malware if you wanted to do so. ByteCog uses multiple methods of analyzing and classifying samples given to it, such as using Shannon Entropy to give a visual aspect for the researchers to look at while analyzing the code and finding possible readable code/text in a sample. ByteCog also uses Hausdorff Distance to calculate a 'raw similarity' value based on the difference in the entropy graphs of both samples, and finally ByteCog uses Jaro-Winkler Distance to calculate the 'true similarity' since the Hausdorff Distance will in most cases return a very high value if the sample is mostly the same entropy wise, so the Jaro-Winkler Distance is used to 'adjust' the simliarity value for this case of a sample.
Requirements
- A python installation above 3.5+, which you can download from the official python website here.
Installation
Clone this repository to your local machine by following these instructions layed out here
Then proceed to download the dependencies file by running the following line in your console window
pip install -r requirements.txt
Usage
======================================================
| ____ __ ______ |
| / __ ) __ __ / /_ ___ / ____/____ ____ |
| / __ |/ / / // __// _ \ / / / __ \ / __ \ |
| / /_/ // /_/ // /_ / __// /___ / /_/ // /_/ / |
| /_____/ \__, / \__/ \___/ \____/ \____/ \__, / |
| /____/ /____/ |
| |
| Version: 0.4 |
| Author: IlluminatiFish |
======================================================
usage: bytecog.py [-h] -k KNOWN -u UNKNOWN -i IDENTIFIER -v VISUAL
Determine whether an unknown provided sample is similar to a known sample
optional arguments:
-h, --help show this help message and exit
-k KNOWN, --known KNOWN
The file path to the known sample
-u UNKNOWN, --unknown UNKNOWN
The file path to the unknown sample
-i IDENTIFIER, --identifier IDENTIFIER
The antivirus identifier of the known file
-v VISUAL, --visual VISUAL
If you want to show a visual representation of the file entropy
Features & Use Cases
- Calculates sample similarity
- Generates chunked entropy graph
- Able to possibly detect malicious and benign software samples
Screenshots
License
ByteCog - A way to analyse how malware and/or goodware samples vary from each other using Shannon Entropy, Hausdorff Distance and Jaro-Winkler Distance Copyright (c) 2021 IlluminatiFish
This program is free software; you can redistribute it and/or modify the code base under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but without ANY warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/
Acknowledgements
-
Using a modified version of @venkat-abhi's Shannon Entropy calculator to work with my project script, you can find the original one here.
-
Using the fastest method to get maximum key from a dictionary using this snippet here.
References
Entropy Wiki
Jaro-Winkler Distance Wiki
Hausdorff Distance Wiki
Shannon Calculator
Referenced Article #1
Referenced Paper #1
Referenced Paper #2
Referenced Paper #3