Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Injy Sarhan

Last update: Jan 5, 2023

Related tags

Deep Learning Open-CyKG

Overview

Open-CyKG

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Model Description

Open-CyKG is a framework that is constructed using an attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threat information from unstructured Advanced Persistent Threat (APT) reports. More specifically, we first identify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) that aids in labeling relation triples generated by the OIE model. Afterwards, the extracted structured data is canonicalized to build the KG by employing fusion techniques using word embeddings.

Datasets

OIE dataset: Malware DB
NER dataset: Microsoft Security Bulletins (MSB) and Cyber Threat Intelligence reports (CTI)

For dataset files please refer to the appropiate refrence in the paper.

Code:

Dependencies

Compatible with Python 3.x
Dependencies can be installed as specified in Block 1 in the respective notebooks.
All the code was implemented on Google Colab using GPU. Please ensure that you are using the version as specified in the "Ïnstallion and Drives" block.
Make sure to adapt the code based on your dataset and choice of word embeddings.
To utlize CRF in NER model using Keras; plase make sure to:

-- Use tensorFlow version and Keras version:

-- In tensorflow_backend.py and Optimizer.py write down those 2 liness ---> then restart runtime
```
  ```
  import tensorflow.compat.v1 as tf
  tf.disable_v2_behavior()
  ```
```

For more details on the how the exact process was carried out and the final hyper-parameters used; please refer to Open-CyKG paper.

Citing:

Please cite Open-CyKG if you use any of this material in your work.

I. Sarhan and M. Spruit, Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph, Knowledge-Based Systems (2021), doi: https://doi.org/10.1016/j.knosys.2021.107524.

@article{SARHAN2021107524,
title = {Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph},
journal = {Knowledge-Based Systems},
volume = {233},
pages = {107524},
year = {2021},
issn = {0950-7051},
doi = {https://doi.org/10.1016/j.knosys.2021.107524},
url = {https://www.sciencedirect.com/science/article/pii/S0950705121007863},
author = {Injy Sarhan and Marco Spruit},
keywords = {Cyber Threat Intelligence, Knowledge Graph, Named Entity Recognition, Open Information Extraction, Attention network},
abstract = {Instant analysis of cybersecurity reports is a fundamental challenge for security experts as an immeasurable amount of cyber information is generated on a daily basis, which necessitates automated information extraction tools to facilitate querying and retrieval of data. Hence, we present Open-CyKG: an Open Cyber Threat Intelligence (CTI) Knowledge Graph (KG) framework that is constructed using an attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threat information from unstructured Advanced Persistent Threat (APT) reports. More specifically, we first identify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) that aids in labeling relation triples generated by the OIE model. Afterwards, the extracted structured data is canonicalized to build the KG by employing fusion techniques using word embeddings. As a result, security professionals can execute queries to retrieve valuable information from the Open-CyKG framework. Experimental results demonstrate that our proposed components that build up Open-CyKG outperform state-of-the-art models.11Our implementation of Open-CyKG is publicly available at https://github.com/IS5882/Open-CyKG.}
}

Implementation Refrences:

Contextualized word embediings: link to Flairs word embedding documentation, Hugging face link of all pretrained models https://huggingface.co/transformers/v2.3.0/pretrained_models.html
Functions in block 3&9 are originally refrenced from the work of Stanvosky et al. Please refer/cite his work, with exception of some modification in the functions Stanovsky, Gabriel, et al. "Supervised open information extraction." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
OIE implements Bahdanau attention (https://arxiv.org/pdf/1409.0473.pdf). Towards Data Science Blog
NER refrence blog
Knowledge Graph fusion motivated by the work of CESI Vashishth, Shikhar, Prince Jain, and Partha Talukdar. "Cesi: Canonicalizing open knowledge bases using embeddings and side information." Proceedings of the 2018 World Wide Web Conference. 2018..
Neo4J was used for Knowledge Graph visualization.

Please cite the appropriate reference(s) in your work

Comments

Datasets for the notebook

Hi, Greetings from India! I have been following your work for a while and recently I came across your repository, I went through your notebooks but I couldn't find the datasets in the repository which you have used for creating the model. I would be grateful if could share the datasets used in all the 3 notebooks.

Regards, Harsh Vardhan Jaiswal

opened by hvjrocks-ds 10
Dataset used for Repo
Hi a good afternoon from Singapore! Sorry, hope you dont mind if i asked afew questions! I do apologise for the lengthy questions!

Question 1 For the following: Open_CyKG_OIE_Model.ipynb there is a ''' #train_fn="add you file here" ''' I understand you mentioned to get the data from: https://justhalf.github.io/publication/2017-07-01-malwaretextdb However, from the following lines of codes: train_textEI = dfE.groupby(dfE['word_id'].eq(0).cumsum())['word'].apply(lambda x: [' '.join(x)]).tolist() train_predIE = dfE.groupby(dfE['word_id'].eq(0).cumsum())['pred'].apply(lambda x: [' '.join(x)]).tolist()

I am unable find dataset that has the column 'word_id', 'word' and 'pred'.

Thus i was wondering if you could kindly point me to the direction of the dataset or how u derived the data!

Because from the following link: https://justhalf.github.io/publication/2017-07-01-malwaretextdb i obtained the following data:

i do not see any file which has ['word_id'] and ['word']

Question 2 Also, for the file Open_Cy_KG_NER.ipynb the codes are as follows: words = list(dframe['words'].unique()) tags = list(dframe['tag'].unique()) target = dframe.loc[:,'tag']

class sentence(object): def init(self, df): self.n_sent = 1 self.df = df self.empty = False

agg = lambda s : [(w, p, t) for w, p, t in zip(s['words'].values.tolist(), s['POS'].values.tolist(), s['tag'].values.tolist())] self.grouped = self.df.groupby("sentence_idx").apply(agg)

Similar question, I was wondering where you got the data from that has "words", "POS", "tag"!

Question 3 Also , i wanted to clarify on some facts; Am i right to say that in order to all the files, They are prerequisites of one another; Thus they need to be run in the order of the following:

OIE

NER

KnowledgeGraph

Question 4? im just wondering if you could kindly attach the other relevant datasets if any Please accept my kindest apologies for the lengthy question! I'm a big fan of ur work and am just trying to get it up!
opened by malcolm1232 3
OIE Dataset

Hiee! I hope i dont trouble you too much again, but may i know if you have the data for OIE model?

As per the codes from the OIE.ipynb : df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True) df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64') df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64') df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64') df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')

I was wondering if u have the dataset for OIE notebook. The data you provided was only for NER.ipynb and the dataset you used seemed different! I'm terribly sorry for the inconvenience! Just kinda really like ur notebook and have been trying to get it working since then!

Cheers!

opened by malcolm1232 2
spacy_wrapper

Hi For the following line of code in OIE.ipynb

from spacy_wrapper import spacy_whitespace_parser as spacy_ws <

By any chance u havethe file for it? Because i searched high and low and am unable to find spacy_wrapper library

opened by malcolm1232 2
Output files for Knowledge Graph

Hi, Greetings! I was going through the Knowledge Graph notebook, I would be extremely grateful if you could share the dataset used in the Knowledge Graph notebook i.e. the output generated from OIE and NER model.

Thanks in advance! Regards, Harsh Vardhan Jaiswal

opened by hvjrocks-ds 1
Dataset for CYKG

Hello again Injy Sarhan, I know youre a busy woman! I have been trying to run the notebook: Open_CyKG__Knowledge_Graph_Canonicalization.ipynb,

To Narrow down on the dataset for the knowledge graph (Open_CyKG__Knowledge_Graph_Canonicalization.ipynb), I was able to run the CY-KG, but i am missing the dataset which requires the following header:

words=df['word'] labels=df['predicted_label'] POS=df['pos'] trueLabels=df['true_label']

df['finalSentID']=df['originalSentID'] df['FinalWord']=df['NERword']

Ner1=df['NERpredicted1'] Ner2=df['NERpredicted2']

Also, there is this line of code (hope it helps!): df.to_csv("/content/gdrive/MyDrive/my Personal work/Open-CyKG/MalwareDB_dataset_csv_exl/OIEoutput(10Feb).csv", index=False)

A huge thank you in advance!!! :) I hope it doesnt inconvenience you!

opened by malcolm1232 1
OIE Dataset

Hi, I was going through the Knowledge Graph notebook.

Could you share the dataset('MLB_all_csv') for input in OIE task? Could you share the dataset used in the Knowledge Graph notebook i.e. the output generated from OIE and NER model. Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected]

Thank U so much!!

opened by rlagywns0213 0
Datasets for OIE notebook

Hi Injy, Thank you for such quick response on the previous issue regarding the NER model. I went through the google drive shared by you but I couldn't find the dataset for OIE model notebook. Can you please also share the train, dev and test datasets for the OIE model notebook. Thank you for your support.

Regards, Harsh Vardhan Jaiswal

opened by hvjrocks-ds 0
OIE Dataset

Hello, I am doing my graduation thesis and found your paper, your work will help me a lot, can you share the dataset you used for OIE model? My email is: [email protected]. Thank you very much!

opened by hainamt 0
OIE Dataset

Hi @IS5882 , I am Interested in your work as well for my master thesis on MISP kgs. Can you share the MLB_all_csv and NER data with me as well please. My mail is [email protected] .

opened by l0renor 1
Datasets needed for OIE and NER

Hi Sarhan, for our NLP course project at Berkeley, we are following your paper on opencykg. Just as another user Malcom explained in one of the posts, we also need the datasets you used for the OIE python notebook. I downloaded the malwaretextdb database directly from your paper's reference but that doesn't contain any of the fields required by the downstream code such as : word_id word pred pred_id head_pred_id sent_id run_id label

Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected].

We are in a time crunch here with course deadlines approaching. So would be grateful if you could give us access to the datasets that you used for the OIE and NER notebooks.

Thanks, Nitin

opened by nitinpi0210 43

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Related tags

Overview

Open-CyKG

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Model Description

Datasets

Code:

Dependencies

Citing:

Implementation Refrences:

Comments

Owner

Injy Sarhan

A repository built on the Flow software package to explore cyber-security attacks on intelligent transportation systems.

A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

AI Flow is an open source framework that bridges big data and artificial intelligence.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

Learning Intents behind Interactions with Knowledge Graph for Recommendation, WWW2021

Y. Zhang, Q. Yao, W. Dai, L. Chen. AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. IEEE International Conference on Data Engineering (ICDE). 2020

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

Using pretrained language models for biomedical knowledge graph completion.

🤖 A Python library for learning and evaluating knowledge graph embeddings

TuckER: Tensor Factorization for Knowledge Graph Completion

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer"

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.