Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Overview

Open-CyKG

Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph

Journal Paper Google Scholar LinkedIn

Model Description

Open-CyKG is a framework that is constructed using an attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threat information from unstructured Advanced Persistent Threat (APT) reports. More specifically, we first identify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) that aids in labeling relation triples generated by the OIE model. Afterwards, the extracted structured data is canonicalized to build the KG by employing fusion techniques using word embeddings.

Datasets

  • OIE dataset: Malware DB
  • NER dataset: Microsoft Security Bulletins (MSB) and Cyber Threat Intelligence reports (CTI)

For dataset files please refer to the appropiate refrence in the paper.

Code:

Dependencies

  • Compatible with Python 3.x

  • Dependencies can be installed as specified in Block 1 in the respective notebooks.

  • All the code was implemented on Google Colab using GPU. Please ensure that you are using the version as specified in the "Ïnstallion and Drives" block.

  • Make sure to adapt the code based on your dataset and choice of word embeddings.

  • To utlize CRF in NER model using Keras; plase make sure to:

    -- Use tensorFlow version and Keras version:

    -- In tensorflow_backend.py and Optimizer.py write down those 2 liness ---> then restart runtime

      ```
      import tensorflow.compat.v1 as tf
      tf.disable_v2_behavior()
      ```
    

For more details on the how the exact process was carried out and the final hyper-parameters used; please refer to Open-CyKG paper.

Citing:

Please cite Open-CyKG if you use any of this material in your work.

I. Sarhan and M. Spruit, Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph, Knowledge-Based Systems (2021), doi: https://doi.org/10.1016/j.knosys.2021.107524.

@article{SARHAN2021107524,
title = {Open-CyKG: An Open Cyber Threat Intelligence Knowledge Graph},
journal = {Knowledge-Based Systems},
volume = {233},
pages = {107524},
year = {2021},
issn = {0950-7051},
doi = {https://doi.org/10.1016/j.knosys.2021.107524},
url = {https://www.sciencedirect.com/science/article/pii/S0950705121007863},
author = {Injy Sarhan and Marco Spruit},
keywords = {Cyber Threat Intelligence, Knowledge Graph, Named Entity Recognition, Open Information Extraction, Attention network},
abstract = {Instant analysis of cybersecurity reports is a fundamental challenge for security experts as an immeasurable amount of cyber information is generated on a daily basis, which necessitates automated information extraction tools to facilitate querying and retrieval of data. Hence, we present Open-CyKG: an Open Cyber Threat Intelligence (CTI) Knowledge Graph (KG) framework that is constructed using an attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threat information from unstructured Advanced Persistent Threat (APT) reports. More specifically, we first identify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) that aids in labeling relation triples generated by the OIE model. Afterwards, the extracted structured data is canonicalized to build the KG by employing fusion techniques using word embeddings. As a result, security professionals can execute queries to retrieve valuable information from the Open-CyKG framework. Experimental results demonstrate that our proposed components that build up Open-CyKG outperform state-of-the-art models.11Our implementation of Open-CyKG is publicly available at https://github.com/IS5882/Open-CyKG.}
}

Implementation Refrences:

  • Contextualized word embediings: link to Flairs word embedding documentation, Hugging face link of all pretrained models https://huggingface.co/transformers/v2.3.0/pretrained_models.html
  • Functions in block 3&9 are originally refrenced from the work of Stanvosky et al. Please refer/cite his work, with exception of some modification in the functions Stanovsky, Gabriel, et al. "Supervised open information extraction." Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.
  • OIE implements Bahdanau attention (https://arxiv.org/pdf/1409.0473.pdf). Towards Data Science Blog
  • NER refrence blog
  • Knowledge Graph fusion motivated by the work of CESI Vashishth, Shikhar, Prince Jain, and Partha Talukdar. "Cesi: Canonicalizing open knowledge bases using embeddings and side information." Proceedings of the 2018 World Wide Web Conference. 2018..
  • Neo4J was used for Knowledge Graph visualization.

Please cite the appropriate reference(s) in your work

Comments
  • Datasets for the notebook

    Datasets for the notebook

    Hi, Greetings from India! I have been following your work for a while and recently I came across your repository, I went through your notebooks but I couldn't find the datasets in the repository which you have used for creating the model. I would be grateful if could share the datasets used in all the 3 notebooks.

    Regards, Harsh Vardhan Jaiswal

    opened by hvjrocks-ds 10
  • Dataset used for Repo

    Dataset used for Repo

    Hi a good afternoon from Singapore! Sorry, hope you dont mind if i asked afew questions! I do apologise for the lengthy questions!

    Question 1 For the following: Open_CyKG_OIE_Model.ipynb there is a ''' #train_fn="add you file here" ''' I understand you mentioned to get the data from: https://justhalf.github.io/publication/2017-07-01-malwaretextdb However, from the following lines of codes: train_textEI = dfE.groupby(dfE['word_id'].eq(0).cumsum())['word'].apply(lambda x: [' '.join(x)]).tolist() train_predIE = dfE.groupby(dfE['word_id'].eq(0).cumsum())['pred'].apply(lambda x: [' '.join(x)]).tolist()

    I am unable find dataset that has the column 'word_id', 'word' and 'pred'.

    Thus i was wondering if you could kindly point me to the direction of the dataset or how u derived the data!

    Because from the following link: https://justhalf.github.io/publication/2017-07-01-malwaretextdb i obtained the following data: image 1 ANN TOKENS

    i do not see any file which has ['word_id'] and ['word']

    Question 2 Also, for the file Open_Cy_KG_NER.ipynb the codes are as follows: words = list(dframe['words'].unique()) tags = list(dframe['tag'].unique()) target = dframe.loc[:,'tag']

    class sentence(object): def init(self, df): self.n_sent = 1 self.df = df self.empty = False

        agg = lambda s : [(w, p, t) for w, p, t in zip(s['words'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['tag'].values.tolist())] 
        self.grouped = self.df.groupby("sentence_idx").apply(agg) 
    

    Similar question, I was wondering where you got the data from that has "words", "POS", "tag"!

    Question 3 Also , i wanted to clarify on some facts; Am i right to say that in order to all the files, They are prerequisites of one another; Thus they need to be run in the order of the following:

    1. OIE
    2. NER
    3. KnowledgeGraph

    Question 4? im just wondering if you could kindly attach the other relevant datasets if any Please accept my kindest apologies for the lengthy question! I'm a big fan of ur work and am just trying to get it up!

    opened by malcolm1232 3
  • OIE Dataset

    OIE Dataset

    Hiee! I hope i dont trouble you too much again, but may i know if you have the data for OIE model?

    As per the codes from the OIE.ipynb : df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True) df.word_id = pd.to_numeric(df.word_id, errors='coerce').astype('Int64') df.run_id = pd.to_numeric(df.run_id, errors='coerce').astype('Int64') df.sent_id = pd.to_numeric(df.sent_id, errors='coerce').astype('Int64') df.head_pred_id = pd.to_numeric(df.head_pred_id, errors='coerce').astype('Int64')

    I was wondering if u have the dataset for OIE notebook. The data you provided was only for NER.ipynb and the dataset you used seemed different! I'm terribly sorry for the inconvenience! Just kinda really like ur notebook and have been trying to get it working since then!

    Cheers!

    opened by malcolm1232 2
  • spacy_wrapper

    spacy_wrapper

    Hi For the following line of code in OIE.ipynb

    from spacy_wrapper import spacy_whitespace_parser as spacy_ws <

    By any chance u havethe file for it? Because i searched high and low and am unable to find spacy_wrapper library

    opened by malcolm1232 2
  • Output files for Knowledge Graph

    Output files for Knowledge Graph

    Hi, Greetings! I was going through the Knowledge Graph notebook, I would be extremely grateful if you could share the dataset used in the Knowledge Graph notebook i.e. the output generated from OIE and NER model.

    Thanks in advance! Regards, Harsh Vardhan Jaiswal

    opened by hvjrocks-ds 1
  • Dataset for CYKG

    Dataset for CYKG

    Hello again Injy Sarhan, I know youre a busy woman! I have been trying to run the notebook: Open_CyKG__Knowledge_Graph_Canonicalization.ipynb,

    To Narrow down on the dataset for the knowledge graph (Open_CyKG__Knowledge_Graph_Canonicalization.ipynb), I was able to run the CY-KG, but i am missing the dataset which requires the following header:

    words=df['word'] labels=df['predicted_label'] POS=df['pos'] trueLabels=df['true_label']

    df['finalSentID']=df['originalSentID'] df['FinalWord']=df['NERword']

    Ner1=df['NERpredicted1'] Ner2=df['NERpredicted2']

    Also, there is this line of code (hope it helps!): df.to_csv("/content/gdrive/MyDrive/my Personal work/Open-CyKG/MalwareDB_dataset_csv_exl/OIEoutput(10Feb).csv", index=False)

    A huge thank you in advance!!! :) I hope it doesnt inconvenience you!

    opened by malcolm1232 1
  • OIE Dataset

    OIE Dataset

    Hi, I was going through the Knowledge Graph notebook.

    Could you share the dataset('MLB_all_csv') for input in OIE task? Could you share the dataset used in the Knowledge Graph notebook i.e. the output generated from OIE and NER model. Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected]

    Thank U so much!!

    opened by rlagywns0213 0
  • Datasets for OIE notebook

    Datasets for OIE notebook

    Hi Injy, Thank you for such quick response on the previous issue regarding the NER model. I went through the google drive shared by you but I couldn't find the dataset for OIE model notebook. Can you please also share the train, dev and test datasets for the OIE model notebook. Thank you for your support.

    Regards, Harsh Vardhan Jaiswal

    opened by hvjrocks-ds 0
  • OIE Dataset

    OIE Dataset

    Hello, I am doing my graduation thesis and found your paper, your work will help me a lot, can you share the dataset you used for OIE model? My email is: [email protected]. Thank you very much!

    opened by hainamt 0
  • OIE Dataset

    OIE Dataset

    Hi @IS5882 , I am Interested in your work as well for my master thesis on MISP kgs. Can you share the MLB_all_csv and NER data with me as well please. My mail is [email protected] .

    opened by l0renor 1
  • Datasets needed for OIE and NER

    Datasets needed for OIE and NER

    Hi Sarhan, for our NLP course project at Berkeley, we are following your paper on opencykg. Just as another user Malcom explained in one of the posts, we also need the datasets you used for the OIE python notebook. I downloaded the malwaretextdb database directly from your paper's reference but that doesn't contain any of the fields required by the downstream code such as : word_id word pred pred_id head_pred_id sent_id run_id label

    Can you please give me access to the datasets that are needed to succesfully run the OIE notebook? My email is : [email protected].

    We are in a time crunch here with course deadlines approaching. So would be grateful if you could give us access to the datasets that you used for the OIE and NER notebooks.

    Thanks, Nitin

    opened by nitinpi0210 43
Owner
Injy Sarhan
Injy Sarhan
A repository built on the Flow software package to explore cyber-security attacks on intelligent transportation systems.

A repository built on the Flow software package to explore cyber-security attacks on intelligent transportation systems.

George Gunter 4 Nov 14, 2022
A PoC Corporation Relationship Knowledge Graph System on top of Nebula Graph.

Corp-Rel is a PoC of Corpartion Relationship Knowledge Graph System. It's built on top of the Open Source Graph Database: Nebula Graph with a dataset

Wey Gu 20 Dec 11, 2022
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 9, 2023
TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

FunMatch-Distillation TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A g

Sayak Paul 67 Dec 20, 2022
Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

KaGRMN-DSG_ABSA This repository contains the PyTorch source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated

XingBowen 4 May 20, 2022
AI Flow is an open source framework that bridges big data and artificial intelligence.

Flink AI Flow Introduction Flink AI Flow is an open source framework that bridges big data and artificial intelligence. It manages the entire machine

null 144 Dec 30, 2022
This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

BUPT GAMMA Lab 519 Jan 2, 2023
Learning Intents behind Interactions with Knowledge Graph for Recommendation, WWW2021

Learning Intents behind Interactions with Knowledge Graph for Recommendation This is our PyTorch implementation for the paper: Xiang Wang, Tinglin Hua

null 158 Dec 15, 2022
Y. Zhang, Q. Yao, W. Dai, L. Chen. AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. IEEE International Conference on Data Engineering (ICDE). 2020

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

AutoML Research 64 Dec 17, 2022
KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Learning Knowledge Bases with Parameters for Task-Oriented Dialogue Systems This is the implementation of the paper: Learning Knowledge Bases with Par

CAiRE 42 Nov 10, 2022
Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors.

PairRE Code for paper PairRE: Knowledge Graph Embeddings via Paired Relation Vectors. This implementation of PairRE for Open Graph Benchmak datasets (

Alipay 65 Dec 19, 2022
This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

Yue Yu 58 Dec 21, 2022
Paddle implementation for "Highly Efficient Knowledge Graph Embedding Learning with Closed-Form Orthogonal Procrustes Analysis" (NAACL 2021)

ProcrustEs-KGE Paddle implementation for Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes Analysis ?? A more detailed re

Lincedo Lab 4 Jun 9, 2021
Using pretrained language models for biomedical knowledge graph completion.

LMs for biomedical KG completion This repository contains code to run the experiments described in: Scientific Language Models for Biomedical Knowledg

Rahul Nadkarni 41 Nov 30, 2022
🤖 A Python library for learning and evaluating knowledge graph embeddings

PyKEEN PyKEEN (Python KnowlEdge EmbeddiNgs) is a Python package designed to train and evaluate knowledge graph embedding models (incorporating multi-m

PyKEEN 1.1k Jan 9, 2023
TuckER: Tensor Factorization for Knowledge Graph Completion

TuckER: Tensor Factorization for Knowledge Graph Completion This codebase contains PyTorch implementation of the paper: TuckER: Tensor Factorization f

Ivana Balazevic 296 Dec 6, 2022
Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

Wang Yucheng 30 Dec 18, 2022
🌈 PyTorch Implementation for EMNLP'21 Findings "Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer"

SGLKT-VisDial Pytorch Implementation for the paper: Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer Gi-Cheon Kang, Junseok P

Gi-Cheon Kang 9 Jul 5, 2022
ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representa

Bats Research 94 Nov 21, 2022