TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.
Comic stolen from Agent-X Comics.
📖
Table of contents
☁️
Project Description
Objective
The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.
Web Scraper
Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.
Deduplication
To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.
GPT-2
A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of cherry-picked randomly selected generated samples can be found here here.
📁
Shared Files
Resource | Size | #Samples | Link |
---|---|---|---|
Full dataset | 128.5 MB | 85,160 | Link |
Deduplicated dataset | 74.2 MB | 58,227 | Link |
Thread urls | 6.4 MB | 95,324 | Link |
GPT-2 Checkpoints | ~1.5 GB | Link |
🧰
Requirements
See requirements.txt
.
⚙️
Installation
$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt
🧻
Usage
To generate dataset (~3 hours on Colab):
$ python create_dataset.py [-c configs/create_dataset.yaml]
To deduplicate dataset (~30 minutes on Colab):
$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]
To train GPT-2 (~3 hours/epoch on Colab with K80):
$ python gpt2_train.py [-c configs/gpt2_train.yaml]
To generate text with GPT-2:
$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]