Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Last update: Nov 28, 2022

Related tags

Text Data & NLP nlp pytorch lsh dataset transformer web-scraping minhash readability scam deduplication fraud fine-tuning gpt2 huggingface

Overview

Neural Scam Artist

TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.

Comic stolen from Agent-X Comics.

📖 Table of contents

➤ Project Description
➤ Shared Files
➤ Requirements
➤ Installation
➤ Usage

☁️ Project Description

Objective

The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.

Web Scraper

Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.

Deduplication

To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.

GPT-2

A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of ~~cherry-picked~~ randomly selected generated samples can be found here here.

📁 Shared Files

Resource	Size	#Samples	Link
Full dataset	128.5 MB	85,160	Link
Deduplicated dataset	74.2 MB	58,227	Link
Thread urls	6.4 MB	95,324	Link
GPT-2 Checkpoints	~1.5 GB		Link

🧰 Requirements

See requirements.txt.

⚙️ Installation

$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt

🧻 Usage

To generate dataset (~3 hours on Colab):


$ python create_dataset.py [-c configs/create_dataset.yaml]

To deduplicate dataset (~30 minutes on Colab):

$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]

To train GPT-2 (~3 hours/epoch on Colab with K80):

$ python gpt2_train.py [-c configs/gpt2_train.yaml]

To generate text with GPT-2:

$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]

You might also like...

SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

0 Oct 7, 2021

CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集，面向中文文献类（论文）场景。包含以下10个label：正文标题图片图片标题表格表格标题页眉页脚注释公式 Text Title

84 Dec 28, 2022

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 3, 2023

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Related tags

Overview

Neural Scam Artist

📖 Table of contents

☁️ Project Description

Objective

Web Scraper

Deduplication

GPT-2

📁 Shared Files

🧰 Requirements

⚙️ Installation

🧻 Usage

You might also like...

SDL: Synthetic Document Layout dataset

CDLA: A Chinese document layout analysis (CDLA) dataset

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Shirt Bot is a discord bot which uses GPT-3 to generate text

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

Owner

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Deduplication is the task to combine different representations of the same real world entity.

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.