Using BERT-based models for toxic span detection

Ravika Nagpal

Last update: Jan 4, 2022

SemEval 2021 Task 5: Toxic Spans Detection:

Task:

Link to SemEval-2021: Task 5 Toxic Span Detection is https://competitions.codalab.org/competitions/25623

References:

https://huggingface.co/docs/transformers/training - To understand how to train model.
https://huggingface.co/docs/transformers/model_doc/roberta - To understand Roberta model and corresponding tokenizer
https://huggingface.co/docs/transformers/model_doc/distilbert - To understand DistilBert and corresponding rokeniser
https://github.com/huggingface/transformers/issues/14305 - to understand postprocessing of predicted labels to spans
https://github.com/huggingface/notebooks/blob/master/examples/token_classification-tf.ipynb - Copied function tokenize_and_align_labels() from this tutorial notebook from huggingface and followed the certain steps to fine tune model on custom dataset.
https://github.com/ipavlopoulos/toxic_spans/blob/master/evaluation/metrics.py - F1 score function provided by competition is modified to accomodate our model output

You might also like...

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

3.5k Dec 30, 2022

Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

14 Nov 2, 2022

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

FantasyBert English | 中文 Introduction An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations. You can imp

137 Oct 26, 2022

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

109 Dec 2, 2022

A BERT-based reverse-dictionary of Korean proverbs

Using BERT-based models for toxic span detection

Related tags

Overview

SemEval 2021 Task 5: Toxic Spans Detection:

Task:

References:

You might also like...

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Pre-training BERT masked language models with custom vocabulary

An easy-to-use framework for BERT models, with trainers, various NLP tasks and detailed annonations

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

A BERT-based reverse-dictionary of Korean proverbs

A BERT-based reverse dictionary of Korean proverbs

IMDB film review sentiment classification based on BERT's supervised learning model.

BERT-based Financial Question Answering System

Owner

Ravika Nagpal

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

VD-BERT: A Unified Vision and Dialog Transformer with BERT

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Super easy library for BERT based NLP models

Super easy library for BERT based NLP models

PyTorch impelementations of BERT-based Spelling Error Correction Models.

PyTorch impelementations of BERT-based Spelling Error Correction Models

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Ongoing research training transformer language models at scale, including: BERT & GPT-2