Open solution to the Toxic Comment Classification Challenge

Overview

Starter code: Kaggle Toxic Comment Classification Challenge

More competitions 🎇

Check collection of public projects 🎁 , where you can find multiple Kaggle competitions with code, experiments and outputs.

Here, at Neptune we enjoy participating in the Kaggle competitions. Toxic Comment Classification Challenge is especially interesting because it touches important issue of online harassment.

Ensemble our predictions in the cloud!

You need to be registered to neptune.ml to be able to use our predictions for your ensemble models.

  • click start notebook
  • choose browse button
  • select the neptune_ensembling.ipynb file from this repository.
  • choose worker type: gcp-large is the recommended one.
  • run first few cells to load our predictions on the held out validation set along with the labels
  • grid search over many possible parameter options. The more runs you choose the longer it will run.
  • train your second level, ensemble model (it should take less than an hour once you have the parameters)
  • load our predictions on the test set
  • feed our test set predictions to your ensemble model and get final predictions
  • save your submission file
  • click on browse files and find your submission file to download it.

Running the notebook as is got 0.986+ on the LB.

Disclaimer

In this open source solution you will find references to the neptune.ml. It is free platform for community Users, which we use daily to keep track of our experiments. Please note that using neptune.ml is not necessary to proceed with this solution. You may run it as plain Python script 😉 .

The idea

We are contributing starter code that is easy to use and extend. We did it before with Cdiscount’s Image Classification Challenge and we believe that it is correct way to open data science to the wider community and encourage more people to participate in Challenges. This starter is ready-to-use end-to-end solution. Since all computations are organized in separate steps, it is also easy to extend. Check devbook.ipynb for more information about different pipelines.

Now we want to go one step further and invite you to participate in the development of this analysis pipeline. At the later stage of the competition (early February) we will invite top contributors to join our team on Kaggle.

Contributing

You are welcome to extend this pipeline and contribute your own models or procedures. Please refer to the CONTRIBUTING for more details.

Installation

option 1: Neptune cloud

on the neptune site

  • log in: neptune accound login
  • create new project named toxic: Follow the link Projects (top bar, left side), then click New project button. This action will generate project-key TOX, which is already listed in the neptune.yaml.

run setup commands

$ git clone https://github.com/neptune-ml/kaggle-toxic-starter.git
$ pip3 install neptune-cli
$ neptune login

start experiment

$ neptune send --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config best_configs/fasttext_gru.yaml -- train_evaluate_predict_cv_pipeline --pipeline_name fasttext_gru --model_level first

This should get you to 0.9852 Happy Training :)

Refer to Neptune documentation and Getting started: Neptune Cloud for more.

option 2: local install

Please refer to the Getting started: local instance for installation procedure.

Solution visualization

Below end-to-end pipeline is visualized. You can run exactly this one! pipeline_001

We have also prepared something simpler to just get you started:

pipeline_002

User support

There are several ways to seek help:

  1. Read project's Wiki, where we publish descriptions about the code, pipelines and neptune.
  2. Kaggle discussion is our primary way of communication.
  3. You can submit an issue directly in this repo.
Comments
  • Seems lots of package installing issues with requirements.txt

    Seems lots of package installing issues with requirements.txt

    I have tried to run "neptune send experiment_manager.py --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config neptune_config.yaml -- train_evaluate_predict_pipeline --pipeline_name glove_lstm"

    but got lots of package installation issues. I tried to comment out some in requirements.txt, but seems they are so many. Any solution for this? Or I just missed something in configuration? Thanks.

    opened by ymcdull 7
  • Hard to reproduce results locally

    Hard to reproduce results locally

    Hi, I was able to run all models locally (run_end_to_end.sh) but weren't able to run catboost on top of the models.

    TypeError: fit() missing 1 required positional argument: 'validation_data'

    It looks like I am missing something. Is it possible to reproduce your pipeline without loading data from the cloud?

    opened by lodurality 6
  • Unable to access auth server. Login url is incorrect

    Unable to access auth server. Login url is incorrect

    When I run command "neptune login" I got nothing but "Unable to access auth server. Login url is incorrect" Is there anything wrong with the server?

    opened by rrdssfgcs 5
  • Bugs and errors I ran into during experiment

    Bugs and errors I ran into during experiment

    Found bugs:

    1. missing "," in the end of line 28 in file "pipeline_config.py".
    2. missing "import nltk" in file "/steps/preprocessing.py".

    Error:

    1. FileNotFoundError 30.2639 | /usr/local/lib/python3.6/dist-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. 30.26557 | from ._conv import register_converters as register_converters 30.417965 | Using TensorFlow backend. 35.018845 | Traceback (most recent call last): 35.019116 | File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 113, in 35.019273 | execute() 35.019429 | File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 109, in execute 35.019568 | execfile(job_filepath, job_globals) 35.019697 | File "/usr/local/lib/python3.6/dist-packages/past/builtins/misc.py", line 82, in execfile 35.019829 | exec(code, myglobals, mylocals) 35.019959 | File "main.py", line 12, in 35.020092 | from pipelines import PIPELINES 35.020226 | File "/neptune/pipelines.py", line 8, in 35.020365 | from steps.preprocessing import XYSplit, TextCleaner, TfidfVectorizer, WordListFilter, Normalizer, TextCounter,
      35.020523 | File "/neptune/steps/preprocessing.py", line 24, in 35.020654 | with open('../external_data/apostrophes.json', 'r') as f: 35.020784 | FileNotFoundError: [Errno 2] No such file or directory: '../external_data/apostrophes.json'

     After I copy the data from that file into the code as following, another error pops up. #with open('../external_data/apostrophes.json', 'r') as f: # APPO = json.load(f) APPO = { "arent": "are not", ..., "well": "will" }

    2. TypeError 645.742938 Traceback (most recent call last): 645.743234 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 113, in 645.743377 execute() 645.74352 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/job_wrapper.py", line 109, in execute 645.743696 execfile(job_filepath, job_globals) 645.743897 File "/usr/local/lib/python3.6/dist-packages/past/builtins/misc.py", line 82, in execfile 645.744034 exec_(code, myglobals, mylocals) 645.744164 File "main.py", line 382, in 645.744294 action() 645.744436 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 722, in call 645.744561 return self.main(*args, **kwargs) 645.744717 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 697, in main 645.744849 rv = self.invoke(ctx) 645.744996 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 1066, in invoke 645.745155 return _process_result(sub_ctx.command.invoke(sub_ctx)) 645.745335 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 895, in invoke 645.745471 return ctx.invoke(self.callback, **ctx.params) 645.745602 File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 535, in invoke 645.745742 return callback(*args, **kwargs) 645.745871 File "main.py", line 129, in train_evaluate_predict_pipeline 645.746002 _train_pipeline(pipeline_name) 645.746266 File "main.py", line 68, in _train_pipeline 645.746432 _ = pipeline.fit_transform(data) 645.746566 File "/neptune/steps/base.py", line 71, in fit_transform 645.746704 step_inputs[input_step.name] = input_step.fit_transform(data) 645.746835 File "/neptune/steps/base.py", line 77, in fit_transform 645.746964 step_output_data = self._cached_fit_transform(step_inputs) 645.747093 File "/neptune/steps/base.py", line 91, in _cached_fit_transform 645.747223 step_output_data = self.transformer.fit_transform(**step_inputs) 645.747351 File "/neptune/steps/base.py", line 213, in fit_transform 645.747482 self.fit(*args, **kwargs) 645.747618 File "/neptune/models.py", line 54, in fit 645.747749 self.callbacks = self._create_callbacks(**self.callbacks_config) 645.747895 File "/neptune/models.py", line 30, in _create_callbacks 645.748021 neptune = NeptuneMonitor(**kwargs['neptune_monitor']) 645.748166 File "/neptune/steps/keras/callbacks.py", line 11, in init 645.748296 self.batch_loss_channel_name = get_correct_channel_name(self.ctx, 'Batch Log-loss training') 645.748426 File "/neptune/steps/keras/callbacks.py", line 38, in get_correct_channel_name 645.748554 channels_with_name = [channel for channel in ctx.job._channels if name in channel.name] 645.74869 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/client_library/context_factory.py", line 91, in job 645.748819 , JobPropertyDeprecationWarning) 645.748948 File "/usr/local/lib/python3.6/dist-packages/deepsense/neptune/common/utils/neptune_warnings.py", line 52, in neptune_warn 645.749108 warnings.warn(message, warning_type) 645.749242 File "/usr/lib/python3.6/warnings.py", line 101, in _showwarnmsg 645.749421 _showwarnmsg_impl(msg) 645.749568 File "/usr/lib/python3.6/warnings.py", line 28, in _showwarnmsg_impl 645.749706 text = _formatwarnmsg(msg) 645.749838 File "/usr/lib/python3.6/warnings.py", line 116, in _formatwarnmsg 645.749987 msg.filename, msg.lineno, line=msg.line) 645.75013 TypeError: custom_formatwarning() got an unexpected keyword argument 'line'

    How should I get it running successfully? I am running on Neptune using command "neptune send --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium -- train_evaluate_predict_pipeline --pipeline_name glove_lstm". I've tried other pipeline, all fail :(

    opened by binglixyz 4
  • ModuleNotFoundError: No module named 'seaborn'

    ModuleNotFoundError: No module named 'seaborn'

    Hi,

    I tried the notebook neptune_ensembling.ipynb on neptune, but I got this error message: ... ModuleNotFoundError: No module named 'seaborn'

    I am not sure which combination of worker type, Python version and leading library I should use where this seaborn module is installed. Thanks.

    opened by cahya-wirawan 2
  • Vanished experiment_manager.py

    Vanished experiment_manager.py

    How can I run this code on the local machine? Accordance with your wiki, file experiment_manager.py exists in the repo and I should use it. But I don't see it. Could you help me?

    https://github.com/neptune-ml/kaggle-toxic-starter/wiki/Experimentation-guideline

    opened by laol777 2
  • How to add parameter to select GPU at runtime?

    How to add parameter to select GPU at runtime?

    Hi, how do we add a command line parameter to the select the GPU at run time, especially on amulti gpu machine? I tried adding

    @action.command()
    
    @click.option('-g', '--gpu', help='select gpu', default=1, required=true)
    def select_gpu(gpu):
        os.environ["CUDA_VISIBLE_DEVICES"]=gpu
    

    I tried to add a similar code block for train_evaluate_predict_pipeline and pass the gpu parameter but i keep getting invalid option during runtime. I know its not the package issue, but the documentation did not help either

    opened by setuc 2
  • requirements.txt doesn't seem to work.

    requirements.txt doesn't seem to work.

    When I tried to run $ neptune send experiment_manager.py --environment keras-2.0-gpu-py3 --worker gcp-gpu-medium --config neptune_config.yaml -- train_evaluate_predict_pipeline --pipeline_name glove_lstm, the experiment failed because it failed to import attrdict.

    However, this lib is already written in the requirements, and I guess somehow the program failed to check the requirements?

    opened by stringyao 2
  • old test data

    old test data

    The test data in /public folder is old and so the submission output is generated for old test data. The test data for this competition has been recently changed.

    opened by karimiabdullah 2
  • Bump numpy from 1.21.0 to 1.22.0

    Bump numpy from 1.21.0 to 1.22.0

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump ipython from 6.2.1 to 7.16.3

    Bump ipython from 6.2.1 to 7.16.3

    Bumps ipython from 6.2.1 to 7.16.3.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Congratulations on your silver medal

    Congratulations on your silver medal

    Hope to see your final solution to the problem......And thanks for providing the environment for running the models...it made a lot of difference in running massive stackers :)

    opened by setuc 1
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
nlpcommon is a python Open Source Toolkit for text classification.

nlpcommon nlpcommon, Python Text Tool. Guide Feature Install Usage Dataset Contact Cite Reference Feature nlpcommon is a python Open Source

xuming 3 May 29, 2022
Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

TOPSIS implementation in Python Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) CHING-LAI Hwang and Yoon introduced TOPSIS

Hamed Baziyad 8 Dec 10, 2022
Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Signature Remover Signature remover is a NLP based solution which removes email signatures from the rest of the text. It helps to enchance data conten

Forges Alterway 8 Jan 6, 2023
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

null 7 Nov 2, 2022
🏆 The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

?? The 1st Place Submission to AICity Challenge 2021 Natural Language-Based Vehicle Retrieval Track (Alibaba-UTS submission)

null 26 Apr 29, 2021
a CTF web challenge about making screenshots

screenshotter (web) A CTF web challenge about making screenshots. It is inspired by a bug found in real life. The challenge was created by @LiveOverfl

null 219 Jan 2, 2023
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 5, 2022
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

ConferencingSpeech 2022 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech 2022 challenge. For more

null 21 Dec 2, 2022
A Python package implementing a new model for text classification with visualization tools for Explainable AI :octocat:

A Python package implementing a new model for text classification with visualization tools for Explainable AI ?? Online live demos: http://tworld.io/s

Sergio Burdisso 285 Jan 2, 2023
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 5, 2023
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Dec 29, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 186 Dec 29, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 22.2k Feb 18, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

null 160 Feb 9, 2021
CorNet Correlation Networks for Extreme Multi-label Text Classification

CorNet Correlation Networks for Extreme Multi-label Text Classification Prerequisites python==3.6.3 pytorch==1.2.0 torchgpipe==0.0.5 click==7.0 ruamel

Guangxu Xun 38 Dec 31, 2022