Pipeline for chemical image-to-text competition

Maksim Zhdanov

Last update: Sep 20, 2022

Related tags

Text Data & NLP BMS-Molecular-Translation

Overview

BMS-Molecular-Translation

Introduction

This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got bronze medals in this competition. Significant part of code was originated from Y.Nakama's notebook

This competition was about image-to-text translation of images with molecular skeletal strucutures to InChI chemical formula identifiers.

InChI=1S/C16H13Cl2NO3/c1-10-2-4-11(5-3-10)16(21)22-9-15(20)19-14-8-12(17)6-7-13(14)18/h2-8H,9H2,1H3,(H,19,20)

Solution

General Encoder-Decoder concept

Most participants used CNN encoder to acquire features with decoder (LSTM/GRU/Transformer) to get text sequences. That's a casual approach to image captioning problem.

Pseudo-labelling with InChI validation using RDKit

RDKit is an open source toolkit for cheminformatics and it was quite useful while solving the problem. When we trained our first model, it scored around 7-8 on public leaderboard and we decided to make pseudo-labelling on test data. However, in common scenario you get a significant amount of wrong predictions in your extended training set from pseudo-labelling. With RDKit we validated all of our predicted formulas and select around 800k correct samples. Lack of wrong labels in pseudo labels improved the score.

Predictions normalization

This notebook tells about InChI normalization

Blending

Finally, we blended ~20 predictions from 2 models (mostly from different epochs) using RDKit validation to choose only formulas which have possible InChI structure.

Final private LB score 1.79

You might also like...

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

1.3k Jan 3, 2023

A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

831 Feb 17, 2021

DaCy: The State of the Art Danish NLP pipeline using SpaCy

DaCy: A SpaCy NLP Pipeline for Danish DaCy is a Danish preprocessing pipeline trained in SpaCy. At the time of writing it has achieved State-of-the-Ar

71 Jan 6, 2023

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

Toy Machine Learning Pipeline Table of Contents About Getting Started ML task description and evaluation procedure Dataset description Repository stru

190 Dec 21, 2022

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

9 Jul 17, 2021

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

65 Dec 30, 2022

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

74 Oct 7, 2022

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

6 Jun 27, 2022

BookNLP, a natural language processing pipeline for books

BookNLP BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech taggin

654 Jan 2, 2023

Pipeline for chemical image-to-text competition

Related tags

Overview

BMS-Molecular-Translation

Introduction

Solution

General Encoder-Decoder concept

Pseudo-labelling with InChI validation using RDKit

Predictions normalization

Blending

Final private LB score 1.79

You might also like...

A full spaCy pipeline and models for scientific/biomedical documents.

A full spaCy pipeline and models for scientific/biomedical documents.

DaCy: The State of the Art Danish NLP pipeline using SpaCy

Toy example of an applied ML pipeline for me to experiment with MLOps tools.

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

BookNLP, a natural language processing pipeline for books

Owner

Maksim Zhdanov

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Pipeline for fast building text classification TF-IDF + LogReg baselines.

TFIDF-based QA system for AIO2 competition

VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Code for Text Prior Guided Scene Text Image Super-Resolution

:mag: Transformers at scale for question answering & neural search. Using NLP via a modular Retriever-Reader-Pipeline. Supporting DPR, Elasticsearch, HuggingFace's Modelhub...