Ecommerce product title recognition package

Overview

revizor Test & Lint codecov

This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you name it).
Imagine classic named entity recognition, but recognition done on product titles.

Install

revizor requires python 3.8+ version on Linux or macOS, Windows isn't supported now, but contributions are welcome.

$ pip install revizor

Usage

from revizor.tagger import ProductTagger

tagger = ProductTagger()
product = tagger.predict("Смартфон Apple iPhone 12 Pro 128 gb Gold (CY.563781.P273)")

assert product.type == "Смартфон"
assert product.brand == "Apple"
assert product.model == "iPhone 12 Pro"
assert product.article == "CY.563781.P273"

Boring numbers

Actually, just output from flair training log:

Corpus: "Corpus: 138959 train + 15440 dev + 51467 test sentences"
Results:
- F1-score (micro) 0.8843
- F1-score (macro) 0.8766

By class:
ARTICLE    tp: 9893 - fp: 1899 - fn: 3268 - precision: 0.8390 - recall: 0.7517 - f1-score: 0.7929
BRAND      tp: 47977 - fp: 2335 - fn: 514 - precision: 0.9536 - recall: 0.9894 - f1-score: 0.9712
MODEL      tp: 35187 - fp: 11824 - fn: 9995 - precision: 0.7485 - recall: 0.7788 - f1-score: 0.7633
TYPE       tp: 25044 - fp: 637 - fn: 443 - precision: 0.9752 - recall: 0.9826 - f1-score: 0.9789

Dataset

Model was trained on automatically annotated corpus. Since it may be affected by DMCA, we'll not publish it.
But we can give hint on how to obtain it, don't we?
Dataset can be created by scrapping any large marketplace, like goods, yandex.market or ozon.
We extract product title and table with product info, then we parse brand and model strings from product info table.
Now we have product title, brand and model. Then we can split product title by brand string, e.g.:

product_title = "Смартфон Apple iPhone 12 Pro 128 Gb Space Gray"
brand = "Apple"
model = "iPhone 12 Pro"

product_type, product_model_plus_some_random_info = product_title.split(brand)

product_type # => 'Смартфон'
product_model_plus_some_random_info # => 'iPhone 12 Pro 128 Gb Space Gray'

License

This package is licensed under MIT license.

You might also like...
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Negative Sampling for NER Unlabeled entity problem is prevalent in many NER scenarios (e.g., weakly supervised NER). Our paper in ICLR-2021 proposes u

Pytorch-Named-Entity-Recognition-with-BERT
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux
voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Command-line tools for speech and intent recognition on Linux

Comments
  • How can I tweak the model?

    How can I tweak the model?

    Hi dveselov,

    Great library! I'm looking to tweak the model with a dataset of my own (~100k product titles and metadata). How can I do so? I see your .pt model file but can't find the code you used to generate it.

    Thanks 😄

    opened by kehphin 1
Owner
Bureaucratic Labs
We do natural language processing services
Bureaucratic Labs
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

null 43 Nov 11, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

NeuroNER NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com. This page gives step-by-step instructions to insta

Franck Dernoncourt 1.6k Dec 27, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains.

Zihan Liu 89 Nov 10, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 5, 2022