A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

nutte-language This is the Alpha of Nutte language, it is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda My language was

2 Dec 18, 2021

NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Jan 2, 2023

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.6k Feb 18, 2021

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in a matter of minutes. Based on our experiments with a wide range of benchmarks, ProteinBERT usually achieves state-of-the-art performance. ProteinBERT is built on TenforFlow/Keras.

241 Jan 4, 2023

Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

77 Dec 27, 2022

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

114 Nov 4, 2022

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

You might also like...

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

NL. The natural language programming language.

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

Protein Language Model

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Owner

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

A python framework to transform natural language questions to queries in a database query language.

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)