Semi-automated vocabulary generation from semantic vector models

Last update: Nov 25, 2022

Related tags

Text Data & NLP vec2word

Overview

vec2word

Semi-automated vocabulary generation from semantic vector models

This script generates a list of potential conlang word forms along with associated possible glosses based on a word-shape template and a word2vec-style semantic vector model. The process works something like this:

Acquire a word2vec-style semantic vector model (either word2vec binary format or text format).
Define a word-shape template.
Use Principle Component Analysis to project the vector model down to the same number of dimensions as you have slots in your template.
Match the new model dimensions to slots based on how many phonemes can go in a slot vs. the variance in a given dimension (large phoneme range pairs with large variance), and then discretize those dimensions into the appropriate number of buckets.
Use the buckets each vector ends up getting put in to select phonemes for each template slot and generate new conlang words, along with a list of all of the model words whose vectors ended up in that same set of buckets.

This results in word forms in which each phoneme represents a category in some semantic classification scheme, rather like a traditional philosophical language--except, the categories are not obviously-sensible, human-defined categories such as you might find in a thesaurus, but weird collections of whatever happens to project into similar places in low-dimensional space. Getting reasonable definitions for your new words will still require work at selecting among the various options provided to you, or making up a new one in a similar semantic space--whatever you decide that means. Ideally, this should result in a lexicon with lots of discoverable sound-symbolism, but very little obvious regular morphology.

You could also decide that, rather than generating complete words, you just want to generate, e.g., individual syllables, which could then be compounded together to produce words with more specific meanings--essentially, simulating the process by which Chinese produced lots of homophones (single phonetic forms with wildly varying ambiguous meanings) and then used compounding to re-disambiguate the lexicon.

Or generate triliteral consonant roots, whose semantics will be narrowed down by intercalated vowel patterns.

Or something else entirely! Play around, experiment, have fun!

Example use

python vec2word.py model.bin "t,d,n,k,g,q,p,b,m" "i,u,e" "t,n,k,q,p,m" > syllables.txt

This uses the model.bin model to produce "words" on a CVC template and save the results in syllables.txt. For longer templates, just add more command-line arguments, each consisting of a comma-separated list of phonemes/graphemes that are allowed in the slot.

Many pre-built word2vec models suitable for use with this script can be downloaded from the NLPL Word Vectors Repository.

You might also like...

[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

113 Dec 15, 2022

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

128 Dec 11, 2022

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

1 Nov 18, 2021

Automated Phrase Mining from Massive Text Corpora in Python.

28 Apr 15, 2021

This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

3 Oct 15, 2021

An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

1 Dec 26, 2021

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

3.1k Jan 8, 2023

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

605 Jan 2, 2023

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

66 Dec 26, 2022

Semi-automated vocabulary generation from semantic vector models

Related tags

Overview

vec2word

Example use

You might also like...

[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

Automated Phrase Mining from Massive Text Corpora in Python.

This project converts your human voice input to its text transcript and to an automated voice too.

An automated program that helps customers of Pizza Palour place their pizza orders

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Owner

Pre-training BERT masked language models with custom vocabulary

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

An extension for asreview implements a version of the tf-idf feature extractor that saves the matrix and the vocabulary.

Simple tool/toolkit for evaluating NLG (Natural Language Generation) offering various automated metrics.

Generate vector graphics from a textual caption

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

A collection of models for image - text generation in ACM MM 2021.

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.