Fast, DB Backed pretrained word embeddings for natural language processing.

Victor Zhong

Last update: Nov 21, 2022

Related tags

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Comments

Memory Error

embeddings = [glove embedding(), kazuma char embedding()] generates a memory error on my system having 4Gb on installed memory. What is the memory requirements to run embeedings?

opened by Hayat2018 3
Where is the db file on Windows?

I got the 1st load through but there are still only 2 files . One called 'common_crawl_840' the other is the zip I downloaded from stanford. The common_crawl_840 file is only 0kb is it the generated database file?

opened by kurophali 2
badzipfile

I run the following code

from embeddings import GloveEmbedding GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)

with python3.6.5, embeddings==0.0.6

and get the error: zipfile.BadZipFile: File is not a zip file

Thanks!

opened by libing125 2

"TypeError: a float is required" in Python 2.7

I tried to use embeddings on Python 2.7 but got the following error:

  File embeddings/embeddings/glove.py", line 58, in emb
    g = self.lookup(word)
  File "embeddings/embedding.py", line 182, in lookup
    return array('f', q[0]).tolist() if q else None
TypeError: a float is required

The above error does not show when I ran in Python 3.6.

opened by ducalpha 2

Add numberbatch embeddings

This PR adds all currently available versions of the ConceptNet Numberbatch embeddings, as available from https://github.com/commonsense/conceptnet-numberbatch.

I was able to mostly re-use the code from the GloVe embedding implementation and added a few comments here and there.

Let me know if anything is missing or needs to be changed. :)

opened by mspl13 1
Update embedding.py

Small change, but this greatly reduces the amount of time it takes to load embeddings into a database.

I am creating a new class to load an Embedding file with a total of 13.8m embeddings. With the default implementation, this would have taken approx 13h. With these two lines, it was reduced to approx 1h45m.

Thanks for this great package!

opened by mickvanhulst 1
Create database from given embeddings

Hi,

For a project I am working on I need to create a database to increase the efficiency of looking up embeddings. I would like to use your project as a basis for this, so I was wondering if you could give me some pointers on how I can best tackle this given the codebase you provided? If successful, I could send a PR afterwards.

Thanks!

opened by mickvanhulst 0
pip list and .__version__ are different

When I type pip list I got embeddings 0.0.8 But in python I got Python 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import embeddings >>> embeddings.__version__ '0.0.6'

opened by ymliunlp 0

Owner

Victor Zhong

I am a PhD student at the University of Washington. Formerly Salesforce Research / MetaMind, @stanfordnlp, and ECE at UToronto.

GitHub

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

125 Dec 20, 2022

A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

3 May 25, 2022

A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

3k Jan 6, 2023

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Styleformer A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/cas

431 Dec 19, 2022

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

4 Jul 1, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group

8.4k Dec 30, 2022

This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).

Ucto for Python This is a Python binding to the tokeniser Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task,

27 Dec 14, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

24.9k Jan 2, 2023

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

1.9k Feb 3, 2021

A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

2.9k Jan 2, 2023

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

2.3k Jan 7, 2023