Fast, DB Backed pretrained word embeddings for natural language processing.

Overview

Embeddings

Documentation Status https://travis-ci.org/vzhong/embeddings.svg?branch=master

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Comments
  • Memory Error

    Memory Error

    embeddings = [glove embedding(), kazuma char embedding()] generates a memory error on my system having 4Gb on installed memory. What is the memory requirements to run embeedings?

    opened by Hayat2018 3
  • Where is the db file on Windows?

    Where is the db file on Windows?

    I got the 1st load through but there are still only 2 files . One called 'common_crawl_840' the other is the zip I downloaded from stanford. The common_crawl_840 file is only 0kb is it the generated database file?

    opened by kurophali 2
  • badzipfile

    badzipfile

    I run the following code

    from embeddings import GloveEmbedding GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)

    with python3.6.5, embeddings==0.0.6

    and get the error: zipfile.BadZipFile: File is not a zip file

    Thanks!

    opened by libing125 2
  • "TypeError: a float is required" in Python 2.7

    I tried to use embeddings on Python 2.7 but got the following error:

      File embeddings/embeddings/glove.py", line 58, in emb
        g = self.lookup(word)
      File "embeddings/embedding.py", line 182, in lookup
        return array('f', q[0]).tolist() if q else None
    TypeError: a float is required
    

    The above error does not show when I ran in Python 3.6.

    opened by ducalpha 2
  • Add numberbatch embeddings

    Add numberbatch embeddings

    This PR adds all currently available versions of the ConceptNet Numberbatch embeddings, as available from https://github.com/commonsense/conceptnet-numberbatch.

    I was able to mostly re-use the code from the GloVe embedding implementation and added a few comments here and there.

    Let me know if anything is missing or needs to be changed. :)

    opened by mspl13 1
  • Update embedding.py

    Update embedding.py

    Small change, but this greatly reduces the amount of time it takes to load embeddings into a database.

    I am creating a new class to load an Embedding file with a total of 13.8m embeddings. With the default implementation, this would have taken approx 13h. With these two lines, it was reduced to approx 1h45m.

    Thanks for this great package!

    opened by mickvanhulst 1
  • Create database from given embeddings

    Create database from given embeddings

    Hi,

    For a project I am working on I need to create a database to increase the efficiency of looking up embeddings. I would like to use your project as a basis for this, so I was wondering if you could give me some pointers on how I can best tackle this given the codebase you provided? If successful, I could send a PR afterwards.

    Thanks!

    opened by mickvanhulst 0
  • pip list and .__version__ are different

    pip list and .__version__ are different

    When I type pip list I got embeddings 0.0.8 But in python I got Python 3.8.13 (default, Mar 28 2022, 11:38:47) [GCC 7.5.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information. >>> import embeddings >>> embeddings.__version__ '0.0.6'

    opened by ymliunlp 0
Owner
Victor Zhong
I am a PhD student at the University of Washington. Formerly Salesforce Research / MetaMind, @stanfordnlp, and ECE at UToronto.
Victor Zhong
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language

LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️ The library of Natural Language Processing for Brazilian legal lang

Felipe Maia Polo 125 Dec 20, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 6, 2023
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 1, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 2, 2023
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 1, 2023
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

null 652 Jan 6, 2023
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.2k Dec 21, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 19.5k Feb 13, 2021
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 ?? Transformers provides thousands of pretrained models to perform tasks o

Hugging Face 77.3k Jan 3, 2023
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 12.3k Dec 31, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 5, 2023
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 1.9k Feb 3, 2021
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks

A Deep Learning NLP/NLU library by Intel® AI Lab Overview | Models | Installation | Examples | Documentation | Tutorials | Contributing NLP Architect

Intel Labs 2.9k Jan 2, 2023
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 7, 2023