🦆 Contextually-keyed word vectors

Explosion

Last update: Dec 25, 2022

Related tags

Text Data & NLP python nlp machine-learning natural-language-processing word2vec spacy gensim sense2vec gensim-word2vec

Overview

sense2vec: Contextually-keyed word vectors

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out our blog post. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo.

🦆 Version 2.0 (for spaCy v3) out now! Read the release notes here.

✨ Features

Query vectors for multi-word phrases based on part-of-speech tags and entity labels.
spaCy pipeline component and extension attributes.
Fully serializable so you can easily ship your sense2vec vectors with your spaCy model packages.
Optional caching of nearest neighbors for super fast "most similar" queries.
Train your own vectors using a pretrained spaCy model, raw text and GloVe or Word2Vec via fastText (details).
Prodigy annotation recipes for evaluating models, creating lists of similar multi-word phrases and converting them to match patterns, e.g. for rule-based NER or to bootstrap NER annotation (details & examples).

🚀 Quickstart

Standalone usage

from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

Usage as a spaCy pipeline component

⚠️ Note that this example describes usage with spaCy v3. For usage with spaCy v2, download sense2vec==1.0.3 and check out the v1.x branch of this repo.

import spacy

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
#  (('computer vision', 'NOUN'), 0.8636297),
#  (('deep learning', 'NOUN'), 0.8573361)]

Interactive demos

To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo.

This repo also includes a Streamlit demo script for exploring vectors and the most similar phrases. After installing streamlit, you can run the script with streamlit run and one or more paths to pretrained vectors as positional arguments on the command line. For example:

pip install streamlit
streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors

Pretrained vectors

To use the vectors, download the archive(s) and pass the extracted directory to Sense2Vec.from_disk or Sense2VecComponent.from_disk. The vector files are attached to the GitHub release. Large files have been split into multi-part downloads.

Vectors	Size	Description	📥 Download (zipped)
`s2v_reddit_2019_lg`	4 GB	Reddit comments 2019 (01-07)	part 1, part 2, part 3
`s2v_reddit_2015_md`	573 MB	Reddit comments 2015	part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

⏳ Installation & Setup

sense2vec releases are available on pip:

pip install sense2vec

To use pretrained vectors, download one of the vector packages, unpack the .tar.gz archive and point from_disk to the extracted data directory:

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")

👩‍💻 Usage

Usage with spaCy v3

The easiest way to use the library and vectors is to plug it into your spaCy pipeline. The sense2vec package exposes a Sense2VecComponent, which can be initialised with the shared vocab and added to your spaCy pipeline as a custom pipeline component. By default, components are added to the end of the pipeline, which is the recommended position for this component, since it needs access to the dependency parse and, if available, named entities.

import spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

The component will add several extension attributes and methods to spaCy's Token and Span objects that let you retrieve vectors and frequencies, as well as most similar terms.

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)

For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):

doc = nlp("A sentence about Facebook and Google.")
for ent in doc.ents:
    assert ent._.in_s2v
    most_similar = ent._.s2v_most_similar(3)

Available attributes

The following extension attributes are exposed on the Doc object via the ._ property:

Name	Attribute Type	Type	Description
`s2v_phrases`	property	list	All sense2vec-compatible phrases in the given `Doc` (noun phrases, named entities).

The following attributes are available via the ._ property of Token and Span objects – for example token._.in_s2v:

Name	Attribute Type	Return Type	Description
`in_s2v`	property	bool	Whether a key exists in the vector map.
`s2v_key`	property	unicode	The sense2vec key of the given object, e.g. `"duck
`s2v_vec`	property	`ndarray[float32]`	The vector of the given key.
`s2v_freq`	property	int	The frequency of the given key.
`s2v_other_senses`	property	list	Available other senses, e.g. `"duck
`s2v_most_similar`	method	list	Get the `n` most similar terms. Returns a list of `((word, sense), score)` tuples.
`s2v_similarity`	method	float	Get the similarity to another `Token` or `Span`.

⚠️ A note on span attributes: Under the hood, entities in doc.ents are Span objects. This is why the pipeline component also adds attributes and methods to spans and not just tokens. However, it's not recommended to use the sense2vec attributes on arbitrary slices of the document, since the model likely won't have a key for the respective text. Span objects also don't have a part-of-speech tag, so if no entity label is present, the "sense" defaults to the root's part-of-speech tag.

Adding sense2vec to a trained pipeline

If you're training and packaging a spaCy pipeline and want to include a sense2vec component in it, you can load in the data via the [initialize] block of the training config:

[initialize.components]

[initialize.components.sense2vec]
data_path = "/path/to/s2v_reddit_2015_md"

Standalone usage

You can also use the underlying Sense2Vec class directly and load in the vectors using the from_disk method. See below for the available API methods.

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/reddit_vectors-1.1.0")
most_similar = s2v.most_similar("natural_language_processing|NOUN", n=10)

⚠️ Important note: To look up entries in the vectors table, the keys need to follow the scheme of phrase_text|SENSE (note the _ instead of spaces and the | before the tag or label) – for example, machine_learning|NOUN. Also note that the underlying vector table is case-sensitive.

🎛 API

`class` `Sense2Vec`

The standalone Sense2Vec object that holds the vectors, strings and frequencies.

`method` `Sense2Vec.init`

Initialize the Sense2Vec object.

Argument	Type	Description
`shape`	tuple	The vector shape. Defaults to `(1000, 128)`.
`strings`	`spacy.strings.StringStore`	Optional string store. Will be created if it doesn't exist.
`senses`	list	Optional list of all available senses. Used in methods that generate the best sense or other senses.
`vectors_name`	unicode	Optional name to assign to the `Vectors` table, to prevent clashes. Defaults to `"sense2vec"`.
`overrides`	dict	Optional custom functions to use, mapped to names registered via the registry, e.g. `{"make_key": "custom_make_key"}`.
RETURNS	`Sense2Vec`	The newly constructed object.

s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"])

`method` `Sense2Vec.len`

The number of rows in the vectors table.

Argument	Type	Description
RETURNS	int	The number of rows in the vectors table.

s2v = Sense2Vec(shape=(300, 128))
assert len(s2v) == 300

`method` `Sense2Vec.contains`

Check if a key is in the vectors table.

Argument	Type	Description
`key`	unicode / int	The key to look up.
RETURNS	bool	Whether the key is in the table.

s2v = Sense2Vec(shape=(10, 4))
s2v.add("avocado|NOUN", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32))
assert "avocado|NOUN" in s2v
assert "avocado|VERB" not in s2v

`method` `Sense2Vec.getitem`

Retrieve a vector for a given key. Returns None if the key is not in the table.

Argument	Type	Description
`key`	unicode / int	The key to look up.
RETURNS	`numpy.ndarray`	The vector or `None`.

vec = s2v["avocado|NOUN"]

`method` `Sense2Vec.setitem`

Set a vector for a given key. Will raise an error if the key doesn't exist. To add a new entry, use Sense2Vec.add.

Argument	Type	Description
`key`	unicode / int	The key.
`vector`	`numpy.ndarray`	The vector to set.

vec = s2v["avocado|NOUN"]
s2v["avacado|NOUN"] = vec

`method` `Sense2Vec.add`

Add a new vector to the table.

Argument	Type	Description
`key`	unicode / int	The key to add.
`vector`	`numpy.ndarray`	The vector to add.
`freq`	int	Optional frequency count. Used to find best matching senses.

vec = s2v["avocado|NOUN"]
s2v.add("🥑|NOUN", vec, 1234)

`method` `Sense2Vec.get_freq`

Get the frequency count for a given key.

Argument	Type	Description
`key`	unicode / int	The key to look up.
`default`	-	Default value to return if no frequency is found.
RETURNS	int	The frequency count.

vec = s2v["avocado|NOUN"]
s2v.add("🥑|NOUN", vec, 1234)
assert s2v.get_freq("🥑|NOUN") == 1234

`method` `Sense2Vec.set_freq`

Set a frequency count for a given key.

Argument	Type	Description
`key`	unicode / int	The key to set the count for.
`freq`	int	The frequency count.

s2v.set_freq("avocado|NOUN", 104294)

`method` `Sense2Vec.iter`, `Sense2Vec.items`

Iterate over the entries in the vectors table.

Argument	Type	Description
YIELDS	tuple	String key and vector pairs in the table.

for key, vec in s2v:
    print(key, vec)

for key, vec in s2v.items():
    print(key, vec)

`method` `Sense2Vec.keys`

Iterate over the keys in the table.

Argument	Type	Description
YIELDS	unicode	The string keys in the table.

all_keys = list(s2v.keys())

`method` `Sense2Vec.values`

Iterate over the vectors in the table.

Argument	Type	Description
YIELDS	`numpy.ndarray`	The vectors in the table.

all_vecs = list(s2v.values())

`property` `Sense2Vec.senses`

The available senses in the table, e.g. "NOUN" or "VERB" (added at initialization).

Argument	Type	Description
RETURNS	list	The available senses.

s2v = Sense2Vec(senses=["VERB", "NOUN"])
assert "VERB" in s2v.senses

`property` `Sense2vec.frequencies`

The frequencies of the keys in the table, in descending order.

Argument	Type	Description
RETURNS	list	The `(key, freq)` tuples by frequency, descending.

most_frequent = s2v.frequencies[:10]
key, score = s2v.frequencies[0]

`method` `Sense2vec.similarity`

Make a semantic similarity estimate of two keys or two sets of keys. The default estimate is cosine similarity using an average of vectors.

Argument	Type	Description
`keys_a`	unicode / int / iterable	The string or integer key(s).
`keys_b`	unicode / int / iterable	The other string or integer key(s).
RETURNS	float	The similarity score.

keys_a = ["machine_learning|NOUN", "natural_language_processing|NOUN"]
keys_b = ["computer_vision|NOUN", "object_detection|NOUN"]
print(s2v.similarity(keys_a, keys_b))
assert s2v.similarity("machine_learning|NOUN", "machine_learning|NOUN") == 1.0

`method` `Sense2Vec.most_similar`

Get the most similar entries in the table. If more than one key is provided, the average of the vectors is used. To make this method faster, see the script for precomputing a cache of the nearest neighbors.

Argument	Type	Description
`keys`	unicode / int / iterable	The string or integer key(s) to compare to.
`n`	int	The number of similar keys to return. Defaults to `10`.
`batch_size`	int	The batch size to use. Defaults to `16`.
RETURNS	list	The `(key, score)` tuples of the most similar vectors.

most_similar = s2v.most_similar("natural_language_processing|NOUN", n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

`method` `Sense2Vec.get_other_senses`

Find other entries for the same word with a different sense, e.g. "duck|VERB" for "duck|NOUN".

Argument	Type	Description
`key`	unicode / int	The key to check.
`ignore_case`	bool	Check for uppercase, lowercase and titlecase. Defaults to `True`.
RETURNS	list	The string keys of other entries with different senses.

other_senses = s2v.get_other_senses("duck|NOUN")
# ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ']

`method` `Sense2Vec.get_best_sense`

Find the best-matching sense for a given word based on the available senses and frequency counts. Returns None if no match is found.

Argument	Type	Description
`word`	unicode	The word to check.
`senses`	list	Optional list of senses to limit the search to. If not set / empty, all senses in the vectors are used.
`ignore_case`	bool	Check for uppercase, lowercase and titlecase. Defaults to `True`.
RETURNS	unicode	The best-matching key or None.

assert s2v.get_best_sense("duck") == "duck|NOUN"
assert s2v.get_best_sense("duck", ["VERB", "ADJ"]) == "duck|VERB"

`method` `Sense2Vec.to_bytes`

Serialize a Sense2Vec object to a bytestring.

Argument	Type	Description
`exclude`	list	Names of serialization fields to exclude.
RETURNS	bytes	The serialized `Sense2Vec` object.

s2v_bytes = s2v.to_bytes()

`method` `Sense2Vec.from_bytes`

Load a Sense2Vec object from a bytestring.

Argument	Type	Description
`bytes_data`	bytes	The data to load.
`exclude`	list	Names of serialization fields to exclude.
RETURNS	`Sense2Vec`	The loaded object.

s2v_bytes = s2v.to_bytes()
new_s2v = Sense2Vec().from_bytes(s2v_bytes)

`method` `Sense2Vec.to_disk`

Serialize a Sense2Vec object to a directory.

Argument	Type	Description
`path`	unicode / `Path`	The path.
`exclude`	list	Names of serialization fields to exclude.

s2v.to_disk("/path/to/sense2vec")

`method` `Sense2Vec.from_disk`

Load a Sense2Vec object from a directory.

Argument	Type	Description
`path`	unicode / `Path`	The path to load from
`exclude`	list	Names of serialization fields to exclude.
RETURNS	`Sense2Vec`	The loaded object.

s2v.to_disk("/path/to/sense2vec")
new_s2v = Sense2Vec().from_disk("/path/to/sense2vec")

`class` `Sense2VecComponent`

The pipeline component to add sense2vec to spaCy pipelines.

`method` `Sense2VecComponent.init`

Initialize the pipeline component.

Argument	Type	Description
`vocab`	`Vocab`	The shared `Vocab`. Mostly used for the shared `StringStore`.
`shape`	tuple	The vector shape.
`merge_phrases`	bool	Whether to merge sense2vec phrases into one token. Defaults to `False`.
`lemmatize`	bool	Always look up lemmas if available in the vectors, otherwise default to original word. Defaults to `False`.
`overrides`	Optional custom functions to use, mapped to names registred via the registry, e.g. `{"make_key": "custom_make_key"}`.
RETURNS	`Sense2VecComponent`	The newly constructed object.

s2v = Sense2VecComponent(nlp.vocab)

`classmethod` `Sense2VecComponent.from_nlp`

Initialize the component from an nlp object. Mostly used as the component factory for the entry point (see setup.cfg) and to auto-register via the @spacy.component decorator.

Argument	Type	Description
`nlp`	`Language`	The `nlp` object.
`**cfg`	-	Optional config parameters.
RETURNS	`Sense2VecComponent`	The newly constructed object.

s2v = Sense2VecComponent.from_nlp(nlp)

`method` `Sense2VecComponent.call`

Process a Doc object with the component. Typically only called as part of the spaCy pipeline and not directly.

Argument	Type	Description
`doc`	`Doc`	The document to process.
RETURNS	`Doc`	the processed document.

`method` `Sense2Vec.init_component`

Register the component-specific extension attributes here and only if the component is added to the pipeline and used – otherwise, tokens will still get the attributes even if the component is only created and not added.

`method` `Sense2VecComponent.to_bytes`

Serialize the component to a bytestring. Also called when the component is added to the pipeline and you run nlp.to_bytes.

Argument	Type	Description
RETURNS	bytes	The serialized component.

`method` `Sense2VecComponent.from_bytes`

Load a component from a bytestring. Also called when you run nlp.from_bytes.

Argument	Type	Description
`bytes_data`	bytes	The data to load.
RETURNS	`Sense2VecComponent`	The loaded object.

`method` `Sense2VecComponent.to_disk`

Serialize the component to a directory. Also called when the component is added to the pipeline and you run nlp.to_disk.

Argument	Type	Description
`path`	unicode / `Path`	The path.

`method` `Sense2VecComponent.from_disk`

Load a Sense2Vec object from a directory. Also called when you run nlp.from_disk.

Argument	Type	Description
`path`	unicode / `Path`	The path to load from
RETURNS	`Sense2VecComponent`	The loaded object.

`class` `registry`

Function registry (powered by catalogue) to easily customize the functions used to generate keys and phrases. Allows you to decorate and name custom functions, swap them out and serialize the custom names when you save out the model. The following registry options are available:

Name	Description
`registry.make_key`	Given a `word` and `sense`, return a string of the key, e.g. `"word
`registry.split_key`	Given a string key, return a `(word, sense)` tuple.
`registry.make_spacy_key`	Given a spaCy object (`Token` or `Span`) and a boolean `prefer_ents` keyword argument (whether to prefer the entity label for single tokens), return a `(word, sense)` tuple. Used in extension attributes to generate a key for tokens and spans.
`registry.get_phrases`	Given a spaCy `Doc`, return a list of `Span` objects used for sense2vec phrases (typically noun phrases and named entities).
`registry.merge_phrases`	Given a spaCy `Doc`, get all sense2vec phrases and merge them into single tokens.

Each registry has a register method that can be used as a function decorator and takes one argument, the name of the custom function.

from sense2vec import registry

@registry.make_key.register("custom")
def custom_make_key(word, sense):
    return f"{word}###{sense}"

@registry.split_key.register("custom")
def custom_split_key(key):
    word, sense = key.split("###")
    return word, sense

When initializing the Sense2Vec object, you can now pass in a dictionary of overrides with the names of your custom registered functions.

overrides = {"make_key": "custom", "split_key": "custom"}
s2v = Sense2Vec(overrides=overrides)

This makes it easy to experiment with different strategies and serializing the strategies as plain strings (instead of having to pass around and/or pickle the functions themselves).

🚂 Training your own sense2vec vectors

The /scripts directory contains command line utilities for preprocessing text and training your own vectors.

Requirements

To train your own sense2vec vectors, you'll need the following:

A very large source of raw text (ideally more than you'd use for word2vec, since the senses make the vocabulary more sparse). We recommend at least 1 billion words.
A pretrained spaCy model that assigns part-of-speech tags, dependencies and named entities, and populates the doc.noun_chunks. If the language you need doesn't provide a built in syntax iterator for noun phrases, you'll need to write your own. (The doc.noun_chunks and doc.ents are what sense2vec uses to determine what's a phrase.)
GloVe or fastText installed and built. You should be able to clone the repo and run make in the respective directory.

Step-by-step process

The training process is split up into several steps to allow you to resume at any given point. Processing scripts are designed to operate on single files, making it easy to parallellize the work. The scripts in this repo require either Glove or fastText which you need to clone and make.

For Fasttext, the scripts will require the path to the created binary file. If you're working on Windows, you can build with cmake, or alternatively use the .exe file from this unofficial repo with FastText binary builds for Windows: https://github.com/xiamx/fastText/releases.

	Script	Description
1.	`01_parse.py`	Use spaCy to parse the raw text and output binary collections of `Doc` objects (see `DocBin`).
2.	`02_preprocess.py`	Load a collection of parsed `Doc` objects produced in the previous step and output text files in the sense2vec format (one sentence per line and merged phrases with senses).
3.	`03_glove_build_counts.py`	Use GloVe to build the vocabulary and counts. Skip this step if you're using Word2Vec via FastText.
4.	`04_glove_train_vectors.py` `04_fasttext_train_vectors.py`	Use GloVe or FastText to train vectors.
5.	`05_export.py`	Load the vectors and frequencies and output a sense2vec component that can be loaded via `Sense2Vec.from_disk`.
6.	`06_precompute_cache.py`	Optional: Precompute nearest-neighbor queries for every entry in the vocab to make `Sense2Vec.most_similar` faster.

For more detailed documentation of the scripts, check out the source or run them with --help. For example, python scripts/01_parse.py --help.

🍳 Prodigy recipes

This package also seamlessly integrates with the Prodigy annotation tool and exposes recipes for using sense2vec vectors to quickly generate lists of multi-word phrases and bootstrap NER annotations. To use a recipe, sense2vec needs to be installed in the same environment as Prodigy. For an example of a real-world use case, check out this NER project with downloadable datasets.

The following recipes are available – see below for more detailed docs.

Recipe	Description
`sense2vec.teach`	Bootstrap a terminology list using sense2vec.
`sense2vec.to-patterns`	Convert phrases dataset to token-based match patterns.
`sense2vec.eval`	Evaluate a sense2vec model by asking about phrase triples.
`sense2vec.eval-most-similar`	Evaluate a sense2vec model by correcting the most similar entries.
`sense2vec.eval-ab`	Perform an A/B evaluation of two pretrained sense2vec vector models.

`recipe` `sense2vec.teach`

Bootstrap a terminology list using sense2vec. Prodigy will suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.

prodigy sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold]
[--n-similar] [--batch-size] [--resume]

Argument	Type	Description
`dataset`	positional	Dataset to save annotations to.
`vectors_path`	positional	Path to pretrained sense2vec vectors.
`--seeds`, `-s`	option	One or more comma-separated seed phrases.
`--threshold`, `-t`	option	Similarity threshold. Defaults to `0.85`.
`--n-similar`, `-n`	option	Number of similar items to get at once.
`--batch-size`, `-b`	option	Batch size for submitting annotations.
`--resume`, `-R`	flag	Resume from an existing phrases dataset.

Example

prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md
--seeds "natural language processing, machine learning, artificial intelligence"

`recipe` `sense2vec.to-patterns`

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns that can be used with spaCy's EntityRuler or recipes like ner.match. If no output file is specified, the patterns are written to stdout. The examples are tokenized so that multi-token terms are represented correctly, e.g.: {"label": "SHOE_BRAND", "pattern": [{ "LOWER": "new" }, { "LOWER": "balance" }]}.

prodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file]
[--case-sensitive] [--dry]

Argument	Type	Description
`dataset`	positional	Phrase dataset to convert.
`spacy_model`	positional	spaCy model for tokenization.
`label`	positional	Label to apply to all patterns.
`--output-file`, `-o`	option	Optional output file. Defaults to stdout.
`--case-sensitive`, `-CS`	flag	Make patterns case-sensitive.
`--dry`, `-D`	flag	Perform a dry run and don't output anything.

Example

prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY
--output-file /path/to/patterns.jsonl

`recipe` `sense2vec.eval`

Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? If the human mostly agrees with the model, the vectors model is good. The recipe will only ask about vectors with the same sense and supports different example selection strategies.

prodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses]
[--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole]
[--eval-only] [--show-scores]

Argument	Type	Description
`dataset`	positional	Dataset to save annotations to.
`vectors_path`	positional	Path to pretrained sense2vec vectors.
`--strategy`, `-st`	option	Example selection strategy. `most similar` (default) or `random`.
`--senses`, `-s`	option	Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
`--exclude-senses`, `-es`	option	Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.
`--n-freq`, `-f`	option	Number of most frequent entries to limit to.
`--threshold`, `-t`	option	Minimum similarity threshold to consider examples.
`--batch-size`, `-b`	option	Batch size to use.
`--eval-whole`, `-E`	flag	Evaluate the whole dataset instead of the current session.
`--eval-only`, `-O`	flag	Don't annotate, only evaluate the current dataset.
`--show-scores`, `-S`	flag	Show all scores for debugging.

Strategies

Name	Description
`most_similar`	Pick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection.
`most_least_similar`	Pick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that.
`random`	Pick a random sample of 3 words from the same random sense.

Example

prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT --threshold 0.5

`recipe` `sense2vec.eval-most-similar`

Evaluate a vectors model by looking at the most similar entries it returns for a random phrase and unselecting the mistakes.

prodigy sense2vec.eval [dataset] [vectors_path] [--senses] [--exclude-senses]
[--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only]
[--show-scores]

Argument	Type	Description
`dataset`	positional	Dataset to save annotations to.
`vectors_path`	positional	Path to pretrained sense2vec vectors.
`--senses`, `-s`	option	Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
`--exclude-senses`, `-es`	option	Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.
`--n-freq`, `-f`	option	Number of most frequent entries to limit to.
`--n-similar`, `-n`	option	Number of similar items to check. Defaults to `10`.
`--batch-size`, `-b`	option	Batch size to use.
`--eval-whole`, `-E`	flag	Evaluate the whole dataset instead of the current session.
`--eval-only`, `-O`	flag	Don't annotate, only evaluate the current dataset.
`--show-scores`, `-S`	flag	Show all scores for debugging.

prodigy sense2vec.eval-most-similar vectors_eval_sim /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT

`recipe` `sense2vec.eval-ab`

Perform an A/B evaluation of two pretrained sense2vec vector models by comparing the most similar entries they return for a random phrase. The UI shows two randomized options with the most similar entries of each model and highlights the phrases that differ. At the end of the annotation session the overall stats and preferred model are shown.

prodigy sense2vec.eval [dataset] [vectors_path_a] [vectors_path_b] [--senses]
[--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole]
[--eval-only] [--show-mapping]

Argument	Type	Description
`dataset`	positional	Dataset to save annotations to.
`vectors_path_a`	positional	Path to pretrained sense2vec vectors.
`vectors_path_b`	positional	Path to pretrained sense2vec vectors.
`--senses`, `-s`	option	Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
`--exclude-senses`, `-es`	option	Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults.
`--n-freq`, `-f`	option	Number of most frequent entries to limit to.
`--n-similar`, `-n`	option	Number of similar items to check. Defaults to `10`.
`--batch-size`, `-b`	option	Batch size to use.
`--eval-whole`, `-E`	flag	Evaluate the whole dataset instead of the current session.
`--eval-only`, `-O`	flag	Don't annotate, only evaluate the current dataset.
`--show-mapping`, `-S`	flag	Show which models are option 1 and option 2 in the UI (for debugging).

prodigy sense2vec.eval-ab vectors_eval_sim /path/to/s2v_reddit_2015_md /path/to/s2v_reddit_2019_md --senses NOUN,ORG,PRODUCT

Pretrained vectors

The pretrained Reddit vectors support the following "senses", either part-of-speech tags or entity labels. For more details, see spaCy's annotation scheme overview.

Tag	Description	Examples
`ADJ`	adjective	big, old, green
`ADP`	adposition	in, to, during
`ADV`	adverb	very, tomorrow, down, where
`AUX`	auxiliary	is, has (done), will (do)
`CONJ`	conjunction	and, or, but
`DET`	determiner	a, an, the
`INTJ`	interjection	psst, ouch, bravo, hello
`NOUN`	noun	girl, cat, tree, air, beauty
`NUM`	numeral	1, 2017, one, seventy-seven, MMXIV
`PART`	particle	's, not
`PRON`	pronoun	I, you, he, she, myself, somebody
`PROPN`	proper noun	Mary, John, London, NATO, HBO
`PUNCT`	punctuation	, ? ( )
`SCONJ`	subordinating conjunction	if, while, that
`SYM`	symbol	$, %, =, :), 😝
`VERB`	verb	run, runs, running, eat, ate, eating

Entity Label	Description
`PERSON`	People, including fictional.
`NORP`	Nationalities or religious or political groups.
`FACILITY`	Buildings, airports, highways, bridges, etc.
`ORG`	Companies, agencies, institutions, etc.
`GPE`	Countries, cities, states.
`LOC`	Non-GPE locations, mountain ranges, bodies of water.
`PRODUCT`	Objects, vehicles, foods, etc. (Not services.)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LANGUAGE`	Any named language.

Comments

Integrate Eigen Library to Remove BLAS Dependency

The first step will be to integrate the Eigen library into the codebase and have both BLAS and Eigen paths. When formalized we can then remove the BLAS dependency.

opened by init-random 27

Help Re-writing 04_fasttext_train_vectors.py for Windows 10 Compatibility

The two os.system(cmd) portions of the 04_fasttext_train_vectors.py script on lines 61-67 and lines 75-81 do not work for Windows users. So, I re-wrote lines 61-67 using the FastText Word representations documentation. However, lines 75-81 are proving more difficult to rewrite because I don't know the structure of the vocab_file output file created with lines 76-78, included below. The 3rd-to-last line of my code below uses the save_model function to save the model to a binary file for later loading as shown here. However, this is not the input file format expected in 05_export.py. Could you please provide a sample of what the vocab_file output file looks like? Or, better yet, do you have any suggestions for how to replace lines 75-78 that doesn't involve using os.system, CLI code, or the fasttext_bin.

Lines 76-78 of 04_fasttext_train_vectors.py:

vocab_file = output_path / "vocab.txt"
cmd = f"{fasttext_bin} dump {output_file.with_suffix('.bin')} dict > {vocab_file}"
print(cmd)
vocab_cmd = os.system(cmd)

Here is the code I used in place of 04_fasttext_train_vectors.py to make it Windows compatible:

from pathlib import Path
from wasabi import msg
import fasttext

in_dir = "./corpus_parsed3"
out_dir = "./fasttext_model3"
n_threads = 26
min_count = 50
vector_size = 300
verbose = 2

input_path = Path(in_dir)
output_path = Path(out_dir)
if not input_path.exists() or not input_path.is_dir():
    msg.fail("Not a valid input directory", in_dir, exits=1)
if not output_path.exists():
    output_path.mkdir(parents=True)
    msg.good(f"Created output directory {out_dir}")
output_file = output_path / f"vectors_w2v_{vector_size}dim.bin"

# fastText expects only one input file and only reads from disk and not
# stdin, so we need to create a temporary file that concatenates the inputs
tmp_path = input_path / "s2v_input.tmp"
input_files = [p for p in input_path.iterdir() if p.suffix == ".s2v"]
if not input_files:
    msg.fail("Input directory contains no .s2v files", in_dir, exits=1)
with tmp_path.open("a", encoding="utf8") as tmp_file:
    for input_file in input_files:
        with input_file.open("r", encoding="utf-8") as f:
            tmp_file.write(f.read())
msg.info("Created temporary merged input file", tmp_path)

sense2vec_model = fasttext.train_unsupervised(in_dir+"/s2v_input.tmp", thread=n_threads, epoch=5, dim=vector_size, minn=0, maxn=0, minCount=min_count, verbose=verbose)
sense2vec_model.save_model(out_dir+f"/vectors_w2v_{vector_size}dim.bin")

tmp_path.unlink()
msg.good("Deleted temporary input file", tmp_path)

scripts

opened by dshefman1 15

Unable to download reddit_vectors model

Hi @honnibal ,

I am getting the following error when I execute: $ python -m sense2vec.download

File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/runpy.py", line 85, in run_code exec(code, run_globals) File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 38, in plac.call(main) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/plac_core.py", line 207, in consume return cmd, self.func((args + varargs + extraopts), *_kwargs) File "/Users/boscoraju/src/sense2vec/sense2vec/download.py", line 26, in main package = sputnik.install(about.title, about.version, about.default_model) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/init.py", line 37, in install index.update() File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/index.py", line 84, in update index = json.load(session.open(request, 'utf8')) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sputnik/session.py", line 43, in open r = self.opener.open(request) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 466, in open response = self._open(req, data) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 484, in _open '_open', req) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 444, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1297, in https_open context=self._context, check_hostname=self._check_hostname) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1256, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:645)>

Looks like it is unable to make a connection. Could you point me to right direction?

Thank you.

opened by boscoraju 14

Error using the most similar method

Following the successful installation of sense2vec, I got the model loaded as described in the response to the issue #3, but I am getting an error when I try to use the most_similar method.

Following is what I entered after loading the model: print vector_map.most_similar("education", topn=10)

Below is the error I receive.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-7f468f5b06ca> in <module>()
----> 1 print vector_map.most_similar("education", topn=10)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.most_similar (sense2vec/vectors.cpp:3363)()
     66             yield (string, freq, self.data[i])
     67 
---> 68     def most_similar(self, float[:] vector, int n):
     69         indices, scores = self.data.most_similar(vector, n)
     70         return [self.strings[idx] for idx in indices], scores

TypeError: most_similar() takes exactly 2 positional arguments (1 given)

So I understand that the most_similar method wants a float parameter followed by an int parameter. I thought the function will expect similar arguments as to gensim's word2vec implementation of most_similar method.

I request if please I could be shown how to use the most_similar method in the sense2vec implementation.

opened by newterminator 12

Help loading model
I downloaded the trained model from:

https://index.spacy.io/models/reddit_vectors-1.0.1/archive.gz

How can I load this into a VectorMap or a gensim model in order to make similarity queries?
opened by elyase 12
Unable to load model

When I tried to load the model via "model = sense2vec.load()" I get the following error:

RuntimeError("Model not installed. Please run 'python -m " RuntimeError: Model not installed. Please run 'python -m sense2vec.download' to install latest compatible model.

Then I tried to execute the command "'python -m sense2vec.download" and I got another error:

File "C:\Users\rg\Anaconda2\lib\runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "C:\Users\rg\Anaconda2\lib\runpy.py", line 72, in run_code exec code in run_globals File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 38, in plac.call(main) File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 328, in call cmd, result = parser.consume(arglist) File "C:\Users\rg\Anaconda2\lib\site-packages\plac_core.py", line 207, in consume return cmd, self.func((args + varargs + extraopts), *_kwargs) File "c:\users\rg\src\sense2vec\sense2vec\download.py", line 20, in main sputnik.package(about.title, about.version, about.default_model) AttributeError: 'module' object has no attribute 'title'

Can you please help me?

opened by reneg117 10
install sense2vec on mojave
Collecting sense2vec==1.0.0a0 Using cached https://files.pythonhosted.org/packages/28/4a/a1d9a28545adc839789c1442e7314cb0c70b8657a885f9e5b287fade7814/sense2vec-1.0.0a0.tar.gz Requirement already satisfied: numpy>=1.7 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from sense2vec==1.0.0a0) (1.15.0) Requirement already satisfied: ujson>=1.35 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from sense2vec==1.0.0a0) (1.35) Requirement already satisfied: preshed<2.0.0,>=1.0.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from sense2vec==1.0.0a0) (1.0.1) Requirement already satisfied: murmurhash<0.29,>=0.28 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from sense2vec==1.0.0a0) (0.28.0) Requirement already satisfied: cymem<1.32,>=1.30 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from sense2vec==1.0.0a0) (1.31.2) Installing collected packages: sense2vec Running setup.py install for sense2vec ... error Complete output from command /Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -u -c "import setuptools, tokenize;file='/private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-record-7ia78okh/install-record.txt --single-version-externally-managed --compile: running install running build running build_py creating build creating build/lib.macosx-10.9-x86_64-3.7 creating build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/init.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/about.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/spacy_pipeline.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec creating build/lib.macosx-10.9-x86_64-3.7/sense2vec/tests copying sense2vec/tests/conftest.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec/tests copying sense2vec/tests/test_vectors.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec/tests copying sense2vec/tests/init.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec/tests copying sense2vec/tests/test_sense2vec.py -> build/lib.macosx-10.9-x86_64-3.7/sense2vec/tests copying sense2vec/_strings.pyx -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/vectors.pyx -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/cfile.pyx -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/cfile.pxd -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/vectors.pxd -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/init.pxd -> build/lib.macosx-10.9-x86_64-3.7/sense2vec copying sense2vec/_strings.pxd -> build/lib.macosx-10.9-x86_64-3.7/sense2vec running build_ext building 'sense2vec.vectors' extension creating build/temp.macosx-10.9-x86_64-3.7 creating build/temp.macosx-10.9-x86_64-3.7/sense2vec gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m -I/private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/include -I/Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c sense2vec/vectors.cpp -o build/temp.macosx-10.9-x86_64-3.7/sense2vec/vectors.o -O3 -Wno-unused-function -fno-stack-protector In file included from sense2vec/vectors.cpp:508: In file included from /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/include/numpy/arrayobject.h:15: In file included from /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/include/numpy/ndarrayobject.h:17: In file included from /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/include/numpy/ndarraytypes.h:1728: /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/include/numpy/npy_deprecated_api.h:11:2: warning: "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings] #warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" ^ sense2vec/vectors.cpp:9861:42: warning: comparison of integers of different signs: 'std::__1::priority_queue<std::__1::pair<float, int>, std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >, std::__1::less<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int' [-Wsign-compare] __pyx_t_4 = ((__pyx_v_queue.size() > __pyx_v_nr_out) != 0); ~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~ sense2vec/vectors.cpp:9723:21: warning: code will never be executed [-Wunreachable-code] if (1 == 0) abort(); ^~~~~ sense2vec/vectors.cpp:9723:13: note: silence by adding parentheses to mark code as explicitly dead if (1 == 0) abort(); ^ /* DISABLES CODE */ ( ) sense2vec/vectors.cpp:31777:24: error: no member named 'exc_type' in '_ts' tmp_type = tstate->exc_type; ~~~~~~ ^ sense2vec/vectors.cpp:31778:25: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? tmp_value = tstate->exc_value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:31779:22: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? tmp_tb = tstate->exc_traceback; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:31780:13: error: no member named 'exc_type' in '_ts' tstate->exc_type = *type; ~~~~~~ ^ sense2vec/vectors.cpp:31781:13: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? tstate->exc_value = *value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:31782:13: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? tstate->exc_traceback = *tb; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:32757:21: error: no member named 'exc_type' in '_ts' *type = tstate->exc_type; ~~~~~~ ^ sense2vec/vectors.cpp:32758:22: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? *value = tstate->exc_value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:32759:19: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? *tb = tstate->exc_traceback; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:32766:24: error: no member named 'exc_type' in '_ts' tmp_type = tstate->exc_type; ~~~~~~ ^ sense2vec/vectors.cpp:32767:25: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? tmp_value = tstate->exc_value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:32768:22: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? tmp_tb = tstate->exc_traceback; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:32769:13: error: no member named 'exc_type' in '_ts' tstate->exc_type = type; ~~~~~~ ^ sense2vec/vectors.cpp:32770:13: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? tstate->exc_value = value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:32771:13: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? tstate->exc_traceback = tb; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:32816:24: error: no member named 'exc_type' in '_ts' tmp_type = tstate->exc_type; ~~~~~~ ^ sense2vec/vectors.cpp:32817:25: error: no member named 'exc_value' in '_ts'; did you mean 'curexc_value'? tmp_value = tstate->exc_value; ^~~~~~~~~ curexc_value /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:237:15: note: 'curexc_value' declared here PyObject *curexc_value; ^ sense2vec/vectors.cpp:32818:22: error: no member named 'exc_traceback' in '_ts'; did you mean 'curexc_traceback'? tmp_tb = tstate->exc_traceback; ^~~~~~~~~~~~~ curexc_traceback /Library/Frameworks/Python.framework/Versions/3.7/include/python3.7m/pystate.h:238:15: note: 'curexc_traceback' declared here PyObject *curexc_traceback; ^ sense2vec/vectors.cpp:32819:13: error: no member named 'exc_type' in '_ts' tstate->exc_type = local_type; ~~~~~~ ^ fatal error: too many errors emitted, stopping now [-ferror-limit=] 3 warnings and 20 errors generated. error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7 -u -c "import setuptools, tokenize;file='/private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-record-7ia78okh/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/fl/16v92h311z55cljkhrhlh30m0000gn/T/pip-install-k2nj2n_g/sense2vec/
install
opened by junyanp 9
set Force=True on all set_extensions

to avoid errors like "Extension 'in_s2v' already exists on Token" when multiple spacy+sense2vec pipelines are loaded in the same process (e.g. when creating the spacy+sense2vec pipeline in a mapPartitions call in spark)

opened by tolomaus 9

Error while opening own trained vectors file

I was able to train data using train_word2vec.py after preprocessing the data using merge_text.py. Below is the outcome of train_word2vec.py:

vectors

Then I input the vectors.bin to the new version 0.2.0 of sense2vec and I got an IOerror. The following is what I put to load the vectors:

from sense2vec.vectors import VectorMap
vector_map = VectorMap(128)
vector_map.load("/home/noname/Documents/data/vectors")

The error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-9-315510f2d9d1> in <module>()
      1 vector_map = VectorMap(128)
----> 2 vector_map.load("/home/noname/Documents/data/vectors")

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorMap.load (sense2vec/vectors.cpp:4870)()
    100 
    101     def load(self, data_dir):
--> 102         self.data.load(path.join(data_dir, 'vectors.bin'))
    103         with open(path.join(data_dir, 'strings.json')) as file_:
    104             self.strings.load(file_)

/home/noname/spacy/src/sense2vec/sense2vec/vectors.pyx in sense2vec.vectors.VectorStore.load (sense2vec/vectors.cpp:7049)()
    200         cdef float[:] cv
    201         for i in range(nr_vector):
--> 202             cfile.read_into(&tmp[0], self.nr_dim, sizeof(tmp[0]))
    203             ptr = &tmp[0]
    204             cv = <float[:128]>ptr

/home/noname/.linuxbrew/Cellar/python/2.7.11/lib/python2.7/site-packages/spacy/cfile.pyx in spacy.cfile.CFile.read_into (spacy/cfile.cpp:1147)()
     25         st = fread(dest, elem_size, number, self.fp)
     26         if st != number:
---> 27             raise IOError
     28 
     29     cdef int write_from(self, void* src, size_t number, size_t elem_size) except -1:

IOError:

Also I wanted to ask that how do I get the relevant freqs.json and strings.json for the trained vectors. For the strings.json, I have the batch outputs from merge_text.py. So they need to be mapped to the relevant information in freqs.json. If there is already a function that does it and I missed calling it, please let me know.

Python version: 2.7.11 Spacy version: 0.100.5

opened by newterminator 9

Feature setup config

Hello -- This is a simple config setup for the ability to configure the -L/-l blas link_options for your individual environment, e.g. in setup.cfg I needed [link_options] link_dir = /opt/openblas/lib link_library = openblas The current setup may not be optimal as more platforms are supported, bit it may be a starting point.

opened by init-random 9
install sense2vec in Windows
`(base) C:\WINDOWS\system32>pip install sense2vec==1.0.0a0 Collecting sense2vec==1.0.0a0 Downloading https://files.pythonhosted.org/packages/28/4a/a1d9a28545adc839789c1442e7314cb0c70b8657a885f9e5b287fade7814/sense2vec-1.0.0a0.tar.gz (311kB) 100% |████████████████████████████████| 317kB 1.5MB/s Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 1, in File "C:\Users...\AppData\Local\Temp\pip-install-5_i77ipi\sense2vec\setup.py", line 169, in setup_package() File "C:\Users...\AppData\Local\Temp\pip-install-5_i77ipi\sense2vec\setup.py", line 107, in setup_package readme = f.read() File "c:\users...\anaconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 13089: character maps to

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in C:\Users...\AppData\Local\Temp\pip-install-5_i77ipi\sense2vec`
install
opened by leo-gan 8
Train sense2vec in Chinese

Try to use Wikipedia Chinese corpus to Train sense2vec. But met a problem which is The 'noun_chunks' syntax iterator is not implemented for language 'zh'. Anyone know how to deal with this? How could I write the lables in noun_chunks function? How can I find the labels I need?

opened by JingxinLee 1
different results on the demo than through the spacy script

Hello, When using the demo with the word "Height", the sense "auto" and the year "2019" the results are amazing while running through spacy the results are completly different.

example: import spacy nlp_spacy = spacy.load("en_core_web_sm") s2v = nlp_spacy.add_pipe("sense2vec") s2v.from_disk("s2v_reddit_2019_lg")

doc = nlp_spacy("Height") assert doc[:].text == "Height" freq = doc[:]..s2v_freq vector = doc[:]..s2v_vec most_similar = doc[:]._.s2v_most_similar(3)

Why is that? and how can I get the same results as in the demo? Thanks.

opened by shanihadar 0

Get most similar senses on multiple senses.

Is there a way to get multiple most similar senses using array of senses?

I believe current solution gives average to multiple senses. For example I want to get most similar senses on two or more senses i.e.

from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("./s2v_reddit_2015_md")
query = ["bot|NOUN", "think|VERB" ]
# gives averaged senses
s2v.most_similar(query, n=10)
[('sup(This|ADV', 0.7275), ('idontbelieveyou.gif|NOUN', 0.7244), ('TLDR|NOUN', 0.6689), ('Original_Post|NOUN', 0.6167)]

# But I want to get separate most similar senses for each sense. Like this.
[[('sup(This|ADV', 0.7137), ('Original_Post|NOUN', 0.6189), ('TLDR|NOUN', 0.6129)], [('but|CONJ', 0.9187), ('obviously|ADV', 0.9084), ('honestly|ADV', 0.9006)]]

opened by mirfan899 2

Is there any way to use "doc.spans" in 01_parse.py?

Hi, I am trying to built a sense2vec model with new data. I have made few changes in 01_parse.py. First, I have removed the default ner pipe coming with "en_core_web_lg". Then I have added a new Language.component where I identify Spans associated to a new entities (new labels) in a doc. Sometimes, I would like to assign a Span[x, y] to more than one entity but I can not. My question... I have read the new changes in spaCy v3.1. Is there a way to use "doc.spans" (or something similar) in 01_parse where SpaCy's internal algorithms take Spans overlap into account?

@Language.component("name_comp") def my_component(doc):
matches = matcher(doc) seen_tokens = set() new_entities = [] entities = doc.ents for match_id, start, end in matches: # check for end - 1 here because boundaries are inclusive if start not in seen_tokens and end - 1 not in seen_tokens: new_entities.append(Span(doc, start, end, label=match_id)) entities = [ e for e in entities if not (e.start < end and e.end > start) ] seen_tokens.update(range(start, end)) doc.ents = tuple(entities) + tuple(new_entities) return doc

Thanks in advance, Paula

opened by nonstoprunning 0

plug sense2vec it into your spaCy pipeline

I want to add my own sense2vec to my own spacy model, as you wrote in documentation,

I add that to my current pipeline

[initialize.components]

[initialize.components.sense2vec]
data_path = "/path/to/s2v"

then

nlp = spacy.load("../data/ModelV05b/model-best")
nlp.add_pipe("sense2vec")
s2v.from_disk("../data/S2VFasttextV04")

it does not work , since it says that

[E090] Extension '_s2v' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

since sense2vec is`in nlp.component_names

['tok2vec',
 'tagger',
 'parser',
 'ner',
 'attribute_ruler',
 'lemmatizer',
 'sense2vec']

then I changed to my model

nlp = spacy.load("../data/ModelV05b/model-best")

still it does not work and it says

doc = nlp2("The testimony of the ages confirms that the motions of the planets are orbicular.")
assert doc[1:2].text == "testimony"
freq = doc[1:2]._.s2v_freq
vector = doc[1:2]._.s2v_vec
most_similar = doc[1:2]._.s2v_most_similar(3)

and it says that

AttributeError: 'NoneType' object has no attribute 'get_freq'

opened by myeghaneh 2

Releases(v2.0.1)

v2.0.1(Dec 8, 2022)
In the sense2vec.teach prodigy recipe: only fail if no seeds are available.

Extend support for wasabi to v1.1.x.

Source code(tar.gz)
Source code(zip)
v2.0.0(Feb 7, 2021)
Update component and internals for spaCy v3.

Source code(tar.gz)
Source code(zip)
v1.0.3(Feb 7, 2021)
Various small fixes and improvements.

Improve training scripts.

Fix issue #102: split binary .spacy files.

Fix issue #118: Fix typo in s2v_other_senses.

Thanks to @ahalterman, @dshefman1 and @Anxo06 for the pull requests!
Source code(tar.gz)
Source code(zip)
v1.0.2(Nov 22, 2019)
🔴 Bug fixes

Add defaults for config if attributes are not included in saved model.

Fix serialization and deserialization of string store in component.

Source code(tar.gz)
Source code(zip)
v1.0.1(Nov 22, 2019)
🔴 Bug fixes

Fix bug that'd cause the scores to not be read correctly from precomputed most_similar caches.

Source code(tar.gz)
Source code(zip)
v1.0.0(Nov 22, 2019)
✨ New features and improvements

Completely rewrite package from scratch.

Replace built-in vector storage with spaCy's Vectors, making this package a pure Python package and allowing easy out-of-the-box serialization of vectors.

Add fully serializable spaCy pipeline component and extension attributes.

Add new methods get_best_sense and get_other_senses and improve most_similar.

Add script for precomputing index of nearest neighbors for super fast "most similar" queries.

Add annotation recipes for Prodigy to easily create word lists and match patterns from similar phrases using sense2vec vectors (like the terms.teach recipe, just with multi-word expressions).

New and more efficient training and preprocessing scripts using GloVe and fastText.

⚠️ Backwards incompatibilities

The sense2vec.load method has been removed. Use Sense2Vec.from_disk instead.

The previous VectorMap and VectorStorage have been removed.

This package now requires Python 3.6+.

This update requires a new vectors format (see attached files).

📖 Documentation and examples

Rewrite README from scratch and include full API docs.

👥 Contributors

Thanks to @kabirkhan for contributing the initial Prodigy recipes!
Source code(tar.gz)
Source code(zip)
s2v_reddit_2015_md.tar.gz(572.62 MB)
s2v_reddit_2019_lg.tar.gz.001(1430.51 MB)
s2v_reddit_2019_lg.tar.gz.002(1430.51 MB)
s2v_reddit_2019_lg.tar.gz.003(736.51 MB)
v1.0.0a9(Nov 21, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a10(Nov 21, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a8(Nov 19, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a7(Nov 19, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a6(Nov 3, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a5(Nov 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a4(Nov 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a3(Nov 2, 2019)

Source code(tar.gz)
Source code(zip)
v1.0.0a2(Oct 31, 2019)
⚠️ This is an alpha release and not yet ready for production. You can download sense2vec via pip by specifying the exact version.

pip install sense2vec==1.0.0a2

The converted Reddit vectors (trained on all comments of 2015) are attached to this release as a .tar.gz file. For more details and usage instructions, see the README.

✨ New features and improvements

Completely rewrite package from scratch.

Replace built-in vector storage with spaCy's Vectors, making this package a pure Python package and allowing easy out-of-the-box serialization of vectors.

Add fully serializable spaCy pipeline component and extension attributes.

Add new methods get_best_sense and get_other_senses and improve most_similar.

Add annotation recipes for Prodigy to easily create word lists and match patterns from similar phrases using sense2vec vectors (like the terms.teach recipe, just with multi-word expressions).

New and more efficient training and preprocessing scripts using GloVe.

⚠️ Backwards incompatibilities

The sense2vec.load method has been removed. Use Sense2Vec.from_disk instead.

The previous VectorMap and VectorStorage have been removed.

This package now requires Python 3.6+.

This update requires a new vectors format (see attached .tar.gz).

📖 Documentation and examples

Rewrite README from scratch and include full API docs.

👥 Contributors

Thanks to @kabirkhan for contributing the Prodigy recipes!
Source code(tar.gz)
Source code(zip)
sense2vec-vectors.zip(572.73 MB)
v1.0.0a1(Sep 12, 2019)
⚠️ This is an alpha release and not yet ready for production. You can download sense2vec via pip by specifying the exact version.

pip install sense2vec==1.0.0a1

Note that the library doesn't depend on spaCy anymore, so you might have to install spaCy and the English model separately. The Reddit vectors (trained on all comments of 2015) are attached to this release as a .tar.gz file. For more details and usage instructions, see the README.

✨ New features and improvements

NEW: Remove spaCy dependency and allow standalone use of the sense2vec library.

NEW: Include spaCy v2.x pipeline component to add sense2vec-compatible token merging and token attributes and methods.

Attach reddit_vectors model to release and make it easier to download and load in models.

📖 Documentation and examples

Rewrite README from scratch and include full API docs.

🚧 Todo

[ ] Replace VectorMap implementation with spaCy's Vectors class.

[ ] Don't merge tokens at runtime and adjust extension attributes accordingly.

[ ] Update training and pre-processing scripts for spaCy v2.x.

[ ] Retrain vectors on more data.

Source code(tar.gz)
Source code(zip)
v1.0.0a0(Apr 8, 2018)
⚠️ This is an alpha release and not yet ready for production. You can download sense2vec via pip by specifying the exact version.

pip install sense2vec==1.0.0a0

Note that the library doesn't depend on spaCy anymore, so you might have to install spaCy and the English model separately. The Reddit vectors (trained on all comments of 2015) are attached to this release as a .tar.gz file. For more details and usage instructions, see the README.

✨ New features and improvements

NEW: Remove spaCy dependency and allow standalone use of the sense2vec library.

NEW: Include spaCy v2.x pipeline component to add sense2vec-compatible token merging and token attributes and methods.

Attach reddit_vectors model to release and make it easier to download and load in models.

📖 Documentation and examples

Rewrite README from scratch and include full API docs.

🚧 Todo

[ ] Update training and pre-processing scripts for spaCy v2.x.

Source code(tar.gz)
Source code(zip)
reddit_vectors-1.1.0.tar.gz(560.46 MB)

🦆 Contextually-keyed word vectors

Related tags

Overview

sense2vec: Contextually-keyed word vectors

✨ Features

🚀 Quickstart

Standalone usage

Usage as a spaCy pipeline component

Interactive demos

Pretrained vectors

⏳ Installation & Setup

👩‍💻 Usage

Usage with spaCy v3

Available attributes

Adding sense2vec to a trained pipeline

Standalone usage

🎛 API

class Sense2Vec

method Sense2Vec.__init__

method Sense2Vec.__len__

method Sense2Vec.__contains__

method Sense2Vec.__getitem__

method Sense2Vec.__setitem__

method Sense2Vec.add

method Sense2Vec.get_freq

method Sense2Vec.set_freq

method Sense2Vec.__iter__, Sense2Vec.items

method Sense2Vec.keys

method Sense2Vec.values

property Sense2Vec.senses

property Sense2vec.frequencies

method Sense2vec.similarity

method Sense2Vec.most_similar

method Sense2Vec.get_other_senses

method Sense2Vec.get_best_sense

method Sense2Vec.to_bytes

method Sense2Vec.from_bytes

method Sense2Vec.to_disk

method Sense2Vec.from_disk

class Sense2VecComponent

method Sense2VecComponent.__init__

classmethod Sense2VecComponent.from_nlp

method Sense2VecComponent.__call__

method Sense2Vec.init_component

method Sense2VecComponent.to_bytes

method Sense2VecComponent.from_bytes

method Sense2VecComponent.to_disk

method Sense2VecComponent.from_disk

class registry

🚂 Training your own sense2vec vectors

Requirements

Step-by-step process

🍳 Prodigy recipes

recipe sense2vec.teach

Example

recipe sense2vec.to-patterns

Example

recipe sense2vec.eval

Strategies

Example

recipe sense2vec.eval-most-similar

recipe sense2vec.eval-ab

Pretrained vectors

Comments

Releases(v2.0.1)

v2.0.1(Dec 8, 2022)

v2.0.0(Feb 7, 2021)

v1.0.3(Feb 7, 2021)

v1.0.2(Nov 22, 2019)

🔴 Bug fixes

v1.0.1(Nov 22, 2019)

🔴 Bug fixes

v1.0.0(Nov 22, 2019)

✨ New features and improvements

⚠️ Backwards incompatibilities

📖 Documentation and examples

👥 Contributors

v1.0.0a9(Nov 21, 2019)

v1.0.0a10(Nov 21, 2019)

v1.0.0a8(Nov 19, 2019)

`class` `Sense2Vec`

`method` `Sense2Vec.init`

`method` `Sense2Vec.len`

`method` `Sense2Vec.contains`

`method` `Sense2Vec.getitem`

`method` `Sense2Vec.setitem`

`method` `Sense2Vec.add`

`method` `Sense2Vec.get_freq`

`method` `Sense2Vec.set_freq`

`method` `Sense2Vec.iter`, `Sense2Vec.items`

`method` `Sense2Vec.keys`

`method` `Sense2Vec.values`

`property` `Sense2Vec.senses`

`property` `Sense2vec.frequencies`

`method` `Sense2vec.similarity`

`method` `Sense2Vec.most_similar`

`method` `Sense2Vec.get_other_senses`

`method` `Sense2Vec.get_best_sense`

`method` `Sense2Vec.to_bytes`

`method` `Sense2Vec.from_bytes`

`method` `Sense2Vec.to_disk`

`method` `Sense2Vec.from_disk`

`class` `Sense2VecComponent`

`method` `Sense2VecComponent.init`

`classmethod` `Sense2VecComponent.from_nlp`

`method` `Sense2VecComponent.call`

`method` `Sense2Vec.init_component`

`method` `Sense2VecComponent.to_bytes`

`method` `Sense2VecComponent.from_bytes`

`method` `Sense2VecComponent.to_disk`

`method` `Sense2VecComponent.from_disk`

`class` `registry`

`recipe` `sense2vec.teach`

`recipe` `sense2vec.to-patterns`

`recipe` `sense2vec.eval`

`recipe` `sense2vec.eval-most-similar`

`recipe` `sense2vec.eval-ab`