Unsupervised text tokenizer for Neural Network-based text generation.

Google

Last update: Jan 1, 2023

Related tags

Overview

SentencePiece

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature	SentencePiece	subword-nmt	WordPiece
Supported algorithm	BPE, unigram, char, word	BPE	BPE*
OSS?	Yes	Yes	Google internal
Subword regularization	Yes	No	No
Python Library (pip)	Yes	No	N/A
C++ Library	Yes	No	N/A
Pre-segmentation required?	No	Yes	Yes
Customizable normalization (e.g., NFKC)	Yes	No	N/A
Direct id generation	Yes	No	N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between “World” and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('▁', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

(en) Hello world. → [Hello] [World] [.] (A space between Hello and World)
(ja) こんにちは世界。 → [こんにちは] [世界] [。] (No space between こんにちは and 世界)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-droptout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

% pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

cmake
C++11 compiler
gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

--input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
--model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
--vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
--character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
--model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

SentencePiece Experiments
SentencePieceProcessor C++ API
Use custom text normalization rules
Use custom symbols
Python Module
TensorFlow Module
[Segmentation and training algorithms in detail]

Comments

Pip install sentencepiece failure

Hi, pip install sentencepiece fails, This is the log I get:

pip install sentencepiece 7.4.0 Collecting sentencepiece Using cached https://files.pythonhosted.org/packages/fd/45/6d0eb609d5cd81df094aab71a867b2ab6b315ffd592e78fb94a625c4d6aa/sentencepiece-0.1.3.tar.gz ERROR: Complete output from command python setup.py egg_info: ERROR: /bin/sh: 1: pkg-config: not found Failed to find sentencepiece pkgconfig ---------------------------------------- ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-463tj_x8/sentencepiece/

opened by saareliad 32
Compatibility with Tensorflow Serving
Any idea how to best integrate the tensorflow op with tensorflow serving?

Currently if this is used to train, when the tensorflow Graph is exported to a servable and ran with tensorflow serving a run time error will obviously occur.

For example a model trained with this op trying to be loaded into tensorflow serving will result in:

Loading servable: {name: xling } failed: Not Found: Op tyope not registered `SentencepieceEncodeSparse' in binary...
opened by r-wheeler 31

pip install failed on linux cluster

System Info: Linux version 4.14.0-115.7.1.el7a.ppc64le ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC))

I tried both installing from PyPI and installing from source file, but neither of them worked.

When installing from PyPI:

$ pip install sentencepiece
Collecting sentencepiece
  Using cached https://files.pythonhosted.org/packages/1b/87/c3c2fa8cbec61fffe031ca9f0da512747520bec9be7f886f748457daac31/sentencepiece-0.1.83.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-t33o0yz4/sentencepiece/setup.py", line 29, in <module>
        with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
      File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-t33o0yz4/sentencepiece/

I then manually downloaded the tar.gz source file, uncompressed it, changed the directory to "./python", and tried to install directly from the setup.py:

$ python setup.py install
Package sentencepiece was not found in the pkg-config search path.
Perhaps you should add the directory containing `sentencepiece.pc'
to the PKG_CONFIG_PATH environment variable
No package 'sentencepiece' found
Failed to find sentencepiece pkgconfig

However pip install . gives a different error message:

$ pip install .
Processing <...>/sentencepiece-0.1.83/python
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-209jgy5x/setup.py", line 29, in <module>
        with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
      File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-209jgy5x/

Does anyone know what might be wrong and how to fix it? Thank you!

execution environment

opened by wendywangwwt 24

undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

Hi , When I am trying to import "tf_sentencepiece" . I am getting following error:

NotFoundError Traceback (most recent call last) in import tf_sentencepiece as tfs

~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/init.py in from future import print_function from tf_sentencepiece.sentencepiece_processor_ops import * ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/sentencepiece_processor_ops.py in _gen_sentencepiece_processor_op = tf.load_op_library(so_file) ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename) RuntimeError: when unable to load the library or get the python wrappers. """ lib_handle = py_tf.TF_LoadLibrary(library_filename) op_list_str = py_tf.TF_GetOpList(lib_handle) NotFoundError: /home/user/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/_sentencepiece_processor_ops.so.1.12.0: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

Help me out in resolving this issue. Thanks in advance.

opened by ramreddyyasa 21
Add Mac M1 Compatibility

Hi,

Like the most part of Python librairies, SentencePiece won't install on Mac M1 architecture... "A revolution in data science" they said... what a joke, every data science library is a real pain to install! Do you plan to make a compatible version of SentencePiece?

Thank you!

opened by pierreia 19

Issue in installing.

Python 3.7.3 OS: Redhat

I am getting following error message while installing:

I already tried installing wheel but getting message:

(tanveer) [ai_u@powcbds tanveer]$ pip install sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl
ERROR: sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl is not a supported wheel on this platform.

> Using cached sentencepiece-0.1.83.tar.gz (497 kB)
>   ERROR: Command errored out with exit status 1:
>    command: /power8nfs/home/ai_u/.conda/envs/tanveer/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"'; __file__='"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-6kz16kgn/sentencepiece/pip-egg-info
>        cwd: /tmp/pip-install-6kz16kgn/sentencepiece/
>   Complete output (7 lines):
>   Traceback (most recent call last):
>     File "<string>", line 1, in <module>
>     File "/tmp/pip-install-6kz16kgn/sentencepiece/setup.py", line 29, in <module>
>       with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
>     File "/power8nfs/home/ai_u/.conda/envs/tanveer/lib/python3.7/codecs.py", line 904, in open
>       file = builtins.open(filename, mode, buffering)
>   FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
>   ----------------------------------------
> ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
>

execution environment

opened by tkhan3 19

`sentencepiece==0.1.92` seems breaking something

with newly released sentencepiece==0.1.92

Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers, torch
>>> transformers.__version__
'2.9.1'
>>> torch.__version__
'1.4.0'
>>> torch.rand(3)
Segmentation fault (core dumped)

However, downgrade to sentencepiece==0.1.91 solves this issue

opened by boy2000-007man 16

terminate called after throwing an instance of 'std::bad_alloc'

I'm running a sentencepiece model and getting an std::bad_alloc error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here's how I'm calling the function:

spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
    --model_type=unigram --input_sentence_size=10000000 --num_threads=32

here's the specific error:

trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=25
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

I've tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it's a RAM limitation, I'm running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?

opened by pstjohn 15

$FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'$

FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

Hi,

I opened an issue relating to the pytorch-transformers library but was redirected here. For the sake of clarity here's all the relevant info:

OS: Windows10 Python: 3.5.2. Error when trying pip install sentencepiece:

    ERROR: Command errored out with exit status 1:
     command: 'c:\users\pawel.lonca\appdata\local\programs\python\python35\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"'; __file__='"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
         cwd: C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\setup.py", line 29, in <module>
        with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
      File "c:\users\pawel.lonca\appdata\local\programs\python\python35\lib\codecs.py", line 895, in open
        file = builtins.open(filename, mode, buffering)
    FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

execution environment

opened by balkon16 14

Subword regularization on BPE models

As described by @eric-haibin-lin in https://github.com/google/sentencepiece/issues/335 it is currently not possible to use SampleEncodeAsPieces, SampleEncodeAs{Pieces,Ids} on a BPE model (displays model_interface.h(85) LOG(ERROR) Not implemented. error and returns an empty list).

Do you plan to support it in the near futur ?

(and thank you for this great tool BTW!)

opened by nicolaspanel 13

Cannot install sentencepiece with Python 3.9 on Windows

Currently adding Python 3.9 support for pytorch/text and ran into an issue installing sentencepiece for Python 3.9 on windows. (CircleCI logs)

  ERROR: Failed building wheel for sentencepiece
    ERROR: Command errored out with exit status 1:
     command: 'C:\Users\circleci\project\env\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"'; __file__='"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\circleci\AppData\Local\Temp\pip-record-xi27zjv8\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\circleci\project\env\Include\sentencepiece'
         cwd: C:\Users\circleci\AppData\Local\Temp\pip-install-trvw9qva\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\
    Complete output (20 lines):
    running install
    running build
    running build_py
    creating build
    creating build\lib.win-amd64-3.9
    creating build\lib.win-amd64-3.9\sentencepiece
    copying src\sentencepiece/__init__.py -> build\lib.win-amd64-3.9\sentencepiece
    copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
    copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
    running build_ext
    building 'sentencepiece._sentencepiece' extension
    creating build\temp.win-amd64-3.9
    creating build\temp.win-amd64-3.9\Release
    creating build\temp.win-amd64-3.9\Release\src
    creating build\temp.win-amd64-3.9\Release\src\sentencepiece
    C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\circleci\project\env\include -IC:\Users\circleci\project\env\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include -IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-3.9\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
    cl : Command line warning D9025 : overriding '/MD' with '/MT'
    sentencepiece_wrap.cxx
    src/sentencepiece/sentencepiece_wrap.cxx(2777): fatal error C1083: Cannot open include file: 'sentencepiece_processor.h': No such file or directory
    error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

This is a duplicate of #452, but no real solution to building from source seems to have come from that so I have opened a new issue

Is there a workaround for getting this dependency?

cc @taku910

opened by seemethere 12

Training a BPE model w/ "identity" normalization rule doesn't add "\n" to the vocab

Training a BPE model w/ the identity normalization rule doesn't add the newline character to the vocab:

#!/bin/bash

../sentencepiece_upstream/build/src/spm_train \
  --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
  --input_sentence_size 9999 \
  --model_prefix "bpe.joint" \
  --model_type "bpe" \
  --pad_id 3 \
  --pad_piece "<pad>" \
  --normalization_rule_name "identity" \
  --remove_extra_whitespaces 0

This causes unks when encoding strings w/ \n:

>>> import sentencepiece
>>> x=sentencepiece.SentencePieceProcessor("bpe.joint.model")
>>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
[1, 174, 7930, 7936, 0, 41, 7930, 7936, 0, 2]

Without the identity normalization, newlines just get replaced with whitespace, for example:

../sentencepiece_upstream/build/src/spm_train \
  --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
  --input_sentence_size 9999 \
  --model_prefix "bpe.joint" \
  --model_type "bpe" \
  --pad_id 3 \
  --pad_piece "<pad>" \
  --remove_extra_whitespaces 0
[...]
>>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
[1, 174, 7931, 7937, 174, 7931, 7937, 7921, 2]

opened by pks 0

Not able to install sentencepiece on s390x machine

Hi Team Im not able to install sentencepiece on my s390x machine. below is the error. Please do help me out with this

pip install sentencepiece Collecting sentencepiece Downloading sentencepiece-0.1.97.tar.gz (524 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 524.7/524.7 kB 2.8 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Building wheels for collected packages: sentencepiece Building wheel for sentencepiece (setup.py) ... error error: subprocess-exited-with-error

× python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [161 lines of output] running bdist_wheel running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/_version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found Cloning into 'sentencepiece'... Note: switching to '58f256cf6f01bb86e6fa634a5cc560de5bd1667d'.

  You are in 'detached HEAD' state. You can look around, make experimental
  changes and commit them, and you can discard any commits you make in this
  state without impacting any branches by switching back to a branch.
  
  If you want to create a new branch to retain commits you create, you may
  do so (now or later) by using -c with the switch command. Example:
  
    git switch -c <new-branch-name>
  
  Or undo this operation with:
  
    git switch -
  
  Turn off this advice by setting config variable advice.detachedHead to false
  
  -- VERSION: 0.1.97
  -- The C compiler identification is GNU 8.5.0
  -- The CXX compiler identification is GNU 8.5.0
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Check for working C compiler: /usr/bin/cc - skipped
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /usr/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Looking for pthread.h
  -- Looking for pthread.h - found
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
  -- Looking for pthread_create in pthreads
  -- Looking for pthread_create in pthreads - not found
  -- Looking for pthread_create in pthread
  -- Looking for pthread_create in pthread - found
  -- Found Threads: TRUE
  -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
  -- Configuring done
  -- Generating done
  -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled
  [  1%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/builder.cc.o
  [  3%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_interface.cc.o
  [  4%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unicode_script.cc.o
  [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unigram_model_trainer.cc.o
  [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/word_model_trainer.cc.o
  [  9%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/char_model_trainer.cc.o
  [ 11%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_factory.cc.o
  [ 12%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arena.cc.o
  [ 14%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arenastring.cc.o
  [ 16%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/bpe_model_trainer.cc.o
  [ 17%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/bytestream.cc.o
  [ 19%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/sentencepiece_trainer.cc.o
  [ 20%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/pretokenizer_for_training.cc.o
  [ 22%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/coded_stream.cc.o
  [ 24%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/common.cc.o
  [ 25%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/extension_set.cc.o
  [ 27%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_enum_util.cc.o
  [ 29%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_table_driven_lite.cc.o
  [ 30%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_util.cc.o
  [ 32%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/implicit_weak_message.cc.o
  [ 33%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/int128.cc.o
  [ 35%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/io_win32.cc.o
  [ 37%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/message_lite.cc.o
  [ 38%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/parse_context.cc.o
  [ 40%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/repeated_field.cc.o
  [ 41%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/status.cc.o
  [ 43%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/statusor.cc.o
  [ 45%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringpiece.cc.o
  [ 46%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringprintf.cc.o
  [ 48%] Linking CXX static library libsentencepiece_train.a
  [ 50%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/structurally_valid.cc.o
  [ 50%] Built target sentencepiece_train-static
  [ 51%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/strutil.cc.o
  [ 53%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/time.cc.o
  [ 54%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/wire_format_lite.cc.o
  [ 56%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream.cc.o
  [ 58%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl.cc.o
  [ 59%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl_lite.cc.o
  [ 61%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece.pb.cc.o
  [ 62%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece_model.pb.cc.o
  [ 64%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/bpe_model.cc.o
  [ 66%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/char_model.cc.o
  [ 67%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/error.cc.o
  [ 69%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/filesystem.cc.o
  [ 70%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_factory.cc.o
  [ 72%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_interface.cc.o
  [ 74%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o
  [ 75%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/sentencepiece_processor.cc.o
  [ 77%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/unigram_model.cc.o
  [ 79%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/util.cc.o
  [ 80%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/word_model.cc.o
  /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’:
  /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer_’ was not declared in this scope
                                           &precompiled_charsmap_buffer_);
                                            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
  [ 82%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/absl/flags/flag.cc.o
  gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1
  gmake[2]: *** Waiting for unfinished jobs....
  gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2
  gmake: *** [Makefile:156: all] Error 2
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in <module>
      setup(
    File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
      self.run_command(cmd)
    File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/local/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 290, in run
      self.run_command('build')
    File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
      _build_ext.build_ext.run(self)
    File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run
      self.build_extensions()
    File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
      _build_ext.build_ext.build_extensions(self)
    File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
      self._build_extensions_serial()
    File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension
      subprocess.check_call(['./build_bundled.sh', __version__])
    File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2.
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for sentencepiece Running setup.py clean for sentencepiece Failed to build sentencepiece Installing collected packages: sentencepiece Running setup.py install for sentencepiece ... error error: subprocess-exited-with-error

× Running setup.py install for sentencepiece did not run successfully. │ exit code: 1 ╰─> [77 lines of output] running install running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found fatal: destination path 'sentencepiece' already exists and is not an empty directory. fatal: destination path 'sentencepiece' already exists and is not an empty directory. -- VERSION: 0.1.97 -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND -- Configuring done -- Generating done -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled Consolidate compiler generated dependencies of target sentencepiece_train-static [ 17%] Built target sentencepiece_train-static Consolidate compiler generated dependencies of target sentencepiece-static [ 19%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer’ was not declared in this scope &precompiled_charsmap_buffer_); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2 gmake: *** [Makefile:156: all] Error 2 Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in setup( File "/usr/lib/python3.8/site-packages/setuptools/init.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run return orig.install.run(self) File "/usr/lib64/python3.8/distutils/command/install.py", line 556, in run self.run_command('build') File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run _build_ext.run(self) File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension subprocess.check_call(['./build_bundled.sh', version]) File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2. [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

× Encountered error while trying to install package. ╰─> sentencepiece

note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

opened by swagaths1 0

Is it allowed to rearrange index/id of each vocabulary?

Thank you for reading my question. I have a demand of rearranging vocabulary id and assigning scores freely to any token. Here is a background

Background:

Firstly, I want to manually add some tokens to a vocabulary that was trained with unigram model type. These tokens should allow other pieces to contain these tokens, so they are not user_defined_symbols. I want to manually assign them a score, so they can be sampled according to probability.

Secondly, I want to align the trained vocabulary with the other vocabulary. The other vocabulary makes indexes for those tokens I mentioned before. I hope the indexes for the common tokens in both vocabularies are of the same values. The indexes of other vocabularies are assigned with numbers after the last common index.

Could you please give me some advice about how to achieve this goal? Thank you

opened by lsy641 0
tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type

When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.

Is this intended for word models?

opened by lintangsutawika 0

Cannot install sentencepiece with Python 3.11 on Windows

Error alive again, Windows 10, Python 3.10.7

 Attempting uninstall: sentencepiece
    Found existing installation: sentencepiece 0.1.97
    Uninstalling sentencepiece-0.1.97:
      Successfully uninstalled sentencepiece-0.1.97
  Running setup.py install for sentencepiece ... error
  error: subprocess-exited-with-error

  × Running setup.py install for sentencepiece did not run successfully.
  │ exit code: 1
  ╰─> [24 lines of output]
      C:\Python310\lib\site-packages\setuptools\dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
        warnings.warn(
      running install
      C:\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\sentencepiece
      copying src\sentencepiece/__init__.py -> build\lib.win-amd64-cpython-310\sentencepiece
      copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
      copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
      running build_ext
      building 'sentencepiece._sentencepiece' extension
      creating build\temp.win-amd64-cpython-310
      creating build\temp.win-amd64-cpython-310\Release
      creating build\temp.win-amd64-cpython-310\Release\src
      creating build\temp.win-amd64-cpython-310\Release\src\sentencepiece
      "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Python310\include -IC:\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-cpython-310\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
      cl : L¡nea de comandos warning D9025 : invalidando '/MD' con '/MT'
      sentencepiece_wrap.cxx
      src/sentencepiece/sentencepiece_wrap.cxx(2809): fatal error C1083: No se puede abrir el archivo incluir: 'sentencepiece_processor.h': No such file or directory
      error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30037\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  Rolling back uninstall of sentencepiece
  Moving to c:\python310\lib\site-packages\sentencepiece-0.1.97.dist-info\
   from C:\Python310\Lib\site-packages\~entencepiece-0.1.97.dist-info
  Moving to c:\python310\lib\site-packages\sentencepiece\
   from C:\Python310\Lib\site-packages\~entencepiece
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> sentencepiece

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure`

Edit:
This path: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\" exists, and cl.exe is there too.

Originally posted by @cibernicola in https://github.com/google/sentencepiece/issues/591#issuecomment-1250851548

opened by kbatsuren 1

Build with protobuf in system
While using protobuf library in system (i.e., SPM_USE_BUILTIN_PROTOBUF=OFF, instead of third_party/protobuf-lite), hard-coded header file inclusion causes an error.

in init.h:21:

#include "third_party/protobuf-lite/google/protobuf/message_lite.h"

it should be

#include "google/protobuf/message_lite.h"
opened by acane77 1