Unsupervised text tokenizer for Neural Network-based text generation.

Overview

SentencePiece

Build Status Build status Coverage Status GitHub Issues Codacy Badge PyPI version PyPi downloads Contributions welcome License

SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.

This is not an official Google product.

Technical highlights

  • Purely data driven: SentencePiece trains tokenization and detokenization models from sentences. Pre-tokenization (Moses tokenizer/MeCab/KyTea) is not always required.
  • Language independent: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic.
  • Multiple subword algorithms: BPE [Sennrich et al.] and unigram language model [Kudo.] are supported.
  • Subword regularization: SentencePiece implements subword sampling for subword regularization and BPE-dropout which help to improve the robustness and accuracy of NMT models.
  • Fast and lightweight: Segmentation speed is around 50k sentences/sec, and memory footprint is around 6MB.
  • Self-contained: The same tokenization/detokenization is obtained as long as the same model file is used.
  • Direct vocabulary id generation: SentencePiece manages vocabulary to id mapping and can directly generate vocabulary id sequences from raw sentences.
  • NFKC-based normalization: SentencePiece performs NFKC-based text normalization.

For those unfamiliar with SentencePiece as a software/algorithm, one can read a gentle introduction here.

Comparisons with other implementations

Feature SentencePiece subword-nmt WordPiece
Supported algorithm BPE, unigram, char, word BPE BPE*
OSS? Yes Yes Google internal
Subword regularization Yes No No
Python Library (pip) Yes No N/A
C++ Library Yes No N/A
Pre-segmentation required? No Yes Yes
Customizable normalization (e.g., NFKC) Yes No N/A
Direct id generation Yes No N/A

Note that BPE algorithm used in WordPiece is slightly different from the original BPE.

Overview

What is SentencePiece?

SentencePiece is a re-implementation of sub-word units, an effective way to alleviate the open vocabulary problems in neural machine translation. SentencePiece supports two segmentation algorithms, byte-pair-encoding (BPE) [Sennrich et al.] and unigram language model [Kudo.]. Here are the high level differences from other implementations.

The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

Trains from raw sentences

Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance. The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese and Japanese where no explicit spaces exist between words.

Whitespace is treated as a basic symbol

The first step of Natural Language processing is text tokenization. For example, a standard English tokenizer would segment the text "Hello world." into the following three tokens.

[Hello] [World] [.]

One observation is that the original input and tokenized sequence are NOT reversibly convertible. For instance, the information that is no space between “World” and “.” is dropped from the tokenized sequence, since e.g., Tokenize(“World.”) == Tokenize(“World .”)

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol "▁" (U+2581) as follows.

Hello▁World.

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('▁', ' ')

This feature makes it possible to perform detokenization without relying on language-specific resources.

Note that we cannot apply the same lossless conversions when splitting the sentence with standard word segmenters, since they treat the whitespace as a special symbol. Tokenized sequences do not preserve the necessary information to restore the original sentence.

  • (en) Hello world. → [Hello] [World] [.] (A space between Hello and World)
  • (ja) こんにちは世界。 → [こんにちは] [世界] [。] (No space between こんにちは and 世界)

Subword regularization and BPE-dropout

Subword regularization [Kudo.] and BPE-droptout Provilkov et al are simple regularization methods that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.

To enable subword regularization, you would like to integrate SentencePiece library (C++/Python) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of Python library. You can find that 'New York' is segmented differently on each SampleEncode (C++) or encode with enable_sampling=True (Python) calls. The details of sampling parameters are found in sentencepiece_processor.h.

>>> import sentencepiece as spm
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
...     s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
['▁', 'New', '▁York']
['▁', 'New', '▁York']

Installation

Python module

SentencePiece provides Python wrapper that supports both SentencePiece training and segmentation. You can install Python binary package of SentencePiece with.

% pip install sentencepiece

For more detail, see Python module

Build and install SentencePiece command line tools from C++ source

The following tools and libraries are required to build SentencePiece:

  • cmake
  • C++11 compiler
  • gperftools library (optional, 10-40% performance improvement can be obtained.)

On Ubuntu, the build tools can be installed with apt-get:

% sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, you can build and install command line tools as follows.

% git clone https://github.com/google/sentencepiece.git 
% cd sentencepiece
% mkdir build
% cd build
% cmake ..
% make -j $(nproc)
% sudo make install
% sudo ldconfig -v

On OSX/macOS, replace the last command with sudo update_dyld_shared_cache

Build and install using vcpkg

You can download and install sentencepiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Usage instructions

Train SentencePiece Model

% spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>
  • --input: one-sentence-per-line raw corpus file. No need to run tokenizer, normalizer or preprocessor. By default, SentencePiece normalizes the input with Unicode NFKC. You can pass a comma-separated list of files.
  • --model_prefix: output model name prefix. <model_name>.model and <model_name>.vocab are generated.
  • --vocab_size: vocabulary size, e.g., 8000, 16000, or 32000
  • --character_coverage: amount of characters covered by the model, good defaults are: 0.9995 for languages with rich character set like Japanse or Chinese and 1.0 for other languages with small character set.
  • --model_type: model type. Choose from unigram (default), bpe, char, or word. The input sentence must be pretokenized when using word type.

Use --help flag to display all parameters for training, or see here for an overview.

Encode raw text into sentence pieces/ids

% spm_encode --model=<model_file> --output_format=piece < input > output
% spm_encode --model=<model_file> --output_format=id < input > output

Use --extra_options flag to insert the BOS/EOS markers or reverse the input sequence.

% spm_encode --extra_options=eos (add </s> only)
% spm_encode --extra_options=bos:eos (add <s> and </s>)
% spm_encode --extra_options=reverse:bos:eos (reverse input and add <s> and </s>)

SentencePiece supports nbest segmentation and segmentation sampling with --output_format=(nbest|sample)_(piece|id) flags.

% spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input > output
% spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input > output

Decode sentence pieces/ids into raw text

% spm_decode --model=<model_file> --input_format=piece < input > output
% spm_decode --model=<model_file> --input_format=id < input > output

Use --extra_options flag to decode the text in reverse order.

% spm_decode --extra_options=reverse < input > output

End-to-End Example

% spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000
unigram_model_trainer.cc(494) LOG(INFO) Starts training with :
input: "../data/botchan.txt"
... <snip>
unigram_model_trainer.cc(529) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4973 num_tokens=37630 num_tokens/piece=34.2091
trainer_interface.cc(272) LOG(INFO) Saving model: m.model
trainer_interface.cc(281) LOG(INFO) Saving vocabs: m.vocab

% echo "I saw a girl with a telescope." | spm_encode --model=m.model
▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

% echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
9 459 11 939 44 11 4 142 82 8 28 21 132 6

% echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
I saw a girl with a telescope.

You can find that the original input sentence is restored from the vocabulary id sequence.

Export vocabulary list

% spm_export_vocab --model=<model_file> --output=<output file>

<output file> stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.

Redefine special meta tokens

By default, SentencePiece uses Unknown (<unk>), BOS (<s>) and EOS (</s>) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.

% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...

When setting -1 id e.g., bos_id=-1, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (<pad>) as --pad_id=3.  

If you want to assign another special tokens, please see Use custom symbols.

Vocabulary restriction

spm_encode accepts a --vocabulary and a --vocabulary_threshold option so that spm_encode will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is described in subword-nmt page.

The usage is basically the same as that of subword-nmt. Assuming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:

% cat {train_file}.L1 {train_file}.L2 | shuffle > train
% spm_train --input=train --model_prefix=spm --vocab_size=8000 --character_coverage=0.9995
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2

shuffle command is used just in case because spm_train loads the first 10M lines of corpus by default.

Then segment train/test corpus with --vocabulary option

% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2

Advanced topics

Comments
  • Pip install sentencepiece failure

    Pip install sentencepiece failure

    Hi, pip install sentencepiece fails, This is the log I get:

    pip install sentencepiece 7.4.0 Collecting sentencepiece Using cached https://files.pythonhosted.org/packages/fd/45/6d0eb609d5cd81df094aab71a867b2ab6b315ffd592e78fb94a625c4d6aa/sentencepiece-0.1.3.tar.gz ERROR: Complete output from command python setup.py egg_info: ERROR: /bin/sh: 1: pkg-config: not found Failed to find sentencepiece pkgconfig ---------------------------------------- ERROR: Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-463tj_x8/sentencepiece/

    opened by saareliad 32
  • Compatibility with Tensorflow Serving

    Compatibility with Tensorflow Serving

    Any idea how to best integrate the tensorflow op with tensorflow serving?

    Currently if this is used to train, when the tensorflow Graph is exported to a servable and ran with tensorflow serving a run time error will obviously occur.

    For example a model trained with this op trying to be loaded into tensorflow serving will result in:

    Loading servable: {name: xling } failed: Not Found: Op tyope not registered `SentencepieceEncodeSparse' in binary...
    
    opened by r-wheeler 31
  • pip install failed on linux cluster

    pip install failed on linux cluster

    System Info: Linux version 4.14.0-115.7.1.el7a.ppc64le ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC))

    I tried both installing from PyPI and installing from source file, but neither of them worked.

    When installing from PyPI:

    $ pip install sentencepiece
    Collecting sentencepiece
      Using cached https://files.pythonhosted.org/packages/1b/87/c3c2fa8cbec61fffe031ca9f0da512747520bec9be7f886f748457daac31/sentencepiece-0.1.83.tar.gz
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-t33o0yz4/sentencepiece/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-t33o0yz4/sentencepiece/
    

    I then manually downloaded the tar.gz source file, uncompressed it, changed the directory to "./python", and tried to install directly from the setup.py:

    $ python setup.py install
    Package sentencepiece was not found in the pkg-config search path.
    Perhaps you should add the directory containing `sentencepiece.pc'
    to the PKG_CONFIG_PATH environment variable
    No package 'sentencepiece' found
    Failed to find sentencepiece pkgconfig
    

    However pip install . gives a different error message:

    $ pip install .
    Processing <...>/sentencepiece-0.1.83/python
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-req-build-209jgy5x/setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "/opt/anaconda3/lib/python3.6/codecs.py", line 897, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-209jgy5x/
    

    Does anyone know what might be wrong and how to fix it? Thank you!

    execution environment 
    opened by wendywangwwt 24
  • undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Hi , When I am trying to import "tf_sentencepiece" . I am getting following error:

    NotFoundError Traceback (most recent call last) in import tf_sentencepiece as tfs

    ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/init.py in from future import print_function from tf_sentencepiece.sentencepiece_processor_ops import * ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/sentencepiece_processor_ops.py in _gen_sentencepiece_processor_op = tf.load_op_library(so_file) ~/.conda/envs/tf_gpu/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename) RuntimeError: when unable to load the library or get the python wrappers. """ lib_handle = py_tf.TF_LoadLibrary(library_filename) op_list_str = py_tf.TF_GetOpList(lib_handle) NotFoundError: /home/user/.conda/envs/tf_gpu/lib/python3.6/site-packages/tf_sentencepiece/_sentencepiece_processor_ops.so.1.12.0: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrESs

    Help me out in resolving this issue. Thanks in advance.

    opened by ramreddyyasa 21
  • Add Mac M1 Compatibility

    Add Mac M1 Compatibility

    Hi,

    Like the most part of Python librairies, SentencePiece won't install on Mac M1 architecture... "A revolution in data science" they said... what a joke, every data science library is a real pain to install! Do you plan to make a compatible version of SentencePiece?

    Thank you!

    opened by pierreia 19
  • Issue in installing.

    Issue in installing.

    Python 3.7.3 OS: Redhat

    I am getting following error message while installing:

    I already tried installing wheel but getting message:

    (tanveer) [ai_u@powcbds tanveer]$ pip install sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl
    ERROR: sentencepiece-0.1.85-cp38-cp38-manylinux1_i686.whl is not a supported wheel on this platform.
    
    > Using cached sentencepiece-0.1.83.tar.gz (497 kB)
    >   ERROR: Command errored out with exit status 1:
    >    command: /power8nfs/home/ai_u/.conda/envs/tanveer/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"'; __file__='"'"'/tmp/pip-install-6kz16kgn/sentencepiece/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-6kz16kgn/sentencepiece/pip-egg-info
    >        cwd: /tmp/pip-install-6kz16kgn/sentencepiece/
    >   Complete output (7 lines):
    >   Traceback (most recent call last):
    >     File "<string>", line 1, in <module>
    >     File "/tmp/pip-install-6kz16kgn/sentencepiece/setup.py", line 29, in <module>
    >       with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
    >     File "/power8nfs/home/ai_u/.conda/envs/tanveer/lib/python3.7/codecs.py", line 904, in open
    >       file = builtins.open(filename, mode, buffering)
    >   FileNotFoundError: [Errno 2] No such file or directory: '../VERSION'
    >   ----------------------------------------
    > ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    > 
    
    execution environment 
    opened by tkhan3 19
  • `sentencepiece==0.1.92` seems breaking something

    `sentencepiece==0.1.92` seems breaking something

    with newly released sentencepiece==0.1.92

    Python 3.6.9 (default, Nov  7 2019, 10:44:02)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import transformers, torch
    >>> transformers.__version__
    '2.9.1'
    >>> torch.__version__
    '1.4.0'
    >>> torch.rand(3)
    Segmentation fault (core dumped)
    

    However, downgrade to sentencepiece==0.1.91 solves this issue

    opened by boy2000-007man 16
  • terminate called after throwing an instance of 'std::bad_alloc'

    terminate called after throwing an instance of 'std::bad_alloc'

    I'm running a sentencepiece model and getting an std::bad_alloc error when I increase the training size from 5M to 10M sentences. (it works fine for 5M sentences). Here's how I'm calling the function:

    spm_train --input=input.txt --vocab_size=32000 --character_coverage=1.0
        --model_type=unigram --input_sentence_size=10000000 --num_threads=32
    

    here's the specific error:

    trainer_interface.cc(317) LOG(INFO) Sampled 10000000 sentences from 283087079 sentences.
    trainer_interface.cc(321) LOG(INFO) Skipped 209436 too long sentences.
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <s>
    trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
    trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
    trainer_interface.cc(384) LOG(INFO) all chars count=3460742236
    trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
    trainer_interface.cc(402) LOG(INFO) Alphabet size=25
    trainer_interface.cc(403) LOG(INFO) Final character coverage=1
    trainer_interface.cc(435) LOG(INFO) Done! preprocessed 10000000 sentences.
    terminate called after throwing an instance of 'std::bad_alloc'
      what():  std::bad_alloc
    

    I've tried compiling SentencePiece with and without gperftools, and get the same error message. Compiled with gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16), in case that matters. (Edit: also tried a more recent gcc 8.2.0 with the same results.) I doubt that it's a RAM limitation, I'm running this on a pretty beefy compute node with 768 GB of memory, and watching memory utilization as the program is running (even at 5M input sentences) I never get close to maxing out. Any thoughts why I might be getting this error message?

    opened by pstjohn 15
  • FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'

    Hi,

    I opened an issue relating to the pytorch-transformers library but was redirected here. For the sake of clarity here's all the relevant info:

    OS: Windows10 Python: 3.5.2. Error when trying pip install sentencepiece:

        ERROR: Command errored out with exit status 1:
         command: 'c:\users\pawel.lonca\appdata\local\programs\python\python35\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"'; __file__='"'"'C:\\Users\\PAWEL~1.LON\\AppData\\Local\\Temp\\pip-install-ibsvnyrj\\sentencepiece\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
             cwd: C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\
        Complete output (7 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\Users\PAWEL~1.LON\AppData\Local\Temp\pip-install-ibsvnyrj\sentencepiece\setup.py", line 29, in <module>
            with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
          File "c:\users\pawel.lonca\appdata\local\programs\python\python35\lib\codecs.py", line 895, in open
            file = builtins.open(filename, mode, buffering)
        FileNotFoundError: [Errno 2] No such file or directory: '..\\VERSION'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    execution environment 
    opened by balkon16 14
  • Subword regularization on BPE models

    Subword regularization on BPE models

    As described by @eric-haibin-lin in https://github.com/google/sentencepiece/issues/335 it is currently not possible to use SampleEncodeAsPieces, SampleEncodeAs{Pieces,Ids} on a BPE model (displays model_interface.h(85) LOG(ERROR) Not implemented. error and returns an empty list).

    Do you plan to support it in the near futur ?

    (and thank you for this great tool BTW!)

    opened by nicolaspanel 13
  • Cannot install sentencepiece with Python 3.9 on Windows

    Cannot install sentencepiece with Python 3.9 on Windows

    Currently adding Python 3.9 support for pytorch/text and ran into an issue installing sentencepiece for Python 3.9 on windows. (CircleCI logs)

      ERROR: Failed building wheel for sentencepiece
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\circleci\project\env\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"'; __file__='"'"'C:\\Users\\circleci\\AppData\\Local\\Temp\\pip-install-trvw9qva\\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\circleci\AppData\Local\Temp\pip-record-xi27zjv8\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\circleci\project\env\Include\sentencepiece'
             cwd: C:\Users\circleci\AppData\Local\Temp\pip-install-trvw9qva\sentencepiece_6ae2202249f44bf5b7a3902ec8532c93\
        Complete output (20 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.9
        creating build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/__init__.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-3.9\sentencepiece
        running build_ext
        building 'sentencepiece._sentencepiece' extension
        creating build\temp.win-amd64-3.9
        creating build\temp.win-amd64-3.9\Release
        creating build\temp.win-amd64-3.9\Release\src
        creating build\temp.win-amd64-3.9\Release\src\sentencepiece
        C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\circleci\project\env\include -IC:\Users\circleci\project\env\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\ATLMFC\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include -IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-3.9\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
        cl : Command line warning D9025 : overriding '/MD' with '/MT'
        sentencepiece_wrap.cxx
        src/sentencepiece/sentencepiece_wrap.cxx(2777): fatal error C1083: Cannot open include file: 'sentencepiece_processor.h': No such file or directory
        error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.27.29110\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
    

    This is a duplicate of #452, but no real solution to building from source seems to have come from that so I have opened a new issue

    Is there a workaround for getting this dependency?

    cc @taku910

    opened by seemethere 12
  • Training a BPE model w/

    Training a BPE model w/ "identity" normalization rule doesn't add "\n" to the vocab

    Training a BPE model w/ the identity normalization rule doesn't add the newline character to the vocab:

    #!/bin/bash
    
    ../sentencepiece_upstream/build/src/spm_train \
      --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
      --input_sentence_size 9999 \
      --model_prefix "bpe.joint" \
      --model_type "bpe" \
      --pad_id 3 \
      --pad_piece "<pad>" \
      --normalization_rule_name "identity" \
      --remove_extra_whitespaces 0
    

    This causes unks when encoding strings w/ \n:

    >>> import sentencepiece
    >>> x=sentencepiece.SentencePieceProcessor("bpe.joint.model")
    >>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
    [1, 174, 7930, 7936, 0, 41, 7930, 7936, 0, 2]
    

    Without the identity normalization, newlines just get replaced with whitespace, for example:

    ../sentencepiece_upstream/build/src/spm_train \
      --input ../europarl-v7.de-en.en,../europarl-v7.de-en.de \
      --input_sentence_size 9999 \
      --model_prefix "bpe.joint" \
      --model_type "bpe" \
      --pad_id 3 \
      --pad_piece "<pad>" \
      --remove_extra_whitespaces 0
    [...]
    >>> x.encode_as_ids("asdf\nasdf\n", add_eos=True, add_bos=True)
    [1, 174, 7931, 7937, 174, 7931, 7937, 7921, 2]
    
    opened by pks 0
  • Not able to install sentencepiece on s390x machine

    Not able to install sentencepiece on s390x machine

    Hi Team Im not able to install sentencepiece on my s390x machine. below is the error. Please do help me out with this

    pip install sentencepiece Collecting sentencepiece Downloading sentencepiece-0.1.97.tar.gz (524 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 524.7/524.7 kB 2.8 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Building wheels for collected packages: sentencepiece Building wheel for sentencepiece (setup.py) ... error error: subprocess-exited-with-error

    × python setup.py bdist_wheel did not run successfully. │ exit code: 1 ╰─> [161 lines of output] running bdist_wheel running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/_version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found Cloning into 'sentencepiece'... Note: switching to '58f256cf6f01bb86e6fa634a5cc560de5bd1667d'.

      You are in 'detached HEAD' state. You can look around, make experimental
      changes and commit them, and you can discard any commits you make in this
      state without impacting any branches by switching back to a branch.
      
      If you want to create a new branch to retain commits you create, you may
      do so (now or later) by using -c with the switch command. Example:
      
        git switch -c <new-branch-name>
      
      Or undo this operation with:
      
        git switch -
      
      Turn off this advice by setting config variable advice.detachedHead to false
      
      -- VERSION: 0.1.97
      -- The C compiler identification is GNU 8.5.0
      -- The CXX compiler identification is GNU 8.5.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Looking for pthread.h
      -- Looking for pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
      -- Looking for pthread_create in pthreads
      -- Looking for pthread_create in pthreads - not found
      -- Looking for pthread_create in pthread
      -- Looking for pthread_create in pthread - found
      -- Found Threads: TRUE
      -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND
      -- Configuring done
      -- Generating done
      -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled
      [  1%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/builder.cc.o
      [  3%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_interface.cc.o
      [  4%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unicode_script.cc.o
      [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/unigram_model_trainer.cc.o
      [  8%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/word_model_trainer.cc.o
      [  9%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/char_model_trainer.cc.o
      [ 11%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/trainer_factory.cc.o
      [ 12%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arena.cc.o
      [ 14%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/arenastring.cc.o
      [ 16%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/bpe_model_trainer.cc.o
      [ 17%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/bytestream.cc.o
      [ 19%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/sentencepiece_trainer.cc.o
      [ 20%] Building CXX object src/CMakeFiles/sentencepiece_train-static.dir/pretokenizer_for_training.cc.o
      [ 22%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/coded_stream.cc.o
      [ 24%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/common.cc.o
      [ 25%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/extension_set.cc.o
      [ 27%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_enum_util.cc.o
      [ 29%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_table_driven_lite.cc.o
      [ 30%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/generated_message_util.cc.o
      [ 32%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/implicit_weak_message.cc.o
      [ 33%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/int128.cc.o
      [ 35%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/io_win32.cc.o
      [ 37%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/message_lite.cc.o
      [ 38%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/parse_context.cc.o
      [ 40%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/repeated_field.cc.o
      [ 41%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/status.cc.o
      [ 43%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/statusor.cc.o
      [ 45%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringpiece.cc.o
      [ 46%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/stringprintf.cc.o
      [ 48%] Linking CXX static library libsentencepiece_train.a
      [ 50%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/structurally_valid.cc.o
      [ 50%] Built target sentencepiece_train-static
      [ 51%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/strutil.cc.o
      [ 53%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/time.cc.o
      [ 54%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/wire_format_lite.cc.o
      [ 56%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream.cc.o
      [ 58%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl.cc.o
      [ 59%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/protobuf-lite/zero_copy_stream_impl_lite.cc.o
      [ 61%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece.pb.cc.o
      [ 62%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/builtin_pb/sentencepiece_model.pb.cc.o
      [ 64%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/bpe_model.cc.o
      [ 66%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/char_model.cc.o
      [ 67%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/error.cc.o
      [ 69%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/filesystem.cc.o
      [ 70%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_factory.cc.o
      [ 72%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/model_interface.cc.o
      [ 74%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o
      [ 75%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/sentencepiece_processor.cc.o
      [ 77%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/unigram_model.cc.o
      [ 79%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/util.cc.o
      [ 80%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/word_model.cc.o
      /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’:
      /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer_’ was not declared in this scope
                                               &precompiled_charsmap_buffer_);
                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
      [ 82%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/__/third_party/absl/flags/flag.cc.o
      gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1
      gmake[2]: *** Waiting for unfinished jobs....
      gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2
      gmake: *** [Makefile:156: all] Error 2
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in <module>
          setup(
        File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 145, in setup
          return distutils.core.setup(**attrs)
        File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup
          dist.run_commands()
        File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands
          self.run_command(cmd)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/local/lib/python3.8/site-packages/wheel/bdist_wheel.py", line 290, in run
          self.run_command('build')
        File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run
          self.run_command(cmd_name)
        File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command
          self.distribution.run_command(command)
        File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command
          cmd_obj.run()
        File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run
          _build_ext.run(self)
        File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
          _build_ext.build_ext.run(self)
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run
          self.build_extensions()
        File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
          _build_ext.build_ext.build_extensions(self)
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
          self._build_extensions_serial()
        File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
          self.build_extension(ext)
        File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension
          subprocess.check_call(['./build_bundled.sh', __version__])
        File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call
          raise CalledProcessError(retcode, cmd)
      subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2.
      [end of output]
    

    note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for sentencepiece Running setup.py clean for sentencepiece Failed to build sentencepiece Installing collected packages: sentencepiece Running setup.py install for sentencepiece ... error error: subprocess-exited-with-error

    × Running setup.py install for sentencepiece did not run successfully. │ exit code: 1 ╰─> [77 lines of output] running install running build running build_py creating build creating build/lib.linux-s390x-3.8 creating build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/init.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/version.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_model_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece copying src/sentencepiece/sentencepiece_pb2.py -> build/lib.linux-s390x-3.8/sentencepiece running build_ext Package sentencepiece was not found in the pkg-config search path. Perhaps you should add the directory containing `sentencepiece.pc' to the PKG_CONFIG_PATH environment variable Package 'sentencepiece', required by 'virtual:world', not found fatal: destination path 'sentencepiece' already exists and is not an empty directory. fatal: destination path 'sentencepiece' already exists and is not an empty directory. -- VERSION: 0.1.97 -- Not Found TCMalloc: TCMALLOC_LIB-NOTFOUND -- Configuring done -- Generating done -- Build files have been written to: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/bundled Consolidate compiler generated dependencies of target sentencepiece_train-static [ 17%] Built target sentencepiece_train-static Consolidate compiler generated dependencies of target sentencepiece-static [ 19%] Building CXX object src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc: In member function ‘void sentencepiece::normalizer::Normalizer::Init()’: /tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/sentencepiece/src/normalizer.cc:54:42: error: ‘precompiled_charsmap_buffer’ was not declared in this scope &precompiled_charsmap_buffer_); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ gmake[2]: *** [src/CMakeFiles/sentencepiece-static.dir/build.make:552: src/CMakeFiles/sentencepiece-static.dir/normalizer.cc.o] Error 1 gmake[1]: *** [CMakeFiles/Makefile2:207: src/CMakeFiles/sentencepiece-static.dir/all] Error 2 gmake: *** [Makefile:156: all] Error 2 Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 136, in setup( File "/usr/lib/python3.8/site-packages/setuptools/init.py", line 145, in setup return distutils.core.setup(**attrs) File "/usr/lib64/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib64/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/install.py", line 61, in run return orig.install.run(self) File "/usr/lib64/python3.8/distutils/command/install.py", line 556, in run self.run_command('build') File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib64/python3.8/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/lib64/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib64/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 84, in run _build_ext.run(self) File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib64/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/tmp/pip-install-5emffrl3/sentencepiece_0fc0e52c5f2b4bfea7160dcf5bae3daf/setup.py", line 89, in build_extension subprocess.check_call(['./build_bundled.sh', version]) File "/usr/lib64/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['./build_bundled.sh', '0.1.97']' returned non-zero exit status 2. [end of output]

    note: This error originates from a subprocess, and is likely not a problem with pip. error: legacy-install-failure

    × Encountered error while trying to install package. ╰─> sentencepiece

    note: This is an issue with the package mentioned above, not pip. hint: See above for output from the failure.

    opened by swagaths1 0
  • Is it allowed to rearrange index/id of each vocabulary?

    Is it allowed to rearrange index/id of each vocabulary?

    Thank you for reading my question. I have a demand of rearranging vocabulary id and assigning scores freely to any token. Here is a background

    Background:

    Firstly, I want to manually add some tokens to a vocabulary that was trained with unigram model type. These tokens should allow other pieces to contain these tokens, so they are not user_defined_symbols. I want to manually assign them a score, so they can be sampled according to probability.

    Secondly, I want to align the trained vocabulary with the other vocabulary. The other vocabulary makes indexes for those tokens I mentioned before. I hope the indexes for the common tokens in both vocabularies are of the same values. The indexes of other vocabularies are assigned with numbers after the last common index.

    Could you please give me some advice about how to achieve this goal? Thank you

    opened by lsy641 0
  • tokens listed in user_defined_symbols tokenized as unknowns when using the

    tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type

    When using model_type="word" as argument in spm.SentencePieceTrainer.train, it seems that tokens listed in user_defined_symbols for example user_defined_symbols=["<s>", "</s>", "."], are still encoded to the unk_id. Using BPE, and Char works.

    Is this intended for word models?

    opened by lintangsutawika 0
  • Cannot install sentencepiece with Python 3.11 on Windows

    Cannot install sentencepiece with Python 3.11 on Windows

    Error alive again, Windows 10, Python 3.10.7

     Attempting uninstall: sentencepiece
        Found existing installation: sentencepiece 0.1.97
        Uninstalling sentencepiece-0.1.97:
          Successfully uninstalled sentencepiece-0.1.97
      Running setup.py install for sentencepiece ... error
      error: subprocess-exited-with-error
    
      × Running setup.py install for sentencepiece did not run successfully.
      │ exit code: 1
      ╰─> [24 lines of output]
          C:\Python310\lib\site-packages\setuptools\dist.py:771: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead
            warnings.warn(
          running install
          C:\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
            warnings.warn(
          running build
          running build_py
          creating build
          creating build\lib.win-amd64-cpython-310
          creating build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/__init__.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_model_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          copying src\sentencepiece/sentencepiece_pb2.py -> build\lib.win-amd64-cpython-310\sentencepiece
          running build_ext
          building 'sentencepiece._sentencepiece' extension
          creating build\temp.win-amd64-cpython-310
          creating build\temp.win-amd64-cpython-310\Release
          creating build\temp.win-amd64-cpython-310\Release\src
          creating build\temp.win-amd64-cpython-310\Release\src\sentencepiece
          "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Python310\include -IC:\Python310\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpsrc/sentencepiece/sentencepiece_wrap.cxx /Fobuild\temp.win-amd64-cpython-310\Release\src/sentencepiece/sentencepiece_wrap.obj /MT /I..\build\root\include
          cl : L¡nea de comandos warning D9025 : invalidando '/MD' con '/MT'
          sentencepiece_wrap.cxx
          src/sentencepiece/sentencepiece_wrap.cxx(2809): fatal error C1083: No se puede abrir el archivo incluir: 'sentencepiece_processor.h': No such file or directory
          error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.29.30037\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
          [end of output]
    
      note: This error originates from a subprocess, and is likely not a problem with pip.
      Rolling back uninstall of sentencepiece
      Moving to c:\python310\lib\site-packages\sentencepiece-0.1.97.dist-info\
       from C:\Python310\Lib\site-packages\~entencepiece-0.1.97.dist-info
      Moving to c:\python310\lib\site-packages\sentencepiece\
       from C:\Python310\Lib\site-packages\~entencepiece
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> sentencepiece
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure`
    
    Edit:
    This path: "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30037\bin\HostX86\x64\" exists, and cl.exe is there too.
    

    Originally posted by @cibernicola in https://github.com/google/sentencepiece/issues/591#issuecomment-1250851548

    opened by kbatsuren 1
  • Build with protobuf in system

    Build with protobuf in system

    While using protobuf library in system (i.e., SPM_USE_BUILTIN_PROTOBUF=OFF, instead of third_party/protobuf-lite), hard-coded header file inclusion causes an error.

    in init.h:21:

    #include "third_party/protobuf-lite/google/protobuf/message_lite.h"
    

    it should be

    #include "google/protobuf/message_lite.h"
    
    opened by acane77 1
Releases(v0.1.97)
Owner
Google
Google ❤️ Open Source
Google
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 2, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 718 Feb 18, 2021
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

null 325 Jan 5, 2023
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage >>> from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 6, 2023
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

NLP*CL Laboratory 2 Oct 26, 2021
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.8k Dec 30, 2022
Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code.

textgenrnn Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code, or quickly tr

Max Woolf 4.3k Feb 18, 2021
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 2, 2022
SimCTG - A Contrastive Framework for Neural Text Generation

A Contrastive Framework for Neural Text Generation Authors: Yixuan Su, Tian Lan,

Yixuan Su 345 Jan 3, 2023
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

Keon Lee 237 Jan 2, 2023
neural network based speaker embedder

Content What is deepaudio-speaker? Installation Get Started Model Architecture How to contribute to deepaudio-speaker? Acknowledge What is deepaudio-s

null 20 Dec 29, 2022