pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Overview

pkuseg:一个多领域中文分词工具包 (English Version)

pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。

目录

主要亮点

pkuseg具有如下几个特点:

  1. 多领域分词。不同于以往的通用中文分词工具,此工具包同时致力于为不同领域的数据提供个性化的预训练模型。根据待分词文本的领域特点,用户可以自由地选择不同的模型。 我们目前支持了新闻领域,网络领域,医药领域,旅游领域,以及混合领域的分词预训练模型。在使用中,如果用户明确待分词的领域,可加载对应的模型进行分词。如果用户无法确定具体领域,推荐使用在混合领域上训练的通用模型。各领域分词样例可参考 example.txt
  2. 更高的分词准确率。相比于其他的分词工具包,当使用相同的训练数据和测试数据,pkuseg可以取得更高的分词准确率。
  3. 支持用户自训练模型。支持用户使用全新的标注数据进行训练。
  4. 支持词性标注。

编译和安装

  • 目前仅支持python3
  • 为了获得好的效果和速度,强烈建议大家通过pip install更新到目前的最新版本
  1. 通过PyPI安装(自带模型文件):

    pip3 install pkuseg
    之后通过import pkuseg来引用
    

    建议更新到最新版本以获得更好的开箱体验:

    pip3 install -U pkuseg
    
  2. 如果PyPI官方源下载速度不理想,建议使用镜像源,比如:
    初次安装:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg
    

    更新:

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
    
  3. 如果不使用pip安装方式,选择从GitHub下载,可运行以下命令安装:

    python setup.py build_ext -i
    

    GitHub的代码并不包括预训练模型,因此需要用户自行下载或训练模型,预训练模型可详见release。使用时需设定"model_name"为模型文件。

注意:安装方式1和2目前仅支持linux(ubuntu)、mac、windows 64 位的python3版本。如果非以上系统,请使用安装方式3进行本地编译安装。

各类分词工具包的性能对比

我们选择jieba、THULAC等国内代表分词工具包与pkuseg做性能比较,详细设置可参考实验环境

细领域训练及测试结果

以下是在不同数据集上的对比结果:

MSRA Precision Recall F-score
jieba 87.01 89.88 88.42
THULAC 95.60 95.91 95.71
pkuseg 96.94 96.81 96.88
WEIBO Precision Recall F-score
jieba 87.79 87.54 87.66
THULAC 93.40 92.40 92.87
pkuseg 93.78 94.65 94.21

默认模型在不同领域的测试效果

考虑到很多用户在尝试分词工具的时候,大多数时候会使用工具包自带模型测试。为了直接对比“初始”性能,我们也比较了各个工具包的默认模型在不同领域的测试效果。请注意,这样的比较只是为了说明默认情况下的效果,并不一定是公平的。

Default MSRA CTB8 PKU WEIBO All Average
jieba 81.45 79.58 81.83 83.56 81.61
THULAC 85.55 87.84 92.29 86.65 88.08
pkuseg 87.29 91.77 92.68 93.43 91.29

其中,All Average显示的是在所有测试集上F-score的平均。

更多详细比较可参见和现有工具包的比较

使用方式

代码示例

以下代码示例适用于python交互式环境。

代码示例1:使用默认配置进行分词(如果用户无法确定分词领域,推荐使用默认模型分词

import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)

代码示例2:细领域分词(如果用户明确分词领域,推荐使用细领域模型分词

import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)

代码示例3:分词同时进行词性标注,各词性标签的详细含义可参考 tags.txt

import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)

代码示例4:对文件分词

import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20)     

其他使用示例可参见详细代码示例

参数说明

模型配置

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
			        "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)。
				"news", 使用新闻领域模型。
				"web", 使用网络领域模型。
				"medicine", 使用医药领域模型。
				"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
				"default", 默认参数,使用我们提供的词典。
				None, 不使用词典。
				dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		        是否进行词性分析。
				False, 默认参数,只进行分词,不进行词性标注。
				True, 会在分词的同时进行词性标注。

对文件进行分词

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

模型训练

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'。

多进程分词

当将以上代码示例置于文件中运行时,如涉及多进程功能,请务必使用if __name__ == '__main__'保护全局语句,详见多进程分词

预训练模型

从pip安装的用户在使用细领域分词功能时,只需要设置model_name字段为对应的领域即可,会自动下载对应的细领域模型。

从github下载的用户则需要自己下载对应的预训练模型,并设置model_name字段为预训练模型路径。预训练模型可以在release部分下载。以下是对预训练模型的说明:

  • news: 在MSRA(新闻语料)上训练的模型。

  • web: 在微博(网络文本语料)上训练的模型。

  • medicine: 在医药领域上训练的模型。

  • tourism: 在旅游领域上训练的模型。

  • mixed: 混合数据集训练的通用模型。随pip包附带的是此模型。

欢迎更多用户可以分享自己训练好的细分领域模型。

版本历史

详见版本历史

开源协议

  1. 本代码采用MIT许可证。
  2. 欢迎对该工具包提出任何宝贵意见和建议,请发邮件至[email protected]

论文引用

该代码包主要基于以下科研论文,如使用了本工具,请引用以下论文:


@article{pkuseg,
  author = {Luo, Ruixuan and Xu, Jingjing and Zhang, Yi and Ren, Xuancheng and Sun, Xu},
  journal = {CoRR},
  title = {PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation.},
  url = {https://arxiv.org/abs/1906.11455},
  volume = {abs/1906.11455},
  year = 2019
}

其他相关论文

  • Xu Sun, Houfeng Wang, Wenjie Li. Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection. ACL. 2012.
  • Jingjing Xu and Xu Sun. Dependency-based gated recursive neural network for chinese word segmentation. ACL. 2016.
  • Jingjing Xu and Xu Sun. Transfer learning for low-resource chinese word segmentation with a novel neural network. NLPCC. 2017.

常见问题及解答

  1. 为什么要发布pkuseg?
  2. pkuseg使用了哪些技术?
  3. 无法使用多进程分词和训练功能,提示RuntimeError和BrokenPipeError。
  4. 是如何跟其它工具包在细领域数据上进行比较的?
  5. 在黑盒测试集上进行比较的话,效果如何?
  6. 如果我不了解待分词语料的所属领域呢?
  7. 如何看待在一些特定样例上的分词结果?
  8. 关于运行速度问题?
  9. 关于多进程速度问题?

致谢

感谢俞士汶教授(北京大学计算语言所)与邱立坤博士提供的训练数据集!

作者

Ruixuan Luo (罗睿轩), Jingjing Xu(许晶晶), Xuancheng Ren(任宣丞), Yi Zhang(张艺), Bingzhen Wei(位冰镇), Xu Sun (孙栩)

北京大学 语言计算与机器学习研究组

Comments
  • 与其余分词工具包的性能对比并不公平吧?

    与其余分词工具包的性能对比并不公平吧?

    请问一下对比的jieba 和 THULAC 模型有用对应的训练语料(MSRA,CTB8)训练么? 如果有训练语料的话,这两个模型的结果应该不会那么差。80%左右的F值都快和unsupervised segmentation 差不多了。

    如果用in domain 训练语料训练的pkuseg 和 没有使用对应domain训练语料的jieba THULAC 对比,这样是显然不公平的啊。大幅提高了分词的准确率的结论不能通过这种对比实验得出。

    事实上MSRA 分词效果在论文里基本上都超过97.5了。

    opened by jiesutd 31
  • 就比较了一句话的结果就能和jieba一决胜负了

    就比较了一句话的结果就能和jieba一决胜负了

    pkuseg: seg = pkuseg.pkuseg() print(seg.cut('结婚的和尚未结婚的确实在干扰分词啊')) ['结婚', '的', '和尚', '未', '结婚', '的确', '实在', '干扰', '分词', '啊']

    jieba: print([i[0] for i in jieba.tokenize('结婚的和尚未结婚的确实在干扰分词啊')]) ['结婚', '的', '和', '尚未', '结婚', '的', '确实', '在', '干扰', '分词', '啊']

    一句话分错三个词,不知道如此高调的宣布远超jieba的勇气在哪儿 ......

    opened by mendynew 8
  • undefined symbol: PyFPE_jbuf

    undefined symbol: PyFPE_jbuf

    ImportError: /root/anaconda3/envs/NLP/lib/python3.5/site-packages/pkuseg/feature_extractor.cpython-35m-x86_64-linux-gnu.so: undefined symbol: PyFPE_jbuf

    ubuntu, pip install pkuseg any ideas?

    opened by LCorleone 5
  • what is the required encode of input file?

    what is the required encode of input file?

    C:\Python36>python
    Python 3.6.7 (v3.6.7:6ec5cf24b7, Oct 20 2018, 13:35:33) [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import numpy as np
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg()           # 以默认配置加载模型
    >>> text = seg.cut('我爱北京天安门')  # 进行分词
    >>> print(text)
    ['我', '爱', '北京', '天安门']
    >>> import pkuseg
    >>>
    >>> seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
    Downloading: "https://github.com/lancopku/pkuseg-python/releases/download/v0.0.16/postag.zip" to C:\Users\lutao/.pkuseg\
    postag.zip
    100.0%
    >>> text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
    >>> print(text)
    [('我', 'r'), ('爱', 'v'), ('北京', 'ns'), ('天安门', 'ns')]
    >>> import pkuseg
    >>>
    >>> # 对input.txt的文件分词输出到output.txt中
    ... # 开20个进程
    ... pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
      File "<stdin>", line 3
        pkuseg.test('c:/user/lutao/downloads/0309a.txt', ''c:/user/lutao/downloads/0309a_output.txt', nthread=10)
                                                           ^
    SyntaxError: invalid syntax
    >>> pkuseg.test('c:/user/lutao/downloads/0309a.txt', 'c:/user/lutao/downloads/0309a_output.txt', nthread=10)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 520, in test
        input_file, output_file, nthread, model_name, user_dict, postag, verbose
      File "C:\Python36\lib\site-packages\pkuseg\__init__.py", line 444, in _test_multi_proc
        raise Exception("input_file {} does not exist.".format(input_file))
    Exception: input_file c:/user/lutao/downloads/0309a.txt does not exist.
    

    I replaced '/' with '', and encode of 0309a.txt is gbk

    >>> pkuseg.test('c:\user\lutao\downloads\0309a.txt', 'c:\user\lutao\downloads\0309a_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    

    I save 0309a.txt to 0309b.txt as utf-8 encode,

    >>> pkuseg.test('c:\user\lutao\downloads\0309b.txt', 'c:\user\lutao\downloads\0309b_output.txt', nthread=10)
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
    
    opened by l1t1 5
  • python3.6 import 失败

    python3.6 import 失败

    pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuse g Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Requirement already satisfied: pkuseg in d:\dev_tools\python3.6\lib\site-package s (0.0.14) Requirement already satisfied: numpy in d:\dev_tools\python3.6\lib\site-packages (from pkuseg) (1.13.3+mkl)

    python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AM D64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

    import pkuseg Traceback (most recent call last): File "", line 1, in File "D:\dev_tools\python3.6\lib\site-packages\pkuseg_init_.py", line 14, i n import pkuseg.trainer as trainer File "D:\dev_tools\python3.6\lib\site-packages\pkuseg\trainer.py", line 19, in

    import pkuseg.inference as _inf File "__init__.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expec ted 216 from C header, got 192 from PyObject
    opened by tangchun 5
  • 这个是什么问题导致的?

    这个是什么问题导致的?

    length = 1 : 0 length = 2 : 2496 length = 3 : 2642 length = 4 : 2568 length = 5 : 1313 length = 6 : 633 length = 7 : 249 length = 8 : 133 length = 9 : 66 length = 10 : 16 length = 11 : 6 length = 12 : 1 length = 13 : 1

    start training...

    reading training & test data... done! train/test data sizes: 1/1

    r: 1 iter0 diff=1.00e+100 train-time(sec)=5.64 f-score=0.06% iter1 diff=1.00e+100 train-time(sec)=5.63 f-score=0.00% Traceback (most recent call last): File "test.py", line 8, in pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models', nthread=20) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/init.py", line 324, in train trainer.train(config) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 103, in train score_list = trainer.test(testset, i) File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 169, in test testset, self.model, writer File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/trainer.py", line 357, in _decode_fscore gold_tags, pred_tags, self.idx_to_chunk_tag File "/Users/faby/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pkuseg/scorer.py", line 37, in getFscore pre = correct_chunk / res_chunk * 100 ZeroDivisionError: division by zero

    opened by Fabyone 4
  • ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    利用pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg安装

    import pkuseg

    seg = pkuseg.pkuseg()

    text = "我爱北京天安门"

    cut = seg.cut(text) print(cut)

    Traceback (most recent call last): File "E:/python/work/spider/bx/piggy.py", line 1, in import pkuseg File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg_init_.py", line 14, in import pkuseg.trainer File "D:\Program Files (x86)\Python\Anaconda3\lib\site-packages\pkuseg\trainer.py", line 19, in import pkuseg.inference as _inf File "init.pxd", line 918, in init pkuseg.inference ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

    opened by xhochipe 4
  • FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    安装了pkuseg 初次使用,需要下载postag.zip 下载失败 我就自己下载,并放到文件夹下 但是有报错FileNotFoundError: [Errno 2] No such file or directory: '/home/.pkuseg/postag/featureIndex.txt_0'

    opened by hjing100 3
  • 0.0.25在binder安装报错

    0.0.25在binder安装报错

    0.0.22 可以正常安装

    Collecting numpy
      Downloading numpy-1.19.0-cp37-cp37m-manylinux2010_x86_64.whl (14.6 MB)
    Collecting pkuseg
      Downloading pkuseg-0.0.25.tar.gz (48.8 MB)
        ERROR: Command errored out with exit status 1:
         command: /srv/conda/envs/notebook/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"'; __file__='"'"'/tmp/pip-install-5d95j8mq/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-h22vfd4x
             cwd: /tmp/pip-install-5d95j8mq/pkuseg/
        Complete output (5 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-5d95j8mq/pkuseg/setup.py", line 5, in <module>
            import numpy as np
        ModuleNotFoundError: No module named 'numpy'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    
    opened by GoooIce 3
  • pip安装 使用细分领域模型 报错?

    pip安装 使用细分领域模型 报错?

    Traceback (most recent call last): 9 File "py3_cook_corpus_embedding.py", line 18, in <module> 10 seg = pkuseg.pkuseg(model_name='medicine') 11 File "/home/work/software/anaconda3/envs/py3myhao/lib/python3.6/site-packages/pkuseg/__init__.py", line 224, in __init__ 12 self.feature_extractor = FeatureExtractor.load() 13 File "pkuseg/feature_extractor.pyx", line 625, in pkuseg.feature_extractor.FeatureExtractor.load 14 FileNotFoundError: [Errno 2] No such file or directory: 'medicine/unigram_word.txt'

    另外,使用细分模型后,可以同时加上自定义词表吗?

    opened by kinghmy 3
  • wsl2 + pyenv + python3.8.5 安装报错.

    wsl2 + pyenv + python3.8.5 安装报错.

    (fastApi-env) xiaxichen@DESKTOP-LEBPPCV:/mnt/c/Users/Administrator$ pip install pkuseg Looking in indexes: http://mirrors.aliyun.com/pypi/simple Collecting pkuseg Downloading http://mirrors.aliyun.com/pypi/packages/64/3a/090a533c7f0682d653633cfd2d33e9aab3e671379fb199aeb7fa9bd3c34a/pkuseg-0.0.25.tar.gz (48.8 MB) |████████████████████████████████| 48.8 MB 79.6 MB/s ERROR: Command errored out with exit status 1: command: /home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"'; file='"'"'/tmp/pip-install-hjb0015_/pkuseg/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-99zrwcbj cwd: /tmp/pip-install-hjb0015_/pkuseg/ Complete output (36 lines): WARNING: The wheel package is not available. WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and is being ignored. If this repository is available via HTTPS we recommend you use HTTPS instead, otherwise you may silence this warning and allow it anyway with '--trusted-host mirrors.aliyun.com'. ERROR: Could not find a version that satisfies the requirement cython (from versions: none) ERROR: No matching distribution found for cython Traceback (most recent call last): File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 128, in fetch_build_egg subprocess.check_call(cmd) File "/home/xiaxichen/.pyenv/versions/3.8.5/lib/python3.8/subprocess.py", line 364, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 63, in <module>
        setup_package()
      File "/tmp/pip-install-hjb0015_/pkuseg/setup.py", line 39, in setup_package
        setuptools.setup(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 162, in setup
        _install_setup_requires(attrs)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/__init__.py", line 157, in _install_setup_requires
        dist.fetch_build_eggs(dist.setup_requires)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 699, in fetch_build_eggs
        resolved_dists = pkg_resources.working_set.resolve(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 779, in resolve
        dist = best[req.key] = env.best_match(
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1064, in best_match
        return self.obtain(req, installer)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1076, in obtain
        return installer(requirement)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/dist.py", line 758, in fetch_build_egg
        return fetch_build_egg(self, req)
      File "/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/lib/python3.8/site-packages/setuptools/installer.py", line 130, in fetch_build_egg
        raise DistutilsError(str(e)) from e
    distutils.errors.DistutilsError: Command '['/home/xiaxichen/.pyenv/versions/3.8.5/envs/fastApi-env/bin/python3.8', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmp6illbjjn', '--quiet', 'cython']' returned non-zero exit status 1.
    ----------------------------------------
    

    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

    opened by xiaxichen 2
  • cannot install in the environment of python 3.9

    cannot install in the environment of python 3.9

    Dear Sirs or Madams, the installation to the environment of python 3.9 failed. I check your repository at 'https://pypi.tuna.tsinghua.edu.cn/simple/pkuseg/'. It seems that there are no python 3.9 relevant files there. do you have any plan to support python 3.9? I also saw that in other issues, you suggested the file relevant to python 3.9. I cannot find the file. Your reply is highly appreciated. Tony

    opened by tonydeck0506 2
  • 词性标注效果过好

    词性标注效果过好

    理论上来讲效果好是一件好事,但是实际测试来讲会把不存在的地名也认作为地名

    import pkuseg
    seg = pkuseg.pkuseg(postag=True)
    text = seg.cut('广场镇是河北天津衡水冲绳东京的旧地狱和亚特兰斯地吗?')
    for word, flag in text: 
        if flag == 'ns':
            print (word)
    

    输出结果为:

    广场镇
    河北
    天津
    衡水
    冲绳
    东京
    亚特兰斯
    
    opened by axty666 1
  • TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    TypeError: train() got an unexpected keyword argument 'nthread'

    
    import pkuseg
    
    # 训练文件为'train.txt'
    # 测试文件为'test.txt'
    # 加载'./pretrained'目录下的模型,训练好的模型保存在'./models',训练10轮
    pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')
    
    
    opened by KangChou 1
Releases(v0.0.25)
Owner
LancoPKU
Language Computing and Machine Learning Group (Xu Sun's group) at Peking University
LancoPKU
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

SaiVenkatDhulipudi 2 Nov 17, 2021
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 9, 2023
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc. Copyright &

Fang-Pen Lin 82 Jun 28, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

VinAI Research 109 Dec 2, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 2, 2023
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 4, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 161 Dec 19, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

null 452 Dec 30, 2022
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

null 382 Jan 7, 2023
A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

Yuliang Liu 45 Oct 4, 2022
DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库,可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

liuhuanyong 357 Dec 24, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 6, 2022