Python library for processing Chinese text

Rui Wang

Last update: Jan 2, 2023

Related tags

Text Data & NLP snownlp

Overview

SnowNLP: Simplified Chinese Text Processing

SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob不同的是，这里没有用NLTK，所有的算法都是自己实现的，并且自带了一些训练好的字典。注意本程序都是处理的unicode编码，所以使用时请自行decode成unicode。

from snownlp import SnowNLP

s = SnowNLP(u'这个东西真心很赞')

s.words         # [u'这个', u'东西', u'真心',
                #  u'很', u'赞']

s.tags          # [(u'这个', u'r'), (u'东西', u'n'),
                #  (u'真心', u'd'), (u'很', u'd'),
                #  (u'赞', u'Vg')]

s.sentiments    # 0.9769663402895832 positive的概率

s.pinyin        # [u'zhe', u'ge', u'dong', u'xi',
                #  u'zhen', u'xin', u'hen', u'zan']

s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。')

s.han           # u'「繁体字」「繁体中文」的叫法
                # 在台湾亦很常见。'

text = u'''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，
所以它与语言学的研究有着密切的联系，但又有重要的区别。
自然语言处理并不是一般地研究自然语言，
而在于研制能有效地实现自然语言通信的计算机系统，
特别是其中的软件系统。因而它是计算机科学的一部分。
'''

s = SnowNLP(text)

s.keywords(3)	# [u'语言', u'自然', u'计算机']

s.summary(3)	# [u'因而它是计算机科学的一部分',
                #  u'自然语言处理是一门融语言学、计算机科学、
				#	 数学于一体的科学',
				#  u'自然语言处理是计算机科学领域与人工智能
				#	 领域中的一个重要方向']
s.sentences

s = SnowNLP([[u'这篇', u'文章'],
             [u'那篇', u'论文'],
             [u'这个']])
s.tf
s.idf
s.sim([u'文章'])# [0.3756070762985226, 0, 0]

Features

中文分词（Character-Based Generative Model）
词性标注（TnT 3-gram 隐马）
情感分析（现在训练数据主要是买卖东西时的评价，所以对其他的一些可能效果不是很好，待解决）
文本分类（Naive Bayes）
转换成拼音（Trie树实现的最大匹配）
繁体转简体（Trie树实现的最大匹配）
提取文本关键词（TextRank算法）
提取文本摘要（TextRank算法）
tf，idf
Tokenization（分割成句子）
文本相似（BM25）
支持python3（感谢erning）

Get It now

$ pip install snownlp

关于训练

现在提供训练的包括分词，词性标注，情感分析，而且都提供了我用来训练的原始文件以分词为例分词在snownlp/seg目录下

from snownlp import seg
seg.train('data.txt')
seg.save('seg.marshal')
# from snownlp import tag
# tag.train('199801.txt')
# tag.save('tag.marshal')
# from snownlp import sentiment
# sentiment.train('neg.txt', 'pos.txt')
# sentiment.save('sentiment.marshal')

这样训练好的文件就存储为seg.marshal了，之后修改snownlp/seg/__init__.py里的data_path指向刚训练好的文件即可

License

MIT licensed.

Comments

Python 3.5 SnowNLP关键词分析每次运行结果不同

比如我在一段文本里面选择10个关键词 content=...#内容 r=SnowNLP(content) result=r.keywords(10, False) 第一次运行结果： ['執', '這', '壺', '年', '版', '星期日', '下午', '收', '也出', '土'] 第二次： ['出', '年', '版', '星期日', '下午', '執壺', '種', '級', '低', '岛'] 第三次： ['壺', '種', '年', '星期日', '下午', '岛', '也出', '也收藏', '物館', '明這']

我怀疑原因和python版本有关，见jieba分词的类似问题 https://github.com/fxsjy/jieba/issues/228

opened by jiacheliu3 7

sentiment训练后结果不变？

用自己的训练集训练：

def train():
    from snownlp import sentiment
    sentiment.train('filter_neg.txt', 'filter_pos.txt')
    sentiment.save('sentiment.marshal')

然后调用marshal进行情感分析结果输出：

from snownlp import sentiment
    sentiment.data_path = 'sentiment.marshal'

    with open('test.txt') as f:
        data = f.xreadlines()
        for line in data:
            line = line.decode('utf-8').replace('\r', ' ').replace('\n', ' ').replace('\t', ' ')
            snlp = SnowNLP(line)
            print snlp.sentiments

可是得到的结果和没有调用新的marshal一样，请问是哪里写得有问题吗？

opened by tangwz 4

TF IDF with letter pairs?

Hi! Thanks for this library. I'm running it on some chinese sentences that have been segmented already with jieba.

But the tf/idf only seems to work with single characters.

Can it works with full chinese words? eg 接力 not 接 and 力 separately?

sub 妈 接力 活动 还 没 结束 吗 ？
你好 我家 宝宝 牛奶 过敏 一个 加号 ， 亲舒 可以 喝
sentences
['妈 接力 活动 还 没 结束 吗', '你好 我家 宝宝 牛奶 过敏 一个 加号', '亲舒 可以 喝']
tags
[   ('妈', 'n'),
    ('接力', 'vn'),
    ('活动', 'vn'),
    ('还', 'd'),
    ('没', 'd'),
    ('结束', 'v'),
    ('吗', 'y'),
    ('？', 'w'),
    ('你好', 'l'),
    ('我家', 'n'),
    ('宝宝', 'n'),
    ('牛奶', 'n'),
    ('过敏', 'v'),
    ('一个', 'm'),
    ('加号', 'Bg'),
    ('，', 'w'),
    ('亲', 'a'),
    ('舒', 'nr'),
    ('可以', 'v'),
    ('喝', 'v')]
tf
[   {'妈': 1},
    {'力': 1, '接': 1},
    {'动': 1, '活': 1},
    {'还': 1},
    {'没': 1},
    {'束': 1, '结': 1},
    {'吗': 1},
    {'\n': 1, '你': 1, '好': 1, '？': 1},
    {'家': 1, '我': 1},
    {'宝': 2},
    {'奶': 1, '牛': 1},
    {'敏': 1, '过': 1},
    {'一': 1, '个': 1},
    {'加': 1, '号': 1},
    {'，': 1},
    {'亲': 1, '舒': 1},
    {'以': 1, '可': 1},
    {'喝': 1}]
idf
{   '\n': 2.456735772821304,
    '一': 2.456735772821304,
    '个': 2.456735772821304,
    '亲': 2.456735772821304,
    '以': 2.456735772821304,
    '你': 2.456735772821304,
    '力': 2.456735772821304,
    '加': 2.456735772821304,
    '动': 2.456735772821304,
    '可': 2.456735772821304,
    '号': 2.456735772821304,
    '吗': 2.456735772821304,
    '喝': 2.456735772821304,
    '奶': 2.456735772821304,
    '好': 2.456735772821304,
    '妈': 2.456735772821304,
    '宝': 2.456735772821304,
    '家': 2.456735772821304,
    '我': 2.456735772821304,
    '接': 2.456735772821304,
    '敏': 2.456735772821304,
    '束': 2.456735772821304,
    '没': 2.456735772821304,
    '活': 2.456735772821304,
    '牛': 2.456735772821304,
    '结': 2.456735772821304,
    '舒': 2.456735772821304,
    '过': 2.456735772821304,
    '还': 2.456735772821304,
    '，': 2.456735772821304,
    '？': 2.456735772821304}

opened by dcsan 3

MemoryError

 * Running on http://127.0.0.1:8001/ (Press CTRL+C to quit)
 * Restarting with stat
Traceback (most recent call last):
  File "weibo_sentiment/main.py", line 24, in <module>
    from weibo_sentiment import (
  File "/home/www/weibo_sentiment/weibo_sentiment/weibo_model.py", line 10, in <module>
    from snownlp import SnowNLP
  File "/usr/local/lib/python2.7/dist-packages/snownlp/__init__.py", line 5, in <module>
    from . import seg
  File "/usr/local/lib/python2.7/dist-packages/snownlp/seg/__init__.py", line 12, in <module>
    segger.load(data_path, True)
  File "/usr/local/lib/python2.7/dist-packages/snownlp/seg/seg.py", line 23, in load
    self.segger.load(fname, iszip)
  File "/usr/local/lib/python2.7/dist-packages/snownlp/seg/y09_2047.py", line 51, in load
    data = f.read()
  File "/usr/lib/python2.7/gzip.py", line 254, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 313, in _read
    self._add_read_data( uncompress )
  File "/usr/lib/python2.7/gzip.py", line 331, in _add_read_data
    self.extrabuf = self.extrabuf[offset:] + data
MemoryError

It is running on Aliyun ECS 1GB RAM.

File "/usr/local/lib/python2.7/dist-packages/snownlp/seg/y09_2047.py", line 51, in load
    data = f.read()

Change to readlines ?

opened by zh-h 3

情感模型训练模型不生效

from snownlp import sentiment sentiment.train('neg.txt', 'pos.txt') sentiment.save('sentiment.marshal4') 生成一个sentiment.marshal4.3的文件，在snownlp\sentiment目录中进行.重命名后替换成.marshal，替换neg和pos文件，但是分数和默认还是一样的。求教哪里的问题。

opened by dockersky 2
想知道情感分析的背後的詳細原理
您好 1. 根據您的程式碼，關於情感分析的部分，沒理解錯誤的話，其背後的演算法是基於Bayes的相關模型？去看過TextBlob的實作後，發現他也是應用Bayes相關的模型(https://textblob.readthedocs.io/en/dev/advanced_usage.html#sentiment-analyzers)

看了你們的東西，但我對於其背後的原理沒有很懂，想知道你們是怎麼用Bayes相關的模型，來進行情感分析的？想請問您有沒有相關的文獻，或是網站可以參考呢？

TextBlob有做polarity，subjective/objective的分析，不過您的範例只有positive value 那個value值，是否代表越接近1，代表情緒越正向；越接近0，代表情緒越負向呢？

因為TextBlob，在針對英文上面，情緒分析做得很完善只是中文分析上面，好像還沒有一個針對「極性、正負向」的情緒分析，一個全面且很有效的套件？目前貌似只有您踏出第一步(而且相當大的一步，感謝)，所以想先充分理解您的東西，未來看有沒有機會來完善中文文字探勘這一塊的缺漏
opened by skydome20 2
textrank算法的困惑，请解惑，非常感谢！

您好，关于textrank算法中，感觉有点问题，以keywordTextRank为例，当计算节点k的rank值时，您的代码是 for k, v in self.words.items(): m[k] = 1-self.d for j in v: if k == j or len(self.words[j]) == 0: continue m[k] += (self.d/len(self.words[j])*self.vertex[j]) if abs(m[k] - self.vertex[k]) > max_diff: max_diff = abs(m[k] - self.vertex[k]) 根据k，v的关系，v中包含的是指向节点k的所有节点集合，如果这样，在内循环中的len(self.words[j]这里算的是节点j的入度，即指向节点j的数量。但实际的textrank算法中，这里应该是节点j的出度，即节点j外链接的数量。请问作者，这里的代码是否有问题？感谢！

opened by 37814084 2
python3下的问题

通过pip安装后

import snownlp

报错，提示seg.marshal文件有问题

具体错误信息如下： /opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/snownlp/init.py in () 3 4 from . import normal ----> 5 from . import seg 6 from . import tag 7 from . import sentiment

/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/snownlp/seg/init.py in () 10 'seg.marshal') 11 segger = TnTseg.Seg() ---> 12 segger.load(data_path, True) 13 14

/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/snownlp/seg/seg.py in load(self, fname, iszip) 21 22 def load(self, fname, iszip=True): ---> 23 self.segger.load(fname, iszip) 24 25 def train(self, fname):

/opt/local/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/snownlp/seg/y09_2047.py in load(self, fname, iszip) 45 try: 46 f = gzip.open(fname, 'rb') ---> 47 d = marshal.loads(f.read()) 48 except IOError: 49 f = open(fname, 'rb')

ValueError: bad marshal data (unknown type code)

opened by DongChengliang 2
在向服务器部署时，进程意外终止
在向服务器部署时，进程意外终止

感谢您的分享，很实用的库，学习中不过在部署服务器的时候，出现了一点问题

场景复现

服务器环境：centos，内存2G，CPU Intel(R) Xeon(R) E5-2420 1.9GHz

在执行模块导入时

from snownlp import SnowNLP

进程be killed 使用top命令查看CPU占用超过100%，Memory 27% 而在更低配置的个人电脑上就没有问题，一直没有分析出原因，不知您有可行的建议么
opened by sanguo0023 2
minus value in simuarity

st = SnowNLP([[u'吃饭',u'睡觉',u'打豆豆'], [u'吃饭',u'烤肉',u'懒觉',u'打游戏'], ['吃吃喝喝']]) st.sim([u'吃饭',u'睡觉',u'打豆豆'])

and result is [0.4836218923228314, -0.4170005091967271, 0] so what is the meaning of minus simularity

opened by FranklinBao 0
CWS Standard for the Default Version

In your publication you demonstrate the scores of your segmentation algorithm on various datasets (PKU, MSR, AS etc.). However, these datasets differ in the standards they are based on. What is the default dataset (=> CWS standard) for the default version of snownlp?

opened by Kiryukhasemenov 0

Owner

Rui Wang

GitHub

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统，包含语音编码器、语音合成器、声码器和可视化模块。

6 Nov 8, 2022

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 2, 2023

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单，音质最好的语音合成系统

12 Dec 14, 2022

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM（Chinese Pretrained Models）模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型，参数量分别为109M、334M、2.6B，用户需申请与通过审核，方可下载。由于原项目需要考虑大模型的训练和使用，需要安装较为复杂

382 Jan 7, 2023

A demo for end-to-end English and Chinese text spotting using ABCNet.

ABCNet_Chinese A demo for end-to-end English and Chinese text spotting using ABCNet. This is an old model that was trained a long ago, which serves as

45 Oct 4, 2022

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task。涵盖68个领域、共计916万词的专业词典知识库，可用于文本分类、知识增强、领域词汇库扩充等自然语言处理应用。

357 Dec 24, 2022

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識程式碼說明組別：這就是我成員：蔣明憲唐碩謙黃玥菱林冠霆蕭靖騰目錄環境套件安裝方式資料夾布局前處理-製作偵測訓練註解檔前處理-製作分類訓練樣本 part.py ：從 json 裁切出分類訓練樣本 Class.py ：將切出來的樣本按照文字分類到各資料夾

3 Jan 14, 2022

Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc. Copyright &

82 Jun 28, 2022

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

Text-Summarization-using-NLP Text Summarization using NLP to fetch BBC News Arti

21 Aug 6, 2022

Python library for processing Chinese text

Related tags

Overview

SnowNLP: Simplified Chinese Text Processing

Features

Get It now

关于训练

License

Comments

在向服务器部署时，进程意外终止

场景复现

Owner

Rui Wang

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Python library for processing Chinese text

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

vits chinese, tts chinese, tts mandarin

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Easy-to-use CPM for Chinese text generation

A demo for end-to-end English and Chinese text spotting using ABCNet.

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

Chinese segmentation library

Text-Summarization-using-NLP - Text Summarization using NLP to fetch BBC News Article and summarize its text and also it includes custom article Summarization

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

a chinese segment base on crf

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

A framework for cleaning Chinese dialog data

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

Application for shadowing Chinese.