汉字转拼音(pypinyin)

Huang Huang

Last update: Jan 3, 2023

Related tags

Text Processing python python3 pinyin chinese hanzi python2 pypinyin hanzi-pinyin

Overview

汉字拼音转换工具（Python 版）

将汉字转为拼音。可以用于汉字注音、排序、检索(Russian translation) 。

基于 hotoo/pinyin 开发。

Documentation: http://pypinyin.rtfd.io/
GitHub: https://github.com/mozillazg/python-pinyin
License: MIT license
PyPI: https://pypi.org/project/pypinyin
Python version: 2.7, pypy, pypy3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9

Contents

特性
安装
使用示例
文档
FAQ
拼音数据
Related Projects

特性

根据词组智能匹配最正确的拼音。
支持多音字。
简单的繁体支持, 注音支持。
支持多种不同拼音/注音风格。

安装

$ pip install pypinyin

使用示例

Python 3(Python 2 下把 '中心' 替换为 u'中心' 即可):

>>> from pypinyin import pinyin, lazy_pinyin, Style
>>> pinyin('中心')
[['zhōng'], ['xīn']]
>>> pinyin('中心', heteronym=True)  # 启用多音字模式
[['zhōng', 'zhòng'], ['xīn']]
>>> pinyin('中心', style=Style.FIRST_LETTER)  # 设置拼音风格
[['z'], ['x']]
>>> pinyin('中心', style=Style.TONE2, heteronym=True)
[['zho1ng', 'zho4ng'], ['xi1n']]
>>> pinyin('中心', style=Style.TONE3, heteronym=True)
[['zhong1', 'zhong4'], ['xin1']]
>>> pinyin('中心', style=Style.BOPOMOFO)  # 注音风格
[['ㄓㄨㄥ'], ['ㄒㄧㄣ']]
>>> lazy_pinyin('中心')  # 不考虑多音字的情况
['zhong', 'xin']
>>> lazy_pinyin('战略', v_to_u=True)  # 不使用 v 表示 ü
['zhan', 'lüe']
# 使用 5 标识轻声
>>> lazy_pinyin('衣裳', style=Style.TONE3, neutral_tone_with_five=True)
['yi1', 'shang5']

注意事项 ：

默认情况下拼音结果不会标明哪个韵母是轻声，轻声的韵母没有声调或数字标识（可以通过参数 neutral_tone_with_five=True 开启使用 5 标识轻声）。
默认情况下无声调相关拼音风格下的结果会使用 v 表示 ü （可以通过参数 v_to_u=True 开启使用 ü 代替 v ）。
默认情况下会原样输出没有拼音的字符（自定义处理没有拼音的字符的方法见文档）。

命令行工具：

$ pypinyin 音乐
yīn yuè
$ pypinyin -h

文档

详细文档请访问：http://pypinyin.rtfd.io/ 。

项目代码开发方面的问题可以看看开发文档。

FAQ

如何减少内存占用

如果对拼音的准确性不是特别在意的话，可以通过设置环境变量 PYPINYIN_NO_PHRASES 和 PYPINYIN_NO_DICT_COPY 来节省内存。详见文档

更多 FAQ 详见文档中的 FAQ 部分。

拼音数据

单个汉字的拼音使用 pinyin-data 的数据
词组的拼音使用 phrase-pinyin-data 的数据
声母和韵母使用《汉语拼音方案》的数据

Related Projects

hotoo/pinyin: 汉字拼音转换工具 Node.js/JavaScript 版。
mozillazg/go-pinyin: 汉字拼音转换工具 Go 版。
mozillazg/rust-pinyin: 汉字拼音转换工具 Rust 版。

Comments

简化的pypinyin命令行选项
包括2个修改：

增加了 short flags：-f, -s, -p（记如part）, -e, -m（记如multiple，而且是heteronym的最后一个字母）

增加了基于范例的“拼音风格”命令行选项，思路见Go版：https://github.com/mozillazg/go-pinyin/pull/19

提出这些调整的原因，一是原命令行较难记忆，二是输起来太费劲了（尤其是在手机上），都可能吓走用户。

BOPOMOFO、CYRILLIC的4个选项，我不懂怎么设计范例，只好照抄。本来还想弄成不区分大小写、有没有下划线都无所谓的，但对于日常使用，这样应该已经挺方便了。
opened by wdscxsj 16
声母 Y 和 W 的问题

类似

pinyin(u'中心', style=pypinyin.INITIALS) # 设置拼音风格 [['zh'], ['x']]

代码中声母表

_INITIALS = 'b,p,m,f,d,t,n,l,g,k,h,j,q,x,zh,ch,sh,r,z,c,s,'.split(',')

没有y和w。如果碰到Y和W开头的字，相应的字会返回空。例如：

pinyin(u'火影忍者', style=pypinyin.INITIALS) [[u'h'], [u''], [u'r'], [u'zh']]

我查了下资料，有的说声母不包括Y和W，所以这个返回是正常，但这样处理导致应用不好做，只能用首字母模式替代。是否新加一个接口，加上Y和W的返回，或者，说明上告知这个情况，以免别人使用的时候碰到问题。
question discuss

opened by ultimate010 13
新特性：errors 为 callable 对象，返回值为 list 时将会在最终结果里 extend 列表元素
PR 描述

代码：

from pypinyin import pinyin result = pinyin('你好!🙂', errors=lambda x: [i for i in x]) print(result)

改动前输出：

[['nǐ'], ['hǎo'], ['!🙂']]

改动后输出：

[['nǐ'], ['hǎo'], ['!'], ['🙂']]

其他内置 errors 处理选项功能不变

待办事项

[x] 符合代码规范

[x] 单元测试

[x] 文档
opened by howl-anderson 8
修复 `struct=True` 时韵母相关风格下没有正确处理韵母 `üan`
PR 描述

修复 https://github.com/mozillazg/phrase-pinyin-data/commit/cb8423f4e89474144b3498e4a95f187ce9b3eaae#commitcomment-31005953 提到的没有正确处理韵母 üan 的问题。

根据《汉语拼音方案》增加更多的测试用例，确保不再出现类似的问题（?）。

待办事项

[x] 符合代码规范

[x] 单元测试

[x] 文档

bug
opened by mozillazg 8

可否支持IronPython库

IronPython库作用是可.Net和Python之间的互相调用, 而IronPython库感觉像是独立的Python环境, 但IronPython与Python2.7却有一些差异主要原因是IronPython中处理字符串均是unicode, 看以下代码:

# IronPython 2.7.7 (2.7.7.0) on .NET 4.0.30319.42000 (32-bit)
# Type "help", "copyright", "credits" or "license" for more information.
import pypinyin
from pypinyin import pinyin
pinyin('成都')
#报出以下错误
#Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\pypinyin\core.py", line 250, in pinyin
#  File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\pypinyin\core.py", line 32, in seg
#  File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\pypinyin\core.py", line 34, in seg
#  File "C:\Program Files (x86)\IronPython 2.7\Lib\site-packages\pypinyin\utils.py", line 48, in simple_seg
#AssertionError: must be unicode string or [unicode, ...] list

########################################
print unicode   #output <type 'str'>
print str       #output <type 'str'>
# 从上面两行代码可以看出, 在IronPython环境中, unicode与str完全是相同的
print sys.version_info
#output sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
print sys.subversion
#output ('IronPython', '', '')

从上面可以看出在IronPython输出的版本是2.7.7.0, 在你compat.py中判定PY2变量为True, 所以bytes_type的结果为<type str> 由于在IronPython中输出unicode与str是都是str, 所以在utils.py对hans判断为bytes_type时便会报must be unicode string or [unicode, ...] list 希望可以在compat.py文件对PY2变量加以下判断

PY2 = sys.version_info < (3, 0) and sys.subversion[0] != 'IronPython'

enhancement

opened by Lession711 8

"鸟事“错误转换为”niao sh"
运行环境

操作系统（Linux/macOS/Windows）：win10

Python 版本：3.6.5

pypinyin 版本：0.33.2

问题描述

"鸟事“错误转换为”niao sh"

问题复现步骤

from pypinyin import lazy_pinyin lazy_pinyin("鸟事") ==>['niao', 'sh']
bug easy fix
opened by ledao 7
如何引用这项工作？
运行环境

操作系统（Linux/macOS/Windows）：

Python 版本：

pypinyin 版本：

问题描述

您好，非常好的工作，对我们帮助很大。我们是想根据文本生成一些语音，首先需要将文本转化为拼音，试了一下你们的工具，效果比其他的工具都好。我们也打算用这个，目前的问题是如果顺利形成文章的话，该如何引用你们的工作呢？

问题复现步骤
question
opened by fuzixiansheng 6
对用户传入的已进行分词处理的数据进行二次分词以便提高准确性
PR 描述

对用户传入的已进行分词处理的数据进行二次分词以便提高准确性，因为用户的分词结果不一定有对应的词组数据，二次分词后可能有对应的词组数据。

比如 你要重新考虑 这个句子：

用户分词结果： ['你', '要', '重新考虑'] 二次分词结果： ['你', '要', '重新', '考虑']

没有 重新考虑 这个词组的拼音数据，但是有 重新 这个词组的拼音数据

待办事项

[x] 符合代码规范

[x] 单元测试

[x] 文档
opened by mozillazg 6
字符"〇"不能正确转换
运行环境

操作系统（Linux/macOS/Windows）：Arch Linux x86_64

Python 版本：Python 3.6.4

pypinyin 版本：0.29.0

问题描述

字符〇不能正确转换.

问题复现步骤

import pypinyin print(pypinyin.lazy_pinyin("〇")); # 输出: ['〇']
enhancement
opened by sidgwick 6

「阿」字在词组中没有被注音

pypinyin 0.26.0

>>> from pypinyin import pinyin, lazy_pinyin, Style
>>> word = '穆彰阿'
>>> pinyin(word,  heteronym=True)
[['mù'], ['zhāng']]  # 这里漏掉了阿
>>> pinyin(word,  heteronym=False)
[['mù'], ['zhāng']]  # 这里漏掉了阿
>>> word = '穆彰'
>>> pinyin(word,  heteronym=False)
[['mù'], ['zhāng']]
>>> pinyin(word,  heteronym=True)
[['mù'], ['zhāng']]
>>> word = '彰阿'  # 这里没有漏掉阿
>>> pinyin(word,  heteronym=True)
[['zhāng'], ['ā', 'ē', 'ě', 'ǎ', 'à', 'a']]

bug

opened by imdreamrunner 6

python2.7.5 不能正确安装

pip install pypinyin 源码在编译时出错，上一句为 byte-compiling build/bdist.linux-x86_64/egg/pypinyin/phrases_dict.py to phrases_dict.pyc 此时未报错，直接显示为killed，下载源码也一样。求解
question

opened by liuxiawei 6
[嗯]字的声母韵母问题
运行环境

操作系统（Linux/macOS/Windows）：Linux(Ubuntu 18.04)

Python 版本： 3.8.1

pypinyin 版本：0.47.0

问题描述

开始我以为[嗯]字的拼音错了，在issue里面查了一下，发现了 #109 但是当我想使用[嗯]字的声韵母的时候，发现好像提取不出来如果加了 neutral_tone_with_five=True 那么韵母就是 5，感觉比较奇怪

问题复现步骤

Python 3.8.1 (default, Jan 8 2020, 22:29:32) [GCC 7.3.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

from pypinyin import pinyin, Style txt = '嗯' shengmu = pinyin(txt, style=Style.INITIALS, neutral_tone_with_five=True) yunmu = pinyin(txt, style=Style.FINALS_TONE3, neutral_tone_with_five=True) shengmu [['']] yunmu [['5']] yunmu = pinyin(txt, style=Style.FINALS_TONE3) yunmu [['']]

discuss
opened by JiaYK 5
新增一个函数用于实现对结果进行分组输出
问题描述

新增一个函数用于实现对结果进行分组（按词语分组、儿化音分组、隔音符号连接的拼音分组）输出:

>>> xxx('你好吗？') [ { "hanzi": "你好", "pinyin": ["ní hǎo"], }, { "hanzi": "吗", "pinyin": ["ma"], }, { "hanzi": "？", "pinyin": [], }, ] >>> xxx('西安') [ { "hanzi": "西安", "pinyin": ["xi'an"], }, ] >>> xxx('花儿') [ { "hanzi": "花儿", "pinyin": ["huar"], }, ]

case：

配合 <ruby> 标签实现汉字标注拼音的显示效果：你好 (níhǎo)吗(ma)？（当然，这个场景下获取拼音的需求更推荐使用 javascript 库在前端实现）

解决儿化音以及类似 xi'an 这种连在一起的拼音的显示场景的需求

#245
new feature
opened by mozillazg 0
三声变调错误
运行环境

操作系统（Linux/macOS/Windows）：Windows

Python 版本：3.10

pypinyin 版本：0.46.0

问题描述

lazy_pinyin(word, style=style.TONE3, tone_sandhi=True)返回的一些词未经三声变调，例如： lazy_pinyin(‘永远’, style=style.TONE3, tone_sandhi=True) 返回 ['yong3', 'yuan3'] （应当为yong2yuan3） lazy_pinyin(‘两手’, style=style.TONE3, tone_sandhi=True) 返回 ['liang3', 'shou3'] （应当为liang2shou3） lazy_pinyin(‘辗转反侧’, style=style.TONE3, tone_sandhi=True) 返回 ['zhan3', 'zhuan3', 'fan3', 'ce4'] （应当为zhan2zhuan2fan3ce4）

另一部分词例如‘你好’ni2hao3是正常的。

问题复现步骤

from pypinyin import pinyin, lazy_pinyin, Style lazy_pinyin('辗转反侧', style=Style.TONE3, tone_sandhi=True) ['zhan3', 'zhuan3', 'fan3', 'ce4']
enhancement
opened by YL-Huo 3
Is it possible to split Pinyin into consonant and vowel?

Since the pronunciation of all syllables in Mandarin Chinese can be expressed in the combination of two phonemes which are called consonant and vowel. Is it possible to use your great project to achieve this goal?

opened by shipleyxie 1
单字拼音预测有误
运行环境

操作系统（Linux/macOS/Windows）： CentOS7

Python 版本：3.6.5

pypinyin 版本：0.4.3

问题描述

pypinyin 给出整个句子的拼音，跟单个字单个字给出的拼音不一样。有时候错误的非常厉害，不只是多音字错误问题。

问题复现步骤

pypinyin.pinyin("淡豆豉") pypinyin.pinyin("豉") ---> shi4

豉不是多音字，应该没有 shi4这种读音的。
question
opened by JohnHerry 7

Releases(v0.47.1)

v0.47.1(Aug 26, 2022)

Source code(tar.gz)
Source code(zip)
v0.47.0(Jul 30, 2022)

Source code(tar.gz)
Source code(zip)
v0.46.0(Feb 12, 2022)

Source code(tar.gz)
Source code(zip)
v0.45.0(Jan 23, 2022)

Source code(tar.gz)
Source code(zip)
v0.44.0(Dec 12, 2021)

Source code(tar.gz)
Source code(zip)
v0.43.0(Oct 6, 2021)

Source code(tar.gz)
Source code(zip)
v0.42.1(Oct 6, 2021)

Source code(tar.gz)
Source code(zip)
v0.42.0(Jun 14, 2021)

Source code(tar.gz)
Source code(zip)
v0.41.0(Mar 13, 2021)

Source code(tar.gz)
Source code(zip)
v0.40.0(Nov 22, 2020)

Source code(tar.gz)
Source code(zip)
v0.39.1(Oct 8, 2020)

Source code(tar.gz)
Source code(zip)
v0.39.0(Aug 16, 2020)

Source code(tar.gz)
Source code(zip)
v0.38.1(Jul 5, 2020)

Source code(tar.gz)
Source code(zip)
v0.38.0(Jun 7, 2020)

Source code(tar.gz)
Source code(zip)
v0.37.0(Feb 9, 2020)

Source code(tar.gz)
Source code(zip)
v0.36.0(Oct 28, 2019)

Source code(tar.gz)
Source code(zip)
v0.35.4(Jul 13, 2019)

Source code(tar.gz)
Source code(zip)
v0.35.3(May 11, 2019)

Source code(tar.gz)
Source code(zip)
v0.35.2(Apr 6, 2019)

Source code(tar.gz)
Source code(zip)
v0.35.1(Mar 2, 2019)

Source code(tar.gz)
Source code(zip)
v0.35.0(Feb 24, 2019)

Source code(tar.gz)
Source code(zip)
v0.34.1(Dec 30, 2018)

Source code(tar.gz)
Source code(zip)
v0.34.0(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.33.2(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.33.1(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.33.0(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.32.0(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.31.0(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.30.1(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)
v0.30.0(Dec 26, 2018)

Source code(tar.gz)
Source code(zip)

汉字转拼音(pypinyin)

Related tags

Overview

汉字拼音转换工具（Python 版）

Comments

PR 描述

待办事项

PR 描述

待办事项

运行环境

问题描述

问题复现步骤

运行环境

问题描述

问题复现步骤

PR 描述

待办事项

运行环境

问题描述

问题复现步骤

运行环境

问题描述

问题复现步骤

问题描述

运行环境

问题描述

问题复现步骤

运行环境

问题描述

问题复现步骤

Releases(v0.47.1)

v0.47.1(Aug 26, 2022)

v0.47.0(Jul 30, 2022)

v0.46.0(Feb 12, 2022)

v0.45.0(Jan 23, 2022)

v0.44.0(Dec 12, 2021)

v0.43.0(Oct 6, 2021)

v0.42.1(Oct 6, 2021)

v0.42.0(Jun 14, 2021)

v0.41.0(Mar 13, 2021)

v0.40.0(Nov 22, 2020)

v0.39.1(Oct 8, 2020)

v0.39.0(Aug 16, 2020)

v0.38.1(Jul 5, 2020)

v0.38.0(Jun 7, 2020)

v0.37.0(Feb 9, 2020)

v0.36.0(Oct 28, 2019)

v0.35.4(Jul 13, 2019)

v0.35.3(May 11, 2019)

v0.35.2(Apr 6, 2019)

v0.35.1(Mar 2, 2019)

v0.35.0(Feb 24, 2019)

v0.34.1(Dec 30, 2018)

v0.34.0(Dec 26, 2018)

v0.33.2(Dec 26, 2018)

v0.33.1(Dec 26, 2018)

v0.33.0(Dec 26, 2018)

v0.32.0(Dec 26, 2018)

v0.31.0(Dec 26, 2018)

v0.30.1(Dec 26, 2018)

v0.30.0(Dec 26, 2018)

Owner

Huang Huang

一款高性能敏感词(非法词/脏字)检测过滤组件，附带繁体简体互换，支持全角半角互换，汉字转拼音，模糊搜索等功能。