100+ Chinese Word Vectors 上百种预训练中文词向量

Overview

Chinese Word Vectors 中文词向量

中文

This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse), context features (word, ngram, character, and more), and corpora. One can easily obtain pre-trained vectors with different properties and use them for downstream tasks.

Moreover, we provide a Chinese analogical reasoning dataset CA8 and an evaluation toolkit for users to evaluate the quality of their word vectors.

Reference

Please cite the paper, if using these embeddings and CA8 dataset.

Shen Li, Zhe Zhao, Renfen Hu, Wensi Li, Tao Liu, Xiaoyong Du, Analogical Reasoning on Chinese Morphological and Semantic Relations, ACL 2018.

@InProceedings{P18-2023,
  author =  "Li, Shen
    and Zhao, Zhe
    and Hu, Renfen
    and Li, Wensi
    and Liu, Tao
    and Du, Xiaoyong",
  title =   "Analogical Reasoning on Chinese Morphological and Semantic Relations",
  booktitle =   "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year =  "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "138--143",
  location =  "Melbourne, Australia",
  url =   "http://aclweb.org/anthology/P18-2023"
}

 

A detailed analysis of the relation between the intrinsic and extrinsic evaluations of Chinese word embeddings is shown in the paper:

Yuanyuan Qiu, Hongzheng Li, Shen Li, Yingdi Jiang, Renfen Hu, Lijiao Yang. Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. Springer, Cham, 2018. 209-221. (CCL & NLP-NABD 2018 Best Paper)

@incollection{qiu2018revisiting,
  title={Revisiting Correlations between Intrinsic and Extrinsic Evaluations of Word Embeddings},
  author={Qiu, Yuanyuan and Li, Hongzheng and Li, Shen and Jiang, Yingdi and Hu, Renfen and Yang, Lijiao},
  booktitle={Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data},
  pages={209--221},
  year={2018},
  publisher={Springer}
}

Format

The pre-trained vector files are in text format. Each line contains a word and its vector. Each value is separated by space. The first line records the meta information: the first number indicates the number of words in the file and the second indicates the dimension size.

Besides dense word vectors (trained with SGNS), we also provide sparse vectors (trained with PPMI). They are in the same format with liblinear, where the number before " : " denotes dimension index and the number after the " : " denotes the value.

Pre-trained Chinese Word Vectors

Basic Settings

                                       
Window Size Dynamic Window Sub-sampling Low-Frequency Word Iteration Negative Sampling*
5 Yes 1e-5 10 5 5

*Only for SGNS.

Various Domains

Chinese Word Vectors trained with different representations, context features, and corpora.

Word2vec / Skip-Gram with Negative Sampling (SGNS)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 300d 300d 300d 300d / PWD: 5555
Wikipedia_zh 中文维基百科 300d 300d 300d 300d
People's Daily News 人民日报 300d 300d 300d 300d
Sogou News 搜狗新闻 300d 300d 300d 300d
Financial News 金融新闻 300d 300d 300d 300d
Zhihu_QA 知乎问答 300d 300d 300d 300d
Weibo 微博 300d 300d 300d 300d
Literature 文学作品 300d 300d / PWD: z5b4 300d 300d / PWD: yenb
Complete Library in Four Sections
四库全书*
300d 300d NAN NAN
Mixed-large 综合
Baidu Netdisk / Google Drive
300d
300d
300d
300d
300d
300d
300d
300d
Positive Pointwise Mutual Information (PPMI)
Corpus Context Features
Word Word + Ngram Word + Character Word + Character + Ngram
Baidu Encyclopedia 百度百科 Sparse Sparse Sparse Sparse
Wikipedia_zh 中文维基百科 Sparse Sparse Sparse Sparse
People's Daily News 人民日报 Sparse Sparse Sparse Sparse
Sogou News 搜狗新闻 Sparse Sparse Sparse Sparse
Financial News 金融新闻 Sparse Sparse Sparse Sparse
Zhihu_QA 知乎问答 Sparse Sparse Sparse Sparse
Weibo 微博 Sparse Sparse Sparse Sparse
Literature 文学作品 Sparse Sparse Sparse Sparse
Complete Library in Four Sections
四库全书*
Sparse Sparse NAN NAN
Mixed-large 综合 Sparse Sparse Sparse Sparse

*Character embeddings are provided, since most of Hanzi are words in the archaic Chinese.

Various Co-occurrence Information

We release word vectors upon different co-occurrence statistics. Target and context vectors are often called input and output vectors in some related papers.

In this part, one can obtain vectors of arbitrary linguistic units beyond word. For example, character vectors is in the context vectors of word-character.

All vectors are trained by SGNS on Baidu Encyclopedia.

                                                       
Feature Co-occurrence Type Target Word Vectors Context Word Vectors
Word Word → Word 300d 300d
Ngram Word → Ngram (1-2) 300d 300d
Word → Ngram (1-3) 300d 300d
Ngram (1-2) → Ngram (1-2) 300d 300d
Character Word → Character (1) 300d 300d
Word → Character (1-2) 300d 300d
Word → Character (1-4) 300d 300d
Radical Radical 300d 300d
Position Word → Word (left/right) 300d 300d
Word → Word (distance) 300d 300d
Global Word → Text 300d 300d
Syntactic Feature Word → POS 300d 300d
Word → Dependency 300d 300d

Representations

Existing word representation methods fall into one of the two classes, dense and sparse represnetations. SGNS model (a model in word2vec toolkit) and PPMI model are respectively typical methods of these two classes. SGNS model trains low-dimensional real (dense) vectors through a shallow neural network. It is also called neural embedding method. PPMI model is a sparse bag-of-feature representation weighted by positive-pointwise-mutual-information (PPMI) weighting scheme.

Context Features

Three context features: word, ngram, and character are commonly used in the word embedding literature. Most word representation methods essentially exploit word-word co-occurrence statistics, namely using word as context feature (word feature). Inspired by language modeling problem, we introduce ngram feature into the context. Both word-word and word-ngram co-occurrence statistics are used for training (ngram feature). For Chinese, characters (Hanzi) often convey strong semantics. To this end, we consider using word-word and word-character co-occurrence statistics for learning word vectors. The length of character-level ngrams ranges from 1 to 4 (character feature).

Besides word, ngram, and character, there are other features which have substantial influence on properties of word vectors. For example, using entire text as context feature could introduce more topic information into word vectors; using dependency parse as context feature could add syntactic constraint to word vectors. 17 co-occurrence types are considered in this project.

Corpus

We made great efforts to collect corpus across various domains. All text data are preprocessed by removing html and xml tags. Only the plain text are kept and HanLP(v_1.5.3) is used for word segmentation. In addition, traditional Chinese characters are converted into simplified characters with Open Chinese Convert (OpenCC). The detailed corpora information is listed as follows:

Corpus Size Tokens Vocabulary Size Description
Baidu Encyclopedia
百度百科
4.1G 745M 5422K Chinese Encyclopedia data from
https://baike.baidu.com/
Wikipedia_zh
中文维基百科
1.3G 223M 2129K Chinese Wikipedia data from
https://dumps.wikimedia.org/
People's Daily News
人民日报
3.9G 668M 1664K News data from People's Daily(1946-2017)
http://data.people.com.cn/
Sogou News
搜狗新闻
3.7G 649M 1226K News data provided by Sogou labs
http://www.sogou.com/labs/
Financial News
金融新闻
6.2G 1055M 2785K Financial news collected from multiple news websites
Zhihu_QA
知乎问答
2.1G 384M 1117K Chinese QA data from
https://www.zhihu.com/
Weibo
微博
0.73G 136M 850K Chinese microblog data provided by NLPIR Lab
http://www.nlpir.org/wordpress/download/weibo.7z
Literature
文学作品
0.93G 177M 702K 8599 modern Chinese literature works
Mixed-large
综合
22.6G 4037M 10653K We build the large corpus by merging the above corpora.
Complete Library in Four Sections
四库全书
1.5G 714M 21.8K The largest collection of texts in pre-modern China.

All words are concerned, including low frequency words.

Toolkits

All word vectors are trained by ngram2vec toolkit. Ngram2vec toolkit is a superset of word2vec and fasttext toolkit, where arbitrary context features and models are supported.

Chinese Word Analogy Benchmarks

The quality of word vectors is often evaluated by analogy question tasks. In this project, two benchmarks are exploited for evaluation. The first is CA-translated, where most analogy questions are directly translated from English benchmark. Although CA-translated has been widely used in many Chinese word embedding papers, it only contains questions of three semantic questions and covers 134 Chinese words. In contrast, CA8 is specifically designed for Chinese language. It contains 17813 analogy questions and covers comprehensive morphological and semantic relations. The CA-translated, CA8, and their detailed descriptions are provided in testsets folder.

Evaluation Toolkit

We present an evaluation toolkit in evaluation folder.

Run the following codes to evaluate dense vectors.

$ python ana_eval_dense.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_dense.py -v <vector.txt> -a CA8/semantic.txt

Run the following codes to evaluate sparse vectors.

$ python ana_eval_sparse.py -v <vector.txt> -a CA8/morphological.txt
$ python ana_eval_sparse.py -v <vector.txt> -a CA8/semantic.txt
Comments
  • Question about the download links

    Question about the download links

    1. Could you please publish a link to all of the Baidu Netdisk files? I wish to download all the model files quickly rather than one by one.

    2. Is there any plan to save the model files to other netdisks? For example, Google Drive or Dropbox. It should be very convenient for oversea researchers.

    Many thanks for your work!

    opened by imWildCat 16
  • python gensim 不能加载词向量文件

    python gensim 不能加载词向量文件

    D:\Program\Anaconda3\lib\site-packages\gensim\utils.py:860: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") Traceback (most recent call last): File ".\zzk_word2vec.py", line 101, in test_word_embedding('D:\data\pretrain_word2vec\Chinese-Word-Vectors\sgns.zhihu.char\sgns.zhihu.char') File ".\zzk_word2vec.py", line 76, in test_word_embedding model = gensim.models.KeyedVectors.load_word2vec_format(vector_file, binary=False, encoding='utf8') File "D:\Program\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 250, in load_word2vec_format parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ") File "D:\Program\Anaconda3\lib\site-packages\gensim\utils.py", line 242, in any2unicode return unicode(text, encoding, errors=errors) UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: invalid continuation byte

    opened by sudazzk 5
  • KeyError: '[SEP]'

    KeyError: '[SEP]'

    对HAN模型进行训练时,出现报错信息:File "/home1/mayifan/demo/text_classification/model/model.py", line 385, in sep_index = [i for i, num in enumerate(doc_list) if num == self.word_embedding.stoi['[SEP]']] KeyError: '[SEP]'

    opened by myf-algorithm 3
  • 使用CAD8评估效果不理想

    使用CAD8评估效果不理想

    ana_eval_dense.py -v sgns.financial.bigram -a CA8\semantic.txt geography add/mul: 0.335/0.358 nature add/mul: 0.111/0.107 history add/mul: 0.0/0.0 people add/mul: 0.088/0.099 Total accuracy (add): 0.243 Total accuracy (mul): 0.257 甚至还有 sgns.financial.word morphological.txt 条件下 A add/mul: 0.009/0.009 prefix add/mul: 0.086/0.088 AB add/mul: 0.01/0.012 suffix add/mul: 0.075/0.073 Total accuracy (add): 0.049 Total accuracy (mul): 0.05 修改了一处line 36 read_analogy函数中的 open(path,enconding='utf-8') 想知道是什么原因导致的

    opened by zxyangyu 3
  • Cannot download the pre-trained vector files

    Cannot download the pre-trained vector files

    I tried to download context word vectors of Word → Character (1), however, I failed to do that since I cannot register the account of baidu. Can you upload the dataset to other places such as google drive or dropbox? Thanks.

    opened by dx80232 3
  • Baidu Pan links do not work.

    Baidu Pan links do not work.

    I tried to click the links, which should lead to some Baidu Pan files, but none of them worked. A 404 not found page was shown (https://pan.baidu.com/error/404.html). image

    #86 is a nice alternative, but I believe you may want to fix this.

    opened by ktxlh 2
  • Download from https://pan.baidu.com/ not possible without Chinese phone mumber

    Download from https://pan.baidu.com/ not possible without Chinese phone mumber

    Is there any place to download the models from except for https://pan.baidu.com/ ?

    I did not succeed to download them from there without registration, and registration only works with Chinese phone number, which I do not have.

    Thanks a lot!

    opened by ihmsje 0
  • 如何读取sgns.financial.bigram-char

    如何读取sgns.financial.bigram-char

    我下载下来后,使用如下语句指定训练好的模型,py运行却没有任何反应 model = gensim.models.KeyedVectors.load_word2vec_format('/text/sgns.financial.bigram-char') 而换为另一个混合类的模型,就能正常运行 model = gensim.models.KeyedVectors.load_word2vec_format('/text/merge_sgns_bigram_char300.txt') 这是为什么呢?是不是第一个的格式不对?还是需要另外的语句读取model? 谢谢呀!

    opened by fredericky123 2
Owner
embedding
embedding
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

SaiVenkatDhulipudi 2 Nov 17, 2021
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.5k Dec 25, 2022
🦆 Contextually-keyed word vectors

sense2vec: Contextually-keyed word vectors sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detaile

Explosion 1.2k Feb 17, 2021
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Chinese real time voice cloning (VC) and Chinese text to speech (TTS). 好用的中文语音克隆兼中文语音合成系统,包含语音编码器、语音合成器、声码器和可视化模块。

Kuang Dada 6 Nov 8, 2022
vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单,音质最好的语音合成系统

AmorTX 12 Dec 14, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
🌸 fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy floret is an extended version of fastText that can produce word repr

Explosion 222 Dec 16, 2022
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 6, 2023
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Free and Open Source Machine Translation API. 100% self-hosted, offline capable and easy to setup.

LibreTranslate Try it online! | API Docs | Community Forum Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it d

null 3.4k Dec 27, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
A telegram bot to translate 100+ Languages

?? GOOGLE TRANSLATER ?? The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • ?? Deploy To Railway ?? • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

Keevo 5 Apr 15, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 2, 2023
Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc. Copyright &

Fang-Pen Lin 82 Jun 28, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 4, 2022
Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据,将清华新闻数据、搜狗新闻数据等新闻数据集,以及开源的一些摘要数据进行整理清洗,构建一个较完善的中文摘要数据集。 数据集清洗时,仅进行了简单地规则清洗。

logCong 785 Dec 29, 2022