A framework for cleaning Chinese dialog data

Yida

Last update: Dec 20, 2022

Related tags

Overview

本项目为一个清洗对话数据的多线程框架，针对知乎、微博、贴吧等。目前还比较简陋，欢迎提bug和优化，比如句内重复短语降重函数的正则或者后缀算法。代码还在继续完善中，注释以及一些函数出处引用等待完善。

目录结构

--scripts: 存放运行脚本
  ---run.sh: 使用我挑选的几个规则来运行run_dist.py  
--src: 清洗框架功能主目录  
  ---inputters: 存放dataloader 和 存取数据工具函数
  ---rules: 存放各级别的规则函数
  ---single_filter.py: run_dist.py所调用的单个线程的主程序，加载处理单个数据，并保存过滤后的数据以及脏   
---tool_data: 存放黑名单词典，每行一个词  
---run_dist.py: 主运行文件，加载dataloader，加载黑名单，简历线程池 
---utils: 数据统计，结果检测

运行并保存日志

bash ./scripts/run.sh 2>&1 | tee -a cleaning.log

Rules

规则包括目前大部分论文内的清洗规则：

1 黑名单过滤，包括特殊字符和脏话
2 emoji表情
3 邮箱、电话号等隐私过滤, 人名替换为NAME1、NAME2。。。
4 URL过滤
5 unicode 相关修复
6 去重：包括重复词缩减、过滤掉上下文相同的句子、重复的对话
7 meena以及dialogpt中使用的广告、通用回复筛除

以上识别出来的噪音，如可在句中抹去则抹去。
如不可抹去则放弃该句子：即，若是单轮对话放弃该对话，若是多轮对话则以该句为分割，切分对话。

NOTE THAT: 1, 改动某规则的时候注意是否影响到其他规则, 规则清洗顺序有要求 2, 黑名单如人名、特殊话题等可根据需要配置放置到 ./tool_data/下，文件命名可自行配置请参阅。/run_dist.py中dataloader。黑名单可到github上搜寻，如 https://github.com/fighting41love/funNLP 3, 将在每个函数上方给定测试样例，下方给定期待样例 4, 目前run.sh中使用的参数为本人正在使用的功能

Auguments

参数	描述
n_p	多进程数
batch_size	单个进程最大处理session数
tool_dir	工具数据所在目录（如黑名单）
out_dir	清洗后的文件输出目录
raw_dir	待处理文件所在mull
dirty_dir	存储清洗出来的脏数据，如为空则不存
:---------------	:-------------------
split_multi_repost	将微博转发数据按"//@aaa XXXX //@bbb XXX"撕开成多句
no_utter_dup	如果 context == response 则去掉该对话
re_name	人名用 , ...替换
no_ad	去除可能是广告的对话（同样的回复对应多个context）借鉴论文
de_generic_dialog	去通用回复借鉴论文
no_short_response	去掉对话尾部所有过短回复
:---------------	:-------------------
bert_clean	使用BertTokenizer 中函数清理句子
cleantext_clean	使用clean-text 清理（电话号、邮箱、unicode错误等）
:---------------	:-------------------
no_short	去除过短的句子
no_long	去除过长的句子
de_reply_tag	去除微博中 "回复 @XXX:"
de_hashtag	去除句中 "# XXX#"
de_emotion	去除句中 ": XXX:"
de_mention	去除句子中 "@Cindy"， "@Bob:"， "@Amy:" 等
no_mention	去除包含 @XXX 的句子
de_repost	去除句中 "//XXX"
de_duplicated	句中短语降重（待用后缀算法优化）
de_emoji	去除emoji （代补全）
no_special_topic	过滤包含特定名单词的对话对话
no_str_blacklist	过滤包含黑名单词的对话
no_toupiao	判断是否是微博投票
no_specific_utter	删除一些特定句子
contain_zh	删掉不包含中文的句子
de_single_repost_mention	去掉 "@XXX:"
de_weibo_url	去除 http:\t.c
de_url	去除 url
de_angle	去除其中XX为非中文
de_alpha_num	去除长串无意义的数字字母组合
de_specific	去除句中固定pattern
:---------------	:-------------------
de_showall	去除某些特定文件中的 "...显示全部"
de_brackets	去除某些特定文件中的 "[XXX]"
:---------------	:-------------------
no_word_blacklist	过滤分此后的黑名单词的对话
no_alpha_noise	过滤掉含有不成英文单词的字母组合的句子
check_confuse_word	保存包含混淆名单词的对话进行recall
yda_dedupl	如果一个词语在句子中出现的比例超过一个阈值则放弃该句子

You might also like...

An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT

5k Feb 18, 2021

VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

44 Nov 1, 2022

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg：一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用，支持细分领域分词，有效提升了分词准确度。目录主要亮点编译和安装各类分词工具包的性能对比使用方式论文引用作者常见问题及解答主要

6k Dec 29, 2022

Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库，可以方便的处理中文文本内容，是受到了TextBlob的启发而写的，由于现在大部分的自然语言处理库基本都是针对英文的，于是写了一个方便处理中文的类库，并且和TextBlob

6k Jan 2, 2023

Chinese segmentation library

What is loso? loso is a Chinese segmentation system written in Python. It was developed by Victor Lin ([email protected]) for Plurk Inc. Copyright &

82 Jun 28, 2022

a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。支持简单的pinyin分词支持用户自定义break 支持用户自定义合并词

237 Nov 4, 2022

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

GPT2-NewsTitle 带有超详细注释的GPT2新闻标题生成项目 UpDate 01.02.2021 从网上收集数据，将清华新闻数据、搜狗新闻数据等新闻数据集，以及开源的一些摘要数据进行整理清洗，构建一个较完善的中文摘要数据集。数据集清洗时，仅进行了简单地规则清洗。

785 Dec 29, 2022

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！ 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！ 2021/04/13 softdtw 分支支持使用 Sof

161 Dec 19, 2022

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

English | 中文说明 CBLUE AI (Artificial Intelligence) is playing an indispensabe role in the biomedical field, helping improve medical technology. For fur

452 Dec 30, 2022

Comments

Fixed Emoji Bug + Inconsistencies (Sourcery refactored)
Pull Request #3 refactored by Sourcery.

Since the original Pull Request was opened as a fork in a contributor's repository, we are unable to create a Pull Request branching from it.

To incorporate these changes, you can either:

Merge this Pull Request instead of the original, or

Ask your contributor to locally incorporate these commits and push them to the original Pull Request

Incorporate changes via command line

git fetch https://github.com/lemon234071/clean-dialog pull/3/head git merge --ff-only FETCH_HEAD git push

NOTE: As code is pushed to the original Pull Request, Sourcery will re-run and update (force-push) this Pull Request with new refactorings as necessary. If Sourcery finds no refactorings at any point, this Pull Request will be closed automatically.

See our documentation here.

Run Sourcery locally

Reduce the feedback loop during development by using the Sourcery editor plugin:

VS Code

PyCharm
opened by sourcery-ai[bot] 1
Sourcery Suggested Refactorings

Hi there!

Here are 5 refactorings we found that can help improve the readability/quality of your code.

We found another 13 refactorings in your repo as well but didn't want to open too big of a PR. Add Sourcery to your repo to see the rest of them and to get these types of refactorings on every pull request.

You can also get these types of suggestions while you work in VS Code or PyCharm.

One last thing. We're sending out this PR because we think it's helpful. If it isn't please let us know! Or ignore the PR and we won't bother you again 🙂

opened by sourcery-ai-bot 0

A framework for cleaning Chinese dialog data

Related tags

Overview

目录结构

运行并保存日志

Rules

Auguments

You might also like...

An open source library for deep learning end-to-end dialog systems and chatbots.

VD-BERT: A Unified Vision and Dialog Transformer with BERT

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Python library for processing Chinese text

Chinese segmentation library

a chinese segment base on crf

Chinese NewsTitle Generation Project by GPT2.带有超级详细注释的中文GPT2新闻标题生成项目。

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

中文医疗信息处理基准CBLUE: A Chinese Biomedical LanguageUnderstanding Evaluation Benchmark

Comments

Fixed Emoji Bug + Inconsistencies (Sourcery refactored)

Sourcery Suggested Refactorings

Owner

Yida

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

vits chinese, tts chinese, tts mandarin

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

A method for cleaning and classifying text using transformers.

Built for cleaning purposes in military institutions

An open source library for deep learning end-to-end dialog systems and chatbots.

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

An open source library for deep learning end-to-end dialog systems and chatbots.

ChatterBot is a machine learning, conversational dialog engine for creating chat bots