CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

Overview

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985

赛题描述详见:https://www.datafountain.cn/competitions/474

文件说明

data: 存放训练数据和测试数据以及预处理代码

model_bert.py: 网络模型结构定义

adv_train.py: 对抗训练代码

run_bert_pse_adv.py: 运行bert-wwm + 对抗训练 + 伪标签模型

run_roberta_cls_pse_reinit_adv.py: 运行roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签模型

个人方案

我的baseline是将query和answer拼接后传入预训练好的bert进行特征提取,之后将提取的特征传入一个全连接层,最后接一个softmax进行分类。

其中尝试的预训练模型有bert(谷歌),bert_wwm(哈工大版本),roberta_large(哈工大版本),xlneternie等,其中效果较好的有bert-wwm和roberta-large。之后在baseline的基础上进行了各种尝试,主要尝试有以下:

模型 线上F1
bert-wwm 0.78
bert-wwm + 对抗训练 0.783
bert-wwm + 对抗训练 + 伪标签 0.7879
roberta-large 0.774
roberta-large + reinit + 对抗训练 0.786
roberta-large + reinit+对抗训练 + 伪标签 0.7871
roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 0.7879

对抗训练

其基本的原理呢,就是通过添加扰动构造一些对抗样本,放给模型去训练,以攻为守,提高模型在遇到对抗样本时的鲁棒性,同时一定程度也能提高模型的表现和泛化能力。

参考链接:https://zhuanlan.zhihu.com/p/91269728

伪标签

将测试数据和预测结果进行拼接,之后当成训练数据传入到模型中重新进行训练。为了减少对训练数据的原始分布的影响并增加伪标签的置信度,我只在五个采用不同预训练模型的baseline预测一致的数据中采样了6000条测试数据加入到训练集进行训练。

重新初始化

参考链接:如何让Bert在finetune小数据集时更“稳”一点 https://zhuanlan.zhihu.com/p/148720604

大致思想是靠近底部的层(靠近input)学到的是比较通用的语义方面的信息,比如词性、词法等语言学知识,而靠近顶部的层会倾向于学习到接近下游任务的知识,对于预训练来说就是类似masked word prediction、next sentence prediction任务的相关知识。当使用bert预训练模型finetune其他下游任务(比如序列标注)时,如果下游任务与预训练任务差异较大,那么bert顶层的权重所拥有的知识反而会拖累整体的finetune进程,使得模型在finetune初期产生训练不稳定的问题。

因此,我们可以在finetune时,只保留接近底部的bert权重,对于靠近顶部的层的权重,可以重新随机初始化,从头开始学习。

在本次比赛中,我只对最后roberta-large的最后五层进行重新初始化。在实验中,我发现对于bert,重新初始化会降低效果,而roberta-large则有提升。

bert 不同embedding和cls组合

思路主要是参考 CCF BDCI 2019 互联网新闻情感分析 复赛top1解决方案

参考链接:https://github.com/cxy229/BDCI2019-SENTIMENT-CLASSIFICATION

即对bert不同embedding进行组合后传入全连接层进行分类。该方案尝试时间较晚,只实验last2embedding_cls这种组合,结果也确实有提升。

模型融合

对于单模,我采用五折交叉验证,对每一个单模的五个模型结果,我尝试了相加融合和投票的方式,结果是融合相加的线上f1较高

对于不同模型,我也只是采用的相加融合的方式(由于时间问题没有尝试投票和stacking的方式)。最后a榜效果最好的是bert-wwm + 对抗训练 + 伪标签、roberta-large + reinit+对抗训练 + 伪标签、roberta-large last2embedding_cls + reinit + 对抗训练 + 伪标签 三个模型的融合,线上F1有 0.7908 , 排名47;B榜我尝试只对两个效果最好的模型进行融合,即 bert-wwm + 对抗训练 + 伪标签last2embedding_cls + reinit + 对抗训练 + 伪标签,最终F1为0.80,排名72。

总结

本次参加比赛完全是数据挖掘课程要求,也是我第一次参加大数据比赛。因为我的研究方向是图像,所以基本可以说是从零开始,写这个github只是想记录一下这一个月自己从零开始的参赛经历,也希望对同样参加类似比赛的新人有帮助。最后,希望看到了顺手给star,万分感谢。

You might also like...
Apache Flink 目录遍历漏洞批量检测 (CVE-2020-17519)

使用方法&免责声明 该脚本为Apache Flink 目录遍历漏洞批量检测 (CVE-2020-17519)。 使用方法:Python CVE-2020-17519.py urls.txt urls.txt 中每个url为一行,漏洞地址输出在vul.txt中 影响版本: Apache Flink 1

Code for ACM MM 2020 paper
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.
[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

Stable Head Pose Estimation and Landmark Regression via 3D Dense Face Reconstruction Reimplementation of (ECCV 2020) Towards Fast, Accurate and Stable

MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。
MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。

mediapipe-python-sample MediaPipeのPythonパッケージのサンプルです。 2020/12/11時点でPython実装のある以下4機能について用意しています。 Hands Pose Face Mesh Holistic Requirement mediapipe 0.

2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

2020CCF-NER 2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案 bert base + flat + crf + fgm + swa + pu learning策略 + clue数据集 = test1单模0.906 词向量

Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)
Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

Learning to Simulate Dynamic Environments with GameGAN PyTorch code for GameGAN Learning to Simulate Dynamic Environments with GameGAN Seung Wook Kim,

1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

Instead, two models for appearance modeling are included, together with the open-source BAGS model and the full set of code for inference. With this code, you can achieve around mAP@23 with TAO test set (based on our estimation).

justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

AI grand challenge 2020 Repo (Speech Recognition Track)
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].
Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].

Learning from Synthetic Shadows for Shadow Detection and Removal (IEEE TCSVT 2020) Overview This repo is for the paper "Learning from Synthetic Shadow

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

PoC for CVE-2020-6207  (Missing Authentication Check in SAP Solution Manager)
PoC for CVE-2020-6207 (Missing Authentication Check in SAP Solution Manager)

PoC for CVE-2020-6207 (Missing Authentication Check in SAP Solution Manager) This script allows to check and exploit missing authentication checks in

🐙 Share your Github stats for 2020 on Twitter
🐙 Share your Github stats for 2020 on Twitter

Year on Github 🐙 Share your Github stats for 2020 on Twitter. This project contains a small web app that let's you share stats about your Github acti

Roadmap to becoming a machine learning engineer in 2020
Roadmap to becoming a machine learning engineer in 2020

Roadmap to becoming a machine learning engineer in 2020, inspired by web-developer-roadmap.

9th place solution in "Santa 2020 - The Candy Cane Contest"

Santa 2020 - The Candy Cane Contest My solution in this Kaggle competition "Santa 2020 - The Candy Cane Contest", 9th place. Basic Strategy In this co

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

An official implementation of
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

Unofficial implementation of
Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

TTNet-Pytorch The implementation for the paper "TTNet: Real-time temporal and spatial video analysis of table tennis" An introduction of the project c

Owner
shuo
shuo
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
2021 CCF BDCI 全国信息检索挑战杯(CCIR-Cup)智能人机交互自然语言理解赛道第二名参赛解决方案

2021 CCF BDCI 全国信息检索挑战杯(CCIR-Cup) 智能人机交互自然语言理解赛道第二名解决方案 比赛网址: CCIR-Cup-智能人机交互自然语言理解 1.依赖环境: python==3.8 torch==1.7.1+cu110 numpy==1.19.2 transformers=

JinXiang 22 Oct 29, 2022
1st place solution in CCF BDCI 2021 ULSEG challenge

1st place solution in CCF BDCI 2021 ULSEG challenge This is the source code of the 1st place solution for ultrasound image angioma segmentation task (

Chenxu Peng 30 Nov 22, 2022
DNSpooq - dnsmasq cache poisoning (CVE-2020-25686, CVE-2020-25684, CVE-2020-25685)

dnspooq DNSpooq PoC - dnsmasq cache poisoning (CVE-2020-25686, CVE-2020-25684, CVE-2020-25685) For educational purposes only Requirements Docker compo

Teppei Fukuda 80 Nov 28, 2022
UDP++ (ECCVW 2020 Oral), (Winner of COCO 2020 Keypoint Challenge).

UDP-Pose This is the pytorch implementation for UDP++, which won the Fisrt place in COCO Keypoint Challenge at ECCV 2020 Workshop. Top-Down Results on

null 20 Jul 29, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 7, 2023
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 9, 2023
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023