2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

Overview

2020CCF-NER

2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

bert base + flat + crf + fgm + swa + pu learning策略 + clue数据集 = test1单模0.906

词向量:https://github.com/Embedding/Chinese-Word-Vectors SGNS(Mixed-large 综合)

loss mask相关代码为pu learning策略的实现

主要模块版本 python 3.6.9

torch 1.1.0

transformers 3.0.2

pytorchcrf 1.2.0

torchcontrib 0.0.2

You might also like...
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

Roadmap to becoming a machine learning engineer in 2020
Roadmap to becoming a machine learning engineer in 2020

Roadmap to becoming a machine learning engineer in 2020, inspired by web-developer-roadmap.

9th place solution in "Santa 2020 - The Candy Cane Contest"

Santa 2020 - The Candy Cane Contest My solution in this Kaggle competition "Santa 2020 - The Candy Cane Contest", 9th place. Basic Strategy In this co

An official implementation of
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

Unofficial implementation of
Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

TTNet-Pytorch The implementation for the paper "TTNet: Real-time temporal and spatial video analysis of table tennis" An introduction of the project c

git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]
git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

Official implementation of
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020)
Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020)

Causality In Traffic Accident (Under Construction) Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020) Overview Data Prepa

TensorFlow code for the neural network presented in the paper:
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

Comments
  • TextEncoder处理FLAT input时,char_word_mask及part_size计算方式的问题

    TextEncoder处理FLAT input时,char_word_mask及part_size计算方式的问题

    请问,代码中,在计算 word_mask,part_size等 (NERModelFitting.py 中 collate_fn_test 方法),这种有点奇怪的方式,有什么依据吗? 如果只是处理为正常的 FLAT 输入,在我看来结果是错的。还有这个奇怪的 mask,在 Model 的 Tranformer 计算时,明显已经把正常的 text token 都 mask 掉了。 所以,冒昧想问一下,这么处理有没有什么理由,还是说,恰巧得到了比较好的分数,或者上传的不是最终的正确代码。 还请不吝赐教,谢谢

    opened by RacleRay 2
  • lattice 的 start、end 与 bert ids不对应

    lattice 的 start、end 与 bert ids不对应

    请教一个问题: 通过代码生成的一个样本: "text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治(Liz),《不懂项目管理,还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

    text经过bert_tokenizer后的结果是: [101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

    发现一个现象是 lattice的start 和end和text_ids 不对应,比如 项目 14 15 text_ids的14、15对应的文本并不是项目,这样处理会有影响吗?

    (出现这个现象的原因的ppt这个词都tokenize成了1个id)

    opened by renmada 1
  • AttributeError: 'str' object has no attribute 'detach'

    AttributeError: 'str' object has no attribute 'detach'

    encoder.py这个文件中的vec是str类型,vec.detach()这样写会报错,请问作者这里是不是去掉detach()
    

    def get_bert_vec(self, text, text_mask, text_pos=None): if text_pos is None: _, _, text_vecs = self.bert(text, text_mask) else: _, _, text_vecs = self.bert(text, text_mask, position_ids=text_pos) text_vecs = list(text_vecs) if self.detach_ptm_flag: for i, vec in enumerate(text_vecs): text_vecs[i] = vec.detach() return text_vecs

    opened by FLxuRu 1
  • lattice的start end 和 text不对应

    lattice的start end 和 text不对应

    请教一个问题: 通过代码生成的一个样本: "text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治(Liz),《不懂项目管理,还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

    text经过bert_tokenizer后的结果是: [101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

    发现一个现象是 lattice的start 和end和text_ids 不对应,比如 项目 14 15 text_ids的14、15对应的文本并不是项目,这样处理会有影响吗?

    (出现这个现象的原因的ppt这个词都tokenzie成了1个id)

    opened by renmada 0
Owner
null
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 7, 2023
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

Tencent YouTu Research 64 Nov 11, 2022
[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

Stable Head Pose Estimation and Landmark Regression via 3D Dense Face Reconstruction Reimplementation of (ECCV 2020) Towards Fast, Accurate and Stable

Remilia Scarlet 221 Dec 30, 2022
MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。

mediapipe-python-sample MediaPipeのPythonパッケージのサンプルです。 2020/12/11時点でPython実装のある以下4機能について用意しています。 Hands Pose Face Mesh Holistic Requirement mediapipe 0.

KazuhitoTakahashi 217 Dec 12, 2022
Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

Learning to Simulate Dynamic Environments with GameGAN PyTorch code for GameGAN Learning to Simulate Dynamic Environments with GameGAN Seung Wook Kim,

null 199 Dec 26, 2022
1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

Instead, two models for appearance modeling are included, together with the open-source BAGS model and the full set of code for inference. With this code, you can achieve around mAP@23 with TAO test set (based on our estimation).

null 79 Oct 8, 2022
AI grand challenge 2020 Repo (Speech Recognition Track)

KorBERT를 활용한 한국어 텍스트 기반 위협 상황인지(2020 인공지능 그랜드 챌린지) 본 프로젝트는 ETRI에서 제공된 한국어 korBERT 모델을 활용하여 폭력 기반 한국어 텍스트를 분류하는 다양한 분류 모델들을 제공합니다. 본 개발자들이 참여한 2020 인공지

Young-Seok Choi 23 Jan 25, 2022
Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].

Learning from Synthetic Shadows for Shadow Detection and Removal (IEEE TCSVT 2020) Overview This repo is for the paper "Learning from Synthetic Shadow

Naoto Inoue 67 Dec 28, 2022