2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

Last update: Oct 19, 2022

Related tags

Overview

2020CCF-NER

2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

bert base + flat + crf + fgm + swa + pu learning策略 + clue数据集 = test1单模0.906

词向量：https://github.com/Embedding/Chinese-Word-Vectors SGNS(Mixed-large 综合)

loss mask相关代码为pu learning策略的实现

主要模块版本 python 3.6.9

torch 1.1.0

transformers 3.0.2

pytorchcrf 1.2.0

torchcontrib 0.0.2

You might also like...

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 7, 2022

Roadmap to becoming a machine learning engineer in 2020

Roadmap to becoming a machine learning engineer in 2020, inspired by web-developer-roadmap.

1.7k Dec 29, 2022

9th place solution in "Santa 2020 - The Candy Cane Contest"

Santa 2020 - The Candy Cane Contest My solution in this Kaggle competition "Santa 2020 - The Candy Cane Contest", 9th place. Basic Strategy In this co

22 Nov 26, 2021

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

87 Dec 30, 2022

Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

TTNet-Pytorch The implementation for the paper "TTNet: Real-time temporal and spatial video analysis of table tennis" An introduction of the project c

438 Dec 29, 2022

git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

35 Sep 8, 2021

Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 9, 2022

Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020)

Causality In Traffic Accident (Under Construction) Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020) Overview Data Prepa

21 Nov 20, 2022

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 6, 2022

Comments

TextEncoder处理FLAT input时，char_word_mask及part_size计算方式的问题

请问，代码中，在计算 word_mask，part_size等（NERModelFitting.py 中 collate_fn_test 方法），这种有点奇怪的方式，有什么依据吗？如果只是处理为正常的 FLAT 输入，在我看来结果是错的。还有这个奇怪的 mask，在 Model 的 Tranformer 计算时，明显已经把正常的 text token 都 mask 掉了。所以，冒昧想问一下，这么处理有没有什么理由，还是说，恰巧得到了比较好的分数，或者上传的不是最终的正确代码。还请不吝赐教，谢谢

opened by RacleRay 2
lattice 的 start、end 与 bert ids不对应

请教一个问题：通过代码生成的一个样本： "text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治（Liz），《不懂项目管理，还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

text经过bert_tokenizer后的结果是： [101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

发现一个现象是 lattice的start 和end和text_ids 不对应，比如 项目 14 15 text_ids的14、15对应的文本并不是项目，这样处理会有影响吗？

(出现这个现象的原因的ppt这个词都tokenize成了1个id)

opened by renmada 1
AttributeError: 'str' object has no attribute 'detach'
encoder.py这个文件中的vec是str类型，vec.detach()这样写会报错，请问作者这里是不是去掉detach()

def get_bert_vec(self, text, text_mask, text_pos=None): if text_pos is None: _, _, text_vecs = self.bert(text, text_mask) else: _, _, text_vecs = self.bert(text, text_mask, position_ids=text_pos) text_vecs = list(text_vecs) if self.detach_ptm_flag: for i, vec in enumerate(text_vecs): text_vecs[i] = vec.detach() return text_vecs
opened by FLxuRu 1
lattice的start end 和 text不对应

请教一个问题：通过代码生成的一个样本： "text": "《别告诉我你懂PPT》《不懂项目管理还敢拼职场》《让营销更性感》的作者李治（Liz），《不懂项目管理，还敢拼职场》及《别告诉我你懂PPT》的作者"", "entities": [], "lattice": [["告诉", 2, 3], ["项目", 14, 15], ["管理", 16, 17], ["职场", 21, 22], ["营销", 26, 27], ["性感", 29, 30], ["作者", 33, 34], ["项目", 46, 47], ["管理", 48, 49], ["职场", 54, 55], ["告诉", 60, 61], ["作者", 70, 71]]}

text经过bert_tokenizer后的结果是： [101, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 517, 679, 2743, 7555, 4680, 5052, 4415, 6820, 3140, 2894, 5466, 1767, 518, 517, 6375, 5852, 7218, 3291, 2595, 2697, 518, 4638, 868, 5442, 3330, 3780, 8020, 9341, 8253, 8021, 8024, 517, 679, 2743, 7555, 4680, 5052, 4415, 8024, 6820, 3140, 2894, 5466, 1767, 518, 1350, 517, 1166, 1440, 6401, 2769, 872, 2743, 8842, 518, 4638, 868, 5442, 107, 102]

发现一个现象是 lattice的start 和end和text_ids 不对应，比如 项目 14 15 text_ids的14、15对应的文本并不是项目，这样处理会有影响吗？

(出现这个现象的原因的ppt这个词都tokenzie成了1个id)

opened by renmada 0

Owner

GitHub

2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

Related tags

Overview

2020CCF-NER

You might also like...

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Roadmap to becoming a machine learning engineer in 2020

9th place solution in "Santa 2020 - The Candy Cane Contest"

An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

Unofficial implementation of "TTNet: Real-time temporal and spatial video analysis of table tennis" (CVPR 2020)

git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

Repository for Traffic Accident Benchmark for Causality Recognition (ECCV 2020)

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Comments

TextEncoder处理FLAT input时，char_word_mask及part_size计算方式的问题

lattice 的 start、end 与 bert ids不对应

AttributeError: 'str' object has no attribute 'detach'

lattice的start end 和 text不对应

Owner

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。

Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

1st Place Solution to ECCV-TAO-2020: Detect and Represent Any Object for Tracking

AI grand challenge 2020 Repo (Speech Recognition Track)

Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].