Boostcamp AI Tech 3rd / Basic Paper reading w.r.t Embedding

Soyeon Kim

Last update: Nov 14, 2022

Related tags

Deep Learning BoostcampAITech3-PaperReading-Embedding

Overview

Boostcamp AI Tech 3rd : Basic Paper Reading w.r.t Embedding

TL;DR

1992년부터 2018년도까지 이루어진 word/sentence embedding의 중요한 줄기를 이루는 기초 논문 스터디를 진행하고자 합니다.

논문 정리 발표에 들어갈 내용

저자가 풀려고 하는 문제는 어떤 것인가?
어떤 식으로 해결하고자 했는가. 어떤 장점이 있는가(시간 여유가 된다면, 이전에는 어떤 방법이 있었고 그 방법들의 단점)
그 방법에 대한 intuition (수학 없이)
방법에 대한 이해(수학적으로)
방법의 성공성을 보여주기 위해 사용한 데이터, 메트릭, 성능비교
부족하다 생각되는 것, 애매한 것, 혹은 좋았던 점 등의 Discussion point

리딩 리스트

Paper(author)	Year	Presenter	File upload	Code explained
Class-Based n-gram Models of Natural Language(Peter F Brown, et al.)	1992	소연	설명
Efficient Estimation of Word Representations in Vector Space(Tomas Mikolov, et al)	2013	동진	발표
Distributed Representations of Words and Phrases and their Compositionality(Tomas Mikolov, et al)	2013	나연	설명	skip-gram, CBOW
Distributed Representations of Sentences and Documents(Quoc V. Le and Tomas Mikolov)	2014	기원	설명	Doc2Vec
GloVe: Global Vectors for Word Representation(Jeffrey Pennington, et al.)	2015	수정	설명
Skip-Thought Vectors(Ryan Kiros, et al.)	2015	기범	설명
Enriching Word Vectors with Subword Information(Piotr Bojanowski, et al.)	2017	은기	설명
Universal Sentence Encoder(Daniel Cer et al.)	2018

issue & 추가 스터디 자료

Dates	Topic	Presenter	File upload
04/14	genism을 이용한 word2vec 사용	현지	링크
04/14	negative samping & subsampling	나경	링크
04/14	hierarchical softmax	소연	링크
04/14	negative contrastive estimation(NCE)	수정	링크

스터디 룰

스터디 시간 : 목요일 저녁 9시 30분!
스터디 분량 : 매주 1주씩! (프로덕트 서빙 커리큘럼 기간에 집중할 수 있게 그전에 끝내보아영)
- 각각 읽고, 질문 최소 1개를 github issue에 올림(+ 거기에 대한 답변을 안다면 답변 달아주기!)
발표자 : 해당 요일에 랜덤 선택. 발표 자료는 자유 양식
- 논문 발표 : 발표자는 발표 후 정리 내용 해당 레포 폴더파서 업로드. 발표자 외 사람 중 공유하고 싶은 사람은 issues에 남기거나 file upload 에 마찬가지로 링크 추가 가능(자율)
- 코드뷰 설명: 해당 논문 발표자는 다음주차에 코드뷰 설명(e.g, 어떤 라이브러리로 쉽게 쓸 수 있는지 usage 설명, 알고리즘이 복잡한 경우 코드뷰로 어떻게 구현되었는지 설명 등 본인 기호에 맞게)

참여자

강나경, 김소연, 김현지, 박기범, 임동진, 임수정, 정기원, 한나연 , 김은기

참고 링크

논문을 정리하는 틀과 issues를 통한 discussion이 좋았던 깃헙 레포 참고

리딩 리스트를 참고한 NLP Must Read paper 정리된 깃헙 레포 참고

국내 NLP 리뷰 모임 참고 (season1의 beginners에 중복되는 논문들 있어요!)

Comments

[week5] TODO
word2vec의 cbow, skip-gram 구현 방법 및 구현 차이

genism 모듈을 이용한 wod2vec 사용 및 간단한 소개?

doc2vec 구현 간단한 설명(참고 : https://github.com/inejc/paragraph-vectors)

negative sampling & subsampling 구적 설명 : 차이 및 동작 방식, 결과적으로 어떤 점이 유리한지

hierarchical softmax 좀더 설명 + 구현이 어떻게 되는지?

Noise Contrastive Estimation(NCE) 추가 설명
opened by kimcando 6
[week3] Noise Contrastive Estimation (NCE)란 무엇일까?
Introduction, 2페이지 상단에

In addition, we present a simplified variant of Noise Contrastive Estimation (NCE) [4] for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work [8].

라고 Noise Contrastive Estimation(NCE) 개념이 등장합니다. NCE란 어떤 개념일까요?
opened by TB2715 5
[week2] 모델의 강인함이 무슨 말인지 궁금합니다. 그리고 만약 논문의 모델 입력에 noise가 들어가면 어떻게 처리할까요?, 논문에서 나온 모델은 강인한 모델일까요?

논문의 Intro 부분을 보게 되면, 기존의 NLP 모델이 One-hot encoding을 했던 이유 중에 모델의 강인함(robustness)에 대해 언급하고 있습니다. 모델의 강인함은 Input의 noise에 대한 강인함일까요? 잠깐 생각해보면 One-hot encoding이 Distributed에 비해 강인함을 가지고 있다 생각 됩니다. 하나의 값만 1이기 때문에 noise가 발생해도 복구가 쉽다고 생각합니다. 그런데 만약 논문에서 나온 CBOW, Skip-gram과같은 모델에서 input에 noise가 들어가 변형된 projection이 나온다면 어떻게 원래의 input으로 복구를 해줄까요?

opened by kiwon94 4
[week2] hierarchical softmax가 뭘까요?

어휘의 이진 트리 표현으로 검증되어야 하는 결과값의 크기가 log_2(V)에 가깝게 작아질 수 있다고 하는데 어떤 방식으로 이렇게 되는 걸까요?

3p, 'With binary tree representations of the vocabulary, the number of output units that need to be evaluated can go down to around $log_2(V)$.' 'the term H x V can be efficiently reduced to H x log_2(V) by using hierarchical softmax.'

opened by angieKang 3
[Week3] Numerical Probability는 무엇일까?

논문에서 Negative Sampling은 NCE를 단순화한 방식이라고 합니다. 이 둘의 가장 큰 차이점에 대해서 논문은 아래와 같이 설명했는데요,

The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.

여기에서 Negative Sampling에서는 필요하지 않은 것이 numerical probability라고 하는데, 이 numerical probability의 의미가 무엇인지 잘 모르겠습니다. 정확히 말하면 NCE에서 NEG로 단순화하는 과정에서 무엇이 빠졌다는 것인지 잘 모르겠습니다..!

opened by sujeongim 2
[week2] Projection Layer가 정확히 뭘까요?
본 논문에서 training complexity를 계산하는 수식에서 Q가 등장합니다. Q는 모델마다 다르게 정의하기로 하는데, 이 수식 연산에서 자주 등장하는 것이 projection layer (matrix) 입니다.

정확하게 projection layer(matrix)가 뭔지 모르겠네요....?
가장 비슷한건 embedding layer (matrix)가 떠오르는데 정확히 embedding이란 용어를 쓰진 않았으니...
opened by cow-coding 2
[week2] 모델 complexity의 변천사 설명
일단 N에 대해서 설명하는 문구 이해가 정확히 안됐음(section2.1) 무슨 given time에? 그리고 active 된다는 것? N개의 sequence만 모델에 넣어주고 그때 그 단어만 학습되는 것을 이렇게 표현한건가?

As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.

계산이나 dominant하다는 이유를 정확하게 이해 못하겠음.

NNLM : 두번째 term도 충분히 클텐데 왜 세번째 텀이 더 압도적이라는거지? 그럴려면 N* D<< V여야하는데 N은 입력 단어 갯수? D는 입력으로 들어오는 단어를 처음으로 projection 해주는 것이고? V는 전체 vocabulayr 사이즈? 라서 V가 압도적으로 크니까? 여서인가?

Q = (NxD) + (NxDxH) + (HxV)

NxD : input layer

(NxDxH) : hidden layer( single layer란 의미인가?)

HxV : output layer(dominant)

허프만, hierarchical softmax 개념도 낯설다

RNNLM :

Q = (HxH) + (HxV) where D=H

HxH : 음.. 이전 step의 데이터가 인풋으로 들어가서 다음 step에 영향을 주는 recurrent 하다는 것은 알겠고 그 맥락에서 이 식이 input과 hidden을 같이 퉁치는 것 같음. 하지만 인풋 단어 N개를 받아야하니까 NNLM처럼 input data인 N이 표현되는 식이 있어야되지 않나?

HxV : 위랑 동일해서 대충 알겠음

CBOW : 히든 레이어 빠지고 output layer sharing

Q= NxD + Dx log_2(V)

NxD : input layer

Dx log_2(V) : output layer에 해당되는 계산. 주변 n 개를 통해 얻은 값으로 1개를 얻는 감소되는 과정을 log_2로 한 것 같은데 왜지??????** 왜 log2로 감소된다고 표현한거지 ?**

Skip-gram

Q=Cx(D+Dxlog_2(v))

이거는 log_2가 이해되면 이해될 것 같다
opened by kimcando 2
[week2] CBOW와 Skip-gram의 syntactic task와 semantic task의 성능 차이 나는 이유가 학습 방식과 어떤 관련이 있을까요?
논문 4.3 section 3번째 문단입니다.

The CBOW architecture works better than the NNLM on the syntactic tasks, and about the same on the semantic one. Finally, the Skip-gram architecture works slightly worse on the syntactic task than the CBOW model (but still better than the NNLM), and much better on the semantic part of the test than all the other models.

CBOW 은 NNLM 대비 syntactic에서 높은 성능냈지만 semantic 측면에서는 비슷

Skip-gram은 syntactic task는 NNLM보다는 높지만 CBOW보다 낮음. 반면에 semantic에서는 가장 높음

CBOW는 주변 단어를 참고해서 현재 단어를 예측하고, skip-gram은 현재 단어로 주변 단어를 예측하는 방식으로 학습되는데요. 이런 학습 방식의 syntactic, semantic 테스크의 성능 차이로 이어지는 것일까요?
opened by kimcando 2
[week1] word sequence가 길어질 수록 추정해야하는 파라미터가 늘어야하는 것?

word sequence가 길어질 수록 추정해야하는 파라미터가 늘어야하는 것이 잘 안와닿음. 어차피 count 로 치환해서 계산하는거면 그게 왜 늘어나는거지? 여기에서도 비슷하게 설명하는데 정확하게 와닿지는 않음.

n을 크게 선택하면 실제 훈련 코퍼스에서 해당 n-gram을 카운트할 수 있는 확률은 적어지므로 희소 문제는 점점 심각해집니다. 또한 n이 커질수록 모델 사이즈가 커진다는 문제점도 있습니다. 기본적으로 코퍼스의 모든 n-gram에 대해서 카운트를 해야 하기 때문입니다.

opened by kimcando 2
[week 8] 본 논문에서 제안하는 character n-gram을 활용하면 접두사와 접미사에 대한 정보를 알 수 있나요?

character n-gram을 구성할 때, 첫 번째 subword에는 <를 붙이고 마지막 subword에는 >를 붙인다고 합니다. 저는 이 기호가 subword 집합의 시작과 끝을 알리는 역할이라고 생각했는데요, <과 >를 통해 해당 단어의 접두사와 접미사에 대한 정보를 알 수 있나요?

Each word w is represented as a bag of character n-gram. We add special boundary symbols < and > at the beginning and end of words, allowing to distinguish prefixes and suffixes from other character sequences.

단어 where의 character n-gram(n=3)

opened by HanNayeoniee 1
[Week 7] 목적함수에 log를 취하는 이유

이 논문의 목적함수 뿐만 아니라 다른 논문, 데이터 분석에서도 자꾸 log를 취하는데 장점이 뭘까요? 값을 작게 해줌으로써 얻을 수 있는 점들이 궁금합니다.

~~수학적 지식이 부족한 사람의 조금 기초적인 질문이긴 하지만 이때 아니면 언제 알아보겠어 하는 마음에 올립니다.~~

opened by angieKang 1
[week8] 새로운 scoring function에서 Vc의 정체

논문에서 제시한 새로운 scoring function은 아래 사진과 같습니다. 이 식에서는 word가 아닌 각각의 n-gram vector(z_g)와 Vc를 내적한 값을 모두 더한 것을 score로 정의합니다.

여기에서 V_c는 원래 skip gram model에서의 V_wc와 같다고 생각했는데, 그러면 n-gram vector와 word vector를 모두 사용한다고 이해하면 될까요..?? 아니면 V_c도 n-gram의 조합으로 표현이 되어야 하는건지 궁금합니다.

opened by sujeongim 1
[week 8] 단어의 subword와 특정 단어가 겹칠 때의 embedding 이 겹치면?

논문에서 사용한 예시 where를 생각하면, <wh, whe, her, ere, re> 가 됩니다. 이때 her 라는 단어는 대명사인 her 과 겹치게 되는게요. 본 논문을 읽었을 때 원래 단어의 subword와 겹치는 단어(즉, where의 3-gram인 her과 대명사 her)를 다르게 처리하는 과정은 없는 것 같습니다.

그렇다면 where를 학습할 때 her 라는 3-gram vector값이 where의 단어 벡터를 구성하는데 좋은걸까요? 오히려 어떤 단어들에 대해서는 노이즈로 작동할수도?

opened by kimcando 0
[Week 7] 어떻게 Naive Bayes 와 함께 사용했다는 것일까요?

3.5. Classification benchmarks에서,

다른 task에서는 언급되지 않았던 Naive Bayes가 나오게 되는데, combine-skip과 Naive Bayes를 함께 사용한 combination의 성능이 가장 좋은 것으로 나타납니다.

그런데 combine-skip에 Naive Bayes를 combine했다는 것이 어떤 의미인지(수식적으로 또는 학습 방식적으로) 잘 이해가 되지 않아서 질문을 드립니다..!

opened by sujeongim 0
[Week 7] un-regularized L2 linear regression loss의 의미

2.2. Vocabulary expansion 내용에서,

f : V_w2v -> V_rnn 을 만족하는 mapping function을 구축하기 위해 행렬 W를 학습시킨다는 얘기가 나옵니다.

이 때 W 학습을 위해 un-regularized L2 linear regression loss를 사용한다는 설명이 나오는데, 이 loss를 그냥 일반적인 RMSE와 유사한..? loss로 봐도 괜찮은지, 아니면 RMSE와는 완전히 다른 Loss인지 궁금합니다.

opened by sujeongim 1
[Week 7] Vocab matrix가 학습단계에서 어떤 방식으로 활용되고 조정되는지 궁금합니다.

학습과정이 잘 이해되지 않아서 물어봅니다. 제가 이해한 학습 방식은 Encoder 단계에선 문장 단위의 hidden state vector를 학습하고, Decoder 단계에서 (이거도 start of sentence가 되어야하는데 표기는 eos네요)부터 한 단어 씩 hidden state vector와 vocab matrix를 활용하여 추론하는 방식입니다. Encoder 단계에서는 vocab matrix값이 활용이 안되는건가요?

만약에 아예 처음 등장하는 단어가 나오면 어떤 방식으로 추론이 되는건가요??

opened by greenare 0

Owner

Soyeon Kim

GitHub

The 3rd place solution for competition

The 3rd place solution for competition "Lyft Motion Prediction for Autonomous Vehicles" at Kaggle Team behind this solution: Artsiom Sanakoyeu [Homepa

104 Nov 22, 2022

Waymo motion prediction challenge 2021: 3rd place solution

Waymo motion prediction challenge 2021: 3rd place solution ?? Technical report ??️ Presentation ?? Announcement ??Motion Prediction Channel Website ??

158 Jan 8, 2023

An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics.

Sketch Simulator An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics. See

12 Dec 18, 2022

Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers.

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

2 Oct 14, 2021

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

5 Dec 7, 2022

a basic code repository for basic task in CV(classification,detection,segmentation)

basic_cv a basic code repository for basic task in CV(classification,detection,segmentation,tracking) classification generate dataset train predict de

1 Oct 15, 2021

Ratatoskr: Worcester Tech's conference scheduling system

Ratatoskr: Worcester Tech's conference scheduling system In Norse mythology, Ratatoskr is a squirrel who runs up and down the world tree Yggdrasil to

4 Dec 22, 2022

Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch

Automatic Number Plate Recognition Automatic Number Plate Recognition (ANPR) is the process of reading the characters on the plate with various optica

52 Dec 22, 2022

Code for the paper "Query Embedding on Hyper-relational Knowledge Graphs"

Query Embedding on Hyper-Relational Knowledge Graphs This repository contains the code used for the experiments in the paper Query Embedding on Hyper-

19 Jul 26, 2022

The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding"

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

64 Dec 17, 2022

Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

BiDR Repo for WWW 2022 paper: Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval. Requirements torch==

11 Oct 20, 2022

A pytorch implementation of Reading Wikipedia to Answer Open-Domain Questions.

DrQA A pytorch implementation of the ACL 2017 paper Reading Wikipedia to Answer Open-Domain Questions (DrQA). Reading comprehension is a task to produ

394 Nov 8, 2022

🎃 Core identification module of AI powerful point reading system platform.

ppReader-Kernel Intro Core identification module of AI powerful point reading system platform. Usage 硬件： Windows10、GPU：nvdia GTX 1060 、普通RBG相机软件： con

1 Jan 11, 2022

A modern pure-Python library for reading PDF files

pdf A modern pure-Python library for reading PDF files. The goal is to have a modern interface to handle PDF files which is consistent with itself and

6 Apr 6, 2022

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving (AAAI2021). RTS3D is efficiency and accuracy s

71 Nov 29, 2022

the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

EmbedSeg Introduction This repository hosts the version of the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

88 Dec 25, 2022

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

[ICLR 2021] "UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems" by Jiayi Shen, Haotao Wang*, Shupeng Gui*, Jianchao Tan, Zhangyang Wang, and Ji Liu

39 Dec 3, 2022

Y. Zhang, Q. Yao, W. Dai, L. Chen. AutoSF: Searching Scoring Functions for Knowledge Graph Embedding. IEEE International Conference on Data Engineering (ICDE). 2020

AutoSF The code for our paper "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" and this paper has been accepted by ICDE2020. News:

64 Dec 17, 2022

Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Learning the Best Pooling Strategy for Visual Semantic Embedding Official PyTorch implementation of the paper Learning the Best Pooling Strategy for V

106 Jan 6, 2023