Hi, I notice that the bert prompt model does not use the cls & linear head. I try to explain it in the following code with toy inputs, where say input_ids 's shape is [8, 32], and pre_seq_len is 3, then inputs_embeds's shall be [8, 35, 768]. I'll comment the shape of the key variables in the code and state my concern
class BertPromptForSequenceClassification(BertPreTrainedModel):
def forward(*args):
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
batch_size = input_ids.shape[0]
raw_embedding = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
token_type_ids=token_type_ids,
)
prompts = self.get_prompt(batch_size=batch_size)
inputs_embeds = torch.cat((prompts, raw_embedding), dim=1) # then inputs
prefix_attention_mask = torch.ones(batch_size, self.pre_seq_len).to(self.bert.device)
attention_mask = torch.cat((prefix_attention_mask, attention_mask), dim=1)
# inputs_embeds's shape: [8, 35, 768]
outputs = self.bert(
# input_ids,
attention_mask=attention_mask,
# token_type_ids=token_type_ids,
# position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
# past_key_values=past_key_values,
)
# since bert encoder will take as inputs the first token to the bert_pooler, \
# here the real token being used for classifier is the soft prompts' first token!
pooled_output = outputs[1]
I wonder, is p-tuning v2 compared with soft prompt tuning?
But the token being used for the latter one in the head for classification is not the cls.
Is that expected?