def convert_single_mathqa_example(example, is_training, tokenizer, max_seq_length,
max_program_length, op_list, op_list_size,
const_list, const_list_size,
cls_token, sep_token):
"""Converts a single MathQAExample into an InputFeature."""
features = []
question_tokens = example.question_tokens
if len(question_tokens) > max_seq_length - 2:
print("too long")
question_tokens = question_tokens[:max_seq_length - 2]
tokens = [cls_token] + question_tokens + [sep_token] # 1. This line add [cls_token] at beginning.
segment_ids = [0] * len(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
for ind, offset in enumerate(example.number_indices): # 2. Why don't number_indices offset by 1 ?
if offset < len(input_mask):
input_mask[offset] = 2
else:
if is_training == True:
# invalid example, drop for training
return features
# assert is_training == False
Hello, Thanks for the great work! However, I am confused with the code. In the 1. comment, you add [cls_token]
in front of the tokens
, which means that the indices of tokens in the tokens
will shift to the right by 1. In. 2. comment, you just use the example.number_indices
to assign 2 to the indices of numbers, this is confusing, since input_mask
is created from the tokens
, which contains the [cls] at the beginning. For example: tokens
: [[cls], a, b, 1, c, d], the example.number_indices
will be [2] (because when you calculate the example.number_indices
, there is no [cls] at the beginning, the "2" refers to the number "1"'s index ), the corresponding input_mask
will be [1, 1, 1, 1, 1, 1]. When you try to assign the numbers' indices to 2 by the example.number_indices
, the input_mask
will be [1, 1, 0, 1, 1, 1], however, the 0'index 2 refers to the "b" in the tokens
. Could you please explain this? Thanks!