这个里面有一段:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"][:1000],
eval_dataset=lm_datasets["validation"][:100],
data_collator=data_collator,
)
可能是我的版本问题,datasets切片之后类型是dict,好像不能直接赋值给dataset,这样会报错:
***** Running training *****
Num examples = 3
Num Epochs = 1
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 1
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-113-3435b262f1ae> in <module>()
----> 1 trainer.train()
4 frames
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
KeyError: 2
因为切片之后,len就不是样本长度,而是dict的K的数量(3个)
原始数据:
Dataset({ features: ['attention_mask', 'input_ids', 'labels'], num_rows: 19240 })
切片之后dict:
Type: dict String form: {'attention_mask': [[1, 1,。。。。。。后面太长了
其实应该要的是
Dataset({ features: ['attention_mask', 'input_ids', 'labels'], num_rows:1000 })
==================这边的我的参考代码===========
from datasets import Dataset
trainer = Trainer(
model=model,
args=training_args,
train_dataset=Dataset.from_dict(lm_datasets["train"][:1000]),
。。。。
=======因为可能是版本问题,我这边给出一些版本说明=====
datasets-1.11.0 fsspec-2021.7.0 huggingface-hub-0.0.12 pyyaml-5.4.1 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.9.2 xxhash-2.0.2
Pytorch版本:1.9.0+cu102