Both extra_ids ((extra_ids)) and additional_special_tokens ((additional_special_tokens)) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens
Package:
transformers
50617

Exception Class:
ValueError
Raise code
tra_ids to the special token list
if extra_ids > 0 and additional_special_tokens is None:
additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
elif extra_ids > 0 and additional_special_tokens is not None:
# Check that we have the right number of extra_id special tokens
extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
if extra_tokens != extra_ids:
raise ValueError(
f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
"In this case the additional_special_tokens must include the extra_ids tokens"
)
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
eos_
Links to the raise (2)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/models/t5/tokenization_t5.py#L127 https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/models/t5/tokenization_t5_fast.py#L123Ways to fix
When using the T5Tokenizer, if additional_special_tokens
parameter is provided, then the extra_ids
parameter should reflect the number of those additional special tokens.
e.g.
Let's say additional_special_tokens
has the following value.
["<extra_id>_1","<extra_id>_2","<extra_id>_3"]
Then the extra_ids
should be set to 3 because we have 3 additional special tokens.
How to reproduce the error:
$ pipenv install transformers
$ pipenv install sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small',
extra_ids=5, #this number should be 3
additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"
ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
This will throw the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-afe27be3f4de> in <module>()
1 from transformers import T5Tokenizer, T5ForConditionalGeneration
2
----> 3 tokenizer = T5Tokenizer.from_pretrained('t5-small', extra_ids=5,additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
4
5 model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/t5/tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
126 if extra_tokens != extra_ids:
127 raise ValueError(
--> 128 f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
129 "In this case the additional_special_tokens must include the extra_ids tokens"
130 )
ValueError: Both extra_ids (5) and additional_special_tokens (['<extra_id>_1', '<extra_id>_2', '<extra_id>_3']) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens
Working (Fixed) code
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small',
extra_ids=3,
additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"
ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
Output of the fixed code
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Dies ist ein Beispiel für einen Satz, der verwendet werden soll.
Add a possible fix
Please authorize to post fix