Both extra_ids ((extra_ids)) and additional_special_tokens ((additional_special_tokens)) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens
Package:
transformers
50617

Exception Class:
ValueError
Raise code
tra_ids to the special token list
if extra_ids > 0 and additional_special_tokens is None:
additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
elif extra_ids > 0 and additional_special_tokens is not None:
# Check that we have the right number of extra_id special tokens
extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
if extra_tokens != extra_ids:
raise ValueError(
f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
"In this case the additional_special_tokens must include the extra_ids tokens"
)
self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
super().__init__(
eos_
🙏 Scream for help to Ukraine
Today, 2nd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too!
We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Links to the raise (2)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/models/t5/tokenization_t5.py#L127 https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/models/t5/tokenization_t5_fast.py#L123Ways to fix
When using the T5Tokenizer, if additional_special_tokens
parameter is provided, then the extra_ids
parameter should reflect the number of those additional special tokens.
e.g.
Let's say additional_special_tokens
has the following value.
["<extra_id>_1","<extra_id>_2","<extra_id>_3"]
Then the extra_ids
should be set to 3 because we have 3 additional special tokens.
How to reproduce the error:
$ pipenv install transformers
$ pipenv install sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small',
extra_ids=5, #this number should be 3
additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"
ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
This will throw the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-afe27be3f4de> in <module>()
1 from transformers import T5Tokenizer, T5ForConditionalGeneration
2
----> 3 tokenizer = T5Tokenizer.from_pretrained('t5-small', extra_ids=5,additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
4
5 model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/t5/tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
126 if extra_tokens != extra_ids:
127 raise ValueError(
--> 128 f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
129 "In this case the additional_special_tokens must include the extra_ids tokens"
130 )
ValueError: Both extra_ids (5) and additional_special_tokens (['<extra_id>_1', '<extra_id>_2', '<extra_id>_3']) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens
Working (Fixed) code
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small',
extra_ids=3,
additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"
ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids # Batch size 1
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)
Output of the fixed code
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Dies ist ein Beispiel für einen Satz, der verwendet werden soll.
Add a possible fix
Please authorize to post fix