votes up 6

Both extra_ids ((extra_ids)) and additional_special_tokens ((additional_special_tokens)) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens

Package:
Exception Class:
ValueError

Raise code

tra_ids to the special token list
        if extra_ids > 0 and additional_special_tokens is None:
            additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
        elif extra_ids > 0 and additional_special_tokens is not None:
            # Check that we have the right number of extra_id special tokens
            extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
            if extra_tokens != extra_ids:
                raise ValueError(
                    f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
                    "In this case the additional_special_tokens must include the extra_ids tokens"
                )

        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

        super().__init__(
            eos_
😲 Agile task management is now easier than calling a taxi. #Tracklify

Ways to fix

votes up 4 votes down

When using the T5Tokenizer, if additional_special_tokens parameter is provided, then the extra_ids parameter should reflect the number of those additional special tokens.

e.g.

Let's say additional_special_tokens has the following value.

["<extra_id>_1","<extra_id>_2","<extra_id>_3"]

Then the extra_ids should be set to 3 because we have 3 additional special tokens.

How to reproduce the error:

$ pipenv install transformers

$ pipenv install sentencepiece

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small', 
                                        extra_ids=5, #this number should be 3
                                        additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"


ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids 
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

This will throw the following error:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-30-afe27be3f4de> in <module>()
      1 from transformers import T5Tokenizer, T5ForConditionalGeneration
      2 
----> 3 tokenizer = T5Tokenizer.from_pretrained('t5-small', extra_ids=5,additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
      4 
      5 model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/t5/tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
    126             if extra_tokens != extra_ids:
    127                 raise ValueError(
--> 128                     f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
    129                     "In this case the additional_special_tokens must include the extra_ids tokens"
    130                 )


ValueError: Both extra_ids (5) and additional_special_tokens (['<extra_id>_1', '<extra_id>_2', '<extra_id>_3']) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens

Working (Fixed) code

from transformers import T5Tokenizer, T5ForConditionalGeneration


tokenizer = T5Tokenizer.from_pretrained('t5-small', 
                                        extra_ids=3,
                                        additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])


model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


input = "This is a sample a sentence to to be used"


ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids  # Batch size 1
outputs = model.generate(input_ids)


decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)


print(decoded)

Output of the fixed code

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Dies ist ein Beispiel für einen Satz, der verwendet werden soll.

Jun 19, 2021 kellemnegasi answer
kellemnegasi 22.6k

Add a possible fix

Please authorize to post fix