votes up 6

Both extra_ids ((extra_ids)) and additional_special_tokens ((additional_special_tokens)) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens

Package:
Exception Class:
ValueError

Raise code

tra_ids to the special token list
        if extra_ids > 0 and additional_special_tokens is None:
            additional_special_tokens = [f"<extra_id_{i}>" for i in range(extra_ids)]
        elif extra_ids > 0 and additional_special_tokens is not None:
            # Check that we have the right number of extra_id special tokens
            extra_tokens = len(set(filter(lambda x: bool("extra_id" in str(x)), additional_special_tokens)))
            if extra_tokens != extra_ids:
                raise ValueError(
                    f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
                    "In this case the additional_special_tokens must include the extra_ids tokens"
                )

        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

        super().__init__(
            eos_
😲 Agile task management is now easier than calling a taxi. #Tracklify
🙏 Scream for help to Ukraine
Today, 2nd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too! We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.

Ways to fix

votes up 4 votes down

When using the T5Tokenizer, if additional_special_tokens parameter is provided, then the extra_ids parameter should reflect the number of those additional special tokens.

e.g.

Let's say additional_special_tokens has the following value.

["<extra_id>_1","<extra_id>_2","<extra_id>_3"]

Then the extra_ids should be set to 3 because we have 3 additional special tokens.

How to reproduce the error:

$ pipenv install transformers

$ pipenv install sentencepiece

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small', 
                                        extra_ids=5, #this number should be 3
                                        additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])

model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
input = "This is a sample a sentence to to be used"


ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids 
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)

This will throw the following error:

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-30-afe27be3f4de> in <module>()
      1 from transformers import T5Tokenizer, T5ForConditionalGeneration
      2 
----> 3 tokenizer = T5Tokenizer.from_pretrained('t5-small', extra_ids=5,additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])
      4 
      5 model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


2 frames
/usr/local/lib/python3.7/dist-packages/transformers/models/t5/tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
    126             if extra_tokens != extra_ids:
    127                 raise ValueError(
--> 128                     f"Both extra_ids ({extra_ids}) and additional_special_tokens ({additional_special_tokens}) are provided to T5Tokenizer. "
    129                     "In this case the additional_special_tokens must include the extra_ids tokens"
    130                 )


ValueError: Both extra_ids (5) and additional_special_tokens (['<extra_id>_1', '<extra_id>_2', '<extra_id>_3']) are provided to T5Tokenizer. In this case the additional_special_tokens must include the extra_ids tokens

Working (Fixed) code

from transformers import T5Tokenizer, T5ForConditionalGeneration


tokenizer = T5Tokenizer.from_pretrained('t5-small', 
                                        extra_ids=3,
                                        additional_special_tokens=["<extra_id>_1","<extra_id>_2","<extra_id>_3"])


model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


input = "This is a sample a sentence to to be used"


ids = tokenizer("translate English to German: "+input, return_tensors="pt")
input_ids = ids.input_ids  # Batch size 1
outputs = model.generate(input_ids)


decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)


print(decoded)

Output of the fixed code

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Dies ist ein Beispiel für einen Satz, der verwendet werden soll.

Jun 19, 2021 kellemnegasi answer
kellemnegasi 30.0k

Add a possible fix

Please authorize to post fix