votes up 6

word_ids() is not available when using Python-based tokenizers

Exception Class:

Raise code


            :obj:`List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by
            the tokenizer are mapped to :obj:`None` and other tokens are mapped to the index of their corresponding
            word (several tokens will be mapped to the same word index if they are parts of that word).
        if not self._encodings:
            raise ValueError("word_ids() is not available when using Python-based tokenizers")
        return self._encodings[batch_index].word_ids

    def token_to_sequence(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:
        Get the index of the sequence represented by the given token. In the general use case, this method returns
        :obj:`0` for a single sequence or the first sequence of a pair, and :obj:`1` for the second sequence of a pair
😲 Agile task management is now easier than calling a taxi. #Tracklify
🙏 Scream for help to Ukraine
Today, 2nd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too! We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.

Ways to fix

votes up 2 votes down


This exception occurs when the word_ids function is called on an instance of the BatchEncoding object. When creating an instance of the BatchEncoding object, there is an optional parameter: encoding. In order to avoid this exception, you must pass in a value for that parameter, as well as ensuring a dictionary was passed as the first parameter. The value of encoding must be an Encoding object in order for the program to run smoothly. The Encoding class is from tokenizers.

Code to Reproduce the Error (WRONG):

import transformers.tokenization_utils_base as tub
from tokenizers import Encoding

be = tub.BatchEncoding({'a':[1,2,3]})

Working Version (Fixed):

import transformers.tokenization_utils_base as tub
from tokenizers import Encoding

en = Encoding()
be = tub.BatchEncoding({'a':[1,2,3]}, en)
Jul 09, 2021 codingcomedyig answer

Add a possible fix

Please authorize to post fix