votes up 6

char_to_token() is not available when using Python based tokenizers

Package:
Exception Class:
ValueError

Raise code

""" 


        Returns:
            :obj:`int`: Index of the token.
        """

        if not self._encodings:
            raise ValueError("char_to_token() is not available when using Python based tokenizers")
        if char_index is not None:
            batch_index = batch_or_char_index
        else:
            batch_index = 0
            char_index = batch_or_char_index
        return self._encodings[batch_index].char_to_token(char_index, sequence_index)
😲  Walkingbet is Android app that pays you real bitcoins for a walking. Withdrawable real money bonus is available now, hurry up! 🚶

Ways to fix

votes up 1 votes down

BatchEncoding class is derived from a python dictionary and can be used like a dictionary. In addition, this class exposes utility methods to map from word/character space to token space. char_to_token gets the index of the token in the encoded output comprising a character in the original string for a sequence of the batch.

Error code:

from transformers.tokenization_utils_base import BatchEncoding

speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech)  <----#Error here 
encoded_inputs.char_to_token(1)
print(encoded_inputs)

We have to declare our encoding in Batchendocing. It's the most important part because otherwise, an error comes.

Fix code:

from transformers.tokenization_utils_base import BatchEncoding
from tokenizers import Encoding as EncodingFast

speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech,encoding=EncodingFast()) <---#encoding is here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
Jul 02, 2021 anonim answer
anonim 13.0k

Add a possible fix

Please authorize to post fix