votes up 6

char_to_token() is not available when using Python based tokenizers

Package:
Exception Class:
ValueError

Raise code

""" 


        Returns:
            :obj:`int`: Index of the token.
        """

        if not self._encodings:
            raise ValueError("char_to_token() is not available when using Python based tokenizers")
        if char_index is not None:
            batch_index = batch_or_char_index
        else:
            batch_index = 0
            char_index = batch_or_char_index
        return self._encodings[batch_index].char_to_token(char_index, sequence_index)
ūüôŹ Scream for help to Ukraine
Today, 3rd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this ūüė†, civilians and children are dying too! We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.

Ways to fix

votes up 1 votes down

BatchEncoding class is derived from a python dictionary and can be used like a dictionary. In addition, this class exposes utility methods to map from word/character space to token space. char_to_token gets the index of the token in the encoded output comprising a character in the original string for a sequence of the batch.

Error code:

from transformers.tokenization_utils_base import BatchEncoding

speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech)  <----#Error here 
encoded_inputs.char_to_token(1)
print(encoded_inputs)

We have to declare our encoding in Batchendocing. It's the most important part because otherwise, an error comes.

Fix code:

from transformers.tokenization_utils_base import BatchEncoding
from tokenizers import Encoding as EncodingFast

speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech,encoding=EncodingFast()) <---#encoding is here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
Jul 02, 2021 anonim answer
anonim 13.0k

Add a possible fix

Please authorize to post fix