char_to_token() is not available when using Python based tokenizers
Package:
transformers
50617

Exception Class:
ValueError
Raise code
"""
Returns:
:obj:`int`: Index of the token.
"""
if not self._encodings:
raise ValueError("char_to_token() is not available when using Python based tokenizers")
if char_index is not None:
batch_index = batch_or_char_index
else:
batch_index = 0
char_index = batch_or_char_index
return self._encodings[batch_index].char_to_token(char_index, sequence_index)
🙏 Scream for help to Ukraine
Today, 3rd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too!
We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Links to the raise (1)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L547Ways to fix
BatchEncoding class is derived from a python dictionary and can be used like a dictionary. In addition, this class exposes utility methods to map from word/character space to token space. char_to_token gets the index of the token in the encoded output comprising a character in the original string for a sequence of the batch.
Error code:
from transformers.tokenization_utils_base import BatchEncoding
speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech) <----#Error here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
We have to declare our encoding in Batchendocing. It's the most important part because otherwise, an error comes.
Fix code:
from transformers.tokenization_utils_base import BatchEncoding
from tokenizers import Encoding as EncodingFast
speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech,encoding=EncodingFast()) <---#encoding is here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
Add a possible fix
Please authorize to post fix