char_to_token() is not available when using Python based tokenizers
Package:
transformers
50617

Exception Class:
ValueError
Raise code
"""
Returns:
:obj:`int`: Index of the token.
"""
if not self._encodings:
raise ValueError("char_to_token() is not available when using Python based tokenizers")
if char_index is not None:
batch_index = batch_or_char_index
else:
batch_index = 0
char_index = batch_or_char_index
return self._encodings[batch_index].char_to_token(char_index, sequence_index)
Links to the raise (1)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L547Ways to fix
BatchEncoding class is derived from a python dictionary and can be used like a dictionary. In addition, this class exposes utility methods to map from word/character space to token space. char_to_token gets the index of the token in the encoded output comprising a character in the original string for a sequence of the batch.
Error code:
from transformers.tokenization_utils_base import BatchEncoding
speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech) <----#Error here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
We have to declare our encoding in Batchendocing. It's the most important part because otherwise, an error comes.
Fix code:
from transformers.tokenization_utils_base import BatchEncoding
from tokenizers import Encoding as EncodingFast
speech = {"input_values": ["Hello","Welcome","Everyone"]}
encoded_inputs = BatchEncoding(speech,encoding=EncodingFast()) <---#encoding is here
encoded_inputs.char_to_token(1)
print(encoded_inputs)
Add a possible fix
Please authorize to post fix