word_ids() is not available when using Python-based tokenizers
Package:
transformers
50617

Exception Class:
ValueError
Raise code
"""
Returns:
:obj:`List[Optional[int]]`: A list indicating the word corresponding to each token. Special tokens added by
the tokenizer are mapped to :obj:`None` and other tokens are mapped to the index of their corresponding
word (several tokens will be mapped to the same word index if they are parts of that word).
"""
if not self._encodings:
raise ValueError("word_ids() is not available when using Python-based tokenizers")
return self._encodings[batch_index].word_ids
def token_to_sequence(self, batch_or_token_index: int, token_index: Optional[int] = None) -> int:
"""
Get the index of the sequence represented by the given token. In the general use case, this method returns
:obj:`0` for a single sequence or the first sequence of a pair, and :obj:`1` for the second sequence of a pair
"""
🙏 Scream for help to Ukraine
Today, 2nd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too!
We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Links to the raise (1)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L347Ways to fix
Summary:
This exception occurs when the word_ids function is called on an instance of the BatchEncoding object. When creating an instance of the BatchEncoding object, there is an optional parameter: encoding. In order to avoid this exception, you must pass in a value for that parameter, as well as ensuring a dictionary was passed as the first parameter. The value of encoding must be an Encoding object in order for the program to run smoothly. The Encoding class is from tokenizers.
Code to Reproduce the Error (WRONG):
import transformers.tokenization_utils_base as tub
from tokenizers import Encoding
be = tub.BatchEncoding({'a':[1,2,3]})
be.word_ids()
Working Version (Fixed):
import transformers.tokenization_utils_base as tub
from tokenizers import Encoding
en = Encoding()
be = tub.BatchEncoding({'a':[1,2,3]}, en)
be.word_ids()
Add a possible fix
Please authorize to post fix