text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
Package:
transformers
50617

Exception Class:
ValueError
Raise code
return len(t[0]) == 0 or isinstance(t[0][0], str)
else:
return False
else:
return False
if not _is_valid_text_input(text):
raise ValueError(
"text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
"or `List[List[str]]` (batch of pretokenized examples)."
)
if text_pair is not None and not _is_valid_text_input(text_pair):
raise ValueError(
"text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
🙏 Scream for help to Ukraine
Today, 3rd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too!
We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.
Links to the raise (2)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L2262 https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L2268Ways to fix
As pointed out on the official transformer documentation the pipeline function contains its own
t
okenizer that is used by to encode data for the model. Therefore additional encoding is not needed to encode the input data.
I.e. the input to the pipeline should be a raw string.
Steps to reproduce the error:
- Setup environment:
$ pip install --user pipenv
$ mkdir test_folder
$ cd test_folder
$ pipenv shell
- Install pytorch
pipenv install torch
- Install Transformer
pipenv install transformers
- Run test code
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)
text = "This is a sample text a feature should be extracted from"
encoded = tokenizer.encode_plus(
text=text,
add_special_tokens=True,
max_length = 64,
pad_to_max_length=True,
return_attention_mask = True,
truncation=True,
return_tensors = 'pt',
)
feature_extractor(encoded) # Here encoded input is suplied to the pipeline
Fixed version of the code:
A raw text should be given to the pipeline
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)
text = "This is a sample text a feature should be extracted from"
feature_extractor(text) # a raw text is given to the pipline because it has its own encoder
Error code:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = 5
encoded_input = tokenizer(text, return_tensors='pt') #Error here
output = model(**encoded_input)
print(output)
Error raise because text which we are using in tokenizer is not a string. As you can see in the documentation, it checks if it's a string or list, or tuple.
Fix code:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = ("something")
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)
You can get the bert-base-uncased dataset here
Add a possible fix
Please authorize to post fix