text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
Package:
transformers
50617

Exception Class:
ValueError
Raise code
return len(t[0]) == 0 or isinstance(t[0][0], str)
else:
return False
else:
return False
if not _is_valid_text_input(text):
raise ValueError(
"text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
"or `List[List[str]]` (batch of pretokenized examples)."
)
if text_pair is not None and not _is_valid_text_input(text_pair):
raise ValueError(
"text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
Links to the raise (2)
https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L2262 https://github.com/huggingface/transformers/blob/bd9871657bb9500a9f4437a873db6df5f1ae6dbb/src/transformers/tokenization_utils_base.py#L2268Ways to fix
As pointed out on the official transformer documentation the pipeline function contains its own
t
okenizer that is used by to encode data for the model. Therefore additional encoding is not needed to encode the input data.
I.e. the input to the pipeline should be a raw string.
Steps to reproduce the error:
- Setup environment:
$ pip install --user pipenv
$ mkdir test_folder
$ cd test_folder
$ pipenv shell
- Install pytorch
pipenv install torch
- Install Transformer
pipenv install transformers
- Run test code
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)
text = "This is a sample text a feature should be extracted from"
encoded = tokenizer.encode_plus(
text=text,
add_special_tokens=True,
max_length = 64,
pad_to_max_length=True,
return_attention_mask = True,
truncation=True,
return_tensors = 'pt',
)
feature_extractor(encoded) # Here encoded input is suplied to the pipeline
Fixed version of the code:
A raw text should be given to the pipeline
import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)
text = "This is a sample text a feature should be extracted from"
feature_extractor(text) # a raw text is given to the pipline because it has its own encoder
Error code:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = 5
encoded_input = tokenizer(text, return_tensors='pt') #Error here
output = model(**encoded_input)
print(output)
Error raise because text which we are using in tokenizer is not a string. As you can see in the documentation, it checks if it's a string or list, or tuple.
Fix code:
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = ("something")
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)
You can get the bert-base-uncased dataset here
Add a possible fix
Please authorize to post fix