votes up 6

text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Package:
Exception Class:
ValueError

Raise code

        return len(t[0]) == 0 or isinstance(t[0][0], str)
                else:
                    return False
            else:
                return False

        if not _is_valid_text_input(text):
            raise ValueError(
                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
                "or `List[List[str]]` (batch of pretokenized examples)."
            )

        if text_pair is not None and not _is_valid_text_input(text_pair):
            raise ValueError(
                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
            
😲  Walkingbet is Android app that pays you real bitcoins for a walking. Withdrawable real money bonus is available now, hurry up! 🚶

Ways to fix

votes up 2 votes down

As pointed out on the official transformer documentation the pipeline function contains its own

tokenizer that is used by to encode data for the model. Therefore additional encoding is not needed to encode the input data.

I.e. the input to the pipeline should be a raw string.

Steps to reproduce the error:

  • Setup environment:

$ pip install --user pipenv

$ mkdir test_folder

$ cd test_folder

$ pipenv shell

  • Install pytorch

pipenv install torch

  • Install Transformer

pipenv install transformers

  • Run test code

import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")


feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)



text = "This is a sample text a feature should be extracted from"
encoded = tokenizer.encode_plus(
    text=text, 
    add_special_tokens=True,
    max_length = 64, 
    pad_to_max_length=True,
    return_attention_mask = True,
    truncation=True, 
    return_tensors = 'pt',
)


feature_extractor(encoded) # Here encoded input is suplied to the pipeline 

Fixed version of the code:

A raw text should be given to the pipeline

import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)

text = "This is a sample text a feature should be extracted from"

feature_extractor(text) # a raw text is given to the pipline because it has its own encoder
Jun 02, 2021 kellemnegasi answer
kellemnegasi 31.6k
votes up 0 votes down

Error code:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = 5 
encoded_input = tokenizer(text, return_tensors='pt') #Error here
output = model(**encoded_input)
print(output)

Error raise because text which we are using in tokenizer is not a string. As you can see in the documentation, it checks if it's a string or list, or tuple.

Fix code:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = ("something")
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)

You can get the bert-base-uncased dataset here

Jun 02, 2021 anonim answer
anonim 13.0k

Add a possible fix

Please authorize to post fix