votes up 6

text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Package:
Exception Class:
ValueError

Raise code

        return len(t[0]) == 0 or isinstance(t[0][0], str)
                else:
                    return False
            else:
                return False

        if not _is_valid_text_input(text):
            raise ValueError(
                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
                "or `List[List[str]]` (batch of pretokenized examples)."
            )

        if text_pair is not None and not _is_valid_text_input(text_pair):
            raise ValueError(
                "text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) "
            
😲 Agile task management is now easier than calling a taxi. #Tracklify
🙏 Scream for help to Ukraine
Today, 3rd July 2022, Russia continues bombing and firing Ukraine. Don't trust Russia, they are bombing us and brazenly lying in same time they are not doing this 😠, civilians and children are dying too! We are screaming and asking exactly you to help us, we want to survive, our families, children, older ones.
Please spread the information, and ask your governemnt to stop Russia by any means. We promise to work extrahard after survival to make the world safer place for all.

Ways to fix

votes up 3 votes down

As pointed out on the official transformer documentation the pipeline function contains its own

tokenizer that is used by to encode data for the model. Therefore additional encoding is not needed to encode the input data.

I.e. the input to the pipeline should be a raw string.

Steps to reproduce the error:

  • Setup environment:

$ pip install --user pipenv

$ mkdir test_folder

$ cd test_folder

$ pipenv shell

  • Install pytorch

pipenv install torch

  • Install Transformer

pipenv install transformers

  • Run test code

import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")


feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)



text = "This is a sample text a feature should be extracted from"
encoded = tokenizer.encode_plus(
    text=text, 
    add_special_tokens=True,
    max_length = 64, 
    pad_to_max_length=True,
    return_attention_mask = True,
    truncation=True, 
    return_tensors = 'pt',
)


feature_extractor(encoded) # Here encoded input is suplied to the pipeline 

Fixed version of the code:

A raw text should be given to the pipeline

import torch
from transformers import AutoTokenizer,AutoModelForSequenceClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
feature_extractor = pipeline('feature-extraction', model=model, tokenizer= tokenizer)

text = "This is a sample text a feature should be extracted from"

feature_extractor(text) # a raw text is given to the pipline because it has its own encoder
Jun 02, 2021 kellemnegasi answer
kellemnegasi 30.0k
votes up 0 votes down

Error code:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = 5 
encoded_input = tokenizer(text, return_tensors='pt') #Error here
output = model(**encoded_input)
print(output)

Error raise because text which we are using in tokenizer is not a string. As you can see in the documentation, it checks if it's a string or list, or tuple.

Fix code:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('../input/bert-base-uncased')
model = BertModel.from_pretrained("../input/bert-base-uncased")
text = ("something")
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)

You can get the bert-base-uncased dataset here

Jun 02, 2021 anonim answer
anonim 13.0k

Add a possible fix

Please authorize to post fix