How to Use LLMs for Classification Tasks
22 Aug 2025Understanding the different ways in which large language models (LLMs) are often evaluated can be quite confusing at first. This may be partly because different sources use models differently at evaluation time, and partly because researchers often omit details about evaluation methods. In addition, most articles about this topic do not clearly spell out some important aspects about language models that I think are useful to build an intuition about this. So, in this article, I’d like to go over the basics of LLM evaluation, with a special emphasis on classification tasks, as these aren’t technically what these models were designed for. Hopefully this will give you a better intuition about how these models are used to produce the reported metrics that they are so often reduced to. I will assume you are familiar with both masked (encoder-only) and causal or autoregressive (decoder-only) language models (if not, you can learn about them here). Also, when I say LLMs, I’ll always be referring to causal language models.
Why Are LLMs Used for Classification?
The short answer is: why not? The long answer is that LLMs are often described as foundational models, due to their ability to perform many tasks, so long as these tasks are turned into text. These tasks can take such forms as summarizing a news article, composing an email or writing code. Despite the many forms that these tasks can take, they can be categorized into two different groups: classification tasks and generation tasks. Classification tasks are typical machine learning tasks that require that models predict one or many labels from a finite and discrete set of labels, such as assigning an email a label from the set {spam, not-spam}. Generation tasks require that models produce text of an undetermined length, such as summarizing a news article or writing a function in Python or in Java. No model would be foundational if it could only handle one of these two types of tasks.
Since causal language models are trained to predict the next token in a given sequence of tokens, it’s arguable that generation tasks are quite at home with LLMs, as this is precisely what seq2seq models were designed for. Such models can always “read” a given article in English and autoregressively produce its translation to French, or they can read instructions for writing some code before producing it autoregressively. And this variety of generation tasks can be seen in the different metrics that exist to evaluate them, e.g. Rouge is used for summarization, BLEU for machine translation, or exact match more generally.
But LLMs also need to be able to handle classification tasks. Many of the tasks that the NLP community has used for decades as benchmarks of model performance are classification tasks, e.g. NLI, sentiment analysis, etc. As a result, to show the various uses of a model, new LLMs are often presented along with their performance on such benchmarks, e.g. MMLU (a classification task). So, at this point it is fair to ask: how do we get an LLM, trained for next-token prediction, to produce one of a few possible labels for a classification task? Unlike generation tasks, this isn’t immediately clear, and if you think about it for a bit, you may come up with several ways to do this. So, before we get into the ways this is actually done in practice, let’s first discuss some of the fundamentals of language modeling, so that we may reason about what we can and cannot do at inference time.
How To Use The Output Of LLMs
Language models are, at the core, functions of the form \(f: T \rightarrow P\), where \(T\) is the set of all possible sequences of tokens from some vocabulary \(V\) and \(P\) is a probability simplex of size \(\vert V\vert\). \(f\) does not say anything about (i) how to interpret or (ii) how to use its outputs \(p\in P\).
The way we interpret \(p\) is determined by what we train the model to do, i.e. how we learn \(f\). Masked language models (MLMs) are trained to predict masked tokens, so we’d interpret each \(p_i\) as the probability of token \(i\) being the missing one. Similarly, causal language models (CLMs) are trained to predict the next token, so we interpret \(p_i\) as the probability that token \(i\) is the next one. In short, the choice of training objective prevents us from freely interpreting \(p\) in any way we want.
However, when it comes to how to use \(p\), we are less restricted. In fact, once a language model has provided \(p\), its work is technically done, as least for that time step. And so long as we respect the semantics induced by its training objective, we can do whatever we want with \(p\). In the case of MLMs, this freedom has not really been exercised, likely because the training objective was never really a task of interest, but mostly an artifact to get the models to learn rich contextual representations of tokens. These representations, and not \(p\), were always the goal, so that we could feed them to a classifier that would project them to some label space in order to make predictions for a given downstream task.
The idea that “we can do whatever we want with the output of an LLM” really becomes clear when looking at how LLMs have been used for downstream tasks. One easy way to see this is in the multiple different ways we can use \(p\) to sample the next token during autoregressive generation, e.g. greedy decoding or top-p (if not familiar, see Task 4). Similarly, there is also variety in the way we can get labels for classification from an LLM. One thing we can do is take the same approach as with MLMs. That is, we can use the representations produced by the model as input to some downstream classifier that we train (or fine-tune end-to-end) in a supervised manner for some downstream task of our choice. Before the advent of LLMs, this was the natural choice, and it is what the first GPT model did. But the idea that models can be foundational likely comes from the fact that researchers realized that they could use \(p\) to directly extract labels for many downstream tasks without the need for a separate classifier or extra labeled data. This was, to my knowledge, first shown by the authors of GPT-2, and further developed when introduced as part of in-context learning (ICL) with GPT-3 and the concept of per-token likelihood.
Per-Token Likelihood
The fact that the creators of GPT-2 were able to directly use it for multiple downstream tasks, in combination with the fact that this was seen to work better the larger the model got, is what makes GPT-2 arguably the first LLM (though some would argue that scale is one of two ingredients required for the modern LLM, see here for an introduction). But it was with GPT-3 that this approach was first seen as competitive with the dominant approach of fine-tuning. It was also with GPT-3 that the concept of per-token likelihood materialized as a way to extract classification labels from the output of a causal language model.
To evaluate for classification tasks, the authors of GPT-3 would prompt the model so that the predicted tokens would have to be a label for the task. For example, a multiple choice question from the OpenBookQA dataset is the following:
Organisms require energy in order to do what?
- Mature and develop
- Rest soundly
- Absorb light
- Take in nutrients
The authors of GPT-3 asked the model questions like this and then compared the probability of each of the four answers and picked the one with the highest probability. In other words, they assumed that only those continuations were possible and picked the answer that was most likely to be predicted by the model. But given that different answers can have a different number of tokens, they normalized the probability of each completion by its number of tokens. That way, the selection would not be biased by answer length. As nicely formalized by Leo Gao, we have that the score \(s\) of the answer encoded by tokens \(t_m\) to \(t_n\) is given by:
\[s(t_{m:n}) = \frac{1}{(n-m)} \sum_{i=m}^{n} \log p(t_i\mid t_{1:i - 1}),\]where \(p\) is the output distribution of our model that we’ve been talking about. Note that we can use \(\log\) probabilities for this, i.e. we add up logits, or we can multiply actual probabilities. The normalization prevents us from favoring longer answers in the former case, and shorter answers in the latter. Let’s visualize this with a toy example. Say we are doing sentiment analysis with the following labels and their corresponding tokenized sequences:
- positive: [" positive"]
- negative: [" negative"]
- neutral: [" neu", "tral"]
If we compute the unnormalized scores \(s\) for each answer, i.e. \(s(t_{m:n}) = \sum_{i=m}^{n} \log p(t_i\mid t_{1:i - 1})\), we have:
\[\begin{align} s(\text{positive}) &= \log p(\text{'' positive''}\mid t_{1:i - 1}) = 0.5, \\ s(\text{negative}) &= \log p(\text{'' negative''}\mid t_{1:i - 1}) = 0.7, \\ s(\text{neutral}) &= \log p(\text{'' neu''}\mid t_{1:i - 1}) + \log p(\text{''tral''}\mid t_{1:i - 1}, \text{'' neu''}) \\ &= 0.3 + 0.5 = 0.8. \end{align}\]Here we’d predict the label neutral, but only because it has two tokens. If we instead normalize by token length, we have:
\[\begin{align} s(\text{positive}) &= \log p(\text{'' positive''}\mid t_{1:i - 1}) = 0.5, \\ s(\text{negative}) &= \log p(\text{'' negative''}\mid t_{1:i - 1}) = 0.7, \\ s(\text{neutral}) &= \frac{1}{2} \left(\log p(\text{'' neu''}\mid t_{1:i - 1}) + \log p(\text{''tral''}\mid t_{1:i - 1}\right), \text{'' neu''}) \\ &= \frac{1}{2}(0.3 + 0.5) = 0.4, \end{align}\]which would lead us to predicting the label negative. In short, per-token likelihood was about treating a set of labels as possible continuations of the input and picking the most probable one, in a sensible way. Gao also points out that counting tokens means this metric is sensitive to the choice of tokenizer, and shows how the LM Evaluation Harness library supports per-character likelihood, which is the same idea but normalizing by number of bytes instead of number of tokens. Note, however, that it’s not common to change tokenizers for a given model, nor is it common to compare probabilities across different models, as these are not necessarily calibrated. The important thing is that the sequences of tokens across different possible labels are treated equally, which is already achieved when normalizing by token length. It is also not common to use different languages for different labels in the same task, so that the issue of byte premium should not be a problem when computing per-character likelihood.
Per-token likelihood is a nice way to see how to generally extract classification labels from a causal language model. But let’s not forget we have freedom to do this in any way we want. For example, the very authors of GPT-3 used a slightly different form of normalization for some tasks (see Section 2.4), because this produced better results. Similarly, you may find that for some tasks, using an unnormalized score may produce better results. You are free to make the choice. However, you always want to reason about where the benefits of your choice come from, as it may be an artifact of your tokenizer and/or your dataset and not something that reflects your model’s general performance on the task.
Now, since the pre-historic times of GPT-3, things have changed a bit, to the point that now it’s common to evaluate LLMs on classification tasks by simply using greedy decoding over the entire vocabulary to extract the predicted label. Instruction-tuned models are particularly good at this, but with the right prompt, e.g. a well-crafted ICL example, most modern LLMs will stick to predicting one of the possible labels. But even in such settings, you are still free to make choices that may impact model performance. This brings us to the concept of verbalizers.
Verbalizers
For a given set of labels, e.g. {positive, negative, neutral}, we can always decide which tokens to map each label to. For example, say each label is tokenized as follows:
- positive: [" pos", "itive"]
- negative: [" neg", "ative"]
- neutral: [" neu", "tral"]
We could decide to only look at the first token in each label and avoid the extra forward passes. This means we’d be only comparing the logits of tokens “ pos”, “ neg” and “ neu” to make a decision. This idea that we can map labels to tokens of our choice was formalized by Schick and Schütze with the concept of verbalizers. Verbalizers are injective functions of the form \(v: L \rightarrow V\), where \(L\) is a set of labels and \(V\) the vocabulary of your language model. This means that your choice of \(v\) will tell you which tokens in \(V\) you need to compare in order to extract a prediction from the model. For example, in the case above, we went with the following verbalizer:
\[\begin{equation} v = \begin{cases} \text{positive} & \text{'' pos''} \\ \text{negative} & \text{'' neg''} \\ \text{neutral} & \text{'' neu''}. \end{cases} \end{equation}\]We could also map possible answers to numbers (or letters as in multiple choice settings), e.g.:
\[\begin{equation} v = \begin{cases} \text{positive} & 0 \\ \text{negative} & 1 \\ \text{neutral} & 2. \end{cases} \end{equation}\]The important thing is that, either via clear instructions or in-context examples, we make it clear to the model what the possible answers are. In short, verbalizers are a reminder that we can choose which tokens are mapped to each label in our classification task, and per-token likehood is a general way to compute scores that we can rank to make predictions. And all of this without fine-tuning a model for each task, so great!
But, I hear you say, fine-tuning is not expensive, as it typically takes a few hours at most. And if we are concerned about catastrophic forgetting, we can always train different sets of LoRA adapters for different tasks and merge and unmerge them at will. So, what if fine-tuning is on the table?
Encoder-Only vs Decoder-Only
The success of LLMs made it so decoder-only models became by far the most popular architecture for language models. And despite not being designed for classification, they can do it successfully. Encoder-only models like BERT, however, were designed with classification in mind. Not only their training objective, but also differences in architecture like using bidirectional instead of causal self-attention, are all so the model learns rich representations of tokens that we can use as input for downstream applications. So, why hack labels out of decoder-only models instead of using encoder-only models that were designed for this?
Well, even if the required fine-tuning step only takes a few hours (which hopefully we can all spare), encoder-only models still require labeled data to train a downstream model for each separate task. So, the encoder-only route would have to bring clear benefits for us go that way. And this is indeed the case. Some studies have compared these types models on downstream tasks. E.g. Ahuja et al. report that fine-tuned encoder-only models were almost always better than prompting LLMs for labels.
But yes, fine-tuning vs not fine-tuning is not exactly fair, and this is just one difference. In general, it’s not quite easy to make this a fair comparison, as there are many differences in such models that go beyond architecture and training objective, e.g. model sizes need to be the same, training data need to be the same, etc.
Recently, Weller et al. took the challenge of pre-training 5 pairs of MLMs and CLMs of various sizes, all under the same conditions as much as possible. They found, as prior work had, that encoder-only models are better at classification tasks, and decoder-only models are better at generation tasks. They found this to be the case even after adapting a decoder-only model to become an encoder-only model with continual pre-training, a practice that has been promoted as a solution recently. So, the world makes sense in a way, and there is still use for encoder-only models, a fact that motivated the creation of modern versions of BERT like ModernBERT or NeoBERT.
Closing Thoughts
Even if designed for next-token prediction, there is nothing preventing us from using this ability to predict labels in a classification task. We saw how we have freedom to do this, so long as we (i) consider the semantics of the output distribution of our model, and (ii) treat all possible labels in the same way. Finally, it’s important to remember that just because we can, doesn’t mean this is our best alternative, as masked (encoder-only) language models are still known to be better than causal (decoder-only) models at classification tasks, at least at the time of writing.