What Are Reasoning LLMs?

This article is adapted from this lecture.

2025 is the year of “reasoning LLMs”, and that is in no small part thanks to the release of DeepSeek-R1 back in January. Both from the amount of research focused on reasoning models, to applications where we can see models “thinking” before they provide an answer, we’ve seen reasoning LLMs everywhere this year. But what exactly are reasoning LLMs? And why did DeepSeek-R1 cause so much interest in them? In this article, I’d like to answer these questions by giving a high-level overview of the concept of reasoning LLMs with a special focus on DeepSeek-R1, as the research behind this model was shared publicly with the research community. I will assume you know about LLMs in general, and about reinforcement learning from human feedback (RLHF) in particular (if not, you can learn about them here). Also, while I will touch on the subject a bit, the focus of this article is on methodology, not on whether these models reason in the same way humans do.

Reasoning is a Vague Term

Like many umbrella terms, such as “artificial intelligence” or “data science”, what constitutes reasoning is not always clear or consistent across sources. And while I am sure this has been discussed in many areas of research, I will not provide citations about this topic from such fields as philosophy or cognitive science, as these are not my fields. What’s easy to say is that, as humans, we are all familiar with different forms of reasoning, e.g. inductive vs deductive.

Similarly, in the history of AI, reasoning has taken many forms, from traversing a graph to grounding logical formulas. While each of these can be formally defined, what collectively earns them the title of “reasoning” methods is less clear. But we can guess that it has to do with a process that applies some steps to some input data to produce a result that cannot be achieved directly without the application of those steps. For example, given the data:

\[\begin{align} a\quad&\alpha\quad b, \\ b\quad&\alpha\quad c, \\ \alpha \text{ is} &\text{ a transitive relation}, \end{align}\]

we need to apply the transitive property of \(\alpha\) to reason that \(a\,\alpha\, c\). Reasoning LLMs are similar to other forms of reasoning in this general sense, but the similarities do not go much deeper than that. Instead, to build an intuition about what makes reasoning LLMs different, it’s useful to start with a different concept: test-time compute.

The Concept of Test-Time Compute

There’s been a lot of research in late 2024 and 2025 related to the idea of test-time compute. The basic idea is very simple: test-time compute refers to computations done at inference time. But more specifically, test-time compute refers to extra computations done at inference time. If we assume that we need a single forward pass to produce each output token at inference time, then any number of forward passes that exceeds this baseline can be considered extra computation. This means that something like beam search is a form of test-time compute, as is generating multiple answers and using an external tool to pick the most appropriate one, as done by Cobbe et al. (2021)

The examples just given of different forms of test-time compute show that the idea is much older than reasoning LLMs, and that there can be different reasons that motivate the use of extra compute at inference time (note that tool use or LLMs-as-agents are all forms of test-time compute). But beyond the motivation behind each method, there is a general way these can all be related: the idea that reaching some answers may require more compute than reaching other answers.

A Computational Perspective On Reasoning

Computational costs are typically a function of two things: (i) resources used, and (ii) task at hand. Given a fixed task, costs change depending on the resources we use to solve the task. For example, we may solve a task in \(\mathcal{O}(n^2)\) when using two for loops, but the costs come down to \(\mathcal{O}(n)\) when using a single for loop and a dictionary. Another example would be using CPUs vs GPUs for Deep Learning models. Similarly, given fixed resources, the computational costs change depending on the task. While this depends on the resources used, as just discussed, some problems are known to have solutions in polynomial time, while others don’t. In short, some tasks are inherently harder than others. Let’s illustrate this in the context of LLMs.

Say we have the best LLM in the world, and we use it in the following way:

  • Input: a yes-or-no question
  • Output: predict answer by decoding single token

We can decode in any way we want, e.g. constrained decoding on the {yes, no} tokens, greedy decoding, etc. We then ask the model the following questions:

  • Is Berlin the capital of Germany?
  • Is 2048 the square of 64?
  • Does P equal NP?

For each question, the compute resources we have are the same: a single forward pass. That is, we have fixed resources for different tasks, which clearly require different amounts of compute to answer. Even at a more fine-grained level, the concept of the residual stream suggests that we can apply the same idea to model depth and perhaps see that some tasks require less than a single forward pass to answer. But even so, a single forward pass is certainly an upper bound in this case.

“Ok, but why would you use an LLM in such a restrictive way?” Great question! Not embracing the idea of test-time compute would essentially be just as restrictive, as we don’t need to directly produce the final output, but could instead do some intermediate steps first. This is one of the main motivations behind reasoning LLMs, as nicely put by Herel et al. (2024)

What are Reasoning LLMs?

While there is no agreed-upon definition in the literature, reasoning LLMs are those that do extra computations before producing a final answer. But not just any type of test-time computation, e.g. beam search, but a specific form of test-time compute. Namely, extra computations in the form of intermediate steps that allow the model to reason its way to an answer. These extra steps take the form of generating intermediate tokens, i.e. extra forward passes, that the model uses for “thinking” about the answer.

A simple example of what inference looks like with reasoning LLMs, and likely a bit catalyst to the advent of such models, is chain-of-thought (CoT) prompting.

Figure 1(d) from Kojima et al. (2022)
Figure 1(d) from Kojima et al. (2022)

CoT has been found to work better for some tasks, such as math or coding tasks. We can interpret the success of CoT in two ways. From a language modeling perspective, it seems that autoregressively producing an explicit reasoning path increases the probability of the correct answer in some cases. But we can also always take the computational perspective, i.e. that the model is given a less restrictive use of resources to reach an answer.

In any case, since CoT, reasoning LLMs have taken the more concrete form of models that have an explicit thinking phase at inference time. This is where models are given more freedom to reason about the answer by “thinking out loud”, so to speak. This thinking phase is usually explicitly denoted by the use of think tokens or <think> tags, as seen in the image below.

Table 1 from Deekseek-AI (2025)
Table 1 from DeepSeek-AI (2025)

A natural question then is: is this extra inference cost worth it? Well, yes, depending on the task. Like CoT, knowledge-based questions don’t typically benefit from extra compute, e.g. “what is the capital of France?” Conversely, there are other (typically more involved) tasks that clearly benefit from this extra “thinking”. This is what Snell et al. (2024) found in their recent work, where they suggested that in some cases, focusing less on pre-training compute and more on test-time compute may be more beneficial. More recently, Liu et al. (2025) took this idea further by showing that in some cases, a small 1B model can compete with a much larger 405B one.

How to Get Reasoning LLMs?

Having established that the extra inference costs incurred by reasoning LLMs is sometimes worth it, how do we go from a regular LLM to a reasoning LLM? Do we just prompt it to think as with CoT? Well, yes, you can do that, but more specialized methods were also developed for this purpose. These can be broken down into two categories: test-time methods and post-training methods.

Test-time methods are those that do not require that we fine-tune our base model so that it can better perform this thinking phase. This would be any CoT-like approach. Some methods may require a bit of additional training, because they introduce special tokens, such as THINK tokens that trigger the thinking phase, or the use of a WAIT token that forces the model to recheck its answer before “deciding” that it’s ready. Still other approaches may rely on the introduction of different types of components, such as in the work of Geiping et al. (2025), where they introduced an RNN-like component that can be used at inference time to reason iteratively.

As for post-training methods, these would be those that do fine-tune the base model so it’s better suited to using a thinking phase. Most of these are based on reinforcement learning, which brings us to DeepSeek-R1.

The DeepSeek-R1 Family

DeepSeek-R1 is a model tuned specifically for reasoning. But it wasn’t only its success in reaching the level of the best performing reasoning models that brought it so much attention, but also the way this model was trained to get there. DeepSeek-R1 is based on the DeepSeek-V3 model, which is a mixture-of-experts model with 671B parameters. There have been many innovations coming from the DeepSeek family of models, e.g. multi-head latent attention was introduced with DeepSeek-V2. But here we’re focusing only on their reasoning models, specifically the DeepSeek-R1 family of models, all of which are based on reinforcement learning (RL).

  • DeepSeek-R1-Zero: achieved reasoning with an “pure-RL” approach
  • DeepSeek-R1: flagship reasoning model built on top of DeepSeek-R1-Zero.
  • DeepSeek-R1-Distill: smaller models distilled from DeepSeek-R1.

RL has been a popular tool for post-training, as seen by the success of RLHF. In the reasoning space, RL was also adopted as a post-training approach to turn models into reasoning LLMs, e.g. Zelikman et al. (2022) proposed such an approach after seeing the success of CoT. More recently, OpenAI has released their “o” reasoning models, which are also based on RL.

DeepSeek-R1-Zero

One key component in the success of RL-based approaches has been the choice of the reward function. For example, in RLHF, the reward model has the goal of mimicking the preference of humans, and to train such a reward model, a lot of supervised data had to be collected, which can be a very expensive process. This is typically referred to as the supervised fine-tuning (SFT) phase.

One of the main innovations from the DeepSeek-R1 family of models is that DeepSeek-R1-Zero was trained without an SFT phase. The innovation came from realizing that they could directly train the RL policy by using existing out-of-the-box systems as reward models instead of expensively training one. Specifically, they used a LeetCode-style compiler, and rule-based verifiers for math tasks. This was possible for two reasons. One, their goal was not to tune the model for human preference, but to tune the model for reasoning tasks, and the type of tasks that typically benefit from such an approach are coding or math tasks. Second, the base model could already be prompted to use thinking tags to conduct a reasoning phase before providing a final answer (and indeed it was only via the prompt that the model was forced into this format during training, see their Table 1 above). This, in combination with group relative policy optimization (GRPO), which was also developed by the DeepSeek team, enabled them to train a base LLM so it would reach the performance of OpenAI’s best reasoning models at the time (see Table 2) without the need to collect expensive supervised data.

The authors shared how the base model progressed with more training time.

Figure 2 from Deekseek-AI (2025)
Figure 2 from DeepSeek-AI (2025)

But what’s perhaps more interesting is that we could also see that the average length of the response given by the model got longer with more training time.

Figure 3 from Deekseek-AI (2025)
Figure 3 from DeepSeek-AI (2025)

This is interesting, as the model was never encouraged to give longer answers, or to “think” more. But their training approach “naturally” showed that longer answers were apparently necessary for better performance. This serves as (admittedly vague) support for the computational motivation behind reasoning LLMs.

Finally, the authors found some interesting moments during training where the model would stop and reassess its reasoning, again without being explicitly prompted to do so. This is vaguely similar to the type of backtracking that Prolog was trained to do.

Table 3 from Deekseek-AI (2025)
Table 3 from DeepSeek-AI (2025)

Overall, DeepSeek-R1-Zero is a very nice example of the beauty of Deep Learning, as the authors developed a relatively cheap way for models to learn a useful skill without introducing a lot of bias into the process.

DeepSeek-R1 and DeepSeek-R1-Distill

Despite the methodological innovation behind DeepSeek-R1-Zero, the model still had issues, such as code switching or the generation of answers that were not easy to read. To remedy this, the authors developed DeepSeek-R1, their flagship model that trained further by incorporating a SFT phase as typically done, but they did use their R1-Zero model as part of this phase as well.

Arguably more interesting were their DeepSeek-R1-Distill models, which were smaller Llama and Qwen models tuned via distillation using DeepSeek-R1 as teacher. Here it was interesting to see how far a small model could get as a reasoning model when compared to the best models out there. Sure enough, they found that their smaller distilled models were competitive with, or even outperformed some of the best and largest reasoning models.

Table 5 from Deekseek-AI (2025)
Table 5 from DeepSeek-AI (2025)

Do Reasoning LLMs Actually Reason?

There are lots of interesting work that take a more critical look at reasoning LLMs, including some of the things discussed above. For example, Yue et al. (2025) questioned whether the RL approach taken by DeepSeek-R1 indeed promotes reasoning in models. They compared performance on math and coding tasks of base vs reasoning models tuned with the approach used by DeepSeek-R1. They found that tuned models performed better at pass@1, i.e. when the model is only given a single chance to provide an answer, but base models performed better at pass@60. It may be easy to look at this and say “what is the point of pass@60?” But it’s important to note that on the whole, the base models were able to solve a larger variety of problems when given more resources (same idea again). A useful interpretation here comes from the typical trade-off between exploration vs exploitation. That is, the tuned models were better at exploiting some types of solutions, but that base models were better at exploring the solution space.

Kim et al. (2025) proposed a new method for test-time compute and found no correlation between output length and performance, suggesting that longer answers do not always lead to better answers.

On whether models actually reason in a way that is akin to human reasoning, there are some works that suggest this isn’t the case. Mirzadeh et al. (2024) created a benchmark that was able to generate different instances of the same problem using both different numerical values and different scales. They found that model performance is sensitive to the numerical values used in a problem, as well as to its scale, suggesting models aren’t really understanding these problems.

More recently, Valmeekam et al. (2025) found that other forms of test-time compute perform just as well as the approach of generating more tokens at test time. And Shojaee et al. (2025) found that models can fail to solve a problem even when explicitly given the steps to solve it.

Closing Thoughts

Reasoning LLMs are LLMs that use a specific form of test-time compute: the generation of extra tokens for “thinking” about a problem before providing an answer. They can be both prompted to behave like this, as in CoT reasoning, and tuned for this purpose, as with DeepSeek-R1 and many other models before that.

W.r.t. base vs reasoning models, research shows the latter are better at some types of tasks, e.g. math and coding, but it’s unclear why this is the case. On whether models actually reason the way humans do, there is some evidence this is not the case, but it’s possible than when tested, we may find that humans have similar issues in some cases, so it’s not always easy to say. My guess is that they don’t, and perhaps they don’t need to.

References

The following articles by Sebastian Raschka were a useful starting point before I dug into the papers cited in this article:

Contact

Questions? Comments? Drop me a line at daniel@ruffinelli.io ← Back to all posts