Can LLMs SAT?

Posted on 2026-02-24

In recent years, LLMs have shown significant improvements in their overall performance. When they first became mainstream a couple of years before, they were already impressive with their seemingly human-like conversation abilities, but their reasoning always lacked. They were able to describe any sorting algorithm in the style of your favorite author; on the other hand, they weren't able to consistently perform addition. However, they improved significantly, and it's more and more difficult to find examples where they fail to reason. This created the belief that with enough scaling, LLMs will be able to learn general reasoning.

I wanted to test this claim with SAT problems. Why SAT? Because solving SAT problems require applying very few rules consistently. The principle stays the same even if you have millions of variables or just a couple. So if you know how to reason properly any SAT instances is solvable given enough time. Also, it's easy to generate completely random SAT problems that make it less likely for LLM to solve the problem based on pure pattern recognition. Therefore, I think it is a good problem type to test whether LLMs can generalize basic rules beyond their training data.

What is SAT?

SAT (short for "satisfiability") is a logic problem that given a boolean formula, it asks whether the boolean formula has an assignment that makes the problem true. An example boolean formula is:

(a || b) && !a

This formula is satisfiable because if we set to b to true and a to false, then the whole formula is true. All other assignments make the formula false, but it doesn't change that the formula is satisfiable as long as there is at least one assignment makes the formula true.

An unsatisfiable formula is:

b && !b

No matter what you assign to the b, it will be false since either the left or the right of the && operator is false.

CNF Form

There is a special form for boolean formulas called "Conjunctive Normal Form" (CNF). A problem in this form consists of clauses connected with and operators, where each clause only contains variables connected with or operators. The variables can appear negated, but only variables can be directly negated, something like !(a && b) is not allowed. An example boolean formula in CNF form is:

(a || b || c) &&
(!a || d || e) &&
(e || g || x)

SAT solvers usually expect boolean formulas in this form, because they are specialized to solve problems in this form efficiently. I decided to use this form to validate results of the LLM output with a SAT solver.

Testing Approach

My approach is very simple:

Generate random SAT instances, both SAT and UNSAT.
Feed the SAT instance to the LLM.
Verify the output.

Generating SAT problems

For the test to be fair for LLMs, the SAT instance should be reasonably large, but not too big. I can't just give SAT problems with thousands of variables. But also it shouldn't be too easy.

I learned that for 4-SAT, if clause to variable ratio is more than 10, the generated problems become difficult to solve, and the likelihood of formula to be SAT or UNSAT is close to 50%. So I generated 3 types of formulas:

SAT problem with 10 variables and 200 clauses
SAT problem with 14 variables and 126 clauses
UNSAT problem with 10 variables and 200 clauses

I used cnfgen to generate SAT instances using the following command:

cnfgen -q randkcnf 4 $VARIABLES $CLAUSES

This command outputs the formula in dimacs format, which is a standard format for CNF supported by every SAT solver. This makes it possible to validate LLM decision with another program.

Models

I used https://openrouter.ai to test multiple models without having to register to different LLM providers.

I tested following models:

Gemini 3 Pro
GPT 5.2 Mini
GPT 5.2

For each model reasoning was enabled, and the reasoning effort is set to high. I included GPT 5.2 because it could be argued that it can reason better than mini. However, I couldn't test GPT 5.2 as much as the other models because it was too costly. Gemini 3 Pro was costly as well, but it didn't spend as much time as GPT 5.2 during reasoning which made it more affordable in my experience.

For each model, I used the same system prompt:

The user will give a CNF in dimacs format.
Determine if it's satisfiable or not WITHOUT USING ANY EXTERNAL TOOLS.
Use your own reasoning. Don't stop even if the formula is too large just try to solve it manually.
After determining the result output a JSON with two fields:
    - satisfiable: Boolean. True if the formula is satisfiable
    - assignment: Array of booleans. If the formula is satisfiable provide an assignment for each variable from 1 to N. If the formula is not satisfiable this field is null.

EXAMPLE INPUT: 
p cnf 10 10
-7 9 10 0
7 8 9 0
-7 -9 10 0
-2 3 5 0
-4 -6 -8 0
-1 3 -5 0
1 -3 -8 0
4 -5 -10 0
4 -7 -8 0
-2 -5 8 0

EXAMPLE JSON OUTPUT:
{
	"satisfiable": true,
	"assignment": [false, false, false, false, false, false, false, false, false, false]
}

Testing LLM Output

I used z3 theorem prover to assess LLM output, which is a pretty decent SAT solver. I considered the LLM output successful if it determines the formula is SAT or UNSAT correctly, and for SAT case it needs to provide a valid assignment. Testing the assignment is easy, given an assignment you can add a single variable clause to the formula. If the resulting formula is still SAT, that means the assignment is valid otherwise it means that the assignment contradicts with the formula, and it is invalid.

Initially I aimed to test with at least 10 formulas for each model for SAT/UNSAT, but it turned out to be more expensive than I expected, so I tested ~5 formulas for each case/model. First, I used the openrouter API to automate the process, but I experienced response stops in the middle due to long reasoning process, so I reverted to using the chat interface (I don't if this was a problem from the model provider or if it's an openrouter issue). For this reason I don't have standard outputs for each testing, but I linked to the output for each case I mentioned in results.

Results

All formulas used in this test can be found here.

Gemini 3 Pro

As a frontier flagship model, it was disappointing. It got no successful outcome. It seemed that it didn't reason thoroughly even though the reasoning was enabled, and the level set to high.

For SAT problems with 10 variables and 200 clauses, it usually output SAT as expected, but the assignment was never valid (Examples: first, second). Once it claimed a SAT formula was UNSAT. For this reason I didn't bother testing with more variables for the SAT case.

For UNSAT problems with 10 variables and 200 clauses, it always claimed that the formula is SAT and made up assignments (See this example).

GPT 5.2 Mini

Surprisingly, as a smaller model it performed better than Gemini 3 Pro. It found some valid assignments for SAT formulas, but has the same issue of making up assignments for UNSAT formulas.

For SAT problems with 10 variables and 200 clauses, sometimes outputted UNSAT because it couldn't find any satisfying assignment, and it would take a lot more time to find one, which is logically sound. I don't consider this as bad reasoning as it is about performance. So I tried it with only 100 clauses and it successfully found valid assignments.

For UNSAT problems with 10 variables and 200 clauses, it had the same issue as Gemini 3 Pro of making up assignments.

GPT 5.2

This one was a lot better than others. For every SAT problem with 10 variables and 200 clauses it was able to find a valid satisfying assignment. Therefore, I pushed it to test with 14 variables and 100 clauses, and it got half correct among 4 instances (See files with prefix formula14_ in here). Half correct sounds like a decent performance, but it is equivalent to random guessing.

For UNSAT problems with 10 variables and 200 clauses it had the same issue as others: making up assignments.

Conclusion

Testing LLM reasoning abilities with SAT is not an original idea; there is a recent research that did a thorough testing with models such as GPT-4o and found that for hard enough problems, every model degrades to random guessing. But I couldn't find any research that used newer models like I used. It would be nice to see a more thorough testing done again with newer models.

Even though my dataset is very small, I think it's sufficient to conclude that LLMs can't consistently reason. Also their reasoning performance gets worse as the SAT instance grows, which may be due to the context window becoming too large as the model reasoning progresses, and it gets harder to remember original clauses at the top of the context. A friend of mine made an observation that how complex SAT instances are similar to working with many rules in large codebases. As we add more rules, it gets more and more likely for LLMs to forget some of them, which can be insidious. Of course that doesn't mean LLMs are useless. They can be definitely useful without being able to reason, but due to lack of reasoning, we can't just write down the rules and expect that LLMs will always follow them. For critical requirements there needs to be some other process in place to ensure that these are met.