Can Alternating Generation‑Reduction Make LLMs Think Faster? Introducing PENCIL
The paper presents PENCIL, a novel alternating generation‑and‑erasure reasoning paradigm that achieves optimal space‑time complexity for chain‑of‑thought tasks, dramatically improves accuracy and efficiency on hard SAT, QBF, and Einstein puzzle benchmarks, and is provably Turing‑complete.
Background and Motivation
Recent large language models such as OpenAI o1/o3 and DeepSeek R1 have shown that test‑time scaling via deep reasoning can boost inference ability, but traditional chain‑of‑thought (CoT) suffers from context‑window overflow, retrieval difficulty, and reduced generation efficiency when the reasoning chain grows.
Key Idea: PENCIL
PENCIL (Alternating Generation – Reduction ) interleaves token generation with dynamic erasure of intermediate thoughts that are no longer needed for future reasoning. By discarding irrelevant fragments, the method keeps the context short while preserving essential information for the final answer.
Erasure Mechanism
The system introduces three special tokens [CALL], [SEP], and [RETURN]. When the generated sequence matches a predefined pattern, a reduction rule triggers, removing the intermediate thought segment T while retaining the surrounding context C and answer A. This design mirrors rewriting rules in logic and stack‑frame management in functional programming.
Supported Reasoning Patterns
Task Decomposition : Use [CALL] to start a sub‑task and [RETURN] to merge results, erasing sub‑task details.
Search and Backtrack : Manage exploration branches with special tokens and erase failed paths.
Summarization : Collapse long thought fragments into concise summaries, analogous to tail recursion.
Theoretical Guarantees
The authors prove that a fixed‑size Transformer running PENCIL can simulate any Turing machine using optimal space O(S) and time O(T), where S and T are the space and time of the simulated machine. In contrast, traditional CoT requires context length proportional to T, leading to exponential blow‑up for many NP‑Complete problems.
Training and Experimental Setup
Training uses the same token budget as CoT but computes loss only on the shortened sequence after each reduction, allowing KV‑cache reuse for the unchanged context part. Experiments evaluate three hard reasoning tasks: 3‑SAT (NP‑Complete), QBF (PSPACE‑Complete), and Einstein’s Puzzle (natural‑language logic).
Two Transformer sizes are tested (10.6 M and 25.2 M parameters) with identical hyper‑parameters and random initialization.
Results
Accuracy : PENCIL maintains ≥ 99 % accuracy on SAT and QBF even as problem size grows, while CoT accuracy drops sharply (e.g., ~50 % at n = 10 for SAT). On Einstein’s Puzzle, a 25.2 M model with PENCIL reaches 97 % accuracy versus 25 % for CoT.
Computation Efficiency : Under equal FLOPs, PENCIL converges faster, achieving 100 % accuracy early in training and requiring far fewer attention computations because the context remains short.
Proof Sketch
The authors introduce the Full‑Access Sequence Processing (FASP) language and show that any FASP program can be expressed by a fixed‑size Transformer. By constructing FASP programs that implement the alternating think‑summarize loop, they demonstrate that PENCIL can realize the optimal‑complexity simulation of arbitrary computations.
References
Dziri, N. et al., “Faith and fate: Limits of transformers on compositionality,” NeurIPS 2023.
Merrill, W. & Sabharwal, A., “The expressive power of transformers with chain of thought,” ICLR 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
