What 2023 Taught Us About LLMs and AI‑Guided Optimization
The author reviews a year of rapid progress in large language models, highlighting breakthrough papers such as Positional Interpolation, StreamingLLM, Deja Vu, and RLCD, and discusses how AI‑guided optimization techniques like SurCo, LANCER, and GenCo are reshaping research and industry applications.
Large Language Models (LLM)
2023 saw several LLM papers attract wide community attention. Positional Interpolation demonstrated that a single line‑code change to RoPE, combined with modest fine‑tuning, can dramatically extend the pre‑training context window, sparking a surge of open‑source long‑context models.
StreamingLLM (Attention Sink) showed that preserving only the first four tokens during inference breaks the context‑window limit, enabling "infinite chat" behavior. The method quickly spread to Intel Extension for Transformers, HuggingFace Transformers, and the mobile offline LLM MLC Chat.
The idea behind StreamingLLM originated from the H2O paper, which observed that discarding 80% of the KV‑cache does not harm next‑token perplexity, prompting investigation of the remaining influential tokens.
Deja Vu (ICML'23 oral) introduced sparsity‑based inference acceleration: it predicts which neurons and attention heads will be active in future layers and loads only those weights into GPU cache, drastically reducing memory I/O. Subsequent work such as Shanghai Jiao‑Tong University's PowerIter combined this with CPU‑GPU joint inference.
The author also proposed RLCD , a method that generates labeled generation samples using both positive and negative prompts, eliminating manual annotation and supporting fine‑tuning or reward‑model training to avoid RLAIF pitfalls.
Two theoretical papers were released on Transformer dynamics: Scan&Snap (NeurIPS'23) analyzed a single‑layer linear MLP + attention, while JoMA extended the analysis to multi‑layer nonlinear MLP + attention, revealing that attention becomes sparse during training before partially densifying, and explaining why Transformers can learn high‑level concepts.
AI‑Guided Optimization
The SurCo framework (ICML'23) uses a linear surrogate cost to augment traditional combinatorial solvers, enabling indirect solutions to nonlinear combinatorial problems such as table sharding, optical device design, and nonlinear shortest‑path problems. SurCo won the best paper award at the ICML'23 SODL workshop.
Subsequent work includes LANCER (NeurIPS'23), which reduces the number of calls to the combinatorial optimizer, improving efficiency for portfolio optimization, and GenCo , which generates diverse feasible solutions for nonlinear problems and applies them to game level and optical device design.
Contrastive‑learning approaches like CL‑LNS (ICML'23) and its successor ConPAS accelerate Large Neighborhood Search by learning heuristic search rules. The overall conclusion is that fully replacing decades‑old combinatorial methods with ML is still difficult; practical solutions involve high‑level ML policies that invoke existing solvers when needed.
Reflections on the LLM Era
The pace of research has become extremely fast, with major conferences serving more as social gatherings than sources of cutting‑edge results. Real‑time discussions now happen on Discord, X (Twitter), HuggingFace repos, and GitHub issues.
Instances of parallel discovery (e.g., StreamingLLM vs. LM‑Infinite) caused frustration, but the author persisted, re‑framed the problem, and produced additional experiments (Table 2) that confirmed the existence of Attention Sink. Similar parallel work appeared in ViT analysis.
Rapid iteration rewards hands‑on code exploration; those who write and test code quickly gain deeper understanding and can outpace even venture‑capital‑backed teams. The LLM wave also reshapes research thinking, making tasks that once seemed impossible—such as self‑reflection or executing novel textual instructions—achievable with a few well‑crafted prompts.
Looking ahead, the author expects continued acceleration driven by better hardware, open‑source ecosystems, and evolving researcher mindsets, potentially empowering individuals and small teams to make unique contributions.
References
Positional Interpolation: https://arxiv.org/abs/2306.15595
StreamingLLM: https://arxiv.org/abs/2309.17453
Blog: https://huggingface.co/blog/tomaarsen/attention-sinks
Video: https://www.youtube.com/watch?v=409tNlaByds
Media coverage: https://venturebeat.com/ai/streamingllm-shows-how-one-token-can-keep-ai-models-running-smoothly-indefinitely/
Discussion: https://news.ycombinator.com/item?id=37740932
Intel Extension for Transformers: https://twitter.com/HaihaoShen/status/1715335763032780853
HuggingFace Transformers PR: https://github.com/huggingface/transformers/pull/26681
MLC Chat: https://twitter.com/davidpissarra/status/1735761373261427189
H2O: https://arxiv.org/abs/2306.14048
RLCD: https://arxiv.org/abs/2307.12950
Scan&Snap: https://arxiv.org/abs/2305.16380
JoMA: https://arxiv.org/abs/2310.00535
Hong Kong University talk: https://twitter.com/hkudatascience/status/1706967154887962986
RIKEN talk: https://youtu.be/u05Z74dF0Gg
Remote talk: https://www.youtube.com/watch?v=eXPhvQgAT_I
SurCo: https://arxiv.org/abs/2210.12547
SODL workshop: https://sods-icml2023.github.io/
LANCER: https://arxiv.org/abs/2307.08964
GenCo: https://arxiv.org/abs/2310.02442v1
CL‑LNS: https://arxiv.org/abs/2302.01578
LM‑Infinite: https://arxiv.org/abs/2308.16137
ViT analysis: https://arxiv.org/abs/2309.16588
Online discussion of Positional Interpolation: https://kaiokendev.github.io/til#extending-context-to-8k
Improved RoPE scaling: https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/
Twitter question: https://twitter.com/MarkwardtAdam/status/1674425742615269385
Twitter reply: https://twitter.com/tydsh/status/1674436093356421120
GPT‑4 speculation (part 2): https://zhuanlan.zhihu.com/p/622518320
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
