Artificial Intelligence 15 min read

Code Understanding Technology: Building White-Box Software Knowledge Graph at Baidu

Baidu’s white‑box code understanding platform combines static, dynamic, non‑code and LLM‑based analyses in a three‑layer architecture that accelerates C/C++ processing ninefold, supports multiple languages, and powers applications such as intelligent unit testing, orphan‑function cleanup and AI‑driven risk detection, while future integration with models like GPT‑4 aims to enable multi‑turn code Q&A, automated refactoring and predictive testing.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Code Understanding Technology: Building White-Box Software Knowledge Graph at Baidu

Code understanding is a critical technology for software knowledge graphs, providing foundational technical and data support for construction, testing, problem location, and code interpretation. It serves as the starting point for continuous integration, enabling purposeful and effective building processes.

Code understanding analyzes software systems to extract internal information and operational processes. The commonly used analysis methods include static analysis, dynamic analysis, and non-source code analysis. With the advent of LLM large models, research is also exploring breakthroughs and applications in the code understanding field.

Static Analysis involves scanning program code through lexical analysis, syntax analysis, control flow, and data flow analysis without running the code, verifying code compliance with specifications, safety, reliability, and maintainability indicators.

Dynamic Analysis is the analysis of software system behavior before, during, and after execution in simulated or real environments.

Non-Code Analysis primarily performs correlation analysis between non-source code files (data files, configuration files) and source code, enabling awareness of how repository changes impact source code and functionality.

LLM-Based Analysis leverages the reasoning and deduction capabilities of large models to mine knowledge from program static and dynamic data.

The article introduces Baidu's three-layer architecture for code understanding: Infrastructure Layer (multi-language parsers, data storage, caching mechanisms), Analysis Layer (abstracting general analysis capabilities to characterize code relationships), and Service Layer (providing easy-to-use and open data access). The solution supports 3 languages and 10+ code entity data sources, with C/C++ efficiency improved by nearly 9 times and incremental efficiency under 200 seconds.

Typical applications at Baidu include: Intelligent Unit Testing (UT) for Go language migration, achieving 65% accuracy and recalling 400+ effective risk issues per quarter; Unused Function Cleanup achieving 97% accuracy in orphan function detection, helping clean 76,000+ lines of code across 57 business modules; and AI-Static Analysis (AI-SA) for proactive risk identification, recalling 20,000+ effective risk issues from scratch.

The future direction involves integrating large models (GPT-4, Wenxin Yiyan) to optimize code understanding across storage, analysis, and model layers, enabling multi-round Q&A for code comprehension, automated refactoring suggestions, test case design, and risk prediction.

ci/cdASTLLMsoftware engineeringcode analysiscode qualitycode understandingstatic analysisBaidusoftware knowledge graph
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.