Optimizing the Code Diff System: From JGit to GitLab API and Diff Compensation
This article analyzes performance and concurrency problems in a code‑diff service, compares the original JGit‑based approach with a GitLab‑API solution, addresses new accuracy issues, and presents a compensation strategy using java‑diff‑utils to achieve stable, efficient backend diff processing.
The code‑diff system is a core component for incremental static code scanning, coverage, and interface analysis, and its stability and performance directly affect the entire workflow. As usage grew, especially with the incremental code statistics feature, the existing design revealed scalability and performance bottlenecks.
Original implementation involved cloning the GitLab repository branch to a local server and using JGit to compute diffs. This caused long clone times for large repositories and required exclusive locks on the local server, preventing concurrent processing of the same repository.
Optimization proposals :
1. Space‑for‑time : continue cloning locally but store code per branch, eliminating cross‑branch locking and reducing logic changes. However, concurrent diffs on the same branch still need locking, and lock contention grows with pull time.
2. Remove JGit : switch to java‑gitlab‑api to call GitLab’s native diff API, which provides stable performance and concurrency. Early deployment met requirements, prompting the question why this approach wasn’t adopted earlier.
New issue : Users reported inaccurate diffs because the GitLab API lacks an option to ignore whitespace changes, whereas JGit can ignore them (e.g., WS_IGNORE_ALL ). Business logic requires ignoring pure formatting changes to ensure correct incremental statistics.
Additional solution : Since GitLab’s diff endpoint cannot be configured, a compensation layer was added. The diff result from GitLab is pre‑processed with the open‑source java‑diff‑utils library, and the same HistogramDiff algorithm used by JGit is applied to keep results consistent.
Other problems :
1. Large files sometimes return empty diffs due to GitLab’s diff size limits (see GitLab Diff limits administration).
2. When the number of diff files is large, performance degrades; multithreaded processing was introduced to mitigate this.
Conclusion : After extensive experimentation and several false starts, the GitLab‑API‑based approach with diff compensation now provides a stable and performant foundation for the diff system. Feedback and alternative ideas are welcomed for further improvement.
References :
1,https://github.com/java-diff-utils/java-diff-utils
2,https://docs.gitlab.com/ee/api/转转QA
In the era of knowledge sharing, discover 转转QA from a new perspective.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.