How to Parse Git Diffs into JSON and Highlight Line Changes Efficiently
This article explains how to transform raw Git diff output into a structured JSON format, handle various diff headers, and apply line‑by‑line highlighting using the diff‑match‑patch algorithm, while also offering performance optimizations for long lines.
Parsing Git Diff
To display a diff, the Git diff format must first be parsed into a structured representation such as JSON. The basic diff header contains the two file names being compared, followed by the hash values and file mode (e.g., 100644 for a text file).
Basic Format
diff --git a/f1 b/f1
index 6f8a38c..449b072 100644
--- a/f1
+++ b/f1
@@ -1,7 +1,7 @@
1
2
3
- a
+ b
5
6
7The lines starting with --- and +++ indicate the original and new file, while the @@ line defines the hunk header and the context lines that follow.
Extended Headers
After the first header, Git may include additional lines such as old mode, new mode, deleted file mode, copy from, rename from, similarity index, etc. Each of these can be captured with regular expressions and stored in the JSON object.
File Operations
Git diff distinguishes between added, deleted, copied, and renamed files. For example, an added file includes new file mode and a /dev/null entry, while a deleted file shows deleted file mode and the same /dev/null placeholder.
Binary Files
Binary changes are represented simply as Binary files a/img.png and b/img.png differ without showing the actual byte differences.
Line‑by‑Line Diff and Highlighting
After the initial JSON is built, each line can be further diffed to highlight exact changes. Only lines that were modified (neither pure additions nor deletions) are processed with the diff_match_patch.diff_main function.
const diff = diffMatchPatch.diff_main(oldLine.content, newLine.content);
if (diff && diff.length) {
oldLine.diff = [];
newLine.diff = [];
diff.forEach(item => {
if (item[0] === DiffMatchPatch.DIFF_INSERT) {
newLine.diff.push({ tag: 'added', content: item[1] });
} else if (item[0] === DiffMatchPatch.DIFF_DELETE) {
oldLine.diff.push({ tag: 'deleted', content: item[1] });
} else {
oldLine.diff.push({ tag: '', content: item[1] });
newLine.diff.push({ tag: '', content: item[1] });
}
});
}This produces a diff field for each modified line, indicating inserted or deleted fragments.
Performance Optimizations
Because the classic LCS algorithm is O(M×N), several shortcuts are applied:
Immediate equality check – if the two strings are identical, skip diffing.
Common prefix and suffix detection using binary search (O(log n)).
Detecting pure insertions or deletions without invoking the full algorithm.
Checking for substring containment to avoid unnecessary work.
Skipping line diff for very long lines (e.g., >80 characters) by setting a threshold.
const threshold = 80;
const len = Math.max(oldLine.content.length, newLine.content.length);
if (len <= threshold) {
const diff = diffMatchPatch.diff_main(oldLine.content, newLine.content);
// process diff as shown above
}These heuristics dramatically reduce processing time for large diffs, especially when diffing embedded scripts that can span thousands of characters.
References
Git Diff Documentation
Longest Common Subsequence (LCS)
Neil Fraser’s Diff Algorithm Overview
Optimized Diff Algorithm (PDF)
Google Diff‑Match‑Patch Library
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Taobao Frontend Technology
The frontend landscape is constantly evolving, with rapid innovations across familiar languages. Like us, your understanding of the frontend is continually refreshed. Join us on Taobao, a vibrant, all‑encompassing platform, to uncover limitless potential.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
