Best Practices for Building an Entity‑Relationship Annotation Tool at Laiye AI R&D Center
This article details Laiye Technology’s AI R&D team’s end‑to‑end approach to designing and optimizing a custom entity‑relationship annotation tool, covering data‑labeling challenges, shortcomings of Excel and off‑the‑shelf solutions, architectural requirements, line‑breaking and mark‑position algorithms, performance improvements, and real‑world results.
When interviewing AI algorithm engineers, the most common answer to "what is the most important thing in machine learning?" is "data, data, data". High‑quality labeled data and efficient labeling workflows are therefore critical for any AI team.
This article explains the best practices Laiye Technology’s AI R&D Center followed to implement a proprietary "entity‑relationship annotation tool".
Background : Laiye’s dialogue robot relies heavily on Natural Language Processing (NLP), and annotated entity‑relationship data is essential for improving NLP capabilities.
The team progressed through three stages of labeling:
Using Excel for annotation.
Adopting commercial annotation tools (e.g., Baidu Brain).
Developing a custom in‑house tool.
Excel drawbacks include cumbersome collaboration, lack of fine‑grained permissions, and difficulty maintaining annotation and review status.
Commercial tools improve on Excel by offering text filtering, shared entity types, visual highlights, and relationship counts, but they still suffer from limitations such as inability to label the same text segment with multiple entities, poor handling of cross‑line annotations, and unclear relationship visualizations.
Custom tool requirements were defined as:
Real‑time maintenance of entity and relationship types on the annotation page.
Support for multiple entities within the same text segment.
Accurate character‑position retrieval.
Cross‑line entities should not be split into separate lines.
To meet these needs, the team introduced an additional line element for each annotation and positioned it absolutely beneath the text, allowing multiple overlapping annotations.
Key algorithmic components are shown below.
function breakIntoLines(str, width) {
// Get a test span that inherits the tool's text style
const span = getTestSpanInstance();
const lines = [];
let tokens = '';
let stIndex = 0;
while (str.length) {
span.innerText = tokens + str[0];
if (span.offsetWidth >= width) {
lines.push({ stIndex, tokens });
stIndex += tokens.length;
tokens = str[0];
} else {
tokens += str[0];
}
str = str.slice(1);
}
if (tokens) { lines.push({ stIndex, tokens }); }
return lines;
}The original implementation suffered from excessive offsetWidth queries, causing layout thrashing, and from an O(l·n²) algorithm for computing vertical offsets of overlapping marks.
Optimizations applied:
For the first line, directly measure a chunk of ~30 characters and adjust iteratively, then reuse the measured character count as a heuristic for subsequent lines.
Cache line character counts to reduce width measurements.
Refactor the mark‑position algorithm to process marks in start‑index order, assign marks to the current line only, and compute vertical offsets by checking overlap within that line, reducing complexity to near O(l·n).
function getMarkPosition(lines, marks) {
const stIndexArr = marks.sort((a, b) => a.stIndex - b.stIndex);
let st = 0;
let current = [];
for (let line of lines) {
while (stIndexArr[st] && stIndexArr[st].stIndex <= line.endIndex) {
current.push(stIndexArr[st]);
st++;
}
// compute offsetX for each mark in current
addMarkLines(line, current);
current = current.filter(m => m.endIndex > line.endIndex);
}
}
function addMarkLines(line, marks) {
const markLines = [[]];
marks.forEach(mark => {
let placed = false;
for (let i = 0; i < markLines.length; i++) {
if (!markLines[i].some(old => isOverlap(old, mark))) {
mark.offsetY = i;
markLines[i].push(mark);
placed = true;
break;
}
}
if (!placed) {
mark.offsetY = markLines.length;
markLines.push([mark]);
}
});
line.markLines = markLines;
}Performance tests showed dramatic improvements: rendering 500 characters dropped from 233 ms to 79 ms, and rendering 20 000 characters fell from 11 232 ms to 91 ms.
In production, the tool has been live for six months, handling over 30 000 annotation documents, increasing labeling efficiency by 35 % and achieving a 96.8 % accuracy rate. The project has been patented and is planned for open‑source release.
Author: Jia Siqi Editor: Liu Tongtong
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.