How to Separate Complex Image Foreground from Background Using AI and Classic CV Techniques
This article presents a step‑by‑step solution that combines computer‑vision preprocessing, OCR, CNN classification, shape matching, and inpainting to isolate meaningful foreground elements from images with intricate backgrounds, discussing practical results, limitations, and code implementations.
Background
Previous work introduced a UI‑automation goal: convert a single design image into code. To do this we must first cut out meaningful blocks (text, buttons, product images) from the picture. Traditional cutting treats the whole picture as a single image and loses structural information, especially when the background is complex. Industry solutions often rely on computer‑vision or AI (e.g., FCN+CRF) which achieve only ~80% accuracy, lack pixel‑level edges, require costly labeling, are hard to train, and behave like a black box. We therefore explored a hypothesis that UI foregrounds usually exhibit clear geometric features (regular shapes, presence of text, closed contours) and can be separated without heavy AI.
Practical Results
Extensive testing of many CV algorithms showed that no single method works universally; each works only in specific scenarios and requires different parameters for varying color complexities. A case‑by‑case approach would become unmaintainable.
We therefore built a pipeline that:
Detects as many foreground regions as possible.
Filters out low‑confidence regions.
Assigns foreground‑background layers with a hierarchical allocator.
Repairs the image by filling blank background areas.
Below are the foreground‑search process (GIF) and the final layered separation result.
Logic Overview
Text Processing
OCR Rough Position
OCR provides approximate bounding boxes for text. For example, the left image is the Xianyu homepage and the right image shows OCR‑derived white boxes. OCR gives coarse positions but cannot separate individual characters; whole lines may be merged and non‑text elements (e.g., banner slogans) can be mistakenly recognized as text.
Segmentation and CNN Classifier
Each OCR‑detected region is cropped to the smallest possible image and fed to a TensorFlow CNN that decides whether the region is editable text or an image‑based graphic.
"""
ui基础元素识别
"""
# Load model (placeholder code)
with ui_sess.as_default():
with g2.as_default():
tf.global_variables_initializer().run()
# Load label file
ui_label_lines = [line.rstrip() for line in tf.gfile.GFile("AI_models/CNN/ui-elements-NN/tf_files/retrained_labels.txt")]
# Load graph
with tf.gfile.FastGFile("AI_models/CNN/ui-elements-NN/tf_files/retrained_graph.pb", 'rb') as f:
ui_graph_def = tf.GraphDef()
ui_graph_def.ParseFromString(f.read())
tf.import_graph_def(ui_graph_def, name='')
ui_softmax_tensor = ui_sess.graph.get_tensor_by_name('final_result:0')
def ui_classify(image_path):
image_data = tf.gfile.FastGFile(image_path, 'rb').read()
predictions = ui_sess.run(ui_softmax_tensor, {'DecodeJpeg/contents:0': image_data})
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for node_id in top_k:
human_string = ui_label_lines[node_id]
score = predictions[0][node_id]
print('%s (score = %s)' % (human_string, score))
return human_string, scoreText Extraction
When the background is uniform, text regions are easy to extract. For complex backgrounds we evaluated Harris corners, Canny edges, SWT, and K‑means; K‑means gave the best results. The following code reshapes the gray region, runs K‑means, and reconstructs the segmented mask.
Z = gray_region.reshape((-1, 1))
Z = np.float32(Z)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret, label, center = cv2.kmeans(Z, K, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((gray_region.shape))Foreground Search
Enhance Edges, Suppress Non‑edges
We convolve the original image with a kernel that highlights edges while smoothing flat areas.
conv_kernel = [
[-1, -1, -1],
[-1, 8, -1],
[-1, -1, -1]
]Denoising
The convolved image is converted to grayscale, binarized, and small noisy components are removed using cv2.connectedComponentsWithStats().
Contour Search Based on Text Position
Using the top‑left corner of each OCR‑detected text block as a seed, we perform flood‑fill to obtain a region, then extract its external contour with cv2.findContours(). We test whether the contour encloses the text (via cv2.pointPolygonTest) to decide if it represents a valid foreground.
Determine Inner/Outer Contours
If the text lies inside a contour, the contour is expanded outward until the border is captured; otherwise the existing border is used directly.
Foreground Classifier
Define Valid Shapes
Three template shapes (square, rectangle, circle) are pre‑loaded. A contour is considered valid if its matchShapes score against any template is below a threshold (empirically < 3) and the contour contains text.
# Load shape templates
circle = cv2.imread(os.getcwd() + '/fgbgIsolation/utils/shapes/circle.png', 0)
_, contours, _ = cv2.findContours(circle, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
self.circle = contours[0]
square = cv2.imread(os.getcwd() + '/fgbgIsolation/utils/shapes/square.png', 0)
_, contours, _ = cv2.findContours(square, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
self.square = contours[0]
rect = cv2.imread(os.getcwd() + '/fgbgIsolation/utils/shapes/rect.png', 0)
_, contours, _ = cv2.findContours(rect, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
self.rect = contours[0]
def detect(self, cnt):
shape = "unidentified"
types = [self.square, self.rect, self.circle]
names = ['square', 'rect', 'circle']
for i in range(len(types)):
type = types[i]
score = cv2.matchShapes(type, cnt, 1, 0.0) # lower score = more similar
if score < 3:
shape = names[i]
break
return shape, scoreImage Inpainting
Compute Overlap Regions
Only overlapping parts of layered foregrounds need repair. We compute the intersection of the current layer mask with masks of all higher layers using cv2.bitwise_and, then build a combined overlap mask.
# mask: current layer mask; layers_merge: list of all foreground masks
UPPER_level_mask = np.zeros(mask.shape, np.uint8)
UPPER_level_mask = np.where(layers_merge > i, 255, 0).astype(np.uint8)
_, contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
overlaps_mask = np.zeros(mask.shape, np.uint8)
for cnt in contours:
cnt_mask = np.zeros(mask.shape, np.uint8)
cv2.drawContours(cnt_mask, [cnt], 0, (255, 255, 255), cv2.FILLED, cv2.LINE_AA)
overlap_mask = cv2.bitwise_and(inpaint_mask, cnt_mask, mask=UPPER_level_mask)
overlaps_mask = cv2.bitwise_or(overlaps_mask, overlap_mask)
# Assign the computed overlap mask to the inpainting mask
inpaint_mask = overlaps_maskInpainting
OpenCV's cv2.INPAINT_TELEA algorithm first restores edge pixels and then propagates inward until the whole masked area is filled.
# img: original image; inpaint_mask: mask from previous step
# dst: repaired image
dst = cv2.inpaint(img, inpaint_mask, 3, cv2.INPAINT_TELEA)Extension
The presented computer‑vision‑centric, deep‑learning‑assisted pipeline works well for many UI screenshots, but challenging cases with high‑contrast edges, heavy noise, or indistinct contours still leave room for improvement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
