Artificial Intelligence 12 min read

Applying YOLOv5 Object Detection for Black, Color, and Normal Screen Classification in Video Frames

This article presents a method that replaces traditional manual video frame quality checks with an automated YOLOv5‑based object detection pipeline, detailing data labeling, model training, loss computation, inference code, and experimental results that show higher accuracy than ResNet for classifying black, color‑screen, and normal frames.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Applying YOLOv5 Object Detection for Black, Color, and Normal Screen Classification in Video Frames

Video frame black‑screen and color‑screen detection is a crucial part of video quality assessment, but manual inspection is labor‑intensive and inefficient; the article proposes an automated solution using a classification approach based on object detection models, specifically YOLOv5.

The workflow starts with a simplified labeling strategy where each whole image is treated as a single target, assigning class 0 to normal screens, 1 to colorful screens, and 2 to black screens. The labeling code is shown below:

OBJECT_DICT = {"Normalscreen": 0, "Colorfulscreen": 1, "Blackscreen": 2}

def parse_json_file(image_path):
    imageName = os.path.basename(image_path).split('.')[0]
    img = cv2.imread(image_path)
    size = img.shape
    label = image_path.split('/')[-1].split('\\')[0]
    label = OBJECT_DICT.get(label)
    imageWidth, imageHeight = size[0], size[1]
    label_dict = {}
    xmin, ymin = (0, 0)
    xmax, ymax = (imageWidth, imageHeight)
    xcenter = (xmin + xmax) / 2 / float(imageWidth)
    ycenter = (ymin + ymax) / 2 / float(imageHeight)
    width = (xmax - xmin) / float(imageWidth)
    heigt = (ymax - ymin) / float(imageHeight)
    label_dict.update({label: [str(xcenter), str(ycenter), str(width), str(heigt)]})
    label_dict = sorted(label_dict.items(), key=lambda x: x[0])
    return imageName, label_dict

The training pipeline follows the standard YOLOv5 workflow with minor adjustments for the single‑class dataset. Key steps such as data loading, model creation, learning‑rate scheduling, and the training loop are illustrated in the following snippet:

# Load data, get train and test paths
with open(opt.data) as f:
    data_dict = yaml.load(f, Loader=yaml.FullLoader)
    with torch_distributed_zero_first(rank):
        check_dataset(data_dict)
train_path = data_dict['train']
test_path = data_dict['val']
Number_class, names = (1, ['item']) if opt.single_cls else (int(data_dict['nc']), data_dict['names'])

# Create model
model = Model(opt.cfg, ch=3, nc=Number_class).to(device)

# Learning‑rate schedule
lf = lambda x: ((1 + math.cos(x * math.pi / epochs)) / 2) * (1 - hyp['lrf']) + hyp['lrf']
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)

# Training loop
for epoch in range(start_epoch, epochs):
    model.train()

The loss function combines bounding‑box (GIoU), objectness, and classification components, as shown below:

def compute_loss(p, targets, model):
    device = targets.device
    loss_cls, loss_box, loss_obj = torch.zeros(1, device=device), torch.zeros(1, device=device), torch.zeros(1, device=device)
    tcls, tbox, indices, anchors = build_targets(p, targets, model)
    h = model.hyp
    BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([h['cls_pw']])).to(device)
    BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.Tensor([h['obj_pw']])).to(device)
    cp, cn = smooth_BCE(eps=0.0)
    nt = 0
    np = len(p)
    balance = [4.0, 1.0, 0.4] if np == 3 else [4.0, 1.0, 0.4, 0.1]
    for i, pi in enumerate(p):
        image, anchor, gridy, gridx = indices[i]
        tobj = torch.zeros_like(pi[..., 0], device=device)
        n = image.shape[0]
        if n:
            nt += n
            ps = pi[anchor, image, gridy, gridx]
            pxy = ps[:, :2].sigmoid() * 2 - 0.5
            pwh = (ps[:, 2:4].sigmoid() * 2) ** 2 * anchors[i]
            predicted_box = torch.cat((pxy, pwh), 1).to(device)
            giou = bbox_iou(predicted_box.T, tbox[i], x1y1x2y2=False, CIoU=True)
            loss_box += (1.0 - giou).mean()
            tobj[image, anchor, gridy, gridx] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype)
            if model.nc > 1:
                t = torch.full_like(ps[:, 5:], cn, device=device)
                t[range(n), tcls[i]] = cp
                loss_cls += BCEcls(ps[:, 5:], t)
            loss_obj += BCEobj(pi[..., 4], tobj) * balance[i]
    s = 3 / np
    loss_box *= h['giou'] * s
    loss_obj *= h['obj'] * s * (1.4 if np == 4 else 1.0)
    loss_cls *= h['cls'] * s
    loss = loss_box + loss_obj + loss_cls
    return loss * bs, torch.cat((loss_box, loss_obj, loss_cls, loss)).detach()

During inference, the detection results are post‑processed to extract the class with the highest confidence, effectively turning the object detector into a classifier:

def detect(opt, img):
    out, source, weights, view_img, save_txt, imgsz = opt.output, img, opt.weights, opt.view_img, opt.save_txt, opt.img_size
    device = select_device(opt.device)
    half = device.type != 'cpu'
    model = experimental.attempt_load(weights, map_location=device)
    imgsz = check_img_size(imgsz, s=model.stride.max())
    if half:
        model.half()
    img = letterbox(img)[0]
    img = img[:, :, ::-1].transpose(2, 0, 1)
    img = np.ascontiguousarray(img)
    img_warm = torch.zeros((1, 3, imgsz, imgsz), device=device)
    _ = model(img_warm.half() if half else img_warm) if device.type != 'cpu' else None
    img = torch.from_numpy(img).to(device)
    img = img.half() if half else img.float()
    img /= 255.0
    if img.ndimension() == 3:
        img = img.unsqueeze(0)
    pred = model(img, augment=opt.augment)[0]
    pred = non_max_suppression(pred, opt.conf_thres, opt.iou_thres, classes=opt.classes, agnostic=opt.agnostic_nms)
    for i, det in enumerate(pred):
        if det is not None and len(det):
            det[:, :4] = scale_coords(img.shape[2:], det[:, :4], img.shape).round()
            all_conf = det[:, 4]
            if len(det[:, -1]) > 1:
                ind = torch.max(all_conf, 0)[1]
                c = torch.take(det[:, -1], ind)
                detect_class = int(c)
            else:
                for c in det[:, -1]:
                    detect_class = int(c)
            return detect_class

Experimental results on a dataset of 600 labeled frames (200 normal, 200 colorful, 200 black) show that the YOLOv5‑based classifier achieves 97% accuracy, outperforming a ResNet‑based classifier which reaches only 88% and often confuses normal and colorful screens.

The conclusion recommends using object‑detection frameworks such as YOLOv5 for classification tasks when the dataset is small or when pure classification models struggle, noting that the approach can be adapted to other detection architectures.

image classificationPythondeep learningobject detectionvideo qualityYOLOv5
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.