How to Build a Real‑Time Virtual Avatar with CNN and face‑api.js

This tutorial explains how to create a simple virtual avatar system by combining convolutional neural networks, the face‑api.js library, and WebRTC, covering CNN fundamentals, face detection, landmark extraction, model selection, and rendering techniques with code examples.

ELab Team
ELab Team
ELab Team
How to Build a Real‑Time Virtual Avatar with CNN and face‑api.js

Introduction

During a recent remote meeting, I noticed a participant using a "virtual avatar" feature with a mask and glasses, which inspired me to explore building a simple virtual avatar system.

Convolutional Neural Network (CNN)

The article outlines a three‑layer CNN model:

Convolution Layer – Feature Extraction

The convolution layer slides a kernel over the entire image, extracting local features. Multiple kernels capture different texture patterns, allowing the network to represent an image as a set of basic textures.

Question: After convolution, edge feature extraction may decrease—what methods can address this?

Pooling Layer – Downsampling

Pooling (often max‑pooling) reduces data dimensionality, decreasing computational load and helping prevent overfitting. For example, a 20×20 input with a 10×10 kernel can be reduced to a 2×2 feature map, cutting dimensions by tenfold.

Overfitting occurs when a model performs well on training data but poorly on unseen data because it memorizes training specifics instead of learning general patterns.

Fully Connected Layer – Output

The final pooling output (e.g., 5×5×4 = 100 nodes) is connected to an output layer that produces class probabilities, such as [0.89, 0.1, 0.001] for cat, dog, and snake.

Example: Edge Detection

An illustration shows how vertical edge detection works using a convolution kernel.

face‑api.js Overview

face‑api.js, built on TensorFlow.js, provides pre‑trained models for face detection, 68‑point landmark detection, face recognition, expression analysis, and gender/age estimation, dramatically reducing development effort.

Face Detection Models

SSD Mobilenet V1 – accurate but large (5.4 MB) and slower on weak networks.

The Tiny Face Detector – lightweight (190 KB), real‑time capable, but slightly less accurate on small faces; ideal for mobile or resource‑constrained environments.

Landmark Detection

The 68‑point model returns facial landmarks grouped into regions such as chin (points 1‑16), left eyebrow (18‑22), right eyebrow (23‑27), nose bridge (28‑31), nose (32‑36), left eye (37‑42), right eye (43‑48), outer lips (49‑60), and inner lips (61‑68).

Face Recognition

After detection and alignment, a face is passed to a recognition network that outputs a 128‑dimensional feature vector; cosine similarity between vectors determines identity matches.

Virtual Avatar System

Acquiring Video Stream

Modern browsers support WebRTC; getUserMedia obtains audio/video streams based on MediaTrackConstraints.

const constraints = { audio: true, video: { width: 1280, height: 720 } };
const setLocalMediaStream = (mediaStream) => {
    videoRef.current.srcObject = mediaStream;
};
navigator.mediaDevices.getUserMedia(constraints)
    .then(setLocalMediaStream);

Extracting Face Features

The Tiny Face Detector combined with the 68‑point landmark model is used. Adjustable parameters include inputSize (smaller values speed up detection but reduce accuracy for tiny faces) and scoreThreshold (lower to detect low‑confidence faces).

Loading Models

// Load detection model
await faceApi.nets.tinyFaceDetector.loadFromUri('xxx/weights/');
// Load landmark model
await faceApi.nets.faceLandmark68Net.loadFromUri('xxx/weights/');

Switching Detection Model

const options = new faceApi.TinyFaceDetectorOptions({
  inputSize,
  scoreThreshold,
});
const result = await faceApi
  .detectSingleFace(videoEl, options)
  .withFaceLandmarks();

Avatar Rendering

Using the 68 landmarks, relative coordinates are computed to place a mask image (256×256) based on points 1 and 16.

Canvas Rendering

const { positions } = resizedResult.landmarks;
const leftPoint = positions[0];
const rightPoint = positions[16];
const length = Math.sqrt(
  Math.pow(leftPoint.x - rightPoint.x, 2) +
  Math.pow(leftPoint.y - rightPoint.y, 2)
);
canvasCtx?.drawImage(
  mask,
  0, 0, 265, 265,
  leftPoint.x, leftPoint.y,
  length, length
);

MediaStream Rendering

The canvas captureStream API provides a MediaStream that can be merged with the original video track, enabling seamless streaming of the composited avatar.

const stream = canvasRef.current.captureStream();
mediaStream = res[0].clone();
mediaStream.addTrack(stream.getVideoTracks()[0]);
videoRef.current.srcObject = mediaStream;

Comparison

Canvas rendering offers broader compatibility but may suffer from network or hardware variability in real‑time communication.

MediaStream rendering has limited compatibility but delivers smoother playback in live‑stream scenarios.

Practical Results

Using only two landmark points yields modest visual quality; leveraging all 68 points enables richer effects such as full‑face skin replacement.

Extended Thoughts

Evaluation: Detect face count and identity to trigger alerts, log anomalies, and allow backend monitoring.

Learning: Detect when a user looks away to prompt re‑engagement.

Bullet‑screen: Detect faces to avoid overlay obstruction.

Conclusion

The guide demonstrates a complete pipeline—from CNN fundamentals and face‑api.js model selection to video capture, landmark extraction, and avatar rendering—providing a foundation for building interactive virtual avatar applications.

CNNJavaScriptvirtual avatarFace DetectionWebRTCface-api.js
ELab Team
Written by

ELab Team

Sharing fresh technical insights

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.