How to Build Real-Time Mouse Gesture Recognition with TensorFlow.js
This article explains how to design, implement, and evaluate a mouse gesture recognition system using machine learning and geometric analysis, covering data preprocessing, model training with TensorFlow.js, cosine‑similarity matching, performance optimizations, and extensions to three‑dimensional VR/AR environments.
TLDR
This article designs and validates a mouse‑gesture recognition solution using machine learning and cosine similarity, and explores extending the approach to three‑dimensional space.
Background
The core of front‑end technology is responding directly to user interactions. On PC, the mouse (or touchpad) is the most important input device after the keyboard, providing events such as click, scroll and drag. While keyboard shortcuts are convenient, mouse gestures offer a complementary shortcut method for scenarios where shortcuts are less practical. Early browsers popularized mouse gestures (e.g., straight line, check mark, circle). Touch‑screen devices further popularized gestures (e.g., swipe back, pull‑to‑refresh). With the rise of VR/AR/MR, three‑dimensional gestures are gaining attention. This article uses the PC scenario to illustrate the core implementation logic and then generalizes it to other platforms.
Goals
Implement core logic: record and recognize mouse gestures, assuming translation and scaling invariance and tolerance for repeated gestures.
Encapsulate the logic as a custom event that can be listened to via addEventListener, improving development convenience.
Provide a product‑level experience that allows users to add custom gestures.
Extend the solution to three‑dimensional space.
Problem Analysis
Gesture recognition must handle uncertainty: the user‑drawn path may deviate from a predefined template. For predefined gestures, the task becomes measuring similarity between a deterministic template and an uncertain user path; for custom gestures, both paths are uncertain.
Traditional programming would require precise rule‑based logic. Instead, we can treat the gesture as either a raster image (apply classic machine‑learning classification) or as a vector shape (define a custom similarity metric).
Implementation Plan
Machine‑Learning Approach
Switching from rule‑based to learning‑based programming, we use a neural network to learn features. Because labeling real gesture data is costly, we substitute the MNIST handwritten‑digit dataset for proof‑of‑concept.
Select an appropriate ML model based on the problem.
Choose a convenient ML framework.
Train the model and obtain a model file.
Deploy and run the model to obtain predictions.
Model Selection
TensorFlow.js provides many pre‑trained models that can be used directly or fine‑tuned.
Convolutional Neural Networks (CNN) excel at processing raster data. They treat pixel grids as input, extract hierarchical features, and output compact feature vectors for classification.
Framework Choice
TensorFlow.js is chosen because it runs entirely in the browser, offers good portability, low latency, and a low learning curve for web developers.
Model Training
The following code demonstrates data loading, model definition, training, and evaluation using TensorFlow.js.
/** @license * Copyright 2018 Google LLC. All Rights Reserved. * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * http://www.apache.org/licenses/LICENSE-2.0 * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */
const IMAGE_SIZE = 784;
const NUM_CLASSES = 10;
const NUM_DATASET_ELEMENTS = 65000;
const NUM_TRAIN_ELEMENTS = 55000;
const NUM_TEST_ELEMENTS = NUM_DATASET_ELEMENTS - NUM_TRAIN_ELEMENTS;
const MNIST_IMAGES_SPRITE_PATH = 'https://storage.googleapis.com/learnjs-data/model-builder/mnist_images.png';
const MNIST_LABELS_PATH = 'https://storage.googleapis.com/learnjs-data/model-builder/mnist_labels_uint8';
export class MnistData {
constructor() {
this.shuffledTrainIndex = 0;
this.shuffledTestIndex = 0;
}
async load() {
const img = new Image();
const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
const imgRequest = new Promise((resolve, reject) => {
img.crossOrigin = '';
img.onload = () => {
img.width = img.naturalWidth;
img.height = img.naturalHeight;
const datasetBytesBuffer = new ArrayBuffer(NUM_DATASET_ELEMENTS * IMAGE_SIZE * 4);
const chunkSize = 5000;
canvas.width = img.width;
canvas.height = chunkSize;
for (let i = 0; i < NUM_DATASET_ELEMENTS / chunkSize; i++) {
const datasetBytesView = new Float32Array(datasetBytesBuffer, i * IMAGE_SIZE * chunkSize * 4, IMAGE_SIZE * chunkSize);
ctx.drawImage(img, 0, i * chunkSize, img.width, chunkSize, 0, 0, img.width, chunkSize);
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
for (let j = 0; j < imageData.data.length / 4; j++) {
datasetBytesView[j] = imageData.data[j * 4] / 255;
}
}
this.datasetImages = new Float32Array(datasetBytesBuffer);
resolve();
};
img.src = MNIST_IMAGES_SPRITE_PATH;
});
const labelsRequest = fetch(MNIST_LABELS_PATH);
const [imgResponse, labelsResponse] = await Promise.all([imgRequest, labelsRequest]);
this.datasetLabels = new Uint8Array(await labelsResponse.arrayBuffer());
this.trainIndices = tf.util.createShuffledIndices(NUM_TRAIN_ELEMENTS);
this.testIndices = tf.util.createShuffledIndices(NUM_TEST_ELEMENTS);
this.trainImages = this.datasetImages.slice(0, IMAGE_SIZE * NUM_TRAIN_ELEMENTS);
this.testImages = this.datasetImages.slice(IMAGE_SIZE * NUM_TRAIN_ELEMENTS);
this.trainLabels = this.datasetLabels.slice(0, NUM_CLASSES * NUM_TRAIN_ELEMENTS);
this.testLabels = this.datasetLabels.slice(NUM_CLASSES * NUM_TRAIN_ELEMENTS);
}
nextTrainBatch(batchSize) {
return this.nextBatch(batchSize, [this.trainImages, this.trainLabels], () => {
this.shuffledTrainIndex = (this.shuffledTrainIndex + 1) % this.trainIndices.length;
return this.trainIndices[this.shuffledTrainIndex];
});
}
nextTestBatch(batchSize) {
return this.nextBatch(batchSize, [this.testImages, this.testLabels], () => {
this.shuffledTestIndex = (this.shuffledTestIndex + 1) % this.testIndices.length;
return this.testIndices[this.shuffledTestIndex];
});
}
nextBatch(batchSize, data, index) {
const batchImagesArray = new Float32Array(batchSize * IMAGE_SIZE);
const batchLabelsArray = new Uint8Array(batchSize * NUM_CLASSES);
for (let i = 0; i < batchSize; i++) {
const idx = index();
const image = data[0].slice(idx * IMAGE_SIZE, idx * IMAGE_SIZE + IMAGE_SIZE);
batchImagesArray.set(image, i * IMAGE_SIZE);
const label = data[1].slice(idx * NUM_CLASSES, idx * NUM_CLASSES + NUM_CLASSES);
batchLabelsArray.set(label, i * NUM_CLASSES);
}
const xs = tf.tensor2d(batchImagesArray, [batchSize, IMAGE_SIZE]);
const labels = tf.tensor2d(batchLabelsArray, [batchSize, NUM_CLASSES]);
return { xs, labels };
}
} // Simple example using the MNIST dataset
import { MnistData } from './data.js';
let cnnModel = null;
async function run() {
const data = new MnistData();
await data.load();
cnnModel = getModel();
await train(cnnModel, data);
}
function getModel() {
const model = tf.sequential();
const IMAGE_WIDTH = 28;
const IMAGE_HEIGHT = 28;
const IMAGE_CHANNELS = 1;
model.add(tf.layers.conv2d({
inputShape: [IMAGE_WIDTH, IMAGE_HEIGHT, IMAGE_CHANNELS],
kernelSize: 5,
filters: 8,
strides: 1,
activation: 'relu',
kernelInitializer: 'varianceScaling'
}));
model.add(tf.layers.maxPooling2d({ poolSize: [2, 2], strides: [2, 2] }));
model.add(tf.layers.conv2d({ kernelSize: 5, filters: 16, strides: 1, activation: 'relu', kernelInitializer: 'varianceScaling' }));
model.add(tf.layers.maxPooling2d({ poolSize: [2, 2], strides: [2, 2] }));
model.add(tf.layers.flatten());
const NUM_OUTPUT_CLASSES = 10;
model.add(tf.layers.dense({ units: NUM_OUTPUT_CLASSES, kernelInitializer: 'varianceScaling', activation: 'softmax' }));
const optimizer = tf.train.adam();
model.compile({ optimizer: optimizer, loss: 'categoricalCrossentropy', metrics: ['accuracy'] });
return model;
}
async function train(model, data) {
const BATCH_SIZE = 512;
const TRAIN_DATA_SIZE = 5500;
const TEST_DATA_SIZE = 1000;
const [trainXs, trainYs] = tf.tidy(() => {
const d = data.nextTrainBatch(TRAIN_DATA_SIZE);
return [d.xs.reshape([TRAIN_DATA_SIZE, 28, 28, 1]), d.labels];
});
const [testXs, testYs] = tf.tidy(() => {
const d = data.nextTestBatch(TEST_DATA_SIZE);
return [d.xs.reshape([TEST_DATA_SIZE, 28, 28, 1]), d.labels];
});
return model.fit(trainXs, trainYs, { batchSize: BATCH_SIZE, validationData: [testXs, testYs], epochs: 10, shuffle: true });
}Model Deployment and Runtime
// Predict the class of the drawing on the canvas
function predict(){
const input = tf.tidy(() => {
return tf.image.resizeBilinear(tf.browser.fromPixels(canvas), [28, 28], true)
.slice([0,0,0],[28,28,1])
.toFloat()
.div(255)
.reshape([1,28,28,1]);
});
const pred = cnnModel.predict(input).argMax(1);
console.log('Prediction:', pred.dataSync());
alert(`Prediction result ${pred.dataSync()[0]}`);
}
document.getElementById('predict-btn').addEventListener('click', predict);
document.getElementById('clear-btn').addEventListener('click', clear);
document.addEventListener('DOMContentLoaded', run);
const canvas = document.querySelector('canvas');
canvas.addEventListener('mousemove', (e) => {
if (e.buttons === 1) {
const ctx = canvas.getContext('2d');
ctx.fillStyle = 'rgb(255,255,255)';
ctx.fillRect(e.offsetX, e.offsetY, 10, 10);
}
});
function clear(){
const ctx = canvas.getContext('2d');
ctx.fillStyle = 'rgb(0,0,0)';
ctx.fillRect(0,0,300,300);
}
clear();Geometric Analysis Method
We capture the mouse’s spatio‑temporal trajectory, extract position, shape and direction, and represent the path as a series of unit vectors. By normalizing size and orientation, we achieve translation‑ and scale‑invariance.
Each gesture is resampled into 128 unit vectors; similarity between two gestures is computed by summing the cosine similarity of corresponding vector pairs, yielding a score in the range [-128, 128]. A threshold determines a match.
For 2‑D vectors we use the standard cosine similarity formula:
Implementation snippets (React + TypeScript) illustrate point collection, sparsification, vector conversion, cosine similarity calculation, and custom event dispatch.
import { useEffect, useState, useRef, useMemo } from 'react';
import throttle from "lodash/throttle";
type Position = {x:number, y:number};
type Vector = [number, number];
// Example predefined gesture vectors
const shapeVectors_v: Vector[] = [[5,16],[13,29],[4,9],[6,9],[8,8],[1,0],[1,0],[1,-2],[0,-3],[7,-11],[21,-34],[10,-19]];
// ... other shapes omitted for brevity
function Gesture(){
const pointsRef = useRef<Position[]>([]);
const sparsedPointsRef = useRef<Position[]>([]);
const vectorsRef = useRef<Vector[]>([]);
const canvasContextRef = useRef<CanvasRenderingContext2D>();
const containerRef = useRef<HTMLDivElement>(null);
const [predictResults, setPredictResults] = useState<{label:string, similarity:number}[]>([]);
const handleMouseMoveThrottled = useMemo(() => throttle(handleMouseMove, 16), []);
useEffect(() => {
const canvasEle = document.getElementById('canvas-ele') as HTMLCanvasElement;
const ctx = canvasEle.getContext('2d')!;
canvasContextRef.current = ctx;
clear();
}, []);
function handleMouseDown(){
containerRef?.current?.addEventListener('mousemove', handleMouseMoveThrottled);
}
function handleMouseUp(){
containerRef?.current?.removeEventListener('mousemove', handleMouseMoveThrottled);
pointsRef.current = [];
}
function drawPoint(x:number, y:number){
canvasContextRef.current!.fillStyle = 'red';
canvasContextRef.current?.fillRect(x, y, 10, 10);
}
function handleMouseMove(e:any){
const x = e.offsetX, y = e.offsetY;
drawPoint(x, y);
const newPoints = [...pointsRef.current, {x, y}];
pointsRef.current = newPoints;
const sparsed = sparsePoints(newPoints);
sparsedPointsRef.current = sparsed;
vectorsRef.current = points2Vectors(sparsed);
}
function sparsePoints(points: Position[]){
const target = 13;
if (points.length <= target) return points;
const step = points.length / target;
const result: Position[] = [];
for (let i = 0; i < target; i++) {
const idx = Math.round(step * i);
result.push(points[idx]);
}
return result;
}
function points2Vectors(points: Position[]){
if (points.length <= 1) return [];
return points.slice(1).map((p, i) => [p.x - points[i].x, p.y - points[i].y]);
}
function cosineSimilarity(v1: Vector[], v2: Vector[]){
if (v1.length !== v2.length) return 0;
const sum = v1.reduce((acc, vec, i) => acc + vectorsCos(vec, v2[i]), 0);
return Math.abs(sum / v1.length);
}
function vectorsCos(v1:Vector, v2:Vector){
const len1 = vectorLength(v1);
const len2 = vectorLength(v2);
if (len1 * len2 === 0) return 1;
return vectorsDotProduct(v1, v2) / (len1 * len2);
}
function vectorsDotProduct(v1:Vector, v2:Vector){ return v1[0]*v2[0] + v1[1]*v2[1]; }
function vectorLength(v:Vector){ return Math.sqrt(v[0]*v[0] + v[1]*v[1]); }
function handlePredict(){
const results = Object.keys(shapeVectors).map(key => ({
label: key,
similarity: cosineSimilarity(shapeVectors[key], vectorsRef.current)
}));
setPredictResults(results);
}
function clear(){
canvasContextRef.current!.fillStyle = 'rgb(0,0,0)';
canvasContextRef.current!.fillRect(0,0,500,500);
setPredictResults([]);
}
return (
<div ref={containerRef} onMouseDown={handleMouseDown} onMouseUp={handleMouseUp} style={{width:'500px',height:'500px',background:'grey'}}>
<canvas id='canvas-ele' width='500' height='500'></canvas>
<section>
<button onClick={handlePredict}>Predict</button>
<button onClick={clear}>Clear</button>
</section>
<ul>
{predictResults.map(e => (
<li key={e.label}>Similarity with {e.label}: {e.similarity}</li>
))}
</ul>
</div>
);
}
export default Gesture;Performance Optimizations
Vector calculations are CPU‑intensive; moving them to a WebWorker and using requestAnimationFrame for canvas rendering can improve responsiveness.
Evaluation
The cosine‑similarity approach is lightweight and works well for a few predefined gestures, but it struggles with complex, multi‑stroke gestures lacking strict stroke order.
Extending to 3‑D Space
To handle VR/MR gestures, each 3‑D vector is projected onto the X‑Y and Y‑Z planes; the cosine similarity of each projection is computed and multiplied to obtain a 3‑D similarity score.
Comprehensive Solution
We combine both approaches: predefined gestures are trained offline with a CNN model and bundled into the product; custom gestures use the geometric analysis method, requiring three user recordings to establish a reliable template and threshold.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
