Implementing the SIFT Image Matching Algorithm in Java – A Complete Walkthrough
This article explains the four main stages of the SIFT algorithm—scale‑space construction, DoG extrema detection, keypoint orientation assignment, and descriptor generation—while providing a full Java implementation with detailed code snippets and explanations of each processing step.
SIFT (Scale‑Invariant Feature Transform) is a classic algorithm for image recognition. The following sections describe a Java implementation that follows the four standard SIFT steps.
1. Constructing the Scale Space
The scale space consists of Gaussian‑blurred images and their Difference‑of‑Gaussians (DoG). Gaussian pyramids are built by repeatedly blurring and down‑sampling the input gray‑pixel matrix.
public void scaleSpace(GrayPixelMatrix gpm) throws IOException {
gaussPyramid = new int[nOctaves][gaussLayers][][];
// Double the image size and apply initial blur
gaussPyramid[0][0] = gpm.zoom(2).blur(gauss1DTemplate(Math.sqrt(sigma * sigma - 4 * preSigma * preSigma)));
for (int i = 1; i < gaussLayers; i++) {
double deltaSigma = sigma * Math.sqrt(Math.pow(2.0, i * 2.0 / nOctaveLayers) - Math.pow(2.0, (i * 2.0 - 2.0) / nOctaveLayers));
gaussPyramid[0][i] = gpm.blur(gauss1DTemplate(deltaSigma)).getPixels();
}
for (int i = 1; i < nOctaves; i++) {
// Down‑sample the last layer of the previous octave
gpm = gpm.sample(2);
gaussPyramid[i][0] = gpm.getPixels();
for (int j = 1; j < gaussLayers; j++) {
double deltaSigma = sigma * Math.sqrt(Math.pow(2.0, i * 2.0 / nOctaveLayers) - Math.pow(2.0, (i * 2.0 - 2.0) / nOctaveLayers));
gaussPyramid[i][j] = gpm.blur(gauss1DTemplate(deltaSigma)).getPixels();
}
}
// Build DoG pyramid by subtracting adjacent Gaussian layers
dogPyramid = new int[nOctaves][dogLayers][][];
for (int i = 0; i < nOctaves; i++) {
for (int j = 0; j < dogLayers; j++) {
int[][] g0 = gaussPyramid[i][j];
int[][] g1 = gaussPyramid[i][j + 1];
int[][] diff = new int[g0.length][g0[0].length];
for (int k = 0; k < diff.length; k++) {
for (int l = 0; l < diff[k].length; l++) {
diff[k][l] = g1[k][l] - g0[k][l];
}
}
dogPyramid[i][j] = diff;
}
}
}2. Computing Gradient Magnitude and Orientation
Before detecting extrema, the gradient magnitude and angle for each pixel are cached to avoid repeated calculations.
for (int i = 0; i < nOctaves; i++) {
for (int j = 0; j < gaussLayers; j++) {
int[][] curGauss = gaussPyramid[i][j];
double[][] curGradient = new double[curGauss.length][curGauss[0].length];
double[][] curTheta = new double[curGauss.length][curGauss[0].length];
for (int y = 1; y < curGauss.length - 1; y++) {
for (int x = 1; x < curGauss[0].length - 1; x++) {
double dx = curGauss[y][x + 1] - curGauss[y][x - 1];
double dy = curGauss[y + 1][x] - curGauss[y - 1][x];
curGradient[y][x] = Math.sqrt(dx * dx + dy * dy);
curTheta[y][x] = calcTheta(dy, dx);
}
}
gradientMagnitudeArr[i][j] = curGradient;
thetaArr[i][j] = curTheta;
}
}
private double calcTheta(double dy, double dx) {
double theta = Math.atan2(dy, dx);
return theta < 0 ? 2 * Math.PI - theta : theta;
}3. Detecting Extrema in the DoG Space
Extrema are points that are either larger or smaller than all 26 neighbours (8 in the current layer, 9 in the layer above, and 9 in the layer below) and satisfy contrast and edge‑response thresholds.
public void extremDetect() {
for (int i = 0; i < nOctaves; i++) {
for (int j = 1; j <= nOctaveLayers; j++) {
int[][] curDog = dogPyramid[i][j];
int[][] nextDog = dogPyramid[i][j + 1];
int[][] preDog = dogPyramid[i][j - 1];
double threshold = 256 * 0.5 * contrastThreshold / j;
for (int y = SIFT_IMG_BORDER; y < curDog.length - SIFT_IMG_BORDER; y++) {
for (int x = SIFT_IMG_BORDER; x < curDog[y].length - SIFT_IMG_BORDER; x++) {
int val = curDog[y][x];
if (Math.abs(val) > threshold && isExtremum(val, curDog, nextDog, preDog, y, x)) {
// Refine location with Taylor expansion (steps 1‑4)
if (refineKeypoint(i, j, y, x)) {
// keypoint accepted
}
}
}
}
}
}
}
private boolean isExtremum(int val, int[][] cur, int[][] next, int[][] pre, int y, int x) {
// Check 26 neighbours – omitted for brevity
return true; // placeholder
}4. Refining Keypoint Position and Eliminating Edge Responses
The location is refined using a second‑order Taylor expansion. Points with low contrast or a high edge response (based on the Hessian trace‑determinant ratio) are discarded.
// Inside refineKeypoint(...)
double[][] H = {{dxx, dxy, dxs}, {dxy, dyy, dys}, {dxs, dys, dss}};
double det = dxx * dyy * dss + 2 * dxy * dys * dxs - dxx * dys * dys - dyy * dxs * dxs - dss * dxy * dxy;
if (Math.abs(det) < Math.pow(0.1, 5)) return false;
double tr = dxx + dyy;
if (det <= Math.pow(0.1, 6) || tr * tr * edgeThreshold >= (edgeThreshold + 1) * (edgeThreshold + 1) * det) return false;5. Assigning Orientation to Each Keypoint
For each retained keypoint, a weighted histogram of gradient orientations in a 16×16 window is built. The peak(s) of the histogram define the main orientation(s); secondary peaks above 80 % of the maximum are also kept as additional orientations.
6. Building the Descriptor
The region around the oriented keypoint is divided into a 4×4 grid of 8×8 pixel blocks. For each block a histogram with 8 orientation bins is computed, yielding a 128‑dimensional descriptor that is finally normalized to reduce illumination effects.
7. Matching Features Between Two Images
Descriptors are compared using Euclidean distance. A match is accepted when the distance ratio between the best and the second‑best candidate is below 0.8. RANSAC can be applied afterwards to filter out outliers and estimate the geometric transformation.
Green dots indicate the detected keypoints, and the lines show the correspondences between the two images.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
