Artificial Intelligence 13 min read

Implementing the SIFT Image Matching Algorithm in Java – A Complete Walkthrough

This article explains the four main stages of the SIFT algorithm—scale‑space construction, DoG extrema detection, keypoint orientation assignment, and descriptor generation—while providing a full Java implementation with detailed code snippets and explanations of each processing step.

BiCaiJia Technology Team

Aug 26, 2017

Implementing the SIFT Image Matching Algorithm in Java – A Complete Walkthrough

SIFT (Scale‑Invariant Feature Transform) is a classic algorithm for image recognition. The following sections describe a Java implementation that follows the four standard SIFT steps.

1. Constructing the Scale Space

The scale space consists of Gaussian‑blurred images and their Difference‑of‑Gaussians (DoG). Gaussian pyramids are built by repeatedly blurring and down‑sampling the input gray‑pixel matrix.

public void scaleSpace(GrayPixelMatrix gpm) throws IOException {
    gaussPyramid = new int[nOctaves][gaussLayers][][];
    // Double the image size and apply initial blur
    gaussPyramid[0][0] = gpm.zoom(2).blur(gauss1DTemplate(Math.sqrt(sigma * sigma - 4 * preSigma * preSigma)));
    for (int i = 1; i < gaussLayers; i++) {
        double deltaSigma = sigma * Math.sqrt(Math.pow(2.0, i * 2.0 / nOctaveLayers) - Math.pow(2.0, (i * 2.0 - 2.0) / nOctaveLayers));
        gaussPyramid[0][i] = gpm.blur(gauss1DTemplate(deltaSigma)).getPixels();
    }
    for (int i = 1; i < nOctaves; i++) {
        // Down‑sample the last layer of the previous octave
        gpm = gpm.sample(2);
        gaussPyramid[i][0] = gpm.getPixels();
        for (int j = 1; j < gaussLayers; j++) {
            double deltaSigma = sigma * Math.sqrt(Math.pow(2.0, i * 2.0 / nOctaveLayers) - Math.pow(2.0, (i * 2.0 - 2.0) / nOctaveLayers));
            gaussPyramid[i][j] = gpm.blur(gauss1DTemplate(deltaSigma)).getPixels();
        }
    }
    // Build DoG pyramid by subtracting adjacent Gaussian layers
    dogPyramid = new int[nOctaves][dogLayers][][];
    for (int i = 0; i < nOctaves; i++) {
        for (int j = 0; j < dogLayers; j++) {
            int[][] g0 = gaussPyramid[i][j];
            int[][] g1 = gaussPyramid[i][j + 1];
            int[][] diff = new int[g0.length][g0[0].length];
            for (int k = 0; k < diff.length; k++) {
                for (int l = 0; l < diff[k].length; l++) {
                    diff[k][l] = g1[k][l] - g0[k][l];
                }
            }
            dogPyramid[i][j] = diff;
        }
    }
}

2. Computing Gradient Magnitude and Orientation

Before detecting extrema, the gradient magnitude and angle for each pixel are cached to avoid repeated calculations.

for (int i = 0; i < nOctaves; i++) {
    for (int j = 0; j < gaussLayers; j++) {
        int[][] curGauss = gaussPyramid[i][j];
        double[][] curGradient = new double[curGauss.length][curGauss[0].length];
        double[][] curTheta = new double[curGauss.length][curGauss[0].length];
        for (int y = 1; y < curGauss.length - 1; y++) {
            for (int x = 1; x < curGauss[0].length - 1; x++) {
                double dx = curGauss[y][x + 1] - curGauss[y][x - 1];
                double dy = curGauss[y + 1][x] - curGauss[y - 1][x];
                curGradient[y][x] = Math.sqrt(dx * dx + dy * dy);
                curTheta[y][x] = calcTheta(dy, dx);
            }
        }
        gradientMagnitudeArr[i][j] = curGradient;
        thetaArr[i][j] = curTheta;
    }
}

private double calcTheta(double dy, double dx) {
    double theta = Math.atan2(dy, dx);
    return theta < 0 ? 2 * Math.PI - theta : theta;
}

3. Detecting Extrema in the DoG Space

Extrema are points that are either larger or smaller than all 26 neighbours (8 in the current layer, 9 in the layer above, and 9 in the layer below) and satisfy contrast and edge‑response thresholds.

public void extremDetect() {
    for (int i = 0; i < nOctaves; i++) {
        for (int j = 1; j <= nOctaveLayers; j++) {
            int[][] curDog = dogPyramid[i][j];
            int[][] nextDog = dogPyramid[i][j + 1];
            int[][] preDog = dogPyramid[i][j - 1];
            double threshold = 256 * 0.5 * contrastThreshold / j;
            for (int y = SIFT_IMG_BORDER; y < curDog.length - SIFT_IMG_BORDER; y++) {
                for (int x = SIFT_IMG_BORDER; x < curDog[y].length - SIFT_IMG_BORDER; x++) {
                    int val = curDog[y][x];
                    if (Math.abs(val) > threshold && isExtremum(val, curDog, nextDog, preDog, y, x)) {
                        // Refine location with Taylor expansion (steps 1‑4)
                        if (refineKeypoint(i, j, y, x)) {
                            // keypoint accepted
                        }
                    }
                }
            }
        }
    }
}

private boolean isExtremum(int val, int[][] cur, int[][] next, int[][] pre, int y, int x) {
    // Check 26 neighbours – omitted for brevity
    return true; // placeholder
}

4. Refining Keypoint Position and Eliminating Edge Responses

The location is refined using a second‑order Taylor expansion. Points with low contrast or a high edge response (based on the Hessian trace‑determinant ratio) are discarded.

// Inside refineKeypoint(...)
double[][] H = {{dxx, dxy, dxs}, {dxy, dyy, dys}, {dxs, dys, dss}};
double det = dxx * dyy * dss + 2 * dxy * dys * dxs - dxx * dys * dys - dyy * dxs * dxs - dss * dxy * dxy;
if (Math.abs(det) < Math.pow(0.1, 5)) return false;
double tr = dxx + dyy;
if (det <= Math.pow(0.1, 6) || tr * tr * edgeThreshold >= (edgeThreshold + 1) * (edgeThreshold + 1) * det) return false;

5. Assigning Orientation to Each Keypoint

For each retained keypoint, a weighted histogram of gradient orientations in a 16×16 window is built. The peak(s) of the histogram define the main orientation(s); secondary peaks above 80 % of the maximum are also kept as additional orientations.

6. Building the Descriptor

The region around the oriented keypoint is divided into a 4×4 grid of 8×8 pixel blocks. For each block a histogram with 8 orientation bins is computed, yielding a 128‑dimensional descriptor that is finally normalized to reduce illumination effects.

7. Matching Features Between Two Images

Descriptors are compared using Euclidean distance. A match is accepted when the distance ratio between the best and the second‑best candidate is below 0.8. RANSAC can be applied afterwards to filter out outliers and estimate the geometric transformation.