Optimizing RANSAC with SSE

When developing computer vision systems that needs to find plane projections from point-to-point correspondences, a RANdom SAmple Consensus (RANSAC) implementation is necessary. The algorithm rejects wrong correspondences (outliers) and find the geometric transformation, usually a homography, explaining inliers. The algorithm randomly picks 4 correspondences, find the corresponding homography, and count how many correspondences it explains. After a fixed number of iteration, the homography with the best support is chosen.

OpenCV offers an implementation in findHomography. It is unfortunately rather slow. A great optimization approach is to leverage SIMD instructions such as SSE or NEON. In theory, SIMD instructions could allow the processor to test 4 homographies in a single pass. However, conditional jumps are forbidden since the execution flow has to be common for the 4 tested homographies.

Computing homographies without conditional jumps

We need a jumpless way of computing a homography from 4 correspondences. Since I’m lazy, I asked maple to generate the function for me. Maple had problem inverting analytically the 8 by 8 matrix to solve the problem. It managed to compute analytically a homography sending the unit square to arbitrary points, though.

Here’s the maple code:

with(linalg);
eq := (x,y) -> <<x[1]| x[2] | 1 | 0| 0| 0| -y[1]*x[1]/1 | -y[1]*x[2]/1>, <0|0|0|x[1]|x[2]|1|- (y[2]/1)*x[1]|- (y[2]/1)*x[2]>>;
mat := < eq([0, 0],x) , eq([0,1],y), eq([1,1],z), eq([1,0],w)>;
inv := inverse(mat);
homo := simplify(evalm(inv &* <x[1],x[2],y[1],y[2],z[1],z[2],w[1],w[2]>));
homo_func := unapply([homo[1],homo[2],homo[3],homo[4],homo[5],homo[6],homo[7],homo[8]],x,y,z,w);
CodeGeneration[C](homo_func, optimize);

This little code produces a rather large function that does not use SIMD instructions at all: it uses double. To compute an arbitrary homography, I just call this function twice, invert one result, and multiply both matrix together. These function do not need conditional jumps.

Thank to the power of C++ (operator overloading in particular), replacing “double” with a SSE type simply amounts to defining a class.

Replacing double with fvec4

Compilers give access to SSE instructions through intrinsics that do not look very friendly. The following example is rather hard to read:

// declares a variable containing "1, 2, 3, 4"
__m128 a = _mm_set_ps(4, 3, 2, 1);
// Adds "7" to all entries of a
a = _mm_add_ps(a, _mm_set1_ps(7));

This version is much easier to read:

fvec4 a(1, 2, 3, 4);
fvec4 a += fvec4(7);

The fvec4 class can be implemented by declaring proper constructors and operators. The only data member is a single __m128 field. It might look like:

struct fvec4 {
    __m128 data;

    fvec4(float a) {data = _mm_set_ps1(a);}
    fvec4(float a, float b, float c, float d) { data = _mm_set_ps(a,b,c,d); }

    fvec4 operator += (const fvec4 &a) {
      this->data = _mm_add_ps(data, a.data);
      return *this;
    }
};

Once we have such a class (see my fvec4 class, used in polyora), we can simply replace “double” by “fvec4″ in the function generated by maple. The result is a function that computes 4 homographies in a single jumpless execution.

Parallel RANSAC loop

We now have everything we need to write our RANSAC loop:

// compute the homography
fvec4 H[3][3];
homography4_from_4corresp(pts1[0], pts1[1], pts1[2], pts1[3],
                          pts2[0], pts2[1], pts2[2], pts2[3],
                          H);
// evaluate support
fvec4 support(0);
for (int i=0; i<n; i++) {
  // Fetch the 4 next correspondences
  p[0] = fvec4( row(i,uv1,stride1)[0] );
  p[1] = fvec4( row(i,uv1,stride1)[1] );
  g[0] = fvec4( row(i,uv2,stride2)[0] );
  g[1] = fvec4( row(i,uv2,stride2)[1] );

  // Projects the points with the homography
  fvec4 t[2];
  homography4_transform(p, H, t);

  // Compute reprojection distances
  fvec4 d = dist2(t,g);

  // Compute support.
  support += (d < threshold) & fvec4(1 - .1 * (d / threshold));
}

And here we go ! See the full implementation here.

Results

Using SSE instructions to find homographies significantly speeds up computation. My implementation is much faster than OpenCV’s findHomography function which is a bit more more accurate, because it uses 64 bits doubles instead of 32 bits floats.