With BUILD 2014, Microsoft released a new preview version of the next generation JIT compiler "RyuJIT" that, combined with a special SIMD library that can be installed via NuGet, supports SIMD intrinsics (only SSE2 for now, but AVX is in the works).
Finally! I couldn't wait to try out the new bits; thus I modified the C# version of my existing XRaySimulator* to make use of SSE2 by implementing a simple packet ray tracing technique, i. e. instead of tracing individual rays, this version traces bundles of 2x2 (SSE2) or 4x2 (AVX) rays. Because the rays are largely "coherent" they typically hit the same objects (cache hit rate!).
The contendersCurrently there are a total of six different variants of the XRaySimulator:
- "C#": This is the baseline, scalar managed implementation.
- "C# adj. trav.": A further optimized version that exploits the fact that once a ray is inside a volume (finite element) mesh, it must hit a face of an adjacent element (hexahedron).
- "C#/SSE2": Like "C#", but using 2x2 (X-)ray packets; doesn't use "adjacency traversal" due to the high branching factor
- "C++": A C++11 reimplementation of "C#"; I tried to stay as close as possible to "C#" while still using at least half-way decent, idiomatic C++.
- "C++ adj. trav.": Corresponds to "C# adj. trav."
- "C++/AVX": Vectorized version of "C++" using 4x2 ray bundles thanks to AVX
Performance analysisSo, who wins? The following figure shows the performance of the different versions in million rays per second (MRay/s) rendering an FE model consisting of 28672 hexahedral elements (344064 triangles) at a resolution of 6400 x 4800 pixels on an Intel Core i7-2600K (3.4 - 4.2 GHz) with 32 GB DDR4 RAM running under Windows 8.1 Pro:
Now, given that SSE2 uses only 128-bit-wide vector lanes compared to AVX's generous 256 bit and the generally much more aggressive optimizer of the Visual C++ compiler, it's not exactly surprising to see an obvious performance difference between the "C++/AVX" and "C#/SSE2" case. Yet, I still would have expected the speed-up of "C#/SSE2" to reach a value a little closer to 4x instead of 2.5. What's going on there?
According to Visual Studio's built-in profiler all of the implementations spend the majority of their time in the intersection routine of the AABB (axis-aligned bounding box) - which is a good thing, because this intersection test is very fast compared to a triangle intersection test. Thus the quality of the generated machine code for this method/function is critical for the overall performance of the renderer.
The source code of the C#/SSE2 version looks like this:
And here's the source for the C++/AVX version:
(Note: The C++ code uses a hard-coded vector lane width of 8 floats.)
Almost identical; yet, if you compare what both RyuJIT and Visual C++ make of these sources, you'll first notice that the machine code emitted by RyuJIT is much more convoluted and thus longer:
Preliminary conclusionsIt seems like Microsoft has finally awakend and makes the long overdue investments in .NET performance. Thanks Google and Apple! Although RyuJIT will still require a lot of optimizations, in particular with respect to the generated SIMD code, Redmond's latest moves are promising. A next generation JIT, SIMD support, AOT compilation using the Visual C++ optimizer backend... What will come next? GPGPU support? Large arrays? A decent, modern, performant desktop UI framework? True first-class support for F#?
The future is bright!
*XRaySimulator is a visualization tool that renders X-ray-like images of finite element models. It uses a modified ray tracing algorithm to compute the energy absorption within each intersected element based on the element's material properties. A BVH (bounding volume hierarchy) is used to speed-up the intersection computation.
Details (German): http://www.uni-ulm.de/fileadmin/website_uni_ulm/uzwr/projekte/p10-2.pdf
**The C# versions of XRaySimulator on BitBucket currently don't support saving the rendered image to a file. In older versions, I used to use Tao.DevIL, but that only works on x86 and the preview releases of RyuJIT only emit x64 machine code. The C++ versions use a custom TGA output filter.