Now I just have to figure out, where the larger performance drop for large data sizes comes from compared to the other methods. memcpy somehow reaches approx. 20 GB/s beyond the L3 cache.
Btw., I have a further post on SIMD programming with .NET almost ready, but I can't publish it yet due to problems with the current version of System.Numerics.Vectors in combination with .NET 4.6/VS 2015 RC. Hope that'll get fixed soon!