Thanks, this took me lots of tweaking and lots of testing!

Anyway, here's a load of tests. Taken from computers around the house.
CPU: AMD Athlon 64 X2 4800+ @ 2.5GHz
---------- testing started ----------testing x87Starting matrix multiplication loop...Elapsed time: 5141 mstesting sse1Starting matrix multiplication loop...Elapsed time: 6203 mstesting sse2Starting matrix multiplication loop...Elapsed time: 5938 ms---------- testing finished ----------
CPU: Intel Pentium Dual Core E5200 @ 2.5GHz
---------- testing started ----------testing x87Starting matrix multiplication loop...Elapsed time: 4218 mstesting sse1Starting matrix multiplication loop...Elapsed time: 3594 mstesting sse2Starting matrix multiplication loop...Elapsed time: 3172 ms---------- testing finished ----------
CPU: Intel Core 2 Duo E8400 @ 3.0GHz
---------- testing started ----------testing x87Starting matrix multiplication loop...Elapsed time: 3682 mstesting sse1Starting matrix multiplication loop...Elapsed time: 3074 mstesting sse2Starting matrix multiplication loop...Elapsed time: 2652 ms---------- testing finished ----------
That's it, should help.
