After looking over the code i'd say that most low hanging fruit is already eaten with this mod, there are some gains to be made by moving to let's say SSE4.x for some functions beeing injected, but without seeing profile of the exe it is hard to say...
You need to balance two things:
1) Creating SSE4 version will fragment efforts and create confusion. While it's easy to do cpu capabilities detection in dll load to throw a warning or sth if it's not supported - it's still a different version to maintain.
2) Some of the SSE4+ gains are valid only for Intel Nehalem+ CPUs that are already brutally powerfull
For example dot product and visibility calculations can benefit greatly from SSE4 instructions ( especially dpps ), but that instruction gives proper speed improvements for Nehalem+ CPUs ( it was implemented in 45nm Core2 shrink Penryn, but was kinda meh ). Question is what share of cpu time is taken by those functions?
One more thing to consider - we should think about possibility of Bethesda backporting some of these changes back to Skyrim code base and mass of users benefiting from them. With Arisu's assembler work it was quite doable - some functions in original source code were easily replaceable with __asm { } blocks. Currently Alexander appears to be doing the following - dissasembling some hot functions with IDA, using resulting source code to build custom functions with ICC compiler and injecting generated code back - Bethesda can't really backport any of this work.
P.S. This mod community is amazing, 40-50% FPS gains are unheard of
