Thanks for explaining (and of course for making the awesome mod). Interesting to read. Do you think you will be able to do more optimization fixes like this?
Theoretically, yes, of course. I'd currently estimate that an additional +10% could be achieved in the game executable itself using the same procedure as before. However, to fix a problematic function using only a normal person's tools, it has to be detected and isolated from the call graph, then its assembly code has to be reverse engineered, rewritten and tested until it can be replaced by a better performing variant.
Now that many of the biggest CPU hogs have been softened, there still is a lot of potential gains ahead, but acquiring them would require multiple times the work already invested for a smaller amount of additional performance. It all has to be hand-crafted - there are no tools to automate this. Worse yet, some of the work has to be redone when porting the changes to a different game version. This way, only very simple optimizations are even possible at all, due to the sheer amount of places where a call must be inlined. Most function calls suitable for inlining have a varying amount of setup and cleanup code for that function call and I just can't go through each of the (tens of!) thousands of references to figure out the optimal replacement at each point. A compiler does all this for free, within seconds - and not only for the first dozen of hottest targets, but for the whole game, which consists of millions of lines of assembly code.
My true hope is therefore that this little demo will create a demand that Bethesda must answer. Much like what happened with the LAA patch. This would be the best outcome for all and also the best base for any further optimizations that can only be hand-crafted.
And did I understood you right that you said that if an optimization compiler would have been used by Bethesda from the start, we would have experienced over 100% better performance?
Yeah, pretty sure about that. The current patch doesn't modify even nearly 1% of the code, but still manages to cut the cycles per frame by 30-40%. The whole binary is full of redundant code, it's just way too much to ever do manually, but sums up quickly when eliminated by an optimizing compiler.