I'd say that there wasn't a ton of reason behind pouring resources into these tools in the 80's when the hardware market couldn't use it. Now that it's a viable alternative we should expect to see it becoming worth the investment to develop for. That's when you see breakthroughs, when it's worth it. What good would multithreading have done 80's computers that weren't made for it? It'd be weird to have the software before the hardware, honestly.

Parallel computing has been the subject of research since the fifties. That's 60+ years ago. My countryman http://en.wikipedia.org/wiki/Edsger_W._Dijkstra has done http://en.wikipedia.org/wiki/Semaphore_%28programming%29 already long time ago. And there have been machines for parallel computing since ages too. Whether it is a multi-core CPU, a motherboard with multiple CPUs, a supercomputer with many CPUs, or a true distributed system (CPUs connected by networks), the underlying algorithms and problems and solutions share the same basic principles. We have semaphores, message passing, shared memory, and a whole bunch of other primitives that we can use when programming parallel algorithms. But they gives us only very basic tools. It's like having a hammer, a saw, and nails. And then we'll have to build a Cathedral. We know it can be done, but it takes a lot of thinking, and it keeps tricky to not make mistakes.
So there has always been work at parallel programming. Both from the research community, but also from practical people. And still it's not as easy to use new tools as it was moving from C to C++. Or Pascal to Java. I don't believe that after 60 years of research, suddenly easy solutions will pop up.
You think "It'd be weird to have the software before the hardware, honestly." ? I think the exact opposite. Why build hardware if you don't know what to do with it ? I think the order has always been: somebody needed to solve a problem. somebody (else) came up with a way (an algorithm) to solve the problem. Then somebody has to implement that new solution (write a program). Then you need to find hardware to run your program on. What CPUs do hasn't changed that much over the last 50 years. They have an instructionset that hasn't changed that much (still add, multiply, compare, jump, store, fetch, etc). CPUs have changed much, but what they offer to a programmer is basically still the same. Only faster.
I'd also like to revisit your 4 core vs. 8 core argument that no games use 5 or more corse. I'd like to point out that until a few years ago, no game used more than 2 cores and before that, no game used more than 1. Technology is progressing and 8 cores aren't that far from being standard as intel keeps advancing the chains of progress. Can't games be made to utilize 4 or 8 cores or is it tied to one or the other?
Small example why it isn't that easy.
Suppose you have a CPU with 1 core. It can do 1 billion instructions/second.
Let's ignore the GPU at the moment.
Suppose a program (a game) needs 100 million instructions to render 1 frame.
That means our game will run at 10 frames/second.
Now suppose we have 2 cores. Still 1 billion instructions/second per core.
To make use of that, we need to split our program into 2 threads.
Suppose we can split the program in these threads:
1) render thread. requires 50 million instructions to render 1 frame.
2) audio thread. requires 30 million instructions to render 1 frame.
3) AI thread. requires 20 million instructions to render 1 frame.
We then run thread1 on core-A, and let core-B be used by thread2 and thread3.
As you can see, both core-A and core-B will be done twice as fast.
In stead of needing 100ms (with 1 core), we are now done after 50ms.
The frame/sec goes up from 10 fps to 20 fps !
Awesome.
Now suppose we get a 4-core system.
We need 4 threads. Thread1 (the render thread) is very complicated. We can't split it easily.
But the AI thread can be split. So we get 2 AI threads now, 3A and 3B, each requiring 10 million instructions.
Let's start the 4 threads on our 4 cores.
Thread1 still requires 50 million instructions. It will take 50 milliseconds to finish.
Thread2 still requires 30 million instructions. It will run for 30 milliseconds. And then the core will be idle for 20 milliseconds.
Thread3A requires 10 million instructions. It will run for 10 milliseconds. And then the core will wait for 40 milliseconds.
Thread3B requires 10 million instructions. It will run for 10 milliseconds. And then the core will wait for 40 milliseconds.
Result: the slowest thread (1) still requires 50 milliseconds to finish.
Your fps will again be 20 fps.
We went from 2 cores to 4 cores, and gained no fps. And all 4 cores were in fact in use !
Now we put in a lot of work, and split the render-thread (thread1) into 2 new threads.
Thread1-A will need 30 million instructions, thread1-B 20 million instructions.
We know run thread-1A on core-A, thread-1B on core-B, thread-2 on core-B, and threads 3A and 3B on core-D.
Result, both core-A and core-C will finish after 30 milliseconds. core-B and core-D will finish after 20 milliseconds.
We now have all threads finished after 30 milliseconds. Our fps went up from 20 to 33 !
I hope you get the picture now.
Suppose we get an 8-core CPU.
We need to split our threads again. This will become harder and harder each time.
Suppose we can split the AI-threads into 10 little threads, each requiring 2 million instructions each. Will this help us ? No. Because the slower 2 threads will finish after 30 milliseconds. So we need to split both threads 1A and 2 into smaller threads. Will that be doable ? If so, then we are saved. But for some programs, there will be core-functionality that can maybe not be split into smaller threads. What if the render-thread 1A can not be split ? We can split all the other threads in smaller ones. But that won't help at all. We will only get higher performance if we can split the heaviest thread(s). If you can't split all heavy threads into kinda equal-sized chunks, then more cores will not help at all.
I think this is a problem games have. Most games have some functionality that can not easily be split into smaller threads. You can split off AI, you can split off audio-handling. Maybe a few more smaller tasks. But there is some large rendering work that can not easily be split. I guess if you want to do that, you have to re-design your engine from the ground up. And few engines have done that yet. There will be more, but that takes time. And even then, you can only split a program into multiple threads until you run into the wall where there is one thread that you can't split further.
Anyway, this has become a lecture. Time to stop boring you.

