If you havent already, I would first verify that its actually CPU bound, before changing CFLAGs and recompiling everything. So take a look at top, vmstat, mpstat etc when you're noticing slowness. If it is truely CPU bound and you're going to recompile everything, you could consider upgrading to the ~ version of gcc first, with the assumption that the optimizations maybe be better. However, my gut feeling is that you wont get much or any improvement over your current CFLAGs.
The i686 and -Os ideas are interesting. See if you can find any benchmarks.
Also - try diffing the kernel .configs - maybe you missed something important on the slow system.