One last thought, it could be a problem with the thermal regulation itself, if you can figure out the fan wiring you could set it so they are always full on, might help, might not.  you could also put in fans the same size but higher current/airflow.  as this is a server, apparently rack mounted you probably can't do much else about the cooling.

mad.scientist.at.large (a good madscientist)
--
Read, Scream, Fight <https://www.eff.org>



20. Apr 2018 06:21 by michaelkintzios@gmail.com:

On Friday, 20 April 2018 12:55:13 BST Corbin Bird wrote:
Oak Ridge National Laboratory uses these processors ( Rhea Cluster ) and
has numerous heat failures.

Due to poor cooling ... surprised?

The cooling is not working right. Something is still wrong.

On 04/19/2018 09:33 PM, R0b0t1 wrote:
> Dell Precision T7600, two 16 thread Xeons, 192GB of RAM, two Quadro
> cards and a Tesla card.
>
> The system is a few years old at this point. Old enough that the
> thermal compound could have hardened, which is why I replaced it.

If the problem started suddenly, rather than getting progressively worse over
time, it may have something to do with kernel drivers, or some change in
firmware.

If the cause is mechanical, I'd also suggest checking the heat sink contact
surface. Some heat sinks are poorly manufactured and require flattening with
wet 'n dry sandpaper to get a flat enough surface and improve their contact
with the CPU. I've seen 15°C improvement in a Zalman CPU cooler after excess
metal was removed from copper pipes, which were manufactured proud. Hardcore
O/C's flatten the CPU too, but I'd avoid anything as radical because it can go
badly wrong if you remove more than the surface varnish from the chip.

In the interim, opening the side panel may also help in hot weather.

--
Regards,
Mick