public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-user] time to build a new machine ?
@ 2021-09-24  9:58 Philip Webb
  2021-09-24 10:12 ` Andrew Udvare
  0 siblings, 1 reply; 7+ messages in thread
From: Philip Webb @ 2021-09-24  9:58 UTC (permalink / raw
  To: Gentoo User

While I was asleep yesterday, my machine reported on all  3  Konsoles :

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000 

Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
: mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822

-- end of report --

I don't remember seeing this before : how concerned should I be ?

The present machine is  6  years old & has always worked very well ;
its CPU is an AMD.  I plan to build a new machine in the next few months :
should I accelerate my plans ?

-- 
========================,,============================================
SUPPORT     ___________//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT    `-O----------O---'   purslowatchassdotutorontodotca



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24  9:58 [gentoo-user] time to build a new machine ? Philip Webb
@ 2021-09-24 10:12 ` Andrew Udvare
  2021-09-24 10:48   ` Philip Webb
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Udvare @ 2021-09-24 10:12 UTC (permalink / raw
  To: Gentoo Users



> On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
> 
> While I was asleep yesterday, my machine reported on all  3  Konsoles :
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000 
> 
> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
> 
> -- end of report --
> 
> I don't remember seeing this before : how concerned should I be ?

From the manpage:

       Most  errors  can be corrected by the CPU by internal error correction mechanisms. Uncorrected
       errors cause machine check exceptions which may kill processes or panic the machine.  A  small
       number  of  corrected errors is usually not a cause for worry, but a large number can indicate
       future failure.

       When an uncorrected machine check error happens that the kernel cannot recover  from  then  it
       will  usually  panic  the  system.   In  this case when there was a warm reset after the panic
       mcelog should pick up the machine check errors after reboot.  This is  not  possible  after  a
       cold reset.

If you are overclocking, try disabling it.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24 10:12 ` Andrew Udvare
@ 2021-09-24 10:48   ` Philip Webb
  2021-09-24 15:22     ` Andrew Udvare
  0 siblings, 1 reply; 7+ messages in thread
From: Philip Webb @ 2021-09-24 10:48 UTC (permalink / raw
  To: gentoo-user

210924 Andrew Udvare wrote:
> On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
>> While I was asleep yesterday, my machine reported on all  3  Konsoles :
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000 
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
>> -- end of report --
> From the manpage:

Which man page is that ?

> Most errors can be corrected by the CPU
> by internal error correction mechanisms.  Uncorrected errors cause
> machine check exceptions which may kill processes or panic the machine.
> A small number of corrected errors is usually not a cause for worry,
> but a large number can indicate future failure.

So it looks as if the above was a correctable error.

> When an uncorrected machine check error happens
> that the kernel cannot recover from, then it will usually panic the system.
> In this case when there was a warm reset after the panic,
> mcelog should pick up the machine check errors after reboot.
> This is not possible after a cold reset.

No sign of any other effects : everything went on running.

> If you are overclocking, try disabling it.

No, I never overclock anything (smile).

-- 
========================,,============================================
SUPPORT     ___________//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT    `-O----------O---'   purslowatchassdotutorontodotca



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24 10:48   ` Philip Webb
@ 2021-09-24 15:22     ` Andrew Udvare
  2021-09-24 16:37       ` Mark Knecht
  0 siblings, 1 reply; 7+ messages in thread
From: Andrew Udvare @ 2021-09-24 15:22 UTC (permalink / raw
  To: gentoo-user


[-- Attachment #1.1: Type: text/plain, Size: 747 bytes --]

On 24/09/2021 06:48, Philip Webb wrote:
> 210924 Andrew Udvare wrote:
>> On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
>>> While I was asleep yesterday, my machine reported on all  3  Konsoles :
>>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
>>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
>>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
>>> -- end of report --
>>  From the manpage:
> 
> Which man page is that ?

man mcelog


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 840 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24 15:22     ` Andrew Udvare
@ 2021-09-24 16:37       ` Mark Knecht
  2021-09-24 19:55         ` Philip Webb
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Knecht @ 2021-09-24 16:37 UTC (permalink / raw
  To: Gentoo User

[-- Attachment #1: Type: text/plain, Size: 1294 bytes --]

On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare <audvare@gmail.com> wrote:
>
> On 24/09/2021 06:48, Philip Webb wrote:
> > 210924 Andrew Udvare wrote:
> >> On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
> >>> While I was asleep yesterday, my machine reported on all  3  Konsoles
:
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:
9d0b4c16001d011b
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
> >>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
> >>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0
APIC 0 microcode 6000822
> >>> -- end of report --
> >>  From the manpage:
> >
> > Which man page is that ?
>
> man mcelog

I have no direct experience with this error however I'd suggest it was most
likely an error reading
a block of DRAM and not likely the CPU itself failing. I periodically get
mce errors on my i980
machine when running big PixInsight jobs and I hit thermal limits.

I'd suggest you run extensive memory tests and if you don't see any problems
don't worry too much. Of course, it's always wise to do good backups in case
the problem gets worse.

Good luck,
Mark

[-- Attachment #2: Type: text/html, Size: 1777 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24 16:37       ` Mark Knecht
@ 2021-09-24 19:55         ` Philip Webb
  2021-09-25  0:04           ` Adam Carter
  0 siblings, 1 reply; 7+ messages in thread
From: Philip Webb @ 2021-09-24 19:55 UTC (permalink / raw
  To: gentoo-user

210924 Mark Knecht wrote:
> On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
>> While I was asleep yesterday, my machine reported on all  3  Konsoles
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
>> Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...
>> : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
MK> I have no direct experience with this error,
> however I'd suggest it was most likely an error
> reading a block of DRAM and not likely the CPU itself failing.
> I periodically get mce errors on my i980 machine
> when running big PixInsight jobs and I hit thermal limits.

I thought you had written "1980 machine" (grin).

> I'd suggest you run extensive memory tests 
> and if you don't see any problems don't worry too much.
> It's always wise to do good backups in case the problem gets worse.

Everything is backed up, incl off-site.

> On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare <audvare@gmail.com> wrote:
>> man mcelog

'man mcelog' + 'man mce' find nothing.  does it need to be installed ?

Thanks for the advice so far.

-- 
========================,,============================================
SUPPORT     ___________//___,   Philip Webb
ELECTRIC   /] [] [] [] [] []|   Cities Centre, University of Toronto
TRANSIT    `-O----------O---'   purslowatchassdotutorontodotca



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] time to build a new machine ?
  2021-09-24 19:55         ` Philip Webb
@ 2021-09-25  0:04           ` Adam Carter
  0 siblings, 0 replies; 7+ messages in thread
From: Adam Carter @ 2021-09-25  0:04 UTC (permalink / raw
  To: Gentoo User

[-- Attachment #1: Type: text/plain, Size: 382 bytes --]

>
> >> man mcelog
>
> 'man mcelog' + 'man mce' find nothing.  does it need to be installed ?
>

Yep and the package is called mcelog.

Did you check for any other messages before/after the mce errors?

Do you also have lm-sensors installed? Running sensord?

Genuine CPU issues seem pretty rare, so I would check for overheating or
power issues, and lm-sensors will help with that.

[-- Attachment #2: Type: text/html, Size: 703 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-25  0:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-24  9:58 [gentoo-user] time to build a new machine ? Philip Webb
2021-09-24 10:12 ` Andrew Udvare
2021-09-24 10:48   ` Philip Webb
2021-09-24 15:22     ` Andrew Udvare
2021-09-24 16:37       ` Mark Knecht
2021-09-24 19:55         ` Philip Webb
2021-09-25  0:04           ` Adam Carter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox