Kitz ADSL Broadband Information
adsl spacer  
Support this site
Home Broadband ISPs Tech Routers Wiki Forum
 
     
   Compare ISP   Rate your ISP
   Glossary   Glossary
 
Please login or register.

Login with username, password and session length
Advanced search  

News:

Pages: [1] 2

Author Topic: Linux (CentOS) Machine Check Exceptions  (Read 1920 times)

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Linux (CentOS) Machine Check Exceptions
« on: January 25, 2020, 01:10:35 PM »

In my sheltered life I had never heard of Machine Check Exceptions, til this week.  But it seems they are raised in response to hardware faults detected by the CPU.   They typically mean something fundamental has malfunctioned, like a motherboard, memory, or the CPU itself.

So wondering...  does anybody here know anything much about MCE diagnosis?   And in particular on Linux, will the mcelog package help?

Reason for asking...     My HP Proliant Micro Server spontaneously rebooted.   After the reboot all was working, except the machine health status LED bar was flashing red.  A power cycle restored normal blue colour.   I found nothing of interest in any of the CentOS logs but logging into HPís ILO firmware  diagnostics showed that the issue was immediately preceded by two logs of ďUncorrectable Machine Check ExceptionsĒ. 

I gather that decoding MCEs manually is hard work, so I have decided to cling to the hope that the fault was a one-off blip.  I found and installed a BIOS update, though no reason to expect a cure from that.   

But I have also installed Linuxís mcelog package in the hope it may help if fault recurs. mcelog óascii is also meant to be able to decode a string containing the raw codes, but Iíve not had any success.  Pinning my hopes purely on the possibility mcelog might help if there is a next time. 

Any guidance appreciated. :)
Logged

roseway

  • Administrator
  • Senior Kitizen
  • *
  • Posts: 39678
  • Penguins CAN fly
    • DSLstats
Re: Linux (CentOS) Machine Check Exceptions
« Reply #1 on: January 25, 2020, 03:32:52 PM »

I've never heard of machine check exceptions, but I would suggest that a likely cause of the event is overheating. Do you have any temperature monitoring on that machine?
Logged
  Eric

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #2 on: January 25, 2020, 06:34:17 PM »

I've never heard of machine check exceptions, but I would suggest that a likely cause of the event is overheating. Do you have any temperature monitoring on that machine?

Phew, glad itís not just me thatís never heard of them. :)


Temperatures heavily monitored in system firmware and all well within limits.   In any case, that would (/should) have raised a different error though now you mention it, Wikipedia does suggest a connection with temperature....

https://en.m.wikipedia.org/wiki/Machine-check_exception

My emphasis...
Quote
A machine-check exception (MCE) is a type of computer hardware error that occurs when a computer's central processing unit detects an unrecoverable hardware error in the processor itself, the memory, the I/O devices, or on the system bus. It is not caused by software.   The error usually occurs due to component failure or the overheating or overclocking of hardware components.

The CPU wasnít doing much at the time, just playing back a TV recording and capturing a couple more, so no real excuse for getting hot.   Good idea though, Iíll take the lid off sometime and make sure heatsinks and fans arenít too dusty. :)


Logged

roseway

  • Administrator
  • Senior Kitizen
  • *
  • Posts: 39678
  • Penguins CAN fly
    • DSLstats
Re: Linux (CentOS) Machine Check Exceptions
« Reply #3 on: January 25, 2020, 06:59:08 PM »

If it were a hardware failure, it would most likely be permanent, surely? Your machine recovered after power recycling, which seems to suggest some sort of thermal effect.
Logged
  Eric

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #4 on: January 25, 2020, 07:10:09 PM »

If it were a hardware failure, it would most likely be permanent, surely? Your machine recovered after power recycling, which seems to suggest some sort of thermal effect.

I follow your reasoning, but Iíve found lots of threads on lots of forums complaining of occasional MCE errors that get worse with time.

One possibility (not likely here) may be lightning damage to chip inputs that leaves them still soldiering on, but leaky.  Whisker growth on EU compliant ROHS solder pads would be another scenario that might lead to intermittencies.  Iím sure there are more.

Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 31058
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Linux (CentOS) Machine Check Exceptions
« Reply #5 on: January 25, 2020, 09:26:43 PM »

MCEs -- either thermal events or RAM parity errors
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #6 on: January 25, 2020, 10:24:15 PM »

Current temperature readings of same system under similar conditions...

01-Inlet Ambient    Ambient    3    0     OK    23C    Caution: 42C; Critical: 46C
02-CPU    CPU    10    6     OK    40C    Caution: 70C; Critical: N/A
03-P1 DIMM 1-2    Memory    14    7     OK    34C    Caution: 87C; Critical: N/A
05-Chipset    System    4    2     OK    55C    Caution: 105C; Critical: N/A
06-Chipset Zone    System    3    4     OK    42C    Caution: 68C; Critical: 73C
07-VR P1 Zone    System    9    12     OK    48C    Caution: 88C; Critical: 93C
09-iLO Zone    System    7    15     OK    46C    Caution: 72C; Critical: 77C
11-PCI 1 Zone    I/O Board    2    11     OK    39C    Caution: 64C; Critical: 69C
12-Sys Exhaust    Chassis    10    15     OK    43C    Caution: 68C; Critical: 73C

RAM parity is certainly another possibility, although I'm not convinced RAM parity and overheating are the only possibilities.    Somewhere in the detailed log message status fields lies, I suspect, a precise reason that would remove the guesswork.   Full message logs were...

   13    Critical   CPU   01/23/2020 21:19   01/23/2020 21:19   1   Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000002, Bank 0x00000003, Status 0xF2000000'00800400, Address 0x00000000'00000000, Misc 0x00000000'00000000)
   12    Critical   CPU   01/23/2020 21:19   01/23/2020 21:19   1   Uncorrectable Machine Check Exception (Board 0, Processor 1, APIC ID 0x00000000, Bank 0x00000003, Status 0xF2000000'00800400, Address 0x00000000'00000000, Misc 0x00000000'00000000)

But I'm not expecting anybody here to be able to decode, it's obviously specialised stuff, probably known only to Intel's internal CPU Gurus.  I was just hoping for any pointers on how to decode myself, or what tools might decode them for me.  I have installed mcelog, which might give more information in event of a recurrence.

https://mcelog.org
Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 31058
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Linux (CentOS) Machine Check Exceptions
« Reply #7 on: January 25, 2020, 10:56:05 PM »

There should be an appropriate RPM package available, for the version of CentOS that you are using.

[bcat ~]$ sudo yum info mcelog
Loaded plugins: fastestmirror, product-id, refresh-packagekit, search-disabled-repos, subscription-manager
Loading mirror speeds from cached hostfile
 * elrepo-kernel: mirrors.coreix.net
Available Packages
Name        : mcelog
Arch        : x86_64
Epoch       : 2
Version     : 128
Release     : 1.c83713fd.el6
Size        : 68 k
Repo        : rhel-6-server-rpms
Summary     : Tool to translate x86-64 CPU Machine Check Exception data.
URL         : https://github.com/andikleen/mcelog.git
License     : GPLv2
Description : mcelog is a daemon that collects and decodes Machine Check Exception data
            : on x86-64 machines.

[bcat ~]$

The above output was obtained on a RHEL6 system.
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #8 on: January 25, 2020, 11:48:03 PM »

Yes, I Ďyum installedí mcelog, but only after the event.   Fingers crossed that it helps, should there be a recurrence.   Fingers crossed, even more firmly, that there is no recurrence. :D

I still like the idea that a gradual build up of dust on heatsinks and fans might cause high temps.  In theory, I think  HPís own sensors should have detected such a condition and raised cautions or shut down the system long before an MCE was raised, but nothing in the ILO logs to suggest that happened.   Still no harm anyway in giving it a good clean, and I will certainly do so.   Many thanks to Eric and BíCat for assistance.   I hope I donít come across as ungrateful as actually, I am very grateful. :)

Off topic, the machineís ILO (Integrated Lights Out) management is really very impressive.  Accessible via a separate ethernet port on the back panel, it provides access to machine status and  event logs, at an OS independent level.   Just the ticket for a headless server.  Without it in fact, since I had not yet installed mcelog, Iíd probably have no idea whatsoever why my machine rebooted on Friday evening.  :graduate:
Logged

burakkucat

  • Global Moderator
  • Senior Kitizen
  • *
  • Posts: 31058
  • Over the Rainbow Bridge
    • The ELRepo Project
Re: Linux (CentOS) Machine Check Exceptions
« Reply #9 on: January 26, 2020, 12:22:08 AM »

What I failed to type was that with the mcelog package installed, there should be some documentation installed under the /usr/share/ directory, as well as the usual manual page, etc.

rpm -qd mcelog

(Just in case you have forgotten . . .)
Logged
:cat:  100% Linux and, previously, Unix. Co-founder of the ELRepo Project.

Please consider making a donation to support the running of this site.

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #10 on: January 26, 2020, 12:52:13 AM »

(Just in case you have forgotten . . .)

Trouble with mcelog documentation is, itís actually a bit vague on exactly what help it will provide.   Iím by no means certain that it will actually do a great deal to help, or tell me anything more than HPís ILO already told me. :(

From

https://mcelog.org/faq.html#1

Quote
Here is this machine check output. Please tell me what it means

You have to ask your hardware vendor

Quote
How do I decode fatal machine checks?

How do I "run through mcelog --ascii"?

First don't expect too much from decoding them. This often does not actually tell you what really went wrong.

So, not so much a case of having forgotten to read it, more a case of blotting it out, and living in hope. :D
Logged

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 9302
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Re: Linux (CentOS) Machine Check Exceptions
« Reply #11 on: January 26, 2020, 02:06:53 AM »

I am agreeing here with all that youíve said. This is what I would do if it were my box.

I would look at RAM faults first. Does it have ECC RAM? A RAM parity check might have caused it as you say.

It could indeed be a bug in the CPU itself in which case it needs a microcode update. I think you already addressed that with a BIOS update that contains latest fixes for CPU errata.

I need to look things up here: Software bugs such as division by zero in ring 0 might be recorded separately under a different category, but when inside a (software) interrupt service routine or trap for that exception there occurs another error condition of the same type then clearly youíre dead as youíre in an infinite loop so a different kind of trap such as machine exception might be the way to go; would have to read up on the processor. Corruption of the code inside the kernel could be a culprit caused by bad ram perhaps.

Temperature such as a bad processor heatsink ie thermal bridge to CPU, or failure of a fan or RAM overheating would be a good culprit too, also dust. I hoovered out my machines twice a year because the servers sucked in a lot of air therefore plenty of dust, and I hoovered the room often too. I blasted various areas with a can of compressed air too. I did this for my customersí machines every year as part of a Ďserviceí like a car as well as checking the WinNT family event log and downloading the updates that had not been installed automatically.

Is there a cpu fan? dedicated to sucking air through the cpu heatsink?

I only ever had this class of problem once in one of my machines and that was an overheating Dell. It was positioned badly, in that warm air collected around it. I fixed the problem by getting rid of the machine, I donated it to students for medical experiments.

Apologise for not being very helpful but my experience with this is luckily limited. Will the manufacturer advise?



Trying to read thermal figures, Iím assuming theyíre all better than ok?

I suppose we can reseat the RAM? Blast the RAM with compressed air perhaps ?

Has the ram been upgraded? In this case we point the finger of suspicion at the new ram.
« Last Edit: January 26, 2020, 02:54:45 AM by Weaver »
Logged

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #12 on: January 26, 2020, 08:39:30 AM »

RAM has never been upgraded.   Iím reluctant to reseat the RAM or anything else at this stage, as doing so carries a risk of causing some new instability as will as the possibility of fixing it.  Maybe if it starts happening regularly, I may try.

The BIOS update was for the Ďrecentí Intel processor vulnerabilities.   I noticed in the incremental release notes, for an earlier update, a mention of fixing rogue MCEs.  Unfortunately the circumstances for such MCEs were quite concisely documented, and emphatically did not apply to my machine.

Temps are all within limits but more to the point, theyíre monitored by the system firmware. Pretty sure that any transient excesses should have been logged as cautions or critical events in the ILO logs, and system shut down if appropriate, with eye catching flashing leds on front panel.

Currently perusing Intelís 64 and IA-32 Architectures Software Developerís Manual, as linked from Wiki article on MCE.   Lots of data in there on MCA but it is a 676 page document and clearly assumes prior familiarity with things with which I am not remotely familiar. :D
Logged

sevenlayermuddle

  • Helpful
  • Addicted Kitizen
  • *
  • Posts: 5252
Re: Linux (CentOS) Machine Check Exceptions
« Reply #13 on: January 26, 2020, 09:04:39 AM »

This is interesting.   

Itís not my machine and not my processor, otherwise it would be downright exciting.   But the MCA status that is listed is exactly the same as mine.  So, in some circumstances at least, a processor microcode update might be relevant, and might be all that is required to prevent recurrence.

https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=emr_na-a00090359en_us
Logged

Weaver

  • Addicted Kitizen
  • *****
  • Posts: 9302
  • Retd sw dev; A&A; 4 ◊ 7km ADSL2; IPv6; Firebrick
Re: Linux (CentOS) Machine Check Exceptions
« Reply #14 on: January 26, 2020, 09:53:46 AM »

> causing some new instability as well as the possibility of fixing it

Very wise. I had my doubts whilst writing it.
Logged
Pages: [1] 2