Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUC7PJYH (J5005) - mce: [Hardware Error] #70

Open
fatez opened this issue Jun 29, 2018 · 2 comments
Open

NUC7PJYH (J5005) - mce: [Hardware Error] #70

fatez opened this issue Jun 29, 2018 · 2 comments

Comments

@fatez
Copy link

fatez commented Jun 29, 2018

Description of problem:

From time to time system becomes instable and several applications reports
some stranges exception (non app error but system/hardware error)

calling dmesg report this message:

$ dmesg | grep mce
[    0.039982] mce: CPU supports 7 MCE banks
[    0.060928] mce: [Hardware Error]: Machine check events logged
[    0.060932] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
[    0.060941] mce: [Hardware Error]: TSC 0 ADDR fef4c9e0 
[    0.060949] mce: [Hardware Error]: PROCESSOR 0:706a1 TIME 1530266046 SOCKET 0 APIC 0 microcode 22
$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 1997.494
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 1595.178
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 2559.706
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 122
model name	: Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
stepping	: 1
microcode	: 0x22
cpu MHz		: 2607.391
cache size	: 4096 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 24
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti cdp_l2 ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts umip rdpid arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 2995.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

I tried several kernels (4.15.0-23; 4.17.0-041700; 4.17.1-041701; 4.17.2-041702; 4.18.0-041800rc1; 4.18.0-041800rc2) - and even the last available

$ cat /proc/version
Linux version 4.18.0-041800rc2-generic (root@ubuntu) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #201806241430 SMP Fri Jun 29 09:34:50 CEST 2018

list installed micro (microcode-20180425)

$ ls /lib/firmware/intel-ucode/
06-03-02  06-06-05  06-08-01  06-0a-01  06-0f-02  06-16-01  06-1c-02  06-26-01  06-3a-09.initramfs  06-3e-06            06-45-01            06-4e-03            06-56-03  06-8e-09  0f-00-0a  0f-02-09  0f-04-04  0f-06-04
06-05-00  06-06-0a  06-08-03  06-0b-01  06-0f-06  06-17-06  06-1c-0a  06-2a-07  06-3c-03            06-3e-07            06-45-01.initramfs  06-4f-01.initramfs  06-56-04  06-8e-0a  0f-01-02  0f-03-02  0f-04-07  0f-06-05
06-05-01  06-06-0d  06-08-06  06-0b-04  06-0f-07  06-17-07  06-1d-01  06-2d-06  06-3c-03.initramfs  06-3f-02            06-46-01            06-55-03            06-56-05  06-9e-09  0f-02-04  0f-03-03  0f-04-08  0f-06-08
06-05-02  06-07-01  06-08-0a  06-0d-06  06-0f-0a  06-17-0a  06-1e-05  06-2d-07  06-3d-04            06-3f-02.initramfs  06-46-01.initramfs  06-55-04            06-5c-09  06-9e-0a  0f-02-05  0f-03-04  0f-04-09
06-05-03  06-07-02  06-09-05  06-0e-08  06-0f-0b  06-1a-04  06-25-02  06-2f-02  06-3d-04.initramfs  06-3f-04            06-47-01            06-56-02            06-5e-03  06-9e-0b  0f-02-06  0f-04-01  0f-04-0a
06-06-00  06-07-03  06-0a-00  06-0e-0c  06-0f-0d  06-1a-05  06-25-05  06-3a-09  06-3e-04            06-3f-04.initramfs  06-47-01.initramfs  06-56-02.initramfs  06-7a-01  0f-00-07  0f-02-07  0f-04-03  0f-06-02

CPU supported :

mcelog$ if ./mcelog --is-cpu-supported; then echo "CPU is supported!"; else echo "No luck!"; fi
mcelog: Family 6 Model 122 CPU: only decoding architectural errors
CPU is supported!

How reproducible:
Difficult to descripe. Some applications seem to mess up the system.
Immediately the system seems to get in an instable state. Applications
starts showing some strange error with some indicattions, that the
system/hardware may have a problem.

@vhulagov
Copy link

vhulagov commented Jul 5, 2018

  1. This is hardware issue, not issue of mcelog, so questions like this may be placed on serverfault site, for example, not here;
  2. Messages with prefix "mce:" are from mcheck kernel component, not mcelog;
  3. To get detailed information related to Hardware Error you must confirm that mcelog daemon is launched ps aux|grep mcelog and there are messages in /var/log/mcelog or /var/log/syslog with MCE header

@vinc3m1
Copy link

vinc3m1 commented Sep 17, 2018

While I'm seeing the same issue in dmesg/journalctl, I'm not seeing any detailed logs in /var/log/mcelog or /var/log/syslog:

Sep 16 19:51:15 kernel: [    0.043437] mce: CPU supports 7 MCE banks
Sep 16 19:51:15 kernel: [    0.064261] mce: [Hardware Error]: Machine check events logged
Sep 16 19:51:15 kernel: [    0.064265] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: a600000000020408
Sep 16 19:51:15 kernel: [    0.064270] mce: [Hardware Error]: TSC 0 ADDR fef4c9e0 
Sep 16 19:51:15 kernel: [    0.064277] mce: [Hardware Error]: PROCESSOR 0:706a1 TIME 1537152664 SOCKET 0 APIC 0 microcode 28
Sep 16 19:51:15 mcelog[987]: mcelog: Family 6 Model 122 CPU: only decoding architectural errors
Sep 16 19:51:15 mcelog[987]: mcelog: Cannot open /dev/mem for DMI decoding: Operation not permitted

mcelog definitely running:

$ ps aux | grep mcelog
root       987  0.0  0.0  13096  2256 ?        Ss   19:51   0:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground
vince     8902  0.0  0.0  21536  1076 pts/0    S+   20:02   0:00 grep --color=auto mcelog

but no detailed logs are printed:

$ cat /var/log/mcelog 
mcelog: failed to prefill DIMM database from DMI data
mcelog: mcelog server already running

Is this expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants