![]() | |
![]() |
| | Thread Tools | Search this Thread | Display Modes |
#1
| |||
| |||
|
|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
#2
| |||
| |||
|
|
Hello. Lately, I have been random and rare kernel panics on my old Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I couldn't figure out what it was until I discovered mcelog a couple days ago, and it revealed interesting scary datas in my dmesg/messages and syslog: # cat /var/log/messages ... Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events logged Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a software problem! Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0 Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010 Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction, level 1' Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43 I am not familiar with hardwares, so I assume this is very bad, but what part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had it and its motherboard since 12/24/2006, so it is not that old yet. I have the full details on my secondary machine at http://alpha.zimage.com/~ant/antfarm.../computers.txt ... Although, this might be related to the PSU's death back in early December 2009. My friend and I believe it also took out my EVGA GeForce 8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each piece with memtest86+ v4.00 to narrow it down). http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the details of my systems. I did run memtest86+ again a couple weeks ago and this morning for 5-6 hours, but not got no errors after five full tests (passed). I also do not overclock/OC. Thank you in advance. ![]() -- |
|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
#3
| |||
| |||
|
|
I also ran sys_basher (http://www.polybus.com/sys_basher_web/) in my Debian a few times in the past and just now. No errors or crashes. On 3/7/2010 8:59 AM PT, Ant typed: Hello. Lately, I have been random and rare kernel panics on my old Debian/Linux box (tried both Kernel versions 2.6.30 and 2.6.32). I couldn't figure out what it was until I discovered mcelog a couple days ago, and it revealed interesting scary datas in my dmesg/messages and syslog: # cat /var/log/messages ... Mar 7 08:25:24 MyLinuxBox kernel: [ 3299.988026] Machine check events logged Mar 7 08:25:24 MyLinuxBox mcelog: HARDWARE ERROR. This is *NOT* a software problem! Mar 7 08:25:24 MyLinuxBox mcelog: Please contact your hardware vendor Mar 7 08:25:24 MyLinuxBox mcelog: MCE 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPU 1 1 instruction cache Mar 7 08:25:24 MyLinuxBox mcelog: ADDR c11b6ff0 Mar 7 08:25:24 MyLinuxBox mcelog: TIME 1267979124 Sun Mar 7 08:25:24 2010 Mar 7 08:25:24 MyLinuxBox mcelog: TLB parity error in virtual array Mar 7 08:25:24 MyLinuxBox mcelog: TLB error 'instruction transaction, level 1' Mar 7 08:25:24 MyLinuxBox mcelog: STATUS 9400000000010011 MCGSTATUS 0 Mar 7 08:25:24 MyLinuxBox mcelog: MCGCAP 105 APICID 1 SOCKETID 0 Mar 7 08:25:24 MyLinuxBox mcelog: CPUID Vendor AMD Family 15 Model 43 I am not familiar with hardwares, so I assume this is very bad, but what part(s) is/are bad? Is my old Athlon 64 X2 CPU dying/damaged? I have had it and its motherboard since 12/24/2006, so it is not that old yet. I have the full details on my secondary machine at http://alpha.zimage.com/~ant/antfarm.../computers.txt ... Although, this might be related to the PSU's death back in early December 2009. My friend and I believe it also took out my EVGA GeForce 8800 GT video card and damage a 512 MB of RAM (tested 3 GB with and each piece with memtest86+ v4.00 to narrow it down). http://alpha.zimage.com/~ant/antfarm/about/toys.html has a log of the details of my systems. I did run memtest86+ again a couple weeks ago and this morning for 5-6 hours, but not got no errors after five full tests (passed). I also do not overclock/OC. Thank you in advance. ![]() |
#4
| ||||
| ||||
|
| http://en.wikipedia.org/wiki/Transla...okaside_buffer TLB stands for Translation Lookaside Buffer. It translates from virtual addresses to physical addresses. And apparently, according to the AMD documentation, it is protected by parity. It is part of the processor. A question would be, if it was a real error, why weren't there crash symptoms or side effects ? If an incorrect mapping from virtual space |
|
to physical occurred, you'd think there would be consequences. (Maybe the entry is automatically invalidated and reloaded via page table walk ?) |
|
The AMD processor apparently has BIST or built-in self test, for memory structures inside the processor. This document is not at all clear, on whether you'd have that implemented on a typical desktop motherboard. It is an optional operation, that would occur early after powerup. It would allow bad internal memory inside the processor to be detected, before a computer boots. There is a bit in a special register, that contains the test result, if the test was triggered. (Section 14.1.1 PDF page 395 "Programmers Manual Vol.2") http://support.amd.com/us/Processor_TechDocs/24593.pdf |
|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
#5
| |||
| |||
|
|
The AMD processor apparently has BIST or built-in self test, for memory structures inside the processor. This document is not at all clear, on whether you'd have that implemented on a typical desktop motherboard. It is an optional operation, that would occur early after powerup. It would allow bad internal memory inside the processor to be detected, before a computer boots. There is a bit in a special register, that contains the test result, if the test was triggered. (Section 14.1.1 PDF page 395 "Programmers Manual Vol.2") http://support.amd.com/us/Processor_TechDocs/24593.pdf Now, this is over my head. Is there a way to test this with softwares? Does memtest86+ v4.00 test for this? I already tried compiling, unraring 10+ GB of datas, running sys_basher, and memtest86+ v4.0 (passed a few weeks ago + this morning = five tests total). It doesn't seem to stress/overload and temperatures related since most kernel panics happened when mostly idled! |
#6
| |||
| |||
|
|
Ant wrote: The AMD processor apparently has BIST or built-in self test, for memory structures inside the processor. This document is not at all clear, on whether you'd have that implemented on a typical desktop motherboard. It is an optional operation, that would occur early after powerup. It would allow bad internal memory inside the processor to be detected, before a computer boots. There is a bit in a special register, that contains the test result, if the test was triggered. (Section 14.1.1 PDF page 395 "Programmers Manual Vol.2") http://support.amd.com/us/Processor_TechDocs/24593.pdf Now, this is over my head. Is there a way to test this with softwares? Does memtest86+ v4.00 test for this? I already tried compiling, unraring 10+ GB of datas, running sys_basher, and memtest86+ v4.0 (passed a few weeks ago + this morning = five tests total). It doesn't seem to stress/overload and temperatures related since most kernel panics happened when mostly idled! That entry in the manual means, there is a way to test that section of the processor. But I'm not aware of any software that does things like that. And because the 24593 document didn't say what triggered the test, I can't comment on whether a motivated person could even write some code to do it. Maybe there are one or more pins on the processor, that have to be set up for that. I could see a server motherboard maker perhaps, going the extra mile (doing a basic test on the processor, before completing POST). The pinout for AM2 socket isn't publicly available. This site says the document needed is 31117.pdf, but you can't download that. So there is no way to look for any pins with "interesting" names. http://www.sandpile.org/docs/amd/k8.htm My guess is, that a program like memtest86+, isn't going to specifically target things like the TLB, while it tests main memory. It's possible a small number of entries in the TLB were loaded by the BIOS, for perhaps a linear mapping of some sort, and memtest86+ relies on that for what it does. You'd have to look at the source for memtest86+, to see what it does. I read a claim a couple days ago, that memtest86+ uses PAE, and that should be a mapping trick as well. That is how a 32 bit executable can be used to test system memory totals of greater than 4GB. It could test 4GB at a time, and change mappings to access a different 4GB block of memory. |
|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
#7
| |||
| |||
|
|
The AMD processor apparently has BIST or built-in self test, for memory structures inside the processor. This document is not at all clear, on whether you'd have that implemented on a typical desktop motherboard. It is an optional operation, that would occur early after powerup. It would allow bad internal memory inside the processor to be detected, before a computer boots. There is a bit in a special register, that contains the test result, if the test was triggered. (Section 14.1.1 PDF page 395 "Programmers Manual Vol.2") http://support.amd.com/us/Processor_TechDocs/24593.pdf Now, this is over my head. Is there a way to test this with softwares? Does memtest86+ v4.00 test for this? I already tried compiling, unraring 10+ GB of datas, running sys_basher, and memtest86+ v4.0 (passed a few weeks ago + this morning = five tests total). It doesn't seem to stress/overload and temperatures related since most kernel panics happened when mostly idled! That entry in the manual means, there is a way to test that section of the processor. But I'm not aware of any software that does things like that. And because the 24593 document didn't say what triggered the test, I can't comment on whether a motivated person could even write some code to do it. Maybe there are one or more pins on the processor, that have to be set up for that. I could see a server motherboard maker perhaps, going the extra mile (doing a basic test on the processor, before completing POST). The pinout for AM2 socket isn't publicly available. This site says the document needed is 31117.pdf, but you can't download that. So there is no way to look for any pins with "interesting" names. http://www.sandpile.org/docs/amd/k8.htm My guess is, that a program like memtest86+, isn't going to specifically target things like the TLB, while it tests main memory. It's possible a small number of entries in the TLB were loaded by the BIOS, for perhaps a linear mapping of some sort, and memtest86+ relies on that for what it does. You'd have to look at the source for memtest86+, to see what it does. I read a claim a couple days ago, that memtest86+ uses PAE, and that should be a mapping trick as well. That is how a 32 bit executable can be used to test system memory totals of greater than 4GB. It could test 4GB at a time, and change mappings to access a different 4GB block of memory. Ah, interesting. Thanks. |
|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
#8
| |||
| |||
|
|
On 3/8/2010 8:21 AM PT, Ant typed: The AMD processor apparently has BIST or built-in self test, for memory structures inside the processor. This document is not at all clear, on whether you'd have that implemented on a typical desktop motherboard. It is an optional operation, that would occur early after powerup. It would allow bad internal memory inside the processor to be detected, before a computer boots. There is a bit in a special register, that contains the test result, if the test was triggered. (Section 14.1.1 PDF page 395 "Programmers Manual Vol.2") http://support.amd.com/us/Processor_TechDocs/24593.pdf Now, this is over my head. Is there a way to test this with softwares? Does memtest86+ v4.00 test for this? I already tried compiling, unraring 10+ GB of datas, running sys_basher, and memtest86+ v4.0 (passed a few weeks ago + this morning = five tests total). It doesn't seem to stress/overload and temperatures related since most kernel panics happened when mostly idled! That entry in the manual means, there is a way to test that section of the processor. But I'm not aware of any software that does things like that. And because the 24593 document didn't say what triggered the test, I can't comment on whether a motivated person could even write some code to do it. Maybe there are one or more pins on the processor, that have to be set up for that. I could see a server motherboard maker perhaps, going the extra mile (doing a basic test on the processor, before completing POST). The pinout for AM2 socket isn't publicly available. This site says the document needed is 31117.pdf, but you can't download that. So there is no way to look for any pins with "interesting" names. http://www.sandpile.org/docs/amd/k8.htm My guess is, that a program like memtest86+, isn't going to specifically target things like the TLB, while it tests main memory. It's possible a small number of entries in the TLB were loaded by the BIOS, for perhaps a linear mapping of some sort, and memtest86+ relies on that for what it does. You'd have to look at the source for memtest86+, to see what it does. I read a claim a couple days ago, that memtest86+ uses PAE, and that should be a mapping trick as well. That is how a 32 bit executable can be used to test system memory totals of greater than 4GB. It could test 4GB at a time, and change mappings to access a different 4GB block of memory. Ah, interesting. Thanks. Last night, I ran memtest86+ v4.00's test #9. http://www.memtest86.com/tech.html#descri says: "Test 9 [Bit fade test, 90 min, 2 patterns] The bit fade test initializes all of memory with a pattern and then sleeps for 90 minutes. Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used. This test takes 3 hours to complete. The Bit Fade test is not included in the normal test sequence and must be run manually via the runtime configuration menu." I only ran it for over 3.25 hours and it passed (only one test). Shouldn't this test that problem? Or is that TLB somewhere else? Maybe I need to run it longer and more? Also, I did a cat /var/log/messages |grep mcelog and posted the long log at http://pastie.org/867602 ... Check out of those mcelog errors. The author of cpuburn, told me to try seven and 37 "nice -19 ./burnMMX P &" separately. I ran them for many hours, and no problems. I am starting to notice that the errors and kernel panics seem to only occur when my system is idled (again, not using cool'n'quiet). |
#9
| |||
| |||
|
|
The author of cpuburn, told me to try seven and 37 "nice -19 ./burnMMX P &" separately. I ran them for many hours, and no problems. I am starting to notice that the errors and kernel panics seem to only occur when my system is idled (again, not using cool'n'quiet). The TLB is part of your processor. It converts virtual addresses into physical addresses. And it involves a small memory to store the entries. To test it, you'd need a test software specifically designed to verify that it can hold entries, and the entries are pointed to the right physical locations. I haven't read of any programs that do that specifically. The processor memory BIST function, is an example of a "structural" test, which is used to verify that a chunk of hardware works. When we talk about running test programs on the computer, those are "functional" tests. It can be much harder, to get good test coverage, using nothing but functional tests. If you want a test case, that can make the situation worse, you'd need a test program with a random access characteristic, something which causes so many TLB entries to be used, that there are lots of page table walks, swapping out of least recently used entries in the TLB and so on. You can see someone characterizing a TLB here. I think the program they're using, runs under Windows. http://ixbtlabs.com/articles2/rmma/rmma-dothan.html RMMA http://cpu.rightmark.org/download.shtml I have no idea under Windows, how a TLB ECC error would show up. Machine check exception ? Or something in the Event Viewer ? |

|
|o o| | Ant's Quality Foraged Links: http://aqfl.net \ _ / Nuke ANT from e-mail address: philpi (AT) earthlink (DOT) netANT |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |