![]() | |
![]() |
| | Thread Tools | Search this Thread | Display Modes |
#1
| |||
| |||
|
|
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. Then the box would |
|
The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) with Samsung SDRAM memory chips (K4H510638D-TC80) which got quite hot-to-the-touch during this strenuous testing. This was quite evident with older chips with 2002 date codes, compared to 2003. |
#2
| |||
| |||
|
|
Phil <silicontundra (AT) gmail (DOT) com> wrote in part: During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. What happened is memtest stressed the memory, it failed but the ECC caught it (so memtest never saw it) and after enough of these [silently filling logs], the machine shut itself down and wouldn't repower until fixed (clear CMOS). |
#3
| |||
| |||
|
|
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. Then the box would not boot, dark screen, no BIOS. Box powers on with Power-on green LED on front panel but otherwise appears totally dead. The box has dual IBM 350 watt power supplies with both green LEDs on in rear. Have not seen this IBM xSeries server issue discussed when googling the newsgroups and Tek-tips, so this long solution is described here with additional questions for enhancing reliability of legacy IBM servers. Details follow, regard, Phil ----- The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) ... |
|
... with Samsung SDRAM memory chips (K4H510638D-TC80) ... |
|
... which got quite hot-to-the-touch during this strenuous testing. This was quite evident with older chips with 2002 date codes, compared to 2003. |
#4
| |||
| |||
|
|
On Sat, 28 Apr 2007 17:25:25 GMT, Robert Redelmeier redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed: Phil <silicontundra (AT) gmail (DOT) com> wrote in part: During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. What happened is memtest stressed the memory, it failed but the ECC caught it (so memtest never saw it) and after enough of these [silently filling logs], the machine shut itself down and wouldn't repower until fixed (clear CMOS). I can't imagine that the paltry few bytes of CMOS RAM, most of which are already in use, would be enough to store more than a handful of such errors. In any case, what is the point of ECC if a system dies when its error log becomes full? My experience of ECC memory in mainframes is that a computer can run forever, albeit with a minor performance penalty, if RAM errors are limited to a single data bit per word. I can't see why PCs would be any different. In fact the OP can easily test your hypothesis by placing insulation tape over one data bit of each RAM stick's edge connector and then subjecting his machine to normal everyday use. - Franc Zabkar -- Please remove one 'i' from my address when replying by email. |
#5
| |||
| |||
|
|
"Franc Zabkar" <fzabkar (AT) iinternode (DOT) on.net> wrote in message news:55p7335e10c568neuf3lgc677ve9o2li48 (AT) 4ax (DOT) com... On Sat, 28 Apr 2007 17:25:25 GMT, Robert Redelmeier redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed: Phil <silicontundra (AT) gmail (DOT) com> wrote in part: During MemTest86+ v1.70 (latest with Win98SE boot floppy) for reliabilty testing in upgrading used RAM memory in a popular, redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the screen hung about 30 min into the test. What happened is memtest stressed the memory, it failed but the ECC caught it (so memtest never saw it) and after enough of these [silently filling logs], the machine shut itself down and wouldn't repower until fixed (clear CMOS). I can't imagine that the paltry few bytes of CMOS RAM, most of which are already in use, would be enough to store more than a handful of such errors. In any case, what is the point of ECC if a system dies when its error log becomes full? My experience of ECC memory in mainframes is that a computer can run forever, albeit with a minor performance penalty, if RAM errors are limited to a single data bit per word. I can't see why PCs would be any different. In fact the OP can easily test your hypothesis by placing insulation tape over one data bit of each RAM stick's edge connector and then subjecting his machine to normal everyday use. - Franc Zabkar Except that it is undesirable so to do. If there is a pesistent hard correctable error, those words are running with their protection against soft errors gone, pretty much anyway. (there are tricks) So if one has a hard error, at least a block of memory ought to be deallocated. if one has a hard error or rate exceeds threshold, then all blocks with errors should be deallocated. |
#6
| ||||
| ||||
|
|
I can't imagine that the paltry few bytes of CMOS RAM, most |
|
of which are already in use, would be enough to store more than a handful of such errors. In any case, what is the point of ECC if a system dies when its error log becomes full? |
|
My experience of ECC memory in mainframes is that a computer can run forever, albeit with a minor performance penalty, |
|
if RAM errors are limited to a single data bit per word. I can't see why PCs would be any different. In fact the OP can easily test your hypothesis by placing insulation tape over one data bit of each RAM stick's edge connector and then subjecting his machine to normal everyday use. |
#7
| |||||
| |||||
|
|
Franc Zabkar <fzabkar (AT) iinternode (DOT) on.net> wrote in part: I can't imagine that the paltry few bytes of CMOS RAM, most I wasn't aware of any limit to CMOS RAM. Most systems have little, but a very low-level designer could put in more, probably at a different port address. BIOS isn't fixed. |
|
of which are already in use, would be enough to store more than a handful of such errors. In any case, what is the point of ECC if a system dies when its error log becomes full? Avoiding error! In many business apps, errors are worse than downtime. Keeping a suspect machine up that could be propagating errors and enshrining them in a database is a DB admins worst nightmare. |
|
My experience of ECC memory in mainframes is that a computer can run forever, albeit with a minor performance penalty, So long as the errors are rare and not localized. |
|
It also depends very much on the calcs. A scientific machine doing interative calcs could probably tolerate/heal error much better than an accounting package running integers. |
|
if RAM errors are limited to a single data bit per word. I can't see why PCs would be any different. |
#8
| ||||
| ||||
|
|
Tested with a slightly older version of MemTest86+ v1.65 on the original IBM x345 Server OEM 256MB DDR SDRAM memory DIMM sticks in the server's slot 1 and 2. The results were again similar, with the computer box hanging after about 45 minutes of testing and again requiring a R&R of the CMOS battery before it would boot again. The memory is mfgr'd by Micron Tech with two different date codes. Again my conclusion is that there is a thermally related failure mode; the older 2002 date codes failing first, presumably fabricated with older process technology that results in higher power consumption. |
|
My conclusion is that gamer-type RAM coolers (convection heat sinks) are required to reduce memory reliability issues with IBM OEM DIMM memory in their legacy 2002 xSeries servers, even though the 2U servers are quite well designed with dual redundant banks of 4 fans across the cross-section of the chassis (wind-tunnel type design). Any SEs concur? Regards, Phil Details follow: IBM memory P/N 38L4029 FRU 09N4306, 2 sticks of 256MB PC2100 CL2.5 2.5v registered ECC, double sided, organized 32Mb x 72 The older Micron Tech PC2100A-25330-M1 DIMM with 18 chips 46V32M4-75A date code late 2002. This pair hung 32 min into testing cycle with each pass taking 11 min for 512MB, or 2 1/2 passes. The newer Micron Tech PC2100A-25331-Z DIMM with 18 chips 46V32M4-75B date code mid 2003. This pair hung 49 min (similar to newer 1GB DIMM) into testing cycle, or just over 4 passes. Hung at Test5, Block move. The chips were almost-too-hot-to-touch with the pinky finger. Samsung parts indeed are K4510638D-TB80, 8ns parts, |
|
with sufficient design bandwidth margin. DDR clocking gives the 266MHz operation to the Xeon processors with their 533MHz FSB. My feeling on memory chip power consumption is that it is more than the 1.5 watts spec. 3) DC: not using IBM's ChipKill technology DIMMs, so deallocation of a block of memory space is not effected. 4) FZ#4: again not using IBM ChipKill DIMMs. I'll again refer you to IBM's White paper on ChipKill. |
|
5) FZ, RR, the HW Manual references p34, 94 do not pertain to problem at hand. Same with User Manual, p5, 6. |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
| |