HighDots.NET Computer Hardware Forums  

Re: IBM x345 Server goes black during memory test of Samsung DIMMs

Hardware Chips Processor, cache, memory chips, etc. (comp.sys.ibm.pc.hardware.chips)


Discuss Re: IBM x345 Server goes black during memory test of Samsung DIMMs in the Hardware Chips forum.



Reply
 
Thread Tools Search this Thread Display Modes
  #1  
Old   
Robert Redelmeier
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-28-2007 , 12:25 PM






Phil <silicontundra (AT) gmail (DOT) com> wrote in part:
Quote:
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test. Then the box would
No surprise. Please remember we are talking servers here,
and reliability (correct answers) are often considered more
important than uptime. And maintenance is assumed available.

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

Quote:
The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS);
4 pieces of 1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided
DIMMs (DDR 266MHz, PC2100 CL2.5 registered ECC, spec 100MHz 2.5v)
with Samsung SDRAM memory chips (K4H510638D-TC80) which got quite
hot-to-the-touch during this strenuous testing. This was quite
evident with older chips with 2002 date codes, compared to 2003.
Are you sure you have clean filters, dusted PSUs and have good
clearances around ducts? You may have some work to duplicate the
design conditions if there was an auxiliary airmover or cabinet.

I don't know the failure mechanism of RAM, but I was surprised
at the number of failures reported recently here. Second
after HDs, and not by much.

-- Robert




Reply With Quote
  #2  
Old   
Franc Zabkar
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-28-2007 , 07:36 PM






On Sat, 28 Apr 2007 17:25:25 GMT, Robert Redelmeier
<redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed:

Quote:
Phil <silicontundra (AT) gmail (DOT) com> wrote in part:

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test.

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).
I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.


Reply With Quote
  #3  
Old   
Franc Zabkar
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-28-2007 , 07:36 PM



On 27 Apr 2007 14:22:08 -0700, Phil <silicontundra (AT) gmail (DOT) com> put
finger to keyboard and composed:

Quote:
During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz FSB,
2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server, the
screen hung about 30 min into the test. Then the box would not boot,
dark screen, no BIOS. Box powers on with Power-on green LED on front
panel but otherwise appears totally dead. The box has dual IBM 350
watt power supplies with both green LEDs on in rear.

Have not seen this IBM xSeries server issue discussed when googling
the newsgroups and Tek-tips, so this long solution is described here
with additional questions for enhancing reliability of legacy IBM
servers. Details follow, regard, Phil

-----

The RAM totals 4GB (8GB max with 2GB DIMMs and dual PS); 4 pieces of
1GB IBM FRU 09N4308 / 38L4031 184-pin double-sided DIMMs (DDR 266MHz,
PC2100 CL2.5 registered ECC, spec 100MHz 2.5v) ...
Are you perhaps unintentionally overclocking your RAM? According to
the datasheet, 100MHz is the rated CL2 speed. To achieve 133MHz at CL2
you need "A2" parts.

See pages 3 and 4 of this document:
http://www.datasheetarchive.com/data...rticle=1875962

Quote:
... with Samsung SDRAM
memory chips (K4H510638D-TC80) ...
I believe that should be "TCB0" which codes for TSOP package,
commercial temperature, normal power, and a speed of 7.5ns (AT) CL2 (DOT) 5.

The "D" indicates a 5th generation part.

Quote:
... which got quite hot-to-the-touch during
this strenuous testing. This was quite evident with older chips with
2002 date codes, compared to 2003.
The datasheet stipulates a maximum power dissipation of 1.5W per chip.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.


Reply With Quote
  #4  
Old   
Del Cecchi
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-28-2007 , 08:16 PM




"Franc Zabkar" <fzabkar (AT) iinternode (DOT) on.net> wrote

Quote:
On Sat, 28 Apr 2007 17:25:25 GMT, Robert Redelmeier
redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed:

Phil <silicontundra (AT) gmail (DOT) com> wrote in part:

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test.

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
Except that it is undesirable so to do. If there is a pesistent hard
correctable error, those words are running with their protection against
soft errors gone, pretty much anyway. (there are tricks)

So if one has a hard error, at least a block of memory ought to be
deallocated. if one has a hard error or rate exceeds threshold, then all
blocks with errors should be deallocated.




Reply With Quote
  #5  
Old   
Franc Zabkar
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-30-2007 , 01:31 AM



On Sat, 28 Apr 2007 20:16:23 -0500, "Del Cecchi"
<delcecchiofthenorth (AT) gmail (DOT) com> put finger to keyboard and composed:

Quote:
"Franc Zabkar" <fzabkar (AT) iinternode (DOT) on.net> wrote in message
news:55p7335e10c568neuf3lgc677ve9o2li48 (AT) 4ax (DOT) com...
On Sat, 28 Apr 2007 17:25:25 GMT, Robert Redelmeier
redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed:

Phil <silicontundra (AT) gmail (DOT) com> wrote in part:

During MemTest86+ v1.70 (latest with Win98SE boot floppy) for
reliabilty testing in upgrading used RAM memory in a popular,
redeployed IBM xSeries 345 (dual Xeon 2.8GHz, 8670-61X, 100MHz
FSB, 2002 era with latest v1.21 BIOS, v1.09 ISMP) 2U rack server,
the screen hung about 30 min into the test.

What happened is memtest stressed the memory, it failed but
the ECC caught it (so memtest never saw it) and after enough
of these [silently filling logs], the machine shut itself
down and wouldn't repower until fixed (clear CMOS).

I can't imagine that the paltry few bytes of CMOS RAM, most of which
are already in use, would be enough to store more than a handful of
such errors. In any case, what is the point of ECC if a system dies
when its error log becomes full?

My experience of ECC memory in mainframes is that a computer can run
forever, albeit with a minor performance penalty, if RAM errors are
limited to a single data bit per word. I can't see why PCs would be
any different. In fact the OP can easily test your hypothesis by
placing insulation tape over one data bit of each RAM stick's edge
connector and then subjecting his machine to normal everyday use.

- Franc Zabkar

Except that it is undesirable so to do. If there is a pesistent hard
correctable error, those words are running with their protection against
soft errors gone, pretty much anyway. (there are tricks)

So if one has a hard error, at least a block of memory ought to be
deallocated. if one has a hard error or rate exceeds threshold, then all
blocks with errors should be deallocated.
This appears to be the HW Maintenance Manual for the OP's machine:

ftp://ftp.software.ibm.com/systems/s...df/48p9718.pdf

AIUI, the Integrated System Management processor ("service processor")
is able to deallocate faulty DIMM banks on the fly (see page 94).

Page 6 of the user manual ...

ftp://ftp.software.ibm.com/systems/s...df/88p9189.pdf

.... confirms that the server incorporates "memory scrubbing and
Predictive Failure Analysis".

Furthermore, page 5 of the same manual states that "the memory
controller also provides Chipkill™ memory protection if all DIMMs are
of the type x4. Chipkill memory protection is a technology that
protects the system from a single chip failure on a DIMM."

I notice also that there is a diagnostic LED which reports when the
error log is more than 75% full (see page 34 of the HW manual). Is RR
onto something?

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.


Reply With Quote
  #6  
Old   
Robert Redelmeier
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 04-30-2007 , 07:43 AM



Franc Zabkar <fzabkar (AT) iinternode (DOT) on.net> wrote in part:
Quote:
I can't imagine that the paltry few bytes of CMOS RAM, most
I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.

Quote:
of which are already in use, would be enough to store more
than a handful of such errors. In any case, what is the point
of ECC if a system dies when its error log becomes full?
Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.

Quote:
My experience of ECC memory in mainframes is that a computer
can run forever, albeit with a minor performance penalty,
So long as the errors are rare and not localized. It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.

Quote:
if RAM errors are limited to a single data bit per word. I
can't see why PCs would be any different. In fact the OP
can easily test your hypothesis by placing insulation tape
over one data bit of each RAM stick's edge connector and
then subjecting his machine to normal everyday use.
Oh, that'd be messy. The connectors have too much pressure
and too little clearance. I'd expect connector damage unless
just the right [Mylar?] tape was used.

-- Robert



Reply With Quote
  #7  
Old   
Franc Zabkar
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 05-01-2007 , 05:03 PM



On Mon, 30 Apr 2007 12:43:41 GMT, Robert Redelmeier
<redelm (AT) ev1 (DOT) net.invalid> put finger to keyboard and composed:

Quote:
Franc Zabkar <fzabkar (AT) iinternode (DOT) on.net> wrote in part:
I can't imagine that the paltry few bytes of CMOS RAM, most

I wasn't aware of any limit to CMOS RAM. Most systems have
little, but a very low-level designer could put in more,
probably at a different port address. BIOS isn't fixed.
Many (most?) systems now have 256 bytes of CMOS RAM. AFAIK, the first
128 bytes are accessed via ports 70/71h, and the next 128 bytes via
ports 72/73.

I suppose it's possible to have more CMOS RAM, but it could also be
that the Integrated System Management Processor has its own RAM or
EEPROM. FWIW, other IBM server products appear to write their error
logs to "NVRAM", which in PC terms usually refers to an EEPROM.

Quote:
of which are already in use, would be enough to store more
than a handful of such errors. In any case, what is the point
of ECC if a system dies when its error log becomes full?

Avoiding error! In many business apps, errors are worse
than downtime. Keeping a suspect machine up that could be
propagating errors and enshrining them in a database is a
DB admins worst nightmare.
Not in my experience. The performance penalty of a faulty memory bit
usually amounted to no more than an extra clock cycle. Taking a
mainframe out of service for a non-fatal error would have meant that
up to a dozen workstations would have been idle. Furthermore, many
servers run 24/7 doing batch jobs.

The whole point of ECC, especially in servers, is to provide a fault
tolerant system. If the error log is full, then the machine should
alert the operator, but that's all. In fact the OP's machine does
indicate when the log is 75% full.

Quote:
My experience of ECC memory in mainframes is that a computer
can run forever, albeit with a minor performance penalty,

So long as the errors are rare and not localized.
Not true. With ECC you can have a dead bit at *every* address in
*every* memory module and still have a functioning system. It's only
when you have a multi-bit error that the system can break down.

See the references to ChipKill and "memory scrubbing" in IBM's
documentation.

Quote:
It also
depends very much on the calcs. A scientific machine doing
interative calcs could probably tolerate/heal error much
better than an accounting package running integers.
I don't see it.

Quote:
if RAM errors are limited to a single data bit per word. I
can't see why PCs would be any different.
It may help to know which chipset is detected by memtest86+. I found
one URL which suggests that the OP's chipset may be the Serverworks
Serverset CNB20-HE.

FWIW, the following URL describes a problem with memtest86+ v1.65:

Support for Serverworks Serverset (CNB20HE)?
http://forum.x86-secret.com/archive/...hp/t-4459.html

The author writes:

"This [test failure] seems to happen only with 2x1GB memory strips.
.... If I test with 2x512MB everything works fine."

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.


Reply With Quote
  #8  
Old   
Franc Zabkar
 
Posts: n/a

Default Re: IBM x345 Server goes black during memory test of Samsung DIMMs - 05-02-2007 , 01:24 AM



On 1 May 2007 15:31:43 -0700, Phil <silicontundra (AT) gmail (DOT) com> put
finger to keyboard and composed:

Quote:
Tested with a slightly older version of MemTest86+ v1.65 on the
original IBM x345 Server OEM 256MB DDR SDRAM memory DIMM sticks in the
server's slot 1 and 2. The results were again similar, with the
computer box hanging after about 45 minutes of testing and again
requiring a R&R of the CMOS battery before it would boot again.

The memory is mfgr'd by Micron Tech with two different date codes.
Again my conclusion is that there is a thermally related failure mode;
the older 2002 date codes failing first, presumably fabricated with
older process technology that results in higher power consumption.
The "D" in the part number (K4H510638D) indicates the "generation" of
manufacture.

http://www.samsung.com/Products/Semi...k4h510638b.pdf
http://www.samsung.com/Products/Semi...k4h510638c.pdf

8. Version
M : 1st Generation
A : 2nd Generation
B : 3rd Generation
C : 4th Generation
* D : 5th Generation
E : 6th Generation

I would think that a newer process technology would have a higher
version number.

Quote:
My conclusion is that gamer-type RAM coolers (convection heat sinks)
are required to reduce memory reliability issues with IBM OEM DIMM
memory in their legacy 2002 xSeries servers, even though the 2U
servers are quite well designed with dual redundant banks of 4 fans
across the cross-section of the chassis (wind-tunnel type design). Any
SEs concur? Regards, Phil

Details follow:
IBM memory P/N 38L4029 FRU 09N4306, 2 sticks of 256MB PC2100 CL2.5
2.5v registered ECC, double sided, organized 32Mb x 72

The older Micron Tech PC2100A-25330-M1 DIMM with 18 chips 46V32M4-75A
date code late 2002. This pair hung 32 min into testing cycle with
each pass taking 11 min for 512MB, or 2 1/2 passes.

The newer Micron Tech PC2100A-25331-Z DIMM with 18 chips 46V32M4-75B
date code mid 2003. This pair hung 49 min (similar to newer 1GB DIMM)
into testing cycle, or just over 4 passes. Hung at Test5, Block move.
The chips were almost-too-hot-to-touch with the pinky finger.

Samsung parts indeed are K4510638D-TB80, 8ns parts,
According to the datasheet, the correct suffix is TCB0 which makes
them 7.5ns parts.

9. Package
* T : TSOP2 (400mil x 875mil)

10. Temperature & Power
* C : (Commercial, Normal)
L : (Commercial, Low)

11. Speed
A0 : 10ns@CL2
A2 : 7.5ns@CL2
* B0 : 7.5ns (AT) CL2 (DOT) 5

Quote:
with sufficient design bandwidth margin. DDR clocking gives the 266MHz
operation to the Xeon processors with their 533MHz FSB. My feeling on
memory chip power consumption is that it is more than the 1.5 watts
spec.
3) DC: not using IBM's ChipKill technology DIMMs, so deallocation of
a block of memory space is not effected.
4) FZ#4: again not using IBM ChipKill DIMMs. I'll again refer you to
IBM's White paper on ChipKill.
It's the memory controller that provides the ChipKill functionality,
not the DIMM.

According to IBM's white paper ...

http://www-03.ibm.com/servers/eserve...s/chipkill.pdf

"The memory subsystem design is such that a single chip, no matter
what its data width, would not affect more than one bit in any given
ECC word. For example, if x4 DRAMs were in use, each of the 4 DQs
would feed a different ECC word, that is, a different address of the
memory space. Thus even in the case of an entire chipkill, no single
ECC word will experience more than one bit of bad data -- which is
fixable by the SEC ECC -- and thereby the fault-tolerance of the
memory subsystem is maintained."

Furthermore, your user manual ...

ftp://ftp.software.ibm.com/systems/s...df/88p9189.pdf

.... states that "the memory controller also provides Chipkill™ memory
protection if all DIMMs are of the type x4."

Therefore, as Samsung's datasheet states that the organisation of your
DRAMs is "stacked x4" (128Mx4), this suggests to me that your system
*does* support ChipKill.

AIUI your motherboard's memory controller spreads each DRAM chip's
four data bits over four distinct addresses, which means that in the
worst case a faulty chip will give rise to four correctable single-bit
errors rather than a single uncorrectable 4-bit error.

Quote:
5) FZ, RR, the HW Manual references p34, 94 do not pertain to problem
at hand. Same with User Manual, p5, 6.
I'm wondering whether clearing your CMOS RAM is a red herring. If you
allow sufficient time for your machine to cool, does this achieve the
same end?

- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.


Reply With Quote
Reply




Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.