[SGVLUG] Recommendation of an open source hardware diagnostic tool

Matthew Campbell dvdmatt at gmail.com
Sun Mar 2 19:31:57 PST 2014


Yep.  Tried that with the RAM but the Mobo and CPU are the latest and I
don't want to blow another grand on duplicates...

Matt

---------
*Matthew Campbell*
Storage Solution Consultant
Storage Design and Engineering

*Kaiser Permanente*
IMG-Systems Integration
99 S. Oakland
Pasadena, CA 91101

626-564-7228 (office)
8-338-7228 (tie-line)
818-314-9897 (mobile phone)
Green Center 3-North, 031W29
---------
*kp.org/thrive <http://kp.org/thrive>*


On Sun, Mar 2, 2014 at 5:01 PM, Dan Kegel <dank at kegel.com> wrote:

> Swapping out part by part until the problem goes away might be your best
> bet.
>  Am 02.03.2014 15:24 schrieb "Matthew Campbell" <dvdmatt at gmail.com>:
>
> Does anyone have a hardware diagnostic tool they like, preferably open
>> source?  I have been fighting a host for two weeks now and after finding
>> and submitted 2 kernel bugs have begun to suspect that the problems I am
>> running into are being exposed by a hardware failure.
>>
>> The system appears to be running fine, but every 10-15 seconds will zone
>> out for a couple of seconds.  At first I thought it was a BTRFS bug, and
>> the errors I was seeing turned out to be just that.
>>
>> Once they were fixed the freezing kept on.  Further poking uncovered a
>> NFS bug in its interaction with the underlying filesystem, but having also
>> patched the kernel for that the poor performance continues.
>>
>> Now I'm starting to see errors of this sort in my syslog:
>>
>> 2014-03-02T22:39:00.262Z cpu6:34527)WARNING: LinScsi:
>> SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056
>> Unknown status vmhba0:0:0:0 (driver name: ahci) - Message repeated 4 times
>> 2014-03-02T22:39:00.262Z cpu2:32791)ScsiDeviceIO: 2324:
>> Cmd(0x412e8088eac0) 0x4d, CmdSN 0x784 from world 0 to dev
>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>> 2014-03-02T22:39:00.275Z cpu2:32784)ScsiDeviceIO: 2324:
>> Cmd(0x412e80842b00) 0x28, CmdSN 0x51c3 from world 32878 to dev
>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>>
>> Could my SSD be failing?  But I just replaced the previous boot disk as
>> it looked like it was failing...
>>
>> Device sense code D:0x8 equates to 08h  BUSY according to these docs:
>>
>> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902
>>
>> It could be a MOBO issue with the SATA port or even the CPU or RAM.  Ugh.
>>
>> I tried memtest86 and all passed...
>>
>> Any suggestions on a full-system hardware test suite would be much
>> appreciated.
>>
>> Matt
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sgvlug.net/pipermail/sgvlug/attachments/20140302/2c97efd4/attachment.html>


More information about the SGVLUG mailing list