[SGVLUG] Recommendation of an open source hardware diagnostic tool

Matthew Campbell dvdmatt at gmail.com
Mon Mar 3 23:28:58 PST 2014


Yes, but memtest86 came up clean so I don't think it's the RAM.

I have installed Phoronix which automates most of the tests I am aware of
but it looks like it's a 2 week run.  I set off the first of 15 tests this
morning and it just finished...

Matt


---------
*Matthew Campbell*
Storage Solution Consultant
Storage Design and Engineering

*Kaiser Permanente*
IMG-Systems Integration
99 S. Oakland
Pasadena, CA 91101

626-564-7228 (office)
8-338-7228 (tie-line)
818-314-9897 (mobile phone)
Green Center 3-North, 031W29
---------
*kp.org/thrive <http://kp.org/thrive>*


On Mon, Mar 3, 2014 at 1:08 PM, Jess Bermudes <jbermudes at gmail.com> wrote:

> For the RAM, would memtest86 be something like what you're looking for?
>
>
> On Mon, Mar 3, 2014 at 12:35 PM, Matthew Campbell <dvdmatt at gmail.com>wrote:
>
>> Scott, that is an awesome diagram!  It really points out the right tools
>> to deep dive into each section of the system.  Someone somewhere has
>> already written a script to install, configure, run and interpret the
>> output of the 30 tools.  I don't want to re-invent the wheel ;)
>>
>> Dan, swapping hardware is a good suggestion but I hope that running a
>> benchmark tool should expose the problem component if there it may also
>> cover driver issues, kernel issues, networking interactions, etc..  There
>> are several closed source solutions along these lines. I was hoping someone
>> on this list had experience with an open source product like the Phoronix
>> suite
>>
>> http://www.phoronix-test-suite.com/
>> http://openbenchmarking.org/
>>
>> or any of the multitude of other open source benchmarks.
>>
>> http://en.wikipedia.org/wiki/Benchmark_%28computing%29#Common_benchmarks
>>
>> Matt
>>
>>
>> ---------
>> *Matthew Campbell*
>> Storage Solution Consultant
>> Storage Design and Engineering
>>
>> *Kaiser Permanente*
>> IMG-Systems Integration
>> 99 S. Oakland
>> Pasadena, CA 91101
>>
>> 626-564-7228 (office)
>> 8-338-7228 (tie-line)
>> 818-314-9897 (mobile phone)
>> Green Center 3-North, 031W29
>> ---------
>> *kp.org/thrive <http://kp.org/thrive>*
>>
>>
>> On Mon, Mar 3, 2014 at 8:49 AM, Dan Kegel <dank at kegel.com> wrote:
>>
>>> I wonder if you could learn anything by swapping out the motherboard
>>> with a cheaper one.
>>>
>>> On Sun, Mar 2, 2014 at 7:31 PM, Matthew Campbell <dvdmatt at gmail.com>
>>> wrote:
>>> > Yep.  Tried that with the RAM but the Mobo and CPU are the latest and I
>>> > don't want to blow another grand on duplicates...
>>> >
>>> > Matt
>>> >
>>> > ---------
>>> > Matthew Campbell
>>> > Storage Solution Consultant
>>> > Storage Design and Engineering
>>> >
>>> > Kaiser Permanente
>>> > IMG-Systems Integration
>>> > 99 S. Oakland
>>> > Pasadena, CA 91101
>>> >
>>> > 626-564-7228 (office)
>>> > 8-338-7228 (tie-line)
>>> > 818-314-9897 (mobile phone)
>>> > Green Center 3-North, 031W29
>>> > ---------
>>> > kp.org/thrive
>>> >
>>> >
>>> > On Sun, Mar 2, 2014 at 5:01 PM, Dan Kegel <dank at kegel.com> wrote:
>>> >>
>>> >> Swapping out part by part until the problem goes away might be your
>>> best
>>> >> bet.
>>> >>
>>> >> Am 02.03.2014 15:24 schrieb "Matthew Campbell" <dvdmatt at gmail.com>:
>>> >>
>>> >>> Does anyone have a hardware diagnostic tool they like, preferably
>>> open
>>> >>> source?  I have been fighting a host for two weeks now and after
>>> finding and
>>> >>> submitted 2 kernel bugs have begun to suspect that the problems I am
>>> running
>>> >>> into are being exposed by a hardware failure.
>>> >>>
>>> >>> The system appears to be running fine, but every 10-15 seconds will
>>> zone
>>> >>> out for a couple of seconds.  At first I thought it was a BTRFS bug,
>>> and the
>>> >>> errors I was seeing turned out to be just that.
>>> >>>
>>> >>> Once they were fixed the freezing kept on.  Further poking uncovered
>>> a
>>> >>> NFS bug in its interaction with the underlying filesystem, but
>>> having also
>>> >>> patched the kernel for that the poor performance continues.
>>> >>>
>>> >>> Now I'm starting to see errors of this sort in my syslog:
>>> >>>
>>> >>> 2014-03-02T22:39:00.262Z cpu6:34527)WARNING: LinScsi:
>>> >>> SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056
>>> Unknown
>>> >>> status vmhba0:0:0:0 (driver name: ahci) - Message repeated 4 times
>>> >>> 2014-03-02T22:39:00.262Z cpu2:32791)ScsiDeviceIO: 2324:
>>> >>> Cmd(0x412e8088eac0) 0x4d, CmdSN 0x784 from world 0 to dev
>>> >>>
>>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>>> >>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>>> >>> 2014-03-02T22:39:00.275Z cpu2:32784)ScsiDeviceIO: 2324:
>>> >>> Cmd(0x412e80842b00) 0x28, CmdSN 0x51c3 from world 32878 to dev
>>> >>>
>>> "t10.ATA_____INTEL_SSDSC2BW240A4_____________________CVDA341000752403GN__"
>>> >>> failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.
>>> >>>
>>> >>> Could my SSD be failing?  But I just replaced the previous boot disk
>>> as
>>> >>> it looked like it was failing...
>>> >>>
>>> >>> Device sense code D:0x8 equates to 08h  BUSY according to these docs:
>>> >>>
>>> >>>
>>> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902
>>> >>>
>>> >>> It could be a MOBO issue with the SATA port or even the CPU or RAM.
>>>  Ugh.
>>> >>>
>>> >>> I tried memtest86 and all passed...
>>> >>>
>>> >>> Any suggestions on a full-system hardware test suite would be much
>>> >>> appreciated.
>>> >>>
>>> >>> Matt
>>> >>>
>>> >
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://sgvlug.net/pipermail/sgvlug/attachments/20140303/b0cd5179/attachment-0001.html>


More information about the SGVLUG mailing list