From: bab on
Hello,

I'm running a number of FreeBSD VMs on an ESX cluster using an MD3000i
(Dell iSCSI SAN) for storage. Most of the time these systems all work
fine, however periodically our automated monitoring system reports one
or two of the hosts inaccessible for a brief period 5-10 minutes).
This usually occurs in the early morning so it doesn't affect our
operations, but I'm concerned about the underlying cause. In the logs
of the affected machines are errors like those below. I would assume
this means that for some reason the connection to the SAN is timing
out, but can't seem to find the definition of the mpt_cam_event code
"0x60" anywhere. This occurs both on VMs where I've extended the disk
timeout with "kern.cam.da.retry_count=100" and on VMs where I have not
done this. Any ideas?


kernel: mpt0: attempting to abort req 0xc4298800:11 function 0
kernel: mpt0: mpt_wait_req(1) timed out
kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting
controller
kernel: mpt0: mpt_cam_event: 0x60
kernel: mpt0: completing timedout/aborted req 0xc4298800:11
kernel: mpt0: request 0xc4298b70:32 timed out for ccb 0xc47f7800 (req-
>ccb 0xc47f7800)
kernel: mpt0: request 0xc42965a0:33 timed out for ccb 0xc4283000 (req-
>ccb 0xc4283000)
kernel: mpt0: request 0xc4297c70:34 timed out for ccb 0xc43e3000 (req-
>ccb 0xc43e3000)
kernel: mpt0: request 0xc4293f80:35 timed out for ccb 0xc4806000 (req-
>ccb 0xc4806000)
kernel: mpt0: request 0xc4292c70:36 timed out for ccb 0xc480c800 (req-
>ccb 0xc480c800)
kernel: mpt0: request 0xc4297220:37 timed out for ccb 0xc43db000 (req-
>ccb 0xc43db000)
kernel: mpt0: request 0xc429a650:38 timed out for ccb 0xc4810000 (req-
>ccb 0xc4810000)
kernel: mpt0: request 0xc42923b0:39 timed out for ccb 0xc480e000 (req-
>ccb 0xc480e000)
kernel: mpt0: request 0xc4299110:40 timed out for ccb 0xc480f000 (req-
>ccb 0xc480f000)
kernel: mpt0: request 0xc4297590:41 timed out for ccb 0xc468f000 (req-
>ccb 0xc468f000)
kernel: mpt0: attempting to abort req 0xc4298b70:32 function 0
kernel: mpt0: completing timedout/aborted req 0xc4298b70:32
kernel: mpt0: abort of req 0xc4298b70:0 completed
kernel: mpt0: attempting to abort req 0xc42965a0:33 function 0
kernel: mpt0: completing timedout/aborted req 0xc42965a0:33
kernel: mpt0: abort of req 0xc42965a0:0 completed
kernel: mpt0: attempting to abort req 0xc4297c70:34 function 0
kernel: mpt0: completing timedout/aborted req 0xc4297c70:34
kernel: mpt0: abort of req 0xc4297c70:0 completed
kernel: mpt0: attempting to abort req 0xc4293f80:35 function 0
kernel: mpt0: completing timedout/aborted req 0xc4293f80:35
kernel: mpt0: abort of req 0xc4293f80:0 completed
kernel: mpt0: attempting to abort req 0xc4292c70:36 function 0
kernel: mpt0: completing timedout/aborted req 0xc4292c70:36
kernel: mpt0: abort of req 0xc4292c70:0 completed
kernel: mpt0: attempting to abort req 0xc4297220:37 function 0
kernel: mpt0: completing timedout/aborted req 0xc4297220:37
kernel: mpt0: abort of req 0xc4297220:0 completed
kernel: mpt0: attempting to abort req 0xc429a650:38 function 0
kernel: mpt0: mpt_wait_req(1) timed out
kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting
controller
kernel: mpt0: mpt_cam_event: 0x60
kernel: mpt0: completing timedout/aborted req 0xc429a650:38
kernel: mpt0: completing timedout/aborted req 0xc42923b0:39
kernel: mpt0: completing timedout/aborted req 0xc4299110:40
kernel: mpt0: completing timedout/aborted req 0xc4297590:41

From: Dominic Fandrey on
On 09/05/2010 15:24, bab wrote:
> Most of the time these systems all work
> fine, however periodically our automated monitoring system reports one
> or two of the hosts inaccessible for a brief period 5-10 minutes).
> This usually occurs in the early morning so it doesn't affect our
> operations, but I'm concerned about the underlying cause.

Did you consider physical causes? E.g. the cleaning crew pulls the
plug of the storage system for their vacuum cleaners.

Or maybe the fibres are bent too strongly to transmit the signals.

--
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?
From: Torfinn Ingolfsen on
On 05/11/2010 11:05, Dominic Fandrey wrote:
> Did you consider physical causes? E.g. the cleaning crew pulls the
> plug of the storage system for their vacuum cleaners.
>
> Or maybe the fibres are bent too strongly to transmit the signals.

Or could it be other traffic on the SAN switches consuming all the
bandwidth?
Probably not a backup job if it is only 5 - 10 minutes
--
Torfinn Ingolfsen,
Norway