No mount after swinging LUN from NetApp to a new host "very big device" [Setup]

Prev: Pleasantly surprised by Linux...for a couple of days...The Short Happy Life of Francis Linux Macomber
Next: Got it working! First post from inside Damn Small Linux! Need more help pls

From: Doug Freyburger on 11 Jun 2010 05:37

Folks,

I've got a 2+ TB NetApp LUN that has Oracle RMAN data as part of a
technology refresh migration. We are currently using other storage on
the NetApp over NFS to do the backup all over again but that will lose
another day compared to using the data on the LUN. It looks like it
will take 5+ hours to write the data into the fallback NFS location.

The source host is old:

Red Hat Enterprise Linux ES release 3 (Taroon Update 8)

Linux source-host 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006
i686 i686 i386 GNU/Linux

scsi2 : QLogic QLA2422 PCI to Fibre Channel Host Adapter: bus 3 device
2 irq 27
Firmware version: 4.00.23, Driver version 7.07.05

On the old host LUN 38 was scanned, partitioned with parted. An ext3
filesystem was put on it, mounted. Oracle RMAN deposited well over a TB
of data. Then I tried to swing the LUN to the new host.

The destination host is new:

Red Hat Enterprise Linux Server release 5.3 (Tikanga)

Linux target-host 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST
2008 x86_64 x86_64 x86_64 GNU/Linux

Hmmm, not that new. It's got all of the patches required for Oracle but
not all of the patches period.

The first hint of a problem was fsck claimed a SCSI reserve problem that
a reboot did not resolve and on the NetApp resetting the status of the
LUN did not help either:

[root(a)xdb1(newprod) etc]# fsck /dev/sdl1
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
fsck.ext2: Device or resource busy while trying to open /dev/sdl1
Filesystem mounted or opened exclusively by another program?

Well, at one point it had an exclusive SCSI reservation on the old host
but that didn't seem to be the cause. After the reboot I noticed an
interesting message in /var/log/messages and in dmesg:

sd 3:0:2:34: Attached scsi disk sdk
sd 3:0:2:34: Attached scsi generic sg17 type 0
Vendor: NETAPP Model: LUN Rev: 7320
Type: Direct-Access ANSI SCSI revision: 04
qla2xxx 0000:04:00.0: scsi(3:0:2:38): Enabled tagged queuing, queue depth 32.
sdl : very big device. try to use READ CAPACITY(16).
SCSI device sdl: 5033164800 512-byte hdwr sectors (2576980 MB)
sdl: Write Protect is off
sdl: Mode Sense: bd 00 00 08

That sequence appeared for both of the redundant paths to the LUN but
none of the other LUNs complained. It turns out the other LUNs are
Oracle raw files at 250 GB each but this LUN is ext3 cooked filesystem
at 2.5 TB. On the new system this is the only device over 2 TB.

Checking the Red Hat knowledgebase it says that Enterprise Server
release 4 had that problem and it was fixed in Update 1 U1. Okay but
the old source system is ES 3 and the new target system is ES 5.

I don't think this is a Qlogic driver problem because "fdisk -l
/dev/sdl" sees the size and complains. It tells me to run parted
instead just like I expect with a device over 2 TB. But that should not
matter because it sees primary partition 1:

WARNING: The size of this disk is 2.6 TB (2576980377600 bytes).
DOS partition table format can not be used on drives for volumes
larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
partition table format (GPT).

Disk /dev/sdl: 2576.9 GB, 2576980377600 bytes
255 heads, 63 sectors/track, 313300 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdl1 1 267349 2147480811 83 Linux

So I see the LUN as a device. That suggests it is not a Qlogic driver
problem. I see the partition table with fdisk. That suggests it is not
a partitioning problem.

Should I be looking at it as a SCSI issue? It seems strange that the
partition table is visible but the filesystem is not. That looks more
ext2/3 than SCSI to me.

I can open a ticket with Red Hat but because there's an NFS backup
running we have a workaround. The NFS backup might complete before I
get a response from Red Hat on a priority 2 ticket. I discussed that
with the client manager and he asked that I limit the Red Hat ticket
effort to an hour while waiting for the NFS backup to complete. That's
a "newsgroups to the rescue" sort of problem.

Help me Obi Wan COL.setup ....

From: Nico Kadel-Garcia on 11 Jun 2010 07:17

On Jun 11, 5:37 am, Doug Freyburger <dfrey...(a)yahoo.com> wrote:
> Folks,
>
> I've got a 2+ TB NetApp LUN that has Oracle RMAN data as part of a
> technology refresh migration. We are currently using other storage on
> the NetApp over NFS to do the backup all over again but that will lose
> another day compared to using the data on the LUN. It looks like it
> will take 5+ hours to write the data into the fallback NFS location.
>
> The source host is old:
>
> Red Hat Enterprise Linux ES release 3 (Taroon Update 8)

Stop *RIGHT* there. RHEL 3? E-w-w-w-w-w-w-w!

>
> Linux source-host 2.4.21-47.ELsmp #1 SMP Wed Jul 5 20:38:41 EDT 2006
> i686 i686 i386 GNU/Linux
>
> scsi2 : QLogic QLA2422 PCI to Fibre Channel Host Adapter: bus 3 device
> 2 irq 27
> Firmware version: 4.00.23, Driver version 7.07.05
>
> On the old host LUN 38 was scanned, partitioned with parted. An ext3
> filesystem was put on it, mounted. Oracle RMAN deposited well over a TB
> of data. Then I tried to swing the LUN to the new host.
>
> The destination host is new:
>
> Red Hat Enterprise Linux Server release 5.3 (Tikanga)

You are two point releases and a stack of patches behind. Update to
RHEL 5.5 if possible.

> Linux target-host 2.6.18-128.el5 #1 SMP Wed Dec 17 11:41:38 EST
> 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> Hmmm, not that new. It's got all of the patches required for Oracle but
> not all of the patches period.
>
> The first hint of a problem was fsck claimed a SCSI reserve problem that
> a reboot did not resolve and on the NetApp resetting the status of the
> LUN did not help either:
>
> [root(a)xdb1(newprod) etc]# fsck /dev/sdl1
> fsck 1.39 (29-May-2006)
> e2fsck 1.39 (29-May-2006)
> fsck.ext2: Device or resource busy while trying to open /dev/sdl1
> Filesystem mounted or opened exclusively by another program?
>
> Well, at one point it had an exclusive SCSI reservation on the old host
> but that didn't seem to be the cause. After the reboot I noticed an
> interesting message in /var/log/messages and in dmesg:
>
> sd 3:0:2:34: Attached scsi disk sdk
> sd 3:0:2:34: Attached scsi generic sg17 type 0
> Vendor: NETAPP Model: LUN Rev: 7320
> Type: Direct-Access ANSI SCSI revision: 04
> qla2xxx 0000:04:00.0: scsi(3:0:2:38): Enabled tagged queuing, queue depth 32.
> sdl : very big device. try to use READ CAPACITY(16).
> SCSI device sdl: 5033164800 512-byte hdwr sectors (2576980 MB)
> sdl: Write Protect is off
> sdl: Mode Sense: bd 00 00 08
>
> That sequence appeared for both of the redundant paths to the LUN but
> none of the other LUNs complained. It turns out the other LUNs are
> Oracle raw files at 250 GB each but this LUN is ext3 cooked filesystem
> at 2.5 TB. On the new system this is the only device over 2 TB.
>
> Checking the Red Hat knowledgebase it says that Enterprise Server
> release 4 had that problem and it was fixed in Update 1 U1. Okay but
> the old source system is ES 3 and the new target system is ES 5.

There are a number of useful and important fixes from the default RHEL
5.3 kernels available in RHEL 5.4 and 5.5. I'd definitely make the
leap.

> I don't think this is a Qlogic driver problem because "fdisk -l
> /dev/sdl" sees the size and complains. It tells me to run parted
> instead just like I expect with a device over 2 TB. But that should not
> matter because it sees primary partition 1:
>
> WARNING: The size of this disk is 2.6 TB (2576980377600 bytes).
> DOS partition table format can not be used on drives for volumes
> larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
> partition table format (GPT).
>
> Disk /dev/sdl: 2576.9 GB, 2576980377600 bytes
> 255 heads, 63 sectors/track, 313300 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sdl1 1 267349 2147480811 83 Linux
>
> So I see the LUN as a device. That suggests it is not a Qlogic driver
> problem. I see the partition table with fdisk. That suggests it is not
> a partitioning problem.
>
> Should I be looking at it as a SCSI issue? It seems strange that the
> partition table is visible but the filesystem is not. That looks more
> ext2/3 than SCSI to me.
>
> I can open a ticket with Red Hat but because there's an NFS backup
> running we have a workaround. The NFS backup might complete before I
> get a response from Red Hat on a priority 2 ticket. I discussed that
> with the client manager and he asked that I limit the Red Hat ticket
> effort to an hour while waiting for the NFS backup to complete. That's
> a "newsgroups to the rescue" sort of problem.
>
> Help me Obi Wan COL.setup ....

From: Doug Freyburger on 14 Jun 2010 12:02

David W. Hodgins wrote:
> Doug Freyburger <dfreybur(a)yahoo.com> wrote:
>
>> WARNING: The size of this disk is 2.6 TB (2576980377600 bytes).
>> DOS partition table format can not be used on drives for volumes
>> larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
>> partition table format (GPT).
>> ...
>> So I see the LUN as a device. That suggests it is not a Qlogic driver
>> problem. I see the partition table with fdisk. That suggests it is not
>> a partitioning problem.
>
> For a device using the GUID partition table format, the mbr should have
> one fake entry of type EE, to protect the space from partitioning
> tools that don't handle GUID.
>
> See
> http://en.wikipedia.org/wiki/GUID_Partition_Table#Legacy_MBR_.28LBA_0.29
>
> Follow the warning to use parted, to see what is on the GUID partition
> table.

Check. The command parted is what was used on the source host to make
be able to mount the LUN in the first place. It was in the root
..bash_history file on the source host when I checked.

If I understand corrected using parted on the target host would wipe the
data written by the source host resulting in the backup data getting
lost. Is my understanding incorrect?

From: Nico Kadel-Garcia on 14 Jun 2010 22:10

On Jun 14, 12:15 pm, Doug Freyburger <dfrey...(a)yahoo.com> wrote:
> Nico Kadel-Garcia wrote:
> > Doug Freyburger <dfrey...(a)yahoo.com> wrote:
>
> >> The source host is old:
>
> >> Red Hat Enterprise Linux ES release 3 (Taroon Update 8)
>
> > Stop *RIGHT* there. RHEL 3? E-w-w-w-w-w-w-w!
>
> The source host is the source for the migration effort because of that.
> Check. No way am I going to abandon the source of the data without
> migrating just because it's old. Yes way am I going to ask for the old
> host to be trashed after it's all over. The entire source is actually
> two ASM/CRS Oracle clusters of two hosts each.
>
> The other cluster has already been migrated using NFS over the NetApp.
> That followed the standard NetApp experience with NFS, works first time
> every time. Unfortunately trying to use the NetApp as a block service
> also has followed the standard NetApp experience with SAN, no faster
> than NFS but a lot more frustrating.
>
> >> Red Hat Enterprise Linux Server release 5.3 (Tikanga)
>
> > There are a number of useful and important fixes from the default RHEL
> > 5.3 kernels available in RHEL 5.4 and 5.5. I'd definitely make the
> > leap.
>
> Thanks! Since there was frustration with the block service I started
> the backup over again using the NFS service. Time was of the essence
> and it was worth pursuing both avenues in parallel.
>
> At this time the 2+ TB backup over NFS completed. We swung the NFS
> mount to the new system which of course worked first time every time
> (not counting a bad EtherNet cable thrown in by Murphy just for fun).
> The restore over NFS completed and production has been migrated to the
> new machines.
>
> I have added OS upgrades to the recommended maintainence list for all of
> the new hosts at that client. It will proceed on the usual scheduled
> maintenance window pattern.
>
> Thanks for the advice. I now have a plan going forward.

Good. And you've my sympathies: I worked very hard on a migration of
12 year old SCO OpenServer, proprietary software to RHEL 5 a few years
ago, and definitely feel your discomfort. (I also asked Richard
Stallman for brownie points for doing it, and he said "well, it's a
start"....)

|
Pages: 1
Prev: Pleasantly surprised by Linux...for a couple of days...The Short Happy Life of Francis Linux Macomber
Next: Got it working! First post from inside Damn Small Linux! Need more help pls