From: Michael Breuer on
Kernel 2.6.32.4 (git) with the following patches applied:

af_packet.c (tpacket_snd version 3)
sky2.c pskb_may_pull
sky2 fix WARNING at lib/dma-debug.c check_sync

Running with CONFIG_DMAR=n, system is stable.
Running with the exact same source but CONFIG_DMAR=y I get the WARNING
(see below) after about 36 hours of uptime (has varied from about 24 to
about 48):
Smolt profile:
http://smolt.fedoraproject.org/show?uuid=pub_bb05c701-1e47-4b3c-9fab-54f520f39d79+
I'm also attaching dmesg.old (dmesg from the crash).

Subsequent to this the system watchdog reboots the system (it's hung).

Of interest: each and every time this has happened the system was under
heavy RX load (win7 backup to a cifs share hosted on this server). Also,
there is always a dhcp exchange of some sort preceding the event.

It is possible that the event is re creatable without DMAR enabled, but
I have been unsuccessful in doing so.


Jan 22 05:38:36 mail dhcpd: DHCPREQUEST for 10.0.0.54 from
00:1b:78:c8:2b:8e (HPC82B8D) via eth0
Jan 22 05:38:36 mail dhcpd: DHCPACK on 10.0.0.54 to 00:1b:78:c8:2b:8e
(HPC82B8D) via eth0
Jan 22 05:38:41 mail kernel: DRHD: handling fault status reg 2
Jan 22 05:38:41 mail kernel: DMAR:[DMA Read] Request device [06:00.0]
fault addr ffdfdd9fe000
Jan 22 05:38:41 mail kernel: DMAR:[fault reason 06] PTE Read access is
not set
Jan 22 05:38:41 mail kernel: sky2 0000:06:00.0: error interrupt
status=0xc0000000
Jan 22 05:38:41 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
Jan 22 05:39:18 mail kernel: ------------[ cut here ]------------
Jan 22 05:39:18 mail kernel: WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0xf3/0x164()
Jan 22 05:39:18 mail kernel: Hardware name: System Product Name
Jan 22 05:39:18 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit
queue 0 timed out
Jan 22 05:39:18 mail kernel: Modules linked in: cpufreq_stats
ip6table_mangle ip6table_filter ip6_tables ipt_MASQUERADE iptable_nat
nf_nat iptable_mangle iptable_raw appletalk psnap llc nfsd lockd nfs_acl
auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4
ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6
xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath kvm_intel kvm
snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi snd_hda_intel
snd_ac97_codec snd_hda_codec snd_hwdep ac97_bus snd_seq snd_seq_device
firewire_ohci snd_pcm firewire_core crc_itu_t snd_timer snd
gspca_spca505 gspca_main i2c_i801 videodev v4l1_compat sky2 soundcore
v4l2_compat_ioctl32 wmi snd_page_alloc asus_atk0110 hwmon pcspkr
iTCO_wdt iTCO_vendor_support fbcon tileblit font bitblit softcursor
raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy
async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm
drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core
cfbimgblt cfb
Jan 22 05:39:18 mail kernel: fillrect [last unloaded: microcode]
Jan 22 05:39:18 mail kernel: Pid: 0, comm: swapper Tainted: G W
2.6.32.4MMAPDMARAF3SKY2PSKBMAYPULL-00912-g914160d-dirty #6
Jan 22 05:39:18 mail kernel: Call Trace:
Jan 22 05:39:18 mail kernel: <IRQ> [<ffffffff810536ee>]
warn_slowpath_common+0x7c/0x94
Jan 22 05:39:18 mail kernel: [<ffffffff8105375d>]
warn_slowpath_fmt+0x41/0x43
Jan 22 05:39:18 mail kernel: [<ffffffff813e3b6b>] ? netif_tx_lock+0x44/0x6c
Jan 22 05:39:18 mail kernel: [<ffffffff813e3cd3>] dev_watchdog+0xf3/0x164
Jan 22 05:39:18 mail kernel: [<ffffffff8105f3f4>] ? cascade+0x6a/0x84
Jan 22 05:39:18 mail kernel: [<ffffffff8106323f>]
run_timer_softirq+0x1c8/0x270
Jan 22 05:39:18 mail kernel: [<ffffffff8105af0f>] __do_softirq+0xf8/0x1cd
Jan 22 05:39:18 mail kernel: [<ffffffff8107f0ab>] ?
tick_program_event+0x2a/0x2c
Jan 22 05:39:18 mail kernel: [<ffffffff81012e1c>] call_softirq+0x1c/0x30
Jan 22 05:39:18 mail kernel: [<ffffffff810143a3>] do_softirq+0x4b/0xa6
Jan 22 05:39:18 mail kernel: [<ffffffff8105aaef>] irq_exit+0x4a/0x8c
Jan 22 05:39:18 mail kernel: [<ffffffff81470612>]
smp_apic_timer_interrupt+0x86/0x94
Jan 22 05:39:18 mail kernel: [<ffffffff810127e3>]
apic_timer_interrupt+0x13/0x20
Jan 22 05:39:18 mail kernel: <EOI> [<ffffffff812c729a>] ?
acpi_idle_enter_bm+0x256/0x28a
Jan 22 05:39:18 mail kernel: [<ffffffff812c7293>] ?
acpi_idle_enter_bm+0x24f/0x28a
Jan 22 05:39:18 mail kernel: [<ffffffff813a6c3c>] ?
cpuidle_idle_call+0x9e/0xfa
Jan 22 05:39:18 mail kernel: [<ffffffff81010c90>] ? cpu_idle+0xb4/0xf6
Jan 22 05:39:18 mail kernel: [<ffffffff81465ba5>] ?
start_secondary+0x201/0x242
Jan 22 05:39:18 mail kernel: ---[ end trace 57f7151f6a5def07 ]---
Jan 22 05:39:18 mail kernel: sky2 eth0: tx timeout
Jan 22 05:39:18 mail kernel: sky2 eth0: transmit ring 76 .. 35 report=76
done=76
Jan 22 05:39:18 mail kernel: sky2 eth0: disabling interface
Jan 22 05:39:18 mail kernel: sky2 eth0: enabling interface
Jan 22 05:39:21 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full
duplex, flow control both
Jan 22 05:40:06 mail kernel: sky2 eth0: tx timeout
Jan 22 05:40:06 mail kernel: sky2 eth0: transmit ring 3 .. 90 report=3
done=3
Jan 22 05:40:06 mail kernel: sky2 eth0: disabling interface
Jan 22 05:40:06 mail kernel: sky2 eth0: enabling interface
Jan 22 05:40:09 mail kernel: sky2 eth0: Link is up at 1000 Mbps, full
duplex, flow control both
Jan 22 05:40:30 mail abrt: Kerneloops: Reported 1 kernel oopses to Abrt
Jan 22 05:40:30 mail abrtd: Directory 'kerneloops-1264156830-1' creation
detected
Jan 22 05:40:30 mail abrtd: Can't open file
'/var/cache/abrt/kerneloops-1264156830-1/cmdline'
Jan 22 05:40:30 mail abrtd: Corrupted or bad crash, deleting
Jan 22 05:40:33 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0
Jan 22 05:40:33 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via
eth0
Jan 22 05:40:36 mail dhcpd: DHCPINFORM from 10.0.0.11 via eth0
Jan 22 05:40:36 mail dhcpd: DHCPACK to 10.0.0.11 (00:1a:92:8d:30:81) via
eth0
Jan 22 05:40:50 mail named[11245]: error (connection refused) resolving
'173-212-207-86.hostnoc.net/A/IN': 2607:f878:0:3::12#53
Jan 22 05:40:50 mail named[11245]: error (connection refused) resolving
'NS1.HOSTNOC.NET/AAAA/IN': 2607:f878:0:3::10#53
<watchdog reboot>

From: Jarek Poplawski on
On Fri, Jan 22, 2010 at 01:01:15PM -0500, Michael Breuer wrote:
> Kernel 2.6.32.4 (git) with the following patches applied:
>
> af_packet.c (tpacket_snd version 3)
> sky2.c pskb_may_pull
> sky2 fix WARNING at lib/dma-debug.c check_sync

I guess, you meant the "sky2.c receive_copy" patch which you tested
earlier, or at least you managed to crash DMAR with that patch
before crashing it with Stephen's "lib/dma-debug.c check_sync" patch,
right?

> Running with CONFIG_DMAR=n, system is stable.
> Running with the exact same source but CONFIG_DMAR=y I get the
> WARNING (see below) after about 36 hours of uptime (has varied from
> about 24 to about 48):
> Smolt profile: http://smolt.fedoraproject.org/show?uuid=pub_bb05c701-1e47-4b3c-9fab-54f520f39d79+
> I'm also attaching dmesg.old (dmesg from the crash).
>
> Subsequent to this the system watchdog reboots the system (it's hung).
>
> Of interest: each and every time this has happened the system was
> under heavy RX load (win7 backup to a cifs share hosted on this
> server). Also, there is always a dhcp exchange of some sort
> preceding the event.
>
> It is possible that the event is re creatable without DMAR enabled,
> but I have been unsuccessful in doing so.

It would be nice to check now if it's re-creatable without the dhcp
exchange yet, or at least dhcp through the switch and the router,
because I suspect there might be something more than a simple drop
on the switch that affects sky2 stability.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on
On 1/22/2010 4:53 PM, Jarek Poplawski wrote:
> On Fri, Jan 22, 2010 at 01:01:15PM -0500, Michael Breuer wrote:
>
>> Kernel 2.6.32.4 (git) with the following patches applied:
>>
>> af_packet.c (tpacket_snd version 3)
>> sky2.c pskb_may_pull
>> sky2 fix WARNING at lib/dma-debug.c check_sync
>>
> I guess, you meant the "sky2.c receive_copy" patch which you tested
> earlier, or at least you managed to crash DMAR with that patch
> before crashing it with Stephen's "lib/dma-debug.c check_sync" patch,
> right?
>
>
Yes - sorry, correct - all three patches were in the last run.
Previously, I've encountered the crash without these patches.
>> Running with CONFIG_DMAR=n, system is stable.
>> Running with the exact same source but CONFIG_DMAR=y I get the
>> WARNING (see below) after about 36 hours of uptime (has varied from
>> about 24 to about 48):
>> Smolt profile: http://smolt.fedoraproject.org/show?uuid=pub_bb05c701-1e47-4b3c-9fab-54f520f39d79+
>> I'm also attaching dmesg.old (dmesg from the crash).
>>
>> Subsequent to this the system watchdog reboots the system (it's hung).
>>
>> Of interest: each and every time this has happened the system was
>> under heavy RX load (win7 backup to a cifs share hosted on this
>> server). Also, there is always a dhcp exchange of some sort
>> preceding the event.
>>
>> It is possible that the event is re creatable without DMAR enabled,
>> but I have been unsuccessful in doing so.
>>
> It would be nice to check now if it's re-creatable without the dhcp
> exchange yet, or at least dhcp through the switch and the router,
> because I suspect there might be something more than a simple drop
> on the switch that affects sky2 stability.
>
> Jarek P.
>
Not sure I can do that. Note that based on the log messages, there were
no errors/dropped packets involving dhcp. Moving the dhcp server off of
the affected machine is not trivial. The dhcp correlation is based on
logged messages preceding each crash. I cannot confirm that they're
related, however it's really suspicious. If it helps, HP replaced my
unmanaged switch with a managed one so I can see whether there were any
switch events logged the next time I have a crash.

At this point, it seems the following is required to trigger the crash:
1) Uptime of 24-36 hours
2) High RX load on server (cifs traffic is what I've triggered it with).
3) Normal DHCP traffic.

Looks like based on the events I've seen that the high RX load has to be
sustained for about 15-20 minutes prior to the dhcp traffic. Crash
follows about a minute later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Jarek Poplawski on
On Fri, Jan 22, 2010 at 05:14:58PM -0500, Michael Breuer wrote:
> On 1/22/2010 4:53 PM, Jarek Poplawski wrote:
> >On Fri, Jan 22, 2010 at 01:01:15PM -0500, Michael Breuer wrote:
> >>Kernel 2.6.32.4 (git) with the following patches applied:
> >>
> >>af_packet.c (tpacket_snd version 3)
> >>sky2.c pskb_may_pull
> >>sky2 fix WARNING at lib/dma-debug.c check_sync
> >I guess, you meant the "sky2.c receive_copy" patch which you tested
> >earlier, or at least you managed to crash DMAR with that patch
> >before crashing it with Stephen's "lib/dma-debug.c check_sync" patch,
> >right?
> >
> Yes - sorry, correct - all three patches were in the last run.
> Previously, I've encountered the crash without these patches.

OK, thanks for testing - it's really very helpful, and supports
David's opinion that dmar is a different problem.
....
> Not sure I can do that. Note that based on the log messages, there
> were no errors/dropped packets involving dhcp. Moving the dhcp
> server off of the affected machine is not trivial. The dhcp
> correlation is based on logged messages preceding each crash. I
> cannot confirm that they're related, however it's really suspicious.
> If it helps, HP replaced my unmanaged switch with a managed one so I
> can see whether there were any switch events logged the next time I
> have a crash.
>
> At this point, it seems the following is required to trigger the crash:
> 1) Uptime of 24-36 hours
> 2) High RX load on server (cifs traffic is what I've triggered it with).
> 3) Normal DHCP traffic.

Do you mean you got these crashes with the new switch too, and this
switch doesn't drop DHCP at all? (Otherwise, let's try this switch
first.)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Michael Breuer on
On 1/22/2010 6:06 PM, Jarek Poplawski wrote:
> On Fri, Jan 22, 2010 at 05:14:58PM -0500, Michael Breuer wrote:
>
>> On 1/22/2010 4:53 PM, Jarek Poplawski wrote:
>>
>>> On Fri, Jan 22, 2010 at 01:01:15PM -0500, Michael Breuer wrote:
>>>
>>>> Kernel 2.6.32.4 (git) with the following patches applied:
>>>>
>>>> af_packet.c (tpacket_snd version 3)
>>>> sky2.c pskb_may_pull
>>>> sky2 fix WARNING at lib/dma-debug.c check_sync
>>>>
>>> I guess, you meant the "sky2.c receive_copy" patch which you tested
>>> earlier, or at least you managed to crash DMAR with that patch
>>> before crashing it with Stephen's "lib/dma-debug.c check_sync" patch,
>>> right?
>>>
>>>
>> Yes - sorry, correct - all three patches were in the last run.
>> Previously, I've encountered the crash without these patches.
>>
> OK, thanks for testing - it's really very helpful, and supports
> David's opinion that dmar is a different problem.
> ...
>
>> Not sure I can do that. Note that based on the log messages, there
>> were no errors/dropped packets involving dhcp. Moving the dhcp
>> server off of the affected machine is not trivial. The dhcp
>> correlation is based on logged messages preceding each crash. I
>> cannot confirm that they're related, however it's really suspicious.
>> If it helps, HP replaced my unmanaged switch with a managed one so I
>> can see whether there were any switch events logged the next time I
>> have a crash.
>>
>> At this point, it seems the following is required to trigger the crash:
>> 1) Uptime of 24-36 hours
>> 2) High RX load on server (cifs traffic is what I've triggered it with).
>> 3) Normal DHCP traffic.
>>
> Do you mean you got these crashes with the new switch too, and this
> switch doesn't drop DHCP at all? (Otherwise, let's try this switch
> first.)
>
> Jarek P.
>
Nope - just got the new switch. Crash was old switch. That said, I don't
think (based on the log messages) that the dhcpoffer packet drop was
happening prior to the crash. I also can't fathom why a DHCPOFFER packet
dropped after leaving the server would have any bearing on the issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/