kernel BUG in iwl-agn-rs.c:2076, WAS: iwlagn + some accesspoint == hardlock [Kernel]

Prev: Staging: cxt1e1: Fix build error
Next: [PATCH 2.6.34-rc6] mxc: gadget: Fix bitfield for calculating maximum packet size

From: NilsRadtkelkml on 3 May 2010 15:20

Hello,

Some of the previous messages about this BUG seem lost, so here again in a single message:

When connecting to the local university's wireless APs, rather regularly the kernel crashes.

So, connecting to one of those the system hardlocks. No panic LED blinking, no mouse cursor moving,
no magic sysreq key working. Poweroff is the only way to revive the machine. Until the upcoming
lock-up.

What might serve as a hint: using knemo and kwifimanager (with wpa_supplicant in the bg) it feels like it is
freezing more frequently. Not running kwifimanager seems to reduce the frequency w/o however eliminating it
completely. Sometimes iceweasel has been touched and then the crash occurred.
Maybe some TX (or subsequent RX) provokes the crash? Some idea that also came up was that accessing iwlagn
drivers isn't clean when it's happening from more that one place at the same time
(->wpa_supplicant + kwifimanager)? [edit: this isn't probably it, or not solely]

This behaviour is currently only known to happen in this exact place/geolocation. So one assumption is that it
is somehow correlated to a particular type of access point, firmware respectively. Nonetheless, even
if there's some fw sending kill packets, the kernel should not crash, obviously.

The access points in range are of type: (obviously, fw versions are unknown..)

- one 0:1d:8b PIRELLI BROADBAND SOLUTIONS
- couple of those 0:40:96 Cisco Systems, Inc.
- one 0:1a:70 Cisco-Linksys

This notebook has no serial port, so we haven't been able to get a glimpse on what is happening at that
precise moment. xconsole or dmesg haven't been helpful as the system just freezes and that's it.

Other circumstances leading to a crash (reproducibility):

- RFKILL switch, switching radio on, (obviously) is a precondition. But then, it mainly
happens shortly after switching radio on,
- it happens more frequently and faster after switching radio on when kde + firefox + xy is running

Please find attached an image of the bug. Finally I could catch a glimpse of what happens, I longued
for this quite some time.

Strangely, as stated in the previous message, this bug does only happen in conjunction at a specific
geographical position, so far it only happened there that is. How does this correlate with the designated
code at line 2076 in iwl-agn-rs.c?:

/* Sanity-check TPT calculations */
BUG_ON(window->average_tpt != ((window->success_ratio *
tbl->expected_tpt[index] + 64) / 128));

Furthermore, it seems like there was already an encounter with this bug before. Previously I had no means
of grabbing more information off it, though, so I wrote down some notes:

2010-03-20: iwl-agn-rs.c line 2076: bug
The file where I kept my note for this bug contains:

- crashed during wireless AP scan using wpa_supplicant (after RFKILL switching to on..)
- this happend right after a fresh reboot (after crashing for the same bug)

- afterwards, auth.log contained this (bad, bad where's my data gone?):
Mar 20 18:24:01 localhost CRON[16684]: pam_unix(cron:session): session closed for user root
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@i686; it; rv:1.9.2.1) Gecko/20091211 Firefox/3.5.6 (like Firefox/3.5.11; Debian-3.5.11-1)"

Noticed the line in the bug from 2010-03-20? Yep, the same as in the one of today.

2010-04-30: One crash, two oopses: Msg restored mainly from remainders left in some brain's corner, w/
heavy massaging applied, mainly Code has been written down properly due to outside constrains..
Unfortunately, the proper ordering of lines got lost, so, yes, this is a poor report..
Order is as it's believed it was shown on-screen.

Number 1: nearly lost as subsequent oops made it scroll off-screen, only bottom half visible:

in CPU idle +x3f/x90

EIP rs_tx_status +x8f/x2030 SS: ESP 0668:f749fc4c

Code: 7f 22 69 c7 80 01 00 00 8b 8c 24 fc 01 00 00 8b 84 01 a8 00 00 00 6b 04 b0 64 39 44 24 1c 0f 8e
d2 fe ff ff 31 db e9 cb fe ff ff <0f> 0b fe 8b 54 24 4c 8b 8c 24 fc 01 00 00 8b 82 04 00 00 00

in iwl_rx_queue restock xbd/x130 iwlcore

panic -> oops begin

kernel panic not syncing fatal exec in int

pid 0 comm: swapper tainted GD .33.2

Comments, ideas, fixes welcome... this BUG is highly annoying..

Happy to serve with more info if required (and available..).

Regards,

Nils

From: reinette chatre on 11 May 2010 13:30

On Tue, 2010-05-11 at 08:50 -0700, Christian Borntraeger wrote:
> Am Donnerstag 06 Mai 2010 18:28:48 schrieb reinette chatre:
> > Below seven iwlwifi patches were added after rc4. If you are unable to
> > bisect ... perhaps you can run a while by reverting more and more from
> > this list?
> >
> > f2fa1b015e9c199e45c836c769d94db595150731 iwlwifi: correct 6000 EEPROM regulatory address
> > 88be026490ed89c2ffead81a52531fbac5507e01 iwlwifi: fix scan races
> > 8b9fce77737ae9983f61ec56cd53f52fb738b2c7 iwlwifi: work around bogus active chains detection
> > ece6444c2fe80dab679beb5f0d58b091f1933b00 iwlwifi: need check for valid qos packet before free
> > de0f60ea94e132c858caa64a44b2012e1e8580b0 iwlwifi: avoid Tx queue memory allocation in interface down
> > 04f2dec1c3d375c4072613880f28f43b66524876 iwlwifi: use consistent table for tx data collect
> > dd48744964296b5713032ea1d66eb9e3d990e287 iwlwifi: fix DMA allocation warnings
>
> Just to give you some feedback. Sometimes it takes a while until it crashes.
> My hand made bisect is currently at two remaining patches:
>
> de0f60ea94e132c858caa64a44b2012e1e8580b0
> 8b9fce77737ae9983f61ec56cd53f52fb738b2c7
>
> reverting both solves my hard lockup. I will try to isolate the "bad" patch
> but this takes some more days since I wont be in the "hazardous environment"
> this week.

Thank you for digging into this. It will be very helpful it you can get
us a trace of the crash - any chance that netconsole may work?

Reinette

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nils Radtke on 12 May 2010 10:40

Hi,

Problem regarding expected tpt calculation: (at site A)
Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
and iwl recovery patches applied.

[28766.975878] iwlagn 0000:03:00.0: loaded firmware version 228.61.2.24
[28770.306101] eth1: direct probe to AP 00:1a:70:aa:bb:cc (try 1)
[28770.506282] eth1: direct probe to AP 00:1a:70:aa:bb:cc (try 2)
[28770.512298] eth1: direct probe responded
[28770.512308] eth1: authenticate with AP 00:1a:70:aa:bb:cc (try 1)
[28770.712518] eth1: authenticate with AP 00:1a:70:aa:bb:cc (try 2)
[28770.714378] eth1: authenticated
[28770.714419] eth1: associate with AP 00:1a:70:aa:bb:cc (try 1)
[28770.914282] eth1: associate with AP 00:1a:70:aa:bb:cc (try 2)
[28770.917618] eth1: RX AssocResp from 00:1a:70:aa:bb:cc (capab=0x401 status=0 aid=2)
[28770.917626] eth1: associated
[28777.113828] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28777.991207] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28778.621611] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now
[28788.508867] iwlagn 0000:03:00.0: expected_tpt should have been calculated by now

Messages of the type of the last line all over the logs, coming in triples, couples or alone..

Then, there are again these msgs: (site A again)

[29001.505362] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:bb:cc tid = 0
[29039.037933] iwlagn 0000:03:00.0: iwl_tx_agg_start on ra = 00:1a:70:aa:bb:cc tid = 0

and so on. Associated w/ poor throughput.

- - -

Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):

Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
and iwl recovery patches applied.

RFKILL switch on, wpa_cli stat showed auth.

First auth: conn failed during https login

[28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
[28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
[28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.

started knoppix download w/ 20k/s avg, got 88kb when:
[28225.669850] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)

next try: no deauth but throughput was: --.-K/s
on abortion of wget and retry: deauth:

[28376.132933] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28376.253981] eth1: direct probe to AP 00:23:eb:aa:bb:cc (try 1)
[28376.768958] eth1: direct probe to AP 00:23:eb:aa:bb:cc (try 1)
[28376.771879] eth1: direct probe responded
[28376.771887] eth1: authenticate with AP 00:23:eb:aa:bb:cc (try 1)
[28376.971307] eth1: authenticate with AP 00:23:eb:aa:bb:cc (try 2)
[28376.974411] eth1: authenticated
[28376.974455] eth1: associate with AP 00:23:eb:aa:bb:cc (try 1)
[28377.174290] eth1: associate with AP 00:23:eb:aa:bb:cc (try 2)
[28377.180867] eth1: RX AssocResp from 00:23:eb:aa:bb:cc (capab=0x421 status=0 aid=11)
[28377.180875] eth1: associated

Then again, doing nothing:
[28406.222404] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28409.836210] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28410.036330] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 2)
[28410.039365] eth1: direct probe responded
[28410.039377] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 1)
[28410.239290] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 2)
[28410.245375] eth1: authenticated
[28410.245419] eth1: associate with AP 00:40:96:aa:bb:cc (try 1)
[28410.445346] eth1: associate with AP 00:40:96:aa:bb:cc (try 2)
[28410.448273] eth1: RX AssocResp from 00:40:96:aa:bb:cc (capab=0x421 status=0 aid=17)
[28410.448282] eth1: associated

and so on:
[28476.948459] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
[28477.059629] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28477.590389] eth1: direct probe to AP 00:40:96:aa:bb:cc (try 1)
[28477.607096] eth1: direct probe responded
[28477.607106] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 1)
[28477.807275] eth1: authenticate with AP 00:40:96:aa:bb:cc (try 2)
[28477.812340] eth1: authenticated
[28477.812382] eth1: associate with AP 00:40:96:aa:bb:cc (try 1)
[28478.012397] eth1: associate with AP 00:40:96:aa:bb:cc (try 2)
[28478.014914] eth1: RX AssocResp from 00:40:96:aa:bb:cc (capab=0x421 status=0 aid=19)
[28478.014922] eth1: associated

So, this is a real blocker.

BTW, what is so particular about cisco APs? Again, this is a problem only w/ cisco (different site B,
though):
0:40:96 Cisco Systems, Inc.

Thanks.

Cheers,

Nils

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: reinette chatre on 12 May 2010 19:20

Hi Nils,

On Wed, 2010-05-12 at 07:39 -0700, Nils Radtke wrote:
> Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):
>
> Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
> and iwl recovery patches applied.
>

Any chance you can try these same tests with latest code
(compat-wireless may be easiest) to help us determine if this is still
an issue?

> RFKILL switch on, wpa_cli stat showed auth.
>
> First auth: conn failed during https login
>
> [28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
> [28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
> [28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
> [28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.

We could maybe at least figure out what is going on here if the driver
is loaded with some debugging. Please ensure driver is compiled with
debugging (CONFIG_IWLWIFI_DEBUG) and load the module with "modprobe
iwlagn debug=0x43fff". Could you please reproduce under this
environment?

Reinette

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nils Radtke on 13 May 2010 06:40

Hi Reinette,

On Wed, 2010-05-12 at 07:39 -0700, Nils Radtke wrote:
> > Problem regarding the deauths/reconnects (sometimes combined w/ bad throughput):
> >
> > Tested w/ Linux mypole 2.6.33.3 #21 SMP PREEMPT Tue May 11 11:41:07 CEST 2010 i686 GNU/Linux
> > and iwl recovery patches applied.
> >
>
> Any chance you can try these same tests with latest code
> (compat-wireless may be easiest) to help us determine if this is still
> an issue?
I'll try those. Have to recompile the kernel though as I used to have config_80211 in instead of M.

> > RFKILL switch on, wpa_cli stat showed auth.
> >
> > First auth: conn failed during https login
> >
> > [28018.602587] eth1: deauthenticated from 00:40:96:aa:bb:cc (Reason: 2)
> > [28018.900400] iwlagn 0000:03:00.0: TX Power requested while scanning!
> > [28018.900411] iwlagn 0000:03:00.0: Error sending TX power (-11)
> > [28018.921465] iwlagn 0000:03:00.0: request scan called when driver not ready.
>
> We could maybe at least figure out what is going on here if the driver
> is loaded with some debugging. Please ensure driver is compiled with
> debugging (CONFIG_IWLWIFI_DEBUG) and load the module with "modprobe
> iwlagn debug=0x43fff". Could you please reproduce under this
> environment?
Ok.

Today it's been a thunderstorm here but right now the sun is shining. Guess, I'll make it for a tour through
the city.. :)

Cheers,

Nils

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2 3 4
Prev: Staging: cxt1e1: Fix build error
Next: [PATCH 2.6.34-rc6] mxc: gadget: Fix bitfield for calculating maximum packet size