From: Felipe W Damasio on
Hi Mr. Miller,

2010/7/10 David Miller <davem(a)davemloft.net>:
> It could be corruption from elsewhere. �Those last four hex
> digits (0x5d415d41) are "]A]A" in ascii, but that could just
> be coincidence.

What do you mean "from elsewhere"? You mean elsewhere on the network code?

Since the function that had the problem was tcp_recvmsg and we're
talking about a squid process, we're either talking about a typical
webserver-objet response, or about about an incorrect/faulty http
request from the user.

Like I told Mr. Dumazet, since on the squid logs I got a:

2010/07/08 14:51:10| clientTryParseRequest: FD 6088
(187.16.240.122:2035) Invalid Request

Only a second before the bug entry on syslog, I suppose that this
invalid request caused the problem (more like a guess, really).

If you think there's a way I can help reproduce/trigger and fix this
bug, please let me know, since the production machine is down until I
can ensure my bosses that this particular crash won't happen again.

Thanks,

Felipe Damasio
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Avi Kivity on
On 07/10/2010 09:17 AM, Eric Dumazet wrote:
>
> Strange thing with your crash report is CR2 value, with unexpected value
> of 000000000b388000 while RAX value is dce8dce85d415d41
>
> Faulting instruction is :
>
> 48 83 b8 b0 00 00 00 00 cmpq $0x0,0xb0(%rax)
>
> So I would have expected CR2 being RAX+0xb0, but its not.
>

Nothing strange about it. You only get page faults and valid cr2 for
canonical addresses (17 high order bits all equal). In this case
rax+0xb0 is not a canonical address, so you got a general protection
fault instead, with cr2 unchanged.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Felipe W Damasio on
2010/7/11 Felipe W Damasio <felipewd(a)gmail.com>:
> � The production machine has 8GB of RAM:

I'm sorry, this is not right. The production machine has 16GB of RAM.

Don't know if that matters regarding those proc parameters, though.

Cheers,

Felipe Damasio
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on
Le dimanche 11 juillet 2010 à 08:19 +0300, Avi Kivity a écrit :
> On 07/10/2010 09:17 AM, Eric Dumazet wrote:
> >
> > Strange thing with your crash report is CR2 value, with unexpected value
> > of 000000000b388000 while RAX value is dce8dce85d415d41
> >
> > Faulting instruction is :
> >
> > 48 83 b8 b0 00 00 00 00 cmpq $0x0,0xb0(%rax)
> >
> > So I would have expected CR2 being RAX+0xb0, but its not.
> >
>
> Nothing strange about it. You only get page faults and valid cr2 for
> canonical addresses (17 high order bits all equal). In this case
> rax+0xb0 is not a canonical address, so you got a general protection
> fault instead, with cr2 unchanged.
>

OK, thanks Avi for this information, as I was not aware of this.

So something overwrote sk->sk_prot pointer (or skb->sk pointer) with
some data.

tcp sockets are allocated from a dedicated kmem_cache (because of
SLAB_DESTROY_RCU attribute). Their sk->sk_prot should never change in
normal operation, since underlying memory cannot be reused by another
object type in kernel. It should be NULL or &tcp_prot

Felipe, please describe your configuration as much as possible.
It might be a driver bug with with special kind of network frames.

lsmod
lspci -v
ethtool -k eth0
ethtool -k eth1 (if applicable)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Eric Dumazet on
Le samedi 10 juillet 2010 à 12:30 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet(a)gmail.com>
> Date: Sat, 10 Jul 2010 08:17:29 +0200
>
> > Strange thing with your crash report is CR2 value, with unexpected value
> > of 000000000b388000 while RAX value is dce8dce85d415d41
> >
> > Faulting instruction is :
> >
> > 48 83 b8 b0 00 00 00 00 cmpq $0x0,0xb0(%rax)
> >
> > So I would have expected CR2 being RAX+0xb0, but its not.
>
> It could be corruption from elsewhere. Those last four hex
> digits (0x5d415d41) are "]A]A" in ascii, but that could just
> be coincidence.
>

x86 being litle endian, string is "A]A]" followed by another "XYXY"
pattern (non ASCII chars : 0xE8, 0xDC, 0xE8, 0xDC, "èÜèÜ" in ISO8859)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/