From: David Schwartz on
I have a mystery on my hands. I believe I've checked all the obvious
things and it's quite baffling. The symptom is that a machine
sometimes does not reply to pings (or any other traffic) addressed to
an aliased loopback interface. If I log into the machine and force it
to, say, ping the address that can't reach it (using the aliased
loopback addressas a source), then magically everything works. For a
while, anyway, then it breaks again.

Here's the setup in more detail: The machine has interfaces in two /24
networks, each of which ultimately connects to the same Internet
gateway. I'll use made up IP addresses:

There are two routers, 1.1.1.1/24 and 1.1.2.1/24. They each handle
their respective /24s. The machine has an interface on each of these
LANs, say 1.1.1.5/24 and 1.1.2.5/24. Both LANs run OSPF, and the
machine has an lo:0 interface numbered 1.1.3.5/32. The machine picks
up a default route from each router over each LAN and installs both
default routes. Each router, in turn, re-advertises its link to
1.1.3.5/32, making the machine reachable through the LAN and the
Internet.

So here's the symptom: A machine (on a remote network) can ping
1.1.1.5 and 1.1.2.5, but pings to 1.1.3.5 get no replies. Routing
looks correct. Analyzing 'tcpdump's on both physical interfaces so
'ping's being received on the 1.1.1.5 interface as they should be, but
the machine does not appear to be sending replies out either physical
interface.

The second I ping out from that machine (using 1.1.3.5 as the source)
to the machine that can't reach it, the routing works perfectly -- for
awhile anyway.

The machine has a default route out both interfaces to each router at
all times (received by OSPF using Quagga). The route to 1.1.3.5
appears to be being advertised correctly across both networks. I've
made sure rp_filter is off on all interfaces.

default proto zebra metric 1
nexthop via 1.1.1.1 dev eth0 weight 1
nexthop via 1.1.2.1 dev eth1 weight 1

1.1.1.0/24 dev eth0 proto kernel scope link src 1.1.1.5
1.1.2.0/24 dev eth1 proto kernel scope link src 1.1.2.5


eth0 Link encap:Ethernet HWaddr <removed>
inet addr:1.1.1.5 Bcast:1.1.1.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

eth1 Link encap:Ethernet HWaddr <removed>
inet addr:1.1.2.5 Bcast:1.1.2.55 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1

lo:0 Link encap:Local Loopback
inet addr:1.1.3.5 Mask:255.255.255.255
UP LOOPBACK RUNNING MTU:16436 Metric:1

# Controls IP packet forwarding
net.ipv4.ip_forward = 1
# Controls source route verification
net.ipv4.conf.default.rp_filter = 0

Anyone have a clue what I'm missing?

DS
From: Pascal Hambourg on
Hello,

David Schwartz a �crit :
> I have a mystery on my hands. I believe I've checked all the obvious
> things and it's quite baffling. The symptom is that a machine
> sometimes does not reply to pings (or any other traffic) addressed to
> an aliased loopback interface. If I log into the machine and force it
> to, say, ping the address that can't reach it (using the aliased
> loopback addressas a source), then magically everything works. For a
> while, anyway, then it breaks again.

Not reading further, I take the risk to think that the problem may be
related to ARP. The box does not reply to ARP queries for that IP
address, but uses it in its ARP queries, so the other side learns its
MAC address from the received request and stores it in its ARP cache.
Then the ARP cache entry expires and communication fails.
From: David Schwartz on
On Mar 22, 3:56 pm, Pascal Hambourg <boite-a-s...(a)plouf.fr.eu.org>
wrote:

> Not reading further, I take the risk to think that the problem may be
> related to ARP. The box does not reply to ARP queries for that IP
> address,

I would hope not, since that address is not part of any Ethernet
network, and ARP is an Ethernet thing.

> but uses it in its ARP queries, so the other side learns its
> MAC address from the received request and stores it in its ARP cache.

I don't think it's sensible (or even possible) to use an IP address
that's not assigned as an Ethernet interface as the source address in
an ARP query. In any event, I can't think of any scenario in which
this could cause a problem. I know inbound traffic is working. And
outbound traffic is being sent to the router.

> Then the ARP cache entry expires and communication fails.

The ARP entry in the machine or in the router? And the ARP entry for
what destination? I can't think of any possible way this could be an
issue. (But I'll take a look at the ARP queries anyway.)

It certainly could be an ARP issue, but I can't think of how. The
machine only needs the ARP entry for its next hop, which is a
perfectly ordinary ARP case.

DS
From: David Schwartz on
Clearing ARP caches on the routers and this host does not create a
problem. Outbound ARP requests appear to be correct, they are only for
the routers and they are correctly sourced. The ARP table rapidly
fills up with exactly two entries, one for each router.

I don't think it's an ARP issue.

DS
From: David Schwartz on
Well, you were right! It is (somehow!) an ARP issue. The problem re-
appeared. I logged into the machine, and I see this:

? (1.1.1.1) at <incomplete> on eth0
? (1.1.2.1) at 00:xx:xx:xx:xx:xx [ether] on eth1

And doing a 'ping 1.1.2.1' fixed the ARP entry and fixed connectivity.

Perhaps the problem is the 1.1.1.1 machine not replying to ARPs?
Perhaps the problem is this machine not sending them?

I wonder how OSPF can be stable if I'm losing the ARP entry for that
router.

I'm still baffled, but that's the first forward progress I've made in
some time. Thanks again.

DS