abnormal (excessive) number of arp requests on subnet? [Linux Networking]

Prev: Cisco VPN client fails to connect
Next: Share windows partitions on Ubuntu server from Ubuntu Client?

From: Rahul on 10 Jul 2010 17:38

I did a tcpdump like so:

tcpdump -c 1000 -ennqti eth3 $ arp or icmp $

In a one minute period I get 1000 ARP requests. Is this normal? I
reproduce below the traffic in case this helps diagnosis. The network is
static, no new devices are being added or removed. The MAC<->IP
association is also static. Why is there such a lot of ARP traffic or is
this normal?

The network has ~265 servers. There is only a single physical network but
twin subnets: 10.0.x.x (primary traffic) and 172.16.x.x (monitoring).
i.e. each server has a single physical card but it reponds to two MAC and
IP addresses.

Would increasing the size of my ARP cache be a solution? I'm a bit
confused because (as I understant ARP caching) my ARP cache size is set
to 512 or 1024 (not sure which) but the actual ARP table seems to have
only 265 entries (values below). Or is my understanding of ARP wrong?

cat /proc/net/arp | wc -l
265

ip neigh | wc -l
264

cat /proc/sys/net/ipv4/neigh/default/gc_thresh2
512

cat /proc/sys/net/ipv4/neigh/default/gc_thresh3
1024

############################
00:26:b9:58:ec:29 > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.2.5 is-at 00:26:b9:58:ec:29
00:26:b9:58:ec:2a > 00:26:b9:58:d7:2f, ARP, length 60: arp who-has
10.0.3.2 tell 10.0.0.11
00:26:b9:58:ec:2c > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.0.11 is-at 00:26:b9:58:ec:2c
00:26:b9:58:ec:48 > 00:26:b9:58:d7:2f, ARP, length 60: arp who-has
10.0.3.2 tell 10.0.1.66
00:26:b9:58:ec:4a > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.1.66 is-at 00:26:b9:58:ec:4a
00:26:b9:58:ec:56 > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
172.16.1.12 is-at 00:26:b9:58:ec:56
00:26:b9:58:ec:5a > 00:26:b9:58:d7:2f, ARP, length 60: arp who-has
10.0.3.2 tell 10.0.0.52
################################

--
Rahul

From: Chris Cox on 10 Jul 2010 21:15

Rahul wrote:
> I did a tcpdump like so:
>
> tcpdump -c 1000 -ennqti eth3 $ arp or icmp $
>
> In a one minute period I get 1000 ARP requests. Is this normal? I
> reproduce below the traffic in case this helps diagnosis. The network is
> static, no new devices are being added or removed. The MAC<->IP
> association is also static. Why is there such a lot of ARP traffic or is
> this normal?

In general, I'd say pretty normal. Things are always making queries... who-has
messages abound, as well as i-have messages.

From: Moe Trin on 11 Jul 2010 14:27

On Sat, 10 Jul 2010, in the Usenet newsgroup comp.os.linux.networking, in
article <Xns9DB1A958CCCBE6650A1FC0D7811DDBC81(a)188.40.43.230>, Rahul wrote:

>In a one minute period I get 1000 ARP requests. Is this normal?

Depends. How "busy" is the network - how many hosts talking to how
many hosts how often?

>The network is static, no new devices are being added or removed. The
>MAC<->IP association is also static. Why is there such a lot of ARP
>traffic or is this normal?

RFC1122 Section 2.3.2.1 ARP Cache Validation

BRIEFLY - ARP is used to resolve IP->MAC. The querying and answering
systems will keep an individual entry for on the order of one minute.
For the Linux kernel, this is NORMALLY a compile-time setting. You
may be able to increase the timeout.

>The network has ~265 servers. There is only a single physical network
>but twin subnets: 10.0.x.x (primary traffic) and 172.16.x.x
>(monitoring) .i.e. each server has a single physical card but it
>responds to two MAC and IP addresses.

Overlaying networks rarely serves any useful purpose other than to
increase overhead. Are you sure this is needed? It's bad enough
with 265 hosts in one collision domain, never mind 530. (Our
original subnet mask was 255.255.252.0 allowing 1000 hosts per
segment - in 1994, we installed Etherswitches to break the coax into
segments with no more than 50 hosts per, resulting in significant
improvement in network speed.). Doing a traffic analysis (who is
talking to who) could be a real eye-opener, suggesting a more efficient
layout. Look at RFC0950 (Internet Standard Subnetting Procedure) and
related documents (RFC0917, RFC0925, RFC0932, RFC0936, RFC0940 and even
RFC1027) which should provide useful background.

>Would increasing the size of my ARP cache be a solution? I'm a bit
>confused because (as I understant ARP caching) my ARP cache size is
>set to 512 or 1024 (not sure which) but the actual ARP table seems
>to have only 265 entries (values below). Or is my understanding of ARP
>wrong?

ARP is used when host A wants to talk to host B. If it doesn't need
to talk to B, why should it be caching B's MAC? Also, how is your
network _physically_ connected? Is this coax (10Base2 or 10Base5)
or twisted pairs with a _hub_ junction? Every host on such a
"party line" hears everyone else, and ARP _may_ be configured to
cache ARP replies heard from "other" systems. If the network is
using switches, _broadcast_ packets are heard by all (depending on
the switch), while _unicast_ packets (ARP replies) are heard only
by the "interested" party. If using switches, you need also look at
the timeouts in the individual switches as well.

>00:26:b9:58:ec:29 > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
>172.16.2.5 is-at 00:26:b9:58:ec:29
>00:26:b9:58:ec:2a > 00:26:b9:58:d7:2f, ARP, length 60: arp who-has
>10.0.3.2 tell 10.0.0.11
>00:26:b9:58:ec:2c > ff:ff:ff:ff:ff:ff, ARP, length 60: arp reply
>172.16.0.11 is-at 00:26:b9:58:ec:2c

[compton ~]$ etherwhois 00:26:b9
00-26-B9 (hex) Dell Inc
0026B9 (base 16) Dell Inc
One Dell Way, MS RR5-45
Round Rock Texas 78682
UNITED STATES
[compton ~]$

My condolences.

Something is fucked with your capture data. Example - the first line
shows Dull 58:ec:29 _broadcasting_ an ARP reply. That should be a
unicast from Dull 58:ec:29 to the MAC of the querying system. In
the second line, Dull ec:2a sends a _unicast_ query asking who is
"10.0.3.2". That should be a broadcast unless this is a reconfirm.
You may also want to look at RFC0826, which is the specification for
ARP referenced in RFC1122.

0826 Ethernet Address Resolution Protocol: Or Converting Network
Protocol Addresses to 48.bit Ethernet Address for Transmission on
Ethernet Hardware. D. Plummer. November 1982. (Format: TXT=21556
bytes) (Updated by RFC5227, RFC5494) (Also STD0037) (Status:
STANDARD)

1122 Requirements for Internet Hosts - Communication Layers. R.
Braden, Ed.. October 1989. (Format: TXT=295992 bytes) (Updates
RFC0793) (Updated by RFC1349, RFC4379) (Also STD0003) (Status:
STANDARD)

If the number of ARP packets is a concern, look at increasing the
ARP timeout, or simply bite the bullet and use permanent entries in
the arp cache (man arp look at the -s and/or -f options).

Old guy

From: Rahul on 12 Jul 2010 15:31

ibuprofin(a)painkiller.example.tld.invalid (Moe Trin) wrote in
news:slrni3k385.dip.ibuprofin(a)compton.phx.az.us:

Thanks Moe for a detailed analysis!

> BRIEFLY - ARP is used to resolve IP->MAC. The querying and answering
> systems will keep an individual entry for on the order of one minute.
> For the Linux kernel, this is NORMALLY a compile-time setting. You
> may be able to increase the timeout.
>

Why is the cache maintained on a time basis? Isn't it more logial to
specify the max number of ARP cache entries? Or are the two approaches
identical?

>>In a one minute period I get 1000 ARP requests. Is this normal?
>
> Depends. How "busy" is the network - how many hosts talking to how
> many hosts how often?

I know there are ~265 physical servers and x2 = 530 IP addresses. The
10.0.x.x should be fairly busy. But I have no way to quantify it right
now. In fact, what tool does one use to answer the question you raised:
"How "busy" is the network?"

Maybe the answer is in the RFC's you quoted. I'm reading them now. But if
anyone has pointers as to how to answer the above question please do
tell. I don't have access to the switches so can't get any switch side
stats. unfortunately. All monitoring will have to be server-side.
>
> Overlaying networks rarely serves any useful purpose other than to
> increase overhead. Are you sure this is needed?

I am not sure. Maybe my design decision was wrong. The situation is that
we have normal traffic as well as IPMI (maintainance mode) traffic
piggybacking over the same physical wire and adapters. Conceptually I
thought it made sense to keep those seperate? But I am open to
sugesstions if this was a bad idea.

> It's bad enough
> with 265 hosts in one collision domain, never mind 530.

But that is only relevant for broadcast traffic, correct? Unicast traffic
will be intelligently handled by the switch so that the collission domain
is only equal to the number of switch ports? Pardon my networking
ignorance if this is wrong.

> improvement in network speed.). Doing a traffic analysis (who is
> talking to who) could be a real eye-opener, suggesting a more
> efficient layout.

Is tcpdump the tool of choice for this? Or wireshark? Or something else?

> ARP is used when host A wants to talk to host B. If it doesn't need
> to talk to B, why should it be caching B's MAC?

Is there a downside to having a larger ARP cache? I mean sure, it takes
more memory but these days RAM is cheap and anyways a 1000 row IP<->MAC
lookup table is not a big size.

>Also, how is your
> network _physically_ connected? Is this coax (10Base2 or 10Base5)

It's a 1GigE ethernet cable. I think it's CAT5e (1000BASE-T).

> cache ARP replies heard from "other" systems. If the network is
> using switches, _broadcast_ packets are heard by all (depending on

The network is switched. Each switch takes around 48 hosts so we have 6
Cisco-Catalyst switches interconnected with 10GigE fiber links.

> the switch), while _unicast_ packets (ARP replies) are heard only
> by the "interested" party. If using switches, you need also look at
> the timeouts in the individual switches as well.

Ah! Thanks! I didn't realize the switches have a ARP cache timeout too.
Makes sense. I'll ask my networking folks about that.

>
> My condolences.

For using Dell? :) I'm confused.

> Something is fucked with your capture data. Example - the first line
> shows Dull 58:ec:29 _broadcasting_ an ARP reply. That should be a
> unicast from Dull 58:ec:29 to the MAC of the querying system. In
> the second line, Dull ec:2a sends a _unicast_ query asking who is
> "10.0.3.2". That should be a broadcast unless this is a reconfirm.
> You may also want to look at RFC0826, which is the specification for
> ARP referenced in RFC1122.

Wow! You are right. I never noticed this. I will definately dig deeper
into this. Something is not right.

--
Rahul

From: Rick Jones on 12 Jul 2010 17:03

Rahul <nospam(a)nospam.invalid> wrote:
> Why is the cache maintained on a time basis?

It helps to bound "fail-over" time when an IP is migrated from being
associated with one MAC address to another.

> Is tcpdump the tool of choice for this? Or wireshark? Or something
> else?

If one is a fan of Star Trek "TOS" tcpdump can be though of as the
mnemonic memory circuits made from stone knives and bearskins. It is
a basic CLI (command-line interface) packet capture utility.
Wireshark adds a gooey and whatnot. They both use libpcap to perform
actual packet capture. The differences would be in what they can
decode and how they display it.

> Ah! Thanks! I didn't realize the switches have a ARP cache timeout too.
> Makes sense. I'll ask my networking folks about that.

Indeed, anything with an ARP cache needs to have a way to keep it
up-to-date.

rick jones
--
oxymoron n, commuter in a gas-guzzling luxury SUV with an American flag
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to rick.jones2 in hp.com but NOT BOTH...

| Next | Last
Pages: 1 2 3
Prev: Cisco VPN client fails to connect
Next: Share windows partitions on Ubuntu server from Ubuntu Client?