From: Cong Wang on
Octavian Purdila wrote:
> This patch series is based on Amerigo's v2 but it now uses a bitmap
> for port reservation.
>
> I've ran a while (1) { bind(0) } test (with ip_local_port_range
> 1024 65000) to see if there is any performance difference between the
> two approaches (ranges vs bitmap). I could not detect any significant
> difference, both cases scored in 2.76s +/- 0.01 on my setup.
>
> I've based this patch series on current net-next, but it contains a
> significant non networking part. Please let me know if I should handle
> this differently.
>
> Octavian Purdila (3):
> sysctl: refactor integer handling proc code
> sysctl: add proc_dobitmap
> net: reserve ports for applications using fixed port numbers
>
> Documentation/networking/ip-sysctl.txt | 12 +
> drivers/infiniband/core/cma.c | 7 +-
> include/linux/sysctl.h | 2 +
> include/net/ip.h | 6 +
> kernel/sysctl.c | 374 +++++++++++++++++++-------------
> net/ipv4/inet_connection_sock.c | 5 +
> net/ipv4/inet_hashtables.c | 2 +
> net/ipv4/sysctl_net_ipv4.c | 7 +
> net/ipv4/udp.c | 3 +-
> net/sctp/socket.c | 2 +
> 10 files changed, 264 insertions(+), 156 deletions(-)
>

Hey, Octavian, typo in netdev list name...

Could you please fix it and resend? So that this will get more reviews.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on
Octavian Purdila wrote:
> This patch series is based on Amerigo's v2 but it now uses a bitmap
> for port reservation.
>
> I've ran a while (1) { bind(0) } test (with ip_local_port_range
> 1024 65000) to see if there is any performance difference between the
> two approaches (ranges vs bitmap). I could not detect any significant
> difference, both cases scored in 2.76s +/- 0.01 on my setup.
>
> I've based this patch series on current net-next, but it contains a
> significant non networking part. Please let me know if I should handle
> this differently.
>
> Octavian Purdila (3):
> sysctl: refactor integer handling proc code
> sysctl: add proc_dobitmap
> net: reserve ports for applications using fixed port numbers
>

(Sorry for the delay, we are having Chinese new year here.)

Thanks for your work, Octavian!

Your patches look nice, but I don't have much time to review them today,
I will have a detailed look tomorrow.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on
Octavian Purdila wrote:
> This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports
> (bitmap type) which allows users to reserve ports for third-party
> applications.
>
> The reserved ports will not be used by automatic port assignments
> (e.g. when calling connect() or bind() with port number 0). Explicit
> port allocation behavior is unchanged.
>
> Signed-off-by: Octavian Purdila <opurdila(a)ixiacom.com>
> Signed-off-by: WANG Cong <amwang(a)redhat.com>
> Cc: Neil Horman <nhorman(a)tuxdriver.com>
> Cc: Eric Dumazet <eric.dumazet(a)gmail.com>
> ---
> Documentation/networking/ip-sysctl.txt | 12 ++++++++++++
> drivers/infiniband/core/cma.c | 7 ++++++-
> include/net/ip.h | 6 ++++++
> net/ipv4/af_inet.c | 8 +++++++-
> net/ipv4/inet_connection_sock.c | 6 ++++++
> net/ipv4/inet_hashtables.c | 2 ++
> net/ipv4/sysctl_net_ipv4.c | 17 +++++++++++++++++
> net/ipv4/udp.c | 3 ++-
> net/sctp/socket.c | 2 ++
> 9 files changed, 60 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 2dc7a1d..23be7a4 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -564,6 +564,18 @@ ip_local_port_range - 2 INTEGERS
> (i.e. by default) range 1024-4999 is enough to issue up to
> 2000 connections per second to systems supporting timestamps.
>
> +ip_local_reserved_ports - BITMAP of 65536 ports
> + Specify the ports which are reserved for known third-party
> + applications. These ports will not be used by automatic port assignments
> + (e.g. when calling connect() or bind() with port number 0). Explicit
> + port allocation behavior is unchanged.
> +
> + Reserving ports is done by writing positive numbers in this proc entry,
> + clearing them is done by writing negative numbers (e.g. 8080 reserves
> + port number, -8080 makes it available for automatic assignment again).
> +
> + Default: Empty
> +
> ip_nonlocal_bind - BOOLEAN
> If set, allows processes to bind() to non-local IP addresses,
> which can be quite useful - but may break some applications.
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index cc9b594..8248fc6 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -1979,6 +1979,8 @@ retry:
> /* FIXME: add proper port randomization per like inet_csk_get_port */
> do {
> ret = idr_get_new_above(ps, bind_list, next_port, &port);
> + if (inet_is_reserved_local_port(port))
> + ret = -EAGAIN;
> } while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
>
> if (ret)
> @@ -2997,10 +2999,13 @@ static int __init cma_init(void)
> {
> int ret, low, high, remaining;
>
> - get_random_bytes(&next_port, sizeof next_port);
> inet_get_local_port_range(&low, &high);
> +again:
> + get_random_bytes(&next_port, sizeof next_port);
> remaining = (high - low) + 1;
> next_port = ((unsigned int) next_port % remaining) + low;
> + if (inet_is_reserved_local_port(next_port))
> + goto again;
>
> cma_wq = create_singlethread_workqueue("rdma_cm");
> if (!cma_wq)
> diff --git a/include/net/ip.h b/include/net/ip.h
> index fb63371..2e24256 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -184,6 +184,12 @@ extern struct local_ports {
> } sysctl_local_ports;
> extern void inet_get_local_port_range(int *low, int *high);
>
> +extern unsigned long *sysctl_local_reserved_ports;
> +static inline int inet_is_reserved_local_port(int port)
> +{
> + return test_bit(port, sysctl_local_reserved_ports);
> +}
> +
> extern int sysctl_ip_default_ttl;
> extern int sysctl_ip_nonlocal_bind;
>
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 7d12c6a..06810b0 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1546,9 +1546,13 @@ static int __init inet_init(void)
>
> BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
>
> + sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
> + if (!sysctl_local_reserved_ports)
> + goto out;
> +


I think we should also consider the ports in ip_local_port_range,
since we can only reserve the ports in that range.


> rc = proto_register(&tcp_prot, 1);
> if (rc)
> - goto out;
> + goto out_free_reserved_ports;
>
> rc = proto_register(&udp_prot, 1);
> if (rc)
> @@ -1647,6 +1651,8 @@ out_unregister_udp_proto:
> proto_unregister(&udp_prot);
> out_unregister_tcp_proto:
> proto_unregister(&tcp_prot);
> +out_free_reserved_ports:
> + kfree(sysctl_local_reserved_ports);
> goto out;
> }
>
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 8da6429..1acb462 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -37,6 +37,9 @@ struct local_ports sysctl_local_ports __read_mostly = {
> .range = { 32768, 61000 },
> };
>
> +unsigned long *sysctl_local_reserved_ports;
> +EXPORT_SYMBOL(sysctl_local_reserved_ports);
> +
> void inet_get_local_port_range(int *low, int *high)
> {
> unsigned seq;
> @@ -108,6 +111,8 @@ again:
>
> smallest_size = -1;
> do {
> + if (inet_is_reserved_local_port(rover))
> + goto next_nolock;
> head = &hashinfo->bhash[inet_bhashfn(net, rover,
> hashinfo->bhash_size)];
> spin_lock(&head->lock);
> @@ -130,6 +135,7 @@ again:
> break;
> next:
> spin_unlock(&head->lock);
> + next_nolock:
> if (++rover > high)
> rover = low;
> } while (--remaining > 0);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index 2b79377..d3e160a 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -456,6 +456,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
> local_bh_disable();
> for (i = 1; i <= remaining; i++) {
> port = low + (i + offset) % remaining;
> + if (inet_is_reserved_local_port(port))
> + continue;
> head = &hinfo->bhash[inet_bhashfn(net, port,
> hinfo->bhash_size)];
> spin_lock(&head->lock);
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 7e3712c..ce3597a 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -298,6 +298,13 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = ipv4_local_port_range,
> },
> + {
> + .procname = "ip_local_reserved_ports",
> + .data = NULL, /* initialized in sysctl_ipv4_init */
> + .maxlen = 65536,
> + .mode = 0644,
> + .proc_handler = proc_dobitmap,
> + },


Isn't there an off-by-one here?

In patch 2/3, you use 0 to set the fist bit, then how about 65535 which
writes 65536th bit? This is beyond the range of port number.


> #ifdef CONFIG_IP_MULTICAST
> {
> .procname = "igmp_max_memberships",
> @@ -721,6 +728,16 @@ static __net_initdata struct pernet_operations ipv4_sysctl_ops = {
> static __init int sysctl_ipv4_init(void)
> {
> struct ctl_table_header *hdr;
> + struct ctl_table *i;
> +
> + for (i = ipv4_table; i->procname; i++) {
> + if (strcmp(i->procname, "ip_local_reserved_ports") == 0) {
> + i->data = sysctl_local_reserved_ports;
> + break;
> + }
> + }
> + if (!i->procname[0])
> + return -EINVAL;
>
> hdr = register_sysctl_paths(net_ipv4_ctl_path, ipv4_table);
> if (hdr == NULL)
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 608a544..bfd0a6a 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -232,7 +232,8 @@ int udp_lib_get_port(struct sock *sk, unsigned short snum,
> */
> do {
> if (low <= snum && snum <= high &&
> - !test_bit(snum >> udptable->log, bitmap))
> + !test_bit(snum >> udptable->log, bitmap) &&
> + !inet_is_reserved_local_port(snum))
> goto found;
> snum += rand;
> } while (snum != first);
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index f6d1e59..1f839d0 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -5432,6 +5432,8 @@ static long sctp_get_port_local(struct sock *sk, union sctp_addr *addr)
> rover++;
> if ((rover < low) || (rover > high))
> rover = low;
> + if (inet_is_reserved_local_port(rover))
> + continue;
> index = sctp_phashfn(rover);
> head = &sctp_port_hashtable[index];
> sctp_spin_lock(&head->lock);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on
Octavian Purdila wrote:
> On Tuesday 16 February 2010 11:37:04 you wrote:
>>> BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
>>>
>>> + sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
>>> + if (!sysctl_local_reserved_ports)
>>> + goto out;
>>> +
>> I think we should also consider the ports in ip_local_port_range,
>> since we can only reserve the ports in that range.
>>
>
> That is subject to changes at runtime, which means we will have to readjust
> the bitmap at runtime which introduces the need for additional synchronization
> operations which I would rather avoid.

Why? As long as the bitmap is global, this will not be hard.

Consider that if one user writes a port number which is beyond
the ip_local_port_range into ip_local_reserved_ports, we should
not accept this, because it doesn't make any sense. But with your
patch, we do.


>
>>> + {
>>> + .procname = "ip_local_reserved_ports",
>>> + .data = NULL, /* initialized in sysctl_ipv4_init */
>>> + .maxlen = 65536,
>>> + .mode = 0644,
>>> + .proc_handler = proc_dobitmap,
>>> + },
>> Isn't there an off-by-one here?
>>
>> In patch 2/3, you use 0 to set the fist bit, then how about 65535 which
>> writes 65536th bit? This is beyond the range of port number.
>>
>
> This seems fine to me, 65535 is the value used by both the port checking
> function and the proc read/write function. And it translates indeed to
> 65536th bit, but that is also bit 65535 if you start counting bits from 0
> instead of 1. The usual computing/natural arithmetic confusion for the meaning
> of first :)
>

Oh, I see.

Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Cong Wang on
Octavian Purdila wrote:
> On Tuesday 16 February 2010 15:06:26 you wrote:
>> Octavian Purdila wrote:
>>> On Tuesday 16 February 2010 11:37:04 you wrote:
>>>>> BUILD_BUG_ON(sizeof(struct inet_skb_parm) > sizeof(dummy_skb->cb));
>>>>>
>>>>> + sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
>>>>> + if (!sysctl_local_reserved_ports)
>>>>> + goto out;
>>>>> +
>>>> I think we should also consider the ports in ip_local_port_range,
>>>> since we can only reserve the ports in that range.
>>> That is subject to changes at runtime, which means we will have to
>>> readjust the bitmap at runtime which introduces the need for additional
>>> synchronization operations which I would rather avoid.
>> Why? As long as the bitmap is global, this will not be hard.
>>
>
> For the more important point see bellow, but with regard to reallocation, this
> means we need to at least use rcu_read_lock() in the fast path to avoid races
> between freeing the old bitmap and doing a read in progress.
>
> Granted, that is a light operation, but would it makes things so much more
> complicated just so that we save one memory page (assuming the range is the
> default [32000 64000] one).


Why not just allocate the bitmap for all ports? 65535/8 bytes are
needed.

>
>> Consider that if one user writes a port number which is beyond
>> the ip_local_port_range into ip_local_reserved_ports, we should
>> not accept this, because it doesn't make any sense. But with your
>> patch, we do.
>>
>
> I think it should be allowed. I see ip_local_reserved_ports and ip_local_range
> as independent settings that can be change at any time.


According to the original purpose, they are not.

>
> That way I can flag port 8080 even if the current range is [32000, 64000] and
> then later I can expand the range to [1024, 64000] without loosing the 8080
> reservation.

Then its meaning is changed, bind(0) will never have chance to get 8080,
thus reserving 8080 for this purpose fails.

I want to always keep its original meaning, if the local_port_range goes
out, then local_reserved_port should be empty at the same time, you have
to reset it after changing local_port_range.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/