Commit Graph

20 Commits

Author SHA1 Message Date
Ivan Vecera e4d9647420 netlink: fix netlink_diag_dump() return value
JIRA: https://issues.redhat.com/browse/RHEL-62123

commit 6647b338fc5c6741736fe51a25fc2c0bec6398b8
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Feb 22 10:50:12 2024 +0000

    netlink: fix netlink_diag_dump() return value

    __netlink_diag_dump() returns 1 if the dump is not complete,
    zero if no error occurred.

    If err variable is zero, this means the dump is complete:
    We should not return skb->len in this case, but 0.

    This allows NLMSG_DONE to be appended to the skb.
    User space does not have to call us again only to get NLMSG_DONE.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-10-24 16:14:43 +02:00
Petr Oros d16936e50a netlink: fill in missing MODULE_DESCRIPTION()
JIRA: https://issues.redhat.com/browse/RHEL-30145

Upstream commit(s):
commit 016b9332a3346e97a6cacffea0f9dc10e1235a75
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Nov 1 21:57:24 2023 -0700

    netlink: fill in missing MODULE_DESCRIPTION()

    W=1 builds now warn if a module is built without
    a MODULE_DESCRIPTION(). Fill it in for sock_diag.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2024-04-26 17:16:05 +02:00
Ivan Vecera 374f808933 netlink: convert nlk->flags to atomic flags
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 8fe08d70a2b61b35a0a1235c78cf321e7528351f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Aug 11 07:22:26 2023 +0000

    netlink: convert nlk->flags to atomic flags

    sk_diag_put_flags(), netlink_setsockopt(), netlink_getsockopt()
    and others use nlk->flags without correct locking.

    Use set_bit(), clear_bit(), test_bit(), assign_bit() to remove
    data-races.

    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:29 +02:00
Ivan Vecera 764a373f7a netlink: Add __sock_i_ino() for __netlink_diag_dump().
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 25a9c8a4431c364f97f75558cb346d2ad3f53fbb
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Mon Jun 26 09:43:13 2023 -0700

    netlink: Add __sock_i_ino() for __netlink_diag_dump().

    syzbot reported a warning in __local_bh_enable_ip(). [0]

    Commit 8d61f926d420 ("netlink: fix potential deadlock in
    netlink_set_err()") converted read_lock(&nl_table_lock) to
    read_lock_irqsave() in __netlink_diag_dump() to prevent a deadlock.

    However, __netlink_diag_dump() calls sock_i_ino() that uses
    read_lock_bh() and read_unlock_bh().  If CONFIG_TRACE_IRQFLAGS=y,
    read_unlock_bh() finally enables IRQ even though it should stay
    disabled until the following read_unlock_irqrestore().

    Using read_lock() in sock_i_ino() would trigger a lockdep splat
    in another place that was fixed in commit f064af1e50 ("net: fix
    a lockdep splat"), so let's add __sock_i_ino() that would be safe
    to use under BH disabled.

    [0]:
    WARNING: CPU: 0 PID: 5012 at kernel/softirq.c:376 __local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
    Modules linked in:
    CPU: 0 PID: 5012 Comm: syz-executor487 Not tainted 6.4.0-rc7-syzkaller-00202-g6f68fc395f49 #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
    RIP: 0010:__local_bh_enable_ip+0xbe/0x130 kernel/softirq.c:376
    Code: 45 bf 01 00 00 00 e8 91 5b 0a 00 e8 3c 15 3d 00 fb 65 8b 05 ec e9 b5 7e 85 c0 74 58 5b 5d c3 65 8b 05 b2 b6 b4 7e 85 c0 75 a2 <0f> 0b eb 9e e8 89 15 3d 00 eb 9f 48 89 ef e8 6f 49 18 00 eb a8 0f
    RSP: 0018:ffffc90003a1f3d0 EFLAGS: 00010046
    RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff1cf5996
    RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffff8805c6f3
    RBP: ffffffff8805c6f3 R08: 0000000000000001 R09: ffff8880152b03a3
    R10: ffffed1002a56074 R11: 0000000000000005 R12: 00000000000073e4
    R13: dffffc0000000000 R14: 0000000000000002 R15: 0000000000000000
    FS:  0000555556726300(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000000045ad50 CR3: 000000007c646000 CR4: 00000000003506f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <TASK>
     sock_i_ino+0x83/0xa0 net/core/sock.c:2559
     __netlink_diag_dump+0x45c/0x790 net/netlink/diag.c:171
     netlink_diag_dump+0xd6/0x230 net/netlink/diag.c:207
     netlink_dump+0x570/0xc50 net/netlink/af_netlink.c:2269
     __netlink_dump_start+0x64b/0x910 net/netlink/af_netlink.c:2374
     netlink_dump_start include/linux/netlink.h:329 [inline]
     netlink_diag_handler_dump+0x1ae/0x250 net/netlink/diag.c:238
     __sock_diag_cmd net/core/sock_diag.c:238 [inline]
     sock_diag_rcv_msg+0x31e/0x440 net/core/sock_diag.c:269
     netlink_rcv_skb+0x165/0x440 net/netlink/af_netlink.c:2547
     sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:280
     netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
     netlink_unicast+0x547/0x7f0 net/netlink/af_netlink.c:1365
     netlink_sendmsg+0x925/0xe30 net/netlink/af_netlink.c:1914
     sock_sendmsg_nosec net/socket.c:724 [inline]
     sock_sendmsg+0xde/0x190 net/socket.c:747
     ____sys_sendmsg+0x71c/0x900 net/socket.c:2503
     ___sys_sendmsg+0x110/0x1b0 net/socket.c:2557
     __sys_sendmsg+0xf7/0x1c0 net/socket.c:2586
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x63/0xcd
    RIP: 0033:0x7f5303aaabb9
    Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007ffc7506e548 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5303aaabb9
    RDX: 0000000000000000 RSI: 0000000020000180 RDI: 0000000000000003
    RBP: 00007f5303a6ed60 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f5303a6edf0
    R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
     </TASK>

    Fixes: 8d61f926d420 ("netlink: fix potential deadlock in netlink_set_err()")
    Reported-by: syzbot+5da61cf6a9bc1902d422@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=5da61cf6a9bc1902d422
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20230626164313.52528-1-kuniyu@amazon.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:27 +02:00
Ivan Vecera e385a6038e netlink: fix potential deadlock in netlink_set_err()
JIRA: https://issues.redhat.com/browse/RHEL-30656

commit 8d61f926d42045961e6b65191c09e3678d86a9cf
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 21 15:43:37 2023 +0000

    netlink: fix potential deadlock in netlink_set_err()

    syzbot reported a possible deadlock in netlink_set_err() [1]

    A similar issue was fixed in commit 1d482e666b ("netlink: disable IRQs
    for netlink_lock_table()") in netlink_lock_table()

    This patch adds IRQ safety to netlink_set_err() and __netlink_diag_dump()
    which were not covered by cited commit.

    [1]

    WARNING: possible irq lock inversion dependency detected
    6.4.0-rc6-syzkaller-00240-g4e9f0ec38852 #0 Not tainted

    syz-executor.2/23011 just changed the state of lock:
    ffffffff8e1a7a58 (nl_table_lock){.+.?}-{2:2}, at: netlink_set_err+0x2e/0x3a0 net/netlink/af_netlink.c:1612
    but this lock was taken by another, SOFTIRQ-safe lock in the past:
     (&local->queue_stop_reason_lock){..-.}-{2:2}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
     Possible interrupt unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(nl_table_lock);
                                   local_irq_disable();
                                   lock(&local->queue_stop_reason_lock);
                                   lock(nl_table_lock);
      <Interrupt>
        lock(&local->queue_stop_reason_lock);

     *** DEADLOCK ***

    Fixes: 1d482e666b ("netlink: disable IRQs for netlink_lock_table()")
    Reported-by: syzbot+a7d200a347f912723e5c@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=a7d200a347f912723e5c
    Link: https://lore.kernel.org/netdev/000000000000e38d1605fea5747e@google.com/T/#u
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Johannes Berg <johannes.berg@intel.com>
    Link: https://lore.kernel.org/r/20230621154337.1668594-1-edumazet@google.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-04-10 09:19:26 +02:00
Thomas Gleixner 09c434b8a0 treewide: Add SPDX license identifier for more missed files
Add SPDX license identifiers to all files which:

 - Have no license information of any form

 - Have MODULE_LICENCE("GPL*") inside which was used in the initial
   scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

  GPL-2.0-only

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 10:50:45 +02:00
Tom Herbert 97a6ec4ac0 rhashtable: Change rhashtable_walk_start to return void
Most callers of rhashtable_walk_start don't care about a resize event
which is indicated by a return value of -EAGAIN. So calls to
rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something
like this is common:

       ret = rhashtable_walk_start(rhiter);
       if (ret && ret != -EAGAIN)
               goto out;

Since zero and -EAGAIN are the only possible return values from the
function this check is pointless. The condition never evaluates to true.

This patch changes rhashtable_walk_start to return void. This simplifies
code for the callers that ignore -EAGAIN. For the few cases where the
caller cares about the resize event, particularly where the table can be
walked in mulitple parts for netlink or seq file dump, the function
rhashtable_walk_start_check has been added that returns -EAGAIN on a
resize event.

Signed-off-by: Tom Herbert <tom@quantonium.net>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11 09:58:38 -05:00
Andrey Vagin 457c79e544 netlink/diag: report flags for netlink sockets
cb_running is reported in /proc/self/net/netlink and it is reported by
the ss tool, when it gets information from the proc files.

sock_diag is a new interface which is used instead of proc files, so it
looks reasonable that this interface has to report no less information
about sockets than proc files.

We use these flags to dump and restore netlink sockets.

Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-05 07:13:56 -07:00
Eric Dumazet 93636d1f1f netlink: netlink_diag_dump() runs without locks
A recent commit removed locking from netlink_diag_dump() but forgot
one error case.

=====================================
[ BUG: bad unlock balance detected! ]
4.9.0-rc3+ #336 Not tainted
-------------------------------------
syz-executor/4018 is trying to release lock ([   36.220068] nl_table_lock
) at:
[<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
but there are no more locks to release!

other info that might help us debug this:
3 locks held by syz-executor/4018:
 #0: [   36.220068]  (
sock_diag_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c3873b>] sock_diag_rcv+0x1b/0x40
 #1: [   36.220068]  (
sock_diag_table_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82c38e00>] sock_diag_rcv_msg+0x140/0x3a0
 #2: [   36.220068]  (
nlk->cb_mutex[   36.220068] ){+.+.+.}
, at: [   36.220068] [<ffffffff82db6600>] netlink_dump+0x50/0xac0

stack backtrace:
CPU: 1 PID: 4018 Comm: syz-executor Not tainted 4.9.0-rc3+ #336
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 ffff8800645df688 ffffffff81b46934 ffffffff84eb3e78 ffff88006ad85800
 ffffffff82dc8683 ffffffff84eb3e78 ffff8800645df6b8 ffffffff812043ca
 dffffc0000000000 ffff88006ad85ff8 ffff88006ad85fd0 00000000ffffffff
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff81b46934>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
 [<ffffffff812043ca>] print_unlock_imbalance_bug+0x17a/0x1a0
kernel/locking/lockdep.c:3388
 [<     inline     >] __lock_release kernel/locking/lockdep.c:3512
 [<ffffffff8120cfd8>] lock_release+0x8e8/0xc60 kernel/locking/lockdep.c:3765
 [<     inline     >] __raw_read_unlock ./include/linux/rwlock_api_smp.h:225
 [<ffffffff83fc001a>] _raw_read_unlock+0x1a/0x30 kernel/locking/spinlock.c:255
 [<ffffffff82dc8683>] netlink_diag_dump+0x1a3/0x250 net/netlink/diag.c:182
 [<ffffffff82db6947>] netlink_dump+0x397/0xac0 net/netlink/af_netlink.c:2110

Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-03 16:16:51 -04:00
Andrey Vagin 733ade23de netlink: don't forget to release a rhashtable_iter structure
This bug was detected by kmemleak:
unreferenced object 0xffff8804269cc3c0 (size 64):
  comm "criu", pid 1042, jiffies 4294907360 (age 13.713s)
  hex dump (first 32 bytes):
    a0 32 cc 2c 04 88 ff ff 00 00 00 00 00 00 00 00  .2.,............
    00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de  ................
  backtrace:
    [<ffffffff8184dffa>] kmemleak_alloc+0x4a/0xa0
    [<ffffffff8124720f>] kmem_cache_alloc_trace+0x10f/0x280
    [<ffffffffa02864cc>] __netlink_diag_dump+0x26c/0x290 [netlink_diag]

v2: don't remove a reference on a rhashtable_iter structure to
    release it from netlink_diag_dump_done

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Fixes: ad20207432 ("netlink: Use rhashtable walk interface in diag dump")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-07 17:29:38 -07:00
Herbert Xu ad20207432 netlink: Use rhashtable walk interface in diag dump
This patch converts the diag dumping code to use the rhashtable
walk code instead of going through rhashtable by hand.  The lock
nl_table_lock is now only taken while we process the multicast
list as it's not needed for the rhashtable walk.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19 14:40:25 -07:00
Florian Westphal d1b4c689d4 netlink: remove mmapped netlink support
mmapped netlink has a number of unresolved issues:

- TX zerocopy support had to be disabled more than a year ago via
  commit 4682a03586 ("netlink: Always copy on mmap TX.")
  because the content of the mmapped area can change after netlink
  attribute validation but before message processing.

- RX support was implemented mainly to speed up nfqueue dumping packet
  payload to userspace.  However, since commit ae08ce0021
  ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
  with the socket-based interface too (via the skb_zerocopy helper).

The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:

- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.

- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094 ("netlink: not trim skb for mmaped socket when dump").

- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.

Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").

mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch.  Daniel Borkmann fixed this via
commit 6bb0fef489 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.

nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
  Problem is that in the mmap case, the allocation time also determines
  the ordering in which the frame will be seen by userspace (A
  allocating before B means that A is located in earlier ring slot,
  but this also means that B might get a lower sequence number then A
  since seqno is decided later.  To fix this we would need to extend the
  spinlocked region to also cover the allocation and message setup which
  isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
  Queing GSO packets is faster than having to force a software segmentation
  in the kernel, so this is a desirable option.  However, with a mmap based
  ring one has to use 64kb per ring slot element, else mmap has to fall back
  to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-18 11:42:18 -05:00
Johannes Berg 053c095a82 netlink: make nlmsg_end() and genlmsg_end() void
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

  if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

  return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

  if (my_function(...))
    /* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

-	return nlmsg_end(...);
+	nlmsg_end(...);
+	return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with <= 0 in dump functionality, but that could just
be changed to < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for <0 or <=0 and thus broke out of the loop every single time.
I've preserved this since it will (I think) have caused the messages to
userspace to be formatted differently with just a single message for
every SKB returned to userspace. It's possible that this isn't needed
for the tools that actually use this, but I don't even know what they
are so couldn't test that changing this behaviour would be acceptable.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18 01:03:45 -05:00
Ying Xue c5adde9468 netlink: eliminate nl_sk_hash_lock
As rhashtable_lookup_compare_insert() can guarantee the process
of search and insertion is atomic, it's safe to eliminate the
nl_sk_hash_lock. After this, object insertion or removal will
be protected with per bucket lock on write side while object
lookup is guarded with rcu read lock on read side.

Signed-off-by: Ying Xue <ying.xue@windriver.com>
Cc: Thomas Graf <tgraf@suug.ch>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13 14:01:00 -05:00
Thomas Graf 88d6ed15ac rhashtable: Convert bucket iterators to take table and index
This patch is in preparation to introduce per bucket spinlocks. It
extends all iterator macros to take the bucket table and bucket
index. It also introduces a new rht_dereference_bucket() to
handle protected accesses to buckets.

It introduces a barrier() to the RCU iterators to the prevent
the compiler from caching the first element.

The lockdep verifier is introduced as stub which always succeeds
and properly implement in the next patch when the locks are
introduced.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03 14:32:56 -05:00
Thomas Graf 6c8f7e7083 netlink: hold nl_sock_hash_lock during diag dump
Although RCU protection would be possible during diag dump, doing
so allows for concurrent table mutations which can render the
in-table offset between individual Netlink messages invalid and
thus cause legitimate sockets to be skipped in the dump.

Since the diag dump is relatively low volume and consistency is
more important than performance, the table mutex is held during
dump.

Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Thomas Graf <tgraf@suug.ch>
Fixes: e341694e3e ("netlink: Convert netlink_lookup() to use RCU protected hash table")
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06 19:17:44 -07:00
Thomas Graf e341694e3e netlink: Convert netlink_lookup() to use RCU protected hash table
Heavy Netlink users such as Open vSwitch spend a considerable amount of
time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
RCU relieves the lock contention.

Makes use of the new resizable hash table to avoid locking on the
lookup.

The hash table will grow if entries exceeds 75% of table size up to a
total table size of 64K. It will automatically shrink if usage falls
below 30%.

Also splits nl_table_lock into a separate mutex to protect hash table
mutations and allow synchronize_rcu() to sleep while waiting for readers
during expansion and shrinking.

Before:
   9.16%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   6.42%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   6.26%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   6.23%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.79%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup
   4.37%  kpktgend_0  [kernel.kallsyms]  [k] memcpy
   3.60%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   2.69%  kpktgend_0  [kernel.kallsyms]  [k] jhash2

After:
  15.26%  kpktgend_0  [openvswitch]      [k] masked_flow_lookup
   8.12%  kpktgend_0  [pktgen]           [k] pktgen_thread_worker
   7.92%  kpktgend_0  [pktgen]           [k] mod_cur_headers
   5.11%  kpktgend_0  [kernel.kallsyms]  [k] memset
   4.11%  kpktgend_0  [openvswitch]      [k] ovs_flow_extract
   4.06%  kpktgend_0  [kernel.kallsyms]  [k] _raw_spin_lock
   3.90%  kpktgend_0  [kernel.kallsyms]  [k] jhash2
   [...]
   0.67%  kpktgend_0  [kernel.kallsyms]  [k] netlink_lookup

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 19:49:38 -07:00
David S. Miller 3dec2246c2 netlink: Fix build with mmap disabled.
net/netlink/diag.c: In function 'sk_diag_put_rings_cfg':
net/netlink/diag.c:28:17: error: 'struct netlink_sock' has no member named 'pg_vec_lock'
net/netlink/diag.c:29:29: error: 'struct netlink_sock' has no member named 'rx_ring'
net/netlink/diag.c:31:30: error: 'struct netlink_sock' has no member named 'tx_ring'
net/netlink/diag.c:33:19: error: 'struct netlink_sock' has no member named 'pg_vec_lock'

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-23 15:39:03 -04:00
Patrick McHardy 4ae9fbee16 netlink: add RX/TX-ring support to netlink diag
Based on AF_PACKET.

Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:57:58 -04:00
Andrey Vagin eaaa313926 netlink: Diag core and basic socket info dumping (v2)
The netlink_diag can be built as a module, just like it's done in
unix sockets.

The core dumping message carries the basic info about netlink sockets:
family, type and protocol, portis, dst_group, dst_portid, state.

Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.

Netlink sockets cab be filtered by protocols.

The socket inode number and cookie is reserved for future per-socket info
retrieving. The per-protocol filtering is also reserved for future by
requiring the sdiag_protocol to be zero.

The file /proc/net/netlink doesn't provide enough information for
dumping netlink sockets. It doesn't provide dst_group, dst_portid,
groups above 32.

v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.

Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 12:38:03 -04:00