Commit Graph

2210 Commits

Author SHA1 Message Date
Felix Maurer 1e3ab14088 xdp: Fix spurious packet loss in generic XDP TX path
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2120968

commit 1fd6e5675336daf4747940b4285e84b0c114ae32
Author: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Date:   Tue Jul 5 10:23:45 2022 +0200

    xdp: Fix spurious packet loss in generic XDP TX path

    The byte queue limits (BQL) mechanism is intended to move queuing from
    the driver to the network stack in order to reduce latency caused by
    excessive queuing in hardware. However, when transmitting or redirecting
    a packet using generic XDP, the qdisc layer is bypassed and there are no
    additional queues. Since netif_xmit_stopped() also takes BQL limits into
    account, but without having any alternative queuing, packets are
    silently dropped.

    This patch modifies the drop condition to only consider cases when the
    driver itself cannot accept any more packets. This is analogous to the
    condition in __dev_direct_xmit(). Dropped packets are also counted on
    the device.

    Bypassing the qdisc layer in the generic XDP TX path means that XDP
    packets are able to starve other packets going through a qdisc, and
    DDOS attacks will be more effective. In-driver-XDP use dedicated TX
    queues, so they do not have this starvation issue.

    Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/20220705082345.2494312-1-johan.almbladh@anyfinetworks.com

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:11 +02:00
Felix Maurer b06bbd83be net: Use this_cpu_inc() to increment net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850

commit 6510ea973d8d9d4a0cb2fb557b36bd1ab3eb49f6
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Mon Apr 25 18:39:46 2022 +0200

    net: Use this_cpu_inc() to increment net->core_stats

    The macro dev_core_stats_##FIELD##_inc() disables preemption and invokes
    netdev_core_stats_alloc() to return a per-CPU pointer.
    netdev_core_stats_alloc() will allocate memory on its first invocation
    which breaks on PREEMPT_RT because it requires non-atomic context for
    memory allocation.

    This can be avoided by enabling preemption in netdev_core_stats_alloc()
    assuming the caller always disables preemption.

    It might be better to replace local_inc() with this_cpu_inc() now that
    dev_core_stats_##FIELD##_inc() gained a preempt-disable section and does
    not rely on already disabled preemption. This results in less
    instructions on x86-64:
    local_inc:
    |          incl %gs:__preempt_count(%rip)  # __preempt_count
    |          movq    488(%rdi), %rax # _1->core_stats, _22
    |          testq   %rax, %rax      # _22
    |          je      .L585   #,
    |          add %gs:this_cpu_off(%rip), %rax        # this_cpu_off, tcp_ptr__
    |  .L586:
    |          testq   %rax, %rax      # _27
    |          je      .L587   #,
    |          incq (%rax)            # _6->a.counter
    |  .L587:
    |          decl %gs:__preempt_count(%rip)  # __preempt_count

    this_cpu_inc(), this patch:
    |         movq    488(%rdi), %rax # _1->core_stats, _5
    |         testq   %rax, %rax      # _5
    |         je      .L591   #,
    | .L585:
    |         incq %gs:(%rax) # _18->rx_dropped

    Use unsigned long as type for the counter. Use this_cpu_inc() to
    increment the counter. Use a plain read of the counter.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/YmbO0pxgtKpCw4SY@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Felix Maurer a320271336 net: add per-cpu storage and net->core_stats
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2130850
Conflicts:
- drivers/net/vxlan.c: file is not moved to drivers/net/vxlan/vxlan_core.c
  due to missing 6765393614ea8 ("vxlan: move to its own directory");
  context difference due to missing 4095e0e1328a3 ("drivers: vxlan:
  vnifilter: per vni stats")
- net/core/dev.c: code difference in __netif_receive_skb_core due to
  already applied 9f8ed577c2881 ("net: skb: rename
  SKB_DROP_REASON_PTYPE_ABSENT"). Result is like upstream now.
- net/core/gro_cells.c: context difference due to already applied
  5dcd08cd1991 ("net: Fix data-races around netdev_max_backlog.")

commit 625788b5844511cf4c30cffa7fa0bc3a69cebc82
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Mar 10 21:14:20 2022 -0800

    net: add per-cpu storage and net->core_stats

    Before adding yet another possibly contended atomic_long_t,
    it is time to add per-cpu storage for existing ones:
     dev->tx_dropped, dev->rx_dropped, and dev->rx_nohandler

    Because many devices do not have to increment such counters,
    allocate the per-cpu storage on demand, so that dev_get_stats()
    does not have to spend considerable time folding zero counters.

    Note that some drivers have abused these counters which
    were supposed to be only used by core networking stack.

    v4: should use per_cpu_ptr() in dev_get_stats() (Jakub)
    v3: added a READ_ONCE() in netdev_core_stats_alloc() (Paolo)
    v2: add a missing include (reported by kernel test robot <lkp@intel.com>)
        Change in netdev_core_stats_alloc() (Jakub)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: jeffreyji <jeffreyji@google.com>
    Reviewed-by: Brian Vazquez <brianvv@google.com>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/20220311051420.2608812-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-11-30 12:47:10 +02:00
Frantisek Hrbata a03fbb1743 Merge: CNB: Update TC subsystem to upstream v6.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1567

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170
Tested: Using self-tests, results present in the BZ
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2133511
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128185

Commits:
```
b20dc3c68458 ("gtp: Allow to create GTP device without FDs")
9af41cc33471 ("gtp: Implement GTP echo response")
d33bd757d362 ("gtp: Implement GTP echo request")
e3acda7ade0a ("net/sched: Allow flower to match on GTP options")
81dd9849fa49 ("gtp: Add support for checking GTP device type")
02f393381d14 ("gtp: Fix inconsistent indenting")
4c096ea2d67c ("net/sched: matchall: Take verbose flag into account when logging error messages")
11c95317bc1a ("net/sched: flower: Take verbose flag into account when logging error messages")
c2ccf84ecb71 ("net/sched: act_api: Add extack to offload_act_setup() callback")
69642c2ab2f5 ("net/sched: act_gact: Add extack messages for offload failure")
4dcaa50d0292 ("net/sched: act_mirred: Add extack message for offload failure")
bca3821d19d9 ("net/sched: act_mpls: Add extack messages for offload failure")
bf3b99e4f9ce ("net/sched: act_pedit: Add extack message for offload failure")
b50e462bc22d ("net/sched: act_police: Add extack messages for offload failure")
a9c64939b669 ("net/sched: act_skbedit: Add extack messages for offload failure")
ee367d44b936 ("net/sched: act_tunnel_key: Add extack message for offload failure")
f8fab3169464 ("net/sched: act_vlan: Add extack message for offload failure")
c440615ffbcb ("net/sched: cls_api: Add extack message for unsupported action offload")
0cba5c34b8f4 ("net/sched: matchall: Avoid overwriting error messages")
fd23e0e250c6 ("net/sched: flower: Avoid overwriting error messages")
c9a40d1c87e9 ("net_sched: make qdisc_reset() smaller")
7463acfbe52a ("netfilter: Rename ingress hook include file")
17d20784223d ("netfilter: Generalize ingress hook include file")
42df6e1d221d ("netfilter: Introduce egress hook")
2f1e85b1aee4 ("net: sched: use queue_mapping to pick tx queue")
38a6f0865796 ("net: sched: support hash selecting tx queue")
285ba06b0edb ("net/sched: flower: Helper function for vlan ethtype checks")
6ee59e554d33 ("net/sched: flower: Reduce identation after is_key_vlan refactoring")
b40003128226 ("net/sched: flower: Add number of vlan tags filter")
99fdb22bc5e9 ("net/sched: flower: Consider the number of tags for vlan filters")
b57c7e8b76c6 ("selftests: forwarding: tc_actions: allow mirred egress test to run on non-offloaded h2")
70f87de9fa0d ("net_sched: em_meta: add READ_ONCE() in var_sk_bound_if()")
a2b1a5d40bd1 ("net/sched: sch_netem: Fix arithmetic in netem_dump() for 32-bit platforms")
1da9e27415bf ("tc-testing: gitignore, delete plugins directory")
6deb209dc6b0 ("net: Print hashed skb addresses for all net and qdisc events")
76b39b94382f ("net/sched: act_api: Notify user space if any actions were flushed before error")
88153e29c1e0 ("selftests: tc-testing: Add testcases to test new flush behaviour")
837ced3a1a5d ("time64.h: consolidate uses of PSEC_PER_NSEC")
d7be266adbfd ("net: sched: provide shim definitions for taprio_offload_{get,free}")
fc54d9065f90 ("net/sched: act_ct: set 'net' pointer when creating new nf_flow_table")
b038177636f8 ("netfilter: nf_flow_table: count pending offload workqueue tasks")
b06ada6df9cf ("netfilter: flowtable: fix incorrect Kconfig dependencies")
83d85bb06915 ("net: extract port range fields from fl_flow_key")
bc5c8260f411 ("net/sched: remove return value of unregister_tcf_proto_ops")
88b3822cdf2f ("net/sched: sch_cbq: Delete unused delay_timer")
ca0cab119288 ("net/sched: remove qdisc_root_lock() helper")
c0f47c2822aa ("net/sched: cls_api: Fix flow action initialization")
5008750eff5d ("net/sched: flower: Add PPPoE filter")
a482d47d33ac ("net/sched: sch_cbq: change the type of cbq_set_lss to void")
06799a9085e1 ("net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS")
4873a1b2024d ("net/sched: remove hacks added to dev_trans_start() for bonding to work")
9ad36309e271 ("net_sched: cls_route: remove from list when handle is 0")
02799571714d ("net_sched: cls_route: disallow handle of 0")
b05972f01e7d ("net: sched: tbf: don't call qdisc_put() while holding tree lock")
f612466ebecb ("net/sched: fix netdevice reference leaks in attach_default_qdiscs()")
9efd23297cca ("sch_sfb: Don't assume the skb is still around after enqueueing to child")
2f09707d0c97 ("sch_sfb: Also store skb len before calling child enqueue")
db46e3a88a09 ("net/sched: taprio: avoid disabling offload when it was never enabled")
1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs")
c2e1cfefcac3 ("net: sched: fix possible refcount leak in tc_new_tfilter()")
6e23ec0ba92d ("net: sched: act_ct: fix possible refcount leak in tcf_ct_init()")
ffdd33dd9c12 ("netfilter: core: Fix clang warnings about unused static inlines")
6316136ec6e3 ("netfilter: egress: avoid a lockdep splat")
d645552e9bd9 ("netfilter: egress: Report interface as outgoing")
af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"")
8bdc2acd420c ("net: sched: Fix use after free in red_enqueue()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-23 02:46:05 -05:00
Frantisek Hrbata 1269719102 Merge: BPF and XDP rebase to v5.18
Merge conflicts:
-----------------
arch/x86/net/bpf_jit_comp.c
        - bpf_arch_text_poke()
          HEAD(!1464) contains b73b002f7f ("x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline")
          Resolved in favour of !1464, but keep the return statement from !1477

MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1477

Bugzilla: https://bugzilla.redhat.com/2120966

Rebase BPF and XDP to the upstream kernel version 5.18

Patch applied, then reverted:
```
544356 selftests/bpf: switch to new libbpf XDP APIs
0bfb95 selftests, bpf: Do not yet switch to new libbpf XDP APIs
```
Taken in the perf rebase:
```
23fcfc perf: use generic bpf_program__set_type() to set BPF prog type
```
Unsuported arches:
```
5c1011 libbpf: Fix riscv register names
cf0b5b libbpf: Fix accessing syscall arguments on riscv
```
Depends on changes of other subsystems:
```
7fc8c3 s390/bpf: encode register within extable entry
aebfd1 x86/ibt,ftrace: Search for __fentry__ location
589127 x86/ibt,bpf: Add ENDBR instructions to prologue and trampoline
```
Broken selftest:
```
edae34 selftests net: add UDP GRO fraglist + bpf self-tests
cf6783 selftests net: fix bpf build error
7b92aa selftests net: fix kselftest net fatal error
```
Out of scope:
```
baebdf net: dev: Makes sure netif_rx() can be invoked in any context.
5c8166 kbuild: replace $(if A,A,B) with $(or A,B)
1a97ce perf maps: Use a pointer for kmaps
967747 uaccess: remove CONFIG_SET_FS
42b01a s390: always use the packed stack layout
bf0882 flow_dissector: Add support for HSR
d09a30 s390/extable: move EX_TABLE define to asm-extable.h
3d6671 s390/extable: convert to relative table with data
4efd41 s390: raise minimum supported machine generation to z10
f65e58 flow_dissector: Add support for HSRv0
1a6d7a netdevsim: Introduce support for L3 offload xstats
9b1894 selftests: netdevsim: hw_stats_l3: Add a new test
84005b perf ftrace latency: Add -n/--use-nsec option
36c4a7 kasan, arm64: don't tag executable vmalloc allocations
8df013 docs: netdev: move the netdev-FAQ to the process pages
4d4d00 perf tools: Update copy of libbpf's hashmap.c
0df6ad perf evlist: Rename cpus to user_requested_cpus
1b8089 flow_dissector: fix false-positive __read_overflow2_field() warning
0ae065 perf build: Fix check for btf__load_from_kernel_by_id() in libbpf
8994e9 perf test bpf: Skip test if clang is not present
735346 perf build: Fix btf__load_from_kernel_by_id() feature check
f037ac s390/stack: merge empty stack frame slots
335220 docs: netdev: update maintainer-netdev.rst reference
a0b098 s390/nospec: remove unneeded header includes
34513a netdevsim: Fix hwstats debugfs file permissions
```

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Jan Stancek <jstancek@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Felix Maurer <fmaurer@redhat.com>
Approved-by: Viktor Malik <vmalik@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-21 05:30:47 -05:00
Frantisek Hrbata 27a89b8946 Merge: tcp: BIG TCP implementation
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1560

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using netperf and veth driver. Results meet the assumptions. See https://bugzilla.redhat.com/show_bug.cgi?id=2139501#c1

The series introduces support for BIG TCP.

- Patch 1-2: Preliminary dependencies
- Patch 3-14: Commits from upstream series 7fa2e481ff2f ("Merge branch 'big-tcp'", 2022-05-16)
- Patch 15-19: Follow-ups

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-15 07:30:55 -05:00
Frantisek Hrbata 6fd36e2149 Merge: CNB: net: drop the weight argument from netif_napi_add
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1577

Bugzilla: https://bugzilla.redhat.com/2139498
Tested: build, boot

Change netif_napi_add family function's API so `netif_napi_add` and `netif_napi_add_tx` uses by default weight = NAPI_POLL_WEIGHT (as most of drivers were already doing in some or another way), and add `netif_napi_add_weight` and `netif_napi_add_tx_weight` for drivers that want to specify a custom NAPI weight.

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Torez Smith <torez@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Tony Camuso <tcamuso@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-14 10:28:04 -05:00
Ivan Vecera f31181025a net: sched: use queue_mapping to pick tx queue
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 2f1e85b1aee459b7d0fd981839042c6a38ffaf0c
Author: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Date:   Sat Apr 16 00:40:45 2022 +0800

    net: sched: use queue_mapping to pick tx queue

    This patch fixes issue:
    * If we install tc filters with act_skbedit in clsact hook.
      It doesn't work, because netdev_core_pick_tx() overwrites
      queue_mapping.

      $ tc filter ... action skbedit queue_mapping 1

    And this patch is useful:
    * We can use FQ + EDT to implement efficient policies. Tx queues
      are picked by xps, ndo_select_queue of netdev driver, or skb hash
      in netdev_core_pick_tx(). In fact, the netdev driver, and skb
      hash are _not_ under control. xps uses the CPUs map to select Tx
      queues, but we can't figure out which task_struct of pod/containter
      running on this cpu in most case. We can use clsact filters to classify
      one pod/container traffic to one Tx queue. Why ?

      In containter networking environment, there are two kinds of pod/
      containter/net-namespace. One kind (e.g. P1, P2), the high throughput
      is key in these applications. But avoid running out of network resource,
      the outbound traffic of these pods is limited, using or sharing one
      dedicated Tx queues assigned HTB/TBF/FQ Qdisc. Other kind of pods
      (e.g. Pn), the low latency of data access is key. And the traffic is not
      limited. Pods use or share other dedicated Tx queues assigned FIFO Qdisc.
      This choice provides two benefits. First, contention on the HTB/FQ Qdisc
      lock is significantly reduced since fewer CPUs contend for the same queue.
      More importantly, Qdisc contention can be eliminated completely if each
      CPU has its own FIFO Qdisc for the second kind of pods.

      There must be a mechanism in place to support classifying traffic based on
      pods/container to different Tx queues. Note that clsact is outside of Qdisc
      while Qdisc can run a classifier to select a sub-queue under the lock.

      In general recording the decision in the skb seems a little heavy handed.
      This patch introduces a per-CPU variable, suggested by Eric.

      The xmit.skip_txqueue flag is firstly cleared in __dev_queue_xmit().
      - Tx Qdisc may install that skbedit actions, then xmit.skip_txqueue flag
        is set in qdisc->enqueue() though tx queue has been selected in
        netdev_tx_queue_mapping() or netdev_core_pick_tx(). That flag is cleared
        firstly in __dev_queue_xmit(), is useful:
      - Avoid picking Tx queue with netdev_tx_queue_mapping() in next netdev
        in such case: eth0 macvlan - eth0.3 vlan - eth0 ixgbe-phy:
        For example, eth0, macvlan in pod, which root Qdisc install skbedit
        queue_mapping, send packets to eth0.3, vlan in host. In __dev_queue_xmit() of
        eth0.3, clear the flag, does not select tx queue according to skb->queue_mapping
        because there is no filters in clsact or tx Qdisc of this netdev.
        Same action taked in eth0, ixgbe in Host.
      - Avoid picking Tx queue for next packet. If we set xmit.skip_txqueue
        in tx Qdisc (qdisc->enqueue()), the proper way to clear it is clearing it
        in __dev_queue_xmit when processing next packets.

      For performance reasons, use the static key. If user does not config the NET_EGRESS,
      the patch will not be compiled.

      +----+      +----+      +----+
      | P1 |      | P2 |      | Pn |
      +----+      +----+      +----+
        |           |           |
        +-----------+-----------+
                    |
                    | clsact/skbedit
                    |      MQ
                    v
        +-----------+-----------+
        | q0        | q1        | qn
        v           v           v
      HTB/FQ      HTB/FQ  ...  FIFO

    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Cc: "David S. Miller" <davem@davemloft.net>
    Cc: Jakub Kicinski <kuba@kernel.org>
    Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Alexander Lobakin <alobakin@pm.me>
    Cc: Paolo Abeni <pabeni@redhat.com>
    Cc: Talal Ahmad <talalahmad@google.com>
    Cc: Kevin Hao <haokexin@gmail.com>
    Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Cc: Antoine Tenart <atenart@kernel.org>
    Cc: Wei Wang <weiwan@google.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:02 +01:00
Ivan Vecera d545c120ec netfilter: Introduce egress hook
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 42df6e1d221dddc0f2acf2be37e68d553ad65f96
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:03 2021 +0200

    netfilter: Introduce egress hook

    Support classifying packets with netfilter on egress to satisfy user
    requirements such as:
    * outbound security policies for containers (Laura)
    * filtering and mangling intra-node Direct Server Return (DSR) traffic
      on a load balancer (Laura)
    * filtering locally generated traffic coming in through AF_PACKET,
      such as local ARP traffic generated for clustering purposes or DHCP
      (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
    * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
      and gPTP with nftables (Pablo)
    * in the future: in-kernel NAT64/NAT46 (Pablo)

    The egress hook introduced herein complements the ingress hook added by
    commit e687ad60af ("netfilter: add netfilter ingress hook after
    handle_ing() under unique static key").  A patch for nftables to hook up
    egress rules from user space has been submitted separately, so users may
    immediately take advantage of the feature.

    Alternatively or in addition to netfilter, packets can be classified
    with traffic control (tc).  On ingress, packets are classified first by
    tc, then by netfilter.  On egress, the order is reversed for symmetry.
    Conceptually, tc and netfilter can be thought of as layers, with
    netfilter layered above tc.

    Traffic control is capable of redirecting packets to another interface
    (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
    host namespace to a container via a veth connection:
    tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)

    In this case, netfilter egress classifying is not performed when leaving
    the host namespace!  That's because the packet is still on the tc layer.
    If tc redirects the packet to a physical interface in the host namespace
    such that it leaves the system, the packet is never subjected to
    netfilter egress classifying.  That is only logical since it hasn't
    passed through netfilter ingress classifying either.

    Packets can alternatively be redirected at the netfilter layer using
    nft fwd.  Such a packet *is* subjected to netfilter egress classifying
    since it has reached the netfilter layer.

    Internally, the skb->nf_skip_egress flag controls whether netfilter is
    invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
    be called recursively by tunnel drivers such as vxlan, the flag is
    reverted to false after sch_handle_egress().  This ensures that
    netfilter is applied both on the overlay and underlying network.

    Interaction between tc and netfilter is possible by setting and querying
    skb->mark.

    If netfilter egress classifying is not enabled on any interface, it is
    patched out of the data path by way of a static_key and doesn't make a
    performance difference that is discernible from noise:

    Before:             1537 1538 1538 1537 1538 1537 Mb/sec
    After:              1536 1534 1539 1539 1539 1540 Mb/sec
    Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
    After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
    Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
    After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec

    When netfilter egress classifying is enabled on at least one interface,
    a minimal performance penalty is incurred for every egress packet, even
    if the interface it's transmitted over doesn't have any netfilter egress
    rules configured.  That is caused by checking dev->nf_hooks_egress
    against NULL.

    Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
    ip link add dev foo type dummy
    ip link set dev foo up
    modprobe pktgen
    echo "add_device foo" > /proc/net/pktgen/kpktgend_3
    samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

    Accept all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

    Drop all traffic with tc:
    tc qdisc add dev foo clsact
    tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

    Apply this patch when measuring packet drops to avoid errors in dmesg:
    https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Cc: Laura García Liébana <nevola@gmail.com>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Thomas Graf <tgraf@suug.ch>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 866706749c netfilter: Generalize ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 17d20784223d52bf1671f984c9e8d5d9b8ea171b
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:02 2021 +0200

    netfilter: Generalize ingress hook include file

    Prepare for addition of a netfilter egress hook by generalizing the
    ingress hook include file.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera 3ccbb377fc netfilter: Rename ingress hook include file
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit 7463acfbe52ae8b7e0ea6890c1886b3f8ba8bddd
Author: Lukas Wunner <lukas@wunner.de>
Date:   Fri Oct 8 22:06:01 2021 +0200

    netfilter: Rename ingress hook include file

    Prepare for addition of a netfilter egress hook by renaming
    <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

    The egress hook also necessitates a refactoring of the include file,
    but that is done in a separate commit to ease reviewing.

    No functional change intended.

    Signed-off-by: Lukas Wunner <lukas@wunner.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Frantisek Hrbata 0fe0e3e4d8 Merge: CNB: net: HW counters for soft devices
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1580

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149
Tested: Using netdevsim hw_stats_l3.sh self-test

Commits:
```
22b67d17194f ("net: rtnetlink: rtnl_stats_get(): Emit an extack for unset filter_mask")
6b524a1d012b ("net: rtnetlink: Namespace functions related to IFLA_OFFLOAD_XSTATS_*")
f6e0fb812988 ("net: rtnetlink: Stop assuming that IFLA_OFFLOAD_XSTATS_* are dev-backed")
46efc97b7306 ("net: rtnetlink: RTM_GETSTATS: Allow filtering inside nests")
05415bccbb09 ("net: rtnetlink: Propagate extack to rtnl_offload_xstats_fill()")
216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns")
9309f97aef6d ("net: dev: Add hardware stats support")
0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats")
03ba35667091 ("net: rtnetlink: Add RTM_SETSTATS")
5fd0b838efac ("net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS")
ba95e7930957 ("selftests: forwarding: hw_stats_l3: Add a new test")
57d29a2935c9 ("net: rtnetlink: fix error handling in rtnl_fill_statsinfo()")
23cfe941b52e ("rtnetlink: Fix handling of disabled L3 stats in RTM_GETSTATS replies")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Petr Oros <poros@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Wander Lairson Costa <wander@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-08 09:08:22 -05:00
Frantisek Hrbata 5ac5a1dfd0 Merge: CNB: net: disambiguate the TSO and GSO limits
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1419

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180
Tested: Using iperf3 and toggling gso/tso offloading knobs

Commits:
```
2106efda785b ("net: remove .ndo_change_proto_down")
2cc6cdd44a16 ("net: unexport a handful of dev_* functions")
6264f58ca0e5 ("net: extract a few internals from netdevice.h")
6df6398f7c8b ("net: add netif_inherit_tso_max()")
14d7b8122fd5 ("net: don't allow user space to lift the device limits")
ee8b7a1156f3 ("net: make drivers set the TSO limit not the GSO limit")
744d49daf8bd ("net: move netif_set_gso_max helpers")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-11-05 02:54:07 -04:00
Ivan Vecera a5a7be252a net: dev: Add hardware stats support
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2140149

commit 9309f97aef6d8250bb484dabeac925c3a7c57716
Author: Petr Machata <petrm@nvidia.com>
Date:   Wed Mar 2 18:31:20 2022 +0200

    net: dev: Add hardware stats support

    Offloading switch device drivers may be able to collect statistics of the
    traffic taking place in the HW datapath that pertains to a certain soft
    netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
    these statistics to the offloaded netdevice in question. The API was shaped
    by the following considerations:

    - Collection of HW statistics is not free: there may be a finite number of
      counters, and the act of counting may have a performance impact. It is
      therefore necessary to allow toggling whether HW counting should be done
      for any particular SW netdevice.

    - As the drivers are loaded and removed, a particular device may get
      offloaded and unoffloaded again. At the same time, the statistics values
      need to stay monotonic (modulo the eventual 64-bit wraparound),
      increasing only to reflect traffic measured in the device.

      To that end, the netdevice keeps around a lazily-allocated copy of struct
      rtnl_link_stats64. Device drivers then contribute to the values kept
      therein at various points. Even as the driver goes away, the struct stays
      around to maintain the statistics values.

    - Different HW devices may be able to count different things. The
      motivation behind this patch in particular is exposure of HW counters on
      Nvidia Spectrum switches, where the only practical approach to counting
      traffic on offloaded soft netdevices currently is to use router interface
      counters, and count L3 traffic. Correspondingly that is the statistics
      suite added in this patch.

      Other devices may be able to measure different kinds of traffic, and for
      that reason, the APIs are built to allow uniform access to different
      statistics suites.

    - Because soft netdevices and offloading drivers are only loosely bound, a
      netdevice uses a notifier chain to communicate with the drivers. Several
      new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
      to the offloading drivers.

    - Devices can have various conditions for when a particular counter is
      available. As the device is configured and reconfigured, the device
      offload may become or cease being suitable for counter binding. A
      netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
      ping offloading drivers and determine whether anyone currently implements
      a given statistics suite. This information can then be propagated to user
      space.

      When the driver decides to unoffload a netdevice, it can use a
      newly-added function, netdev_offload_xstats_report_delta(), to record
      outstanding collected statistics, before destroying the HW counter.

    This patch adds a helper, call_netdevice_notifiers_info_robust(), for
    dispatching a notifier with the possibility of unwind when one of the
    consumers bails. Given the wish to eventually get rid of the global
    notifier block altogether, this helper only invokes the per-netns notifier
    block.

    Signed-off-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-04 17:15:40 +01:00
Íñigo Huguet 4ed32c17b9 netdev: reshuffle netif_napi_add() APIs to allow dropping weight
Bugzilla: https://bugzilla.redhat.com/2139498

commit 58caed3dacb4354a25a1aa8d2febc3e9648ba1f4
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon May 2 16:27:03 2022 -0700

    netdev: reshuffle netif_napi_add() APIs to allow dropping weight
    
    Most drivers should not have to worry about selecting the right
    weight for their NAPI instances and pass NAPI_POLL_WEIGHT.
    It'd be best if we didn't require the argument at all and selected
    the default internally.
    
    This change prepares the ground for such reshuffling, allowing
    for a smooth transition. The following API should remain after
    the next release cycle:
      netif_napi_add()
      netif_napi_add_weight()
      netif_napi_add_tx()
      netif_napi_add_tx_weight()
    Where the _weight() variants take an explicit weight argument.
    I opted for a _weight() suffix rather than a __ prefix, because
    we use __ in places to mean that caller needs to also issue a
    synchronize_net() call.
    
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20220502232703.396351-1-kuba@kernel.org
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Íñigo Huguet <ihuguet@redhat.com>
2022-11-04 16:46:33 +01:00
Ivan Vecera fccce056fa net: allow gro_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 0fe79f28bfaf73b66b7b1562d2468f94aa03bd12
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:34:03 2022 -0700

    net: allow gro_max_size to exceed 65536

    Allow the gro_max_size to exceed a value larger than 65536.

    There weren't really any external limitations that prevented this other
    than the fact that IPv4 only supports a 16 bit length field. Since we have
    the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to
    exceed this value and for IPv4 and non-TCP flows we can cap things at 65536
    via a constant rather than relying on gro_max_size.

    [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:56:09 +01:00
Ivan Vecera d513603ec1 net: allow gso_max_size to exceed 65536
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

commit 7c4e983c4f3cf94fcd879730c6caa877e0768a4d
Author: Alexander Duyck <alexanderduyck@fb.com>
Date:   Fri May 13 11:33:57 2022 -0700

    net: allow gso_max_size to exceed 65536

    The code for gso_max_size was added originally to allow for debugging and
    workaround of buggy devices that couldn't support TSO with blocks 64K in
    size. The original reason for limiting it to 64K was because that was the
    existing limits of IPv4 and non-jumbogram IPv6 length fields.

    With the addition of Big TCP we can remove this limit and allow the value
    to potentially go up to UINT_MAX and instead be limited by the tso_max_size
    value.

    So in order to support this we need to go through and clean up the
    remaining users of the gso_max_size value so that the values will cap at
    64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
    so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
    limit for GSO_MAX_SIZE.

    v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
                   in a new sk_trim_gso_size() helper.
                   netif_set_tso_max_size() caps the requested TSO size
                   with GSO_MAX_SIZE.

    Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:52 +01:00
Ivan Vecera 017d0aca36 gro: add ability to control gro max packet size
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139501

Conflicts:
- context due to existing backport of 14d7b8122fd5 ("net: don't allow
  user space to lift the device limits")

commit eac1b93c14d645ef147b049ace0d5230df755548
Author: Coco Li <lixiaoyan@google.com>
Date:   Wed Jan 5 02:48:38 2022 -0800

    gro: add ability to control gro max packet size

    Eric Dumazet suggested to allow users to modify max GRO packet size.

    We have seen GRO being disabled by users of appliances (such as
    wifi access points) because of claimed bufferbloat issues,
    or some work arounds in sch_cake, to split GRO/GSO packets.

    Instead of disabling GRO completely, one can chose to limit
    the maximum packet size of GRO packets, depending on their
    latency constraints.

    This patch adds a per device gro_max_size attribute
    that can be changed with ip link command.

    ip link set dev eth0 gro_max_size 16000

    Suggested-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Coco Li <lixiaoyan@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-02 18:55:37 +01:00
Paolo Abeni 022665bacd net: skb: introduce and use a single page frag cache
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit dbae2b062824fc2d35ae2d5df2f500626c758e80
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Sep 28 10:43:09 2022 +0200

    net: skb: introduce and use a single page frag cache

    After commit 3226b158e6 ("net: avoid 32 x truesize under-estimation
    for tiny skbs") we are observing 10-20% regressions in performance
    tests with small packets. The perf trace points to high pressure on
    the slab allocator.

    This change tries to improve the allocation schema for small packets
    using an idea originally suggested by Eric: a new per CPU page frag is
    introduced and used in __napi_alloc_skb to cope with small allocation
    requests.

    To ensure that the above does not lead to excessive truesize
    underestimation, the frag size for small allocation is inflated to 1K
    and all the above is restricted to build with 4K page size.

    Note that we need to update accordingly the run-time check introduced
    with commit fd9ea57f4e95 ("net: add napi_get_frags_check() helper").

    Alex suggested a smart page refcount schema to reduce the number
    of atomic operations and deal properly with pfmemalloc pages.

    Under small packet UDP flood, I measure a 15% peak tput increases.

    Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
    Suggested-by: Alexander H Duyck <alexanderduyck@fb.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
    Link: https://lore.kernel.org/r/6b6f65957c59f86a353fc09a5127e83a32ab5999.1664350652.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:12:04 +02:00
Paolo Abeni 7822d83322 net: add napi_get_frags_check() helper
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2133057
Tested: LNST, Tier1

Upstream commit:
commit fd9ea57f4e9514f9d0f0dec505eefd99a8faa148
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed Jun 8 09:04:38 2022 -0700

    net: add napi_get_frags_check() helper

    This is a follow up of commit 3226b158e6
    ("net: avoid 32 x truesize under-estimation for tiny skbs")

    When/if we increase MAX_SKB_FRAGS, we better make sure
    the old bug will not come back.

    Adding a check in napi_get_frags() would be costly,
    even if using DEBUG_NET_WARN_ON_ONCE().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-27 19:10:48 +02:00
Jiri Benc 2da69cb317 net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit cd14e9b7b8d312dfbf75ce1f78552902e51b9045
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:56:22 2022 -0800

    net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally

    The previous patches handled the delivery_time in the ingress path
    before the routing decision is made.  This patch can postpone clearing
    delivery_time in a skb until knowing it is delivered locally and also
    set the (rcv) timestamp if needed.  This patch moves the
    skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
    and ip6_input_finish().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e0f797236e net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/2120966

Conflicts:
- [minor] context difference in __netif_receive_skb_core due to missing
  42df6e1d221d ("netfilter: Introduce egress hook")

commit d98d58a002619b5c165f1eedcd731e2fe2c19088
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:50 2022 -0800

    net: Set skb->mono_delivery_time and clear it after sch_handle_ingress()

    The previous patches handled the delivery_time before sch_handle_ingress().

    This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
    and also clear it with skb_clear_delivery_time() after
    sch_handle_ingress().  This will make the bpf_redirect_*()
    to keep the mono delivery_time and used by a qdisc (fq) of
    the egress-ing interface.

    A latter patch will postpone the skb_clear_delivery_time() until the
    stack learns that the skb is being delivered locally and that will
    make other kernel forwarding paths (ip[6]_forward) able to keep
    the delivery_time also.  Thus, like the previous patches on using
    the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
    is not limited within the CONFIG_NET_INGRESS to avoid too many code
    churns among this set.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc e17e09a099 net: Clear mono_delivery_time bit in __skb_tstamp_tx()
Bugzilla: https://bugzilla.redhat.com/2120966

commit d93376f503c7a586707925957592c0f16f4db0b1
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:44 2022 -0800

    net: Clear mono_delivery_time bit in __skb_tstamp_tx()

    In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
    the sk_error_queue.  The outgoing skb may have the mono delivery_time
    while the (rcv) timestamp is expected for the clone, so the
    skb->mono_delivery_time bit needs to be cleared from the clone.

    This patch adds the skb->mono_delivery_time clearing to the existing
    __net_timestamp() and use it in __skb_tstamp_tx().
    The __net_timestamp() fast path usage in dev.c is changed to directly
    call ktime_get_real() since the mono_delivery_time bit is not set at
    that point.

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Jiri Benc c387356f8d net: Handle delivery_time in skb->tstamp during network tapping with af_packet
Bugzilla: https://bugzilla.redhat.com/2120966

commit 27942a15209f564ed8ee2a9e126cb7b105181355
Author: Martin KaFai Lau <kafai@fb.com>
Date:   Wed Mar 2 11:55:38 2022 -0800

    net: Handle delivery_time in skb->tstamp during network tapping with af_packet

    A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
    is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
    skb_clear_tstamp() will then keep this delivery_time during forwarding.

    This patch is to make the network tapping (with af_packet) to handle
    the delivery_time stored in skb->tstamp.

    Regardless of tapping at the ingress or egress,  the tapped skb is
    received by the af_packet socket, so it is ingress to the af_packet
    socket and it expects the (rcv) timestamp.

    When tapping at egress, dev_queue_xmit_nit() is used.  It has already
    expected skb->tstamp may have delivery_time,  so it does
    skb_clone()+net_timestamp_set() to ensure the cloned skb has
    the (rcv) timestamp before passing to the af_packet sk.
    This patch only adds to clear the skb->mono_delivery_time
    bit in net_timestamp_set().

    When tapping at ingress, it currently expects the skb->tstamp is either 0
    or the (rcv) timestamp.  Meaning, the tapping at ingress path
    has already expected the skb->tstamp could be 0 and it will get
    the (rcv) timestamp by ktime_get_real() when needed.

    There are two cases for tapping at ingress:

    One case is af_packet queues the skb to its sk_receive_queue.
    The skb is either not shared or new clone created.  The newly
    added skb_clear_delivery_time() is called to clear the
    delivery_time (if any) and set the (rcv) timestamp if
    needed before the skb is queued to the sk_receive_queue.

    Another case, the ingress skb is directly copied to the rx_ring
    and tpacket_get_timestamp() is used to get the (rcv) timestamp.
    The newly added skb_tstamp() is used in tpacket_get_timestamp()
    to check the skb->mono_delivery_time bit before returning skb->tstamp.
    As mentioned earlier, the tapping@ingress has already expected
    the skb may not have the (rcv) timestamp (because no sk has asked
    for it) and has handled this case by directly calling ktime_get_real().

    Signed-off-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-10-25 14:58:00 +02:00
Frantisek Hrbata fa843be1d1 Merge: net: add skb drop reasons
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1454

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161

Sync skb drop reasons with upstream to improve debuggability and visibility in
the net stack. This MR helps in understanding why a given packet is being
dropped.

One way of retrieving the skb drop reason is to hook to the kfree_skb tracepoint:

```
# perf record -e skb:kfree_skb -a sleep 10
# perf script
         swapper     0 [000] 45483.977088: skb:kfree_skb: skbaddr=0xffffa04859090f00 protocol=34525 location=0xffffffff9bc92940 reason: NOT_SPECIFIED
         swapper     0 [000] 45485.792919: skb:kfree_skb: skbaddr=0xffffa04143757900 protocol=34525 location=0xffffffff9bbd84bb reason: TCP_INVALID_SEQUENCE
```

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Frantisek Hrbata <fhrbata@redhat.com>
2022-10-24 14:27:58 -04:00
Ivan Vecera 4ba4dadfe4 net: make drivers set the TSO limit not the GSO limit
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_pf.c
* drivers/net/ethernet/marvell/octeontx2/nic/otx2_vf.c
  - small context conflicts
* drivers/net/usb/ax88179_178a.c
  - hunk removed, the driver does not call netif_set_gso_max_size()
* drivers/net/usb/lan78xx.c
  - modified due to absence of commits d383216a7efe ("lan78xx: Introduce
    Tx URB processing improvements") and 0dd87266c133 ("lan78xx: Remove
    hardware-specific header update")

commit ee8b7a1156f357613646d6c69d07ac5a087a1071
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:33 2022 -0700

    net: make drivers set the TSO limit not the GSO limit

    Drivers should call the TSO setting helper, GSO is controllable
    by user space.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 8f95afcecf net: don't allow user space to lift the device limits
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 14d7b8122fd591693a2388b98563707ba72c6780
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:32 2022 -0700

    net: don't allow user space to lift the device limits

    Up until commit 46e6b992c2 ("rtnetlink: allow GSO maximums to
    be set on device creation") the gso_max_segs and gso_max_size
    of a device were not controlled from user space.

    The quoted commit added the ability to control them because of
    the following setup:

     netns A  |  netns B
         veth<->veth   eth0

    If eth0 has TSO limitations and user wants to efficiently forward
    traffic between eth0 and the veths they should copy the TSO
    limitations of eth0 onto the veths. This would happen automatically
    for macvlans or ipvlan but veth users are not so lucky (given the
    loose coupling).

    Unfortunately the commit in question allowed users to also override
    the limits on real HW devices.

    It may be useful to control the max GSO size and someone may be using
    that ability (not that I know of any user), so create a separate set
    of knobs to reliably record the TSO limitations. Validate the user
    requests.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera f9b471a989 net: add netif_inherit_tso_max()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to missing eac1b93c14d6 ("gro: add ability
  to control gro max packet size")

commit 6df6398f7c8b481ce83f28143bc08a5231616deb
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Thu May 5 19:51:31 2022 -0700

    net: add netif_inherit_tso_max()

    To make later patches smaller create a helper for inheriting
    the TSO limitations of a lower device. The TSO in the name
    is not an accident, subsequent patches will replace GSO
    with TSO in more names.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:21 +02:00
Ivan Vecera 5a0eef8003 net: extract a few internals from netdevice.h
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- slightly modified due to missing 0b5c21bbc01e ("net: ensure
  net_todo_list is processed quickly") and d07b26f5bbea ("dev_addr:
  add a modification check")

commit 6264f58ca0e54e41d63c2d00334a48bac28fbf30
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:54 2022 -0700

    net: extract a few internals from netdevice.h

    There's a number of functions and static variables used
    under net/core/ but not from the outside. We currently
    dump most of them into netdevice.h. That bad for many
    reasons:
     - netdevice.h is very cluttered, hard to figure out
       what the APIs are;
     - netdevice.h is very long;
     - we have to touch netdevice.h more which causes expensive
       incremental builds.

    Create a header under net/core/ and move some declarations.

    The new header is also a bit of a catch-all but that's
    fine, if we create more specific headers people will
    likely over-think where their declaration fit best.
    And end up putting them in netdevice.h, again.

    More work should be done on splitting netdevice.h into more
    targeted headers, but that'd be more time consuming so small
    steps.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-18 10:27:16 +02:00
Antoine Tenart d3b8b917fb net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git
Conflict:\
- In __netif_receive_skb_core due to missing upstream commit
  625788b58445 ("net: add per-cpu storage and net->core_stats") in c9s.

commit 9f8ed577c28813410614b418bad42285840c1a00
Author: Menglong Dong <imagedong@tencent.com>
Date:   Thu Apr 7 14:20:50 2022 +0800

    net: skb: rename SKB_DROP_REASON_PTYPE_ABSENT

    As David Ahern suggested, the reasons for skb drops should be more
    general and not be code based.

    Therefore, rename SKB_DROP_REASON_PTYPE_ABSENT to
    SKB_DROP_REASON_UNHANDLED_PROTO, which is used for the cases of no
    L3 protocol handler, no L4 protocol handler, version extensions, etc.

    From previous discussion, now we have the aim to make these reasons
    more abstract and users based, avoiding code based.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Reviewed-by: David Ahern <dsahern@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:24 +02:00
Antoine Tenart 3f421c9474 net: dev: use kfree_skb_reason() for __netif_receive_skb_core()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 6c2728b7c14164928cb7cb9c847dead101b2d503
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:46 2022 +0800

    net: dev: use kfree_skb_reason() for __netif_receive_skb_core()

    Add reason for skb drops to __netif_receive_skb_core() when packet_type
    not found to handle the skb. For this purpose, the drop reason
    SKB_DROP_REASON_PTYPE_ABSENT is introduced. Take ether packets for
    example, this case mainly happens when L3 protocol is not supported.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 4fa8044e89 net: dev: use kfree_skb_reason() for sch_handle_ingress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit a568aff26ac03ee9eb1482683514914a5ec3b4c3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:45 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_ingress()

    Replace kfree_skb() used in sch_handle_ingress() with
    kfree_skb_reason(). Following drop reasons are introduced:

    SKB_DROP_REASON_TC_INGRESS

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 9c9aa3ee0a net: dev: use kfree_skb_reason() for do_xdp_generic()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7e726ed81e1ddd5fdc431e02b94fcfe2a9876d42
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:44 2022 +0800

    net: dev: use kfree_skb_reason() for do_xdp_generic()

    Replace kfree_skb() used in do_xdp_generic() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_XDP is introduced for this case.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart db388f3375 net: dev: use kfree_skb_reason() for enqueue_to_backlog()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 44f0bd40803c0e04f1c8cd59df3c7acce783ae9c
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:43 2022 +0800

    net: dev: use kfree_skb_reason() for enqueue_to_backlog()

    Replace kfree_skb() used in enqueue_to_backlog() with
    kfree_skb_reason(). The skb rop reason SKB_DROP_REASON_CPU_BACKLOG is
    introduced for the case of failing to enqueue the skb to the per CPU
    backlog queue. The further reason can be backlog queue full or RPS
    flow limition, and I think we needn't to make further distinctions.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart b63c068d65 net: dev: add skb drop reasons to __dev_xmit_skb()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 7faef0547f4c29031a68d058918b031a8e520d49
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:42 2022 +0800

    net: dev: add skb drop reasons to __dev_xmit_skb()

    Add reasons for skb drops to __dev_xmit_skb() by replacing
    kfree_skb_list() with kfree_skb_list_reason(). The drop reason of
    SKB_DROP_REASON_QDISC_DROP is introduced for qdisc enqueue fails.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Antoine Tenart 694219a303 net: dev: use kfree_skb_reason() for sch_handle_egress()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2059161
Upstream Status: linux.git

commit 98b4d7a4e7374a44c4afd9f08330e72f6ad0d644
Author: Menglong Dong <imagedong@tencent.com>
Date:   Fri Mar 4 14:00:40 2022 +0800

    net: dev: use kfree_skb_reason() for sch_handle_egress()

    Replace kfree_skb() used in sch_handle_egress() with kfree_skb_reason().
    The drop reason SKB_DROP_REASON_TC_EGRESS is introduced. Considering
    the code path of tc egerss, we make it distinct with the drop reason
    of SKB_DROP_REASON_QDISC_DROP in the next commit.

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-10-13 14:53:23 +02:00
Paolo Abeni 7403d40195 net: Fix a data-race around netdev_unregister_timeout_secs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1
Conflicts: chunk applied into netdev_wait_allrefs() instead of \
 netdev_wait_allrefs_any() and with different context as rhel-9 \
 lacks the upstream commit faab39f63c1fc ("net: allow out-of-order \
 netdev unregistration")

Upstream commit:
commit 05e49cfc89e4f325eebbc62d24dd122e55f94c23
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:59 2022 -0700

    net: Fix a data-race around netdev_unregister_timeout_secs.

    While reading netdev_unregister_timeout_secs, it can be changed
    concurrently.  Thus, we need to add READ_ONCE() to its reader.

    Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Acked-by: Dmitry Vyukov <dvyukov@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 48e48d197a net: Fix a data-race around netdev_budget_usecs.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit fa45d484c52c73f79db2c23b0cdfc6c6455093ad
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:55 2022 -0700

    net: Fix a data-race around netdev_budget_usecs.

    While reading netdev_budget_usecs, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 3d0c78c5c1 net: Fix a data-race around netdev_budget.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 2e0c42374ee32e72948559d2ae2f7ba3dc6b977c
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:53 2022 -0700

    net: Fix a data-race around netdev_budget.

    While reading netdev_budget, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its reader.

    Fixes: 51b0bdedb8 ("[NET]: Separate two usages of netdev_max_backlog.")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:04 +02:00
Paolo Abeni 08060d0717 net: Fix data-races around netdev_tstamp_prequeue.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 61adf447e38664447526698872e21c04623afb8e
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:47 2022 -0700

    net: Fix data-races around netdev_tstamp_prequeue.

    While reading netdev_tstamp_prequeue, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    Fixes: 3b098e2d7c ("net: Consistent skb timestamping")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 13d50816f6 net: Fix data-races around netdev_max_backlog.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit 5dcd08cd19912892586c6082d56718333e2d19db
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:46 2022 -0700

    net: Fix data-races around netdev_max_backlog.

    While reading netdev_max_backlog, it can be changed concurrently.
    Thus, we need to add READ_ONCE() to its readers.

    While at it, we remove the unnecessary spaces in the doc.

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Paolo Abeni 05d6206bdc net: Fix data-races around weight_p and dev_weight_[rt]x_bias.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2134161
Tested: LNST, Tier1

Upstream commit:
commit bf955b5ab8f6f7b0632cdef8e36b14e4f6e77829
Author: Kuniyuki Iwashima <kuniyu@amazon.com>
Date:   Tue Aug 23 10:46:45 2022 -0700

    net: Fix data-races around weight_p and dev_weight_[rt]x_bias.

    While reading weight_p, it can be changed concurrently.  Thus, we need
    to add READ_ONCE() to its reader.

    Also, dev_[rt]x_weight can be read/written at the same time.  So, we
    need to use READ_ONCE() and WRITE_ONCE() for its access.  Moreover, to
    use the same weight_p while changing dev_[rt]x_weight, we add a mutex
    in proc_do_dev_weight().

    Fixes: 3d48b53fb2 ("net: dev_weight: TX/RX orthogonality")
    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-10-13 13:00:03 +02:00
Ivan Vecera 7ca7843425 net: unexport a handful of dev_* functions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

commit 2cc6cdd44a1655ac5a9863529a2fd6dbed2d092c
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Wed Apr 6 14:37:53 2022 -0700

    net: unexport a handful of dev_* functions

    We have a bunch of functions which are only used under
    net/core/ yet they get exported. Remove the exports.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:03:08 +02:00
Ivan Vecera 616826f600 net: remove .ndo_change_proto_down
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2128180

Conflicts:
- small context conflict due to existing backport of 3b89b511ea0c ("net:
  fix IFF_TX_SKB_NO_LINEAR definition")

commit 2106efda785b55a8957efed9a52dfa28ee0d7280
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Nov 22 17:24:47 2021 -0800

    net: remove .ndo_change_proto_down

    .ndo_change_proto_down was added seemingly to enable out-of-tree
    implementations. Over 2.5yrs later we still have no real users
    upstream. Hardwire the generic implementation for now, we can
    revert once real users materialize. (rocker is a test vehicle,
    not a user.)

    We need to drop the optimization on the sysfs side, because
    unlike ndos priv_flags will be changed at runtime, so we'd
    need READ_ONCE/WRITE_ONCE everywhere..

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-10-03 17:02:55 +02:00
Felix Maurer 8611666ff2 xdp: check prog type before updating BPF link
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071620

commit 382778edc8262b7535f00523e9eb22edba1b9816
Author: Toke Høiland-Jørgensen <toke@redhat.com>
Date:   Fri Jan 7 23:11:13 2022 +0100

    xdp: check prog type before updating BPF link

    The bpf_xdp_link_update() function didn't check the program type before
    updating the program, which made it possible to install any program type as
    an XDP program, which is obviously not good. Syzbot managed to trigger this
    by swapping in an LWT program on the XDP hook which would crash in a helper
    call.

    Fix this by adding a check and bailing out if the types don't match.

    Fixes: 026a4c28e1 ("bpf, xdp: Implement LINK_UPDATE for BPF XDP link")
    Reported-by: syzbot+983941aa85af6ded1fd9@syzkaller.appspotmail.com
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/r/20220107221115.326171-1-toke@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-08-24 16:56:03 +02:00
Patrick Talbert 95ad1a9fa6 Merge: CNB: bpf: Let bpf_warn_invalid_xdp_action() report more info
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1070

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454
Tested: Build, boot.

The commit let bpf_warn_invalid_xdp_action() report more info

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>
Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com>
Approved-by: Mohamed Gamal Morsy <mgamal@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-15 09:40:47 +02:00
Patrick Talbert 5f85d33e47 Merge: net/core: backport fixes from upstream for 9.1 P2
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1057

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278

The latest path depends on the second latest patch.

Signed-off-by: Hangbin Liu <haliu@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-14 12:07:49 +02:00
Patrick Talbert c2f72a65cf Merge: CNB: gro: get out of core files
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1066

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789
Tested: Just built - there is no functional change

The series moves GRO related definitions, declarations and code from core files into net/core/gro.h and include/net/gro.h and reduces too big files include/linux/netdevice.h andnet/core/dev.c. Backport of this series provides <net/gro.h> for NIC drivers and avoids conflicts in future GRO related backports and fixes.

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Conflicts:
- include/linux/netdevice.h: fuzz.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-12 10:36:03 +02:00
Patrick Talbert f063b56239 Merge: net: backport netdevice and netns refcount tracking and enable them for debug kernels
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1003

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377
Tested: Basic networking tasks using namespaces, vlans, veths, macvlans etc. with kernel-debug flavor

Upstream kernel recently introduces refcount tracking infrastructure for network devices and namespaces to help to avoid resource leaks and use-after-free issues. This infrastructure should be helpful for our support teams to debug customers' issues.
The series backports the following commits and enables both trackers for kernel debug flavors:

```
95d1d2490c27 ("netdevice: move xdp_rxq within netdev_rx_queue")
2a12ae5d433d ("net: inline sock_prot_inuse_add()")
d477eb900484 ("net: make sock_inuse_add() available")
4199bae10c49 ("net: merge net->core.prot_inuse and net->core.sock_inuse")
b3cb764aa1d7 ("net: drop nopreempt requirement on sock_prot_inuse_add()")
4e66934eaadc ("lib: add reference counting tracking infrastructure")
914a7b5000d0 ("lib: add tests for reference tracker")
4d92b95ff2f9 ("net: add net device refcount tracker infrastructure")
80e8921b2b72 ("net: add net device refcount tracker to struct netdev_rx_queue")
0b688f24b7d6 ("net: add net device refcount tracker to struct netdev_queue")
5ae2195088d0 ("net: add net device refcount tracker to ethtool_phys_id()")
14ed029b5eb5 ("net: add net device refcount tracker to dev_ifsioc()")
4dbd24f65c60 ("drop_monitor: add net device refcount tracker")
9038c320001d ("net: dst: add net device refcount tracking to dst_entry")
fb67510ba9bd ("ipv6: add net device refcount tracker to rt6_probe_deferred()")
c0fd407a0666 ("sit: add net device refcount tracking to ip_tunnel")
56c1c77948ba ("ipv6: add net device refcount tracker to struct ip6_tnl")
85662c9f8cbd ("net: add net device refcount tracker to struct neighbour")
77a23b1f9543 ("net: add net device refcount tracker to struct pneigh_entry")
08d622568e5a ("net: add net device refcount tracker to struct neigh_parms")
f77159a348f2 ("net: add net device refcount tracker to struct netdev_adjacent")
8c727003c4d0 ("ipv6: add net device refcount tracker to struct inet6_dev")
c04438f58d14 ("ipv4: add net device refcount tracker to struct in_device")
606509f27f67 ("net/sched: add net device refcount tracker to struct Qdisc")
63f13937cbe9 ("net: linkwatch: add net device refcount tracker")
095e200f175f ("net: failover: add net device refcount tracker")
42120a864383 ("ipmr, ip6mr: add net device refcount tracker to struct vif_device")
5fa5ae605821 ("netpoll: add net device refcount tracker to struct netpoll")
c0e5e11af12b ("vrf: use dev_replace_track() for better tracking")
08f0b22d731f ("net: eql: add net device refcount tracker")
19c9ebf6ed70 ("vlan: add net device refcount tracker")
b2dcdc7f731d ("net: bridge: add net device refcount tracker")
f12bf6f3f942 ("net: watchdog: add net device refcount tracker")
4fc003fe0313 ("net: switchdev: add net device refcount tracker")
e44b14ebae10 ("inet: add net device refcount tracker to struct fib_nh_common")
66ce07f7802b ("ax25: add net device refcount tracker")
615d069dcf12 ("llc: add net device refcount tracker")
035f1f2b96ae ("pktgen add net device refcount tracker")
b60645248af3 ("net/smc: add net device tracker to struct smc_pnetentry")
e4b8954074f6 ("netlink: add net device refcount tracker to struct ethnl_req_info")
e7c8ab8419d7 ("openvswitch: add net device refcount tracker to struct vport")
ada066b2e02c ("net: sched: act_mirred: add net device refcount tracker")
4177e4960594 ("xfrm: use net device refcount tracker helpers")
9ba74e6c9e9d ("net: add networking namespace refcount tracker")
ffa84b5ffb37 ("net: add netns refcount tracker to struct sock")
04a931e58d19 ("net: add netns refcount tracker to struct seq_net_private")
dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts")
285ec2fef4b8 ("l2tp: add netns refcount tracker to l2tp_dfs_seq_data")
11b311a867b6 ("ppp: add netns refcount tracker")
0976b888a150 ("ethtool: fix null-ptr-deref on ref tracker")
e1b539bd73a7 ("xfrm: add net device refcount tracker to struct xfrm_state_offload")
8b40a9d53d4f ("ipv6: use GFP_ATOMIC in rt6_probe()")
1d2f3d3c6268 ("mptcp: adjust to use netns refcount tracker")
123e495ecc25 ("net: linkwatch: be more careful about dev->linkwatch_dev_tracker")
9280ac2e6f19 ("net: dev_replace_track() cleanup")
34ac17ecbf57 ("ethtool: use ethnl_parse_header_dev_put()")
f1d9268e0618 ("net: add net device refcount tracker to struct packet_type")
3bc14ea0d12a ("ethtool: always write dev in ethnl_parse_header_dev_get")
a9382d9389a0 ("netfilter: nfnetlink: add netns refcount tracker to struct nfulnl_instance")
30db406923b9 ("netfilter: nf_nat_masquerade: make async masq_inet6_event handling generic")
7970a19b7104 ("netfilter: nf_nat_masquerade: defer conntrack walk to work queue")
fc0d026a2fad ("netfilter: nf_nat_masquerade: add netns refcount tracker to masq_dev_work")
88248c357c2a ("net/sched: add missing tracker information in qdisc_create()")
2d6ec25539b0 ("netlink: do not allocate a device refcount tracker in ethnl_default_notify()")
bf44077c1b3a ("af_packet: fix tracking issues in packet_do_bind()")
cb963a19d99f ("net: sched: do not allocate a tracker in tcf_exts_init()")
c12837d1bb31 ("ref_tracker: use __GFP_NOFAIL more carefully")
fcfb894d5952 ("net: bridge: fix net device refcount tracking issue in error path")
7b9b1d449a7c ("net/smc: fix possible NULL deref in smc_pnet_add_eth()")
6cdef8a6ee74 ("SUNRPC: add netns refcount tracker to struct svc_xprt")
9b1831e56c7f ("SUNRPC: add netns refcount tracker to struct gss_auth")
b9a0d6d143ec ("SUNRPC: add netns refcount tracker to struct rpc_xprt")
e3ececfe668f ("ref_tracker: implement use-after-free detection")
8fd5522f44dc ("ref_tracker: add a count of untracked references")
4c6c11ea0f7b ("net: refine dev_put()/dev_hold() debugging")
28f922213886 ("net/smc: fix ref_tracker issue in smc_pnet_add()")
94fdd7c02a56 ("net/smc: use GFP_ATOMIC allocation in smc_pnet_add_eth()")
b2309a71c1f2 ("net: add dev->dev_registered_tracker")
3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()")
ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-07-01 09:17:32 +02:00
Ivan Vecera ca7c7d9c0c bpf: Let bpf_warn_invalid_xdp_action() report more info
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073454

Conflicts:
- N/A hunk for unsupported octeontx2 driver omitted

commit c8064e5b4adac5e1255cf4f3b374e75b5376e7ca
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Nov 30 11:08:07 2021 +0100

    bpf: Let bpf_warn_invalid_xdp_action() report more info

    In non trivial scenarios, the action id alone is not sufficient to
    identify the program causing the warning. Before the previous patch,
    the generated stack-trace pointed out at least the involved device
    driver.

    Let's additionally include the program name and id, and the relevant
    device name.

    If the user needs additional infos, he can fetch them via a kernel
    probe, leveraging the arguments added here.

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 16:13:14 +02:00
Ivan Vecera 7ba9ae4395 net: gro: populate net/core/gro.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

Conflicts:
- adjusted due to existing backport of 7881453e4adf ("net: gro: avoid
  re-computing truesize twice on recycle")

commit 587652bbdd06ab38a4c1b85e40f933d2cf4a1147
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 15 09:05:54 2021 -0800

    net: gro: populate net/core/gro.c

    Move gro code and data from net/core/dev.c to net/core/gro.c
    to ease maintenance.

    gro_normal_list() and gro_normal_one() are inlined
    because they are called from both files.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:41 +02:00
Ivan Vecera e9721641ed net:dev: Change napi_gro_complete return type to void
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit 1643771eeb2db9b487cbbde12e2a3f6ed0171490
Author: Gyumin Hwang <hkm73560@gmail.com>
Date:   Sat Oct 2 08:11:36 2021 +0000

    net:dev: Change napi_gro_complete return type to void

    napi_gro_complete always returned the same value, NET_RX_SUCCESS
    And the value was not used anywhere

    Signed-off-by: Gyumin Hwang <hkm73560@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:40 +02:00
Ivan Vecera 2119ff5330 move netdev_boot_setup into Space.c
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101789

commit 5ea2f5ffde39251115ef9a566262fb9e52b91cb7
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Tue Aug 3 13:40:46 2021 +0200

    move netdev_boot_setup into Space.c

    This is now only used by a handful of old ISA drivers,
    and can be moved into the file they already all depend on.

    Signed-off-by: Arnd Bergmann <arnd@arndb.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-28 13:28:39 +02:00
Hangbin Liu e4c3a2b313 net: fix data-race in dev_isalive()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit cc26c2661fef

Conflicts: context conflicts due to missing ae68db14b616 ("net: transition
netdev reg state earlier in run_todo") and 86213f80da1b ("net: avoid quadratic
behavior in netdev_wait_allrefs_any()")

commit cc26c2661fefea215f41edb665193324a5f99021
Author: Eric Dumazet <edumazet@google.com>
Date:   Thu Jun 16 00:34:34 2022 -0700

    net: fix data-race in dev_isalive()

    dev_isalive() is called under RTNL or dev_base_lock protection.

    This means that changes to dev->reg_state should be done with both locks held.

    syzbot reported:

    BUG: KCSAN: data-race in register_netdevice / type_show

    write to 0xffff888144ecf518 of 1 bytes by task 20886 on cpu 0:
    register_netdevice+0xb9f/0xdf0 net/core/dev.c:10050
    lapbeth_new_device drivers/net/wan/lapbether.c:414 [inline]
    lapbeth_device_event+0x4a0/0x6c0 drivers/net/wan/lapbether.c:456
    notifier_call_chain kernel/notifier.c:87 [inline]
    raw_notifier_call_chain+0x53/0xb0 kernel/notifier.c:455
    __dev_notify_flags+0x1d6/0x3a0
    dev_change_flags+0xa2/0xc0 net/core/dev.c:8607
    do_setlink+0x778/0x2230 net/core/rtnetlink.c:2780
    __rtnl_newlink net/core/rtnetlink.c:3546 [inline]
    rtnl_newlink+0x114c/0x16a0 net/core/rtnetlink.c:3593
    rtnetlink_rcv_msg+0x811/0x8c0 net/core/rtnetlink.c:6089
    netlink_rcv_skb+0x13e/0x240 net/netlink/af_netlink.c:2501
    rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:6107
    netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
    netlink_unicast+0x58a/0x660 net/netlink/af_netlink.c:1345
    netlink_sendmsg+0x661/0x750 net/netlink/af_netlink.c:1921
    sock_sendmsg_nosec net/socket.c:714 [inline]
    sock_sendmsg net/socket.c:734 [inline]
    __sys_sendto+0x21e/0x2c0 net/socket.c:2119
    __do_sys_sendto net/socket.c:2131 [inline]
    __se_sys_sendto net/socket.c:2127 [inline]
    __x64_sys_sendto+0x74/0x90 net/socket.c:2127
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    read to 0xffff888144ecf518 of 1 bytes by task 20423 on cpu 1:
    dev_isalive net/core/net-sysfs.c:38 [inline]
    netdev_show net/core/net-sysfs.c:50 [inline]
    type_show+0x24/0x90 net/core/net-sysfs.c:112
    dev_attr_show+0x35/0x90 drivers/base/core.c:2095
    sysfs_kf_seq_show+0x175/0x240 fs/sysfs/file.c:59
    kernfs_seq_show+0x75/0x80 fs/kernfs/file.c:162
    seq_read_iter+0x2c3/0x8e0 fs/seq_file.c:230
    kernfs_fop_read_iter+0xd1/0x2f0 fs/kernfs/file.c:235
    call_read_iter include/linux/fs.h:2052 [inline]
    new_sync_read fs/read_write.c:401 [inline]
    vfs_read+0x5a5/0x6a0 fs/read_write.c:482
    ksys_read+0xe8/0x1a0 fs/read_write.c:620
    __do_sys_read fs/read_write.c:630 [inline]
    __se_sys_read fs/read_write.c:628 [inline]
    __x64_sys_read+0x3e/0x50 fs/read_write.c:628
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x46/0xb0

    value changed: 0x00 -> 0x01

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 20423 Comm: udevd Tainted: G W 5.19.0-rc2-syzkaller-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 16:39:41 +08:00
Hangbin Liu ca3a0598a6 net: Write lock dev_base_lock without disabling bottom halves.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit fd888e85fe6b

commit fd888e85fe6b661e78044dddfec0be5271afa626
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Fri Nov 26 17:15:29 2021 +0100

    net: Write lock dev_base_lock without disabling bottom halves.

    The writer acquires dev_base_lock with disabled bottom halves.
    The reader can acquire dev_base_lock without disabling bottom halves
    because there is no writer in softirq context.

    On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
    as a lock to ensure that resources, that are protected by disabling
    bottom halves, remain protected.
    This leads to a circular locking dependency if the lock acquired with
    disabled bottom halves (as in write_lock_bh()) and somewhere else with
    enabled bottom halves (as by read_lock() in netstat_show()) followed by
    disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
    -> spin_lock_bh()). This is the reverse locking order.

    All read_lock() invocation are from sysfs callback which are not invoked
    from softirq context. Therefore there is no need to disable bottom
    halves while acquiring a write lock.

    Acquire the write lock of dev_base_lock without disabling bottom halves.

    Reported-by: Pei Zhang <pezhang@redhat.com>
    Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 16:37:14 +08:00
Hangbin Liu 7b9f2507ce net: fix dev_fill_forward_path with pppoe + bridge
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2101278
Upstream Status: net.git commit cf2df74e202d

commit cf2df74e202d81b09f09d84c2d8903e0e87e9274
Author: Felix Fietkau <nbd@nbd.name>
Date:   Mon May 9 14:26:15 2022 +0200

    net: fix dev_fill_forward_path with pppoe + bridge

    When calling dev_fill_forward_path on a pppoe device, the provided destination
    address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
    code needs to update ctx->daddr to the correct value.
    Fix this by storing the address inside struct net_device_path_ctx

    Fixes: f6efc675c9 ("net: ppp: resolve forwarding path for bridge pppoe devices")
    Signed-off-by: Felix Fietkau <nbd@nbd.name>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-06-27 12:26:09 +08:00
Patrick Talbert 164ce13234 Merge: CNB: Update TC subsystem to upstream v5.18
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/971

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410
Depends: https://bugzilla.redhat.com/show_bug.cgi?id=2094002
Tested: Using TC related kernel self-tests

The series rebases TC subsystem to upstream v5.18

Commits:
```
f79a3bcb1a50 ("net/sched: Remove unnecessary if statement")
409f386b8e5d ("qdisc: add new field for qdisc_enqueue tracepoint")
56af5e749f20 ("net/sched: act_skbmod: Add SKBMOD_F_ECN option support")
68f9884837c6 ("tc-testing: Add control-plane selftest for skbmod SKBMOD_F_ECN option")
695176bfe5de ("net_sched: refactor TC action init API")
625af9f0298b ("tc-testing: Add control-plane selftests for sch_mq")
a5397d68b2db ("net/sched: cls_api, reset flags on replay")
efe487fce306 ("fix array-index-out-of-bounds in taprio_change")
1e080f17750d ("net: sched: update default qdisc visibility after Tx queue cnt changes")
2e367522ce6b ("netdevsim: add ability to change channel count")
2d6a58996ee2 ("selftests: net: test ethtool -L vs mq")
f7116fb46085 ("net: sched: move and reuse mq_change_real_num_tx()")
b193e15ac69d ("net: prevent user from passing illegal stab size")
69508d43334e ("net_sched: Use struct_size() and flex_array_size() helpers")
129291980f49 ("net: sched: Use struct_size() helper in kvmalloc()")
fbf307c89eb0 ("gen_stats: Add instead Set the value in __gnet_stats_copy_basic().")
448e163f8b9b ("gen_stats: Add gnet_stats_add_queue().")
7361df4606ba ("mq, mqprio: Use gnet_stats_add_queue().")
10940eb746d4 ("gen_stats: Move remaining users to gnet_stats_add_queue().")
f2efdb179289 ("u64_stats: Introduce u64_stats_set()")
67c9e6270f30 ("net: sched: Protect Qdisc::bstats with u64_stats")
f56940daa5a7 ("net: sched: Use _bstats_update/set() instead of raw writes")
50dc9a8572aa ("net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types")
29cbcd858283 ("net: sched: Remove Qdisc::running sequence counter")
4c57e2fac41c ("net: sched: fix logic error in qdisc_run_begin()")
97604c65bcda ("net: sched: remove one pair of atomic operations")
6b3efbfa4e68 ("net: sch_tbf: Add a graft command")
e22db7bd552f ("net: sched: Allow statistics reads from softirq.")
c5c6e589a8c8 ("net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding.")
f25c0515c521 ("net: sched: gred: dynamically allocate tc_gred_qopt_offload")
267463823adb ("net: sch: eliminate unnecessary RCU waits in mini_qdisc_pair_swap()")
85c0c3eb9a66 ("net: sch: simplify condtion for selecting mini_Qdisc_pair buffer")
648a991cf316 ("sch_htb: Add extack messages for EOPNOTSUPP errors")
6de6e46d27ef ("cls_flower: Fix inability to match GRE/IPIP packets")
af0a51113cb7 ("selftests: forwarding: Fix packet matching in mirroring selftests")
cb3ef7b00042 ("net: sched: sch_netem: Refactor code in 4-state loss generator")
bdf1565fe03d ("selftests/tc-testing: match any qdisc type")
b43c2793f5e9 ("netfilter: nfnetlink_queue: silence bogus compiler warning")
43332cf97425 ("net/sched: act_ct: Offload only ASSURED connections")
40bd094d65fc ("flow_offload: fill flags to action structure")
144d4c9e800d ("flow_offload: reject to offload tc actions in offload drivers")
5a9959008fb6 ("flow_offload: add index to flow_action_entry structure")
9c1c0e124ca2 ("flow_offload: rename offload functions with offload instead of flow")
c54e1d920f04 ("flow_offload: add ops to tc_action_ops for flow action setup")
8cbfe939abe9 ("flow_offload: allow user to offload tc action to net device")
7adc57651211 ("flow_offload: add skip_hw and skip_sw to control if offload the action")
bcd64368584b ("flow_offload: rename exts stats update functions with hw")
c7a66f8d8a94 ("flow_offload: add process to update action stats from hardware")
e8cb5bcf6ed6 ("net: sched: save full flags for tc action")
13926d19a11e ("flow_offload: add reoffload process to update hw_count")
c86e0209dc77 ("flow_offload: validate flags of filter and actions")
eb473bac4a4b ("selftests: tc-testing: add action offload selftest for action and filter")
c48c94b0ab75 ("net/sched: use min() macro instead of doing it manually")
963178a06352 ("flow_offload: fix suspicious RCU usage when offloading tc action")
9795ded7f924 ("net/sched: act_ct: Fill offloading tuple iifidx")
b702436a51df ("net: openvswitch: Fill act ct extension")
7d18a07897d0 ("sch_qfq: prevent shift-out-of-bounds in qfq_init_qdisc")
c25af830ab26 ("sch_cake: revise Diffserv docs")
719774377622 ("netfilter: conntrack: convert to refcount_t api")
3fce16493dc1 ("netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook")
285c8a7a5815 ("netfilter: make function op structures const")
6ae7989c9af0 ("netfilter: conntrack: avoid useless indirection during conntrack destruction")
408bdcfce8df ("net: prefer nf_ct_put instead of nf_conntrack_put")
fb80445c438c ("net_sched: restore "mpu xxx" handling")
973bf8fdd12f ("net: sched: Clarify error message when qdisc kind is unknown")
bb62a765b1b5 ("netfilter: conntrack: make all extensions 8-byte alignned")
5f31edc0676b ("netfilter: conntrack: move extension sizes into core")
1bc91a5ddf3e ("netfilter: conntrack: handle ->destroy hook via nat_ops instead")
1015c3de23ee ("netfilter: conntrack: remove extension register api")
34243b9ec856 ("netfilter: nft_ct: fix use after free when attaching zone template")
429c3be8a5e2 ("sch_htb: Fail on unsupported parameters when offload is requested")
98b608629746 ("net: sched: remove psched_tdiff_bounded()")
a459bc9a3a68 ("net: sched: remove qdisc_qlen_cpu()")
04c2a47ffb13 ("net: sched: fix use-after-free in tc_new_tfilter()")
35d39fecbc24 ("net/sched: Enable tc skb ext allocation on chain miss only when needed")
4ddc844eb81d ("net/sched: act_police: more accurate MTU policing")
5891cd5ec46c ("net_sched: add __rcu annotation to netdev->qdisc")
5740d0689096 ("net: sched: limit TC_ACT_REPEAT loops")
2f131de361f6 ("net/sched: act_ct: Fix flow table lookup after ct clear or switching zones")
ecf4a24cf978 ("net: sched: avoid newline at end of message in NL_SET_ERR_MSG_MOD")
b8cd5831c61c ("net: flow_offload: add tc police action parameters")
d97b4b105ce7 ("flow_offload: reject offload for all drivers with invalid police parameters")
fcb6aa86532c ("act_ct: Support GRE offload")
db6140e5e35a ("net/sched: act_ct: Fix flow table lookup failure with no originating ifindex")
d922a99b96d0 ("flow_offload: improve extack msg for user when adding invalid filter")
ab95465cde23 ("net/sched: add vlan push_eth and pop_eth action to the hardware IR")
054d5575cd6e ("net/sched: fix incorrect vlan_push_eth dest field")
bcb74e132a76 ("net/sched: act_ct: fix ref leak when switching zones")
2105f700b53c ("net/sched: flower: fix parsing of ethertype following VLAN header")
e65812fd22eb ("net/sched: fix initialization order when updating chain 0 head")
e8a64bbaaad1 ("net/sched: taprio: Check if socket flags are valid")
3db09e762dc7 ("net/sched: cls_u32: fix netns refcount changes in u32_change()")
ec5b0f605b10 ("net/sched: cls_u32: fix possible leak in u32_init_knode()")
8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable")
4d42d54a7d6a ("net/sched: act_pedit: sanitize shift argument before usage")
86360030cc51 ("net/sched: act_api: fix error code in tcf_ct_flow_table_fill_tuple_ipv6()")
```

Signed-off-by: Ivan Vecera <ivecera@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Petr Oros <poros@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-21 10:07:08 +02:00
Ivan Vecera 056507f0cb net: add dev->dev_registered_tracker
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit b2309a71c1f2fc841feb184195b2e46b2e139bf4
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Feb 7 10:41:07 2022 -0800

    net: add dev->dev_registered_tracker

    Convert one dev_hold()/dev_put() pair in register_netdevice()
    and unregister_netdevice_many() to dev_hold_track()
    and dev_put_track().

    This would allow to detect a rogue dev_put() a bit earlier.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:39:34 +02:00
Ivan Vecera 859ed7a9a3 net: refine dev_put()/dev_hold() debugging
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit 4c6c11ea0f7b00a1894803efe980dfaf3b074886
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 4 14:42:37 2022 -0800

    net: refine dev_put()/dev_hold() debugging

    We are still chasing some syzbot reports where we think a rogue dev_put()
    is called with no corresponding prior dev_hold().
    Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
    dev_hold_track(), meaning that the refcount saturation splat comes
    too late to be useful.

    Make sure that 'not tracked' dev_put() and dev_hold() better use
    CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:

    Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
    to be called with a NULL @trackerp parameter, and to use a separate refcount
    only to detect too many put() even in the following case:

    dev_hold_track(dev, tracker_1, GFP_ATOMIC);
     dev_hold(dev);
     dev_put(dev);
     dev_put(dev); // Should complain loudly here.
    dev_put_track(dev, tracker_1); // instead of here

    Add clarification about netdev_tracker_alloc() role.

    v2: I replaced the dev_put() in linkwatch_do_dev()
        with __dev_put() because callers called netdev_tracker_free().

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:39:33 +02:00
Ivan Vecera 6ce56701da net: add net device refcount tracker to struct netdev_adjacent
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

commit f77159a348f2d6078af7fe4933a60229d7c7aae2
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Dec 4 20:22:10 2021 -0800

    net: add net device refcount tracker to struct netdev_adjacent

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:38:19 +02:00
Ivan Vecera f516b70a26 net: add net device refcount tracker infrastructure
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2096377

Conflicts:
- context conflict due to missing commit 5ea2f5ffde392 ("move
  netdev_boot_setup into Space.c")

commit 4d92b95ff2f95f13df9bad0b5a25a9f60e72758d
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Dec 4 20:21:57 2021 -0800

    net: add net device refcount tracker infrastructure

    net device are refcounted. Over the years we had numerous bugs
    caused by imbalanced dev_hold() and dev_put() calls.

    The general idea is to be able to precisely pair each decrement with
    a corresponding prior increment. Both share a cookie, basically
    a pointer to private data storing stack traces.

    This patch adds dev_hold_track() and dev_put_track().

    To use these helpers, each data structure owning a refcount
    should also use a "netdevice_tracker" to pair the hold and put.

    netdevice_tracker dev_tracker;
    ...
    dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
    ...
    dev_put_track(dev, &dev_tracker);

    Whenever a leak happens, we will get precise stack traces
    of the point dev_hold_track() happened, at device dismantle phase.

    We will also get a stack trace if too many dev_put_track() for the same
    netdevice_tracker are attempted.

    This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-13 18:36:42 +02:00
Patrick Talbert 0b353d8be8 Merge: CNB: net: consolidate neif_rx() and make it callable from any context
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/968

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703
Tested: basic network tests on lo, tun, veth

Series consolidate neif_rx() and make it callable from any context.
It is backport for these upstream series:
 da54d75bebf4d8 ("Merge branch 'netdev-RT'")
 9f9919f73c94ae ("Merge branch 'netif_rx'")
 83b7b77af37a89 ("Merge branch 'netif_rx-conversions-part2'")
 e21af12622c0fb ("Merge branch 'netif_rx-part3'")

Omitted-fix: b903117b48681e12fae38e09c874f38c45186dc6
Omitted-fix: e1f9e434617fb28097223d9484de66218bc0b52d

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: Jiri Benc <jbenc@redhat.com>
Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: John W. Linville <linville@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-10 09:44:49 +02:00
Ivan Vecera 0cdfbe9c70 net: sched: update default qdisc visibility after Tx queue cnt changes
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 1e080f17750d1083e8a32f7b350584ae1cd7ff20
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Mon Sep 13 15:53:30 2021 -0700

    net: sched: update default qdisc visibility after Tx queue cnt changes

    mq / mqprio make the default child qdiscs visible. They only do
    so for the qdiscs which are within real_num_tx_queues when the
    device is registered. Depending on order of calls in the driver,
    or if user space changes config via ethtool -L the number of
    qdiscs visible under tc qdisc show will differ from the number
    of queues. This is confusing to users and potentially to system
    configuration scripts which try to make sure qdiscs have the
    right parameters.

    Add a new Qdisc_ops callback and make relevant qdiscs TTRT.

    Note that this uncovers the "shortcut" created by
    commit 1f27cde313 ("net: sched: use pfifo_fast for non real queues")
    The default child qdiscs beyond initial real_num_tx are always
    pfifo_fast, no matter what the sysfs setting is. Fixing this
    gets a little tricky because we'd need to keep a reference
    on whatever the default qdisc was at the time of creation.
    In practice this is likely an non-issue the qdiscs likely have
    to be configured to non-default settings, so whatever user space
    is doing such configuration can replace the pfifos... now that
    it will see them.

    Reported-by: Matthew Massey <matthewmassey@fb.com>
    Reviewed-by: Dave Taht <dave.taht@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:29:55 +02:00
Ivan Vecera bfa8b4c7ce net: add netif_set_real_num_queues() for device reconfig
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2094002

commit 271e5b7d00aeff7c61fb6c5415d14dbedb783b68
Author: Jakub Kicinski <kuba@kernel.org>
Date:   Tue Aug 3 06:05:26 2021 -0700

    net: add netif_set_real_num_queues() for device reconfig

    netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
    can fail which breaks drivers trying to implement reconfiguration
    in a way that can't leave the device half-broken. In other words
    those functions are incompatible with prepare/commit approach.

    Luckily setting real number of queues can fail only if the number
    is increased, meaning that if we order operations correctly we
    can guarantee ending up with either new config (success), or
    the old one (on error).

    Provide a helper implementing such logic so that drivers don't
    have to duplicate it.

    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:24:58 +02:00
Petr Oros 6f8d815bcf net: dev: Use netif_rx().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit ad0a043fc26c17522ede3cc986d559f05ece20f4
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Thu Mar 3 18:15:05 2022 +0100

    net: dev: Use netif_rx().

    Since commit
       baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")

    the function netif_rx() can be used in preemptible/thread context as
    well as in interrupt context.

    Use netif_rx().

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:54:24 +02:00
Petr Oros ee3d25c7a3 net: Correct wrong BH disable in hard-interrupt.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit 167053f8dd0ed60287858448696b4784d7e1d899
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Wed Feb 16 18:50:46 2022 +0100

    net: Correct wrong BH disable in hard-interrupt.

    I missed the obvious case where netif_ix() is invoked from hard-IRQ
    context.

    Disabling bottom halves is only needed in process context. This ensures
    that the code remains on the current CPU and that the soft-interrupts
    are processed at local_bh_enable() time.
    In hard- and soft-interrupt context this is already the case and the
    soft-interrupts will be processed once the context is left (at irq-exit
    time).

    Disable bottom halves if neither hard-interrupts nor soft-interrupts are
    disabled. Update the kernel-doc, mention that interrupts must be enabled
    if invoked from process context.

    Fixes: baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
    Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
    Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
    Link: https://lore.kernel.org/r/Yg05duINKBqvnxUc@linutronix.de
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:54:20 +02:00
Petr Oros 32c9187bad net: dev: Make rps_lock() disable interrupts.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit e722db8de6e6932267457ace2657a19015f3db4a
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:39 2022 +0100

    net: dev: Make rps_lock() disable interrupts.

    Disabling interrupts and in the RPS case locking input_pkt_queue is
    split into local_irq_disable() and optional spin_lock().

    This breaks on PREEMPT_RT because the spinlock_t typed lock can not be
    acquired with disabled interrupts.
    The sections in which the lock is acquired is usually short in a sense that it
    is not causing long und unbounded latiencies. One exception is the
    skb_flow_limit() invocation which may invoke a BPF program (and may
    require sleeping locks).

    By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep
    interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels.
    Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens
    as part of local_bh_disable() on the local CPU.
    ____napi_schedule() is only invoked if sd is from the local CPU. Replace
    it with __napi_schedule_irqoff() which already disables interrupts on
    PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the
    function to napi_schedule_rps as suggested by Jakub.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:38 +02:00
Petr Oros 56766d1469 net: dev: Makes sure netif_rx() can be invoked in any context.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Conflicts:
- drivers/net/amt.c Unmerged because file missing in rhel

Upstream commit(s):
commit baebdf48c360080710f80699eea3affbb13d6c65
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:38 2022 +0100

    net: dev: Makes sure netif_rx() can be invoked in any context.

    Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
    work in all contexts and get rid of netif_rx_ni()". Eric agreed and
    pointed out that modern devices should use netif_receive_skb() to avoid
    the overhead.
    In the meantime someone added another variant, netif_rx_any_context(),
    which behaves as suggested.

    netif_rx() must be invoked with disabled bottom halves to ensure that
    pending softirqs, which were raised within the function, are handled.
    netif_rx_ni() can be invoked only from process context (bottom halves
    must be enabled) because the function handles pending softirqs without
    checking if bottom halves were disabled or not.
    netif_rx_any_context() invokes on the former functions by checking
    in_interrupts().

    netif_rx() could be taught to handle both cases (disabled and enabled
    bottom halves) by simply disabling bottom halves while invoking
    netif_rx_internal(). The local_bh_enable() invocation will then invoke
    pending softirqs only if the BH-disable counter drops to zero.

    Eric is concerned about the overhead of BH-disable+enable especially in
    regard to the loopback driver. As critical as this driver is, it will
    receive a shortcut to avoid the additional overhead which is not needed.

    Add a local_bh_disable() section in netif_rx() to ensure softirqs are
    handled if needed.
    Provide __netif_rx() which does not disable BH and has a lockdep assert
    to ensure that interrupts are disabled. Use this shortcut in the
    loopback driver and in drivers/net/*.c.
    Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
    can be removed once they are no more users left.

    Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:37 +02:00
Petr Oros c15df5c592 net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2089703

Upstream commit(s):
commit f234ae2947612825686b25cae3e9579188a6ba95
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Sat Feb 12 00:38:37 2022 +0100

    net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().

    The preempt_disable() () section was introduced in commit
        cece1945bf ("net: disable preemption before call smp_processor_id()")

    and adds it in case this function is invoked from preemtible context and
    because get_cpu() later on as been added.

    The get_cpu() usage was added in commit
        b0e28f1eff ("net: netif_rx() must disable preemption")

    because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption
    causing a warning in smp_processor_id(). The function netif_rx() should
    only be invoked from an interrupt context which implies disabled
    preemption. The commit
       e30b38c298 ("ip: Fix ip_dev_loopback_xmit()")

    was addressing this and replaced netif_rx() with in netif_rx_ni() in
    ip_dev_loopback_xmit().

    Based on the discussion on the list, the former patch (b0e28f1eff)
    should not have been applied only the latter (e30b38c298).

    Remove get_cpu() and preempt_disable() since the function is supposed to
    be invoked from context with stable per-CPU pointers. Bottom halves have
    to be disabled at this point because the function may raise softirqs
    which need to be processed.

    Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-06-06 11:25:37 +02:00
Patrick Talbert 8c5b3f7fd9 Merge: XDP and networking eBPF rebase to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/674

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Depends: !572

Tested: Using bpf selftests, everything passes.

This rebases XDP and networking eBPF to upstream kernel version 5.15.

Signed-off-by: Jiri Benc <jbenc@redhat.com>

Approved-by: Hangbin Liu <haliu@redhat.com>
Approved-by: Rafael Aquini <aquini@redhat.com>
Approved-by: Toke Høiland-Jørgensen <toke@redhat.com>
Approved-by: Íñigo Huguet <ihuguet@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-06-03 09:26:25 +02:00
Patrick Talbert 092af648a0 Merge: bpf: update to v5.15
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/572

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041365

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>

Approved-by: Rado Vrbovsky <rvrbovsk@redhat.com>
Approved-by: Prarit Bhargava <prarit@redhat.com>
Approved-by: Yauheni Kaliuta <ykaliuta@redhat.com>
Approved-by: Artem Savkov <asavkov@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
2022-05-26 09:27:25 +02:00
Jiri Benc 7e6f15045c net: in_irq() cleanup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit afa79d08c6c8e1901cb1547591e3ccd3ec6965d9
Author: Changbin Du <changbin.du@intel.com>
Date:   Fri Aug 13 22:57:49 2021 +0800

    net: in_irq() cleanup

    Replace the obsolete and ambiguos macro in_irq() with new
    macro in_hardirq().

    Signed-off-by: Changbin Du <changbin.du@gmail.com>
    Link: https://lore.kernel.org/r/20210813145749.86512-1-changbin.du@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:49 +02:00
Jiri Benc c773bf00b4 net, core: Allow netdev_lower_get_next_private_rcu in bh context
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 689186699931313c7a42462602bd5c03eef77f9f
Author: Jussi Maki <joamaki@gmail.com>
Date:   Sat Jul 31 05:57:36 2021 +0000

    net, core: Allow netdev_lower_get_next_private_rcu in bh context

    For the XDP bonding slave lookup to work in the NAPI poll context in which
    the redudant rcu_read_lock() has been removed we have to follow the same
    approach as in 694cea395f ("bpf: Allow RCU-protected lookups to happen
    from bh context") and modify the WARN_ON to also check rcu_read_lock_bh_held().

    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-6-joamaki@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:48 +02:00
Jiri Benc 88b4e5f8ea net, core: Add support for XDP redirection to slave device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

Conflicts:
- Using lower case __bpf_prog_run in bpf_prog_run_xdp due to out of order
  backport of fb7dd8bca013 ("bpf: Refactor BPF_PROG_RUN into a function")

commit 879af96ffd72706c6e3278ea6b45b0b0e37ec5d7
Author: Jussi Maki <joamaki@gmail.com>
Date:   Sat Jul 31 05:57:33 2021 +0000

    net, core: Add support for XDP redirection to slave device

    This adds the ndo_xdp_get_xmit_slave hook for transforming XDP_TX
    into XDP_REDIRECT after BPF program run when the ingress device
    is a bond slave.

    The dev_xdp_prog_count is exposed so that slave devices can be checked
    for loaded XDP programs in order to avoid the situation where both
    bond master and slave have programs loaded according to xdp_state.

    Signed-off-by: Jussi Maki <joamaki@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Jay Vosburgh <j.vosburgh@gmail.com>
    Cc: Veaceslav Falico <vfalico@gmail.com>
    Cc: Andy Gospodarek <andy@greyhouse.net>
    Link: https://lore.kernel.org/bpf/20210731055738.16820-3-joamaki@gmail.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:47 +02:00
Hangbin Liu b2ce8f1b0b net: initialize init_net earlier
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 9c1be1935fb6

Conflicts: context conflicts due to missing commit
41467d2ff4df ("net: net_namespace: Optimize the code")

commit 9c1be1935fb68b2413796cdc03d019b8cf35ab51
Author: Eric Dumazet <edumazet@google.com>
Date:   Sat Feb 5 09:01:25 2022 -0800

    net: initialize init_net earlier

    While testing a patch that will follow later
    ("net: add netns refcount tracker to struct nsproxy")
    I found that devtmpfs_init() was called before init_net
    was initialized.

    This is a bug, because devtmpfs_setup() calls
    ksys_unshare(CLONE_NEWNS);

    This has the effect of increasing init_net refcount,
    which will be later overwritten to 1, as part of setup_net(&init_net)

    We had too many prior patches [1] trying to work around the root cause.

    Really, make sure init_net is in BSS section, and that net_ns_init()
    is called earlier at boot time.

    Note that another patch ("vfs: add netns refcount tracker
    to struct fs_context") also will need net_ns_init() being called
    before vfs_caches_init()

    As a bonus, this patch saves around 4KB in .data section.

    [1]

    f8c46cb390 ("netns: do not call pernet ops for not yet set up init_net namespace")
    b5082df801 ("net: Initialise init_net.count to 1")
    734b65417b ("net: Statically initialize init_net.dev_base_head")

    v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:57 +08:00
Hangbin Liu 970a02e10a net: gro: avoid re-computing truesize twice on recycle
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 7881453e4adf

Conflicts: there is no net/core/gro.c due to missing commit
587652bbdd06 ("net: gro: populate net/core/gro.c")

commit 7881453e4adf497cf9109c84fa21eedda9ac6164
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Feb 4 12:28:36 2022 +0100

    net: gro: avoid re-computing truesize twice on recycle

    After commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb
    carring sock reference") and commit af352460b465 ("net: fix GRO
    skb truesize update") the truesize of the skb with stolen head is
    properly updated by the GRO engine, we don't need anymore resetting
    it at recycle time.

    v1 -> v2:
     - clarify the commit message (Alexander)

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Reviewed-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Hangbin Liu 9ef759e929 net: annotate data-races on txq->xmit_lock_owner
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 7a10d8c810cf

commit 7a10d8c810cfad3e79372d7d1c77899d86cd6662
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Nov 30 09:01:55 2021 -0800

    net: annotate data-races on txq->xmit_lock_owner

    syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner
    without annotations.

    No serious issue there, let's document what is happening there.

    BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit

    write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0:
     __netif_tx_unlock include/linux/netdevice.h:4437 [inline]
     __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229
     dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
     macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
     macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
     __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
     netdev_start_xmit include/linux/netdevice.h:5001 [inline]
     xmit_one+0x105/0x2f0 net/core/dev.c:3590
     dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
     sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
     __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
     __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
     dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
     neigh_hh_output include/net/neighbour.h:511 [inline]
     neigh_output include/net/neighbour.h:525 [inline]
     ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126
     __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
     ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
     NF_HOOK_COND include/linux/netfilter.h:296 [inline]
     ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
     dst_output include/net/dst.h:450 [inline]
     NF_HOOK include/linux/netfilter.h:307 [inline]
     ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
     ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
     addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
     call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
     expire_timers+0x116/0x240 kernel/time/timer.c:1466
     __run_timers+0x368/0x410 kernel/time/timer.c:1734
     run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
     __do_softirq+0x158/0x2de kernel/softirq.c:558
     __irq_exit_rcu kernel/softirq.c:636 [inline]
     irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
     sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097
     asm_sysvec_apic_timer_interrupt+0x12/0x20

    read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1:
     __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213
     dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
     macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
     macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
     __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
     netdev_start_xmit include/linux/netdevice.h:5001 [inline]
     xmit_one+0x105/0x2f0 net/core/dev.c:3590
     dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
     sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
     __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
     __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
     dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
     neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523
     neigh_output include/net/neighbour.h:527 [inline]
     ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126
     __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
     ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
     NF_HOOK_COND include/linux/netfilter.h:296 [inline]
     ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
     dst_output include/net/dst.h:450 [inline]
     NF_HOOK include/linux/netfilter.h:307 [inline]
     ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
     ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
     addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
     call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
     expire_timers+0x116/0x240 kernel/time/timer.c:1466
     __run_timers+0x368/0x410 kernel/time/timer.c:1734
     run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
     __do_softirq+0x158/0x2de kernel/softirq.c:558
     __irq_exit_rcu kernel/softirq.c:636 [inline]
     irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
     sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097
     asm_sysvec_apic_timer_interrupt+0x12/0x20
     kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443
     folio_test_anon include/linux/page-flags.h:581 [inline]
     PageAnon include/linux/page-flags.h:586 [inline]
     zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347
     zap_pmd_range mm/memory.c:1467 [inline]
     zap_pud_range mm/memory.c:1496 [inline]
     zap_p4d_range mm/memory.c:1517 [inline]
     unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538
     unmap_single_vma+0x157/0x210 mm/memory.c:1583
     unmap_vmas+0xd0/0x180 mm/memory.c:1615
     exit_mmap+0x23d/0x470 mm/mmap.c:3170
     __mmput+0x27/0x1b0 kernel/fork.c:1113
     mmput+0x3d/0x50 kernel/fork.c:1134
     exit_mm+0xdb/0x170 kernel/exit.c:507
     do_exit+0x608/0x17a0 kernel/exit.c:819
     do_group_exit+0xce/0x180 kernel/exit.c:929
     get_signal+0xfc3/0x1550 kernel/signal.c:2852
     arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868
     handle_signal_work kernel/entry/common.c:148 [inline]
     exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
     exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207
     __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
     syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300
     do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0x00000000 -> 0xffffffff

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G        W         5.16.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Hangbin Liu 1928aa8364 net: multicast: calculate csum of looped-back and forwarded packets
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2081920
Upstream Status: net.git commit 9122a70a6333

commit 9122a70a6333705c0c35614ddc51c274ed1d3637
Author: Cyril Strejc <cyril.strejc@skoda.cz>
Date:   Sun Oct 24 22:14:25 2021 +0200

    net: multicast: calculate csum of looped-back and forwarded packets

    During a testing of an user-space application which transmits UDP
    multicast datagrams and utilizes multicast routing to send the UDP
    datagrams out of defined network interfaces, I've found a multicast
    router does not fill-in UDP checksum into locally produced, looped-back
    and forwarded UDP datagrams, if an original output NIC the datagrams
    are sent to has UDP TX checksum offload enabled.

    The datagrams are sent malformed out of the NIC the datagrams have been
    forwarded to.

    It is because:

    1. If TX checksum offload is enabled on the output NIC, UDP checksum
       is not calculated by kernel and is not filled into skb data.

    2. dev_loopback_xmit(), which is called solely by
       ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
       unconditionally.

    3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
       CHECKSUM_COMPLETE"), the ip_summed value is preserved during
       forwarding.

    4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
       a packet egress.

    The minimum fix in dev_loopback_xmit():

    1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
       case when the original output NIC has TX checksum offload enabled.
       The effects are:

         a) If the forwarding destination interface supports TX checksum
            offloading, the NIC driver is responsible to fill-in the
            checksum.

         b) If the forwarding destination interface does NOT support TX
            checksum offloading, checksums are filled-in by kernel before
            skb is submitted to the NIC driver.

         c) For local delivery, checksum validation is skipped as in the
            case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().

    2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
       means, for CHECKSUM_NONE, the behavior is unmodified and is there
       to skip a looped-back packet local delivery checksum validation.

    Signed-off-by: Cyril Strejc <cyril.strejc@skoda.cz>
    Reviewed-by: Willem de Bruijn <willemb@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Hangbin Liu <haliu@redhat.com>
2022-05-05 12:26:41 +08:00
Jerome Marchand 850123ac6a bpf: devmap: Implement devmap prog execution for generic XDP
Bugzilla: http://bugzilla.redhat.com/2041365

commit 2ea5eabaf04a1829383aefe98ac38a2e5ae2d698
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:24 2021 +0530

    bpf: devmap: Implement devmap prog execution for generic XDP

    This lifts the restriction on running devmap BPF progs in generic
    redirect mode. To match native XDP behavior, it is invoked right before
    generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
    XDP_DROP actions.

    We also return 0 even if devmap program drops the packet, as
    semantically redirect has already succeeded and the devmap prog is the
    last point before TX of the packet to device where it can deliver a
    verdict on the packet.

    This also means it must take care of freeing the skb, as
    xdp_do_generic_redirect callers only do that in case an error is
    returned.

    Since devmap entry prog is supported, remove the check in
    generic_xdp_install entirely.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Jerome Marchand 01fb58edc6 bpf: cpumap: Implement generic cpumap
Bugzilla: http://bugzilla.redhat.com/2041365

commit 11941f8a85362f612df61f4aaab0e41b64d2111d
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:23 2021 +0530

    bpf: cpumap: Implement generic cpumap

    This change implements CPUMAP redirect support for generic XDP programs.
    The idea is to reuse the cpu map entry's queue that is used to push
    native xdp frames for redirecting skb to a different CPU. This will
    match native XDP behavior (in that RPS is invoked again for packet
    reinjected into networking stack).

    To be able to determine whether the incoming skb is from the driver or
    cpumap, we reuse skb->redirected bit that skips generic XDP processing
    when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
    it has been lifted and it is always available.

    >From the redirect side, we add the skb to ptr_ring with its lowest bit
    set to 1.  This should be safe as skb is not 1-byte aligned. This allows
    kthread to discern between xdp_frames and sk_buff. On consumption of the
    ptr_ring item, the lowest bit is unset.

    In the end, the skb is simply added to the list that kthread is anyway
    going to maintain for xdp_frames converted to skb, and then received
    again by using netif_receive_skb_list.

    Bulking optimization for generic cpumap is left as an exercise for a
    future patch for now.

    Since cpumap entry progs are now supported, also remove check in
    generic_xdp_install for the cpumap.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Jerome Marchand 9d29c832f5 net: core: Split out code to run generic XDP prog
Bugzilla: http://bugzilla.redhat.com/2041365

commit fe21cb91ae7bca1ae7805454be80b6d03bec85f7
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Date:   Fri Jul 2 16:48:21 2021 +0530

    net: core: Split out code to run generic XDP prog

    This helper can later be utilized in code that runs cpumap and devmap
    programs in generic redirect mode and adjust skb based on changes made
    to xdp_buff.

    When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
    generic redirect path invokes devmap/cpumap prog if set, it must
    __skb_pull again as we expect mac header to be pulled.

    It also drops the skb_reset_mac_len call after do_xdp_generic, as the
    mac_header and network_header are advanced by the same offset, so the
    difference (mac_len) remains constant.

    Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
    Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:30 +02:00
Ivan Vecera 85520fc44a net: annotate accesses to dev->gso_max_segs
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2073465

Conflicts:
- small context conflicts in octeontx2 driver

commit 6d872df3e3b91532b142de9044e5b4984017a55f
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Nov 19 07:43:32 2021 -0800

    net: annotate accesses to dev->gso_max_segs

    dev->gso_max_segs is written under RTNL protection, or when the device is
    not yet visible, but is read locklessly.

    Add netif_set_gso_max_segs() helper.

    Add the READ_ONCE()/WRITE_ONCE() pairs, and use netif_set_gso_max_segs()
    where we can to better document what is going on.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-04-08 16:46:08 +02:00
Herton R. Krzesinski 90182f8b73 Merge: ovs: backports P2 for 9.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/431

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048
Tested: Sanity only

A bit large for a P2 backport; but those patches are needed and were
requested by members of the OVS team.

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Davide Caratti <dcaratti@redhat.com>
Approved-by: Eelco Chaudron <echaudro@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-02-15 22:45:52 +00:00
Herton R. Krzesinski 4f893751ba Merge: net: introduce kfree_skb_reason
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/405

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931
Tested: Instructions in bz

Signed-off-by: Antoine Tenart <atenart@redhat.com>

Approved-by: Florian Westphal <fwestpha@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-26 22:28:46 +00:00
Herton R. Krzesinski adc4082e23 Merge: CNB: net: Remove redundant if statements
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/328

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Series moving dev NULL check into dev_put()/dev_hold()

Signed-off-by: Petr Oros <poros@redhat.com>

Approved-by: John W. Linville <linville@redhat.com>
Approved-by: Andrea Claudi <aclaudi@redhat.com>
Approved-by: Corinna Vinschen <vinschen@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-26 22:11:25 +00:00
Antoine Tenart b5e24650b7 net/sched: Extend qdisc control block with tc control block
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2045048
Upstream Status: linux.git
Tested: Sanity only

commit ec624fe740b416fb68d536b37fb8eef46f90b5c2
Author: Paul Blakey <paulb@nvidia.com>
Date:   Tue Dec 14 19:24:33 2021 +0200

    net/sched: Extend qdisc control block with tc control block

    BPF layer extends the qdisc control block via struct bpf_skb_data_end
    and because of that there is no more room to add variables to the
    qdisc layer control block without going over the skb->cb size.

    Extend the qdisc control block with a tc control block,
    and move all tc related variables to there as a pre-step for
    extending the tc control block with additional members.

    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-26 16:54:01 +01:00
Antoine Tenart 4a0269b225 net: skb: introduce kfree_skb_reason()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2041931
Upstream Status: linux.git
Tested: Instructions in bz

commit c504e5c2f9648a1e5c2be01e8c3f59d394192bd3
Author: Menglong Dong <imagedong@tencent.com>
Date:   Sun Jan 9 14:36:26 2022 +0800

    net: skb: introduce kfree_skb_reason()

    Introduce the interface kfree_skb_reason(), which is able to pass
    the reason why the skb is dropped to 'kfree_skb' tracepoint.

    Add the 'reason' field to 'trace_kfree_skb', therefor user can get
    more detail information about abnormal skb with 'drop_monitor' or
    eBPF.

    All drop reasons are defined in the enum 'skb_drop_reason', and
    they will be print as string in 'kfree_skb' tracepoint in format
    of 'reason: XXX'.

    ( Maybe the reasons should be defined in a uapi header file, so that
    user space can use them? )

    Signed-off-by: Menglong Dong <imagedong@tencent.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Antoine Tenart <atenart@redhat.com>
2022-01-21 10:05:00 +01:00
Herton R. Krzesinski b8f20958b7 Merge: net: core stable backport for rhel 9.0
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/212

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

This includes a few critical bugfixes for the core network stack.

Notably it includes 7f678def99d2 ("skb_expand_head() adjust skb->truesize incorrectly") and a whole series of pre-requisites. The bug addressed there is nasty and present even prior to skb_expand_head() introduction.

commit 719c57197010 ("net: make napi_disable() symmetric with enable") instead has been explicitly excluded, as it's not really a fix, is known to introduce problems and it's still quite new

Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Jarod Wilson <jarod@redhat.com>
Approved-by: Antoine Tenart <atenart@redhat.com>
Approved-by: Guillaume Nault <gnault@redhat.com>
Approved-by: Jiri Benc <jbenc@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-14 16:53:21 +00:00
Herton R. Krzesinski 911d813798 Merge: net/sched: 9.0 P1 backports from upstream
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/197

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552  
Upstream Status: all mainline in net.git  
Conflicts: None  
Tested: boot-tested only  

Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Approved-by: Kamal Heib <kheib@redhat.com>
Approved-by: Ivan Vecera <ivecera@redhat.com>
Approved-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Approved-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-12 15:43:04 +00:00
Petr Oros ea6b084bc4 net: Remove redundant if statements
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2037315

Upstream commit(s):
commit 1160dfa178eb848327e9dec39960a735f4dc1685
Author: Yajun Deng <yajun.deng@linux.dev>
Date:   Thu Aug 5 19:55:27 2021 +0800

    net: Remove redundant if statements

    The 'if (dev)' statement already move into dev_{put , hold}, so remove
    redundant if statements.

    Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Petr Oros <poros@redhat.com>
2022-01-10 16:20:08 +01:00
Herton R. Krzesinski adc818bf26 Merge: Replace deprecated CPU-hotplug functions for kernel-rt
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/134
Bugzilla: http://bugzilla.redhat.com/2023079

Depends: https://gitlab.com/redhat/rhel/src/kernel/rhel-8/-/merge_requests/99

The kernel-rt variant requires these changes in order to make future
changes to the RHEL9 kernel.  These changes were found by code inspection
and affect not only kernel-rt but the regular kernel variants as well.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
RH-Acked-by: Rafael Aquini <aquini@redhat.com>
RH-Acked-by: John W. Linville <linville@redhat.com>
RH-Acked-by: David Arcari <darcari@redhat.com>
RH-Acked-by: Vladis Dronov <vdronov@redhat.com>
RH-Acked-by: Jiri Benc <jbenc@redhat.com>
RH-Acked-by: Jarod Wilson <jarod@redhat.com>
RH-Acked-by: Waiman Long <longman@redhat.com>
RH-Acked-by: Phil Auld <pauld@redhat.com>
RH-Acked-by: Wander Lairson Costa <wander@redhat.com>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
2022-01-10 11:46:27 -03:00
Paolo Abeni d27bdebcab sk_buff: avoid potentially clearing 'slow_gro' field
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit a432934a30679c0e3c47b87f13e4901bc1a3fc03
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Fri Jul 30 18:30:53 2021 +0200

    sk_buff: avoid potentially clearing 'slow_gro' field

    If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro'
    field is cleared, too. That could lead to wrong behavior if
    the skb later enters the GRO stage.

    Fix the potential issue replacing preserving a non-zero value of
    the 'slow_gro' field.

    Additionally, fix a comment typo.

    Reported-by: Sabrina Dubroca <sd@queasysnail.net>
    Reported-by: Jakub Kicinski <kuba@kernel.org>
    Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro")
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:58:21 +01:00
Paolo Abeni 2bea014388 skbuff: allow 'slow_gro' for skb carring sock reference
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 5e10da5385d20c4bae587bc2921e5fdd9655d5fc
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:03 2021 +0200

    skbuff: allow 'slow_gro' for skb carring sock reference

    This change leverages the infrastructure introduced by the previous
    patches to allow soft devices passing to the GRO engine owned skbs
    without impacting the fast-path.

    It's up to the GRO caller ensuring the slow_gro bit validity before
    invoking the GRO engine. The new helper skb_prepare_for_gro() is
    introduced for that goal.

    On slow_gro, skbs are aggregated only with equal sk.
    Additionally, skb truesize on GRO recycle and free is correctly
    updated so that sk wmem is not changed by the GRO processing.

    rfc-> v1:
     - fixed bad truesize on dev_gro_receive NAPI_FREE
     - use the existing state bit

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:52 +01:00
Paolo Abeni 9ce6ef4e71 net: optimize GRO for the common case.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028927
Tested: LNST, Tier1

Upstream commit:
commit 9efb4b5baf6ce851b247288992b0632cb4d31c17
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Wed Jul 28 18:24:02 2021 +0200

    net: optimize GRO for the common case.

    After the previous patches, at GRO time, skb->slow_gro is
    usually 0, unless the packets comes from some H/W offload
    slowpath or tunnel.

    We can optimize the GRO code assuming !skb->slow_gro is likely.
    This remove multiple conditionals in the most common path, at the
    price of an additional one when we hit the above "slow-paths".

    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 18:57:26 +01:00
Prarit Bhargava 286c7df21b net: Replace deprecated CPU-hotplug functions.
Bugzilla: http://bugzilla.redhat.com/2023079

commit 372bbdd5bb3fc454d9c280dc0914486a3c7419d5
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Aug 3 16:16:06 2021 +0200

    net: Replace deprecated CPU-hotplug functions.

    The functions get_online_cpus() and put_online_cpus() have been
    deprecated during the CPU hotplug rework. They map directly to
    cpus_read_lock() and cpus_read_unlock().

    Replace deprecated CPU-hotplug functions with the official version.
    The behavior remains unchanged.

    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
2021-12-09 09:04:08 -05:00
Davide Caratti bee2c235ef net/sched: store the last executed chain also for clsact egress
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2025552
Upstream Status: net-next.git commit 3aa260559455

commit 3aa2605594556c676fb88744bd9845acae60683d
Author: Davide Caratti <dcaratti@redhat.com>
Date:   Wed Jul 28 20:08:00 2021 +0200

    net/sched: store the last executed chain also for clsact egress

    currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain
    id' in the skb extension. However, userspace programs (like ovs) are able
    to setup egress rules, and datapath gets confused in case it doesn't find
    the 'chain id' for a packet that's "recirculated" by tc.
    Change tcf_classify() to have the same semantic as tcf_classify_ingress()
    so that a single function can be called in ingress / egress, using the tc
    ingress / egress block respectively.

    Suggested-by: Alaa Hleilel <alaa@nvidia.com>
    Signed-off-by: Davide Caratti <dcaratti@redhat.com>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2021-12-09 12:01:45 +01:00
Paolo Abeni 96d14cbcf2 net: Prevent infinite while loop in skb_tx_hash()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 0c57eeecc559ca6bc18b8c4e2808bc78dbe769b0
Author: Michael Chan <michael.chan@broadcom.com>
Date:   Mon Oct 25 05:05:28 2021 -0400

    net: Prevent infinite while loop in skb_tx_hash()

    Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
    to set the queue count and offset for each TC.  So the queue count
    and offset for the TCs may be zero for a short period after dev->num_tc
    has been set.  If a TX packet is being transmitted at this time in the
    code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
    nonzero dev->num_tc but zero qcount for the TC.  The while loop that
    keeps looping while hash >= qcount will not end.

    Fix it by checking the TC's qcount to be nonzero before using it.

    Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
    Reviewed-by: Andy Gospodarek <gospo@broadcom.com>
    Signed-off-by: Michael Chan <michael.chan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
Paolo Abeni a1950c1dcf napi: fix race inside napi_enable
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2028276
Tested: LNST, Tier1

Upstream commit:
commit 3765996e4f0b8a755cab215a08df744490c76052
Author: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Date:   Sat Sep 18 16:52:32 2021 +0800

    napi: fix race inside napi_enable

    The process will cause napi.state to contain NAPI_STATE_SCHED and
    not in the poll_list, which will cause napi_disable() to get stuck.

    The prefix "NAPI_STATE_" is removed in the figure below, and
    NAPI_STATE_HASHED is ignored in napi.state.

                          CPU0       |                   CPU1       | napi.state
    ===============================================================================
    napi_disable()                   |                              | SCHED | NPSVC
    napi_enable()                    |                              |
    {                                |                              |
        smp_mb__before_atomic();     |                              |
        clear_bit(SCHED, &n->state); |                              | NPSVC
                                     | napi_schedule_prep()         | SCHED | NPSVC
                                     | napi_poll()                  |
                                     |   napi_complete_done()       |
                                     |   {                          |
                                     |      if (n->state & (NPSVC | | (1)
                                     |               _BUSY_POLL)))  |
                                     |           return false;      |
                                     |     ................         |
                                     |   }                          | SCHED | NPSVC
                                     |                              |
        clear_bit(NPSVC, &n->state); |                              | SCHED
    }                                |                              |
                                     |                              |
    napi_schedule_prep()             |                              | SCHED | MISSED (2)

    (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
    (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list

    Since NAPI_STATE_SCHED already exists and napi is not in the
    sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
    exist.

    1. This will cause this queue to no longer receive packets.
    2. If you encounter napi_disable under the protection of rtnl_lock, it
       will cause the entire rtnl_lock to be locked, affecting the overall
       system.

    This patch uses cmpxchg to implement napi_enable(), which ensures that
    there will be no race due to the separation of clear two bits.

    Fixes: 2d8bff1269 ("netpoll: Close race condition between poll_one_napi and napi_disable")
    Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
    Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2021-12-09 10:44:31 +01:00
David S. Miller 20192d9c9f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Andrii Nakryiko says:

====================
pull-request: bpf 2021-07-15

The following pull-request contains BPF updates for your *net* tree.

We've added 9 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 37 insertions(+), 15 deletions(-).

The main changes are:

1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and
   BPF_XDP_CPUMAP programs, from Xuan Zhuo.

2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo.

3) Follow-up fix to subprog poke descriptor use-after-free problem, from
   Daniel Borkmann and John Fastabend.

4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King.

5) Fix memory leak in BPF sockmap, from John Fastabend.

6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend
   and Jakub Sitnicki.

7) Fix NULL pointer dereference in bpftool, from Tobias Klauser.

8) AF_XDP documentation fixes, from Baruch Siach.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 14:39:45 -07:00
Qitao Xu 70713dddf3 net_sched: introduce tracepoint trace_qdisc_enqueue()
Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
the entrance of TC layer on TX side. This is similar to
trace_qdisc_dequeue():

1. For both we only trace successful cases. The failure cases
   can be traced via trace_kfree_skb().

2. They are called at entrance or exit of TC layer, not for each
   ->enqueue() or ->dequeue(). This is intentional, because
   we want to make trace_qdisc_enqueue() symmetric to
   trace_qdisc_dequeue(), which is easier to use.

The return value of qdisc_enqueue() is not interesting here,
we have Qdisc's drop packets in ->dequeue(), it is impossible to
trace them even if we have the return value, the only way to trace
them is tracing kfree_skb().

We only add information we need to trace ring buffer. If any other
information is needed, it is easy to extend it without breaking ABI,
see commit 3dd344ea84 ("net: tracepoint: exposing sk_family in all
tcp:tracepoints").

Reviewed-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Qitao Xu <qitao.xu@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-15 10:32:38 -07:00
Xuan Zhuo 5acc7d3e8d xdp, net: Fix use-after-free in bpf_xdp_link_release
The problem occurs between dev_get_by_index() and dev_xdp_attach_link().
At this point, dev_xdp_uninstall() is called. Then xdp link will not be
detached automatically when dev is released. But link->dev already
points to dev, when xdp link is released, dev will still be accessed,
but dev has been released.

dev_get_by_index()        |
link->dev = dev           |
                          |      rtnl_lock()
                          |      unregister_netdevice_many()
                          |          dev_xdp_uninstall()
                          |      rtnl_unlock()
rtnl_lock();              |
dev_xdp_attach_link()     |
rtnl_unlock();            |
                          |      netdev_run_todo() // dev released
bpf_xdp_link_release()    |
    /* access dev.        |
       use-after-free */  |

[   45.966867] BUG: KASAN: use-after-free in bpf_xdp_link_release+0x3b8/0x3d0
[   45.967619] Read of size 8 at addr ffff00000f9980c8 by task a.out/732
[   45.968297]
[   45.968502] CPU: 1 PID: 732 Comm: a.out Not tainted 5.13.0+ #22
[   45.969222] Hardware name: linux,dummy-virt (DT)
[   45.969795] Call trace:
[   45.970106]  dump_backtrace+0x0/0x4c8
[   45.970564]  show_stack+0x30/0x40
[   45.970981]  dump_stack_lvl+0x120/0x18c
[   45.971470]  print_address_description.constprop.0+0x74/0x30c
[   45.972182]  kasan_report+0x1e8/0x200
[   45.972659]  __asan_report_load8_noabort+0x2c/0x50
[   45.973273]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.973834]  bpf_link_free+0xd0/0x188
[   45.974315]  bpf_link_put+0x1d0/0x218
[   45.974790]  bpf_link_release+0x3c/0x58
[   45.975291]  __fput+0x20c/0x7e8
[   45.975706]  ____fput+0x24/0x30
[   45.976117]  task_work_run+0x104/0x258
[   45.976609]  do_notify_resume+0x894/0xaf8
[   45.977121]  work_pending+0xc/0x328
[   45.977575]
[   45.977775] The buggy address belongs to the page:
[   45.978369] page:fffffc00003e6600 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x4f998
[   45.979522] flags: 0x7fffe0000000000(node=0|zone=0|lastcpupid=0x3ffff)
[   45.980349] raw: 07fffe0000000000 fffffc00003e6708 ffff0000dac3c010 0000000000000000
[   45.981309] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   45.982259] page dumped because: kasan: bad access detected
[   45.982948]
[   45.983153] Memory state around the buggy address:
[   45.983753]  ffff00000f997f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   45.984645]  ffff00000f998000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.985533] >ffff00000f998080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.986419]                                               ^
[   45.987112]  ffff00000f998100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988006]  ffff00000f998180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[   45.988895] ==================================================================
[   45.989773] Disabling lock debugging due to kernel taint
[   45.990552] Kernel panic - not syncing: panic_on_warn set ...
[   45.991166] CPU: 1 PID: 732 Comm: a.out Tainted: G    B             5.13.0+ #22
[   45.991929] Hardware name: linux,dummy-virt (DT)
[   45.992448] Call trace:
[   45.992753]  dump_backtrace+0x0/0x4c8
[   45.993208]  show_stack+0x30/0x40
[   45.993627]  dump_stack_lvl+0x120/0x18c
[   45.994113]  dump_stack+0x1c/0x34
[   45.994530]  panic+0x3a4/0x7d8
[   45.994930]  end_report+0x194/0x198
[   45.995380]  kasan_report+0x134/0x200
[   45.995850]  __asan_report_load8_noabort+0x2c/0x50
[   45.996453]  bpf_xdp_link_release+0x3b8/0x3d0
[   45.997007]  bpf_link_free+0xd0/0x188
[   45.997474]  bpf_link_put+0x1d0/0x218
[   45.997942]  bpf_link_release+0x3c/0x58
[   45.998429]  __fput+0x20c/0x7e8
[   45.998833]  ____fput+0x24/0x30
[   45.999247]  task_work_run+0x104/0x258
[   45.999731]  do_notify_resume+0x894/0xaf8
[   46.000236]  work_pending+0xc/0x328
[   46.000697] SMP: stopping secondary CPUs
[   46.001226] Dumping ftrace buffer:
[   46.001663]    (ftrace buffer empty)
[   46.002110] Kernel Offset: disabled
[   46.002545] CPU features: 0x00000001,23202c00
[   46.003080] Memory Limit: none

Fixes: aa8d3a716b ("bpf, xdp: Add bpf_link-based XDP attachment API")
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210710031635.41649-1-xuanzhuo@linux.alibaba.com
2021-07-13 08:22:31 -07:00
Antoine Tenart 28b34f01a7 net: do not reuse skbuff allocated from skbuff_fclone_cache in the skb cache
Some socket buffers allocated in the fclone cache (in __alloc_skb) can
end-up in the following path[1]:

napi_skb_finish
  __kfree_skb_defer
    napi_skb_cache_put

The issue is napi_skb_cache_put is not fclone friendly and will put
those skbuff in the skb cache to be reused later, although this cache
only expects skbuff allocated from skbuff_head_cache. When this happens
the skbuff is eventually freed using the wrong origin cache, and we can
see traces similar to:

[ 1223.947534] cache_from_obj: Wrong slab cache. skbuff_head_cache but object is from skbuff_fclone_cache
[ 1223.948895] WARNING: CPU: 3 PID: 0 at mm/slab.h:442 kmem_cache_free+0x251/0x3e0
[ 1223.950211] Modules linked in:
[ 1223.950680] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.13.0+ #474
[ 1223.951587] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-3.fc34 04/01/2014
[ 1223.953060] RIP: 0010:kmem_cache_free+0x251/0x3e0

Leading sometimes to other memory related issues.

Fix this by using __kfree_skb for fclone skbuff, similar to what is done
the other place __kfree_skb_defer is called.

[1] At least in setups using veth pairs and tunnels. Building a kernel
    with KASAN we can for example see packets allocated in
    sk_stream_alloc_skb hit the above path and later the issue arises
    when the skbuff is reused.

Fixes: 9243adfc31 ("skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing")
Cc: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-09 11:26:27 -07:00
Florian Fainelli 9615fe36b3 skbuff: Fix build with SKB extensions disabled
We will fail to build with CONFIG_SKB_EXTENSIONS disabled after
8550ff8d8c ("skbuff: Release nfct refcount on napi stolen or re-used
skbs") since there is an unconditionally use of skb_ext_find() without
an appropriate stub. Simply build the code conditionally and properly
guard against both COFNIG_SKB_EXTENSIONS as well as
CONFIG_NET_TC_SKB_EXT being disabled.

Fixes: Fixes: 8550ff8d8c ("skbuff: Release nfct refcount on napi stolen or re-used skbs")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-08 00:07:14 -07:00
Paul Blakey 8550ff8d8c skbuff: Release nfct refcount on napi stolen or re-used skbs
When multiple SKBs are merged to a new skb under napi GRO,
or SKB is re-used by napi, if nfct was set for them in the
driver, it will not be released while freeing their stolen
head state or on re-use.

Release nfct on napi's stolen or re-used SKBs, and
in gro_list_prepare, check conntrack metadata diff.

Fixes: 5c6b946047 ("net/mlx5e: CT: Handle misses after executing CT action")
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Paul Blakey <paulb@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-06 10:26:29 -07:00
Linus Torvalds dbe69e4337 Networking changes for 5.14.
Core:
 
  - BPF:
    - add syscall program type and libbpf support for generating
      instructions and bindings for in-kernel BPF loaders (BPF loaders
      for BPF), this is a stepping stone for signed BPF programs
    - infrastructure to migrate TCP child sockets from one listener
      to another in the same reuseport group/map to improve flexibility
      of service hand-off/restart
    - add broadcast support to XDP redirect
 
  - allow bypass of the lockless qdisc to improving performance
    (for pktgen: +23% with one thread, +44% with 2 threads)
 
  - add a simpler version of "DO_ONCE()" which does not require
    jump labels, intended for slow-path usage
 
  - virtio/vsock: introduce SOCK_SEQPACKET support
 
  - add getsocketopt to retrieve netns cookie
 
  - ip: treat lowest address of a IPv4 subnet as ordinary unicast address
        allowing reclaiming of precious IPv4 addresses
 
  - ipv6: use prandom_u32() for ID generation
 
  - ip: add support for more flexible field selection for hashing
        across multi-path routes (w/ offload to mlxsw)
 
  - icmp: add support for extended RFC 8335 PROBE (ping)
 
  - seg6: add support for SRv6 End.DT46 behavior
 
  - mptcp:
     - DSS checksum support (RFC 8684) to detect middlebox meddling
     - support Connection-time 'C' flag
     - time stamping support
 
  - sctp: packetization Layer Path MTU Discovery (RFC 8899)
 
  - xfrm: speed up state addition with seq set
 
  - WiFi:
     - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
     - aggregation handling improvements for some drivers
     - minstrel improvements for no-ack frames
     - deferred rate control for TXQs to improve reaction times
     - switch from round robin to virtual time-based airtime scheduler
 
  - add trace points:
     - tcp checksum errors
     - openvswitch - action execution, upcalls
     - socket errors via sk_error_report
 
 Device APIs:
 
  - devlink: add rate API for hierarchical control of max egress rate
             of virtual devices (VFs, SFs etc.)
 
  - don't require RCU read lock to be held around BPF hooks
    in NAPI context
 
  - page_pool: generic buffer recycling
 
 New hardware/drivers:
 
  - mobile:
     - iosm: PCIe Driver for Intel M.2 Modem
     - support for Qualcomm MSM8998 (ipa)
 
  - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices
 
  - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches
 
  - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)
 
  - NXP SJA1110 Automotive Ethernet 10-port switch
 
  - Qualcomm QCA8327 switch support (qca8k)
 
  - Mikrotik 10/25G NIC (atl1c)
 
 Driver changes:
 
  - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP
    (our first foray into MAC/PHY description via ACPI)
 
  - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx
 
  - Mellanox/Nvidia NIC (mlx5)
    - NIC VF offload of L2 bridging
    - support IRQ distribution to Sub-functions
 
  - Marvell (prestera):
     - add flower and match all
     - devlink trap
     - link aggregation
 
  - Netronome (nfp): connection tracking offload
 
  - Intel 1GE (igc): add AF_XDP support
 
  - Marvell DPU (octeontx2): ingress ratelimit offload
 
  - Google vNIC (gve): new ring/descriptor format support
 
  - Qualcomm mobile (rmnet & ipa): inline checksum offload support
 
  - MediaTek WiFi (mt76)
     - mt7915 MSI support
     - mt7915 Tx status reporting
     - mt7915 thermal sensors support
     - mt7921 decapsulation offload
     - mt7921 enable runtime pm and deep sleep
 
  - Realtek WiFi (rtw88)
     - beacon filter support
     - Tx antenna path diversity support
     - firmware crash information via devcoredump
 
  - Qualcomm 60GHz WiFi (wcn36xx)
     - Wake-on-WLAN support with magic packets and GTK rekeying
 
  - Micrel PHY (ksz886x/ksz8081): add cable test support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDb+fUACgkQMUZtbf5S
 Irs2Jg//aqN0Q8CgIvYCVhPxQw1tY7pTAbgyqgBZ01vwjyvtIOgJiWzSfFEU84mX
 M8fcpFX5eTKrOyJ9S6UFfQ/JG114n3hjAxFFT4Hxk2gC1Tg0vHuFQTDHcUl28bUE
 mTm61e1YpdorILnv2k5JVQ/wu0vs5QKDrjcYcrcPnh+j93wvnPOgAfDBV95nZzjS
 OTt4q2fR8GzLcSYWWsclMbDNkzyTG50RW/0Yd6aGjr5QGvXfrMeXfUJNz533PMf/
 w5lNyjRKv+x9mdTZJzU0+msNUrZgUdRz7W8Ey8lD3hJZRE+D6/uU7FtsE8Mi3+uc
 HWxeZUyzA3YF1MfVl/eesbxyPT7S/OkLzk4O5B35FbqP0YltaP+bOjq1/nM3ce1/
 io9Dx9pIl/2JANUgRCAtLi8Z2dkvRoqTaBxZ/nPudCCljFwDwl6joTMJ7Ow22i5Y
 5aIkcXFmZq4LbJDiHvbTlqT7yiuaEvu2UK/23bSIg/K3nF4eAmkY9Y1EgiMf60OF
 78Ttw0wk2tUegwaS5MZnCniKBKDyl9gM2F6rbZ/IxQRR2LTXFc1B6gC+ynUxgXfh
 Ub8O++6qGYGYZ0XvQH4pzco79p3qQWBTK5beIp2eu6BOAjBVIXq4AibUfoQLACsu
 hX7jMPYd0kc3WFgUnKgQP8EnjFSwbf4XiaE7fIXvWBY8hzCw2h4=
 =LvtX
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - BPF:
      - add syscall program type and libbpf support for generating
        instructions and bindings for in-kernel BPF loaders (BPF loaders
        for BPF), this is a stepping stone for signed BPF programs
      - infrastructure to migrate TCP child sockets from one listener to
        another in the same reuseport group/map to improve flexibility
        of service hand-off/restart
      - add broadcast support to XDP redirect

   - allow bypass of the lockless qdisc to improving performance (for
     pktgen: +23% with one thread, +44% with 2 threads)

   - add a simpler version of "DO_ONCE()" which does not require jump
     labels, intended for slow-path usage

   - virtio/vsock: introduce SOCK_SEQPACKET support

   - add getsocketopt to retrieve netns cookie

   - ip: treat lowest address of a IPv4 subnet as ordinary unicast
     address allowing reclaiming of precious IPv4 addresses

   - ipv6: use prandom_u32() for ID generation

   - ip: add support for more flexible field selection for hashing
     across multi-path routes (w/ offload to mlxsw)

   - icmp: add support for extended RFC 8335 PROBE (ping)

   - seg6: add support for SRv6 End.DT46 behavior

   - mptcp:
      - DSS checksum support (RFC 8684) to detect middlebox meddling
      - support Connection-time 'C' flag
      - time stamping support

   - sctp: packetization Layer Path MTU Discovery (RFC 8899)

   - xfrm: speed up state addition with seq set

   - WiFi:
      - hidden AP discovery on 6 GHz and other HE 6 GHz improvements
      - aggregation handling improvements for some drivers
      - minstrel improvements for no-ack frames
      - deferred rate control for TXQs to improve reaction times
      - switch from round robin to virtual time-based airtime scheduler

   - add trace points:
      - tcp checksum errors
      - openvswitch - action execution, upcalls
      - socket errors via sk_error_report

  Device APIs:

   - devlink: add rate API for hierarchical control of max egress rate
     of virtual devices (VFs, SFs etc.)

   - don't require RCU read lock to be held around BPF hooks in NAPI
     context

   - page_pool: generic buffer recycling

  New hardware/drivers:

   - mobile:
      - iosm: PCIe Driver for Intel M.2 Modem
      - support for Qualcomm MSM8998 (ipa)

   - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices

   - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches

   - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU)

   - NXP SJA1110 Automotive Ethernet 10-port switch

   - Qualcomm QCA8327 switch support (qca8k)

   - Mikrotik 10/25G NIC (atl1c)

  Driver changes:

   - ACPI support for some MDIO, MAC and PHY devices from Marvell and
     NXP (our first foray into MAC/PHY description via ACPI)

   - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx

   - Mellanox/Nvidia NIC (mlx5)
      - NIC VF offload of L2 bridging
      - support IRQ distribution to Sub-functions

   - Marvell (prestera):
      - add flower and match all
      - devlink trap
      - link aggregation

   - Netronome (nfp): connection tracking offload

   - Intel 1GE (igc): add AF_XDP support

   - Marvell DPU (octeontx2): ingress ratelimit offload

   - Google vNIC (gve): new ring/descriptor format support

   - Qualcomm mobile (rmnet & ipa): inline checksum offload support

   - MediaTek WiFi (mt76)
      - mt7915 MSI support
      - mt7915 Tx status reporting
      - mt7915 thermal sensors support
      - mt7921 decapsulation offload
      - mt7921 enable runtime pm and deep sleep

   - Realtek WiFi (rtw88)
      - beacon filter support
      - Tx antenna path diversity support
      - firmware crash information via devcoredump

   - Qualcomm WiFi (wcn36xx)
      - Wake-on-WLAN support with magic packets and GTK rekeying

   - Micrel PHY (ksz886x/ksz8081): add cable test support"

* tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits)
  tcp: change ICSK_CA_PRIV_SIZE definition
  tcp_yeah: check struct yeah size at compile time
  gve: DQO: Fix off by one in gve_rx_dqo()
  stmmac: intel: set PCI_D3hot in suspend
  stmmac: intel: Enable PHY WOL option in EHL
  net: stmmac: option to enable PHY WOL with PMT enabled
  net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del}
  net: use netdev_info in ndo_dflt_fdb_{add,del}
  ptp: Set lookup cookie when creating a PTP PPS source.
  net: sock: add trace for socket errors
  net: sock: introduce sk_error_report
  net: dsa: replay the local bridge FDB entries pointing to the bridge dev too
  net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev
  net: dsa: include fdb entries pointing to bridge in the host fdb list
  net: dsa: include bridge addresses which are local in the host fdb list
  net: dsa: sync static FDB entries on foreign interfaces to hardware
  net: dsa: install the host MDB and FDB entries in the master's RX filter
  net: dsa: reference count the FDB addresses at the cross-chip notifier level
  net: dsa: introduce a separate cross-chip notifier type for host FDBs
  net: dsa: reference count the MDB entries at the cross-chip notifier level
  ...
2021-06-30 15:51:09 -07:00
Jakub Kicinski b6df00789e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Trivial conflict in net/netfilter/nf_tables_api.c.

Duplicate fix in tools/testing/selftests/net/devlink_port_split.py
- take the net-next version.

skmsg, and L4 bpf - keep the bpf code but remove the flags
and err params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-29 15:45:27 -07:00
Tanner Love 127d7355ab net: update netdev_rx_csum_fault() print dump only once
Printing this stack dump multiple times does not provide additional
useful information, and consumes time in the data path. Printing once
is sufficient.

Changes
  v2: Format indentation properly

Signed-off-by: Tanner Love <tannerlove@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-28 15:54:57 -07:00
Yunsheng Lin c4fef01ba4 net: sched: implement TCQ_F_CAN_BYPASS for lockless qdisc
Currently pfifo_fast has both TCQ_F_CAN_BYPASS and TCQ_F_NOLOCK
flag set, but queue discipline by-pass does not work for lockless
qdisc because skb is always enqueued to qdisc even when the qdisc
is empty, see __dev_xmit_skb().

This patch calls sch_direct_xmit() to transmit the skb directly
to the driver for empty lockless qdisc, which aviod enqueuing
and dequeuing operation.

As qdisc->empty is not reliable to indicate a empty qdisc because
there is a time window between enqueuing and setting qdisc->empty.
So we use the MISSED state added in commit a90c57f2ce ("net:
sched: fix packet stuck problem for lockless qdisc"), which
indicate there is lock contention, suggesting that it is better
not to do the qdisc bypass in order to avoid packet out of order
problem.

In order to make MISSED state reliable to indicate a empty qdisc,
we need to ensure that testing and clearing of MISSED state is
within the protection of qdisc->seqlock, only setting MISSED state
can be done without the protection of qdisc->seqlock. A MISSED
state testing is added without the protection of qdisc->seqlock to
aviod doing unnecessary spin_trylock() for contention case.

As the enqueuing is not within the protection of qdisc->seqlock,
there is still a potential data race as mentioned by Jakub [1]:

      thread1               thread2             thread3
qdisc_run_begin() # true
                        qdisc_run_begin(q)
                             set(MISSED)
pfifo_fast_dequeue
  clear(MISSED)
  # recheck the queue
qdisc_run_end()
                            enqueue skb1
                                             qdisc empty # true
                                          qdisc_run_begin() # true
                                          sch_direct_xmit() # skb2
                         qdisc_run_begin()
                            set(MISSED)

When above happens, skb1 enqueued by thread2 is transmited after
skb2 is transmited by thread3 because MISSED state setting and
enqueuing is not under the qdisc->seqlock. If qdisc bypass is
disabled, skb1 has better chance to be transmited quicker than
skb2.

This patch does not take care of the above data race, because we
view this as similar as below:
Even at the same time CPU1 and CPU2 write the skb to two socket
which both heading to the same qdisc, there is no guarantee that
which skb will hit the qdisc first, because there is a lot of
factor like interrupt/softirq/cache miss/scheduling afffecting
that.

There are below cases that need special handling:
1. When MISSED state is cleared before another round of dequeuing
   in pfifo_fast_dequeue(), and __qdisc_run() might not be able to
   dequeue all skb in one round and call __netif_schedule(), which
   might result in a non-empty qdisc without MISSED set. In order
   to avoid this, the MISSED state is set for lockless qdisc and
   __netif_schedule() will be called at the end of qdisc_run_end.

2. The MISSED state also need to be set for lockless qdisc instead
   of calling __netif_schedule() directly when requeuing a skb for
   a similar reason.

3. For netdev queue stopped case, the MISSED case need clearing
   while the netdev queue is stopped, otherwise there may be
   unnecessary __netif_schedule() calling. So a new DRAINING state
   is added to indicate this case, which also indicate a non-empty
   qdisc.

4. As there is already netif_xmit_frozen_or_stopped() checking in
   dequeue_skb() and sch_direct_xmit(), which are both within the
   protection of qdisc->seqlock, but the same checking in
   __dev_xmit_skb() is without the protection, which might cause
   empty indication of a lockless qdisc to be not reliable. So
   remove the checking in __dev_xmit_skb(), and the checking in
   the protection of qdisc->seqlock seems enough to avoid the cpu
   consumption problem for netdev queue stopped case.

1. https://lkml.org/lkml/2021/5/29/215

Acked-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # flexcan
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-23 12:17:35 -07:00
Sebastian Andrzej Siewior 2b4cd14fd9 net/netif_receive_skb_core: Use migrate_disable()
The preempt disable around do_xdp_generic() has been introduced in
commit
   bbbe211c29 ("net: rcu lock and preempt disable missing around generic xdp")

For BPF it is enough to use migrate_disable() and the code was updated
as it can be seen in commit
   3c58482a38 ("bpf: Provide bpf_prog_run_pin_on_cpu() helper")

This is a leftover which was not converted.

Use migrate_disable() before invoking do_xdp_generic().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-21 12:08:02 -07:00
Peter Zijlstra 2f064a59a1 sched: Change task_struct::state
Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org
2021-06-18 11:43:09 +02:00
Jakub Kicinski 5ada57a9a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
cdc-wdm: s/kill_urbs/poison_urbs/ to fix build

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-27 09:55:10 -07:00
Yunsheng Lin dcad9ee9e0 net: sched: fix tx action reschedule issue with stopped queue
The netdev qeueue might be stopped when byte queue limit has
reached or tx hw ring is full, net_tx_action() may still be
rescheduled if STATE_MISSED is set, which consumes unnecessary
cpu without dequeuing and transmiting any skb because the
netdev queue is stopped, see qdisc_run_end().

This patch fixes it by checking the netdev queue state before
calling qdisc_run() and clearing STATE_MISSED if netdev queue is
stopped during qdisc_run(), the net_tx_action() is rescheduled
again when netdev qeueue is restarted, see netif_tx_wake_queue().

As there is time window between netif_xmit_frozen_or_stopped()
checking and STATE_MISSED clearing, between which STATE_MISSED
may set by net_tx_action() scheduled by netif_tx_wake_queue(),
so set the STATE_MISSED again if netdev queue is restarted.

Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:05:46 -07:00
Yunsheng Lin 102b55ee92 net: sched: fix tx action rescheduling issue during deactivation
Currently qdisc_run() checks the STATE_DEACTIVATED of lockless
qdisc before calling __qdisc_run(), which ultimately clear the
STATE_MISSED when all the skb is dequeued. If STATE_DEACTIVATED
is set before clearing STATE_MISSED, there may be rescheduling
of net_tx_action() at the end of qdisc_run_end(), see below:

CPU0(net_tx_atcion)  CPU1(__dev_xmit_skb)  CPU2(dev_deactivate)
          .                   .                     .
          .            set STATE_MISSED             .
          .           __netif_schedule()            .
          .                   .           set STATE_DEACTIVATED
          .                   .                qdisc_reset()
          .                   .                     .
          .<---------------   .              synchronize_net()
clear __QDISC_STATE_SCHED  |  .                     .
          .                |  .                     .
          .                |  .            some_qdisc_is_busy()
          .                |  .               return *false*
          .                |  .                     .
  test STATE_DEACTIVATED   |  .                     .
__qdisc_run() *not* called |  .                     .
          .                |  .                     .
   test STATE_MISS         |  .                     .
 __netif_schedule()--------|  .                     .
          .                   .                     .
          .                   .                     .

__qdisc_run() is not called by net_tx_atcion() in CPU0 because
CPU2 has set STATE_DEACTIVATED flag during dev_deactivate(), and
STATE_MISSED is only cleared in __qdisc_run(), __netif_schedule
is called at the end of qdisc_run_end(), causing tx action
rescheduling problem.

qdisc_run() called by net_tx_action() runs in the softirq context,
which should has the same semantic as the qdisc_run() called by
__dev_xmit_skb() protected by rcu_read_lock_bh(). And there is a
synchronize_net() between STATE_DEACTIVATED flag being set and
qdisc_reset()/some_qdisc_is_busy in dev_deactivate(), we can safely
bail out for the deactived lockless qdisc in net_tx_action(), and
qdisc_reset() will reset all skb not dequeued yet.

So add the rcu_read_lock() explicitly to protect the qdisc_run()
and do the STATE_DEACTIVATED checking in net_tx_action() before
calling qdisc_run_begin(). Another option is to do the checking in
the qdisc_run_end(), but it will add unnecessary overhead for
non-tx_action case, because __dev_queue_xmit() will not see qdisc
with STATE_DEACTIVATED after synchronize_net(), the qdisc with
STATE_DEACTIVATED can only be seen by net_tx_action() because of
__netif_schedule().

The STATE_DEACTIVATED checking in qdisc_run() is to avoid race
between net_tx_action() and qdisc_reset(), see:
commit d518d2ed86 ("net/sched: fix race between deactivation
and dequeue for NOLOCK qdisc"). As the bailout added above for
deactived lockless qdisc in net_tx_action() provides better
protection for the race without calling qdisc_run() at all, so
remove the STATE_DEACTIVATED checking in qdisc_run().

After qdisc_reset(), there is no skb in qdisc to be dequeued, so
clear the STATE_MISSED in dev_reset_queue() too.

Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
V8: Clearing STATE_MISSED before calling __netif_schedule() has
    avoid the endless rescheduling problem, but there may still
    be a unnecessary rescheduling, so adjust the commit log.
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-14 15:05:46 -07:00
Sebastian Andrzej Siewior 8380c81d5c net: Treat __napi_schedule_irqoff() as __napi_schedule() on PREEMPT_RT
__napi_schedule_irqoff() is an optimized version of __napi_schedule()
which can be used where it is known that interrupts are disabled,
e.g. in interrupt-handlers, spin_lock_irq() sections or hrtimer
callbacks.

On PREEMPT_RT enabled kernels this assumptions is not true. Force-
threaded interrupt handlers and spinlocks are not disabling interrupts
and the NAPI hrtimer callback is forced into softirq context which runs
with interrupts enabled as well.

Chasing all usage sites of __napi_schedule_irqoff() is a whack-a-mole
game so make __napi_schedule_irqoff() invoke __napi_schedule() for
PREEMPT_RT kernels.

The callers of ____napi_schedule() in the networking core have been
audited and are correct on PREEMPT_RT kernels as well.

Reported-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-13 13:11:19 -07:00
David S. Miller 6876a18d33 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-04-26 12:00:00 -07:00
David S. Miller 5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Martin Willi 22b6034323 net, xdp: Update pkt_type if generic XDP changes unicast MAC
If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.org
2021-04-22 23:18:02 +02:00
Alexander Lobakin 7ad18ff644 gro: fix napi_gro_frags() Fast GRO breakage due to IP alignment check
Commit 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
did the right thing, but missed the fact that napi_gro_frags() logics
calls for skb_gro_reset_offset() *before* pulling Ethernet header
to the skb linear space.
That said, the introduced check for frag0 address being aligned to 4
always fails for it as Ethernet header is obviously 14 bytes long,
and in case with NET_IP_ALIGN its start is not aligned to 4.

Fix this by adding @nhoff argument to skb_gro_reset_offset() which
tells if an IP header is placed right at the start of frag0 or not.
This restores Fast GRO for napi_gro_frags() that became very slow
after the mentioned commit, and preserves the introduced check to
avoid silent unaligned accesses.

From v1 [0]:
 - inline tiny skb_gro_reset_offset() to let the code be optimized
   more efficively (esp. for the !NET_IP_ALIGN case) (Eric);
 - pull in Reviewed-by from Eric.

[0] https://lore.kernel.org/netdev/20210418114200.5839-1-alobakin@pm.me

Fixes: 38ec4944b5 ("gro: ensure frag0 meets IP header alignment")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-19 16:03:32 -07:00
Jakub Kicinski 8203c7ce4e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
 - keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
 - fix build after move to net_generic

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-17 11:08:07 -07:00
Eric Dumazet 38ec4944b5 gro: ensure frag0 meets IP header alignment
After commit 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Guenter Roeck reported one failure in his tests using sh architecture.

After much debugging, we have been able to spot silent unaligned accesses
in inet_gro_receive()

The issue at hand is that upper networking stacks assume their header
is word-aligned. Low level drivers are supposed to reserve NET_IP_ALIGN
bytes before the Ethernet header to make that happen.

This patch hardens skb_gro_reset_offset() to not allow frag0 fast-path
if the fragment is not properly aligned.

Some arches like x86, arm64 and powerpc do not care and define NET_IP_ALIGN
as 0, this extra check will be a NOP for them.

Note that if frag0 is not used, GRO will call pskb_may_pull()
as many times as needed to pull network and transport headers.

Fixes: 0f6925b3e8 ("virtio_net: Do not pull payload in skb->head")
Fixes: 78a478d0ef ("gro: Inline skb_gro_header and cache frag0 virtual address")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-13 15:09:31 -07:00
Jakub Kicinski 8859a44ea0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

MAINTAINERS
 - keep Chandrasekar
drivers/net/ethernet/mellanox/mlx5/core/en_main.c
 - simple fix + trust the code re-added to param.c in -next is fine
include/linux/bpf.h
 - trivial
include/linux/ethtool.h
 - trivial, fix kdoc while at it
include/linux/skmsg.h
 - move to relevant place in tcp.c, comment re-wrapped
net/core/skmsg.c
 - add the sk = sk // sk = NULL around calls
net/tipc/crypto.c
 - trivial

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 20:48:35 -07:00
Paolo Abeni 27f0ad7169 net: fix hangup on napi_disable for threaded napi
napi_disable() is subject to an hangup, when the threaded
mode is enabled and the napi is under heavy traffic.

If the relevant napi has been scheduled and the napi_disable()
kicks in before the next napi_threaded_wait() completes - so
that the latter quits due to the napi_disable_pending() condition,
the existing code leaves the NAPI_STATE_SCHED bit set and the
napi_disable() loop waiting for such bit will hang.

This patch addresses the issue by dropping the NAPI_STATE_DISABLE
bit test in napi_thread_wait(). The later napi_threaded_poll()
iteration will take care of clearing the NAPI_STATE_SCHED.

This also addresses a related problem reported by Jakub:
before this patch a napi_disable()/napi_enable() pair killed
the napi thread, effectively disabling the threaded mode.
On the patched kernel napi_disable() simply stops scheduling
the relevant thread.

v1 -> v2:
  - let the main napi_thread_poll() loop clear the SCHED bit

Reported-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 29863d41bb ("net: implement threaded-able napi poll loop support")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/883923fa22745a9589e8610962b7dc59df09fb1f.1617981844.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-09 12:50:31 -07:00
Andrei Vagin 0854fa82c9 net: remove the new_ifindex argument from dev_change_net_namespace
Here is only one place where we want to specify new_ifindex. In all
other cases, callers pass 0 as new_ifindex. It looks reasonable to add a
low-level function with new_ifindex and to convert
dev_change_net_namespace to a static inline wrapper.

Fixes: eeb85a14ee ("net: Allow to specify ifindex when device is moved to another namespace")
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-07 14:43:28 -07:00
Andrei Vagin eeb85a14ee net: Allow to specify ifindex when device is moved to another namespace
Currently, we can specify ifindex on link creation. This change allows
to specify ifindex when a device is moved to another network namespace.

Even now, a device ifindex can be changed if there is another device
with the same ifindex in the target namespace. So this change doesn't
introduce completely new behavior, it adds more control to the process.

CRIU users want to restore containers with pre-created network devices.
A user will provide network devices and instructions where they have to
be restored, then CRIU will restore network namespaces and move devices
into them. The problem is that devices have to be restored with the same
indexes that they have before C/R.

Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Suggested-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-05 14:49:40 -07:00
Dmitry Vyukov 6c996e1994 net: change netdev_unregister_timeout_secs min value to 1
netdev_unregister_timeout_secs=0 can lead to printing the
"waiting for dev to become free" message every jiffy.
This is too frequent and unnecessary.
Set the min value to 1 second.

Also fix the merge issue introduced by
"net: make unregister netdev warning timeout configurable":
it changed "refcnt != 1" to "refcnt".

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Fixes: 5aa3afe107 ("net: make unregister netdev warning timeout configurable")
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 17:24:06 -07:00
David S. Miller efd13b71a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-25 15:31:22 -07:00
Pablo Neira Ayuso ddb94eafab net: resolve forwarding path from virtual netdevice and HW destination address
This patch adds dev_fill_forward_path() which resolves the path to reach
the real netdevice from the IP forwarding side. This function takes as
input the netdevice and the destination hardware address and it walks
down the devices calling .ndo_fill_forward_path() for each device until
the real device is found.

For instance, assuming the following topology:

               IP forwarding
              /             \
           br0              eth0
           / \
       eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

where eth1 and eth2 are bridge ports and eth0 provides WAN connectivity.
ethX is the interface in another box which is connected to the eth1
bridge port.

For packets going through IP forwarding to br0 whose destination MAC
address is ab💿ef🆎cd:ef, dev_fill_forward_path() provides the
following path:

	br0 -> eth1

.ndo_fill_forward_path for br0 looks up at the FDB for the bridge port
from the destination MAC address to get the bridge port eth1.

This information allows to create a fast path that bypasses the classic
bridge and IP forwarding paths, so packets go directly from the bridge
port eth1 to eth0 (wan interface) and vice versa.

             fast path
      .------------------------.
     /                          \
    |           IP forwarding   |
    |          /             \  \/
    |       br0               eth0
    .       / \
     -> eth1  eth2
        .
        .
        .
       ethX
 ab💿ef🆎cd:ef

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-24 12:48:38 -07:00
Dmitry Vyukov 5aa3afe107 net: make unregister netdev warning timeout configurable
netdev_wait_allrefs() issues a warning if refcount does not drop to 0
after 10 seconds. While 10 second wait generally should not happen
under normal workload in normal environment, it seems to fire falsely
very often during fuzzing and/or in qemu emulation (~10x slower).
At least it's not possible to understand if it's really a false
positive or not. Automated testing generally bumps all timeouts
to very high values to avoid flake failures.
Add net.core.netdev_unregister_timeout_secs sysctl to make
the timeout configurable for automated testing systems.
Lowering the timeout may also be useful for e.g. manual bisection.
The default value matches the current behavior.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=211877
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-23 17:22:50 -07:00
Eric Dumazet add2d73631 net: set initial device refcount to 1
When adding CONFIG_PCPU_DEV_REFCNT, I forgot that the
initial net device refcount was 0.

When CONFIG_PCPU_DEV_REFCNT is not set, this means
the first dev_hold() triggers an illegal refcount
operation (addition on 0)

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0x128/0x1a4

Fix is to change initial (and final) refcount to be 1.

Also add a missing kerneldoc piece, as reported by
Stephen Rothwell.

Fixes: 919067cc84 ("net: add CONFIG_PCPU_DEV_REFCNT")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Guenter Roeck <groeck@google.com>
Tested-by: Guenter Roeck <groeck@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 16:57:36 -07:00
Vladimir Oltean 5da9ace340 net: make xps_needed and xps_rxqs_needed static
Since their introduction in commit 04157469b7 ("net: Use static_key
for XPS maps"), xps_needed and xps_rxqs_needed were never used outside
net/core/dev.c, so I don't really understand why they were exported as
symbols in the first place.

This is needed in order to silence a "make W=1" warning about these
static keys not being declared as static variables, but not having a
previous declaration in a header file nonetheless.

Cc: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-22 13:13:55 -07:00
Eric Dumazet 919067cc84 net: add CONFIG_PCPU_DEV_REFCNT
I was working on a syzbot issue, claiming one device could not be
dismantled because its refcount was -1

unregister_netdevice: waiting for sit0 to become free. Usage count = -1

It would be nice if syzbot could trigger a warning at the time
this reference count became negative.

This patch adds CONFIG_PCPU_DEV_REFCNT options which defaults
to per cpu variables (as before this patch) on SMP builds.

v2: free_dev label in alloc_netdev_mqs() is moved to avoid
    a compiler warning (-Wunused-label), as reported
    by kernel test robot <lkp@intel.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-19 13:38:46 -07:00
Antoine Tenart 75b2758abc net: NULL the old xps map entries when freeing them
In __netif_set_xps_queue, old map entries from the old dev_maps are
freed but their corresponding entry in the old dev_maps aren't NULLed.
Fix this.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 2d05bf0153 net: fix use after free in xps
When setting up an new dev_maps in __netif_set_xps_queue, we remove and
free maps from unused CPUs/rx-queues near the end of the function; by
calling remove_xps_queue. However it's possible those maps are also part
of the old not-freed-yet dev_maps, which might be used concurrently.
When that happens, a map can be freed while its corresponding entry in
the old dev_maps table isn't NULLed, leading to: "BUG: KASAN:
use-after-free" in different places.

This fixes the map freeing logic for unused CPUs/rx-queues, to also NULL
the map entries from the old dev_maps table.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 132f743b01 net: improve queue removal readability in __netif_set_xps_queue
Improve the readability of the loop removing tx-queue from unused
CPUs/rx-queues in __netif_set_xps_queue. The change should only be
cosmetic.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 402fbb992e net: add an helper to copy xps maps to the new dev_maps
This patch adds an helper, xps_copy_dev_maps, to copy maps from dev_maps
to new_dev_maps at a given index. The logic should be the same, with an
improved code readability and maintenance.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 044ab86d43 net: move the xps maps to an array
Move the xps maps (xps_cpus_map and xps_rxqs_map) to an array in
net_device. That will simplify a lot the code removing the need for lots
of if/else conditionals as the correct map will be available using its
offset in the array.

This should not modify the xps maps behaviour in any way.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 6f36158e05 net: remove the xps possible_mask
Remove the xps possible_mask. It was an optimization but we can just
loop from 0 to nr_ids now that it is embedded in the xps dev_maps. That
simplifies the code a bit.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 5478fcd0f4 net: embed nr_ids in the xps maps
Embed nr_ids (the number of cpu for the xps cpus map, and the number of
rxqs for the xps cpus map) in dev_maps. That will help not accessing out
of bound memory if those values change after dev_maps was allocated.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Antoine Tenart 255c04a87f net: embed num_tc in the xps maps
The xps cpus/rxqs map is accessed using dev->num_tc, which is used when
allocating the map. But later updates of dev->num_tc can lead to having
a mismatch between the maps and how they're accessed. In such cases the
map values do not make any sense and out of bound accesses can occur
(that can be easily seen using KASAN).

This patch aims at fixing this by embedding num_tc into the maps, using
the value at the time the map is created. This brings two improvements:
- The maps can be accessed using the embedded num_tc, so we know for
  sure we won't have out of bound accesses.
- Checks can be made before accessing the maps so we know the values
  retrieved will make sense.

We also update __netif_set_xps_queue to conditionally copy old maps from
dev_maps in the new one only if the number of traffic classes from both
maps match.

Signed-off-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:56:22 -07:00
Jiri Bohac 6c015a2256 net: check all name nodes in __dev_alloc_name
__dev_alloc_name(), when supplied with a name containing '%d',
will search for the first available device number to generate a
unique device name.

Since commit ff92741270 ("net:
introduce name_node struct to be used in hashlist") network
devices may have alternate names.  __dev_alloc_name() does take
these alternate names into account, possibly generating a name
that is already taken and failing with -ENFILE as a result.

This demonstrates the bug:

    # rmmod dummy 2>/dev/null
    # ip link property add dev lo altname dummy0
    # modprobe dummy numdummies=1
    modprobe: ERROR: could not insert 'dummy': Too many open files in system

Instead of creating a device named dummy1, modprobe fails.

Fix this by checking all the names in the d->name_node list, not just d->name.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: ff92741270 ("net: introduce name_node struct to be used in hashlist")
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-18 14:40:53 -07:00
Wei Wang cb03835793 net: fix race between napi kthread mode and busy poll
Currently, napi_thread_wait() checks for NAPI_STATE_SCHED bit to
determine if the kthread owns this napi and could call napi->poll() on
it. However, if socket busy poll is enabled, it is possible that the
busy poll thread grabs this SCHED bit (after the previous napi->poll()
invokes napi_complete_done() and clears SCHED bit) and tries to poll
on the same napi. napi_disable() could grab the SCHED bit as well.
This patch tries to fix this race by adding a new bit
NAPI_STATE_SCHED_THREADED in napi->state. This bit gets set in
____napi_schedule() if the threaded mode is enabled, and gets cleared
in napi_complete_done(), and we only poll the napi in kthread if this
bit is set. This helps distinguish the ownership of the napi between
kthread and other scenarios and fixes the race issue.

Fixes: 29863d41bb ("net: implement threaded-able napi poll loop support")
Reported-by: Martin Zaharinov <micron10@gmail.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Cc: Alexander Duyck <alexanderduyck@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-17 14:31:17 -07:00
Martin Willi 3a5ca85707 can: dev: Move device back to init netns on owning netns delete
When a non-initial netns is destroyed, the usual policy is to delete
all virtual network interfaces contained, but move physical interfaces
back to the initial netns. This keeps the physical interface visible
on the system.

CAN devices are somewhat special, as they define rtnl_link_ops even
if they are physical devices. If a CAN interface is moved into a
non-initial netns, destroying that netns lets the interface vanish
instead of moving it back to the initial netns. default_device_exit()
skips CAN interfaces due to having rtnl_link_ops set. Reproducer:

  ip netns add foo
  ip link set can0 netns foo
  ip netns delete foo

WARNING: CPU: 1 PID: 84 at net/core/dev.c:11030 ops_exit_list+0x38/0x60
CPU: 1 PID: 84 Comm: kworker/u4:2 Not tainted 5.10.19 #1
Workqueue: netns cleanup_net
[<c010e700>] (unwind_backtrace) from [<c010a1d8>] (show_stack+0x10/0x14)
[<c010a1d8>] (show_stack) from [<c086dc10>] (dump_stack+0x94/0xa8)
[<c086dc10>] (dump_stack) from [<c086b938>] (__warn+0xb8/0x114)
[<c086b938>] (__warn) from [<c086ba10>] (warn_slowpath_fmt+0x7c/0xac)
[<c086ba10>] (warn_slowpath_fmt) from [<c0629f20>] (ops_exit_list+0x38/0x60)
[<c0629f20>] (ops_exit_list) from [<c062a5c4>] (cleanup_net+0x230/0x380)
[<c062a5c4>] (cleanup_net) from [<c0142c20>] (process_one_work+0x1d8/0x438)
[<c0142c20>] (process_one_work) from [<c0142ee4>] (worker_thread+0x64/0x5a8)
[<c0142ee4>] (worker_thread) from [<c0148a98>] (kthread+0x148/0x14c)
[<c0148a98>] (kthread) from [<c0100148>] (ret_from_fork+0x14/0x2c)

To properly restore physical CAN devices to the initial netns on owning
netns exit, introduce a flag on rtnl_link_ops that can be set by drivers.
For CAN devices setting this flag, default_device_exit() considers them
non-virtual, applying the usual namespace move.

The issue was introduced in the commit mentioned below, as at that time
CAN devices did not have a dellink() operation.

Fixes: e008b5fc8d ("net: Simplfy default_device_exit and improve batching.")
Link: https://lore.kernel.org/r/20210302122423.872326-1-martin@strongswan.org
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-03-16 08:40:04 +01:00
Lorenzo Bianconi 8f64860f8b net: export dev_set_threaded symbol
For wireless devices (e.g. mt76 driver) multiple net_devices belongs to
the same wireless phy and the napi object is registered in a dummy
netdevice related to the wireless phy.
Export dev_set_threaded in order to be reused in device drivers enabling
threaded NAPI.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-15 12:35:23 -07:00
Alexander Lobakin d0eed5c325 gro: give 'hash' variable in dev_gro_receive() a less confusing name
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:09 -07:00
Alexander Lobakin 9dc2c31337 gro: consistentify napi->gro_hash[x] access in dev_gro_receive()
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.

Misc: remove superfluous braces around single-line branches.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Alexander Lobakin 0ccf4d50d1 gro: simplify gro_list_prepare()
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-14 14:41:08 -07:00
Gustavo A. R. Silva b1866bfff9 net: core: Fix fall-through warnings for Clang
In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
by explicitly adding a break statement instead of letting the code fall
through to the next case.

Link: https://github.com/KSPP/linux/issues/115
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10 12:45:15 -08:00
David S. Miller b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Alexander Lobakin 9243adfc31 skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:04 -08:00
Alexander Lobakin fec6e49b63 skbuff: remove __kfree_skb_flush()
This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-13 14:32:03 -08:00
Jesper Dangaard Brouer 5f7d57280c bpf: Drop MTU check when doing TC-BPF redirect to ingress
The use-case for dropping the MTU check when TC-BPF does redirect to
ingress, is described by Eyal Birger in email[0]. The summary is the
ability to increase packet size (e.g. with IPv6 headers for NAT64) and
ingress redirect packet and let normal netstack fragment packet as needed.

[0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/

V15:
 - missing static for function declaration

V9:
 - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect

V4:
 - Keep net_device "up" (IFF_UP) check.
 - Adjustment to handle bpf_redirect_peer() helper

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul
2021-02-13 01:15:28 +01:00
Cong Wang 3b23a32a63 net: fix dev_ifsioc_locked() race condition
dev_ifsioc_locked() is called with only RCU read lock, so when
there is a parallel writer changing the mac address, it could
get a partially updated mac address, as shown below:

Thread 1			Thread 2
// eth_commit_mac_addr_change()
memcpy(dev->dev_addr, addr->sa_data, ETH_ALEN);
				// dev_ifsioc_locked()
				memcpy(ifr->ifr_hwaddr.sa_data,
					dev->dev_addr,...);

Close this race condition by guarding them with a RW semaphore,
like netdev_get_name(). We can not use seqlock here as it does not
allow blocking. The writers already take RTNL anyway, so this does
not affect the slow path. To avoid bothering existing
dev_set_mac_address() callers in drivers, introduce a new wrapper
just for user-facing callers on ioctl and rtnetlink paths.

Note, bonding also changes slave mac addresses but that requires
a separate patch due to the complexity of bonding code.

Fixes: 3710becf8a ("net: RCU locking for simple ioctl()")
Reported-by: "Gong, Sishuai" <sishuai@purdue.edu>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-11 18:14:19 -08:00
David S. Miller dc9d87581d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2021-02-10 13:30:12 -08:00
Wei Wang 5fdd2f0e5c net: add sysfs attribute to control napi threaded mode
This patch adds a new sysfs attribute to the network device class.
Said attribute provides a per-device control to enable/disable the
threaded mode for all the napi instances of the given network device,
without the need for a device up/down.
User sets it to 1 or 0 to enable or disable threaded mode.
Note: when switching between threaded and the current softirq based mode
for a napi instance, it will not immediately take effect if the napi is
currently being polled. The mode switch will happen for the next time
napi_schedule() is called.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09 15:27:28 -08:00
Wei Wang 29863d41bb net: implement threaded-able napi poll loop support
This patch allows running each napi poll loop inside its own
kernel thread.
The kthread is created during netif_napi_add() if dev->threaded
is set. And threaded mode is enabled in napi_enable(). We will
provide a way to set dev->threaded and enable threaded mode
without a device up/down in the following patch.

Once that threaded mode is enabled and the kthread is
started, napi_schedule() will wake-up such thread instead
of scheduling the softirq.

The threaded poll loop behaves quite likely the net_rx_action,
but it does not have to manipulate local irqs and uses
an explicit scheduling point based on netdev_budget.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09 15:27:28 -08:00
Felix Fietkau 898f8015ff net: extract napi poll functionality to __napi_poll()
This commit introduces a new function __napi_poll() which does the main
logic of the existing napi_poll() function, and will be called by other
functions in later commits.
This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
is proposed as part of the patch to move napi work to work_queue
context.
This commit by itself is a code restructure.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Wei Wang <weiwan@google.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-09 15:27:28 -08:00
Eric Dumazet 8dc1c444df net: gro: do not keep too many GRO packets in napi->rx_list
Commit c80794323e ("net: Fix packet reordering caused by GRO and
listified RX cooperation") had the unfortunate effect of adding
latencies in common workloads.

Before the patch, GRO packets were immediately passed to
upper stacks.

After the patch, we can accumulate quite a lot of GRO
packets (depdending on NAPI budget).

My fix is counting in napi->rx_count number of segments
instead of number of logical packets.

Fixes: c80794323e ("net: Fix packet reordering caused by GRO and listified RX cooperation")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-by: John Sperbeck <jsperbeck@google.com>
Tested-by: Jian Yang <jianyang@google.com>
Cc: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Reviewed-by: Alexander Lobakin <alobakin@pm.me>
Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-05 19:28:01 -08:00
Leon Romanovsky 04f00ab227 net/core: move gro function declarations to separate header
Fir the following compilation warnings:
 1031 | INDIRECT_CALLABLE_SCOPE void udp_v6_early_demux(struct sk_buff *skb)

net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
      |                             ^~~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:182:41: warning: no previous prototype for ‘ipv6_gro_receive’ [-Wmissing-prototypes]
  182 | INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
      |                                         ^~~~~~~~~~~~~~~~
net/ipv6/ip6_offload.c:320:29: warning: no previous prototype for ‘ipv6_gro_complete’ [-Wmissing-prototypes]
  320 | INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-04 18:37:57 -08:00
Xin Long 62fafcd631 net: support ip generic csum processing in skb_csum_hwoffload_help
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.

However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.

This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.

Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.

v1->v2:
  - not extend skb->csum_not_inet, but use skb->csum_offset to tell
    if it's an UDP/TCP csum packet.
v2->v3:
  - add a note in the changelog, as Willem suggested.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 20:39:14 -08:00
Yousuk Seung e7ed11ee94 tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS
This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-22 18:20:52 -08:00
wenxu 7baf2429a1 net/sched: cls_flower add CT_FLAGS_INVALID flag support
This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:09:44 -08:00
Jakub Kicinski 0cbe1e57a7 net: inline rollback_registered_many()
Similar to the change for rollback_registered() -
rollback_registered_many() was a part of unregister_netdevice_many()
minus the net_set_todo(), which is no longer needed.

Functionally this patch moves the list_empty() check back after:

	BUG_ON(dev_boot_phase);
	ASSERT_RTNL();

but I can't find any reason why that would be an issue.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:19 -08:00
Jakub Kicinski bcfe2f1a38 net: move rollback_registered_many()
Move rollback_registered_many() and add a temporary
forward declaration to make merging the code into
unregister_netdevice_many() easier to review.

No functional changes.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:19 -08:00
Jakub Kicinski 037e56bd96 net: inline rollback_registered()
rollback_registered() is a local helper, it's common for driver
code to call unregister_netdevice_queue(dev, NULL) when they
want to unregister netdevices under rtnl_lock. Inline
rollback_registered() and adjust the only remaining caller.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:18 -08:00
Jakub Kicinski 2014beea7e net: move net_set_todo inside rollback_registered()
Commit 93ee31f14f ("[NET]: Fix free_netdev on register_netdev
failure.") moved net_set_todo() outside of rollback_registered()
so that rollback_registered() can be used in the failure path of
register_netdevice() but without risking a double free.

Since commit cf124db566 ("net: Fix inconsistent teardown and
release of private netdev state."), however, we have a better
way of handling that condition, since destructors don't call
free_netdev() directly.

After the change in commit c269a24ce0 ("net: make free_netdev()
more lenient with unregistering devices") we can now move
net_set_todo() back.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 21:04:18 -08:00
Jakub Kicinski 0fe2f273ab Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:

drivers/net/can/dev.c
  commit 03f16c5075 ("can: dev: can_restart: fix use after free bug")
  commit 3e77f70e73 ("can: dev: move driver related infrastructure into separate subdir")

  Code move.

drivers/net/dsa/b53/b53_common.c
 commit 8e4052c32d ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
 commit b7a9e0da2d ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")

 Field rename.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20 12:16:11 -08:00
Tariq Toukan a3eb4e9d4c net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.

Fixes: 14136564c8 ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 15:58:05 -08:00
Xin Long fa82117010 net: add inline function skb_csum_is_sctp
This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.

Suggested-by: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19 14:31:25 -08:00
Tariq Toukan 719a402cf6 net: netdevice: Add operation ndo_sk_get_lower_dev
ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.

Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18 20:48:39 -08:00
Jakub Kicinski 2d9116be76 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-01-16

1) Extend atomic operations to the BPF instruction set along with x86-64 JIT support,
   that is, atomic{,64}_{xchg,cmpxchg,fetch_{add,and,or,xor}}, from Brendan Jackman.

2) Add support for using kernel module global variables (__ksym externs in BPF
   programs) retrieved via module's BTF, from Andrii Nakryiko.

3) Generalize BPF stackmap's buildid retrieval and add support to have buildid
   stored in mmap2 event for perf, from Jiri Olsa.

4) Various fixes for cross-building BPF sefltests out-of-tree which then will
   unblock wider automated testing on ARM hardware, from Jean-Philippe Brucker.

5) Allow to retrieve SOL_SOCKET opts from sock_addr progs, from Daniel Borkmann.

6) Clean up driver's XDP buffer init and split into two helpers to init per-
   descriptor and non-changing fields during processing, from Lorenzo Bianconi.

7) Minor misc improvements to libbpf & bpftool, from Ian Rogers.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
  perf: Add build id data in mmap2 event
  bpf: Add size arg to build_id_parse function
  bpf: Move stack_map_get_build_id into lib
  bpf: Document new atomic instructions
  bpf: Add tests for new BPF atomic operations
  bpf: Add bitwise atomic instructions
  bpf: Pull out a macro for interpreting atomic ALU operations
  bpf: Add instructions for atomic_[cmp]xchg
  bpf: Add BPF_FETCH field / create atomic_fetch_add instruction
  bpf: Move BPF_STX reserved field check into BPF_STX verifier code
  bpf: Rename BPF_XADD and prepare to encode other atomics in .imm
  bpf: x86: Factor out a lookup table for some ALU opcodes
  bpf: x86: Factor out emission of REX byte
  bpf: x86: Factor out emission of ModR/M for *(reg + off)
  tools/bpftool: Add -Wall when building BPF programs
  bpf, libbpf: Avoid unused function warning on bpf_tail_call_static
  selftests/bpf: Install btf_dump test cases
  selftests/bpf: Fix installation of urandom_read
  selftests/bpf: Move generated test files to $(TEST_GEN_FILES)
  selftests/bpf: Fix out-of-tree build
  ...
====================

Link: https://lore.kernel.org/r/20210116012922.17823-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15 17:57:26 -08:00
Jakub Kicinski 1d9f03c0a1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 18:34:50 -08:00
Tariq Toukan 25537d71e2 net: Allow NETIF_F_HW_TLS_TX if IP_CSUM && IPV6_CSUM
Cited patch below blocked the TLS TX device offload unless HW_CSUM
is set. This broke devices that use IP_CSUM && IP6_CSUM.
Here we fix it.

Note that the single HW_TLS_TX feature flag indicates support for
both IPv4/6, hence it should still be disabled in case only one of
(IP_CSUM | IPV6_CSUM) is set.

Fixes: ae0b04b238 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reported-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-14 10:56:13 -08:00
Menglong Dong 324cefaf1c net: core: use eth_type_vlan in __netif_receive_skb_core
Replace the check for ETH_P_8021Q and ETH_P_8021AD in
__netif_receive_skb_core with eth_type_vlan.

Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Link: https://lore.kernel.org/r/20210111104221.3451-1-dong.menglong@zte.com.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-13 19:08:54 -08:00
Eric Dumazet 1d11fa6967 net-gro: remove GRO_DROP
GRO_DROP can only be returned from napi_gro_frags()
if the skb has not been allocated by a prior napi_get_frags()

Since drivers must use napi_get_frags() and test its result
before populating the skb with metadata, we can safely remove
GRO_DROP since it offers no practical use.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-09 14:24:26 -08:00
Jakub Kicinski 766b0515d5 net: make sure devices go through netdev_wait_all_refs
If register_netdevice() fails at the very last stage - the
notifier call - some subsystems may have already seen it and
grabbed a reference. struct net_device can't be freed right
away without calling netdev_wait_all_refs().

Now that we have a clean interface in form of dev->needs_free_netdev
and lenient free_netdev() we can undo what commit 93ee31f14f ("[NET]:
Fix free_netdev on register_netdev failure.") has done and complete
the unregistration path by bringing the net_set_todo() call back.

After registration fails user is still expected to explicitly
free the net_device, so make sure ->needs_free_netdev is cleared,
otherwise rolling back the registration will cause the old double
free for callers who release rtnl_lock before the free.

This also solves the problem of priv_destructor not being called
on notifier error.

net_set_todo() will be moved back into unregister_netdevice_queue()
in a follow up.

Reported-by: Hulk Robot <hulkci@huawei.com>
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Jakub Kicinski c269a24ce0 net: make free_netdev() more lenient with unregistering devices
There are two flavors of handling netdev registration:
 - ones called without holding rtnl_lock: register_netdev() and
   unregister_netdev(); and
 - those called with rtnl_lock held: register_netdevice() and
   unregister_netdevice().

While the semantics of the former are pretty clear, the same can't
be said about the latter. The netdev_todo mechanism is utilized to
perform some of the device unregistering tasks and it hooks into
rtnl_unlock() so the locked variants can't actually finish the work.
In general free_netdev() does not mix well with locked calls. Most
drivers operating under rtnl_lock set dev->needs_free_netdev to true
and expect core to make the free_netdev() call some time later.

The part where this becomes most problematic is error paths. There is
no way to unwind the state cleanly after a call to register_netdevice(),
since unreg can't be performed fully without dropping locks.

Make free_netdev() more lenient, and defer the freeing if device
is being unregistered. This allows error paths to simply call
free_netdev() both after register_netdevice() failed, and after
a call to unregister_netdevice() but before dropping rtnl_lock.

Simplify the error paths which are currently doing gymnastics
around free_netdev() handling.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-08 19:27:41 -08:00
Lorenzo Bianconi be9df4aff6 net, xdp: Introduce xdp_prepare_buff utility routine
Introduce xdp_prepare_buff utility routine to initialize per-descriptor
xdp_buff fields (e.g. xdp_buff pointers). Rely on xdp_prepare_buff() in
all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/45f46f12295972a97da8ca01990b3e71501e9d89.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Lorenzo Bianconi 43b5169d83 net, xdp: Introduce xdp_init_buff utility routine
Introduce xdp_init_buff utility routine to initialize xdp_buff fields
const over NAPI iterations (e.g. frame_sz or rxq pointer). Rely on
xdp_init_buff in all XDP capable drivers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Shay Agroskin <shayagr@amazon.com>
Acked-by: Martin Habets <habetsm.xilinx@gmail.com>
Acked-by: Camelia Groza <camelia.groza@nxp.com>
Acked-by: Marcin Wojtas <mw@semihalf.com>
Link: https://lore.kernel.org/bpf/7f8329b6da1434dc2b05a77f2e800b29628a8913.1608670965.git.lorenzo@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-08 13:39:24 -08:00
Jakub Kicinski 876c4384ae udp_tunnel: hard-wire NDOs to udp_tunnel_nic_*_port() helpers
All drivers use udp_tunnel_nic_*_port() helpers, prepare for
NDO removal by invoking those helpers directly.

The helpers are safe to call on all devices, they check if
device has the UDP tunnel state initialized.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-07 12:53:29 -08:00
Lijun Pan 7061eb8cfa net: core: introduce __netdev_notify_peers
There are some use cases for netdev_notify_peers in the context
when rtnl lock is already held. Introduce lockless version
of netdev_notify_peers call to save the extra code to call
	call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
	call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
After that, convert netdev_notify_peers to call the new helper.

Suggested-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-16 11:43:25 -08:00
Linus Torvalds d635a69dd4 Networking updates for 5.11
Core:
 
  - support "prefer busy polling" NAPI operation mode, where we defer softirq
    for some time expecting applications to periodically busy poll
 
  - AF_XDP: improve efficiency by more batching and hindering
            the adjacency cache prefetcher
 
  - af_packet: make packet_fanout.arr size configurable up to 64K
 
  - tcp: optimize TCP zero copy receive in presence of partial or unaligned
         reads making zero copy a performance win for much smaller messages
 
  - XDP: add bulk APIs for returning / freeing frames
 
  - sched: support fragmenting IP packets as they come out of conntrack
 
  - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs
 
 BPF:
 
  - BPF switch from crude rlimit-based to memcg-based memory accounting
 
  - BPF type format information for kernel modules and related tracing
    enhancements
 
  - BPF implement task local storage for BPF LSM
 
  - allow the FENTRY/FEXIT/RAW_TP tracing programs to use bpf_sk_storage
 
 Protocols:
 
  - mptcp: improve multiple xmit streams support, memory accounting and
           many smaller improvements
 
  - TLS: support CHACHA20-POLY1305 cipher
 
  - seg6: add support for SRv6 End.DT4/DT6 behavior
 
  - sctp: Implement RFC 6951: UDP Encapsulation of SCTP
 
  - ppp_generic: add ability to bridge channels directly
 
  - bridge: Connectivity Fault Management (CFM) support as is defined in
            IEEE 802.1Q section 12.14.
 
 Drivers:
 
  - mlx5: make use of the new auxiliary bus to organize the driver internals
 
  - mlx5: more accurate port TX timestamping support
 
  - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
      the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging
 
  - rtw88: major bluetooth co-existance improvements
 
  - iwlwifi: support new 6 GHz frequency band
 
  - ath11k: Fast Initial Link Setup (FILS)
 
  - mt7915: dual band concurrent (DBDC) support
 
  - net: ipa: add basic support for IPA v4.5
 
 Refactor:
 
  - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej Siewior
 
  - phy: add support for shared interrupts; get rid of multiple driver
         APIs and have the drivers write a full IRQ handler, slight growth
 	of driver code should be compensated by the simpler API which
 	also allows shared IRQs
 
  - add common code for handling netdev per-cpu counters
 
  - move TX packet re-allocation from Ethernet switch tag drivers to
    a central place
 
  - improve efficiency and rename nla_strlcpy
 
  - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot
 
 Old code removal:
 
  - wan: delete the DLCI / SDLA drivers
 
  - wimax: move to staging
 
  - wifi: remove old WDS wifi bridging support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl/YXmUACgkQMUZtbf5S
 IrvSQBAAgOrt4EFopEvVqlTHZbqI45IEqgtXS+YWmlgnjZCgshyMj8q1yK1zzane
 qYxr/NNJ9kV3FdtaynmmHPgEEEfR5kJ/D3B2BsxYDkaDDrD0vbNsBGw+L+/Gbhxl
 N/5l/9FjLyLY1D+EErknuwR5XGuQ6BSDVaKQMhYOiK2hgdnAAI4hszo8Chf6wdD0
 XDBslQ7vpD/05r+eMj0IkS5dSAoGOIFXUxhJ5dqrDbRHiKsIyWqA3PLbYemfAhxI
 s2XckjfmSgGE3FKL8PSFu+EcfHbJQQjLcULJUnqgVcdwEEtRuE9ggEi52nZRXMWM
 4e8sQJAR9Fx7pZy0G1xfS149j6iPU5LjRlU9TNSpVABz14Vvvo3gEL6gyIdsz+xh
 hMN7UBdp0FEaP028CXoIYpaBesvQqj0BSndmee8qsYAtN6j+QKcM2AOSr7JN1uMH
 C/86EDoGAATiEQIVWJvnX5MPmlAoblyLA+RuVhmxkIBx2InGXkFmWqRkXT5l4jtk
 LVl8/TArR4alSQqLXictXCjYlCm9j5N4zFFtEVasSYi7/ZoPfgRNWT+lJ2R8Y+Zv
 +htzGaFuyj6RJTVeFQMrkl3whAtBamo2a0kwg45NnxmmXcspN6kJX1WOIy82+MhD
 Yht7uplSs7MGKA78q/CDU0XBeGjpABUvmplUQBIfrR/jKLW2730=
 =GXs1
 -----END PGP SIGNATURE-----

Merge tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - support "prefer busy polling" NAPI operation mode, where we defer
     softirq for some time expecting applications to periodically busy
     poll

   - AF_XDP: improve efficiency by more batching and hindering the
     adjacency cache prefetcher

   - af_packet: make packet_fanout.arr size configurable up to 64K

   - tcp: optimize TCP zero copy receive in presence of partial or
     unaligned reads making zero copy a performance win for much smaller
     messages

   - XDP: add bulk APIs for returning / freeing frames

   - sched: support fragmenting IP packets as they come out of conntrack

   - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

  BPF:

   - BPF switch from crude rlimit-based to memcg-based memory accounting

   - BPF type format information for kernel modules and related tracing
     enhancements

   - BPF implement task local storage for BPF LSM

   - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
     bpf_sk_storage

  Protocols:

   - mptcp: improve multiple xmit streams support, memory accounting and
     many smaller improvements

   - TLS: support CHACHA20-POLY1305 cipher

   - seg6: add support for SRv6 End.DT4/DT6 behavior

   - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

   - ppp_generic: add ability to bridge channels directly

   - bridge: Connectivity Fault Management (CFM) support as is defined
     in IEEE 802.1Q section 12.14.

  Drivers:

   - mlx5: make use of the new auxiliary bus to organize the driver
     internals

   - mlx5: more accurate port TX timestamping support

   - mlxsw:
      - improve the efficiency of offloaded next hop updates by using
        the new nexthop object API
      - support blackhole nexthops
      - support IEEE 802.1ad (Q-in-Q) bridging

   - rtw88: major bluetooth co-existance improvements

   - iwlwifi: support new 6 GHz frequency band

   - ath11k: Fast Initial Link Setup (FILS)

   - mt7915: dual band concurrent (DBDC) support

   - net: ipa: add basic support for IPA v4.5

  Refactor:

   - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
     Siewior

   - phy: add support for shared interrupts; get rid of multiple driver
     APIs and have the drivers write a full IRQ handler, slight growth
     of driver code should be compensated by the simpler API which also
     allows shared IRQs

   - add common code for handling netdev per-cpu counters

   - move TX packet re-allocation from Ethernet switch tag drivers to a
     central place

   - improve efficiency and rename nla_strlcpy

   - number of W=1 warning cleanups as we now catch those in a patchwork
     build bot

  Old code removal:

   - wan: delete the DLCI / SDLA drivers

   - wimax: move to staging

   - wifi: remove old WDS wifi bridging support"

* tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
  net: hns3: fix expression that is currently always true
  net: fix proc_fs init handling in af_packet and tls
  nfc: pn533: convert comma to semicolon
  af_vsock: Assign the vsock transport considering the vsock address flags
  af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
  vsock_addr: Check for supported flag values
  vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
  vm_sockets: Add flags field in the vsock address data structure
  net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
  tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
  net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
  nfc: s3fwrn5: Release the nfc firmware
  net: vxget: clean up sparse warnings
  mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
  mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
  mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
  mlxsw: reg: Add Router LPM Cache Enable Register
  mlxsw: reg: Add Router LPM Cache ML Delete Register
  mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
  mlxsw: reg: Add XM Router M Table Register
  ...
2020-12-15 13:22:29 -08:00
Tariq Toukan ae0b04b238 net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
logically done when HW_CSUM offload is off.

Fixes: 2342a8512a ("net: Add TLS TX offload features")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Boris Pismenny <borisp@nvidia.com>
Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-14 19:31:36 -08:00
Linus Torvalds adb35e8dc9 Scheduler updates:
- migrate_disable/enable() support which originates from the RT tree and
    is now a prerequisite for the new preemptible kmap_local() API which aims
    to replace kmap_atomic().
 
  - A fair amount of topology and NUMA related improvements
 
  - Improvements for the frequency invariant calculations
 
  - Enhanced robustness for the global CPU priority tracking and decision
    making
 
  - The usual small fixes and enhancements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XwK4THHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoX28D/9cVrvziSQGfBfuQWnUiw8iOIq1QBa2
 Me+Tvenhfrlt7xU6rbP9ciFu7eTN+fS06m5uQPGI+t22WuJmHzbmw1bJVXfkvYfI
 /QoU+Hg7DkDAn1p7ZKXh0dRkV0nI9ixxSHl0E+Zf1ATBxCUMV2SO85flg6z/4qJq
 3VWUye0dmR7/bhtkIjv5rwce9v2JB2g1AbgYXYTW9lHVoUdGoMSdiZAF4tGyHLnx
 sJ6DMqQ+k+dmPyYO0z5MTzjW/fXit4n9w2e3z9TvRH/uBu58WSW1RBmQYX6aHBAg
 dhT9F4lvTs6lJY23x5RSFWDOv6xAvKF5a0xfb8UZcyH5EoLYrPRvm42a0BbjdeRa
 u0z7LbwIlKA+RFdZzFZWz8UvvO0ljyMjmiuqZnZ5dY9Cd80LSBuxrWeQYG0qg6lR
 Y2povhhCepEG+q8AXIe2YjHKWKKC1s/l/VY3CNnCzcd21JPQjQ4Z5eWGmHif5IED
 CntaeFFhZadR3w02tkX35zFmY3w4soKKrbI4EKWrQwd+cIEQlOSY7dEPI/b5BbYj
 MWAb3P4EG9N77AWTNmbhK4nN0brEYb+rBbCA+5dtNBVhHTxAC7OTWElJOC2O66FI
 e06dREjvwYtOkRUkUguWwErbIai2gJ2MH0VILV3hHoh64oRk7jjM8PZYnjQkdptQ
 Gsq0rJW5iiu/OQ==
 =Oz1V
 -----END PGP SIGNATURE-----

Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler updates from Thomas Gleixner:

 - migrate_disable/enable() support which originates from the RT tree
   and is now a prerequisite for the new preemptible kmap_local() API
   which aims to replace kmap_atomic().

 - A fair amount of topology and NUMA related improvements

 - Improvements for the frequency invariant calculations

 - Enhanced robustness for the global CPU priority tracking and decision
   making

 - The usual small fixes and enhancements all over the place

* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
  sched/fair: Trivial correction of the newidle_balance() comment
  sched/fair: Clear SMT siblings after determining the core is not idle
  sched: Fix kernel-doc markup
  x86: Print ratio freq_max/freq_base used in frequency invariance calculations
  x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
  x86, sched: Calculate frequency invariance for AMD systems
  irq_work: Optimize irq_work_single()
  smp: Cleanup smp_call_function*()
  irq_work: Cleanup
  sched: Limit the amount of NUMA imbalance that can exist at fork time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Rename nr_running and break out the magic number
  sched: Make migrate_disable/enable() independent of RT
  sched/topology: Condition EAS enablement on FIE support
  arm64: Rebuild sched domains on invariance status changes
  sched/topology,schedutil: Wrap sched domains rebuild
  sched/uclamp: Allow to reset a task uclamp constraint value
  sched/core: Fix typos in comments
  Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
  ...
2020-12-14 18:29:11 -08:00
Jakub Kicinski 46d5e62dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff
to __xdp_return().

strlcpy got converted to strscpy but here it makes no
functional difference, so just keep the right code.

Conflicts:
	net/netfilter/nf_tables_api.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-11 22:29:38 -08:00
Toke Høiland-Jørgensen 998f172962 xdp: Remove the xdp_attachment_flags_ok() callback
Since commit 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:

- An XDP program loaded without specifying a mode (and which then ends up
  in driver mode) cannot be unloaded if the program mode is specified on
  unload.

- The dev_xdp_uninstall() hook always calls the driver callback with the
  mode set to the type of the program but an empty flags argument, which
  means the flags_ok() check prevents the program from being removed,
  leading to bpf prog reference leaks.

The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
2020-12-09 16:27:42 +01:00
Jakub Kicinski a1dd1d8697 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-12-03

The main changes are:

1) Support BTF in kernel modules, from Andrii.

2) Introduce preferred busy-polling, from Björn.

3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

4) Memcg-based memory accounting for bpf objects, from Roman.

5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
  selftests/bpf: Fix invalid use of strncat in test_sockmap
  libbpf: Use memcpy instead of strncpy to please GCC
  selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
  selftests/bpf: Add tp_btf CO-RE reloc test for modules
  libbpf: Support attachment of BPF tracing programs to kernel modules
  libbpf: Factor out low-level BPF program loading helper
  bpf: Allow to specify kernel module BTFs when attaching BPF programs
  bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
  selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
  selftests/bpf: Add support for marking sub-tests as skipped
  selftests/bpf: Add bpf_testmod kernel module for testing
  libbpf: Add kernel module BTF support for CO-RE relocations
  libbpf: Refactor CO-RE relocs to not assume a single BTF object
  libbpf: Add internal helper to load BTF data by FD
  bpf: Keep module's btf_data_size intact after load
  bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
  selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
  bpf: Adds support for setting window clamp
  samples/bpf: Fix spelling mistake "recieving" -> "receiving"
  bpf: Fix cold build of test_progs-no_alu32
  ...
====================

Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-04 07:48:12 -08:00
Jakub Kicinski 55fd59b003 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts:
	drivers/net/ethernet/ibm/ibmvnic.c

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-03 15:44:09 -08:00
Vladimir Oltean c214550ff8 net: delete __dev_getfirstbyhwtype
The last user of the RTNL brother of dev_getfirstbyhwtype (the latter
being synchronized under RCU) has been deleted in commit b4db2b35fc
("afs: Use core kernel UUID generation").

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201129200550.2433401-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-12-01 15:40:01 -08:00
Björn Töpel b02e5a0ebb xsk: Propagate napi_id to XDP socket Rx path
Add napi_id to the xdp_rxq_info structure, and make sure the XDP
socket pick up the napi_id in the Rx path. The napi_id is used to find
the corresponding NAPI structure for socket busy polling.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/bpf/20201130185205.196029-7-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel 7c951cafc0 net: Add SO_BUSY_POLL_BUDGET socket option
This option lets a user set a per socket NAPI budget for
busy-polling. If the options is not set, it will use the default of 8.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-3-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
Björn Töpel 7fd3253a7d net: Introduce preferred busy-polling
The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
option or system-wide using the /proc/sys/net/core/busy_read knob, is
an opportunistic. That means that if the NAPI context is not
scheduled, it will poll it. If, after busy-polling, the budget is
exceeded the busy-polling logic will schedule the NAPI onto the
regular softirq handling.

One implication of the behavior above is that a busy/heavy loaded NAPI
context will never enter/allow for busy-polling. Some applications
prefer that most NAPI processing would be done by busy-polling.

This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
in concert with the napi_defer_hard_irqs and gro_flush_timeout
knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
introduced in commit 6f8b12d661 ("net: napi: add hard irqs deferral
feature"), and allows for a user to defer interrupts to be enabled and
instead schedule the NAPI context from a watchdog timer. When a user
enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
and the NAPI context is being processed by a softirq, the softirq NAPI
processing will exit early to allow the busy-polling to be performed.

If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume.

In summary; Heavy traffic applications that prefer busy-polling over
softirq processing should use this option.

Example usage:

  $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
  $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

Note that the timeout should be larger than the userspace processing
window, otherwise the watchdog will timeout and fall back to regular
softirq processing.

Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
2020-12-01 00:09:25 +01:00
wenxu aadaca9e7c net/sched: fix miss init the mru in qdisc_skb_cb
The mru in the qdisc_skb_cb should be init as 0. Only defrag packets in the
act_ct will set the value.

Fixes: 038ebb1a71 ("net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-27 14:36:02 -08:00
Heiner Kallweit 1d155dfdf5 net: warn if gso_type isn't set for a GSO SKB
In bug report [0] a warning in r8169 driver was reported that was
caused by an invalid GSO SKB (gso_type was 0). See [1] for a discussion
about this issue. Still the origin of the invalid GSO SKB isn't clear.

It shouldn't be a network drivers task to check for invalid GSO SKB's.
Also, even if issue [0] can be fixed, we can't be sure that a
similar issue doesn't pop up again at another place.
Therefore let gso_features_check() check for such invalid GSO SKB's.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=209423
[1] https://www.spinics.net/lists/netdev/msg690794.html

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/97c78d21-7f0b-d843-df17-3589f224d2cf@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24 14:02:10 -08:00
Björn Töpel 36ccdf8582 net, xsk: Avoid taking multiple skbuff references
Commit 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
addressed the problem that packets were discarded from the Tx AF_XDP
ring, when the driver returned NETDEV_TX_BUSY. Part of the fix was
bumping the skbuff reference count, so that the buffer would not be
freed by dev_direct_xmit(). A reference count larger than one means
that the skbuff is "shared", which is not the case.

If the "shared" skbuff is sent to the generic XDP receive path,
netif_receive_generic_xdp(), and pskb_expand_head() is entered the
BUG_ON(skb_shared(skb)) will trigger.

This patch adds a variant to dev_direct_xmit(), __dev_direct_xmit(),
where a user can select the skbuff free policy. This allows AF_XDP to
avoid bumping the reference count, but still keep the NETDEV_TX_BUSY
behavior.

Fixes: 642e450b6b ("xsk: Do not discard packet when NETDEV_TX_BUSY")
Reported-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20201123175600.146255-1-bjorn.topel@gmail.com
2020-11-24 22:39:56 +01:00
Peter Zijlstra 545b8c8df4 smp: Cleanup smp_call_function*()
Get rid of the __call_single_node union and cleanup the API a little
to avoid external code relying on the structure layout as much.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
2020-11-24 16:47:49 +01:00
Mauro Carvalho Chehab c1639be98b net: datagram: fix some kernel-doc markups
Some identifiers have different names between their prototypes
and the kernel-doc markup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17 14:15:03 -08:00
Heiner Kallweit a18394269f net: core: add dev_get_tstats64 as a ndo_get_stats64 implementation
It's a frequent pattern to use netdev->stats for the less frequently
accessed counters and per-cpu counters for the frequently accessed
counters (rx/tx bytes/packets). Add a default ndo_get_stats64()
implementation for this use case.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-09 17:50:27 -08:00
Tom Rix 5d867245c4 net: core: remove unneeded semicolon
A semicolon is not needed after a switch statement.

Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20201101153647.2292322-1-trix@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-02 17:51:02 -08:00
Jakub Kicinski 1c29d98990 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-29 14:08:40 -07:00
Yi Li 3aefd7d6ea net: core: Use skb_is_gso() in skb_checksum_help()
No functional changes, just minor refactoring.

Signed-off-by: Yi Li <yili@winhong.com>
Link: https://lore.kernel.org/r/20201027055904.2683444-1-yili@winhong.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-27 17:36:11 -07:00
Willy Tarreau 3744741ada random32: add noise from network and scheduling activity
With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.

This patch restores the spirit of commit f227e3ec3b ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.

The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24 20:21:57 +02:00
Taehee Yoo 0e8b8d6a2d net: core: use list_del_init() instead of list_del() in netdev_run_todo()
dev->unlink_list is reused unless dev is deleted.
So, list_del() should not be used.
Due to using list_del(), dev->unlink_list can't be reused so that
dev->nested_level update logic doesn't work.
In order to fix this bug, list_del_init() should be used instead
of list_del().

Test commands:
    ip link add bond0 type bond
    ip link add bond1 type bond
    ip link set bond0 master bond1
    ip link set bond0 nomaster
    ip link set bond1 master bond0
    ip link set bond1 nomaster

Splat looks like:
[  255.750458][ T1030] ============================================
[  255.751967][ T1030] WARNING: possible recursive locking detected
[  255.753435][ T1030] 5.9.0-rc8+ #772 Not tainted
[  255.754553][ T1030] --------------------------------------------
[  255.756047][ T1030] ip/1030 is trying to acquire lock:
[  255.757304][ T1030] ffff88811782a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync_multiple+0xc2/0x150
[  255.760056][ T1030]
[  255.760056][ T1030] but task is already holding lock:
[  255.761862][ T1030] ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.764581][ T1030]
[  255.764581][ T1030] other info that might help us debug this:
[  255.766645][ T1030]  Possible unsafe locking scenario:
[  255.766645][ T1030]
[  255.768566][ T1030]        CPU0
[  255.769415][ T1030]        ----
[  255.770259][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.771629][ T1030]   lock(&dev_addr_list_lock_key/1);
[  255.772994][ T1030]
[  255.772994][ T1030]  *** DEADLOCK ***
[  255.772994][ T1030]
[  255.775091][ T1030]  May be due to missing lock nesting notation
[  255.775091][ T1030]
[  255.777182][ T1030] 2 locks held by ip/1030:
[  255.778299][ T1030]  #0: ffffffffb1f63250 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x2e4/0x8b0
[  255.780600][ T1030]  #1: ffff88811130a280 (&dev_addr_list_lock_key/1){+...}-{2:2}, at: bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.783411][ T1030]
[  255.783411][ T1030] stack backtrace:
[  255.784874][ T1030] CPU: 7 PID: 1030 Comm: ip Not tainted 5.9.0-rc8+ #772
[  255.786595][ T1030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  255.789030][ T1030] Call Trace:
[  255.789850][ T1030]  dump_stack+0x99/0xd0
[  255.790882][ T1030]  __lock_acquire.cold.71+0x166/0x3cc
[  255.792285][ T1030]  ? register_lock_class+0x1a30/0x1a30
[  255.793619][ T1030]  ? rcu_read_lock_sched_held+0x91/0xc0
[  255.794963][ T1030]  ? rcu_read_lock_bh_held+0xa0/0xa0
[  255.796246][ T1030]  lock_acquire+0x1b8/0x850
[  255.797332][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.798624][ T1030]  ? bond_enslave+0x3d4d/0x43e0 [bonding]
[  255.800039][ T1030]  ? check_flags+0x50/0x50
[  255.801143][ T1030]  ? lock_contended+0xd80/0xd80
[  255.802341][ T1030]  _raw_spin_lock_nested+0x2e/0x70
[  255.803592][ T1030]  ? dev_mc_sync_multiple+0xc2/0x150
[  255.804897][ T1030]  dev_mc_sync_multiple+0xc2/0x150
[  255.806168][ T1030]  bond_enslave+0x3d58/0x43e0 [bonding]
[  255.807542][ T1030]  ? __lock_acquire+0xe53/0x51b0
[  255.808824][ T1030]  ? bond_update_slave_arr+0xdc0/0xdc0 [bonding]
[  255.810451][ T1030]  ? check_chain_key+0x236/0x5e0
[  255.811742][ T1030]  ? mutex_is_locked+0x13/0x50
[  255.812910][ T1030]  ? rtnl_is_locked+0x11/0x20
[  255.814061][ T1030]  ? netdev_master_upper_dev_get+0xf/0x120
[  255.815553][ T1030]  do_setlink+0x94c/0x3040
[ ... ]

Reported-by: syzbot+4a0f7bc34e3997a6c7df@syzkaller.appspotmail.com
Fixes: 1fc70edb7d ("net: core: add nested_level variable in net_device")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Link: https://lore.kernel.org/r/20201015162606.9377-1-ap420073@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-18 14:50:25 -07:00
Heiner Kallweit 44fa32f008 net: add function dev_fetch_sw_netstats for fetching pcpu_sw_netstats
In several places the same code is used to populate rtnl_link_stats64
fields with data from pcpu_sw_netstats. Therefore factor out this code
to a new function dev_fetch_sw_netstats().

v2:
- constify argument netstats
- don't ignore netstats being NULL or an ERRPTR
- switch to EXPORT_SYMBOL_GPL

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/6d16a338-52f5-df69-0020-6bc771a7d498@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-13 17:33:48 -07:00
Jakub Kicinski ccdf7fae3a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-10-12

The main changes are:

1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

2) libbpf relocation support for different size load/store, from Andrii.

3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

4) BPF support for per-cpu variables, form Hao.

5) sockmap improvements, from John.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-12 16:16:50 -07:00
Daniel Borkmann 9aa1206e8f bpf: Add redirect_peer helper
Add an efficient ingress to ingress netns switch that can be used out of tc BPF
programs in order to redirect traffic from host ns ingress into a container
veth device ingress without having to go via CPU backlog queue [0]. For local
containers this can also be utilized and path via CPU backlog queue only needs
to be taken once, not twice. On a high level this borrows from ipvlan which does
similar switch in __netif_receive_skb_core() and then iterates via another_round.
This helps to reduce latency for mentioned use cases.

Pod to remote pod with redirect(), TCP_RR [1]:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
        MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
      STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
         MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
         P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
         P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
         P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )

    TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )

Pod to remote pod with redirect_peer(), TCP_RR:

  # percpu_netperf 10.217.1.33
          RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
        MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
      STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
         MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
         P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
         P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
         P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )

    TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )

  [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
  [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperf

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
2020-10-11 10:21:04 -07:00
David S. Miller 8b0308fe31 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-10-05 18:40:01 -07:00
Sebastian Andrzej Siewior c11171a413 net: Add netif_rx_any_context()
Quite some drivers make conditional decisions based on in_interrupt() to
invoke either netif_rx() or netif_rx_ni().

Conditionals based on in_interrupt() or other variants of preempt count
checks in drivers should not exist for various reasons and Linus clearly
requested to either split the code pathes or pass an argument to the
common functions which provides the context.

This is obviously the correct solution, but for some of the affected
drivers this needs a major rewrite due to their convoluted structure.

As in_interrupt() usage in drivers needs to be phased out, provide
netif_rx_any_context() as a stop gap for these drivers.

This confines the in_interrupt() conditional to core code which in turn
allows to remove the access to this check for driver code and provides one
central place to do further modifications once the driver maze is cleaned
up.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-29 14:02:53 -07:00
Taehee Yoo 1fc70edb7d net: core: add nested_level variable in net_device
This patch is to add a new variable 'nested_level' into the net_device
structure.
This variable will be used as a parameter of spin_lock_nested() of
dev->addr_list_lock.

netif_addr_lock() can be called recursively so spin_lock_nested() is
used instead of spin_lock() and dev->lower_level is used as a parameter
of spin_lock_nested().
But, dev->lower_level value can be updated while it is being used.
So, lockdep would warn a possible deadlock scenario.

When a stacked interface is deleted, netif_{uc | mc}_sync() is
called recursively.
So, spin_lock_nested() is called recursively too.
At this moment, the dev->lower_level variable is used as a parameter of it.
dev->lower_level value is updated when interfaces are being unlinked/linked
immediately.
Thus, After unlinking, dev->lower_level shouldn't be a parameter of
spin_lock_nested().

    A (macvlan)
    |
    B (vlan)
    |
    C (bridge)
    |
    D (macvlan)
    |
    E (vlan)
    |
    F (bridge)

    A->lower_level : 6
    B->lower_level : 5
    C->lower_level : 4
    D->lower_level : 3
    E->lower_level : 2
    F->lower_level : 1

When an interface 'A' is removed, it releases resources.
At this moment, netif_addr_lock() would be called.
Then, netdev_upper_dev_unlink() is called recursively.
Then dev->lower_level is updated.
There is no problem.

But, when the bridge module is removed, 'C' and 'F' interfaces
are removed at once.
If 'F' is removed first, a lower_level value is like below.
    A->lower_level : 5
    B->lower_level : 4
    C->lower_level : 3
    D->lower_level : 2
    E->lower_level : 1
    F->lower_level : 1

Then, 'C' is removed. at this moment, netif_addr_lock() is called
recursively.
The ordering is like this.
C(3)->D(2)->E(1)->F(1)
At this moment, the lower_level value of 'E' and 'F' are the same.
So, lockdep warns a possible deadlock scenario.

In order to avoid this problem, a new variable 'nested_level' is added.
This value is the same as dev->lower_level - 1.
But this value is updated in rtnl_unlock().
So, this variable can be used as a parameter of spin_lock_nested() safely
in the rtnl context.

Test commands:
   ip link add br0 type bridge vlan_filtering 1
   ip link add vlan1 link br0 type vlan id 10
   ip link add macvlan2 link vlan1 type macvlan
   ip link add br3 type bridge vlan_filtering 1
   ip link set macvlan2 master br3
   ip link add vlan4 link br3 type vlan id 10
   ip link add macvlan5 link vlan4 type macvlan
   ip link add br6 type bridge vlan_filtering 1
   ip link set macvlan5 master br6
   ip link add vlan7 link br6 type vlan id 10
   ip link add macvlan8 link vlan7 type macvlan

   ip link set br0 up
   ip link set vlan1 up
   ip link set macvlan2 up
   ip link set br3 up
   ip link set vlan4 up
   ip link set macvlan5 up
   ip link set br6 up
   ip link set vlan7 up
   ip link set macvlan8 up
   modprobe -rv bridge

Splat looks like:
[   36.057436][  T744] WARNING: possible recursive locking detected
[   36.058848][  T744] 5.9.0-rc6+ #728 Not tainted
[   36.059959][  T744] --------------------------------------------
[   36.061391][  T744] ip/744 is trying to acquire lock:
[   36.062590][  T744] ffff8c4767509280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_set_rx_mode+0x19/0x30
[   36.064922][  T744]
[   36.064922][  T744] but task is already holding lock:
[   36.066626][  T744] ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
[   36.068851][  T744]
[   36.068851][  T744] other info that might help us debug this:
[   36.070731][  T744]  Possible unsafe locking scenario:
[   36.070731][  T744]
[   36.072497][  T744]        CPU0
[   36.073238][  T744]        ----
[   36.074007][  T744]   lock(&vlan_netdev_addr_lock_key);
[   36.075290][  T744]   lock(&vlan_netdev_addr_lock_key);
[   36.076590][  T744]
[   36.076590][  T744]  *** DEADLOCK ***
[   36.076590][  T744]
[   36.078515][  T744]  May be due to missing lock nesting notation
[   36.078515][  T744]
[   36.080491][  T744] 3 locks held by ip/744:
[   36.081471][  T744]  #0: ffffffff98571df0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x236/0x490
[   36.083614][  T744]  #1: ffff8c4767769280 (&vlan_netdev_addr_lock_key){+...}-{2:2}, at: dev_uc_add+0x1e/0x60
[   36.085942][  T744]  #2: ffff8c476c8da280 (&bridge_netdev_addr_lock_key/4){+...}-{2:2}, at: dev_uc_sync+0x39/0x80
[   36.088400][  T744]
[   36.088400][  T744] stack backtrace:
[   36.089772][  T744] CPU: 6 PID: 744 Comm: ip Not tainted 5.9.0-rc6+ #728
[   36.091364][  T744] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   36.093630][  T744] Call Trace:
[   36.094416][  T744]  dump_stack+0x77/0x9b
[   36.095385][  T744]  __lock_acquire+0xbc3/0x1f40
[   36.096522][  T744]  lock_acquire+0xb4/0x3b0
[   36.097540][  T744]  ? dev_set_rx_mode+0x19/0x30
[   36.098657][  T744]  ? rtmsg_ifinfo+0x1f/0x30
[   36.099711][  T744]  ? __dev_notify_flags+0xa5/0xf0
[   36.100874][  T744]  ? rtnl_is_locked+0x11/0x20
[   36.101967][  T744]  ? __dev_set_promiscuity+0x7b/0x1a0
[   36.103230][  T744]  _raw_spin_lock_bh+0x38/0x70
[   36.104348][  T744]  ? dev_set_rx_mode+0x19/0x30
[   36.105461][  T744]  dev_set_rx_mode+0x19/0x30
[   36.106532][  T744]  dev_set_promiscuity+0x36/0x50
[   36.107692][  T744]  __dev_set_promiscuity+0x123/0x1a0
[   36.108929][  T744]  dev_set_promiscuity+0x1e/0x50
[   36.110093][  T744]  br_port_set_promisc+0x1f/0x40 [bridge]
[   36.111415][  T744]  br_manage_promisc+0x8b/0xe0 [bridge]
[   36.112728][  T744]  __dev_set_promiscuity+0x123/0x1a0
[   36.113967][  T744]  ? __hw_addr_sync_one+0x23/0x50
[   36.115135][  T744]  __dev_set_rx_mode+0x68/0x90
[   36.116249][  T744]  dev_uc_sync+0x70/0x80
[   36.117244][  T744]  dev_uc_add+0x50/0x60
[   36.118223][  T744]  macvlan_open+0x18e/0x1f0 [macvlan]
[   36.119470][  T744]  __dev_open+0xd6/0x170
[   36.120470][  T744]  __dev_change_flags+0x181/0x1d0
[   36.121644][  T744]  dev_change_flags+0x23/0x60
[   36.122741][  T744]  do_setlink+0x30a/0x11e0
[   36.123778][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.124929][  T744]  ? __nla_validate_parse.part.6+0x45/0x8e0
[   36.126309][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.127457][  T744]  __rtnl_newlink+0x546/0x8e0
[   36.128560][  T744]  ? lock_acquire+0xb4/0x3b0
[   36.129623][  T744]  ? deactivate_slab.isra.85+0x6a1/0x850
[   36.130946][  T744]  ? __lock_acquire+0x92c/0x1f40
[   36.132102][  T744]  ? lock_acquire+0xb4/0x3b0
[   36.133176][  T744]  ? is_bpf_text_address+0x5/0xe0
[   36.134364][  T744]  ? rtnl_newlink+0x2e/0x70
[   36.135445][  T744]  ? rcu_read_lock_sched_held+0x32/0x60
[   36.136771][  T744]  ? kmem_cache_alloc_trace+0x2d8/0x380
[   36.138070][  T744]  ? rtnl_newlink+0x2e/0x70
[   36.139164][  T744]  rtnl_newlink+0x47/0x70
[ ... ]

Fixes: 845e0ebb44 ("net: change addr_list_lock back to static key")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28 15:00:15 -07:00
Taehee Yoo eff7423365 net: core: introduce struct netdev_nested_priv for nested interface infrastructure
Functions related to nested interface infrastructure such as
netdev_walk_all_{ upper | lower }_dev() pass both private functions
and "data" pointer to handle their own things.
At this point, the data pointer type is void *.
In order to make it easier to expand common variables and functions,
this new netdev_nested_priv structure is added.

In the following patch, a new member variable will be added into this
struct to fix the lockdep issue.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28 15:00:15 -07:00
Taehee Yoo fe8300fd8d net: core: add __netdev_upper_dev_unlink()
The netdev_upper_dev_unlink() has to work differently according to flags.
This idea is the same with __netdev_upper_dev_link().

In the following patches, new flags will be added.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-28 15:00:14 -07:00
Mauro Carvalho Chehab de2b541b3b net: fix a new kernel-doc warning at dev.c
kernel-doc expects the function prototype to be just after
the kernel-doc markup, as otherwise it will get it all wrong:

	./net/core/dev.c:10036: warning: Excess function parameter 'dev' description in 'WAIT_REFS_MIN_MSECS'

Fixes: 0e4be9e57e ("net: use exponential backoff in netdev_wait_allrefs")
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Francesco Ruggeri <fruggeri@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23 17:42:54 -07:00
David S. Miller 6d772f328d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-09-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 95 non-merge commits during the last 22 day(s) which contain
a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

The main changes are:

1) Full multi function support in libbpf, from Andrii.

2) Refactoring of function argument checks, from Lorenz.

3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

4) Program metadata support, from YiFei.

5) bpf iterator optimizations, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-23 13:11:11 -07:00
David S. Miller 3ab0a7a0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22 16:45:34 -07:00
Randy Dunlap 4250b75b40 net: core: delete duplicated words
Drop repeated words in net/core/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Francesco Ruggeri 0e4be9e57e net: use exponential backoff in netdev_wait_allrefs
The combination of aca_free_rcu, introduced in commit 2384d02520
("net/ipv6: Add anycast addresses to a global hashtable"), and
fib6_info_destroy_rcu, introduced in commit 9b0a8da8c4 ("net/ipv6:
respect rcu grace period before freeing fib6_info"), can result in
an extra rcu grace period being needed when deleting an interface,
with the result that netdev_wait_allrefs ends up hitting the msleep(250),
which is considerably longer than the required grace period.
This can result in long delays when deleting a large number of interfaces,
and it can be observed with this script:

ns=dummy-ns
NIFS=100

ip netns add $ns
ip netns exec $ns ip link set lo up
ip netns exec $ns sysctl net.ipv6.conf.default.disable_ipv6=0
ip netns exec $ns sysctl net.ipv6.conf.default.forwarding=1

for ((i=0; i<$NIFS; i++))
do
        if=eth$i
        ip netns exec $ns ip link add $if type dummy
        ip netns exec $ns ip link set $if up
        ip netns exec $ns ip -6 addr add 2021:$i::1/120 dev $if
done

for ((i=0; i<$NIFS; i++))
do
        if=eth$i
        ip netns exec $ns ip link del $if
done

ip netns del $ns

Instead of using a fixed msleep(250), this patch tries an extra
rcu_barrier() followed by an exponential backoff.

Time with this patch on a 5.4 kernel:

real	0m7.704s
user	0m0.385s
sys	0m1.230s

Time without this patch:

real    0m31.522s
user    0m0.438s
sys     0m1.156s

v2: use exponential backoff instead of trying to wake up
    netdev_wait_allrefs.
v3: preserve reverse christmas tree ordering of local variables
v4: try an extra rcu_barrier before the backoff, plus some
    cosmetic changes.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:47:31 -07:00
YiFei Zhu 984fe94f94 bpf: Mutex protect used_maps array and count
To support modifying the used_maps array, we use a mutex to protect
the use of the counter and the array. The mutex is initialized right
after the prog aux is allocated, and destroyed right before prog
aux is freed. This way we guarantee it's initialized for both cBPF
and eBPF.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
2020-09-15 18:28:27 -07:00
Vladimir Oltean b14a9fc452 __netif_receive_skb_core: don't untag vlan from skb on DSA master
A DSA master interface has upper network devices, each representing an
Ethernet switch port attached to it. Demultiplexing the source ports and
setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
packet_type handler. Catch-all because DSA vendors have various header
implementations, which can be placed anywhere in the frame: before the
DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
handler acts like an rx_handler more than anything.

It is unlikely for the DSA master interface to have any other upper than
the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
is very likely that the DSA master will have no 8021q upper. So
__netif_receive_skb_core() will try to untag the VLAN, despite the fact
that the DSA switch interface might have an 8021q upper. So the skb will
never reach that.

So far, this hasn't been a problem because most of the possible
placements of the DSA switch header mentioned in the first paragraph
will displace the VLAN header when the DSA master receives the frame, so
__netif_receive_skb_core() will not actually execute any VLAN-specific
code for it. This only becomes a problem when the DSA switch header does
not displace the VLAN header (for example with a tail tag).

What the patch does is it bypasses the untagging of the skb when there
is a DSA switch attached to this net device. So, DSA is the only
packet_type handler which requires seeing the VLAN header. Once skb->dev
will be changed, __netif_receive_skb_core() will be invoked again and
untagging, or delivery to an 8021q upper, will happen in the RX of the
DSA switch interface itself.

*see commit 9eb8eff0cf ("net: bridge: allow enslaving some DSA master
network devices". This is actually the reason why I prefer keeping DSA
as a packet_type handler of ETH_P_XDSA rather than converting to an
rx_handler. Currently the rx_handler code doesn't support chaining, and
this is a problem because a DSA master might be bridged.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:34:18 -07:00
Paolo Abeni 2de79ee27f net: try to avoid unneeded backlog flush
flush_all_backlogs() may cause deadlock on systems
running processes with FIFO scheduling policy.

The above is critical in -RT scenarios, where user-space
specifically ensure no network activity is scheduled on
the CPU running the mentioned FIFO process, but still get
stuck.

This commit tries to address the problem checking the
backlog status on the remote CPUs before scheduling the
flush operation. If the backlog is empty, we can skip it.

v1 -> v2:
 - explicitly clear flushed cpu mask - Eric

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 14:39:00 -07:00
Ido Schimmel e1b9efe6ba net: Fix bridge enslavement failure
When a netdev is enslaved to a bridge, its parent identifier is queried.
This is done so that packets that were already forwarded in hardware
will not be forwarded again by the bridge device between netdevs
belonging to the same hardware instance.

The operation fails when the netdev is an upper of netdevs with
different parent identifiers.

Instead of failing the enslavement, have dev_get_port_parent_id() return
'-EOPNOTSUPP' which will signal the bridge to skip the query operation.
Other callers of the function are not affected by this change.

Fixes: 7e1146e8c1 ("net: devlink: introduce devlink_compat_switch_id_get() helper")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:06:48 -07:00
Jakub Kicinski 5251ef8299 net: make sure napi_list is safe for RCU traversal
netpoll needs to traverse dev->napi_list under RCU, make
sure it uses the right iterator and that removal from this
list is handled safely.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski 4d092dd204 net: manage napi add/del idempotence explicitly
To RCUify napi->dev_list we need to replace list_del_init()
with list_del_rcu(). There is no _init() version for RCU for
obvious reasons. Up until now netif_napi_del() was idempotent
so to make sure it remains such add a bit which is set when
NAPI is listed, and cleared when it removed. Since we don't
expect multiple calls to netif_napi_add() to be correct,
add a warning on that side.

Now that napi_hash_add / napi_hash_del are only called by
napi_add / del we can actually steal its bit. We just need
to make sure hash node is initialized correctly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski 5198d545db net: remove napi_hash_del() from driver-facing API
We allow drivers to call napi_hash_del() before calling
netif_napi_del() to batch RCU grace periods. This makes
the API asymmetric and leaks internal implementation details.
Soon we will want the grace period to protect more than just
the NAPI hash table.

Restructure the API and have drivers call a new function -
__netif_napi_del() if they want to take care of RCU waits.

Note that only core was checking the return status from
napi_hash_del() so the new helper does not report if the
NAPI was actually deleted.

Some notes on driver oddness:
 - veth observed the grace period before calling netif_napi_del()
   but that should not matter
 - myri10ge observed normal RCU flavor
 - bnx2x and enic did not actually observe the grace period
   (unless they did so implicitly)
 - virtio_net and enic only unhashed Rx NAPIs

The last two points seem to indicate that the calls to
napi_hash_del() were a left over rather than an optimization.
Regardless, it's easy enough to correct them.

This patch may introduce extra synchronize_net() calls for
interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
free_netdev() to call netif_napi_del(). This seems inevitable
since we want to use RCU for netpoll dev->napi_list traversal,
and almost no drivers set IFF_DISABLE_NETPOLL.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jonathan Neuschäfer ee1a4c84a7 net: Add a missing word
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-06 12:13:11 -07:00
Linus Torvalds 3e8d3bdc2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

 2) Fix loss of RTT samples in rxrpc, from David Howells.

 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

 5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

 8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

11) Memory leak in rxkad_verify_response, from Dinghao Liu.

12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
  net/smc: fix sock refcounting in case of termination
  net/smc: reset sndbuf_desc if freed
  net/smc: set rx_off for SMCR explicitly
  net/smc: fix toleration of fake add_link messages
  tg3: Fix soft lockup when tg3_reset_task() fails.
  doc: net: dsa: Fix typo in config code sample
  net: dp83867: Fix WoL SecureOn password
  nfp: flower: fix ABI mismatch between driver and firmware
  tipc: fix shutdown() of connectionless socket
  ipv6: Fix sysctl max for fib_multipath_hash_policy
  drivers/net/wan/hdlc: Change the default of hard_header_len to 0
  net: gemini: Fix another missing clk_disable_unprepare() in probe
  net: bcmgenet: fix mask check in bcmgenet_validate_flow()
  amd-xgbe: Add support for new port mode
  net: usb: dm9601: Add USB ID of Keenetic Plus DSL
  vhost: fix typo in error message
  net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
  pktgen: fix error message with wrong function name
  net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
  cxgb4: fix thermal zone device registration
  ...
2020-09-03 18:50:48 -07:00
Jakub Kicinski 96e97bc07e net: disable netpoll on fresh napis
napi_disable() makes sure to set the NAPI_STATE_NPSVC bit to prevent
netpoll from accessing rings before init is complete. However, the
same is not done for fresh napi instances in netif_napi_add(),
even though we expect NAPI instances to be added as disabled.

This causes crashes during driver reconfiguration (enabling XDP,
changing the channel count) - if there is any printk() after
netif_napi_add() but before napi_enable().

To ensure memory ordering is correct we need to use RCU accessors.

Reported-by: Rob Sherwood <rsher@fb.com>
Fixes: 2d8bff1269 ("netpoll: Close race condition between poll_one_napi and napi_disable")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:16:39 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Andrii Nakryiko c8a36f1945 bpf: xdp: Fix XDP mode when no mode flags specified
7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
inadvertently changed which XDP mode is assumed when no mode flags are
specified explicitly. Previously, driver mode was preferred, if driver
supported it. If not, generic SKB mode was chosen. That commit changed default
to SKB mode always. This patch fixes the issue and restores the original
logic.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Reported-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/bpf/20200820052841.1559757-1-andriin@fb.com
2020-08-20 14:27:12 -07:00
David S. Miller 7f9bf6e824 Revert "net: xdp: pull ethernet header off packet after computing skb->protocol"
This reverts commit f8414a8d88.

eth_type_trans() does the necessary pull on the skb.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 11:48:05 -07:00
Jason A. Donenfeld f8414a8d88 net: xdp: pull ethernet header off packet after computing skb->protocol
When an XDP program changes the ethernet header protocol field,
eth_type_trans is used to recalculate skb->protocol. In order for
eth_type_trans to work correctly, the ethernet header must actually be
part of the skb data segment, so the code first pushes that onto the
head of the skb. However, it subsequently forgets to pull it back off,
making the behavior of the passed-on packet inconsistent between the
protocol modifying case and the static protocol case. This patch fixes
the issue by simply pulling the ethernet header back off of the skb
head.

Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 15:28:18 -07:00
Andrii Nakryiko 068d9d1eba bpf: Fix XDP FD-based attach/detach logic around XDP_FLAGS_UPDATE_IF_NOEXIST
Enforce XDP_FLAGS_UPDATE_IF_NOEXIST only if new BPF program to be attached is
non-NULL (i.e., we are not detaching a BPF program).

Fixes: d4baa9368a ("bpf, xdp: Extract common XDP program attachment logic")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200812022923.1217922-1-andriin@fb.com
2020-08-12 18:00:49 -07:00
David S. Miller 2e7199bd77 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-08-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).

The main changes are:

1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
   syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.

2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
   in-kernel inspection, from Yonghong Song and Alexei Starovoitov.

3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
   unwinder errors, from Song Liu.

4) Allow cgroup local storage map to be shared between programs on the same
   cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.

5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
   load instructions, from Jean-Philippe Brucker.

6) Follow-up fixes on BPF socket lookup in combination with reuseport group
   handling. Also add related BPF selftests, from Jakub Sitnicki.

7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
   socket create/release as well as bind functions, from Stanislav Fomichev.

8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
   xdp_statistics, from Peilin Ye.

9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.

10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
    fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.

11) Fix a bpftool segfault due to missing program type name and make it more robust
    to prevent them in future gaps, from Quentin Monnet.

12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
    resolver issue, from John Fastabend.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:27:40 -07:00
Andrii Nakryiko 73b11c2ab0 bpf: Add support for forced LINK_DETACH command
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has
the same behavior as auto-detaching of bpf_link due to cgroup dying for
bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case,
bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF
program attached to corresponding BPF hook. This functionality allows users
with enough access rights to manually force-detach attached bpf_link without
killing respective owner process.

This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly
re-using existing link release handling code.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
2020-08-01 20:38:28 -07:00
Roopa Prabhu 829eb208e8 rtnetlink: add support for protodown reason
netdev protodown is a mechanism that allows protocols to
hold an interface down. It was initially introduced in
the kernel to hold links down by a multihoming protocol.
There was also an attempt to introduce protodown
reason at the time but was rejected. protodown and protodown reason
is supported by almost every switching and routing platform.
It was ok for a while to live without a protodown reason.
But, its become more critical now given more than
one protocol may need to keep a link down on a system
at the same time. eg: vrrp peer node, port security,
multihoming protocol. Its common for Network operators and
protocol developers to look for such a reason on a networking
box (Its also known as errDisable by most networking operators)

This patch adds support for link protodown reason
attribute. There are two ways to maintain protodown
reasons.
(a) enumerate every possible reason code in kernel
    - A protocol developer has to make a request and
      have that appear in a certain kernel version
(b) provide the bits in the kernel, and allow user-space
(sysadmin or NOS distributions) to manage the bit-to-reasonname
map.
	- This makes extending reason codes easier (kind of like
      the iproute2 table to vrf-name map /etc/iproute2/rt_tables.d/)

This patch takes approach (b).

a few things about the patch:
- It treats the protodown reason bits as counter to indicate
active protodown users
- Since protodown attribute is already an exposed UAPI,
the reason is not enforced on a protodown set. Its a no-op
if not used.
the patch follows the below algorithm:
  - presence of reason bits set indicates protodown
    is in use
  - user can set protodown and protodown reason in a
    single or multiple setlink operations
  - setlink operation to clear protodown, will return -EBUSY
    if there are active protodown reason bits
  - reason is not included in link dumps if not used

example with patched iproute2:
$cat /etc/iproute2/protodown_reasons.d/r.conf
0 mlag
1 evpn
2 vrrp
3 psecurity

$ip link set dev vxlan0 protodown on protodown_reason vrrp on
$ip link set dev vxlan0 protodown_reason mlag on
$ip link show
14: vxlan0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
    link/ether f6:06:be:17:91:e7 brd ff:ff:ff:ff:ff:ff protodown on <mlag,vrrp>

$ip link set dev vxlan0 protodown_reason mlag off
$ip link set dev vxlan0 protodown off protodown_reason vrrp off

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 18:49:16 -07:00
Miaohe Lin 9fc95f50ee net: Pass NULL to skb_network_protocol() when we don't care about vlan depth
When we don't care about vlan depth, we could pass NULL instead of the
address of a unused local variable to skb_network_protocol() as a param.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31 16:42:30 -07:00
Andrii Nakryiko e8407fdeb9 bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands
Now that BPF program/link management is centralized in generic net_device
code, kernel code never queries program id from drivers, so
XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary.

This patch removes all the implementations of those commands in kernel, along
the xdp_attachment_query().

This patch was compile-tested on allyesconfig.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-10-andriin@fb.com
2020-07-25 20:37:02 -07:00
Andrii Nakryiko c1931c9784 bpf: Implement BPF XDP link-specific introspection APIs
Implement XDP link-specific show_fdinfo and link_info to emit ifindex.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-7-andriin@fb.com
2020-07-25 20:37:02 -07:00
Andrii Nakryiko 026a4c28e1 bpf, xdp: Implement LINK_UPDATE for BPF XDP link
Add support for LINK_UPDATE command for BPF XDP link to enable reliable
replacement of underlying BPF program.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-6-andriin@fb.com
2020-07-25 20:37:02 -07:00
Andrii Nakryiko aa8d3a716b bpf, xdp: Add bpf_link-based XDP attachment API
Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
BPF_LINK_CREATE command.

bpf_xdp_link is mutually exclusive with direct BPF program attachment,
previous BPF program should be detached prior to attempting to create a new
bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
can't be replaced by other BPF program attachment or link attachment. It will
be detached only when the last BPF link FD is closed.

bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
will become defunct, but won't be destroyed until last FD is closed.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-5-andriin@fb.com
2020-07-25 20:37:02 -07:00
Andrii Nakryiko d4baa9368a bpf, xdp: Extract common XDP program attachment logic
Further refactor XDP attachment code. dev_change_xdp_fd() is split into two
parts: getting bpf_progs from FDs and attachment logic, working with
bpf_progs. This makes attachment  logic a bit more straightforward and
prepares code for bpf_xdp_link inclusion, which will share the common logic.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-4-andriin@fb.com
2020-07-25 20:37:02 -07:00
Andrii Nakryiko 7f0a838254 bpf, xdp: Maintain info on attached XDP BPF programs in net_device
Instead of delegating to drivers, maintain information about which BPF
programs are attached in which XDP modes (generic/skb, driver, or hardware)
locally in net_device. This effectively obsoletes XDP_QUERY_PROG command.

Such re-organization simplifies existing code already. But it also allows to
further add bpf_link-based XDP attachments without drivers having to know
about any of this at all, which seems like a good setup.
XDP_SETUP_PROG/XDP_SETUP_PROG_HW are just low-level commands to driver to
install/uninstall active BPF program. All the higher-level concerns about
prog/link interaction will be contained within generic driver-agnostic logic.

All the XDP_QUERY_PROG calls to driver in dev_xdp_uninstall() were removed.
It's not clear for me why dev_xdp_uninstall() were passing previous prog_flags
when resetting installed programs. That seems unnecessary, plus most drivers
don't populate prog_flags anyways. Having XDP_SETUP_PROG vs XDP_SETUP_PROG_HW
should be enough of an indicator of what is required of driver to correctly
reset active BPF program. dev_xdp_uninstall() is also generalized as an
iteration over all three supported mode.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722064603.3350758-3-andriin@fb.com
2020-07-25 20:37:02 -07:00
David S. Miller a57066b1a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The UDP reuseport conflict was a little bit tricky.

The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.

At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.

This requires moving the reuseport_has_conns() logic into the callers.

While we are here, get rid of inline directives as they do not belong
in foo.c files.

The other changes were cases of more straightforward overlapping
modifications.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-25 17:49:04 -07:00
Subash Abhinov Kasiviswanathan 7df5cb75cf dev: Defer free of skbs in flush_backlog
IRQs are disabled when freeing skbs in input queue.
Use the IRQ safe variant to free skbs here.

Fixes: 145dd5f9c8 ("net: flush the softnet backlog in process context")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24 19:59:22 -07:00
Vladimir Oltean 5df5661a13 net: dsa: stop overriding master's ndo_get_phys_port_name
The purpose of this override is to give the user an indication of what
the number of the CPU port is (in DSA, the CPU port is a hardware
implementation detail and not a network interface capable of traffic).

However, it has always failed (by design) at providing this information
to the user in a reliable fashion.

Prior to commit 3369afba1e ("net: Call into DSA netdevice_ops
wrappers"), the behavior was to only override this callback if it was
not provided by the DSA master.

That was its first failure: if the DSA master itself was a DSA port or a
switchdev, then the user would not see the number of the CPU port in
/sys/class/net/eth0/phys_port_name, but the number of the DSA master
port within its respective physical switch.

But that was actually ok in a way. The commit mentioned above changed
that behavior, and now overrides the master's ndo_get_phys_port_name
unconditionally. That comes with problems of its own, which are worse in
a way.

The idea is that it's typical for switchdev users to have udev rules for
consistent interface naming. These are based, among other things, on
the phys_port_name attribute. If we let the DSA switch at the bottom
to start randomly overriding ndo_get_phys_port_name with its own CPU
port, we basically lose any predictability in interface naming, or even
uniqueness, for that matter.

So, there are reasons to let DSA override the master's callback (to
provide a consistent interface, a number which has a clear meaning and
must not be interpreted according to context), and there are reasons to
not let DSA override it (it breaks udev matching for the DSA master).

But, there is an alternative method for users to retrieve the number of
the CPU port of each DSA switch in the system:

  $ devlink port
  pci/0000:00:00.5/0: type eth netdev swp0 flavour physical port 0
  pci/0000:00:00.5/2: type eth netdev swp2 flavour physical port 2
  pci/0000:00:00.5/4: type notset flavour cpu port 4
  spi/spi2.0/0: type eth netdev sw0p0 flavour physical port 0
  spi/spi2.0/1: type eth netdev sw0p1 flavour physical port 1
  spi/spi2.0/2: type eth netdev sw0p2 flavour physical port 2
  spi/spi2.0/4: type notset flavour cpu port 4
  spi/spi2.1/0: type eth netdev sw1p0 flavour physical port 0
  spi/spi2.1/1: type eth netdev sw1p1 flavour physical port 1
  spi/spi2.1/2: type eth netdev sw1p2 flavour physical port 2
  spi/spi2.1/3: type eth netdev sw1p3 flavour physical port 3
  spi/spi2.1/4: type notset flavour cpu port 4

So remove this duplicated, unreliable and troublesome method. From this
patch on, the phys_port_name attribute of the DSA master will only
contain information about itself (if at all). If the users need reliable
information about the CPU port they're probably using devlink anyway.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-23 15:14:58 -07:00
David S. Miller dee72f8a0c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-07-21

The following pull-request contains BPF updates for your *net-next* tree.

We've added 46 non-merge commits during the last 6 day(s) which contain
a total of 68 files changed, 4929 insertions(+), 526 deletions(-).

The main changes are:

1) Run BPF program on socket lookup, from Jakub.

2) Introduce cpumap, from Lorenzo.

3) s390 JIT fixes, from Ilya.

4) teach riscv JIT to emit compressed insns, from Luke.

5) use build time computed BTF ids in bpf iter, from Yonghong.
====================

Purely independent overlapping changes in both filter.h and xdp.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-22 12:35:33 -07:00
Florian Fainelli 3369afba1e net: Call into DSA netdevice_ops wrappers
Make the core net_device code call into our ndo_do_ioctl() and
ndo_get_phys_port_name() functions via the wrappers defined previously

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-20 16:48:22 -07:00
Petr Machata ac5c66f261 Revert "net: sched: Pass root lock to Qdisc_ops.enqueue"
This reverts commit aebe4426cc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-07-16 16:48:34 -07:00
Lorenzo Bianconi 9216477449 bpf: cpumap: Add the possibility to attach an eBPF program to cpumap
Introduce the capability to attach an eBPF program to cpumap entries.
The idea behind this feature is to add the possibility to define on
which CPU run the eBPF program if the underlying hw does not support
RSS. Current supported verdicts are XDP_DROP and XDP_PASS.

This patch has been tested on Marvell ESPRESSObin using xdp_redirect_cpu
sample available in the kernel tree to identify possible performance
regressions. Results show there are no observable differences in
packet-per-second:

$./xdp_redirect_cpu --progname xdp_cpu_map0 --dev eth0 --cpu 1
rx: 354.8 Kpps
rx: 356.0 Kpps
rx: 356.8 Kpps
rx: 356.3 Kpps
rx: 356.6 Kpps
rx: 356.6 Kpps
rx: 356.7 Kpps
rx: 355.8 Kpps
rx: 356.8 Kpps
rx: 356.8 Kpps

Co-developed-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Link: https://lore.kernel.org/bpf/5c9febdf903d810b3415732e5cd98491d7d9067a.1594734381.git.lorenzo@kernel.org
2020-07-16 17:00:32 +02:00
Wei Yongjun ce1e2a776f net: make symbol 'flush_works' static
The sparse tool complains as follows:

net/core/dev.c:5594:1: warning:
 symbol '__pcpu_scope_flush_works' was not declared. Should it be static?

'flush_works' is not used outside of dev.c, so marks
it static.

Fixes: 41852497a9 ("net: batch calls to flush_all_backlogs()")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13 17:24:52 -07:00
Andrew Lunn 8842500dd0 net: core: kerneldoc fixes
Simple fixes which require no deep knowledge of the code.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-13 17:20:39 -07:00
Petr Machata aebe4426cc net: sched: Pass root lock to Qdisc_ops.enqueue
A following patch introduces qevents, points in qdisc algorithm where
packet can be processed by user-defined filters. Should this processing
lead to a situation where a new packet is to be enqueued on the same port,
holding the root lock would lead to deadlocks. To solve the issue, qevent
handler needs to unlock and relock the root lock when necessary.

To that end, add the root lock argument to the qdisc op enqueue, and
propagate throughout.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-29 17:08:28 -07:00