Commit Graph

435 Commits

Author SHA1 Message Date
Xin Long 1955673054 net: sched: refine software bypass handling in tc_run
JIRA: https://issues.redhat.com/browse/RHEL-60271
JIRA: https://issues.redhat.com/browse/RHEL-61181
Upstream Status: net-next.git
Tested: compile only

commit a12c76a03386e32413ae8eaaefa337e491880632
Author: Xin Long <lucien.xin@gmail.com>
Date:   Wed Jan 15 09:27:54 2025 -0500

    net: sched: refine software bypass handling in tc_run

    This patch addresses issues with filter counting in block (tcf_block),
    particularly for software bypass scenarios, by introducing a more
    accurate mechanism using useswcnt.

    Previously, filtercnt and skipswcnt were introduced by:

      Commit 2081fd3445fe ("net: sched: cls_api: add filter counter") and
      Commit f631ef39d819 ("net: sched: cls_api: add skip_sw counter")

      filtercnt tracked all tp (tcf_proto) objects added to a block, and
      skipswcnt counted tp objects with the skipsw attribute set.

    The problem is: a single tp can contain multiple filters, some with skipsw
    and others without. The current implementation fails in the case:

      When the first filter in a tp has skipsw, both skipswcnt and filtercnt
      are incremented, then adding a second filter without skipsw to the same
      tp does not modify these counters because tp->counted is already set.

      This results in bypass software behavior based solely on skipswcnt
      equaling filtercnt, even when the block includes filters without
      skipsw. Consequently, filters without skipsw are inadvertently bypassed.

    To address this, the patch introduces useswcnt in block to explicitly count
    tp objects containing at least one filter without skipsw. Key changes
    include:

      Whenever a filter without skipsw is added, its tp is marked with usesw
      and counted in useswcnt. tc_run() now uses useswcnt to determine software
      bypass, eliminating reliance on filtercnt and skipswcnt.

      This refined approach prevents software bypass for blocks containing
      mixed filters, ensuring correct behavior in tc_run().

    Additionally, as atomic operations on useswcnt ensure thread safety and
    tp->lock guards access to tp->usesw and tp->counted, the broader lock
    down_write(&block->cb_lock) is no longer required in tc_new_tfilter(),
    and this resolves a performance regression caused by the filter counting
    mechanism during parallel filter insertions.

      The improvement can be demonstrated using the following script:

      # cat insert_tc_rules.sh

        tc qdisc add dev ens1f0np0 ingress
        for i in $(seq 16); do
            taskset -c $i tc -b rules_$i.txt &
        done
        wait

      Each of rules_$i.txt files above includes 100000 tc filter rules to a
      mlx5 driver NIC ens1f0np0.

      Without this patch:

      # time sh insert_tc_rules.sh

        real    0m50.780s
        user    0m23.556s
        sys     4m13.032s

      With this patch:

      # time sh insert_tc_rules.sh

        real    0m17.718s
        user    0m7.807s
        sys     3m45.050s

    Fixes: 047f340b36fc ("net: sched: make skip_sw actually skip software")
    Reported-by: Shuang Li <shuali@redhat.com>
    Signed-off-by: Xin Long <lucien.xin@gmail.com>
    Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Reviewed-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Xin Long <lxin@redhat.com>
2025-01-20 12:01:29 -05:00
CKI Backport Bot f0936762c4 net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()
JIRA: https://issues.redhat.com/browse/RHEL-68161
CVE: CVE-2024-53044

commit a13e690191eafc154b3f60afe9ce35aa9b9128b4
Author: Vladimir Oltean <vladimir.oltean@nxp.com>
Date:   Wed Oct 23 13:05:41 2024 +0300

    net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()

    This command:

    $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
    Error: block dev insert failed: -EBUSY.

    fails because user space requests the same block index to be set for
    both ingress and egress.

    [ side note, I don't think it even failed prior to commit 913b47d3424e
      ("net/sched: Introduce tc block netdev tracking infra"), because this
      is a command from an old set of notes of mine which used to work, but
      alas, I did not scientifically bisect this ]

    The problem is not that it fails, but rather, that the second time
    around, it fails differently (and irrecoverably):

    $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact
    Error: dsa_core: Flow block cb is busy.

    [ another note: the extack is added by me for illustration purposes.
      the context of the problem is that clsact_init() obtains the same
      &q->ingress_block pointer as &q->egress_block, and since we call
      tcf_block_get_ext() on both of them, "dev" will be added to the
      block->ports xarray twice, thus failing the operation: once through
      the ingress block pointer, and once again through the egress block
      pointer. the problem itself is that when xa_insert() fails, we have
      emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the
      offload never sees a corresponding FLOW_BLOCK_UNBIND. ]

    Even correcting the bad user input, we still cannot recover:

    $ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact
    Error: dsa_core: Flow block cb is busy.

    Basically the only way to recover is to reboot the system, or unbind and
    rebind the net device driver.

    To fix the bug, we need to fill the correct error teardown path which
    was missed during code movement, and call tcf_block_offload_unbind()
    when xa_insert() fails.

    [ last note, fundamentally I blame the label naming convention in
      tcf_block_get_ext() for the bug. The labels should be named after what
      they do, not after the error path that jumps to them. This way, it is
      obviously wrong that two labels pointing to the same code mean
      something is wrong, and checking the code correctness at the goto site
      is also easier ]

    Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
    Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Link: https://patch.msgid.link/20241023100541.974362-1-vladimir.oltean@nxp.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>
2024-11-19 20:50:00 +00:00
Ivan Vecera 66b830ed1a net: sched: cls_api: fix slab-use-after-free in fl_dump_key
JIRA: https://issues.redhat.com/browse/RHEL-57767

commit 2ecd487b670fcbb1ad4893fff1af4aafdecb6023
Author: Jianbo Liu <jianbol@nvidia.com>
Date:   Mon Apr 8 16:48:17 2024 +0300

    net: sched: cls_api: fix slab-use-after-free in fl_dump_key

    The filter counter is updated under the protection of cb_lock in the
    cited commit. While waiting for the lock, it's possible the filter is
    being deleted by other thread, and thus causes UAF when dump it.

    Fix this issue by moving tcf_block_filter_cnt_update() after
    tfilter_put().

     ==================================================================
     BUG: KASAN: slab-use-after-free in fl_dump_key+0x1d3e/0x20d0 [cls_flower]
     Read of size 4 at addr ffff88814f864000 by task tc/2973

     CPU: 7 PID: 2973 Comm: tc Not tainted 6.9.0-rc2_for_upstream_debug_2024_04_02_12_41 #1
     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
     Call Trace:
      <TASK>
      dump_stack_lvl+0x7e/0xc0
      print_report+0xc1/0x600
      ? __virt_addr_valid+0x1cf/0x390
      ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
      ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
      kasan_report+0xb9/0xf0
      ? fl_dump_key+0x1d3e/0x20d0 [cls_flower]
      fl_dump_key+0x1d3e/0x20d0 [cls_flower]
      ? lock_acquire+0x1c2/0x530
      ? fl_dump+0x172/0x5c0 [cls_flower]
      ? lockdep_hardirqs_on_prepare+0x400/0x400
      ? fl_dump_key_options.part.0+0x10f0/0x10f0 [cls_flower]
      ? do_raw_spin_lock+0x12d/0x270
      ? spin_bug+0x1d0/0x1d0
      fl_dump+0x21d/0x5c0 [cls_flower]
      ? fl_tmplt_dump+0x1f0/0x1f0 [cls_flower]
      ? nla_put+0x15f/0x1c0
      tcf_fill_node+0x51b/0x9a0
      ? tc_skb_ext_tc_enable+0x150/0x150
      ? __alloc_skb+0x17b/0x310
      ? __build_skb_around+0x340/0x340
      ? down_write+0x1b0/0x1e0
      tfilter_notify+0x1a5/0x390
      ? fl_terse_dump+0x400/0x400 [cls_flower]
      tc_new_tfilter+0x963/0x2170
      ? tc_del_tfilter+0x1490/0x1490
      ? print_usage_bug.part.0+0x670/0x670
      ? lock_downgrade+0x680/0x680
      ? security_capable+0x51/0x90
      ? tc_del_tfilter+0x1490/0x1490
      rtnetlink_rcv_msg+0x75e/0xac0
      ? if_nlmsg_stats_size+0x4c0/0x4c0
      ? lockdep_set_lock_cmp_fn+0x190/0x190
      ? __netlink_lookup+0x35e/0x6e0
      netlink_rcv_skb+0x12c/0x360
      ? if_nlmsg_stats_size+0x4c0/0x4c0
      ? netlink_ack+0x15e0/0x15e0
      ? lockdep_hardirqs_on_prepare+0x400/0x400
      ? netlink_deliver_tap+0xcd/0xa60
      ? netlink_deliver_tap+0xcd/0xa60
      ? netlink_deliver_tap+0x1c9/0xa60
      netlink_unicast+0x43e/0x700
      ? netlink_attachskb+0x750/0x750
      ? lock_acquire+0x1c2/0x530
      ? __might_fault+0xbb/0x170
      netlink_sendmsg+0x749/0xc10
      ? netlink_unicast+0x700/0x700
      ? __might_fault+0xbb/0x170
      ? netlink_unicast+0x700/0x700
      __sock_sendmsg+0xc5/0x190
      ____sys_sendmsg+0x534/0x6b0
      ? import_iovec+0x7/0x10
      ? kernel_sendmsg+0x30/0x30
      ? __copy_msghdr+0x3c0/0x3c0
      ? entry_SYSCALL_64_after_hwframe+0x46/0x4e
      ? lock_acquire+0x1c2/0x530
      ? __virt_addr_valid+0x116/0x390
      ___sys_sendmsg+0xeb/0x170
      ? __virt_addr_valid+0x1ca/0x390
      ? copy_msghdr_from_user+0x110/0x110
      ? __delete_object+0xb8/0x100
      ? __virt_addr_valid+0x1cf/0x390
      ? do_sys_openat2+0x102/0x150
      ? lockdep_hardirqs_on_prepare+0x284/0x400
      ? do_sys_openat2+0x102/0x150
      ? __fget_light+0x53/0x1d0
      ? sockfd_lookup_light+0x1a/0x150
      __sys_sendmsg+0xb5/0x140
      ? __sys_sendmsg_sock+0x20/0x20
      ? lock_downgrade+0x680/0x680
      do_syscall_64+0x70/0x140
      entry_SYSCALL_64_after_hwframe+0x46/0x4e
     RIP: 0033:0x7f98e3713367
     Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
     RSP: 002b:00007ffc74a64608 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
     RAX: ffffffffffffffda RBX: 000000000047eae0 RCX: 00007f98e3713367
     RDX: 0000000000000000 RSI: 00007ffc74a64670 RDI: 0000000000000003
     RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000000
     R10: 00007f98e360c5e8 R11: 0000000000000246 R12: 00007ffc74a6a508
     R13: 00000000660d518d R14: 0000000000484a80 R15: 00007ffc74a6a50b
      </TASK>

     Allocated by task 2973:
      kasan_save_stack+0x20/0x40
      kasan_save_track+0x10/0x30
      __kasan_kmalloc+0x77/0x90
      fl_change+0x27a6/0x4540 [cls_flower]
      tc_new_tfilter+0x879/0x2170
      rtnetlink_rcv_msg+0x75e/0xac0
      netlink_rcv_skb+0x12c/0x360
      netlink_unicast+0x43e/0x700
      netlink_sendmsg+0x749/0xc10
      __sock_sendmsg+0xc5/0x190
      ____sys_sendmsg+0x534/0x6b0
      ___sys_sendmsg+0xeb/0x170
      __sys_sendmsg+0xb5/0x140
      do_syscall_64+0x70/0x140
      entry_SYSCALL_64_after_hwframe+0x46/0x4e

     Freed by task 283:
      kasan_save_stack+0x20/0x40
      kasan_save_track+0x10/0x30
      kasan_save_free_info+0x37/0x50
      poison_slab_object+0x105/0x190
      __kasan_slab_free+0x11/0x30
      kfree+0x111/0x340
      process_one_work+0x787/0x1490
      worker_thread+0x586/0xd30
      kthread+0x2df/0x3b0
      ret_from_fork+0x2d/0x70
      ret_from_fork_asm+0x11/0x20

     Last potentially related work creation:
      kasan_save_stack+0x20/0x40
      __kasan_record_aux_stack+0x9b/0xb0
      insert_work+0x25/0x1b0
      __queue_work+0x640/0xc90
      rcu_work_rcufn+0x42/0x70
      rcu_core+0x6a9/0x1850
      __do_softirq+0x264/0x88f

     Second to last potentially related work creation:
      kasan_save_stack+0x20/0x40
      __kasan_record_aux_stack+0x9b/0xb0
      __call_rcu_common.constprop.0+0x6f/0xac0
      queue_rcu_work+0x56/0x70
      fl_mask_put+0x20d/0x270 [cls_flower]
      __fl_delete+0x352/0x6b0 [cls_flower]
      fl_delete+0x97/0x160 [cls_flower]
      tc_del_tfilter+0x7d1/0x1490
      rtnetlink_rcv_msg+0x75e/0xac0
      netlink_rcv_skb+0x12c/0x360
      netlink_unicast+0x43e/0x700
      netlink_sendmsg+0x749/0xc10
      __sock_sendmsg+0xc5/0x190
      ____sys_sendmsg+0x534/0x6b0
      ___sys_sendmsg+0xeb/0x170
      __sys_sendmsg+0xb5/0x140
      do_syscall_64+0x70/0x140
      entry_SYSCALL_64_after_hwframe+0x46/0x4e

    Fixes: 2081fd3445fe ("net: sched: cls_api: add filter counter")
    Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
    Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
    Tested-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-06 15:01:54 +02:00
Ivan Vecera 8d9ec7bbab net: sched: make skip_sw actually skip software
JIRA: https://issues.redhat.com/browse/RHEL-57767

commit 047f340b36fc550c0fc6a8947fc0a1f8e429e9ab
Author: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Date:   Mon Mar 25 20:47:36 2024 +0000

    net: sched: make skip_sw actually skip software

    TC filters come in 3 variants:
    - no flag (try to process in hardware, but fallback to software))
    - skip_hw (do not process filter by hardware)
    - skip_sw (do not process filter by software)

    However skip_sw is implemented so that the skip_sw
    flag can first be checked, after it has been matched.

    IMHO it's common when using skip_sw, to use it on all rules.

    So if all filters in a block is skip_sw filters, then
    we can bail early, we can thus avoid having to match
    the filters, just to check for the skip_sw flag.

    This patch adds a bypass, for when only TC skip_sw rules
    are used. The bypass is guarded by a static key, to avoid
    harming other workloads.

    There are 3 ways that a packet from a skip_sw ruleset, can
    end up in the kernel path. Although the send packets to a
    non-existent chain way is only improved a few percents, then
    I believe it's worth optimizing the trap and fall-though
    use-cases.

     +----------------------------+--------+--------+--------+
     | Test description           | Pre-   | Post-  | Rel.   |
     |                            | kpps   | kpps   | chg.   |
     +----------------------------+--------+--------+--------+
     | basic forwarding + notrack | 3589.3 | 3587.9 |  1.00x |
     | switch to eswitch mode     | 3081.8 | 3094.7 |  1.00x |
     | add ingress qdisc          | 3042.9 | 3063.6 |  1.01x |
     | tc forward in hw / skip_sw |37024.7 |37028.4 |  1.00x |
     | tc forward in sw / skip_hw | 3245.0 | 3245.3 |  1.00x |
     +----------------------------+--------+--------+--------+
     | tests with only skip_sw rules below:                  |
     +----------------------------+--------+--------+--------+
     | 1 non-matching rule        | 2694.7 | 3058.7 |  1.14x |
     | 1 n-m rule, match trap     | 2611.2 | 3323.1 |  1.27x |
     | 1 n-m rule, goto non-chain | 2886.8 | 2945.9 |  1.02x |
     | 5 non-matching rules       | 1958.2 | 3061.3 |  1.56x |
     | 5 n-m rules, match trap    | 1911.9 | 3327.0 |  1.74x |
     | 5 n-m rules, goto non-chain| 2883.1 | 2947.5 |  1.02x |
     | 10 non-matching rules      | 1466.3 | 3062.8 |  2.09x |
     | 10 n-m rules, match trap   | 1444.3 | 3317.9 |  2.30x |
     | 10 n-m rules,goto non-chain| 2883.1 | 2939.5 |  1.02x |
     | 25 non-matching rules      |  838.5 | 3058.9 |  3.65x |
     | 25 n-m rules, match trap   |  824.5 | 3323.0 |  4.03x |
     | 25 n-m rules,goto non-chain| 2875.8 | 2944.7 |  1.02x |
     | 50 non-matching rules      |  488.1 | 3054.7 |  6.26x |
     | 50 n-m rules, match trap   |  484.9 | 3318.5 |  6.84x |
     | 50 n-m rules,goto non-chain| 2884.1 | 2939.7 |  1.02x |
     +----------------------------+--------+--------+--------+

    perf top (25 n-m skip_sw rules - pre patch):
      20.39%  [kernel]  [k] __skb_flow_dissect
      16.43%  [kernel]  [k] rhashtable_jhash2
      10.58%  [kernel]  [k] fl_classify
      10.23%  [kernel]  [k] fl_mask_lookup
       4.79%  [kernel]  [k] memset_orig
       2.58%  [kernel]  [k] tcf_classify
       1.47%  [kernel]  [k] __x86_indirect_thunk_rax
       1.42%  [kernel]  [k] __dev_queue_xmit
       1.36%  [kernel]  [k] nft_do_chain
       1.21%  [kernel]  [k] __rcu_read_lock

    perf top (25 n-m skip_sw rules - post patch):
       5.12%  [kernel]  [k] __dev_queue_xmit
       4.77%  [kernel]  [k] nft_do_chain
       3.65%  [kernel]  [k] dev_gro_receive
       3.41%  [kernel]  [k] check_preemption_disabled
       3.14%  [kernel]  [k] mlx5e_skb_from_cqe_mpwrq_nonlinear
       2.88%  [kernel]  [k] __netif_receive_skb_core.constprop.0
       2.49%  [kernel]  [k] mlx5e_xmit
       2.15%  [kernel]  [k] ip_forward
       1.95%  [kernel]  [k] mlx5e_tc_restore_tunnel
       1.92%  [kernel]  [k] vlan_gro_receive

    Test setup:
     DUT: Intel Xeon D-1518 (2.20GHz) w/ Nvidia/Mellanox ConnectX-6 Dx 2x100G
     Data rate measured on switch (Extreme X690), and DUT connected as
     a router on a stick, with pktgen and pktsink as VLANs.
     Pktgen-dpdk was in range 36.6-37.7 Mpps 64B packets across all tests.
     Full test data at https://files.fiberby.net/ast/2024/tc_skip_sw/v2_tests/

    Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-06 15:01:51 +02:00
Ivan Vecera ca1d99272f net: sched: cls_api: add filter counter
JIRA: https://issues.redhat.com/browse/RHEL-57767

commit 2081fd3445fec6b9813c20e8b910c2abd6de31cb
Author: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Date:   Mon Mar 25 20:47:35 2024 +0000

    net: sched: cls_api: add filter counter

    Maintain a count of filters per block.

    Counter updates are protected by cb_lock, which is
    also used to protect the offload counters.

    Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-06 15:01:49 +02:00
Ivan Vecera eba5754059 net: sched: cls_api: add skip_sw counter
JIRA: https://issues.redhat.com/browse/RHEL-57767

commit f631ef39d81956a2ee69d25039781ceae1162f62
Author: Asbjørn Sloth Tønnesen <ast@fiberby.net>
Date:   Mon Mar 25 20:47:34 2024 +0000

    net: sched: cls_api: add skip_sw counter

    Maintain a count of skip_sw filters.

    This counter is protected by the cb_lock, and is updated
    at the same time as offloadcnt.

    Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-06 15:01:48 +02:00
Ivan Vecera 8db36f9b9d net/sched: Load modules via their alias
JIRA: https://issues.redhat.com/browse/RHEL-57767

commit 2c15a5aee2f32e341d1585fa1867eece76a1edb8
Author: Michal Koutný <mkoutny@suse.com>
Date:   Thu Feb 1 14:09:42 2024 +0100

    net/sched: Load modules via their alias

    The cls_,sch_,act_ modules may be loaded lazily during network
    configuration but without user's awareness and control.

    Switch the lazy loading from canonical module names to a module alias.
    This allows finer control over lazy loading, the precedent from
    commit 7f78e03513 ("fs: Limit sys_mount to only request filesystem
    modules.") explains it already:

            Using aliases means user space can control the policy of which
            filesystem^W net/sched modules are auto-loaded by editing
            /etc/modprobe.d/*.conf with blacklist and alias directives.
            Allowing simple, safe, well understood work-arounds to known
            problematic software.

    By default, nothing changes. However, if a specific module is
    blacklisted (its canonical name), it won't be modprobe'd when requested
    under its alias (i.e. kernel auto-loading). It would appear as if the
    given module was unknown.

    The module can still be loaded under its canonical name, which is an
    explicit (privileged) user action.

    Signed-off-by: Michal Koutný <mkoutny@suse.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20240201130943.19536-4-mkoutny@suse.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-09-06 15:01:42 +02:00
Ivan Vecera b05d770d4b net: sched: track device in tcf_block_get/put_ext() only for clsact binder types
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit e18405d0be8001fa4c5f9e61471f6ffd59c7a1b3
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Fri Jan 12 12:39:30 2024 +0100

    net: sched: track device in tcf_block_get/put_ext() only for clsact binder types

    Clsact/ingress qdisc is not the only one using shared block,
    red is also using it. The device tracking was originally introduced
    by commit 913b47d3424e ("net/sched: Introduce tc block netdev
    tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net:
    sched: move block device tracking into tcf_block_get/put_ext()")
    mistakenly enabled that for red as well.

    Fix that by adding a check for the binder type being clsact when adding
    device to the block->ports xarray.

    Reported-by: Ido Schimmel <idosch@idosch.org>
    Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/
    Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()")
    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Tested-by: Ido Schimmel <idosch@nvidia.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Tested-by: Victor Nogueira <victor@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:25 +02:00
Ivan Vecera d24cd5c4a6 net/sched: simplify tc_action_load_ops parameters
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 405cd9fc6f44f7a54505019bea60de83f1c58365
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Thu Jan 4 21:38:10 2024 -0300

    net/sched: simplify tc_action_load_ops parameters

    Instead of using two bools derived from a flags passed as arguments to
    the parent function of tc_action_load_ops, just pass the flags itself
    to tc_action_load_ops to simplify its parameters.

    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:24 +02:00
Ivan Vecera e4251de682 net: sched: move block device tracking into tcf_block_get/put_ext()
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 94e2557d086ad831027c54bc9c2130d337c72814
Author: Jiri Pirko <jiri@nvidia.com>
Date:   Thu Jan 4 13:58:44 2024 +0100

    net: sched: move block device tracking into tcf_block_get/put_ext()

    Inserting the device to block xarray in qdisc_create() is not suitable
    place to do this. As it requires use of tcf_block() callback, it causes
    multiple issues. It is called for all qdisc types, which is incorrect.

    So, instead, move it to more suitable place, which is tcf_block_get_ext()
    and make sure it is only done for qdiscs that use block infrastructure
    and also only for blocks which are shared.

    Symmetrically, alter the cleanup path, move the xarray entry removal
    into tcf_block_put_ext().

    Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra")
    Reported-by: Ido Schimmel <idosch@nvidia.com>
    Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/
    Reported-by: Kui-Feng Lee <sinquersw@gmail.com>
    Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/
    Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/
    Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/
    Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com
    Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/
    Signed-off-by: Jiri Pirko <jiri@nvidia.com>
    Tested-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Victor Nogueira <victor@mojatatu.com>
    Tested-by: Victor Nogueira <victor@mojatatu.com>
    Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:24 +02:00
Ivan Vecera 843e86a641 net/sched: cls_api: complement tcf_tfilter_dump_policy
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 2ab1efad60ad119b616722b81eeb73060728028c
Author: Lin Ma <linma@zju.edu.cn>
Date:   Thu Dec 28 14:43:58 2023 +0800

    net/sched: cls_api: complement tcf_tfilter_dump_policy

    In function `tc_dump_tfilter`, the attributes array is parsed via
    tcf_tfilter_dump_policy which only describes TCA_DUMP_FLAGS. However,
    the NLA TCA_CHAIN is also accessed with `nla_get_u32`.

    The access to TCA_CHAIN is introduced in commit 5bc1701881 ("net:
    sched: introduce multichain support for filters") and no nla_policy is
    provided for parsing at that point. Later on, tcf_tfilter_dump_policy is
    introduced in commit f8ab1807a9 ("net: sched: introduce terse dump
    flag") while still ignoring the fact that TCA_CHAIN needs a check. This
    patch does that by complementing the policy to allow the access
    discussed here can be safe as other cases just choose rtm_tca_policy as
    the parsing policy.

    Signed-off-by: Lin Ma <linma@zju.edu.cn>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:24 +02:00
Ivan Vecera 9601880cb4 net/sched: cls_api: Expose tc block to the datapath
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit a7042cf8f23191c3a460c627c0c39463afb5d335
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Tue Dec 19 15:16:20 2023 -0300

    net/sched: cls_api: Expose tc block to the datapath

    The datapath can now find the block of the port in which the packet arrived
    at.

    In the next patch we show a possible usage of this patch in a new
    version of mirred that multicasts to all ports except for the port in
    which the packet arrived on.

    Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 997564732a net/sched: Introduce tc block netdev tracking infra
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 913b47d3424e7d99eaf34b798c47dfa840c64a08
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Tue Dec 19 15:16:19 2023 -0300

    net/sched: Introduce tc block netdev tracking infra

    This commit makes tc blocks track which ports have been added to them.
    And, with that, we'll be able to use this new information to send
    packets to the block's ports. Which will be done in the patch #3 of this
    series.

    Suggested-by: Jiri Pirko <jiri@nvidia.com>
    Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Co-developed-by: Pedro Tammela <pctammela@mojatatu.com>
    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 25391b6030 net: sched: Add initial TC error skb drop reasons
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 4cf24dc8934074725042c0bd10b91f4d4b5269bb
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Sat Dec 16 17:44:36 2023 -0300

    net: sched: Add initial TC error skb drop reasons

    Continue expanding Daniel's patch by adding new skb drop reasons that
    are idiosyncratic to TC.

    More specifically:

    - SKB_DROP_REASON_TC_COOKIE_ERROR: An error occurred whilst
      processing a tc ext cookie.

    - SKB_DROP_REASON_TC_CHAIN_NOTFOUND: tc chain lookup failed.

    - SKB_DROP_REASON_TC_RECLASSIFY_LOOP: tc exceeded max reclassify loop
      iterations

    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 0c3f908699 net: sched: Move drop_reason to struct tc_skb_cb
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit fb2780721ca5e9f78bbe4544b819b929a982df9c
Author: Victor Nogueira <victor@mojatatu.com>
Date:   Sat Dec 16 17:44:34 2023 -0300

    net: sched: Move drop_reason to struct tc_skb_cb

    Move drop_reason from struct tcf_result to skb cb - more specifically to
    struct tc_skb_cb. With that, we'll be able to also set the drop reason for
    the remaining qdiscs (aside from clsact) that do not have access to
    tcf_result when time comes to set the skb drop reason.

    Signed-off-by: Victor Nogueira <victor@mojatatu.com>
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 5dc793543c net/sched: act_api: skip idr replace on bound actions
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 1dd7f18fc0ed75dad4d5f2ecc84f69c6b62b6a81
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Mon Dec 11 15:18:07 2023 -0300

    net/sched: act_api: skip idr replace on bound actions

    tcf_idr_insert_many will replace the allocated -EBUSY pointer in
    tcf_idr_check_alloc with the real action pointer, exposing it
    to all operations. This operation is only needed when the action pointer
    is created (ACT_P_CREATED). For actions which are bound to (returned 0),
    the pointer already resides in the idr making such operation a nop.

    Even though it's a nop, it's still not a cheap operation as internally
    the idr code walks the idr and then does a replace on the appropriate slot.
    So if the action was bound, better skip the idr replace entirely.

    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
    Link: https://lore.kernel.org/r/20231211181807.96028-3-pctammela@mojatatu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 162a9b1419 net/sched: cls_api: conditional notification of events
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 93775590b1ee98bf2976b1f4a1ed24e9ff76170f
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Fri Dec 8 16:28:47 2023 -0300

    net/sched: cls_api: conditional notification of events

    As of today tc-filter/chain events are unconditionally built and sent to
    RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check
    before-hand if they are really needed. This will help to alleviate
    system pressure when filters are concurrently added without the rtnl
    lock as in tc-flower.

    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Link: https://lore.kernel.org/r/20231208192847.714940-8-pctammela@mojatatu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:23 +02:00
Ivan Vecera 0c000768f9 net/sched: cls_api: remove 'unicast' argument from delete notification
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit e522755520ef63b121ddd5808197a370be212e9a
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Fri Dec 8 16:28:46 2023 -0300

    net/sched: cls_api: remove 'unicast' argument from delete notification

    This argument is never called while set to true, so remove it as there's
    no need for it.

    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Link: https://lore.kernel.org/r/20231208192847.714940-7-pctammela@mojatatu.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:22 +02:00
Ivan Vecera 9488742a2f net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 40cb2fdfed342e7e578d551a073687789f698d89
Author: Jamal Hadi Salim <jhs@mojatatu.com>
Date:   Sat Oct 28 13:16:10 2023 -0400

    net, sched: Fix SKB_NOT_DROPPED_YET splat under debug config

    Getting the following splat [1] with CONFIG_DEBUG_NET=y and this
    reproducer [2]. Problem seems to be that classifiers clear 'struct
    tcf_result::drop_reason', thereby triggering the warning in
    __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0).

    Fixed by disambiguating a legit error from a verdict with a bogus drop_reason

    [1]
    WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
    Modules linked in:
    CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
    RIP: 0010:kfree_skb_reason+0x38/0x130
    [...]
    Call Trace:
     <IRQ>
     __netif_receive_skb_core.constprop.0+0x837/0xdb0
     __netif_receive_skb_one_core+0x3c/0x70
     process_backlog+0x95/0x130
     __napi_poll+0x25/0x1b0
     net_rx_action+0x29b/0x310
     __do_softirq+0xc0/0x29b
     do_softirq+0x43/0x60
     </IRQ>

    [2]

    ip link add name veth0 type veth peer name veth1
    ip link set dev veth0 up
    ip link set dev veth1 up
    tc qdisc add dev veth1 clsact
    tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
    mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1

    Ido reported:

      [...] getting the following splat [1] with CONFIG_DEBUG_NET=y and this
      reproducer [2]. Problem seems to be that classifiers clear 'struct
      tcf_result::drop_reason', thereby triggering the warning in
      __kfree_skb_reason() due to reason being 'SKB_NOT_DROPPED_YET' (0). [...]

      [1]
      WARNING: CPU: 0 PID: 181 at net/core/skbuff.c:1082 kfree_skb_reason+0x38/0x130
      Modules linked in:
      CPU: 0 PID: 181 Comm: mausezahn Not tainted 6.6.0-rc6-custom-ge43e6d9582e0 #682
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-1.fc37 04/01/2014
      RIP: 0010:kfree_skb_reason+0x38/0x130
      [...]
      Call Trace:
       <IRQ>
       __netif_receive_skb_core.constprop.0+0x837/0xdb0
       __netif_receive_skb_one_core+0x3c/0x70
       process_backlog+0x95/0x130
       __napi_poll+0x25/0x1b0
       net_rx_action+0x29b/0x310
       __do_softirq+0xc0/0x29b
       do_softirq+0x43/0x60
       </IRQ>

      [2]
      #!/bin/bash

      ip link add name veth0 type veth peer name veth1
      ip link set dev veth0 up
      ip link set dev veth1 up
      tc qdisc add dev veth1 clsact
      tc filter add dev veth1 ingress pref 1 proto all flower dst_mac 00:11:22:33:44:55 action drop
      mausezahn veth0 -a own -b 00:11:22:33:44:55 -q -c 1

    What happens is that inside most classifiers the tcf_result is copied over
    from a filter template e.g. *res = f->res which then implicitly overrides
    the prior SKB_DROP_REASON_TC_{INGRESS,EGRESS} default drop code which was
    set via sch_handle_{ingress,egress}() for kfree_skb_reason().

    Commit text above copied verbatim from Daniel. The general idea of the patch
    is not very different from what Ido originally posted but instead done at the
    cls_api codepath.

    Fixes: 54a59aed395c ("net, sched: Make tc-related drop reason more flexible")
    Reported-by: Ido Schimmel <idosch@idosch.org>
    Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Link: https://lore.kernel.org/netdev/ZTjY959R+AFXf3Xy@shredder
    Reviewed-by: Simon Horman <horms@kernel.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:20 +02:00
Ivan Vecera 0ebceac108 net, sched: Add tcf_set_drop_reason for {__,}tcf_classify
JIRA: https://issues.redhat.com/browse/RHEL-36218

commit 39d08b91646d83e87f7cbcd846b3ef33b1a53b79
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Mon Oct 9 11:26:55 2023 +0200

    net, sched: Add tcf_set_drop_reason for {__,}tcf_classify

    Add an initial user for the newly added tcf_set_drop_reason() helper to set the
    drop reason for internal errors leading to TC_ACT_SHOT inside {__,}tcf_classify().

    Right now this only adds a very basic SKB_DROP_REASON_TC_ERROR as a generic
    fallback indicator to mark drop locations. Where needed, such locations can be
    converted to more specific codes, for example, when hitting the reclassification
    limit, etc.

    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Victor Nogueira <victor@mojatatu.com>
    Link: https://lore.kernel.org/r/20231009092655.22025-2-daniel@iogearbox.net
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2024-05-14 13:13:19 +02:00
Davide Caratti d49c690e6d net/sched: flower: Fix chain template offload
JIRA: https://issues.redhat.com/browse/RHEL-32137
JIRA: https://issues.redhat.com/browse/RHEL-31315
CVE: CVE-2024-26669
Upstream Status: net.git commit 32f2a0afa95fae0d1ceec2ff06e0e816939964b8

commit 32f2a0afa95fae0d1ceec2ff06e0e816939964b8
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Mon Jan 22 15:28:43 2024 +0200

    net/sched: flower: Fix chain template offload

    When a qdisc is deleted from a net device the stack instructs the
    underlying driver to remove its flow offload callback from the
    associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
    then continues to replay the removal of the filters in the block for
    this driver by iterating over the chains in the block and invoking the
    'reoffload' operation of the classifier being used. In turn, the
    classifier in its 'reoffload' operation prepares and emits a
    'FLOW_CLS_DESTROY' command for each filter.

    However, the stack does not do the same for chain templates and the
    underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
    a qdisc is deleted. This results in a memory leak [1] which can be
    reproduced using [2].

    Fix by introducing a 'tmplt_reoffload' operation and have the stack
    invoke it with the appropriate arguments as part of the replay.
    Implement the operation in the sole classifier that supports chain
    templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}'
    command based on whether a flow offload callback is being bound to a
    filter block or being unbound from one.

    As far as I can tell, the issue happens since cited commit which
    reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains()
    in __tcf_block_put(). The order cannot be reversed as the filter block
    is expected to be freed after flushing all the chains.

    [1]
    unreferenced object 0xffff888107e28800 (size 2048):
      comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
      hex dump (first 32 bytes):
        b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff  ..|......[......
        01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff  ................
      backtrace:
        [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
        [<ffffffff81ab374e>] __kmalloc+0x4e/0x90
        [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0
        [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
        [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
        [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
        [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
        [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
        [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
        [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
        [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
        [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
        [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
        [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0
        [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0
        [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0
    unreferenced object 0xffff88816d2c0400 (size 1024):
      comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
      hex dump (first 32 bytes):
        40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00  @.......W.8.....
        10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff  ..,m......,m....
      backtrace:
        [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
        [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90
        [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0
        [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460
        [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0
        [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0
        [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
        [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
        [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
        [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
        [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
        [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
        [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
        [<ffffffff83ac6270>] netlink_unicast+0x540/0x820
        [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
        [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80

    [2]
     # tc qdisc add dev swp1 clsact
     # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32
     # tc qdisc del dev swp1 clsact
     # devlink dev reload pci/0000:06:00.0

    Fixes: bbf73830cd ("net: sched: traverse chains in block with tcf_get_next_chain()")
    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2024-04-16 14:29:54 +02:00
Ivan Vecera 878a53d83c net: sched: move rtm_tca_policy declaration to include file
JIRA: https://issues.redhat.com/browse/RHEL-1773

commit 886bc7d6ed3357975c5f1d3c784da96000d4bbb4
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 6 11:42:33 2023 +0000

    net: sched: move rtm_tca_policy declaration to include file

    rtm_tca_policy is used from net/sched/sch_api.c and net/sched/cls_api.c,
    thus should be declared in an include file.

    This fixes the following sparse warning:
    net/sched/sch_api.c:1434:25: warning: symbol 'rtm_tca_policy' was not declared. Should it be static?

    Fixes: e331473fee ("net/sched: cls_api: add missing validation of netlink attributes")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-10-13 09:03:10 +02:00
Davide Caratti 0e5ce74663 net/sched: cls_api: remove block_cb from driver_list before freeing
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219411
Upstream Status: net.git commit da94a7781fc3

commit da94a7781fc3c92e7df7832bc2746f4d39bc624e
Author: Vlad Buslov <vladbu@nvidia.com>
Date:   Wed Apr 26 14:31:11 2023 +0200

    net/sched: cls_api: remove block_cb from driver_list before freeing

    Error handler of tcf_block_bind() frees the whole bo->cb_list on error.
    However, by that time the flow_block_cb instances are already in the driver
    list because driver ndo_setup_tc() callback is called before that up the
    call chain in tcf_block_offload_cmd(). This leaves dangling pointers to
    freed objects in the list and causes use-after-free[0]. Fix it by also
    removing flow_block_cb instances from driver_list before deallocating them.

    [0]:
    [  279.868433] ==================================================================
    [  279.869964] BUG: KASAN: slab-use-after-free in flow_block_cb_setup_simple+0x631/0x7c0
    [  279.871527] Read of size 8 at addr ffff888147e2bf20 by task tc/2963

    [  279.873151] CPU: 6 PID: 2963 Comm: tc Not tainted 6.3.0-rc6+ #4
    [  279.874273] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    [  279.876295] Call Trace:
    [  279.876882]  <TASK>
    [  279.877413]  dump_stack_lvl+0x33/0x50
    [  279.878198]  print_report+0xc2/0x610
    [  279.878987]  ? flow_block_cb_setup_simple+0x631/0x7c0
    [  279.879994]  kasan_report+0xae/0xe0
    [  279.880750]  ? flow_block_cb_setup_simple+0x631/0x7c0
    [  279.881744]  ? mlx5e_tc_reoffload_flows_work+0x240/0x240 [mlx5_core]
    [  279.883047]  flow_block_cb_setup_simple+0x631/0x7c0
    [  279.884027]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
    [  279.885037]  ? tcf_block_setup+0x6b0/0x6b0
    [  279.885901]  ? mutex_lock+0x7d/0xd0
    [  279.886669]  ? __mutex_unlock_slowpath.constprop.0+0x2d0/0x2d0
    [  279.887844]  ? ingress_init+0x1c0/0x1c0 [sch_ingress]
    [  279.888846]  tcf_block_get_ext+0x61c/0x1200
    [  279.889711]  ingress_init+0x112/0x1c0 [sch_ingress]
    [  279.890682]  ? clsact_init+0x2b0/0x2b0 [sch_ingress]
    [  279.891701]  qdisc_create+0x401/0xea0
    [  279.892485]  ? qdisc_tree_reduce_backlog+0x470/0x470
    [  279.893473]  tc_modify_qdisc+0x6f7/0x16d0
    [  279.894344]  ? tc_get_qdisc+0xac0/0xac0
    [  279.895213]  ? mutex_lock+0x7d/0xd0
    [  279.896005]  ? __mutex_lock_slowpath+0x10/0x10
    [  279.896910]  rtnetlink_rcv_msg+0x5fe/0x9d0
    [  279.897770]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
    [  279.898672]  ? __sys_sendmsg+0xb5/0x140
    [  279.899494]  ? do_syscall_64+0x3d/0x90
    [  279.900302]  ? entry_SYSCALL_64_after_hwframe+0x46/0xb0
    [  279.901337]  ? kasan_save_stack+0x2e/0x40
    [  279.902177]  ? kasan_save_stack+0x1e/0x40
    [  279.903058]  ? kasan_set_track+0x21/0x30
    [  279.903913]  ? kasan_save_free_info+0x2a/0x40
    [  279.904836]  ? ____kasan_slab_free+0x11a/0x1b0
    [  279.905741]  ? kmem_cache_free+0x179/0x400
    [  279.906599]  netlink_rcv_skb+0x12c/0x360
    [  279.907450]  ? rtnl_calcit.isra.0+0x2b0/0x2b0
    [  279.908360]  ? netlink_ack+0x1550/0x1550
    [  279.909192]  ? rhashtable_walk_peek+0x170/0x170
    [  279.910135]  ? kmem_cache_alloc_node+0x1af/0x390
    [  279.911086]  ? _copy_from_iter+0x3d6/0xc70
    [  279.912031]  netlink_unicast+0x553/0x790
    [  279.912864]  ? netlink_attachskb+0x6a0/0x6a0
    [  279.913763]  ? netlink_recvmsg+0x416/0xb50
    [  279.914627]  netlink_sendmsg+0x7a1/0xcb0
    [  279.915473]  ? netlink_unicast+0x790/0x790
    [  279.916334]  ? iovec_from_user.part.0+0x4d/0x220
    [  279.917293]  ? netlink_unicast+0x790/0x790
    [  279.918159]  sock_sendmsg+0xc5/0x190
    [  279.918938]  ____sys_sendmsg+0x535/0x6b0
    [  279.919813]  ? import_iovec+0x7/0x10
    [  279.920601]  ? kernel_sendmsg+0x30/0x30
    [  279.921423]  ? __copy_msghdr+0x3c0/0x3c0
    [  279.922254]  ? import_iovec+0x7/0x10
    [  279.923041]  ___sys_sendmsg+0xeb/0x170
    [  279.923854]  ? copy_msghdr_from_user+0x110/0x110
    [  279.924797]  ? ___sys_recvmsg+0xd9/0x130
    [  279.925630]  ? __perf_event_task_sched_in+0x183/0x470
    [  279.926656]  ? ___sys_sendmsg+0x170/0x170
    [  279.927529]  ? ctx_sched_in+0x530/0x530
    [  279.928369]  ? update_curr+0x283/0x4f0
    [  279.929185]  ? perf_event_update_userpage+0x570/0x570
    [  279.930201]  ? __fget_light+0x57/0x520
    [  279.931023]  ? __switch_to+0x53d/0xe70
    [  279.931846]  ? sockfd_lookup_light+0x1a/0x140
    [  279.932761]  __sys_sendmsg+0xb5/0x140
    [  279.933560]  ? __sys_sendmsg_sock+0x20/0x20
    [  279.934436]  ? fpregs_assert_state_consistent+0x1d/0xa0
    [  279.935490]  do_syscall_64+0x3d/0x90
    [  279.936300]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
    [  279.937311] RIP: 0033:0x7f21c814f887
    [  279.938085] Code: 0a 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
    [  279.941448] RSP: 002b:00007fff11efd478 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    [  279.942964] RAX: ffffffffffffffda RBX: 0000000064401979 RCX: 00007f21c814f887
    [  279.944337] RDX: 0000000000000000 RSI: 00007fff11efd4e0 RDI: 0000000000000003
    [  279.945660] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
    [  279.947003] R10: 00007f21c8008708 R11: 0000000000000246 R12: 0000000000000001
    [  279.948345] R13: 0000000000409980 R14: 000000000047e538 R15: 0000000000485400
    [  279.949690]  </TASK>

    [  279.950706] Allocated by task 2960:
    [  279.951471]  kasan_save_stack+0x1e/0x40
    [  279.952338]  kasan_set_track+0x21/0x30
    [  279.953165]  __kasan_kmalloc+0x77/0x90
    [  279.954006]  flow_block_cb_setup_simple+0x3dd/0x7c0
    [  279.955001]  tcf_block_offload_cmd.isra.0+0x189/0x2d0
    [  279.956020]  tcf_block_get_ext+0x61c/0x1200
    [  279.956881]  ingress_init+0x112/0x1c0 [sch_ingress]
    [  279.957873]  qdisc_create+0x401/0xea0
    [  279.958656]  tc_modify_qdisc+0x6f7/0x16d0
    [  279.959506]  rtnetlink_rcv_msg+0x5fe/0x9d0
    [  279.960392]  netlink_rcv_skb+0x12c/0x360
    [  279.961216]  netlink_unicast+0x553/0x790
    [  279.962044]  netlink_sendmsg+0x7a1/0xcb0
    [  279.962906]  sock_sendmsg+0xc5/0x190
    [  279.963702]  ____sys_sendmsg+0x535/0x6b0
    [  279.964534]  ___sys_sendmsg+0xeb/0x170
    [  279.965343]  __sys_sendmsg+0xb5/0x140
    [  279.966132]  do_syscall_64+0x3d/0x90
    [  279.966908]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

    [  279.968407] Freed by task 2960:
    [  279.969114]  kasan_save_stack+0x1e/0x40
    [  279.969929]  kasan_set_track+0x21/0x30
    [  279.970729]  kasan_save_free_info+0x2a/0x40
    [  279.971603]  ____kasan_slab_free+0x11a/0x1b0
    [  279.972483]  __kmem_cache_free+0x14d/0x280
    [  279.973337]  tcf_block_setup+0x29d/0x6b0
    [  279.974173]  tcf_block_offload_cmd.isra.0+0x226/0x2d0
    [  279.975186]  tcf_block_get_ext+0x61c/0x1200
    [  279.976080]  ingress_init+0x112/0x1c0 [sch_ingress]
    [  279.977065]  qdisc_create+0x401/0xea0
    [  279.977857]  tc_modify_qdisc+0x6f7/0x16d0
    [  279.978695]  rtnetlink_rcv_msg+0x5fe/0x9d0
    [  279.979562]  netlink_rcv_skb+0x12c/0x360
    [  279.980388]  netlink_unicast+0x553/0x790
    [  279.981214]  netlink_sendmsg+0x7a1/0xcb0
    [  279.982043]  sock_sendmsg+0xc5/0x190
    [  279.982827]  ____sys_sendmsg+0x535/0x6b0
    [  279.983703]  ___sys_sendmsg+0xeb/0x170
    [  279.984510]  __sys_sendmsg+0xb5/0x140
    [  279.985298]  do_syscall_64+0x3d/0x90
    [  279.986076]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

    [  279.987532] The buggy address belongs to the object at ffff888147e2bf00
                    which belongs to the cache kmalloc-192 of size 192
    [  279.989747] The buggy address is located 32 bytes inside of
                    freed 192-byte region [ffff888147e2bf00, ffff888147e2bfc0)

    [  279.992367] The buggy address belongs to the physical page:
    [  279.993430] page:00000000550f405c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x147e2a
    [  279.995182] head:00000000550f405c order:1 entire_mapcount:0 nr_pages_mapped:0 pincount:0
    [  279.996713] anon flags: 0x200000000010200(slab|head|node=0|zone=2)
    [  279.997878] raw: 0200000000010200 ffff888100042a00 0000000000000000 dead000000000001
    [  279.999384] raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000
    [  280.000894] page dumped because: kasan: bad access detected

    [  280.002386] Memory state around the buggy address:
    [  280.003338]  ffff888147e2be00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [  280.004781]  ffff888147e2be80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [  280.006224] >ffff888147e2bf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [  280.007700]                                ^
    [  280.008592]  ffff888147e2bf80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [  280.010035]  ffff888147e2c000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [  280.011564] ==================================================================

    Fixes: 59094b1e50 ("net: sched: use flow block API")
    Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-07-03 17:03:32 +02:00
Davide Caratti adae2111f2 net/sched: cls_api: Fix lockup on flushing explicitly created chain
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219411
Upstream Status: net.git commit c9a82bec02c3

commit c9a82bec02c339cdda99b37c5e62b3b71fc4209c
Author: Vlad Buslov <vladbu@nvidia.com>
Date:   Mon Jun 12 11:34:26 2023 +0200

    net/sched: cls_api: Fix lockup on flushing explicitly created chain

    Mingshuai Ren reports:

    When a new chain is added by using tc, one soft lockup alarm will be
     generated after delete the prio 0 filter of the chain. To reproduce
     the problem, perform the following steps:
    (1) tc qdisc add dev eth0 root handle 1: htb default 1
    (2) tc chain add dev eth0
    (3) tc filter del dev eth0 chain 0 parent 1: prio 0
    (4) tc filter add dev eth0 chain 0 parent 1:

    Fix the issue by accounting for additional reference to chains that are
    explicitly created by RTM_NEWCHAIN message as opposed to implicitly by
    RTM_NEWTFILTER message.

    Fixes: 726d061286 ("net: sched: prevent insertion of new classifiers during chain flush")
    Reported-by: Mingshuai Ren <renmingshuai@huawei.com>
    Closes: https://lore.kernel.org/lkml/87legswvi3.fsf@nvidia.com/T/
    Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
    Link: https://lore.kernel.org/r/20230612093426.2867183-1-vladbu@nvidia.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-07-03 17:03:31 +02:00
Davide Caratti 2b5d2b606a net: sched: fix possible refcount leak in tc_chain_tmplt_add()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2219411
Upstream Status: net.git commit 44f8baaf230c

commit 44f8baaf230c655c249467ca415b570deca8df77
Author: Hangyu Hua <hbh25y@gmail.com>
Date:   Wed Jun 7 10:23:01 2023 +0800

    net: sched: fix possible refcount leak in tc_chain_tmplt_add()

    try_module_get will be called in tcf_proto_lookup_ops. So module_put needs
    to be called to drop the refcount if ops don't implement the required
    function.

    Fixes: 9f407f1768 ("net: sched: introduce chain templates")
    Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
    Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
2023-07-03 17:03:29 +02:00
Ivan Vecera 738c62f76e net/sched: cls_api: Initialize miss_cookie_node when action miss is not used
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 2cc8a008d62f3c04eeb7ec6fe59e542802bb8df3
Author: Ivan Vecera <ivecera@redhat.com>
Date:   Thu Apr 20 20:36:33 2023 +0200

    net/sched: cls_api: Initialize miss_cookie_node when action miss is not used

    Function tcf_exts_init_ex() sets exts->miss_cookie_node ptr only
    when use_action_miss is true so it assumes in other case that
    the field is set to NULL by the caller. If not then the field
    contains garbage and subsequent tcf_exts_destroy() call results
    in a crash.
    Ensure that the field .miss_cookie_node pointer is NULL when
    use_action_miss parameter is false to avoid this potential scenario.

    Fixes: 80cd22c35c90 ("net/sched: cls_api: Support hardware miss to tc action")
    Signed-off-by: Ivan Vecera <ivecera@redhat.com>
    Reviewed-by: Pedro Tammela <pctammela@mojatatu.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Link: https://lore.kernel.org/r/20230420183634.1139391-1-ivecera@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:56 +02:00
Ivan Vecera 2e88db0644 net/sched: clear actions pointer in miss cookie init fail
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 338469d677e5d426f5ada88761f16f6d2c7c1981
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Sat Apr 15 12:33:09 2023 -0300

    net/sched: clear actions pointer in miss cookie init fail

    Palash reports a UAF when using a modified version of syzkaller[1].

    When 'tcf_exts_miss_cookie_base_alloc()' fails in 'tcf_exts_init_ex()'
    a call to 'tcf_exts_destroy()' is made to free up the tcf_exts
    resources.
    In flower, a call to '__fl_put()' when 'tcf_exts_init_ex()' fails is made;
    Then calling 'tcf_exts_destroy()', which triggers an UAF since the
    already freed tcf_exts action pointer is lingering in the struct.

    Before the offending patch, this was not an issue since there was no
    case where the tcf_exts action pointer could linger. Therefore, restore
    the old semantic by clearing the action pointer in case of a failure to
    initialize the miss_cookie.

    [1] https://github.com/cmu-pasta/linux-kernel-enriched-corpus

    v1->v2: Fix compilation on configs without tc actions (kernel test robot)

    Fixes: 80cd22c35c90 ("net/sched: cls_api: Support hardware miss to tc action")
    Reported-by: Palash Oswal <oswalpalash@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:56 +02:00
Ivan Vecera 9062deee24 net/sched: cls_api: Move call to tcf_exts_miss_cookie_base_destroy()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 37e1f3acc339b28493eb3dad571c3f01b6af86f6
Author: Nathan Chancellor <nathan@kernel.org>
Date:   Fri Feb 24 11:18:49 2023 -0700

    net/sched: cls_api: Move call to tcf_exts_miss_cookie_base_destroy()

    When CONFIG_NET_CLS_ACT is disabled:

      ../net/sched/cls_api.c:141:13: warning: 'tcf_exts_miss_cookie_base_destroy' defined but not used [-Wunused-function]
        141 | static void tcf_exts_miss_cookie_base_destroy(struct tcf_exts *exts)
            |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Due to the way the code is structured, it is possible for a definition
    of tcf_exts_miss_cookie_base_destroy() to be present without actually
    being used. Its single callsite is in an '#ifdef CONFIG_NET_CLS_ACT'
    block but a definition will always be present in the file. The version
    of tcf_exts_miss_cookie_base_destroy() that actually does something
    depends on CONFIG_NET_TC_SKB_EXT, so the stub function is used in both
    CONFIG_NET_CLS_ACT=n and CONFIG_NET_CLS_ACT=y + CONFIG_NET_TC_SKB_EXT=n
    configurations.

    Move the call to tcf_exts_miss_cookie_base_destroy() in
    tcf_exts_destroy() out of the '#ifdef CONFIG_NET_CLS_ACT', so that it
    always appears used to the compiler, while not changing any behavior
    with any of the various configuration combinations.

    Fixes: 80cd22c35c90 ("net/sched: cls_api: Support hardware miss to tc action")
    Signed-off-by: Nathan Chancellor <nathan@kernel.org>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:55 +02:00
Ivan Vecera f7627d50bf net/sched: cls_api: Support hardware miss to tc action
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 80cd22c35c9001fe72bf614d29439de41933deca
Author: Paul Blakey <paulb@nvidia.com>
Date:   Sat Feb 18 00:36:14 2023 +0200

    net/sched: cls_api: Support hardware miss to tc action

    For drivers to support partial offload of a filter's action list,
    add support for action miss to specify an action instance to
    continue from in sw.

    CT action in particular can't be fully offloaded, as new connections
    need to be handled in software. This imposes other limitations on
    the actions that can be offloaded together with the CT action, such
    as packet modifications.

    Assign each action on a filter's action list a unique miss_cookie
    which drivers can then use to fill action_miss part of the tc skb
    extension. On getting back this miss_cookie, find the action
    instance with relevant cookie and continue classifying from there.

    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Reviewed-by: Jiri Pirko <jiri@nvidia.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:55 +02:00
Ivan Vecera a9c4a3bda2 net/sched: Rename user cookie and act cookie
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

Conflicts:
- hunk for mlx5 was skipped as it is not applicable due to absence of
  commit cca7eac13856 ("net/mlx5e: TC, store tc action cookies per attr")

commit db4b49025c0c7116f1d2dfe8d5bbfc983ac054de
Author: Paul Blakey <paulb@nvidia.com>
Date:   Sat Feb 18 00:36:13 2023 +0200

    net/sched: Rename user cookie and act cookie

    struct tc_action->act_cookie is a user defined cookie,
    and the related struct flow_action_entry->act_cookie is
    used as an handle similar to struct flow_cls_offload->cookie.

    Rename tc_action->act_cookie to user_cookie, and
    flow_action_entry->act_cookie to cookie so their names
    would better fit their usage.

    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:54 +02:00
Ivan Vecera 144d78951e net/sched: introduce flow_offload action cookie
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit d307b2c6f962ad5d83d7a7df71c2e9c9e4106d82
Author: Oz Shlomo <ozsh@nvidia.com>
Date:   Sun Feb 12 15:25:15 2023 +0200

    net/sched: introduce flow_offload action cookie

    Currently a hardware action is uniquely identified by the <id, hw_index>
    tuple. However, the id is set by the flow_act_setup callback and tc core
    cannot enforce this, and it is possible that a future change could break
    this. In addition, <id, hw_index> are not unique across network namespaces.

    Uniquely identify the action by setting an action cookie by the tc core.
    Use the unique action cookie to query the action's hardware stats.

    Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
    Reviewed-by: Simon Horman <simon.horman@corigine.com>
    Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:53 +02:00
Ivan Vecera 8f592ab576 sched: add new attr TCA_EXT_WARN_MSG to report tc extact message
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 0349b8779cc949ad9e6aced32672ee48cf79b497
Author: Hangbin Liu <liuhangbin@gmail.com>
Date:   Fri Jan 13 11:43:53 2023 +0800

    sched: add new attr TCA_EXT_WARN_MSG to report tc extact message

    We will report extack message if there is an error via netlink_ack(). But
    if the rule is not to be exclusively executed by the hardware, extack is not
    passed along and offloading failures don't get logged.

    In commit 81c7288b17 ("sched: cls: enable verbose logging") Marcelo
    made cls could log verbose info for offloading failures, which helps
    improving Open vSwitch debuggability when using flower offloading.

    It would also be helpful if userspace monitor tools, like "tc monitor",
    could log this kind of message, as it doesn't require vswitchd log level
    adjusment. Let's add a new tc attributes to report the extack message so
    the monitor program could receive the failures. e.g.

      # tc monitor
      added chain dev enp3s0f1np1 parent ffff: chain 0
      added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1
        ct_state +trk+new
        not_in_hw
              action order 1: gact action drop
               random type none pass val 0
               index 1 ref 1 bind 1

      Warning: mlx5_core: matching on ct_state +new isn't supported.

    In this patch I only report the extack message on add/del operations.
    It doesn't look like we need to report the extack message on get/dump
    operations.

    Note this message not only reporte to multicast groups, it could also
    be reported unicast, which may affect the current usersapce tool's behaivor.

    Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Link: https://lore.kernel.org/r/20230113034353.2766735-1-liuhangbin@gmail.com
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:49 +02:00
Ivan Vecera 69c91cc775 net/sched: avoid indirect classify functions on retpoline kernels
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 9f3101dca3a7c69027c65770ac28803768efefa5
Author: Pedro Tammela <pctammela@mojatatu.com>
Date:   Tue Dec 6 10:55:13 2022 -0300

    net/sched: avoid indirect classify functions on retpoline kernels

    Expose the necessary tc classifier functions and wire up cls_api to use
    direct calls in retpoline kernels.

    Signed-off-by: Pedro Tammela <pctammela@mojatatu.com>
    Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Reviewed-by: Victor Nogueira <victor@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:49 +02:00
Ivan Vecera a85ba0b65d net/sched: cls_api: remove redundant 0 check in tcf_qevent_init()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit 2e5fb3223261366d1673c3827190c85a74b1aa56
Author: Zhengchao Shao <shaozhengchao@huawei.com>
Date:   Thu Sep 1 09:16:17 2022 +0800

    net/sched: cls_api: remove redundant 0 check in tcf_qevent_init()

    tcf_qevent_parse_block_index() never returns a zero block_index.
    Therefore, it is unnecessary to check the value of block_index in
    tcf_qevent_init().

    Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
    Link: https://lore.kernel.org/r/20220901011617.14105-1-shaozhengchao@huawei.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:40 +02:00
Ivan Vecera 9968b36488 net: sched: remove duplicate check of user rights in qdisc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2172886

commit ab4850819176a92864f6ebd6c932ed926a337054
Author: Zhengchao Shao <shaozhengchao@huawei.com>
Date:   Fri Aug 19 12:18:54 2022 +0800

    net: sched: remove duplicate check of user rights in qdisc

    In rtnetlink_rcv_msg function, the permission for all user operations
    is checked except the GET operation, which is the same as the checking
    in qdisc. Therefore, remove the user rights check in qdisc.

    Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
    Message-Id: <20220819041854.83372-1-shaozhengchao@huawei.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-05-10 20:48:39 +02:00
Ivan Vecera 8e121a06f3 act_skbedit: skbedit queue mapping for receive queue
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178209

commit 4a6a676f8c16ec17d2f8d69ce3b5d680277ed0d2
Author: Amritha Nambiar <amritha.nambiar@intel.com>
Date:   Fri Oct 21 00:58:39 2022 -0700

    act_skbedit: skbedit queue mapping for receive queue

    Add support for skbedit queue mapping action on receive
    side. This is supported only in hardware, so the skip_sw
    flag is enforced. This enables offloading filters for
    receive queue selection in the hardware using the
    skbedit action. Traffic arrives on the Rx queue requested
    in the skbedit action parameter. A new tc action flag
    TCA_ACT_FLAGS_AT_INGRESS is introduced to identify the
    traffic direction the action queue_mapping is requested
    on during filter addition. This is used to disallow
    offloading the skbedit queue mapping action on transmit
    side.

    Example:
    $tc filter add dev $IFACE ingress protocol ip flower dst_ip $DST_IP\
     action skbedit queue_mapping $rxq_id skip_sw

    Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
    Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2023-03-22 17:51:42 +01:00
Ivan Vecera 52d10c9a8d net: sched: fix possible refcount leak in tc_new_tfilter()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit c2e1cfefcac35e0eea229e148c8284088ce437b5
Author: Hangyu Hua <hbh25y@gmail.com>
Date:   Wed Sep 21 17:27:34 2022 +0800

    net: sched: fix possible refcount leak in tc_new_tfilter()

    tfilter_put need to be called to put the refount got by tp->ops->get to
    avoid possible refcount leak when chain->tmplt_ops != NULL and
    chain->tmplt_ops != tp->ops.

    Fixes: 7d5509fa0d ("net: sched: extend proto ops with 'put' callback")
    Signed-off-by: Hangyu Hua <hbh25y@gmail.com>
    Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
    Link: https://lore.kernel.org/r/20220921092734.31700-1-hbh25y@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:04 +01:00
Ivan Vecera 2a2a92fb9a net/sched: cls_api: Fix flow action initialization
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit c0f47c2822aadeb8b2829f3e4c3792f184c7be33
Author: Oz Shlomo <ozsh@nvidia.com>
Date:   Tue Jul 19 15:24:09 2022 +0300

    net/sched: cls_api: Fix flow action initialization

    The cited commit refactored the flow action initialization sequence to
    use an interface method when translating tc action instances to flow
    offload objects. The refactored version skips the initialization of the
    generic flow action attributes for tc actions, such as pedit, that allocate
    more than one offload entry. This can cause potential issues for drivers
    mapping flow action ids.

    Populate the generic flow action fields for all the flow action entries.

    Fixes: c54e1d920f04 ("flow_offload: add ops to tc_action_ops for flow action setup")
    Signed-off-by: Oz Shlomo <ozsh@nvidia.com>
    Reviewed-by: Roi Dayan <roid@nvidia.com>

    ----
    v1 -> v2:
     - coalese the generic flow action fields initialization to a single loop
    Reviewed-by: Baowen Zheng <baowen.zheng@corigine.com>

    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:03 +01:00
Ivan Vecera a971d70e1d net/sched: remove return value of unregister_tcf_proto_ops
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit bc5c8260f4114951de3b4ec629650a722ca58a2b
Author: Zhengchao Shao <shaozhengchao@huawei.com>
Date:   Wed Jul 13 09:54:38 2022 +0800

    net/sched: remove return value of unregister_tcf_proto_ops

    Return value of unregister_tcf_proto_ops is unused, remove it.

    Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:03 +01:00
Ivan Vecera 2b642d2994 net/sched: cls_api: Add extack message for unsupported action offload
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit c440615ffbcb9c06975103e5abbcb094589329d1
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Thu Apr 7 10:35:31 2022 +0300

    net/sched: cls_api: Add extack message for unsupported action offload

    For better error reporting to user space, add an extack message when the
    requested action does not support offload.

    Example:

     # echo 1 > /sys/kernel/tracing/events/netlink/netlink_extack/enable

     # tc filter add dev dummy0 ingress pref 1 proto all matchall skip_sw action nat ingress 192.0.2.1 198.51.100.1
     Error: cls_matchall: Failed to setup flow action.
     We have an error talking to the kernel

     # cat /sys/kernel/tracing/trace_pipe
           tc-181     [000] b..1.    88.406093: netlink_extack: msg=Action does not support offload
           tc-181     [000] .....    88.406108: netlink_extack: msg=cls_matchall: Failed to setup flow action

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:01 +01:00
Ivan Vecera ae16730e38 net/sched: act_api: Add extack to offload_act_setup() callback
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2139170

commit c2ccf84ecb715bb81dc7f51e69d680a95bf055ae
Author: Ido Schimmel <idosch@nvidia.com>
Date:   Thu Apr 7 10:35:22 2022 +0300

    net/sched: act_api: Add extack to offload_act_setup() callback

    The callback is used by various actions to populate the flow action
    structure prior to offload. Pass extack to this callback so that the
    various actions will be able to report accurate error messages to user
    space.

    Signed-off-by: Ido Schimmel <idosch@nvidia.com>
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-11-13 16:59:00 +01:00
Ivan Vecera da579c6dcc net/sched: fix initialization order when updating chain 0 head
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit e65812fd22eba32f11abe28cb377cbd64cfb1ba0
Author: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date:   Thu Apr 7 11:29:23 2022 -0300

    net/sched: fix initialization order when updating chain 0 head

    Currently, when inserting a new filter that needs to sit at the head
    of chain 0, it will first update the heads pointer on all devices using
    the (shared) block, and only then complete the initialization of the new
    element so that it has a "next" element.

    This can lead to a situation that the chain 0 head is propagated to
    another CPU before the "next" initialization is done. When this race
    condition is triggered, packets being matched on that CPU will simply
    miss all other filters, and will flow through the stack as if there were
    no other filters installed. If the system is using OVS + TC, such
    packets will get handled by vswitchd via upcall, which results in much
    higher latency and reordering. For other applications it may result in
    packet drops.

    This is reproducible with a tc only setup, but it varies from system to
    system. It could be reproduced with a shared block amongst 10 veth
    tunnels, and an ingress filter mirroring packets to another veth.
    That's because using the last added veth tunnel to the shared block to
    do the actual traffic, it makes the race window bigger and easier to
    trigger.

    The fix is rather simple, to just initialize the next pointer of the new
    filter instance (tp) before propagating the head change.

    The fixes tag is pointing to the original code though this issue should
    only be observed when using it unlocked.

    Fixes: 2190d1d094 ("net: sched: introduce helpers to work with filter chains")
    Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
    Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
    Reviewed-by: Davide Caratti <dcaratti@redhat.com>
    Link: https://lore.kernel.org/r/b97d5f4eaffeeb9d058155bcab63347527261abf.1649341369.git.marcelo.leitner@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:32:47 +02:00
Ivan Vecera 492340ec09 net_sched: add __rcu annotation to netdev->qdisc
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 5891cd5ec46c2c2eb6427cb54d214b149635dd0e
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Feb 11 12:06:23 2022 -0800

    net_sched: add __rcu annotation to netdev->qdisc

    syzbot found a data-race [1] which lead me to add __rcu
    annotations to netdev->qdisc, and proper accessors
    to get LOCKDEP support.

    [1]
    BUG: KCSAN: data-race in dev_activate / qdisc_lookup_rcu

    write to 0xffff888168ad6410 of 8 bytes by task 13559 on cpu 1:
     attach_default_qdiscs net/sched/sch_generic.c:1167 [inline]
     dev_activate+0x2ed/0x8f0 net/sched/sch_generic.c:1221
     __dev_open+0x2e9/0x3a0 net/core/dev.c:1416
     __dev_change_flags+0x167/0x3f0 net/core/dev.c:8139
     rtnl_configure_link+0xc2/0x150 net/core/rtnetlink.c:3150
     __rtnl_newlink net/core/rtnetlink.c:3489 [inline]
     rtnl_newlink+0xf4d/0x13e0 net/core/rtnetlink.c:3529
     rtnetlink_rcv_msg+0x745/0x7e0 net/core/rtnetlink.c:5594
     netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
     rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
     netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
     netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
     netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmsg+0x195/0x230 net/socket.c:2496
     __do_sys_sendmsg net/socket.c:2505 [inline]
     __se_sys_sendmsg net/socket.c:2503 [inline]
     __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    read to 0xffff888168ad6410 of 8 bytes by task 13560 on cpu 0:
     qdisc_lookup_rcu+0x30/0x2e0 net/sched/sch_api.c:323
     __tcf_qdisc_find+0x74/0x3a0 net/sched/cls_api.c:1050
     tc_del_tfilter+0x1c7/0x1350 net/sched/cls_api.c:2211
     rtnetlink_rcv_msg+0x5ba/0x7e0 net/core/rtnetlink.c:5585
     netlink_rcv_skb+0x14e/0x250 net/netlink/af_netlink.c:2494
     rtnetlink_rcv+0x18/0x20 net/core/rtnetlink.c:5612
     netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
     netlink_unicast+0x602/0x6d0 net/netlink/af_netlink.c:1343
     netlink_sendmsg+0x728/0x850 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg net/socket.c:725 [inline]
     ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
     ___sys_sendmsg net/socket.c:2467 [inline]
     __sys_sendmsg+0x195/0x230 net/socket.c:2496
     __do_sys_sendmsg net/socket.c:2505 [inline]
     __se_sys_sendmsg net/socket.c:2503 [inline]
     __x64_sys_sendmsg+0x42/0x50 net/socket.c:2503
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    value changed: 0xffffffff85dee080 -> 0xffff88815d96ec00

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 13560 Comm: syz-executor.2 Not tainted 5.17.0-rc3-syzkaller-00116-gf1baf68e1383-dirty #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 470502de5b ("net: sched: unlock rules update API")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Vlad Buslov <vladbu@mellanox.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:55 +02:00
Ivan Vecera 192c0ad4dd net/sched: Enable tc skb ext allocation on chain miss only when needed
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 35d39fecbc242150af5587506e58ec1f8541fb68
Author: Paul Blakey <paulb@nvidia.com>
Date:   Thu Feb 3 10:44:30 2022 +0200

    net/sched: Enable tc skb ext allocation on chain miss only when needed

    Currently tc skb extension is used to send miss info from
    tc to ovs datapath module, and driver to tc. For the tc to ovs
    miss it is currently always allocated even if it will not
    be used by ovs datapath (as it depends on a requested feature).

    Export the static key which is used by openvswitch module to
    guard this code path as well, so it will be skipped if ovs
    datapath doesn't need it. Enable this code path once
    ovs datapath needs it.

    Signed-off-by: Paul Blakey <paulb@nvidia.com>
    Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:54 +02:00
Ivan Vecera 9d82e0dd47 net: sched: fix use-after-free in tc_new_tfilter()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 04c2a47ffb13c29778e2a14e414ad4cb5a5db4b5
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Jan 31 09:20:18 2022 -0800

    net: sched: fix use-after-free in tc_new_tfilter()

    Whenever tc_new_tfilter() jumps back to replay: label,
    we need to make sure @q and @chain local variables are cleared again,
    or risk use-after-free as in [1]

    For consistency, apply the same fix in tc_ctl_chain()

    BUG: KASAN: use-after-free in mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
    Write of size 8 at addr ffff8880985c4b08 by task syz-executor.4/1945

    CPU: 0 PID: 1945 Comm: syz-executor.4 Not tainted 5.17.0-rc1-syzkaller-00495-gff58831fa02d #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
     print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
     __kasan_report mm/kasan/report.c:442 [inline]
     kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
     mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
     tcf_chain_head_change_item net/sched/cls_api.c:372 [inline]
     tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:386
     tcf_chain_tp_insert net/sched/cls_api.c:1657 [inline]
     tcf_chain_tp_insert_unique net/sched/cls_api.c:1707 [inline]
     tc_new_tfilter+0x1e67/0x2350 net/sched/cls_api.c:2086
     rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
     netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
     netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
     netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
     netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:725
     ____sys_sendmsg+0x331/0x810 net/socket.c:2413
     ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
     __sys_sendmmsg+0x195/0x470 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f2647172059
    Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
    RSP: 002b:00007f2645aa5168 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
    RAX: ffffffffffffffda RBX: 00007f2647285100 RCX: 00007f2647172059
    RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
    RBP: 00007f26471cc08d R08: 0000000000000000 R09: 0000000000000000
    R10: 9e00000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00007fffb3f7f02f R14: 00007f2645aa5300 R15: 0000000000022000
     </TASK>

    Allocated by task 1944:
     kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
     kasan_set_track mm/kasan/common.c:45 [inline]
     set_alloc_info mm/kasan/common.c:436 [inline]
     ____kasan_kmalloc mm/kasan/common.c:515 [inline]
     ____kasan_kmalloc mm/kasan/common.c:474 [inline]
     __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:524
     kmalloc_node include/linux/slab.h:604 [inline]
     kzalloc_node include/linux/slab.h:726 [inline]
     qdisc_alloc+0xac/0xa10 net/sched/sch_generic.c:941
     qdisc_create.constprop.0+0xce/0x10f0 net/sched/sch_api.c:1211
     tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660
     rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5592
     netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
     netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
     netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
     netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:725
     ____sys_sendmsg+0x331/0x810 net/socket.c:2413
     ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
     __sys_sendmmsg+0x195/0x470 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Freed by task 3609:
     kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
     kasan_set_track+0x21/0x30 mm/kasan/common.c:45
     kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
     ____kasan_slab_free mm/kasan/common.c:366 [inline]
     ____kasan_slab_free+0x130/0x160 mm/kasan/common.c:328
     kasan_slab_free include/linux/kasan.h:236 [inline]
     slab_free_hook mm/slub.c:1728 [inline]
     slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1754
     slab_free mm/slub.c:3509 [inline]
     kfree+0xcb/0x280 mm/slub.c:4562
     rcu_do_batch kernel/rcu/tree.c:2527 [inline]
     rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2778
     __do_softirq+0x29b/0x9c2 kernel/softirq.c:558

    Last potentially related work creation:
     kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
     __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
     __call_rcu kernel/rcu/tree.c:3026 [inline]
     call_rcu+0xb1/0x740 kernel/rcu/tree.c:3106
     qdisc_put_unlocked+0x6f/0x90 net/sched/sch_generic.c:1109
     tcf_block_release+0x86/0x90 net/sched/cls_api.c:1238
     tc_new_tfilter+0xc0d/0x2350 net/sched/cls_api.c:2148
     rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
     netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
     netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
     netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
     netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
     sock_sendmsg_nosec net/socket.c:705 [inline]
     sock_sendmsg+0xcf/0x120 net/socket.c:725
     ____sys_sendmsg+0x331/0x810 net/socket.c:2413
     ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
     __sys_sendmmsg+0x195/0x470 net/socket.c:2553
     __do_sys_sendmmsg net/socket.c:2582 [inline]
     __se_sys_sendmmsg net/socket.c:2579 [inline]
     __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    The buggy address belongs to the object at ffff8880985c4800
     which belongs to the cache kmalloc-1k of size 1024
    The buggy address is located 776 bytes inside of
     1024-byte region [ffff8880985c4800, ffff8880985c4c00)
    The buggy address belongs to the page:
    page:ffffea0002617000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x985c0
    head:ffffea0002617000 order:3 compound_mapcount:0 compound_pincount:0
    flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
    raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888010c41dc0
    raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected
    page_owner tracks the page as allocated
    page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 1941, ts 1038999441284, free_ts 1033444432829
     prep_new_page mm/page_alloc.c:2434 [inline]
     get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
     __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
     alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
     alloc_slab_page mm/slub.c:1799 [inline]
     allocate_slab mm/slub.c:1944 [inline]
     new_slab+0x28a/0x3b0 mm/slub.c:2004
     ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
     __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
     slab_alloc_node mm/slub.c:3196 [inline]
     slab_alloc mm/slub.c:3238 [inline]
     __kmalloc+0x2fb/0x340 mm/slub.c:4420
     kmalloc include/linux/slab.h:586 [inline]
     kzalloc include/linux/slab.h:715 [inline]
     __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1335
     neigh_sysctl_register+0x2c8/0x5e0 net/core/neighbour.c:3787
     devinet_sysctl_register+0xb1/0x230 net/ipv4/devinet.c:2618
     inetdev_init+0x286/0x580 net/ipv4/devinet.c:278
     inetdev_event+0xa8a/0x15d0 net/ipv4/devinet.c:1532
     notifier_call_chain+0xb5/0x200 kernel/notifier.c:84
     call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:1919
     call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
     call_netdevice_notifiers net/core/dev.c:1945 [inline]
     register_netdevice+0x1073/0x1500 net/core/dev.c:9698
     veth_newlink+0x59c/0xa90 drivers/net/veth.c:1722
    page last free stack trace:
     reset_page_owner include/linux/page_owner.h:24 [inline]
     free_pages_prepare mm/page_alloc.c:1352 [inline]
     free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
     free_unref_page_prepare mm/page_alloc.c:3325 [inline]
     free_unref_page+0x19/0x690 mm/page_alloc.c:3404
     release_pages+0x748/0x1220 mm/swap.c:956
     tlb_batch_pages_flush mm/mmu_gather.c:50 [inline]
     tlb_flush_mmu_free mm/mmu_gather.c:243 [inline]
     tlb_flush_mmu+0xe9/0x6b0 mm/mmu_gather.c:250
     zap_pte_range mm/memory.c:1441 [inline]
     zap_pmd_range mm/memory.c:1490 [inline]
     zap_pud_range mm/memory.c:1519 [inline]
     zap_p4d_range mm/memory.c:1540 [inline]
     unmap_page_range+0x1d1d/0x2a30 mm/memory.c:1561
     unmap_single_vma+0x198/0x310 mm/memory.c:1606
     unmap_vmas+0x16b/0x2f0 mm/memory.c:1638
     exit_mmap+0x201/0x670 mm/mmap.c:3178
     __mmput+0x122/0x4b0 kernel/fork.c:1114
     mmput+0x56/0x60 kernel/fork.c:1135
     exit_mm kernel/exit.c:507 [inline]
     do_exit+0xa3c/0x2a30 kernel/exit.c:793
     do_group_exit+0xd2/0x2f0 kernel/exit.c:935
     __do_sys_exit_group kernel/exit.c:946 [inline]
     __se_sys_exit_group kernel/exit.c:944 [inline]
     __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:944
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Memory state around the buggy address:
     ffff8880985c4a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff8880985c4a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff8880985c4b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                          ^
     ffff8880985c4b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
     ffff8880985c4c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 470502de5b ("net: sched: unlock rules update API")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Vlad Buslov <vladbu@mellanox.com>
    Cc: Jiri Pirko <jiri@mellanox.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Reported-by: syzbot <syzkaller@googlegroups.com>
    Link: https://lore.kernel.org/r/20220131172018.3704490-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:54 +02:00
Ivan Vecera 89fae87f10 net/sched: use min() macro instead of doing it manually
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit c48c94b0ab75ef3bbfa539e6e212184e315fd5bd
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Tue Dec 21 09:14:55 2021 +0800

    net/sched: use min() macro instead of doing it manually

    Fix following coccicheck warnings:
    ./net/sched/cls_api.c:3333:17-18: WARNING opportunity for min()
    ./net/sched/cls_api.c:3389:17-18: WARNING opportunity for min()
    ./net/sched/cls_api.c:3427:17-18: WARNING opportunity for min()

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:26 +02:00
Ivan Vecera 5dc16cd298 flow_offload: validate flags of filter and actions
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit c86e0209dc7725c91583e3c0c78c3da6a28daeb4
Author: Baowen Zheng <baowen.zheng@corigine.com>
Date:   Fri Dec 17 19:16:28 2021 +0100

    flow_offload: validate flags of filter and actions

    Add process to validate flags of filter and actions when adding
    a tc filter.

    We need to prevent adding filter with flags conflicts with its actions.

    Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
    Signed-off-by: Louis Peens <louis.peens@corigine.com>
    Signed-off-by: Simon Horman <simon.horman@corigine.com>
    Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:26 +02:00
Ivan Vecera 8bace59d89 flow_offload: allow user to offload tc action to net device
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 8cbfe939abe905280279e84a297b1cb34e0d0ec9
Author: Baowen Zheng <baowen.zheng@corigine.com>
Date:   Fri Dec 17 19:16:22 2021 +0100

    flow_offload: allow user to offload tc action to net device

    Use flow_indr_dev_register/flow_indr_dev_setup_offload to
    offload tc action.

    We need to call tc_cleanup_flow_action to clean up tc action entry since
    in tc_setup_action, some actions may hold dev refcnt, especially the mirror
    action.

    Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
    Signed-off-by: Louis Peens <louis.peens@corigine.com>
    Signed-off-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:19 +02:00
Ivan Vecera ad0a1778e8 flow_offload: add ops to tc_action_ops for flow action setup
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit c54e1d920f04d528ab558f09326a78d2ae59e323
Author: Baowen Zheng <baowen.zheng@corigine.com>
Date:   Fri Dec 17 19:16:21 2021 +0100

    flow_offload: add ops to tc_action_ops for flow action setup

    Add a new ops to tc_action_ops for flow action setup.

    Refactor function tc_setup_flow_action to use this new ops.

    We make this change to facilitate to add standalone action module.

    We will also use this ops to offload action independent of filter
    in following patch.

    Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
    Signed-off-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:18 +02:00
Ivan Vecera 3c8e8652d7 flow_offload: rename offload functions with offload instead of flow
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090410

commit 9c1c0e124ca25589e6cf040e105ab0857f9e9c3e
Author: Baowen Zheng <baowen.zheng@corigine.com>
Date:   Fri Dec 17 19:16:20 2021 +0100

    flow_offload: rename offload functions with offload instead of flow

    To improves readability, we rename offload functions with offload instead
    of flow.

    The term flow is related to exact matches, so we rename these functions
    with offload.

    We make this change to facilitate single action offload functions naming.

    Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
    Signed-off-by: Simon Horman <simon.horman@corigine.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Signed-off-by: Ivan Vecera <ivecera@redhat.com>
2022-06-06 16:31:18 +02:00