Commit Graph

475 Commits

Author SHA1 Message Date
Kamal Heib 74ffa37115 RDMA/core: Fix ENODEV error for iWARP test over vlan
JIRA: https://issues.redhat.com/browse/RHEL-77880

commit a4048c83fd87c65657a4acb17d639092d4b6133d
Author: Anumula Murali Mohan Reddy <anumula@chelsio.com>
Date:   Tue Dec 3 19:30:53 2024 +0530

    RDMA/core: Fix ENODEV error for iWARP test over vlan

    If traffic is over vlan, cma_validate_port() fails to match
    net_device ifindex with bound_if_index and results in ENODEV error.
    As iWARP gid table is static, it contains entry corresponding  to
    only one net device which is either real netdev or vlan netdev for
    cases like  siw attached to a vlan interface.
    This patch fixes the issue by assigning bound_if_index with net
    device index, if real net device obtained from bound if index matches
    with net device retrieved from gid table

    Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices")
    Link: https://lore.kernel.org/all/ZzNgdrjo1kSCGbRz@chelsio.com/
    Signed-off-by: Anumula Murali Mohan Reddy <anumula@chelsio.com>
    Signed-off-by: Potnuri Bharat Teja <bharat@chelsio.com>
    Link: https://patch.msgid.link/20241203140052.3985-1-anumula@chelsio.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2025-02-06 09:33:43 -05:00
Kamal Heib 0b2751db60 RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
JIRA: https://issues.redhat.com/browse/RHEL-56247

commit 9c0731832d3b7420cbadba6a7f334363bc8dfb15
Author: Zhu Yanjun <yanjun.zhu@linux.dev>
Date:   Fri May 10 23:12:47 2024 +0200

    RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw

    When running blktests nvme/rdma, the following kmemleak issue will appear.

    kmemleak: Kernel memory leak detector initialized (mempool available:36041)
    kmemleak: Automatic memory scanning thread started
    kmemleak: 2 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
    kmemleak: 8 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
    kmemleak: 17 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
    kmemleak: 4 new suspected memory leaks (see /sys/kernel/debug/kmemleak)

    unreferenced object 0xffff88855da53400 (size 192):
      comm "rdma", pid 10630, jiffies 4296575922
      hex dump (first 32 bytes):
        37 00 00 00 00 00 00 00 c0 ff ff ff 1f 00 00 00  7...............
        10 34 a5 5d 85 88 ff ff 10 34 a5 5d 85 88 ff ff  .4.].....4.]....
      backtrace (crc 47f66721):
        [<ffffffff911251bd>] kmalloc_trace+0x30d/0x3b0
        [<ffffffffc2640ff7>] alloc_gid_entry+0x47/0x380 [ib_core]
        [<ffffffffc2642206>] add_modify_gid+0x166/0x930 [ib_core]
        [<ffffffffc2643468>] ib_cache_update.part.0+0x6d8/0x910 [ib_core]
        [<ffffffffc2644e1a>] ib_cache_setup_one+0x24a/0x350 [ib_core]
        [<ffffffffc263949e>] ib_register_device+0x9e/0x3a0 [ib_core]
        [<ffffffffc2a3d389>] 0xffffffffc2a3d389
        [<ffffffffc2688cd8>] nldev_newlink+0x2b8/0x520 [ib_core]
        [<ffffffffc2645fe3>] rdma_nl_rcv_msg+0x2c3/0x520 [ib_core]
        [<ffffffffc264648c>]
    rdma_nl_rcv_skb.constprop.0.isra.0+0x23c/0x3a0 [ib_core]
        [<ffffffff9270e7b5>] netlink_unicast+0x445/0x710
        [<ffffffff9270f1f1>] netlink_sendmsg+0x761/0xc40
        [<ffffffff9249db29>] __sys_sendto+0x3a9/0x420
        [<ffffffff9249dc8c>] __x64_sys_sendto+0xdc/0x1b0
        [<ffffffff92db0ad3>] do_syscall_64+0x93/0x180
        [<ffffffff92e00126>] entry_SYSCALL_64_after_hwframe+0x71/0x79

    The root cause: rdma_put_gid_attr is not called when sgid_attr is set
    to ERR_PTR(-ENODEV).

    Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
    Closes: https://lore.kernel.org/all/19bf5745-1b3b-4b8a-81c2-20d945943aaf@linux.dev/T/
    Fixes: f8ef1be816bf ("RDMA/cma: Avoid GID lookups on iWARP devices")
    Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
    Link: https://lore.kernel.org/r/20240510211247.31345-1-yanjun.zhu@linux.dev
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2024-10-07 11:55:53 -04:00
Kamal Heib 662f008c89 treewide: use get_random_u32_inclusive() when possible
JIRA: https://issues.redhat.com/browse/RHEL-56247
Conflicts:
Include only the hunks under drivers/infiniband.

commit e8a533cbeb79809206f8724e89961e0079508c3c
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun Oct 9 20:44:02 2022 -0600

    treewide: use get_random_u32_inclusive() when possible

    These cases were done with this Coccinelle:

    @@
    expression H;
    expression L;
    @@
    - (get_random_u32_below(H) + L)
    + get_random_u32_inclusive(L, H + L - 1)

    @@
    expression H;
    expression L;
    expression E;
    @@
      get_random_u32_inclusive(L,
      H
    - + E
    - - E
      )

    @@
    expression H;
    expression L;
    expression E;
    @@
      get_random_u32_inclusive(L,
      H
    - - E
    - + E
      )

    @@
    expression H;
    expression L;
    expression E;
    expression F;
    @@
      get_random_u32_inclusive(L,
      H
    - - E
      + F
    - + E
      )

    @@
    expression H;
    expression L;
    expression E;
    expression F;
    @@
      get_random_u32_inclusive(L,
      H
    - + E
      + F
    - - E
      )

    And then subsequently cleaned up by hand, with several automatic cases
    rejected if it didn't make sense contextually.

    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2024-10-07 11:55:50 -04:00
Kamal Heib 3f12c3acf6 treewide: use get_random_u32_below() instead of deprecated function
JIRA: https://issues.redhat.com/browse/RHEL-56247
Conflicts:
Include only the hunks under drivers/infiniband.

commit 8032bf1233a74627ce69b803608e650f3f35971c
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Sun Oct 9 20:44:02 2022 -0600

    treewide: use get_random_u32_below() instead of deprecated function

    This is a simple mechanical transformation done by:

    @@
    expression E;
    @@
    - prandom_u32_max
    + get_random_u32_below
      (E)

    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
    Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
    Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2024-10-07 11:55:50 -04:00
Benjamin Coddington a9a5eecfa5 RDMA/cma: Avoid GID lookups on iWARP devices
JIRA: https://issues.redhat.com/browse/RHEL-12457

commit f8ef1be816bf9a0c406c696368c2264a9597a994
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Mon Jul 17 11:12:32 2023 -0400

    RDMA/cma: Avoid GID lookups on iWARP devices

    We would like to enable the use of siw on top of a VPN that is
    constructed and managed via a tun device. That hasn't worked up
    until now because ARPHRD_NONE devices (such as tun devices) have
    no GID for the RDMA/core to look up.

    But it turns out that the egress device has already been picked for
    us -- no GID is necessary. addr_handler() just has to do the right
    thing with it.

    Link: https://lore.kernel.org/r/168960675257.3007.4737911174148394395.stgit@manet.1015granger.net
    Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
2024-03-06 14:37:55 -05:00
Benjamin Coddington 1fb1fe9992 RDMA/cma: Deduplicate error flow in cma_validate_port()
JIRA: https://issues.redhat.com/browse/RHEL-12457

commit 700c96497ba9acf1a3554a3cd3ba6c79db3cbcf7
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Mon Jul 17 11:12:25 2023 -0400

    RDMA/cma: Deduplicate error flow in cma_validate_port()

    Clean up to prepare for the addition of new logic.

    Link: https://lore.kernel.org/r/168960674597.3007.6128252077812202526.stgit@manet.1015granger.net
    Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
2024-03-06 14:37:55 -05:00
Kamal Heib 6d7b4ce420 RDMA/cma: Initialize ib_sa_multicast structure to 0 when join
JIRA: https://issues.redhat.com/browse/RHEL-1030

commit e0fe97efdb00f0f32b038a4836406a82886aec9c
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Wed Sep 27 12:05:11 2023 +0300

    RDMA/cma: Initialize ib_sa_multicast structure to 0 when join

    Initialize the structure to 0 so that it's fields won't have random
    values. For example fields like rec.traffic_class (as well as
    rec.flow_label and rec.sl) is used to generate the user AH through:
      cma_iboe_join_multicast
        cma_make_mc_event
          ib_init_ah_from_mcmember

    And a random traffic_class causes a random IP DSCP in RoCEv2.

    Fixes: b5de0c60cc ("RDMA/cma: Fix use after free race in roce multicast join")
    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Link: https://lore.kernel.org/r/20230927090511.603595-1-markzhang@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-11-20 15:13:16 -05:00
Kamal Heib 0ebdc73da1 RDMA/core: Update CMA destination address on rdma_resolve_addr
JIRA: https://issues.redhat.com/browse/RHEL-1029

commit 0e15863015d97c1ee2cc29d599abcc7fa2dc3e95
Author: Shiraz Saleem <shiraz.saleem@intel.com>
Date:   Wed Jul 12 18:41:33 2023 -0500

    RDMA/core: Update CMA destination address on rdma_resolve_addr

    8d037973d48c ("RDMA/core: Refactor rdma_bind_addr") intoduces as regression
    on irdma devices on certain tests which uses rdma CM, such as cmtime.

    No connections can be established with the MAD QP experiences a fatal
    error on the active side.

    The cma destination address is not updated with the dst_addr when ULP
    on active side calls rdma_bind_addr followed by rdma_resolve_addr.
    The id_priv state is 'bound' in resolve_prepare_src and update is skipped.

    This leaves the dgid passed into irdma driver to create an Address Handle
    (AH) for the MAD QP at 0. The create AH descriptor as well as the ARP cache
    entry is invalid and HW throws an asynchronous events as result.

    [ 1207.656888] resolve_prepare_src caller: ucma_resolve_addr+0xff/0x170 [rdma_ucm] daddr=200.0.4.28 id_priv->state=7
    [....]
    [ 1207.680362] ice 0000:07:00.1 rocep7s0f1: caller: irdma_create_ah+0x3e/0x70 [irdma] ah_id=0 arp_idx=0 dest_ip=0.0.0.0
    destMAC=00:00:64:ca:b7:52 ipvalid=1 raw=0000:0000:0000:0000:0000:ffff:0000:0000
    [ 1207.682077] ice 0000:07:00.1 rocep7s0f1: abnormal ae_id = 0x401 bool qp=1 qp_id = 1, ae_src=5
    [ 1207.691657] infiniband rocep7s0f1: Fatal error (1) on MAD QP (1)

    Fix this by updating the CMA destination address when the ULP calls
    a resolve address with the CM state already bound.

    Fixes: 8d037973d48c ("RDMA/core: Refactor rdma_bind_addr")
    Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com>
    Link: https://lore.kernel.org/r/20230712234133.1343-1-shiraz.saleem@intel.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 11:13:50 -04:00
Kamal Heib 1b96606f80 RDMA/core: Refactor rdma_bind_addr
JIRA: https://issues.redhat.com/browse/RHEL-1029

commit 8d037973d48c026224ab285e6a06985ccac6f7bf
Author: Patrisious Haddad <phaddad@nvidia.com>
Date:   Wed Jan 4 10:01:38 2023 +0200

    RDMA/core: Refactor rdma_bind_addr

    Refactor rdma_bind_addr function so that it doesn't require that the
    cma destination address be changed before calling it.

    So now it will update the destination address internally only when it is
    really needed and after passing all the required checks.

    Which in turn results in a cleaner and more sensible call and error
    handling flows for the functions that call it directly or indirectly.

    Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Reviewed-by: Mark Zhang <markzhang@nvidia.com>
    Link: https://lore.kernel.org/r/3d0e9a2fd62bc10ba02fed1c7c48a48638952320.1672819273.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 11:13:50 -04:00
Kamal Heib 77cad8fa52 RDMA/cma: Remove NULL check before dev_{put, hold}
JIRA: https://issues.redhat.com/browse/RHEL-1029

commit 6735041fd8460a94ae367830ece8ef65f191227a
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Wed Jun 14 09:43:28 2023 +0800

    RDMA/cma: Remove NULL check before dev_{put, hold}

    The call netdev_{put, hold} of dev_{put, hold} will check NULL, so there
    is no need to check before using dev_{put, hold}, remove it to silence the
    warning:

    ./drivers/infiniband/core/cma.c:4812:2-9: WARNING: NULL check before dev_{put, hold} functions is not needed.

    Link: https://lore.kernel.org/r/20230614014328.14007-1-yang.lee@linux.alibaba.com
    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5521
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 11:13:50 -04:00
Kamal Heib a411017cd6 RDMA/cma: Always set static rate to 0 for RoCE
JIRA: https://issues.redhat.com/browse/RHEL-956

commit 58030c76cce473b6cfd630bbecb97215def0dff8
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Mon Jun 5 13:33:23 2023 +0300

    RDMA/cma: Always set static rate to 0 for RoCE

    Set static rate to 0 as it should be discovered by path query and
    has no meaning for RoCE.
    This also avoid of using the rtnl lock and ethtool API, which is
    a bottleneck when try to setup many rdma-cm connections at the same
    time, especially with multiple processes.

    Fixes: 3c86aa70bf ("RDMA/cm: Add RDMA CM support for IBoE devices")
    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Link: https://lore.kernel.org/r/f72a4f8b667b803aee9fa794069f61afb5839ce4.1685960567.git.leon@kernel.org
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 10:56:05 -04:00
Kamal Heib eb4813aca0 RDMA/cma: Remove NULL check before dev_{put, hold}
JIRA: https://issues.redhat.com/browse/RHEL-956

commit 08ebf57f6e1d73cc1890e1ff1b1c74887c53770b
Author: Yang Li <yang.lee@linux.alibaba.com>
Date:   Fri Mar 31 09:06:33 2023 +0800

    RDMA/cma: Remove NULL check before dev_{put, hold}

    The call netdev_{put, hold} of dev_{put, hold} will check NULL,
    so there is no need to check before using dev_{put, hold},
    remove it to silence the warnings:

    ./drivers/infiniband/core/cma.c:713:2-9: WARNING: NULL check before dev_{put, hold} functions is not needed.
    ./drivers/infiniband/core/cma.c:2433:2-9: WARNING: NULL check before dev_{put, hold} functions is not needed.

    Reported-by: Abaci Robot <abaci@linux.alibaba.com>
    Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=4668
    Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
    Link: https://lore.kernel.org/r/20230331010633.63261-1-yang.lee@linux.alibaba.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 10:56:05 -04:00
Kamal Heib 53881012d3 RDMA/cma: Allow UD qp_type to join multicast only
JIRA: https://issues.redhat.com/browse/RHEL-956

commit 58e84f6b3e84e46524b7e5a916b53c1ad798bc8f
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Mon Mar 20 12:59:55 2023 +0200

    RDMA/cma: Allow UD qp_type to join multicast only

    As for multicast:
    - The SIDR is the only mode that makes sense;
    - Besides PS_UDP, other port spaces like PS_IB is also allowed, as it is
      UD compatible. In this case qkey also needs to be set [1].

    This patch allows only UD qp_type to join multicast, and set qkey to
    default if it's not set, to fix an uninit-value error: the ib->rec.qkey
    field is accessed without being initialized.

    =====================================================
    BUG: KMSAN: uninit-value in cma_set_qkey drivers/infiniband/core/cma.c:510 [inline]
    BUG: KMSAN: uninit-value in cma_make_mc_event+0xb73/0xe00 drivers/infiniband/core/cma.c:4570
     cma_set_qkey drivers/infiniband/core/cma.c:510 [inline]
     cma_make_mc_event+0xb73/0xe00 drivers/infiniband/core/cma.c:4570
     cma_iboe_join_multicast drivers/infiniband/core/cma.c:4782 [inline]
     rdma_join_multicast+0x2b83/0x30a0 drivers/infiniband/core/cma.c:4814
     ucma_process_join+0xa76/0xf60 drivers/infiniband/core/ucma.c:1479
     ucma_join_multicast+0x1e3/0x250 drivers/infiniband/core/ucma.c:1546
     ucma_write+0x639/0x6d0 drivers/infiniband/core/ucma.c:1732
     vfs_write+0x8ce/0x2030 fs/read_write.c:588
     ksys_write+0x28c/0x520 fs/read_write.c:643
     __do_sys_write fs/read_write.c:655 [inline]
     __se_sys_write fs/read_write.c:652 [inline]
     __ia32_sys_write+0xdb/0x120 fs/read_write.c:652
     do_syscall_32_irqs_on arch/x86/entry/common.c:114 [inline]
     __do_fast_syscall_32+0x96/0xf0 arch/x86/entry/common.c:180
     do_fast_syscall_32+0x34/0x70 arch/x86/entry/common.c:205
     do_SYSENTER_32+0x1b/0x20 arch/x86/entry/common.c:248
     entry_SYSENTER_compat_after_hwframe+0x4d/0x5c

    Local variable ib.i created at:
    cma_iboe_join_multicast drivers/infiniband/core/cma.c:4737 [inline]
    rdma_join_multicast+0x586/0x30a0 drivers/infiniband/core/cma.c:4814
    ucma_process_join+0xa76/0xf60 drivers/infiniband/core/ucma.c:1479

    CPU: 0 PID: 29874 Comm: syz-executor.3 Not tainted 5.16.0-rc3-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    =====================================================

    [1] https://lore.kernel.org/linux-rdma/20220117183832.GD84788@nvidia.com/

    Fixes: b5de0c60cc ("RDMA/cma: Fix use after free race in roce multicast join")
    Reported-by: syzbot+8fcbb77276d43cc8b693@syzkaller.appspotmail.com
    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Link: https://lore.kernel.org/r/58a4a98323b5e6b1282e83f6b76960d06e43b9fa.1679309909.git.leon@kernel.org
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-09-05 10:56:05 -04:00
Kamal Heib dade592238 Revert "RDMA/core: Refactor rdma_bind_addr"
This reverts commit 0a11f5ee03 which
introduce a regression when running librdmacm-utils.

Bugzilla: https://bugzilla.redhat.com/2212559
Upstream status: RHEL-only

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-06-27 13:04:20 -04:00
Kamal Heib 9b338e1995 RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size
Bugzilla: https://bugzilla.redhat.com/2168937

commit 876e480da2f74715fc70e37723e77ca16a631e35
Author: Kees Cook <keescook@chromium.org>
Date:   Wed Feb 8 15:25:53 2023 -0800

    RDMA/cma: Distinguish between sockaddr_in and sockaddr_in6 by size

    Clang can do some aggressive inlining, which provides it with greater
    visibility into the sizes of various objects that are passed into
    helpers. Specifically, compare_netdev_and_ip() can see through the type
    given to the "sa" argument, which means it can generate code for "struct
    sockaddr_in" that would have been passed to ipv6_addr_cmp() (that expects
    to operate on the larger "struct sockaddr_in6"), which would result in a
    compile-time buffer overflow condition detected by memcmp(). Logically,
    this state isn't reachable due to the sa_family assignment two callers
    above and the check in compare_netdev_and_ip(). Instead, provide a
    compile-time check on sizes so the size-mismatched code will be elided
    when inlining. Avoids the following warning from Clang:

    ../include/linux/fortify-string.h:652:4: error: call to '__read_overflow' declared with 'error' attribute: detected read beyond size of object (1st parameter)
                            __read_overflow();
                            ^
    note: In function 'cma_netevent_callback'
    note:   which inlined function 'node_from_ndev_ip'
    1 error generated.

    When the underlying object size is not known (e.g. with GCC and older
    Clang), the result of __builtin_object_size() is SIZE_MAX, which will also
    compile away, leaving the code as it was originally.

    Link: https://lore.kernel.org/r/20230208232549.never.139-kees@kernel.org
    Link: https://github.com/ClangBuiltLinux/linux/issues/1687
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Tested-by: Nathan Chancellor <nathan@kernel.org> # build
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:16:21 -04:00
Kamal Heib 3096760fb3 RDMA/cma: Refactor the inbound/outbound path records process flow
Bugzilla: https://bugzilla.redhat.com/2168937

commit ccae0447af0e471426beea789a52b2b6605663e0
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Wed Jan 4 10:03:41 2023 +0200

    RDMA/cma: Refactor the inbound/outbound path records process flow

    Refactors based on comments [1] of the multiple path records support
    patchset:
    - Return failure if not able to set inbound/outbound PRs;
    - Simplify the flow when receiving the PRs from netlink channel: When
      a good PR response is received, unpack it and call the path_query
      callback directly. This saves two memory allocations;
    - Define RDMA_PRIMARY_PATH_MAX_REC_NUM in a proper place.

    [1] https://lore.kernel.org/linux-rdma/Yyxp9E9pJtUids2o@nvidia.com/

    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Reviewed-by: Bart Van Assche <bvanassche@acm.org> #srp
    Link: https://lore.kernel.org/r/7610025d57342b8b6da0f19516c9612f9c3fdc37.1672819376.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:16:21 -04:00
Kamal Heib 0a11f5ee03 RDMA/core: Refactor rdma_bind_addr
Bugzilla: https://bugzilla.redhat.com/2168937

commit 8d037973d48c026224ab285e6a06985ccac6f7bf
Author: Patrisious Haddad <phaddad@nvidia.com>
Date:   Wed Jan 4 10:01:38 2023 +0200

    RDMA/core: Refactor rdma_bind_addr

    Refactor rdma_bind_addr function so that it doesn't require that the
    cma destination address be changed before calling it.

    So now it will update the destination address internally only when it is
    really needed and after passing all the required checks.

    Which in turn results in a cleaner and more sensible call and error
    handling flows for the functions that call it directly or indirectly.

    Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
    Reported-by: Wei Chen <harperchen1110@gmail.com>
    Reviewed-by: Mark Zhang <markzhang@nvidia.com>
    Link: https://lore.kernel.org/r/3d0e9a2fd62bc10ba02fed1c7c48a48638952320.1672819273.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:16:21 -04:00
Kamal Heib e9047034d8 RDMA/cma: Change RoCE packet life time from 18 to 16
Bugzilla: https://bugzilla.redhat.com/2168936

commit fb4907f487254375830f135dcfe5dd7e6f8b705f
Author: Chao Leng <lengchao@huawei.com>
Date:   Fri Nov 25 09:00:26 2022 +0800

    RDMA/cma: Change RoCE packet life time from 18 to 16

    The ack timeout retransmission time is affected by the following two
    factors: one is packet life time, another is the HCA processing time.

    Now the default packet lifetime(CMA_IBOE_PACKET_LIFETIME) is 18.

    That means the minimum ack timeout is 2
    seconds (2^(18+1)*4us=2.097seconds).  The packet lifetime means the
    maximum transmission time of packets on the network, 2 seconds is too
    long.

    Assume the network is a clos topology with three layers, every packet will
    pass through five hops of switches. Assume the buffer of every switch is
    128MB and the port transmission rate is 25 Gbit/s, the maximum
    transmission time of the packet is 200ms(128MB*5/25Gbit/s).  Add double
    redundancy, it is less than 500ms.

    So change the CMA_IBOE_PACKET_LIFETIME to 16, the maximum transmission
    time of the packet will be about 500+ms, it is long enough.

    Link: https://lore.kernel.org/r/20221125010026.755-1-lengchao@huawei.com
    Signed-off-by: Chao Leng <lengchao@huawei.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:16:05 -04:00
Kamal Heib b3a23ffc17 treewide: use prandom_u32_max() when possible, part 1
Bugzilla: https://bugzilla.redhat.com/2168933
Conflicts:
Include only the RDMA related hunks.

commit 81895a65ec63ee1daec3255dc1a06675d2fbe915
Author: Jason A. Donenfeld <Jason@zx2c4.com>
Date:   Wed Oct 5 16:43:38 2022 +0200

    treewide: use prandom_u32_max() when possible, part 1

    Rather than incurring a division or requesting too many random bytes for
    the given range, use the prandom_u32_max() function, which only takes
    the minimum required bytes from the RNG and avoids divisions. This was
    done mechanically with this coccinelle script:

    @basic@
    expression E;
    type T;
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    typedef u64;
    @@
    (
    - ((T)get_random_u32() % (E))
    + prandom_u32_max(E)
    |
    - ((T)get_random_u32() & ((E) - 1))
    + prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
    |
    - ((u64)(E) * get_random_u32() >> 32)
    + prandom_u32_max(E)
    |
    - ((T)get_random_u32() & ~PAGE_MASK)
    + prandom_u32_max(PAGE_SIZE)
    )

    @multi_line@
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    identifier RAND;
    expression E;
    @@

    -       RAND = get_random_u32();
            ... when != RAND
    -       RAND %= (E);
    +       RAND = prandom_u32_max(E);

    // Find a potential literal
    @literal_mask@
    expression LITERAL;
    type T;
    identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
    position p;
    @@

            ((T)get_random_u32()@p & (LITERAL))

    // Add one to the literal.
    @script:python add_one@
    literal << literal_mask.LITERAL;
    RESULT;
    @@

    value = None
    if literal.startswith('0x'):
            value = int(literal, 16)
    elif literal[0] in '123456789':
            value = int(literal, 10)
    if value is None:
            print("I don't know how to handle %s" % (literal))
            cocci.include_match(False)
    elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
            print("Skipping 0x%x for cleanup elsewhere" % (value))
            cocci.include_match(False)
    elif value & (value + 1) != 0:
            print("Skipping 0x%x because it's not a power of two minus one" % (value))
            cocci.include_match(False)
    elif literal.startswith('0x'):
            coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
    else:
            coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

    // Replace the literal mask with the calculated result.
    @plus_one@
    expression literal_mask.LITERAL;
    position literal_mask.p;
    expression add_one.RESULT;
    identifier FUNC;
    @@

    -       (FUNC()@p & (LITERAL))
    +       prandom_u32_max(RESULT)

    @collapse_ret@
    type T;
    identifier VAR;
    expression E;
    @@

     {
    -       T VAR;
    -       VAR = (E);
    -       return VAR;
    +       return E;
     }

    @drop_var@
    type T;
    identifier VAR;
    @@

     {
    -       T VAR;
            ... when != VAR
     }

    Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Yury Norov <yury.norov@gmail.com>
    Reviewed-by: KP Singh <kpsingh@kernel.org>
    Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
    Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
    Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
    Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
    Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:15:53 -04:00
Kamal Heib 6b2dbc867c RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID
Bugzilla: https://bugzilla.redhat.com/2168933

commit eb8336dbe373edd1ad6061c543e4ba6ea60f6cc9
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Thu Sep 8 13:09:03 2022 +0300

    RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID

    In inter-subnet cases, when inbound/outbound PRs are available,
    outbound_PR.dlid is used as the requestor's datapath DLID and
    inbound_PR.dlid is used as the responder's DLID. The inbound_PR.dlid
    is passed to responder side with the "ConnectReq.Primary_Local_Port_LID"
    field. With this solution the PERMISSIVE_LID is no longer used in
    Primary Local LID field.

    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Reviewed-by: Mark Bloch <mbloch@nvidia.com>
    Link: https://lore.kernel.org/r/b3f6cac685bce9dde37c610be82e2c19d9e51d9e.1662631201.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:15:53 -04:00
Kamal Heib ec99d37f73 RDMA/cma: Multiple path records support with netlink channel
Bugzilla: https://bugzilla.redhat.com/2168933

commit 5a3749493394276449cfc4efb417ed267edbd480
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Thu Sep 8 13:09:01 2022 +0300

    RDMA/cma: Multiple path records support with netlink channel

    Support receiving inbound and outbound IB path records (along with GMP
    PathRecord) from user-space service through the RDMA netlink channel.
    The LIDs in these 3 PRs can be used in this way:
    1. GMP PR: used as the standard local/remote LIDs;
    2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
    3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
       responder side.

    This is aimed to support adaptive routing. With current IB routing
    solution when a packet goes out it's assigned with a fixed DLID per
    target, meaning a fixed router will be used.
    The LIDs in inbound/outbound path records can be used to identify group
    of routers that allow communication with another subnet's entity. With
    them packets from an inter-subnet connection may travel through any
    router in the set to reach the target.

    As confirmed with Jason, when sending a netlink request, kernel uses
    LS_RESOLVE_PATH_USE_ALL so that the service knows kernel supports
    multiple PRs.

    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Reviewed-by: Mark Bloch <mbloch@nvidia.com>
    Link: https://lore.kernel.org/r/2fa2b6c93c4c16c8915bac3cfc4f27be1d60519d.1662631201.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:15:53 -04:00
Kamal Heib 9cdbadf668 RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths
Bugzilla: https://bugzilla.redhat.com/2168933

commit bf9a9928510a03e445fa4f54bdc0b8e71f4c0067
Author: Mark Zhang <markzhang@nvidia.com>
Date:   Thu Sep 8 13:09:00 2022 +0300

    RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths

    This fields means the total number of primary and alternative paths,
    i.e.,:
      0 - No primary nor alternate path is available;
      1 - Only primary path is available;
      2 - Both primary and alternate path are available.
    Rename it to avoid confusion as with follow patches primary path will
    support multiple path records.

    Signed-off-by: Mark Zhang <markzhang@nvidia.com>
    Reviewed-by: Mark Bloch <mbloch@nvidia.com>
    Link: https://lore.kernel.org/r/cbe424de63a56207870d70c5edce7c68e45f429e.1662631201.git.leonro@nvidia.com
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2023-03-31 14:15:52 -04:00
Kamal Heib 3d372f628b RDMA/cma: Use output interface for net_dev check
Bugzilla: https://bugzilla.redhat.com/2120668

commit eb83f502adb036cd56c27e13b9ca3b2aabfa790b
Author: Håkon Bugge <haakon.bugge@oracle.com>
Date:   Wed Oct 12 16:15:42 2022 +0200

    RDMA/cma: Use output interface for net_dev check

    Commit 27cfde795a96 ("RDMA/cma: Fix arguments order in net device
    validation") swapped the src and dst addresses in the call to
    validate_net_dev().

    As a consequence, the test in validate_ipv4_net_dev() to see if the
    net_dev is the right one, is incorrect for port 1 <-> 2 communication when
    the ports are on the same sub-net. This is fixed by denoting the
    flowi4_oif as the device instead of the incoming one.

    The bug has not been observed using IPv6 addresses.

    Fixes: 27cfde795a96 ("RDMA/cma: Fix arguments order in net device validation")
    Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
    Link: https://lore.kernel.org/r/20221012141542.16925-1-haakon.bugge@oracle.com
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-11-29 11:40:49 -05:00
Kamal Heib a176f42b99 RDMA/cma: Fix arguments order in net device validation
Bugzilla: https://bugzilla.redhat.com/2120665

commit 27cfde795a96aef1e859a5480489944b95421e46
Author: Michael Guralnik <michaelgur@nvidia.com>
Date:   Tue Aug 23 13:51:50 2022 +0300

    RDMA/cma: Fix arguments order in net device validation

    Fix the order of source and destination addresses when resolving the
    route between server and client to validate use of correct net device.

    The reverse order we had so far didn't actually validate the net device
    as the server would try to resolve the route to itself, thus always
    getting the server's net device.

    The issue was discovered when running cm applications on a single host
    between 2 interfaces with same subnet and source based routing rules.
    When resolving the reverse route the source based route rules were
    ignored.

    Fixes: f887f2ac87 ("IB/cma: Validate routing of incoming requests")
    Link: https://lore.kernel.org/r/1c1ec2277a131d277ebcceec987fd338d35b775f.1661251872.git.leonro@nvidia.com
    Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-10-06 15:48:08 -04:00
Kamal Heib b3556addd4 RDMA/core: Add a netevent notifier to cma
Bugzilla: https://bugzilla.redhat.com/2120665
Bugzilla: https://bugzilla.redhat.com/2117911

commit 925d046e7e52c71c3531199ce137e141807ef740
Author: Patrisious Haddad <phaddad@nvidia.com>
Date:   Tue Jun 7 14:32:44 2022 +0300

    RDMA/core: Add a netevent notifier to cma

    Add a netevent callback for cma, mainly to catch NETEVENT_NEIGH_UPDATE.

    Previously, when a system with failover MAC mechanism change its MAC address
    during a CM connection attempt, the RDMA-CM would take a lot of time till
    it disconnects and timesout due to the incorrect MAC address.

    Now when we get a NETEVENT_NEIGH_UPDATE we check if it is due to a failover
    MAC change and if so, we instantly destroy the CM and notify the user in order
    to spare the unnecessary waiting for the timeout.

    Link: https://lore.kernel.org/r/bb255c9e301cd50b905663b8e73f7f5133d0e4c5.1654601342.git.leonro@nvidia.com
    Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
    Reviewed-by: Mark Zhang <markzhang@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-10-06 15:46:33 -04:00
Kamal Heib dabf620fc7 RDMA/core: Add an rb_tree that stores cm_ids sorted by ifindex and remote IP
Bugzilla: https://bugzilla.redhat.com/2120665
Bugzilla: https://bugzilla.redhat.com/2117911

commit fc008bdbf1cd02e36bbfe53ea006b258335d908e
Author: Patrisious Haddad <phaddad@nvidia.com>
Date:   Tue Jun 7 14:32:43 2022 +0300

    RDMA/core: Add an rb_tree that stores cm_ids sorted by ifindex and remote IP

    Add to the cma, a tree that keeps track of all rdma_id_private channels
    that were created while in RoCE mode.

    The IDs are sorted first according to their netdevice ifindex then their
    destination IP. And for IDs with matching IP they would be at the same node
    in the tree, since the tree data is a list of all ids with matching destination IP.

    The tree allows fast and efficient lookup of ids using an ifindex and
    IP address which is useful for identifying relevant net_events promptly.

    Link: https://lore.kernel.org/r/2fac52c86cc918c634ab24b3867d4aed992f54ec.1654601342.git.leonro@nvidia.com
    Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
    Reviewed-by: Mark Zhang <markzhang@nvidia.com>
    Signed-off-by: Leon Romanovsky <leon@kernel.org>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-10-06 15:46:05 -04:00
Kamal Heib adadfc75dc IB/cma: Allow XRC INI QPs to set their local ACK timeout
Bugzilla: http://bugzilla.redhat.com/2056772

commit 748663c8ccf6b2e5a800de19127c2cc1c4423fd2
Author: Håkon Bugge <haakon.bugge@oracle.com>
Date:   Wed Feb 9 16:39:35 2022 +0100

    IB/cma: Allow XRC INI QPs to set their local ACK timeout

    XRC INI QPs should be able to adjust their local ACK timeout.

    Fixes: 2c1619edef ("IB/cma: Define option to set ack timeout and pack tos_set")
    Link: https://lore.kernel.org/r/1644421175-31943-1-git-send-email-haakon.bugge@oracle.com
    Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
    Suggested-by: Avneesh Pant <avneesh.pant@oracle.com>
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-05-10 11:45:08 +03:00
Kamal Heib 0443531e9a RDMA/cma: Do not change route.addr.src_addr outside state checks
Bugzilla: http://bugzilla.redhat.com/2056771

commit 22e9f71072fa605cbf033158db58e0790101928d
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Wed Feb 23 11:23:57 2022 -0400

    RDMA/cma: Do not change route.addr.src_addr outside state checks

    If the state is not idle then resolve_prepare_src() should immediately
    fail and no change to global state should happen. However, it
    unconditionally overwrites the src_addr trying to build a temporary any
    address.

    For instance if the state is already RDMA_CM_LISTEN then this will corrupt
    the src_addr and would cause the test in cma_cancel_operation():

               if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)

    Which would manifest as this trace from syzkaller:

      BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
      Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204

      CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
       __kasan_report mm/kasan/report.c:399 [inline]
       kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
       __list_add_valid+0x93/0xa0 lib/list_debug.c:26
       __list_add include/linux/list.h:67 [inline]
       list_add_tail include/linux/list.h:100 [inline]
       cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
       rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
       ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
       ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
       vfs_write+0x28e/0xa30 fs/read_write.c:603
       ksys_write+0x1ee/0x250 fs/read_write.c:658
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae

    This is indicating that an rdma_id_private was destroyed without doing
    cma_cancel_listens().

    Instead of trying to re-use the src_addr memory to indirectly create an
    any address derived from the dst build one explicitly on the stack and
    bind to that as any other normal flow would do. rdma_bind_addr() will copy
    it over the src_addr once it knows the state is valid.

    This is similar to commit bc0bdc5afaa7 ("RDMA/cma: Do not change
    route.addr.src_addr.ss_family")

    Link: https://lore.kernel.org/r/0-v2-e975c8fd9ef2+11e-syz_cma_srcaddr_jgg@nvidia.com
    Cc: stable@vger.kernel.org
    Fixes: 732d41c545 ("RDMA/cma: Make the locking for automatic state transition more clear")
    Reported-by: syzbot+c94a3675a626f6333d74@syzkaller.appspotmail.com
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 20:02:42 -04:00
Kamal Heib 678d7a5efe RDMA/cma: Use correct address when leaving multicast group
Bugzilla: http://bugzilla.redhat.com/2056771

commit d9e410ebbed9d091b97bdf45b8a3792e2878dc48
Author: Maor Gottlieb <maorg@nvidia.com>
Date:   Tue Jan 18 09:35:00 2022 +0200

    RDMA/cma: Use correct address when leaving multicast group

    In RoCE we should use cma_iboe_set_mgid() and not cma_set_mgid to generate
    the mgid, otherwise we will generate an IGMP for an incorrect address.

    Fixes: b5de0c60cc ("RDMA/cma: Fix use after free race in roce multicast join")
    Link: https://lore.kernel.org/r/913bc6783fd7a95fe71ad9454e01653ee6fb4a9a.1642491047.git.leonro@nvidia.com
    Signed-off-by: Maor Gottlieb <maorg@nvidia.com>
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 20:02:41 -04:00
Kamal Heib e6a78a3179 RDMA/cma: Let cma_resolve_ib_dev() continue search even after empty entry
Bugzilla: http://bugzilla.redhat.com/2056771

commit 20679094a0161c94faf77e373fa3f7428a8e14bd
Author: Avihai Horon <avihaih@nvidia.com>
Date:   Thu Dec 9 15:16:07 2021 +0200

    RDMA/cma: Let cma_resolve_ib_dev() continue search even after empty entry

    Currently, when cma_resolve_ib_dev() searches for a matching GID it will
    stop searching after encountering the first empty GID table entry. This
    behavior is wrong since neither IB nor RoCE spec enforce tightly packed
    GID tables.

    For example, when the matching valid GID entry exists at index N, and if a
    GID entry is empty at index N-1, cma_resolve_ib_dev() will fail to find
    the matching valid entry.

    Fix it by making cma_resolve_ib_dev() continue searching even after
    encountering missing entries.

    Fixes: f17df3b0de ("RDMA/cma: Add support for AF_IB to rdma_resolve_addr()")
    Link: https://lore.kernel.org/r/b7346307e3bb396c43d67d924348c6c496493991.1639055490.git.leonro@nvidia.com
    Signed-off-by: Avihai Horon <avihaih@nvidia.com>
    Reviewed-by: Mark Zhang <markzhang@nvidia.com>
    Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 20:02:32 -04:00
Kamal Heib f8a9008c08 RDMA/cma: Remove open coding of overflow checking for private_data_len
Bugzilla: http://bugzilla.redhat.com/2056771

commit 8d0d2b0f41b1b2add8a30dbd816051a964efa497
Author: Håkon Bugge <haakon.bugge@oracle.com>
Date:   Tue Nov 23 11:06:18 2021 +0100

    RDMA/cma: Remove open coding of overflow checking for private_data_len

    The existing tests are a little hard to comprehend. Use
    check_add_overflow() instead.

    Fixes: 04ded16724 ("RDMA/cma: Verify private data length")
    Link: https://lore.kernel.org/r/1637661978-18770-1-git-send-email-haakon.bugge@oracle.com
    Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 20:02:28 -04:00
Kamal Heib 93e15cddc5 RDMA/cma: Split apart the multiple uses of the same list heads
Bugzilla: http://bugzilla.redhat.com/2056770

commit 99cfddb8a8bd57122effa808653dec83408705a6
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Wed Sep 15 13:25:19 2021 -0300

    RDMA/cma: Split apart the multiple uses of the same list heads

    Two list heads in the rdma_id_private are being used for multiple
    purposes, to save a few bytes of memory. Give the different purposes
    different names and union the memory that is clearly exclusive.

    list splits into device_item and listen_any_item. device_item is threaded
    onto the cma_device's list and listen_any goes onto the
    listen_any_list. IDs doing any listen cannot have devices.

    listen_list splits into listen_item and listen_list. listen_list is on the
    parent listen any rdma_id_private and listen_item is on child listen that
    is bound to a specific cma_dev.

    Which name should be used in which case depends on the state and other
    factors of the rdma_id_private. Remap all the confusing references to make
    sense with the new names, so at least there is some hope of matching the
    necessary preconditions with each access.

    Link: https://lore.kernel.org/r/0-v1-a5ead4a0c19d+c3a-cma_list_head_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 19:56:54 -04:00
Kamal Heib 613fc9cd29 RDMA/core/sa_query: Retry SA queries
Bugzilla: http://bugzilla.redhat.com/2056769

commit 5f5a650999d5718af766fc70a120230b04235a6f
Author: Håkon Bugge <haakon.bugge@oracle.com>
Date:   Thu Aug 12 18:12:35 2021 +0200

    RDMA/core/sa_query: Retry SA queries

    A MAD packet is sent as an unreliable datagram (UD). SA requests are sent
    as MAD packets. As such, SA requests or responses may be silently dropped.

    IB Core's MAD layer has a timeout and retry mechanism, which amongst
    other, is used by RDMA CM. But it is not used by SA queries. The lack of
    retries of SA queries leads to long specified timeout, and error being
    returned in case of packet loss. The ULP or user-land process has to
    perform the retry.

    Fix this by taking advantage of the MAD layer's retry mechanism.

    First, a check against a zero timeout is added in rdma_resolve_route(). In
    send_mad(), we set the MAD layer timeout to one tenth of the specified
    timeout and the number of retries to 10. The special case when timeout is
    less than 10 is handled.

    With this fix:

     # ucmatose -c 1000 -S 1024 -C 1

    runs stable on an Infiniband fabric. Without this fix, we see an
    intermittent behavior and it errors out with:

    cmatose: event: RDMA_CM_EVENT_ROUTE_ERROR, error: -110

    (110 is ETIMEDOUT)

    Link: https://lore.kernel.org/r/1628784755-28316-1-git-send-email-haakon.bugge@oracle.com
    Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-03-23 19:49:53 -04:00
Kamal Heib abf79a3198 RDMA/cma: Do not change route.addr.src_addr.ss_family
Bugzilla: http://bugzilla.redhat.com/2032069
CVE: CVE-2021-4028

commit bc0bdc5afaa740d782fbf936aaeebd65e5c2921d
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Wed Sep 15 17:21:43 2021 -0300

    RDMA/cma: Do not change route.addr.src_addr.ss_family

    If the state is not idle then rdma_bind_addr() will immediately fail and
    no change to global state should happen.

    For instance if the state is already RDMA_CM_LISTEN then this will corrupt
    the src_addr and would cause the test in cma_cancel_operation():

                    if (cma_any_addr(cma_src_addr(id_priv)) && !id_priv->cma_dev)

    To view a mangled src_addr, eg with a IPv6 loopback address but an IPv4
    family, failing the test.

    This would manifest as this trace from syzkaller:

      BUG: KASAN: use-after-free in __list_add_valid+0x93/0xa0 lib/list_debug.c:26
      Read of size 8 at addr ffff8881546491e0 by task syz-executor.1/32204

      CPU: 1 PID: 32204 Comm: syz-executor.1 Not tainted 5.12.0-rc8-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:79 [inline]
       dump_stack+0x141/0x1d7 lib/dump_stack.c:120
       print_address_description.constprop.0.cold+0x5b/0x2f8 mm/kasan/report.c:232
       __kasan_report mm/kasan/report.c:399 [inline]
       kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
       __list_add_valid+0x93/0xa0 lib/list_debug.c:26
       __list_add include/linux/list.h:67 [inline]
       list_add_tail include/linux/list.h:100 [inline]
       cma_listen_on_all drivers/infiniband/core/cma.c:2557 [inline]
       rdma_listen+0x787/0xe00 drivers/infiniband/core/cma.c:3751
       ucma_listen+0x16a/0x210 drivers/infiniband/core/ucma.c:1102
       ucma_write+0x259/0x350 drivers/infiniband/core/ucma.c:1732
       vfs_write+0x28e/0xa30 fs/read_write.c:603
       ksys_write+0x1ee/0x250 fs/read_write.c:658
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
       entry_SYSCALL_64_after_hwframe+0x44/0xae

    Which is indicating that an rdma_id_private was destroyed without doing
    cma_cancel_listens().

    Instead of trying to re-use the src_addr memory to indirectly create an
    any address build one explicitly on the stack and bind to that as any
    other normal flow would do.

    Link: https://lore.kernel.org/r/0-v1-9fbb33f5e201+2a-cma_listen_jgg@nvidia.com
    Cc: stable@vger.kernel.org
    Fixes: 732d41c545 ("RDMA/cma: Make the locking for automatic state transition more clear")
    Reported-by: syzbot+6bb0528b13611047209c@syzkaller.appspotmail.com
    Tested-by: Hao Sun <sunhao.th@gmail.com>
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-02-09 11:24:59 +02:00
Kamal Heib 31dd87a92b RDMA/cma: Ensure rdma_addr_cancel() happens before issuing more requests
Bugzilla: http://bugzilla.redhat.com/2036599

commit 305d568b72f17f674155a2a8275f865f207b3808
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Thu Sep 16 15:34:46 2021 -0300

    RDMA/cma: Ensure rdma_addr_cancel() happens before issuing more requests

    The FSM can run in a circle allowing rdma_resolve_ip() to be called twice
    on the same id_priv. While this cannot happen without going through the
    work, it violates the invariant that the same address resolution
    background request cannot be active twice.

           CPU 1                                  CPU 2

    rdma_resolve_addr():
      RDMA_CM_IDLE -> RDMA_CM_ADDR_QUERY
      rdma_resolve_ip(addr_handler)  #1

                             process_one_req(): for #1
                              addr_handler():
                                RDMA_CM_ADDR_QUERY -> RDMA_CM_ADDR_BOUND
                                mutex_unlock(&id_priv->handler_mutex);
                                [.. handler still running ..]

    rdma_resolve_addr():
      RDMA_CM_ADDR_BOUND -> RDMA_CM_ADDR_QUERY
      rdma_resolve_ip(addr_handler)
        !! two requests are now on the req_list

    rdma_destroy_id():
     destroy_id_handler_unlock():
      _destroy_id():
       cma_cancel_operation():
        rdma_addr_cancel()

                              // process_one_req() self removes it
                              spin_lock_bh(&lock);
                               cancel_delayed_work(&req->work);
                               if (!list_empty(&req->list)) == true

          ! rdma_addr_cancel() returns after process_on_req #1 is done

       kfree(id_priv)

                             process_one_req(): for #2
                              addr_handler():
                                mutex_lock(&id_priv->handler_mutex);
                                !! Use after free on id_priv

    rdma_addr_cancel() expects there to be one req on the list and only
    cancels the first one. The self-removal behavior of the work only happens
    after the handler has returned. This yields a situations where the
    req_list can have two reqs for the same "handle" but rdma_addr_cancel()
    only cancels the first one.

    The second req remains active beyond rdma_destroy_id() and will
    use-after-free id_priv once it inevitably triggers.

    Fix this by remembering if the id_priv has called rdma_resolve_ip() and
    always cancel before calling it again. This ensures the req_list never
    gets more than one item in it and doesn't cost anything in the normal flow
    that never uses this strange error path.

    Link: https://lore.kernel.org/r/0-v1-3bc675b8006d+22-syz_cancel_uaf_jgg@nvidia.com
    Cc: stable@vger.kernel.org
    Fixes: e51060f08a ("IB: IP address based RDMA connection manager")
    Reported-by: syzbot+dc3dfba010d7671e05f5@syzkaller.appspotmail.com
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-01-10 08:09:32 +02:00
Kamal Heib b91d339c3f RDMA/cma: Fix listener leak in rdma_cma_listen_on_all() failure
Bugzilla: http://bugzilla.redhat.com/2036599

commit ca465e1f1f9b38fe916a36f7d80c5d25f2337c81
Author: Tao Liu <thomas.liu@ucloud.cn>
Date:   Mon Sep 13 17:33:44 2021 +0800

    RDMA/cma: Fix listener leak in rdma_cma_listen_on_all() failure

    If cma_listen_on_all() fails it leaves the per-device ID still on the
    listen_list but the state is not set to RDMA_CM_ADDR_BOUND.

    When the cmid is eventually destroyed cma_cancel_listens() is not called
    due to the wrong state, however the per-device IDs are still holding the
    refcount preventing the ID from being destroyed, thus deadlocking:

     task:rping state:D stack:   0 pid:19605 ppid: 47036 flags:0x00000084
     Call Trace:
      __schedule+0x29a/0x780
      ? free_unref_page_commit+0x9b/0x110
      schedule+0x3c/0xa0
      schedule_timeout+0x215/0x2b0
      ? __flush_work+0x19e/0x1e0
      wait_for_completion+0x8d/0xf0
      _destroy_id+0x144/0x210 [rdma_cm]
      ucma_close_id+0x2b/0x40 [rdma_ucm]
      __destroy_id+0x93/0x2c0 [rdma_ucm]
      ? __xa_erase+0x4a/0xa0
      ucma_destroy_id+0x9a/0x120 [rdma_ucm]
      ucma_write+0xb8/0x130 [rdma_ucm]
      vfs_write+0xb4/0x250
      ksys_write+0xb5/0xd0
      ? syscall_trace_enter.isra.19+0x123/0x190
      do_syscall_64+0x33/0x40
      entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Ensure that cma_listen_on_all() atomically unwinds its action under the
    lock during error.

    Fixes: c80a0c52d8 ("RDMA/cma: Add missing error handling of listen_id")
    Link: https://lore.kernel.org/r/20210913093344.17230-1-thomas.liu@ucloud.cn
    Signed-off-by: Tao Liu <thomas.liu@ucloud.cn>
    Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-01-04 12:58:47 +02:00
Kamal Heib 9a1c699531 IB/cma: Do not send IGMP leaves for sendonly Multicast groups
Bugzilla: http://bugzilla.redhat.com/2036599

commit 2cc74e1ee31d00393b6698ec80b322fd26523da4
Author: Christoph Lameter <cl@gentwo.de>
Date:   Wed Sep 8 13:43:28 2021 +0200

    IB/cma: Do not send IGMP leaves for sendonly Multicast groups

    ROCE uses IGMP for Multicast instead of the native Infiniband system where
    joins are required in order to post messages on the Multicast group.  On
    Ethernet one can send Multicast messages to arbitrary addresses without
    the need to subscribe to a group.

    So ROCE correctly does not send IGMP joins during rdma_join_multicast().

    F.e. in cma_iboe_join_multicast() we see:

       if (addr->sa_family == AF_INET) {
                    if (gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) {
                            ib.rec.hop_limit = IPV6_DEFAULT_HOPLIMIT;
                            if (!send_only) {
                                    err = cma_igmp_send(ndev, &ib.rec.mgid,
                                                        true);
                            }
                    }
            } else {

    So the IGMP join is suppressed as it is unnecessary.

    However no such check is done in destroy_mc(). And therefore leaving a
    sendonly multicast group will send an IGMP leave.

    This means that the following scenario can lead to a multicast receiver
    unexpectedly being unsubscribed from a MC group:

    1. Sender thread does a sendonly join on MC group X. No IGMP join
       is sent.

    2. Receiver thread does a regular join on the same MC Group x.
       IGMP join is sent and the receiver begins to get messages.

    3. Sender thread terminates and destroys MC group X.
       IGMP leave is sent and the receiver no longer receives data.

    This patch adds the same logic for sendonly joins to destroy_mc() that is
    also used in cma_iboe_join_multicast().

    Fixes: ab15c95a17 ("IB/core: Support for CMA multicast join flags")
    Link: https://lore.kernel.org/r/alpine.DEB.2.22.394.2109081340540.668072@gentwo.de
    Signed-off-by: Christoph Lameter <cl@linux.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Signed-off-by: Kamal Heib <kheib@redhat.com>
2022-01-04 12:58:46 +02:00
Mike Marciniszyn db4657afd1 RDMA/cma: Revert INIT-INIT patch
The net/sunrpc/xprtrdma module creates its QP using rdma_create_qp() and
immediately post receives, implicitly assuming the QP is in the INIT state
and thus valid for ib_post_recv().

The patch noted in Fixes: removed the RESET->INIT modifiy from
rdma_create_qp(), breaking NFS rdma for verbs providers that fail the
ib_post_recv() for a bad state.

This situation was proven using kprobes in rvt_post_recv() and
rvt_modify_qp(). The traces showed that the rvt_post_recv() failed before
ANY modify QP and that the current state was RESET.

Fix by reverting the patch below.

Fixes: dc70f7c3ed ("RDMA/cma: Remove unnecessary INIT->INIT transition")
Link: https://lore.kernel.org/r/1627583182-81330-1-git-send-email-mike.marciniszyn@cornelisnetworks.com
Cc: Haakon Bugge <haakon.bugge@oracle.com>
Cc: Chuck Lever III <chuck.lever@oracle.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@cornelisnetworks.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-08-02 12:45:22 -03:00
Leon Romanovsky 3d82875442 RDMA/core: Always release restrack object
Change location of rdma_restrack_del() to fix the bug where
task_struct was acquired but not released, causing to resource leak.

  ucma_create_id() {
    ucma_alloc_ctx();
    rdma_create_user_id() {
      rdma_restrack_new();
      rdma_restrack_set_name() {
        rdma_restrack_attach_task.part.0(); <--- task_struct was gotten
      }
    }
    ucma_destroy_private_ctx() {
      ucma_put_ctx();
      rdma_destroy_id() {
        _destroy_id()                       <--- id_priv was freed
      }
    }
  }

Fixes: 889d916b6f ("RDMA/core: Don't access cm_id after its destruction")
Link: https://lore.kernel.org/r/073ec27acb943ca8b6961663c47c5abe78a5c8cc.1624948948.git.leonro@nvidia.com
Reported-by: Pavel Skripkin <paskripkin@gmail.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-29 19:57:18 -03:00
Gerd Rausch 74f160ead7 RDMA/cma: Fix rdma_resolve_route() memory leak
Fix a memory leak when "mda_resolve_route() is called more than once on
the same "rdma_cm_id".

This is possible if cma_query_handler() triggers the
RDMA_CM_EVENT_ROUTE_ERROR flow which puts the state machine back and
allows rdma_resolve_route() to be called again.

Link: https://lore.kernel.org/r/f6662b7b-bdb7-2706-1e12-47c61d3474b6@oracle.com
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-25 12:00:19 -03:00
Håkon Bugge e84045eab6 RDMA/cma: Fix incorrect Packet Lifetime calculation
An approximation for the PacketLifeTime is half the local ACK timeout.
The encoding for both timers are logarithmic.

If the local ACK timeout is set, but zero, it means the timer is
disabled. In this case, we choose the CMA_IBOE_PACKET_LIFETIME value,
since 50% of infinite makes no sense.

Before this commit, the PacketLifeTime became 255 if local ACK
timeout was zero (not running).

Fixed by explicitly testing for timeout being zero.

Fixes: e1ee1e62be ("RDMA/cma: Use ACK timeout for RoCE packetLifeTime")
Link: https://lore.kernel.org/r/1624371207-26710-1-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-25 10:54:34 -03:00
Håkon Bugge ca0c448d2b RDMA/cma: Protect RMW with qp_mutex
The struct rdma_id_private contains three bit-fields, tos_set,
timeout_set, and min_rnr_timer_set. These are set by accessor functions
without any synchronization. If two or all accessor functions are invoked
in close proximity in time, there will be Read-Modify-Write from several
contexts to the same variable, and the result will be intermittent.

Fixed by protecting the bit-fields by the qp_mutex in the accessor
functions.

The consumer of timeout_set and min_rnr_timer_set is in
rdma_init_qp_attr(), which is called with qp_mutex held for connected
QPs. Explicit locking is added for the consumers of tos and tos_set.

This commit depends on ("RDMA/cma: Remove unnecessary INIT->INIT
transition"), since the call to rdma_init_qp_attr() from
cma_init_conn_qp() does not hold the qp_mutex.

Fixes: 2c1619edef ("IB/cma: Define option to set ack timeout and pack tos_set")
Fixes: 3aeffc46af ("IB/cma: Introduce rdma_set_min_rnr_timer()")
Link: https://lore.kernel.org/r/1624369197-24578-3-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24 15:29:53 -03:00
Håkon Bugge dc70f7c3ed RDMA/cma: Remove unnecessary INIT->INIT transition
In rdma_create_qp(), a connected QP will be transitioned to the INIT
state.

Afterwards, the QP will be transitioned to the RTR state by the
cma_modify_qp_rtr() function. But this function starts by performing an
ib_modify_qp() to the INIT state again, before another ib_modify_qp() is
performed to transition the QP to the RTR state.

Hence, there is no need to transition the QP to the INIT state in
rdma_create_qp().

Link: https://lore.kernel.org/r/1624369197-24578-2-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-24 15:29:53 -03:00
Shay Drory 889d916b6f RDMA/core: Don't access cm_id after its destruction
restrack should only be attached to a cm_id while the ID has a valid
device pointer. It is set up when the device is first loaded, but not
cleared when the device is removed. There is also two copies of the device
pointer, one private and one in the public API, and these were left out of
sync.

Make everything go to NULL together and manipulate restrack right around
the device assignments.

Found by syzcaller:
BUG: KASAN: wild-memory-access in __list_del include/linux/list.h:112 [inline]
BUG: KASAN: wild-memory-access in __list_del_entry include/linux/list.h:135 [inline]
BUG: KASAN: wild-memory-access in list_del include/linux/list.h:146 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_listens drivers/infiniband/core/cma.c:1767 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_operation drivers/infiniband/core/cma.c:1795 [inline]
BUG: KASAN: wild-memory-access in cma_cancel_operation+0x1f4/0x4b0 drivers/infiniband/core/cma.c:1783
Write of size 8 at addr dead000000000108 by task syz-executor716/334

CPU: 0 PID: 334 Comm: syz-executor716 Not tainted 5.11.0+ #271
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0xbe/0xf9 lib/dump_stack.c:120
 __kasan_report mm/kasan/report.c:400 [inline]
 kasan_report.cold+0x5f/0xd5 mm/kasan/report.c:413
 __list_del include/linux/list.h:112 [inline]
 __list_del_entry include/linux/list.h:135 [inline]
 list_del include/linux/list.h:146 [inline]
 cma_cancel_listens drivers/infiniband/core/cma.c:1767 [inline]
 cma_cancel_operation drivers/infiniband/core/cma.c:1795 [inline]
 cma_cancel_operation+0x1f4/0x4b0 drivers/infiniband/core/cma.c:1783
 _destroy_id+0x29/0x460 drivers/infiniband/core/cma.c:1862
 ucma_close_id+0x36/0x50 drivers/infiniband/core/ucma.c:185
 ucma_destroy_private_ctx+0x58d/0x5b0 drivers/infiniband/core/ucma.c:576
 ucma_close+0x91/0xd0 drivers/infiniband/core/ucma.c:1797
 __fput+0x169/0x540 fs/file_table.c:280
 task_work_run+0xb7/0x100 kernel/task_work.c:140
 exit_task_work include/linux/task_work.h:30 [inline]
 do_exit+0x7da/0x17f0 kernel/exit.c:825
 do_group_exit+0x9e/0x190 kernel/exit.c:922
 __do_sys_exit_group kernel/exit.c:933 [inline]
 __se_sys_exit_group kernel/exit.c:931 [inline]
 __x64_sys_exit_group+0x2d/0x30 kernel/exit.c:931
 do_syscall_64+0x2d/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 255d0c14b3 ("RDMA/cma: rdma_bind_addr() leaks a cma_dev reference count")
Link: https://lore.kernel.org/r/3352ee288fe34f2b44220457a29bfc0548686363.1620711734.git.leonro@nvidia.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-05-18 14:05:26 -03:00
Shay Drory cb5cd0ea4e RDMA/core: Add CM to restrack after successful attachment to a device
The device attach triggers addition of CM_ID to the restrack DB.
However, when error occurs, we releasing this device, but defer CM_ID
release. This causes to the situation where restrack sees CM_ID that
is not valid anymore.

As a solution, add the CM_ID to the resource tracking DB only after the
attachment is finished.

Found by syzcaller:
infiniband syz0: added syz_tun
rdma_rxe: ignoring netdev event = 10 for syz_tun
infiniband syz0: set down
infiniband syz0: ib_query_port failed (-19)
restrack: ------------[ cut here    ]------------
infiniband syz0: BUG: RESTRACK detected leak of resources
restrack: User CM_ID object allocated by syz-executor716 is not freed
restrack: ------------[ cut here    ]------------

Fixes: b09c4d7012 ("RDMA/restrack: Improve readability in task name management")
Link: https://lore.kernel.org/r/ab93e56ba831eac65c322b3256796fa1589ec0bb.1618753862.git.leonro@nvidia.com
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-21 20:53:14 -03:00
Parav Pandit 4d51c3d9de RDMA/cma: Skip device which doesn't support CM
A switchdev RDMA device do not support IB CM. When such device is added to
the RDMA CM's device list, when application invokes rdma_listen(), cma
attempts to listen to such device, however it has IB CM attribute
disabled.

Due to this, rdma_listen() call fails to listen for other non switchdev
devices as well.

A below error message can be seen.

infiniband mlx5_0: RDMA CMA: cma_listen_on_dev, error -38

A failing call flow is below.

  cma_listen_on_all()
    cma_listen_on_dev()
      _cma_attach_to_dev()
        rdma_listen() <- fails on a specific switchdev device

This is because rdma_listen() is hardwired to only work with iwarp or IB
CM compatible devices.

Hence, when a IB device doesn't support IB CM or IW CM, avoid adding such
device to the cma list so rdma_listen() can't even be called.

Link: https://lore.kernel.org/r/f9cac00d52864ea7c61295e43fb64cf4db4fdae6.1618753862.git.leonro@nvidia.com
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-21 20:27:52 -03:00
Håkon Bugge 3aeffc46af IB/cma: Introduce rdma_set_min_rnr_timer()
Introduce the ability for kernel ULPs to adjust the minimum RNR Retry
timer. The INIT -> RTR transition executed by RDMA CM will be used for
this adjustment. This avoids an additional ib_modify_qp() call.

rdma_set_min_rnr_timer() must be called before the call to rdma_connect()
on the active side and before the call to rdma_accept() on the passive
side.

The default value of RNR Retry timer is zero, which translates to 655
ms. When the receiver is not ready to accept a send messages, it encodes
the RNR Retry timer value in the NAK. The requestor will then wait at
least the specified time value before retrying the send.

The 5-bit value to be supplied to the rdma_set_min_rnr_timer() is
documented in IBTA Table 45: "Encoding for RNR NAK Timer Field".

Link: https://lore.kernel.org/r/1617216194-12890-2-git-send-email-haakon.bugge@oracle.com
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-12 19:51:48 -03:00
Wenpeng Liang b6eb7011f5 RDMA/core: Correct format of braces
Do following cleanups about braces:

- Add the necessary braces to maintain context alignment.
- Fix the open '{' that is not on the same line as "switch".
- Remove braces that are not necessary for single statement blocks.
- Fix "else" that doesn't follow close brace '}'.

Link: https://lore.kernel.org/r/1617783353-48249-6-git-send-email-liweihang@huawei.com
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-12 14:56:51 -03:00
Mark Bloch 1fb7f8973f RDMA: Support more than 255 rdma ports
Current code uses many different types when dealing with a port of a RDMA
device: u8, unsigned int and u32. Switch to u32 to clean up the logic.

This allows us to make (at least) the core view consistent and use the
same type. Unfortunately not all places can be converted. Many uverbs
functions expect port to be u8 so keep those places in order not to break
UAPIs.  HW/Spec defined values must also not be changed.

With the switch to u32 we now can support devices with more than 255
ports. U32_MAX is reserved to make control logic a bit easier to deal
with. As a device with U32_MAX ports probably isn't going to happen any
time soon this seems like a non issue.

When a device with more than 255 ports is created uverbs will report the
RDMA device as having 255 ports as this is the max currently supported.

The verbs interface is not changed yet because the IBTA spec limits the
port size in too many places to be u8 and all applications that relies in
verbs won't be able to cope with this change. At this stage, we are
extending the interfaces that are using vendor channel solely

Once the limitation is lifted mlx5 in switchdev mode will be able to have
thousands of SFs created by the device. As the only instance of an RDMA
device that reports more than 255 ports will be a representor device and
it exposes itself as a RAW Ethernet only device CM/MAD/IPoIB and other
ULPs aren't effected by this change and their sysfs/interfaces that are
exposes to userspace can remain unchanged.

While here cleanup some alignment issues and remove unneeded sanity
checks (mainly in rdmavt),

Link: https://lore.kernel.org/r/20210301070420.439400-1-leon@kernel.org
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-26 09:31:21 -03:00
Gal Pressman 871159515c RDMA/cma: Remove unused leftovers in cma code
Commit ee1c60b1bf ("IB/SA: Modify SA to implicitly cache Class Port
info") removed the class_port_info_context struct usage, remove a couple
of leftovers.

Link: https://lore.kernel.org/r/20210314143427.76101-1-galpress@amazon.com
Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-03-22 09:31:28 -03:00