Centos-kernel-stream-9

Commit Graph

Author	SHA1	Message	Date
CKI Backport Bot	29b595faa5	RDMA/mlx5: Move events notifier registration to be after device registration JIRA: https://issues.redhat.com/browse/RHEL-72349 CVE: CVE-2024-53224 commit ede132a5cf559f3ab35a4c28bac4f4a6c20334d8 Author: Patrisious Haddad <phaddad@nvidia.com> Date: Wed Nov 13 13:23:19 2024 +0200 RDMA/mlx5: Move events notifier registration to be after device registration Move pkey change work initialization and cleanup from device resources stage to notifier stage, since this is the stage which handles this work events. Fix a race between the device deregistration and pkey change work by moving MLX5_IB_STAGE_DEVICE_NOTIFIER to be after MLX5_IB_STAGE_IB_REG in order to ensure that the notifier is deregistered before the device during cleanup. Which ensures there are no works that are being executed after the device has already unregistered which can cause the panic below. BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 630071 Comm: kworker/1:2 Kdump: loaded Tainted: G W OE --------- --- 5.14.0-162.6.1.el9_1.x86_64 #1 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 02/27/2023 Workqueue: events pkey_change_handler [mlx5_ib] RIP: 0010:setup_qp+0x38/0x1f0 [mlx5_ib] Code: ee 41 54 45 31 e4 55 89 f5 53 48 89 fb 48 83 ec 20 8b 77 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 48 8b 07 48 8d 4c 24 16 <4c> 8b 38 49 8b 87 80 0b 00 00 4c 89 ff 48 8b 80 08 05 00 00 8b 40 RSP: 0018:ffffbcc54068be20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff954054494128 RCX: ffffbcc54068be36 RDX: ffff954004934000 RSI: 0000000000000001 RDI: ffff954054494128 RBP: 0000000000000023 R08: ffff954001be2c20 R09: 0000000000000001 R10: ffff954001be2c20 R11: ffff9540260133c0 R12: 0000000000000000 R13: 0000000000000023 R14: 0000000000000000 R15: ffff9540ffcb0905 FS: 0000000000000000(0000) GS:ffff9540ffc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010625c001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mlx5_ib_gsi_pkey_change+0x20/0x40 [mlx5_ib] process_one_work+0x1e8/0x3c0 worker_thread+0x50/0x3b0 ? rescuer_thread+0x380/0x380 kthread+0x149/0x170 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x22/0x30 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_fwctl(OE) fwctl(OE) ib_uverbs(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlx_compat(OE) psample mlxfw(OE) tls knem(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache netfs qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common rapl hv_balloon hv_utils i2c_piix4 pcspkr joydev fuse ext4 mbcache jbd2 sr_mod sd_mod cdrom t10_pi sg ata_generic pci_hyperv pci_hyperv_intf hyperv_drm drm_shmem_helper drm_kms_helper hv_storvsc syscopyarea hv_netvsc sysfillrect sysimgblt hid_hyperv fb_sys_fops scsi_transport_fc hyperv_keyboard drm ata_piix crct10dif_pclmul crc32_pclmul crc32c_intel libata ghash_clmulni_intel hv_vmbus serio_raw [last unloaded: ib_core] CR2: 0000000000000000 ---[ end trace f6f8be4eae12f7bc ]--- Fixes: `7722f47e71` ("IB/mlx5: Create GSI transmission QPs when P_Key table is changed") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/d271ceeff0c08431b3cbbbb3e2d416f09b6d1621.1731496944.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: CKI Backport Bot <cki-ci-bot+cki-gitlab-backport-bot@redhat.com>	2025-01-13 23:46:40 +00:00
Benjamin Poirier	5a175ac333	RDMA/mlx5: Use IB set_netdev and get_netdev functions JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 8d159eb2117b2e3697a31785662b653938f007cb Author: Chiara Meiohas <cmeiohas@nvidia.com> Date: Mon Sep 9 20:30:23 2024 +0300 RDMA/mlx5: Use IB set_netdev and get_netdev functions The IB layer provides a common interface to store and get net devices associated to an IB device port (ib_device_set_netdev() and ib_device_get_netdev()). Previously, mlx5_ib stored and managed the associated net devices internally. Replace internal net device management in mlx5_ib with ib_device_set_netdev() when attaching/detaching a net device and ib_device_get_netdev() when retrieving the net device. Export ib_device_get_netdev(). For mlx5 representors/PFs/VFs and lag creation we replace the netdev assignments with the IB set/get netdev functions. In active-backup mode lag the active slave net device is stored in the lag itself. To assure the net device stored in a lag bond IB device is the active slave we implement the following: - mlx5_core: when modifying the slave of a bond we send the internal driver event MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE. - mlx5_ib: when catching the event call ib_device_set_netdev() This patch also ensures the correct IB events are sent in switchdev lag. While at it, when in multiport eswitch mode, only a single IB device is created for all ports. The said IB device will receive all netdev events of its VFs once loaded, thus to avoid overwriting the mapping of PF IB device to PF netdev, ignore NETDEV_REGISTER events if the ib device has already been mapped to a netdev. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909173025.30422-6-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:09 -05:00
Benjamin Poirier	892c5f7ca0	RDMA/mlx5: Add implicit MR handling to ODP memory scheme JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 6f2487bfafce5e6cd6f89e7238a82012f7b9f5ac Author: Michael Guralnik <michaelgur@nvidia.com> Date: Mon Sep 9 13:05:03 2024 +0300 RDMA/mlx5: Add implicit MR handling to ODP memory scheme Implicit MRs in ODP memory scheme require allocating a private null mkey and assigning the mkey and va differently in the KSM mkey. The page faults are received on the null mkey so we also add storing the null mkey in the odp_mkey xarray. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909100504.29797-8-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:09 -05:00
Benjamin Poirier	c8b0960396	net/mlx5: Expand mkey page size to support 6 bits JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit cef7dde8836ab09a3bfe96ada4f18ef2496eacc9 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Mon Sep 9 13:04:57 2024 +0300 net/mlx5: Expand mkey page size to support 6 bits Protect the usage of the 6th bit with the relevant capability to ensure we are using the new page sizes with FW that supports the bit extension. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/20240909100504.29797-2-michaelgur@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	35555eab92	RDMA/mlx5: Fix MR cache temp entries cleanup JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 7ebb00cea49db641b458edef0ede389f7004821d Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:50 2024 +0300 RDMA/mlx5: Fix MR cache temp entries cleanup Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	e74ca62c4c	RDMA/mlx5: Remove two unused declarations JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 53ffc09a3e6d39d7a9b3758be4a8795fb57a7989 Author: Yue Haibing <yuehaibing@huawei.com> Date: Fri Aug 16 18:13:58 2024 +0800 RDMA/mlx5: Remove two unused declarations Commit `e6fb246cca` ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") removed mlx5_ib_free_implicit_mr() but left the declaration. Commit d98995b4bf98 ("net/mlx5: Reimplement write combining test") left mlx5_ib_test_wc(). Remove the unused declarations. Link: https://patch.msgid.link/r/20240816101358.881247-1-yuehaibing@huawei.com Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	35e261e6d5	RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	0726e29ee2	RDMA/mlx5: Add the initialization flow to utilize the 'data direct' device JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 2e8e631d7a41e3a4edc94f3c9dd5cb32c2aa539e Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:12 2024 +0300 RDMA/mlx5: Add the initialization flow to utilize the 'data direct' device Add the NET device initialization flow to utilize the 'data direct' device. When a NET mlx5_ib device is capable of 'data direct', the following sequence of actions will occur: - Find its affiliated 'data direct' VUID via a firmware command. - Create its own private PD and 'data direct' mkey. - Register to be notified when its 'data direct' driver is probed or removed. The DMA device of the affiliated 'data direct' device, including the private PD and the 'data direct' mkey, will be used later during MR registrations that request the data direct functionality. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/b11fa87b2a65bce4db8d40341bb6cee490fa4d06.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	d93fd28f29	RDMA/mlx5: Introduce the 'data direct' driver JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 6910e3660d86c1a5654f742a40181d2c9154f26f Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:11 2024 +0300 RDMA/mlx5: Introduce the 'data direct' driver Introduce the 'data direct' driver for a ConnectX-8 Data Direct device. The 'data direct' driver functions as the affiliated DMA device for one or more capable mlx5_ib devices. This DMA device, as the name suggests, is used exclusively for DMA operations. It can be considered a DMA engine managed by a PF/VF, lacking network capabilities and having minimal overall capabilities. Consequently, the DMA NIC PF will not be exposed to or directly used by software applications. The driver will not have any direct interface or interaction with the firmware (no command interface, no capabilities, etc.). It will operate solely over PCI to enable its DMA functionality. Registration and un-registration of the driver are handled as part of the mlx5_ib initialization and exit processes, as the mlx5_ib devices will effectively be its clients. The driver will serve as the DMA device for accessing another PCI device to achieve optimal performance (both on the same NUMA node, P2P access, etc.). Upon probing, it will read its VUID over PCI to handle mlx5_ib device registrations with the same VUID. Upon removal, it will notify its clients to allow them to clean up the resources that were mmaped with its DMA device. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/b77edecfd476c3f445da96ab6aef499ae47b2829.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:08 -05:00
Benjamin Poirier	7ed9c90b8c	RDMA/mlx5: Support plane device and driver APIs to add and delete it JIRA: https://issues.redhat.com/browse/RHEL-52869 JIRA: https://issues.redhat.com/browse/RHEL-52874 Upstream-status: v6.11-rc1 commit 026a425990af6969a15b57d6d7fa0138a7e21958 Author: Mark Zhang <markzhang@nvidia.com> Date: Sun Jun 16 19:08:39 2024 +0300 RDMA/mlx5: Support plane device and driver APIs to add and delete it This patch supports driver APIs "add_sub_dev" and "del_sub_dev", to add and delete a plane device respectively. A mlx5 plane device is a rdma SMI device; It provides the SMI capability through user MAD for it's parent, the logical multi-plane aggregated device. For a plane port: - It supports QP0 only; - When adding a plane device, all plane ports are added; - For some commands like mad_ifc, both plane_index and native portnum is needed; - When querying or modifying a plane port context, the native portnum must be used, as the query/modify_hca_vport_context command doesn't support plane port. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/e933cd0562aece181f8657af2ca0f5b387d0f14e.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	73d6c04f0f	RDMA/mlx5: Add support to multi-plane device and port JIRA: https://issues.redhat.com/browse/RHEL-52869 JIRA: https://issues.redhat.com/browse/RHEL-52874 Upstream-status: v6.11-rc1 commit 2a5db20fa532198639671713c6213f96ff285b85 Author: Mark Zhang <markzhang@nvidia.com> Date: Sun Jun 16 19:08:35 2024 +0300 RDMA/mlx5: Add support to multi-plane device and port When multi-plane is supported, a logical port, which is aggregation of multiple physical plane ports, is exposed for data transmission. Compared with a normal mlx5 IB port, this logical port supports all functionalities except Subnet Management. Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://lore.kernel.org/r/7e37c06c9cb243be9ac79930cd17053903785b95.1718553901.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	5b25a87734	RDMA/mlx5: Send UAR page index as ioctl attribute JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.11-rc1 commit 589b844f1bf04850d9fabcaa2e943325dc6768b4 Author: Akiva Goldberger <agoldberger@nvidia.com> Date: Thu Jun 27 21:23:50 2024 +0300 RDMA/mlx5: Send UAR page index as ioctl attribute Add UAR page index as a driver ioctl attribute to increase the number of supported indices, previously limited to 16 bits by mlx5_ib_create_cq struct. Link: https://lore.kernel.org/r/0e18b34d7ec3b1ae02d694b0d545aed7413c0ef7.1719512393.git.leon@kernel.org Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	04a3bdaf8f	RDMA/mlx5: Set mkeys for dmabuf at PAGE_SIZE JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.11-rc1 commit a4e540119be565f47c305f295ed43f8e0bc3f5c3 Author: Chiara Meiohas <cmeiohas@nvidia.com> Date: Thu Jun 13 21:01:42 2024 +0300 RDMA/mlx5: Set mkeys for dmabuf at PAGE_SIZE Set the mkey for dmabuf at PAGE_SIZE to support any SGL after a move operation. ib_umem_find_best_pgsz returns 0 on error, so it is incorrect to check the returned page_size against PAGE_SIZE Fixes: `90da7dc820` ("RDMA/mlx5: Support dma-buf based userspace memory region") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://lore.kernel.org/r/1e2289b9133e89f273a4e68d459057d032cbc2ce.1718301631.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	4afc011f2b	IB/mlx5: Allocate resources just before first QP/SRQ is created JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.11-rc1 commit 5895e70f2e6e8dc67b551ca554d6fcde0a7f0467 Author: Jianbo Liu <jianbol@nvidia.com> Date: Mon Jun 3 13:26:39 2024 +0300 IB/mlx5: Allocate resources just before first QP/SRQ is created Previously, all IB dev resources are initialized on driver load. As they are not always used, move the initialization to the time when they are needed. To be more specific, move PD (p0) and CQ (c0) initialization to the time when the first SRQ is created. and move SRQs（s0 and s1) initialization to the time first QP is created. To avoid concurrent creations, two new mutexes are also added. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Link: https://lore.kernel.org/r/98c3e53a8cc0bdfeb6dec6e5bb8b037d78ab00d8.1717409369.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	20c8b9560d	IB/mlx5: Create UMR QP just before first reg_mr occurs JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.11-rc1 commit 638420115cc4ad6c3a2683bf46a052b505abb202 Author: Jianbo Liu <jianbol@nvidia.com> Date: Mon Jun 3 13:26:38 2024 +0300 IB/mlx5: Create UMR QP just before first reg_mr occurs UMR QP is not used in some cases, so move QP and its CQ creations from driver load flow to the time first reg_mr occurs, that is when MR interfaces are first called. The initialization of dev->umrc.pd and dev->umrc.lock is still done in driver load because pd is needed for mlx5_mkey_cache_init and the lock is reused to protect against the concurrent creation. When testing 4G bytes memory registration latency with rtool [1] and 8 threads in parallel, there is minor performance degradation (<5% for the max latency) is seen for the first reg_mr with this change. Link: https://github.com/paravmellanox/rtool [1] Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Link: https://lore.kernel.org/r/55d3c4f8a542fd974d8a4c5816eccfb318a59b38.1717409369.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	0daec69495	net/mlx5: Reimplement write combining test JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.11-rc1 commit d98995b4bf981519dde4af0a081c393d62474039 Author: Jianbo Liu <jianbol@nvidia.com> Date: Mon Jun 3 13:26:37 2024 +0300 net/mlx5: Reimplement write combining test The test of write combining was added before in mlx5_ib driver. It opens UD QP and posts NOP WQEs, and uses BlueFlame doorbell. When BlueFlame is used, WQEs get written directly to a PCI BAR of the device (in addition to memory) so that the device handles them without having to access memory. In this test, the WQEs written in memory are different from the ones written to the BlueFlame which request CQE update. By checking the completion reports posted on CQ, we can know if BlueFlame succeeds or not. The write combining must be supported if BlueFlame succeeds as its register is written using write combining. This patch reimplements the test in the same way, but using a pair of SQ and CQ only. It is moved to mlx5_core as a general feature used by both mlx5_core and mlx5_ib. Besides, save write combine test result of the PCI function, so that its thousands of child functions such as SF can query without paying the time and resource penalty by itself. The test function is called only after failing to get the cached result. With this enhancement, all thousands of SFs of the PF attached to same driver no longer need to perform WC check explicitly, which is already done in the system. This saves several commands per SF, thereby speeds up SF creation and also saves completion EQ creation. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/4ff5a8cc4c5b5b0d98397baa45a5019bcdbf096e.1717409369.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:32:03 -05:00
Benjamin Poirier	8b9ed7593d	RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.9-rc1 commit a400073ce3dd3dbdf843e6c9c0a0a7f6ca9f05d7 Author: Alexey Dobriyan <adobriyan@gmail.com> Date: Tue Jan 23 13:35:38 2024 +0300 RDMA/mlx5: Delete unused mlx5_ib_copy_pas prototype mlx5_ib_copy_pas() doesn't exist anymore. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Link: https://lore.kernel.org/r/a2cb861e-d11e-4567-8a73-73763d1dc199@p183 Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-12-05 10:31:58 -05:00
Kamal Heib	bd8a5c9fe8	RDMA: Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API JIRA: https://issues.redhat.com/browse/RHEL-56245 commit 3aa73c6b795b9aaaf933f3c95495d85fc0de39e3 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:15 2024 +0300 RDMA: Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API Pass uverbs_attr_bundle as part of '.reg_user_mr_dmabuf' API instead of udata. This enables passing some new ioctl attributes to the drivers, as will be introduced in the next patches for mlx5 driver. Change the involved drivers accordingly. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/9a25b2fc02443f7c36c2d93499ae25252b6afd40.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Kamal Heib <kheib@redhat.com>	2024-10-27 19:32:22 -04:00
Kamal Heib	60620ca443	RDMA: Pass entire uverbs attr bundle to create cq function JIRA: https://issues.redhat.com/browse/RHEL-56247 Conflicts: Drop hunks for none existing drivers. commit dd6d7f8574d7f8b6a0bf1aeef0b285d2706b8c2a Author: Akiva Goldberger <agoldberger@nvidia.com> Date: Thu Jun 27 21:23:49 2024 +0300 RDMA: Pass entire uverbs attr bundle to create cq function Changes the create_cq verb signature by sending the entire uverbs attr bundle as a parameter. This allows drivers to send driver specific attrs through ioctl for the create_cq verb and access them in their driver specific code. Also adds a new enum value for driver specific ioctl attributes for methods already supporting UHW. Link: https://lore.kernel.org/r/ed147343987c0d43fd391c1b2f85e2f425747387.1719512393.git.leon@kernel.org Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Kamal Heib <kheib@redhat.com>	2024-10-07 11:55:54 -04:00
Benjamin Poirier	93e14b10b1	RDMA/mlx5: Change check for cacheable mkeys JIRA: https://issues.redhat.com/browse/RHEL-45365 Upstream-status: v6.10-rc1 commit 8c1185fef68cc603b954fece2a434c9f851d6a86 Author: Or Har-Toov <ohartoov@nvidia.com> Date: Wed Apr 3 13:36:00 2024 +0300 RDMA/mlx5: Change check for cacheable mkeys umem can be NULL for user application mkeys in some cases. Therefore umem can't be used for checking if the mkey is cacheable and it is changed for checking a flag that indicates it. Also make sure that all mkeys which are not returned to the cache will be destroyed. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://lore.kernel.org/r/2690bc5c6896bcb937f89af16a1ff0343a7ab3d0.1712140377.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-08-07 09:20:55 -04:00
Benjamin Poirier	ef5b55a3ba	RDMA/mlx5: Uncacheable mkey has neither rb_key or cache_ent JIRA: https://issues.redhat.com/browse/RHEL-45365 Upstream-status: v6.10-rc1 commit 0611a8e8b475fc5230b9a24d29c8397aaab20b63 Author: Or Har-Toov <ohartoov@nvidia.com> Date: Wed Apr 3 13:35:59 2024 +0300 RDMA/mlx5: Uncacheable mkey has neither rb_key or cache_ent As some mkeys can't be modified with UMR due to some UMR limitations, like the size of translation that can be updated, not all user mkeys can be cached. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://lore.kernel.org/r/f2742dd934ed73b2d32c66afb8e91b823063880c.1712140377.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-08-07 09:20:55 -04:00
Benjamin Poirier	9edb9be3ae	RDMA/mlx5: Implement mkeys management via LIFO queue JIRA: https://issues.redhat.com/browse/RHEL-24466 Upstream-status: v6.7-rc1 Conflicts: - drivers/infiniband/hw/mlx5/mr.c Due to commit `78e2a0dd38` (tags/kernel-5.14.0-417.el9) which is an OOO backport of a53e215f9007 RDMA/mlx5: Fix mkey cache WQ flush (v6.7-rc1) -> Adjust context commit 57e7071683ef6148c9f5ea0ba84598d2ba681375 Author: Shay Drory <shayd@nvidia.com> Date: Thu Sep 21 11:07:16 2023 +0300 RDMA/mlx5: Implement mkeys management via LIFO queue Currently, mkeys are managed via xarray. This implementation leads to a degradation in cases many MRs are unregistered in parallel, due to xarray internal implementation, for example: deregistration 1M MRs via 64 threads is taking ~15% more time[1]. Hence, implement mkeys management via LIFO queue, which solved the degradation. [1] 2.8us in kernel v5.19 compare to 3.2us in kernel v6.4 Signed-off-by: Shay Drory <shayd@nvidia.com> Link: https://lore.kernel.org/r/fde3d4cfab0f32f0ccb231cd113298256e1502c5.1695283384.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>	2024-07-22 15:33:43 -04:00
Amir Tzin	37765490f3	RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion JIRA: https://issues.redhat.com/browse/RHEL-22227 Upstream-status: v6.6-rc1 commit 58dbd6428a6819e55a3c52ec60126b5d00804a38 Author: Patrisious Haddad <phaddad@nvidia.com> Date: Thu Apr 13 12:04:59 2023 +0300 RDMA/mlx5: Handles RoCE MACsec steering rules addition and deletion Add RoCE MACsec rules when a gid is added for the MACsec netdevice and handle their cleanup when the gid is removed or the MACsec SA is deleted. Also support alias IP for the MACsec device, as long as we don't have more ips than what the gid table can hold. In addition handle the case where a gid is added but there are still no SAs added for the MACsec device, so the rules are added later on when the SAs are added. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Amir Tzin <atzin@redhat.com>	2024-04-21 13:52:29 +00:00
Amir Tzin	33bd3be29e	RDMA/mlx5: Implement MACsec gid addition and deletion JIRA: https://issues.redhat.com/browse/RHEL-22227 Upstream-status: v6.6-rc1 commit 758ce14aee825f8f3ca8f76c9991c108094cae8b Author: Patrisious Haddad <phaddad@nvidia.com> Date: Tue May 3 08:37:48 2022 +0300 RDMA/mlx5: Implement MACsec gid addition and deletion Handle MACsec IP ambiguity issue, since mlx5 hw can't support programming both the MACsec and the physical gid when they have the same IP address, because it wouldn't know to whom to steer the traffic. Hence in such case we delete the physical gid from the hw gid table, which would then cause all traffic sent over it to fail, and we'll only be able to send traffic over the MACsec gid. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Amir Tzin <atzin@redhat.com>	2024-04-21 13:52:28 +00:00
Amir Tzin	9fd4a7a96f	RDMA/mlx5: Reduce QP table exposure JIRA: https://issues.redhat.com/browse/RHEL-22227 Upstream-status: v6.5-rc1 commit 2ecfd946169e7f56534db2a5f6935858be3005ba Author: Leon Romanovsky <leon@kernel.org> Date: Mon Jun 5 13:14:05 2023 +0300 RDMA/mlx5: Reduce QP table exposure driver.h is common header to whole mlx5 code base, but struct mlx5_qp_table is used in mlx5_ib driver only. So move that struct to be under sole responsibility of mlx5_ib. Link: https://lore.kernel.org/r/bec0dc1158e795813b135d1143147977f26bf668.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Amir Tzin <atzin@redhat.com>	2024-04-21 13:52:25 +00:00
Mohammad Kabat	60f8094ff7	RDMA/mlx5: Remove not-used cache disable flag JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.6-rc5 commit c99a7457e5bb873914a74307ba2df85f6799203b Author: Leon Romanovsky <leon@kernel.org> Date: Thu Sep 28 20:20:47 2023 +0300 RDMA/mlx5: Remove not-used cache disable flag During execution of mlx5_mkey_cache_cleanup(), there is a guarantee that MR are not registered and/or destroyed. It means that we don't need newly introduced cache disable flag. Fixes: 374012b00457 ("RDMA/mlx5: Fix mkey cache possible deadlock on cleanup") Link: https://lore.kernel.org/r/c7e9c9f98c8ae4a7413d97d9349b29f5b0a23dbe.1695921626.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2024-01-16 09:00:54 +00:00
Mohammad Kabat	bd35524428	RDMA/mlx5: Fix mkey cache possible deadlock on cleanup JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.6-rc5 commit 374012b0045780b7ad498be62e85153009bb7fe9 Author: Shay Drory <shayd@nvidia.com> Date: Tue Sep 12 13:07:45 2023 +0300 RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] * DEADLOCK * [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b95845178328 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2024-01-16 09:00:50 +00:00
Mohammad Kabat	9f9d3cc92b	RDMA/mlx5: Fix affinity assignment JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.4-rc7 commit 617f5db1a626f18d5cbb7c7faf7bf8f9ea12be78 Author: Mark Bloch <mbloch@nvidia.com> Date: Mon Jun 5 13:33:26 2023 +0300 RDMA/mlx5: Fix affinity assignment The cited commit aimed to ensure that Virtual Functions (VFs) assign a queue affinity to a Queue Pair (QP) to distribute traffic when the LAG master creates a hardware LAG. If the affinity was set while the hardware was not in LAG, the firmware would ignore the affinity value. However, this commit unintentionally assigned an affinity to QPs on the LAG master's VPORT even if the RDMA device was not marked as LAG-enabled. In most cases, this was not an issue because when the hardware entered hardware LAG configuration, the RDMA device of the LAG master would be destroyed and a new one would be created, marked as LAG-enabled. The problem arises when a user configures Equal-Cost Multipath (ECMP). In ECMP mode, traffic can be directed to different physical ports based on the queue affinity, which is intended for use by VPORTS other than the E-Switch manager. ECMP mode is supported only if both E-Switch managers are in switchdev mode and the appropriate route is configured via IP. In this configuration, the RDMA device is not destroyed, and we retain the RDMA device that is not marked as LAG-enabled. To ensure correct behavior, Send Queues (SQs) opened by the E-Switch manager through verbs should be assigned strict affinity. This means they will only be able to communicate through the native physical port associated with the E-Switch manager. This will prevent the firmware from assigning affinity and will not allow the SQs to be remapped in case of failover. Fixes: `802dcc7fc5` ("RDMA/mlx5: Support TX port affinity for VF drivers in LAG mode") Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://lore.kernel.org/r/425b05f4da840bc684b0f7e8ebf61aeb5cef09b0.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2024-01-16 08:58:21 +00:00
Mohammad Kabat	43193c4444	RDMA/mlx5: Create an indirect flow table for steering anchor JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.4-rc7 commit e1f4a52ac171dd863fe89055e749ef5e0a0bc5ce Author: Mark Bloch <mbloch@nvidia.com> Date: Mon Jun 5 13:33:18 2023 +0300 RDMA/mlx5: Create an indirect flow table for steering anchor A misbehaved user can create a steering anchor that points to a kernel flow table and then destroy the anchor without freeing the associated STC. This creates a problem as the kernel can't destroy the flow table since there is still a reference to it. As a result, this can exhaust all available flow table resources, preventing other users from using the RDMA device. To prevent this problem, a solution is implemented where a special flow table with two steering rules is created when a user creates a steering anchor for the first time. The rules include one that drops all traffic and another that points to the kernel flow table. If the steering anchor is destroyed, only the rule pointing to the kernel's flow table is removed. Any traffic reaching the special flow table after that is dropped. Since the special flow table is not destroyed when the steering anchor is destroyed, any issues are prevented from occurring. The remaining resources are only destroyed when the RDMA device is destroyed, which happens after all DEVX objects are freed, including the STCs, thus mitigating the issue. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/b4a88a871d651fa4e8f98d552553c1cfe9ba2cd6.1685960567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2024-01-16 08:58:21 +00:00
Mohammad Kabat	ba789aaf26	IB/mlx5: Extend debug control for CC parameters Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit 66fb1d5df6ace316a4a6e2c31e13fc123ea2b644 Author: Edward Srouji <edwards@nvidia.com> Date: Thu Feb 16 11:13:45 2023 +0200 IB/mlx5: Extend debug control for CC parameters This patch adds rtt_resp_dscp to the current debug controllability of congestion control (CC) parameters. rtt_resp_dscp can be read or written through debugfs. If set, its value overwrites the DSCP of the generated RTT response. Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/1dcc3440ee53c688f19f579a051ded81a2aaa70a.1676538714.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:41:08 +00:00
Mohammad Kabat	bd6f48e0a0	RDMA/mlx5: Use query_special_contexts for mkeys Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit 594cac11ab6a1be8022a3c96d181dde7cfb0b8cf Author: Or Har-Toov <ohartoov@nvidia.com> Date: Tue Jan 17 15:14:52 2023 +0200 RDMA/mlx5: Use query_special_contexts for mkeys Use query_sepcial_contexts to get the correct value of mkeys such as null_mkey, terminate_scatter_list_mkey and dump_fill_mkey, as FW will change them in certain configurations. Link: https://lore.kernel.org/r/000236f0a9487d48809f87bcc3620a3964b2d3d3.1673960981.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:41:07 +00:00
Mohammad Kabat	dc3bc8db65	RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit dca55da0a15717dde509d17163946e951bad56c4 Author: Jiri Pirko <jiri@nvidia.com> Date: Tue Nov 1 15:36:01 2022 +0100 RDMA/mlx5: Track netdev to avoid deadlock during netdev notifier unregister When removing a network namespace with mlx5 devlink instance being in it, following callchain is performed: cleanup_net (takes down_read(&pernet_ops_rwsem) devlink_pernet_pre_exit() devlink_reload() mlx5_devlink_reload_down() mlx5_unload_one_devl_locked() mlx5_detach_device() del_adev() mlx5r_remove() __mlx5_ib_remove() mlx5_ib_roce_cleanup() mlx5_remove_netdev_notifier() unregister_netdevice_notifier (takes down_write(&pernet_ops_rwsem) This deadlocks. Resolve this by converting to register_netdevice_notifier_dev_net() which does not take pernet_ops_rwsem and moves the notifier block around according to netdev it takes as arg. Use previously introduced netdev added/removed events to track uplink netdev to be used for register_netdevice_notifier_dev_net() purposes. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:41:04 +00:00
Mohammad Kabat	f578ac6b23	RDMA/mlx5: Remove impossible check of mkey cache cleanup failure Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit 85f9e38a5ac7d397f9bb5e57901b2d6af4dcc3b9 Author: Leon Romanovsky <leon@kernel.org> Date: Thu Feb 2 11:03:07 2023 +0200 RDMA/mlx5: Remove impossible check of mkey cache cleanup failure mlx5_mkey_cache_cleanup() can't fail and can be changed to be void. Link: https://lore.kernel.org/r/1acd9528995d083114e7dec2a2afc59436406583.1675328463.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:41:01 +00:00
Mohammad Kabat	465d1920e3	RDMA/mlx5: Add work to remove temporary entries from the cache Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit 627122280c878cf5d3cda2d2c5a0a8f6a7e35cb7 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Thu Jan 26 00:28:07 2023 +0200 RDMA/mlx5: Add work to remove temporary entries from the cache The non-cache mkeys are stored in the cache only to shorten restarting application time. Don't store them longer than needed. Configure cache entries that store non-cache MRs as temporary entries. If 30 seconds have passed and no user reclaimed the temporarily cached mkeys, an asynchronous work will destroy the mkeys entries. Link: https://lore.kernel.org/r/20230125222807.6921-7-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:40:58 +00:00
Mohammad Kabat	31ecbaf6d5	RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit dd1b913fb0d0e3e6d55e92d2319d954474dd66ac Author: Michael Guralnik <michaelgur@nvidia.com> Date: Thu Jan 26 00:28:06 2023 +0200 RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow Currently, when dereging an MR, if the mkey doesn't belong to a cache entry, it will be destroyed. As a result, the restart of applications with many non-cached mkeys is not efficient since all the mkeys are destroyed and then recreated. This process takes a long time (for 100,000 MRs, it is ~20 seconds for dereg and ~28 seconds for re-reg). To shorten the restart runtime, insert all cacheable mkeys to the cache. If there is no fitting entry to the mkey properties, create a temporary entry that fits it. After a predetermined timeout, the cache entries will shrink to the initial high limit. The mkeys will still be in the cache when consuming them again after an application restart. Therefore, the registration will be much faster (for 100,000 MRs, it is ~4 seconds for dereg and ~5 seconds for re-reg). The temporary cache entries created to store the non-cache mkeys are not exposed through sysfs like the default cache entries. Link: https://lore.kernel.org/r/20230125222807.6921-6-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:40:58 +00:00
Mohammad Kabat	4b769bdd9c	RDMA/mlx5: Introduce mlx5r_cache_rb_key Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit 73d09b2fe8336f5f37935e46418666ddbcd3c343 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Thu Jan 26 00:28:05 2023 +0200 RDMA/mlx5: Introduce mlx5r_cache_rb_key Switch from using the mkey order to using the new struct as the key to the RB tree of cache entries. The key is all the mkey properties that UMR operations can't modify. Using this key to define the cache entries and to search and create cache mkeys. Link: https://lore.kernel.org/r/20230125222807.6921-5-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:40:58 +00:00
Mohammad Kabat	5ded6a3406	RDMA/mlx5: Change the cache structure to an RB-tree Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit b9584517832858a0f78d6851d09b697a829514cd Author: Michael Guralnik <michaelgur@nvidia.com> Date: Thu Jan 26 00:28:04 2023 +0200 RDMA/mlx5: Change the cache structure to an RB-tree Currently, the cache structure is a static linear array. Therefore, his size is limited to the number of entries in it and is not expandable. The entries are dedicated to mkeys of size 2^x and no access_flags. Mkeys with different properties are not cacheable. In this patch, we change the cache structure to an RB-tree. This will allow to extend the cache to support more entries with different mkey properties. Link: https://lore.kernel.org/r/20230125222807.6921-4-michaelgur@nvidia.com Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:40:58 +00:00
Mohammad Kabat	a16ac19808	RDMA/mlx5: Don't keep umrable 'page_shift' in cache entries Bugzilla: https://bugzilla.redhat.com/2165364 Upstream-status: v6.3-rc1 commit a2a88b8e22d1b202225d0e40b02ad068afab2ccb Author: Aharon Landau <aharonl@nvidia.com> Date: Thu Jan 26 00:28:02 2023 +0200 RDMA/mlx5: Don't keep umrable 'page_shift' in cache entries mkc.log_page_size can be changed using UMR. Therefore, don't treat it as a cache entry property. Removing it from struct mlx5_cache_ent. All cache mkeys will be created with default PAGE_SHIFT, and updated with the needed page_shift using UMR when passing them to a user. Link: https://lore.kernel.org/r/20230125222807.6921-2-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-07-25 07:40:58 +00:00
Mohammad Kabat	8255cc648b	RDMA/mlx5: Don't set tx affinity when lag is in hash mode Bugzilla: https://bugzilla.redhat.com/2165355 Upstream-status: v6.1-rc1 commit a83bb5df2ac604ab418fbe0a8720f55de46652eb Author: Liu, Changcheng <jerrliu@nvidia.com> Date: Wed Sep 7 16:36:26 2022 -0700 RDMA/mlx5: Don't set tx affinity when lag is in hash mode In hash mode, without setting tx affinity explicitly, the port select flow table decides which port is used for the traffic. If port_select_flow_table_bypass capability is supported and tx affinity is set explicitly for QP/TIS, they will be added into the explicit affinity table in FW to check which port is used for the traffic. 1. The overloaded explicit affinity table may affect performance. To avoid this, do not set tx affinity explicitly by default. 2. The packets of the same flow need to be transmitted on the same port. Because the packets of the same flow use different QPs in slow & fast path, it shouldn't set tx affinity explicitly for these QPs. Signed-off-by: Liu, Changcheng <jerrliu@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-06-29 09:21:37 +00:00
Herton R. Krzesinski	9de1dafa38	Merge: RDMA: Add support of RDMA dmabuf for mlx5 driver MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/1940 Upstream status: v6.1. Bugzilla: https://bugzilla.redhat.com/2123401 Tested: cuda pyverbs tests passed. Add support for DMABUF FD's when creating a devx umem in the RDMA mlx5 driver. This allows applications to create work queues directly on GPU memory where the GPU fully controls the data flow out of the RDMA NIC. Signed-off-by: Kamal Heib <kheib@redhat.com> Approved-by: Íñigo Huguet <ihuguet@redhat.com> Approved-by: José Ignacio Tornos Martínez <jtornosm@redhat.com> Approved-by: Jonathan Toppins <jtoppins@redhat.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com>	2023-02-08 01:35:25 +00:00
Kamal Heib	ee8da65a04	RDMA/mlx5: Enable ATS support for MRs and umems Bugzilla: https://bugzilla.redhat.com/2123401 commit 72b2f7608a59727e7c2e5b11cff2749c2c080fac Author: Jason Gunthorpe <jgg@ziepe.ca> Date: Thu Sep 1 11:20:56 2022 -0300 RDMA/mlx5: Enable ATS support for MRs and umems For mlx5 if ATS is enabled in the PCI config then the device will use ATS requests for only certain DMA operations. This has to be opted in by the SW side based on the mkey or umem settings. ATS slows down the PCI performance, so it should only be set in cases when it is needed. All of these cases revolve around optimizing PCI P2P transfers and avoiding bad cases where the bus just doesn't work. Link: https://lore.kernel.org/r/4-v1-bd147097458e+ede-umem_dmabuf_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Kamal Heib <kheib@redhat.com>	2023-01-24 10:44:39 -05:00
Mohammad Kabat	470a289875	RDMA/mlx5: Fix UMR cleanup on error flow of driver init Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc5 commit 9b7d4be967f16f79a2283b2338709fcc750313ee Author: Maor Gottlieb <maorg@nvidia.com> Date: Mon Aug 29 12:02:29 2022 +0300 RDMA/mlx5: Fix UMR cleanup on error flow of driver init The cited commit removed from the cleanup flow of umr the checks if the resources were created. This could lead to null-ptr-deref in case that we had failure in mlx5_ib_stage_ib_reg_init stage. Fix it by adding new state to the umr that can say if the resources were created or not and check it in the umr cleanup flow before destroying the resources. Fixes: 04876c12c19e ("RDMA/mlx5: Move init and cleanup of UMR to umr.c") Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Link: https://lore.kernel.org/r/4cfa61386cf202e9ce330e8d228ce3b25a36326e.1661763459.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:42 +00:00
Mohammad Kabat	3b72ca1f74	RDMA/mlx5: Rename the mkey cache variables and functions Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 0113780870b1597ae49f30abfa4957c239f913d3 Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Jul 26 10:19:11 2022 +0300 RDMA/mlx5: Rename the mkey cache variables and functions After replacing the MR cache with an Mkey cache, rename the variables and functions to fit the new meaning. Link: https://lore.kernel.org/r/20220726071911.122765-6-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:38 +00:00
Mohammad Kabat	1f1e9871f0	RDMA/mlx5: Store in the cache mkeys instead of mrs Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 6b7533869523ae58e2b914551305b0e47cbeb247 Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Jul 26 10:19:10 2022 +0300 RDMA/mlx5: Store in the cache mkeys instead of mrs Currently, the driver stores mlx5_ib_mr struct in the cache entries, although the only use of the cached MR is the mkey. Store only the mkey in the cache. Link: https://lore.kernel.org/r/20220726071911.122765-5-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:38 +00:00
Mohammad Kabat	47033b9181	RDMA/mlx5: Store the number of in_use cache mkeys instead of total_mrs Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 19591f134c59703dfc272356808e6fe2037d0d40 Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Jul 26 10:19:09 2022 +0300 RDMA/mlx5: Store the number of in_use cache mkeys instead of total_mrs total_mrs is used only to calculate the number of mkeys currently in use. To simplify things, replace it with a new member called "in_use" and directly store the number of mkeys currently in use. Link: https://lore.kernel.org/r/20220726071911.122765-4-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:38 +00:00
Mohammad Kabat	0fad9155ff	RDMA/mlx5: Replace cache list with Xarray Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 86457a92df1bebdcd8e20afa286427e4b525aa08 Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Jul 26 10:19:08 2022 +0300 RDMA/mlx5: Replace cache list with Xarray The Xarray allows us to store the cached mkeys in memory efficient way. Entries are reserved in the Xarray using xa_cmpxchg before calling to the upcoming callbacks to avoid allocations in interrupt context. The xa_cmpxchg can sleep when using GFP_KERNEL, so we call it in a loop to ensure one reserved entry for each process trying to reserve. Link: https://lore.kernel.org/r/20220726071911.122765-3-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:37 +00:00
Mohammad Kabat	52ce7c8519	RDMA/mlx5: Replace ent->lock with xa_lock Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 17ae355926ed1832449d52748334b8fa799301f1 Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Jul 26 10:19:07 2022 +0300 RDMA/mlx5: Replace ent->lock with xa_lock In the next patch, ent->list will be replaced with an xarray. The xarray uses an internal lock to protect the indexes. Use it to protect all the entry fields, and get rid of ent->lock. Link: https://lore.kernel.org/r/20220726071911.122765-2-michaelgur@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:37 +00:00
Mohammad Kabat	3a331a3b0f	RDMA/mlx5: Expose steering anchor to userspace Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 0c6ab0ca9a662d4ca9742d97156bac0d3067d72d Author: Mark Bloch <mbloch@nvidia.com> Date: Sun Jul 3 13:54:07 2022 -0700 RDMA/mlx5: Expose steering anchor to userspace Expose a steering anchor per priority to allow users to re-inject packets back into default NIC pipeline for additional processing. MLX5_IB_METHOD_STEERING_ANCHOR_CREATE returns a flow table ID which a user can use to re-inject packets at a specific priority. A FTE (flow table entry) can be created and the flow table ID used as a destination. When a packet is taken into a RDMA-controlled steering domain (like software steering) there may be a need to insert the packet back into the default NIC pipeline. This exposes a flow table ID to the user that can be used as a destination in a flow table entry. With this new method priorities that are exposed to users via MLX5_IB_METHOD_FLOW_MATCHER_CREATE can be reached from a non-zero UID. As user-created flow tables (via RDMA DEVX) are created with a non-zero UID thus it's impossible to point to a NIC core flow table (core driver flow tables are created with UID value of zero) from userspace. Create flow tables that are exposed to users with the shared UID, this allows users to point to default NIC flow tables. Steering loops are prevented at FW level as FW enforces that no flow table at level X can point to a table at level lower than X. Link: https://lore.kernel.org/all/20220703205407.110890-6-saeed@kernel.org/ Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:36 +00:00
Mohammad Kabat	e01810e974	RDMA/mlx5: Add a umr recovery flow Bugzilla: https://bugzilla.redhat.com/2112947 Upstream-status: v6.0-rc1 commit 158e71bb69e368b8b33e8b7c4ac8c111da0c1ae2 Author: Aharon Landau <aharonl@nvidia.com> Date: Sun May 15 07:19:53 2022 +0300 RDMA/mlx5: Add a umr recovery flow When a UMR fails, the UMR QP state changes to an error state. Therefore, all the further UMR operations will fail too. Add a recovery flow to the UMR QP, and repost the flushed WQEs. Link: https://lore.kernel.org/r/6cc24816cca049bd8541317f5e41d3ac659445d3.1652588303.git.leonro@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2023-01-19 10:21:32 +00:00
Mohammad Kabat	f8a2b29fbb	net/mlx5: Lag, expose number of lag ports Bugzilla: https://bugzilla.redhat.com/2112940 Upstream-status: v5.19-rc1 commit 34a30d7635a8e37275a7b63bec09035ed762969b Author: Mark Bloch <mbloch@nvidia.com> Date: Tue Mar 1 15:42:01 2022 +0000 net/mlx5: Lag, expose number of lag ports Downstream patches will add support for hardware lag with more than 2 ports. Add a way for users to query the number of lag ports. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>	2022-12-18 10:14:09 +00:00

1 2 3 4 5 ...

479 Commits