Centos-kernel-stream-9/drivers/infiniband/hw/mlx5/mr.c

2843 lines
74 KiB
C
Raw Normal View History

/*
* Copyright (c) 2013-2015, Mellanox Technologies. All rights reserved.
* Copyright (c) 2020, Intel Corporation. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
* General Public License (GPL) Version 2, available from the file
* COPYING in the main directory of this source tree, or the
* OpenIB.org BSD license below:
*
* Redistribution and use in source and binary forms, with or
* without modification, are permitted provided that the following
* conditions are met:
*
* - Redistributions of source code must retain the above
* copyright notice, this list of conditions and the following
* disclaimer.
*
* - Redistributions in binary form must reproduce the above
* copyright notice, this list of conditions and the following
* disclaimer in the documentation and/or other materials
* provided with the distribution.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
#include <linux/kref.h>
#include <linux/random.h>
#include <linux/debugfs.h>
#include <linux/export.h>
#include <linux/delay.h>
#include <linux/dma-buf.h>
#include <linux/dma-resv.h>
#include <rdma/ib_umem_odp.h>
#include "dm.h"
#include "mlx5_ib.h"
#include "umr.h"
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
#include "data_direct.h"
enum {
MAX_PENDING_REG_MR = 8,
};
RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit ee6d57a2e13d11ce9050cfc3e3b69ef707a44a63 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:49 2024 +0300 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
#define MLX5_MR_CACHE_PERSISTENT_ENTRY_MIN_DESCS 4
#define MLX5_UMR_ALIGN 2048
static void
create_mkey_callback(int status, struct mlx5_async_work *context);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem,
u64 iova, int access_flags,
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
unsigned int page_size, bool populate,
int access_mode);
static int __mlx5_ib_dereg_mr(struct ib_mr *ibmr);
static void set_mkc_access_pd_addr_fields(void *mkc, int acc, u64 start_addr,
struct ib_pd *pd)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
MLX5_SET(mkc, mkc, a, !!(acc & IB_ACCESS_REMOTE_ATOMIC));
MLX5_SET(mkc, mkc, rw, !!(acc & IB_ACCESS_REMOTE_WRITE));
MLX5_SET(mkc, mkc, rr, !!(acc & IB_ACCESS_REMOTE_READ));
MLX5_SET(mkc, mkc, lw, !!(acc & IB_ACCESS_LOCAL_WRITE));
MLX5_SET(mkc, mkc, lr, 1);
if (acc & IB_ACCESS_RELAXED_ORDERING) {
if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write))
MLX5_SET(mkc, mkc, relaxed_ordering_write, 1);
RDMA/mlx5: Allow relaxed ordering read in VFs and VMs JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.4-rc1 commit bd4ba605c4a92b46ab414626a4f969a19103f97a Author: Avihai Horon <avihaih@nvidia.com> Date: Mon Apr 10 16:07:53 2023 +0300 RDMA/mlx5: Allow relaxed ordering read in VFs and VMs According to PCIe spec, Enable Relaxed Ordering value in the VF's PCI config space is wired to 0 and PF relaxed ordering (RO) setting should be applied to the VF. In QEMU (and maybe others), when assigning VFs, the RO bit in PCI config space is not emulated properly and is always set to 0. Therefore, pcie_relaxed_ordering_enabled() always returns 0 for VFs and VMs and thus MKeys can't be created with RO read even if the PF supports it. pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when creating a MKey with relaxed ordering (RO) enabled when the driver's relaxed_ordering_read_pci_enabled HCA capability is out of sync with FW. With the new relaxed_ordering_read capability this can't happen, as it's set regardless of RO value in PCI config space and thus can't change during runtime. Hence, to allow RO read in VFs and VMs, use the new HCA capability relaxed_ordering_read without checking pcie_relaxed_ordering_enabled(). The old capability checks are kept for backward compatibility with older FWs. Allowing RO in VFs and VMs is valuable since it can greatly improve performance on some setups. For example, testing throughput of a VF on an AMD EPYC 7763 and ConnectX-6 Dx setup showed roughly 60% performance improvement. Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Link: https://lore.kernel.org/r/e7048640d66c341a8fa0465e099926e7989184bc.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2023-12-13 09:06:28 +00:00
if (MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read) ||
(MLX5_CAP_GEN(dev->mdev,
relaxed_ordering_read_pci_enabled) &&
pcie_relaxed_ordering_enabled(dev->mdev->pdev)))
MLX5_SET(mkc, mkc, relaxed_ordering_read, 1);
}
MLX5_SET(mkc, mkc, pd, to_mpd(pd)->pdn);
MLX5_SET(mkc, mkc, qpn, 0xffffff);
MLX5_SET64(mkc, mkc, start_addr, start_addr);
}
static void assign_mkey_variant(struct mlx5_ib_dev *dev, u32 *mkey, u32 *in)
{
u8 key = atomic_inc_return(&dev->mkey_var);
void *mkc;
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, mkey_7_0, key);
*mkey = key;
}
static int mlx5_ib_create_mkey(struct mlx5_ib_dev *dev,
struct mlx5_ib_mkey *mkey, u32 *in, int inlen)
{
int ret;
assign_mkey_variant(dev, &mkey->key, in);
ret = mlx5_core_create_mkey(dev->mdev, &mkey->key, in, inlen);
if (!ret)
init_waitqueue_head(&mkey->wait);
return ret;
}
static int mlx5_ib_create_mkey_cb(struct mlx5r_async_create_mkey *async_create)
{
struct mlx5_ib_dev *dev = async_create->ent->dev;
size_t inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
size_t outlen = MLX5_ST_SZ_BYTES(create_mkey_out);
MLX5_SET(create_mkey_in, async_create->in, opcode,
MLX5_CMD_OP_CREATE_MKEY);
assign_mkey_variant(dev, &async_create->mkey, async_create->in);
return mlx5_cmd_exec_cb(&dev->async_ctx, async_create->in, inlen,
async_create->out, outlen, create_mkey_callback,
&async_create->cb_work);
}
static int mkey_cache_max_order(struct mlx5_ib_dev *dev);
static void queue_adjust_cache_locked(struct mlx5_cache_ent *ent);
static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
{
WARN_ON(xa_load(&dev->odp_mkeys, mlx5_base_mkey(mr->mmkey.key)));
return mlx5_core_destroy_mkey(dev->mdev, mr->mmkey.key);
}
static void create_mkey_warn(struct mlx5_ib_dev *dev, int status, void *out)
{
if (status == -ENXIO) /* core driver is not available */
return;
mlx5_ib_warn(dev, "async reg mr failed. status %d\n", status);
if (status != -EREMOTEIO) /* driver specific failure */
return;
/* Failed in FW, print cmd out failure details */
mlx5_cmd_out_err(dev->mdev, MLX5_CMD_OP_CREATE_MKEY, 0, out);
}
static int push_mkey_locked(struct mlx5_cache_ent *ent, u32 mkey)
{
unsigned long tmp = ent->mkeys_queue.ci % NUM_MKEYS_PER_PAGE;
struct mlx5_mkeys_page *page;
lockdep_assert_held(&ent->mkeys_queue.lock);
if (ent->mkeys_queue.ci >=
ent->mkeys_queue.num_pages * NUM_MKEYS_PER_PAGE) {
page = kzalloc(sizeof(*page), GFP_ATOMIC);
if (!page)
return -ENOMEM;
ent->mkeys_queue.num_pages++;
list_add_tail(&page->list, &ent->mkeys_queue.pages_list);
} else {
page = list_last_entry(&ent->mkeys_queue.pages_list,
struct mlx5_mkeys_page, list);
}
page->mkeys[tmp] = mkey;
ent->mkeys_queue.ci++;
return 0;
}
static int pop_mkey_locked(struct mlx5_cache_ent *ent)
{
unsigned long tmp = (ent->mkeys_queue.ci - 1) % NUM_MKEYS_PER_PAGE;
struct mlx5_mkeys_page *last_page;
u32 mkey;
lockdep_assert_held(&ent->mkeys_queue.lock);
last_page = list_last_entry(&ent->mkeys_queue.pages_list,
struct mlx5_mkeys_page, list);
mkey = last_page->mkeys[tmp];
last_page->mkeys[tmp] = 0;
ent->mkeys_queue.ci--;
if (ent->mkeys_queue.num_pages > 1 && !tmp) {
list_del(&last_page->list);
ent->mkeys_queue.num_pages--;
kfree(last_page);
}
return mkey;
}
static void create_mkey_callback(int status, struct mlx5_async_work *context)
{
struct mlx5r_async_create_mkey *mkey_out =
container_of(context, struct mlx5r_async_create_mkey, cb_work);
struct mlx5_cache_ent *ent = mkey_out->ent;
struct mlx5_ib_dev *dev = ent->dev;
unsigned long flags;
if (status) {
create_mkey_warn(dev, status, mkey_out->out);
kfree(mkey_out);
spin_lock_irqsave(&ent->mkeys_queue.lock, flags);
ent->pending--;
WRITE_ONCE(dev->fill_delay, 1);
spin_unlock_irqrestore(&ent->mkeys_queue.lock, flags);
mod_timer(&dev->delay_timer, jiffies + HZ);
return;
}
mkey_out->mkey |= mlx5_idx_to_mkey(
MLX5_GET(create_mkey_out, mkey_out->out, mkey_index));
WRITE_ONCE(dev->cache.last_add, jiffies);
spin_lock_irqsave(&ent->mkeys_queue.lock, flags);
push_mkey_locked(ent, mkey_out->mkey);
ent->pending--;
/* If we are doing fill_to_high_water then keep going. */
queue_adjust_cache_locked(ent);
spin_unlock_irqrestore(&ent->mkeys_queue.lock, flags);
kfree(mkey_out);
}
static int get_mkc_octo_size(unsigned int access_mode, unsigned int ndescs)
{
int ret = 0;
switch (access_mode) {
case MLX5_MKC_ACCESS_MODE_MTT:
ret = DIV_ROUND_UP(ndescs, MLX5_IB_UMR_OCTOWORD /
sizeof(struct mlx5_mtt));
break;
case MLX5_MKC_ACCESS_MODE_KSM:
ret = DIV_ROUND_UP(ndescs, MLX5_IB_UMR_OCTOWORD /
sizeof(struct mlx5_klm));
break;
default:
WARN_ON(1);
}
return ret;
}
static void set_cache_mkc(struct mlx5_cache_ent *ent, void *mkc)
{
set_mkc_access_pd_addr_fields(mkc, ent->rb_key.access_flags, 0,
ent->dev->umrc.pd);
MLX5_SET(mkc, mkc, free, 1);
MLX5_SET(mkc, mkc, umr_en, 1);
MLX5_SET(mkc, mkc, access_mode_1_0, ent->rb_key.access_mode & 0x3);
MLX5_SET(mkc, mkc, access_mode_4_2,
(ent->rb_key.access_mode >> 2) & 0x7);
MLX5_SET(mkc, mkc, ma_translation_mode, !!ent->rb_key.ats);
MLX5_SET(mkc, mkc, translations_octword_size,
get_mkc_octo_size(ent->rb_key.access_mode,
ent->rb_key.ndescs));
MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
}
/* Asynchronously schedule new MRs to be populated in the cache. */
static int add_keys(struct mlx5_cache_ent *ent, unsigned int num)
{
struct mlx5r_async_create_mkey *async_create;
void *mkc;
int err = 0;
int i;
for (i = 0; i < num; i++) {
async_create = kzalloc(sizeof(struct mlx5r_async_create_mkey),
GFP_KERNEL);
if (!async_create)
return -ENOMEM;
mkc = MLX5_ADDR_OF(create_mkey_in, async_create->in,
memory_key_mkey_entry);
set_cache_mkc(ent, mkc);
async_create->ent = ent;
spin_lock_irq(&ent->mkeys_queue.lock);
if (ent->pending >= MAX_PENDING_REG_MR) {
err = -EAGAIN;
goto free_async_create;
}
ent->pending++;
spin_unlock_irq(&ent->mkeys_queue.lock);
err = mlx5_ib_create_mkey_cb(async_create);
if (err) {
mlx5_ib_warn(ent->dev, "create mkey failed %d\n", err);
goto err_create_mkey;
}
}
return 0;
err_create_mkey:
spin_lock_irq(&ent->mkeys_queue.lock);
ent->pending--;
free_async_create:
spin_unlock_irq(&ent->mkeys_queue.lock);
kfree(async_create);
return err;
}
/* Synchronously create a MR in the cache */
static int create_cache_mkey(struct mlx5_cache_ent *ent, u32 *mkey)
{
size_t inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
void *mkc;
u32 *in;
int err;
in = kzalloc(inlen, GFP_KERNEL);
if (!in)
return -ENOMEM;
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
set_cache_mkc(ent, mkc);
err = mlx5_core_create_mkey(ent->dev->mdev, mkey, in, inlen);
if (err)
goto free_in;
WRITE_ONCE(ent->dev->cache.last_add, jiffies);
free_in:
kfree(in);
return err;
}
static void remove_cache_mr_locked(struct mlx5_cache_ent *ent)
{
u32 mkey;
lockdep_assert_held(&ent->mkeys_queue.lock);
if (!ent->mkeys_queue.ci)
return;
mkey = pop_mkey_locked(ent);
spin_unlock_irq(&ent->mkeys_queue.lock);
mlx5_core_destroy_mkey(ent->dev->mdev, mkey);
spin_lock_irq(&ent->mkeys_queue.lock);
}
static int resize_available_mrs(struct mlx5_cache_ent *ent, unsigned int target,
bool limit_fill)
__acquires(&ent->mkeys_queue.lock) __releases(&ent->mkeys_queue.lock)
{
int err;
lockdep_assert_held(&ent->mkeys_queue.lock);
while (true) {
if (limit_fill)
target = ent->limit * 2;
if (target == ent->pending + ent->mkeys_queue.ci)
return 0;
if (target > ent->pending + ent->mkeys_queue.ci) {
u32 todo = target - (ent->pending + ent->mkeys_queue.ci);
spin_unlock_irq(&ent->mkeys_queue.lock);
err = add_keys(ent, todo);
if (err == -EAGAIN)
usleep_range(3000, 5000);
spin_lock_irq(&ent->mkeys_queue.lock);
if (err) {
if (err != -EAGAIN)
return err;
} else
return 0;
} else {
remove_cache_mr_locked(ent);
}
}
}
static ssize_t size_write(struct file *filp, const char __user *buf,
size_t count, loff_t *pos)
{
struct mlx5_cache_ent *ent = filp->private_data;
u32 target;
int err;
err = kstrtou32_from_user(buf, count, 0, &target);
if (err)
return err;
/*
* Target is the new value of total_mrs the user requests, however we
* cannot free MRs that are in use. Compute the target value for stored
* mkeys.
*/
spin_lock_irq(&ent->mkeys_queue.lock);
if (target < ent->in_use) {
err = -EINVAL;
goto err_unlock;
}
target = target - ent->in_use;
if (target < ent->limit || target > ent->limit*2) {
err = -EINVAL;
goto err_unlock;
}
err = resize_available_mrs(ent, target, false);
if (err)
goto err_unlock;
spin_unlock_irq(&ent->mkeys_queue.lock);
return count;
err_unlock:
spin_unlock_irq(&ent->mkeys_queue.lock);
return err;
}
static ssize_t size_read(struct file *filp, char __user *buf, size_t count,
loff_t *pos)
{
struct mlx5_cache_ent *ent = filp->private_data;
char lbuf[20];
int err;
err = snprintf(lbuf, sizeof(lbuf), "%ld\n",
ent->mkeys_queue.ci + ent->in_use);
if (err < 0)
return err;
return simple_read_from_buffer(buf, count, pos, lbuf, err);
}
static const struct file_operations size_fops = {
.owner = THIS_MODULE,
.open = simple_open,
.write = size_write,
.read = size_read,
};
static ssize_t limit_write(struct file *filp, const char __user *buf,
size_t count, loff_t *pos)
{
struct mlx5_cache_ent *ent = filp->private_data;
u32 var;
int err;
err = kstrtou32_from_user(buf, count, 0, &var);
if (err)
return err;
/*
* Upon set we immediately fill the cache to high water mark implied by
* the limit.
*/
spin_lock_irq(&ent->mkeys_queue.lock);
ent->limit = var;
err = resize_available_mrs(ent, 0, true);
spin_unlock_irq(&ent->mkeys_queue.lock);
if (err)
return err;
return count;
}
static ssize_t limit_read(struct file *filp, char __user *buf, size_t count,
loff_t *pos)
{
struct mlx5_cache_ent *ent = filp->private_data;
char lbuf[20];
int err;
err = snprintf(lbuf, sizeof(lbuf), "%d\n", ent->limit);
if (err < 0)
return err;
return simple_read_from_buffer(buf, count, pos, lbuf, err);
}
static const struct file_operations limit_fops = {
.owner = THIS_MODULE,
.open = simple_open,
.write = limit_write,
.read = limit_read,
};
static bool someone_adding(struct mlx5_mkey_cache *cache)
{
struct mlx5_cache_ent *ent;
struct rb_node *node;
bool ret;
mutex_lock(&cache->rb_lock);
for (node = rb_first(&cache->rb_root); node; node = rb_next(node)) {
ent = rb_entry(node, struct mlx5_cache_ent, node);
spin_lock_irq(&ent->mkeys_queue.lock);
ret = ent->mkeys_queue.ci < ent->limit;
spin_unlock_irq(&ent->mkeys_queue.lock);
if (ret) {
mutex_unlock(&cache->rb_lock);
return true;
}
}
mutex_unlock(&cache->rb_lock);
return false;
}
/*
* Check if the bucket is outside the high/low water mark and schedule an async
* update. The cache refill has hysteresis, once the low water mark is hit it is
* refilled up to the high mark.
*/
static void queue_adjust_cache_locked(struct mlx5_cache_ent *ent)
{
lockdep_assert_held(&ent->mkeys_queue.lock);
if (ent->disabled || READ_ONCE(ent->dev->fill_delay) || ent->is_tmp)
return;
if (ent->mkeys_queue.ci < ent->limit) {
ent->fill_to_high_water = true;
mod_delayed_work(ent->dev->cache.wq, &ent->dwork, 0);
} else if (ent->fill_to_high_water &&
ent->mkeys_queue.ci + ent->pending < 2 * ent->limit) {
/*
* Once we start populating due to hitting a low water mark
* continue until we pass the high water mark.
*/
mod_delayed_work(ent->dev->cache.wq, &ent->dwork, 0);
} else if (ent->mkeys_queue.ci == 2 * ent->limit) {
ent->fill_to_high_water = false;
} else if (ent->mkeys_queue.ci > 2 * ent->limit) {
/* Queue deletion of excess entries */
ent->fill_to_high_water = false;
if (ent->pending)
queue_delayed_work(ent->dev->cache.wq, &ent->dwork,
msecs_to_jiffies(1000));
else
mod_delayed_work(ent->dev->cache.wq, &ent->dwork, 0);
}
}
RDMA/mlx5: Fix MR cache temp entries cleanup JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 7ebb00cea49db641b458edef0ede389f7004821d Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:50 2024 +0300 RDMA/mlx5: Fix MR cache temp entries cleanup Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static void clean_keys(struct mlx5_ib_dev *dev, struct mlx5_cache_ent *ent)
{
u32 mkey;
spin_lock_irq(&ent->mkeys_queue.lock);
while (ent->mkeys_queue.ci) {
mkey = pop_mkey_locked(ent);
spin_unlock_irq(&ent->mkeys_queue.lock);
mlx5_core_destroy_mkey(dev->mdev, mkey);
spin_lock_irq(&ent->mkeys_queue.lock);
}
ent->tmp_cleanup_scheduled = false;
spin_unlock_irq(&ent->mkeys_queue.lock);
}
static void __cache_work_func(struct mlx5_cache_ent *ent)
{
struct mlx5_ib_dev *dev = ent->dev;
struct mlx5_mkey_cache *cache = &dev->cache;
int err;
spin_lock_irq(&ent->mkeys_queue.lock);
if (ent->disabled)
goto out;
if (ent->fill_to_high_water &&
ent->mkeys_queue.ci + ent->pending < 2 * ent->limit &&
!READ_ONCE(dev->fill_delay)) {
spin_unlock_irq(&ent->mkeys_queue.lock);
err = add_keys(ent, 1);
spin_lock_irq(&ent->mkeys_queue.lock);
if (ent->disabled)
goto out;
if (err) {
/*
* EAGAIN only happens if there are pending MRs, so we
* will be rescheduled when storing them. The only
* failure path here is ENOMEM.
*/
if (err != -EAGAIN) {
mlx5_ib_warn(
dev,
"add keys command failed, err %d\n",
err);
queue_delayed_work(cache->wq, &ent->dwork,
msecs_to_jiffies(1000));
}
}
} else if (ent->mkeys_queue.ci > 2 * ent->limit) {
bool need_delay;
/*
* The remove_cache_mr() logic is performed as garbage
* collection task. Such task is intended to be run when no
* other active processes are running.
*
* The need_resched() will return TRUE if there are user tasks
* to be activated in near future.
*
* In such case, we don't execute remove_cache_mr() and postpone
* the garbage collection work to try to run in next cycle, in
* order to free CPU resources to other tasks.
*/
spin_unlock_irq(&ent->mkeys_queue.lock);
need_delay = need_resched() || someone_adding(cache) ||
!time_after(jiffies,
READ_ONCE(cache->last_add) + 300 * HZ);
spin_lock_irq(&ent->mkeys_queue.lock);
if (ent->disabled)
goto out;
if (need_delay) {
queue_delayed_work(cache->wq, &ent->dwork, 300 * HZ);
goto out;
}
remove_cache_mr_locked(ent);
queue_adjust_cache_locked(ent);
}
out:
spin_unlock_irq(&ent->mkeys_queue.lock);
}
static void delayed_cache_work_func(struct work_struct *work)
{
struct mlx5_cache_ent *ent;
ent = container_of(work, struct mlx5_cache_ent, dwork.work);
RDMA/mlx5: Fix MR cache temp entries cleanup JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 7ebb00cea49db641b458edef0ede389f7004821d Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:50 2024 +0300 RDMA/mlx5: Fix MR cache temp entries cleanup Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
/* temp entries are never filled, only cleaned */
if (ent->is_tmp)
clean_keys(ent->dev, ent);
else
__cache_work_func(ent);
}
static int cache_ent_key_cmp(struct mlx5r_cache_rb_key key1,
struct mlx5r_cache_rb_key key2)
{
int res;
res = key1.ats - key2.ats;
if (res)
return res;
res = key1.access_mode - key2.access_mode;
if (res)
return res;
res = key1.access_flags - key2.access_flags;
if (res)
return res;
/*
* keep ndescs the last in the compare table since the find function
* searches for an exact match on all properties and only closest
* match in size.
*/
return key1.ndescs - key2.ndescs;
}
static int mlx5_cache_ent_insert(struct mlx5_mkey_cache *cache,
struct mlx5_cache_ent *ent)
{
struct rb_node **new = &cache->rb_root.rb_node, *parent = NULL;
struct mlx5_cache_ent *cur;
int cmp;
/* Figure out where to put new node */
while (*new) {
cur = rb_entry(*new, struct mlx5_cache_ent, node);
parent = *new;
cmp = cache_ent_key_cmp(cur->rb_key, ent->rb_key);
if (cmp > 0)
new = &((*new)->rb_left);
if (cmp < 0)
new = &((*new)->rb_right);
if (cmp == 0)
return -EEXIST;
}
/* Add new node and rebalance tree. */
rb_link_node(&ent->node, parent, new);
rb_insert_color(&ent->node, &cache->rb_root);
return 0;
}
static struct mlx5_cache_ent *
mkey_cache_ent_from_rb_key(struct mlx5_ib_dev *dev,
struct mlx5r_cache_rb_key rb_key)
{
struct rb_node *node = dev->cache.rb_root.rb_node;
struct mlx5_cache_ent *cur, *smallest = NULL;
RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit ee6d57a2e13d11ce9050cfc3e3b69ef707a44a63 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:49 2024 +0300 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
u64 ndescs_limit;
int cmp;
/*
* Find the smallest ent with order >= requested_order.
*/
while (node) {
cur = rb_entry(node, struct mlx5_cache_ent, node);
cmp = cache_ent_key_cmp(cur->rb_key, rb_key);
if (cmp > 0) {
smallest = cur;
node = node->rb_left;
}
if (cmp < 0)
node = node->rb_right;
if (cmp == 0)
return cur;
}
RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit ee6d57a2e13d11ce9050cfc3e3b69ef707a44a63 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:49 2024 +0300 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
/*
* Limit the usage of mkeys larger than twice the required size while
* also allowing the usage of smallest cache entry for small MRs.
*/
ndescs_limit = max_t(u64, rb_key.ndescs * 2,
MLX5_MR_CACHE_PERSISTENT_ENTRY_MIN_DESCS);
return (smallest &&
smallest->rb_key.access_mode == rb_key.access_mode &&
smallest->rb_key.access_flags == rb_key.access_flags &&
RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit ee6d57a2e13d11ce9050cfc3e3b69ef707a44a63 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:49 2024 +0300 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
smallest->rb_key.ats == rb_key.ats &&
smallest->rb_key.ndescs <= ndescs_limit) ?
smallest :
NULL;
}
static struct mlx5_ib_mr *_mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev,
struct mlx5_cache_ent *ent,
int access_flags)
{
struct mlx5_ib_mr *mr;
int err;
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
spin_lock_irq(&ent->mkeys_queue.lock);
ent->in_use++;
if (!ent->mkeys_queue.ci) {
queue_adjust_cache_locked(ent);
ent->miss++;
spin_unlock_irq(&ent->mkeys_queue.lock);
err = create_cache_mkey(ent, &mr->mmkey.key);
if (err) {
spin_lock_irq(&ent->mkeys_queue.lock);
ent->in_use--;
spin_unlock_irq(&ent->mkeys_queue.lock);
kfree(mr);
return ERR_PTR(err);
}
} else {
mr->mmkey.key = pop_mkey_locked(ent);
queue_adjust_cache_locked(ent);
spin_unlock_irq(&ent->mkeys_queue.lock);
}
mr->mmkey.cache_ent = ent;
mr->mmkey.type = MLX5_MKEY_MR;
mr->mmkey.rb_key = ent->rb_key;
mr->mmkey.cacheable = true;
init_waitqueue_head(&mr->mmkey.wait);
return mr;
}
static int get_unchangeable_access_flags(struct mlx5_ib_dev *dev,
int access_flags)
{
int ret = 0;
if ((access_flags & IB_ACCESS_REMOTE_ATOMIC) &&
MLX5_CAP_GEN(dev->mdev, atomic) &&
MLX5_CAP_GEN(dev->mdev, umr_modify_atomic_disabled))
ret |= IB_ACCESS_REMOTE_ATOMIC;
if ((access_flags & IB_ACCESS_RELAXED_ORDERING) &&
MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write) &&
!MLX5_CAP_GEN(dev->mdev, relaxed_ordering_write_umr))
ret |= IB_ACCESS_RELAXED_ORDERING;
if ((access_flags & IB_ACCESS_RELAXED_ORDERING) &&
RDMA/mlx5: Allow relaxed ordering read in VFs and VMs JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.4-rc1 commit bd4ba605c4a92b46ab414626a4f969a19103f97a Author: Avihai Horon <avihaih@nvidia.com> Date: Mon Apr 10 16:07:53 2023 +0300 RDMA/mlx5: Allow relaxed ordering read in VFs and VMs According to PCIe spec, Enable Relaxed Ordering value in the VF's PCI config space is wired to 0 and PF relaxed ordering (RO) setting should be applied to the VF. In QEMU (and maybe others), when assigning VFs, the RO bit in PCI config space is not emulated properly and is always set to 0. Therefore, pcie_relaxed_ordering_enabled() always returns 0 for VFs and VMs and thus MKeys can't be created with RO read even if the PF supports it. pcie_relaxed_ordering_enabled() check was added to avoid a syndrome when creating a MKey with relaxed ordering (RO) enabled when the driver's relaxed_ordering_read_pci_enabled HCA capability is out of sync with FW. With the new relaxed_ordering_read capability this can't happen, as it's set regardless of RO value in PCI config space and thus can't change during runtime. Hence, to allow RO read in VFs and VMs, use the new HCA capability relaxed_ordering_read without checking pcie_relaxed_ordering_enabled(). The old capability checks are kept for backward compatibility with older FWs. Allowing RO in VFs and VMs is valuable since it can greatly improve performance on some setups. For example, testing throughput of a VF on an AMD EPYC 7763 and ConnectX-6 Dx setup showed roughly 60% performance improvement. Signed-off-by: Avihai Horon <avihaih@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Link: https://lore.kernel.org/r/e7048640d66c341a8fa0465e099926e7989184bc.1681131553.git.leon@kernel.org Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2023-12-13 09:06:28 +00:00
(MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read) ||
MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_pci_enabled)) &&
!MLX5_CAP_GEN(dev->mdev, relaxed_ordering_read_umr))
ret |= IB_ACCESS_RELAXED_ORDERING;
return ret;
}
struct mlx5_ib_mr *mlx5_mr_cache_alloc(struct mlx5_ib_dev *dev,
int access_flags, int access_mode,
int ndescs)
{
struct mlx5r_cache_rb_key rb_key = {
.ndescs = ndescs,
.access_mode = access_mode,
.access_flags = get_unchangeable_access_flags(dev, access_flags)
};
struct mlx5_cache_ent *ent = mkey_cache_ent_from_rb_key(dev, rb_key);
if (!ent)
return ERR_PTR(-EOPNOTSUPP);
return _mlx5_mr_cache_alloc(dev, ent, access_flags);
}
static void mlx5_mkey_cache_debugfs_cleanup(struct mlx5_ib_dev *dev)
{
if (!mlx5_debugfs_root || dev->is_rep)
return;
debugfs_remove_recursive(dev->cache.fs_root);
dev->cache.fs_root = NULL;
}
static void mlx5_mkey_cache_debugfs_add_ent(struct mlx5_ib_dev *dev,
struct mlx5_cache_ent *ent)
{
int order = order_base_2(ent->rb_key.ndescs);
struct dentry *dir;
if (!mlx5_debugfs_root || dev->is_rep)
return;
if (ent->rb_key.access_mode == MLX5_MKC_ACCESS_MODE_KSM)
order = MLX5_IMR_KSM_CACHE_ENTRY + 2;
sprintf(ent->name, "%d", order);
dir = debugfs_create_dir(ent->name, dev->cache.fs_root);
debugfs_create_file("size", 0600, dir, ent, &size_fops);
debugfs_create_file("limit", 0600, dir, ent, &limit_fops);
debugfs_create_ulong("cur", 0400, dir, &ent->mkeys_queue.ci);
debugfs_create_u32("miss", 0600, dir, &ent->miss);
}
static void mlx5_mkey_cache_debugfs_init(struct mlx5_ib_dev *dev)
{
struct dentry *dbg_root = mlx5_debugfs_get_dev_root(dev->mdev);
struct mlx5_mkey_cache *cache = &dev->cache;
if (!mlx5_debugfs_root || dev->is_rep)
return;
cache->fs_root = debugfs_create_dir("mr_cache", dbg_root);
}
treewide: setup_timer() -> timer_setup() This converts all remaining cases of the old setup_timer() API into using timer_setup(), where the callback argument is the structure already holding the struct timer_list. These should have no behavioral changes, since they just change which pointer is passed into the callback with the same available pointers after conversion. It handles the following examples, in addition to some other variations. Casting from unsigned long: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... setup_timer(&ptr->my_timer, my_callback, ptr); and forced object casts: void my_callback(struct something *ptr) { ... } ... setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr); become: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... timer_setup(&ptr->my_timer, my_callback, 0); Direct function assignments: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... ptr->my_timer.function = my_callback; have a temporary cast added, along with converting the args: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback; And finally, callbacks without a data assignment: void my_callback(unsigned long data) { ... } ... setup_timer(&ptr->my_timer, my_callback, 0); have their argument renamed to verify they're unused during conversion: void my_callback(struct timer_list *unused) { ... } ... timer_setup(&ptr->my_timer, my_callback, 0); The conversion is done with the following Coccinelle script: spatch --very-quiet --all-includes --include-headers \ -I ./arch/x86/include -I ./arch/x86/include/generated \ -I ./include -I ./arch/x86/include/uapi \ -I ./arch/x86/include/generated/uapi -I ./include/uapi \ -I ./include/generated/uapi --include ./include/linux/kconfig.h \ --dir . \ --cocci-file ~/src/data/timer_setup.cocci @fix_address_of@ expression e; @@ setup_timer( -&(e) +&e , ...) // Update any raw setup_timer() usages that have a NULL callback, but // would otherwise match change_timer_function_usage, since the latter // will update all function assignments done in the face of a NULL // function initialization in setup_timer(). @change_timer_function_usage_NULL@ expression _E; identifier _timer; type _cast_data; @@ ( -setup_timer(&_E->_timer, NULL, _E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E->_timer, NULL, (_cast_data)_E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E._timer, NULL, &_E); +timer_setup(&_E._timer, NULL, 0); | -setup_timer(&_E._timer, NULL, (_cast_data)&_E); +timer_setup(&_E._timer, NULL, 0); ) @change_timer_function_usage@ expression _E; identifier _timer; struct timer_list _stl; identifier _callback; type _cast_func, _cast_data; @@ ( -setup_timer(&_E->_timer, _callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | _E->_timer@_stl.function = _callback; | _E->_timer@_stl.function = &_callback; | _E->_timer@_stl.function = (_cast_func)_callback; | _E->_timer@_stl.function = (_cast_func)&_callback; | _E._timer@_stl.function = _callback; | _E._timer@_stl.function = &_callback; | _E._timer@_stl.function = (_cast_func)_callback; | _E._timer@_stl.function = (_cast_func)&_callback; ) // callback(unsigned long arg) @change_callback_handle_cast depends on change_timer_function_usage@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; identifier _handle; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { ( ... when != _origarg _handletype *_handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg ) } // callback(unsigned long arg) without existing variable @change_callback_handle_cast_no_arg depends on change_timer_function_usage && !change_callback_handle_cast@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { + _handletype *_origarg = from_timer(_origarg, t, _timer); + ... when != _origarg - (_handletype *)_origarg + _origarg ... when != _origarg } // Avoid already converted callbacks. @match_callback_converted depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier t; @@ void _callback(struct timer_list *t) { ... } // callback(struct something *handle) @change_callback_handle_arg depends on change_timer_function_usage && !match_callback_converted && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; @@ void _callback( -_handletype *_handle +struct timer_list *t ) { + _handletype *_handle = from_timer(_handle, t, _timer); ... } // If change_callback_handle_arg ran on an empty function, remove // the added handler. @unchange_callback_handle_arg depends on change_timer_function_usage && change_callback_handle_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; identifier t; @@ void _callback(struct timer_list *t) { - _handletype *_handle = from_timer(_handle, t, _timer); } // We only want to refactor the setup_timer() data argument if we've found // the matching callback. This undoes changes in change_timer_function_usage. @unchange_timer_function_usage depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg && !change_callback_handle_arg@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type change_timer_function_usage._cast_data; @@ ( -timer_setup(&_E->_timer, _callback, 0); +setup_timer(&_E->_timer, _callback, (_cast_data)_E); | -timer_setup(&_E._timer, _callback, 0); +setup_timer(&_E._timer, _callback, (_cast_data)&_E); ) // If we fixed a callback from a .function assignment, fix the // assignment cast now. @change_timer_function_assignment depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_func; typedef TIMER_FUNC_TYPE; @@ ( _E->_timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -&_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)_callback; +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -&_callback; +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; ) // Sometimes timer functions are called directly. Replace matched args. @change_timer_function_calls depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression _E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_data; @@ _callback( ( -(_cast_data)_E +&_E->_timer | -(_cast_data)&_E +&_E._timer | -_E +&_E->_timer ) ) // If a timer has been configured without a data argument, it can be // converted without regard to the callback argument, since it is unused. @match_timer_function_unused_data@ expression _E; identifier _timer; identifier _callback; @@ ( -setup_timer(&_E->_timer, _callback, 0); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0L); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0UL); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0L); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0UL); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_timer, _callback, 0); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0L); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0UL); +timer_setup(&_timer, _callback, 0); | -setup_timer(_timer, _callback, 0); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0L); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0UL); +timer_setup(_timer, _callback, 0); ) @change_callback_unused_data depends on match_timer_function_unused_data@ identifier match_timer_function_unused_data._callback; type _origtype; identifier _origarg; @@ void _callback( -_origtype _origarg +struct timer_list *unused ) { ... when != _origarg } Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 21:43:17 +00:00
static void delay_time_func(struct timer_list *t)
{
treewide: setup_timer() -> timer_setup() This converts all remaining cases of the old setup_timer() API into using timer_setup(), where the callback argument is the structure already holding the struct timer_list. These should have no behavioral changes, since they just change which pointer is passed into the callback with the same available pointers after conversion. It handles the following examples, in addition to some other variations. Casting from unsigned long: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... setup_timer(&ptr->my_timer, my_callback, ptr); and forced object casts: void my_callback(struct something *ptr) { ... } ... setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr); become: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... timer_setup(&ptr->my_timer, my_callback, 0); Direct function assignments: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... ptr->my_timer.function = my_callback; have a temporary cast added, along with converting the args: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback; And finally, callbacks without a data assignment: void my_callback(unsigned long data) { ... } ... setup_timer(&ptr->my_timer, my_callback, 0); have their argument renamed to verify they're unused during conversion: void my_callback(struct timer_list *unused) { ... } ... timer_setup(&ptr->my_timer, my_callback, 0); The conversion is done with the following Coccinelle script: spatch --very-quiet --all-includes --include-headers \ -I ./arch/x86/include -I ./arch/x86/include/generated \ -I ./include -I ./arch/x86/include/uapi \ -I ./arch/x86/include/generated/uapi -I ./include/uapi \ -I ./include/generated/uapi --include ./include/linux/kconfig.h \ --dir . \ --cocci-file ~/src/data/timer_setup.cocci @fix_address_of@ expression e; @@ setup_timer( -&(e) +&e , ...) // Update any raw setup_timer() usages that have a NULL callback, but // would otherwise match change_timer_function_usage, since the latter // will update all function assignments done in the face of a NULL // function initialization in setup_timer(). @change_timer_function_usage_NULL@ expression _E; identifier _timer; type _cast_data; @@ ( -setup_timer(&_E->_timer, NULL, _E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E->_timer, NULL, (_cast_data)_E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E._timer, NULL, &_E); +timer_setup(&_E._timer, NULL, 0); | -setup_timer(&_E._timer, NULL, (_cast_data)&_E); +timer_setup(&_E._timer, NULL, 0); ) @change_timer_function_usage@ expression _E; identifier _timer; struct timer_list _stl; identifier _callback; type _cast_func, _cast_data; @@ ( -setup_timer(&_E->_timer, _callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | _E->_timer@_stl.function = _callback; | _E->_timer@_stl.function = &_callback; | _E->_timer@_stl.function = (_cast_func)_callback; | _E->_timer@_stl.function = (_cast_func)&_callback; | _E._timer@_stl.function = _callback; | _E._timer@_stl.function = &_callback; | _E._timer@_stl.function = (_cast_func)_callback; | _E._timer@_stl.function = (_cast_func)&_callback; ) // callback(unsigned long arg) @change_callback_handle_cast depends on change_timer_function_usage@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; identifier _handle; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { ( ... when != _origarg _handletype *_handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg ) } // callback(unsigned long arg) without existing variable @change_callback_handle_cast_no_arg depends on change_timer_function_usage && !change_callback_handle_cast@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { + _handletype *_origarg = from_timer(_origarg, t, _timer); + ... when != _origarg - (_handletype *)_origarg + _origarg ... when != _origarg } // Avoid already converted callbacks. @match_callback_converted depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier t; @@ void _callback(struct timer_list *t) { ... } // callback(struct something *handle) @change_callback_handle_arg depends on change_timer_function_usage && !match_callback_converted && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; @@ void _callback( -_handletype *_handle +struct timer_list *t ) { + _handletype *_handle = from_timer(_handle, t, _timer); ... } // If change_callback_handle_arg ran on an empty function, remove // the added handler. @unchange_callback_handle_arg depends on change_timer_function_usage && change_callback_handle_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; identifier t; @@ void _callback(struct timer_list *t) { - _handletype *_handle = from_timer(_handle, t, _timer); } // We only want to refactor the setup_timer() data argument if we've found // the matching callback. This undoes changes in change_timer_function_usage. @unchange_timer_function_usage depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg && !change_callback_handle_arg@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type change_timer_function_usage._cast_data; @@ ( -timer_setup(&_E->_timer, _callback, 0); +setup_timer(&_E->_timer, _callback, (_cast_data)_E); | -timer_setup(&_E._timer, _callback, 0); +setup_timer(&_E._timer, _callback, (_cast_data)&_E); ) // If we fixed a callback from a .function assignment, fix the // assignment cast now. @change_timer_function_assignment depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_func; typedef TIMER_FUNC_TYPE; @@ ( _E->_timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -&_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)_callback; +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -&_callback; +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; ) // Sometimes timer functions are called directly. Replace matched args. @change_timer_function_calls depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression _E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_data; @@ _callback( ( -(_cast_data)_E +&_E->_timer | -(_cast_data)&_E +&_E._timer | -_E +&_E->_timer ) ) // If a timer has been configured without a data argument, it can be // converted without regard to the callback argument, since it is unused. @match_timer_function_unused_data@ expression _E; identifier _timer; identifier _callback; @@ ( -setup_timer(&_E->_timer, _callback, 0); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0L); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0UL); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0L); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0UL); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_timer, _callback, 0); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0L); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0UL); +timer_setup(&_timer, _callback, 0); | -setup_timer(_timer, _callback, 0); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0L); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0UL); +timer_setup(_timer, _callback, 0); ) @change_callback_unused_data depends on match_timer_function_unused_data@ identifier match_timer_function_unused_data._callback; type _origtype; identifier _origarg; @@ void _callback( -_origtype _origarg +struct timer_list *unused ) { ... when != _origarg } Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 21:43:17 +00:00
struct mlx5_ib_dev *dev = from_timer(dev, t, delay_timer);
WRITE_ONCE(dev->fill_delay, 0);
}
static int mlx5r_mkeys_init(struct mlx5_cache_ent *ent)
{
struct mlx5_mkeys_page *page;
page = kzalloc(sizeof(*page), GFP_KERNEL);
if (!page)
return -ENOMEM;
INIT_LIST_HEAD(&ent->mkeys_queue.pages_list);
spin_lock_init(&ent->mkeys_queue.lock);
list_add_tail(&page->list, &ent->mkeys_queue.pages_list);
ent->mkeys_queue.num_pages++;
return 0;
}
static void mlx5r_mkeys_uninit(struct mlx5_cache_ent *ent)
{
struct mlx5_mkeys_page *page;
WARN_ON(ent->mkeys_queue.ci || ent->mkeys_queue.num_pages > 1);
page = list_last_entry(&ent->mkeys_queue.pages_list,
struct mlx5_mkeys_page, list);
list_del(&page->list);
kfree(page);
}
struct mlx5_cache_ent *
mlx5r_cache_create_ent_locked(struct mlx5_ib_dev *dev,
struct mlx5r_cache_rb_key rb_key,
bool persistent_entry)
{
struct mlx5_cache_ent *ent;
int order;
int ret;
ent = kzalloc(sizeof(*ent), GFP_KERNEL);
if (!ent)
return ERR_PTR(-ENOMEM);
ret = mlx5r_mkeys_init(ent);
if (ret)
goto mkeys_err;
ent->rb_key = rb_key;
ent->dev = dev;
ent->is_tmp = !persistent_entry;
INIT_DELAYED_WORK(&ent->dwork, delayed_cache_work_func);
ret = mlx5_cache_ent_insert(&dev->cache, ent);
if (ret)
goto ent_insert_err;
if (persistent_entry) {
if (rb_key.access_mode == MLX5_MKC_ACCESS_MODE_KSM)
order = MLX5_IMR_KSM_CACHE_ENTRY;
else
order = order_base_2(rb_key.ndescs) - 2;
if ((dev->mdev->profile.mask & MLX5_PROF_MASK_MR_CACHE) &&
!dev->is_rep && mlx5_core_is_pf(dev->mdev) &&
mlx5r_umr_can_load_pas(dev, 0))
ent->limit = dev->mdev->profile.mr_cache[order].limit;
else
ent->limit = 0;
mlx5_mkey_cache_debugfs_add_ent(dev, ent);
}
return ent;
ent_insert_err:
mlx5r_mkeys_uninit(ent);
mkeys_err:
kfree(ent);
return ERR_PTR(ret);
}
int mlx5_mkey_cache_init(struct mlx5_ib_dev *dev)
{
struct mlx5_mkey_cache *cache = &dev->cache;
struct rb_root *root = &dev->cache.rb_root;
struct mlx5r_cache_rb_key rb_key = {
.access_mode = MLX5_MKC_ACCESS_MODE_MTT,
};
struct mlx5_cache_ent *ent;
struct rb_node *node;
int ret;
int i;
mutex_init(&dev->slow_path_mutex);
mutex_init(&dev->cache.rb_lock);
dev->cache.rb_root = RB_ROOT;
cache->wq = alloc_ordered_workqueue("mkey_cache", WQ_MEM_RECLAIM);
if (!cache->wq) {
mlx5_ib_warn(dev, "failed to create work queue\n");
return -ENOMEM;
}
mlx5_cmd_init_async_ctx(dev->mdev, &dev->async_ctx);
treewide: setup_timer() -> timer_setup() This converts all remaining cases of the old setup_timer() API into using timer_setup(), where the callback argument is the structure already holding the struct timer_list. These should have no behavioral changes, since they just change which pointer is passed into the callback with the same available pointers after conversion. It handles the following examples, in addition to some other variations. Casting from unsigned long: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... setup_timer(&ptr->my_timer, my_callback, ptr); and forced object casts: void my_callback(struct something *ptr) { ... } ... setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr); become: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... timer_setup(&ptr->my_timer, my_callback, 0); Direct function assignments: void my_callback(unsigned long data) { struct something *ptr = (struct something *)data; ... } ... ptr->my_timer.function = my_callback; have a temporary cast added, along with converting the args: void my_callback(struct timer_list *t) { struct something *ptr = from_timer(ptr, t, my_timer); ... } ... ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback; And finally, callbacks without a data assignment: void my_callback(unsigned long data) { ... } ... setup_timer(&ptr->my_timer, my_callback, 0); have their argument renamed to verify they're unused during conversion: void my_callback(struct timer_list *unused) { ... } ... timer_setup(&ptr->my_timer, my_callback, 0); The conversion is done with the following Coccinelle script: spatch --very-quiet --all-includes --include-headers \ -I ./arch/x86/include -I ./arch/x86/include/generated \ -I ./include -I ./arch/x86/include/uapi \ -I ./arch/x86/include/generated/uapi -I ./include/uapi \ -I ./include/generated/uapi --include ./include/linux/kconfig.h \ --dir . \ --cocci-file ~/src/data/timer_setup.cocci @fix_address_of@ expression e; @@ setup_timer( -&(e) +&e , ...) // Update any raw setup_timer() usages that have a NULL callback, but // would otherwise match change_timer_function_usage, since the latter // will update all function assignments done in the face of a NULL // function initialization in setup_timer(). @change_timer_function_usage_NULL@ expression _E; identifier _timer; type _cast_data; @@ ( -setup_timer(&_E->_timer, NULL, _E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E->_timer, NULL, (_cast_data)_E); +timer_setup(&_E->_timer, NULL, 0); | -setup_timer(&_E._timer, NULL, &_E); +timer_setup(&_E._timer, NULL, 0); | -setup_timer(&_E._timer, NULL, (_cast_data)&_E); +timer_setup(&_E._timer, NULL, 0); ) @change_timer_function_usage@ expression _E; identifier _timer; struct timer_list _stl; identifier _callback; type _cast_func, _cast_data; @@ ( -setup_timer(&_E->_timer, _callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, &_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, _E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, &_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E); +timer_setup(&_E._timer, _callback, 0); | _E->_timer@_stl.function = _callback; | _E->_timer@_stl.function = &_callback; | _E->_timer@_stl.function = (_cast_func)_callback; | _E->_timer@_stl.function = (_cast_func)&_callback; | _E._timer@_stl.function = _callback; | _E._timer@_stl.function = &_callback; | _E._timer@_stl.function = (_cast_func)_callback; | _E._timer@_stl.function = (_cast_func)&_callback; ) // callback(unsigned long arg) @change_callback_handle_cast depends on change_timer_function_usage@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; identifier _handle; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { ( ... when != _origarg _handletype *_handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(_handletype *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg | ... when != _origarg _handletype *_handle; ... when != _handle _handle = -(void *)_origarg; +from_timer(_handle, t, _timer); ... when != _origarg ) } // callback(unsigned long arg) without existing variable @change_callback_handle_cast_no_arg depends on change_timer_function_usage && !change_callback_handle_cast@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _origtype; identifier _origarg; type _handletype; @@ void _callback( -_origtype _origarg +struct timer_list *t ) { + _handletype *_origarg = from_timer(_origarg, t, _timer); + ... when != _origarg - (_handletype *)_origarg + _origarg ... when != _origarg } // Avoid already converted callbacks. @match_callback_converted depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier t; @@ void _callback(struct timer_list *t) { ... } // callback(struct something *handle) @change_callback_handle_arg depends on change_timer_function_usage && !match_callback_converted && !change_callback_handle_cast && !change_callback_handle_cast_no_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; @@ void _callback( -_handletype *_handle +struct timer_list *t ) { + _handletype *_handle = from_timer(_handle, t, _timer); ... } // If change_callback_handle_arg ran on an empty function, remove // the added handler. @unchange_callback_handle_arg depends on change_timer_function_usage && change_callback_handle_arg@ identifier change_timer_function_usage._callback; identifier change_timer_function_usage._timer; type _handletype; identifier _handle; identifier t; @@ void _callback(struct timer_list *t) { - _handletype *_handle = from_timer(_handle, t, _timer); } // We only want to refactor the setup_timer() data argument if we've found // the matching callback. This undoes changes in change_timer_function_usage. @unchange_timer_function_usage depends on change_timer_function_usage && !change_callback_handle_cast && !change_callback_handle_cast_no_arg && !change_callback_handle_arg@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type change_timer_function_usage._cast_data; @@ ( -timer_setup(&_E->_timer, _callback, 0); +setup_timer(&_E->_timer, _callback, (_cast_data)_E); | -timer_setup(&_E._timer, _callback, 0); +setup_timer(&_E._timer, _callback, (_cast_data)&_E); ) // If we fixed a callback from a .function assignment, fix the // assignment cast now. @change_timer_function_assignment depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression change_timer_function_usage._E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_func; typedef TIMER_FUNC_TYPE; @@ ( _E->_timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -&_callback +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)_callback; +(TIMER_FUNC_TYPE)_callback ; | _E->_timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -&_callback; +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)_callback +(TIMER_FUNC_TYPE)_callback ; | _E._timer.function = -(_cast_func)&_callback +(TIMER_FUNC_TYPE)_callback ; ) // Sometimes timer functions are called directly. Replace matched args. @change_timer_function_calls depends on change_timer_function_usage && (change_callback_handle_cast || change_callback_handle_cast_no_arg || change_callback_handle_arg)@ expression _E; identifier change_timer_function_usage._timer; identifier change_timer_function_usage._callback; type _cast_data; @@ _callback( ( -(_cast_data)_E +&_E->_timer | -(_cast_data)&_E +&_E._timer | -_E +&_E->_timer ) ) // If a timer has been configured without a data argument, it can be // converted without regard to the callback argument, since it is unused. @match_timer_function_unused_data@ expression _E; identifier _timer; identifier _callback; @@ ( -setup_timer(&_E->_timer, _callback, 0); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0L); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E->_timer, _callback, 0UL); +timer_setup(&_E->_timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0L); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_E._timer, _callback, 0UL); +timer_setup(&_E._timer, _callback, 0); | -setup_timer(&_timer, _callback, 0); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0L); +timer_setup(&_timer, _callback, 0); | -setup_timer(&_timer, _callback, 0UL); +timer_setup(&_timer, _callback, 0); | -setup_timer(_timer, _callback, 0); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0L); +timer_setup(_timer, _callback, 0); | -setup_timer(_timer, _callback, 0UL); +timer_setup(_timer, _callback, 0); ) @change_callback_unused_data depends on match_timer_function_unused_data@ identifier match_timer_function_unused_data._callback; type _origtype; identifier _origarg; @@ void _callback( -_origtype _origarg +struct timer_list *unused ) { ... when != _origarg } Signed-off-by: Kees Cook <keescook@chromium.org>
2017-10-16 21:43:17 +00:00
timer_setup(&dev->delay_timer, delay_time_func, 0);
mlx5_mkey_cache_debugfs_init(dev);
mutex_lock(&cache->rb_lock);
for (i = 0; i <= mkey_cache_max_order(dev); i++) {
RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit ee6d57a2e13d11ce9050cfc3e3b69ef707a44a63 Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:49 2024 +0300 RDMA/mlx5: Limit usage of over-sized mkeys from the MR cache When searching the MR cache for suitable cache entries, don't use mkeys larger than twice the size required for the MR. This should ensure the usage of mkeys closer to the minimal required size and reduce memory waste. On driver init we create entries for mkeys with clear attributes and powers of 2 sizes from 4 to the max supported size. This solves the issue for anyone using mkeys that fit these requirements. In the use case where an MR is registered with different attributes, like an access flag we can't UMR, we'll create a new cache entry to store it upon dereg. Without this fix, any later registration with same attributes and smaller size will use the newly created cache entry and it's mkeys, disregarding the memory waste of using mkeys larger than required. For example, one worst-case scenario can be when registering and deregistering a 1GB mkey with ATS enabled which will cause the creation of a new cache entry to hold those type of mkeys. A user registering a 4k MR with ATS will end up using the new cache entry and an mkey that can support a 1GB MR, thus wasting x250k memory than actually needed in the HW. Additionally, allow all small registration to use the smallest size cache entry that is initialized on driver load even if size is larger than twice the required size. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/8ba3a6e3748aace2026de8b83da03aba084f78f4.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
rb_key.ndescs = MLX5_MR_CACHE_PERSISTENT_ENTRY_MIN_DESCS << i;
ent = mlx5r_cache_create_ent_locked(dev, rb_key, true);
if (IS_ERR(ent)) {
ret = PTR_ERR(ent);
goto err;
}
}
ret = mlx5_odp_init_mkey_cache(dev);
if (ret)
goto err;
mutex_unlock(&cache->rb_lock);
for (node = rb_first(root); node; node = rb_next(node)) {
ent = rb_entry(node, struct mlx5_cache_ent, node);
spin_lock_irq(&ent->mkeys_queue.lock);
queue_adjust_cache_locked(ent);
spin_unlock_irq(&ent->mkeys_queue.lock);
}
return 0;
err:
mutex_unlock(&cache->rb_lock);
mlx5_mkey_cache_debugfs_cleanup(dev);
mlx5_ib_warn(dev, "failed to create mkey cache entry\n");
return ret;
}
void mlx5_mkey_cache_cleanup(struct mlx5_ib_dev *dev)
{
struct rb_root *root = &dev->cache.rb_root;
struct mlx5_cache_ent *ent;
struct rb_node *node;
if (!dev->cache.wq)
return;
mutex_lock(&dev->cache.rb_lock);
for (node = rb_first(root); node; node = rb_next(node)) {
ent = rb_entry(node, struct mlx5_cache_ent, node);
spin_lock_irq(&ent->mkeys_queue.lock);
ent->disabled = true;
spin_unlock_irq(&ent->mkeys_queue.lock);
cancel_delayed_work(&ent->dwork);
}
RDMA/mlx5: Fix mkey cache possible deadlock on cleanup JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.6-rc5 commit 374012b0045780b7ad498be62e85153009bb7fe9 Author: Shay Drory <shayd@nvidia.com> Date: Tue Sep 12 13:07:45 2023 +0300 RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] *** DEADLOCK *** [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b95845178328 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2023-12-13 09:12:12 +00:00
mutex_unlock(&dev->cache.rb_lock);
/*
* After all entries are disabled and will not reschedule on WQ,
* flush it and all async commands.
*/
flush_workqueue(dev->cache.wq);
mlx5_mkey_cache_debugfs_cleanup(dev);
mlx5_cmd_cleanup_async_ctx(&dev->async_ctx);
RDMA/mlx5: Fix mkey cache possible deadlock on cleanup JIRA: https://issues.redhat.com/browse/RHEL-882 Upstream-status: v6.6-rc5 commit 374012b0045780b7ad498be62e85153009bb7fe9 Author: Shay Drory <shayd@nvidia.com> Date: Tue Sep 12 13:07:45 2023 +0300 RDMA/mlx5: Fix mkey cache possible deadlock on cleanup Fix the deadlock by refactoring the MR cache cleanup flow to flush the workqueue without holding the rb_lock. This adds a race between cache cleanup and creation of new entries which we solve by denied creation of new entries after cache cleanup started. Lockdep: WARNING: possible circular locking dependency detected [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted [ 2785.339778 ] ------------------------------------------------------ [ 2785.340848 ] devlink/53872 is trying to acquire lock: [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900 [ 2785.343403 ] [ 2785.343403 ] but task is already holding lock: [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] [ 2785.346273 ] [ 2785.346273 ] which lock already depends on the new lock. [ 2785.346273 ] [ 2785.347720 ] [ 2785.347720 ] the existing dependency chain (in reverse order) is: [ 2785.349003 ] [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}: [ 2785.350160 ] __mutex_lock+0x14c/0x15c0 [ 2785.350962 ] delayed_cache_work_func+0x2d1/0x610 [mlx5_ib] [ 2785.352044 ] process_one_work+0x7c2/0x1310 [ 2785.352879 ] worker_thread+0x59d/0xec0 [ 2785.353636 ] kthread+0x28f/0x330 [ 2785.354370 ] ret_from_fork+0x1f/0x30 [ 2785.355135 ] [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}: [ 2785.356515 ] __lock_acquire+0x2d8a/0x5fe0 [ 2785.357349 ] lock_acquire+0x1c1/0x540 [ 2785.358121 ] __flush_work+0xe8/0x900 [ 2785.358852 ] __cancel_work_timer+0x2c7/0x3f0 [ 2785.359711 ] mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib] [ 2785.360781 ] mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib] [ 2785.361969 ] __mlx5_ib_remove+0x68/0x120 [mlx5_ib] [ 2785.362960 ] mlx5r_remove+0x63/0x80 [mlx5_ib] [ 2785.363870 ] auxiliary_bus_remove+0x52/0x70 [ 2785.364715 ] device_release_driver_internal+0x3c1/0x600 [ 2785.365695 ] bus_remove_device+0x2a5/0x560 [ 2785.366525 ] device_del+0x492/0xb80 [ 2785.367276 ] mlx5_detach_device+0x1a9/0x360 [mlx5_core] [ 2785.368615 ] mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core] [ 2785.369934 ] mlx5_devlink_reload_down+0x292/0x580 [mlx5_core] [ 2785.371292 ] devlink_reload+0x439/0x590 [ 2785.372075 ] devlink_nl_cmd_reload+0xaef/0xff0 [ 2785.372973 ] genl_family_rcv_msg_doit.isra.0+0x1bd/0x290 [ 2785.374011 ] genl_rcv_msg+0x3ca/0x6c0 [ 2785.374798 ] netlink_rcv_skb+0x12c/0x360 [ 2785.375612 ] genl_rcv+0x24/0x40 [ 2785.376295 ] netlink_unicast+0x438/0x710 [ 2785.377121 ] netlink_sendmsg+0x7a1/0xca0 [ 2785.377926 ] sock_sendmsg+0xc5/0x190 [ 2785.378668 ] __sys_sendto+0x1bc/0x290 [ 2785.379440 ] __x64_sys_sendto+0xdc/0x1b0 [ 2785.380255 ] do_syscall_64+0x3d/0x90 [ 2785.381031 ] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 2785.381967 ] [ 2785.381967 ] other info that might help us debug this: [ 2785.381967 ] [ 2785.383448 ] Possible unsafe locking scenario: [ 2785.383448 ] [ 2785.384544 ] CPU0 CPU1 [ 2785.385383 ] ---- ---- [ 2785.386193 ] lock(&dev->cache.rb_lock); [ 2785.386940 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.388327 ] lock(&dev->cache.rb_lock); [ 2785.389425 ] lock((work_completion)(&(&ent->dwork)->work)); [ 2785.390414 ] [ 2785.390414 ] *** DEADLOCK *** [ 2785.390414 ] [ 2785.391579 ] 6 locks held by devlink/53872: [ 2785.392341 ] #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40 [ 2785.393630 ] #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0 [ 2785.395324 ] #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core] [ 2785.397322 ] #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core] [ 2785.399231 ] #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600 [ 2785.400864 ] #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib] Fixes: b95845178328 ("RDMA/mlx5: Change the cache structure to an RB-tree") Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2023-12-13 09:12:12 +00:00
/* At this point all entries are disabled and have no concurrent work. */
mutex_lock(&dev->cache.rb_lock);
node = rb_first(root);
while (node) {
ent = rb_entry(node, struct mlx5_cache_ent, node);
node = rb_next(node);
clean_keys(dev, ent);
rb_erase(&ent->node, root);
mlx5r_mkeys_uninit(ent);
kfree(ent);
}
mutex_unlock(&dev->cache.rb_lock);
destroy_workqueue(dev->cache.wq);
del_timer_sync(&dev->delay_timer);
}
struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
struct mlx5_ib_mr *mr;
void *mkc;
u32 *in;
int err;
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
in = kzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_free;
}
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_PA);
MLX5_SET(mkc, mkc, length64, 1);
set_mkc_access_pd_addr_fields(mkc, acc | IB_ACCESS_RELAXED_ORDERING, 0,
pd);
MLX5_SET(mkc, mkc, ma_translation_mode, MLX5_CAP_GEN(dev->mdev, ats));
err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
if (err)
goto err_in;
kfree(in);
mr->mmkey.type = MLX5_MKEY_MR;
mr->ibmr.lkey = mr->mmkey.key;
mr->ibmr.rkey = mr->mmkey.key;
mr->umem = NULL;
return &mr->ibmr;
err_in:
kfree(in);
err_free:
kfree(mr);
return ERR_PTR(err);
}
static int get_octo_len(u64 addr, u64 len, int page_shift)
{
u64 page_size = 1ULL << page_shift;
u64 offset;
int npages;
offset = addr & (page_size - 1);
npages = ALIGN(len + offset, page_size) >> page_shift;
return (npages + 1) / 2;
}
static int mkey_cache_max_order(struct mlx5_ib_dev *dev)
{
if (MLX5_CAP_GEN(dev->mdev, umr_extended_translation_offset))
return MKEY_CACHE_LAST_STD_ENTRY;
return MLX5_MAX_UMR_SHIFT;
}
static void set_mr_fields(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
u64 length, int access_flags, u64 iova)
{
mr->ibmr.lkey = mr->mmkey.key;
mr->ibmr.rkey = mr->mmkey.key;
mr->ibmr.length = length;
mr->ibmr.device = &dev->ib_dev;
mr->ibmr.iova = iova;
mr->access_flags = access_flags;
}
static unsigned int mlx5_umem_dmabuf_default_pgsz(struct ib_umem *umem,
u64 iova)
{
/*
* The alignment of iova has already been checked upon entering
* UVERBS_METHOD_REG_DMABUF_MR
*/
umem->iova = iova;
return PAGE_SIZE;
}
static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
struct ib_umem *umem, u64 iova,
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
int access_flags, int access_mode)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
struct mlx5r_cache_rb_key rb_key = {};
struct mlx5_cache_ent *ent;
struct mlx5_ib_mr *mr;
unsigned int page_size;
if (umem->is_dmabuf)
page_size = mlx5_umem_dmabuf_default_pgsz(umem, iova);
else
page_size = mlx5_umem_mkc_find_best_pgsz(dev, umem, iova);
if (WARN_ON(!page_size))
return ERR_PTR(-EINVAL);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
rb_key.access_mode = access_mode;
rb_key.ndescs = ib_umem_num_dma_blocks(umem, page_size);
rb_key.ats = mlx5_umem_needs_ats(dev, umem, access_flags);
rb_key.access_flags = get_unchangeable_access_flags(dev, access_flags);
ent = mkey_cache_ent_from_rb_key(dev, rb_key);
/*
2023-05-17 11:54:18 +00:00
* If the MR can't come from the cache then synchronously create an uncached
* one.
*/
2023-05-17 11:54:18 +00:00
if (!ent) {
mutex_lock(&dev->slow_path_mutex);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
mr = reg_create(pd, umem, iova, access_flags, page_size, false, access_mode);
mutex_unlock(&dev->slow_path_mutex);
if (IS_ERR(mr))
return mr;
2023-05-17 11:54:18 +00:00
mr->mmkey.rb_key = rb_key;
mr->mmkey.cacheable = true;
return mr;
}
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
mr = _mlx5_mr_cache_alloc(dev, ent, access_flags);
if (IS_ERR(mr))
return mr;
mr->ibmr.pd = pd;
mr->umem = umem;
mr->page_shift = order_base_2(page_size);
set_mr_fields(dev, mr, umem->length, access_flags, iova);
return mr;
}
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static struct ib_mr *
reg_create_crossing_vhca_mr(struct ib_pd *pd, u64 iova, u64 length, int access_flags,
u32 crossed_lkey)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int access_mode = MLX5_MKC_ACCESS_MODE_CROSSING;
struct mlx5_ib_mr *mr;
void *mkc;
int inlen;
u32 *in;
int err;
if (!MLX5_CAP_GEN(dev->mdev, crossing_vhca_mkey))
return ERR_PTR(-EOPNOTSUPP);
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
in = kvzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_1;
}
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, crossing_target_vhca_id,
MLX5_CAP_GEN(dev->mdev, vhca_id));
MLX5_SET(mkc, mkc, translations_octword_size, crossed_lkey);
MLX5_SET(mkc, mkc, access_mode_1_0, access_mode & 0x3);
MLX5_SET(mkc, mkc, access_mode_4_2, (access_mode >> 2) & 0x7);
/* for this crossing mkey IOVA should be 0 and len should be IOVA + len */
set_mkc_access_pd_addr_fields(mkc, access_flags, 0, pd);
MLX5_SET64(mkc, mkc, len, iova + length);
MLX5_SET(mkc, mkc, free, 0);
MLX5_SET(mkc, mkc, umr_en, 0);
err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
if (err)
goto err_2;
mr->mmkey.type = MLX5_MKEY_MR;
set_mr_fields(dev, mr, length, access_flags, iova);
mr->ibmr.pd = pd;
kvfree(in);
mlx5_ib_dbg(dev, "crossing mkey = 0x%x\n", mr->mmkey.key);
return &mr->ibmr;
err_2:
kvfree(in);
err_1:
kfree(mr);
return ERR_PTR(err);
}
/*
* If ibmr is NULL it will be allocated by reg_create.
* Else, the given ibmr will be used.
*/
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, struct ib_umem *umem,
u64 iova, int access_flags,
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
unsigned int page_size, bool populate,
int access_mode)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_ib_mr *mr;
__be64 *pas;
void *mkc;
int inlen;
u32 *in;
int err;
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
bool pg_cap = !!(MLX5_CAP_GEN(dev->mdev, pg)) &&
(access_mode == MLX5_MKC_ACCESS_MODE_MTT);
bool ksm_mode = (access_mode == MLX5_MKC_ACCESS_MODE_KSM);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (!page_size)
return ERR_PTR(-EINVAL);
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
mr->ibmr.pd = pd;
mr->access_flags = access_flags;
mr->page_shift = order_base_2(page_size);
inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
if (populate)
inlen += sizeof(*pas) *
roundup(ib_umem_num_dma_blocks(umem, page_size), 2);
in = kvzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_1;
}
pas = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
if (populate) {
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (WARN_ON(access_flags & IB_ACCESS_ON_DEMAND || ksm_mode)) {
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
err = -EINVAL;
goto err_2;
}
mlx5_ib_populate_pas(umem, 1UL << mr->page_shift, pas,
pg_cap ? MLX5_IB_MTT_PRESENT : 0);
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
}
/* The pg_access bit allows setting the access flags
* in the page list submitted with the command.
*/
MLX5_SET(create_mkey_in, in, pg_access, !!(pg_cap));
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
set_mkc_access_pd_addr_fields(mkc, access_flags, iova,
populate ? pd : dev->umrc.pd);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
/* In case a data direct flow, overwrite the pdn field by its internal kernel PD */
if (umem->is_dmabuf && ksm_mode)
MLX5_SET(mkc, mkc, pd, dev->ddr.pdn);
MLX5_SET(mkc, mkc, free, !populate);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
MLX5_SET(mkc, mkc, access_mode_1_0, access_mode);
MLX5_SET(mkc, mkc, umr_en, 1);
MLX5_SET64(mkc, mkc, len, umem->length);
MLX5_SET(mkc, mkc, bsf_octword_size, 0);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (ksm_mode)
MLX5_SET(mkc, mkc, translations_octword_size,
get_octo_len(iova, umem->length, mr->page_shift) * 2);
else
MLX5_SET(mkc, mkc, translations_octword_size,
get_octo_len(iova, umem->length, mr->page_shift));
MLX5_SET(mkc, mkc, log_page_size, mr->page_shift);
if (mlx5_umem_needs_ats(dev, umem, access_flags))
MLX5_SET(mkc, mkc, ma_translation_mode, 1);
if (populate) {
MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
get_octo_len(iova, umem->length, mr->page_shift));
}
err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
if (err) {
mlx5_ib_warn(dev, "create mkey failed\n");
goto err_2;
}
mr->mmkey.type = MLX5_MKEY_MR;
2023-05-17 11:54:18 +00:00
mr->mmkey.ndescs = get_octo_len(iova, umem->length, mr->page_shift);
mr->umem = umem;
set_mr_fields(dev, mr, umem->length, access_flags, iova);
kvfree(in);
mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmkey.key);
return mr;
err_2:
kvfree(in);
err_1:
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
kfree(mr);
return ERR_PTR(err);
}
static struct ib_mr *mlx5_ib_get_dm_mr(struct ib_pd *pd, u64 start_addr,
u64 length, int acc, int mode)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
struct mlx5_ib_mr *mr;
void *mkc;
u32 *in;
int err;
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
in = kzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_free;
}
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, access_mode_1_0, mode & 0x3);
MLX5_SET(mkc, mkc, access_mode_4_2, (mode >> 2) & 0x7);
MLX5_SET64(mkc, mkc, len, length);
set_mkc_access_pd_addr_fields(mkc, acc, start_addr, pd);
err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
if (err)
goto err_in;
kfree(in);
set_mr_fields(dev, mr, length, acc, start_addr);
return &mr->ibmr;
err_in:
kfree(in);
err_free:
kfree(mr);
return ERR_PTR(err);
}
int mlx5_ib_advise_mr(struct ib_pd *pd,
enum ib_uverbs_advise_mr_advice advice,
u32 flags,
struct ib_sge *sg_list,
u32 num_sge,
struct uverbs_attr_bundle *attrs)
{
if (advice != IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH &&
advice != IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_WRITE &&
advice != IB_UVERBS_ADVISE_MR_ADVICE_PREFETCH_NO_FAULT)
return -EOPNOTSUPP;
return mlx5_ib_advise_mr_prefetch(pd, advice, flags,
sg_list, num_sge);
}
struct ib_mr *mlx5_ib_reg_dm_mr(struct ib_pd *pd, struct ib_dm *dm,
struct ib_dm_mr_attr *attr,
struct uverbs_attr_bundle *attrs)
{
struct mlx5_ib_dm *mdm = to_mdm(dm);
struct mlx5_core_dev *dev = to_mdev(dm->device)->mdev;
u64 start_addr = mdm->dev_addr + attr->offset;
int mode;
switch (mdm->type) {
case MLX5_IB_UAPI_DM_TYPE_MEMIC:
if (attr->access_flags & ~MLX5_IB_DM_MEMIC_ALLOWED_ACCESS)
return ERR_PTR(-EINVAL);
mode = MLX5_MKC_ACCESS_MODE_MEMIC;
start_addr -= pci_resource_start(dev->pdev, 0);
break;
case MLX5_IB_UAPI_DM_TYPE_STEERING_SW_ICM:
case MLX5_IB_UAPI_DM_TYPE_HEADER_MODIFY_SW_ICM:
case MLX5_IB_UAPI_DM_TYPE_HEADER_MODIFY_PATTERN_SW_ICM:
case MLX5_IB_UAPI_DM_TYPE_ENCAP_SW_ICM:
if (attr->access_flags & ~MLX5_IB_DM_SW_ICM_ALLOWED_ACCESS)
return ERR_PTR(-EINVAL);
mode = MLX5_MKC_ACCESS_MODE_SW_ICM;
break;
default:
return ERR_PTR(-EINVAL);
}
return mlx5_ib_get_dm_mr(pd, start_addr, attr->length,
attr->access_flags, mode);
}
static struct ib_mr *create_real_mr(struct ib_pd *pd, struct ib_umem *umem,
u64 iova, int access_flags)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_ib_mr *mr = NULL;
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
bool xlt_with_umr;
int err;
xlt_with_umr = mlx5r_umr_can_load_pas(dev, umem->length);
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
if (xlt_with_umr) {
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
mr = alloc_cacheable_mr(pd, umem, iova, access_flags,
MLX5_MKC_ACCESS_MODE_MTT);
} else {
unsigned int page_size =
mlx5_umem_mkc_find_best_pgsz(dev, umem, iova);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
mutex_lock(&dev->slow_path_mutex);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
mr = reg_create(pd, umem, iova, access_flags, page_size,
true, MLX5_MKC_ACCESS_MODE_MTT);
mutex_unlock(&dev->slow_path_mutex);
}
if (IS_ERR(mr)) {
ib_umem_release(umem);
return ERR_CAST(mr);
}
mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
atomic_add(ib_umem_num_pages(umem), &dev->mdev->priv.reg_pages);
if (xlt_with_umr) {
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
/*
* If the MR was created with reg_create then it will be
* configured properly but left disabled. It is safe to go ahead
* and configure it again via UMR while enabling it.
*/
err = mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ENABLE);
if (err) {
mlx5_ib_dereg_mr(&mr->ibmr, NULL);
return ERR_PTR(err);
}
}
return &mr->ibmr;
}
static struct ib_mr *create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length,
u64 iova, int access_flags,
struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct ib_umem_odp *odp;
struct mlx5_ib_mr *mr;
int err;
if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
return ERR_PTR(-EOPNOTSUPP);
err = mlx5r_odp_create_eq(dev, &dev->odp_pf_eq);
if (err)
return ERR_PTR(err);
if (!start && length == U64_MAX) {
if (iova != 0)
return ERR_PTR(-EINVAL);
if (!(dev->odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
return ERR_PTR(-EINVAL);
mr = mlx5_ib_alloc_implicit_mr(to_mpd(pd), access_flags);
if (IS_ERR(mr))
return ERR_CAST(mr);
return &mr->ibmr;
}
/* ODP requires xlt update via umr to work. */
if (!mlx5r_umr_can_load_pas(dev, length))
return ERR_PTR(-EINVAL);
odp = ib_umem_odp_get(&dev->ib_dev, start, length, access_flags,
&mlx5_mn_ops);
if (IS_ERR(odp))
return ERR_CAST(odp);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
mr = alloc_cacheable_mr(pd, &odp->umem, iova, access_flags,
MLX5_MKC_ACCESS_MODE_MTT);
if (IS_ERR(mr)) {
ib_umem_release(&odp->umem);
return ERR_CAST(mr);
}
RDMA/mlx5: Initialize the ODP xarray when creating an ODP MR Bugzilla: https://bugzilla.redhat.com/2049447 Upstream-status: v5.15 commit 5508546631a0f555d7088203dec2614e41b5106e Author: Aharon Landau <aharonl@nvidia.com> Date: Tue Oct 19 14:30:10 2021 +0300 RDMA/mlx5: Initialize the ODP xarray when creating an ODP MR Normally the zero fill would hide the missing initialization, but an errant set to desc_size in reg_create() causes a crash: BUG: unable to handle page fault for address: 0000000800000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 5 PID: 890 Comm: ib_write_bw Not tainted 5.15.0-rc4+ #47 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5_ib_dereg_mr+0x14/0x3b0 [mlx5_ib] Code: 48 63 cd 4c 89 f7 48 89 0c 24 e8 37 30 03 e1 48 8b 0c 24 eb a0 90 0f 1f 44 00 00 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 30 <48> 8b 2f 65 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 8b 87 c8 RSP: 0018:ffff88811afa3a60 EFLAGS: 00010286 RAX: 000000000000001c RBX: 0000000800000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000800000000 RBP: 0000000800000000 R08: 0000000000000000 R09: c0000000fffff7ff R10: ffff88811afa38f8 R11: ffff88811afa38f0 R12: ffffffffa02c7ac0 R13: 0000000000000000 R14: ffff88811afa3cd8 R15: ffff88810772fa00 FS: 00007f47b9080740(0000) GS:ffff88852cd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000800000000 CR3: 000000010761e003 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mlx5_ib_free_odp_mr+0x95/0xc0 [mlx5_ib] mlx5_ib_dereg_mr+0x128/0x3b0 [mlx5_ib] ib_dereg_mr_user+0x45/0xb0 [ib_core] ? xas_load+0x8/0x80 destroy_hw_idr_uobject+0x1a/0x50 [ib_uverbs] uverbs_destroy_uobject+0x2f/0x150 [ib_uverbs] uobj_destroy+0x3c/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x467/0xb00 [ib_uverbs] ? uverbs_finalize_object+0x60/0x60 [ib_uverbs] ? ttwu_queue_wakelist+0xa9/0xe0 ? pty_write+0x85/0x90 ? file_tty_write.isra.33+0x214/0x330 ? process_echoes+0x60/0x60 ib_uverbs_ioctl+0xa7/0x110 [ib_uverbs] __x64_sys_ioctl+0x10d/0x8e0 ? vfs_write+0x17f/0x260 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Add the missing xarray initialization and remove the desc_size set. Fixes: a639e66703ee ("RDMA/mlx5: Zero out ODP related items in the mlx5_ib_mr") Link: https://lore.kernel.org/r/a4846a11c9de834663e521770da895007f9f0d30.1634642730.git.leonro@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2022-08-11 10:22:34 +00:00
xa_init(&mr->implicit_children);
odp->private = mr;
err = mlx5r_store_odp_mkey(dev, &mr->mmkey);
if (err)
goto err_dereg_mr;
err = mlx5_ib_init_odp_mr(mr);
if (err)
goto err_dereg_mr;
return &mr->ibmr;
err_dereg_mr:
mlx5_ib_dereg_mr(&mr->ibmr, NULL);
return ERR_PTR(err);
}
struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 iova, int access_flags,
struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct ib_umem *umem;
int err;
if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM))
return ERR_PTR(-EOPNOTSUPP);
mlx5_ib_dbg(dev, "start 0x%llx, iova 0x%llx, length 0x%llx, access_flags 0x%x\n",
start, iova, length, access_flags);
err = mlx5r_umr_resource_init(dev);
if (err)
return ERR_PTR(err);
if (access_flags & IB_ACCESS_ON_DEMAND)
return create_user_odp_mr(pd, start, length, iova, access_flags,
udata);
umem = ib_umem_get(&dev->ib_dev, start, length, access_flags);
if (IS_ERR(umem))
return ERR_CAST(umem);
return create_real_mr(pd, umem, iova, access_flags);
}
static void mlx5_ib_dmabuf_invalidate_cb(struct dma_buf_attachment *attach)
{
struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv;
struct mlx5_ib_mr *mr = umem_dmabuf->private;
dma_resv_assert_held(umem_dmabuf->attach->dmabuf->resv);
if (!umem_dmabuf->sgt)
return;
mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP);
ib_umem_dmabuf_unmap_pages(umem_dmabuf);
}
static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
.allow_peer2peer = 1,
.move_notify = mlx5_ib_dmabuf_invalidate_cb,
};
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static struct ib_mr *
reg_user_mr_dmabuf(struct ib_pd *pd, struct device *dma_device,
u64 offset, u64 length, u64 virt_addr,
int fd, int access_flags, int access_mode)
{
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
bool pinned_mode = (access_mode == MLX5_MKC_ACCESS_MODE_KSM);
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_ib_mr *mr = NULL;
struct ib_umem_dmabuf *umem_dmabuf;
int err;
err = mlx5r_umr_resource_init(dev);
if (err)
return ERR_PTR(err);
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (!pinned_mode)
umem_dmabuf = ib_umem_dmabuf_get(&dev->ib_dev,
offset, length, fd,
access_flags,
&mlx5_ib_dmabuf_attach_ops);
else
umem_dmabuf = ib_umem_dmabuf_get_pinned_with_dma_device(&dev->ib_dev,
dma_device, offset, length,
fd, access_flags);
if (IS_ERR(umem_dmabuf)) {
mlx5_ib_dbg(dev, "umem_dmabuf get failed (%ld)\n",
PTR_ERR(umem_dmabuf));
return ERR_CAST(umem_dmabuf);
}
mr = alloc_cacheable_mr(pd, &umem_dmabuf->umem, virt_addr,
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
access_flags, access_mode);
if (IS_ERR(mr)) {
ib_umem_release(&umem_dmabuf->umem);
return ERR_CAST(mr);
}
mlx5_ib_dbg(dev, "mkey 0x%x\n", mr->mmkey.key);
atomic_add(ib_umem_num_pages(mr->umem), &dev->mdev->priv.reg_pages);
umem_dmabuf->private = mr;
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (!pinned_mode) {
err = mlx5r_store_odp_mkey(dev, &mr->mmkey);
if (err)
goto err_dereg_mr;
} else {
mr->data_direct = true;
}
err = mlx5_ib_init_dmabuf_mr(mr);
if (err)
goto err_dereg_mr;
return &mr->ibmr;
err_dereg_mr:
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
__mlx5_ib_dereg_mr(&mr->ibmr);
return ERR_PTR(err);
}
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static struct ib_mr *
reg_user_mr_dmabuf_by_data_direct(struct ib_pd *pd, u64 offset,
u64 length, u64 virt_addr,
int fd, int access_flags)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_data_direct_dev *data_direct_dev;
struct ib_mr *crossing_mr;
struct ib_mr *crossed_mr;
int ret = 0;
/* As of HW behaviour the IOVA must be page aligned in KSM mode */
if (!PAGE_ALIGNED(virt_addr) || (access_flags & IB_ACCESS_ON_DEMAND))
return ERR_PTR(-EOPNOTSUPP);
mutex_lock(&dev->data_direct_lock);
data_direct_dev = dev->data_direct_dev;
if (!data_direct_dev) {
ret = -EINVAL;
goto end;
}
/* The device's 'data direct mkey' was created without RO flags to
* simplify things and allow for a single mkey per device.
* Since RO is not a must, mask it out accordingly.
*/
access_flags &= ~IB_ACCESS_RELAXED_ORDERING;
crossed_mr = reg_user_mr_dmabuf(pd, &data_direct_dev->pdev->dev,
offset, length, virt_addr, fd,
access_flags, MLX5_MKC_ACCESS_MODE_KSM);
if (IS_ERR(crossed_mr)) {
ret = PTR_ERR(crossed_mr);
goto end;
}
mutex_lock(&dev->slow_path_mutex);
crossing_mr = reg_create_crossing_vhca_mr(pd, virt_addr, length, access_flags,
crossed_mr->lkey);
mutex_unlock(&dev->slow_path_mutex);
if (IS_ERR(crossing_mr)) {
__mlx5_ib_dereg_mr(crossed_mr);
ret = PTR_ERR(crossing_mr);
goto end;
}
list_add_tail(&to_mmr(crossed_mr)->dd_node, &dev->data_direct_mr_list);
to_mmr(crossing_mr)->dd_crossed_mr = to_mmr(crossed_mr);
to_mmr(crossing_mr)->data_direct = true;
end:
mutex_unlock(&dev->data_direct_lock);
return ret ? ERR_PTR(ret) : crossing_mr;
}
struct ib_mr *mlx5_ib_reg_user_mr_dmabuf(struct ib_pd *pd, u64 offset,
u64 length, u64 virt_addr,
int fd, int access_flags,
struct uverbs_attr_bundle *attrs)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int mlx5_access_flags = 0;
int err;
if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) ||
!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
return ERR_PTR(-EOPNOTSUPP);
if (uverbs_attr_is_valid(attrs, MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS)) {
err = uverbs_get_flags32(&mlx5_access_flags, attrs,
MLX5_IB_ATTR_REG_DMABUF_MR_ACCESS_FLAGS,
MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT);
if (err)
return ERR_PTR(err);
}
mlx5_ib_dbg(dev,
"offset 0x%llx, virt_addr 0x%llx, length 0x%llx, fd %d, access_flags 0x%x, mlx5_access_flags 0x%x\n",
offset, virt_addr, length, fd, access_flags, mlx5_access_flags);
/* dmabuf requires xlt update via umr to work. */
if (!mlx5r_umr_can_load_pas(dev, length))
return ERR_PTR(-EINVAL);
if (mlx5_access_flags & MLX5_IB_UAPI_REG_DMABUF_ACCESS_DATA_DIRECT)
return reg_user_mr_dmabuf_by_data_direct(pd, offset, length, virt_addr,
fd, access_flags);
return reg_user_mr_dmabuf(pd, pd->device->dma_device,
offset, length, virt_addr,
fd, access_flags, MLX5_MKC_ACCESS_MODE_MTT);
}
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
/*
* True if the change in access flags can be done via UMR, only some access
* flags can be updated.
*/
static bool can_use_umr_rereg_access(struct mlx5_ib_dev *dev,
unsigned int current_access_flags,
unsigned int target_access_flags)
{
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
unsigned int diffs = current_access_flags ^ target_access_flags;
if (diffs & ~(IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_READ | IB_ACCESS_RELAXED_ORDERING |
IB_ACCESS_REMOTE_ATOMIC))
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return false;
return mlx5r_umr_can_reconfig(dev, current_access_flags,
target_access_flags);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
}
static bool can_use_umr_rereg_pas(struct mlx5_ib_mr *mr,
struct ib_umem *new_umem,
int new_access_flags, u64 iova,
unsigned long *page_size)
{
struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
/* We only track the allocated sizes of MRs from the cache */
if (!mr->mmkey.cache_ent)
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return false;
if (!mlx5r_umr_can_load_pas(dev, new_umem->length))
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return false;
*page_size = mlx5_umem_mkc_find_best_pgsz(dev, new_umem, iova);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (WARN_ON(!*page_size))
return false;
return (mr->mmkey.cache_ent->rb_key.ndescs) >=
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
ib_umem_num_dma_blocks(new_umem, *page_size);
}
static int umr_rereg_pas(struct mlx5_ib_mr *mr, struct ib_pd *pd,
int access_flags, int flags, struct ib_umem *new_umem,
u64 iova, unsigned long page_size)
{
struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
int upd_flags = MLX5_IB_UPD_XLT_ADDR | MLX5_IB_UPD_XLT_ENABLE;
struct ib_umem *old_umem = mr->umem;
int err;
/*
* To keep everything simple the MR is revoked before we start to mess
* with it. This ensure the change is atomic relative to any use of the
* MR.
*/
err = mlx5r_umr_revoke_mr(mr);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (err)
return err;
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (flags & IB_MR_REREG_PD) {
mr->ibmr.pd = pd;
upd_flags |= MLX5_IB_UPD_XLT_PD;
}
if (flags & IB_MR_REREG_ACCESS) {
mr->access_flags = access_flags;
upd_flags |= MLX5_IB_UPD_XLT_ACCESS;
}
mr->ibmr.iova = iova;
mr->ibmr.length = new_umem->length;
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
mr->page_shift = order_base_2(page_size);
mr->umem = new_umem;
err = mlx5r_umr_update_mr_pas(mr, upd_flags);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (err) {
/*
* The MR is revoked at this point so there is no issue to free
* new_umem.
*/
mr->umem = old_umem;
return err;
}
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
atomic_sub(ib_umem_num_pages(old_umem), &dev->mdev->priv.reg_pages);
ib_umem_release(old_umem);
atomic_add(ib_umem_num_pages(new_umem), &dev->mdev->priv.reg_pages);
return 0;
}
struct ib_mr *mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
u64 length, u64 iova, int new_access_flags,
struct ib_pd *new_pd,
struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(ib_mr->device);
struct mlx5_ib_mr *mr = to_mmr(ib_mr);
int err;
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (!IS_ENABLED(CONFIG_INFINIBAND_USER_MEM) || mr->data_direct)
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return ERR_PTR(-EOPNOTSUPP);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
mlx5_ib_dbg(
dev,
"start 0x%llx, iova 0x%llx, length 0x%llx, access_flags 0x%x\n",
start, iova, length, new_access_flags);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (flags & ~(IB_MR_REREG_TRANS | IB_MR_REREG_PD | IB_MR_REREG_ACCESS))
return ERR_PTR(-EOPNOTSUPP);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (!(flags & IB_MR_REREG_ACCESS))
new_access_flags = mr->access_flags;
if (!(flags & IB_MR_REREG_PD))
new_pd = ib_mr->pd;
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (!(flags & IB_MR_REREG_TRANS)) {
struct ib_umem *umem;
/* Fast path for PD/access change */
if (can_use_umr_rereg_access(dev, mr->access_flags,
new_access_flags)) {
err = mlx5r_umr_rereg_pd_access(mr, new_pd,
new_access_flags);
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
if (err)
return ERR_PTR(err);
return NULL;
}
/* DM or ODP MR's don't have a normal umem so we can't re-use it */
if (!mr->umem || is_odp_mr(mr) || is_dmabuf_mr(mr))
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
goto recreate;
/*
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
* Only one active MR can refer to a umem at one time, revoke
* the old MR before assigning the umem to the new one.
*/
err = mlx5r_umr_revoke_mr(mr);
if (err)
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return ERR_PTR(err);
umem = mr->umem;
mr->umem = NULL;
atomic_sub(ib_umem_num_pages(umem), &dev->mdev->priv.reg_pages);
return create_real_mr(new_pd, umem, mr->ibmr.iova,
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
new_access_flags);
}
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
/*
* DM doesn't have a PAS list so we can't re-use it, odp/dmabuf does
* but the logic around releasing the umem is different
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
*/
if (!mr->umem || is_odp_mr(mr) || is_dmabuf_mr(mr))
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
goto recreate;
if (!(new_access_flags & IB_ACCESS_ON_DEMAND) &&
can_use_umr_rereg_access(dev, mr->access_flags, new_access_flags)) {
struct ib_umem *new_umem;
unsigned long page_size;
new_umem = ib_umem_get(&dev->ib_dev, start, length,
new_access_flags);
if (IS_ERR(new_umem))
return ERR_CAST(new_umem);
/* Fast path for PAS change */
if (can_use_umr_rereg_pas(mr, new_umem, new_access_flags, iova,
&page_size)) {
err = umr_rereg_pas(mr, new_pd, new_access_flags, flags,
new_umem, iova, page_size);
if (err) {
ib_umem_release(new_umem);
return ERR_PTR(err);
}
return NULL;
}
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
return create_real_mr(new_pd, new_umem, iova, new_access_flags);
}
RDMA/mlx5: Fix error unwinds for rereg_mr This is all a giant train wreck of error handling, in many cases the MR is left in some corrupted state where continuing on is going to lead to chaos, or various unwinds/order is missed. rereg had three possible completely different actions, depending on flags and various details about the MR. Split the three actions into three functions, and call the right action from the start. For each action carefully design the error handling to fit the action: - UMR access/PD update is a simple UMR, if it fails the MR isn't changed, so do nothing - PAS update over UMR is multiple UMR operations. To keep everything sane revoke access to the MKey while it is being changed and restore it once the MR is correct. - Recreating the mkey should completely build a parallel MR with a fully loaded PAS then swap and destroy the old one. If it fails the original should be left untouched. This is handled in the core code. Directly call the normal MR creation functions, possibly re-using the existing umem. Add support for working with ODP MRs. The READ/WRITE access flags can be changed by UMR and we can trivially convert to/from ODP MRs using the logic to build a completely new MR. This new logic also fixes various problems with MRs continuing to work while their PAS lists are no longer valid, eg during a page size change. Link: https://lore.kernel.org/r/20201130075839.278575-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-30 07:58:39 +00:00
/*
* Everything else has no state we can preserve, just create a new MR
* from scratch
*/
recreate:
return mlx5_ib_reg_user_mr(new_pd, start, length, iova,
new_access_flags, udata);
}
static int
mlx5_alloc_priv_descs(struct ib_device *device,
struct mlx5_ib_mr *mr,
int ndescs,
int desc_size)
{
struct mlx5_ib_dev *dev = to_mdev(device);
struct device *ddev = &dev->mdev->pdev->dev;
int size = ndescs * desc_size;
int add_size;
int ret;
add_size = max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
if (is_power_of_2(MLX5_UMR_ALIGN) && add_size) {
int end = max_t(int, MLX5_UMR_ALIGN, roundup_pow_of_two(size));
add_size = min_t(int, end - size, add_size);
}
mr->descs_alloc = kzalloc(size + add_size, GFP_KERNEL);
if (!mr->descs_alloc)
return -ENOMEM;
mr->descs = PTR_ALIGN(mr->descs_alloc, MLX5_UMR_ALIGN);
mr->desc_map = dma_map_single(ddev, mr->descs, size, DMA_TO_DEVICE);
if (dma_mapping_error(ddev, mr->desc_map)) {
ret = -ENOMEM;
goto err;
}
return 0;
err:
kfree(mr->descs_alloc);
return ret;
}
static void
mlx5_free_priv_descs(struct mlx5_ib_mr *mr)
{
RDMA/mlx5: Fix a WARN during dereg_mr for DM type JIRA: https://issues.redhat.com/browse/RHEL-6641 JIRA: https://issues.redhat.com/browse/RHEL-49958 JIRA: https://issues.redhat.com/browse/RHEL-77115 Upstream-status: git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-rc commit abc7b3f1f056d69a8f11d6dceecc0c9549ace770 Author: Yishai Hadas <yishaih@nvidia.com> Date: Mon Feb 3 14:51:43 2025 +0200 RDMA/mlx5: Fix a WARN during dereg_mr for DM type Memory regions (MR) of type DM (device memory) do not have an associated umem. In the __mlx5_ib_dereg_mr() -> mlx5_free_priv_descs() flow, the code incorrectly takes the wrong branch, attempting to call dma_unmap_single() on a DMA address that is not mapped. This results in a WARN [1], as shown below. The issue is resolved by properly accounting for the DM type and ensuring the correct branch is selected in mlx5_free_priv_descs(). [1] WARNING: CPU: 12 PID: 1346 at drivers/iommu/dma-iommu.c:1230 iommu_dma_unmap_page+0x79/0x90 Modules linked in: ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry ovelay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core CPU: 12 UID: 0 PID: 1346 Comm: ibv_rc_pingpong Not tainted 6.12.0-rc7+ #1631 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:iommu_dma_unmap_page+0x79/0x90 Code: 2b 49 3b 29 72 26 49 3b 69 08 73 20 4d 89 f0 44 89 e9 4c 89 e2 48 89 ee 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 07 b8 88 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 66 0f 1f 44 00 RSP: 0018:ffffc90001913a10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88810194b0a8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88810194b0a8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f537abdd740(0000) GS:ffff88885fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f537aeb8000 CR3: 000000010c248001 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x84/0x190 ? iommu_dma_unmap_page+0x79/0x90 ? report_bug+0xf8/0x1c0 ? handle_bug+0x55/0x90 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? iommu_dma_unmap_page+0x79/0x90 dma_unmap_page_attrs+0xe6/0x290 mlx5_free_priv_descs+0xb0/0xe0 [mlx5_ib] __mlx5_ib_dereg_mr+0x37e/0x520 [mlx5_ib] ? _raw_spin_unlock_irq+0x24/0x40 ? wait_for_completion+0xfe/0x130 ? rdma_restrack_put+0x63/0xe0 [ib_core] ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0x116/0x170 [ib_uverbs] ? lock_release+0xc6/0x280 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f537adaf17b Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffff218f0b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffff218f1d8 RCX: 00007f537adaf17b RDX: 00007ffff218f1c0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffff218f1a0 R08: 00007f537aa8d010 R09: 0000561ee2e4f270 R10: 00007f537aace3a8 R11: 0000000000000246 R12: 00007ffff218f190 R13: 000000000000001c R14: 0000561ee2e4d7c0 R15: 00007ffff218f450 </TASK> Fixes: f18ec4223117 ("RDMA/mlx5: Use a union inside mlx5_ib_mr") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/2039c22cfc3df02378747ba4d623a558b53fc263.1738587076.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2025-02-25 15:49:18 +00:00
if (!mr->umem && !mr->data_direct &&
mr->ibmr.type != IB_MR_TYPE_DM && mr->descs) {
struct ib_device *device = mr->ibmr.device;
int size = mr->max_descs * mr->desc_size;
struct mlx5_ib_dev *dev = to_mdev(device);
RDMA/mlx5: Fix releasing unallocated memory in dereg MR flow Bugzilla: https://bugzilla.redhat.com/2049449 Upstream-status: v5.16-rc5 commit f0ae4afe3d35e67db042c58a52909e06262b740f Author: Alaa Hleihel <alaa@nvidia.com> Date: Mon Nov 22 13:41:51 2021 +0200 RDMA/mlx5: Fix releasing unallocated memory in dereg MR flow For the case of IB_MR_TYPE_DM the mr does doesn't have a umem, even though it is a user MR. This causes function mlx5_free_priv_descs() to think that it is a kernel MR, leading to wrongly accessing mr->descs that will get wrong values in the union which leads to attempt to release resources that were not allocated in the first place. For example: DMA-API: mlx5_core 0000:08:00.1: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=0 bytes] WARNING: CPU: 8 PID: 1021 at kernel/dma/debug.c:961 check_unmap+0x54f/0x8b0 RIP: 0010:check_unmap+0x54f/0x8b0 Call Trace: debug_dma_unmap_page+0x57/0x60 mlx5_free_priv_descs+0x57/0x70 [mlx5_ib] mlx5_ib_dereg_mr+0x1fb/0x3d0 [mlx5_ib] ib_dereg_mr_user+0x60/0x140 [ib_core] uverbs_destroy_uobject+0x59/0x210 [ib_uverbs] uobj_destroy+0x3f/0x80 [ib_uverbs] ib_uverbs_cmd_verbs+0x435/0xd10 [ib_uverbs] ? uverbs_finalize_object+0x50/0x50 [ib_uverbs] ? lock_acquire+0xc4/0x2e0 ? lock_acquired+0x12/0x380 ? lock_acquire+0xc4/0x2e0 ? lock_acquire+0xc4/0x2e0 ? ib_uverbs_ioctl+0x7c/0x140 [ib_uverbs] ? lock_release+0x28a/0x400 ib_uverbs_ioctl+0xc0/0x140 [ib_uverbs] ? ib_uverbs_ioctl+0x7c/0x140 [ib_uverbs] __x64_sys_ioctl+0x7f/0xb0 do_syscall_64+0x38/0x90 Fix it by reorganizing the dereg flow and mlx5_ib_mr structure: - Move the ib_umem field into the user MRs structure in the union as it's applicable only there. - Function mlx5_ib_dereg_mr() will now call mlx5_free_priv_descs() only in case there isn't udata, which indicates that this isn't a user MR. Fixes: f18ec4223117 ("RDMA/mlx5: Use a union inside mlx5_ib_mr") Link: https://lore.kernel.org/r/66bb1dd253c1fd7ceaa9fc411061eefa457b86fb.1637581144.git.leonro@nvidia.com Signed-off-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Mohammad Kabat <mkabat@redhat.com>
2022-08-11 10:31:32 +00:00
dma_unmap_single(&dev->mdev->pdev->dev, mr->desc_map, size,
DMA_TO_DEVICE);
kfree(mr->descs_alloc);
mr->descs = NULL;
}
}
2023-05-17 11:54:18 +00:00
static int cache_ent_find_and_store(struct mlx5_ib_dev *dev,
struct mlx5_ib_mr *mr)
{
struct mlx5_mkey_cache *cache = &dev->cache;
struct mlx5_cache_ent *ent;
int ret;
2023-05-17 11:54:18 +00:00
if (mr->mmkey.cache_ent) {
spin_lock_irq(&mr->mmkey.cache_ent->mkeys_queue.lock);
2023-05-17 11:54:18 +00:00
mr->mmkey.cache_ent->in_use--;
goto end;
}
mutex_lock(&cache->rb_lock);
ent = mkey_cache_ent_from_rb_key(dev, mr->mmkey.rb_key);
if (ent) {
if (ent->rb_key.ndescs == mr->mmkey.rb_key.ndescs) {
if (ent->disabled) {
mutex_unlock(&cache->rb_lock);
return -EOPNOTSUPP;
}
2023-05-17 11:54:18 +00:00
mr->mmkey.cache_ent = ent;
spin_lock_irq(&mr->mmkey.cache_ent->mkeys_queue.lock);
mutex_unlock(&cache->rb_lock);
2023-05-17 11:54:18 +00:00
goto end;
}
}
ent = mlx5r_cache_create_ent_locked(dev, mr->mmkey.rb_key, false);
mutex_unlock(&cache->rb_lock);
2023-05-17 11:54:18 +00:00
if (IS_ERR(ent))
return PTR_ERR(ent);
mr->mmkey.cache_ent = ent;
spin_lock_irq(&mr->mmkey.cache_ent->mkeys_queue.lock);
2023-05-17 11:54:18 +00:00
end:
ret = push_mkey_locked(mr->mmkey.cache_ent, mr->mmkey.key);
spin_unlock_irq(&mr->mmkey.cache_ent->mkeys_queue.lock);
return ret;
2023-05-17 11:54:18 +00:00
}
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static int mlx5_ib_revoke_data_direct_mr(struct mlx5_ib_mr *mr)
{
struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
int err;
lockdep_assert_held(&dev->data_direct_lock);
mr->revoked = true;
err = mlx5r_umr_revoke_mr(mr);
if (WARN_ON(err))
return err;
ib_umem_dmabuf_revoke(umem_dmabuf);
return 0;
}
void mlx5_ib_revoke_data_direct_mrs(struct mlx5_ib_dev *dev)
{
struct mlx5_ib_mr *mr, *next;
lockdep_assert_held(&dev->data_direct_lock);
list_for_each_entry_safe(mr, next, &dev->data_direct_mr_list, dd_node) {
list_del(&mr->dd_node);
mlx5_ib_revoke_data_direct_mr(mr);
}
}
static int mlx5_revoke_mr(struct mlx5_ib_mr *mr)
{
struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
struct mlx5_cache_ent *ent = mr->mmkey.cache_ent;
RDMA/mlx5: Fix MR cache temp entries cleanup JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 7ebb00cea49db641b458edef0ede389f7004821d Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:50 2024 +0300 RDMA/mlx5: Fix MR cache temp entries cleanup Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
if (mr->mmkey.cacheable && !mlx5r_umr_revoke_mr(mr) && !cache_ent_find_and_store(dev, mr)) {
ent = mr->mmkey.cache_ent;
/* upon storing to a clean temp entry - schedule its cleanup */
spin_lock_irq(&ent->mkeys_queue.lock);
if (ent->is_tmp && !ent->tmp_cleanup_scheduled) {
mod_delayed_work(ent->dev->cache.wq, &ent->dwork,
msecs_to_jiffies(30 * 1000));
ent->tmp_cleanup_scheduled = true;
}
spin_unlock_irq(&ent->mkeys_queue.lock);
return 0;
RDMA/mlx5: Fix MR cache temp entries cleanup JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit 7ebb00cea49db641b458edef0ede389f7004821d Author: Michael Guralnik <michaelgur@nvidia.com> Date: Tue Sep 3 14:24:50 2024 +0300 RDMA/mlx5: Fix MR cache temp entries cleanup Fix the cleanup of the temp cache entries that are dynamically created in the MR cache. The cleanup of the temp cache entries is currently scheduled only when a new entry is created. Since in the cleanup of the entries only the mkeys are destroyed and the cache entry stays in the cache, subsequent registrations might reuse the entry and it will eventually be filled with new mkeys without cleanup ever getting scheduled again. On workloads that register and deregister MRs with a wide range of properties we see the cache ends up holding many cache entries, each holding the max number of mkeys that were ever used through it. Additionally, as the cleanup work is scheduled to run over the whole cache, any mkey that is returned to the cache after the cleanup was scheduled will be held for less than the intended 30 seconds timeout. Solve both issues by dropping the existing remove_ent_work and reusing the existing per-entry work to also handle the temp entries cleanup. Schedule the work to run with a 30 seconds delay every time we push an mkey to a clean temp entry. This ensures the cleanup runs on each entry only 30 seconds after the first mkey was pushed to an empty entry. As we have already been distinguishing between persistent and temp entries when scheduling the cache_work_func, it is not being scheduled in any other flows for the temp entries. Another benefit from moving to a per-entry cleanup is we now not required to hold the rb_tree mutex, thus enabling other flow to run concurrently. Fixes: dd1b913fb0d0 ("RDMA/mlx5: Cache all user cacheable mkeys on dereg MR flow") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/e4fa4bb03bebf20dceae320f26816cd2dde23a26.1725362530.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
}
if (ent) {
spin_lock_irq(&ent->mkeys_queue.lock);
ent->in_use--;
mr->mmkey.cache_ent = NULL;
spin_unlock_irq(&ent->mkeys_queue.lock);
}
return destroy_mkey(dev, mr);
}
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static int __mlx5_ib_dereg_mr(struct ib_mr *ibmr)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
int rc;
/*
* Any async use of the mr must hold the refcount, once the refcount
* goes to zero no other thread, such as ODP page faults, prefetch, any
* UMR activity, etc can touch the mkey. Thus it is safe to destroy it.
*/
if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) &&
refcount_read(&mr->mmkey.usecount) != 0 &&
xa_erase(&mr_to_mdev(mr)->odp_mkeys, mlx5_base_mkey(mr->mmkey.key)))
mlx5r_deref_wait_odp_mkey(&mr->mmkey);
if (ibmr->type == IB_MR_TYPE_INTEGRITY) {
RDMA/mlx5: Delete right entry from MR signature database The value mr->sig is stored in the entry upon mr allocation, however, ibmr is wrongly entered here as "old", therefore, xa_cmpxchg() does not replace the entry with NULL, which leads to the following trace: WARNING: CPU: 28 PID: 2078 at drivers/infiniband/hw/mlx5/main.c:3643 mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib] Modules linked in: nvme_rdma nvme_fabrics nvme_core 8021q garp mrp bonding bridge stp llc rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_tad CPU: 28 PID: 2078 Comm: reboot Tainted: G X --------- --- 5.13.0-0.rc2.19.el9.x86_64 #1 Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 2.9.1 12/07/2018 RIP: 0010:mlx5_ib_stage_init_cleanup+0x4d/0x60 [mlx5_ib] Code: 8d bb 70 1f 00 00 be 00 01 00 00 e8 9d 94 ce da 48 3d 00 01 00 00 75 02 5b c3 0f 0b 5b c3 0f 0b 48 83 bb b0 20 00 00 00 74 d5 <0f> 0b eb d1 4 RSP: 0018:ffffa8db06d33c90 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff97f890a44000 RCX: ffff97f900ec0160 RDX: 0000000000000000 RSI: 0000000080080001 RDI: ffff97f890a44000 RBP: ffffffffc0c189b8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000300 R12: ffff97f890a44000 R13: ffffffffc0c36030 R14: 00000000fee1dead R15: 0000000000000000 FS: 00007f0d5a8a3b40(0000) GS:ffff98077fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555acbf4f450 CR3: 00000002a6f56002 CR4: 00000000001706e0 Call Trace: mlx5r_remove+0x39/0x60 [mlx5_ib] auxiliary_bus_remove+0x1b/0x30 __device_release_driver+0x17a/0x230 device_release_driver+0x24/0x30 bus_remove_device+0xdb/0x140 device_del+0x18b/0x3e0 mlx5_detach_device+0x59/0x90 [mlx5_core] mlx5_unload_one+0x22/0x60 [mlx5_core] shutdown+0x31/0x3a [mlx5_core] pci_device_shutdown+0x34/0x60 device_shutdown+0x15b/0x1c0 __do_sys_reboot.cold+0x2f/0x5b ? vfs_writev+0xc7/0x140 ? handle_mm_fault+0xc5/0x290 ? do_writev+0x6b/0x110 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") Link: https://lore.kernel.org/r/f3f585ea0db59c2a78f94f65eedeafc5a2374993.1623309971.git.leonro@nvidia.com Signed-off-by: Aharon Landau <aharonl@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-06-10 07:34:26 +00:00
xa_cmpxchg(&dev->sig_mrs, mlx5_base_mkey(mr->mmkey.key),
mr->sig, NULL, GFP_KERNEL);
if (mr->mtt_mr) {
rc = mlx5_ib_dereg_mr(&mr->mtt_mr->ibmr, NULL);
if (rc)
return rc;
mr->mtt_mr = NULL;
}
if (mr->klm_mr) {
rc = mlx5_ib_dereg_mr(&mr->klm_mr->ibmr, NULL);
if (rc)
return rc;
mr->klm_mr = NULL;
}
if (mlx5_core_destroy_psv(dev->mdev,
mr->sig->psv_memory.psv_idx))
mlx5_ib_warn(dev, "failed to destroy mem psv %d\n",
mr->sig->psv_memory.psv_idx);
if (mlx5_core_destroy_psv(dev->mdev, mr->sig->psv_wire.psv_idx))
mlx5_ib_warn(dev, "failed to destroy wire psv %d\n",
mr->sig->psv_wire.psv_idx);
kfree(mr->sig);
mr->sig = NULL;
}
/* Stop DMA */
rc = mlx5_revoke_mr(mr);
if (rc)
return rc;
if (mr->umem) {
bool is_odp = is_odp_mr(mr);
if (!is_odp)
atomic_sub(ib_umem_num_pages(mr->umem),
&dev->mdev->priv.reg_pages);
ib_umem_release(mr->umem);
if (is_odp)
mlx5_ib_free_odp_mr(mr);
}
if (!mr->mmkey.cache_ent)
mlx5_free_priv_descs(mr);
kfree(mr);
return 0;
}
RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct JIRA: https://issues.redhat.com/browse/RHEL-52869 Upstream-status: v6.12-rc1 commit de8f847a5114ff7cfcdfc114af8485c431dec703 Author: Yishai Hadas <yishaih@nvidia.com> Date: Thu Aug 1 15:05:16 2024 +0300 RDMA/mlx5: Add support for DMABUF MR registrations with Data-direct Add support for DMABUF MR registrations with Data-direct device. Upon userspace calling to register a DMABUF MR with the data direct bit set, the below algorithm will be followed. 1) Obtain a pinned DMABUF umem from the IB core using the user input parameters (FD, offset, length) and the DMA PF device. The DMA PF device is needed to allow the IOMMU to enable the DMA PF to access the user buffer over PCI. 2) Create a KSM MKEY by setting its entries according to the user buffer VA to IOVA mapping, with the MKEY being the data direct device-crossed MKEY. This KSM MKEY is umrable and will be used as part of the MR cache. The PD for creating it is the internal device 'data direct' kernel one. 3) Create a crossing MKEY that points to the KSM MKEY using the crossing access mode. 4) Manage the KSM MKEY by adding it to a list of 'data direct' MKEYs managed on the mlx5_ib device. 5) Return the crossing MKEY to the user, created with its supplied PD. Upon DMA PF unbind flow, the driver will revoke the KSM entries. The final deregistration will occur under the hood once the application deregisters its MKEY. Notes: - This version supports only the PINNED UMEM mode, so there is no dependency on ODP. - The IOVA supplied by the application must be system page aligned due to HW translations of KSM. - The crossing MKEY will not be umrable or part of the MR cache, as we cannot change its crossed (i.e. KSM) MKEY over UMR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/1f99d8020ed540d9702b9e2252a145a439609ba6.1722512548.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Benjamin Poirier <bpoirier@redhat.com>
2024-12-05 15:32:08 +00:00
static int dereg_crossing_data_direct_mr(struct mlx5_ib_dev *dev,
struct mlx5_ib_mr *mr)
{
struct mlx5_ib_mr *dd_crossed_mr = mr->dd_crossed_mr;
int ret;
ret = __mlx5_ib_dereg_mr(&mr->ibmr);
if (ret)
return ret;
mutex_lock(&dev->data_direct_lock);
if (!dd_crossed_mr->revoked)
list_del(&dd_crossed_mr->dd_node);
ret = __mlx5_ib_dereg_mr(&dd_crossed_mr->ibmr);
mutex_unlock(&dev->data_direct_lock);
return ret;
}
int mlx5_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
struct mlx5_ib_dev *dev = to_mdev(ibmr->device);
if (mr->data_direct)
return dereg_crossing_data_direct_mr(dev, mr);
return __mlx5_ib_dereg_mr(ibmr);
}
static void mlx5_set_umr_free_mkey(struct ib_pd *pd, u32 *in, int ndescs,
int access_mode, int page_shift)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
void *mkc;
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
RDMA/mlx5: Clarify what the UMR is for when creating MRs Once a mkey is created it can be modified using UMR. This is desirable for performance reasons. However, different hardware has restrictions on what modifications are possible using UMR. Make sense of these checks: - mlx5_ib_can_reconfig_with_umr() returns true if the access flags can be altered. Most cases create MRs using 0 access flags (now made clear by consistent use of set_mkc_access_pd_addr_fields()), but the old logic here was tormented. Make it clear that this is checking if the current access_flags can be modified using UMR to different access_flags. It is always OK to use UMR to change flags that all HW supports. - mlx5_ib_can_load_pas_with_umr() returns true if UMR can be used to enable and update the PAS/XLT. Enabling requires updating the entity size, so UMR ends up completely disabled on this old hardware. Make it clear why it is disabled. FRWR, ODP and cache always requires mlx5_ib_can_load_pas_with_umr(). - mlx5_ib_pas_fits_in_mr() is used to tell if an existing MR can be resized to hold a new PAS list. This only works for cached MR's because we don't store the PAS list size in other cases. To be very clear, arrange things so any pre-created MR's in the cache check the newly requested access_flags before allowing the MR to leave the cache. If UMR cannot set the required access_flags the cache fails to create the MR. This in turn means relaxed ordering and atomic are now correctly blocked early for implicit ODP on older HW. Link: https://lore.kernel.org/r/20200914112653.345244-6-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-09-14 11:26:53 +00:00
/* This is only used from the kernel, so setting the PD is OK. */
set_mkc_access_pd_addr_fields(mkc, IB_ACCESS_RELAXED_ORDERING, 0, pd);
MLX5_SET(mkc, mkc, free, 1);
MLX5_SET(mkc, mkc, translations_octword_size, ndescs);
MLX5_SET(mkc, mkc, access_mode_1_0, access_mode & 0x3);
MLX5_SET(mkc, mkc, access_mode_4_2, (access_mode >> 2) & 0x7);
MLX5_SET(mkc, mkc, umr_en, 1);
MLX5_SET(mkc, mkc, log_page_size, page_shift);
if (access_mode == MLX5_MKC_ACCESS_MODE_PA ||
access_mode == MLX5_MKC_ACCESS_MODE_MTT)
MLX5_SET(mkc, mkc, ma_translation_mode, MLX5_CAP_GEN(dev->mdev, ats));
}
static int _mlx5_alloc_mkey_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
int ndescs, int desc_size, int page_shift,
int access_mode, u32 *in, int inlen)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int err;
mr->access_mode = access_mode;
mr->desc_size = desc_size;
mr->max_descs = ndescs;
err = mlx5_alloc_priv_descs(pd->device, mr, ndescs, desc_size);
if (err)
return err;
mlx5_set_umr_free_mkey(pd, in, ndescs, access_mode, page_shift);
err = mlx5_ib_create_mkey(dev, &mr->mmkey, in, inlen);
if (err)
goto err_free_descs;
mr->mmkey.type = MLX5_MKEY_MR;
mr->ibmr.lkey = mr->mmkey.key;
mr->ibmr.rkey = mr->mmkey.key;
return 0;
err_free_descs:
mlx5_free_priv_descs(mr);
return err;
}
static struct mlx5_ib_mr *mlx5_ib_alloc_pi_mr(struct ib_pd *pd,
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
u32 max_num_sg, u32 max_num_meta_sg,
int desc_size, int access_mode)
{
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
int ndescs = ALIGN(max_num_sg + max_num_meta_sg, 4);
int page_shift = 0;
struct mlx5_ib_mr *mr;
u32 *in;
int err;
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
mr->ibmr.pd = pd;
mr->ibmr.device = pd->device;
in = kzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_free;
}
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
if (access_mode == MLX5_MKC_ACCESS_MODE_MTT)
page_shift = PAGE_SHIFT;
err = _mlx5_alloc_mkey_descs(pd, mr, ndescs, desc_size, page_shift,
access_mode, in, inlen);
if (err)
goto err_free_in;
mr->umem = NULL;
kfree(in);
return mr;
err_free_in:
kfree(in);
err_free:
kfree(mr);
return ERR_PTR(err);
}
static int mlx5_alloc_mem_reg_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
int ndescs, u32 *in, int inlen)
{
return _mlx5_alloc_mkey_descs(pd, mr, ndescs, sizeof(struct mlx5_mtt),
PAGE_SHIFT, MLX5_MKC_ACCESS_MODE_MTT, in,
inlen);
}
static int mlx5_alloc_sg_gaps_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
int ndescs, u32 *in, int inlen)
{
return _mlx5_alloc_mkey_descs(pd, mr, ndescs, sizeof(struct mlx5_klm),
0, MLX5_MKC_ACCESS_MODE_KLMS, in, inlen);
}
static int mlx5_alloc_integrity_descs(struct ib_pd *pd, struct mlx5_ib_mr *mr,
int max_num_sg, int max_num_meta_sg,
u32 *in, int inlen)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
u32 psv_index[2];
void *mkc;
int err;
mr->sig = kzalloc(sizeof(*mr->sig), GFP_KERNEL);
if (!mr->sig)
return -ENOMEM;
/* create mem & wire PSVs */
err = mlx5_core_create_psv(dev->mdev, to_mpd(pd)->pdn, 2, psv_index);
if (err)
goto err_free_sig;
mr->sig->psv_memory.psv_idx = psv_index[0];
mr->sig->psv_wire.psv_idx = psv_index[1];
mr->sig->sig_status_checked = true;
mr->sig->sig_err_exists = false;
/* Next UMR, Arm SIGERR */
++mr->sig->sigerr_count;
mr->klm_mr = mlx5_ib_alloc_pi_mr(pd, max_num_sg, max_num_meta_sg,
sizeof(struct mlx5_klm),
MLX5_MKC_ACCESS_MODE_KLMS);
if (IS_ERR(mr->klm_mr)) {
err = PTR_ERR(mr->klm_mr);
goto err_destroy_psv;
}
mr->mtt_mr = mlx5_ib_alloc_pi_mr(pd, max_num_sg, max_num_meta_sg,
sizeof(struct mlx5_mtt),
MLX5_MKC_ACCESS_MODE_MTT);
if (IS_ERR(mr->mtt_mr)) {
err = PTR_ERR(mr->mtt_mr);
goto err_free_klm_mr;
}
/* Set bsf descriptors for mkey */
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, bsf_en, 1);
MLX5_SET(mkc, mkc, bsf_octword_size, MLX5_MKEY_BSF_OCTO_SIZE);
err = _mlx5_alloc_mkey_descs(pd, mr, 4, sizeof(struct mlx5_klm), 0,
MLX5_MKC_ACCESS_MODE_KLMS, in, inlen);
if (err)
goto err_free_mtt_mr;
err = xa_err(xa_store(&dev->sig_mrs, mlx5_base_mkey(mr->mmkey.key),
mr->sig, GFP_KERNEL));
if (err)
goto err_free_descs;
return 0;
err_free_descs:
destroy_mkey(dev, mr);
mlx5_free_priv_descs(mr);
err_free_mtt_mr:
mlx5_ib_dereg_mr(&mr->mtt_mr->ibmr, NULL);
mr->mtt_mr = NULL;
err_free_klm_mr:
mlx5_ib_dereg_mr(&mr->klm_mr->ibmr, NULL);
mr->klm_mr = NULL;
err_destroy_psv:
if (mlx5_core_destroy_psv(dev->mdev, mr->sig->psv_memory.psv_idx))
mlx5_ib_warn(dev, "failed to destroy mem psv %d\n",
mr->sig->psv_memory.psv_idx);
if (mlx5_core_destroy_psv(dev->mdev, mr->sig->psv_wire.psv_idx))
mlx5_ib_warn(dev, "failed to destroy wire psv %d\n",
mr->sig->psv_wire.psv_idx);
err_free_sig:
kfree(mr->sig);
return err;
}
static struct ib_mr *__mlx5_ib_alloc_mr(struct ib_pd *pd,
enum ib_mr_type mr_type, u32 max_num_sg,
u32 max_num_meta_sg)
{
struct mlx5_ib_dev *dev = to_mdev(pd->device);
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
int ndescs = ALIGN(max_num_sg, 4);
struct mlx5_ib_mr *mr;
u32 *in;
int err;
mr = kzalloc(sizeof(*mr), GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
in = kzalloc(inlen, GFP_KERNEL);
if (!in) {
err = -ENOMEM;
goto err_free;
}
mr->ibmr.device = pd->device;
mr->umem = NULL;
switch (mr_type) {
case IB_MR_TYPE_MEM_REG:
err = mlx5_alloc_mem_reg_descs(pd, mr, ndescs, in, inlen);
break;
case IB_MR_TYPE_SG_GAPS:
err = mlx5_alloc_sg_gaps_descs(pd, mr, ndescs, in, inlen);
break;
case IB_MR_TYPE_INTEGRITY:
err = mlx5_alloc_integrity_descs(pd, mr, max_num_sg,
max_num_meta_sg, in, inlen);
break;
default:
mlx5_ib_warn(dev, "Invalid mr type %d\n", mr_type);
err = -EINVAL;
}
if (err)
goto err_free_in;
kfree(in);
return &mr->ibmr;
err_free_in:
kfree(in);
err_free:
kfree(mr);
return ERR_PTR(err);
}
struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd, enum ib_mr_type mr_type,
u32 max_num_sg)
{
return __mlx5_ib_alloc_mr(pd, mr_type, max_num_sg, 0);
}
struct ib_mr *mlx5_ib_alloc_mr_integrity(struct ib_pd *pd,
u32 max_num_sg, u32 max_num_meta_sg)
{
return __mlx5_ib_alloc_mr(pd, IB_MR_TYPE_INTEGRITY, max_num_sg,
max_num_meta_sg);
}
int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
{
struct mlx5_ib_dev *dev = to_mdev(ibmw->device);
int inlen = MLX5_ST_SZ_BYTES(create_mkey_in);
struct mlx5_ib_mw *mw = to_mmw(ibmw);
unsigned int ndescs;
u32 *in = NULL;
void *mkc;
int err;
struct mlx5_ib_alloc_mw req = {};
struct {
__u32 comp_mask;
__u32 response_length;
} resp = {};
err = ib_copy_from_udata(&req, udata, min(udata->inlen, sizeof(req)));
if (err)
return err;
if (req.comp_mask || req.reserved1 || req.reserved2)
return -EOPNOTSUPP;
if (udata->inlen > sizeof(req) &&
!ib_is_udata_cleared(udata, sizeof(req),
udata->inlen - sizeof(req)))
return -EOPNOTSUPP;
ndescs = req.num_klms ? roundup(req.num_klms, 4) : roundup(1, 4);
in = kzalloc(inlen, GFP_KERNEL);
if (!in)
return -ENOMEM;
mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, free, 1);
MLX5_SET(mkc, mkc, translations_octword_size, ndescs);
MLX5_SET(mkc, mkc, pd, to_mpd(ibmw->pd)->pdn);
MLX5_SET(mkc, mkc, umr_en, 1);
MLX5_SET(mkc, mkc, lr, 1);
MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_KLMS);
MLX5_SET(mkc, mkc, en_rinval, !!((ibmw->type == IB_MW_TYPE_2)));
MLX5_SET(mkc, mkc, qpn, 0xffffff);
err = mlx5_ib_create_mkey(dev, &mw->mmkey, in, inlen);
if (err)
goto free;
mw->mmkey.type = MLX5_MKEY_MW;
ibmw->rkey = mw->mmkey.key;
mw->mmkey.ndescs = ndescs;
resp.response_length =
min(offsetofend(typeof(resp), response_length), udata->outlen);
if (resp.response_length) {
err = ib_copy_to_udata(udata, &resp, resp.response_length);
if (err)
goto free_mkey;
}
if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
err = mlx5r_store_odp_mkey(dev, &mw->mmkey);
if (err)
goto free_mkey;
}
kfree(in);
return 0;
free_mkey:
mlx5_core_destroy_mkey(dev->mdev, mw->mmkey.key);
free:
kfree(in);
return err;
}
int mlx5_ib_dealloc_mw(struct ib_mw *mw)
{
struct mlx5_ib_dev *dev = to_mdev(mw->device);
struct mlx5_ib_mw *mmw = to_mmw(mw);
if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING) &&
xa_erase(&dev->odp_mkeys, mlx5_base_mkey(mmw->mmkey.key)))
/*
* pagefault_single_data_segment() may be accessing mmw
* if the user bound an ODP MR to this MW.
*/
mlx5r_deref_wait_odp_mkey(&mmw->mmkey);
return mlx5_core_destroy_mkey(dev->mdev, mmw->mmkey.key);
}
int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status)
{
struct mlx5_ib_mr *mmr = to_mmr(ibmr);
int ret = 0;
if (check_mask & ~IB_MR_CHECK_SIG_STATUS) {
pr_err("Invalid status check mask\n");
ret = -EINVAL;
goto done;
}
mr_status->fail_status = 0;
if (check_mask & IB_MR_CHECK_SIG_STATUS) {
if (!mmr->sig) {
ret = -EINVAL;
pr_err("signature status check requested on a non-signature enabled MR\n");
goto done;
}
mmr->sig->sig_status_checked = true;
if (!mmr->sig->sig_err_exists)
goto done;
if (ibmr->lkey == mmr->sig->err_item.key)
memcpy(&mr_status->sig_err, &mmr->sig->err_item,
sizeof(mr_status->sig_err));
else {
mr_status->sig_err.err_type = IB_SIG_BAD_GUARD;
mr_status->sig_err.sig_err_offset = 0;
mr_status->sig_err.key = mmr->sig->err_item.key;
}
mmr->sig->sig_err_exists = false;
mr_status->fail_status |= IB_MR_CHECK_SIG_STATUS;
}
done:
return ret;
}
static int
mlx5_ib_map_pa_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
int data_sg_nents, unsigned int *data_sg_offset,
struct scatterlist *meta_sg, int meta_sg_nents,
unsigned int *meta_sg_offset)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
unsigned int sg_offset = 0;
int n = 0;
mr->meta_length = 0;
if (data_sg_nents == 1) {
n++;
mr->mmkey.ndescs = 1;
if (data_sg_offset)
sg_offset = *data_sg_offset;
mr->data_length = sg_dma_len(data_sg) - sg_offset;
mr->data_iova = sg_dma_address(data_sg) + sg_offset;
if (meta_sg_nents == 1) {
n++;
mr->meta_ndescs = 1;
if (meta_sg_offset)
sg_offset = *meta_sg_offset;
else
sg_offset = 0;
mr->meta_length = sg_dma_len(meta_sg) - sg_offset;
mr->pi_iova = sg_dma_address(meta_sg) + sg_offset;
}
ibmr->length = mr->data_length + mr->meta_length;
}
return n;
}
static int
mlx5_ib_sg_to_klms(struct mlx5_ib_mr *mr,
struct scatterlist *sgl,
unsigned short sg_nents,
unsigned int *sg_offset_p,
struct scatterlist *meta_sgl,
unsigned short meta_sg_nents,
unsigned int *meta_sg_offset_p)
{
struct scatterlist *sg = sgl;
struct mlx5_klm *klms = mr->descs;
unsigned int sg_offset = sg_offset_p ? *sg_offset_p : 0;
u32 lkey = mr->ibmr.pd->local_dma_lkey;
int i, j = 0;
mr->ibmr.iova = sg_dma_address(sg) + sg_offset;
mr->ibmr.length = 0;
for_each_sg(sgl, sg, sg_nents, i) {
if (unlikely(i >= mr->max_descs))
break;
klms[i].va = cpu_to_be64(sg_dma_address(sg) + sg_offset);
klms[i].bcount = cpu_to_be32(sg_dma_len(sg) - sg_offset);
klms[i].key = cpu_to_be32(lkey);
mr->ibmr.length += sg_dma_len(sg) - sg_offset;
sg_offset = 0;
}
if (sg_offset_p)
*sg_offset_p = sg_offset;
mr->mmkey.ndescs = i;
mr->data_length = mr->ibmr.length;
if (meta_sg_nents) {
sg = meta_sgl;
sg_offset = meta_sg_offset_p ? *meta_sg_offset_p : 0;
for_each_sg(meta_sgl, sg, meta_sg_nents, j) {
if (unlikely(i + j >= mr->max_descs))
break;
klms[i + j].va = cpu_to_be64(sg_dma_address(sg) +
sg_offset);
klms[i + j].bcount = cpu_to_be32(sg_dma_len(sg) -
sg_offset);
klms[i + j].key = cpu_to_be32(lkey);
mr->ibmr.length += sg_dma_len(sg) - sg_offset;
sg_offset = 0;
}
if (meta_sg_offset_p)
*meta_sg_offset_p = sg_offset;
mr->meta_ndescs = j;
mr->meta_length = mr->ibmr.length - mr->data_length;
}
return i + j;
}
static int mlx5_set_page(struct ib_mr *ibmr, u64 addr)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
__be64 *descs;
if (unlikely(mr->mmkey.ndescs == mr->max_descs))
return -ENOMEM;
descs = mr->descs;
descs[mr->mmkey.ndescs++] = cpu_to_be64(addr | MLX5_EN_RD | MLX5_EN_WR);
return 0;
}
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
static int mlx5_set_page_pi(struct ib_mr *ibmr, u64 addr)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
__be64 *descs;
if (unlikely(mr->mmkey.ndescs + mr->meta_ndescs == mr->max_descs))
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
return -ENOMEM;
descs = mr->descs;
descs[mr->mmkey.ndescs + mr->meta_ndescs++] =
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
cpu_to_be64(addr | MLX5_EN_RD | MLX5_EN_WR);
return 0;
}
static int
mlx5_ib_map_mtt_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
int data_sg_nents, unsigned int *data_sg_offset,
struct scatterlist *meta_sg, int meta_sg_nents,
unsigned int *meta_sg_offset)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
struct mlx5_ib_mr *pi_mr = mr->mtt_mr;
int n;
pi_mr->mmkey.ndescs = 0;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
pi_mr->meta_ndescs = 0;
pi_mr->meta_length = 0;
ib_dma_sync_single_for_cpu(ibmr->device, pi_mr->desc_map,
pi_mr->desc_size * pi_mr->max_descs,
DMA_TO_DEVICE);
pi_mr->ibmr.page_size = ibmr->page_size;
n = ib_sg_to_pages(&pi_mr->ibmr, data_sg, data_sg_nents, data_sg_offset,
mlx5_set_page);
if (n != data_sg_nents)
return n;
pi_mr->data_iova = pi_mr->ibmr.iova;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
pi_mr->data_length = pi_mr->ibmr.length;
pi_mr->ibmr.length = pi_mr->data_length;
ibmr->length = pi_mr->data_length;
if (meta_sg_nents) {
u64 page_mask = ~((u64)ibmr->page_size - 1);
u64 iova = pi_mr->data_iova;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
n += ib_sg_to_pages(&pi_mr->ibmr, meta_sg, meta_sg_nents,
meta_sg_offset, mlx5_set_page_pi);
pi_mr->meta_length = pi_mr->ibmr.length;
/*
* PI address for the HW is the offset of the metadata address
* relative to the first data page address.
* It equals to first data page address + size of data pages +
* metadata offset at the first metadata page
*/
pi_mr->pi_iova = (iova & page_mask) +
pi_mr->mmkey.ndescs * ibmr->page_size +
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
(pi_mr->ibmr.iova & ~page_mask);
/*
* In order to use one MTT MR for data and metadata, we register
* also the gaps between the end of the data and the start of
* the metadata (the sig MR will verify that the HW will access
* to right addresses). This mapping is safe because we use
* internal mkey for the registration.
*/
pi_mr->ibmr.length = pi_mr->pi_iova + pi_mr->meta_length - iova;
pi_mr->ibmr.iova = iova;
ibmr->length += pi_mr->meta_length;
}
ib_dma_sync_single_for_device(ibmr->device, pi_mr->desc_map,
pi_mr->desc_size * pi_mr->max_descs,
DMA_TO_DEVICE);
return n;
}
static int
mlx5_ib_map_klm_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
int data_sg_nents, unsigned int *data_sg_offset,
struct scatterlist *meta_sg, int meta_sg_nents,
unsigned int *meta_sg_offset)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
struct mlx5_ib_mr *pi_mr = mr->klm_mr;
int n;
pi_mr->mmkey.ndescs = 0;
pi_mr->meta_ndescs = 0;
pi_mr->meta_length = 0;
ib_dma_sync_single_for_cpu(ibmr->device, pi_mr->desc_map,
pi_mr->desc_size * pi_mr->max_descs,
DMA_TO_DEVICE);
n = mlx5_ib_sg_to_klms(pi_mr, data_sg, data_sg_nents, data_sg_offset,
meta_sg, meta_sg_nents, meta_sg_offset);
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
ib_dma_sync_single_for_device(ibmr->device, pi_mr->desc_map,
pi_mr->desc_size * pi_mr->max_descs,
DMA_TO_DEVICE);
/* This is zero-based memory region */
pi_mr->data_iova = 0;
pi_mr->ibmr.iova = 0;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
pi_mr->pi_iova = pi_mr->data_length;
ibmr->length = pi_mr->ibmr.length;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
return n;
}
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
int mlx5_ib_map_mr_sg_pi(struct ib_mr *ibmr, struct scatterlist *data_sg,
int data_sg_nents, unsigned int *data_sg_offset,
struct scatterlist *meta_sg, int meta_sg_nents,
unsigned int *meta_sg_offset)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
struct mlx5_ib_mr *pi_mr = NULL;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
int n;
WARN_ON(ibmr->type != IB_MR_TYPE_INTEGRITY);
mr->mmkey.ndescs = 0;
mr->data_length = 0;
mr->data_iova = 0;
mr->meta_ndescs = 0;
mr->pi_iova = 0;
/*
* As a performance optimization, if possible, there is no need to
* perform UMR operation to register the data/metadata buffers.
* First try to map the sg lists to PA descriptors with local_dma_lkey.
* Fallback to UMR only in case of a failure.
*/
n = mlx5_ib_map_pa_mr_sg_pi(ibmr, data_sg, data_sg_nents,
data_sg_offset, meta_sg, meta_sg_nents,
meta_sg_offset);
if (n == data_sg_nents + meta_sg_nents)
goto out;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
/*
* As a performance optimization, if possible, there is no need to map
* the sg lists to KLM descriptors. First try to map the sg lists to MTT
* descriptors and fallback to KLM only in case of a failure.
* It's more efficient for the HW to work with MTT descriptors
* (especially in high load).
* Use KLM (indirect access) only if it's mandatory.
*/
pi_mr = mr->mtt_mr;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
n = mlx5_ib_map_mtt_mr_sg_pi(ibmr, data_sg, data_sg_nents,
data_sg_offset, meta_sg, meta_sg_nents,
meta_sg_offset);
if (n == data_sg_nents + meta_sg_nents)
goto out;
pi_mr = mr->klm_mr;
n = mlx5_ib_map_klm_mr_sg_pi(ibmr, data_sg, data_sg_nents,
data_sg_offset, meta_sg, meta_sg_nents,
meta_sg_offset);
if (unlikely(n != data_sg_nents + meta_sg_nents))
return -ENOMEM;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
out:
/* This is zero-based memory region */
ibmr->iova = 0;
mr->pi_mr = pi_mr;
if (pi_mr)
ibmr->sig_attrs->meta_length = pi_mr->meta_length;
else
ibmr->sig_attrs->meta_length = mr->meta_length;
RDMA/mlx5: Improve PI handover performance In some loads, there is performance degradation when using KLM mkey instead of MTT mkey. This is because KLM descriptor access is via indirection that might require more HW resources and cycles. Using KLM descriptor is not necessary when there are no gaps at the data/metadata sg lists. As an optimization, use MTT mkey whenever it is possible. For that matter, allocate internal MTT mkey and choose the effective pi_mr for in transaction according to the required mapping scheme. The setup of the tested benchmark (using iSER ULP): - 2 servers with 24 cores (1 initiator and 1 target) - ConnectX-4/ConnectX-5 adapters - 24 target sessions with 1 LUN each - ramdisk backstore - PI active Performance results running fio (24 jobs, 128 iodepth) using write_generate=1 and read_verify=1 (w/w.o/baseline): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1262.4K/1243.3K/1147.1K 1732.1K/1725.1K/1423.8K 4k 570902/571233/457874 773982/743293/642080 32k 72086/72388/71933 96164/71789/93249 Using write_generate=0 and read_verify=0 (w/w.o patch): bs IOPS(read) IOPS(write) ---- ---------- ---------- 512 1600.1K/1572.1K/1393.3K 1830.3K/1823.5K/1557.2K 4k 937272/921992/762934 815304/753772/646071 32k 77369/75052/72058 97435/73180/94612 Signed-off-by: Israel Rukshin <israelr@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Max Gurtovoy <maxg@mellanox.com> Suggested-by: Idan Burstein <idanb@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-11 15:52:55 +00:00
return 0;
}
int mlx5_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
unsigned int *sg_offset)
{
struct mlx5_ib_mr *mr = to_mmr(ibmr);
int n;
mr->mmkey.ndescs = 0;
ib_dma_sync_single_for_cpu(ibmr->device, mr->desc_map,
mr->desc_size * mr->max_descs,
DMA_TO_DEVICE);
if (mr->access_mode == MLX5_MKC_ACCESS_MODE_KLMS)
n = mlx5_ib_sg_to_klms(mr, sg, sg_nents, sg_offset, NULL, 0,
NULL);
else
n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset,
mlx5_set_page);
ib_dma_sync_single_for_device(ibmr->device, mr->desc_map,
mr->desc_size * mr->max_descs,
DMA_TO_DEVICE);
return n;
}