Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 475f9ff63ee8c296aa46c6e9e9ad9bdd301c6bdf
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Thu Feb 16 14:39:05 2023 +0800
net/smc: fix application data exception
There is a certain probability that following
exceptions will occur in the wrk benchmark test:
Running 10s test @ http://11.213.45.6:80
8 threads and 64 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.72ms 13.94ms 245.33ms 94.17%
Req/Sec 1.96k 713.67 5.41k 75.16%
155262 requests in 10.10s, 23.10MB read
Non-2xx or 3xx responses: 3
We will find that the error is HTTP 400 error, which is a serious
exception in our test, which means the application data was
corrupted.
Consider the following scenarios:
CPU0 CPU1
buf_desc->used = 0;
cmpxchg(buf_desc->used, 0, 1)
deal_with(buf_desc)
memset(buf_desc->cpu_addr,0);
This will cause the data received by a victim connection to be cleared,
thus triggering an HTTP 400 error in the server.
This patch exchange the order between clear used and memset, add
barrier to ensure memory consistency.
Fixes: 1c5526968e27 ("net/smc: Clear memory when release and reuse buffer")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit aff7bfed9097435ea38de919befbe2d7771a3e87
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Thu Feb 2 16:26:42 2023 +0800
net/smc: replace mutex rmbs_lock and sndbufs_lock with rw_semaphore
It's clear that rmbs_lock and sndbufs_lock are aims to protect the
rmbs list or the sndbufs list.
During connection establieshment, smc_buf_get_slot() will always
be invoked, and it only performs read semantics in rmbs list and
sndbufs list.
Based on the above considerations, we replace mutex with rw_semaphore.
Only smc_buf_get_slot() use down_read() to allow smc_buf_get_slot()
run concurrently, other part use down_write() to keep exclusive
semantics.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit f6421014e88983c5bb7a25c71c01ae6278a01df9
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Thu Feb 2 16:26:40 2023 +0800
net/smc: use read semaphores to reduce unnecessary blocking in smc_buf_create() & smcr_buf_unuse()
Following is part of Off-CPU graph during frequent SMC-R short-lived
processing:
process_one_work (51.19%)
smc_close_passive_work (28.36%)
smcr_buf_unuse (28.34%)
rwsem_down_write_slowpath (28.22%)
smc_listen_work (22.83%)
smc_clc_wait_msg (1.84%)
smc_buf_create (20.45%)
smcr_buf_map_usable_links
rwsem_down_write_slowpath (20.43%)
smcr_lgr_reg_rmbs (0.53%)
rwsem_down_write_slowpath (0.43%)
smc_llc_do_confirm_rkey (0.08%)
We can clearly see that during the connection establishment time,
waiting time of connections is not on IO, but on llc_conf_mutex.
What is more important, the core critical area (smcr_buf_unuse() &
smc_buf_create()) only perfroms read semantics on links, we can
easily replace it with read semaphore.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit b5dd4d6981717f7e2682c0419fe832328c7441cf
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Thu Feb 2 16:26:39 2023 +0800
net/smc: llc_conf_mutex refactor, replace it with rw_semaphore
llc_conf_mutex was used to protect links and link related configurations
in the same link group, for example, add or delete links. However,
in most cases, the protected critical area has only read semantics and
with no write semantics at all, such as obtaining a usable link or an
available rmb_desc.
This patch do simply code refactoring, replace mutex with rw_semaphore,
replace mutex_lock with down_write and replace mutex_unlock with
up_write.
Theoretically, this replacement is equivalent, but after this patch,
we can distinguish lock granularity according to different semantics
of critical areas.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 8c81ba20349daf9f7e58bb05a0c12f4b71813a30
Author: Stefan Raspl <raspl@linux.ibm.com>
Date: Mon Jan 23 19:17:52 2023 +0100
net/smc: De-tangle ism and smc device initialization
The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 9de4df7b6be1cfca500f8ba21137d53eec45418a
Author: Stefan Raspl <raspl@linux.ibm.com>
Date: Mon Jan 23 19:17:50 2023 +0100
net/smc: Separate SMC-D and ISM APIs
We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 8747716f3942a610efdd12e3655df47269c268ac
Author: Stefan Raspl <raspl@linux.ibm.com>
Date: Mon Jan 23 19:17:49 2023 +0100
net/smc: Register SMC-D as ISM client
Register the smc module with the new ism device driver API.
This is the second part of a bigger overhaul of the interfaces between SMC
and ISM.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit bdee15e8c58b450ad736a2b62ef8c7a12548b704
Author: Dan Carpenter <error27@gmail.com>
Date: Fri Oct 14 12:34:36 2022 +0300
net/smc: Fix an error code in smc_lgr_create()
If smc_wr_alloc_lgr_mem() fails then return an error code. Don't return
success.
Fixes: 8799e310fb3f ("net/smc: add v2 support to the work request layer")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit e738455b2c6dcdab03e45d97de36476f93f557d2
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Tue Sep 20 14:43:09 2022 +0800
net/smc: Stop the CLC flow if no link to map buffers on
There might be a potential race between SMC-R buffer map and
link group termination.
smc_smcr_terminate_all() | smc_connect_rdma()
--------------------------------------------------------------
| smc_conn_create()
for links in smcibdev |
schedule links down |
| smc_buf_create()
| \- smcr_buf_map_usable_links()
| \- no usable links found,
| (rmb->mr = NULL)
|
| smc_clc_send_confirm()
| \- access conn->rmb_desc->mr[]->rkey
| (panic)
During reboot and IB device module remove, all links will be set
down and no usable links remain in link groups. In such situation
smcr_buf_map_usable_links() should return an error and stop the
CLC flow accessing to uninitialized mr.
Fixes: b9247544c1 ("net/smc: convert static link ID instances to support multiple links")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Link: https://lore.kernel.org/r/1663656189-32090-1-git-send-email-guwen@linux.alibaba.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit b8d199451c99b3796b840c350eb74b830c5c869b
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jul 14 17:44:04 2022 +0800
net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R
On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.
When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.
So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.
Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:
1) regression in data path, which is brought by additional address
translation of sndbuf by RNIC in Tx. But in general, translating
address through MTT is fast.
Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
latency and bandwidth test with physically and virtually contiguous
buffers are as follows:
- client:
smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
-t 5 -vu tcp_{bw|lat}
- server:
smc_run taskset -c <cpu> qperf
[latency]
msgsize tcp smcr smcr-use-virt-buf
1 11.17 us 7.56 us 7.51 us (-0.67%)
2 10.65 us 7.74 us 7.56 us (-2.31%)
4 11.11 us 7.52 us 7.59 us ( 0.84%)
8 10.83 us 7.55 us 7.51 us (-0.48%)
16 11.21 us 7.46 us 7.51 us ( 0.71%)
32 10.65 us 7.53 us 7.58 us ( 0.61%)
64 10.95 us 7.74 us 7.80 us ( 0.76%)
128 11.14 us 7.83 us 7.87 us ( 0.47%)
256 10.97 us 7.94 us 7.92 us (-0.28%)
512 11.23 us 7.94 us 8.20 us ( 3.25%)
1024 11.60 us 8.12 us 8.20 us ( 0.96%)
2048 14.04 us 8.30 us 8.51 us ( 2.49%)
4096 16.88 us 9.13 us 9.07 us (-0.64%)
8192 22.50 us 10.56 us 11.22 us ( 6.26%)
16384 28.99 us 12.88 us 13.83 us ( 7.37%)
32768 40.13 us 16.76 us 16.95 us ( 1.16%)
65536 68.70 us 24.68 us 24.85 us ( 0.68%)
[bandwidth]
msgsize tcp smcr smcr-use-virt-buf
1 1.65 MB/s 1.59 MB/s 1.53 MB/s (-3.88%)
2 3.32 MB/s 3.17 MB/s 3.08 MB/s (-2.67%)
4 6.66 MB/s 6.33 MB/s 6.09 MB/s (-3.85%)
8 13.67 MB/s 13.45 MB/s 11.97 MB/s (-10.99%)
16 25.36 MB/s 27.15 MB/s 24.16 MB/s (-11.01%)
32 48.22 MB/s 54.24 MB/s 49.41 MB/s (-8.89%)
64 106.79 MB/s 107.32 MB/s 99.05 MB/s (-7.71%)
128 210.21 MB/s 202.46 MB/s 201.02 MB/s (-0.71%)
256 400.81 MB/s 416.81 MB/s 393.52 MB/s (-5.59%)
512 746.49 MB/s 834.12 MB/s 809.99 MB/s (-2.89%)
1024 1292.33 MB/s 1641.96 MB/s 1571.82 MB/s (-4.27%)
2048 2007.64 MB/s 2760.44 MB/s 2717.68 MB/s (-1.55%)
4096 2665.17 MB/s 4157.44 MB/s 4070.76 MB/s (-2.09%)
8192 3159.72 MB/s 4361.57 MB/s 4270.65 MB/s (-2.08%)
16384 4186.70 MB/s 4574.13 MB/s 4501.17 MB/s (-1.60%)
32768 4093.21 MB/s 4487.42 MB/s 4322.43 MB/s (-3.68%)
65536 4057.14 MB/s 4735.61 MB/s 4555.17 MB/s (-3.81%)
2) regression in buffer initialization and destruction path, which is
brought by additional MR operations of sndbufs. But thanks to link
group buffer reuse mechanism, the impact of this kind of regression
decreases as times of buffer reuse increases.
Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
buffer-related function obtained by bpftrace are as follows:
Function Phys-bufs Virt-bufs
smcr_new_buf_create() 67154 ns 79164 ns
smc_ib_buf_map_sg() 525 ns 928 ns
smc_ib_get_memory_region() 162294 ns 161191 ns
smc_wr_reg_send() 9957 ns 9635 ns
smc_ib_put_memory_region() 203548 ns 198374 ns
smc_ib_buf_unmap_sg() 508 ns 1158 ns
------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit b984f370ed5182d180f92dbf14bdf847ff6ccc04
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jul 14 17:44:03 2022 +0800
net/smc: Use sysctl-specified types of buffers in new link group
This patch introduces a new SMC-R specific element buf_type
in struct smc_link_group, for recording the value of sysctl
smcr_buf_type when link group is created.
New created link group will create and reuse buffers of the
type specified by buf_type.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 0ef69e788411cba2af017db731a9fc62d255e9ac
Author: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Date: Thu Jul 14 17:44:01 2022 +0800
net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu
Some CPU, such as Xeon, can guarantee DMA cache coherency.
So it is no need to use dma sync APIs to flush cache on such CPUs.
In order to avoid calling dma sync APIs on the IO path, use the
dma_need_sync to check whether smc_buf_desc needs dma sync when
creating smc_buf_desc.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2160099
Upstream status: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Tested: by IBM
Build-Info: ihttps://brewweb.engineering.redhat.com/brew/taskinfo?taskID=52893145
Conflicts: None
commit 6d52e2de6415b7a035b3e8dc4ccffd0da25bbfb9
Author: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Date: Thu Jul 14 17:44:00 2022 +0800
net/smc: remove redundant dma sync ops
smc_ib_sync_sg_for_cpu/device are the ops used for dma memory cache
consistency. Smc sndbufs are dma buffers, where CPU writes data to
it and PCIE device reads data from it. So for sndbufs,
smc_ib_sync_sg_for_device is needed and smc_ib_sync_sg_for_cpu is
redundant as PCIE device will not write the buffers. Smc rmbs
are dma buffers, where PCIE device write data to it and CPU read
data from it. So for rmbs, smc_ib_sync_sg_for_cpu is needed and
smc_ib_sync_sg_for_device is redundant as CPU will not write the buffers.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 4940a1fdf31c39f0806ac831cde333134862030b
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Wed Mar 2 21:25:12 2022 +0800
net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server
The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear.
Based on the fact that whether a new SMC connection can be accepted or
not depends on not only the limit of conn nums, but also the available
entries of rtoken. Since the rtoken release is trigger by peer, while
the conn nums is decrease by local, tons of thing can happen in this
time difference.
This only thing that needs to be mentioned is that now all connection
creations are completely protected by smc_server_lgr_pending lock, it's
enough to check only the available entries in rtokens_used_mask.
Fixes: cd6851f303 ("smc: remote memory buffers (RMBs)")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 0537f0a2151375dcf90c1bbfda6a0aaf57164e89
Author: D. Wythe <alibuda@linux.alibaba.com>
Date: Wed Mar 2 21:25:11 2022 +0800
net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client
The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client
dues to following execution sequence:
Server Conn A: Server Conn B: Client Conn B:
smc_lgr_unregister_conn
smc_lgr_register_conn
smc_clc_send_accept ->
smc_rtoken_add
smcr_buf_unuse
-> Client Conn A:
smc_rtoken_delete
smc_lgr_unregister_conn() makes current link available to assigned to new
incoming connection, while smcr_buf_unuse() has not executed yet, which
means that smc_rtoken_add may fail because of insufficient rtoken_entry,
reversing their execution order will avoid this problem.
Fixes: 3e034725c0 ("net/smc: common functions for RMBs and send buffers")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 6bf536eb5c8ca011d1ff57b5c5f7c57ceac06a37
Author: Dust Li <dust.li@linux.alibaba.com>
Date: Tue Mar 1 17:44:00 2022 +0800
net/smc: correct settings of RMB window update limit
rmbe_update_limit is used to limit announcing receive
window updating too frequently. RFC7609 request a minimal
increase in the window size of 10% of the receive buffer
space. But current implementation used:
min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
always less then 10% of the receive buffer space.
This causes the receiver always sending CDC message to
update its consumer cursor when it consumes more then 2K
of data. And as a result, we may encounter something like
"TCP silly window syndrome" when sending 2.5~8K message.
This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
With this patch and SMC autocorking enabled, qperf 2K/4K/8K
tcp_bw test shows 45%/75%/40% increase in throughput respectively.
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 56d99e81ecbc997a5f984684d0eeb583992b2072
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Sun Jan 16 15:43:42 2022 +0800
net/smc: Fix hung_task when removing SMC-R devices
A hung_task is observed when removing SMC-R devices. Suppose that
a link group has two active links(lnk_A, lnk_B) associated with two
different SMC-R devices(dev_A, dev_B). When dev_A is removed, the
link group will be removed from smc_lgr_list and added into
lgr_linkdown_list. lnk_A will be cleared and smcibdev(A)->lnk_cnt
will reach to zero. However, when dev_B is removed then, the link
group can't be found in smc_lgr_list and lnk_B won't be cleared,
making smcibdev->lnk_cnt never reaches zero, which causes a hung_task.
This patch fixes this issue by restoring the implementation of
smc_smcr_terminate_all() to what it was before commit 349d43127dac
("net/smc: fix kernel panic caused by race of smc_sock"). The original
implementation also satisfies the intention that make sure QP destroy
earlier than CQ destroy because we will always wait for smcibdev->lnk_cnt
reaches zero, which guarantees QP has been destroyed.
Fixes: 349d43127dac ("net/smc: fix kernel panic caused by race of smc_sock")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 20c9398d3309d170300d67643b851fd26783af24
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jan 13 16:36:42 2022 +0800
net/smc: Resolve the race between SMC-R link access and clear
We encountered some crashes caused by the race between SMC-R
link access and link clear that triggered by abnormal link
group termination, such as port error.
Here is an example of this kind of crashes:
BUG: kernel NULL pointer dereference, address: 0000000000000000
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_llc_flow_initiate+0x44/0x190 [smc]
Call Trace:
<TASK>
? __smc_buf_create+0x75a/0x950 [smc]
smcr_lgr_reg_rmbs+0x2a/0xbf [smc]
smc_listen_work+0xf72/0x1230 [smc]
? process_one_work+0x25c/0x600
process_one_work+0x25c/0x600
worker_thread+0x4f/0x3a0
? process_one_work+0x600/0x600
kthread+0x15d/0x1a0
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
</TASK>
smc_listen_work() __smc_lgr_terminate()
---------------------------------------------------------------
| smc_lgr_free()
| |- smcr_link_clear()
| |- memset(lnk, 0)
smc_listen_rdma_reg() |
|- smcr_lgr_reg_rmbs() |
|- smc_llc_flow_initiate() |
|- access lnk->lgr (panic) |
These crashes are similarly caused by clearing SMC-R link
resources when some functions is still accessing to them.
This patch tries to fix the issue by introducing reference
count of SMC-R links and ensuring that the sensitive resources
of links won't be cleared until reference count reaches zero.
The operation to the SMC-R link reference count can be concluded
as follows:
object [hold or initialized as 1] [put]
--------------------------------------------------------------------
links smcr_link_init() smcr_link_clear()
connections smc_conn_create() smc_conn_free()
Through this way, the clear of SMC-R links is later than the
free of all the smc connections above it, thus avoiding the
unsafe reference to SMC-R links.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit ea89c6c0983c39702a4a52ccaa4702e0cb71179b
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jan 13 16:36:41 2022 +0800
net/smc: Introduce a new conn->lgr validity check helper
It is no longer suitable to identify whether a smc connection
is registered in a link group through checking if conn->lgr
is NULL, because conn->lgr won't be reset even the connection
is unregistered from a link group.
So this patch introduces a new helper smc_conn_lgr_valid() and
replaces all the check of conn->lgr in original implementation
with the new helper to judge if conn->lgr is valid to use.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 61f434b0280ed65495831f1b6e1a5c21a90f47c6
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jan 13 16:36:40 2022 +0800
net/smc: Resolve the race between link group access and termination
We encountered some crashes caused by the race between the access
and the termination of link groups.
Here are some of panic stacks we met:
1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()
BUG: kernel NULL pointer dereference, address: 00000000000002f0
Workqueue: smc_hs_wq smc_listen_work [smc]
RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
Call Trace:
<TASK>
? smc_clc_send_accept+0x45/0xa0 [smc]
? smc_clc_send_accept+0x45/0xa0 [smc]
smc_listen_work+0x783/0x1220 [smc]
? finish_task_switch+0xc4/0x2e0
? process_one_work+0x1ad/0x3c0
process_one_work+0x1ad/0x3c0
worker_thread+0x4c/0x390
? rescuer_thread+0x320/0x320
kthread+0x149/0x190
? set_kthread_struct+0x40/0x40
ret_from_fork+0x1f/0x30
</TASK>
smc_listen_work() abnormal case like port error
---------------------------------------------------------------
| __smc_lgr_terminate()
| |- smc_conn_kill()
| |- smc_lgr_unregister_conn()
| |- set conn->lgr = NULL
smc_clc_wait_msg() |
|- access conn->lgr (panic) |
2) Race between smc_setsockopt() and __smc_lgr_terminate()
BUG: kernel NULL pointer dereference, address: 00000000000002e8
RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
Call Trace:
<TASK>
__sys_setsockopt+0xfc/0x190
__x64_sys_setsockopt+0x20/0x30
do_syscall_64+0x34/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
smc_setsockopt() abnormal case like port error
--------------------------------------------------------------
| __smc_lgr_terminate()
| |- smc_conn_kill()
| |- smc_lgr_unregister_conn()
| |- set conn->lgr = NULL
mod_delayed_work() |
|- access conn->lgr (panic) |
There are some other panic places and they are caused by the
similar reason as described above, which is accessing link
group after termination, thus getting a NULL pointer or invalid
resource.
Currently, there seems to be no synchronization between the
link group access and a sudden termination of it. This patch
tries to fix this by introducing reference count of link group
and not freeing link group until reference count is zero.
Link group might be referred to by links or smc connections. So
the operation to the link group reference count can be concluded
as follows:
object [hold or initialized as 1] [put]
-------------------------------------------------------------------
link group smc_lgr_create() smc_lgr_free()
connections smc_conn_create() smc_conn_free()
links smcr_link_init() smcr_link_clear()
Througth this way, we extend the life cycle of link group and
ensure it is longer than the life cycle of connections and links
above it, so that avoid invalid access to link group after its
termination.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 36595d8ad46d9e4c41cc7c48c4405b7c3322deac
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Thu Jan 6 20:42:08 2022 +0800
net/smc: Reset conn->lgr when link group registration fails
SMC connections might fail to be registered in a link group due to
unable to find a usable link during its creation. As a result,
smc_conn_create() will return a failure and most resources related
to the connection won't be applied or initialized, such as
conn->abort_work or conn->lnk.
If smc_conn_free() is invoked later, it will try to access the
uninitialized resources related to the connection, thus causing
a warning or crash.
This patch tries to fix this by resetting conn->lgr to NULL if an
abnormal exit occurs in smc_lgr_register_conn(), thus avoiding the
access to uninitialized resources in smc_conn_free().
Meanwhile, the new created link group should be terminated if smc
connections can't be registered in it. So smc_lgr_cleanup_early() is
modified to take care of link group only and invoked to terminate
unusable link group by smc_conn_create(). The call to smc_conn_free()
is moved out from smc_lgr_cleanup_early() to smc_conn_abort().
Fixes: 56bc3b2094 ("net/smc: assign link to a new connection")
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit de2fea7b39bfa1ee9db8726f7b71d54fec385d80
Author: Tony Lu <tonylu@linux.alibaba.com>
Date: Tue Dec 28 21:06:11 2021 +0800
net/smc: Print net namespace in log
This adds net namespace ID to the kernel log, net_cookie is unique in
the whole system. It is useful in container environment.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 79d39fc503b43b566feae5bc9a57dfcffdf41bd1
Author: Tony Lu <tonylu@linux.alibaba.com>
Date: Tue Dec 28 21:06:10 2021 +0800
net/smc: Add netlink net namespace support
This adds net namespace ID to diag of linkgroup, helps us to distinguish
different namespaces, and net_cookie is unique in the whole system.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 0237a3a683e4844ddc52782d83d439d6192e11f9
Author: Tony Lu <tonylu@linux.alibaba.com>
Date: Tue Dec 28 21:06:09 2021 +0800
net/smc: Introduce net namespace support for linkgroup
Currently, rdma device supports exclusive net namespace isolation,
however linkgroup doesn't know and support ibdev net namespace.
Applications in the containers don't want to share the nics if we
enabled rdma exclusive mode. Every net namespaces should have their own
linkgroups.
This patch introduce a new field net for linkgroup, which is standing
for the ibdev net namespace in the linkgroup. The net in linkgroup is
initialized with the net namespace of link's ibdev. It compares the net
of linkgroup and sock or ibdev before choose it, if no matched, create
new one in current net namespace. If rdma net namespace exclusive mode
is not enabled, it behaves as before.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 349d43127dac00c15231e8ffbcaabd70f7b0e544
Author: Dust Li <dust.li@linux.alibaba.com>
Date: Tue Dec 28 17:03:25 2021 +0800
net/smc: fix kernel panic caused by race of smc_sock
A crash occurs when smc_cdc_tx_handler() tries to access smc_sock
but smc_release() has already freed it.
[ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88
[ 4570.696048] #PF: supervisor write access in kernel mode
[ 4570.696728] #PF: error_code(0x0002) - not-present page
[ 4570.697401] PGD 0 P4D 0
[ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI
[ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111
[ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0
[ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30
<...>
[ 4570.711446] Call Trace:
[ 4570.711746] <IRQ>
[ 4570.711992] smc_cdc_tx_handler+0x41/0xc0
[ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560
[ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10
[ 4570.713489] tasklet_action_common.isra.17+0x66/0x140
[ 4570.714083] __do_softirq+0x123/0x2f4
[ 4570.714521] irq_exit_rcu+0xc4/0xf0
[ 4570.714934] common_interrupt+0xba/0xe0
Though smc_cdc_tx_handler() checked the existence of smc connection,
smc_release() may have already dismissed and released the smc socket
before smc_cdc_tx_handler() further visits it.
smc_cdc_tx_handler() |smc_release()
if (!conn) |
|
|smc_cdc_tx_dismiss_slots()
| smc_cdc_tx_dismisser()
|
|sock_put(&smc->sk) <- last sock_put,
| smc_sock freed
bh_lock_sock(&smc->sk) (panic) |
To make sure we won't receive any CDC messages after we free the
smc_sock, add a refcount on the smc_connection for inflight CDC
message(posted to the QP but haven't received related CQE), and
don't release the smc_connection until all the inflight CDC messages
haven been done, for both success or failed ones.
Using refcount on CDC messages brings another problem: when the link
is going to be destroyed, smcr_link_clear() will reset the QP, which
then remove all the pending CQEs related to the QP in the CQ. To make
sure all the CQEs will always come back so the refcount on the
smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced
by smc_ib_modify_qp_error().
And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we
need to wait for all pending WQEs done, or we may encounter use-after-
free when handling CQEs.
For IB device removal routine, we need to wait for all the QPs on that
device been destroyed before we can destroy CQs on the device, or
the refcount on smc_connection won't reach 0 and smc_sock cannot be
released.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Reported-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 90cee52f2e780345d3629e278291aea5ac74f40f
Author: Dust Li <dust.li@linux.alibaba.com>
Date: Tue Dec 28 17:03:24 2021 +0800
net/smc: don't send CDC/LLC message if link not ready
We found smc_llc_send_link_delete_all() sometimes wait
for 2s timeout when testing with RDMA link up/down.
It is possible when a smc_link is in ACTIVATING state,
the underlaying QP is still in RESET or RTR state, which
cannot send any messages out.
smc_llc_send_link_delete_all() use smc_link_usable() to
checks whether the link is usable, if the QP is still in
RESET or RTR state, but the smc_link is in ACTIVATING, this
LLC message will always fail without any CQE entering the
CQ, and we will always wait 2s before timeout.
Since we cannot send any messages through the QP before
the QP enter RTS. I add a wrapper smc_link_sendable()
which checks the state of QP along with the link state.
And replace smc_link_usable() with smc_link_sendable()
in all LLC & CDC message sending routine.
Fixes: 5f08318f61 ("smc: connection data control (CDC)")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 1c5526968e270e4efccfa1da21d211a4915cdeda
Author: Tony Lu <tonylu@linux.alibaba.com>
Date: Fri Dec 3 12:33:31 2021 +0100
net/smc: Clear memory when release and reuse buffer
Currently, buffers are cleared when smc connections are created and
buffers are reused. This slows down the speed of establishing new
connections. In most cases, the applications want to establish
connections as quickly as possible.
This patch moves memset() from connection creation path to release and
buffer unuse path, this trades off between speed of establishing and
release.
Test environments:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4
- socket sndbuf / rcvbuf: 16384 / 131072 bytes
- w/o first round, 5 rounds, avg, 100 conns batch per round
- smc_buf_create() use bpftrace kprobe, introduces extra latency
Latency benchmarks for smc_buf_create():
w/o patch : 19040.0 ns
w/ patch : 1932.6 ns
ratio : 10.2% (-89.8%)
Latency benchmarks for socket create and connect:
w/o patch : 143.3 us
w/ patch : 102.2 us
ratio : 71.3% (-28.7%)
The latency of establishing connections is reduced by 28.7%.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20211203113331.2818873-1-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 587acad41f1bc48e16f42bb2aca63bf323380be8
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Wed Nov 24 13:32:37 2021 +0100
net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk()
Coverity reports a possible NULL dereferencing problem:
in smc_vlan_by_tcpsk():
6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
1623 ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
1624 if (is_vlan_dev(ndev)) {
Remove the manual implementation and use netdev_walk_all_lower_dev() to
iterate over the lower devices. While on it remove an obsolete function
parameter comment.
Fixes: cb9d43f677 ("net/smc: determine vlan_id of stacked net_device")
Suggested-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit cf4f5530bb55ef7d5a91036b26676643b80b1616
Author: Wen Gu <guwen@linux.alibaba.com>
Date: Mon Nov 15 17:45:07 2021 +0800
net/smc: Make sure the link_id is unique
The link_id is supposed to be unique, but smcr_next_link_id() doesn't
skip the used link_id as expected. So the patch fixes this.
Fixes: 026c381fb4 ("net/smc: introduce link_idx for link group array")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Acked-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit a3a0e81b6fd55745e100735c7667cd99a0650811
Author: Tony Lu <tonylu@linux.alibaba.com>
Date: Mon Nov 1 15:39:16 2021 +0800
net/smc: Introduce tracepoint for smcr link down
SMC-R link down event is important to help us find links' issues, we
should track this event, especially in the single nic mode, which means
upper layer connection would be shut down. Then find out the direct
link-down reason in time, not only increased the counter, also the
location of the code who triggered this event.
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit b0539f5eddc2eefd24378bda3ee9cbbca916f58d
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Sat Oct 16 11:37:51 2021 +0200
net/smc: add netlink support for SMC-Rv2
Implement the netlink support for SMC-Rv2 related attributes that are
provided to user space.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 8799e310fb3f15759824a78b6b93d7e6d5def067
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Sat Oct 16 11:37:49 2021 +0200
net/smc: add v2 support to the work request layer
In the work request layer define one large v2 buffer for each link group
that is used to transmit and receive large LLC control messages.
Add the completion queue handling for this buffer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 24fb68111d4509524b483b2577f1b20a24f5fdfd
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Sat Oct 16 11:37:48 2021 +0200
net/smc: retrieve v2 gid from IB device
In smc_ib.c, scan for RoCE devices that support UDP encapsulation.
Find an eligible device and check that there is a route to the
remote peer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit e49300a6bf6218c835403545e9356141a6340181
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Sat Oct 16 11:37:46 2021 +0200
net/smc: add listen processing for SMC-Rv2
Implement the server side of the SMC-Rv2 processing. Process incoming
CLC messages, find eligible devices and check for a valid route to the
remote peer.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 11a26c59fc510091facd0d80236ac848da844830
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Tue Sep 14 10:35:06 2021 +0200
net/smc: keep static copy of system EID
The system EID is retrieved using an registered ISM device each time
when needed. This adds some unnecessary complexity at all places where
the system EID is needed, but no ISM device is at hand.
Simplify the code and save the system EID in a static variable in
smc_ism.c.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2044294
Upstream Status: https://github.com/torvalds/linux.git
Tested: by IBM
Build-info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=45951016
Conflicts: None
commit 67161779a9ea926fccee8de047ae66cbd3482b91
Author: Stefan Raspl <raspl@linux.ibm.com>
Date: Mon Aug 9 10:10:14 2021 +0200
net/smc: Allow SMC-D 1MB DMB allocations
Commit a3fe3d01bd ("net/smc: introduce sg-logic for RMBs") introduced
a restriction for RMB allocations as used by SMC-R. However, SMC-D does
not use scatter-gather lists to back its DMBs, yet it was limited by
this restriction, still.
This patch exempts SMC, but limits allocations to the maximum RMB/DMB
size respectively.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Tobias Huschle <thuschle@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1869652
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Build Info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=40812265
Tested: by IBM
Conflicts: None
commit 95f7f3e7dc6bd2e735cb5de11734ea2222b1e05a
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Thu Oct 7 16:14:40 2021 +0200
net/smc: improved fix wait on already cleared link
Commit 8f3d65c166 ("net/smc: fix wait on already cleared link")
introduced link refcounting to avoid waits on already cleared links.
This patch extents and improves the refcounting to cover all
remaining possible cases for this kind of error situation.
Fixes: 15e1b99aad ("net/smc: no WR buffer wait for terminating link group")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mete Durlu <mdurlu@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1869652
Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Build Info: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=40812265
Tested: by IBM
Conflicts: None
commit a18cee4791b1123d0a6579a7c89f4b87e48abe03
Author: Karsten Graul <kgraul@linux.ibm.com>
Date: Mon Sep 20 21:18:15 2021 +0200
net/smc: fix 'workqueue leaked lock' in smc_conn_abort_work
The abort_work is scheduled when a connection was detected to be
out-of-sync after a link failure. The work calls smc_conn_kill(),
which calls smc_close_active_abort() and that might end up calling
smc_close_cancel_work().
smc_close_cancel_work() cancels any pending close_work and tx_work but
needs to release the sock_lock before and acquires the sock_lock again
afterwards. So when the sock_lock was NOT acquired before then it may
be held after the abort_work completes. Thats why the sock_lock is
acquired before the call to smc_conn_kill() in __smc_lgr_terminate(),
but this is missing in smc_conn_abort_work().
Fix that by acquiring the sock_lock first and release it after the
call to smc_conn_kill().
Fixes: b286a0651e ("net/smc: handle incoming CDC validation message")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mete Durlu <mdurlu@redhat.com>
SMC clients may be assigned to a different link after the initial
connection between two peers was established. In such a case,
the connection counter was not correctly set.
Update the connection counter correctly when a smc client connection
is assigned to a different smc link.
Fixes: 07d51580ff ("net/smc: Add connection counters for links")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Tested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Make the gathered SMC statistics network namespace aware, for each
namespace collect an own set of statistic information.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add the ability to collect SMC statistics information. Per-cpu
variables are used to collect the statistic information for better
performance and for reducing concurrency pitfalls. The code that is
collecting statistic data is implemented in macros to increase code
reuse and readability.
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
smc_lgr_cleanup() calls smcd_unregister_all_dmbs() as part of the link
group termination process. This is a leftover from the times when
smc_lgr_cleanup() scheduled a worker to actually free the link group.
Nowadays smc_lgr_cleanup() directly calls smc_lgr_free() without any
delay so an earlier dmb unregistration is no longer needed.
So remove smcd_unregister_all_dmbs() and the call to it.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Using snprintf() to convert not null-terminated strings to null
terminated strings may cause out of bounds read in the source string.
Therefore use memcpy() and terminate the target string with a null
afterwards.
Fixes: a3db10efcc ("net/smc: Add support for obtaining SMCR device list")
Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
smc_clc_get_hostname() sets the host pointer to a buffer
which is not NULL-terminated (see smc_clc_init()).
Reported-by: syzbot+f4708c391121cfc58396@syzkaller.appspotmail.com
Fixes: 099b990bd1 ("net/smc: Add support for obtaining system information")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>