Commit Graph

133 Commits

Author SHA1 Message Date
Jeff Moyer 44b5887778 io_uring/sqpoll: ensure task state is TASK_RUNNING when running task_work
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 8f7033aa4089fbaf7a33995f0f2ee6c9d7b9ca1b
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Oct 17 08:31:56 2024 -0600

    io_uring/sqpoll: ensure task state is TASK_RUNNING when running task_work
    
    When the sqpoll is exiting and cancels pending work items, it may need
    to run task_work. If this happens from within io_uring_cancel_generic(),
    then it may be under waiting for the io_uring_task waitqueue. This
    results in the below splat from the scheduler, as the ring mutex may be
    attempted grabbed while in a TASK_INTERRUPTIBLE state.
    
    Ensure that the task state is set appropriately for that, just like what
    is done for the other cases in io_run_task_work().
    
    do not call blocking ops when !TASK_RUNNING; state=1 set at [<0000000029387fd2>] prepare_to_wait+0x88/0x2fc
    WARNING: CPU: 6 PID: 59939 at kernel/sched/core.c:8561 __might_sleep+0xf4/0x140
    Modules linked in:
    CPU: 6 UID: 0 PID: 59939 Comm: iou-sqp-59938 Not tainted 6.12.0-rc3-00113-g8d020023b155 #7456
    Hardware name: linux,dummy-virt (DT)
    pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
    pc : __might_sleep+0xf4/0x140
    lr : __might_sleep+0xf4/0x140
    sp : ffff80008c5e7830
    x29: ffff80008c5e7830 x28: ffff0000d93088c0 x27: ffff60001c2d7230
    x26: dfff800000000000 x25: ffff0000e16b9180 x24: ffff80008c5e7a50
    x23: 1ffff000118bcf4a x22: ffff0000e16b9180 x21: ffff0000e16b9180
    x20: 000000000000011b x19: ffff80008310fac0 x18: 1ffff000118bcd90
    x17: 30303c5b20746120 x16: 74657320313d6574 x15: 0720072007200720
    x14: 0720072007200720 x13: 0720072007200720 x12: ffff600036c64f0b
    x11: 1fffe00036c64f0a x10: ffff600036c64f0a x9 : dfff800000000000
    x8 : 00009fffc939b0f6 x7 : ffff0001b6327853 x6 : 0000000000000001
    x5 : ffff0001b6327850 x4 : ffff600036c64f0b x3 : ffff8000803c35bc
    x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000e16b9180
    Call trace:
     __might_sleep+0xf4/0x140
     mutex_lock+0x84/0x124
     io_handle_tw_list+0xf4/0x260
     tctx_task_work_run+0x94/0x340
     io_run_task_work+0x1ec/0x3c0
     io_uring_cancel_generic+0x364/0x524
     io_sq_thread+0x820/0x124c
     ret_from_fork+0x10/0x20
    
    Cc: stable@vger.kernel.org
    Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:14:54 -05:00
Jeff Moyer ac9c2b1e93 io_uring/sqpoll: close race on waiting for sqring entries
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 28aabffae6be54284869a91cd8bccd3720041129
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Oct 15 08:58:25 2024 -0600

    io_uring/sqpoll: close race on waiting for sqring entries
    
    When an application uses SQPOLL, it must wait for the SQPOLL thread to
    consume SQE entries, if it fails to get an sqe when calling
    io_uring_get_sqe(). It can do so by calling io_uring_enter(2) with the
    flag value of IORING_ENTER_SQ_WAIT. In liburing, this is generally done
    with io_uring_sqring_wait(). There's a natural expectation that once
    this call returns, a new SQE entry can be retrieved, filled out, and
    submitted. However, the kernel uses the cached sq head to determine if
    the SQRING is full or not. If the SQPOLL thread is currently in the
    process of submitting SQE entries, it may have updated the cached sq
    head, but not yet committed it to the SQ ring. Hence the kernel may find
    that there are SQE entries ready to be consumed, and return successfully
    to the application. If the SQPOLL thread hasn't yet committed the SQ
    ring entries by the time the application returns to userspace and
    attempts to get a new SQE, it will fail getting a new SQE.
    
    Fix this by having io_sqring_full() always use the user visible SQ ring
    head entry, rather than the internally cached one.
    
    Cc: stable@vger.kernel.org # 5.10+
    Link: https://github.com/axboe/liburing/discussions/1267
    Reported-by: Benedek Thaler <thaler@thaler.hu>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:14:53 -05:00
Jeff Moyer ba52121989 io_uring: check for non-NULL file pointer in io_file_can_poll()
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 5fc16fa5f13b3c06fdb959ef262050bd810416a2
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sat Jun 1 12:25:35 2024 -0600

    io_uring: check for non-NULL file pointer in io_file_can_poll()
    
    In earlier kernels, it was possible to trigger a NULL pointer
    dereference off the forced async preparation path, if no file had
    been assigned. The trace leading to that looks as follows:
    
    BUG: kernel NULL pointer dereference, address: 00000000000000b0
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP
    CPU: 67 PID: 1633 Comm: buf-ring-invali Not tainted 6.8.0-rc3+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 2/2/2022
    RIP: 0010:io_buffer_select+0xc3/0x210
    Code: 00 00 48 39 d1 0f 82 ae 00 00 00 48 81 4b 48 00 00 01 00 48 89 73 70 0f b7 50 0c 66 89 53 42 85 ed 0f 85 d2 00 00 00 48 8b 13 <48> 8b 92 b0 00 00 00 48 83 7a 40 00 0f 84 21 01 00 00 4c 8b 20 5b
    RSP: 0018:ffffb7bec38c7d88 EFLAGS: 00010246
    RAX: ffff97af2be61000 RBX: ffff97af234f1700 RCX: 0000000000000040
    RDX: 0000000000000000 RSI: ffff97aecfb04820 RDI: ffff97af234f1700
    RBP: 0000000000000000 R08: 0000000000200030 R09: 0000000000000020
    R10: ffffb7bec38c7dc8 R11: 000000000000c000 R12: ffffb7bec38c7db8
    R13: ffff97aecfb05800 R14: ffff97aecfb05800 R15: ffff97af2be5e000
    FS:  00007f852f74b740(0000) GS:ffff97b1eeec0000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000b0 CR3: 000000016deab005 CR4: 0000000000370ef0
    Call Trace:
     <TASK>
     ? __die+0x1f/0x60
     ? page_fault_oops+0x14d/0x420
     ? do_user_addr_fault+0x61/0x6a0
     ? exc_page_fault+0x6c/0x150
     ? asm_exc_page_fault+0x22/0x30
     ? io_buffer_select+0xc3/0x210
     __io_import_iovec+0xb5/0x120
     io_readv_prep_async+0x36/0x70
     io_queue_sqe_fallback+0x20/0x260
     io_submit_sqes+0x314/0x630
     __do_sys_io_uring_enter+0x339/0xbc0
     ? __do_sys_io_uring_register+0x11b/0xc50
     ? vm_mmap_pgoff+0xce/0x160
     do_syscall_64+0x5f/0x180
     entry_SYSCALL_64_after_hwframe+0x46/0x4e
    RIP: 0033:0x55e0a110a67e
    Code: ba cc 00 00 00 45 31 c0 44 0f b6 92 d0 00 00 00 31 d2 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> 90 89 30 eb a9 0f 1f 40 00 48 8b 42 20 8b 00 a8 06 75 af 85 f6
    
    because the request is marked forced ASYNC and has a bad file fd, and
    hence takes the forced async prep path.
    
    Current kernels with the request async prep cleaned up can no longer hit
    this issue, but for ease of backporting, let's add this safety check in
    here too as it really doesn't hurt. For both cases, this will inevitably
    end with a CQE posted with -EBADF.
    
    Cc: stable@vger.kernel.org
    Fixes: a76c0b31eef5 ("io_uring: commit non-pollable provided mapped buffers upfront")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-12-02 11:12:48 -05:00
Jeff Moyer f8b136892d io_uring/rw: reinstate thread check for retries
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 039a2e800bcd5beb89909d1a488abf3d647642cf
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Apr 25 09:04:32 2024 -0600

    io_uring/rw: reinstate thread check for retries
    
    Allowing retries for everything is arguably the right thing to do, now
    that every command type is async read from the start. But it's exposed a
    few issues around missing check for a retry (which cca6571381a0 exposed),
    and the fixup commit for that isn't necessarily 100% sound in terms of
    iov_iter state.
    
    For now, just revert these two commits. This unfortunately then re-opens
    the fact that -EAGAIN can get bubbled to userspace for some cases where
    the kernel very well could just sanely retry them. But until we have all
    the conditions covered around that, we cannot safely enable that.
    
    This reverts commit df604d2ad480fcf7b39767280c9093e13b1de952.
    This reverts commit cca6571381a0bdc88021a1f7a4c2349df21279f7.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:42:44 -05:00
Jeff Moyer 67a075237e io_uring/rw: ensure retry condition isn't lost
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit df604d2ad480fcf7b39767280c9093e13b1de952
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Apr 17 09:23:55 2024 -0600

    io_uring/rw: ensure retry condition isn't lost
    
    A previous commit removed the checking on whether or not it was possible
    to retry a request, since it's now possible to retry any of them. This
    would previously have caused the request to have been ended with an error,
    but now the retry condition can simply get lost instead.
    
    Cleanup the retry handling and always just punt it to task_work, which
    will queue it with io-wq appropriately.
    
    Reported-by: Changhui Zhong <czhong@redhat.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Fixes: cca6571381a0 ("io_uring/rw: cleanup retry path")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:32:44 -05:00
Jeff Moyer 9e9aaab6e3 io_uring: unexport io_req_cqe_overflow()
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit a5bff51850c8d533f3696d45749ab169dd49f8dd
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Wed Apr 10 02:26:51 2024 +0100

    io_uring: unexport io_req_cqe_overflow()
    
    There are no users of io_req_cqe_overflow() apart from io_uring.c, make
    it static.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/f4295eb2f9eb98d5db38c0578f57f0b86bfe0d8c.1712708261.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:21:44 -05:00
Jeff Moyer d45024afd9 io_uring: move mapping/allocation helpers to a separate file
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: RHEL does not have commit 5e0a760b4441 ("mm, treewide:
rename MAX_ORDER to MAX_PAGE_ORDER").

commit f15ed8b4d0ce2c0831232ff85117418740f0c529
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Mar 27 14:59:09 2024 -0600

    io_uring: move mapping/allocation helpers to a separate file
    
    Move the related code from io_uring.c into memmap.c. No functional
    changes in this patch, just cleaning it up a bit now that the full
    transition is done.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:09:44 -05:00
Jeff Moyer f5d5b2e624 io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 87585b05757dc70545efb434669708d276125559
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Mar 12 20:24:21 2024 -0600

    io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring
    
    Rather than use remap_pfn_range() for this and manually free later,
    switch to using vm_insert_page() and have it Just Work.
    
    This requires a bit of effort on the mmap lookup side, as the ctx
    uring_lock isn't held, which  otherwise protects buffer_lists from being
    torn down, and it's not safe to grab from mmap context that would
    introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
    Instead, lookup the buffer_list under RCU, as the the list is RCU freed
    already. Use the existing reference count to determine whether it's
    possible to safely grab a reference to it (eg if it's not zero already),
    and drop that reference when done with the mapping. If the mmap
    reference is the last one, the buffer_list and the associated memory can
    go away, since the vma insertion has references to the inserted pages at
    that point.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:07:44 -05:00
Jeff Moyer 3a48bb39e7 io_uring: get rid of remap_pfn_range() for mapping rings/sqes
JIRA: https://issues.redhat.com/browse/RHEL-64867
Conflicts: RHEL does not have commit 5e0a760b4441 ("mm, treewide:
rename MAX_ORDER to MAX_PAGE_ORDER").

commit 3ab1db3c6039e02a9deb9d5091d28d559917a645
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Mar 13 09:56:14 2024 -0600

    io_uring: get rid of remap_pfn_range() for mapping rings/sqes
    
    Rather than use remap_pfn_range() for this and manually free later,
    switch to using vm_insert_pages() and have it Just Work.
    
    If possible, allocate a single compound page that covers the range that
    is needed. If that works, then we can just use page_address() on that
    page. If we fail to get a compound page, allocate single pages and use
    vmap() to map them into the kernel virtual address space.
    
    This just covers the rings/sqes, the other remaining user of the mmap
    remap_pfn_range() user will be converted separately. Once that is done,
    we can kill the old alloc/free code.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 17:02:44 -05:00
Jeff Moyer 5541c9a578 io_uring: drop ->prep_async()
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit e10677a8f6980dbae2e866b8320d90bae07e87ee
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Mar 18 20:48:38 2024 -0600

    io_uring: drop ->prep_async()
    
    It's now unused, drop the code related to it. This includes the
    io_issue_defs->manual alloc field.
    
    While in there, and since ->async_size is now being used a bit more
    frequently and in the issue path, move it to io_issue_defs[].
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:55:44 -05:00
Jeff Moyer 4bcf940192 io_uring: clean up io_lockdep_assert_cq_locked
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit c133b3b06b0653036b0c07675c1db0c89467ccdb
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:35 2024 +0000

    io_uring: clean up io_lockdep_assert_cq_locked
    
    Move CONFIG_PROVE_LOCKING checks inside of io_lockdep_assert_cq_locked()
    and kill the else branch.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/bbf33c429c9f6d7207a8fe66d1a5866ec2c99850.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:33:44 -05:00
Jeff Moyer 5fb9e301ac io_uring: refactor io_req_complete_post()
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 0667db14e1f029d56243aa2509ebc5f944388200
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:34 2024 +0000

    io_uring: refactor io_req_complete_post()
    
    Make io_req_complete_post() to push all IORING_SETUP_IOPOLL requests
    to task_work, it's much cleaner and should normally happen. We couldn't
    do it before because there was a possibility of looping in
    
    complete_post() -> tw -> complete_post() -> ...
    
    Also, unexport the function and inline __io_req_complete_post().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/ea19c032ace3e0dd96ac4d991a063b0188037014.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:32:44 -05:00
Jeff Moyer ae9b61904d io_uring: get rid of intermediate aux cqe caches
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 902ce82c2aa130bea5e3feca2d4ae62781865da7
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:32 2024 +0000

    io_uring: get rid of intermediate aux cqe caches
    
    io_post_aux_cqe(), which is used for multishot requests, delays
    completions by putting CQEs into a temporary array for the purpose
    completion lock/flush batching.
    
    DEFER_TASKRUN doesn't need any locking, so for it we can put completions
    directly into the CQ and defer post completion handling with a flag.
    That leaves !DEFER_TASKRUN, which is not that interesting / hot for
    multishot requests, so have conditional locking with deferred flush
    for them.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/b1d05a81fd27aaa2a07f9860af13059e7ad7a890.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:30:44 -05:00
Jeff Moyer 74280b745d io_uring: refactor io_fill_cqe_req_aux
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit e5c12945be5016d681ff305ea7306fef5902219d
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:31 2024 +0000

    io_uring: refactor io_fill_cqe_req_aux
    
    The restriction on multishot execution context disallowing io-wq is
    driven by rules of io_fill_cqe_req_aux(), it should only be called in
    the master task context, either from the syscall path or in task_work.
    Since task_work now always takes the ctx lock implying
    IO_URING_F_COMPLETE_DEFER, we can just assume that the function is
    always called with its defer argument set to true.
    
    Kill the argument. Also rename the function for more consistency as
    "fill" in CQE related functions was usually meant for raw interfaces
    only copying data into the CQ without any locking, waking the user
    and other accounting "post" functions take care of.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/93423d106c33116c7d06bf277f651aa68b427328.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:29:44 -05:00
Jeff Moyer 539a8ed2d1 io_uring: remove struct io_tw_state::locked
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 8e5b3b89ecaf6d9295e561c225b35c574a5e0fe7
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:30 2024 +0000

    io_uring: remove struct io_tw_state::locked
    
    ctx is always locked for task_work now, so get rid of struct
    io_tw_state::locked. Note I'm stopping one step before removing
    io_tw_state altogether, which is not empty, because it still serves the
    purpose of indicating which function is a tw callback and forcing users
    not to invoke them carelessly out of a wrong context. The removal can
    always be done later.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/e95e1ea116d0bfa54b656076e6a977bc221392a4.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:28:44 -05:00
Jeff Moyer fa29c3796e io_uring/rw: avoid punting to io-wq directly
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 6e6b8c62120a22acd8cb759304e4cd2e3215d488
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:28 2024 +0000

    io_uring/rw: avoid punting to io-wq directly
    
    kiocb_done() should care to specifically redirecting requests to io-wq.
    Remove the hopping to tw to then queue an io-wq, return -EAGAIN and let
    the core code io_uring handle offloading.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/413564e550fe23744a970e1783dfa566291b0e6f.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 16:26:44 -05:00
Jeff Moyer ee6e5ff7da io_uring/napi: ensure napi polling is aborted when work is available
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 428f13826855e3eea44bf13cedbf33f382ef8794
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Feb 14 12:59:36 2024 -0700

    io_uring/napi: ensure napi polling is aborted when work is available
    
    While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
    stalls in packet delivery. Turns out that while
    io_napi_busy_loop_should_end() aborts appropriately on regular
    task_work, it does not abort if we have local task_work pending.
    
    Move io_has_work() into the private io_uring.h header, and gate whether
    we should continue polling on that as well. This makes NAPI polling on
    send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.
    
    Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:44:44 -05:00
Jeff Moyer 7abbe65ada io-uring: add napi busy poll support
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 8d0c12a80cdeb80d5e0510e96d38fe551ed8e9b5
Author: Stefan Roesch <shr@devkernel.io>
Date:   Thu Jun 8 09:38:36 2023 -0700

    io-uring: add napi busy poll support
    
    This adds the napi busy polling support in io_uring.c. It adds a new
    napi_list to the io_ring_ctx structure. This list contains the list of
    napi_id's that are currently enabled for busy polling. The list is
    synchronized by the new napi_lock spin lock. The current default napi
    busy polling time is stored in napi_busy_poll_to. If napi busy polling
    is not enabled, the value is 0.
    
    In addition there is also a hash table. The hash table store the napi
    id and the pointer to the above list nodes. The hash table is used to
    speed up the lookup to the list elements. The hash table is synchronized
    with rcu.
    
    The NAPI_TIMEOUT is stored as a timeout to make sure that the time a
    napi entry is stored in the napi list is limited.
    
    The busy poll timeout is also stored as part of the io_wait_queue. This
    is necessary as for sq polling the poll interval needs to be adjusted
    and the napi callback allows only to pass in one value.
    
    This has been tested with two simple programs from the liburing library
    repository: the napi client and the napi server program. The client
    sends a request, which has a timestamp in its payload and the server
    replies with the same payload. The client calculates the roundtrip time
    and stores it to calculate the results.
    
    The client is running on host1 and the server is running on host 2 (in
    the same rack). The measured times below are roundtrip times. They are
    average times over 5 runs each. Each run measures 1 million roundtrips.
    
                       no rx coal          rx coal: frames=88,usecs=33
    Default              57us                    56us
    
    client_poll=100us    47us                    46us
    
    server_poll=100us    51us                    46us
    
    client_poll=100us+   40us                    40us
    server_poll=100us
    
    client_poll=100us+   41us                    39us
    server_poll=100us+
    prefer napi busy poll on client
    
    client_poll=100us+   41us                    39us
    server_poll=100us+
    prefer napi busy poll on server
    
    client_poll=100us+   41us                    39us
    server_poll=100us+
    prefer napi busy poll on client + server
    
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Suggested-by: Olivier Langlois <olivier@trillion01.com>
    Acked-by: Jakub Kicinski <kuba@kernel.org>
    Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:40:44 -05:00
Jeff Moyer 5f96f592a6 io-uring: move io_wait_queue definition to header file
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit 405b4dc14b10c5bdb3e9a6c3b9596c1597f7974d
Author: Stefan Roesch <shr@devkernel.io>
Date:   Thu Jun 8 09:38:35 2023 -0700

    io-uring: move io_wait_queue definition to header file
    
    This moves the definition of the io_wait_queue structure to the header
    file so it can be also used from other files.
    
    Signed-off-by: Stefan Roesch <shr@devkernel.io>
    Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:36:44 -05:00
Jeff Moyer 7dd91c9415 io_uring/sqpoll: manage task_work privately
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit af5d68f8892f8ee8f137648b79ceb2abc153a19b
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Feb 2 10:20:05 2024 -0700

    io_uring/sqpoll: manage task_work privately
    
    Decouple from task_work running, and cap the number of entries we process
    at the time. If we exceed that number, push remaining entries to a retry
    list that we'll process first next time.
    
    We cap the number of entries to process at 8, which is fairly random.
    We just want to get enough per-ctx batching here, while not processing
    endlessly.
    
    Since we manually run PF_IO_WORKER related task_work anyway as the task
    never exits to userspace, with this we no longer need to add an actual
    task_work item to the per-process list.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:31:44 -05:00
Jeff Moyer 5d0f19f327 io_uring: mark the need to lock/unlock the ring as unlikely
JIRA: https://issues.redhat.com/browse/RHEL-64867

commit bfe30bfde279529011161a60e5a7ca4be83de422
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sun Jan 28 20:32:52 2024 -0700

    io_uring: mark the need to lock/unlock the ring as unlikely
    
    Any of the fast paths will already have this locked, this helper only
    exists to deal with io-wq invoking request issue where we do not have
    the ctx->uring_lock held already. This means that any common or fast
    path will already have this locked, mark it as such.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-11-28 15:21:44 -05:00
Ming Lei b74c3e7b5b io_uring/cmd: move io_uring_try_cancel_uring_cmd()
JIRA: https://issues.redhat.com/browse/RHEL-56837

commit da12d9ab5889b87429d9375748dcd1485b6241f3
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Mar 18 22:00:23 2024 +0000

    io_uring/cmd: move io_uring_try_cancel_uring_cmd()

    io_uring_try_cancel_uring_cmd() is a part of the cmd handling so let's
    move it closer to all cmd bits into uring_cmd.c

    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Tested-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/43a3937af4933655f0fd9362c381802f804f43de.1710799188.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Ming Lei <ming.lei@redhat.com>
2024-09-27 11:19:16 +08:00
Jeff Moyer c5bed6fb0e io_uring: use the right type for work_llist empty check
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit 22537c9f79417fed70b352d54d01d2586fee9521
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Mar 25 18:53:33 2024 -0600

    io_uring: use the right type for work_llist empty check
    
    io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
    not an io_wq_work_list, it's a struct llist_head. They both have
    ->first as head-of-list, and it turns out the checks are identical. But
    be proper and use the right helper.
    
    Fixes: dac6a0eae793 ("io_uring: ensure iopoll runs local task work as well")
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 14:33:41 -04:00
Jeff Moyer fddfa72c02 io_uring: add io_file_can_poll() helper
JIRA: https://issues.redhat.com/browse/RHEL-27755
Conflicts: Context differences as we don't have commit 521223d7c229
  ("io_uring/cancel: don't default to setting req->work.cancel_seq").

commit 95041b93e90a06bb613ec4bef9cd4d61570f68e4
Author: Jens Axboe <axboe@kernel.dk>
Date:   Sun Jan 28 20:08:24 2024 -0700

    io_uring: add io_file_can_poll() helper
    
    This adds a flag to avoid dipping dereferencing file and then f_op to
    figure out if the file has a poll handler defined or not. We generally
    call this at least twice for networked workloads, and if using ring
    provided buffers, we do it on every buffer selection. Particularly the
    latter is troublesome, as it's otherwise a very fast operation.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 14:33:38 -04:00
Jeff Moyer 0c09ea233c io_uring/poll: add requeue return code from poll multishot handling
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit 704ea888d646cb9d715662944cf389c823252ee0
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Jan 29 11:57:11 2024 -0700

    io_uring/poll: add requeue return code from poll multishot handling
    
    Since our poll handling is edge triggered, multishot handlers retry
    internally until they know that no more data is available. In
    preparation for limiting these retries, add an internal return code,
    IOU_REQUEUE, which can be used to inform the poll backend about the
    handler wanting to retry, but that this should happen through a normal
    task_work requeue rather than keep hammering on the issue side for this
    one request.
    
    No functional changes in this patch, nobody is using this return code
    just yet.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 14:33:37 -04:00
Jeff Moyer 249084f409 io_uring/register: move io_uring_register(2) related code to register.c
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit c43203154d8ac579537aa0c7802b77d463b1f53a
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Dec 19 08:54:20 2023 -0700

    io_uring/register: move io_uring_register(2) related code to register.c
    
    Most of this code is basically self contained, move it out of the core
    io_uring file to bring a bit more separation to the registration related
    bits. This moves another ~10% of the code into register.c.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 14:33:36 -04:00
Jeff Moyer 6d2191d720 io_uring/cmd: inline io_uring_cmd_do_in_task_lazy
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit 6b04a3737057ddfed396c954f9e4be4fe6d53c62
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Dec 1 00:57:36 2023 +0000

    io_uring/cmd: inline io_uring_cmd_do_in_task_lazy
    
    Now as we can easily include io_uring_types.h, move IOU_F_TWQ_LAZY_WAKE
    and inline io_uring_cmd_do_in_task_lazy().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Reviewed-by: Ming Lei <ming.lei@redhat.com>
    Link: https://lore.kernel.org/r/2ec9fb31dd192d1c5cf26d0a2dec5657d88a8e48.1701391955.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 14:32:10 -04:00
Jeff Moyer b3b30700de io_uring/unix: drop usage of io_uring socket
JIRA: https://issues.redhat.com/browse/RHEL-36366
CVE: CVE-2023-52656
Conflicts: Contextual differences in iouring.h.

commit a4104821ad651d8a0b374f0b2474c345bbb42f82
Author: Jens Axboe <axboe@kernel.dk>
Date:   Tue Dec 19 12:30:43 2023 -0700

    io_uring/unix: drop usage of io_uring socket
    
    Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to
    using io_is_uring_fops() to detect whether this is a io_uring fd or not.
    With that done, kill off io_uring_get_socket() as nobody calls it
    anymore.
    
    This is in preparation to yanking out the rest of the core related to
    unix gc with io_uring.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-05-15 13:58:16 -04:00
Jeff Moyer c34b529ae7 io_uring: enable io_mem_alloc/free to be used in other parts
JIRA: https://issues.redhat.com/browse/RHEL-21391

commit edecf1689768452ba1a64b7aaf3a47a817da651a
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Nov 27 20:53:52 2023 -0700

    io_uring: enable io_mem_alloc/free to be used in other parts
    
    In preparation for using these helpers, make them non-static and add
    them to our internal header.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-02-01 12:08:14 -05:00
Jeff Moyer d71b8efd78 io_uring/kbuf: Use slab for struct io_buffer objects
JIRA: https://issues.redhat.com/browse/RHEL-21391

commit b3a4dbc89d4021b3f90ff6a13537111a004f9d07
Author: Gabriel Krisman Bertazi <krisman@suse.de>
Date:   Wed Oct 4 20:05:31 2023 -0400

    io_uring/kbuf: Use slab for struct io_buffer objects
    
    The allocation of struct io_buffer for metadata of provided buffers is
    done through a custom allocator that directly gets pages and
    fragments them.  But, slab would do just fine, as this is not a hot path
    (in fact, it is a deprecated feature) and, by keeping a custom allocator
    implementation we lose benefits like tracking, poisoning,
    sanitizers. Finally, the custom code is more complex and requires
    keeping the list of pages in struct ctx for no good reason.  This patch
    cleans this path up and just uses slab.
    
    I microbenchmarked it by forcing the allocation of a large number of
    objects with the least number of io_uring commands possible (keeping
    nbufs=USHRT_MAX), with and without the patch.  There is a slight
    increase in time spent in the allocation with slab, of course, but even
    when allocating to system resources exhaustion, which is not very
    realistic and happened around 1/2 billion provided buffers for me, it
    wasn't a significant hit in system time.  Specially if we think of a
    real-world scenario, an application doing register/unregister of
    provided buffers will hit ctx->io_buffers_cache more often than actually
    going to slab.
    
    Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
    Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-02-01 12:01:14 -05:00
Jeff Moyer 73c30d9b0c io_uring: ensure io_lockdep_assert_cq_locked() handles disabled rings
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 1658633c04653578429ff5dfc62fdc159203a8f2
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Oct 2 19:51:38 2023 -0600

    io_uring: ensure io_lockdep_assert_cq_locked() handles disabled rings
    
    io_lockdep_assert_cq_locked() checks that locking is correctly done when
    a CQE is posted. If the ring is setup in a disabled state with
    IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until
    the ring is later enabled. We generally don't post CQEs in this state,
    as no SQEs can be submitted. However it is possible to generate a CQE
    if tagged resources are being updated. If this happens and PROVE_LOCKING
    is enabled, then the locking check helper will dereference
    ctx->submitter_task, which hasn't been set yet.
    
    Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While
    at it, convert it to a static inline as well, so that generated line
    offsets will actually reflect which condition failed, rather than just
    the line offset for io_lockdep_assert_cq_locked() itself.
    
    Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com
    Fixes: f26cc9593581 ("io_uring: lockdep annotate CQ locking")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:28 -04:00
Jeff Moyer 41a7b54f2e io_uring: force inline io_fill_cqe_req
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 093a650b757210bc856ca7f5349fb5a4bb9d4bd6
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:30 2023 +0100

    io_uring: force inline io_fill_cqe_req
    
    There are only 2 callers of io_fill_cqe_req left, and one of them is
    extremely hot. Force inline the function.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/ffce4fc5e3521966def848a4d930586dfe33ae11.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer 0f5c36d3ac io_uring: merge iopoll and normal completion paths
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit ec26c225f06f5993f8891fa6c79fab3c92981181
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:29 2023 +0100

    io_uring: merge iopoll and normal completion paths
    
    io_do_iopoll() and io_submit_flush_completions() are pretty similar,
    both filling CQEs and then free a list of requests. Don't duplicate it
    and make iopoll use __io_submit_flush_completions(), which also helps
    with inlining and other optimisations.
    
    For that, we need to first find all completed iopoll requests and splice
    them from the iopoll list and then pass it down. This adds one extra
    list traversal, which should be fine as requests will stay hot in cache.
    
    CQ locking is already conditional, introduce ->lockless_cq and skip
    locking for IOPOLL as it's protected by ->uring_lock.
    
    We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(),
    so it works just like io_cqring_ev_posted_iopoll().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer 6353ed9a30 io_uring: optimise extra io_get_cqe null check
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 59fbc409e71649f558fb4578cdbfac67acb824dc
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:27 2023 +0100

    io_uring: optimise extra io_get_cqe null check
    
    If the cached cqe check passes in io_get_cqe*() it already means that
    the cqe we return is valid and non-zero, however the compiler is unable
    to optimise null checks like in io_fill_cqe_req().
    
    Do a bit of trickery, return success/fail boolean from io_get_cqe*()
    and store cqe in the cqe parameter. That makes it do the right thing,
    erasing the check together with the introduced indirection.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer bc351fb523 io_uring: refactor __io_get_cqe()
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 20d6b633870495fda1d92d283ebf890d80f68ecd
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:26 2023 +0100

    io_uring: refactor __io_get_cqe()
    
    Make __io_get_cqe simpler by not grabbing the cqe from refilled cached,
    but letting io_get_cqe() do it for us. That's cleaner and removes some
    duplication.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer fab40551d6 io_uring: simplify big_cqe handling
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit b24c5d752962fa0970cd7e3d74b1cd0e843358de
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:25 2023 +0100

    io_uring: simplify big_cqe handling
    
    Don't keep big_cqe bits of req in a union with hash_node, find a
    separate space for it. It's bit safer, but also if we keep it always
    initialised, we can get rid of ugly REQ_F_CQE32_INIT handling.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer 2b187db613 io_uring: improve cqe !tracing hot path
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit a0727c738309a06ef5579c1742f8f0def63aa883
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Aug 24 23:53:23 2023 +0100

    io_uring: improve cqe !tracing hot path
    
    While looking at io_fill_cqe_req()'s asm I stumbled on our trace points
    turning into the chunk below:
    
    trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
                            req->cqe.res, req->cqe.flags,
                            req->extra1, req->extra2);
    
    io_uring/io_uring.c:898:        trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
            movq    232(%rbx), %rdi # req_44(D)->big_cqe.extra2, _5
            movq    224(%rbx), %rdx # req_44(D)->big_cqe.extra1, _6
            movl    84(%rbx), %r9d  # req_44(D)->cqe.D.81184.flags, _7
            movl    80(%rbx), %r8d  # req_44(D)->cqe.res, _8
            movq    72(%rbx), %rcx  # req_44(D)->cqe.user_data, _9
            movq    88(%rbx), %rsi  # req_44(D)->ctx, _10
    ./arch/x86/include/asm/jump_label.h:27:         asm_volatile_goto("1:"
            1:jmp .L1772 # objtool NOPs this        #
            ...
    
    It does a jump_label for actual tracing, but those 6 moves will stay
    there in the hottest io_uring path. As an optimisation, add a
    trace_io_uring_complete_enabled() check, which is also uses jump_labels,
    it tricks the compiler into behaving. It removes the junk without
    changing anything else int the hot path.
    
    Note: apparently, it's not only me noticing it, and people are also
    working it around. We should remove the check when it's solved
    generically or rework tracing.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:26 -04:00
Jeff Moyer 615948488d io_uring: never overflow io_aux_cqe
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit b6b2bb58a75407660f638a68e6e34a07036146d0
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Aug 11 13:53:45 2023 +0100

    io_uring: never overflow io_aux_cqe
    
    Now all callers of io_aux_cqe() set allow_overflow to false, remove the
    parameter and not allow overflowing auxilary multishot cqes.
    
    When CQ is full the function callers and all multishot requests in
    general are expected to complete the request. That prevents indefinite
    in-background grows of the overflow list and let's the userspace to
    handle the backlog at its own pace.
    
    Resubmitting a request should also be faster than accounting a bunch of
    overflows, so it should be better for perf when it happens, but a well
    behaving userspace should be trying to avoid overflows in any case.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/bb20d14d708ea174721e58bb53786b0521e4dd6d.1691757663.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:25 -04:00
Jeff Moyer f9df5f9d35 io_uring: remove return from io_req_cqe_overflow()
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 056695bffa4beed5668dd4aa11efb696eacb3ed9
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Aug 11 13:53:44 2023 +0100

    io_uring: remove return from io_req_cqe_overflow()
    
    Nobody checks io_req_cqe_overflow()'s return, make it return void.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/8f2029ad0c22f73451664172d834372608ee0a77.1691757663.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:25 -04:00
Jeff Moyer b364f1b51d io_uring: open code io_fill_cqe_req()
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 00b0db562485fbb259cd4054346208ad0885d662
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Aug 11 13:53:43 2023 +0100

    io_uring: open code io_fill_cqe_req()
    
    io_fill_cqe_req() is only called from one place, open code it, and
    rename __io_fill_cqe_req().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/f432ce75bb1c94cadf0bd2add4d6aa510bd1fb36.1691757663.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:25 -04:00
Jeff Moyer f4278e6959 io_uring: have io_file_put() take an io_kiocb rather than the file
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 17bc28374cd06b7d2d3f1e88470ef89f9cd3a497
Author: Jens Axboe <axboe@kernel.dk>
Date:   Fri Jul 7 11:14:40 2023 -0600

    io_uring: have io_file_put() take an io_kiocb rather than the file
    
    No functional changes in this patch, just a prep patch for needing the
    request in io_file_put().
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 17:26:24 -04:00
Jeff Moyer b5bb2a633e io_uring: fix false positive KASAN warnings
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 569f5308e54352a12181cc0185f848024c5443e8
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Wed Aug 9 13:22:16 2023 +0100

    io_uring: fix false positive KASAN warnings
    
    io_req_local_work_add() peeks into the work list, which can be executed
    in the meanwhile. It's completely fine without KASAN as we're in an RCU
    read section and it's SLAB_TYPESAFE_BY_RCU. With KASAN though it may
    trigger a false positive warning because internal io_uring caches are
    sanitised.
    
    Remove sanitisation from the io_uring request cache for now.
    
    Cc: stable@vger.kernel.org
    Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw")
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/c6fbf7a82a341e66a0007c76eefd9d57f2d3ba51.1691541473.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:32:16 -04:00
Jeff Moyer 0af9ccc401 io_uring: make io_cq_unlock_post static
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 0fdb9a196c6728b51e0e7a4f6fa292d9fd5793de
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Jun 23 12:23:30 2023 +0100

    io_uring: make io_cq_unlock_post static
    
    io_cq_unlock_post() is exclusively used in io_uring/io_uring.c, mark it
    static and don't expose to other files.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/3dc8127dda4514e1dd24bb32035faac887c5fa37.1687518903.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:56 -04:00
Jeff Moyer 4e1f9a9930 io_uring: remove IOU_F_TWQ_FORCE_NORMAL
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 91c7884ac9a92ffbf78af7fc89603daf24f448a9
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Jun 23 12:23:26 2023 +0100

    io_uring: remove IOU_F_TWQ_FORCE_NORMAL
    
    Extract a function for non-local task_work_add, and use it directly from
    io_move_task_work_from_local(). Now we don't use IOU_F_TWQ_FORCE_NORMAL
    and it can be killed.
    
    As a small positive side effect we don't grab task->io_uring in
    io_req_normal_work_add anymore, which is not needed for
    io_req_local_work_add().
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/2e55571e8ff2927ae3cc12da606d204e2485525b.1687518903.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:55 -04:00
Jeff Moyer 9372a586d8 io_uring: remove io_req_ffs_set
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 3beed235d1a1d0a4ab093ab67ea6b2841e9d4fa2
Author: Christoph Hellwig <hch@lst.de>
Date:   Tue Jun 20 13:32:31 2023 +0200

    io_uring: remove io_req_ffs_set
    
    Just checking the flag directly makes it a lot more obvious what is
    going on here.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Link: https://lore.kernel.org/r/20230620113235.920399-5-hch@lst.de
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:54 -04:00
Jeff Moyer 1d2aa06cb2 io_uring: cleanup io_aux_cqe() API
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit d86eaed185e9c6052d1ee2ca538f1936ff255887
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Jun 7 14:41:20 2023 -0600

    io_uring: cleanup io_aux_cqe() API
    
    Everybody is passing in the request, so get rid of the io_ring_ctx and
    explicit user_data pass-in. Both the ctx and user_data can be deduced
    from the request at hand.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:54 -04:00
Jeff Moyer 7803ca2813 io_uring: Add io_uring_setup flag to pre-register ring fd and never install it
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 6e76ac595855db27bbdaef337173294a6fd6eb2c
Author: Josh Triplett <josh@joshtriplett.org>
Date:   Sat Apr 29 01:40:30 2023 +0900

    io_uring: Add io_uring_setup flag to pre-register ring fd and never install it
    
    With IORING_REGISTER_USE_REGISTERED_RING, an application can register
    the ring fd and use it via registered index rather than installed fd.
    This allows using a registered ring for everything *except* the initial
    mmap.
    
    With IORING_SETUP_NO_MMAP, io_uring_setup uses buffers allocated by the
    user, rather than requiring a subsequent mmap.
    
    The combination of the two allows a user to operate *entirely* via a
    registered ring fd, making it unnecessary to ever install the fd in the
    first place. So, add a flag IORING_SETUP_REGISTERED_FD_ONLY to make
    io_uring_setup register the fd and return a registered index, without
    installing the fd.
    
    This allows an application to avoid touching the fd table at all, and
    allows a library to never even momentarily install a file descriptor.
    
    This splits out an io_ring_add_registered_file helper from
    io_ring_add_registered_fd, for use by io_uring_setup.
    
    Signed-off-by: Josh Triplett <josh@joshtriplett.org>
    Link: https://lore.kernel.org/r/bc8f431bada371c183b95a83399628b605e978a3.1682699803.git.josh@joshtriplett.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:53 -04:00
Jeff Moyer 28532e38e4 io_uring: Create a helper to return the SQE size
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 96c7d4f81db0fea05c0792f7563ae0cb4ad5f022
Author: Breno Leitao <leitao@debian.org>
Date:   Thu May 4 05:18:54 2023 -0700

    io_uring: Create a helper to return the SQE size
    
    Create a simple helper that returns the size of the SQE. The SQE could
    have two size, depending of the flags.
    
    If IO_URING_SETUP_SQE128 flag is set, then return a double SQE,
    otherwise returns the sizeof of io_uring_sqe (64 bytes).
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/20230504121856.904491-2-leitao@debian.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:34 -04:00
Jeff Moyer 4a914a7271 io_uring: add irq lockdep checks
JIRA: https://issues.redhat.com/browse/RHEL-12076

commit 8ce4269eeedc5b31f5817f610b42cba8be8fa9de
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Tue Apr 11 12:06:03 2023 +0100

    io_uring: add irq lockdep checks
    
    We don't post CQEs from the IRQ context, add a check catching that.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/f23f7a24dbe8027b3d37873fece2b6488f878b31.1681210788.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:32 -04:00
Jeff Moyer 28b3762b5d io_uring: reduce scheduling due to tw
JIRA: https://issues.redhat.com/browse/RHEL-12076
Conflicts: We backported the sysctl patch out of order, which caused the
  patch to not apply cleanly.

commit 8751d15426a31baaf40f7570263c27c3e5d1dc44
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Apr 6 14:20:12 2023 +0100

    io_uring: reduce scheduling due to tw
    
    Every task_work will try to wake the task to be executed, which causes
    excessive scheduling and additional overhead. For some tw it's
    justified, but others won't do much but post a single CQE.
    
    When a task waits for multiple cqes, every such task_work will wake it
    up. Instead, the task may give a hint about how many cqes it waits for,
    io_req_local_work_add() will compare against it and skip wake ups
    if #cqes + #tw is not enough to satisfy the waiting condition. Task_work
    that uses the optimisation should be simple enough and never post more
    than one CQE. It's also ignored for non DEFER_TASKRUN rings.
    
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Link: https://lore.kernel.org/r/d2b77e99d1e86624d8a69f7037d764b739dcd225.1680782017.git.asml.silence@gmail.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2023-11-02 15:31:31 -04:00