Commit Graph

47 Commits

Author SHA1 Message Date
Jerome Marchand 2de9f2ed84 bpf: Fix iter/task tid filtering
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit 9495a5b731fcaf580448a3438d63601c88367661
Author: Jordan Rome <linux@jordanrome.com>
Date:   Wed Oct 16 14:00:47 2024 -0700

    bpf: Fix iter/task tid filtering

    In userspace, you can add a tid filter by setting
    the "task.tid" field for "bpf_iter_link_info".
    However, `get_pid_task` when called for the
    `BPF_TASK_ITER_TID` type should have been using
    `PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid).

    Fixes: f0d74c4da1f0 ("bpf: Parameterize task iterators.")
    Signed-off-by: Jordan Rome <linux@jordanrome.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20241016210048.1213935-1-linux@jordanrome.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-21 11:27:08 +01:00
Viktor Malik b92231f015
bpf: Fix an issue due to uninitialized bpf_iter_task
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 5f2ae606cb5a90839a9be9d22388c4200f820e75
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sat Feb 17 19:41:51 2024 +0800

    bpf: Fix an issue due to uninitialized bpf_iter_task
    
    Failure to initialize it->pos, coupled with the presence of an invalid
    value in the flags variable, can lead to it->pos referencing an invalid
    task, potentially resulting in a kernel panic. To mitigate this risk, it's
    crucial to ensure proper initialization of it->pos to NULL.
    
    Fixes: ac8148d957f5 ("bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)")
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lore.kernel.org/bpf/20240217114152.1623-2-laoar.shao@gmail.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 11:07:44 +02:00
Viktor Malik 5adceb952f
bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit ac8148d957f50434411a0c15a2e4f352b5bb4ff2
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Nov 14 17:32:39 2023 +0100

    bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos)
    
    This looks more clear and simplifies the code. While at it, remove the
    unnecessary initialization of pos/task at the start of bpf_iter_task_new().
    
    Note that we can even kill kit->task, we can just use pos->group_leader,
    but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new().
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:54 +02:00
Viktor Malik 9e20814e76
bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 5a34f9dabd9aa567e2d37e1aa27a67f80acfaa1c
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Nov 14 17:32:37 2023 +0100

    bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()
    
    Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
    is the last user and the usage is wrong.
    
    bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never
    happen if kit->pos execs. Change this code to use __next_thread().
    
    With or without this change the usage of kit->pos/task and next_task()
    doesn't look nice, see the next patch.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:53 +02:00
Viktor Malik edf0dce146
bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 2d1618054f25e11c44d189dbff4a60342a4cfb4b
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Nov 14 17:32:34 2023 +0100

    bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()
    
    Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c
    is the last user and the usage is wrong.
    
    task_group_seq_get_next() can return the group leader twice if it races
    with mt-thread exec which changes the group->leader's pid.
    
    Change the main loop to use __next_thread(), kill "next_tid == common->pid"
    check.
    
    __next_thread() can't loop forever, we can also change this code to retry
    if next_tid == 0.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:53 +02:00
Viktor Malik 7fafe65db6
bpf: Introduce task_vma open-coded iterator kfuncs
JIRA: https://issues.redhat.com/browse/RHEL-23644

Conflicts: Several commits were previously backported out of order:
           96a4110030fb ("bpf: Introduce css_task open-coded iterator kfuncs")
           e7c7c9dedb42 ("bpf: fix compilation error without CGROUPS")
           391145ba2acc ("bpf: Add __bpf_kfunc_{start,end}_defs macros").
           Updated the commit to match upstream code as much as possible.

commit 4ac4546821584736798aaa9e97da9f6eaf689ea3
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Fri Oct 13 13:44:24 2023 -0700

    bpf: Introduce task_vma open-coded iterator kfuncs

    This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow
    creation and manipulation of struct bpf_iter_task_vma in open-coded
    iterator style. BPF programs can use these kfuncs directly or through
    bpf_for_each macro for natural-looking iteration of all task vmas.

    The implementation borrows heavily from bpf_find_vma helper's locking -
    differing only in that it holds the mmap_read lock for all iterations
    while the helper only executes its provided callback on a maximum of 1
    vma. Aside from locking, struct vma_iterator and vma_next do all the
    heavy lifting.

    A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the
    only field in struct bpf_iter_task_vma. This is because the inner data
    struct contains a struct vma_iterator (not ptr), whose size is likely to
    change under us. If bpf_iter_task_vma_kern contained vma_iterator directly
    such a change would require change in opaque bpf_iter_task_vma struct's
    size. So better to allocate vma_iterator using BPF allocator, and since
    that alloc must already succeed, might as well allocate all iter fields,
    thereby freezing struct bpf_iter_task_vma size.

    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:51:33 +02:00
Artem Savkov 1e9cbbe0f6 bpf: Add __bpf_kfunc_{start,end}_defs macros
JIRA: https://issues.redhat.com/browse/RHEL-23643

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Conflicts: missing xdp commits, missing vma_task iterator

commit 391145ba2accc48b596f3d438af1a6255b62a555
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Tue Oct 31 14:56:24 2023 -0700

    bpf: Add __bpf_kfunc_{start,end}_defs macros

    BPF kfuncs are meant to be called from BPF programs. Accordingly, most
    kfuncs are not called from anywhere in the kernel, which the
    -Wmissing-prototypes warning is unhappy about. We've peppered
    __diag_ignore_all("-Wmissing-prototypes", ... everywhere kfuncs are
    defined in the codebase to suppress this warning.

    This patch adds two macros meant to bound one or many kfunc definitions.
    All existing kfunc definitions which use these __diag calls to suppress
    -Wmissing-prototypes are migrated to use the newly-introduced macros.
    A new __diag_ignore_all - for "-Wmissing-declarations" - is added to the
    __bpf_kfunc_start_defs macro based on feedback from Andrii on an earlier
    version of this patch [0] and another recent mailing list thread [1].

    In the future we might need to ignore different warnings or do other
    kfunc-specific things. This change will make it easier to make such
    modifications for all kfunc defs.

      [0]: https://lore.kernel.org/bpf/CAEf4BzaE5dRWtK6RPLnjTW-MW9sx9K3Fn6uwqCTChK2Dcb1Xig@mail.gmail.com/
      [1]: https://lore.kernel.org/bpf/ZT+2qCc%2FaXep0%2FLf@krava/

    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Suggested-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Cc: Jiri Olsa <olsajiri@gmail.com>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Acked-by: David Vernet <void@manifault.com>
    Acked-by: Yafang Shao <laoar.shao@gmail.com>
    Link: https://lore.kernel.org/r/20231031215625.2343848-1-davemarchevsky@fb.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 11:23:42 +01:00
Artem Savkov 534a34437e bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 0de4f50de25af79c2a46db55d70cdbd8f985c6d1
Author: Chuyi Zhou <zhouchuyi@bytedance.com>
Date:   Tue Nov 7 21:22:03 2023 +0800

    bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg
    
    BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task) in verifier.c wanted to
    teach BPF verifier that bpf_iter__task -> task is a trusted ptr. But it
    doesn't work well.
    
    The reason is, bpf_iter__task -> task would go through btf_ctx_access()
    which enforces the reg_type of 'task' is ctx_arg_info->reg_type, and in
    task_iter.c, we actually explicitly declare that the
    ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL.
    
    Actually we have a previous case like this[1] where PTR_TRUSTED is added to
    the arg flag for map_iter.
    
    This patch sets ctx_arg_info->reg_type is PTR_TO_BTF_ID_OR_NULL |
    PTR_TRUSTED in task_reg_info.
    
    Similarly, bpf_cgroup_reg_info -> cgroup is also PTR_TRUSTED since we are
    under the protection of cgroup_mutex and we would check cgroup_is_dead()
    in __cgroup_iter_seq_show().
    
    This patch is to improve the user experience of the newly introduced
    bpf_iter_css_task kfunc before hitting the mainline. The Fixes tag is
    pointing to the commit introduced the bpf_iter_css_task kfunc.
    
    Link[1]:https://lore.kernel.org/all/20230706133932.45883-3-aspsk@isovalent.com/
    
    Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs")
    Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20231107132204.912120-2-zhouchuyi@bytedance.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:58 +01:00
Artem Savkov 114d586292 bpf: fix compilation error without CGROUPS
JIRA: https://issues.redhat.com/browse/RHEL-23643

Conflicts: missing vma_task iterator

commit 05670f81d1287c40ec861186e4c4e3401013e7fb
Author: Matthieu Baerts <matttbe@kernel.org>
Date:   Wed Nov 1 19:16:01 2023 +0100

    bpf: fix compilation error without CGROUPS

    Our MPTCP CI complained [1] -- and KBuild too -- that it was no longer
    possible to build the kernel without CONFIG_CGROUPS:

      kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_new':
      kernel/bpf/task_iter.c:919:14: error: 'CSS_TASK_ITER_PROCS' undeclared (first use in this function)
        919 |         case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
            |              ^~~~~~~~~~~~~~~~~~~
      kernel/bpf/task_iter.c:919:14: note: each undeclared identifier is reported only once for each function it appears in
      kernel/bpf/task_iter.c:919:36: error: 'CSS_TASK_ITER_THREADED' undeclared (first use in this function)
        919 |         case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
            |                                    ^~~~~~~~~~~~~~~~~~~~~~
      kernel/bpf/task_iter.c:927:60: error: invalid application of 'sizeof' to incomplete type 'struct css_task_iter'
        927 |         kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter));
            |                                                            ^~~~~~
      kernel/bpf/task_iter.c:930:9: error: implicit declaration of function 'css_task_iter_start'; did you mean 'task_seq_start'? [-Werror=implicit-function-declaration]
        930 |         css_task_iter_start(css, flags, kit->css_it);
            |         ^~~~~~~~~~~~~~~~~~~
            |         task_seq_start
      kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_next':
      kernel/bpf/task_iter.c:940:16: error: implicit declaration of function 'css_task_iter_next'; did you mean 'class_dev_iter_next'? [-Werror=implicit-function-declaration]
        940 |         return css_task_iter_next(kit->css_it);
            |                ^~~~~~~~~~~~~~~~~~
            |                class_dev_iter_next
      kernel/bpf/task_iter.c:940:16: error: returning 'int' from a function with return type 'struct task_struct *' makes pointer from integer without a cast [-Werror=int-conversion]
        940 |         return css_task_iter_next(kit->css_it);
            |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/bpf/task_iter.c: In function 'bpf_iter_css_task_destroy':
      kernel/bpf/task_iter.c:949:9: error: implicit declaration of function 'css_task_iter_end' [-Werror=implicit-function-declaration]
        949 |         css_task_iter_end(kit->css_it);
            |         ^~~~~~~~~~~~~~~~~

    This patch simply surrounds with a #ifdef the new code requiring CGroups
    support. It seems enough for the compiler and this is similar to
    bpf_iter_css_{new,next,destroy}() functions where no other #ifdef have
    been added in kernel/bpf/helpers.c and in the selftests.

    Fixes: 9c66dc94b62a ("bpf: Introduce css_task open-coded iterator kfuncs")
    Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/6665206927
    Reported-by: kernel test robot <lkp@intel.com>
    Closes: https://lore.kernel.org/oe-kbuild-all/202310260528.aHWgVFqq-lkp@intel.com/
    Signed-off-by: Matthieu Baerts <matttbe@kernel.org>
    [ added missing ifdefs for BTF_ID cgroup definitions ]
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/r/20231101181601.1493271-1-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:57 +01:00
Artem Savkov e210a1a549 bpf: Let bpf_iter_task_new accept null task ptr
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit cb3ecf7915a1d7ce5304402f4d8616d9fa5193f7
Author: Chuyi Zhou <zhouchuyi@bytedance.com>
Date:   Wed Oct 18 14:17:44 2023 +0800

    bpf: Let bpf_iter_task_new accept null task ptr
    
    When using task_iter to iterate all threads of a specific task, we enforce
    that the user must pass a valid task pointer to ensure safety. However,
    when iterating all threads/process in the system, BPF verifier still
    require a valid ptr instead of "nullable" pointer, even though it's
    pointless, which is a kind of surprising from usability standpoint. It
    would be nice if we could let that kfunc accept a explicit null pointer
    when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer
    when using BPF_TASK_ITER_THREAD.
    
    Given a trival kfunc:
    	__bpf_kfunc void FN(struct TYPE_A *obj);
    
    BPF Prog would reject a nullptr for obj. The error info is:
    "arg#x pointer type xx xx must point to scalar, or struct with scalar"
    reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and
    the btf type of ref_t is not scalar or scalar_struct which leads to the
    rejection of get_kfunc_ptr_arg_type.
    
    This patch add "__nullable" annotation:
    	__bpf_kfunc void FN(struct TYPE_A *obj__nullable);
    Here __nullable indicates obj can be optional, user can pass a explicit
    nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will
    detect whether the current arg is optional and register is null, If so,
    return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next
    arg in check_kfunc_args().
    
    Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:55 +01:00
Artem Savkov d68171a511 bpf: Introduce task open coded iterator kfuncs
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit c68a78ffe2cb4207f64fd0f4262818c728c67be0
Author: Chuyi Zhou <zhouchuyi@bytedance.com>
Date:   Wed Oct 18 14:17:41 2023 +0800

    bpf: Introduce task open coded iterator kfuncs
    
    This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow
    creation and manipulation of struct bpf_iter_task in open-coded iterator
    style. BPF programs can use these kfuncs or through bpf_for_each macro to
    iterate all processes in the system.
    
    The API design keep consistent with SEC("iter/task"). bpf_iter_task_new()
    accepts a specific task and iterating type which allows:
    
    1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS)
    
    2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS)
    
    3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS)
    
    Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
    Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:54 +01:00
Artem Savkov 5d0136ef81 bpf: Introduce css_task open-coded iterator kfuncs
JIRA: https://issues.redhat.com/browse/RHEL-23643

Conflicts: missing task_vma iterator

commit 9c66dc94b62aef23300f05f63404afb8990920b4
Author: Chuyi Zhou <zhouchuyi@bytedance.com>
Date:   Wed Oct 18 14:17:40 2023 +0800

    bpf: Introduce css_task open-coded iterator kfuncs

    This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow
    creation and manipulation of struct bpf_iter_css_task in open-coded
    iterator style. These kfuncs actually wrapps css_task_iter_{start,next,
    end}. BPF programs can use these kfuncs through bpf_for_each macro for
    iteration of all tasks under a css.

    css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so
    the bpf side has to be careful in where it allows to use this iter.
    Currently we only allow it in bpf_lsm and bpf iter-s.

    Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
    Acked-by: Tejun Heo <tj@kernel.org>
    Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:54 +01:00
Artem Savkov 81b09400d5 bpf: task_group_seq_get_next: simplify the "next tid" logic
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 780aa8dfcb73f4703b1c4be11c21c8dca36502ad
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Sep 5 17:46:56 2023 +0200

    bpf: task_group_seq_get_next: simplify the "next tid" logic
    
    Kill saved_tid. It looks ugly to update *tid and then restore the
    previous value if __task_pid_nr_ns() returns 0. Change this code
    to update *tid and common->pid_visiting once before return.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230905154656.GA24950@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:46 +01:00
Artem Savkov 9a6a816efa bpf: task_group_seq_get_next: kill next_task
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 0ee9808b0a211ba1e572073c6afe5897f8300b9c
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Sep 5 17:46:54 2023 +0200

    bpf: task_group_seq_get_next: kill next_task
    
    It only adds the unnecessary confusion and compicates the "retry" code.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230905154654.GA24945@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:46 +01:00
Artem Savkov 3a5139119b bpf: task_group_seq_get_next: fix the skip_if_dup_files check
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 87abbf7a54f6c9c51374b0701cd7ab47534516ae
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Sep 5 17:46:51 2023 +0200

    bpf: task_group_seq_get_next: fix the skip_if_dup_files check
    
    Unless I am notally confused it is wrong. We are going to return or
    skip next_task so we need to check next_task-files, not task->files.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230905154651.GA24940@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:46 +01:00
Artem Savkov be009bdc3e bpf: task_group_seq_get_next: cleanup the usage of get/put_task_struct
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 4981921350452a7639fac9ac8f19be4d25febdca
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Sep 5 17:46:49 2023 +0200

    bpf: task_group_seq_get_next: cleanup the usage of get/put_task_struct
    
    get_pid_task() makes no sense, the code does put_task_struct() soon after.
    Use find_task_by_pid_ns() instead of find_pid_ns + get_pid_task and kill
    put_task_struct(), this allows to do get_task_struct() only once before
    return.
    
    While at it, kill the unnecessary "if (!pid)" check in the "if (!*tid)"
    block, this matches the next usage of find_pid_ns() + get_pid_task() in
    this function.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230905154649.GA24935@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:46 +01:00
Artem Savkov 96631a33db bpf: task_group_seq_get_next: cleanup the usage of next_thread()
JIRA: https://issues.redhat.com/browse/RHEL-23643

commit 1a00ef57d9f120b711b6b1193d12ba3789d47ec2
Author: Oleg Nesterov <oleg@redhat.com>
Date:   Tue Sep 5 17:46:46 2023 +0200

    bpf: task_group_seq_get_next: cleanup the usage of next_thread()
    
    1. find_pid_ns() + get_pid_task() under rcu_read_lock() guarantees that we
       can safely iterate the task->thread_group list. Even if this task exits
       right after get_pid_task() (or goto retry) and pid_alive() returns 0.
    
       Kill the unnecessary pid_alive() check.
    
    2. next_thread() simply can't return NULL, kill the bogus "if (!next_task)"
       check.
    
    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Yonghong Song <yonghong.song@linux.dev>
    Link: https://lore.kernel.org/r/20230905154646.GA24928@redhat.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2024-03-27 10:27:46 +01:00
Jerome Marchand 6d0e8e7e13 bpf: keep a reference to the mm, in case the task is dead.
Bugzilla: https://bugzilla.redhat.com/2177177

commit 7ff94f276f8ea05df82eb115225e9b26f47a3347
Author: Kui-Feng Lee <kuifeng@meta.com>
Date:   Fri Dec 16 14:18:54 2022 -0800

    bpf: keep a reference to the mm, in case the task is dead.

    Fix the system crash that happens when a task iterator travel through
    vma of tasks.

    In task iterators, we used to access mm by following the pointer on
    the task_struct; however, the death of a task will clear the pointer,
    even though we still hold the task_struct.  That can cause an
    unexpected crash for a null pointer when an iterator is visiting a
    task that dies during the visit.  Keeping a reference of mm on the
    iterator ensures we always have a valid pointer to mm.

    Co-developed-by: Song Liu <song@kernel.org>
    Signed-off-by: Song Liu <song@kernel.org>
    Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
    Reported-by: Nathan Slingerland <slinger@meta.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/r/20221216221855.4122288-2-kuifeng@meta.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-04-28 11:43:19 +02:00
Artem Savkov 0455ab2df7 bpf: Handle show_fdinfo for the parameterized task BPF iterators
Bugzilla: https://bugzilla.redhat.com/2166911

commit 2c4fe44fb020f3cce904da2ba9e42bb1c118e8a3
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Mon Sep 26 11:49:55 2022 -0700

    bpf: Handle show_fdinfo for the parameterized task BPF iterators
    
    Show information of iterators in the respective files under
    /proc/<pid>/fdinfo/.
    
    For example, for a task file iterator with 1723 as the value of tid
    parameter, its fdinfo would look like the following lines.
    
        pos:    0
        flags:  02000000
        mnt_id: 14
        ino:    38
        link_type:      iter
        link_id:        51
        prog_tag:       a590ac96db22b825
        prog_id:        299
        target_name:    task_file
        task_type:      TID
        tid: 1723
    
    This patch add the last three fields.  task_type is the type of the
    task parameter.  TID means the iterator visit only the thread
    specified by tid.  The value of tid in the above example is 1723.  For
    the case of PID task_type, it means the iterator visits only threads
    of a process and will show the pid value of the process instead of a
    tid.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/bpf/20220926184957.208194-4-kuifeng@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:19 +01:00
Artem Savkov 48fbcb53ce bpf: Handle bpf_link_info for the parameterized task BPF iterators.
Bugzilla: https://bugzilla.redhat.com/2166911

commit 21fb6f2aa3890b0d0abf88b7756d0098e9367a7c
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Mon Sep 26 11:49:54 2022 -0700

    bpf: Handle bpf_link_info for the parameterized task BPF iterators.
    
    Add new fields to bpf_link_info that users can query it through
    bpf_obj_get_info_by_fd().
    
    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:19 +01:00
Artem Savkov 758d8b2d3a bpf: Parameterize task iterators.
Bugzilla: https://bugzilla.redhat.com/2166911

commit f0d74c4da1f060d2a66976193712a5e6abd361f5
Author: Kui-Feng Lee <kuifeng@fb.com>
Date:   Mon Sep 26 11:49:53 2022 -0700

    bpf: Parameterize task iterators.
    
    Allow creating an iterator that loops through resources of one
    thread/process.
    
    People could only create iterators to loop through all resources of
    files, vma, and tasks in the system, even though they were interested
    in only the resources of a specific task or process.  Passing the
    additional parameters, people can now create an iterator to go
    through all resources or only the resources of a task.
    
    Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/bpf/20220926184957.208194-2-kuifeng@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:19 +01:00
Artem Savkov 9a9a473691 bpf: remove VMA linked list
Bugzilla: https://bugzilla.redhat.com/2166911

commit becc8cdb6cb28d9fd3ecf890d1d6e59118a6a53d
Author: Liam R. Howlett <Liam.Howlett@Oracle.com>
Date:   Tue Sep 6 19:48:59 2022 +0000

    bpf: remove VMA linked list
    
    Use vma_next() and remove reference to the start of the linked list
    
    Link: https://lkml.kernel.org/r/20220906194824.2110408-51-Liam.Howlett@oracle.com
    Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
    Tested-by: Yu Zhao <yuzhao@google.com>
    Cc: Catalin Marinas <catalin.marinas@arm.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Sven Schnelle <svens@linux.ibm.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Will Deacon <will@kernel.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:18 +01:00
Yauheni Kaliuta 25910dab8b bpf: Remove redundant assignment to meta.seq in __task_seq_show()
Bugzilla: https://bugzilla.redhat.com/2120968

commit aa1b02e674fe69acd04624f5bcdef94928bc8695
Author: Yuntao Wang <ytcoode@gmail.com>
Date:   Sun Apr 10 14:00:19 2022 +0800

    bpf: Remove redundant assignment to meta.seq in __task_seq_show()
    
    The seq argument is assigned to meta.seq twice, the second one is
    redundant, remove it.
    
    This patch also removes a redundant space in bpf_iter_link_attach().
    
    Signed-off-by: Yuntao Wang <ytcoode@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220410060020.307283-1-ytcoode@gmail.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:48:59 +02:00
Artem Savkov 5cebd099b9 bpf: Introduce btf_tracing_ids
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit d19ddb476a539fd78ad1028ae13bb38506286931
Author: Song Liu <songliubraving@fb.com>
Date:   Fri Nov 12 07:02:43 2021 -0800

    bpf: Introduce btf_tracing_ids

    Similar to btf_sock_ids, btf_tracing_ids provides btf ID for task_struct,
    file, and vm_area_struct via easy to understand format like
    btf_tracing_ids[BTF_TRACING_TYPE_[TASK|file|VMA]].

    Suggested-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20211112150243.1270987-3-songliubraving@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:37 +02:00
Artem Savkov c083c778ce bpf: Introduce helper bpf_find_vma
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 7c7e3d31e7856a8260a254f8c71db416f7f9f5a1
Author: Song Liu <songliubraving@fb.com>
Date:   Fri Nov 5 16:23:29 2021 -0700

    bpf: Introduce helper bpf_find_vma

    In some profiler use cases, it is necessary to map an address to the
    backing file, e.g., a shared library. bpf_find_vma helper provides a
    flexible way to achieve this. bpf_find_vma maps an address of a task to
    the vma (vm_area_struct) for this address, and feed the vma to an callback
    BPF function. The callback function is necessary here, as we need to
    ensure mmap_sem is unlocked.

    It is necessary to lock mmap_sem for find_vma. To lock and unlock mmap_sem
    safely when irqs are disable, we use the same mechanism as stackmap with
    build_id. Specifically, when irqs are disabled, the unlocked is postponed
    in an irq_work. Refactor stackmap.c so that the irq_work is shared among
    bpf_find_vma and stackmap helpers.

    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Tested-by: Hengqi Chen <hengqi.chen@gmail.com>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20211105232330.1936330-2-songliubraving@fb.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:34 +02:00
Jerome Marchand 5ba1f36d0a bpf: Consolidate task_struct BTF_ID declarations
Bugzilla: http://bugzilla.redhat.com/2041365

commit 33c5cb36015ac1034b50b823fae367e908d05147
Author: Daniel Xu <dxu@dxuuu.xyz>
Date:   Mon Aug 23 19:43:47 2021 -0700

    bpf: Consolidate task_struct BTF_ID declarations

    No need to have it defined 5 times. Once is enough.

    Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/6dcefa5bed26fe1226f26683f36819bb53ec19a2.1629772842.git.dxu@dxuuu.xyz

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:47 +02:00
Song Liu 3a7b35b899 bpf: Introduce task_vma bpf_iter
Introduce task_vma bpf_iter to print memory information of a process. It
can be used to print customized information similar to /proc/<pid>/maps.

Current /proc/<pid>/maps and /proc/<pid>/smaps provide information of
vma's of a process. However, these information are not flexible enough to
cover all use cases. For example, if a vma cover mixed 2MB pages and 4kB
pages (x86_64), there is no easy way to tell which address ranges are
backed by 2MB pages. task_vma solves the problem by enabling the user to
generate customize information based on the vma (and vma->vm_mm,
vma->vm_file, etc.).

To access the vma safely in the BPF program, task_vma iterator holds
target mmap_lock while calling the BPF program. If the mmap_lock is
contended, task_vma unlocks mmap_lock between iterations to unblock the
writer(s). This lock contention avoidance mechanism is similar to the one
used in show_smaps_rollup().

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com
2021-02-12 12:56:53 -08:00
Yonghong Song 04901aab40 bpf: Fix a task_iter bug caused by a merge conflict resolution
Latest bpf tree has a bug for bpf_iter selftest:

  $ ./test_progs -n 4/25
  test_bpf_sk_storage_get:PASS:bpf_iter_bpf_sk_storage_helpers__open_and_load 0 nsec
  test_bpf_sk_storage_get:PASS:socket 0 nsec
  ...
  do_dummy_read:PASS:read 0 nsec
  test_bpf_sk_storage_get:FAIL:bpf_map_lookup_elem map value wasn't set correctly
                          (expected 1792, got -1, err=0)
  #4/25 bpf_sk_storage_get:FAIL
  #4 bpf_iter:FAIL
  Summary: 0/0 PASSED, 0 SKIPPED, 2 FAILED

When doing merge conflict resolution, Commit 4bfc471484 missed to
save curr_task to seq_file private data. The task pointer in seq_file
private data is passed to bpf program. This caused NULL-pointer task
passed to bpf program which will immediately return upon checking
whether task pointer is NULL.

This patch added back the assignment of curr_task to seq_file private
data and fixed the issue.

Fixes: 4bfc471484 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20201231052418.577024-1-yhs@fb.com
2021-01-03 01:41:32 +01:00
David S. Miller 4bfc471484 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-12-28

The following pull-request contains BPF updates for your *net* tree.

There is a small merge conflict between bpf tree commit 69ca310f34
("bpf: Save correct stopping point in file seq iteration") and net tree
commit 66ed594409 ("bpf/task_iter: In task_file_seq_get_next use
task_lookup_next_fd_rcu"). The get_files_struct() does not exist anymore
in net, so take the hunk in HEAD and add the `info->tid = curr_tid` to
the error path:

  [...]
                curr_task = task_seq_get_next(ns, &curr_tid, true);
                if (!curr_task) {
                        info->task = NULL;
                        info->tid = curr_tid;
                        return NULL;
                }

                /* set info->task and info->tid */
  [...]

We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 11 files changed, 75 insertions(+), 20 deletions(-).

The main changes are:

1) Various AF_XDP fixes such as fill/completion ring leak on failed bind and
   fixing a race in skb mode's backpressure mechanism, from Magnus Karlsson.

2) Fix latency spikes on lockdep enabled kernels by adding a rescheduling
   point to BPF hashtab initialization, from Eric Dumazet.

3) Fix a splat in task iterator by saving the correct stopping point in the
   seq file iteration, from Jonathan Lemon.

4) Fix BPF maps selftest by adding retries in case hashtab returns EBUSY
   errors on update/deletes, from Andrii Nakryiko.

5) Fix BPF selftest error reporting to something more user friendly if the
   vmlinux BTF cannot be found, from Kamal Mostafa.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-12-28 15:26:11 -08:00
Jonathan Lemon a61daaf351 bpf: Use thread_group_leader()
Instead of directly comparing task->tgid and task->pid, use the
thread_group_leader() helper.  This helps with readability, and
there should be no functional change.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201218185032.2464558-3-jonathan.lemon@gmail.com
2020-12-24 02:04:53 +01:00
Jonathan Lemon 69ca310f34 bpf: Save correct stopping point in file seq iteration
On some systems, some variant of the following splat is
repeatedly seen.  The common factor in all traces seems
to be the entry point to task_file_seq_next().  With the
patch, all warnings go away.

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
    \x09(t=21033 jiffies g=159148529 q=223125)
    NMI backtrace for cpu 26
    CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
    Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
    Call Trace:
     <IRQ>
     dump_stack+0x50/0x70
     nmi_cpu_backtrace.cold.6+0x13/0x50
     ? lapic_can_unplug_cpu.cold.30+0x40/0x40
     nmi_trigger_cpumask_backtrace+0xba/0xca
     rcu_dump_cpu_stacks+0x99/0xc7
     rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
     ? tick_sched_do_timer+0x60/0x60
     update_process_times+0x24/0x50
     tick_sched_timer+0x37/0x70
     __hrtimer_run_queues+0xfe/0x270
     hrtimer_interrupt+0xf4/0x210
     smp_apic_timer_interrupt+0x5e/0x120
     apic_timer_interrupt+0xf/0x20
     </IRQ>
    RIP: 0010:get_pid_task+0x38/0x80
    Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 <85> d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
    RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
    RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
    R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
    R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
     ? find_ge_pid+0x1b/0x20
     task_seq_get_next+0x52/0xc0
     task_file_seq_get_next+0x159/0x220
     task_file_seq_next+0x4f/0xa0
     bpf_seq_read+0x159/0x390
     vfs_read+0x8a/0x140
     ksys_read+0x59/0xd0
     do_syscall_64+0x42/0x110
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95ae73e76e
    Code: Bad RIP value.
    RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
    RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
    RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
    R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
    R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

If unable to obtain the file structure for the current task,
proceed to the next task number after the one returned from
task_seq_get_next(), instead of the next task number from the
original iterator.

Also, save the stopping task number from task_seq_get_next()
on failure in case of restarts.

Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201218185032.2464558-2-jonathan.lemon@gmail.com
2020-12-24 02:04:47 +01:00
Linus Torvalds faf145d6f3 Merge branch 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "This set of changes ultimately fixes the interaction of posix file
  lock and exec. Fundamentally most of the change is just moving where
  unshare_files is called during exec, and tweaking the users of
  files_struct so that the count of files_struct is not unnecessarily
  played with.

  Along the way fcheck and related helpers were renamed to more
  accurately reflect what they do.

  There were also many other small changes that fell out, as this is the
  first time in a long time much of this code has been touched.

  Benchmarks haven't turned up any practical issues but Al Viro has
  observed a possibility for a lot of pounding on task_lock. So I have
  some changes in progress to convert put_files_struct to always rcu
  free files_struct. That wasn't ready for the merge window so that will
  have to wait until next time"

* 'exec-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  exec: Move io_uring_task_cancel after the point of no return
  coredump: Document coredump code exclusively used by cell spufs
  file: Remove get_files_struct
  file: Rename __close_fd_get_file close_fd_get_file
  file: Replace ksys_close with close_fd
  file: Rename __close_fd to close_fd and remove the files parameter
  file: Merge __alloc_fd into alloc_fd
  file: In f_dupfd read RLIMIT_NOFILE once.
  file: Merge __fd_install into fd_install
  proc/fd: In fdinfo seq_show don't use get_files_struct
  bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
  proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu
  file: Implement task_lookup_next_fd_rcu
  kcmp: In get_file_raw_ptr use task_lookup_fd_rcu
  proc/fd: In tid_fd_mode use task_lookup_fd_rcu
  file: Implement task_lookup_fd_rcu
  file: Rename fcheck lookup_fd_rcu
  file: Replace fcheck_files with files_lookup_fd_rcu
  file: Factor files_lookup_fd_locked out of fcheck_files
  file: Rename __fcheck_files to files_lookup_fd_raw
  ...
2020-12-15 19:29:43 -08:00
Eric W. Biederman 66ed594409 bpf/task_iter: In task_file_seq_get_next use task_lookup_next_fd_rcu
When discussing[1] exec and posix file locks it was realized that none
of the callers of get_files_struct fundamentally needed to call
get_files_struct, and that by switching them to helper functions
instead it will both simplify their code and remove unnecessary
increments of files_struct.count.  Those unnecessary increments can
result in exec unnecessarily unsharing files_struct which breaking
posix locks, and it can result in fget_light having to fallback to
fget reducing system performance.

Using task_lookup_next_fd_rcu simplifies task_file_seq_get_next, by
moving the checking for the maximum file descritor into the generic
code, and by remvoing the need for capturing and releasing a reference
on files_struct.  As the reference count of files_struct no longer
needs to be maintained bpf_iter_seq_task_file_info can have it's files
member removed and task_file_seq_get_next no longer needs it's fstruct
argument.

The curr_fd local variable does need to become unsigned to be used
with fnext_task.  As curr_fd is assigned from and assigned a u32
making curr_fd an unsigned int won't cause problems and might prevent
them.

[1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com
Suggested-by: Oleg Nesterov <oleg@redhat.com>
v1: https://lkml.kernel.org/r/20200817220425.9389-11-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/20201120231441.29911-16-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:42:58 -06:00
Eric W. Biederman f36c294327 file: Replace fcheck_files with files_lookup_fd_rcu
This change renames fcheck_files to files_lookup_fd_rcu.  All of the
remaining callers take the rcu_read_lock before calling this function
so the _rcu suffix is appropriate.  This change also tightens up the
debug check to verify that all callers hold the rcu_read_lock.

All callers that used to call files_check with the files->file_lock
held have now been changed to call files_lookup_fd_locked.

This change of name has helped remind me of which locks and which
guarantees are in place helping me to catch bugs later in the
patchset.

The need for better names became apparent in the last round of
discussion of this set of changes[1].

[1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com
Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-12-10 12:40:03 -06:00
Song Liu 91b2db27d3 bpf: Simplify task_file_seq_get_next()
Simplify task_file_seq_get_next() by removing two in/out arguments: task
and fstruct. Use info->task and info->files instead.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
2020-11-20 20:36:34 +01:00
Yonghong Song cf83b2d2e2 bpf: Permit cond_resched for some iterators
Commit e679654a70 ("bpf: Fix a rcu_sched stall issue with
bpf task/task_file iterator") tries to fix rcu stalls warning
which is caused by bpf task_file iterator when running
"bpftool prog".

      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
      \x09(t=21031 jiffies g=2534773 q=179750)
      NMI backtrace for cpu 7
      CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      Call Trace:
      <IRQ>
      dump_stack+0x57/0x70
      nmi_cpu_backtrace.cold+0x14/0x53
      ? lapic_can_unplug_cpu.cold+0x39/0x39
      nmi_trigger_cpumask_backtrace+0xb7/0xc7
      rcu_dump_cpu_stacks+0xa2/0xd0
      rcu_sched_clock_irq.cold+0x1ff/0x3d9
      ? tick_nohz_handler+0x100/0x100
      update_process_times+0x5b/0x90
      tick_sched_timer+0x5e/0xf0
      __hrtimer_run_queues+0x12a/0x2a0
      hrtimer_interrupt+0x10e/0x280
      __sysvec_apic_timer_interrupt+0x51/0xe0
      asm_call_on_stack+0xf/0x20
      </IRQ>
      sysvec_apic_timer_interrupt+0x6f/0x80
      ...
      task_file_seq_next+0x52/0xa0
      bpf_seq_read+0xb9/0x320
      vfs_read+0x9d/0x180
      ksys_read+0x5f/0xe0
      do_syscall_64+0x38/0x60
      entry_SYSCALL_64_after_hwframe+0x44/0xa9

The fix is to limit the number of bpf program runs to be
one million. This fixed the program in most cases. But
we also found under heavy load, which can increase the wallclock
time for bpf_seq_read(), the warning may still be possible.

For example, calling bpf_delay() in the "while" loop of
bpf_seq_read(), which will introduce artificial delay,
the warning will show up in my qemu run.

  static unsigned q;
  volatile unsigned *p = &q;
  volatile unsigned long long ll;
  static void bpf_delay(void)
  {
         int i, j;

         for (i = 0; i < 10000; i++)
                 for (j = 0; j < 10000; j++)
                         ll += *p;
  }

There are two ways to fix this issue. One is to reduce the above
one million threshold to say 100,000 and hopefully rcu warning will
not show up any more. Another is to introduce a target feature
which enables bpf_seq_read() calling cond_resched().

This patch took second approach as the first approach may cause
more -EAGAIN failures for read() syscalls. Note that not all bpf_iter
targets can permit cond_resched() in bpf_seq_read() as some, e.g.,
netlink seq iterator, rcu read lock critical section spans through
seq_ops->next() -> seq_ops->show() -> seq_ops->next().

For the kernel code with the above hack, "bpftool p" roughly takes
38 seconds to finish on my VM with 184 bpf program runs.
Using the following command, I am able to collect the number of
context switches:
   perf stat -e context-switches -- ./bpftool p >& log
Without this patch,
   69      context-switches
With this patch,
   75      context-switches
This patch added additional 6 context switches, roughly every 6 seconds
to reschedule, to avoid lengthy no-rescheduling which may cause the
above RCU warnings.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
2020-10-28 14:54:31 -07:00
Yonghong Song 203d7b054f bpf: Avoid iterating duplicated files for task_file iterator
Currently, task_file iterator iterates all files from all tasks.
This may potentially visit a lot of duplicated files if there are
many tasks sharing the same files, e.g., typical pthreads
where these pthreads and the main thread are sharing the same files.

This patch changed task_file iterator to skip a particular task
if that task shares the same files as its group_leader (the task
having the same tgid and also task->tgid == task->pid).
This will preserve the same result, visiting all files from all
tasks, and will reduce runtime cost significantl, e.g., if there are
a lot of pthreads and the process has a lot of open files.

Suggested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200902023112.1672792-1-yhs@fb.com
2020-09-02 16:40:33 +02:00
Yonghong Song e60572b8d4 bpf: Avoid visit same object multiple times
Currently when traversing all tasks, the next tid
is always increased by one. This may result in
visiting the same task multiple times in a
pid namespace.

This patch fixed the issue by seting the next
tid as pid_nr_ns(pid, ns) + 1, similar to
funciton next_tgid().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/bpf/20200818222310.2181500-1-yhs@fb.com
2020-08-18 17:36:23 -07:00
Yonghong Song cf28f3bbfc bpf: Use get_file_rcu() instead of get_file() for task_file iterator
With latest `bpftool prog` command, we observed the following kernel
panic.
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor instruction fetch in kernel mode
    #PF: error_code(0x0010) - not-present page
    PGD dfe894067 P4D dfe894067 PUD deb663067 PMD 0
    Oops: 0010 [#1] SMP
    CPU: 9 PID: 6023 ...
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0000:ffffc900002b8f18 EFLAGS: 00010286
    RAX: ffff8883a405f400 RBX: ffff888e46a6bf00 RCX: 000000008020000c
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8883a405f400
    RBP: ffff888e46a6bf50 R08: 0000000000000000 R09: ffffffff81129600
    R10: ffff8883a405f300 R11: 0000160000000000 R12: 0000000000002710
    R13: 000000e9494b690c R14: 0000000000000202 R15: 0000000000000009
    FS:  00007fd9187fe700(0000) GS:ffff888e46a40000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 0000000de5d33002 CR4: 0000000000360ee0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
     <IRQ>
     rcu_core+0x1a4/0x440
     __do_softirq+0xd3/0x2c8
     irq_exit+0x9d/0xa0
     smp_apic_timer_interrupt+0x68/0x120
     apic_timer_interrupt+0xf/0x20
     </IRQ>
    RIP: 0033:0x47ce80
    Code: Bad RIP value.
    RSP: 002b:00007fd9187fba40 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000002 RBX: 00007fd931789160 RCX: 000000000000010c
    RDX: 00007fd9308cdfb4 RSI: 00007fd9308cdfb4 RDI: 00007ffedd1ea0a8
    RBP: 00007fd9187fbab0 R08: 000000000000000e R09: 000000000000002a
    R10: 0000000000480210 R11: 00007fd9187fc570 R12: 00007fd9316cc400
    R13: 0000000000000118 R14: 00007fd9308cdfb4 R15: 00007fd9317a9380

After further analysis, the bug is triggered by
Commit eaaacd2391 ("bpf: Add task and task/file iterator targets")
which introduced task_file bpf iterator, which traverses all open file
descriptors for all tasks in the current namespace.
The latest `bpftool prog` calls a task_file bpf program to traverse
all files in the system in order to associate processes with progs/maps, etc.
When traversing files for a given task, rcu read_lock is taken to
access all files in a file_struct. But it used get_file() to grab
a file, which is not right. It is possible file->f_count is 0 and
get_file() will unconditionally increase it.
Later put_file() may cause all kind of issues with the above
as one of sympotoms.

The failure can be reproduced with the following steps in a few seconds:
    $ cat t.c
    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    #include <unistd.h>

    #define N 10000
    int fd[N];
    int main() {
      int i;

      for (i = 0; i < N; i++) {
        fd[i] = open("./note.txt", 'r');
        if (fd[i] < 0) {
           fprintf(stderr, "failed\n");
           return -1;
        }
      }
      for (i = 0; i < N; i++)
        close(fd[i]);

      return 0;
    }
    $ gcc -O2 t.c
    $ cat run.sh
    #/bin/bash
    for i in {1..100}
    do
      while true; do ./a.out; done &
    done
    $ ./run.sh
    $ while true; do bpftool prog >& /dev/null; done

This patch used get_file_rcu() which only grabs a file if the
file->f_count is not zero. This is to ensure the file pointer
is always valid. The above reproducer did not fail for more
than 30 minutes.

Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/bpf/20200817174214.252601-1-yhs@fb.com
2020-08-17 14:42:58 -07:00
Yonghong Song f9c7927295 bpf: Refactor to provide aux info to bpf_iter_init_seq_priv_t
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184110.590156-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song 14fc6bd6b7 bpf: Refactor bpf_iter_reg to have separate seq_info member
There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.

This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200723184109.590030-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song 3f9969f2c0 bpf: Fix pos computation for bpf_iter seq_ops->start()
Currently, the pos pointer in bpf iterator map/task/task_file
seq_ops->start() is always incremented.
This is incorrect. It should be increased only if
*pos is 0 (for SEQ_START_TOKEN) since these start()
function actually returns the first real object.
If *pos is not 0, it merely found the object
based on the state in seq->private, and not really
advancing the *pos. This patch fixed this issue
by only incrementing *pos if it is 0.

Note that the old *pos calculation, although not
correct, does not affect correctness of bpf_iter
as bpf_iter seq_file->read() does not support llseek.

This patch also renamed "mid" in bpf_map iterator
seq_file private data to "map_id" for better clarity.

Fixes: 6086d29def ("bpf: Add bpf_map iterator")
Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200722195156.4029817-1-yhs@fb.com
2020-07-25 20:16:32 -07:00
Yonghong Song 951cf368bc bpf: net: Use precomputed btf_id for bpf iterators
One additional field btf_id is added to struct
bpf_ctx_arg_aux to store the precomputed btf_ids.
The btf_id is computed at build time with
BTF_ID_LIST or BTF_ID_LIST_GLOBAL macro definitions.
All existing bpf iterators are changed to used
pre-compute btf_ids.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200720163403.1393551-1-yhs@fb.com
2020-07-21 13:26:26 -07:00
Andrii Nakryiko c70f34a8ac bpf: Fix bpf_iter's task iterator logic
task_seq_get_next might stop prematurely if get_pid_task() fails to get
task_struct. Failure to do so doesn't mean that there are no more tasks with
higher pids. Procfs's iteration algorithm (see next_tgid in fs/proc/base.c)
does a retry in such case. After this fix, instead of stopping prematurely
after about 300 tasks on my server, bpf_iter program now returns >4000, which
sounds much closer to reality.

Fixes: eaaacd2391 ("bpf: Add task and task/file iterator targets")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200514055137.1564581-1-andriin@fb.com
2020-05-14 18:37:32 -07:00
Yonghong Song 3c32cc1bce bpf: Enable bpf_iter targets registering ctx argument types
Commit b121b341e5 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song 15172a46fa bpf: net: Refactor bpf_iter target registration
Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.

The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song eaaacd2391 bpf: Add task and task/file iterator targets
Only the tasks belonging to "current" pid namespace
are enumerated.

For task/file target, the bpf program will have access to
  struct task_struct *task
  u32 fd
  struct file *file
where fd/file is an open file for the task.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175911.2476407-1-yhs@fb.com
2020-05-09 17:05:26 -07:00