Commit Graph

125 Commits

Author SHA1 Message Date
Jerome Marchand 52e91da62b cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit 117932eea99b729ee5d12783601a4f7f5fd58a23
Author: Chen Ridong <chenridong@huawei.com>
Date:   Tue Oct 8 11:24:56 2024 +0000

    cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction

    A hung_task problem shown below was found:

    INFO: task kworker/0:0:8 blocked for more than 327 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Workqueue: events cgroup_bpf_release
    Call Trace:
     <TASK>
     __schedule+0x5a2/0x2050
     ? find_held_lock+0x33/0x100
     ? wq_worker_sleeping+0x9e/0xe0
     schedule+0x9f/0x180
     schedule_preempt_disabled+0x25/0x50
     __mutex_lock+0x512/0x740
     ? cgroup_bpf_release+0x1e/0x4d0
     ? cgroup_bpf_release+0xcf/0x4d0
     ? process_scheduled_works+0x161/0x8a0
     ? cgroup_bpf_release+0x1e/0x4d0
     ? mutex_lock_nested+0x2b/0x40
     ? __pfx_delay_tsc+0x10/0x10
     mutex_lock_nested+0x2b/0x40
     cgroup_bpf_release+0xcf/0x4d0
     ? process_scheduled_works+0x161/0x8a0
     ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
     ? process_scheduled_works+0x161/0x8a0
     process_scheduled_works+0x23a/0x8a0
     worker_thread+0x231/0x5b0
     ? __pfx_worker_thread+0x10/0x10
     kthread+0x14d/0x1c0
     ? __pfx_kthread+0x10/0x10
     ret_from_fork+0x59/0x70
     ? __pfx_kthread+0x10/0x10
     ret_from_fork_asm+0x1b/0x30
     </TASK>

    This issue can be reproduced by the following pressuse test:
    1. A large number of cpuset cgroups are deleted.
    2. Set cpu on and off repeatly.
    3. Set watchdog_thresh repeatly.
    The scripts can be obtained at LINK mentioned above the signature.

    The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
    acquired in different tasks, which may lead to deadlock.
    It can lead to a deadlock through the following steps:
    1. A large number of cpusets are deleted asynchronously, which puts a
       large number of cgroup_bpf_release works into system_wq. The max_active
       of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
       cgroup_bpf_release works, and many cgroup_bpf_release works will be put
       into inactive queue. As illustrated in the diagram, there are 256 (in
       the acvtive queue) + n (in the inactive queue) works.
    2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
       smp_call_on_cpu work into system_wq. However step 1 has already filled
       system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
       to wait until the works that were put into the inacvtive queue earlier
       have executed (n cgroup_bpf_release), so it will be blocked for a while.
    3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
    4. Cpusets that were deleted at step 1 put cgroup_release works into
       cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
       When cgroup_metux is acqured by work at css_killed_work_fn, it will
       call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
       However, cpuset_css_offline will be blocked for step 3.
    5. At this moment, there are 256 works in active queue that are
       cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
       a result, all of them are blocked. Consequently, sscs.work can not be
       executed. Ultimately, this situation leads to four processes being
       blocked, forming a deadlock.

    system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
    ...
    2000+ cgroups deleted asyn
    256 actives + n inactives
    				__lockup_detector_reconfigure
    				P(cpu_hotplug_lock.read)
    				put sscs.work into system_wq
    256 + n + 1(sscs.work)
    sscs.work wait to be executed
    				warting sscs.work finish
    								percpu_down_write
    								P(cpu_hotplug_lock.write)
    								...blocking...
    											css_killed_work_fn
    											P(cgroup_mutex)
    											cpuset_css_offline
    											P(cpu_hotplug_lock.read)
    											...blocking...
    256 cgroup_bpf_release
    mutex_lock(&cgroup_mutex);
    ..blocking...

    To fix the problem, place cgroup_bpf_release works on a dedicated
    workqueue which can break the loop and solve the problem. System wqs are
    for misc things which shouldn't create a large number of concurrent work
    items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
    work items, it should use its own dedicated workqueue.

    Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
    Cc: stable@vger.kernel.org # v5.3+
    Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
    Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
    Signed-off-by: Chen Ridong <chenridong@huawei.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-21 11:27:07 +01:00
Jerome Marchand b936aa2a1f bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_*
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit 7f6287417baf57754f47687c6ea1a749a0686ab0
Author: Matteo Croce <teknoraver@meta.com>
Date:   Mon Aug 19 18:28:05 2024 +0200

    bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_*

    The helper bpf_current_task_under_cgroup() currently is only allowed for
    tracing programs, allow its usage also in the BPF_CGROUP_* program types.

    Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c,
    so it compiles also without CONFIG_BPF_EVENTS.

    This will be used in systemd-networkd to monitor the sysctl writes,
    and filter it's own writes from others:
    https://github.com/systemd/systemd/pull/32212

    Signed-off-by: Matteo Croce <teknoraver@meta.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-21 11:24:24 +01:00
Viktor Malik bfe07a51a2
bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types
JIRA: https://issues.redhat.com/browse/RHEL-30773

commit eb166e522c77699fc19bfa705652327a1e51a117
Author: Yonghong Song <yonghong.song@linux.dev>
Date:   Fri Mar 15 11:48:54 2024 -0700

    bpf: Allow helper bpf_get_[ns_]current_pid_tgid() for all prog types
    
    Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
    and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
    in tracing progs.
    
    We have an internal use case where for an application running
    in a container (with pid namespace), user wants to get
    the pid associated with the pid namespace in a cgroup bpf
    program. Currently, cgroup bpf progs already allow
    bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
    as well.
    
    With auditing the code, bpf_get_current_pid_tgid() is also used
    by sk_msg prog. But there are no side effect to expose these two
    helpers to all prog types since they do not reveal any kernel specific
    data. The detailed discussion is in [1].
    
    So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
    are put in bpf_base_func_proto(), making them available to all
    program types.
    
      [1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/
    
    Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-11-07 13:58:30 +01:00
Jerome Marchand 485dcac254 bpf: remove check in __cgroup_bpf_run_filter_skb
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 32e18e7688c6847b0c9db073aafb00639ecf576c
Author: Oliver Crumrine <ozlinuxc@gmail.com>
Date:   Fri Feb 9 14:41:22 2024 -0500

    bpf: remove check in __cgroup_bpf_run_filter_skb

    Originally, this patch removed a redundant check in
    BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
    the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
    reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
    and add the checks to the other macro that calls that function,
    BPF_CGROUP_RUN_PROG_INET_INGRESS.

    To sum it up, checking that the socket exists and that it is a full
    socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
    BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
    function they call, __cgroup_bpf_run_filter_skb.

    v3->v4: Fixed weird merge conflict.
    v2->v3: Sent to bpf-next instead of generic patch
    v1->v2: Addressed feedback about where check should be removed.

    Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
    Acked-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:10 +02:00
Jerome Marchand d1c16d1138 bpf: Take into account BPF token when fetching helper protos
JIRA: https://issues.redhat.com/browse/RHEL-23649

Conflicts: Context change due to missing commit 9a675ba55a96 ("net,
bpf: Add a warning if NAPI cb missed xdp_do_flush().")

commit bbc1d24724e110b86a1a7c3c1724ce0d62cc1e2e
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:21:04 2024 -0800

    bpf: Take into account BPF token when fetching helper protos

    Instead of performing unconditional system-wide bpf_capable() and
    perfmon_capable() calls inside bpf_base_func_proto() function (and other
    similar ones) to determine eligibility of a given BPF helper for a given
    program, use previously recorded BPF token during BPF_PROG_LOAD command
    handling to inform the decision.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:03 +02:00
Jeff Moyer a5d32ca328 bpf: Add sockptr support for setsockopt
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit 3f31e0d14d44ad491a81b7c1f83f32fbc300a867
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Oct 16 06:47:40 2023 -0700

    bpf: Add sockptr support for setsockopt
    
    The whole network stack uses sockptr, and while it doesn't move to
    something more modern, let's use sockptr in setsockptr BPF hooks, so, it
    could be used by other callers.
    
    The main motivation for this change is to use it in the io_uring
    {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
    kernel value for optlen.
    
    Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 09:55:34 -04:00
Jeff Moyer 4172e40e17 bpf: Add sockptr support for getsockopt
JIRA: https://issues.redhat.com/browse/RHEL-27755

commit a615f67e1a426f35366b8398c11f31c148e7df48
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Oct 16 06:47:39 2023 -0700

    bpf: Add sockptr support for getsockopt
    
    The whole network stack uses sockptr, and while it doesn't move to
    something more modern, let's use sockptr in getsockptr BPF hooks, so, it
    could be used by other callers.
    
    The main motivation for this change is to use it in the io_uring
    {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
    kernel value for optlen.
    
    Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
    
    Signed-off-by: Breno Leitao <leitao@debian.org>
    Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
    Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
2024-07-02 09:54:34 -04:00
Jerome Marchand f449721ba0 bpf, cgroup: fix multiple kernel-doc warnings
JIRA: https://issues.redhat.com/browse/RHEL-10691

commit 214bfd267f4929722b374b43fda456c21cd6f016
Author: Randy Dunlap <rdunlap@infradead.org>
Date:   Mon Sep 11 23:08:12 2023 -0700

    bpf, cgroup: fix multiple kernel-doc warnings

    Fix missing or extra function parameter kernel-doc warnings
    in cgroup.c:

    kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb'
    kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb'
    kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk'
    kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk'
    kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr'
    kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr'
    kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops'
    kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops'
    kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl'
    kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl'
    kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace'
    kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace'

    Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
    Cc: Martin KaFai Lau <martin.lau@linux.dev>
    Cc: bpf@vger.kernel.org
    Cc: Alexei Starovoitov <ast@kernel.org>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2023-12-15 09:29:04 +01:00
Viktor Malik 7304a5353a
bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit 29ebbba7d46136cba324264e513a1e964ca16c0a
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu May 11 10:04:53 2023 -0700

    bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
    
    With the way the hooks implemented right now, we have a special
    condition: optval larger than PAGE_SIZE will expose only first 4k into
    BPF; any modifications to the optval are ignored. If the BPF program
    doesn't handle this condition by resetting optlen to 0,
    the userspace will get EFAULT.
    
    The intention of the EFAULT was to make it apparent to the
    developers that the program is doing something wrong.
    However, this inadvertently might affect production workloads
    with the BPF programs that are not too careful (i.e., returning EFAULT
    for perfectly valid setsockopt/getsockopt calls).
    
    Let's try to minimize the chance of BPF program screwing up userspace
    by ignoring the output of those BPF programs (instead of returning
    EFAULT to the userspace). pr_info_once those cases to
    the dmesg to help with figuring out what's going wrong.
    
    Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
    Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-12 11:41:04 +02:00
Artem Savkov 7cbf3a2981 bpf: Don't EFAULT for getsockopt with optval=NULL
Bugzilla: https://bugzilla.redhat.com/2221599

commit 00e74ae0863827d944e36e56a4ce1e77e50edb91
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Apr 18 15:53:38 2023 -0700

    bpf: Don't EFAULT for getsockopt with optval=NULL
    
    Some socket options do getsockopt with optval=NULL to estimate the size
    of the final buffer (which is returned via optlen). This breaks BPF
    getsockopt assumptions about permitted optval buffer size. Let's enforce
    these assumptions only when non-NULL optval is provided.
    
    Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
    Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
    Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:32 +02:00
Artem Savkov f6cd44c258 cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
Bugzilla: https://bugzilla.redhat.com/2221599

Conflicts: missing 0083d27b21dd2 "cgroup: Improve cftype
add/rm error handling"

commit 4cdb91b0dea7d7f59fa84a13c7753cd434fdedcf
Author: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Date:   Fri Mar 3 15:23:10 2023 +0530

    cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers

    Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay
    consistent across cgroup core and other subsystem code, while
    operating on the cgroup_mutex.

    Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:20 +02:00
Artem Savkov 59cde3cedb bpf: allow ctx writes using BPF_ST_MEM instruction
Bugzilla: https://bugzilla.redhat.com/2221599

commit 0d80a619c113d0e216dbffa56b2d5ccc079ee520
Author: Eduard Zingerman <eddyz87@gmail.com>
Date:   Sat Mar 4 03:12:45 2023 +0200

    bpf: allow ctx writes using BPF_ST_MEM instruction
    
    Lift verifier restriction to use BPF_ST_MEM instructions to write to
    context data structures. This requires the following changes:
     - verifier.c:do_check() for BPF_ST updated to:
       - no longer forbid writes to registers of type PTR_TO_CTX;
       - track dst_reg type in the env->insn_aux_data[...].ptr_type field
         (same way it is done for BPF_STX and BPF_LDX instructions).
     - verifier.c:convert_ctx_access() and various callbacks invoked by
       it are updated to handled BPF_ST instruction alongside BPF_STX.
    
    Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
    Link: https://lore.kernel.org/r/20230304011247.566040-2-eddyz87@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:10 +02:00
Artem Savkov 74f2bb6c6c bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
Bugzilla: https://bugzilla.redhat.com/2221599

commit c501bf55c88b834adefda870c7c092ec9052a437
Author: Tejun Heo <tj@kernel.org>
Date:   Thu Mar 2 09:42:59 2023 -1000

    bpf: Make bpf_get_current_[ancestor_]cgroup_id() available for all program types
    
    These helpers are safe to call from any context and there's no reason to
    restrict access to them. Remove them from bpf_trace and filter lists and add
    to bpf_base_func_proto() under perfmon_capable().
    
    v2: After consulting with Andrii, relocated in bpf_base_func_proto() so that
        they require bpf_capable() but not perfomon_capable() as it doesn't read
        from or affect others on the system.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/ZAD8QyoszMZiTzBY@slm.duckdns.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-09-22 09:12:10 +02:00
Artem Savkov ca86169c5a bpf: expose bpf_strtol and bpf_strtoul to all program types
Bugzilla: https://bugzilla.redhat.com/2166911

Conflicts: already applied 7d21225e01 "bpf: Gate dynptr API behind CAP_BPF"

commit 8a67f2de9b1dc3cf8b75b4bf589efb1f08e3e9b8
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Aug 23 15:25:53 2022 -0700

    bpf: expose bpf_strtol and bpf_strtoul to all program types

    bpf_strncmp is already exposed everywhere. The motivation is to keep
    those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move
    them under kernel/bpf/cgroup.c because they are currently only used
    by sysctl prog types.

    Suggested-by: Martin KaFai Lau <kafai@fb.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:25 +01:00
Artem Savkov 567036485d bpf: Use cgroup_{common,current}_func_proto in more hooks
Bugzilla: https://bugzilla.redhat.com/2166911

commit bed89185af0de0d417e29ca1798df50f161b0231
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Aug 23 15:25:52 2022 -0700

    bpf: Use cgroup_{common,current}_func_proto in more hooks
    
    The following hooks are per-cgroup hooks but they are not
    using cgroup_{common,current}_func_proto, fix it:
    
    * BPF_PROG_TYPE_CGROUP_SKB (cg_skb)
    * BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr)
    * BPF_PROG_TYPE_CGROUP_SOCK (cg_sock)
    * BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP
    
    Also:
    
    * move common func_proto's into cgroup func_proto handlers
    * make sure bpf_{g,s}et_retval are not accessible from recvmsg,
      getpeername and getsockname (return/errno is ignored in these
      places)
    * as a side effect, expose get_current_pid_tgid, get_current_comm_proto,
      get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup
      hooks
    
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:03 +01:00
Artem Savkov 47f8ca3b6d bpf: Introduce cgroup_{common,current}_func_proto
Bugzilla: https://bugzilla.redhat.com/2166911

commit dea6a4e17013382b20717664ebf3d7cc405e0952
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Aug 23 15:25:51 2022 -0700

    bpf: Introduce cgroup_{common,current}_func_proto
    
    Split cgroup_base_func_proto into the following:
    
    * cgroup_common_func_proto - common helpers for all cgroup hooks
    * cgroup_current_func_proto - common helpers for all cgroup hooks
      running in the process context (== have meaningful 'current').
    
    Move bpf_{g,s}et_retval and other cgroup-related helpers into
    kernel/bpf/cgroup.c so they closer to where they are being used.
    
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-03-06 14:54:03 +01:00
Artem Savkov 0bed16f5a1 bpf, cgroup: Reject prog_attach_flags array when effective query
Bugzilla: https://bugzilla.redhat.com/2137876

commit 0e426a3ae030a9e891899370229e117158b35de6
Author: Pu Lehui <pulehui@huawei.com>
Date:   Wed Sep 21 10:46:02 2022 +0000

    bpf, cgroup: Reject prog_attach_flags array when effective query
    
    Attach flags is only valid for attached progs of this layer cgroup,
    but not for effective progs. For querying with EFFECTIVE flags,
    exporting attach flags does not make sense. So when effective query,
    we reject prog_attach_flags array and don't need to populate it.
    Also we limit attach_flags to output 0 during effective query.
    
    Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP")
    Signed-off-by: Pu Lehui <pulehui@huawei.com>
    Link: https://lore.kernel.org/r/20220921104604.2340580-2-pulehui@huaweicloud.com
    Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:48 +01:00
Artem Savkov c79f6ee2e7 bpf, cgroup: Fix kernel BUG in purge_effective_progs
Bugzilla: https://bugzilla.redhat.com/2137876

commit 7d6620f107bae6ed687ff07668e8e8f855487aa9
Author: Pu Lehui <pulehui@huawei.com>
Date:   Sat Aug 13 21:40:30 2022 +0800

    bpf, cgroup: Fix kernel BUG in purge_effective_progs
    
    Syzkaller reported a triggered kernel BUG as follows:
    
      ------------[ cut here ]------------
      kernel BUG at kernel/bpf/cgroup.c:925!
      invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
      RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0
      Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84
      28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 <0f> 0b 48
      8b 0c5
      RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246
      RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000
      RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578
      RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000
      R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00
      FS:  00007f68213d2b80(0000) GS:ffff88813bc80000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0
      Call Trace:
       <TASK>
       cgroup_bpf_prog_detach+0xcc/0x100
       __sys_bpf+0x2273/0x2a00
       __x64_sys_bpf+0x17/0x20
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7f68214dbcb9
      Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
      f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
      f0 ff8
      RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9
      RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009
      RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003
      R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20
      R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
    
    Repetition steps:
    
    For the following cgroup tree,
    
      root
       |
      cg1
       |
      cg2
    
      1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs
         attach type is NONE or OVERRIDE.
      2. write 1 to /proc/thread-self/fail-nth for failslab.
      3. detach prog1 for cg1, and then kernel BUG occur.
    
    Failslab injection will cause kmalloc fail and fall back to
    purge_effective_progs. The problem is that cg2 have attached another prog,
    so when go through cg2 layer, iteration will add pos to 1, and subsequent
    operations will be skipped by the following condition, and cg will meet
    NULL in the end.
    
      `if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))`
    
    The NULL cg means no link or prog match, this is as expected, and it's not
    a bug. So here just skip the no match situation.
    
    Fixes: 4c46091ee985 ("bpf: Fix KASAN use-after-free Read in compute_effective_progs")
    Signed-off-by: Pu Lehui <pulehui@huawei.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:48 +01:00
Artem Savkov 0e3c3201fc bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
Bugzilla: https://bugzilla.redhat.com/2137876

commit b79c9fc9551b45953a94abf550b7bd3b00e3a0f9
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:08 2022 -0700

    bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP
    
    We have two options:
    1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
    2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
    
    I was doing (2) in the original patch, but switching to (1) here:
    
    * bpf_prog_query returns all attached BPF_LSM_CGROUP programs
    regardless of attach_btf_id
    * attach_btf_id is exported via bpf_prog_info
    
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov ee4f4249cd bpf: minimize number of allocated lsm slots per program
Bugzilla: https://bugzilla.redhat.com/2137876

commit c0e19f2c9a3edd38e4b1bdae98eb44555d02bc31
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:07 2022 -0700

    bpf: minimize number of allocated lsm slots per program
    
    Previous patch adds 1:1 mapping between all 211 LSM hooks
    and bpf_cgroup program array. Instead of reserving a slot per
    possible hook, reserve 10 slots per cgroup for lsm programs.
    Those slots are dynamically allocated on demand and reclaimed.
    
    struct cgroup_bpf {
    	struct bpf_prog_array *    effective[33];        /*     0   264 */
    	/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
    	struct hlist_head          progs[33];            /*   264   264 */
    	/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
    	u8                         flags[33];            /*   528    33 */
    
    	/* XXX 7 bytes hole, try to pack */
    
    	struct list_head           storages;             /*   568    16 */
    	/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
    	struct bpf_prog_array *    inactive;             /*   584     8 */
    	struct percpu_ref          refcnt;               /*   592    16 */
    	struct work_struct         release_work;         /*   608    72 */
    
    	/* size: 680, cachelines: 11, members: 7 */
    	/* sum members: 673, holes: 1, sum holes: 7 */
    	/* last cacheline: 40 bytes */
    };
    
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9a33161b25 bpf: per-cgroup lsm flavor
Bugzilla: https://bugzilla.redhat.com/2137876

Conflicts: already applied 65d9ecfe0ca73 "bpf: Fix ref_obj_id for dynptr
data slices in verifier"

commit 69fd337a975c7e690dfe49d9cb4fe5ba1e6db44e
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:06 2022 -0700

    bpf: per-cgroup lsm flavor

    Allow attaching to lsm hooks in the cgroup context.

    Attaching to per-cgroup LSM works exactly like attaching
    to other per-cgroup hooks. New BPF_LSM_CGROUP is added
    to trigger new mode; the actual lsm hook we attach to is
    signaled via existing attach_btf_id.

    For the hooks that have 'struct socket' or 'struct sock' as its first
    argument, we use the cgroup associated with that socket. For the rest,
    we use 'current' cgroup (this is all on default hierarchy == v2 only).
    Note that for some hooks that work on 'struct sock' we still
    take the cgroup from 'current' because some of them work on the socket
    that hasn't been properly initialized yet.

    Behind the scenes, we allocate a shim program that is attached
    to the trampoline and runs cgroup effective BPF programs array.
    This shim has some rudimentary ref counting and can be shared
    between several programs attaching to the same lsm hook from
    different cgroups.

    Note that this patch bloats cgroup size because we add 211
    cgroup_bpf_attach_type(s) for simplicity sake. This will be
    addressed in the subsequent patch.

    Also note that we only add non-sleepable flavor for now. To enable
    sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
    shim programs have to be freed via trace rcu, cgroup_bpf.effective
    should be also trace-rcu-managed + maybe some other changes that
    I'm not aware of.

    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 39e540abc9 bpf: convert cgroup_bpf.progs to hlist
Bugzilla: https://bugzilla.redhat.com/2137876

commit 00442143a2ab7f1da46fbf4d2a99c85df767d49a
Author: Stanislav Fomichev <sdf@google.com>
Date:   Tue Jun 28 10:43:05 2022 -0700

    bpf: convert cgroup_bpf.progs to hlist
    
    This lets us reclaim some space to be used by new cgroup lsm slots.
    
    Before:
    struct cgroup_bpf {
    	struct bpf_prog_array *    effective[23];        /*     0   184 */
    	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
    	struct list_head           progs[23];            /*   184   368 */
    	/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
    	u32                        flags[23];            /*   552    92 */
    
    	/* XXX 4 bytes hole, try to pack */
    
    	/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
    	struct list_head           storages;             /*   648    16 */
    	struct bpf_prog_array *    inactive;             /*   664     8 */
    	struct percpu_ref          refcnt;               /*   672    16 */
    	struct work_struct         release_work;         /*   688    32 */
    
    	/* size: 720, cachelines: 12, members: 7 */
    	/* sum members: 716, holes: 1, sum holes: 4 */
    	/* last cacheline: 16 bytes */
    };
    
    After:
    struct cgroup_bpf {
    	struct bpf_prog_array *    effective[23];        /*     0   184 */
    	/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
    	struct hlist_head          progs[23];            /*   184   184 */
    	/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
    	u8                         flags[23];            /*   368    23 */
    
    	/* XXX 1 byte hole, try to pack */
    
    	/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
    	struct list_head           storages;             /*   392    16 */
    	struct bpf_prog_array *    inactive;             /*   408     8 */
    	struct percpu_ref          refcnt;               /*   416    16 */
    	struct work_struct         release_work;         /*   432    72 */
    
    	/* size: 504, cachelines: 8, members: 7 */
    	/* sum members: 503, holes: 1, sum holes: 1 */
    	/* last cacheline: 56 bytes */
    };
    
    Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
    Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
    Reviewed-by: Martin KaFai Lau <kafai@fb.com>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:33 +01:00
Artem Savkov 9831122f17 bpf: Fix KASAN use-after-free Read in compute_effective_progs
Bugzilla: https://bugzilla.redhat.com/2137876

commit 4c46091ee985ae84c60c5e95055d779fcd291d87
Author: Tadeusz Struk <tadeusz.struk@linaro.org>
Date:   Tue May 17 11:04:20 2022 -0700

    bpf: Fix KASAN use-after-free Read in compute_effective_progs
    
    Syzbot found a Use After Free bug in compute_effective_progs().
    The reproducer creates a number of BPF links, and causes a fault
    injected alloc to fail, while calling bpf_link_detach on them.
    Link detach triggers the link to be freed by bpf_link_free(),
    which calls __cgroup_bpf_detach() and update_effective_progs().
    If the memory allocation in this function fails, the function restores
    the pointer to the bpf_cgroup_link on the cgroup list, but the memory
    gets freed just after it returns. After this, every subsequent call to
    update_effective_progs() causes this already deallocated pointer to be
    dereferenced in prog_list_length(), and triggers KASAN UAF error.
    
    To fix this issue don't preserve the pointer to the prog or link in the
    list, but remove it and replace it with a dummy prog without shrinking
    the table. The subsequent call to __cgroup_bpf_detach() or
    __cgroup_bpf_detach() will correct it.
    
    Fixes: af6eea5743 ("bpf: Implement bpf_link-based cgroup BPF program attachment")
    Reported-by: <syzbot+f264bffdfbd5614f3bb2@syzkaller.appspotmail.com>
    Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Cc: <stable@vger.kernel.org>
    Link: https://syzkaller.appspot.com/bug?id=8ebf179a95c2a2670f7cf1ba62429ec044369db4
    Link: https://lore.kernel.org/bpf/20220517180420.87954-1-tadeusz.struk@linaro.org

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2023-01-05 15:46:29 +01:00
Yauheni Kaliuta a0a7914b2c bpf: Use bpf_prog_run_array_cg_flags everywhere
Bugzilla: https://bugzilla.redhat.com/2120968

commit d9d31cf88702ae071bec033e5c8714048aa71285
Author: Stanislav Fomichev <sdf@google.com>
Date:   Mon Apr 25 15:04:48 2022 -0700

    bpf: Use bpf_prog_run_array_cg_flags everywhere
    
    Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
    use it everywhere. check_return_code already enforces sane
    return ranges for all cgroup types. (only egress and bind hooks have
    uncanonical return ranges, the rest is using [0, 1])
    
    No functional changes.
    
    v2:
    - 'func_ret & 1' under explicit test (Andrii & Martin)
    
    Suggested-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:11 +02:00
Yauheni Kaliuta 69950554d0 bpf: Move rcu lock management out of BPF_PROG_RUN routines
Bugzilla: https://bugzilla.redhat.com/2120968

commit 055eb95533273bc334794dbc598400d10800528f
Author: Stanislav Fomichev <sdf@google.com>
Date:   Thu Apr 14 09:12:33 2022 -0700

    bpf: Move rcu lock management out of BPF_PROG_RUN routines
    
    Commit 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros
    into functions") switched a bunch of BPF_PROG_RUN macros to inline
    routines. This changed the semantic a bit. Due to arguments expansion
    of macros, it used to be:
    
    	rcu_read_lock();
    	array = rcu_dereference(cgrp->bpf.effective[atype]);
    	...
    
    Now, with with inline routines, we have:
    	array_rcu = rcu_dereference(cgrp->bpf.effective[atype]);
    	/* array_rcu can be kfree'd here */
    	rcu_read_lock();
    	array = rcu_dereference(array_rcu);
    
    I'm assuming in practice rcu subsystem isn't fast enough to trigger
    this but let's use rcu API properly.
    
    Also, rename to lower caps to not confuse with macros. Additionally,
    drop and expand BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY.
    
    See [1] for more context.
    
      [1] https://lore.kernel.org/bpf/CAKH8qBs60fOinFdxiiQikK_q0EcVxGvNTQoWvHLEUGbgcj1UYg@mail.gmail.com/T/#u
    
    v2
    - keep rcu locks inside by passing cgroup_bpf
    
    Fixes: 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions")
    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20220414161233.170780-1-sdf@google.com

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-11-28 16:52:09 +02:00
Jerome Marchand c5056baccd bpf: Cleanup comments
Bugzilla: https://bugzilla.redhat.com/2120966

commit c561d11063009323a0e57c528cb1d77b7d2c41e0
Author: Tom Rix <trix@redhat.com>
Date:   Sun Feb 20 10:40:55 2022 -0800

    bpf: Cleanup comments

    Add leading space to spdx tag
    Use // for spdx c file comment

    Replacements
    resereved to reserved
    inbetween to in between
    everytime to every time
    intutivie to intuitive
    currenct to current
    encontered to encountered
    referenceing to referencing
    upto to up to
    exectuted to executed

    Signed-off-by: Tom Rix <trix@redhat.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20220220184055.3608317-1-trix@redhat.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:51 +02:00
Jerome Marchand 35afa2bc38 cgroup/bpf: fast path skb BPF filtering
Bugzilla: https://bugzilla.redhat.com/2120966

commit 46531a30364bd483bfa1b041c15d42a196e77e93
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Thu Jan 27 14:09:13 2022 +0000

    cgroup/bpf: fast path skb BPF filtering

    Even though there is a static key protecting from overhead from
    cgroup-bpf skb filtering when there is nothing attached, in many cases
    it's not enough as registering a filter for one type will ruin the fast
    path for all others. It's observed in production servers I've looked
    at but also in laptops, where registration is done during init by
    systemd or something else.

    Add a per-socket fast path check guarding from such overhead. This
    affects both receive and transmit paths of TCP, UDP and other
    protocols. It showed ~1% tx/s improvement in small payload UDP
    send benchmarks using a real NIC and in a server environment and the
    number jumps to 2-3% for preemtible kernels.

    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/r/d8c58857113185a764927a46f4b5a058d36d3ec3.1643292455.git.asml.silence@gmail.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:44 +02:00
Jerome Marchand f6f7a1bd40 bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return value
Bugzilla: https://bugzilla.redhat.com/2120966

commit b44123b4a3dcad4664d3a0f72c011ffd4c9c4d93
Author: YiFei Zhu <zhuyifei@google.com>
Date:   Thu Dec 16 02:04:27 2021 +0000

    bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return value

    The helpers continue to use int for retval because all the hooks
    are int-returning rather than long-returning. The return value of
    bpf_set_retval is int for future-proofing, in case in the future
    there may be errors trying to set the retval.

    After the previous patch, if a program rejects a syscall by
    returning 0, an -EPERM will be generated no matter if the retval
    is already set to -err. This patch change it being forced only if
    retval is not -err. This is because we want to support, for
    example, invoking bpf_set_retval(-EINVAL) and return 0, and have
    the syscall return value be -EINVAL not -EPERM.

    For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is
    that, if the return value is NET_XMIT_DROP, the packet is silently
    dropped. We preserve this behavior for backward compatibility
    reasons, so even if an errno is set, the errno does not return to
    caller. However, setting a non-err to retval cannot propagate so
    this is not allowed and we return a -EFAULT in that case.

    Signed-off-by: YiFei Zhu <zhuyifei@google.com>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:40 +02:00
Jerome Marchand 137f7ac6b3 bpf: Move getsockopt retval to struct bpf_cg_run_ctx
Bugzilla: https://bugzilla.redhat.com/2120966

commit c4dcfdd406aa2167396ac215e351e5e4dfd7efe3
Author: YiFei Zhu <zhuyifei@google.com>
Date:   Thu Dec 16 02:04:26 2021 +0000

    bpf: Move getsockopt retval to struct bpf_cg_run_ctx

    The retval value is moved to struct bpf_cg_run_ctx for ease of access
    in different prog types with different context structs layouts. The
    helper implementation (to be added in a later patch in the series) can
    simply perform a container_of from current->bpf_ctx to retrieve
    bpf_cg_run_ctx.

    Unfortunately, there is no easy way to access the current task_struct
    via the verifier BPF bytecode rewrite, aside from possibly calling a
    helper, so a pointer to current task is added to struct bpf_sockopt_kern
    so that the rewritten BPF bytecode can access struct bpf_cg_run_ctx with
    an indirection.

    For backward compatibility, if a getsockopt program rejects a syscall
    by returning 0, an -EPERM will be generated, by having the
    BPF_PROG_RUN_ARRAY_CG family macros automatically set the retval to
    -EPERM. Unlike prior to this patch, this -EPERM will be visible to
    ctx->retval for any other hooks down the line in the prog array.

    Additionally, the restriction that getsockopt filters can only set
    the retval to 0 is removed, considering that certain getsockopt
    implementations may return optlen. Filters are now able to set the
    value arbitrarily.

    Signed-off-by: YiFei Zhu <zhuyifei@google.com>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/73b0325f5c29912ccea7ea57ec1ed4d388fc1d37.1639619851.git.zhuyifei@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:40 +02:00
Jerome Marchand 8f26dabbcc bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean
Bugzilla: https://bugzilla.redhat.com/2120966

commit f10d059661968b01ef61a8b516775f95a18ab8ae
Author: YiFei Zhu <zhuyifei@google.com>
Date:   Thu Dec 16 02:04:25 2021 +0000

    bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean

    Right now BPF_PROG_RUN_ARRAY and related macros return 1 or 0
    for whether the prog array allows or rejects whatever is being
    hooked. The caller of these macros then return -EPERM or continue
    processing based on thw macro's return value. Unforunately this is
    inflexible, since -EPERM is the only err that can be returned.

    This patch should be a no-op; it prepares for the next patch. The
    returning of the -EPERM is moved to inside the macros, so the outer
    functions are directly returning what the macros returned if they
    are non-zero.

    Signed-off-by: YiFei Zhu <zhuyifei@google.com>
    Reviewed-by: Stanislav Fomichev <sdf@google.com>
    Link: https://lore.kernel.org/r/788abcdca55886d1f43274c918eaa9f792a9f33b.1639619851.git.zhuyifei@google.com
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:40 +02:00
Artem Savkov 7f76bfc54f bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 216e3cd2f28dbbf1fe86848e0e29e6693b9f0a20
Author: Hao Luo <haoluo@google.com>
Date:   Thu Dec 16 16:31:51 2021 -0800

    bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.

    Some helper functions may modify its arguments, for example,
    bpf_d_path, bpf_get_stack etc. Previously, their argument types
    were marked as ARG_PTR_TO_MEM, which is compatible with read-only
    mem types, such as PTR_TO_RDONLY_BUF. Therefore it's legitimate,
    but technically incorrect, to modify a read-only memory by passing
    it into one of such helper functions.

    This patch tags the bpf_args compatible with immutable memory with
    MEM_RDONLY flag. The arguments that don't have this flag will be
    only compatible with mutable memory types, preventing the helper
    from modifying a read-only memory. The bpf_args that have
    MEM_RDONLY are compatible with both mutable memory and immutable
    memory.

    Signed-off-by: Hao Luo <haoluo@google.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211217003152.48334-9-haoluo@google.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:50 +02:00
Yauheni Kaliuta fe6bfd72a8 cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
Bugzilla: http://bugzilla.redhat.com/2069045

commit 588e5d8766486e52ee332a4bb097b016a355b465
Author: He Fengqing <hefengqing@huawei.com>
Date:   Fri Oct 29 02:39:06 2021 +0000

    cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
    
    In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf")
    cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only
    used in kernel/bpf/cgroup.c, so move these function to
    kernel/bpf/cgroup.c, like cgroup_bpf_replace().
    
    Signed-off-by: He Fengqing <hefengqing@huawei.com>
    Signed-off-by: Tejun Heo <tj@kernel.org>

Signed-off-by: Yauheni Kaliuta <ykaliuta@redhat.com>
2022-06-03 17:23:49 +03:00
Jiri Benc 73a5352706 bpf: Add support for {set|get} socket options from setsockopt BPF
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071618

commit 2c531639deb5e3ddfd6e8123b82052b2d9fbc6e5
Author: Prankur Gupta <prankgup@fb.com>
Date:   Tue Aug 17 15:42:20 2021 -0700

    bpf: Add support for {set|get} socket options from setsockopt BPF

    Add logic to call bpf_setsockopt() and bpf_getsockopt() from setsockopt BPF
    programs. An example use case is when the user sets the IPV6_TCLASS socket
    option, we would also like to change the tcp-cc for that socket.

    We don't have any use case for calling bpf_setsockopt() from supposedly read-
    only sys_getsockopt(), so it is made available to BPF_CGROUP_SETSOCKOPT only
    at this point.

    Signed-off-by: Prankur Gupta <prankgup@fb.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Song Liu <songliubraving@fb.com>
    Link: https://lore.kernel.org/bpf/20210817224221.3257826-2-prankgup@fb.com

Signed-off-by: Jiri Benc <jbenc@redhat.com>
2022-05-12 17:29:46 +02:00
Jerome Marchand c6feb8361c bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
Bugzilla: https://bugzilla.redhat.com/2041365

Conflicts: Minor context change from missing commit eb18b49ea758 ("bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt")

commit 5e0bc3082e2e403ac0753e099c2b01446bb35578
Author: Dmitrii Banshchikov <me@ubique.spb.ru>
Date:   Sat Nov 13 18:22:26 2021 +0400

    bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs

    Use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in tracing
    progs may result in locking issues.

    bpf_ktime_get_coarse_ns() uses ktime_get_coarse_ns() time accessor that
    isn't safe for any context:
    ======================================================
    WARNING: possible circular locking dependency detected
    5.15.0-syzkaller #0 Not tainted
    ------------------------------------------------------
    syz-executor.4/14877 is trying to acquire lock:
    ffffffff8cb30008 (tk_core.seq.seqcount){----}-{0:0}, at: ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255

    but task is already holding lock:
    ffffffff90dbf200 (&obj_hash[i].lock){-.-.}-{2:2}, at: debug_object_deactivate+0x61/0x400 lib/debugobjects.c:735

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #1 (&obj_hash[i].lock){-.-.}-{2:2}:
           lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
           __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
           _raw_spin_lock_irqsave+0xd1/0x120 kernel/locking/spinlock.c:162
           __debug_object_init+0xd9/0x1860 lib/debugobjects.c:569
           debug_hrtimer_init kernel/time/hrtimer.c:414 [inline]
           debug_init kernel/time/hrtimer.c:468 [inline]
           hrtimer_init+0x20/0x40 kernel/time/hrtimer.c:1592
           ntp_init_cmos_sync kernel/time/ntp.c:676 [inline]
           ntp_init+0xa1/0xad kernel/time/ntp.c:1095
           timekeeping_init+0x512/0x6bf kernel/time/timekeeping.c:1639
           start_kernel+0x267/0x56e init/main.c:1030
           secondary_startup_64_no_verify+0xb1/0xbb

    -> #0 (tk_core.seq.seqcount){----}-{0:0}:
           check_prev_add kernel/locking/lockdep.c:3051 [inline]
           check_prevs_add kernel/locking/lockdep.c:3174 [inline]
           validate_chain+0x1dfb/0x8240 kernel/locking/lockdep.c:3789
           __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5015
           lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5625
           seqcount_lockdep_reader_access+0xfe/0x230 include/linux/seqlock.h:103
           ktime_get_coarse_ts64+0x25/0x110 kernel/time/timekeeping.c:2255
           ktime_get_coarse include/linux/timekeeping.h:120 [inline]
           ktime_get_coarse_ns include/linux/timekeeping.h:126 [inline]
           ____bpf_ktime_get_coarse_ns kernel/bpf/helpers.c:173 [inline]
           bpf_ktime_get_coarse_ns+0x7e/0x130 kernel/bpf/helpers.c:171
           bpf_prog_a99735ebafdda2f1+0x10/0xb50
           bpf_dispatcher_nop_func include/linux/bpf.h:721 [inline]
           __bpf_prog_run include/linux/filter.h:626 [inline]
           bpf_prog_run include/linux/filter.h:633 [inline]
           BPF_PROG_RUN_ARRAY include/linux/bpf.h:1294 [inline]
           trace_call_bpf+0x2cf/0x5d0 kernel/trace/bpf_trace.c:127
           perf_trace_run_bpf_submit+0x7b/0x1d0 kernel/events/core.c:9708
           perf_trace_lock+0x37c/0x440 include/trace/events/lock.h:39
           trace_lock_release+0x128/0x150 include/trace/events/lock.h:58
           lock_release+0x82/0x810 kernel/locking/lockdep.c:5636
           __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:149 [inline]
           _raw_spin_unlock_irqrestore+0x75/0x130 kernel/locking/spinlock.c:194
           debug_hrtimer_deactivate kernel/time/hrtimer.c:425 [inline]
           debug_deactivate kernel/time/hrtimer.c:481 [inline]
           __run_hrtimer kernel/time/hrtimer.c:1653 [inline]
           __hrtimer_run_queues+0x2f9/0xa60 kernel/time/hrtimer.c:1749
           hrtimer_interrupt+0x3b3/0x1040 kernel/time/hrtimer.c:1811
           local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1086 [inline]
           __sysvec_apic_timer_interrupt+0xf9/0x270 arch/x86/kernel/apic/apic.c:1103
           sysvec_apic_timer_interrupt+0x8c/0xb0 arch/x86/kernel/apic/apic.c:1097
           asm_sysvec_apic_timer_interrupt+0x12/0x20
           __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
           _raw_spin_unlock_irqrestore+0xd4/0x130 kernel/locking/spinlock.c:194
           try_to_wake_up+0x702/0xd20 kernel/sched/core.c:4118
           wake_up_process kernel/sched/core.c:4200 [inline]
           wake_up_q+0x9a/0xf0 kernel/sched/core.c:953
           futex_wake+0x50f/0x5b0 kernel/futex/waitwake.c:184
           do_futex+0x367/0x560 kernel/futex/syscalls.c:127
           __do_sys_futex kernel/futex/syscalls.c:199 [inline]
           __se_sys_futex+0x401/0x4b0 kernel/futex/syscalls.c:180
           do_syscall_x64 arch/x86/entry/common.c:50 [inline]
           do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
           entry_SYSCALL_64_after_hwframe+0x44/0xae

    There is a possible deadlock with bpf_timer_* set of helpers:
    hrtimer_start()
      lock_base();
      trace_hrtimer...()
        perf_event()
          bpf_run()
            bpf_timer_start()
              hrtimer_start()
                lock_base()         <- DEADLOCK

    Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_* helpers in
    BPF_PROG_TYPE_KPROBE, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_PERF_EVENT
    and BPF_PROG_TYPE_RAW_TRACEPOINT prog types.

    Fixes: d055126180 ("bpf: Add bpf_ktime_get_coarse_ns helper")
    Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
    Reported-by: syzbot+43fd005b5a1b4d10781e@syzkaller.appspotmail.com
    Signed-off-by: Dmitrii Banshchikov <me@ubique.spb.ru>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20211113142227.566439-2-me@ubique.spb.ru

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:17:16 +02:00
Jerome Marchand 851147cdc5 bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum
Bugzilla: http://bugzilla.redhat.com/2041365

commit 6fc88c354f3af83ffa2c285b86e76c759755693f
Author: Dave Marchevsky <davemarchevsky@fb.com>
Date:   Thu Aug 19 02:24:20 2021 -0700

    bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum

    Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
    attach types and a function to map bpf_attach_type values to the new
    enum. Inspired by netns_bpf_attach_type.

    Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
    possible.  Functionality is unchanged as attach_type_to_prog_type
    switches in bpf/syscall.c were preventing non-cgroup programs from
    making use of the invalid cgroup_bpf array slots.

    As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
    arrays were sized using MAX_BPF_ATTACH_TYPE.

    bpf_cgroup_storage is notably not migrated as struct
    bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
    member which is not meant to be opaque. Similarly, bpf_cgroup_link
    continues to report its bpf_attach_type member to userspace via fdinfo
    and bpf_link_info.

    To ease disambiguation, bpf_attach_type variables are renamed from
    'type' to 'atype' when changed to cgroup_bpf_attach_type.

    Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:44 +02:00
Jerome Marchand 5457a99f85 bpf: Remove redundant initialization of variable allow
Bugzilla: http://bugzilla.redhat.com/2041365

commit 8cacfc85b615cc0bae01241593c4b25da6570efc
Author: Colin Ian King <colin.king@intel.com>
Date:   Tue Aug 17 18:08:42 2021 +0100

    bpf: Remove redundant initialization of variable allow

    The variable allow is being initialized with a value that is never read, it
    is being updated later on. The assignment is redundant and can be removed.

    Addresses-Coverity: ("Unused value")

    Signed-off-by: Colin Ian King <colin.king@canonical.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20210817170842.495440-1-colin.king@canonical.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:43 +02:00
Jerome Marchand 88da8360b2 bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions
Bugzilla: http://bugzilla.redhat.com/2041365

commit 7d08c2c9117113fee118487425ed55efa50cbfa9
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:55 2021 -0700

    bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions

    Similar to BPF_PROG_RUN, turn BPF_PROG_RUN_ARRAY macros into proper functions
    with all the same readability and maintainability benefits. Making them into
    functions required shuffling around bpf_set_run_ctx/bpf_reset_run_ctx
    functions. Also, explicitly specifying the type of the BPF prog run callback
    required adjusting __bpf_prog_run_save_cb() to accept const void *, casted
    internally to const struct sk_buff.

    Further, split out a cgroup-specific BPF_PROG_RUN_ARRAY_CG and
    BPF_PROG_RUN_ARRAY_CG_FLAGS from the more generic BPF_PROG_RUN_ARRAY due to
    the differences in bpf_run_ctx used for those two different use cases.

    I think BPF_PROG_RUN_ARRAY_CG would benefit from further refactoring to accept
    struct cgroup and enum bpf_attach_type instead of bpf_prog_array, fetching
    cgrp->bpf.effective[type] and RCU-dereferencing it internally. But that
    required including include/linux/cgroup-defs.h, which I wasn't sure is ok with
    everyone.

    The remaining generic BPF_PROG_RUN_ARRAY function will be extended to
    pass-through user-provided context value in the next patch.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-3-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:40 +02:00
Jerome Marchand 5b5312cf91 bpf: Refactor BPF_PROG_RUN into a function
Bugzilla: http://bugzilla.redhat.com/2041365

Conflicts: Missing commit 879af96ffd72 ("net, core: Add support for XDP redirection to slave device")

commit fb7dd8bca0139fd73d3f4a6cd257b11731317ded
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Sun Aug 15 00:05:54 2021 -0700

    bpf: Refactor BPF_PROG_RUN into a function

    Turn BPF_PROG_RUN into a proper always inlined function. No functional and
    performance changes are intended, but it makes it much easier to understand
    what's going on with how BPF programs are actually get executed. It's more
    obvious what types and callbacks are expected. Also extra () around input
    parameters can be dropped, as well as `__` variable prefixes intended to avoid
    naming collisions, which makes the code simpler to read and write.

    This refactoring also highlighted one extra issue. BPF_PROG_RUN is both
    a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning
    BPF_PROG_RUN into a function causes naming conflict compilation error. So
    rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to
    bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of
    BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:40 +02:00
Jerome Marchand 7c529c7879 bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_CGROUP_SOCKOPT
Bugzilla: http://bugzilla.redhat.com/2041365

commit f1248dee954c2ddb0ece47a13591e5d55d422d22
Author: Stanislav Fomichev <sdf@google.com>
Date:   Fri Aug 13 16:05:29 2021 -0700

    bpf: Allow bpf_get_netns_cookie in BPF_PROG_TYPE_CGROUP_SOCKOPT

    This is similar to existing BPF_PROG_TYPE_CGROUP_SOCK
    and BPF_PROG_TYPE_CGROUP_SOCK_ADDR.

    Signed-off-by: Stanislav Fomichev <sdf@google.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Martin KaFai Lau <kafai@fb.com>
    Link: https://lore.kernel.org/bpf/20210813230530.333779-2-sdf@google.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-04-29 18:14:40 +02:00
David S. Miller b8af417e4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2021-02-16

The following pull-request contains BPF updates for your *net-next* tree.

There's a small merge conflict between 7eeba1706e ("tcp: Add receive timestamp
support for receive zerocopy.") from net-next tree and 9cacf81f81 ("bpf: Remove
extra lock_sock for TCP_ZEROCOPY_RECEIVE") from bpf-next tree. Resolve as follows:

  [...]
                lock_sock(sk);
                err = tcp_zerocopy_receive(sk, &zc, &tss);
                err = BPF_CGROUP_RUN_PROG_GETSOCKOPT_KERN(sk, level, optname,
                                                          &zc, &len, err);
                release_sock(sk);
  [...]

We've added 116 non-merge commits during the last 27 day(s) which contain
a total of 156 files changed, 5662 insertions(+), 1489 deletions(-).

The main changes are:

1) Adds support of pointers to types with known size among global function
   args to overcome the limit on max # of allowed args, from Dmitrii Banshchikov.

2) Add bpf_iter for task_vma which can be used to generate information similar
   to /proc/pid/maps, from Song Liu.

3) Enable bpf_{g,s}etsockopt() from all sock_addr related program hooks. Allow
   rewriting bind user ports from BPF side below the ip_unprivileged_port_start
   range, both from Stanislav Fomichev.

4) Prevent recursion on fentry/fexit & sleepable programs and allow map-in-map
   as well as per-cpu maps for the latter, from Alexei Starovoitov.

5) Add selftest script to run BPF CI locally. Also enable BPF ringbuffer
   for sleepable programs, both from KP Singh.

6) Extend verifier to enable variable offset read/write access to the BPF
   program stack, from Andrei Matei.

7) Improve tc & XDP MTU handling and add a new bpf_check_mtu() helper to
   query device MTU from programs, from Jesper Dangaard Brouer.

8) Allow bpf_get_socket_cookie() helper also be called from [sleepable] BPF
   tracing programs, from Florent Revest.

9) Extend x86 JIT to pad JMPs with NOPs for helping image to converge when
   otherwise too many passes are required, from Gary Lin.

10) Verifier fixes on atomics with BPF_FETCH as well as function-by-function
    verification both related to zero-extension handling, from Ilya Leoshkevich.

11) Better kernel build integration of resolve_btfids tool, from Jiri Olsa.

12) Batch of AF_XDP selftest cleanups and small performance improvement
    for libbpf's xsk map redirect for newer kernels, from Björn Töpel.

13) Follow-up BPF doc and verifier improvements around atomics with
    BPF_FETCH, from Brendan Jackman.

14) Permit zero-sized data sections e.g. if ELF .rodata section contains
    read-only data from local variables, from Yonghong Song.

15) veth driver skb bulk-allocation for ndo_xdp_xmit, from Lorenzo Bianconi.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-16 13:14:06 -08:00
Stanislav Fomichev 772412176f bpf: Allow rewriting to ports under ip_unprivileged_port_start
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.

Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.

v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
  and accept BPF_RET_SET_CN (no behavioral changes)

v4:
- Add missing IPv6 support (Martin KaFai Lau)

v3:
- Update description (Martin KaFai Lau)
- Fix capability restore in selftest (Martin KaFai Lau)

v2:
- Switch to explicit return code (Martin KaFai Lau)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
2021-01-27 18:18:15 -08:00
Loris Reiff f4a2da755a bpf, cgroup: Fix problematic bounds check
Since ctx.optlen is signed, a larger value than max_value could be
passed, as it is later on used as unsigned, which causes a WARN_ON_ONCE
in the copy_to_user.

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Signed-off-by: Loris Reiff <loris.reiff@liblor.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20210122164232.61770-2-loris.reiff@liblor.ch
2021-01-22 23:11:47 +01:00
Loris Reiff bb8b81e396 bpf, cgroup: Fix optlen WARN_ON_ONCE toctou
A toctou issue in `__cgroup_bpf_run_filter_getsockopt` can trigger a
WARN_ON_ONCE in a check of `copy_from_user`.

`*optlen` is checked to be non-negative in the individual getsockopt
functions beforehand. Changing `*optlen` in a race to a negative value
will result in a `copy_from_user(ctx.optval, optval, ctx.optlen)` with
`ctx.optlen` being a negative integer.

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Signed-off-by: Loris Reiff <loris.reiff@liblor.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20210122164232.61770-1-loris.reiff@liblor.ch
2021-01-22 23:11:34 +01:00
Stanislav Fomichev a9ed15dae0 bpf: Split cgroup_bpf_enabled per attach type
When we attach any cgroup hook, the rest (even if unused/unattached) start
to contribute small overhead. In particular, the one we want to avoid is
__cgroup_bpf_run_filter_skb which does two redirections to get to
the cgroup and pushes/pulls skb.

Let's split cgroup_bpf_enabled to be per-attach to make sure
only used attach types trigger.

I've dropped some existing high-level cgroup_bpf_enabled in some
places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
cgroup_bpf_enabled check.

I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
has to be constant and known at compile time.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
2021-01-20 14:23:00 -08:00
Stanislav Fomichev 20f2505fb4 bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt
When we attach a bpf program to cgroup/getsockopt any other getsockopt()
syscall starts incurring kzalloc/kfree cost.

Let add a small buffer on the stack and use it for small (majority)
{s,g}etsockopt values. The buffer is small enough to fit into
the cache line and cover the majority of simple options (most
of them are 4 byte ints).

It seems natural to do the same for setsockopt, but it's a bit more
involved when the BPF program modifies the data (where we have to
kmalloc). The assumption is that for the majority of setsockopt
calls (which are doing pure BPF options or apply policy) this
will bring some benefit as well.

Without this patch (we remove about 1% __kmalloc):
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com
2021-01-20 14:23:00 -08:00
Stanislav Fomichev 9cacf81f81 bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
2021-01-20 14:23:00 -08:00
Stanislav Fomichev 4be34f3d07 bpf: Don't leak memory in bpf getsockopt when optlen == 0
optlen == 0 indicates that the kernel should ignore BPF buffer
and use the original one from the user. We, however, forget
to free the temporary buffer that we've allocated for BPF.

Fixes: d8fe449a9c ("bpf: Don't return EINVAL from {get,set}sockopt when optlen > PAGE_SIZE")
Reported-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210112162829.775079-1-sdf@google.com
2021-01-12 21:05:07 +01:00
Linus Torvalds f56e65dff6 Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
2020-10-22 09:59:21 -07:00
Matthew Wilcox (Oracle) 4bd6a7353e sysctl: Convert to iter interfaces
Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs().  Also, the buffer is a string,
so give it the real type of 'char *', not void *.

[AV: Christoph's fixup folded in]

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-08 22:20:39 -04:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00