Commit Graph

76 Commits

Author SHA1 Message Date
Jerome Marchand 77edfcf609 bpf: Preserve param->string when parsing mount options
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit 1f97c03f43fadc407de5b5cb01c07755053e1c22
Author: Hou Tao <houtao1@huawei.com>
Date:   Tue Oct 22 21:01:33 2024 +0800

    bpf: Preserve param->string when parsing mount options

    In bpf_parse_param(), keep the value of param->string intact so it can
    be freed later. Otherwise, the kmalloc area pointed to by param->string
    will be leaked as shown below:

    unreferenced object 0xffff888118c46d20 (size 8):
      comm "new_name", pid 12109, jiffies 4295580214
      hex dump (first 8 bytes):
        61 6e 79 00 38 c9 5c 7e                          any.8.\~
      backtrace (crc e1b7f876):
        [<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80
        [<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0
        [<000000003e29b886>] memdup_user+0x32/0xa0
        [<0000000007248326>] strndup_user+0x46/0x60
        [<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0
        [<0000000018657927>] x64_sys_call+0xff/0x9f0
        [<00000000c0cabc95>] do_syscall_64+0x3b/0xc0
        [<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

    Fixes: 6c1752e0b6ca ("bpf: Support symbolic BPF FS delegation mount options")
    Signed-off-by: Hou Tao <houtao1@huawei.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Jiri Olsa <jolsa@kernel.org>
    Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-21 11:27:08 +01:00
Jerome Marchand 8517c3357f bpf: Simplify character output in seq_print_delegate_opts()
JIRA: https://issues.redhat.com/browse/RHEL-63880

commit f157f9cb85b49745754f8c301c59c6ca110e865f
Author: Markus Elfring <elfring@users.sourceforge.net>
Date:   Mon Jul 15 11:12:30 2024 +0200

    bpf: Simplify character output in seq_print_delegate_opts()

    Single characters should be put into a sequence.
    Thus use the corresponding function “seq_putc” for two selected calls.

    This issue was transformed by using the Coccinelle software.

    Suggested-by: Christophe Jaillet <christophe.jaillet@wanadoo.fr>
    Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/abde0992-3d71-44d2-ab27-75b382933a22@web.de

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2025-01-13 17:36:12 +01:00
Rado Vrbovsky c154c6dc53 Merge: fs: backport mnt_idmap type
MR: https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9/-/merge_requests/4324

JIRA: https://issues.redhat.com/browse/RHEL-33888

This MR back ports idmapping changes to sync. our RHEL-9 kernel with the
upstream kernel to version 6.3.

Our current kernel has idmapped mounts support but there have been many
changes since this initial implementation in the base kernel. In
particular we need the type safety changes and we have seen difficulty
back porting other requested changes on more than one occassion.

The Jira this MR has been raised for is arother example of such a request.

It is needed for a back port of a BPF feature to RHEL 9 which allows BPF
programs to do file verification with LSM and fsverity. To satisfy this
request changes made in the upstream 6.3 kernel are needed which is the
reason we have chosen upstream 6.3 as the target release for the MR.

The first fix has been omitted because it appears to be the same as
24b5308cf5ee ("selftests/filesystems: grant executable permission to
run_fat_tests.sh"). In any case the requirement is to make the path
tools/testing/selftests/filesystems/fat/run_fat_tests.sh executable which
is done.

The second and third Omitted patches are a straight apply and revert leaving
the source unchanged.

Omitted-Fix: 1d4beeb4edc7 ("selftests/filesystems: grant executable permission to run_fat_tests.sh")

Omitted-Fix: 4a47c6385bb4 ovl: turn of SB_POSIXACL with idmapped layers temporarily

Omitted-Fix: 7c4d37c269ac Revert "ovl: turn of SB_POSIXACL with idmapped layers temporarily"

Signed-off-by: Ian Kent <ikent@redhat.com>

Approved-by: Scott Mayhew <smayhew@redhat.com>
Approved-by: Chris von Recklinghausen <crecklin@redhat.com>
Approved-by: Xin Long <lxin@redhat.com>
Approved-by: CKI KWF Bot <cki-ci-bot+kwf-gitlab-com@redhat.com>

Merged-by: Rado Vrbovsky <rvrbovsk@redhat.com>
2024-11-11 08:26:30 +00:00
Ian Kent 2171c567b5 fs: port inode_init_owner() to mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream does not have upstream commit 3db1de0e582c3 ("f2fs:
	change the current atomic write way") so there is no call to
	f2fs_get_tmpfile() in f2fs_ioc_start_atomic_write() to change.
	The above patch also adds the definition of f2fs_get_tmpfile()
	to fs/f2fs/f2fs.h so it's not there to change resulting in a
	hunk reject for fs/f2fs/f2fs.h.
        Upstream commit 787caf1bdcd9f ("f2fs: fix to enable compress for
        newly created file if extension matches") is not present in CentOS
        Stream resulting in a number of rejects against fs/f2fs/namei.c,
        manually apply these changes.
	Dropped hunks for ntfs3 because the source is not present in
	the CentOS Stream source tree.
	CentOS Stream commit 892da692fa ("shmem: support idmapped
	mounts for tmpfs") which causes a reject in fs/shmem.c, manually
	apply the hunk (note: taking account of these changes at the times
	they are needed will result in an updated mm/shmem.c once this
	series is completed).
	Update to add incremental changes needed due to CentOS Stream
	commit 469e1d13f6 ("shmem: quota support").

commit f2d40141d5d90b882e2c35b226f9244a63b82b6e
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:25 2023 +0100

    fs: port inode_init_owner() to mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:26 +08:00
Ian Kent 304ec491ee fs: port ->permission() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	CentOS Stream commit 48fa94aacd ("ceph: fscrypt_auth handling
	for ceph") is presnt which causes fuzz 2 in hunk #1 in
	fs/ceph/super.h.
	Upstream commit 427505ffeaa46 ("exportfs: use pr_debug for
	unreachable debug statements") is not present causing fuzz 2
	in hunk #1 against fs/exportfs/expfs.c.
	Dropped hunks for ksmbd because the source is not present in the
	CentOS Stream source tree.
	Upstream commit 03fa86e9f79d8 ("namei: stash the sampled ->d_seq
	into nameidata") is not present causing a fuzz 1 for hunk #14
	against fs/namei.c.
	CentOS Stream c4f3dd0731 ("nfsd: handle failure to collect
	pre/post-op attrs more sanely") is present and causes a rejects
	for hunks #4 and #5 against fs/nfsd/vfs.c, apply manually.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") moves ovl_xattr_set() and ovl_xattr_get()
	from fs/overlayfs/inode.c to fs/overlayfs/xattrs.c which causes
	hunks #4 and #5 to fail, manually apply to fs/overlayfs/xattrs.c.
	CentOS Stream commit 55177e4b83 ("ovl: mark xwhiteouts directory
	with overlay.opaque='x'") and commit d17b324bb6 ("ovl: use
	ovl_numlower() and ovl_lowerstack() accessors") change the first
	and third hunks of fs/overlayfs/namei.c causing them to fail,
	manually apply.
	CentOS Stream commit 98ba731fc7 ("ovl: Move xattr support to
	new xattrs.c file") causes fuzz 2 in hunk #5 of
	fs/overlayfs/overlayfs.h
	CentOS Stream commit 355a9c490a ("ovl: Add an alternative
	type of whiteout") changes ovl_cache_update_ino() to
	ovl_cache_update() in fs/overlayfs/readdir.c, make the change
	manually.
	Upstream commit 217af7e2f4deb ("apparmor: refactor profile
	rules and attachments") is not in CentOS Stream causing hunk #1
	to fail to apply so manually apply the change.

commit 4609e1f18e19c3b302e1eb4858334bca1532f780
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:22 2023 +0100

    fs: port ->permission() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:20 +08:00
Ian Kent a7750be4f4 fs: port ->mkdir() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: For consistency drop btrfs hunks because it isn't supported in
	CentOS Stream and other backports also drop such hunks.
	The cifs source has been moved in CentOS Stream so manually
	apply rejected hunks to fs/smb/client/cifsfs.h and
	fs/smb/client/inode.c.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.

commit c54bd91e9eaba43f09aadc25b52ea869ff3b5587
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:15 2023 +0100

    fs: port ->mkdir() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:00 +08:00
Ian Kent 5744ba0ee3 fs: port ->symlink() to pass mnt_idmap
JIRA: https://issues.redhat.com/browse/RHEL-33888
Status: Linus

Conflicts: The cifs source has been moved in CentOS Stream so manually
	apply rejected hunks to fs/smb/client/cifsfs.h and
	fs/smb/client/link.c.
	Dropped hunks for ntfs3 because the source is not present in the
	CentOS Stream source tree.
	CentOS Stream commit f0f830cd7e ("ceph: create symlinks with
	encrypted and base64-encoded targets") is present and resulted
	in fuzz against fs/ceph/dir.c.

commit 7a77db95511c39be4b2db2ceca152ef589adc2dc
Author: Christian Brauner <brauner@kernel.org>
Date:   Fri Jan 13 12:49:14 2023 +0100

    fs: port ->symlink() to pass mnt_idmap

    Convert to struct mnt_idmap.

    Last cycle we merged the necessary infrastructure in
    256c8aed2b42 ("fs: introduce dedicated idmap type for mounts").
    This is just the conversion to struct mnt_idmap.

    Currently we still pass around the plain namespace that was attached to a
    mount. This is in general pretty convenient but it makes it easy to
    conflate namespaces that are relevant on the filesystem with namespaces
    that are relevent on the mount level. Especially for non-vfs developers
    without detailed knowledge in this area this can be a potential source for
    bugs.

    Once the conversion to struct mnt_idmap is done all helpers down to the
    really low-level helpers will take a struct mnt_idmap argument instead of
    two namespace arguments. This way it becomes impossible to conflate the two
    eliminating the possibility of any bugs. All of the vfs and all filesystems
    only operate on struct mnt_idmap.

    Acked-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Signed-off-by: Ian Kent <ikent@redhat.com>
2024-10-16 10:45:00 +08:00
Jerome Marchand 2ed1f08fab bpf: Support symbolic BPF FS delegation mount options
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 6c1752e0b6ca8c7021d6da3926738d8d88f601a9
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:21:16 2024 -0800

    bpf: Support symbolic BPF FS delegation mount options

    Besides already supported special "any" value and hex bit mask, support
    string-based parsing of delegation masks based on exact enumerator
    names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
    `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
    symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
    prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
    BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
    normalized to lower case in mount option output. So "PROG_LOAD",
    "prog_load", and "MAP_create" are all valid values to specify for
    delegate_cmds options, "array" is among supported for map types, etc.

    Besides supporting string values, we also support multiple values
    specified at the same time, using colon (':') separator.

    There are corresponding changes on bpf_show_options side to use known
    values to print them in human-readable format, falling back to hex mask
    printing, if there are any unrecognized bits. This shouldn't be
    necessary when enum BTF information is present, but in general we should
    always be able to fall back to this even if kernel was built without BTF.
    As mentioned, emitted symbolic names are normalized to be all lower case.

    Example below shows various ways to specify delegate_cmds options
    through mount command and how mount options are printed back:

    12/14 14:39:07.604
    vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
    $ mount | rg token

      $ sudo mkdir -p /sys/fs/bpf/token
      $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
                   -o delegate_cmds=prog_load:MAP_CREATE \
                   -o delegate_progs=kprobe \
                   -o delegate_attachs=xdp
      $ mount | grep token
      bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-20-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:04 +02:00
Jerome Marchand 13b9927298 bpf: Add BPF token support to BPF_PROG_LOAD command
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit caf8f28e036c4ba1e823355da6c0c01c39e70ab9
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:21:03 2024 -0800

    bpf: Add BPF token support to BPF_PROG_LOAD command

    Add basic support of BPF token to BPF_PROG_LOAD. BPF_F_TOKEN_FD flag
    should be set in prog_flags field when providing prog_token_fd.

    Wire through a set of allowed BPF program types and attach types,
    derived from BPF FS at BPF token creation time. Then make sure we
    perform bpf_token_capable() checks everywhere where it's relevant.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-7-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:03 +02:00
Jerome Marchand cb1e5415cf bpf: Add BPF token support to BPF_MAP_CREATE command
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit a177fc2bf6fd83704854feaf7aae926b1df4f0b9
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:21:01 2024 -0800

    bpf: Add BPF token support to BPF_MAP_CREATE command

    Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
    BPF map creation from unprivileged process through delegated BPF token.
    New BPF_F_TOKEN_FD flag is added to specify together with BPF token FD
    for BPF_MAP_CREATE command.

    Wire through a set of allowed BPF map types to BPF token, derived from
    BPF FS at BPF token creation time. This, in combination with allowed_cmds
    allows to create a narrowly-focused BPF token (controlled by privileged
    agent) with a restrictive set of BPF maps that application can attempt
    to create.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-5-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:03 +02:00
Jerome Marchand a761731cac bpf: Introduce BPF token object
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 35f96de04127d332a5c5e8a155d31f452f88c76d
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:21:00 2024 -0800

    bpf: Introduce BPF token object

    Add new kind of BPF kernel object, BPF token. BPF token is meant to
    allow delegating privileged BPF functionality, like loading a BPF
    program or creating a BPF map, from privileged process to a *trusted*
    unprivileged process, all while having a good amount of control over which
    privileged operations could be performed using provided BPF token.

    This is achieved through mounting BPF FS instance with extra delegation
    mount options, which determine what operations are delegatable, and also
    constraining it to the owning user namespace (as mentioned in the
    previous patch).

    BPF token itself is just a derivative from BPF FS and can be created
    through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
    FS FD, which can be attained through open() API by opening BPF FS mount
    point. Currently, BPF token "inherits" delegated command, map types,
    prog type, and attach type bit sets from BPF FS as is. In the future,
    having an BPF token as a separate object with its own FD, we can allow
    to further restrict BPF token's allowable set of things either at the
    creation time or after the fact, allowing the process to guard itself
    further from unintentionally trying to load undesired kind of BPF
    programs. But for now we keep things simple and just copy bit sets as is.

    When BPF token is created from BPF FS mount, we take reference to the
    BPF super block's owning user namespace, and then use that namespace for
    checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
    capabilities that are normally only checked against init userns (using
    capable()), but now we check them using ns_capable() instead (if BPF
    token is provided). See bpf_token_capable() for details.

    Such setup means that BPF token in itself is not sufficient to grant BPF
    functionality. User namespaced process has to *also* have necessary
    combination of capabilities inside that user namespace. So while
    previously CAP_BPF was useless when granted within user namespace, now
    it gains a meaning and allows container managers and sys admins to have
    a flexible control over which processes can and need to use BPF
    functionality within the user namespace (i.e., container in practice).
    And BPF FS delegation mount options and derived BPF tokens serve as
    a per-container "flag" to grant overall ability to use bpf() (plus further
    restrict on which parts of bpf() syscalls are treated as namespaced).

    Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
    within the BPF FS owning user namespace, rounding up the ns_capable()
    story of BPF token. Also creating BPF token in init user namespace is
    currently not supported, given BPF token doesn't have any effect in init
    user namespace anyways.

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-4-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:03 +02:00
Jerome Marchand c25d56d157 bpf: Add BPF token delegation mount options to BPF FS
JIRA: https://issues.redhat.com/browse/RHEL-23649

commit 6fe01d3cbb924a72493eb3f4722dfcfd1c194234
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Jan 23 18:20:59 2024 -0800

    bpf: Add BPF token delegation mount options to BPF FS

    Add few new mount options to BPF FS that allow to specify that a given
    BPF FS instance allows creation of BPF token (added in the next patch),
    and what sort of operations are allowed under BPF token. As such, we get
    4 new mount options, each is a bit mask
      - `delegate_cmds` allow to specify which bpf() syscall commands are
        allowed with BPF token derived from this BPF FS instance;
      - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
        a set of allowable BPF map types that could be created with BPF token;
      - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
        a set of allowable BPF program types that could be loaded with BPF token;
      - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
        a set of allowable BPF program attach types that could be loaded with
        BPF token; delegate_progs and delegate_attachs are meant to be used
        together, as full BPF program type is, in general, determined
        through both program type and program attach type.

    Currently, these mount options accept the following forms of values:
      - a special value "any", that enables all possible values of a given
      bit set;
      - numeric value (decimal or hexadecimal, determined by kernel
      automatically) that specifies a bit mask value directly;
      - all the values for a given mount option are combined, if specified
      multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
      delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
      mask.

    Ideally, more convenient (for humans) symbolic form derived from
    corresponding UAPI enums would be accepted (e.g., `-o
    delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
    it requires a bunch of UAPI header churn, so I postponed it until this
    feature lands upstream or at least there is a definite consensus that
    this feature is acceptable and is going to make it, just to minimize
    amount of wasted effort and not increase amount of non-essential code to
    be reviewed.

    Attentive reader will notice that BPF FS is now marked as
    FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
    user namespace as long as the process has sufficient *namespaced*
    capabilities within that user namespace. But in reality we still
    restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
    init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
    to allow creating BPF FS context object (i.e., fsopen("bpf")) from
    inside unprivileged process inside non-init userns, to capture that
    userns as the owning userns. It will still be required to pass this
    context object back to privileged process to instantiate and mount it.

    This manipulation is important, because capturing non-init userns as the
    owning userns of BPF FS instance (super block) allows to use that userns
    to constraint BPF token to that userns later on (see next patch). So
    creating BPF FS with delegation inside unprivileged userns will restrict
    derived BPF token objects to only "work" inside that intended userns,
    making it scoped to a intended "container". Also, setting these
    delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
    process cannot set this up without involvement of a privileged process.

    There is a set of selftests at the end of the patch set that simulates
    this sequence of steps and validates that everything works as intended.
    But careful review is requested to make sure there are no missed gaps in
    the implementation and testing.

    This somewhat subtle set of aspects is the result of previous
    discussions ([0]) about various user namespace implications and
    interactions with BPF token functionality and is necessary to contain
    BPF token inside intended user namespace.

      [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/

    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Acked-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/bpf/20240124022127.2379740-3-andrii@kernel.org

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2024-10-15 10:49:02 +02:00
Viktor Malik bbe43757ac
bpf: Re-support uid and gid when mounting bpffs
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit b08c8fc0411dce0fc44b78ce4d67f1b67c35c196
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Wed Dec 20 14:38:05 2023 +0100

    bpf: Re-support uid and gid when mounting bpffs
    
    For a clean, conflict-free revert of the token-related patches in commit
    d17aff807f84 ("Revert BPF token-related functionality"), the bpf fs commit
    750e785796bb ("bpf: Support uid and gid when mounting bpffs") was undone
    temporarily as well.
    
    This patch manually re-adds the functionality from the original one back
    in 750e785796bb, no other functional changes intended.
    
    Testing:
    
      # mount -t bpf -o uid=65534,gid=65534 bpffs ./foo
      # ls -la . | grep foo
      drwxrwxrwt   2 nobody nogroup          0 Dec 20 13:16 foo
      # mount -t bpf
      bpffs on /root/foo type bpf (rw,relatime,uid=65534,gid=65534)
    
    Also, passing invalid arguments for uid/gid are properly rejected as expected.
    
    Fixes: d17aff807f84 ("Revert BPF token-related functionality")
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Cc: Jie Jiang <jiejiang@chromium.org>
    Cc: Andrii Nakryiko <andrii@kernel.org>
    Cc: linux-fsdevel@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20231220133805.20953-1-daniel@iogearbox.net

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 11:07:34 +02:00
Viktor Malik 9680ef97a0
Revert BPF token-related functionality
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit d17aff807f845cf93926c28705216639c7279110
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Tue Dec 19 07:37:35 2023 -0800

    Revert BPF token-related functionality

    This patch includes the following revert (one  conflicting BPF FS
    patch and three token patch sets, represented by merge commits):
      - revert 0f5d5454c723 "Merge branch 'bpf-fs-mount-options-parsing-follow-ups'";
      - revert 750e785796bb "bpf: Support uid and gid when mounting bpffs";
      - revert 733763285acf "Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'";
      - revert c35919dcce28 "Merge branch 'bpf-token-and-bpf-fs-based-delegation'".

    Link: https://lore.kernel.org/bpf/CAHk-=wg7JuFYwGy=GOMbRCtOL+jwSQsdUaBsRWkDVYbxipbM5A@mail.gmail.com
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 11:07:29 +02:00
Viktor Malik d1c71c4773
bpf: support symbolic BPF FS delegation mount options
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit c5707b2146d229691e193d5158ea70b21b8ba180
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Dec 14 14:50:15 2023 -0800

    bpf: support symbolic BPF FS delegation mount options
    
    Besides already supported special "any" value and hex bit mask, support
    string-based parsing of delegation masks based on exact enumerator
    names. Utilize BTF information of `enum bpf_cmd`, `enum bpf_map_type`,
    `enum bpf_prog_type`, and `enum bpf_attach_type` types to find supported
    symbolic names (ignoring __MAX_xxx guard values and stripping repetitive
    prefixes like BPF_ for cmd and attach types, BPF_MAP_TYPE_ for maps, and
    BPF_PROG_TYPE_ for prog types). The case doesn't matter, but it is
    normalized to lower case in mount option output. So "PROG_LOAD",
    "prog_load", and "MAP_create" are all valid values to specify for
    delegate_cmds options, "array" is among supported for map types, etc.
    
    Besides supporting string values, we also support multiple values
    specified at the same time, using colon (':') separator.
    
    There are corresponding changes on bpf_show_options side to use known
    values to print them in human-readable format, falling back to hex mask
    printing, if there are any unrecognized bits. This shouldn't be
    necessary when enum BTF information is present, but in general we should
    always be able to fall back to this even if kernel was built without BTF.
    As mentioned, emitted symbolic names are normalized to be all lower case.
    
    Example below shows various ways to specify delegate_cmds options
    through mount command and how mount options are printed back:
    
    12/14 14:39:07.604
    vmuser@archvm:~/local/linux/tools/testing/selftests/bpf
    $ mount | rg token
    
      $ sudo mkdir -p /sys/fs/bpf/token
      $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
                   -o delegate_cmds=prog_load:MAP_CREATE \
                   -o delegate_progs=kprobe \
                   -o delegate_attachs=xdp
      $ mount | grep token
      bpffs on /sys/fs/bpf/token type bpf (rw,relatime,delegate_cmds=map_create:prog_load,delegate_progs=kprobe,delegate_attachs=xdp)
    
    Acked-by: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231214225016.1209867-2-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:29 +02:00
Viktor Malik 764dbb7279
bpf: Support uid and gid when mounting bpffs
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 750e785796bb72423b97cac21ecd0fa3b3b65610
Author: Jie Jiang <jiejiang@chromium.org>
Date:   Tue Dec 12 09:39:23 2023 +0000

    bpf: Support uid and gid when mounting bpffs
    
    Parse uid and gid in bpf_parse_param() so that they can be passed in as
    the `data` parameter when mount() bpffs. This will be useful when we
    want to control which user/group has the control to the mounted bpffs,
    otherwise a separate chown() call will be needed.
    
    Signed-off-by: Jie Jiang <jiejiang@chromium.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Acked-by: Mike Frysinger <vapier@chromium.org>
    Acked-by: Christian Brauner <brauner@kernel.org>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20231212093923.497838-1-jiejiang@chromium.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:23 +02:00
Viktor Malik 1b699c9ae7
bpf: add BPF token support to BPF_PROG_LOAD command
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit e1cef620f598853a90f17701fcb1057a6768f7b8
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:18 2023 -0800

    bpf: add BPF token support to BPF_PROG_LOAD command
    
    Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
    allowed BPF program types and attach types, derived from BPF FS at BPF
    token creation time. Then make sure we perform bpf_token_capable()
    checks everywhere where it's relevant.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:09 +02:00
Viktor Malik 02d0a61b79
bpf: add BPF token support to BPF_MAP_CREATE command
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 688b7270b3cb75e8ac78123d719967db40336e5b
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:16 2023 -0800

    bpf: add BPF token support to BPF_MAP_CREATE command
    
    Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
    BPF map creation from unprivileged process through delegated BPF token.
    
    Wire through a set of allowed BPF map types to BPF token, derived from
    BPF FS at BPF token creation time. This, in combination with allowed_cmds
    allows to create a narrowly-focused BPF token (controlled by privileged
    agent) with a restrictive set of BPF maps that application can attempt
    to create.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:09 +02:00
Viktor Malik 1cf517d9a1
bpf: introduce BPF token object
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 4527358b76861dfd64ee34aba45d81648fbc8a61
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:15 2023 -0800

    bpf: introduce BPF token object
    
    Add new kind of BPF kernel object, BPF token. BPF token is meant to
    allow delegating privileged BPF functionality, like loading a BPF
    program or creating a BPF map, from privileged process to a *trusted*
    unprivileged process, all while having a good amount of control over which
    privileged operations could be performed using provided BPF token.
    
    This is achieved through mounting BPF FS instance with extra delegation
    mount options, which determine what operations are delegatable, and also
    constraining it to the owning user namespace (as mentioned in the
    previous patch).
    
    BPF token itself is just a derivative from BPF FS and can be created
    through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
    FS FD, which can be attained through open() API by opening BPF FS mount
    point. Currently, BPF token "inherits" delegated command, map types,
    prog type, and attach type bit sets from BPF FS as is. In the future,
    having an BPF token as a separate object with its own FD, we can allow
    to further restrict BPF token's allowable set of things either at the
    creation time or after the fact, allowing the process to guard itself
    further from unintentionally trying to load undesired kind of BPF
    programs. But for now we keep things simple and just copy bit sets as is.
    
    When BPF token is created from BPF FS mount, we take reference to the
    BPF super block's owning user namespace, and then use that namespace for
    checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
    capabilities that are normally only checked against init userns (using
    capable()), but now we check them using ns_capable() instead (if BPF
    token is provided). See bpf_token_capable() for details.
    
    Such setup means that BPF token in itself is not sufficient to grant BPF
    functionality. User namespaced process has to *also* have necessary
    combination of capabilities inside that user namespace. So while
    previously CAP_BPF was useless when granted within user namespace, now
    it gains a meaning and allows container managers and sys admins to have
    a flexible control over which processes can and need to use BPF
    functionality within the user namespace (i.e., container in practice).
    And BPF FS delegation mount options and derived BPF tokens serve as
    a per-container "flag" to grant overall ability to use bpf() (plus further
    restrict on which parts of bpf() syscalls are treated as namespaced).
    
    Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
    within the BPF FS owning user namespace, rounding up the ns_capable()
    story of BPF token.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:08 +02:00
Viktor Malik d25fac4236
bpf: add BPF token delegation mount options to BPF FS
JIRA: https://issues.redhat.com/browse/RHEL-23644

commit 40bba140c60fbb3ee8df6203c82fbd3de9f19d95
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Thu Nov 30 10:52:14 2023 -0800

    bpf: add BPF token delegation mount options to BPF FS
    
    Add few new mount options to BPF FS that allow to specify that a given
    BPF FS instance allows creation of BPF token (added in the next patch),
    and what sort of operations are allowed under BPF token. As such, we get
    4 new mount options, each is a bit mask
      - `delegate_cmds` allow to specify which bpf() syscall commands are
        allowed with BPF token derived from this BPF FS instance;
      - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
        a set of allowable BPF map types that could be created with BPF token;
      - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
        a set of allowable BPF program types that could be loaded with BPF token;
      - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
        a set of allowable BPF program attach types that could be loaded with
        BPF token; delegate_progs and delegate_attachs are meant to be used
        together, as full BPF program type is, in general, determined
        through both program type and program attach type.
    
    Currently, these mount options accept the following forms of values:
      - a special value "any", that enables all possible values of a given
      bit set;
      - numeric value (decimal or hexadecimal, determined by kernel
      automatically) that specifies a bit mask value directly;
      - all the values for a given mount option are combined, if specified
      multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
      delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
      mask.
    
    Ideally, more convenient (for humans) symbolic form derived from
    corresponding UAPI enums would be accepted (e.g., `-o
    delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
    it requires a bunch of UAPI header churn, so I postponed it until this
    feature lands upstream or at least there is a definite consensus that
    this feature is acceptable and is going to make it, just to minimize
    amount of wasted effort and not increase amount of non-essential code to
    be reviewed.
    
    Attentive reader will notice that BPF FS is now marked as
    FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
    user namespace as long as the process has sufficient *namespaced*
    capabilities within that user namespace. But in reality we still
    restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
    init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
    to allow creating BPF FS context object (i.e., fsopen("bpf")) from
    inside unprivileged process inside non-init userns, to capture that
    userns as the owning userns. It will still be required to pass this
    context object back to privileged process to instantiate and mount it.
    
    This manipulation is important, because capturing non-init userns as the
    owning userns of BPF FS instance (super block) allows to use that userns
    to constraint BPF token to that userns later on (see next patch). So
    creating BPF FS with delegation inside unprivileged userns will restrict
    derived BPF token objects to only "work" inside that intended userns,
    making it scoped to a intended "container". Also, setting these
    delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
    process cannot set this up without involvement of a privileged process.
    
    There is a set of selftests at the end of the patch set that simulates
    this sequence of steps and validates that everything works as intended.
    But careful review is requested to make sure there are no missed gaps in
    the implementation and testing.
    
    This somewhat subtle set of aspects is the result of previous
    discussions ([0]) about various user namespace implications and
    interactions with BPF token functionality and is necessary to contain
    BPF token inside intended user namespace.
    
      [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
    
    Acked-by: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.org
    Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2024-06-25 10:52:08 +02:00
Viktor Malik 9358711447
bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit cb8edce28073a906401c9e421eca7c99f3396da1
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Mon May 15 16:48:06 2023 -0700

    bpf: Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
    
    Current UAPI of BPF_OBJ_PIN and BPF_OBJ_GET commands of bpf() syscall
    forces users to specify pinning location as a string-based absolute or
    relative (to current working directory) path. This has various
    implications related to security (e.g., symlink-based attacks), forces
    BPF FS to be exposed in the file system, which can cause races with
    other applications.
    
    One of the feedbacks we got from folks working with containers heavily
    was that inability to use purely FD-based location specification was an
    unfortunate limitation and hindrance for BPF_OBJ_PIN and BPF_OBJ_GET
    commands. This patch closes this oversight, adding path_fd field to
    BPF_OBJ_PIN and BPF_OBJ_GET UAPI, following conventions established by
    *at() syscalls for dirfd + pathname combinations.
    
    This now allows interesting possibilities like working with detached BPF
    FS mount (e.g., to perform multiple pinnings without running a risk of
    someone interfering with them), and generally making pinning/getting
    more secure and not prone to any races and/or security attacks.
    
    This is demonstrated by a selftest added in subsequent patch that takes
    advantage of new mount APIs (fsopen, fsconfig, fsmount) to demonstrate
    creating detached BPF FS mount, pinning, and then getting BPF map out of
    it, all while never exposing this private instance of BPF FS to outside
    worlds.
    
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/bpf/20230523170013.728457-4-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:14 +02:00
Viktor Malik 75010fc170
bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
JIRA: https://issues.redhat.com/browse/RHEL-9957

commit e7d85427ef898afe66c4c1b7e06e5659cec6b640
Author: Andrii Nakryiko <andrii@kernel.org>
Date:   Mon May 22 16:29:14 2023 -0700

    bpf: Validate BPF object in BPF_OBJ_PIN before calling LSM
    
    Do a sanity check whether provided file-to-be-pinned is actually a BPF
    object (prog, map, btf) before calling security_path_mknod LSM hook. If
    it's not, LSM hook doesn't have to be triggered, as the operation has no
    chance of succeeding anyways.
    
    Suggested-by: Christian Brauner <brauner@kernel.org>
    Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Reviewed-by: Christian Brauner <brauner@kernel.org>
    Link: https://lore.kernel.org/bpf/20230522232917.2454595-2-andrii@kernel.org

Signed-off-by: Viktor Malik <vmalik@redhat.com>
2023-10-26 17:06:14 +02:00
Jerome Marchand e8ea7c6063 bpf: Convert bpf_preload.ko to use light skeleton.
Bugzilla: https://bugzilla.redhat.com/2120966

commit cb80ddc67152e72f28ff6ea8517acdf875d7381d
Author: Alexei Starovoitov <ast@kernel.org>
Date:   Wed Feb 9 15:20:01 2022 -0800

    bpf: Convert bpf_preload.ko to use light skeleton.

    The main change is a move of the single line
      #include "iterators.lskel.h"
    from iterators/iterators.c to bpf_preload_kern.c.
    Which means that generated light skeleton can be used from user space or
    user mode driver like iterators.c or from the kernel module or the kernel itself.
    The direct use of light skeleton from the kernel module simplifies the code,
    since UMD is no longer necessary. The libbpf.a required user space and UMD. The
    CO-RE in the kernel and generated "loader bpf program" used by the light
    skeleton are capable to perform complex loading operations traditionally
    provided by libbpf. In addition UMD approach was launching UMD process
    every time bpffs has to be mounted. With light skeleton in the kernel
    the bpf_preload kernel module loads bpf iterators once and pins them
    multiple times into different bpffs mounts.

    Signed-off-by: Alexei Starovoitov <ast@kernel.org>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Yonghong Song <yhs@fb.com>
    Acked-by: Andrii Nakryiko <andrii@kernel.org>
    Link: https://lore.kernel.org/bpf/20220209232001.27490-6-alexei.starovoitov@gmail.com

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
2022-10-25 14:57:49 +02:00
Artem Savkov c088e56c68 bpf: Fix mount source show for bpffs
Bugzilla: https://bugzilla.redhat.com/2069046

Upstream Status: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

commit 1e9d74660d4df625b0889e77018f9e94727ceacd
Author: Yafang Shao <laoar.shao@gmail.com>
Date:   Sat Jan 8 13:46:23 2022 +0000

    bpf: Fix mount source show for bpffs

    We noticed our tc ebpf tools can't start after we upgrade our in-house kernel
    version from 4.19 to 5.10. That is because of the behaviour change in bpffs
    caused by commit d2935de7e4 ("vfs: Convert bpf to use the new mount API").

    In our tc ebpf tools, we do strict environment check. If the environment is
    not matched, we won't allow to start the ebpf progs. One of the check is whether
    bpffs is properly mounted. The mount information of bpffs in kernel-4.19 and
    kernel-5.10 are as follows:

    - kernel 4.19
    $ mount -t bpf bpffs /sys/fs/bpf
    $ mount -t bpf
    bpffs on /sys/fs/bpf type bpf (rw,relatime)

    - kernel 5.10
    $ mount -t bpf bpffs /sys/fs/bpf
    $ mount -t bpf
    none on /sys/fs/bpf type bpf (rw,relatime)

    The device name in kernel-5.10 is displayed as none instead of bpffs, then our
    environment check fails. Currently we modify the tools to adopt to the kernel
    behaviour change, but I think we'd better change the kernel code to keep the
    behavior consistent.

    After this change, the mount information will be displayed the same with the
    behavior in kernel-4.19, for example:

    $ mount -t bpf bpffs /sys/fs/bpf
    $ mount -t bpf
    bpffs on /sys/fs/bpf type bpf (rw,relatime)

    Fixes: d2935de7e4 ("vfs: Convert bpf to use the new mount API")
    Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
    Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
    Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
    Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
    Cc: David Howells <dhowells@redhat.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Link: https://lore.kernel.org/bpf/20220108134623.32467-1-laoar.shao@gmail.com

Signed-off-by: Artem Savkov <asavkov@redhat.com>
2022-08-24 12:53:53 +02:00
Maciej Żenczykowski 5dec6d96d1 bpf: Fix regression on BPF_OBJ_GET with non-O_RDWR flags
This reverts commit d37300ed18 ("bpf: program: Refuse non-O_RDWR flags
in BPF_OBJ_GET"). It breaks Android userspace which expects to be able to
fetch programs with just read permissions.

See: https://cs.android.com/android/platform/superproject/+/master:frameworks/libs/net/common/native/bpf_syscall_wrappers/include/BpfSyscallWrappers.h;drc=7005c764be23d31fa1d69e826b4a2f6689a8c81e;l=124

Side-note: another option to fix it would be to extend bpf_prog_new_fd()
and to pass in used file mode flags in the same way as we do for maps via
bpf_map_new_fd(). Meaning, they'd end up in anon_inode_getfd() and thus
would be retained for prog fd operations with bpf() syscall. Right now
these flags are not checked with progs since they are immutable for their
lifetime (as opposed to maps which can be updated from user space). In
future this could potentially change with new features, but at that point
it's still fine to do the bpf_prog_new_fd() extension when needed. For a
simple stable fix, a revert is less churn.

Fixes: d37300ed18 ("bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET")
Signed-off-by: Maciej Żenczykowski <maze@google.com>
[ Daniel: added side-note to commit message ]
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Acked-by: Greg Kroah-Hartman <gregkh@google.com>
Link: https://lore.kernel.org/bpf/20210618105526.265003-1-zenczykowski@gmail.com
2021-06-22 14:57:43 +02:00
David S. Miller 5f6c2f536d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2021-04-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 69 non-merge commits during the last 22 day(s) which contain
a total of 69 files changed, 3141 insertions(+), 866 deletions(-).

The main changes are:

1) Add BPF static linker support for extern resolution of global, from Andrii.

2) Refine retval for bpf_get_task_stack helper, from Dave.

3) Add a bpf_snprintf helper, from Florent.

4) A bunch of miscellaneous improvements from many developers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-25 18:02:32 -07:00
Muhammad Usama Anjum 957dca3df6 bpf, inode: Remove second initialization of the bpf_preload_lock
bpf_preload_lock is already defined with DEFINE_MUTEX(). There is no
need to initialize it again. Remove the extraneous initialization.

Signed-off-by: Muhammad Usama Anjum <musamaanjum@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210405194904.GA148013@LEGION
2021-04-06 23:39:13 +02:00
Lorenz Bauer d37300ed18 bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET
As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds
currently don't allow modifications this is a precaution, not a
straight up bug fix.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com
2021-04-01 14:33:48 -07:00
Lorenz Bauer 25fc94b2f0 bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GET
Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access
permissions based on file_flags, but the returned fd ignores flags.
This means that any user can acquire a "read-write" fd for a pinned
link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in
file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc.

Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works
because OBJ_GET by default returns a read write mapping and libbpf
doesn't expose a way to override this behaviour for programs
and links.

Fixes: 70ed506c3b ("bpf: Introduce pinnable bpf_link abstraction")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com
2021-04-01 14:33:14 -07:00
Christian Brauner 549c729771
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A
filesystem that is aware of idmapped mounts will receive the user
namespace the mount has been marked with. This can be used for
additional permission checking and also to enable filesystems to
translate between uids and gids if they need to. We have implemented all
relevant helpers in earlier patches.

As requested we simply extend the exisiting inode method instead of
introducing new ones. This is a little more code churn but it's mostly
mechanical and doesnt't leave us with additional inode methods.

Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:20 +01:00
Christian Brauner 21cb47be6f
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the
owner of the inode or is capable with respect to that inode. Allow it to
handle idmapped mounts. If the inode is accessed through an idmapped
mount it according to the mount's user namespace. Afterwards the checks
are identical to non-idmapped mounts. If the initial user namespace is
passed nothing changes so non-idmapped mounts will see identical
behavior as before.

Similarly, allow the inode_init_owner() helper to handle idmapped
mounts. It initializes a new inode on idmapped mounts by mapping the
fsuid and fsgid of the caller from the mount's user namespace. If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 47291baa8d
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by
the vfs to perform basic permission checking by verifying that the
caller is privileged over an inode. In order to handle idmapped mounts
we extend the two helpers with an additional user namespace argument.
On idmapped mounts the two helpers will make sure to map the inode
according to the mount's user namespace and then peform identical
permission checks to inode_permission() and generic_permission(). If the
initial user namespace is passed nothing changes so non-idmapped mounts
will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
Christian Brauner 02f92b3868
fs: add file and path permissions helpers
Add two simple helpers to check permissions on a file and path
respectively and convert over some callers. It simplifies quite a few
codepaths and also reduces the churn in later patches quite a bit.
Christoph also correctly points out that this makes codepaths (e.g.
ioctls) way easier to follow that would otherwise have to do more
complex argument passing than necessary.

Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jamorris@linux.microsoft.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24 14:27:16 +01:00
David S. Miller 3ab0a7a0c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Two minor conflicts:

1) net/ipv4/route.c, adding a new local variable while
   moving another local variable and removing it's
   initial assignment.

2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
   One pretty prints the port mode differently, whilst another
   changes the driver to try and obtain the port mode from
   the port node rather than the switch node.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-22 16:45:34 -07:00
Yonghong Song ce880cb825 bpf: Fix a rcu warning for bpffs map pretty-print
Running selftest
  ./btf_btf -p
the kernel had the following warning:
  [   51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
  [   51.529217] Modules linked in:
  [   51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
  [   51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
  [   51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
  ...
  [   51.542826] Call Trace:
  [   51.543119]  map_seq_next+0x53/0x80
  [   51.543528]  seq_read+0x263/0x400
  [   51.543932]  vfs_read+0xad/0x1c0
  [   51.544311]  ksys_read+0x5f/0xe0
  [   51.544689]  do_syscall_64+0x33/0x40
  [   51.545116]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The related source code in kernel/bpf/hashtab.c:
  709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
  710 {
  711         struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
  712         struct hlist_nulls_head *head;
  713         struct htab_elem *l, *next_l;
  714         u32 hash, key_size;
  715         int i = 0;
  716
  717         WARN_ON_ONCE(!rcu_read_lock_held());

In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
without holding a rcu_read_lock(), hence causing the above warning.
To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.

Fixes: a26ca7c982 ("bpf: btf: Add pretty print support to the basic arraymap")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com
2020-09-15 18:17:39 -07:00
Alexei Starovoitov d71fa5c976 bpf: Add kernel module with user mode driver that populates bpffs.
Add kernel module with user mode driver that populates bpffs with
BPF iterators.

$ mount bpffs /my/bpffs/ -t bpf
$ ls -la /my/bpffs/
total 4
drwxrwxrwt  2 root root    0 Jul  2 00:27 .
drwxr-xr-x 19 root root 4096 Jul  2 00:09 ..
-rw-------  1 root root    0 Jul  2 00:27 maps.debug
-rw-------  1 root root    0 Jul  2 00:27 progs.debug

The user mode driver will load BPF Type Formats, create BPF maps, populate BPF
maps, load two BPF programs, attach them to BPF iterators, and finally send two
bpf_link IDs back to the kernel.
The kernel will pin two bpf_links into newly mounted bpffs instance under
names "progs.debug" and "maps.debug". These two files become human readable.

$ cat /my/bpffs/progs.debug
  id name            attached
  11 dump_bpf_map    bpf_iter_bpf_map
  12 dump_bpf_prog   bpf_iter_bpf_prog
  27 test_pkt_access
  32 test_main       test_pkt_access test_pkt_access
  33 test_subprog1   test_pkt_access_subprog1 test_pkt_access
  34 test_subprog2   test_pkt_access_subprog2 test_pkt_access
  35 test_subprog3   test_pkt_access_subprog3 test_pkt_access
  36 new_get_skb_len get_skb_len test_pkt_access
  37 new_get_skb_ifindex get_skb_ifindex test_pkt_access
  38 new_get_constant get_constant test_pkt_access

The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about
all BPF programs currently loaded in the system. This information is unstable
and will change from kernel to kernel as ".debug" suffix conveys.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com
2020-08-20 16:02:36 +02:00
Yonghong Song 367ec3e483 bpf: Create file bpf iterator
To produce a file bpf iterator, the fd must be
corresponding to a link_fd assocciated with a
trace/iter program. When the pinned file is
opened, a seq_file will be generated.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Andrii Nakryiko 70ed506c3b bpf: Introduce pinnable bpf_link abstraction
Introduce bpf_link abstraction, representing an attachment of BPF program to
a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
ownership of attached BPF program, reference counting of a link itself, when
reference from multiple anonymous inodes, as well as ensures that release
callback will be called from a process context, so that users can safely take
mutex locks and sleep.

Additionally, with a new abstraction it's now possible to generalize pinning
of a link object in BPF FS, allowing to explicitly prevent BPF program
detachment on process exit by pinning it in a BPF FS and let it open from
independent other process to keep working with it.

Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
program attachments) into utilizing bpf_link framework, making them pinnable
in BPF FS. More FD-based bpf_links will be added in follow up patches.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
2020-03-02 22:06:27 -08:00
Linus Torvalds c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
Al Viro d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen 96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Vasily Averin 90435a7891 bpf: map_seq_next should always increase position index
If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.

See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

v1 -> v2: removed missed increment in end of function

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
2020-01-27 10:54:32 +01:00
Al Viro b87121dd3f bpf: don't bother with getname/kern_path - use user_path_at
kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and
more efficient, on top of that) to use user_path...() counterparts
rather than bothering with doing getname() manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120232858.GF8904@ZenIV.linux.org.uk
2020-01-21 23:46:21 +01:00
Andrii Nakryiko 85192dbf4d bpf: Convert bpf_prog refcnt to atomic64_t
Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
and remove artificial 32k limit. This allows to make bpf_prog's refcounting
non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

Validated compilation by running allyesconfig kernel build.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
2019-11-18 11:41:59 +01:00
Andrii Nakryiko 1e0bd5a091 bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails
92117d8443 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
(32k). Due to using 32-bit counter, it's possible in practice to overflow
refcounter and make it wrap around to 0, causing erroneous map free, while
there are still references to it, causing use-after-free problems.

But having a failing refcounting operations are problematic in some cases. One
example is mmap() interface. After establishing initial memory-mapping, user
is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
splitting it into multiple non-contiguous regions. All this happening without
any control from the users of mmap subsystem. Rather mmap subsystem sends
notifications to original creator of memory mapping through open/close
callbacks, which are optionally specified during initial memory mapping
creation. These callbacks are used to maintain accurate refcount for bpf_map
(see next patch in this series). The problem is that open() callback is not
supposed to fail, because memory-mapped resource is set up and properly
referenced. This is posing a problem for using memory-mapping with BPF maps.

One solution to this is to maintain separate refcount for just memory-mappings
and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
There are similar use cases in current work on tcp-bpf, necessitating extra
counter as well. This seems like a rather unfortunate and ugly solution that
doesn't scale well to various new use cases.

Another approach to solve this is to use non-failing refcount_t type, which
uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
stays there. This utlimately causes memory leak, but prevents use after free.

But given refcounting is not the most performance-critical operation with BPF
maps (it's not used from running BPF program code), we can also just switch to
64-bit counter that can't overflow in practice, potentially disadvantaging
32-bit platforms a tiny bit. This simplifies semantics and allows above
described scenarios to not worry about failing refcount increment operation.

In terms of struct bpf_map size, we are still good and use the same amount of
space:

BEFORE (3 cache lines, 8 bytes of padding at the end):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
	atomic_t                   usercnt;              /*   132     4 */
	struct work_struct work;                         /*   136    32 */
	char                       name[16];             /*   168    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 146, holes: 1, sum holes: 38 */
	/* padding: 8 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

AFTER (same 3 cache lines, no extra padding now):
struct bpf_map {
	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
	struct bpf_map *           inner_map_meta;       /*     8     8 */
	void *                     security;             /*    16     8 */
	enum bpf_map_type  map_type;                     /*    24     4 */
	u32                        key_size;             /*    28     4 */
	u32                        value_size;           /*    32     4 */
	u32                        max_entries;          /*    36     4 */
	u32                        map_flags;            /*    40     4 */
	int                        spin_lock_off;        /*    44     4 */
	u32                        id;                   /*    48     4 */
	int                        numa_node;            /*    52     4 */
	u32                        btf_key_type_id;      /*    56     4 */
	u32                        btf_value_type_id;    /*    60     4 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct btf *               btf;                  /*    64     8 */
	struct bpf_map_memory memory;                    /*    72    16 */
	bool                       unpriv_array;         /*    88     1 */
	bool                       frozen;               /*    89     1 */

	/* XXX 38 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
	atomic64_t                 usercnt;              /*   136     8 */
	struct work_struct work;                         /*   144    32 */
	char                       name[16];             /*   176    16 */

	/* size: 192, cachelines: 3, members: 21 */
	/* sum members: 154, holes: 1, sum holes: 38 */
	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
} __attribute__((__aligned__(64)));

This patch, while modifying all users of bpf_map_inc, also cleans up its
interface to match bpf_map_put with separate operations for bpf_map_inc and
bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
respectively). Also, given there are no users of bpf_map_inc_not_zero
specifying uref=true, remove uref flag and default to uref=false internally.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
2019-11-18 11:41:59 +01:00
David Howells d2935de7e4 vfs: Convert bpf to use the new mount API
Convert the bpf filesystem to the new internal mount API as the old
one will be obsoleted and removed.  This allows greater flexibility in
communication of mount parameters between userspace, the VFS and the
filesystem.

See Documentation/filesystems/mount_api.txt for more information.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexei Starovoitov <ast@kernel.org>
cc: Daniel Borkmann <daniel@iogearbox.net>
cc: Martin KaFai Lau <kafai@fb.com>
cc: Song Liu <songliubraving@fb.com>
cc: Yonghong Song <yhs@fb.com>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-18 22:35:31 -04:00
Thomas Gleixner d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Chenbo Feng e547ff3f80 bpf: relax inode permission check for retrieving bpf program
For iptable module to load a bpf program from a pinned location, it
only retrieve a loaded program and cannot change the program content so
requiring a write permission for it might not be necessary.
Also when adding or removing an unrelated iptable rule, it might need to
flush and reload the xt_bpf related rules as well and triggers the inode
permission check. It might be better to remove the write premission
check for the inode so we won't need to grant write access to all the
processes that flush and restore iptables rules.

Signed-off-by: Chenbo Feng <fengc@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-05-16 11:31:49 -07:00
Al Viro 524845ff9c bpf: switch to ->free_inode()
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01 22:43:26 -04:00
Daniel Borkmann 1da6c4d914 bpf: fix use after free in bpf_evict_inode
syzkaller was able to generate the following UAF in bpf:

  BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
  BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
  Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

  CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
  #110
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
  Google 01/01/2011
  Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x244/0x39d lib/dump_stack.c:113
    print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
    lookup_last fs/namei.c:2269 [inline]
    path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    filename_lookup+0x26a/0x520 fs/namei.c:2348
    user_path_at_empty+0x40/0x50 fs/namei.c:2608
    user_path include/linux/namei.h:62 [inline]
    do_mount+0x180/0x1ff0 fs/namespace.c:2980
    ksys_mount+0x12d/0x140 fs/namespace.c:3258
    __do_sys_mount fs/namespace.c:3272 [inline]
    __se_sys_mount fs/namespace.c:3269 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x457569
  Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
  48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
  ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
  RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
  RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
  RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
  R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
  R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

  Allocated by task 9424:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3722 [inline]
    __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
    kstrdup+0x39/0x70 mm/util.c:49
    bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
    vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
    do_symlinkat+0x242/0x2d0 fs/namei.c:4154
    __do_sys_symlink fs/namei.c:4173 [inline]
    __se_sys_symlink fs/namei.c:4171 [inline]
    __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

  Freed by task 9425:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3817
    bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
    evict+0x4b9/0x980 fs/inode.c:558
    iput_final fs/inode.c:1550 [inline]
    iput+0x674/0xa90 fs/inode.c:1576
    do_unlinkat+0x733/0xa30 fs/namei.c:4069
    __do_sys_unlink fs/namei.c:4110 [inline]
    __se_sys_unlink fs/namei.c:4108 [inline]
    __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

In this scenario path lookup under RCU is racing with the final
unlink in case of symlinks. As Linus puts it in his analysis:

  [...] We actually RCU-delay the inode freeing itself, but
  when we do the final iput(), the "evict()" function is called
  synchronously. Now, the simple fix would seem to just RCU-delay
  the kfree() of the symlink data in bpf_evict_inode(). Maybe
  that's the right thing to do. [...]

Al suggested to piggy-back on the ->destroy_inode() callback in
order to implement RCU deferral there which can then kfree() the
inode->i_link eventually right before putting inode back into
inode cache. By reusing free_inode_nonrcu() from there we can
avoid the need for our own inode cache and just reuse generic
one as we currently do.

And in-fact on top of all this we should just get rid of the
bpf_evict_inode() entirely. This means truncate_inode_pages_final()
and clear_inode() will then simply be called by the fs core via
evict(). Dropping the reference should really only be done when
inode is unhashed and nothing reachable anymore, so it's better
also moved into the final ->destroy_inode() callback.

Fixes: 0f98621bef ("bpf, inode: add support for symlinks and fix mtime/ctime")
Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/
2019-03-26 01:38:49 +01:00