linux-kernelorg-stable/include/uapi/linux/fcntl.h

163 lines
5.8 KiB
C
Raw Normal View History

License cleanup: add SPDX license identifier to uapi header files with no license Many user space API headers are missing licensing information, which makes it hard for compliance tools to determine the correct license. By default are files without license information under the default license of the kernel, which is GPLV2. Marking them GPLV2 would exclude them from being included in non GPLV2 code, which is obviously not intended. The user space API headers fall under the syscall exception which is in the kernels COPYING file: NOTE! This copyright does *not* cover user programs that use kernel services by normal system calls - this is merely considered normal use of the kernel, and does *not* fall under the heading of "derived work". otherwise syscall usage would not be possible. Update the files which contain no license information with an SPDX license identifier. The chosen identifier is 'GPL-2.0 WITH Linux-syscall-note' which is the officially assigned identifier for the Linux syscall exception. SPDX license identifiers are a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. See the previous patch in this series for the methodology of how this patch was researched. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:08:43 +00:00
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
#ifndef _UAPI_LINUX_FCNTL_H
#define _UAPI_LINUX_FCNTL_H
#include <asm/fcntl.h>
open: introduce openat2(2) syscall /* Background. */ For a very long time, extending openat(2) with new features has been incredibly frustrating. This stems from the fact that openat(2) is possibly the most famous counter-example to the mantra "don't silently accept garbage from userspace" -- it doesn't check whether unknown flags are present[1]. This means that (generally) the addition of new flags to openat(2) has been fraught with backwards-compatibility issues (O_TMPFILE has to be defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old kernels gave errors, since it's insecure to silently ignore the flag[2]). All new security-related flags therefore have a tough road to being added to openat(2). Userspace also has a hard time figuring out whether a particular flag is supported on a particular kernel. While it is now possible with contemporary kernels (thanks to [3]), older kernels will expose unknown flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during openat(2) time matches modern syscall designs and is far more fool-proof. In addition, the newly-added path resolution restriction LOOKUP flags (which we would like to expose to user-space) don't feel related to the pre-existing O_* flag set -- they affect all components of path lookup. We'd therefore like to add a new flag argument. Adding a new syscall allows us to finally fix the flag-ignoring problem, and we can make it extensible enough so that we will hopefully never need an openat3(2). /* Syscall Prototype. */ /* * open_how is an extensible structure (similar in interface to * clone3(2) or sched_setattr(2)). The size parameter must be set to * sizeof(struct open_how), to allow for future extensions. All future * extensions will be appended to open_how, with their zero value * acting as a no-op default. */ struct open_how { /* ... */ }; int openat2(int dfd, const char *pathname, struct open_how *how, size_t size); /* Description. */ The initial version of 'struct open_how' contains the following fields: flags Used to specify openat(2)-style flags. However, any unknown flag bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR) will result in -EINVAL. In addition, this field is 64-bits wide to allow for more O_ flags than currently permitted with openat(2). mode The file mode for O_CREAT or O_TMPFILE. Must be set to zero if flags does not contain O_CREAT or O_TMPFILE. resolve Restrict path resolution (in contrast to O_* flags they affect all path components). The current set of flags are as follows (at the moment, all of the RESOLVE_ flags are implemented as just passing the corresponding LOOKUP_ flag). RESOLVE_NO_XDEV => LOOKUP_NO_XDEV RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS RESOLVE_BENEATH => LOOKUP_BENEATH RESOLVE_IN_ROOT => LOOKUP_IN_ROOT open_how does not contain an embedded size field, because it is of little benefit (userspace can figure out the kernel open_how size at runtime fairly easily without it). It also only contains u64s (even though ->mode arguably should be a u16) to avoid having padding fields which are never used in the future. Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE is no longer permitted for openat(2). As far as I can tell, this has always been a bug and appears to not be used by userspace (and I've not seen any problems on my machines by disallowing it). If it turns out this breaks something, we can special-case it and only permit it for openat(2) but not openat2(2). After input from Florian Weimer, the new open_how and flag definitions are inside a separate header from uapi/linux/fcntl.h, to avoid problems that glibc has with importing that header. /* Testing. */ In a follow-up patch there are over 200 selftests which ensure that this syscall has the correct semantics and will correctly handle several attack scenarios. In addition, I've written a userspace library[4] which provides convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care must be taken when using RESOLVE_IN_ROOT'd file descriptors with other syscalls). During the development of this patch, I've run numerous verification tests using libpathrs (showing that the API is reasonably usable by userspace). /* Future Work. */ Additional RESOLVE_ flags have been suggested during the review period. These can be easily implemented separately (such as blocking auto-mount during resolution). Furthermore, there are some other proposed changes to the openat(2) interface (the most obvious example is magic-link hardening[5]) which would be a good opportunity to add a way for userspace to restrict how O_PATH file descriptors can be re-opened. Another possible avenue of future work would be some kind of CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace which openat2(2) flags and fields are supported by the current kernel (to avoid userspace having to go through several guesses to figure it out). [1]: https://lwn.net/Articles/588444/ [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags") [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/ [6]: https://youtu.be/ggD-eb3yPVs Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-01-18 12:07:59 +00:00
#include <linux/openat2.h>
#define F_SETLEASE (F_LINUX_SPECIFIC_BASE + 0)
#define F_GETLEASE (F_LINUX_SPECIFIC_BASE + 1)
/*
* Request nofications on a directory.
* See below for events that may be notified.
*/
#define F_NOTIFY (F_LINUX_SPECIFIC_BASE + 2)
#define F_DUPFD_QUERY (F_LINUX_SPECIFIC_BASE + 3)
fcntl: add F_CREATED_QUERY Systemd has a helper called openat_report_new() that returns whether a file was created anew or it already existed before for cases where O_CREAT has to be used without O_EXCL (cf. [1]). That apparently isn't something that's specific to systemd but it's where I noticed it. The current logic is that it first attempts to open the file without O_CREAT | O_EXCL and if it gets ENOENT the helper tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. The caller could try and use fanotify() to register to listen for creation events in the directory before calling openat(). The caller could then compare the returned tid to its own tid to ensure that even in threaded environments it actually created the file. That might work but is a lot of work for something that should be fairly simple and I'm uncertain about it's reliability. The caller could use a bpf lsm hook to hook into security_file_open() to figure out whether they created the file. That also seems a bit wild. So let's add F_CREATED_QUERY which allows the caller to check whether they actually did create the file. That has caveats of course but I don't think they are problematic: * In multi-threaded environments a thread can only be sure that it did create the file if it calls openat() with O_CREAT. In other words, it's obviously not enough to just go through it's fdtable and check these fds because another thread could've created the file. * If there's any codepaths where an openat() with O_CREAT would yield the same struct file as that of another thread it would obviously cause wrong results. I'm not aware of any such codepaths from openat() itself. Imho, that would be a bug. * Related to the previous point, calling the new fcntl() on files created and opened via special-purpose system calls or ioctl()s would cause wrong results only if the affected subsystem a) raises FMODE_CREATED and b) may return the same struct file for two different calls. I'm not seeing anything outside of regular VFS code that raises FMODE_CREATED. There is code for b) in e.g., the drm layer where the same struct file is resurfaced but again FMODE_CREATED isn't used and it would be very misleading if it did. Link: https://github.com/systemd/systemd/blob/11d5e2b5fbf9f6bfa5763fd45b56829ad4f0777f/src/basic/fs-util.c#L1078 [1] Link: https://lore.kernel.org/r/20240724-work-fcntl-v1-1-e8153a2f1991@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24 13:15:35 +00:00
/* Was the file just created? */
#define F_CREATED_QUERY (F_LINUX_SPECIFIC_BASE + 4)
/*
* Cancel a blocking posix lock; internal use only until we expose an
* asynchronous lock api to userspace:
*/
#define F_CANCELLK (F_LINUX_SPECIFIC_BASE + 5)
/* Create a file descriptor with FD_CLOEXEC set. */
#define F_DUPFD_CLOEXEC (F_LINUX_SPECIFIC_BASE + 6)
/*
* Set and get of pipe page size array
*/
#define F_SETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 7)
#define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
shm: add sealing API If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 21:25:27 +00:00
/*
* Set/Get seals
*/
#define F_ADD_SEALS (F_LINUX_SPECIFIC_BASE + 9)
#define F_GET_SEALS (F_LINUX_SPECIFIC_BASE + 10)
/*
* Types of seals
*/
#define F_SEAL_SEAL 0x0001 /* prevent further seals from being set */
#define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week where we don't need to modify core VFS structures to get the same behavior of the seal. This solves several side-effects pointed by Andy. self-tests are provided in later patch to verify the expected semantics. [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/ Thanks a lot to Andy for suggestions to improve code. Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-05 23:47:54 +00:00
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
mm/memfd: add F_SEAL_EXEC Patch series "mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC", v8. Since Linux introduced the memfd feature, memfd have always had their execute bit set, and the memfd_create() syscall doesn't allow setting it differently. However, in a secure by default system, such as ChromeOS, (where all executables should come from the rootfs, which is protected by Verified boot), this executable nature of memfd opens a door for NoExec bypass and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm process created a memfd to share the content with an external process, however the memfd is overwritten and used for executing arbitrary code and root escalation. [2] lists more VRP in this kind. On the other hand, executable memfd has its legit use, runc uses memfd’s seal and executable feature to copy the contents of the binary then execute them, for such system, we need a solution to differentiate runc's use of executable memfds and an attacker's [3]. To address those above, this set of patches add following: 1> Let memfd_create() set X bit at creation time. 2> Let memfd to be sealed for modifying X bit. 3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of X bit.For example, if a container has vm.memfd_noexec=2, then memfd_create() without MFD_NOEXEC_SEAL will be rejected. 4> A new security hook in memfd_create(). This make it possible to a new LSM, which rejects or allows executable memfd based on its security policy. This patch (of 5): The new F_SEAL_EXEC flag will prevent modification of the exec bits: written as traditional octal mask, 0111, or as named flags, S_IXUSR | S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify any of these bits after the seal is applied will fail with errno EPERM. This will preserve the execute bits as they are at the time of sealing, so the memfd will become either permanently executable or permanently un-executable. Link: https://lkml.kernel.org/r/20221215001205.51969-1-jeffxu@google.com Link: https://lkml.kernel.org/r/20221215001205.51969-2-jeffxu@google.com Signed-off-by: Daniel Verkamp <dverkamp@chromium.org> Co-developed-by: Jeff Xu <jeffxu@google.com> Signed-off-by: Jeff Xu <jeffxu@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-15 00:12:01 +00:00
#define F_SEAL_EXEC 0x0020 /* prevent chmod modifying exec bits */
shm: add sealing API If two processes share a common memory region, they usually want some guarantees to allow safe access. This often includes: - one side cannot overwrite data while the other reads it - one side cannot shrink the buffer while the other accesses it - one side cannot grow the buffer beyond previously set boundaries If there is a trust-relationship between both parties, there is no need for policy enforcement. However, if there's no trust relationship (eg., for general-purpose IPC) sharing memory-regions is highly fragile and often not possible without local copies. Look at the following two use-cases: 1) A graphics client wants to share its rendering-buffer with a graphics-server. The memory-region is allocated by the client for read/write access and a second FD is passed to the server. While scanning out from the memory region, the server has no guarantee that the client doesn't shrink the buffer at any time, requiring rather cumbersome SIGBUS handling. 2) A process wants to perform an RPC on another process. To avoid huge bandwidth consumption, zero-copy is preferred. After a message is assembled in-memory and a FD is passed to the remote side, both sides want to be sure that neither modifies this shared copy, anymore. The source may have put sensible data into the message without a separate copy and the target may want to parse the message inline, to avoid a local copy. While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide ways to achieve most of this, the first one is unproportionally ugly to use in libraries and the latter two are broken/racy or even disabled due to denial of service attacks. This patch introduces the concept of SEALING. If you seal a file, a specific set of operations is blocked on that file forever. Unlike locks, seals can only be set, never removed. Hence, once you verified a specific set of seals is set, you're guaranteed that no-one can perform the blocked operations on this file, anymore. An initial set of SEALS is introduced by this patch: - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced in size. This affects ftruncate() and open(O_TRUNC). - GROW: If SEAL_GROW is set, the file in question cannot be increased in size. This affects ftruncate(), fallocate() and write(). - WRITE: If SEAL_WRITE is set, no write operations (besides resizing) are possible. This affects fallocate(PUNCH_HOLE), mmap() and write(). - SEAL: If SEAL_SEAL is set, no further seals can be added to a file. This basically prevents the F_ADD_SEAL operation on a file and can be set to prevent others from adding further seals that you don't want. The described use-cases can easily use these seals to provide safe use without any trust-relationship: 1) The graphics server can verify that a passed file-descriptor has SEAL_SHRINK set. This allows safe scanout, while the client is allowed to increase buffer size for window-resizing on-the-fly. Concurrent writes are explicitly allowed. 2) For general-purpose IPC, both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE are set. This guarantees that neither process can modify the data while the other side parses it. Furthermore, it guarantees that even with writable FDs passed to the peer, it cannot increase the size to hit memory-limits of the source process (in case the file-storage is accounted to the source). The new API is an extension to fcntl(), adding two new commands: F_GET_SEALS: Return a bitset describing the seals on the file. This can be called on any FD if the underlying file supports sealing. F_ADD_SEALS: Change the seals of a given file. This requires WRITE access to the file and F_SEAL_SEAL may not already be set. Furthermore, the underlying file must support sealing and there may not be any existing shared mapping of that file. Otherwise, EBADF/EPERM is returned. The given seals are _added_ to the existing set of seals on the file. You cannot remove seals again. The fcntl() handler is currently specific to shmem and disabled on all files. A file needs to explicitly support sealing for this interface to work. A separate syscall is added in a follow-up, which creates files that support sealing. There is no intention to support this on other file-systems. Semantics are unclear for non-volatile files and we lack any use-case right now. Therefore, the implementation is specific to shmem. Signed-off-by: David Herrmann <dh.herrmann@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ryan Lortie <desrt@desrt.ca> Cc: Lennart Poettering <lennart@poettering.net> Cc: Daniel Mack <zonque@gmail.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 21:25:27 +00:00
/* (1U << 31) is reserved for signed error codes */
fs: add fcntl() interface for setting/getting write life time hints Define a set of write life time hints: RWH_WRITE_LIFE_NOT_SET No hint information set RWH_WRITE_LIFE_NONE No hints about write life time RWH_WRITE_LIFE_SHORT Data written has a short life time RWH_WRITE_LIFE_MEDIUM Data written has a medium life time RWH_WRITE_LIFE_LONG Data written has a long life time RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time The intent is for these values to be relative to each other, no absolute meaning should be attached to these flag names. Add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set on the underlying inode. F_SET_RW_HINT Set one of the above write hints on the underlying inode. F_GET_FILE_RW_HINT Returns the read/write hint set on the file descriptor. F_SET_FILE_RW_HINT Set one of the above write hints on the file descriptor. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags and in struct file as well, and pass them to the kiocb flags. If both a file and its corresponding inode has a write hint, then we use the one in the file, if available. The file hint can be used for sync/direct IO, for buffered writeback only the inode hint is available. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: get or set an inode write hint */ #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <stdbool.h> #include <inttypes.h> #ifndef F_GET_RW_HINT #define F_LINUX_SPECIFIC_BASE 1024 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11) #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #endif static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE", "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM", "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: file <hint>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) { hint = atoi(argv[2]); ret = fcntl(fd, F_SET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_SET_RW_HINT"); return 4; } } ret = fcntl(fd, F_GET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_GET_RW_HINT"); return 3; } printf("%s: hint %s\n", argv[1], str[hint]); close(fd); return 0; } Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 17:47:04 +00:00
/*
* Set/Get write life time hints. {GET,SET}_RW_HINT operate on the
* underlying inode, while {GET,SET}_FILE_RW_HINT operate only on
* the specific file.
*/
#define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11)
#define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12)
#define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13)
#define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14)
/*
* Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
* used to clear any hints previously set.
*/
#define RWH_WRITE_LIFE_NOT_SET 0
fs: add fcntl() interface for setting/getting write life time hints Define a set of write life time hints: RWH_WRITE_LIFE_NOT_SET No hint information set RWH_WRITE_LIFE_NONE No hints about write life time RWH_WRITE_LIFE_SHORT Data written has a short life time RWH_WRITE_LIFE_MEDIUM Data written has a medium life time RWH_WRITE_LIFE_LONG Data written has a long life time RWH_WRITE_LIFE_EXTREME Data written has an extremely long life time The intent is for these values to be relative to each other, no absolute meaning should be attached to these flag names. Add an fcntl interface for querying these flags, and also for setting them as well: F_GET_RW_HINT Returns the read/write hint set on the underlying inode. F_SET_RW_HINT Set one of the above write hints on the underlying inode. F_GET_FILE_RW_HINT Returns the read/write hint set on the file descriptor. F_SET_FILE_RW_HINT Set one of the above write hints on the file descriptor. The user passes in a 64-bit pointer to get/set these values, and the interface returns 0/-1 on success/error. Sample program testing/implementing basic setting/getting of write hints is below. Add support for storing the write life time hint in the inode flags and in struct file as well, and pass them to the kiocb flags. If both a file and its corresponding inode has a write hint, then we use the one in the file, if available. The file hint can be used for sync/direct IO, for buffered writeback only the inode hint is available. This is in preparation for utilizing these hints in the block layer, to guide on-media data placement. /* * writehint.c: get or set an inode write hint */ #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <unistd.h> #include <stdbool.h> #include <inttypes.h> #ifndef F_GET_RW_HINT #define F_LINUX_SPECIFIC_BASE 1024 #define F_GET_RW_HINT (F_LINUX_SPECIFIC_BASE + 11) #define F_SET_RW_HINT (F_LINUX_SPECIFIC_BASE + 12) #endif static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE", "RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM", "RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" }; int main(int argc, char *argv[]) { uint64_t hint; int fd, ret; if (argc < 2) { fprintf(stderr, "%s: file <hint>\n", argv[0]); return 1; } fd = open(argv[1], O_RDONLY); if (fd < 0) { perror("open"); return 2; } if (argc > 2) { hint = atoi(argv[2]); ret = fcntl(fd, F_SET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_SET_RW_HINT"); return 4; } } ret = fcntl(fd, F_GET_RW_HINT, &hint); if (ret < 0) { perror("fcntl: F_GET_RW_HINT"); return 3; } printf("%s: hint %s\n", argv[1], str[hint]); close(fd); return 0; } Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 17:47:04 +00:00
#define RWH_WRITE_LIFE_NONE 1
#define RWH_WRITE_LIFE_SHORT 2
#define RWH_WRITE_LIFE_MEDIUM 3
#define RWH_WRITE_LIFE_LONG 4
#define RWH_WRITE_LIFE_EXTREME 5
/*
* The originally introduced spelling is remained from the first
* versions of the patch set that introduced the feature, see commit
* v4.13-rc1~212^2~51.
*/
#define RWF_WRITE_LIFE_NOT_SET RWH_WRITE_LIFE_NOT_SET
/*
* Types of directory notifications that may be requested.
*/
#define DN_ACCESS 0x00000001 /* File accessed */
#define DN_MODIFY 0x00000002 /* File modified */
#define DN_CREATE 0x00000004 /* File created */
#define DN_DELETE 0x00000008 /* File removed */
#define DN_RENAME 0x00000010 /* File renamed */
#define DN_ATTRIB 0x00000020 /* File changed attibutes */
#define DN_MULTISHOT 0x80000000 /* Don't remove notifier */
uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 10:19:42 +00:00
#define AT_FDCWD -100 /* Special value for dirfd used to
indicate openat should use the
current working directory. */
/* Generic flags for the *at(2) family of syscalls. */
/* Reserved for per-syscall flags 0xff. */
#define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic
links. */
/* Reserved for per-syscall flags 0x200 */
#define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
#define AT_NO_AUTOMOUNT 0x800 /* Suppress terminal automount
traversal. */
#define AT_EMPTY_PATH 0x1000 /* Allow empty relative
pathname to operate on dirfd
directly. */
/*
uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 10:19:42 +00:00
* These flags are currently statx(2)-specific, but they could be made generic
* in the future and so they should not be used for other per-syscall flags.
*/
uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 10:19:42 +00:00
#define AT_STATX_SYNC_TYPE 0x6000 /* Type of synchronisation required from statx() */
#define AT_STATX_SYNC_AS_STAT 0x0000 /* - Do whatever stat() does */
#define AT_STATX_FORCE_SYNC 0x2000 /* - Force the attributes to be sync'd with the server */
#define AT_STATX_DONT_SYNC 0x4000 /* - Don't sync attributes with the server */
#define AT_RECURSIVE 0x8000 /* Apply to the entire subtree */
/*
* Per-syscall flags for the *at(2) family of syscalls.
*
* These are flags that are so syscall-specific that a user passing these flags
* to the wrong syscall is so "clearly wrong" that we can safely call such
* usage "undefined behaviour".
*
* For example, the constants AT_REMOVEDIR and AT_EACCESS have the same value.
* AT_EACCESS is meaningful only to faccessat, while AT_REMOVEDIR is meaningful
* only to unlinkat. The two functions do completely different things and
* therefore, the flags can be allowed to overlap. For example, passing
* AT_REMOVEDIR to faccessat would be undefined behavior and thus treating it
* equivalent to AT_EACCESS is valid undefined behavior.
*
* Note for implementers: When picking a new per-syscall AT_* flag, try to
* reuse already existing flags first. This leaves us with as many unused bits
* as possible, so we can use them for generic bits in the future if necessary.
*/
/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
#define AT_RENAME_NOREPLACE 0x0001
#define AT_RENAME_EXCHANGE 0x0002
#define AT_RENAME_WHITEOUT 0x0004
/* Flag for faccessat(2). */
#define AT_EACCESS 0x200 /* Test access permitted for
effective IDs, not real IDs. */
uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 10:19:42 +00:00
/* Flag for unlinkat(2). */
#define AT_REMOVEDIR 0x200 /* Remove directory instead of
unlinking file. */
uapi: explain how per-syscall AT_* flags should be allocated Unfortunately, the way we have gone about adding new AT_* flags has been a little messy. In the beginning, all of the AT_* flags had generic meanings and so it made sense to share the flag bits indiscriminately. However, we inevitably ran into syscalls that needed their own syscall-specific flags. Due to the lack of a planned out policy, we ended up with the following situations: * Existing syscalls adding new features tended to use new AT_* bits, with some effort taken to try to re-use bits for flags that were so obviously syscall specific that they only make sense for a single syscall (such as the AT_EACCESS/AT_REMOVEDIR/AT_HANDLE_FID triplet). Given the constraints of bitflags, this works well in practice, but ideally (to avoid future confusion) we would plan ahead and define a set of "per-syscall bits" ahead of time so that when allocating new bits we don't end up with a complete mish-mash of which bits are supposed to be per-syscall and which aren't. * New syscalls dealt with this in several ways: - Some syscalls (like renameat2(2), move_mount(2), fsopen(2), and fspick(2)) created their separate own flag spaces that have no overlap with the AT_* flags. Most of these ended up allocating their bits sequentually. In the case of move_mount(2) and fspick(2), several flags have identical meanings to AT_* flags but were allocated in their own flag space. This makes sense for syscalls that will never share AT_* flags, but for some syscalls this leads to duplication with AT_* flags in a way that could cause confusion (if renameat2(2) grew a RENAME_EMPTY_PATH it seems likely that users could mistake it for AT_EMPTY_PATH since it is an *at(2) syscall). - Some syscalls unfortunately ended up both creating their own flag space while also using bits from other flag spaces. The most obvious example is open_tree(2), where the standard usage ends up using flags from *THREE* separate flag spaces: open_tree(AT_FDCWD, "/foo", OPEN_TREE_CLONE|O_CLOEXEC|AT_RECURSIVE); (Note that O_CLOEXEC is also platform-specific, so several future OPEN_TREE_* bits are also made unusable in one fell swoop.) It's not entirely clear to me what the "right" choice is for new syscalls. Just saying that all future VFS syscalls should use AT_* flags doesn't seem practical. openat2(2) has RESOLVE_* flags (many of which don't make much sense to burn generic AT_* flags for) and move_mount(2) has separate AT_*-like flags for both the source and target so separate flags are needed anyway (though it seems possible that renameat2(2) could grow *_EMPTY_PATH flags at some point, and it's a bit of a shame they can't be reused). But at least for syscalls that _do_ choose to use AT_* flags, we should explicitly state the policy that 0x2ff is currently intended for per-syscall flags and that new flags should err on the side of overlapping with existing flag bits (so we can extend the scope of generic flags in the future if necessary). And add AT_* aliases for the RENAME_* flags to further cement that renameat2(2) is an *at(2) flag, just with its own per-syscall flags. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/r/20240828-exportfs-u64-mount-id-v3-1-10c2c4c16708@cyphar.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28 10:19:42 +00:00
/* Flags for name_to_handle_at(2). */
#define AT_HANDLE_FID 0x200 /* File handle is needed to compare
object identity and may not be
usable with open_by_handle_at(2). */
#define AT_HANDLE_MNT_ID_UNIQUE 0x001 /* Return the u64 unique mount ID. */
#define AT_HANDLE_CONNECTABLE 0x002 /* Request a connectable file handle */
exec: Add a new AT_EXECVE_CHECK flag to execveat(2) Add a new AT_EXECVE_CHECK flag to execveat(2) to check if a file would be allowed for execution. The main use case is for script interpreters and dynamic linkers to check execution permission according to the kernel's security policy. Another use case is to add context to access logs e.g., which script (instead of interpreter) accessed a file. As any executable code, scripts could also use this check [1]. This is different from faccessat(2) + X_OK which only checks a subset of access rights (i.e. inode permission and mount options for regular files), but not the full context (e.g. all LSM access checks). The main use case for access(2) is for SUID processes to (partially) check access on behalf of their caller. The main use case for execveat(2) + AT_EXECVE_CHECK is to check if a script execution would be allowed, according to all the different restrictions in place. Because the use of AT_EXECVE_CHECK follows the exact kernel semantic as for a real execution, user space gets the same error codes. An interesting point of using execveat(2) instead of openat2(2) is that it decouples the check from the enforcement. Indeed, the security check can be logged (e.g. with audit) without blocking an execution environment not yet ready to enforce a strict security policy. LSMs can control or log execution requests with security_bprm_creds_for_exec(). However, to enforce a consistent and complete access control (e.g. on binary's dependencies) LSMs should restrict file executability, or measure executed files, with security_file_open() by checking file->f_flags & __FMODE_EXEC. Because AT_EXECVE_CHECK is dedicated to user space interpreters, it doesn't make sense for the kernel to parse the checked files, look for interpreters known to the kernel (e.g. ELF, shebang), and return ENOEXEC if the format is unknown. Because of that, security_bprm_check() is never called when AT_EXECVE_CHECK is used. It should be noted that script interpreters cannot directly use execveat(2) (without this new AT_EXECVE_CHECK flag) because this could lead to unexpected behaviors e.g., `python script.sh` could lead to Bash being executed to interpret the script. Unlike the kernel, script interpreters may just interpret the shebang as a simple comment, which should not change for backward compatibility reasons. Because scripts or libraries files might not currently have the executable permission set, or because we might want specific users to be allowed to run arbitrary scripts, the following patch provides a dynamic configuration mechanism with the SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE securebits. This is a redesign of the CLIP OS 4's O_MAYEXEC: https://github.com/clipos-archive/src_platform_clip-patches/blob/f5cb330d6b684752e403b4e41b39f7004d88e561/1901_open_mayexec.patch This patch has been used for more than a decade with customized script interpreters. Some examples can be found here: https://github.com/clipos-archive/clipos4_portage-overlay/search?q=O_MAYEXEC Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <keescook@chromium.org> Acked-by: Paul Moore <paul@paul-moore.com> Reviewed-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Jeff Xu <jeffxu@chromium.org> Tested-by: Jeff Xu <jeffxu@chromium.org> Link: https://docs.python.org/3/library/io.html#io.open_code [1] Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://lore.kernel.org/r/20241212174223.389435-2-mic@digikod.net Signed-off-by: Kees Cook <kees@kernel.org>
2024-12-12 17:42:16 +00:00
/* Flags for execveat2(2). */
#define AT_EXECVE_CHECK 0x10000 /* Only perform a check if execution
would be allowed. */
#endif /* _UAPI_LINUX_FCNTL_H */