Commit Graph

23 Commits

Author SHA1 Message Date
Carlos Maiolino 972aed5f39 watch_queue: fix pipe accounting mismatch
JIRA: https://issues.redhat.com/browse/RHEL-78249

commit f13abc1e8e1a3b7455511c4e122750127f6bc9b0
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Thu Feb 27 11:41:08 2025 -0600

    watch_queue: fix pipe accounting mismatch

    Currently, watch_queue_set_size() modifies the pipe buffers charged to
    user->pipe_bufs without updating the pipe->nr_accounted on the pipe
    itself, due to the if (!pipe_has_watch_queue()) test in
    pipe_resize_ring(). This means that when the pipe is ultimately freed,
    we decrement user->pipe_bufs by something other than what than we had
    charged to it, potentially leading to an underflow. This in turn can
    cause subsequent too_many_pipe_buffers_soft() tests to fail with -EPERM.

    To remedy this, explicitly account for the pipe usage in
    watch_queue_set_size() to match the number set via account_pipe_buffers()

    (It's unclear why watch_queue_set_size() does not update nr_accounted;
    it may be due to intentional overprovisioning in watch_queue_set_size()?)

    Fixes: e95aada4cb93d ("pipe: wakeup wr_wait after setting max_usage")
    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Link: https://lore.kernel.org/r/206682a8-0604-49e5-8224-fdbe0c12b460@redhat.com
    Signed-off-by: Christian Brauner <brauner@kernel.org>

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2025-04-02 10:32:41 +02:00
Steve Best d9574de8e7 kernel: watch_queue: copy user-array safely
JIRA: https://issues.redhat.com/browse/RHEL-38238
CVE: CVE-2023-52824

Build Info:  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=61430154

Tested: Did sanity boot testing Intel (intel-arrowlake-s-02) system.

commit ca0776571d3163bd03b3e8c9e3da936abfaecbf6
Author: Philipp Stanner <pstanner@redhat.com>
Date:   Wed Sep 20 14:36:11 2023 +0200

    kernel: watch_queue: copy user-array safely

    Currently, there is no overflow-check with memdup_user().

    Use the new function memdup_array_user() instead of memdup_user() for
    duplicating the user-space array safely.

    Suggested-by: David Airlie <airlied@redhat.com>
    Signed-off-by: Philipp Stanner <pstanner@redhat.com>
    Reviewed-by: Kees Cook <keescook@chromium.org>
    Reviewed-by: Zack Rusin <zackr@vmware.com>
    Signed-off-by: Dave Airlie <airlied@redhat.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20230920123612.16914-5-pstanner@redhat.com

Signed-off-by: Steve Best <sbest@redhat.com>
2024-05-23 05:14:57 -04:00
Carlos Maiolino fbc63928da watch_queue: Free the page array when watch_queue is dismantled
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2231268
Tested: Syzkaller reproducer within BZ

Commit 7ea1a0124b6d ("watch_queue: Free the alloc bitmap when the
watch_queue is torn down") took care of the bitmap, but not the page
array.

  BUG: memory leak
  unreferenced object 0xffff88810d9bc140 (size 32):
  comm "syz-executor335", pid 3603, jiffies 4294946994 (age 12.840s)
  hex dump (first 32 bytes):
    40 a7 40 04 00 ea ff ff 00 00 00 00 00 00 00 00  @.@.............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmalloc_array include/linux/slab.h:621 [inline]
     kcalloc include/linux/slab.h:652 [inline]
     watch_queue_set_size+0x12f/0x2e0 kernel/watch_queue.c:251
     pipe_ioctl+0x82/0x140 fs/pipe.c:632
     vfs_ioctl fs/ioctl.c:51 [inline]
     __do_sys_ioctl fs/ioctl.c:874 [inline]
     __se_sys_ioctl fs/ioctl.c:860 [inline]
     __x64_sys_ioctl+0xfc/0x140 fs/ioctl.c:860
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]

Reported-by: syzbot+25ea042ae28f3888727a@syzkaller.appspotmail.com
Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20220322004654.618274-1-eric.dumazet@gmail.com/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b490207017ba237d97b735b2aa66dc241ccd18f5)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2023-08-17 14:10:34 +02:00
Carlos Maiolino 7a1a6ffa85 watch_queue: Actually free the watch
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2231268
Tested: Syzkaller reproducer within BZ

free_watch() does everything barring actually freeing the watch object.  Fix
this by adding the missing kfree.

kmemleak produces a report something like the following.  Note that as an
address can be seen in the first word, the watch would appear to have gone
through call_rcu().

BUG: memory leak
unreferenced object 0xffff88810ce4a200 (size 96):
  comm "syz-executor352", pid 3605, jiffies 4294947473 (age 13.720s)
  hex dump (first 32 bytes):
    e0 82 48 0d 81 88 ff ff 00 00 00 00 00 00 00 00  ..H.............
    80 a2 e4 0c 81 88 ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8214e6cc>] kmalloc include/linux/slab.h:581 [inline]
    [<ffffffff8214e6cc>] kzalloc include/linux/slab.h:714 [inline]
    [<ffffffff8214e6cc>] keyctl_watch_key+0xec/0x2e0 security/keys/keyctl.c:1800
    [<ffffffff8214ec84>] __do_sys_keyctl+0x3c4/0x490 security/keys/keyctl.c:2016
    [<ffffffff84493a25>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    [<ffffffff84493a25>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    [<ffffffff84600068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-and-tested-by: syzbot+6e2de48f06cdb2884bfc@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
(cherry picked from commit 3d8dcf278b1ee1eff1e90be848fa2237db4c07a7)
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2023-08-17 11:48:35 +02:00
Chris von Recklinghausen 0807303b97 watch_queue: Fix NULL dereference in error cleanup
Bugzilla: https://bugzilla.redhat.com/2229694

commit a635415a064e77bcfbf43da413fd9dfe0bbed9cb
Author: David Howells <dhowells@redhat.com>
Date:   Mon Mar 21 08:11:52 2022 +0000

    watch_queue: Fix NULL dereference in error cleanup

    In watch_queue_set_size(), the error cleanup code doesn't take account of
    the fact that __free_page() can't handle a NULL pointer when trying to free
    up buffer pages that did get allocated.

    Fix this by only calling __free_page() on the pages actually allocated.

    Without the fix, this can lead to something like the following:

    BUG: KASAN: null-ptr-deref in __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
    Read of size 4 at addr 0000000000000034 by task syz-executor168/3599
    ...
    Call Trace:
     <TASK>
     __dump_stack lib/dump_stack.c:88 [inline]
     dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
     __kasan_report mm/kasan/report.c:446 [inline]
     kasan_report.cold+0x66/0xdf mm/kasan/report.c:459
     check_region_inline mm/kasan/generic.c:183 [inline]
     kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
     instrument_atomic_read include/linux/instrumented.h:71 [inline]
     atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
     page_ref_count include/linux/page_ref.h:67 [inline]
     put_page_testzero include/linux/mm.h:717 [inline]
     __free_pages+0x1f/0x1b0 mm/page_alloc.c:5473
     watch_queue_set_size+0x499/0x630 kernel/watch_queue.c:275
     pipe_ioctl+0xac/0x2b0 fs/pipe.c:632
     vfs_ioctl fs/ioctl.c:51 [inline]
     __do_sys_ioctl fs/ioctl.c:874 [inline]
     __se_sys_ioctl fs/ioctl.c:860 [inline]
     __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
     do_syscall_x64 arch/x86/entry/common.c:50 [inline]
     do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
     entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-and-tested-by: syzbot+d55757faa9b80590767b@syzkaller.appspotmail.com
    Signed-off-by: David Howells <dhowells@redhat.com>
    Reviewed-by: Fabio M. De Francesco <fmdefrancesco@gmail.com>

Signed-off-by: Chris von Recklinghausen <crecklin@redhat.com>
2023-08-14 07:51:56 -04:00
Audra Mitchell baab22ad52 watch-queue: remove spurious double semicolon
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2190216

This patch is a backport of the following upstream commit:
commit 44e29e64cf1ac0cffb152e0532227ea6d002aa28
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Thu Jul 21 10:30:14 2022 -0700

    watch-queue: remove spurious double semicolon

    Sedat Dilek noticed that I had an extraneous semicolon at the end of a
    line in the previous patch.

    It's harmless, but unintentional, and while compilers just treat it as
    an extra empty statement, for all I know some other tooling might warn
    about it. So clean it up before other people notice too ;)

    Fixes: 353f7988dd84 ("watchqueue: make sure to serialize 'wqueue->defunct' properly")
    Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Reported-by: Sedat Dilek <sedat.dilek@gmail.com>

Signed-off-by: Audra Mitchell <aubaker@redhat.com>
2023-05-22 08:49:50 -04:00
Carlos Maiolino 5ea8439f5b watch_queue: Fix missing locking in add_watch_to_object()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090382
CVE: CVE-2022-1882

If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.

KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.

While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list.  The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.

Fix this by locking the watcher list in add_watch_to_object().

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e64ab2dbd882933b65cd82ff6235d705ad65dbb6)

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2022-08-26 13:59:33 +02:00
Carlos Maiolino 5984ed171a watch_queue: Fix missing rcu annotation
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090382
CVE: CVE-2022-1882

Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).

Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e0339f036ef4beb9b20f0b6532a1e0ece7f594c6)

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2022-08-26 13:59:33 +02:00
Carlos Maiolino 9073bf2948 watchqueue: make sure to serialize 'wqueue->defunct' properly
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2090382
CVE: CVE-2022-1882

When the pipe is closed, we mark the associated watchqueue defunct by
calling watch_queue_clear().  However, while that is protected by the
watchqueue lock, new watchqueue entries aren't actually added under that
lock at all: they use the pipe->rd_wait.lock instead, and looking up
that pipe happens without any locking.

The watchqueue code uses the RCU read-side section to make sure that the
wqueue entry itself hasn't disappeared, but that does not protect the
pipe_info in any way.

So make sure to actually hold the wqueue lock when posting watch events,
properly serializing against the pipe being torn down.

Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 353f7988dd8413c47718f7ca79c030b6fb62cfe5)

Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
2022-08-26 13:59:33 +02:00
David Howells db37f13011 watch_queue: Make comment about setting ->defunct more accurate
Bugzilla: https://bugzilla.redhat.com/2063758

commit 4edc0760412b0c4ecefc7e02cb855b310b122825
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:24:47 2022 +0000

    watch_queue: Make comment about setting ->defunct more accurate

    watch_queue_clear() has a comment stating that setting ->defunct to true
    preventing new additions as well as preventing notifications.  Whilst
    the latter is true, the first bit is superfluous since at the time this
    function is called, the pipe cannot be accessed to add new event
    sources.

    Remove the "new additions" bit from the comment.

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:08 +00:00
David Howells 43cf8a7eed watch_queue: Fix lack of barrier/sync/lock between post and read
Bugzilla: https://bugzilla.redhat.com/2063758

commit 2ed147f015af2b48f41c6f0b6746aa9ea85c19f3
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:24:36 2022 +0000

    watch_queue: Fix lack of barrier/sync/lock between post and read

    There's nothing to synchronise post_one_notification() versus
    pipe_read().  Whilst posting is done under pipe->rd_wait.lock, the
    reader only takes pipe->mutex which cannot bar notification posting as
    that may need to be made from contexts that cannot sleep.

    Fix this by setting pipe->head with a barrier in post_one_notification()
    and reading pipe->head with a barrier in pipe_read().

    If that's not sufficient, the rd_wait.lock will need to be taken,
    possibly in a ->confirm() op so that it only applies to notifications.
    The lock would, however, have to be dropped before copy_page_to_iter()
    is invoked.

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:08 +00:00
David Howells e6d9d26d14 watch_queue: Free the alloc bitmap when the watch_queue is torn down
Bugzilla: https://bugzilla.redhat.com/2063758

commit 7ea1a0124b6da246b5bc8c66cddaafd36acf3ecb
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:24:29 2022 +0000

    watch_queue: Free the alloc bitmap when the watch_queue is torn down

    Free the watch_queue note allocation bitmap when the watch_queue is
    destroyed.

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:08 +00:00
David Howells 47c6349225 watch_queue: Fix the alloc bitmap size to reflect notes allocated
Bugzilla: https://bugzilla.redhat.com/2063758

commit 3b4c0371928c17af03e8397ac842346624017ce6
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:24:22 2022 +0000

    watch_queue: Fix the alloc bitmap size to reflect notes allocated

    Currently, watch_queue_set_size() sets the number of notes available in
    wqueue->nr_notes according to the number of notes allocated, but sets
    the size of the bitmap to the unrounded number of notes originally asked
    for.

    Fix this by setting the bitmap size to the number of notes we're
    actually going to make available (ie. the number allocated).

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:07 +00:00
David Howells bdc2289c74 watch_queue: Use the bitmap API when applicable
Bugzilla: https://bugzilla.redhat.com/2063758

commit a66bd7575b5f449ee0ba20cfd21c3bc5b04ef361
Author: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Date:   Fri Mar 11 13:24:15 2022 +0000

    watch_queue: Use the bitmap API when applicable

    Use bitmap_alloc() to simplify code, improve the semantic and reduce
    some open-coded arithmetic in allocator arguments.

    Also change a memset(0xff) into an equivalent bitmap_fill() to keep
    consistency.

    Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:07 +00:00
David Howells 13d317ccf4 watch_queue: Fix to always request a pow-of-2 pipe ring size
Bugzilla: https://bugzilla.redhat.com/2063758

commit 96a4d8912b28451cd62825fd7caa0e66e091d938
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:24:08 2022 +0000

    watch_queue: Fix to always request a pow-of-2 pipe ring size

    The pipe ring size must always be a power of 2 as the head and tail
    pointers are masked off by AND'ing with the size of the ring - 1.
    watch_queue_set_size(), however, lets you specify any number of notes
    between 1 and 511.  This number is passed through to pipe_resize_ring()
    without checking/forcing its alignment.

    Fix this by rounding the number of slots required up to the nearest
    power of two.  The request is meant to guarantee that at least that many
    notifications can be generated before the queue is full, so rounding
    down isn't an option, but, alternatively, it may be better to give an
    error if we aren't allowed to allocate that much ring space.

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:07 +00:00
David Howells b4f9a84bcc watch_queue: Fix to release page in ->release()
Bugzilla: https://bugzilla.redhat.com/2063758

commit c1853fbadcba1497f4907971e7107888e0714c81
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:23:46 2022 +0000

    watch_queue: Fix to release page in ->release()

    When a pipe ring descriptor points to a notification message, the
    refcount on the backing page is incremented by the generic get function,
    but the release function, which marks the bitmap, doesn't drop the page
    ref.

    Fix this by calling generic_pipe_buf_release() at the end of
    watch_queue_pipe_buf_release().

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:07 +00:00
David Howells 3374b68c90 watch_queue: Fix filter limit check
Bugzilla: https://bugzilla.redhat.com/2063758
CVE: CVE-2022-0995

commit c993ee0f9f81caf5767a50d1faeba39a0dc82af2
Author: David Howells <dhowells@redhat.com>
Date:   Fri Mar 11 13:23:31 2022 +0000

    watch_queue: Fix filter limit check

    In watch_queue_set_filter(), there are a couple of places where we check
    that the filter type value does not exceed what the type_filter bitmap
    can hold.  One place calculates the number of bits by:

       if (tf[i].type >= sizeof(wfilter->type_filter) * 8)

    which is fine, but the second does:

       if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)

    which is not.  This can lead to a couple of out-of-bounds writes due to
    a too-large type:

     (1) __set_bit() on wfilter->type_filter
     (2) Writing more elements in wfilter->filters[] than we allocated.

    Fix this by just using the proper WATCH_TYPE__NR instead, which is the
    number of types we actually know about.

    The bug may cause an oops looking something like:

      BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
      Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
      ...
      Call Trace:
       <TASK>
       dump_stack_lvl+0x45/0x59
       print_address_description.constprop.0+0x1f/0x150
       ...
       kasan_report.cold+0x7f/0x11b
       ...
       watch_queue_set_filter+0x659/0x740
       ...
       __x64_sys_ioctl+0x127/0x190
       do_syscall_64+0x43/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

      Allocated by task 611:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       watch_queue_set_filter+0x23a/0x740
       __x64_sys_ioctl+0x127/0x190
       do_syscall_64+0x43/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae

      The buggy address belongs to the object at ffff88800d2c66a0
       which belongs to the cache kmalloc-32 of size 32
      The buggy address is located 28 bytes inside of
       32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)

    Fixes: c73be61ced ("pipe: Add general notification queue support")
    Reported-by: Jann Horn <jannh@google.com>
    Signed-off-by: David Howells <dhowells@redhat.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: David Howells <dhowells@redhat.com>
2022-03-16 16:00:07 +00:00
Lukas Bulwahn 8f0bfc25c9 watch_queue: rectify kernel-doc for init_watch()
The command './scripts/kernel-doc -none kernel/watch_queue.c'
reported a mismatch in the kernel-doc of init_watch().

Rectify the kernel-doc, such that no issues remain for watch_queue.c.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2021-01-26 11:16:34 +00:00
David Howells 29e44f4535 watch_queue: Limit the number of watches a user can hold
Impose a limit on the number of watches that a user can hold so that
they can't use this mechanism to fill up all the available memory.

This is done by putting a counter in user_struct that's incremented when
a watch is allocated and decreased when it is released.  If the number
exceeds the RLIMIT_NOFILE limit, the watch is rejected with EAGAIN.

This can be tested by the following means:

 (1) Create a watch queue and attach it to fd 5 in the program given - in
     this case, bash:

	keyctl watch_session /tmp/nlog /tmp/gclog 5 bash

 (2) In the shell, set the maximum number of files to, say, 99:

	ulimit -n 99

 (3) Add 200 keyrings:

	for ((i=0; i<200; i++)); do keyctl newring a$i @s || break; done

 (4) Try to watch all of the keyrings:

	for ((i=0; i<200; i++)); do echo $i; keyctl watch_add 5 %:a$i || break; done

     This should fail when the number of watches belonging to the user hits
     99.

 (5) Remove all the keyrings and all of those watches should go away:

	for ((i=0; i<200; i++)); do keyctl unlink %:a$i; done

 (6) Kill off the watch queue by exiting the shell spawned by
     watch_session.

Fixes: c73be61ced ("pipe: Add general notification queue support")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-17 09:39:18 -07:00
Linus Torvalds 6c32978414 Notifications over pipes + Keyring notifications
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7U/i8ACgkQ+7dXa6fL
 C2u2eg/+Oy6ybq0hPovYVkFI9WIG7ZCz7w9Q6BEnfYMqqn3dnfJxKQ3l4pnQEOWw
 f4QfvpvevsYfMtOJkYcG6s66rQgbFdqc5TEyBBy0QNp3acRolN7IXkcopvv9xOpQ
 JxedpbFG1PTFLWjvBpyjlrUPouwLzq2FXAf1Ox0ZIMw6165mYOMWoli1VL8dh0A0
 Ai7JUB0WrvTNbrwhV413obIzXT/rPCdcrgbQcgrrLPex8lQ47ZAE9bq6k4q5HiwK
 KRzEqkQgnzId6cCNTFBfkTWsx89zZunz7jkfM5yx30MvdAtPSxvvpfIPdZRZkXsP
 E2K9Fk1/6OQZTC0Op3Pi/bt+hVG/mD1p0sQUDgo2MO3qlSS+5mMkR8h3mJEgwK12
 72P4YfOJkuAy2z3v4lL0GYdUDAZY6i6G8TMxERKu/a9O3VjTWICDOyBUS6F8YEAK
 C7HlbZxAEOKTVK0BTDTeEUBwSeDrBbvH6MnRlZCG5g1Fos2aWP0udhjiX8IfZLO7
 GN6nWBvK1fYzfsUczdhgnoCzQs3suoDo04HnsTPGJ8De52T4x2RsjV+gPx0nrNAq
 eWChl1JvMWsY2B3GLnl9XQz4NNN+EreKEkk+PULDGllrArrPsp5Vnhb9FJO1PVCU
 hMDJHohPiXnKbc8f4Bd78OhIvnuoGfJPdM5MtNe2flUKy2a2ops=
 =YTGf
 -----END PGP SIGNATURE-----

Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull notification queue from David Howells:
 "This adds a general notification queue concept and adds an event
  source for keys/keyrings, such as linking and unlinking keys and
  changing their attributes.

  Thanks to Debarshi Ray, we do have a pull request to use this to fix a
  problem with gnome-online-accounts - as mentioned last time:

     https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47

  Without this, g-o-a has to constantly poll a keyring-based kerberos
  cache to find out if kinit has changed anything.

  [ There are other notification pending: mount/sb fsinfo notifications
    for libmount that Karel Zak and Ian Kent have been working on, and
    Christian Brauner would like to use them in lxc, but let's see how
    this one works first ]

  LSM hooks are included:

   - A set of hooks are provided that allow an LSM to rule on whether or
     not a watch may be set. Each of these hooks takes a different
     "watched object" parameter, so they're not really shareable. The
     LSM should use current's credentials. [Wanted by SELinux & Smack]

   - A hook is provided to allow an LSM to rule on whether or not a
     particular message may be posted to a particular queue. This is
     given the credentials from the event generator (which may be the
     system) and the watch setter. [Wanted by Smack]

  I've provided SELinux and Smack with implementations of some of these
  hooks.

  WHY
  ===

  Key/keyring notifications are desirable because if you have your
  kerberos tickets in a file/directory, your Gnome desktop will monitor
  that using something like fanotify and tell you if your credentials
  cache changes.

  However, we also have the ability to cache your kerberos tickets in
  the session, user or persistent keyring so that it isn't left around
  on disk across a reboot or logout. Keyrings, however, cannot currently
  be monitored asynchronously, so the desktop has to poll for it - not
  so good on a laptop. This facility will allow the desktop to avoid the
  need to poll.

  DESIGN DECISIONS
  ================

   - The notification queue is built on top of a standard pipe. Messages
     are effectively spliced in. The pipe is opened with a special flag:

        pipe2(fds, O_NOTIFICATION_PIPE);

     The special flag has the same value as O_EXCL (which doesn't seem
     like it will ever be applicable in this context)[?]. It is given up
     front to make it a lot easier to prohibit splice&co from accessing
     the pipe.

     [?] Should this be done some other way?  I'd rather not use up a new
         O_* flag if I can avoid it - should I add a pipe3() system call
         instead?

     The pipe is then configured::

        ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
        ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

     Messages are then read out of the pipe using read().

   - It should be possible to allow write() to insert data into the
     notification pipes too, but this is currently disabled as the
     kernel has to be able to insert messages into the pipe *without*
     holding pipe->mutex and the code to make this work needs careful
     auditing.

   - sendfile(), splice() and vmsplice() are disabled on notification
     pipes because of the pipe->mutex issue and also because they
     sometimes want to revert what they just did - but one or more
     notification messages might've been interleaved in the ring.

   - The kernel inserts messages with the wait queue spinlock held. This
     means that pipe_read() and pipe_write() have to take the spinlock
     to update the queue pointers.

   - Records in the buffer are binary, typed and have a length so that
     they can be of varying size.

     This allows multiple heterogeneous sources to share a common
     buffer; there are 16 million types available, of which I've used
     just a few, so there is scope for others to be used. Tags may be
     specified when a watchpoint is created to help distinguish the
     sources.

   - Records are filterable as types have up to 256 subtypes that can be
     individually filtered. Other filtration is also available.

   - Notification pipes don't interfere with each other; each may be
     bound to a different set of watches. Any particular notification
     will be copied to all the queues that are currently watching for it
     - and only those that are watching for it.

   - When recording a notification, the kernel will not sleep, but will
     rather mark a queue as having lost a message if there's
     insufficient space. read() will fabricate a loss notification
     message at an appropriate point later.

   - The notification pipe is created and then watchpoints are attached
     to it, using one of:

        keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
        watch_mount(AT_FDCWD, "/", 0, fd, 0x02);
        watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in both cases, fd indicates the queue and the number after is
     a tag between 0 and 255.

   - Watches are removed if either the notification pipe is destroyed or
     the watched object is destroyed. In the latter case, a message will
     be generated indicating the enforced watch removal.

  Things I want to avoid:

   - Introducing features that make the core VFS dependent on the
     network stack or networking namespaces (ie. usage of netlink).

   - Dumping all this stuff into dmesg and having a daemon that sits
     there parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky. Further, dmesg might not exist or might be
     inaccessible inside a container.

   - Letting users see events they shouldn't be able to see.

  TESTING AND MANPAGES
  ====================

   - The keyutils tree has a pipe-watch branch that has keyctl commands
     for making use of notifications. Proposed manual pages can also be
     found on this branch, though a couple of them really need to go to
     the main manpages repository instead.

     If the kernel supports the watching of keys, then running "make
     test" on that branch will cause the testing infrastructure to spawn
     a monitoring process on the side that monitors a notifications pipe
     for all the key/keyring changes induced by the tests and they'll
     all be checked off to make sure they happened.

        https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch

   - A test program is provided (samples/watch_queue/watch_test) that
     can be used to monitor for keyrings, mount and superblock events.
     Information on the notifications is simply logged to stdout"

* tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  smack: Implement the watch_key and post_notification hooks
  selinux: Implement the watch_key security hook
  keys: Make the KEY_NEED_* perms an enum rather than a mask
  pipe: Add notification lossage handling
  pipe: Allow buffers to be marked read-whole-or-error for notifications
  Add sample notification program
  watch_queue: Add a key/keyring notification facility
  security: Add hooks to rule on setting a watch
  pipe: Add general notification queue support
  pipe: Add O_NOTIFICATION_PIPE
  security: Add a hook for the point of notification insertion
  uapi: General notification queue definitions
2020-06-13 09:56:21 -07:00
David Howells e7d553d69c pipe: Add notification lossage handling
Add handling for loss of notifications by having read() insert a
loss-notification message after it has read the pipe buffer that was last
in the ring when the loss occurred.

Lossage can come about either by running out of notification descriptors or
by running out of space in the pipe ring.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:40:28 +01:00
David Howells 8cfba76383 pipe: Allow buffers to be marked read-whole-or-error for notifications
Allow a buffer to be marked such that read() must return the entire buffer
in one go or return ENOBUFS.  Multiple buffers can be amalgamated into a
single read, but a short read will occur if the next "whole" buffer won't
fit.

This is useful for watch queue notifications to make sure we don't split a
notification across multiple reads, especially given that we need to
fabricate an overrun record under some circumstances - and that isn't in
the buffers.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:38:18 +01:00
David Howells c73be61ced pipe: Add general notification queue support
Make it possible to have a general notification queue built on top of a
standard pipe.  Notifications are 'spliced' into the pipe and then read
out.  splice(), vmsplice() and sendfile() are forbidden on pipes used for
notifications as post_one_notification() cannot take pipe->mutex.  This
means that notifications could be posted in between individual pipe
buffers, making iov_iter_revert() difficult to effect.

The way the notification queue is used is:

 (1) An application opens a pipe with a special flag and indicates the
     number of messages it wishes to be able to queue at once (this can
     only be set once):

	pipe2(fds, O_NOTIFICATION_PIPE);
	ioctl(fds[0], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);

 (2) The application then uses poll() and read() as normal to extract data
     from the pipe.  read() will return multiple notifications if the
     buffer is big enough, but it will not split a notification across
     buffers - rather it will return a short read or EMSGSIZE.

     Notification messages include a length in the header so that the
     caller can split them up.

Each message has a header that describes it:

	struct watch_notification {
		__u32	type:24;
		__u32	subtype:8;
		__u32	info;
	};

The type indicates the source (eg. mount tree changes, superblock events,
keyring changes, block layer events) and the subtype indicates the event
type (eg. mount, unmount; EIO, EDQUOT; link, unlink).  The info field
indicates a number of things, including the entry length, an ID assigned to
a watchpoint contributing to this buffer and type-specific flags.

Supplementary data, such as the key ID that generated an event, can be
attached in additional slots.  The maximum message size is 127 bytes.
Messages may not be padded or aligned, so there is no guarantee, for
example, that the notification type will be on a 4-byte bounary.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-19 15:08:24 +01:00