bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
|
|
|
/* Copyright (c) 2017 - 2018 Covalent IO, Inc. http://covalent.io */
|
|
|
|
|
|
|
|
#include <linux/skmsg.h>
|
|
|
|
#include <linux/filter.h>
|
|
|
|
#include <linux/bpf.h>
|
|
|
|
#include <linux/init.h>
|
|
|
|
#include <linux/wait.h>
|
2023-03-29 13:38:18 +00:00
|
|
|
#include <linux/util_macros.h>
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
#include <net/inet_common.h>
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
#include <net/tls.h>
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
static int bpf_tcp_ingress(struct sock *sk, struct sk_psock *psock,
|
|
|
|
struct sk_msg *msg, u32 apply_bytes, int flags)
|
|
|
|
{
|
|
|
|
bool apply = apply_bytes;
|
|
|
|
struct scatterlist *sge;
|
|
|
|
u32 size, copied = 0;
|
|
|
|
struct sk_msg *tmp;
|
|
|
|
int i, ret = 0;
|
|
|
|
|
|
|
|
tmp = kzalloc(sizeof(*tmp), __GFP_NOWARN | GFP_KERNEL);
|
|
|
|
if (unlikely(!tmp))
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
tmp->sg.start = msg->sg.start;
|
|
|
|
i = msg->sg.start;
|
|
|
|
do {
|
|
|
|
sge = sk_msg_elem(msg, i);
|
|
|
|
size = (apply && apply_bytes < sge->length) ?
|
|
|
|
apply_bytes : sge->length;
|
|
|
|
if (!sk_wmem_schedule(sk, size)) {
|
|
|
|
if (!copied)
|
|
|
|
ret = -ENOMEM;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
sk_mem_charge(sk, size);
|
|
|
|
sk_msg_xfer(tmp, msg, i, size);
|
|
|
|
copied += size;
|
|
|
|
if (sge->length)
|
|
|
|
get_page(sk_msg_page(tmp, i));
|
|
|
|
sk_msg_iter_var_next(i);
|
|
|
|
tmp->sg.end = i;
|
|
|
|
if (apply) {
|
|
|
|
apply_bytes -= size;
|
2023-03-21 10:02:07 +00:00
|
|
|
if (!apply_bytes) {
|
|
|
|
if (sge->length)
|
|
|
|
sk_msg_iter_var_prev(i);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
break;
|
2023-03-21 10:02:07 +00:00
|
|
|
}
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
} while (i != msg->sg.end);
|
|
|
|
|
|
|
|
if (!ret) {
|
|
|
|
msg->sg.start = i;
|
|
|
|
sk_psock_queue_msg(psock, tmp);
|
2018-12-20 19:35:33 +00:00
|
|
|
sk_psock_data_ready(sk, psock);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
} else {
|
|
|
|
sk_msg_free(sk, tmp);
|
|
|
|
kfree(tmp);
|
|
|
|
}
|
|
|
|
|
|
|
|
release_sock(sk);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
|
|
|
|
int flags, bool uncharge)
|
|
|
|
{
|
|
|
|
bool apply = apply_bytes;
|
|
|
|
struct scatterlist *sge;
|
|
|
|
struct page *page;
|
|
|
|
int size, ret = 0;
|
|
|
|
u32 off;
|
|
|
|
|
|
|
|
while (1) {
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
bool has_tx_ulp;
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
sge = sk_msg_elem(msg, msg->sg.start);
|
|
|
|
size = (apply && apply_bytes < sge->length) ?
|
|
|
|
apply_bytes : sge->length;
|
|
|
|
off = sge->offset;
|
|
|
|
page = sg_page(sge);
|
|
|
|
|
|
|
|
tcp_rate_check_app_limited(sk);
|
|
|
|
retry:
|
bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.
To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.
Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.
In the sendfile case the function do_sendfile() zero's flags,
./fs/read_write.c:
static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
size_t count, loff_t max)
{
...
fl = 0;
#if 0
/*
* We need to debate whether we can enable this or not. The
* man page documents EAGAIN return for the output at least,
* and the application is arguably buggy if it doesn't expect
* EAGAIN on a non-blocking file descriptor.
*/
if (in.file->f_flags & O_NONBLOCK)
fl = SPLICE_F_NONBLOCK;
#endif
file_start_write(out.file);
retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
}
In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.
./fs/splice.c:
static int pipe_to_sendpage(struct pipe_inode_info *pipe,
struct pipe_buffer *buf, struct splice_desc *sd)
{
...
more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
...
}
Confirming what we expect that internal flags are in fact internal
to socket side.
Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 19:35:35 +00:00
|
|
|
has_tx_ulp = tls_sw_has_ctx_tx(sk);
|
|
|
|
if (has_tx_ulp) {
|
|
|
|
flags |= MSG_SENDPAGE_NOPOLICY;
|
|
|
|
ret = kernel_sendpage_locked(sk,
|
|
|
|
page, off, size, flags);
|
|
|
|
} else {
|
|
|
|
ret = do_tcp_sendpages(sk, page, off, size, flags);
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
if (ret <= 0)
|
|
|
|
return ret;
|
|
|
|
if (apply)
|
|
|
|
apply_bytes -= ret;
|
|
|
|
msg->sg.size -= ret;
|
|
|
|
sge->offset += ret;
|
|
|
|
sge->length -= ret;
|
|
|
|
if (uncharge)
|
|
|
|
sk_mem_uncharge(sk, ret);
|
|
|
|
if (ret != size) {
|
|
|
|
size -= ret;
|
|
|
|
off += ret;
|
|
|
|
goto retry;
|
|
|
|
}
|
|
|
|
if (!sge->length) {
|
|
|
|
put_page(page);
|
|
|
|
sk_msg_iter_next(msg, start);
|
|
|
|
sg_init_table(sge, 1);
|
|
|
|
if (msg->sg.start == msg->sg.end)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (apply && !apply_bytes)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_push_locked(struct sock *sk, struct sk_msg *msg,
|
|
|
|
u32 apply_bytes, int flags, bool uncharge)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
ret = tcp_bpf_push(sk, msg, apply_bytes, flags, uncharge);
|
|
|
|
release_sock(sk);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-03-21 10:02:06 +00:00
|
|
|
int tcp_bpf_sendmsg_redir(struct sock *sk, bool ingress,
|
|
|
|
struct sk_msg *msg, u32 bytes, int flags)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
|
|
|
struct sk_psock *psock = sk_psock_get(sk);
|
|
|
|
int ret;
|
|
|
|
|
2022-09-08 09:58:06 +00:00
|
|
|
if (unlikely(!psock))
|
|
|
|
return -EPIPE;
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
ret = ingress ? bpf_tcp_ingress(sk, psock, msg, bytes, flags) :
|
|
|
|
tcp_bpf_push_locked(sk, msg, bytes, flags, false);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(tcp_bpf_sendmsg_redir);
|
|
|
|
|
2021-02-23 18:49:26 +00:00
|
|
|
#ifdef CONFIG_BPF_SYSCALL
|
2021-06-29 22:45:27 +00:00
|
|
|
static int tcp_msg_wait_data(struct sock *sk, struct sk_psock *psock,
|
|
|
|
long timeo)
|
2021-06-15 02:13:35 +00:00
|
|
|
{
|
|
|
|
DEFINE_WAIT_FUNC(wait, woken_wake_function);
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (sk->sk_shutdown & RCV_SHUTDOWN)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (!timeo)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
add_wait_queue(sk_sleep(sk), &wait);
|
|
|
|
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
|
|
|
|
ret = sk_wait_event(sk, &timeo,
|
|
|
|
!list_empty(&psock->ingress_msg) ||
|
|
|
|
!skb_queue_empty(&sk->sk_receive_queue), &wait);
|
|
|
|
sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
|
|
|
|
remove_wait_queue(sk_sleep(sk), &wait);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
static int tcp_bpf_recvmsg_parser(struct sock *sk,
|
|
|
|
struct msghdr *msg,
|
|
|
|
size_t len,
|
|
|
|
int nonblock,
|
|
|
|
int flags,
|
|
|
|
int *addr_len)
|
|
|
|
{
|
|
|
|
struct sk_psock *psock;
|
|
|
|
int copied;
|
|
|
|
|
|
|
|
if (unlikely(flags & MSG_ERRQUEUE))
|
|
|
|
return inet_recv_error(sk, msg, len, addr_len);
|
|
|
|
|
bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- net/ipv4/udp_bpf.c: context difference due to missing ec095263a965 ("net:
remove noblock parameter from recvmsg() entities")
- net/ipv4/tcp_bpf.c: context difference due to missing ec095263a965 ("net:
remove noblock parameter from recvmsg() entities")
commit d900f3d20cc3169ce42ec72acc850e662a4d4db2
Author: Liu Jian <liujian56@huawei.com>
Date: Fri Mar 3 16:09:46 2023 +0800
bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
When the buffer length of the recvmsg system call is 0, we got the
flollowing soft lockup problem:
watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149]
CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:remove_wait_queue+0xb/0xc0
Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20
RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768
RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040
RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7
R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800
R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0
FS: 00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tcp_msg_wait_data+0x279/0x2f0
tcp_bpf_recvmsg_parser+0x3c6/0x490
inet_recvmsg+0x280/0x290
sock_recvmsg+0xfc/0x120
____sys_recvmsg+0x160/0x3d0
___sys_recvmsg+0xf0/0x180
__sys_recvmsg+0xea/0x1a0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
The logic in tcp_bpf_recvmsg_parser is as follows:
msg_bytes_ready:
copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
if (!copied) {
wait data;
goto msg_bytes_ready;
}
In this case, "copied" always is 0, the infinite loop occurs.
According to the Linux system call man page, 0 should be returned in this
case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly
return. Also modify several other functions with the same problem.
Fixes: 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Fixes: c5d2177a72a1 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-05-10 13:22:56 +00:00
|
|
|
if (!len)
|
|
|
|
return 0;
|
|
|
|
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
msg_bytes_ready:
|
|
|
|
copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
|
|
|
|
if (!copied) {
|
|
|
|
long timeo;
|
|
|
|
int data;
|
|
|
|
|
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit 5b2c5540b8110eea0d67a78fb0ddb9654c58daeb
Author: John Fastabend <john.fastabend@gmail.com>
Date: Tue Jan 4 12:59:18 2022 -0800
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Applications can be confused slightly because we do not always return the
same error code as expected, e.g. what the TCP stack normally returns. For
example on a sock err sk->sk_err instead of returning the sock_error we
return EAGAIN. This usually means the application will 'try again'
instead of aborting immediately. Another example, when a shutdown event
is received we should immediately abort instead of waiting for data when
the user provides a timeout.
These tend to not be fatal, applications usually recover, but introduces
bogus errors to the user or introduces unexpected latency. Before
'c5d2177a72a16' we fell back to the TCP stack when no data was available
so we managed to catch many of the cases here, although with the extra
latency cost of calling tcp_msg_wait_data() first.
To fix lets duplicate the error handling in TCP stack into tcp_bpf so
that we get the same error codes.
These were found in our CI tests that run applications against sockmap
and do longer lived testing, at least compared to test_sockmap that
does short-lived ping/pong tests, and in some of our test clusters
we deploy.
Its non-trivial to do these in a shorter form CI tests that would be
appropriate for BPF selftests, but we are looking into it so we can
ensure this keeps working going forward. As a preview one idea is to
pull in the packetdrill testing which catches some of this.
Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 10:35:55 +00:00
|
|
|
if (sock_flag(sk, SOCK_DONE))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (sk->sk_err) {
|
|
|
|
copied = sock_error(sk);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (sk->sk_shutdown & RCV_SHUTDOWN)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (sk->sk_state == TCP_CLOSE) {
|
|
|
|
copied = -ENOTCONN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
timeo = sock_rcvtimeo(sk, nonblock);
|
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit 5b2c5540b8110eea0d67a78fb0ddb9654c58daeb
Author: John Fastabend <john.fastabend@gmail.com>
Date: Tue Jan 4 12:59:18 2022 -0800
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Applications can be confused slightly because we do not always return the
same error code as expected, e.g. what the TCP stack normally returns. For
example on a sock err sk->sk_err instead of returning the sock_error we
return EAGAIN. This usually means the application will 'try again'
instead of aborting immediately. Another example, when a shutdown event
is received we should immediately abort instead of waiting for data when
the user provides a timeout.
These tend to not be fatal, applications usually recover, but introduces
bogus errors to the user or introduces unexpected latency. Before
'c5d2177a72a16' we fell back to the TCP stack when no data was available
so we managed to catch many of the cases here, although with the extra
latency cost of calling tcp_msg_wait_data() first.
To fix lets duplicate the error handling in TCP stack into tcp_bpf so
that we get the same error codes.
These were found in our CI tests that run applications against sockmap
and do longer lived testing, at least compared to test_sockmap that
does short-lived ping/pong tests, and in some of our test clusters
we deploy.
Its non-trivial to do these in a shorter form CI tests that would be
appropriate for BPF selftests, but we are looking into it so we can
ensure this keeps working going forward. As a preview one idea is to
pull in the packetdrill testing which catches some of this.
Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 10:35:55 +00:00
|
|
|
if (!timeo) {
|
|
|
|
copied = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (signal_pending(current)) {
|
|
|
|
copied = sock_intr_errno(timeo);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
data = tcp_msg_wait_data(sk, psock, timeo);
|
2023-10-23 14:09:13 +00:00
|
|
|
if (data < 0) {
|
|
|
|
copied = data;
|
|
|
|
goto unlock;
|
|
|
|
}
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
if (data && !sk_psock_queue_empty(psock))
|
|
|
|
goto msg_bytes_ready;
|
|
|
|
copied = -EAGAIN;
|
|
|
|
}
|
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit 5b2c5540b8110eea0d67a78fb0ddb9654c58daeb
Author: John Fastabend <john.fastabend@gmail.com>
Date: Tue Jan 4 12:59:18 2022 -0800
bpf, sockmap: Fix return codes from tcp_bpf_recvmsg_parser()
Applications can be confused slightly because we do not always return the
same error code as expected, e.g. what the TCP stack normally returns. For
example on a sock err sk->sk_err instead of returning the sock_error we
return EAGAIN. This usually means the application will 'try again'
instead of aborting immediately. Another example, when a shutdown event
is received we should immediately abort instead of waiting for data when
the user provides a timeout.
These tend to not be fatal, applications usually recover, but introduces
bogus errors to the user or introduces unexpected latency. Before
'c5d2177a72a16' we fell back to the TCP stack when no data was available
so we managed to catch many of the cases here, although with the extra
latency cost of calling tcp_msg_wait_data() first.
To fix lets duplicate the error handling in TCP stack into tcp_bpf so
that we get the same error codes.
These were found in our CI tests that run applications against sockmap
and do longer lived testing, at least compared to test_sockmap that
does short-lived ping/pong tests, and in some of our test clusters
we deploy.
Its non-trivial to do these in a shorter form CI tests that would be
appropriate for BPF selftests, but we are looking into it so we can
ensure this keeps working going forward. As a preview one idea is to
pull in the packetdrill testing which catches some of this.
Fixes: c5d2177a72a16 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220104205918.286416-1-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 10:35:55 +00:00
|
|
|
out:
|
2023-10-23 14:09:13 +00:00
|
|
|
unlock:
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return copied;
|
|
|
|
}
|
|
|
|
|
2020-03-20 02:34:26 +00:00
|
|
|
static int tcp_bpf_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
|
|
|
|
int nonblock, int flags, int *addr_len)
|
|
|
|
{
|
|
|
|
struct sk_psock *psock;
|
|
|
|
int copied, ret;
|
|
|
|
|
2020-04-26 03:35:15 +00:00
|
|
|
if (unlikely(flags & MSG_ERRQUEUE))
|
|
|
|
return inet_recv_error(sk, msg, len, addr_len);
|
|
|
|
|
bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2178930
Conflicts:
- net/ipv4/udp_bpf.c: context difference due to missing ec095263a965 ("net:
remove noblock parameter from recvmsg() entities")
- net/ipv4/tcp_bpf.c: context difference due to missing ec095263a965 ("net:
remove noblock parameter from recvmsg() entities")
commit d900f3d20cc3169ce42ec72acc850e662a4d4db2
Author: Liu Jian <liujian56@huawei.com>
Date: Fri Mar 3 16:09:46 2023 +0800
bpf, sockmap: Fix an infinite loop error when len is 0 in tcp_bpf_recvmsg_parser()
When the buffer length of the recvmsg system call is 0, we got the
flollowing soft lockup problem:
watchdog: BUG: soft lockup - CPU#3 stuck for 27s! [a.out:6149]
CPU: 3 PID: 6149 Comm: a.out Kdump: loaded Not tainted 6.2.0+ #30
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:remove_wait_queue+0xb/0xc0
Code: 5e 41 5f c3 cc cc cc cc 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 57 <41> 56 41 55 41 54 55 48 89 fd 53 48 89 f3 4c 8d 6b 18 4c 8d 73 20
RSP: 0018:ffff88811b5978b8 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff88811a7d3780 RCX: ffffffffb7a4d768
RDX: dffffc0000000000 RSI: ffff88811b597908 RDI: ffff888115408040
RBP: 1ffff110236b2f1b R08: 0000000000000000 R09: ffff88811a7d37e7
R10: ffffed10234fa6fc R11: 0000000000000001 R12: ffff88811179b800
R13: 0000000000000001 R14: ffff88811a7d38a8 R15: ffff88811a7d37e0
FS: 00007f6fb5398740(0000) GS:ffff888237180000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000010b6ba002 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
tcp_msg_wait_data+0x279/0x2f0
tcp_bpf_recvmsg_parser+0x3c6/0x490
inet_recvmsg+0x280/0x290
sock_recvmsg+0xfc/0x120
____sys_recvmsg+0x160/0x3d0
___sys_recvmsg+0xf0/0x180
__sys_recvmsg+0xea/0x1a0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
The logic in tcp_bpf_recvmsg_parser is as follows:
msg_bytes_ready:
copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
if (!copied) {
wait data;
goto msg_bytes_ready;
}
In this case, "copied" always is 0, the infinite loop occurs.
According to the Linux system call man page, 0 should be returned in this
case. Therefore, in tcp_bpf_recvmsg_parser(), if the length is 0, directly
return. Also modify several other functions with the same problem.
Fixes: 1f5be6b3b063 ("udp: Implement udp_bpf_recvmsg() for sockmap")
Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()")
Fixes: c5d2177a72a1 ("bpf, sockmap: Fix race in ingress receive verdict with redirect to self")
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Liu Jian <liujian56@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20230303080946.1146638-1-liujian56@huawei.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2023-05-10 13:22:56 +00:00
|
|
|
if (!len)
|
|
|
|
return 0;
|
|
|
|
|
2020-03-20 02:34:26 +00:00
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
|
|
|
if (!skb_queue_empty(&sk->sk_receive_queue) &&
|
2020-04-26 03:35:15 +00:00
|
|
|
sk_psock_queue_empty(psock)) {
|
|
|
|
sk_psock_put(sk, psock);
|
2020-03-20 02:34:26 +00:00
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
2020-04-26 03:35:15 +00:00
|
|
|
}
|
2020-03-20 02:34:26 +00:00
|
|
|
lock_sock(sk);
|
|
|
|
msg_bytes_ready:
|
2021-03-31 02:32:33 +00:00
|
|
|
copied = sk_msg_recvmsg(sk, psock, msg, len, flags);
|
2020-03-20 02:34:26 +00:00
|
|
|
if (!copied) {
|
|
|
|
long timeo;
|
2021-05-17 02:23:48 +00:00
|
|
|
int data;
|
2020-03-20 02:34:26 +00:00
|
|
|
|
|
|
|
timeo = sock_rcvtimeo(sk, nonblock);
|
2021-06-29 22:45:27 +00:00
|
|
|
data = tcp_msg_wait_data(sk, psock, timeo);
|
2023-10-23 14:09:13 +00:00
|
|
|
if (data < 0) {
|
|
|
|
ret = data;
|
|
|
|
goto unlock;
|
|
|
|
}
|
2020-03-20 02:34:26 +00:00
|
|
|
if (data) {
|
|
|
|
if (!sk_psock_queue_empty(psock))
|
|
|
|
goto msg_bytes_ready;
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return tcp_recvmsg(sk, msg, len, nonblock, flags, addr_len);
|
|
|
|
}
|
|
|
|
copied = -EAGAIN;
|
|
|
|
}
|
|
|
|
ret = copied;
|
2023-10-23 14:09:13 +00:00
|
|
|
|
|
|
|
unlock:
|
2020-03-20 02:34:26 +00:00
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
static int tcp_bpf_send_verdict(struct sock *sk, struct sk_psock *psock,
|
|
|
|
struct sk_msg *msg, int *copied, int flags)
|
|
|
|
{
|
2023-03-21 10:02:06 +00:00
|
|
|
bool cork = false, enospc = sk_msg_full(msg), redir_ingress;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
struct sock *sk_redir;
|
2022-11-11 14:24:06 +00:00
|
|
|
u32 tosend, origsize, sent, delta = 0;
|
2023-03-21 10:02:06 +00:00
|
|
|
u32 eval;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
more_data:
|
2018-11-26 22:16:17 +00:00
|
|
|
if (psock->eval == __SK_NONE) {
|
|
|
|
/* Track delta in msg size to add/subtract it on SK_DROP from
|
|
|
|
* returned to user copied size. This ensures user doesn't
|
|
|
|
* get a positive return code with msg_cut_data and SK_DROP
|
|
|
|
* verdict.
|
|
|
|
*/
|
|
|
|
delta = msg->sg.size;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
psock->eval = sk_psock_msg_verdict(sk, psock, msg);
|
2020-01-11 06:12:06 +00:00
|
|
|
delta -= msg->sg.size;
|
2018-11-26 22:16:17 +00:00
|
|
|
}
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
if (msg->cork_bytes &&
|
|
|
|
msg->cork_bytes > msg->sg.size && !enospc) {
|
|
|
|
psock->cork_bytes = msg->cork_bytes - msg->sg.size;
|
|
|
|
if (!psock->cork) {
|
|
|
|
psock->cork = kzalloc(sizeof(*psock->cork),
|
|
|
|
GFP_ATOMIC | __GFP_NOWARN);
|
|
|
|
if (!psock->cork)
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
memcpy(psock->cork, msg, sizeof(*msg));
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
tosend = msg->sg.size;
|
|
|
|
if (psock->apply_bytes && psock->apply_bytes < tosend)
|
|
|
|
tosend = psock->apply_bytes;
|
2023-03-21 10:02:06 +00:00
|
|
|
eval = __SK_NONE;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
switch (psock->eval) {
|
|
|
|
case __SK_PASS:
|
|
|
|
ret = tcp_bpf_push(sk, msg, tosend, flags, true);
|
|
|
|
if (unlikely(ret)) {
|
|
|
|
*copied -= sk_msg_free(sk, msg);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
|
|
|
break;
|
|
|
|
case __SK_REDIRECT:
|
2023-03-21 10:02:06 +00:00
|
|
|
redir_ingress = psock->redir_ingress;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
sk_redir = psock->sk_redir;
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
2022-05-12 15:29:54 +00:00
|
|
|
if (!psock->apply_bytes) {
|
|
|
|
/* Clean up before releasing the sock lock. */
|
|
|
|
eval = psock->eval;
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
psock->sk_redir = NULL;
|
|
|
|
}
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
if (psock->cork) {
|
|
|
|
cork = true;
|
|
|
|
psock->cork = NULL;
|
|
|
|
}
|
2022-11-11 14:24:06 +00:00
|
|
|
sk_msg_return(sk, msg, tosend);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
release_sock(sk);
|
2022-05-12 15:29:54 +00:00
|
|
|
|
2022-11-11 14:24:06 +00:00
|
|
|
origsize = msg->sg.size;
|
2023-03-21 10:02:06 +00:00
|
|
|
ret = tcp_bpf_sendmsg_redir(sk_redir, redir_ingress,
|
|
|
|
msg, tosend, flags);
|
2022-11-11 14:24:06 +00:00
|
|
|
sent = origsize - msg->sg.size;
|
2022-05-12 15:29:54 +00:00
|
|
|
|
|
|
|
if (eval == __SK_REDIRECT)
|
|
|
|
sock_put(sk_redir);
|
|
|
|
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
lock_sock(sk);
|
|
|
|
if (unlikely(ret < 0)) {
|
|
|
|
int free = sk_msg_free_nocharge(sk, msg);
|
|
|
|
|
|
|
|
if (!cork)
|
|
|
|
*copied -= free;
|
|
|
|
}
|
|
|
|
if (cork) {
|
|
|
|
sk_msg_free(sk, msg);
|
|
|
|
kfree(msg);
|
|
|
|
msg = NULL;
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case __SK_DROP:
|
|
|
|
default:
|
|
|
|
sk_msg_free_partial(sk, msg, tosend);
|
|
|
|
sk_msg_apply_bytes(psock, tosend);
|
2018-11-26 22:16:17 +00:00
|
|
|
*copied -= (tosend + delta);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
return -EACCES;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (likely(!ret)) {
|
|
|
|
if (!psock->apply_bytes) {
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
if (psock->sk_redir) {
|
|
|
|
sock_put(psock->sk_redir);
|
|
|
|
psock->sk_redir = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (msg &&
|
|
|
|
msg->sg.data[msg->sg.start].page_link &&
|
2022-09-08 09:58:06 +00:00
|
|
|
msg->sg.data[msg->sg.start].length) {
|
|
|
|
if (eval == __SK_REDIRECT)
|
2022-11-11 14:24:06 +00:00
|
|
|
sk_mem_charge(sk, tosend - sent);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
goto more_data;
|
2022-09-08 09:58:06 +00:00
|
|
|
}
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
|
|
|
|
{
|
|
|
|
struct sk_msg tmp, *msg_tx = NULL;
|
|
|
|
int copied = 0, err = 0;
|
|
|
|
struct sk_psock *psock;
|
|
|
|
long timeo;
|
2019-08-08 00:03:59 +00:00
|
|
|
int flags;
|
|
|
|
|
|
|
|
/* Don't let internal do_tcp_sendpages() flags through */
|
|
|
|
flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
|
|
|
|
flags |= MSG_NO_SHARED_FRAGS;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_sendmsg(sk, msg, size);
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
|
|
|
|
while (msg_data_left(msg)) {
|
|
|
|
bool enospc = false;
|
|
|
|
u32 copy, osize;
|
|
|
|
|
|
|
|
if (sk->sk_err) {
|
|
|
|
err = -sk->sk_err;
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
copy = msg_data_left(msg);
|
|
|
|
if (!sk_stream_memory_free(sk))
|
|
|
|
goto wait_for_sndbuf;
|
|
|
|
if (psock->cork) {
|
|
|
|
msg_tx = psock->cork;
|
|
|
|
} else {
|
|
|
|
msg_tx = &tmp;
|
|
|
|
sk_msg_init(msg_tx);
|
|
|
|
}
|
|
|
|
|
|
|
|
osize = msg_tx->sg.size;
|
|
|
|
err = sk_msg_alloc(sk, msg_tx, msg_tx->sg.size + copy, msg_tx->sg.end - 1);
|
|
|
|
if (err) {
|
|
|
|
if (err != -ENOSPC)
|
|
|
|
goto wait_for_memory;
|
|
|
|
enospc = true;
|
|
|
|
copy = msg_tx->sg.size - osize;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = sk_msg_memcopy_from_iter(sk, &msg->msg_iter, msg_tx,
|
|
|
|
copy);
|
|
|
|
if (err < 0) {
|
|
|
|
sk_msg_trim(sk, msg_tx, osize);
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
|
|
|
|
copied += copy;
|
|
|
|
if (psock->cork_bytes) {
|
|
|
|
if (size > psock->cork_bytes)
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
else
|
|
|
|
psock->cork_bytes -= size;
|
|
|
|
if (psock->cork_bytes && !enospc)
|
|
|
|
goto out_err;
|
|
|
|
/* All cork bytes are accounted, rerun the prog. */
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = tcp_bpf_send_verdict(sk, psock, msg_tx, &copied, flags);
|
|
|
|
if (unlikely(err < 0))
|
|
|
|
goto out_err;
|
|
|
|
continue;
|
|
|
|
wait_for_sndbuf:
|
|
|
|
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
|
|
|
|
wait_for_memory:
|
|
|
|
err = sk_stream_wait_memory(sk, &timeo);
|
|
|
|
if (err) {
|
|
|
|
if (msg_tx && msg_tx != psock->cork)
|
|
|
|
sk_msg_free(sk, msg_tx);
|
|
|
|
goto out_err;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
out_err:
|
|
|
|
if (err < 0)
|
|
|
|
err = sk_stream_error(sk, msg->msg_flags, err);
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
tcp_bpf: fix return value of tcp_bpf_sendmsg()
JIRA: https://issues.redhat.com/browse/RHEL-68071
JIRA: https://issues.redhat.com/browse/RHEL-59445
CVE: CVE-2024-46783
Conflicts:
- net/ipv4/tcp_bpf.c: We don't have dc97391e6610 ("sock: Remove
->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)"), therefore we have
both tcp_bpf_sendmsg() and tcp_bpf_sendpage(), where upstream only has
tcp_bpf_sendmsg(). tcp_bpf_sendpage() follows a similar pattern as
described in the patch: copied could become negative through
tcp_bpf_send_verdict() and would be returned directly. Apply the same fix
to only return copied if the value is positive.
commit fe1910f9337bd46a9343967b547ccab26b4b2c6e
Author: Cong Wang <cong.wang@bytedance.com>
Date: Tue Aug 20 20:07:44 2024 -0700
tcp_bpf: fix return value of tcp_bpf_sendmsg()
When we cork messages in psock->cork, the last message triggers the
flushing will result in sending a sk_msg larger than the current
message size. In this case, in tcp_bpf_send_verdict(), 'copied' becomes
negative at least in the following case:
468 case __SK_DROP:
469 default:
470 sk_msg_free_partial(sk, msg, tosend);
471 sk_msg_apply_bytes(psock, tosend);
472 *copied -= (tosend + delta); // <==== HERE
473 return -EACCES;
Therefore, it could lead to the following BUG with a proper value of
'copied' (thanks to syzbot). We should not use negative 'copied' as a
return value here.
------------[ cut here ]------------
kernel BUG at net/socket.c:733!
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 UID: 0 PID: 3265 Comm: syz-executor510 Not tainted 6.11.0-rc3-syzkaller-00060-gd07b43284ab3 #0
Hardware name: linux,dummy-virt (DT)
pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : sock_sendmsg_nosec net/socket.c:733 [inline]
pc : sock_sendmsg_nosec net/socket.c:728 [inline]
pc : __sock_sendmsg+0x5c/0x60 net/socket.c:745
lr : sock_sendmsg_nosec net/socket.c:730 [inline]
lr : __sock_sendmsg+0x54/0x60 net/socket.c:745
sp : ffff800088ea3b30
x29: ffff800088ea3b30 x28: fbf00000062bc900 x27: 0000000000000000
x26: ffff800088ea3bc0 x25: ffff800088ea3bc0 x24: 0000000000000000
x23: f9f00000048dc000 x22: 0000000000000000 x21: ffff800088ea3d90
x20: f9f00000048dc000 x19: ffff800088ea3d90 x18: 0000000000000001
x17: 0000000000000000 x16: 0000000000000000 x15: 000000002002ffaf
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: ffff8000815849c0 x9 : ffff8000815b49c0
x8 : 0000000000000000 x7 : 000000000000003f x6 : 0000000000000000
x5 : 00000000000007e0 x4 : fff07ffffd239000 x3 : fbf00000062bc900
x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000fffffdef
Call trace:
sock_sendmsg_nosec net/socket.c:733 [inline]
__sock_sendmsg+0x5c/0x60 net/socket.c:745
____sys_sendmsg+0x274/0x2ac net/socket.c:2597
___sys_sendmsg+0xac/0x100 net/socket.c:2651
__sys_sendmsg+0x84/0xe0 net/socket.c:2680
__do_sys_sendmsg net/socket.c:2689 [inline]
__se_sys_sendmsg net/socket.c:2687 [inline]
__arm64_sys_sendmsg+0x24/0x30 net/socket.c:2687
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x48/0x110 arch/arm64/kernel/syscall.c:49
el0_svc_common.constprop.0+0x40/0xe0 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x1c/0x28 arch/arm64/kernel/syscall.c:151
el0_svc+0x34/0xec arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x100/0x12c arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:598
Code: f9404463 d63f0060 3108441f 54fffe81 (d4210000)
---[ end trace 0000000000000000 ]---
Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Reported-by: syzbot+58c03971700330ce14d8@syzkaller.appspotmail.com
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20240821030744.320934-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2025-01-08 17:25:09 +00:00
|
|
|
return copied > 0 ? copied : err;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
|
|
|
|
size_t size, int flags)
|
|
|
|
{
|
|
|
|
struct sk_msg tmp, *msg = NULL;
|
|
|
|
int err = 0, copied = 0;
|
|
|
|
struct sk_psock *psock;
|
|
|
|
bool enospc = false;
|
|
|
|
|
|
|
|
psock = sk_psock_get(sk);
|
|
|
|
if (unlikely(!psock))
|
|
|
|
return tcp_sendpage(sk, page, offset, size, flags);
|
|
|
|
|
|
|
|
lock_sock(sk);
|
|
|
|
if (psock->cork) {
|
|
|
|
msg = psock->cork;
|
|
|
|
} else {
|
|
|
|
msg = &tmp;
|
|
|
|
sk_msg_init(msg);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Catch case where ring is full and sendpage is stalled. */
|
|
|
|
if (unlikely(sk_msg_full(msg)))
|
|
|
|
goto out_err;
|
|
|
|
|
|
|
|
sk_msg_page_add(msg, page, size, offset);
|
|
|
|
sk_mem_charge(sk, size);
|
|
|
|
copied = size;
|
|
|
|
if (sk_msg_full(msg))
|
|
|
|
enospc = true;
|
|
|
|
if (psock->cork_bytes) {
|
|
|
|
if (size > psock->cork_bytes)
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
else
|
|
|
|
psock->cork_bytes -= size;
|
|
|
|
if (psock->cork_bytes && !enospc)
|
|
|
|
goto out_err;
|
|
|
|
/* All cork bytes are accounted, rerun the prog. */
|
|
|
|
psock->eval = __SK_NONE;
|
|
|
|
psock->cork_bytes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
err = tcp_bpf_send_verdict(sk, psock, msg, &copied, flags);
|
|
|
|
out_err:
|
|
|
|
release_sock(sk);
|
|
|
|
sk_psock_put(sk, psock);
|
tcp_bpf: fix return value of tcp_bpf_sendmsg()
JIRA: https://issues.redhat.com/browse/RHEL-68071
JIRA: https://issues.redhat.com/browse/RHEL-59445
CVE: CVE-2024-46783
Conflicts:
- net/ipv4/tcp_bpf.c: We don't have dc97391e6610 ("sock: Remove
->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)"), therefore we have
both tcp_bpf_sendmsg() and tcp_bpf_sendpage(), where upstream only has
tcp_bpf_sendmsg(). tcp_bpf_sendpage() follows a similar pattern as
described in the patch: copied could become negative through
tcp_bpf_send_verdict() and would be returned directly. Apply the same fix
to only return copied if the value is positive.
commit fe1910f9337bd46a9343967b547ccab26b4b2c6e
Author: Cong Wang <cong.wang@bytedance.com>
Date: Tue Aug 20 20:07:44 2024 -0700
tcp_bpf: fix return value of tcp_bpf_sendmsg()
When we cork messages in psock->cork, the last message triggers the
flushing will result in sending a sk_msg larger than the current
message size. In this case, in tcp_bpf_send_verdict(), 'copied' becomes
negative at least in the following case:
468 case __SK_DROP:
469 default:
470 sk_msg_free_partial(sk, msg, tosend);
471 sk_msg_apply_bytes(psock, tosend);
472 *copied -= (tosend + delta); // <==== HERE
473 return -EACCES;
Therefore, it could lead to the following BUG with a proper value of
'copied' (thanks to syzbot). We should not use negative 'copied' as a
return value here.
------------[ cut here ]------------
kernel BUG at net/socket.c:733!
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 UID: 0 PID: 3265 Comm: syz-executor510 Not tainted 6.11.0-rc3-syzkaller-00060-gd07b43284ab3 #0
Hardware name: linux,dummy-virt (DT)
pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : sock_sendmsg_nosec net/socket.c:733 [inline]
pc : sock_sendmsg_nosec net/socket.c:728 [inline]
pc : __sock_sendmsg+0x5c/0x60 net/socket.c:745
lr : sock_sendmsg_nosec net/socket.c:730 [inline]
lr : __sock_sendmsg+0x54/0x60 net/socket.c:745
sp : ffff800088ea3b30
x29: ffff800088ea3b30 x28: fbf00000062bc900 x27: 0000000000000000
x26: ffff800088ea3bc0 x25: ffff800088ea3bc0 x24: 0000000000000000
x23: f9f00000048dc000 x22: 0000000000000000 x21: ffff800088ea3d90
x20: f9f00000048dc000 x19: ffff800088ea3d90 x18: 0000000000000001
x17: 0000000000000000 x16: 0000000000000000 x15: 000000002002ffaf
x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: ffff8000815849c0 x9 : ffff8000815b49c0
x8 : 0000000000000000 x7 : 000000000000003f x6 : 0000000000000000
x5 : 00000000000007e0 x4 : fff07ffffd239000 x3 : fbf00000062bc900
x2 : 0000000000000000 x1 : 0000000000000000 x0 : 00000000fffffdef
Call trace:
sock_sendmsg_nosec net/socket.c:733 [inline]
__sock_sendmsg+0x5c/0x60 net/socket.c:745
____sys_sendmsg+0x274/0x2ac net/socket.c:2597
___sys_sendmsg+0xac/0x100 net/socket.c:2651
__sys_sendmsg+0x84/0xe0 net/socket.c:2680
__do_sys_sendmsg net/socket.c:2689 [inline]
__se_sys_sendmsg net/socket.c:2687 [inline]
__arm64_sys_sendmsg+0x24/0x30 net/socket.c:2687
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x48/0x110 arch/arm64/kernel/syscall.c:49
el0_svc_common.constprop.0+0x40/0xe0 arch/arm64/kernel/syscall.c:132
do_el0_svc+0x1c/0x28 arch/arm64/kernel/syscall.c:151
el0_svc+0x34/0xec arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x100/0x12c arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x19c/0x1a0 arch/arm64/kernel/entry.S:598
Code: f9404463 d63f0060 3108441f 54fffe81 (d4210000)
---[ end trace 0000000000000000 ]---
Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data")
Reported-by: syzbot+58c03971700330ce14d8@syzkaller.appspotmail.com
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20240821030744.320934-1-xiyou.wangcong@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2025-01-08 17:25:09 +00:00
|
|
|
return copied > 0 ? copied : err;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
enum {
|
|
|
|
TCP_BPF_IPV4,
|
|
|
|
TCP_BPF_IPV6,
|
|
|
|
TCP_BPF_NUM_PROTS,
|
|
|
|
};
|
|
|
|
|
|
|
|
enum {
|
|
|
|
TCP_BPF_BASE,
|
|
|
|
TCP_BPF_TX,
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
TCP_BPF_RX,
|
|
|
|
TCP_BPF_TXRX,
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
TCP_BPF_NUM_CFGS,
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct proto *tcpv6_prot_saved __read_mostly;
|
|
|
|
static DEFINE_SPINLOCK(tcpv6_prot_lock);
|
|
|
|
static struct proto tcp_bpf_prots[TCP_BPF_NUM_PROTS][TCP_BPF_NUM_CFGS];
|
|
|
|
|
|
|
|
static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS],
|
|
|
|
struct proto *base)
|
|
|
|
{
|
|
|
|
prot[TCP_BPF_BASE] = *base;
|
2022-10-31 09:12:45 +00:00
|
|
|
prot[TCP_BPF_BASE].destroy = sock_map_destroy;
|
2020-03-09 11:12:36 +00:00
|
|
|
prot[TCP_BPF_BASE].close = sock_map_close;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
prot[TCP_BPF_BASE].recvmsg = tcp_bpf_recvmsg;
|
2022-05-12 15:29:53 +00:00
|
|
|
prot[TCP_BPF_BASE].sock_is_readable = sk_msg_is_readable;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
prot[TCP_BPF_TX] = prot[TCP_BPF_BASE];
|
|
|
|
prot[TCP_BPF_TX].sendmsg = tcp_bpf_sendmsg;
|
|
|
|
prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage;
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
|
|
|
|
prot[TCP_BPF_RX] = prot[TCP_BPF_BASE];
|
|
|
|
prot[TCP_BPF_RX].recvmsg = tcp_bpf_recvmsg_parser;
|
|
|
|
|
|
|
|
prot[TCP_BPF_TXRX] = prot[TCP_BPF_TX];
|
|
|
|
prot[TCP_BPF_TXRX].recvmsg = tcp_bpf_recvmsg_parser;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
2020-08-21 10:29:43 +00:00
|
|
|
if (unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) {
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
spin_lock_bh(&tcpv6_prot_lock);
|
|
|
|
if (likely(ops != tcpv6_prot_saved)) {
|
|
|
|
tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV6], ops);
|
|
|
|
smp_store_release(&tcpv6_prot_saved, ops);
|
|
|
|
}
|
|
|
|
spin_unlock_bh(&tcpv6_prot_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __init tcp_bpf_v4_build_proto(void)
|
|
|
|
{
|
|
|
|
tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV4], &tcp_prot);
|
|
|
|
return 0;
|
|
|
|
}
|
2021-07-12 19:55:46 +00:00
|
|
|
late_initcall(tcp_bpf_v4_build_proto);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
|
|
|
static int tcp_bpf_assert_proto_ops(struct proto *ops)
|
|
|
|
{
|
|
|
|
/* In order to avoid retpoline, we make assumptions when we call
|
|
|
|
* into ops if e.g. a psock is not present. Make sure they are
|
|
|
|
* indeed valid assumptions.
|
|
|
|
*/
|
|
|
|
return ops->recvmsg == tcp_recvmsg &&
|
|
|
|
ops->sendmsg == tcp_sendmsg &&
|
|
|
|
ops->sendpage == tcp_sendpage ? 0 : -ENOTSUPP;
|
|
|
|
}
|
|
|
|
|
2021-04-07 03:21:11 +00:00
|
|
|
int tcp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
{
|
2020-03-09 11:12:34 +00:00
|
|
|
int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4;
|
|
|
|
int config = psock->progs.msg_parser ? TCP_BPF_TX : TCP_BPF_BASE;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2071619
commit c5d2177a72a1659554922728fc407f59950aa929
Author: John Fastabend <john.fastabend@gmail.com>
Date: Wed Nov 3 13:47:34 2021 -0700
bpf, sockmap: Fix race in ingress receive verdict with redirect to self
A socket in a sockmap may have different combinations of programs attached
depending on configuration. There can be no programs in which case the socket
acts as a sink only. There can be a TX program in this case a BPF program is
attached to sending side, but no RX program is attached. There can be an RX
program only where sends have no BPF program attached, but receives are hooked
with BPF. And finally, both TX and RX programs may be attached. Giving us the
permutations:
None, Tx, Rx, and TxRx
To date most of our use cases have been TX case being used as a fast datapath
to directly copy between local application and a userspace proxy. Or Rx cases
and TxRX applications that are operating an in kernel based proxy. The traffic
in the first case where we hook applications into a userspace application looks
like this:
AppA redirect AppB
Tx <-----------> Rx
| |
+ +
TCP <--> lo <--> TCP
In this case all traffic from AppA (after 3whs) is copied into the AppB
ingress queue and no traffic is ever on the TCP recieive_queue.
In the second case the application never receives, except in some rare error
cases, traffic on the actual user space socket. Instead the send happens in
the kernel.
AppProxy socket pool
sk0 ------------->{sk1,sk2, skn}
^ |
| |
| v
ingress lb egress
TCP TCP
Here because traffic is never read off the socket with userspace recv() APIs
there is only ever one reader on the sk receive_queue. Namely the BPF programs.
However, we've started to introduce a third configuration where the BPF program
on receive should process the data, but then the normal case is to push the
data into the receive queue of AppB.
AppB
recv() (userspace)
-----------------------
tcp_bpf_recvmsg() (kernel)
| |
| |
| |
ingress_msgQ |
| |
RX_BPF |
| |
v v
sk->receive_queue
This is different from the App{A,B} redirect because traffic is first received
on the sk->receive_queue.
Now for the issue. The tcp_bpf_recvmsg() handler first checks the ingress_msg
queue for any data handled by the BPF rx program and returned with PASS code
so that it was enqueued on the ingress msg queue. Then if no data exists on
that queue it checks the socket receive queue. Unfortunately, this is the same
receive_queue the BPF program is reading data off of. So we get a race. Its
possible for the recvmsg() hook to pull data off the receive_queue before the
BPF hook has a chance to read it. It typically happens when an application is
banging on recv() and getting EAGAINs. Until they manage to race with the RX
BPF program.
To fix this we note that before this patch at attach time when the socket is
loaded into the map we check if it needs a TX program or just the base set of
proto bpf hooks. Then it uses the above general RX hook regardless of if we
have a BPF program attached at rx or not. This patch now extends this check to
handle all cases enumerated above, TX, RX, TXRX, and none. And to fix above
race when an RX program is attached we use a new hook that is nearly identical
to the old one except now we do not let the recv() call skip the RX BPF program.
Now only the BPF program pulls data from sk->receive_queue and recv() only
pulls data from the ingress msgQ post BPF program handling.
With this resolved our AppB from above has been up and running for many hours
without detecting any errors. We do this by correlating counters in RX BPF
events and the AppB to ensure data is never skipping the BPF program. Selftests,
was not able to detect this because we only run them for a short period of time
on well ordered send/recvs so we don't get any of the noise we see in real
application environments.
Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Jussi Maki <joamaki@gmail.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20211103204736.248403-4-john.fastabend@gmail.com
Signed-off-by: Felix Maurer <fmaurer@redhat.com>
2022-05-05 08:53:02 +00:00
|
|
|
if (psock->progs.stream_verdict || psock->progs.skb_verdict) {
|
|
|
|
config = (config == TCP_BPF_TX) ? TCP_BPF_TXRX : TCP_BPF_RX;
|
|
|
|
}
|
|
|
|
|
2021-03-31 02:32:31 +00:00
|
|
|
if (restore) {
|
|
|
|
if (inet_csk_has_ulp(sk)) {
|
2021-04-10 03:46:01 +00:00
|
|
|
/* TLS does not have an unhash proto in SW cases,
|
|
|
|
* but we need to ensure we stop using the sock_map
|
|
|
|
* unhash routine because the associated psock is being
|
|
|
|
* removed. So use the original unhash handler.
|
|
|
|
*/
|
|
|
|
WRITE_ONCE(sk->sk_prot->unhash, psock->saved_unhash);
|
2021-03-31 02:32:31 +00:00
|
|
|
tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
|
|
|
|
} else {
|
|
|
|
sk->sk_write_space = psock->saved_write_space;
|
|
|
|
/* Pairs with lockless read in sk_clone_lock() */
|
2022-10-26 23:25:57 +00:00
|
|
|
sock_replace_proto(sk, psock->sk_proto);
|
2021-03-31 02:32:31 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
if (sk->sk_family == AF_INET6) {
|
|
|
|
if (tcp_bpf_assert_proto_ops(psock->sk_proto))
|
2021-03-31 02:32:31 +00:00
|
|
|
return -EINVAL;
|
2020-03-09 11:12:34 +00:00
|
|
|
|
2020-08-21 10:29:43 +00:00
|
|
|
tcp_bpf_check_v6_needs_rebuild(psock->sk_proto);
|
2020-03-09 11:12:34 +00:00
|
|
|
}
|
|
|
|
|
2021-03-31 02:32:31 +00:00
|
|
|
/* Pairs with lockless read in sk_clone_lock() */
|
2022-10-26 23:25:57 +00:00
|
|
|
sock_replace_proto(sk, &tcp_bpf_prots[family][config]);
|
2021-03-31 02:32:31 +00:00
|
|
|
return 0;
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
}
|
2021-03-31 02:32:31 +00:00
|
|
|
EXPORT_SYMBOL_GPL(tcp_bpf_update_proto);
|
bpf, sockmap: convert to generic sk_msg interface
Add a generic sk_msg layer, and convert current sockmap and later
kTLS over to make use of it. While sk_buff handles network packet
representation from netdevice up to socket, sk_msg handles data
representation from application to socket layer.
This means that sk_msg framework spans across ULP users in the
kernel, and enables features such as introspection or filtering
of data with the help of BPF programs that operate on this data
structure.
Latter becomes in particular useful for kTLS where data encryption
is deferred into the kernel, and as such enabling the kernel to
perform L7 introspection and policy based on BPF for TLS connections
where the record is being encrypted after BPF has run and came to
a verdict. In order to get there, first step is to transform open
coding of scatter-gather list handling into a common core framework
that subsystems can use.
The code itself has been split and refactored into three bigger
pieces: i) the generic sk_msg API which deals with managing the
scatter gather ring, providing helpers for walking and mangling,
transferring application data from user space into it, and preparing
it for BPF pre/post-processing, ii) the plain sock map itself
where sockets can be attached to or detached from; these bits
are independent of i) which can now be used also without sock
map, and iii) the integration with plain TCP as one protocol
to be used for processing L7 application data (later this could
e.g. also be extended to other protocols like UDP). The semantics
are the same with the old sock map code and therefore no change
of user facing behavior or APIs. While pursuing this work it
also helped finding a number of bugs in the old sockmap code
that we've fixed already in earlier commits. The test_sockmap
kselftest suite passes through fine as well.
Joint work with John.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-13 00:45:58 +00:00
|
|
|
|
2020-02-18 17:10:15 +00:00
|
|
|
/* If a child got cloned from a listening socket that had tcp_bpf
|
|
|
|
* protocol callbacks installed, we need to restore the callbacks to
|
|
|
|
* the default ones because the child does not inherit the psock state
|
|
|
|
* that tcp_bpf callbacks expect.
|
|
|
|
*/
|
|
|
|
void tcp_bpf_clone(const struct sock *sk, struct sock *newsk)
|
|
|
|
{
|
|
|
|
struct proto *prot = newsk->sk_prot;
|
|
|
|
|
2023-03-29 13:38:18 +00:00
|
|
|
if (is_insidevar(prot, tcp_bpf_prots))
|
2020-02-18 17:10:15 +00:00
|
|
|
newsk->sk_prot = sk->sk_prot_creator;
|
|
|
|
}
|
2021-02-23 18:49:26 +00:00
|
|
|
#endif /* CONFIG_BPF_SYSCALL */
|