public inbox for [email protected]
 help / color / mirror / Atom feed
* [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
@ 2021-05-21 21:49 Paul Moore
  2021-05-21 21:49 ` [RFC PATCH 1/9] audit: prepare audit_context for use in calling contexts beyond syscalls Paul Moore
                   ` (10 more replies)
  0 siblings, 11 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:49 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Earlier this week Kumar Kartikeya Dwivedi posted a patchset switching
io_uring over to the anonymous inode variant that allows for LSM
controls.  While nice, the patchset left the actual LSM controls as
an exercise for the reader.  The posting can be found using the lore
link below:

https://lore.kernel.org/io-uring/CAHC9VhS=PDxx=MzZnGGNLwo-o5Og-HGZe84=+BBtBCZgaGSn4A@mail.gmail.com/T/#mde8c5120f3b8e34a5a3b18229b8c563a7855fd20

As fate would have it, I had been working on something very similar,
in fact the two patches from Kumar mirrored two in my own patchset.
This patchset, while still a bit crude, does include an attempt at
adding the LSM and audit support necessary to properly implement LSM
based access controls for io_uring.  I've provided the SELinux
implementation, Casey has been nice enough to provide a Smack patch,
and John is working on an AppArmor patch as I write this.  I've
mentioned this work to the other LSM maintainers that I believe might
be affected but I have not heard back from anyone else at this point.
If any of the other LSMs would like to contribute a patch to this
patchset I will happily accept it; I only ask that you post it to the
LSM list and make sure I am on the To/CC line.  I think it would be
nice to try and wrap this up as soon as possible for the obvious
reasons.

The individual patches provide an explanation of the changes involved
so I'm not going to repeat that here, but I will caution you that
these patches are still rather crude, perhaps more than a RFC patchset
should be, but it seemed prudent to move this along so I'm posting
these now.  Any review that you can provide would be helpful.

Also, any pointers to easy-to-run io_uring tests would be helpful.  I
am particularly interested in tests which make use of the personality
option, share urings across process boundaries, and make use of the
sqpoll functionality.

As a point of reference, this patchset is based on v5.13-rc2 and if
you want to follow along via git I'll be making updates to the git
tree/branch below (warning I will be force-pushing on this branch given
the early/rough nature of these patches).

git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
 (checkout branch "working-io_uring")

Thanks in advance,
-Paul

---

Casey Schaufler (1):
      Smack: Brutalist io_uring support with debug

Paul Moore (8):
      audit: prepare audit_context for use in calling contexts beyond syscalls
      audit,io_uring,io-wq: add some basic audit support to io_uring
      audit: dev/test patch to force io_uring auditing
      audit: add filtering for io_uring records
      fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
      io_uring: convert io_uring to the secure anon inode interface
      lsm,io_uring: add LSM hooks to io_uring
      selinux: add support for the io_uring access controls


 fs/anon_inodes.c                    |  29 ++
 fs/io-wq.c                          |   4 +
 fs/io_uring.c                       |  25 +-
 include/linux/anon_inodes.h         |   4 +
 include/linux/audit.h               |  17 +
 include/linux/lsm_hook_defs.h       |   5 +
 include/linux/lsm_hooks.h           |  13 +
 include/linux/security.h            |  16 +
 include/uapi/linux/audit.h          |   4 +-
 kernel/audit.h                      |   7 +-
 kernel/auditfilter.c                |   4 +-
 kernel/auditsc.c                    | 481 ++++++++++++++++++++++------
 security/security.c                 |  12 +
 security/selinux/hooks.c            |  67 ++++
 security/selinux/include/classmap.h |   2 +
 security/smack/smack_lsm.c          |  64 ++++
 16 files changed, 650 insertions(+), 104 deletions(-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC PATCH 1/9] audit: prepare audit_context for use in calling contexts beyond syscalls
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
@ 2021-05-21 21:49 ` Paul Moore
  2021-05-21 21:49 ` [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring Paul Moore
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:49 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This is a work in progress and should not be merged
anywhere important.  It is almost surely not complete, and while it
probably compiles it likely hasn't been booted and will do terrible
things.  You have been warned.

This patch cleans up some of our audit_context handling by
abstracting out the reset and return code fixup handling to dedicated
functions.  Not only does this help make things easier to read and
inspect, it allows for easier reuse be future patches.  We also
convert the simple audit_context->in_syscall flag into an enum which
can be used to by future patches to indicate a calling context other
than the syscall context.

Thanks to Richard Guy Briggs for review and feedback.

Acked-by: Richard Guy Briggs <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
---
 kernel/audit.h   |    5 +
 kernel/auditsc.c |  255 +++++++++++++++++++++++++++++++++++-------------------
 2 files changed, 167 insertions(+), 93 deletions(-)

diff --git a/kernel/audit.h b/kernel/audit.h
index 1522e100fd17..fba180de5912 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -97,7 +97,10 @@ struct audit_proctitle {
 /* The per-task audit context. */
 struct audit_context {
 	int		    dummy;	/* must be the first element */
-	int		    in_syscall;	/* 1 if task is in a syscall */
+	enum {
+		AUDIT_CTX_UNUSED,	/* audit_context is currently unused */
+		AUDIT_CTX_SYSCALL,	/* in use by syscall */
+	} context;
 	enum audit_state    state, current_state;
 	unsigned int	    serial;     /* serial number for record */
 	int		    major;      /* syscall number */
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 175ef6f3ea4e..cc89e9f9a753 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -909,10 +909,80 @@ static inline void audit_free_aux(struct audit_context *context)
 		context->aux = aux->next;
 		kfree(aux);
 	}
+	context->aux = NULL;
 	while ((aux = context->aux_pids)) {
 		context->aux_pids = aux->next;
 		kfree(aux);
 	}
+	context->aux_pids = NULL;
+}
+
+/**
+ * audit_reset_context - reset a audit_context structure
+ * @ctx: the audit_context to reset
+ *
+ * All fields in the audit_context will be reset to an initial state, all
+ * references held by fields will be dropped, and private memory will be
+ * released.  When this function returns the audit_context will be suitable
+ * for reuse, so long as the passed context is not NULL or a dummy context.
+ */
+static void audit_reset_context(struct audit_context *ctx)
+{
+	if (!ctx)
+		return;
+
+	/* if ctx is non-null, reset the "ctx->state" regardless */
+	ctx->context = AUDIT_CTX_UNUSED;
+	if (ctx->dummy)
+		return;
+
+	/*
+	 * NOTE: It shouldn't matter in what order we release the fields, so
+	 *       release them in the order in which they appear in the struct;
+	 *       this gives us some hope of quickly making sure we are
+	 *       resetting the audit_context properly.
+	 *
+	 *       Other things worth mentioning:
+	 *       - we don't reset "dummy"
+	 *       - we don't reset "state", we do reset "current_state"
+	 *       - we preserver "filterkey" if "state" is AUDIT_RECORD_CONTEXT
+	 *       - much of this is likely overkill, but play it safe for now
+	 *       - we really need to work on improving the audit_context struct
+	 */
+
+	ctx->current_state = ctx->state;
+	ctx->serial = 0;
+	ctx->major = 0;
+	ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 };
+	memset(ctx->argv, 0, sizeof(ctx->argv));
+	ctx->return_code = 0;
+	ctx->prio = (ctx->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0);
+	ctx->return_valid = AUDITSC_INVALID;
+	audit_free_names(ctx);
+	if (ctx->state != AUDIT_RECORD_CONTEXT) {
+		kfree(ctx->filterkey);
+		ctx->filterkey = NULL;
+	}
+	audit_free_aux(ctx);
+	kfree(ctx->sockaddr);
+	ctx->sockaddr = NULL;
+	ctx->sockaddr_len = 0;
+	ctx->pid = ctx->ppid = 0;
+	ctx->uid = ctx->euid = ctx->suid = ctx->fsuid = KUIDT_INIT(0);
+	ctx->gid = ctx->egid = ctx->sgid = ctx->fsgid = KGIDT_INIT(0);
+	ctx->personality = 0;
+	ctx->arch = 0;
+	ctx->target_pid = 0;
+	ctx->target_auid = ctx->target_uid = KUIDT_INIT(0);
+	ctx->target_sessionid = 0;
+	ctx->target_sid = 0;
+	ctx->target_comm[0] = '\0';
+	unroll_tree_refs(ctx, NULL, 0);
+	WARN_ON(!list_empty(&ctx->killed_trees));
+	ctx->type = 0;
+	audit_free_module(ctx);
+	ctx->fds[0] = -1;
+	audit_proctitle_free(ctx);
 }
 
 static inline struct audit_context *audit_alloc_context(enum audit_state state)
@@ -922,6 +992,7 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state)
 	context = kzalloc(sizeof(*context), GFP_KERNEL);
 	if (!context)
 		return NULL;
+	context->context = AUDIT_CTX_UNUSED;
 	context->state = state;
 	context->prio = state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
 	INIT_LIST_HEAD(&context->killed_trees);
@@ -947,7 +1018,7 @@ int audit_alloc(struct task_struct *tsk)
 	char *key = NULL;
 
 	if (likely(!audit_ever_enabled))
-		return 0; /* Return if not auditing. */
+		return 0;
 
 	state = audit_filter_task(tsk, &key);
 	if (state == AUDIT_DISABLED) {
@@ -969,14 +1040,10 @@ int audit_alloc(struct task_struct *tsk)
 
 static inline void audit_free_context(struct audit_context *context)
 {
-	audit_free_module(context);
-	audit_free_names(context);
-	unroll_tree_refs(context, NULL, 0);
+	/* resetting is extra work, but it is likely just noise */
+	audit_reset_context(context);
 	free_tree_refs(context);
-	audit_free_aux(context);
 	kfree(context->filterkey);
-	kfree(context->sockaddr);
-	audit_proctitle_free(context);
 	kfree(context);
 }
 
@@ -1479,29 +1546,35 @@ static void audit_log_exit(void)
 
 	context->personality = current->personality;
 
-	ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL);
-	if (!ab)
-		return;		/* audit_panic has been called */
-	audit_log_format(ab, "arch=%x syscall=%d",
-			 context->arch, context->major);
-	if (context->personality != PER_LINUX)
-		audit_log_format(ab, " per=%lx", context->personality);
-	if (context->return_valid != AUDITSC_INVALID)
-		audit_log_format(ab, " success=%s exit=%ld",
-				 (context->return_valid==AUDITSC_SUCCESS)?"yes":"no",
-				 context->return_code);
-
-	audit_log_format(ab,
-			 " a0=%lx a1=%lx a2=%lx a3=%lx items=%d",
-			 context->argv[0],
-			 context->argv[1],
-			 context->argv[2],
-			 context->argv[3],
-			 context->name_count);
-
-	audit_log_task_info(ab);
-	audit_log_key(ab, context->filterkey);
-	audit_log_end(ab);
+	switch (context->context) {
+	case AUDIT_CTX_SYSCALL:
+		ab = audit_log_start(context, GFP_KERNEL, AUDIT_SYSCALL);
+		if (!ab)
+			return;
+		audit_log_format(ab, "arch=%x syscall=%d",
+				 context->arch, context->major);
+		if (context->personality != PER_LINUX)
+			audit_log_format(ab, " per=%lx", context->personality);
+		if (context->return_valid != AUDITSC_INVALID)
+			audit_log_format(ab, " success=%s exit=%ld",
+					 (context->return_valid == AUDITSC_SUCCESS ?
+					  "yes" : "no"),
+					 context->return_code);
+		audit_log_format(ab,
+				 " a0=%lx a1=%lx a2=%lx a3=%lx items=%d",
+				 context->argv[0],
+				 context->argv[1],
+				 context->argv[2],
+				 context->argv[3],
+				 context->name_count);
+		audit_log_task_info(ab);
+		audit_log_key(ab, context->filterkey);
+		audit_log_end(ab);
+		break;
+	default:
+		BUG();
+		break;
+	}
 
 	for (aux = context->aux; aux; aux = aux->next) {
 
@@ -1591,14 +1664,15 @@ static void audit_log_exit(void)
 		audit_log_name(context, n, NULL, i++, &call_panic);
 	}
 
-	audit_log_proctitle();
+	if (context->context == AUDIT_CTX_SYSCALL)
+		audit_log_proctitle();
 
 	/* Send end of event record to help user space know we are finished */
 	ab = audit_log_start(context, GFP_KERNEL, AUDIT_EOE);
 	if (ab)
 		audit_log_end(ab);
 	if (call_panic)
-		audit_panic("error converting sid to string");
+		audit_panic("error in audit_log_exit()");
 }
 
 /**
@@ -1614,6 +1688,7 @@ void __audit_free(struct task_struct *tsk)
 	if (!context)
 		return;
 
+	/* this may generate CONFIG_CHANGE records */
 	if (!list_empty(&context->killed_trees))
 		audit_kill_trees(context);
 
@@ -1622,7 +1697,8 @@ void __audit_free(struct task_struct *tsk)
 	 * random task_struct that doesn't doesn't have any meaningful data we
 	 * need to log via audit_log_exit().
 	 */
-	if (tsk == current && !context->dummy && context->in_syscall) {
+	if (tsk == current && !context->dummy &&
+	    context->context == AUDIT_CTX_SYSCALL) {
 		context->return_valid = AUDITSC_INVALID;
 		context->return_code = 0;
 
@@ -1636,6 +1712,34 @@ void __audit_free(struct task_struct *tsk)
 	audit_free_context(context);
 }
 
+/**
+ * audit_return_fixup - fixup the return codes in the audit_context
+ * @ctx: the audit_context
+ * @success: true/false value to indicate if the operation succeeded or not
+ * @code: operation return code
+ *
+ * We need to fixup the return code in the audit logs if the actual return
+ * codes are later going to be fixed by the arch specific signal handlers.
+ */
+static void audit_return_fixup(struct audit_context *ctx,
+			       int success, long code)
+{
+	/*
+	 * This is actually a test for:
+	 * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) ||
+	 * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK)
+	 *
+	 * but is faster than a bunch of ||
+	 */
+	if (unlikely(code <= -ERESTARTSYS) &&
+	    (code >= -ERESTART_RESTARTBLOCK) &&
+	    (code != -ENOIOCTLCMD))
+		ctx->return_code = -EINTR;
+	else
+		ctx->return_code  = code;
+	ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE);
+}
+
 /**
  * __audit_syscall_entry - fill in an audit record at syscall entry
  * @major: major syscall type (function)
@@ -1661,7 +1765,12 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2,
 	if (!audit_enabled || !context)
 		return;
 
-	BUG_ON(context->in_syscall || context->name_count);
+	WARN_ON(context->context != AUDIT_CTX_UNUSED);
+	WARN_ON(context->name_count);
+	if (context->context != AUDIT_CTX_UNUSED || context->name_count) {
+		audit_panic("unrecoverable error in audit_syscall_entry()");
+		return;
+	}
 
 	state = context->state;
 	if (state == AUDIT_DISABLED)
@@ -1680,10 +1789,8 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2,
 	context->argv[1]    = a2;
 	context->argv[2]    = a3;
 	context->argv[3]    = a4;
-	context->serial     = 0;
-	context->in_syscall = 1;
+	context->context = AUDIT_CTX_SYSCALL;
 	context->current_state  = state;
-	context->ppid       = 0;
 	ktime_get_coarse_real_ts64(&context->ctime);
 }
 
@@ -1700,63 +1807,27 @@ void __audit_syscall_entry(int major, unsigned long a1, unsigned long a2,
  */
 void __audit_syscall_exit(int success, long return_code)
 {
-	struct audit_context *context;
+	struct audit_context *context = audit_context();
 
-	context = audit_context();
-	if (!context)
-		return;
+	if (!context || context->dummy ||
+	    context->context != AUDIT_CTX_SYSCALL)
+		goto out;
 
+	/* this may generate CONFIG_CHANGE records */
 	if (!list_empty(&context->killed_trees))
 		audit_kill_trees(context);
 
-	if (!context->dummy && context->in_syscall) {
-		if (success)
-			context->return_valid = AUDITSC_SUCCESS;
-		else
-			context->return_valid = AUDITSC_FAILURE;
-
-		/*
-		 * we need to fix up the return code in the audit logs if the
-		 * actual return codes are later going to be fixed up by the
-		 * arch specific signal handlers
-		 *
-		 * This is actually a test for:
-		 * (rc == ERESTARTSYS ) || (rc == ERESTARTNOINTR) ||
-		 * (rc == ERESTARTNOHAND) || (rc == ERESTART_RESTARTBLOCK)
-		 *
-		 * but is faster than a bunch of ||
-		 */
-		if (unlikely(return_code <= -ERESTARTSYS) &&
-		    (return_code >= -ERESTART_RESTARTBLOCK) &&
-		    (return_code != -ENOIOCTLCMD))
-			context->return_code = -EINTR;
-		else
-			context->return_code  = return_code;
-
-		audit_filter_syscall(current, context);
-		audit_filter_inodes(current, context);
-		if (context->current_state == AUDIT_RECORD_CONTEXT)
-			audit_log_exit();
-	}
+	/* run through both filters to ensure we set the filterkey properly */
+	audit_filter_syscall(current, context);
+	audit_filter_inodes(current, context);
+	if (context->current_state < AUDIT_RECORD_CONTEXT)
+		goto out;
 
-	context->in_syscall = 0;
-	context->prio = context->state == AUDIT_RECORD_CONTEXT ? ~0ULL : 0;
+	audit_return_fixup(context, success, return_code);
+	audit_log_exit();
 
-	audit_free_module(context);
-	audit_free_names(context);
-	unroll_tree_refs(context, NULL, 0);
-	audit_free_aux(context);
-	context->aux = NULL;
-	context->aux_pids = NULL;
-	context->target_pid = 0;
-	context->target_sid = 0;
-	context->sockaddr_len = 0;
-	context->type = 0;
-	context->fds[0] = -1;
-	if (context->state != AUDIT_RECORD_CONTEXT) {
-		kfree(context->filterkey);
-		context->filterkey = NULL;
-	}
+out:
+	audit_reset_context(context);
 }
 
 static inline void handle_one(const struct inode *inode)
@@ -1905,7 +1976,7 @@ void __audit_getname(struct filename *name)
 	struct audit_context *context = audit_context();
 	struct audit_names *n;
 
-	if (!context->in_syscall)
+	if (context->context == AUDIT_CTX_UNUSED)
 		return;
 
 	n = audit_alloc_name(context, AUDIT_TYPE_UNKNOWN);
@@ -1977,7 +2048,7 @@ void __audit_inode(struct filename *name, const struct dentry *dentry,
 	struct list_head *list = &audit_filter_list[AUDIT_FILTER_FS];
 	int i;
 
-	if (!context->in_syscall)
+	if (context->context == AUDIT_CTX_UNUSED)
 		return;
 
 	rcu_read_lock();
@@ -2095,7 +2166,7 @@ void __audit_inode_child(struct inode *parent,
 	struct list_head *list = &audit_filter_list[AUDIT_FILTER_FS];
 	int i;
 
-	if (!context->in_syscall)
+	if (context->context == AUDIT_CTX_UNUSED)
 		return;
 
 	rcu_read_lock();
@@ -2194,7 +2265,7 @@ EXPORT_SYMBOL_GPL(__audit_inode_child);
 int auditsc_get_stamp(struct audit_context *ctx,
 		       struct timespec64 *t, unsigned int *serial)
 {
-	if (!ctx->in_syscall)
+	if (ctx->context == AUDIT_CTX_UNUSED)
 		return 0;
 	if (!ctx->serial)
 		ctx->serial = audit_serial();
@@ -2686,7 +2757,7 @@ void audit_seccomp_actions_logged(const char *names, const char *old_names,
 struct list_head *audit_killed_trees(void)
 {
 	struct audit_context *ctx = audit_context();
-	if (likely(!ctx || !ctx->in_syscall))
+	if (likely(!ctx || ctx->context == AUDIT_CTX_UNUSED))
 		return NULL;
 	return &ctx->killed_trees;
 }


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
  2021-05-21 21:49 ` [RFC PATCH 1/9] audit: prepare audit_context for use in calling contexts beyond syscalls Paul Moore
@ 2021-05-21 21:49 ` Paul Moore
  2021-05-22  0:22   ` Pavel Begunkov
  2021-06-02 17:29   ` [RFC PATCH 2/9] audit, io_uring, io-wq: " Richard Guy Briggs
  2021-05-21 21:50 ` [RFC PATCH 3/9] audit: dev/test patch to force io_uring auditing Paul Moore
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:49 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This is a work in progress and should not be merged
anywhere important.  It is almost surely not complete, and while it
probably compiles it likely hasn't been booted and will do terrible
things.  You have been warned.

This patch adds basic auditing to io_uring operations, regardless of
their context.  This is accomplished by allocating audit_context
structures for the io-wq worker and io_uring SQPOLL kernel threads
as well as explicitly auditing the io_uring operations in
io_issue_sqe().  The io_uring operations are audited using a new
AUDIT_URINGOP record, an example is shown below:

  % <TODO - insert AUDIT_URINGOP record example>

Thanks to Richard Guy Briggs for review and feedback.

Signed-off-by: Paul Moore <[email protected]>
---
 fs/io-wq.c                 |    4 +
 fs/io_uring.c              |   11 +++
 include/linux/audit.h      |   17 ++++
 include/uapi/linux/audit.h |    1 
 kernel/audit.h             |    2 +
 kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 208 insertions(+)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 5361a9b4b47b..8af09a3336e0 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -16,6 +16,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/cpu.h>
 #include <linux/tracehook.h>
+#include <linux/audit.h>
 
 #include "io-wq.h"
 
@@ -535,6 +536,8 @@ static int io_wqe_worker(void *data)
 	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
 	set_task_comm(current, buf);
 
+	audit_alloc_kernel(current);
+
 	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
 		long ret;
 
@@ -573,6 +576,7 @@ static int io_wqe_worker(void *data)
 			raw_spin_unlock_irq(&wqe->lock);
 	}
 
+	audit_free(current);
 	io_worker_exit(worker);
 	return 0;
 }
diff --git a/fs/io_uring.c b/fs/io_uring.c
index e481ac8a757a..e9941d1ad8fd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@
 #include <linux/task_work.h>
 #include <linux/pagemap.h>
 #include <linux/io_uring.h>
+#include <linux/audit.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (req->work.creds && req->work.creds != current_cred())
 		creds = override_creds(req->work.creds);
 
+	if (req->opcode < IORING_OP_LAST)
+		audit_uring_entry(req->opcode);
+
 	switch (req->opcode) {
 	case IORING_OP_NOP:
 		ret = io_nop(req, issue_flags);
@@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
+	if (req->opcode < IORING_OP_LAST)
+		audit_uring_exit(!ret, ret);
+
 	if (creds)
 		revert_creds(creds);
 
@@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data)
 		set_cpus_allowed_ptr(current, cpu_online_mask);
 	current->flags |= PF_NO_SETAFFINITY;
 
+	audit_alloc_kernel(current);
+
 	mutex_lock(&sqd->lock);
 	/* a user may had exited before the thread started */
 	io_run_task_work_head(&sqd->park_task_work);
@@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data)
 	io_run_task_work_head(&sqd->park_task_work);
 	mutex_unlock(&sqd->lock);
 
+	audit_free(current);
+
 	complete(&sqd->exited);
 	do_exit(0);
 }
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 82b7c1116a85..6a0c013bc7de 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t)
 /* These are defined in auditsc.c */
 				/* Public API */
 extern int  audit_alloc(struct task_struct *task);
+extern int  audit_alloc_kernel(struct task_struct *task);
 extern void __audit_free(struct task_struct *task);
+extern void __audit_uring_entry(u8 op);
+extern void __audit_uring_exit(int success, long code);
 extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
 				  unsigned long a2, unsigned long a3);
 extern void __audit_syscall_exit(int ret_success, long ret_value);
@@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task)
 	if (unlikely(task->audit_context))
 		__audit_free(task);
 }
+static inline void audit_uring_entry(u8 op)
+{
+	if (unlikely(audit_context()))
+		__audit_uring_entry(op);
+}
+static inline void audit_uring_exit(int success, long code)
+{
+	if (unlikely(audit_context()))
+		__audit_uring_exit(success, code);
+}
 static inline void audit_syscall_entry(int major, unsigned long a0,
 				       unsigned long a1, unsigned long a2,
 				       unsigned long a3)
@@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task)
 {
 	return 0;
 }
+static inline int audit_alloc_kernel(struct task_struct *task)
+{
+	return 0;
+}
 static inline void audit_free(struct task_struct *task)
 { }
 static inline void audit_syscall_entry(int major, unsigned long a0,
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index cd2d8279a5e4..b26e0c435e8b 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -118,6 +118,7 @@
 #define AUDIT_TIME_ADJNTPVAL	1333	/* NTP value adjustment */
 #define AUDIT_BPF		1334	/* BPF subsystem */
 #define AUDIT_EVENT_LISTENER	1335	/* Task joined multicast read socket */
+#define AUDIT_URINGOP		1336	/* io_uring operation */
 
 #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
 #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
diff --git a/kernel/audit.h b/kernel/audit.h
index fba180de5912..50de827497ca 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -100,10 +100,12 @@ struct audit_context {
 	enum {
 		AUDIT_CTX_UNUSED,	/* audit_context is currently unused */
 		AUDIT_CTX_SYSCALL,	/* in use by syscall */
+		AUDIT_CTX_URING,	/* in use by io_uring */
 	} context;
 	enum audit_state    state, current_state;
 	unsigned int	    serial;     /* serial number for record */
 	int		    major;      /* syscall number */
+	int		    uring_op;   /* uring operation */
 	struct timespec64   ctime;      /* time of syscall entry */
 	unsigned long	    argv[4];    /* syscall arguments */
 	long		    return_code;/* syscall return code */
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index cc89e9f9a753..729849d41631 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx)
 	ctx->current_state = ctx->state;
 	ctx->serial = 0;
 	ctx->major = 0;
+	ctx->uring_op = 0;
 	ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 };
 	memset(ctx->argv, 0, sizeof(ctx->argv));
 	ctx->return_code = 0;
@@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk)
 	return 0;
 }
 
+/**
+ * audit_alloc_kernel - allocate an audit_context for a kernel task
+ * @tsk: the kernel task
+ *
+ * Similar to the audit_alloc() function, but intended for kernel private
+ * threads.  Returns zero on success, negative values on failure.
+ */
+int audit_alloc_kernel(struct task_struct *tsk)
+{
+	/*
+	 * At the moment we are just going to call into audit_alloc() to
+	 * simplify the code, but there two things to keep in mind with this
+	 * approach:
+	 *
+	 * 1. Filtering internal kernel tasks is a bit laughable in almost all
+	 * cases, but there is at least one case where there is a benefit:
+	 * the '-a task,never' case allows the admin to effectively disable
+	 * task auditing at runtime.
+	 *
+	 * 2. The {set,clear}_task_syscall_work() ops likely have zero effect
+	 * on these internal kernel tasks, but they probably don't hurt either.
+	 */
+	return audit_alloc(tsk);
+}
+
 static inline void audit_free_context(struct audit_context *context)
 {
 	/* resetting is extra work, but it is likely just noise */
@@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
 	audit_log_end(ab);
 }
 
+/**
+ * audit_log_uring - generate a AUDIT_URINGOP record
+ * @ctx: the audit context
+ */
+static void audit_log_uring(struct audit_context *ctx)
+{
+	struct audit_buffer *ab;
+	const struct cred *cred;
+
+	/*
+	 * TODO: What do we log here?  I'm tossing in a few things to start the
+	 *       conversation, but additional thought needs to go into this.
+	 */
+
+	ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
+	if (!ab)
+		return;
+	cred = current_cred();
+	audit_log_format(ab, "uring_op=%d", ctx->uring_op);
+	if (ctx->return_valid != AUDITSC_INVALID)
+		audit_log_format(ab, " success=%s exit=%ld",
+				 (ctx->return_valid == AUDITSC_SUCCESS ?
+				  "yes" : "no"),
+				 ctx->return_code);
+	audit_log_format(ab,
+			 " items=%d"
+			 " ppid=%d pid=%d auid=%u uid=%u gid=%u"
+			 " euid=%u suid=%u fsuid=%u"
+			 " egid=%u sgid=%u fsgid=%u",
+			 ctx->name_count,
+			 task_ppid_nr(current),
+			 task_tgid_nr(current),
+			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
+			 from_kuid(&init_user_ns, cred->uid),
+			 from_kgid(&init_user_ns, cred->gid),
+			 from_kuid(&init_user_ns, cred->euid),
+			 from_kuid(&init_user_ns, cred->suid),
+			 from_kuid(&init_user_ns, cred->fsuid),
+			 from_kgid(&init_user_ns, cred->egid),
+			 from_kgid(&init_user_ns, cred->sgid),
+			 from_kgid(&init_user_ns, cred->fsgid));
+	audit_log_task_context(ab);
+	audit_log_key(ab, ctx->filterkey);
+	audit_log_end(ab);
+}
+
 static void audit_log_exit(void)
 {
 	int i, call_panic = 0;
@@ -1571,6 +1643,9 @@ static void audit_log_exit(void)
 		audit_log_key(ab, context->filterkey);
 		audit_log_end(ab);
 		break;
+	case AUDIT_CTX_URING:
+		audit_log_uring(context);
+		break;
 	default:
 		BUG();
 		break;
@@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx,
 	ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE);
 }
 
+/**
+ * __audit_uring_entry - prepare the kernel task's audit context for io_uring
+ * @op: the io_uring opcode
+ *
+ * This is similar to audit_syscall_entry() but is intended for use by io_uring
+ * operations.
+ */
+void __audit_uring_entry(u8 op)
+{
+	struct audit_context *ctx = audit_context();
+
+	if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED)
+		return;
+
+	/*
+	 * NOTE: It's possible that we can be called from the process' context
+	 *       before it returns to userspace, and before audit_syscall_exit()
+	 *       is called.  In this case there is not much to do, just record
+	 *       the io_uring details and return.
+	 */
+	ctx->uring_op = op;
+	if (ctx->context == AUDIT_CTX_SYSCALL)
+		return;
+
+	ctx->dummy = !audit_n_rules;
+	if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT)
+		ctx->prio = 0;
+
+	ctx->arch = syscall_get_arch(current);
+	ctx->context = AUDIT_CTX_URING;
+	ctx->current_state = ctx->state;
+	ktime_get_coarse_real_ts64(&ctx->ctime);
+}
+
+/**
+ * __audit_uring_exit - wrap up the kernel task's audit context after io_uring
+ * @success: true/false value to indicate if the operation succeeded or not
+ * @code: operation return code
+ *
+ * This is similar to audit_syscall_exit() but is intended for use by io_uring
+ * operations.
+ */
+void __audit_uring_exit(int success, long code)
+{
+	struct audit_context *ctx = audit_context();
+
+	/*
+	 * TODO: At some point we will likely want to filter on io_uring ops
+	 *       and other things similar to what we do for syscalls, but that
+	 *       is something for another day; just record what we can here.
+	 */
+
+	if (!ctx || ctx->dummy)
+		goto out;
+	if (ctx->context == AUDIT_CTX_SYSCALL) {
+		/*
+		 * NOTE: See the note in __audit_uring_entry() about the case
+		 *       where we may be called from process context before we
+		 *       return to userspace via audit_syscall_exit().  In this
+		 *       case we simply emit a URINGOP record and bail, the
+		 *       normal syscall exit handling will take care of
+		 *       everything else.
+		 *       It is also worth mentioning that when we are called,
+		 *       the current process creds may differ from the creds
+		 *       used during the normal syscall processing; keep that
+		 *       in mind if/when we move the record generation code.
+		 */
+
+		/*
+		 * We need to filter on the syscall info here to decide if we
+		 * should emit a URINGOP record.  I know it seems odd but this
+		 * solves the problem where users have a filter to block *all*
+		 * syscall records in the "exit" filter; we want to preserve
+		 * the behavior here.
+		 */
+		audit_filter_syscall(current, ctx);
+		audit_filter_inodes(current, ctx);
+		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
+			return;
+
+		audit_log_uring(ctx);
+		return;
+	}
+
+	/* this may generate CONFIG_CHANGE records */
+	if (!list_empty(&ctx->killed_trees))
+		audit_kill_trees(ctx);
+
+	audit_filter_inodes(current, ctx);
+	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
+		goto out;
+	audit_return_fixup(ctx, success, code);
+	audit_log_exit();
+
+out:
+	audit_reset_context(ctx);
+}
+
 /**
  * __audit_syscall_entry - fill in an audit record at syscall entry
  * @major: major syscall type (function)


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 3/9] audit: dev/test patch to force io_uring auditing
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
  2021-05-21 21:49 ` [RFC PATCH 1/9] audit: prepare audit_context for use in calling contexts beyond syscalls Paul Moore
  2021-05-21 21:49 ` [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-21 21:50 ` [RFC PATCH 4/9] audit: add filtering for io_uring records Paul Moore
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This patch is intended only to aid in the initial dev/test
of the audit/io_uring support, it is not intended to be merged.

With this patch, you can emit io_uring operation audit records with
the following commands (the first clears any blocking rules):

  % auditctl -D
  % auditctl -a exit,always -S io_uring_enter

Signed-off-by: DO NOT COMMIT
---
 kernel/auditsc.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 729849d41631..d8aa2c690bf9 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -1898,6 +1898,10 @@ void __audit_uring_exit(int success, long code)
 		audit_log_uring(ctx);
 		return;
 	}
+#if 1
+	/* XXX - temporary hack to force record generation */
+	ctx->current_state = AUDIT_RECORD_CONTEXT;
+#endif
 
 	/* this may generate CONFIG_CHANGE records */
 	if (!list_empty(&ctx->killed_trees))


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (2 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 3/9] audit: dev/test patch to force io_uring auditing Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-28 22:35   ` Richard Guy Briggs
  2021-05-21 21:50 ` [RFC PATCH 5/9] fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() Paul Moore
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This is a work in progress and should not be merged
anywhere important.  It is almost surely not complete, and while it
probably compiles it likely hasn't been booted and will do terrible
things.  You have been warned.

This patch adds basic audit io_uring filtering, using as much of the
existing audit filtering infrastructure as possible.  In order to do
this we reuse the audit filter rule's syscall mask for the io_uring
operation and we create a new filter for io_uring operations as
AUDIT_FILTER_URING_EXIT/audit_filter_list[7].

<TODO - provide some additional guidance for the userspace tools>

Signed-off-by: Paul Moore <[email protected]>
---
 include/uapi/linux/audit.h |    3 +-
 kernel/auditfilter.c       |    4 ++-
 kernel/auditsc.c           |   65 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index b26e0c435e8b..621eb3c1076e 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -167,8 +167,9 @@
 #define AUDIT_FILTER_EXCLUDE	0x05	/* Apply rule before record creation */
 #define AUDIT_FILTER_TYPE	AUDIT_FILTER_EXCLUDE /* obsolete misleading naming */
 #define AUDIT_FILTER_FS		0x06	/* Apply rule at __audit_inode_child */
+#define AUDIT_FILTER_URING_EXIT	0x07	/* Apply rule at io_uring op exit */
 
-#define AUDIT_NR_FILTERS	7
+#define AUDIT_NR_FILTERS	8
 
 #define AUDIT_FILTER_PREPEND	0x10	/* Prepend to front of list */
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index db2c6b59dfc3..c21119c00504 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -44,7 +44,8 @@ struct list_head audit_filter_list[AUDIT_NR_FILTERS] = {
 	LIST_HEAD_INIT(audit_filter_list[4]),
 	LIST_HEAD_INIT(audit_filter_list[5]),
 	LIST_HEAD_INIT(audit_filter_list[6]),
-#if AUDIT_NR_FILTERS != 7
+	LIST_HEAD_INIT(audit_filter_list[7]),
+#if AUDIT_NR_FILTERS != 8
 #error Fix audit_filter_list initialiser
 #endif
 };
@@ -56,6 +57,7 @@ static struct list_head audit_rules_list[AUDIT_NR_FILTERS] = {
 	LIST_HEAD_INIT(audit_rules_list[4]),
 	LIST_HEAD_INIT(audit_rules_list[5]),
 	LIST_HEAD_INIT(audit_rules_list[6]),
+	LIST_HEAD_INIT(audit_rules_list[7]),
 };
 
 DEFINE_MUTEX(audit_filter_mutex);
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index d8aa2c690bf9..4f6ab34020fb 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
 	return rule->mask[word] & bit;
 }
 
+/**
+ * audit_filter_uring - apply filters to an io_uring operation
+ * @tsk: associated task
+ * @ctx: audit context
+ */
+static void audit_filter_uring(struct task_struct *tsk,
+			       struct audit_context *ctx)
+{
+	struct audit_entry *e;
+	enum audit_state state;
+
+	if (auditd_test_task(tsk))
+		return;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
+				list) {
+		if (audit_in_mask(&e->rule, ctx->uring_op) &&
+		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
+				       false)) {
+			rcu_read_unlock();
+			ctx->current_state = state;
+			return;
+		}
+	}
+	rcu_read_unlock();
+	return;
+}
+
 /* At syscall exit time, this filter is called if the audit_state is
  * not low enough that auditing cannot take place, but is also not
  * high enough that we already know we have to write an audit record
@@ -1754,7 +1783,7 @@ static void audit_log_exit(void)
  * __audit_free - free a per-task audit context
  * @tsk: task whose audit context block to free
  *
- * Called from copy_process and do_exit
+ * Called from copy_process, do_exit, and the io_uring code
  */
 void __audit_free(struct task_struct *tsk)
 {
@@ -1772,15 +1801,21 @@ void __audit_free(struct task_struct *tsk)
 	 * random task_struct that doesn't doesn't have any meaningful data we
 	 * need to log via audit_log_exit().
 	 */
-	if (tsk == current && !context->dummy &&
-	    context->context == AUDIT_CTX_SYSCALL) {
+	if (tsk == current && !context->dummy) {
 		context->return_valid = AUDITSC_INVALID;
 		context->return_code = 0;
-
-		audit_filter_syscall(tsk, context);
-		audit_filter_inodes(tsk, context);
-		if (context->current_state == AUDIT_RECORD_CONTEXT)
-			audit_log_exit();
+		if (context->context == AUDIT_CTX_SYSCALL) {
+			audit_filter_syscall(tsk, context);
+			audit_filter_inodes(tsk, context);
+			if (context->current_state == AUDIT_RECORD_CONTEXT)
+				audit_log_exit();
+		} else if (context->context == AUDIT_CTX_URING) {
+			/* TODO: verify this case is real and valid */
+			audit_filter_uring(tsk, context);
+			audit_filter_inodes(tsk, context);
+			if (context->current_state == AUDIT_RECORD_CONTEXT)
+				audit_log_uring(context);
+		}
 	}
 
 	audit_set_context(tsk, NULL);
@@ -1861,12 +1896,6 @@ void __audit_uring_exit(int success, long code)
 {
 	struct audit_context *ctx = audit_context();
 
-	/*
-	 * TODO: At some point we will likely want to filter on io_uring ops
-	 *       and other things similar to what we do for syscalls, but that
-	 *       is something for another day; just record what we can here.
-	 */
-
 	if (!ctx || ctx->dummy)
 		goto out;
 	if (ctx->context == AUDIT_CTX_SYSCALL) {
@@ -1891,6 +1920,8 @@ void __audit_uring_exit(int success, long code)
 		 * the behavior here.
 		 */
 		audit_filter_syscall(current, ctx);
+		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
+			audit_filter_uring(current, ctx);
 		audit_filter_inodes(current, ctx);
 		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
 			return;
@@ -1899,7 +1930,9 @@ void __audit_uring_exit(int success, long code)
 		return;
 	}
 #if 1
-	/* XXX - temporary hack to force record generation */
+	/* XXX - temporary hack to force record generation, we are leaving this
+	 *       enabled, but if you want to actually test the filtering you
+	 *       need to disable this #if/#endif block */
 	ctx->current_state = AUDIT_RECORD_CONTEXT;
 #endif
 
@@ -1907,6 +1940,8 @@ void __audit_uring_exit(int success, long code)
 	if (!list_empty(&ctx->killed_trees))
 		audit_kill_trees(ctx);
 
+	/* run through both filters to ensure we set the filterkey properly */
+	audit_filter_uring(current, ctx);
 	audit_filter_inodes(current, ctx);
 	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
 		goto out;


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 5/9] fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure()
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (3 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 4/9] audit: add filtering for io_uring records Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-21 21:50 ` [RFC PATCH 6/9] io_uring: convert io_uring to the secure anon inode interface Paul Moore
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Extending the secure anonymous inode support to other subsystems
requires that we have a secure anon_inode_getfile() variant in
addition to the existing secure anon_inode_getfd() variant.

Thankfully we can reuse the existing __anon_inode_getfile() function
and just wrap it with the proper arguments.

Signed-off-by: Paul Moore <[email protected]>
---
 fs/anon_inodes.c            |   29 +++++++++++++++++++++++++++++
 include/linux/anon_inodes.h |    4 ++++
 2 files changed, 33 insertions(+)

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index a280156138ed..e0c3e33c4177 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -148,6 +148,35 @@ struct file *anon_inode_getfile(const char *name,
 }
 EXPORT_SYMBOL_GPL(anon_inode_getfile);
 
+/**
+ * anon_inode_getfile_secure - Like anon_inode_getfile(), but creates a new
+ *                             !S_PRIVATE anon inode rather than reuse the
+ *                             singleton anon inode and calls the
+ *                             inode_init_security_anon() LSM hook.  This
+ *                             allows for both the inode to have its own
+ *                             security context and for the LSM to enforce
+ *                             policy on the inode's creation.
+ *
+ * @name:    [in]    name of the "class" of the new file
+ * @fops:    [in]    file operations for the new file
+ * @priv:    [in]    private data for the new file (will be file's private_data)
+ * @flags:   [in]    flags
+ * @context_inode:
+ *           [in]    the logical relationship with the new inode (optional)
+ *
+ * The LSM may use @context_inode in inode_init_security_anon(), but a
+ * reference to it is not held.  Returns the newly created file* or an error
+ * pointer.  See the anon_inode_getfile() documentation for more information.
+ */
+struct file *anon_inode_getfile_secure(const char *name,
+				       const struct file_operations *fops,
+				       void *priv, int flags,
+				       const struct inode *context_inode)
+{
+	return __anon_inode_getfile(name, fops, priv, flags,
+				    context_inode, true);
+}
+
 static int __anon_inode_getfd(const char *name,
 			      const struct file_operations *fops,
 			      void *priv, int flags,
diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h
index 71881a2b6f78..5deaddbd7927 100644
--- a/include/linux/anon_inodes.h
+++ b/include/linux/anon_inodes.h
@@ -15,6 +15,10 @@ struct inode;
 struct file *anon_inode_getfile(const char *name,
 				const struct file_operations *fops,
 				void *priv, int flags);
+struct file *anon_inode_getfile_secure(const char *name,
+				       const struct file_operations *fops,
+				       void *priv, int flags,
+				       const struct inode *context_inode);
 int anon_inode_getfd(const char *name, const struct file_operations *fops,
 		     void *priv, int flags);
 int anon_inode_getfd_secure(const char *name,


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 6/9] io_uring: convert io_uring to the secure anon inode interface
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (4 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 5/9] fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-21 21:50 ` [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring Paul Moore
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Converting io_uring's anonymous inode to the secure anon inode API
enables LSMs to enforce policy on the io_uring anonymous inodes if
they chose to do so.  This is an important first step towards
providing the necessary mechanisms so that LSMs can apply security
policy to io_uring operations.

Signed-off-by: Paul Moore <[email protected]>
---
 fs/io_uring.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9941d1ad8fd..6ff769c9b7d3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -9562,8 +9562,8 @@ static struct file *io_uring_get_file(struct io_ring_ctx *ctx)
 		return ERR_PTR(ret);
 #endif
 
-	file = anon_inode_getfile("[io_uring]", &io_uring_fops, ctx,
-					O_RDWR | O_CLOEXEC);
+	file = anon_inode_getfile_secure("[io_uring]", &io_uring_fops, ctx,
+					 O_RDWR | O_CLOEXEC, NULL);
 #if defined(CONFIG_UNIX)
 	if (IS_ERR(file)) {
 		sock_release(ctx->ring_sock);


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (5 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 6/9] io_uring: convert io_uring to the secure anon inode interface Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-26 14:48   ` Stefan Metzmacher
  2021-05-21 21:50 ` [RFC PATCH 8/9] selinux: add support for the io_uring access controls Paul Moore
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This is a work in progress, this patch, including the
description, may be incomplete or even incorrect.  You have been
warned.

A full expalantion of io_uring is beyond the scope of this commit
description, but in summary it is an asynchronous I/O mechanism
which allows for I/O requests and the resulting data to be queued
in memory mapped "rings" which are shared between the kernel and
userspace.  Optionally, io_uring offers the ability for applications
to spawn kernel threads to dequeue I/O requests from the ring and
submit the requests in the kernel, helping to minimize the syscall
overhead.  Rings are accessed in userspace by memory mapping a file
descriptor provided by the io_uring_setup(2), and can be shared
between applications as one might do with any open file descriptor.
Finally, process credentials can be registered with a given ring
and any process with access to that ring can submit I/O requests
using any of the registered credentials.

While the io_uring functionality is widely recognized as offering a
vastly improved, and high performing asynchronous I/O mechanism, its
ability to allow processes to submit I/O requests with credentials
other than its own presents a challenge to LSMs.  When a process
creates a new io_uring ring the ring's credentials are inhertied
from the calling process; if this ring is shared with another
process operating with different credentials there is the potential
to bypass the LSMs security policy.  Similarly, registering
credentials with a given ring allows any process with access to that
ring to submit I/O requests with those credentials.

In an effort to allow LSMs to apply security policy to io_uring I/O
operations, this patch adds two new LSM hooks.  These hooks, in
conjunction with the LSM anonymous inode support previously
submitted, allow an LSM to apply access control policy to the
sharing of io_uring rings as well as any io_uring credential changes
requested by a process.

The new LSM hooks are described below:

 * int security_uring_override_creds(cred)
   Controls if the current task, executing an io_uring operation,
   is allowed to override it's credentials with @cred.  In cases
   where the current task is a user application, the current
   credentials will be those of the user application.  In cases
   where the current task is a kernel thread servicing io_uring
   requests the current credentials will be those of the io_uring
   ring (inherited from the process that created the ring).

 * int security_uring_sqpoll(void)
   Controls if the current task is allowed to create an io_uring
   polling thread (IORING_SETUP_SQPOLL).  Without a SQPOLL thread
   in the kernel processes must submit I/O requests via
   io_uring_enter(2) which allows us to compare any requested
   credential changes against the application making the request.
   With a SQPOLL thread, we can no longer compare requested
   credential changes against the application making the request,
   the comparison is made against the ring's credentials.

Signed-off-by: Paul Moore <[email protected]>
---
 fs/io_uring.c                 |   10 ++++++++++
 include/linux/lsm_hook_defs.h |    5 +++++
 include/linux/lsm_hooks.h     |   13 +++++++++++++
 include/linux/security.h      |   16 ++++++++++++++++
 security/security.c           |   12 ++++++++++++
 5 files changed, 56 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6ff769c9b7d3..d18a594c4c6e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -79,6 +79,7 @@
 #include <linux/pagemap.h>
 #include <linux/io_uring.h>
 #include <linux/audit.h>
+#include <linux/security.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/io_uring.h>
@@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 		if (!req->work.creds)
 			return -EINVAL;
 		get_cred(req->work.creds);
+		ret = security_uring_override_creds(req->work.creds);
+		if (ret) {
+			put_cred(req->work.creds);
+			return ret;
+		}
 	}
 	state = &ctx->submit_state;
 
@@ -7963,6 +7969,10 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		struct io_sq_data *sqd;
 		bool attached;
 
+		ret = security_uring_sqpoll();
+		if (ret)
+			return ret;
+
 		sqd = io_get_sq_data(p, &attached);
 		if (IS_ERR(sqd)) {
 			ret = PTR_ERR(sqd);
diff --git a/include/linux/lsm_hook_defs.h b/include/linux/lsm_hook_defs.h
index 04c01794de83..88971b3da3c0 100644
--- a/include/linux/lsm_hook_defs.h
+++ b/include/linux/lsm_hook_defs.h
@@ -403,3 +403,8 @@ LSM_HOOK(void, LSM_RET_VOID, perf_event_free, struct perf_event *event)
 LSM_HOOK(int, 0, perf_event_read, struct perf_event *event)
 LSM_HOOK(int, 0, perf_event_write, struct perf_event *event)
 #endif /* CONFIG_PERF_EVENTS */
+
+#ifdef CONFIG_IO_URING
+LSM_HOOK(int, 0, uring_override_creds, const struct cred *new)
+LSM_HOOK(int, 0, uring_sqpoll, void)
+#endif /* CONFIG_IO_URING */
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 5c4c5c0602cb..0eb0ae95c4c4 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -1557,6 +1557,19 @@
  * 	Read perf_event security info if allowed.
  * @perf_event_write:
  * 	Write perf_event security info if allowed.
+ *
+ * Security hooks for io_uring
+ *
+ * @uring_override_creds:
+ *      Check if the current task, executing an io_uring operation, is allowed
+ *      to override it's credentials with @new.
+ *
+ *      @new: the new creds to use
+ *
+ * @uring_sqpoll:
+ *      Check whether the current task is allowed to spawn a io_uring polling
+ *      thread (IORING_SETUP_SQPOLL).
+ *
  */
 union security_list_options {
 	#define LSM_HOOK(RET, DEFAULT, NAME, ...) RET (*NAME)(__VA_ARGS__);
diff --git a/include/linux/security.h b/include/linux/security.h
index 06f7c50ce77f..263a744c839f 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -2037,4 +2037,20 @@ static inline int security_perf_event_write(struct perf_event *event)
 #endif /* CONFIG_SECURITY */
 #endif /* CONFIG_PERF_EVENTS */
 
+#ifdef CONFIG_IO_URING
+#ifdef CONFIG_SECURITY
+extern int security_uring_override_creds(const struct cred *new);
+extern int security_uring_sqpoll(void);
+#else
+static inline int security_uring_override_creds(const struct cred *new)
+{
+	return 0;
+}
+static inline int security_uring_sqpoll(void)
+{
+	return 0;
+}
+#endif /* CONFIG_SECURITY */
+#endif /* CONFIG_IO_URING */
+
 #endif /* ! __LINUX_SECURITY_H */
diff --git a/security/security.c b/security/security.c
index b38155b2de83..3d6b3a2cacf5 100644
--- a/security/security.c
+++ b/security/security.c
@@ -2624,3 +2624,15 @@ int security_perf_event_write(struct perf_event *event)
 	return call_int_hook(perf_event_write, 0, event);
 }
 #endif /* CONFIG_PERF_EVENTS */
+
+#ifdef CONFIG_IO_URING
+int security_uring_override_creds(const struct cred *new)
+{
+	return call_int_hook(uring_override_creds, 0, new);
+}
+
+int security_uring_sqpoll(void)
+{
+	return call_int_hook(uring_sqpoll, 0);
+}
+#endif /* CONFIG_IO_URING */


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 8/9] selinux: add support for the io_uring access controls
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (6 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-21 21:50 ` [RFC PATCH 9/9] Smack: Brutalist io_uring support with debug Paul Moore
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

WARNING - This is a work in progress, this patch, including the
description, may be incomplete or even incorrect.  You have been
warned.

This patch implements two new io_uring access controls, specifically
support for controlling the io_uring "personalities" and
IORING_SETUP_SQPOLL.  Controlling the sharing of io_urings themselves
is handled via the normal file/inode labeling and sharing mechanisms.

The io_uring { override_creds } permission restricts which domains
the subject domain can use to override it's own credentials.
Granting a domain the io_uring { override_creds } permission allows
it to impersonate another domain in io_uring operations.

The io_uring { sqpoll } permission restricts which domains can create
asynchronous io_uring polling threads.  This is important from a
security perspective as operations queued by this asynchronous thread
inherit the credentials of the thread creator by default; if an
io_uring is shared across process/domain boundaries this could result
in one domain impersonating another.  Controlling the creation of
sqpoll threads, and the sharing of io_urings across processes, allow
policy authors to restrict the ability of one domain to impersonate
another via io_uring.

As a quick summary, this patch adds a new object class with two
permissions:

 io_uring { override_creds sqpoll }

These permissions can be seen in the two simple policy statements
below:

  allow domA_t domB_t : io_uring { override_creds };
  allow domA_t self : io_uring { sqpoll };

Signed-off-by: Paul Moore <[email protected]>
---
 security/selinux/hooks.c            |   67 +++++++++++++++++++++++++++++++++++
 security/selinux/include/classmap.h |    2 +
 2 files changed, 69 insertions(+)

diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index eaea837d89d1..696130972e4d 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -7115,6 +7115,68 @@ static int selinux_perf_event_write(struct perf_event *event)
 }
 #endif
 
+#ifdef CONFIG_IO_URING
+
+#if 1
+/* XXX - dump_creds() is for debugging only, remove before committing */
+static void dump_creds(const struct cred *cred)
+{
+	u32 secid;
+	char *ctx = NULL;
+	unsigned int ctx_len = 0;
+
+	security_cred_getsecid(cred, &secid);
+	if (secid)
+		security_secid_to_secctx(secid, &ctx, &ctx_len);
+
+	printk(KERN_ERR "CRED: ->security is %s\n", (ctx ? : "(error)"));
+
+	if (ctx)
+		security_release_secctx(ctx, ctx_len);
+}
+#endif
+
+/**
+ * selinux_uring_override_creds - check the requested cred override
+ * @new: the target creds
+ *
+ * Check to see if the current task is allowed to override it's credentials
+ * to service an io_uring operation.
+ */
+int selinux_uring_override_creds(const struct cred *new)
+{
+#if 1
+	/* XXX - debug only code, remove before committing */
+	pr_err("PMD: selinux_uring_override_creds()\n");
+	pr_err("PMD: >>> CURRENT\n");
+	dump_creds(current_cred());
+	pr_err("PMD: >>> NEW\n");
+	dump_creds(new);
+#endif
+	return avc_has_perm(&selinux_state, current_sid(), cred_sid(new),
+			    SECCLASS_IO_URING, IO_URING__OVERRIDE_CREDS, NULL);
+}
+
+/**
+ * selinux_uring_sqpoll - check if a io_uring polling thread can be created
+ *
+ * Check to see if the current task is allowed to create a new io_uring
+ * kernel polling thread.
+ */
+int selinux_uring_sqpoll(void)
+{
+	int sid = current_sid();
+#if 1
+	/* XXX - debug only code, remove before committing */
+	pr_err("PMD: selinux_uring_sqpoll()\n");
+	pr_err("PMD: >>> CURRENT\n");
+	dump_creds(current_cred());
+#endif
+	return avc_has_perm(&selinux_state, sid, sid,
+			    SECCLASS_IO_URING, IO_URING__SQPOLL, NULL);
+}
+#endif /* CONFIG_IO_URING */
+
 /*
  * IMPORTANT NOTE: When adding new hooks, please be careful to keep this order:
  * 1. any hooks that don't belong to (2.) or (3.) below,
@@ -7353,6 +7415,11 @@ static struct security_hook_list selinux_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(perf_event_write, selinux_perf_event_write),
 #endif
 
+#ifdef CONFIG_IO_URING
+	LSM_HOOK_INIT(uring_override_creds, selinux_uring_override_creds),
+	LSM_HOOK_INIT(uring_sqpoll, selinux_uring_sqpoll),
+#endif
+
 	LSM_HOOK_INIT(locked_down, selinux_lockdown),
 
 	/*
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 62d19bccf3de..3314ad72279d 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -252,6 +252,8 @@ struct security_class_mapping secclass_map[] = {
 	  { "integrity", "confidentiality", NULL } },
 	{ "anon_inode",
 	  { COMMON_FILE_PERMS, NULL } },
+	{ "io_uring",
+	  { "override_creds", "sqpoll", NULL } },
 	{ NULL }
   };
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 9/9] Smack: Brutalist io_uring support with debug
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (7 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 8/9] selinux: add support for the io_uring access controls Paul Moore
@ 2021-05-21 21:50 ` Paul Moore
  2021-05-22  0:53 ` [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Tetsuo Handa
  2021-05-26 15:00 ` Jeff Moyer
  10 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-21 21:50 UTC (permalink / raw)
  To: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

From: Casey Schaufler <[email protected]>

Add Smack privilege checks for io_uring. Use CAP_MAC_OVERRIDE
for the override_creds case and CAP_MAC_ADMIN for creating a
polling thread. These choices are based on conjecture regarding
the intent of the surrounding code.

Signed-off-by: Casey Schaufler <[email protected]>
Signed-off-by: Paul Moore <[email protected]>
---
 security/smack/smack_lsm.c |   64 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 223a6da0e6dc..f6423c0096e9 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -4691,6 +4691,66 @@ static int smack_dentry_create_files_as(struct dentry *dentry, int mode,
 	return 0;
 }
 
+#ifdef CONFIG_IO_URING
+/**
+ * smack_uring_override_creds - Is io_uring cred override allowed?
+ * @new: the target creds
+ *
+ * Check to see if the current task is allowed to override it's credentials
+ * to service an io_uring operation.
+ */
+int smack_uring_override_creds(const struct cred *new)
+{
+	struct task_smack *tsp = smack_cred(current_cred());
+	struct task_smack *nsp = smack_cred(new);
+
+#if 1
+	if (tsp->smk_task == nsp->smk_task)
+		pr_info("%s: Smack matches %s\n", __func__,
+			tsp->smk_task->smk_known);
+	else
+		pr_info("%s: Smack override check %s to %s\n", __func__,
+			tsp->smk_task->smk_known, nsp->smk_task->smk_known);
+#endif
+	/*
+	 * Allow the degenerate case where the new Smack value is
+	 * the same as the current Smack value.
+	 */
+	if (tsp->smk_task == nsp->smk_task)
+		return 0;
+
+#if 1
+	pr_info("%s: Smack sqpoll %s\n", __func__,
+		smack_privileged_cred(CAP_MAC_OVERRIDE, current_cred()) ?
+		"ok by Smack" : "disallowed (No CAP_MAC_OVERRIDE)");
+#endif
+	if (smack_privileged_cred(CAP_MAC_OVERRIDE, current_cred()))
+		return 0;
+
+	return -EPERM;
+}
+
+/**
+ * smack_uring_sqpoll - check if a io_uring polling thread can be created
+ *
+ * Check to see if the current task is allowed to create a new io_uring
+ * kernel polling thread.
+ */
+int smack_uring_sqpoll(void)
+{
+#if 1
+	pr_info("%s: Smack new ring %s\n", __func__,
+		smack_privileged_cred(CAP_MAC_ADMIN, current_cred()) ?
+		"ok by Smack" : "disallowed (No CAP_MAC_ADMIN)");
+#endif
+	if (smack_privileged_cred(CAP_MAC_ADMIN, current_cred()))
+		return 0;
+
+	return -EPERM;
+}
+
+#endif /* CONFIG_IO_URING */
+
 struct lsm_blob_sizes smack_blob_sizes __lsm_ro_after_init = {
 	.lbs_cred = sizeof(struct task_smack),
 	.lbs_file = sizeof(struct smack_known *),
@@ -4843,6 +4903,10 @@ static struct security_hook_list smack_hooks[] __lsm_ro_after_init = {
 	LSM_HOOK_INIT(inode_copy_up, smack_inode_copy_up),
 	LSM_HOOK_INIT(inode_copy_up_xattr, smack_inode_copy_up_xattr),
 	LSM_HOOK_INIT(dentry_create_files_as, smack_dentry_create_files_as),
+#ifdef CONFIG_IO_URING
+	LSM_HOOK_INIT(uring_override_creds, smack_uring_override_creds),
+	LSM_HOOK_INIT(uring_sqpoll, smack_uring_sqpoll),
+#endif
 };
 
 


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-21 21:49 ` [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring Paul Moore
@ 2021-05-22  0:22   ` Pavel Begunkov
  2021-05-22  2:36     ` Paul Moore
  2021-06-02 17:29   ` [RFC PATCH 2/9] audit, io_uring, io-wq: " Richard Guy Briggs
  1 sibling, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-05-22  0:22 UTC (permalink / raw)
  To: Paul Moore, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 5/21/21 10:49 PM, Paul Moore wrote:
> WARNING - This is a work in progress and should not be merged
> anywhere important.  It is almost surely not complete, and while it
> probably compiles it likely hasn't been booted and will do terrible
> things.  You have been warned.
> 
> This patch adds basic auditing to io_uring operations, regardless of
> their context.  This is accomplished by allocating audit_context
> structures for the io-wq worker and io_uring SQPOLL kernel threads
> as well as explicitly auditing the io_uring operations in
> io_issue_sqe().  The io_uring operations are audited using a new
> AUDIT_URINGOP record, an example is shown below:
> 
>   % <TODO - insert AUDIT_URINGOP record example>
> 
> Thanks to Richard Guy Briggs for review and feedback.
> 
> Signed-off-by: Paul Moore <[email protected]>
> ---
[...]
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e481ac8a757a..e9941d1ad8fd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
>  #include <linux/task_work.h>
>  #include <linux/pagemap.h>
>  #include <linux/io_uring.h>
> +#include <linux/audit.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  	if (req->work.creds && req->work.creds != current_cred())
>  		creds = override_creds(req->work.creds);
>  
> +	if (req->opcode < IORING_OP_LAST)

always true at this point

> +		audit_uring_entry(req->opcode);

So, it adds two if's with memory loads (i.e. current->audit_context)
per request in one of the hottest functions here... No way, nack

Maybe, if it's dynamically compiled into like kprobes if it's
_really_ used.

> +
>  	switch (req->opcode) {
>  	case IORING_OP_NOP:
>  		ret = io_nop(req, issue_flags);
> @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  		break;
>  	}
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_exit(!ret, ret);
> +
>  	if (creds)
>  		revert_creds(creds);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (8 preceding siblings ...)
  2021-05-21 21:50 ` [RFC PATCH 9/9] Smack: Brutalist io_uring support with debug Paul Moore
@ 2021-05-22  0:53 ` Tetsuo Handa
  2021-05-22  2:06   ` Paul Moore
  2021-05-26 15:00 ` Jeff Moyer
  10 siblings, 1 reply; 71+ messages in thread
From: Tetsuo Handa @ 2021-05-22  0:53 UTC (permalink / raw)
  To: Paul Moore, linux-security-module, linux-audit, io-uring
  Cc: selinux, linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021/05/22 6:49, Paul Moore wrote:
> I've provided the SELinux
> implementation, Casey has been nice enough to provide a Smack patch,
> and John is working on an AppArmor patch as I write this.  I've
> mentioned this work to the other LSM maintainers that I believe might
> be affected but I have not heard back from anyone else at this point.

I don't think any change is required for TOMOYO, for TOMOYO does not
use "struct cred"->security where [RFC PATCH 8/9] and [RFC PATCH 9/9]
are addressing, and TOMOYO does not call kernel/audit*.c functions.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-22  0:53 ` [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Tetsuo Handa
@ 2021-05-22  2:06   ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-22  2:06 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-security-module, linux-audit, io-uring, selinux,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Fri, May 21, 2021 at 8:53 PM Tetsuo Handa
<[email protected]> wrote:
> On 2021/05/22 6:49, Paul Moore wrote:
> > I've provided the SELinux
> > implementation, Casey has been nice enough to provide a Smack patch,
> > and John is working on an AppArmor patch as I write this.  I've
> > mentioned this work to the other LSM maintainers that I believe might
> > be affected but I have not heard back from anyone else at this point.
>
> I don't think any change is required for TOMOYO, for TOMOYO does not
> use "struct cred"->security where [RFC PATCH 8/9] and [RFC PATCH 9/9]
> are addressing, and TOMOYO does not call kernel/audit*.c functions.

Good to know, thank you for checking.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-22  0:22   ` Pavel Begunkov
@ 2021-05-22  2:36     ` Paul Moore
  2021-05-23 20:26       ` Pavel Begunkov
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-05-22  2:36 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <[email protected]> wrote:
> On 5/21/21 10:49 PM, Paul Moore wrote:
> > WARNING - This is a work in progress and should not be merged
> > anywhere important.  It is almost surely not complete, and while it
> > probably compiles it likely hasn't been booted and will do terrible
> > things.  You have been warned.
> >
> > This patch adds basic auditing to io_uring operations, regardless of
> > their context.  This is accomplished by allocating audit_context
> > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > as well as explicitly auditing the io_uring operations in
> > io_issue_sqe().  The io_uring operations are audited using a new
> > AUDIT_URINGOP record, an example is shown below:
> >
> >   % <TODO - insert AUDIT_URINGOP record example>
> >
> > Thanks to Richard Guy Briggs for review and feedback.
> >
> > Signed-off-by: Paul Moore <[email protected]>
> > ---
> [...]
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index e481ac8a757a..e9941d1ad8fd 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -78,6 +78,7 @@
> >  #include <linux/task_work.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/io_uring.h>
> > +#include <linux/audit.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> >       if (req->work.creds && req->work.creds != current_cred())
> >               creds = override_creds(req->work.creds);
> >
> > +     if (req->opcode < IORING_OP_LAST)
>
> always true at this point

I placed the opcode check before the audit call because the switch
statement below which handles the operation dispatching has a 'ret =
-EINVAL' for the default case, implying that there are some paths
where an invalid opcode could be passed into the function.  Obviously
if that is not the case and you can guarantee that req->opcode will
always be valid we can easily drop the check prior to the audit call.

> > +             audit_uring_entry(req->opcode);
>
> So, it adds two if's with memory loads (i.e. current->audit_context)
> per request in one of the hottest functions here... No way, nack
>
> Maybe, if it's dynamically compiled into like kprobes if it's
> _really_ used.

I'm open to suggestions on how to tweak the io_uring/audit
integration, if you don't like what I've proposed in this patchset,
lets try to come up with a solution that is more palatable.  If you
were going to add audit support for these io_uring operations, how
would you propose we do it?  Not being able to properly audit io_uring
operations is going to be a significant issue for a chunk of users, if
it isn't already, we need to work to find a solution to this problem.

Unfortunately I don't think dynamically inserting audit calls is
something that would meet the needs of the audit community (I fear it
would run afoul of the various security certifications), and it
definitely isn't something that we support at present.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-22  2:36     ` Paul Moore
@ 2021-05-23 20:26       ` Pavel Begunkov
  2021-05-24 19:59         ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-05-23 20:26 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 5/22/21 3:36 AM, Paul Moore wrote:
> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <[email protected]> wrote:
>> On 5/21/21 10:49 PM, Paul Moore wrote:
[...]
>>>
>>> +     if (req->opcode < IORING_OP_LAST)
>>
>> always true at this point
> 
> I placed the opcode check before the audit call because the switch
> statement below which handles the operation dispatching has a 'ret =
> -EINVAL' for the default case, implying that there are some paths
> where an invalid opcode could be passed into the function.  Obviously
> if that is not the case and you can guarantee that req->opcode will
> always be valid we can easily drop the check prior to the audit call.

It is always true at this point, would be completely broken
otherwise

>>> +             audit_uring_entry(req->opcode);
>>
>> So, it adds two if's with memory loads (i.e. current->audit_context)
>> per request in one of the hottest functions here... No way, nack
>>
>> Maybe, if it's dynamically compiled into like kprobes if it's
>> _really_ used.
> 
> I'm open to suggestions on how to tweak the io_uring/audit
> integration, if you don't like what I've proposed in this patchset,
> lets try to come up with a solution that is more palatable.  If you
> were going to add audit support for these io_uring operations, how
> would you propose we do it?  Not being able to properly audit io_uring
> operations is going to be a significant issue for a chunk of users, if
> it isn't already, we need to work to find a solution to this problem.

Who knows. First of all, seems CONFIG_AUDIT is enabled by default
for many popular distributions, so I assume that is not compiled out.

What are use cases for audit? Always running I guess? Putting aside
compatibility problems, it sounds that with the amount of overhead
it adds there is no much profit in using io_uring in the first place.
Is that so?

__audit_uring_exit()
-> audit_filter_syscall()
  -> for (audit_list) if (...) audit_filter_rules()
    -> ...
-> audit_filter_inodes()
  -> ...

> Unfortunately I don't think dynamically inserting audit calls is
> something that would meet the needs of the audit community (I fear it
> would run afoul of the various security certifications), and it
> definitely isn't something that we support at present.

I see

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-23 20:26       ` Pavel Begunkov
@ 2021-05-24 19:59         ` Paul Moore
  2021-05-25  8:27           ` Pavel Begunkov
  2021-05-26  1:11           ` Jens Axboe
  0 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-24 19:59 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <[email protected]> wrote:
> On 5/22/21 3:36 AM, Paul Moore wrote:
> > On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <[email protected]> wrote:
> >> On 5/21/21 10:49 PM, Paul Moore wrote:
> [...]
> >>>
> >>> +     if (req->opcode < IORING_OP_LAST)
> >>
> >> always true at this point
> >
> > I placed the opcode check before the audit call because the switch
> > statement below which handles the operation dispatching has a 'ret =
> > -EINVAL' for the default case, implying that there are some paths
> > where an invalid opcode could be passed into the function.  Obviously
> > if that is not the case and you can guarantee that req->opcode will
> > always be valid we can easily drop the check prior to the audit call.
>
> It is always true at this point, would be completely broken
> otherwise

Understood, I was just pointing out an oddity in the code.  I just
dropped the checks from my local tree, you'll see it in the next draft
of the patchset.

> >> So, it adds two if's with memory loads (i.e. current->audit_context)
> >> per request in one of the hottest functions here... No way, nack
> >>
> >> Maybe, if it's dynamically compiled into like kprobes if it's
> >> _really_ used.
> >
> > I'm open to suggestions on how to tweak the io_uring/audit
> > integration, if you don't like what I've proposed in this patchset,
> > lets try to come up with a solution that is more palatable.  If you
> > were going to add audit support for these io_uring operations, how
> > would you propose we do it?  Not being able to properly audit io_uring
> > operations is going to be a significant issue for a chunk of users, if
> > it isn't already, we need to work to find a solution to this problem.
>
> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
> for many popular distributions, so I assume that is not compiled out.
>
> What are use cases for audit? Always running I guess?

Audit has been around for quite some time now, and it's goal is to
provide a mechanism for logging "security relevant" information in
such a way that it meets the needs of various security certification
efforts.  Traditional Linux event logging, e.g. syslog and the like,
does not meet these requirements and changing them would likely affect
the usability for those who are not required to be compliant with
these security certifications.  The Linux audit subsystem allows Linux
to be used in places it couldn't be used otherwise (or rather makes it
a *lot* easier).

That said, audit is not for everyone, and we have build time and
runtime options to help make life easier.  Beyond simply disabling
audit at compile time a number of Linux distributions effectively
shortcut audit at runtime by adding a "never" rule to the audit
filter, for example:

 % auditctl -a task,never

> Putting aside compatibility problems, it sounds that with the amount of overhead
> it adds there is no much profit in using io_uring in the first place.
> Is that so?

Well, if audit alone erased all of the io_uring advantages we should
just rip io_uring out of the kernel and people can just disable audit
instead ;)

I believe there are people who would like to use io_uring and are also
required to use a kernel with audit, either due to the need to run a
distribution kernel or the need to capture security information in the
audit stream.  I'm hoping that we can find a solution for these users;
if we don't we are asking this group to choose either io_uring or
audit, and that is something I would like to avoid.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-24 19:59         ` Paul Moore
@ 2021-05-25  8:27           ` Pavel Begunkov
  2021-05-25 14:53             ` Paul Moore
  2021-05-26  1:11           ` Jens Axboe
  1 sibling, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-05-25  8:27 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 5/24/21 8:59 PM, Paul Moore wrote:
> On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <[email protected]> wrote:
>> On 5/22/21 3:36 AM, Paul Moore wrote:
>>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <[email protected]> wrote:
>>>> On 5/21/21 10:49 PM, Paul Moore wrote:
>> [...]
>>>>>
>>>>> +     if (req->opcode < IORING_OP_LAST)
>>>>
>>>> always true at this point
>>>
>>> I placed the opcode check before the audit call because the switch
>>> statement below which handles the operation dispatching has a 'ret =
>>> -EINVAL' for the default case, implying that there are some paths
>>> where an invalid opcode could be passed into the function.  Obviously
>>> if that is not the case and you can guarantee that req->opcode will
>>> always be valid we can easily drop the check prior to the audit call.
>>
>> It is always true at this point, would be completely broken
>> otherwise
> 
> Understood, I was just pointing out an oddity in the code.  I just
> dropped the checks from my local tree, you'll see it in the next draft
> of the patchset.
> 
>>>> So, it adds two if's with memory loads (i.e. current->audit_context)
>>>> per request in one of the hottest functions here... No way, nack
>>>>
>>>> Maybe, if it's dynamically compiled into like kprobes if it's
>>>> _really_ used.
>>>
>>> I'm open to suggestions on how to tweak the io_uring/audit
>>> integration, if you don't like what I've proposed in this patchset,
>>> lets try to come up with a solution that is more palatable.  If you
>>> were going to add audit support for these io_uring operations, how
>>> would you propose we do it?  Not being able to properly audit io_uring
>>> operations is going to be a significant issue for a chunk of users, if
>>> it isn't already, we need to work to find a solution to this problem.
>>
>> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
>> for many popular distributions, so I assume that is not compiled out.
>>
>> What are use cases for audit? Always running I guess?
> 
> Audit has been around for quite some time now, and it's goal is to
> provide a mechanism for logging "security relevant" information in
> such a way that it meets the needs of various security certification
> efforts.  Traditional Linux event logging, e.g. syslog and the like,
> does not meet these requirements and changing them would likely affect
> the usability for those who are not required to be compliant with
> these security certifications.  The Linux audit subsystem allows Linux
> to be used in places it couldn't be used otherwise (or rather makes it
> a *lot* easier).
> 
> That said, audit is not for everyone, and we have build time and
> runtime options to help make life easier.  Beyond simply disabling
> audit at compile time a number of Linux distributions effectively
> shortcut audit at runtime by adding a "never" rule to the audit
> filter, for example:
> 
>  % auditctl -a task,never
> 
>> Putting aside compatibility problems, it sounds that with the amount of overhead
>> it adds there is no much profit in using io_uring in the first place.
>> Is that so?
> 
> Well, if audit alone erased all of the io_uring advantages we should
> just rip io_uring out of the kernel and people can just disable audit
> instead ;)


Hah, if we add a simple idle "feature" like

for (i=0;i<1000000;i+) {;}

and it would destroy all the performance, let's throw useless
Linux kernel then!

If seriously, it's more of an open question to me, how much overhead
it adds if enabled (with typical filters/options/etc).

Btw, do you really need two hooks -- before and right after
execution?

> I believe there are people who would like to use io_uring and are also
> required to use a kernel with audit, either due to the need to run a
> distribution kernel or the need to capture security information in the
> audit stream.  I'm hoping that we can find a solution for these users;
> if we don't we are asking this group to choose either io_uring or
> audit, and that is something I would like to avoid. 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-25  8:27           ` Pavel Begunkov
@ 2021-05-25 14:53             ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-25 14:53 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Tue, May 25, 2021 at 4:27 AM Pavel Begunkov <[email protected]> wrote:
> On 5/24/21 8:59 PM, Paul Moore wrote:
> > On Sun, May 23, 2021 at 4:26 PM Pavel Begunkov <[email protected]> wrote:
> >> On 5/22/21 3:36 AM, Paul Moore wrote:
> >>> On Fri, May 21, 2021 at 8:22 PM Pavel Begunkov <[email protected]> wrote:
> >>>> On 5/21/21 10:49 PM, Paul Moore wrote:
> >> [...]
> >>>>>
> >>>>> +     if (req->opcode < IORING_OP_LAST)
> >>>>
> >>>> always true at this point
> >>>
> >>> I placed the opcode check before the audit call because the switch
> >>> statement below which handles the operation dispatching has a 'ret =
> >>> -EINVAL' for the default case, implying that there are some paths
> >>> where an invalid opcode could be passed into the function.  Obviously
> >>> if that is not the case and you can guarantee that req->opcode will
> >>> always be valid we can easily drop the check prior to the audit call.
> >>
> >> It is always true at this point, would be completely broken
> >> otherwise
> >
> > Understood, I was just pointing out an oddity in the code.  I just
> > dropped the checks from my local tree, you'll see it in the next draft
> > of the patchset.
> >
> >>>> So, it adds two if's with memory loads (i.e. current->audit_context)
> >>>> per request in one of the hottest functions here... No way, nack
> >>>>
> >>>> Maybe, if it's dynamically compiled into like kprobes if it's
> >>>> _really_ used.
> >>>
> >>> I'm open to suggestions on how to tweak the io_uring/audit
> >>> integration, if you don't like what I've proposed in this patchset,
> >>> lets try to come up with a solution that is more palatable.  If you
> >>> were going to add audit support for these io_uring operations, how
> >>> would you propose we do it?  Not being able to properly audit io_uring
> >>> operations is going to be a significant issue for a chunk of users, if
> >>> it isn't already, we need to work to find a solution to this problem.
> >>
> >> Who knows. First of all, seems CONFIG_AUDIT is enabled by default
> >> for many popular distributions, so I assume that is not compiled out.
> >>
> >> What are use cases for audit? Always running I guess?
> >
> > Audit has been around for quite some time now, and it's goal is to
> > provide a mechanism for logging "security relevant" information in
> > such a way that it meets the needs of various security certification
> > efforts.  Traditional Linux event logging, e.g. syslog and the like,
> > does not meet these requirements and changing them would likely affect
> > the usability for those who are not required to be compliant with
> > these security certifications.  The Linux audit subsystem allows Linux
> > to be used in places it couldn't be used otherwise (or rather makes it
> > a *lot* easier).
> >
> > That said, audit is not for everyone, and we have build time and
> > runtime options to help make life easier.  Beyond simply disabling
> > audit at compile time a number of Linux distributions effectively
> > shortcut audit at runtime by adding a "never" rule to the audit
> > filter, for example:
> >
> >  % auditctl -a task,never
> >
> >> Putting aside compatibility problems, it sounds that with the amount of overhead
> >> it adds there is no much profit in using io_uring in the first place.
> >> Is that so?
> >
> > Well, if audit alone erased all of the io_uring advantages we should
> > just rip io_uring out of the kernel and people can just disable audit
> > instead ;)
>
> Hah, if we add a simple idle "feature" like
>
> for (i=0;i<1000000;i+) {;}
>
> and it would destroy all the performance, let's throw useless
> Linux kernel then!
>
> If seriously, it's more of an open question to me, how much overhead
> it adds if enabled (with typical filters/options/etc).

I am very hesitant to define a "typical" audit configuration as it
really is dependent on the user's security requirements as well as the
particular solution/environment.  Any configuration I pick will be
"wrong" for a lot of audit users :)

As I see it, users will likely find themselves in one of three
performance buckets:

* io_uring enabled, CONFIG_AUDIT=n

For those who are already trying to get that last 1% of performance
from their kernel and are willing to give up most everything else to
get it this is the obvious choice.  Needless to say there should not
be any audit related impact in this case (especially since we've
removed that req->opcode checks prior to the audit calls).

* io_uring enabled, CONFIG_AUDIT=y, audit "task,never" runtime config

[side note: I made some tweaks to audit_uring_entry() to move the
audit_enabled check there so we no longer need to call into
__audit_uring_entry() in this fastpath case.  Similarly
audit_uring_exit() now does an audit_dummy_context() check instead of
simply checking audit_context(), this should avoid calling into
__audit_uring_exit() when the io_uring op is not being audited.]

I'm guessing that most distro users will fall into this bucket.  Here
the task's audit_context should always be NULL, both in the syscall
context and sqpoll context, so io_uring would call into the inline
audit_uring_entry() function and return without calling into
__audit_uring_entry() (see the "side note" above).  The
audit_uring_exit() call would behave in a similar manner.

* io_uring enabled, CONFIG_AUDIT=y, some sort of runtime audit config

For obvious reasons this case has the most performance impact and as
mentioned above it can vary quite a bit.  In my opinion this is the
least important bucket from a performance perspective as the user
clearly has a need for the security information that audit provides
and enabling that functionality in io_uring is critical to allowing
the user to take advantage of io_uring.  The performance of io_uring
is impacted, but it should still be more performant than synchronous
I/O and it allows the user to run their existing io_uring
applications/code-paths.

> Btw, do you really need two hooks -- before and right after
> execution?

Yes, the entry/before hook does the setup for the io_uring op (very
similar to a syscall entry point) and the exit/after hook is what does
the audit record generation based on what took place during the
io_uring operation.

> > I believe there are people who would like to use io_uring and are also
> > required to use a kernel with audit, either due to the need to run a
> > distribution kernel or the need to capture security information in the
> > audit stream.  I'm hoping that we can find a solution for these users;
> > if we don't we are asking this group to choose either io_uring or
> > audit, and that is something I would like to avoid.

I'm leaving this old paragraph here because I really think this is
important to the discussion.  If we can't find a solution here we
would need to make io_uring and audit mutually exclusive and I don't
think that is in the best interests of the users, and would surely
create a headache for the distros.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-24 19:59         ` Paul Moore
  2021-05-25  8:27           ` Pavel Begunkov
@ 2021-05-26  1:11           ` Jens Axboe
  2021-05-26  2:04             ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26  1:11 UTC (permalink / raw)
  To: Paul Moore, Pavel Begunkov
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/24/21 1:59 PM, Paul Moore wrote:
> That said, audit is not for everyone, and we have build time and
> runtime options to help make life easier.  Beyond simply disabling
> audit at compile time a number of Linux distributions effectively
> shortcut audit at runtime by adding a "never" rule to the audit
> filter, for example:
> 
>  % auditctl -a task,never

As has been brought up, the issue we're facing is that distros have
CONFIG_AUDIT=y and hence the above is the best real world case outside
of people doing custom kernels. My question would then be how much
overhead the above will add, considering it's an entry/exit call per op.
If auditctl is turned off, what is the expectation in turns of overhead?

My gut feeling tells me it's likely going to be too much. Keep in mind
that we're sometimes doing millions of operations per second, per core.

aio never had any audit logging as far as I can tell. I think it'd make
a lot more sense to selectively enable audit logging only for opcodes
that we care about. File open/create/unlink/mkdir etc, that kind of
thing. File level operations that people would care about logging. Would
they care about logging a buffer registration or a polled read from a
device/file? I highly doubt it, and we don't do that for alternative
methods either. Doesn't really make sense for a lot of the other
operations, imho.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26  1:11           ` Jens Axboe
@ 2021-05-26  2:04             ` Paul Moore
  2021-05-26 10:19               ` Pavel Begunkov
  2021-05-26 17:15               ` Jens Axboe
  0 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26  2:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> On 5/24/21 1:59 PM, Paul Moore wrote:
> > That said, audit is not for everyone, and we have build time and
> > runtime options to help make life easier.  Beyond simply disabling
> > audit at compile time a number of Linux distributions effectively
> > shortcut audit at runtime by adding a "never" rule to the audit
> > filter, for example:
> >
> >  % auditctl -a task,never
>
> As has been brought up, the issue we're facing is that distros have
> CONFIG_AUDIT=y and hence the above is the best real world case outside
> of people doing custom kernels. My question would then be how much
> overhead the above will add, considering it's an entry/exit call per op.
> If auditctl is turned off, what is the expectation in turns of overhead?

I commented on that case in my last email to Pavel, but I'll try to go
over it again in a little more detail.

As we discussed earlier in this thread, we can skip the req->opcode
check before both the _entry and _exit calls, so we are left with just
the bare audit calls in the io_uring code.  As the _entry and _exit
functions are small, I've copied them and their supporting functions
below and I'll try to explain what would happen in CONFIG_AUDIT=y,
"task,never" case.

+  static inline struct audit_context *audit_context(void)
+  {
+    return current->audit_context;
+  }

+  static inline bool audit_dummy_context(void)
+  {
+    void *p = audit_context();
+    return !p || *(int *)p;
+  }

+  static inline void audit_uring_entry(u8 op)
+  {
+    if (unlikely(audit_enabled && audit_context()))
+      __audit_uring_entry(op);
+  }

We have one if statement where the conditional checks on two
individual conditions.  The first (audit_enabled) is simply a check to
see if anyone has "turned on" auditing at runtime; historically this
worked rather well, and still does in a number of places, but ever
since systemd has taken to forcing audit on regardless of the admin's
audit configuration it is less useful.  The second (audit_context())
is a check to see if an audit_context has been allocated for the
current task.  In the case of "task,never" current->audit_context will
be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
will never be called.

Worst case here is checking the value of audit_enabled and
current->audit_context.  Depending on which you think is more likely
we can change the order of the check so that the
current->audit_context check is first if you feel that is more likely
to be NULL than audit_enabled is to be false (it may be that way now).

+  static inline void audit_uring_exit(int success, long code)
+  {
+    if (unlikely(!audit_dummy_context()))
+      __audit_uring_exit(success, code);
+  }

The exit call is very similar to the entry call, but in the
"task,never" case it is very simple as the first check to be performed
is the current->audit_context check which we know to be NULL.  The
__audit_uring_exit() slowpath will never be called.

> aio never had any audit logging as far as I can tell. I think it'd make
> a lot more sense to selectively enable audit logging only for opcodes
> that we care about. File open/create/unlink/mkdir etc, that kind of
> thing. File level operations that people would care about logging. Would
> they care about logging a buffer registration or a polled read from a
> device/file? I highly doubt it, and we don't do that for alternative
> methods either. Doesn't really make sense for a lot of the other
> operations, imho.

We would need to check with the current security requirements (there
are distro people on the linux-audit list that keep track of that
stuff), but looking at the opcodes right now my gut feeling is that
most of the opcodes would be considered "security relevant" so
selective auditing might not be that useful in practice.  It would
definitely clutter the code and increase the chances that new opcodes
would not be properly audited when they are merged.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26  2:04             ` Paul Moore
@ 2021-05-26 10:19               ` Pavel Begunkov
  2021-05-26 14:38                 ` Paul Moore
  2021-05-26 17:15               ` Jens Axboe
  1 sibling, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-05-26 10:19 UTC (permalink / raw)
  To: Paul Moore, Jens Axboe
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 3:04 AM, Paul Moore wrote:
> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>> That said, audit is not for everyone, and we have build time and
>>> runtime options to help make life easier.  Beyond simply disabling
>>> audit at compile time a number of Linux distributions effectively
>>> shortcut audit at runtime by adding a "never" rule to the audit
>>> filter, for example:
>>>
>>>  % auditctl -a task,never
>>
>> As has been brought up, the issue we're facing is that distros have
>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>> of people doing custom kernels. My question would then be how much
>> overhead the above will add, considering it's an entry/exit call per op.
>> If auditctl is turned off, what is the expectation in turns of overhead?
> 
> I commented on that case in my last email to Pavel, but I'll try to go
> over it again in a little more detail.
> 
> As we discussed earlier in this thread, we can skip the req->opcode
> check before both the _entry and _exit calls, so we are left with just
> the bare audit calls in the io_uring code.  As the _entry and _exit
> functions are small, I've copied them and their supporting functions
> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> "task,never" case.
> 
> +  static inline struct audit_context *audit_context(void)
> +  {
> +    return current->audit_context;
> +  }
> 
> +  static inline bool audit_dummy_context(void)
> +  {
> +    void *p = audit_context();
> +    return !p || *(int *)p;
> +  }
> 
> +  static inline void audit_uring_entry(u8 op)
> +  {
> +    if (unlikely(audit_enabled && audit_context()))
> +      __audit_uring_entry(op);
> +  }

I'd rather agree that it's my cycle-picking. The case I care about
is CONFIG_AUDIT=y (because everybody enable it), and io_uring
tracing _not_ enabled at runtime. If enabled let them suffer
the overhead, it will probably dip down the performance

So, for the case I care about it's two of

if (unlikely(audit_enabled && current->audit_context))

in the hot path. load-test-jump + current, so it will
be around 7x2 instructions. We can throw away audit_enabled
as you say systemd already enables it, that will give
4x2 instructions including 2 conditional jumps.

That's not great at all. And that's why I brought up
the question about need of pre and post hooks and whether
can be combined. Would be just 4 instructions and that is
ok (ish).

> We would need to check with the current security requirements (there
> are distro people on the linux-audit list that keep track of that
> stuff), but looking at the opcodes right now my gut feeling is that
> most of the opcodes would be considered "security relevant" so
> selective auditing might not be that useful in practice.  It would
> definitely clutter the code and increase the chances that new opcodes
> would not be properly audited when they are merged.

I'm curious, why it's enabled by many distros by default? Are there
use cases they use? Tempting to add AUDIT_IOURING=default N, but
won't work I guess

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 10:19               ` Pavel Begunkov
@ 2021-05-26 14:38                 ` Paul Moore
  2021-05-26 15:11                   ` Steve Grubb
  2021-05-26 15:17                   ` Stefan Metzmacher
  0 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 14:38 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <[email protected]> wrote:
> On 5/26/21 3:04 AM, Paul Moore wrote:
> > On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>> That said, audit is not for everyone, and we have build time and
> >>> runtime options to help make life easier.  Beyond simply disabling
> >>> audit at compile time a number of Linux distributions effectively
> >>> shortcut audit at runtime by adding a "never" rule to the audit
> >>> filter, for example:
> >>>
> >>>  % auditctl -a task,never
> >>
> >> As has been brought up, the issue we're facing is that distros have
> >> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >> of people doing custom kernels. My question would then be how much
> >> overhead the above will add, considering it's an entry/exit call per op.
> >> If auditctl is turned off, what is the expectation in turns of overhead?
> >
> > I commented on that case in my last email to Pavel, but I'll try to go
> > over it again in a little more detail.
> >
> > As we discussed earlier in this thread, we can skip the req->opcode
> > check before both the _entry and _exit calls, so we are left with just
> > the bare audit calls in the io_uring code.  As the _entry and _exit
> > functions are small, I've copied them and their supporting functions
> > below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> > "task,never" case.
> >
> > +  static inline struct audit_context *audit_context(void)
> > +  {
> > +    return current->audit_context;
> > +  }
> >
> > +  static inline bool audit_dummy_context(void)
> > +  {
> > +    void *p = audit_context();
> > +    return !p || *(int *)p;
> > +  }
> >
> > +  static inline void audit_uring_entry(u8 op)
> > +  {
> > +    if (unlikely(audit_enabled && audit_context()))
> > +      __audit_uring_entry(op);
> > +  }
>
> I'd rather agree that it's my cycle-picking. The case I care about
> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
> tracing _not_ enabled at runtime. If enabled let them suffer
> the overhead, it will probably dip down the performance
>
> So, for the case I care about it's two of
>
> if (unlikely(audit_enabled && current->audit_context))
>
> in the hot path. load-test-jump + current, so it will
> be around 7x2 instructions. We can throw away audit_enabled
> as you say systemd already enables it, that will give
> 4x2 instructions including 2 conditional jumps.

We've basically got it down to the equivalent of two
"current->audit_context != NULL" checks in the case where audit is
built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
"task,never".  I'm at a loss for how we can lower the overhead any
further, but I'm open to suggestions.

> That's not great at all. And that's why I brought up
> the question about need of pre and post hooks and whether
> can be combined. Would be just 4 instructions and that is
> ok (ish).

As discussed previously in this thread that isn't really an option
from an audit perspective.

> > We would need to check with the current security requirements (there
> > are distro people on the linux-audit list that keep track of that
> > stuff), but looking at the opcodes right now my gut feeling is that
> > most of the opcodes would be considered "security relevant" so
> > selective auditing might not be that useful in practice.  It would
> > definitely clutter the code and increase the chances that new opcodes
> > would not be properly audited when they are merged.
>
> I'm curious, why it's enabled by many distros by default? Are there
> use cases they use?

We've already talked about certain users and environments where audit
is an important requirement, e.g. public sector, health care,
financial institutions, etc.; without audit Linux wouldn't be an
option for these users, at least not without heavy modification,
out-of-tree/ISV patches, etc.  I currently don't have any direct ties
to any distros, "Enterprise" or otherwise, but in the past it has been
my experience that distros much prefer to have a single kernel build
to address the needs of all their users.  In the few cases I have seen
where a second kernel build is supported it is usually for hardware
enablement.  I'm sure there are other cases too, I just haven't seen
them personally; the big distros definitely seem to have a strong
desire to limit the number of supported kernel configs/builds.

> Tempting to add AUDIT_IOURING=default N, but won't work I guess

One of the nice things about audit is that it can give you a history
of what a user did on a system, which is very important for a number
of use cases.  If we selectively disable audit for certain subsystems
we create a blind spot in the audit log, and in the case of io_uring
this can be a very serious blind spot.  I fear that if we can't come
to some agreement here we will need to make io_uring and audit
mutually exclusive at build time which would be awful; forcing many
distros to either make a hard choice or carry out-of-tree patches.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring
  2021-05-21 21:50 ` [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring Paul Moore
@ 2021-05-26 14:48   ` Stefan Metzmacher
  2021-05-26 20:45     ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Stefan Metzmacher @ 2021-05-26 14:48 UTC (permalink / raw)
  To: Paul Moore, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Hi Paul,

>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  		if (!req->work.creds)
>  			return -EINVAL;
>  		get_cred(req->work.creds);
> +		ret = security_uring_override_creds(req->work.creds);
> +		if (ret) {
> +			put_cred(req->work.creds);
> +			return ret;
> +		}

Why are you calling this per requests, shouldn't this be done in
io_register_personality()?

I'm also not sure if this really gains anything as io_register_personality()
only captures the value of get_current_cred(), so the process already has changed to
the credentials (at least once for the io_uring_register(IORING_REGISTER_PERSONALITY)
call).

metze

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
                   ` (9 preceding siblings ...)
  2021-05-22  0:53 ` [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Tetsuo Handa
@ 2021-05-26 15:00 ` Jeff Moyer
  2021-05-26 18:49   ` Paul Moore
  10 siblings, 1 reply; 71+ messages in thread
From: Jeff Moyer @ 2021-05-26 15:00 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Paul Moore <[email protected]> writes:

> Also, any pointers to easy-to-run io_uring tests would be helpful.  I
> am particularly interested in tests which make use of the personality
> option, share urings across process boundaries, and make use of the
> sqpoll functionality.

liburing contains a test suite:
  https://git.kernel.dk/cgit/liburing/

You can run it via 'make runtests'.

Cheers,
Jeff


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 14:38                 ` Paul Moore
@ 2021-05-26 15:11                   ` Steve Grubb
  2021-05-26 15:17                   ` Stefan Metzmacher
  1 sibling, 0 replies; 71+ messages in thread
From: Steve Grubb @ 2021-05-26 15:11 UTC (permalink / raw)
  To: Pavel Begunkov, linux-audit
  Cc: Jens Axboe, selinux, linux-security-module, linux-audit,
	Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring, Alexander Viro,
	Paul Moore

On Wednesday, May 26, 2021 10:38:38 AM EDT Paul Moore wrote:
> > > We would need to check with the current security requirements (there
> > > are distro people on the linux-audit list that keep track of that
> > > stuff),

The requirements generally care about resource access. File open, connect, 
accept, etc. We don't care about read/write itself as that would flood the 
analysis.

> > > but looking at the opcodes right now my gut feeling is that
> > > most of the opcodes would be considered "security relevant" so
> > > selective auditing might not be that useful in practice. 

I'd say maybe a quarter to a third look interesting.

> > > It would
> > > definitely clutter the code and increase the chances that new opcodes
> > > would not be properly audited when they are merged.

There is that...

> > I'm curious, why it's enabled by many distros by default? Are there
> > use cases they use?
> 
> We've already talked about certain users and environments where audit
> is an important requirement, e.g. public sector, health care,
> financial institutions, etc.; without audit Linux wouldn't be an
> option for these users,

People that care about auditing are under regulatory mandates. They care more 
about the audit event than the performance. Imagine you have a system with 
some brand new medical discovery. You want to know anyone who accesses the 
information in case it gets leaked out. You don't care how slow the system 
gets - you simply *have* to know everyone who's looked at the documents.

-Steve



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 14:38                 ` Paul Moore
  2021-05-26 15:11                   ` Steve Grubb
@ 2021-05-26 15:17                   ` Stefan Metzmacher
  2021-05-26 15:49                     ` Richard Guy Briggs
  2021-05-26 15:49                     ` Victor Stewart
  1 sibling, 2 replies; 71+ messages in thread
From: Stefan Metzmacher @ 2021-05-26 15:17 UTC (permalink / raw)
  To: Paul Moore, Pavel Begunkov
  Cc: Jens Axboe, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro


Am 26.05.21 um 16:38 schrieb Paul Moore:
> On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <[email protected]> wrote:
>> On 5/26/21 3:04 AM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>>  % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> +  static inline struct audit_context *audit_context(void)
>>> +  {
>>> +    return current->audit_context;
>>> +  }
>>>
>>> +  static inline bool audit_dummy_context(void)
>>> +  {
>>> +    void *p = audit_context();
>>> +    return !p || *(int *)p;
>>> +  }
>>>
>>> +  static inline void audit_uring_entry(u8 op)
>>> +  {
>>> +    if (unlikely(audit_enabled && audit_context()))
>>> +      __audit_uring_entry(op);
>>> +  }
>>
>> I'd rather agree that it's my cycle-picking. The case I care about
>> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
>> tracing _not_ enabled at runtime. If enabled let them suffer
>> the overhead, it will probably dip down the performance
>>
>> So, for the case I care about it's two of
>>
>> if (unlikely(audit_enabled && current->audit_context))
>>
>> in the hot path. load-test-jump + current, so it will
>> be around 7x2 instructions. We can throw away audit_enabled
>> as you say systemd already enables it, that will give
>> 4x2 instructions including 2 conditional jumps.
> 
> We've basically got it down to the equivalent of two
> "current->audit_context != NULL" checks in the case where audit is
> built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
> "task,never".  I'm at a loss for how we can lower the overhead any
> further, but I'm open to suggestions.
> 
>> That's not great at all. And that's why I brought up
>> the question about need of pre and post hooks and whether
>> can be combined. Would be just 4 instructions and that is
>> ok (ish).
> 
> As discussed previously in this thread that isn't really an option
> from an audit perspective.
> 
>>> We would need to check with the current security requirements (there
>>> are distro people on the linux-audit list that keep track of that
>>> stuff), but looking at the opcodes right now my gut feeling is that
>>> most of the opcodes would be considered "security relevant" so
>>> selective auditing might not be that useful in practice.  It would
>>> definitely clutter the code and increase the chances that new opcodes
>>> would not be properly audited when they are merged.
>>
>> I'm curious, why it's enabled by many distros by default? Are there
>> use cases they use?
> 
> We've already talked about certain users and environments where audit
> is an important requirement, e.g. public sector, health care,
> financial institutions, etc.; without audit Linux wouldn't be an
> option for these users, at least not without heavy modification,
> out-of-tree/ISV patches, etc.  I currently don't have any direct ties
> to any distros, "Enterprise" or otherwise, but in the past it has been
> my experience that distros much prefer to have a single kernel build
> to address the needs of all their users.  In the few cases I have seen
> where a second kernel build is supported it is usually for hardware
> enablement.  I'm sure there are other cases too, I just haven't seen
> them personally; the big distros definitely seem to have a strong
> desire to limit the number of supported kernel configs/builds.
> 
>> Tempting to add AUDIT_IOURING=default N, but won't work I guess
> 
> One of the nice things about audit is that it can give you a history
> of what a user did on a system, which is very important for a number
> of use cases.  If we selectively disable audit for certain subsystems
> we create a blind spot in the audit log, and in the case of io_uring
> this can be a very serious blind spot.  I fear that if we can't come
> to some agreement here we will need to make io_uring and audit
> mutually exclusive at build time which would be awful; forcing many
> distros to either make a hard choice or carry out-of-tree patches.

I'm wondering why it's not enough to have the native auditing just to happen.

E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
already go via security_socket_{recvmsg,sendmsg,connect}()

IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
and should also trigger audit_inode() and security_file_open().

So why is there anything special needed for io_uring (now that the native worker threads are used)?

Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
would do? If so, I think that should just be fixed...

Additional LSM based restrictions could be hooked into the io_check_restriction() path
and setup at io_uring_setup() or early io_uring_register() time.

What do you think?

metze

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 15:17                   ` Stefan Metzmacher
@ 2021-05-26 15:49                     ` Richard Guy Briggs
  2021-05-26 17:22                       ` Jens Axboe
  2021-05-26 15:49                     ` Victor Stewart
  1 sibling, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-26 15:49 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Paul Moore, Pavel Begunkov, Jens Axboe, selinux,
	linux-security-module, linux-audit, Kumar Kartikeya Dwivedi,
	linux-fsdevel, io-uring, Alexander Viro

On 2021-05-26 17:17, Stefan Metzmacher wrote:
> 
> Am 26.05.21 um 16:38 schrieb Paul Moore:
> > On Wed, May 26, 2021 at 6:19 AM Pavel Begunkov <[email protected]> wrote:
> >> On 5/26/21 3:04 AM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>>  % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> +  static inline struct audit_context *audit_context(void)
> >>> +  {
> >>> +    return current->audit_context;
> >>> +  }
> >>>
> >>> +  static inline bool audit_dummy_context(void)
> >>> +  {
> >>> +    void *p = audit_context();
> >>> +    return !p || *(int *)p;
> >>> +  }
> >>>
> >>> +  static inline void audit_uring_entry(u8 op)
> >>> +  {
> >>> +    if (unlikely(audit_enabled && audit_context()))
> >>> +      __audit_uring_entry(op);
> >>> +  }
> >>
> >> I'd rather agree that it's my cycle-picking. The case I care about
> >> is CONFIG_AUDIT=y (because everybody enable it), and io_uring
> >> tracing _not_ enabled at runtime. If enabled let them suffer
> >> the overhead, it will probably dip down the performance
> >>
> >> So, for the case I care about it's two of
> >>
> >> if (unlikely(audit_enabled && current->audit_context))
> >>
> >> in the hot path. load-test-jump + current, so it will
> >> be around 7x2 instructions. We can throw away audit_enabled
> >> as you say systemd already enables it, that will give
> >> 4x2 instructions including 2 conditional jumps.
> > 
> > We've basically got it down to the equivalent of two
> > "current->audit_context != NULL" checks in the case where audit is
> > built into the kernel but disabled at runtime, e.g. CONFIG_AUDIT=y and
> > "task,never".  I'm at a loss for how we can lower the overhead any
> > further, but I'm open to suggestions.
> > 
> >> That's not great at all. And that's why I brought up
> >> the question about need of pre and post hooks and whether
> >> can be combined. Would be just 4 instructions and that is
> >> ok (ish).
> > 
> > As discussed previously in this thread that isn't really an option
> > from an audit perspective.
> > 
> >>> We would need to check with the current security requirements (there
> >>> are distro people on the linux-audit list that keep track of that
> >>> stuff), but looking at the opcodes right now my gut feeling is that
> >>> most of the opcodes would be considered "security relevant" so
> >>> selective auditing might not be that useful in practice.  It would
> >>> definitely clutter the code and increase the chances that new opcodes
> >>> would not be properly audited when they are merged.
> >>
> >> I'm curious, why it's enabled by many distros by default? Are there
> >> use cases they use?
> > 
> > We've already talked about certain users and environments where audit
> > is an important requirement, e.g. public sector, health care,
> > financial institutions, etc.; without audit Linux wouldn't be an
> > option for these users, at least not without heavy modification,
> > out-of-tree/ISV patches, etc.  I currently don't have any direct ties
> > to any distros, "Enterprise" or otherwise, but in the past it has been
> > my experience that distros much prefer to have a single kernel build
> > to address the needs of all their users.  In the few cases I have seen
> > where a second kernel build is supported it is usually for hardware
> > enablement.  I'm sure there are other cases too, I just haven't seen
> > them personally; the big distros definitely seem to have a strong
> > desire to limit the number of supported kernel configs/builds.
> > 
> >> Tempting to add AUDIT_IOURING=default N, but won't work I guess
> > 
> > One of the nice things about audit is that it can give you a history
> > of what a user did on a system, which is very important for a number
> > of use cases.  If we selectively disable audit for certain subsystems
> > we create a blind spot in the audit log, and in the case of io_uring
> > this can be a very serious blind spot.  I fear that if we can't come
> > to some agreement here we will need to make io_uring and audit
> > mutually exclusive at build time which would be awful; forcing many
> > distros to either make a hard choice or carry out-of-tree patches.
> 
> I'm wondering why it's not enough to have the native auditing just to happen.

The audit context needs to be set up for each event.  This happens in
audit_syslog_entry and audit_syslog_exit.

> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
> already go via security_socket_{recvmsg,sendmsg,connect}()
> 
> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
> and should also trigger audit_inode() and security_file_open().

These are extra hooks to grab operation-specific (syscall) parameters.

> So why is there anything special needed for io_uring (now that the native worker threads are used)?

Because syscall has been bypassed by a memory-mapped work queue.

> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> would do? If so, I think that should just be fixed...

This is by design to speed it up.  This is what Paul's iouring entry and
exit hooks do.

> Additional LSM based restrictions could be hooked into the io_check_restriction() path
> and setup at io_uring_setup() or early io_uring_register() time.
> 
> What do you think?
> 
> metze

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 15:17                   ` Stefan Metzmacher
  2021-05-26 15:49                     ` Richard Guy Briggs
@ 2021-05-26 15:49                     ` Victor Stewart
  2021-05-26 16:38                       ` Casey Schaufler
  1 sibling, 1 reply; 71+ messages in thread
From: Victor Stewart @ 2021-05-26 15:49 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: Paul Moore, Pavel Begunkov, Jens Axboe, linux-security-module,
	selinux, linux-audit, io-uring, linux-fsdevel,
	Kumar Kartikeya Dwivedi, Alexander Viro

> I'm wondering why it's not enough to have the native auditing just to happen.
>
> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
> already go via security_socket_{recvmsg,sendmsg,connect}()
>
> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
> and should also trigger audit_inode() and security_file_open().
>
> So why is there anything special needed for io_uring (now that the native worker threads are used)?
>
> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> would do? If so, I think that should just be fixed...

stefan's points crossed my mind as well.

but assuming iouring buy-in is required, from a design perspective,
rather than inserting these audit conditionals in the hotpath,
wouldn't a layering model work better?
aka enabling auditing changes the function entry point into io_uring
and passes operations through an auditing layer, then back to the main
entry point. then there is no
cost to audit disabled code, and you just force audit to pay whatever
double processing cost that entails.

V

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 15:49                     ` Victor Stewart
@ 2021-05-26 16:38                       ` Casey Schaufler
  0 siblings, 0 replies; 71+ messages in thread
From: Casey Schaufler @ 2021-05-26 16:38 UTC (permalink / raw)
  To: Victor Stewart, Stefan Metzmacher
  Cc: Jens Axboe, selinux, io-uring, linux-security-module, linux-audit,
	Kumar Kartikeya Dwivedi, linux-fsdevel, Pavel Begunkov,
	Alexander Viro, Casey Schaufler

On 5/26/2021 8:49 AM, Victor Stewart wrote:
>> I'm wondering why it's not enough to have the native auditing just to happen.
>>
>> E.g. all (I have checked RECVMSG,SENDMSG,SEND and CONNECT) socket related io_uring opcodes
>> already go via security_socket_{recvmsg,sendmsg,connect}()
>>
>> IORING_OP_OPENAT* goes via do_filp_open() which is in common with the open[at[2]]() syscalls
>> and should also trigger audit_inode() and security_file_open().
>>
>> So why is there anything special needed for io_uring (now that the native worker threads are used)?
>>
>> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
>> would do? If so, I think that should just be fixed...
> stefan's points crossed my mind as well.
>
> but assuming iouring buy-in is required, from a design perspective,
> rather than inserting these audit conditionals in the hotpath,
> wouldn't a layering model work better?
> aka enabling auditing changes the function entry point into io_uring
> and passes operations through an auditing layer, then back to the main
> entry point. then there is no
> cost to audit disabled code, and you just force audit to pay whatever
> double processing cost that entails.

There seems to be an assumption that the audit code isn't concerned
about processing cost. This is complete nonsense. While an audit system
has to be accurate and complete (hence encompassing io_uring) it also
has to be sufficiently performant for the system to be useful when it
is enabled. It would have been real handy had the audit and LSM
requirements been designed into io_uring. The performance implications
could have been addressed up front. Instead it all has to be retrofit.
io_uring, Audit and LSM need to perform well and the best way to ensure
that the combination works is cooperation between the developers. Any
response that includes the word "just" is unlikely to be helpful.


>
> V
>
> --
> Linux-audit mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/linux-audit
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26  2:04             ` Paul Moore
  2021-05-26 10:19               ` Pavel Begunkov
@ 2021-05-26 17:15               ` Jens Axboe
  2021-05-26 17:31                 ` Jens Axboe
  1 sibling, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 17:15 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/25/21 8:04 PM, Paul Moore wrote:
> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>> That said, audit is not for everyone, and we have build time and
>>> runtime options to help make life easier.  Beyond simply disabling
>>> audit at compile time a number of Linux distributions effectively
>>> shortcut audit at runtime by adding a "never" rule to the audit
>>> filter, for example:
>>>
>>>  % auditctl -a task,never
>>
>> As has been brought up, the issue we're facing is that distros have
>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>> of people doing custom kernels. My question would then be how much
>> overhead the above will add, considering it's an entry/exit call per op.
>> If auditctl is turned off, what is the expectation in turns of overhead?
> 
> I commented on that case in my last email to Pavel, but I'll try to go
> over it again in a little more detail.
> 
> As we discussed earlier in this thread, we can skip the req->opcode
> check before both the _entry and _exit calls, so we are left with just
> the bare audit calls in the io_uring code.  As the _entry and _exit
> functions are small, I've copied them and their supporting functions
> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> "task,never" case.
> 
> +  static inline struct audit_context *audit_context(void)
> +  {
> +    return current->audit_context;
> +  }
> 
> +  static inline bool audit_dummy_context(void)
> +  {
> +    void *p = audit_context();
> +    return !p || *(int *)p;
> +  }
> 
> +  static inline void audit_uring_entry(u8 op)
> +  {
> +    if (unlikely(audit_enabled && audit_context()))
> +      __audit_uring_entry(op);
> +  }
> 
> We have one if statement where the conditional checks on two
> individual conditions.  The first (audit_enabled) is simply a check to
> see if anyone has "turned on" auditing at runtime; historically this
> worked rather well, and still does in a number of places, but ever
> since systemd has taken to forcing audit on regardless of the admin's
> audit configuration it is less useful.  The second (audit_context())
> is a check to see if an audit_context has been allocated for the
> current task.  In the case of "task,never" current->audit_context will
> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> will never be called.
> 
> Worst case here is checking the value of audit_enabled and
> current->audit_context.  Depending on which you think is more likely
> we can change the order of the check so that the
> current->audit_context check is first if you feel that is more likely
> to be NULL than audit_enabled is to be false (it may be that way now).
> 
> +  static inline void audit_uring_exit(int success, long code)
> +  {
> +    if (unlikely(!audit_dummy_context()))
> +      __audit_uring_exit(success, code);
> +  }
> 
> The exit call is very similar to the entry call, but in the
> "task,never" case it is very simple as the first check to be performed
> is the current->audit_context check which we know to be NULL.  The
> __audit_uring_exit() slowpath will never be called.

I actually ran some numbers this morning. The test base is 5.13+, and
CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
test and the test with this series applied. I used your git branch as of
this morning.

The test case is my usual peak perf test, which is random reads at
QD=128 and using polled IO. It's a single core test, not threaded. I ran
two different tests - one was having a thread just do the IO, the other
is using SQPOLL to do the IO for us. The device is capable than more
IOPS than a single core can deliver, so we're CPU limited in this test.
Hence it's a good test case as it does actual work, and shows software
overhead quite nicely. Runs are very stable (less than 0.5% difference
between runs on the same base), yet I did average 4 runs.

Kernel		SQPOLL		IOPS		Perf diff
---------------------------------------------------------
5.13		0		3029872		0.0%
5.13		1		3031056		0.0%
5.13 + audit	0		2894160		-4.5%
5.13 + audit	1		2886168		-4.8%

That's an immediate drop in perf of almost 5%. Looking at a quick
profile of it (nothing fancy, just checking for 'audit' in the profile)
shows this:

+    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
+    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
     0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
     0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit

Note that this is with _no_ rules!

>> aio never had any audit logging as far as I can tell. I think it'd make
>> a lot more sense to selectively enable audit logging only for opcodes
>> that we care about. File open/create/unlink/mkdir etc, that kind of
>> thing. File level operations that people would care about logging. Would
>> they care about logging a buffer registration or a polled read from a
>> device/file? I highly doubt it, and we don't do that for alternative
>> methods either. Doesn't really make sense for a lot of the other
>> operations, imho.
> 
> We would need to check with the current security requirements (there
> are distro people on the linux-audit list that keep track of that
> stuff), but looking at the opcodes right now my gut feeling is that
> most of the opcodes would be considered "security relevant" so
> selective auditing might not be that useful in practice.  It would
> definitely clutter the code and increase the chances that new opcodes
> would not be properly audited when they are merged.

We don't audit read/write from aio, as mentioned. In the past two
decades, I take it that hasn't been a concern? I agree that some opcodes
should _definitely_ be audited. Things like opening a file, closing a
file, removing/creating a file, mount, etc. But normal read/write, I
think that's just utter noise and not useful at all. Auditing on a
per-opcode basis is trivial, io_uring already has provisions for
flagging opcode requirements and this would just be another one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 15:49                     ` Richard Guy Briggs
@ 2021-05-26 17:22                       ` Jens Axboe
  2021-05-27 17:27                         ` Richard Guy Briggs
  0 siblings, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 17:22 UTC (permalink / raw)
  To: Richard Guy Briggs, Stefan Metzmacher
  Cc: Paul Moore, Pavel Begunkov, selinux, linux-security-module,
	linux-audit, Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring,
	Alexander Viro

On 5/26/21 9:49 AM, Richard Guy Briggs wrote:
>> So why is there anything special needed for io_uring (now that the
>> native worker threads are used)?
> 
> Because syscall has been bypassed by a memory-mapped work queue.

I don't follow this one at all, that's just the delivery mechanism if
you choose to use SQPOLL. If you do, then a thread sibling of the
original task does the actual system call. There's no magic involved
there, and the tasks are related.

So care to expand on that?

>> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
>> would do? If so, I think that should just be fixed...
> 
> This is by design to speed it up.  This is what Paul's iouring entry and
> exit hooks do.

As far as I can tell, we're doing double logging at that point, if the
syscall helper does the audit as well. We'll get something logging the
io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the
fs helper. That's _assuming_ that we always hit the logging part when we
call into the vfs, but that's something that can be updated to be true
and kept an eye on for future additions.

Why is the double logging useful? It only tells you that the invocation
was via io_uring as the delivery mechanism rather than the usual system
call, but the effect is the same - the file is opened, for example.

I feel like I'm missing something here, or the other side is. Or both!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 17:15               ` Jens Axboe
@ 2021-05-26 17:31                 ` Jens Axboe
  2021-05-26 17:54                   ` Jens Axboe
  0 siblings, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 17:31 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 11:15 AM, Jens Axboe wrote:
> On 5/25/21 8:04 PM, Paul Moore wrote:
>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>> That said, audit is not for everyone, and we have build time and
>>>> runtime options to help make life easier.  Beyond simply disabling
>>>> audit at compile time a number of Linux distributions effectively
>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>> filter, for example:
>>>>
>>>>  % auditctl -a task,never
>>>
>>> As has been brought up, the issue we're facing is that distros have
>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>> of people doing custom kernels. My question would then be how much
>>> overhead the above will add, considering it's an entry/exit call per op.
>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>
>> I commented on that case in my last email to Pavel, but I'll try to go
>> over it again in a little more detail.
>>
>> As we discussed earlier in this thread, we can skip the req->opcode
>> check before both the _entry and _exit calls, so we are left with just
>> the bare audit calls in the io_uring code.  As the _entry and _exit
>> functions are small, I've copied them and their supporting functions
>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>> "task,never" case.
>>
>> +  static inline struct audit_context *audit_context(void)
>> +  {
>> +    return current->audit_context;
>> +  }
>>
>> +  static inline bool audit_dummy_context(void)
>> +  {
>> +    void *p = audit_context();
>> +    return !p || *(int *)p;
>> +  }
>>
>> +  static inline void audit_uring_entry(u8 op)
>> +  {
>> +    if (unlikely(audit_enabled && audit_context()))
>> +      __audit_uring_entry(op);
>> +  }
>>
>> We have one if statement where the conditional checks on two
>> individual conditions.  The first (audit_enabled) is simply a check to
>> see if anyone has "turned on" auditing at runtime; historically this
>> worked rather well, and still does in a number of places, but ever
>> since systemd has taken to forcing audit on regardless of the admin's
>> audit configuration it is less useful.  The second (audit_context())
>> is a check to see if an audit_context has been allocated for the
>> current task.  In the case of "task,never" current->audit_context will
>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>> will never be called.
>>
>> Worst case here is checking the value of audit_enabled and
>> current->audit_context.  Depending on which you think is more likely
>> we can change the order of the check so that the
>> current->audit_context check is first if you feel that is more likely
>> to be NULL than audit_enabled is to be false (it may be that way now).
>>
>> +  static inline void audit_uring_exit(int success, long code)
>> +  {
>> +    if (unlikely(!audit_dummy_context()))
>> +      __audit_uring_exit(success, code);
>> +  }
>>
>> The exit call is very similar to the entry call, but in the
>> "task,never" case it is very simple as the first check to be performed
>> is the current->audit_context check which we know to be NULL.  The
>> __audit_uring_exit() slowpath will never be called.
> 
> I actually ran some numbers this morning. The test base is 5.13+, and
> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> test and the test with this series applied. I used your git branch as of
> this morning.
> 
> The test case is my usual peak perf test, which is random reads at
> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> two different tests - one was having a thread just do the IO, the other
> is using SQPOLL to do the IO for us. The device is capable than more
> IOPS than a single core can deliver, so we're CPU limited in this test.
> Hence it's a good test case as it does actual work, and shows software
> overhead quite nicely. Runs are very stable (less than 0.5% difference
> between runs on the same base), yet I did average 4 runs.
> 
> Kernel		SQPOLL		IOPS		Perf diff
> ---------------------------------------------------------
> 5.13		0		3029872		0.0%
> 5.13		1		3031056		0.0%
> 5.13 + audit	0		2894160		-4.5%
> 5.13 + audit	1		2886168		-4.8%
> 
> That's an immediate drop in perf of almost 5%. Looking at a quick
> profile of it (nothing fancy, just checking for 'audit' in the profile)
> shows this:
> 
> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> 
> Note that this is with _no_ rules!

io_uring also supports a NOP command, which basically just measures
reqs/sec through the interface. Ran that as well:

Kernel		SQPOLL		IOPS		Perf diff
---------------------------------------------------------
5.13		0		31.05M		0.0%
5.13 + audit	0		25.31M		-18.5%

and profile for the latter includes:

+    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
+    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
     0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
     0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 17:31                 ` Jens Axboe
@ 2021-05-26 17:54                   ` Jens Axboe
  2021-05-26 18:01                     ` Jens Axboe
  2021-05-26 18:38                     ` Paul Moore
  0 siblings, 2 replies; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 17:54 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 11:31 AM, Jens Axboe wrote:
> On 5/26/21 11:15 AM, Jens Axboe wrote:
>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>> That said, audit is not for everyone, and we have build time and
>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>> audit at compile time a number of Linux distributions effectively
>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>> filter, for example:
>>>>>
>>>>>  % auditctl -a task,never
>>>>
>>>> As has been brought up, the issue we're facing is that distros have
>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>> of people doing custom kernels. My question would then be how much
>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>
>>> I commented on that case in my last email to Pavel, but I'll try to go
>>> over it again in a little more detail.
>>>
>>> As we discussed earlier in this thread, we can skip the req->opcode
>>> check before both the _entry and _exit calls, so we are left with just
>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>> functions are small, I've copied them and their supporting functions
>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>> "task,never" case.
>>>
>>> +  static inline struct audit_context *audit_context(void)
>>> +  {
>>> +    return current->audit_context;
>>> +  }
>>>
>>> +  static inline bool audit_dummy_context(void)
>>> +  {
>>> +    void *p = audit_context();
>>> +    return !p || *(int *)p;
>>> +  }
>>>
>>> +  static inline void audit_uring_entry(u8 op)
>>> +  {
>>> +    if (unlikely(audit_enabled && audit_context()))
>>> +      __audit_uring_entry(op);
>>> +  }
>>>
>>> We have one if statement where the conditional checks on two
>>> individual conditions.  The first (audit_enabled) is simply a check to
>>> see if anyone has "turned on" auditing at runtime; historically this
>>> worked rather well, and still does in a number of places, but ever
>>> since systemd has taken to forcing audit on regardless of the admin's
>>> audit configuration it is less useful.  The second (audit_context())
>>> is a check to see if an audit_context has been allocated for the
>>> current task.  In the case of "task,never" current->audit_context will
>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>> will never be called.
>>>
>>> Worst case here is checking the value of audit_enabled and
>>> current->audit_context.  Depending on which you think is more likely
>>> we can change the order of the check so that the
>>> current->audit_context check is first if you feel that is more likely
>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>
>>> +  static inline void audit_uring_exit(int success, long code)
>>> +  {
>>> +    if (unlikely(!audit_dummy_context()))
>>> +      __audit_uring_exit(success, code);
>>> +  }
>>>
>>> The exit call is very similar to the entry call, but in the
>>> "task,never" case it is very simple as the first check to be performed
>>> is the current->audit_context check which we know to be NULL.  The
>>> __audit_uring_exit() slowpath will never be called.
>>
>> I actually ran some numbers this morning. The test base is 5.13+, and
>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>> test and the test with this series applied. I used your git branch as of
>> this morning.
>>
>> The test case is my usual peak perf test, which is random reads at
>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>> two different tests - one was having a thread just do the IO, the other
>> is using SQPOLL to do the IO for us. The device is capable than more
>> IOPS than a single core can deliver, so we're CPU limited in this test.
>> Hence it's a good test case as it does actual work, and shows software
>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>> between runs on the same base), yet I did average 4 runs.
>>
>> Kernel		SQPOLL		IOPS		Perf diff
>> ---------------------------------------------------------
>> 5.13		0		3029872		0.0%
>> 5.13		1		3031056		0.0%
>> 5.13 + audit	0		2894160		-4.5%
>> 5.13 + audit	1		2886168		-4.8%
>>
>> That's an immediate drop in perf of almost 5%. Looking at a quick
>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>> shows this:
>>
>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>
>> Note that this is with _no_ rules!
> 
> io_uring also supports a NOP command, which basically just measures
> reqs/sec through the interface. Ran that as well:
> 
> Kernel		SQPOLL		IOPS		Perf diff
> ---------------------------------------------------------
> 5.13		0		31.05M		0.0%
> 5.13 + audit	0		25.31M		-18.5%
> 
> and profile for the latter includes:
> 
> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit

As Pavel correctly pointed it, looks like auditing is enabled. And
indeed it was! Hence the above numbers is without having turned off
auditing. Running the NOPs after having turned off audit, we get 30.6M
IOPS, which is down about 1.5% from the baseline. The results for the
polled random read test above did _not_ change from this, they are still
down the same amount.

Note, and I should have included this in the first email, this is not
any kind of argument for or against audit logging. It's purely meant to
be a set of numbers that show how the current series impacts
performance.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 17:54                   ` Jens Axboe
@ 2021-05-26 18:01                     ` Jens Axboe
  2021-05-26 18:44                       ` Paul Moore
  2021-05-26 18:38                     ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 18:01 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 11:54 AM, Jens Axboe wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>> filter, for example:
>>>>>>
>>>>>>  % auditctl -a task,never
>>>>>
>>>>> As has been brought up, the issue we're facing is that distros have
>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>> of people doing custom kernels. My question would then be how much
>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>
>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>> over it again in a little more detail.
>>>>
>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>> check before both the _entry and _exit calls, so we are left with just
>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>> functions are small, I've copied them and their supporting functions
>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>> "task,never" case.
>>>>
>>>> +  static inline struct audit_context *audit_context(void)
>>>> +  {
>>>> +    return current->audit_context;
>>>> +  }
>>>>
>>>> +  static inline bool audit_dummy_context(void)
>>>> +  {
>>>> +    void *p = audit_context();
>>>> +    return !p || *(int *)p;
>>>> +  }
>>>>
>>>> +  static inline void audit_uring_entry(u8 op)
>>>> +  {
>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>> +      __audit_uring_entry(op);
>>>> +  }
>>>>
>>>> We have one if statement where the conditional checks on two
>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>> worked rather well, and still does in a number of places, but ever
>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>> audit configuration it is less useful.  The second (audit_context())
>>>> is a check to see if an audit_context has been allocated for the
>>>> current task.  In the case of "task,never" current->audit_context will
>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>> will never be called.
>>>>
>>>> Worst case here is checking the value of audit_enabled and
>>>> current->audit_context.  Depending on which you think is more likely
>>>> we can change the order of the check so that the
>>>> current->audit_context check is first if you feel that is more likely
>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>
>>>> +  static inline void audit_uring_exit(int success, long code)
>>>> +  {
>>>> +    if (unlikely(!audit_dummy_context()))
>>>> +      __audit_uring_exit(success, code);
>>>> +  }
>>>>
>>>> The exit call is very similar to the entry call, but in the
>>>> "task,never" case it is very simple as the first check to be performed
>>>> is the current->audit_context check which we know to be NULL.  The
>>>> __audit_uring_exit() slowpath will never be called.
>>>
>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>> test and the test with this series applied. I used your git branch as of
>>> this morning.
>>>
>>> The test case is my usual peak perf test, which is random reads at
>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>> two different tests - one was having a thread just do the IO, the other
>>> is using SQPOLL to do the IO for us. The device is capable than more
>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>> Hence it's a good test case as it does actual work, and shows software
>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>> between runs on the same base), yet I did average 4 runs.
>>>
>>> Kernel		SQPOLL		IOPS		Perf diff
>>> ---------------------------------------------------------
>>> 5.13		0		3029872		0.0%
>>> 5.13		1		3031056		0.0%
>>> 5.13 + audit	0		2894160		-4.5%
>>> 5.13 + audit	1		2886168		-4.8%
>>>
>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>> shows this:
>>>
>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> Note that this is with _no_ rules!
>>
>> io_uring also supports a NOP command, which basically just measures
>> reqs/sec through the interface. Ran that as well:
>>
>> Kernel		SQPOLL		IOPS		Perf diff
>> ---------------------------------------------------------
>> 5.13		0		31.05M		0.0%
>> 5.13 + audit	0		25.31M		-18.5%
>>
>> and profile for the latter includes:
>>
>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> 
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
> 
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.

And finally, just checking if we make it optional per opcode if we see
any real impact, and the answer is no. Using the below patch which
effectively bypasses audit calls unless the opcode has flagged the need
to do so, I cannot measure any difference in perf (as expected).

To turn this into something useful, my suggestion as a viable path
forward would be:

1) Use something like the below patch and flag request types that we
   want to do audit logging for.

2) As Pavel suggested, eliminate the need for having both and entry/exit
   hook, turning it into just one. That effectively cuts the number of
   checks and calls in half.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aa065808ddcf..2c7c913b786b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -885,6 +885,8 @@ struct io_op_def {
 	unsigned		needs_async_setup : 1;
 	/* should block plug */
 	unsigned		plug : 1;
+	/* should audit */
+	unsigned		audit : 1;
 	/* size of async data needed, if any */
 	unsigned short		async_size;
 };
@@ -6122,7 +6124,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (req->work.creds && req->work.creds != current_cred())
 		creds = override_creds(req->work.creds);
 
-	if (req->opcode < IORING_OP_LAST)
+	if (io_op_defs[req->opcode].audit)
 		audit_uring_entry(req->opcode);
 
 	switch (req->opcode) {
@@ -6231,7 +6233,7 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 		break;
 	}
 
-	if (req->opcode < IORING_OP_LAST)
+	if (io_op_defs[req->opcode].audit)
 		audit_uring_exit(!ret, ret);
 
 	if (creds)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 17:54                   ` Jens Axboe
  2021-05-26 18:01                     ` Jens Axboe
@ 2021-05-26 18:38                     ` Paul Moore
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 18:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 1:54 PM Jens Axboe <[email protected]> wrote:
> On 5/26/21 11:31 AM, Jens Axboe wrote:
> > On 5/26/21 11:15 AM, Jens Axboe wrote:
> >> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>> That said, audit is not for everyone, and we have build time and
> >>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>> audit at compile time a number of Linux distributions effectively
> >>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>> filter, for example:
> >>>>>
> >>>>>  % auditctl -a task,never
> >>>>
> >>>> As has been brought up, the issue we're facing is that distros have
> >>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>> of people doing custom kernels. My question would then be how much
> >>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>
> >>> I commented on that case in my last email to Pavel, but I'll try to go
> >>> over it again in a little more detail.
> >>>
> >>> As we discussed earlier in this thread, we can skip the req->opcode
> >>> check before both the _entry and _exit calls, so we are left with just
> >>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>> functions are small, I've copied them and their supporting functions
> >>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>> "task,never" case.
> >>>
> >>> +  static inline struct audit_context *audit_context(void)
> >>> +  {
> >>> +    return current->audit_context;
> >>> +  }
> >>>
> >>> +  static inline bool audit_dummy_context(void)
> >>> +  {
> >>> +    void *p = audit_context();
> >>> +    return !p || *(int *)p;
> >>> +  }
> >>>
> >>> +  static inline void audit_uring_entry(u8 op)
> >>> +  {
> >>> +    if (unlikely(audit_enabled && audit_context()))
> >>> +      __audit_uring_entry(op);
> >>> +  }
> >>>
> >>> We have one if statement where the conditional checks on two
> >>> individual conditions.  The first (audit_enabled) is simply a check to
> >>> see if anyone has "turned on" auditing at runtime; historically this
> >>> worked rather well, and still does in a number of places, but ever
> >>> since systemd has taken to forcing audit on regardless of the admin's
> >>> audit configuration it is less useful.  The second (audit_context())
> >>> is a check to see if an audit_context has been allocated for the
> >>> current task.  In the case of "task,never" current->audit_context will
> >>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>> will never be called.
> >>>
> >>> Worst case here is checking the value of audit_enabled and
> >>> current->audit_context.  Depending on which you think is more likely
> >>> we can change the order of the check so that the
> >>> current->audit_context check is first if you feel that is more likely
> >>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>
> >>> +  static inline void audit_uring_exit(int success, long code)
> >>> +  {
> >>> +    if (unlikely(!audit_dummy_context()))
> >>> +      __audit_uring_exit(success, code);
> >>> +  }
> >>>
> >>> The exit call is very similar to the entry call, but in the
> >>> "task,never" case it is very simple as the first check to be performed
> >>> is the current->audit_context check which we know to be NULL.  The
> >>> __audit_uring_exit() slowpath will never be called.
> >>
> >> I actually ran some numbers this morning. The test base is 5.13+, and
> >> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >> test and the test with this series applied. I used your git branch as of
> >> this morning.
> >>
> >> The test case is my usual peak perf test, which is random reads at
> >> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >> two different tests - one was having a thread just do the IO, the other
> >> is using SQPOLL to do the IO for us. The device is capable than more
> >> IOPS than a single core can deliver, so we're CPU limited in this test.
> >> Hence it's a good test case as it does actual work, and shows software
> >> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >> between runs on the same base), yet I did average 4 runs.
> >>
> >> Kernel               SQPOLL          IOPS            Perf diff
> >> ---------------------------------------------------------
> >> 5.13         0               3029872         0.0%
> >> 5.13         1               3031056         0.0%
> >> 5.13 + audit 0               2894160         -4.5%
> >> 5.13 + audit 1               2886168         -4.8%
> >>
> >> That's an immediate drop in perf of almost 5%. Looking at a quick
> >> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >> shows this:
> >>
> >> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>
> >> Note that this is with _no_ rules!
> >
> > io_uring also supports a NOP command, which basically just measures
> > reqs/sec through the interface. Ran that as well:
> >
> > Kernel                SQPOLL          IOPS            Perf diff
> > ---------------------------------------------------------
> > 5.13          0               31.05M          0.0%
> > 5.13 + audit  0               25.31M          -18.5%
> >
> > and profile for the latter includes:
> >
> > +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> > +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>
> As Pavel correctly pointed it, looks like auditing is enabled. And
> indeed it was! Hence the above numbers is without having turned off
> auditing. Running the NOPs after having turned off audit, we get 30.6M
> IOPS, which is down about 1.5% from the baseline. The results for the
> polled random read test above did _not_ change from this, they are still
> down the same amount.
>
> Note, and I should have included this in the first email, this is not
> any kind of argument for or against audit logging. It's purely meant to
> be a set of numbers that show how the current series impacts
> performance.

Thanks for running some tests Jens, unfortunately the git tree hadn't
been updated to reflect the improved audit_uring_entry() and
audit_uring_exit() functions as we were still discussing things and I
thought there still might be some changes.  I just now updated the
branch with the latest code so if you have the cycles (ha!) to run
another perf test I think those numbers would be more interesting.
For example, I don't believe you should see __audit_uring_entry() or
__audit_uring_exit() at all if you have the audit "task,never" rule
loaded; you will see audit_uring_entry() and audit_uring_exit() but as
we discussed previously those should just be a single
"current->audit_context != NULL" check.

Also, I want to pull back a bit as I think there is confusion about
how audit works and why these changes are necessary.  As everyone
likely knows, there are audit calls sprinkled throughout the kernel
code, e.g. audit_log_format() and similar.  In addition to these calls
that most are aware of, there are setup/teardown audit calls that run
at task creation and destruction as well as at syscall entry and exit.
It is these lesser known calls that are responsible for the filtering,
setup, and general management of the audit context state in the
system; they also handle generation of some audit records such as
SYSCALL, PATH, etc. based on information that is recorded by audit
calls inserted into other places in the kernel.  The
audit_alloc_kernel() and audit_free() calls we are adding to the
io_uring/io-wq code are intended to do similar things to the existing
audit task creation/destruction hooks, but for the io_uring/io-wq
kernel threads.  Similarly the audit_uring_entry() and
audit_uring_exit() calls are intended to act as replacements for the
syscall entry and exit code.  If we want the existing audit calls in
the kernel to work properly, we need these setup/teardown functions.

Hopefully that makes things a bit more clear regarding these calls in
the io_uring/io-wq code.

Another point I wanted to address is the "double audit" (!!!) that has
been mentioned recently in this thread.  Don't get too excited, this
isn't quite what you think it is, it is a side effect of how io_uring
can dispatch certain operations.  As I think the io_uring folks
already know, io_uring can process I/O ops in several different
contexts; one of which is after a syscall completes but before the
kernel returns to userspace.  In this particular case things can get
rather interesting from an audit perspective as we need to handle both
the syscall audit records *and* the io_uring operation records.  If
you look closer at the audit code in this patchset you'll see some of
the fun stuff we need to do to make sure this case is handled
correctly.  If you happen to see a code path that takes you through
audit_syscall_entry() + <syscall_stuff> + audit_uring_entry() +
<io_uring_stuff> + audit_uring_exit() + audit_syscall_exit() rest
assured that is correct :)

Of course there is the normal audit_uring_entry() and
audit_uring_exit() without the audit syscall hooks in other code
paths, e.g. SQPOLL, but those are less dramatic than the "OMG, double
audit!!!" ;)

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 18:01                     ` Jens Axboe
@ 2021-05-26 18:44                       ` Paul Moore
  2021-05-26 18:57                         ` Pavel Begunkov
  2021-05-26 19:44                         ` Jens Axboe
  0 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 18:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 2:01 PM Jens Axboe <[email protected]> wrote:
> On 5/26/21 11:54 AM, Jens Axboe wrote:
> > On 5/26/21 11:31 AM, Jens Axboe wrote:
> >> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>> filter, for example:
> >>>>>>
> >>>>>>  % auditctl -a task,never
> >>>>>
> >>>>> As has been brought up, the issue we're facing is that distros have
> >>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>> of people doing custom kernels. My question would then be how much
> >>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>
> >>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>> over it again in a little more detail.
> >>>>
> >>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>> check before both the _entry and _exit calls, so we are left with just
> >>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>> functions are small, I've copied them and their supporting functions
> >>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>> "task,never" case.
> >>>>
> >>>> +  static inline struct audit_context *audit_context(void)
> >>>> +  {
> >>>> +    return current->audit_context;
> >>>> +  }
> >>>>
> >>>> +  static inline bool audit_dummy_context(void)
> >>>> +  {
> >>>> +    void *p = audit_context();
> >>>> +    return !p || *(int *)p;
> >>>> +  }
> >>>>
> >>>> +  static inline void audit_uring_entry(u8 op)
> >>>> +  {
> >>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>> +      __audit_uring_entry(op);
> >>>> +  }
> >>>>
> >>>> We have one if statement where the conditional checks on two
> >>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>> worked rather well, and still does in a number of places, but ever
> >>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>> audit configuration it is less useful.  The second (audit_context())
> >>>> is a check to see if an audit_context has been allocated for the
> >>>> current task.  In the case of "task,never" current->audit_context will
> >>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>> will never be called.
> >>>>
> >>>> Worst case here is checking the value of audit_enabled and
> >>>> current->audit_context.  Depending on which you think is more likely
> >>>> we can change the order of the check so that the
> >>>> current->audit_context check is first if you feel that is more likely
> >>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>
> >>>> +  static inline void audit_uring_exit(int success, long code)
> >>>> +  {
> >>>> +    if (unlikely(!audit_dummy_context()))
> >>>> +      __audit_uring_exit(success, code);
> >>>> +  }
> >>>>
> >>>> The exit call is very similar to the entry call, but in the
> >>>> "task,never" case it is very simple as the first check to be performed
> >>>> is the current->audit_context check which we know to be NULL.  The
> >>>> __audit_uring_exit() slowpath will never be called.
> >>>
> >>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>> test and the test with this series applied. I used your git branch as of
> >>> this morning.
> >>>
> >>> The test case is my usual peak perf test, which is random reads at
> >>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>> two different tests - one was having a thread just do the IO, the other
> >>> is using SQPOLL to do the IO for us. The device is capable than more
> >>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>> Hence it's a good test case as it does actual work, and shows software
> >>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>> between runs on the same base), yet I did average 4 runs.
> >>>
> >>> Kernel              SQPOLL          IOPS            Perf diff
> >>> ---------------------------------------------------------
> >>> 5.13                0               3029872         0.0%
> >>> 5.13                1               3031056         0.0%
> >>> 5.13 + audit        0               2894160         -4.5%
> >>> 5.13 + audit        1               2886168         -4.8%
> >>>
> >>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>> shows this:
> >>>
> >>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> Note that this is with _no_ rules!
> >>
> >> io_uring also supports a NOP command, which basically just measures
> >> reqs/sec through the interface. Ran that as well:
> >>
> >> Kernel               SQPOLL          IOPS            Perf diff
> >> ---------------------------------------------------------
> >> 5.13         0               31.05M          0.0%
> >> 5.13 + audit 0               25.31M          -18.5%
> >>
> >> and profile for the latter includes:
> >>
> >> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >
> > As Pavel correctly pointed it, looks like auditing is enabled. And
> > indeed it was! Hence the above numbers is without having turned off
> > auditing. Running the NOPs after having turned off audit, we get 30.6M
> > IOPS, which is down about 1.5% from the baseline. The results for the
> > polled random read test above did _not_ change from this, they are still
> > down the same amount.
> >
> > Note, and I should have included this in the first email, this is not
> > any kind of argument for or against audit logging. It's purely meant to
> > be a set of numbers that show how the current series impacts
> > performance.
>
> And finally, just checking if we make it optional per opcode if we see
> any real impact, and the answer is no. Using the below patch which
> effectively bypasses audit calls unless the opcode has flagged the need
> to do so, I cannot measure any difference in perf (as expected).
>
> To turn this into something useful, my suggestion as a viable path
> forward would be:
>
> 1) Use something like the below patch and flag request types that we
>    want to do audit logging for.
>
> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>    hook, turning it into just one. That effectively cuts the number of
>    checks and calls in half.

I suspect the updated working-io_uring branch with HEAD at
1f25193a3f54 (updated a short time ago, see my last email in this
thread) will improve performance.  Also, as has been mention several
times now, for audit to work we need both the _entry and _exit call.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-26 15:00 ` Jeff Moyer
@ 2021-05-26 18:49   ` Paul Moore
  2021-05-26 19:07     ` Jeff Moyer
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-05-26 18:49 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <[email protected]> wrote:
> Paul Moore <[email protected]> writes:
>
> > Also, any pointers to easy-to-run io_uring tests would be helpful.  I
> > am particularly interested in tests which make use of the personality
> > option, share urings across process boundaries, and make use of the
> > sqpoll functionality.
>
> liburing contains a test suite:
>   https://git.kernel.dk/cgit/liburing/
>
> You can run it via 'make runtests'.

Thanks Jeff, I'll take a look.  Quick question as I start sifting
through the tests, are there any tests in here which share a single
ring across process boundaries?

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 18:44                       ` Paul Moore
@ 2021-05-26 18:57                         ` Pavel Begunkov
  2021-05-26 19:10                           ` Paul Moore
  2021-05-26 19:44                         ` Jens Axboe
  1 sibling, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-05-26 18:57 UTC (permalink / raw)
  To: Paul Moore, Jens Axboe
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 7:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <[email protected]> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>>  % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> +  static inline struct audit_context *audit_context(void)
>>>>>> +  {
>>>>>> +    return current->audit_context;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline bool audit_dummy_context(void)
>>>>>> +  {
>>>>>> +    void *p = audit_context();
>>>>>> +    return !p || *(int *)p;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline void audit_uring_entry(u8 op)
>>>>>> +  {
>>>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>>>> +      __audit_uring_entry(op);
>>>>>> +  }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful.  The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task.  In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context.  Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> +  static inline void audit_uring_exit(int success, long code)
>>>>>> +  {
>>>>>> +    if (unlikely(!audit_dummy_context()))
>>>>>> +      __audit_uring_exit(success, code);
>>>>>> +  }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL.  The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel              SQPOLL          IOPS            Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13                0               3029872         0.0%
>>>>> 5.13                1               3031056         0.0%
>>>>> 5.13 + audit        0               2894160         -4.5%
>>>>> 5.13 + audit        1               2886168         -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel               SQPOLL          IOPS            Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13         0               31.05M          0.0%
>>>> 5.13 + audit 0               25.31M          -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>>    want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>>    hook, turning it into just one. That effectively cuts the number of
>>    checks and calls in half.
> 
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance.  Also, as has been mention several

See the email you replied to, ~1.5% was basically an overhead of
two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
least once req->opcode is cached. But to be completely fair, misses
unlikely

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-26 18:49   ` Paul Moore
@ 2021-05-26 19:07     ` Jeff Moyer
  2021-05-26 19:10       ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Jeff Moyer @ 2021-05-26 19:07 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

Paul Moore <[email protected]> writes:

> On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <[email protected]> wrote:
>> Paul Moore <[email protected]> writes:
>>
>> > Also, any pointers to easy-to-run io_uring tests would be helpful.  I
>> > am particularly interested in tests which make use of the personality
>> > option, share urings across process boundaries, and make use of the
>> > sqpoll functionality.
>>
>> liburing contains a test suite:
>>   https://git.kernel.dk/cgit/liburing/
>>
>> You can run it via 'make runtests'.
>
> Thanks Jeff, I'll take a look.  Quick question as I start sifting
> through the tests, are there any tests in here which share a single
> ring across process boundaries?

Yes.  At the very least, this one:

test/across-fork.c

-Jeff


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 18:57                         ` Pavel Begunkov
@ 2021-05-26 19:10                           ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 19:10 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 2:57 PM Pavel Begunkov <[email protected]> wrote:
> On 5/26/21 7:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <[email protected]> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>>  % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> +  static inline struct audit_context *audit_context(void)
> >>>>>> +  {
> >>>>>> +    return current->audit_context;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline bool audit_dummy_context(void)
> >>>>>> +  {
> >>>>>> +    void *p = audit_context();
> >>>>>> +    return !p || *(int *)p;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline void audit_uring_entry(u8 op)
> >>>>>> +  {
> >>>>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>>>> +      __audit_uring_entry(op);
> >>>>>> +  }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful.  The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task.  In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context.  Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> +  static inline void audit_uring_exit(int success, long code)
> >>>>>> +  {
> >>>>>> +    if (unlikely(!audit_dummy_context()))
> >>>>>> +      __audit_uring_exit(success, code);
> >>>>>> +  }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL.  The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel              SQPOLL          IOPS            Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13                0               3029872         0.0%
> >>>>> 5.13                1               3031056         0.0%
> >>>>> 5.13 + audit        0               2894160         -4.5%
> >>>>> 5.13 + audit        1               2886168         -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel               SQPOLL          IOPS            Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13         0               31.05M          0.0%
> >>>> 5.13 + audit 0               25.31M          -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >>    want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >>    hook, turning it into just one. That effectively cuts the number of
> >>    checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance.  Also, as has been mention several
>
> See the email you replied to, ~1.5% was basically an overhead of
> two `if (io_op_defs[req->opcode].audit)` in case of nops, where at
> least once req->opcode is cached. But to be completely fair, misses
> unlikely

Maybe.  I remain skeptical that "io_op_defs[req->opcode].audit" has
the same cost as "unlikely(audit_context())".

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring
  2021-05-26 19:07     ` Jeff Moyer
@ 2021-05-26 19:10       ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 19:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Wed, May 26, 2021 at 3:06 PM Jeff Moyer <[email protected]> wrote:
> Paul Moore <[email protected]> writes:
>
> > On Wed, May 26, 2021 at 10:59 AM Jeff Moyer <[email protected]> wrote:
> >> Paul Moore <[email protected]> writes:
> >>
> >> > Also, any pointers to easy-to-run io_uring tests would be helpful.  I
> >> > am particularly interested in tests which make use of the personality
> >> > option, share urings across process boundaries, and make use of the
> >> > sqpoll functionality.
> >>
> >> liburing contains a test suite:
> >>   https://git.kernel.dk/cgit/liburing/
> >>
> >> You can run it via 'make runtests'.
> >
> > Thanks Jeff, I'll take a look.  Quick question as I start sifting
> > through the tests, are there any tests in here which share a single
> > ring across process boundaries?
>
> Yes.  At the very least, this one:
>
> test/across-fork.c

Great, thanks!

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 18:44                       ` Paul Moore
  2021-05-26 18:57                         ` Pavel Begunkov
@ 2021-05-26 19:44                         ` Jens Axboe
  2021-05-26 20:19                           ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-05-26 19:44 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/26/21 12:44 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 2:01 PM Jens Axboe <[email protected]> wrote:
>> On 5/26/21 11:54 AM, Jens Axboe wrote:
>>> On 5/26/21 11:31 AM, Jens Axboe wrote:
>>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
>>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
>>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
>>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
>>>>>>>> That said, audit is not for everyone, and we have build time and
>>>>>>>> runtime options to help make life easier.  Beyond simply disabling
>>>>>>>> audit at compile time a number of Linux distributions effectively
>>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
>>>>>>>> filter, for example:
>>>>>>>>
>>>>>>>>  % auditctl -a task,never
>>>>>>>
>>>>>>> As has been brought up, the issue we're facing is that distros have
>>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
>>>>>>> of people doing custom kernels. My question would then be how much
>>>>>>> overhead the above will add, considering it's an entry/exit call per op.
>>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
>>>>>>
>>>>>> I commented on that case in my last email to Pavel, but I'll try to go
>>>>>> over it again in a little more detail.
>>>>>>
>>>>>> As we discussed earlier in this thread, we can skip the req->opcode
>>>>>> check before both the _entry and _exit calls, so we are left with just
>>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
>>>>>> functions are small, I've copied them and their supporting functions
>>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
>>>>>> "task,never" case.
>>>>>>
>>>>>> +  static inline struct audit_context *audit_context(void)
>>>>>> +  {
>>>>>> +    return current->audit_context;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline bool audit_dummy_context(void)
>>>>>> +  {
>>>>>> +    void *p = audit_context();
>>>>>> +    return !p || *(int *)p;
>>>>>> +  }
>>>>>>
>>>>>> +  static inline void audit_uring_entry(u8 op)
>>>>>> +  {
>>>>>> +    if (unlikely(audit_enabled && audit_context()))
>>>>>> +      __audit_uring_entry(op);
>>>>>> +  }
>>>>>>
>>>>>> We have one if statement where the conditional checks on two
>>>>>> individual conditions.  The first (audit_enabled) is simply a check to
>>>>>> see if anyone has "turned on" auditing at runtime; historically this
>>>>>> worked rather well, and still does in a number of places, but ever
>>>>>> since systemd has taken to forcing audit on regardless of the admin's
>>>>>> audit configuration it is less useful.  The second (audit_context())
>>>>>> is a check to see if an audit_context has been allocated for the
>>>>>> current task.  In the case of "task,never" current->audit_context will
>>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
>>>>>> will never be called.
>>>>>>
>>>>>> Worst case here is checking the value of audit_enabled and
>>>>>> current->audit_context.  Depending on which you think is more likely
>>>>>> we can change the order of the check so that the
>>>>>> current->audit_context check is first if you feel that is more likely
>>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
>>>>>>
>>>>>> +  static inline void audit_uring_exit(int success, long code)
>>>>>> +  {
>>>>>> +    if (unlikely(!audit_dummy_context()))
>>>>>> +      __audit_uring_exit(success, code);
>>>>>> +  }
>>>>>>
>>>>>> The exit call is very similar to the entry call, but in the
>>>>>> "task,never" case it is very simple as the first check to be performed
>>>>>> is the current->audit_context check which we know to be NULL.  The
>>>>>> __audit_uring_exit() slowpath will never be called.
>>>>>
>>>>> I actually ran some numbers this morning. The test base is 5.13+, and
>>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
>>>>> test and the test with this series applied. I used your git branch as of
>>>>> this morning.
>>>>>
>>>>> The test case is my usual peak perf test, which is random reads at
>>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
>>>>> two different tests - one was having a thread just do the IO, the other
>>>>> is using SQPOLL to do the IO for us. The device is capable than more
>>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
>>>>> Hence it's a good test case as it does actual work, and shows software
>>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
>>>>> between runs on the same base), yet I did average 4 runs.
>>>>>
>>>>> Kernel              SQPOLL          IOPS            Perf diff
>>>>> ---------------------------------------------------------
>>>>> 5.13                0               3029872         0.0%
>>>>> 5.13                1               3031056         0.0%
>>>>> 5.13 + audit        0               2894160         -4.5%
>>>>> 5.13 + audit        1               2886168         -4.8%
>>>>>
>>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
>>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
>>>>> shows this:
>>>>>
>>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>>>
>>>>> Note that this is with _no_ rules!
>>>>
>>>> io_uring also supports a NOP command, which basically just measures
>>>> reqs/sec through the interface. Ran that as well:
>>>>
>>>> Kernel               SQPOLL          IOPS            Perf diff
>>>> ---------------------------------------------------------
>>>> 5.13         0               31.05M          0.0%
>>>> 5.13 + audit 0               25.31M          -18.5%
>>>>
>>>> and profile for the latter includes:
>>>>
>>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
>>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
>>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
>>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
>>>
>>> As Pavel correctly pointed it, looks like auditing is enabled. And
>>> indeed it was! Hence the above numbers is without having turned off
>>> auditing. Running the NOPs after having turned off audit, we get 30.6M
>>> IOPS, which is down about 1.5% from the baseline. The results for the
>>> polled random read test above did _not_ change from this, they are still
>>> down the same amount.
>>>
>>> Note, and I should have included this in the first email, this is not
>>> any kind of argument for or against audit logging. It's purely meant to
>>> be a set of numbers that show how the current series impacts
>>> performance.
>>
>> And finally, just checking if we make it optional per opcode if we see
>> any real impact, and the answer is no. Using the below patch which
>> effectively bypasses audit calls unless the opcode has flagged the need
>> to do so, I cannot measure any difference in perf (as expected).
>>
>> To turn this into something useful, my suggestion as a viable path
>> forward would be:
>>
>> 1) Use something like the below patch and flag request types that we
>>    want to do audit logging for.
>>
>> 2) As Pavel suggested, eliminate the need for having both and entry/exit
>>    hook, turning it into just one. That effectively cuts the number of
>>    checks and calls in half.
> 
> I suspect the updated working-io_uring branch with HEAD at
> 1f25193a3f54 (updated a short time ago, see my last email in this
> thread) will improve performance.  Also, as has been mention several
> times now, for audit to work we need both the _entry and _exit call.

Pulled in your new stuff, and ran a quick test again (same as before,
the rand reads). No change, we're still down ~5% with auditctl -a
task,never having been run. If I stub out your two audit entry/exit
calls by adding an if (0) before them, then perf is back to normal.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 19:44                         ` Jens Axboe
@ 2021-05-26 20:19                           ` Paul Moore
  2021-05-28 16:02                             ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-05-26 20:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 3:44 PM Jens Axboe <[email protected]> wrote:
> On 5/26/21 12:44 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 2:01 PM Jens Axboe <[email protected]> wrote:
> >> On 5/26/21 11:54 AM, Jens Axboe wrote:
> >>> On 5/26/21 11:31 AM, Jens Axboe wrote:
> >>>> On 5/26/21 11:15 AM, Jens Axboe wrote:
> >>>>> On 5/25/21 8:04 PM, Paul Moore wrote:
> >>>>>> On Tue, May 25, 2021 at 9:11 PM Jens Axboe <[email protected]> wrote:
> >>>>>>> On 5/24/21 1:59 PM, Paul Moore wrote:
> >>>>>>>> That said, audit is not for everyone, and we have build time and
> >>>>>>>> runtime options to help make life easier.  Beyond simply disabling
> >>>>>>>> audit at compile time a number of Linux distributions effectively
> >>>>>>>> shortcut audit at runtime by adding a "never" rule to the audit
> >>>>>>>> filter, for example:
> >>>>>>>>
> >>>>>>>>  % auditctl -a task,never
> >>>>>>>
> >>>>>>> As has been brought up, the issue we're facing is that distros have
> >>>>>>> CONFIG_AUDIT=y and hence the above is the best real world case outside
> >>>>>>> of people doing custom kernels. My question would then be how much
> >>>>>>> overhead the above will add, considering it's an entry/exit call per op.
> >>>>>>> If auditctl is turned off, what is the expectation in turns of overhead?
> >>>>>>
> >>>>>> I commented on that case in my last email to Pavel, but I'll try to go
> >>>>>> over it again in a little more detail.
> >>>>>>
> >>>>>> As we discussed earlier in this thread, we can skip the req->opcode
> >>>>>> check before both the _entry and _exit calls, so we are left with just
> >>>>>> the bare audit calls in the io_uring code.  As the _entry and _exit
> >>>>>> functions are small, I've copied them and their supporting functions
> >>>>>> below and I'll try to explain what would happen in CONFIG_AUDIT=y,
> >>>>>> "task,never" case.
> >>>>>>
> >>>>>> +  static inline struct audit_context *audit_context(void)
> >>>>>> +  {
> >>>>>> +    return current->audit_context;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline bool audit_dummy_context(void)
> >>>>>> +  {
> >>>>>> +    void *p = audit_context();
> >>>>>> +    return !p || *(int *)p;
> >>>>>> +  }
> >>>>>>
> >>>>>> +  static inline void audit_uring_entry(u8 op)
> >>>>>> +  {
> >>>>>> +    if (unlikely(audit_enabled && audit_context()))
> >>>>>> +      __audit_uring_entry(op);
> >>>>>> +  }
> >>>>>>
> >>>>>> We have one if statement where the conditional checks on two
> >>>>>> individual conditions.  The first (audit_enabled) is simply a check to
> >>>>>> see if anyone has "turned on" auditing at runtime; historically this
> >>>>>> worked rather well, and still does in a number of places, but ever
> >>>>>> since systemd has taken to forcing audit on regardless of the admin's
> >>>>>> audit configuration it is less useful.  The second (audit_context())
> >>>>>> is a check to see if an audit_context has been allocated for the
> >>>>>> current task.  In the case of "task,never" current->audit_context will
> >>>>>> be NULL (see audit_alloc()) and the __audit_uring_entry() slowpath
> >>>>>> will never be called.
> >>>>>>
> >>>>>> Worst case here is checking the value of audit_enabled and
> >>>>>> current->audit_context.  Depending on which you think is more likely
> >>>>>> we can change the order of the check so that the
> >>>>>> current->audit_context check is first if you feel that is more likely
> >>>>>> to be NULL than audit_enabled is to be false (it may be that way now).
> >>>>>>
> >>>>>> +  static inline void audit_uring_exit(int success, long code)
> >>>>>> +  {
> >>>>>> +    if (unlikely(!audit_dummy_context()))
> >>>>>> +      __audit_uring_exit(success, code);
> >>>>>> +  }
> >>>>>>
> >>>>>> The exit call is very similar to the entry call, but in the
> >>>>>> "task,never" case it is very simple as the first check to be performed
> >>>>>> is the current->audit_context check which we know to be NULL.  The
> >>>>>> __audit_uring_exit() slowpath will never be called.
> >>>>>
> >>>>> I actually ran some numbers this morning. The test base is 5.13+, and
> >>>>> CONFIG_AUDIT=y and CONFIG_AUDITSYSCALL=y is set for both the baseline
> >>>>> test and the test with this series applied. I used your git branch as of
> >>>>> this morning.
> >>>>>
> >>>>> The test case is my usual peak perf test, which is random reads at
> >>>>> QD=128 and using polled IO. It's a single core test, not threaded. I ran
> >>>>> two different tests - one was having a thread just do the IO, the other
> >>>>> is using SQPOLL to do the IO for us. The device is capable than more
> >>>>> IOPS than a single core can deliver, so we're CPU limited in this test.
> >>>>> Hence it's a good test case as it does actual work, and shows software
> >>>>> overhead quite nicely. Runs are very stable (less than 0.5% difference
> >>>>> between runs on the same base), yet I did average 4 runs.
> >>>>>
> >>>>> Kernel              SQPOLL          IOPS            Perf diff
> >>>>> ---------------------------------------------------------
> >>>>> 5.13                0               3029872         0.0%
> >>>>> 5.13                1               3031056         0.0%
> >>>>> 5.13 + audit        0               2894160         -4.5%
> >>>>> 5.13 + audit        1               2886168         -4.8%
> >>>>>
> >>>>> That's an immediate drop in perf of almost 5%. Looking at a quick
> >>>>> profile of it (nothing fancy, just checking for 'audit' in the profile)
> >>>>> shows this:
> >>>>>
> >>>>> +    2.17%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>>> +    0.71%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>>      0.07%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>>      0.02%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>>>
> >>>>> Note that this is with _no_ rules!
> >>>>
> >>>> io_uring also supports a NOP command, which basically just measures
> >>>> reqs/sec through the interface. Ran that as well:
> >>>>
> >>>> Kernel               SQPOLL          IOPS            Perf diff
> >>>> ---------------------------------------------------------
> >>>> 5.13         0               31.05M          0.0%
> >>>> 5.13 + audit 0               25.31M          -18.5%
> >>>>
> >>>> and profile for the latter includes:
> >>>>
> >>>> +    5.19%  io_uring  [kernel.vmlinux]  [k] __audit_uring_entry
> >>>> +    4.31%  io_uring  [kernel.vmlinux]  [k] __audit_uring_exit
> >>>>      0.26%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_entry
> >>>>      0.08%  io_uring  [kernel.vmlinux]  [k] __audit_syscall_exit
> >>>
> >>> As Pavel correctly pointed it, looks like auditing is enabled. And
> >>> indeed it was! Hence the above numbers is without having turned off
> >>> auditing. Running the NOPs after having turned off audit, we get 30.6M
> >>> IOPS, which is down about 1.5% from the baseline. The results for the
> >>> polled random read test above did _not_ change from this, they are still
> >>> down the same amount.
> >>>
> >>> Note, and I should have included this in the first email, this is not
> >>> any kind of argument for or against audit logging. It's purely meant to
> >>> be a set of numbers that show how the current series impacts
> >>> performance.
> >>
> >> And finally, just checking if we make it optional per opcode if we see
> >> any real impact, and the answer is no. Using the below patch which
> >> effectively bypasses audit calls unless the opcode has flagged the need
> >> to do so, I cannot measure any difference in perf (as expected).
> >>
> >> To turn this into something useful, my suggestion as a viable path
> >> forward would be:
> >>
> >> 1) Use something like the below patch and flag request types that we
> >>    want to do audit logging for.
> >>
> >> 2) As Pavel suggested, eliminate the need for having both and entry/exit
> >>    hook, turning it into just one. That effectively cuts the number of
> >>    checks and calls in half.
> >
> > I suspect the updated working-io_uring branch with HEAD at
> > 1f25193a3f54 (updated a short time ago, see my last email in this
> > thread) will improve performance.  Also, as has been mention several
> > times now, for audit to work we need both the _entry and _exit call.
>
> Pulled in your new stuff, and ran a quick test again (same as before,
> the rand reads). No change, we're still down ~5% with auditctl -a
> task,never having been run. If I stub out your two audit entry/exit
> calls by adding an if (0) before them, then perf is back to normal.

That's fun.

Considering that we are down to basically a single if-conditional in
the audit code and your other tests with another single conditional
yielded no significant change it looks like we really only have one
option left before we "agree to disagree" and have to mark io_uring
and audit as mutually exclusive in Kconfig.  If we moved the _entry
and _exit calls into the individual operation case blocks (quick
openat example below) so that only certain operations were able to be
audited would that be acceptable assuming the high frequency ops were
untouched?  My initial gut feeling was that this would involve >50% of
the ops, but Steve Grubb seems to think it would be less; it may be
time to look at that a bit more seriously, but if it gets a NACK
regardless it isn't worth the time - thoughts?

  case IORING_OP_OPENAT:
    audit_uring_entry(req->opcode);
    ret = io_openat(req, issue_flags);
    audit_uring_exit(!ret, ret);
    break;

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring
  2021-05-26 14:48   ` Stefan Metzmacher
@ 2021-05-26 20:45     ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-26 20:45 UTC (permalink / raw)
  To: Stefan Metzmacher
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Wed, May 26, 2021 at 10:48 AM Stefan Metzmacher <[email protected]> wrote:
>
> Hi Paul,

Hi Stefan.

> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -6537,6 +6538,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> >               if (!req->work.creds)
> >                       return -EINVAL;
> >               get_cred(req->work.creds);
> > +             ret = security_uring_override_creds(req->work.creds);
> > +             if (ret) {
> > +                     put_cred(req->work.creds);
> > +                     return ret;
> > +             }
>
> Why are you calling this per requests, shouldn't this be done in
> io_register_personality()?

Generally speaking it is more interesting to see when user alice tries
to impersonate bob and not when bob registers his ID as available to
use by others.  We could always add a LSM hook to control when bob
registers his ID, but I think the impersonation is the critical code
path.

However, if I'm misunderstanding how this works in io_uring please correct me.

> I'm also not sure if this really gains anything as io_register_personality()
> only captures the value of get_current_cred(), so the process already has changed to
> the credentials (at least once for the io_uring_register(IORING_REGISTER_PERSONALITY)
> call).
>
> metze

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 17:22                       ` Jens Axboe
@ 2021-05-27 17:27                         ` Richard Guy Briggs
  0 siblings, 0 replies; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-27 17:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Stefan Metzmacher, Paul Moore, Pavel Begunkov, selinux,
	linux-security-module, linux-audit, Kumar Kartikeya Dwivedi,
	linux-fsdevel, io-uring, Alexander Viro

On 2021-05-26 11:22, Jens Axboe wrote:
> On 5/26/21 9:49 AM, Richard Guy Briggs wrote:
> >> So why is there anything special needed for io_uring (now that the
> >> native worker threads are used)?
> > 
> > Because syscall has been bypassed by a memory-mapped work queue.
> 
> I don't follow this one at all, that's just the delivery mechanism if
> you choose to use SQPOLL. If you do, then a thread sibling of the
> original task does the actual system call. There's no magic involved
> there, and the tasks are related.
> 
> So care to expand on that?

These may be poor examples, but hear me out...

In the case of an open call, I'm guessing that they are mapped 1:1
syscall to io_uring action so that the action can be asynchronous.  So
in this case, there is a record of the action, but we don't see the
result success/failure.  I assume that file paths can only be specified
in the original syscall and not in an SQPOLL action?

In the case of a syscall read/write (which aren't as interesting
generally to audit, but I'm concerned there are other similar situations
that are), the syscall would be called for every buffer that is passed
back and forth user/kernel and vice versa, providing a way to log all
that activity.  In the case of SQPOLL, I understand that a syscall sets
up the action and then io_uring takes over and does the bulk transfer
and we'd not have the visibility of how many times that action was
repeated nor that the result success/failure was due to its asynchrony.

Perhaps I am showing my ignorance, so please correct me if I have it
wrong.

> >> Is there really any io_uring opcode that bypasses the security checks the corresponding native syscall
> >> would do? If so, I think that should just be fixed...
> > 
> > This is by design to speed it up.  This is what Paul's iouring entry and
> > exit hooks do.
> 
> As far as I can tell, we're doing double logging at that point, if the
> syscall helper does the audit as well. We'll get something logging the
> io_uring opcode (eg IORING_OP_OPENAT2), then log again when we call the
> fs helper. That's _assuming_ that we always hit the logging part when we
> call into the vfs, but that's something that can be updated to be true
> and kept an eye on for future additions.
> 
> Why is the double logging useful? It only tells you that the invocation
> was via io_uring as the delivery mechanism rather than the usual system
> call, but the effect is the same - the file is opened, for example.
> 
> I feel like I'm missing something here, or the other side is. Or both!

Paul addressed this in his reply, but let me add a more concrete
example...  There was one corner case I was looking at that showed up
this issue.  Please indicate if I have mischaracterized or
misunderstood.

A syscall would generate records something like this:

	AUDIT_SYSCALL
	AUDIT_...
	AUDIT_EOE

A io_uring SQPOLL event would generate something like this:

	AUDIT_URINGOP
	AUDIT_...
	AUDIT_EOE

The "hybrid" event that is a syscall that starts an io_uring action
would generate something like this:

	AUDIT_URINGOP
	[possible AUDIT_CONFIG_CHANGE (from killed_trees)]
	AUDIT_SYSCALL
	AUDIT_...
	AUDIT_EOE

The AUDIT_... is all the operation-specific records that log parameters
that aren't able to be expressed in the SYSLOG or URINGOP records such
as pathnames, other arguments, and context (pwd, etc...).  So this isn't
"double logging".  It is either introducing an io_uring event, or it is
providing more detail about the io_uring arguments to a syscall event.

> Jens Axboe

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-26 20:19                           ` Paul Moore
@ 2021-05-28 16:02                             ` Paul Moore
  2021-06-02  8:26                               ` Pavel Begunkov
  2021-06-03 15:54                               ` Jens Axboe
  0 siblings, 2 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-28 16:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
> ... If we moved the _entry
> and _exit calls into the individual operation case blocks (quick
> openat example below) so that only certain operations were able to be
> audited would that be acceptable assuming the high frequency ops were
> untouched?  My initial gut feeling was that this would involve >50% of
> the ops, but Steve Grubb seems to think it would be less; it may be
> time to look at that a bit more seriously, but if it gets a NACK
> regardless it isn't worth the time - thoughts?
>
>   case IORING_OP_OPENAT:
>     audit_uring_entry(req->opcode);
>     ret = io_openat(req, issue_flags);
>     audit_uring_exit(!ret, ret);
>     break;

I wanted to pose this question again in case it was lost in the
thread, I suspect this may be the last option before we have to "fix"
things at the Kconfig level.  I definitely don't want to have to go
that route, and I suspect most everyone on this thread feels the same,
so I'm hopeful we can find a solution that is begrudgingly acceptable
to both groups.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-05-21 21:50 ` [RFC PATCH 4/9] audit: add filtering for io_uring records Paul Moore
@ 2021-05-28 22:35   ` Richard Guy Briggs
  2021-05-30 15:26     ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-28 22:35 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021-05-21 17:50, Paul Moore wrote:
> WARNING - This is a work in progress and should not be merged
> anywhere important.  It is almost surely not complete, and while it
> probably compiles it likely hasn't been booted and will do terrible
> things.  You have been warned.
> 
> This patch adds basic audit io_uring filtering, using as much of the
> existing audit filtering infrastructure as possible.  In order to do
> this we reuse the audit filter rule's syscall mask for the io_uring
> operation and we create a new filter for io_uring operations as
> AUDIT_FILTER_URING_EXIT/audit_filter_list[7].
> 
> <TODO - provide some additional guidance for the userspace tools>
> 
> Signed-off-by: Paul Moore <[email protected]>
> ---
>  include/uapi/linux/audit.h |    3 +-
>  kernel/auditfilter.c       |    4 ++-
>  kernel/auditsc.c           |   65 ++++++++++++++++++++++++++++++++++----------
>  3 files changed, 55 insertions(+), 17 deletions(-)
> 
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index b26e0c435e8b..621eb3c1076e 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -167,8 +167,9 @@
>  #define AUDIT_FILTER_EXCLUDE	0x05	/* Apply rule before record creation */
>  #define AUDIT_FILTER_TYPE	AUDIT_FILTER_EXCLUDE /* obsolete misleading naming */
>  #define AUDIT_FILTER_FS		0x06	/* Apply rule at __audit_inode_child */
> +#define AUDIT_FILTER_URING_EXIT	0x07	/* Apply rule at io_uring op exit */
>  
> -#define AUDIT_NR_FILTERS	7
> +#define AUDIT_NR_FILTERS	8
>  
>  #define AUDIT_FILTER_PREPEND	0x10	/* Prepend to front of list */
>  
> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index db2c6b59dfc3..c21119c00504 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -44,7 +44,8 @@ struct list_head audit_filter_list[AUDIT_NR_FILTERS] = {
>  	LIST_HEAD_INIT(audit_filter_list[4]),
>  	LIST_HEAD_INIT(audit_filter_list[5]),
>  	LIST_HEAD_INIT(audit_filter_list[6]),
> -#if AUDIT_NR_FILTERS != 7
> +	LIST_HEAD_INIT(audit_filter_list[7]),
> +#if AUDIT_NR_FILTERS != 8
>  #error Fix audit_filter_list initialiser
>  #endif
>  };
> @@ -56,6 +57,7 @@ static struct list_head audit_rules_list[AUDIT_NR_FILTERS] = {
>  	LIST_HEAD_INIT(audit_rules_list[4]),
>  	LIST_HEAD_INIT(audit_rules_list[5]),
>  	LIST_HEAD_INIT(audit_rules_list[6]),
> +	LIST_HEAD_INIT(audit_rules_list[7]),
>  };
>  
>  DEFINE_MUTEX(audit_filter_mutex);
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index d8aa2c690bf9..4f6ab34020fb 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
>  	return rule->mask[word] & bit;
>  }
>  
> +/**
> + * audit_filter_uring - apply filters to an io_uring operation
> + * @tsk: associated task
> + * @ctx: audit context
> + */
> +static void audit_filter_uring(struct task_struct *tsk,
> +			       struct audit_context *ctx)
> +{
> +	struct audit_entry *e;
> +	enum audit_state state;
> +
> +	if (auditd_test_task(tsk))
> +		return;

Is this necessary?  auditd and auditctl don't (intentionally) use any
io_uring functionality.  Is it possible it might inadvertantly use some
by virtue of libc or other library calls now or in the future?

> +	rcu_read_lock();
> +	list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
> +				list) {
> +		if (audit_in_mask(&e->rule, ctx->uring_op) &&

While this seems like the most obvious approach given the parallels
between syscalls and io_uring operations, as coded here it won't work
due to the different mappings of syscall numbers and io_uring
operations unless we re-use the auditctl -S field with raw io_uring
operation numbers in the place of syscall numbers.  This should have
been obvious to me when I first looked at this patch.  It became obvious
when I started looking at the userspace auditctl.c.

The easy first step would be to use something like this:
	auditctl -a uring,always -S 18,28 -F key=uring_open
to monitor file open commands only.  The same is not yet possible for
the perm field, but there are so few io_uring ops at this point compared
with syscalls that it might be manageable.  The arch is irrelevant since
io_uring operation numbers are identical across all hardware as far as I
can tell.  Most of the rest of the fields should make sense if they do
for a syscall rule.  Here's a sample of userspace code to support this
patch:
	https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96
	https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0


If we abuse the syscall infrastructure at first, we'd need a transition
plan to coordinate user and kernel switchover to seperate mechanisms for
the two to work together if the need should arise to have both syscall
and uring filters in the same rule.


I'm still exploring ideas on the best place to put this translation
table, in userspace libaudit, kaudit, io_op_defs[],
include/uapi/linux/io_uring.h, ...

For speed, the best would be in userspace libaudit, but that will be the
least obvious place for any io_uring developer to look when making any
updates to the list of io_uring operations and will most likely result
in additions escaping security logging.

Next best for speed would be in kaudit when instantiating rules, to
translate the syscall numbers to operation numbers and store them
natively in a different mask (audit_in_mask).  This also runs the risk
of additions escaping security logging.

After that, they could reuse the existing audit syscall filter
infrastructure and mask and translate on the fly from the specific op in
use to its equivalent syscall number when checking the existing filter.
This is the least disruptive to the audit code and the most likely to be
updated when new io_uring operations are added, but will incur an extra
step to translate the io_uring operation number to a syscall number in
the hot path.

Given that multiple aspects of security appear to have been a complete
afterthought to this feature, necessitating it to be bolted on after the
fact, it appears that the last option might be the most attractive to
the security folks, making this as easy as possible for fs folks to
update the audit code would be necessary to maintain security.


If there isn't a direct mapping between io_uring operations and
syscalls (the reverse isn't needed), then we'll need to duplicate the
mechanism throughout the stack starting in userspace auditctl and
libaudit.  Currently the bitmap for syscalls is 2k.  The current
io_uring op list appears to be 37.


It might be wise to deliberately not support auditctl "-w" (and the
exported audit_add_watch() function) since that is currently hardcoded
to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by
audit_rule_fieldpair_data()] anyways that is more likely to be
deprecated.  It also appears to make sense not to support autrace (at
least initially).


Does any of this roughly make sense?


> +		    audit_filter_rules(tsk, &e->rule, ctx, NULL, &state,
> +				       false)) {
> +			rcu_read_unlock();
> +			ctx->current_state = state;
> +			return;
> +		}
> +	}
> +	rcu_read_unlock();
> +	return;
> +}
> +
>  /* At syscall exit time, this filter is called if the audit_state is
>   * not low enough that auditing cannot take place, but is also not
>   * high enough that we already know we have to write an audit record
> @@ -1754,7 +1783,7 @@ static void audit_log_exit(void)
>   * __audit_free - free a per-task audit context
>   * @tsk: task whose audit context block to free
>   *
> - * Called from copy_process and do_exit
> + * Called from copy_process, do_exit, and the io_uring code
>   */
>  void __audit_free(struct task_struct *tsk)
>  {
> @@ -1772,15 +1801,21 @@ void __audit_free(struct task_struct *tsk)
>  	 * random task_struct that doesn't doesn't have any meaningful data we
>  	 * need to log via audit_log_exit().
>  	 */
> -	if (tsk == current && !context->dummy &&
> -	    context->context == AUDIT_CTX_SYSCALL) {
> +	if (tsk == current && !context->dummy) {
>  		context->return_valid = AUDITSC_INVALID;
>  		context->return_code = 0;
> -
> -		audit_filter_syscall(tsk, context);
> -		audit_filter_inodes(tsk, context);
> -		if (context->current_state == AUDIT_RECORD_CONTEXT)
> -			audit_log_exit();
> +		if (context->context == AUDIT_CTX_SYSCALL) {
> +			audit_filter_syscall(tsk, context);
> +			audit_filter_inodes(tsk, context);
> +			if (context->current_state == AUDIT_RECORD_CONTEXT)
> +				audit_log_exit();
> +		} else if (context->context == AUDIT_CTX_URING) {
> +			/* TODO: verify this case is real and valid */
> +			audit_filter_uring(tsk, context);
> +			audit_filter_inodes(tsk, context);
> +			if (context->current_state == AUDIT_RECORD_CONTEXT)
> +				audit_log_uring(context);
> +		}
>  	}
>  
>  	audit_set_context(tsk, NULL);
> @@ -1861,12 +1896,6 @@ void __audit_uring_exit(int success, long code)
>  {
>  	struct audit_context *ctx = audit_context();
>  
> -	/*
> -	 * TODO: At some point we will likely want to filter on io_uring ops
> -	 *       and other things similar to what we do for syscalls, but that
> -	 *       is something for another day; just record what we can here.
> -	 */
> -
>  	if (!ctx || ctx->dummy)
>  		goto out;
>  	if (ctx->context == AUDIT_CTX_SYSCALL) {
> @@ -1891,6 +1920,8 @@ void __audit_uring_exit(int success, long code)
>  		 * the behavior here.
>  		 */
>  		audit_filter_syscall(current, ctx);
> +		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
> +			audit_filter_uring(current, ctx);
>  		audit_filter_inodes(current, ctx);
>  		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
>  			return;
> @@ -1899,7 +1930,9 @@ void __audit_uring_exit(int success, long code)
>  		return;
>  	}
>  #if 1
> -	/* XXX - temporary hack to force record generation */
> +	/* XXX - temporary hack to force record generation, we are leaving this
> +	 *       enabled, but if you want to actually test the filtering you
> +	 *       need to disable this #if/#endif block */
>  	ctx->current_state = AUDIT_RECORD_CONTEXT;
>  #endif
>  
> @@ -1907,6 +1940,8 @@ void __audit_uring_exit(int success, long code)
>  	if (!list_empty(&ctx->killed_trees))
>  		audit_kill_trees(ctx);
>  
> +	/* run through both filters to ensure we set the filterkey properly */
> +	audit_filter_uring(current, ctx);
>  	audit_filter_inodes(current, ctx);
>  	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
>  		goto out;
> 
> --
> Linux-audit mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/linux-audit

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-05-28 22:35   ` Richard Guy Briggs
@ 2021-05-30 15:26     ` Paul Moore
  2021-05-31 13:44       ` Richard Guy Briggs
                         ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Paul Moore @ 2021-05-30 15:26 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <[email protected]> wrote:
> On 2021-05-21 17:50, Paul Moore wrote:
> > WARNING - This is a work in progress and should not be merged
> > anywhere important.  It is almost surely not complete, and while it
> > probably compiles it likely hasn't been booted and will do terrible
> > things.  You have been warned.
> >
> > This patch adds basic audit io_uring filtering, using as much of the
> > existing audit filtering infrastructure as possible.  In order to do
> > this we reuse the audit filter rule's syscall mask for the io_uring
> > operation and we create a new filter for io_uring operations as
> > AUDIT_FILTER_URING_EXIT/audit_filter_list[7].
> >
> > <TODO - provide some additional guidance for the userspace tools>
> >
> > Signed-off-by: Paul Moore <[email protected]>
> > ---
> >  include/uapi/linux/audit.h |    3 +-
> >  kernel/auditfilter.c       |    4 ++-
> >  kernel/auditsc.c           |   65 ++++++++++++++++++++++++++++++++++----------
> >  3 files changed, 55 insertions(+), 17 deletions(-)

...

> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index d8aa2c690bf9..4f6ab34020fb 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> >       return rule->mask[word] & bit;
> >  }
> >
> > +/**
> > + * audit_filter_uring - apply filters to an io_uring operation
> > + * @tsk: associated task
> > + * @ctx: audit context
> > + */
> > +static void audit_filter_uring(struct task_struct *tsk,
> > +                            struct audit_context *ctx)
> > +{
> > +     struct audit_entry *e;
> > +     enum audit_state state;
> > +
> > +     if (auditd_test_task(tsk))
> > +             return;
>
> Is this necessary?  auditd and auditctl don't (intentionally) use any
> io_uring functionality.  Is it possible it might inadvertantly use some
> by virtue of libc or other library calls now or in the future?

I think the better question is what harm does it do?  Yes, I'm not
aware of an auditd implementation that currently makes use of
io_uring, but it is also not inconceivable some future implementation
might want to make use of it and given the disjoint nature of kernel
and userspace development I don't want the kernel to block such
developments.  However, if you can think of a reason why having this
check here is bad I'm listening (note: we are already in the slow path
here so having the additional check isn't an issue as far as I'm
concerned).

As a reminder, auditd_test_task() only returns true/1 if the task is
registered with the audit subsystem as an auditd connection, an
auditctl process should not cause this function to return true.

> > +     rcu_read_lock();
> > +     list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
> > +                             list) {
> > +             if (audit_in_mask(&e->rule, ctx->uring_op) &&
>
> While this seems like the most obvious approach given the parallels
> between syscalls and io_uring operations, as coded here it won't work
> due to the different mappings of syscall numbers and io_uring
> operations unless we re-use the auditctl -S field with raw io_uring
> operation numbers in the place of syscall numbers.  This should have
> been obvious to me when I first looked at this patch.  It became obvious
> when I started looking at the userspace auditctl.c.

FWIW, my intention was to treat io_uring opcodes exactly like we treat
syscall numbers.  Yes, this would potentially be an issue if we wanted
to combine syscalls and io_uring opcodes into one filter, but why
would we ever want to do that?  Combining the two into one filter not
only makes the filter lists longer than needed (we will always know if
we are filtering on a syscall or io_uring op) and complicates the
filter rule processing.

Or is there a problem with this that I'm missing?

> The easy first step would be to use something like this:
>         auditctl -a uring,always -S 18,28 -F key=uring_open
> to monitor file open commands only.  The same is not yet possible for
> the perm field, but there are so few io_uring ops at this point compared
> with syscalls that it might be manageable.  The arch is irrelevant since
> io_uring operation numbers are identical across all hardware as far as I
> can tell.  Most of the rest of the fields should make sense if they do
> for a syscall rule.

I've never been a fan of audit's "perm" filtering; I've always felt
there were better ways to handle that so I'm not overly upset that we
are skipping that functionality with this initial support.  If it
becomes a problem in the future we can always add that support at a
later date.

I currently fear that just getting io_uring and audit to coexist is
going to be a large enough problem in the immediate future.

> Here's a sample of userspace code to support this
> patch:
>         https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96
>         https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0

Great, thank you.  I haven't grabbed a copy yet for testing, but I will.

> If we abuse the syscall infrastructure at first, we'd need a transition
> plan to coordinate user and kernel switchover to seperate mechanisms for
> the two to work together if the need should arise to have both syscall
> and uring filters in the same rule.

See my comments above, I don't currently see why we would ever want
syscall and io_uring filtering to happen in the same rule.  Please
speak up if you can think of a reason why this would either be needed,
or desirable for some reason.

> It might be wise to deliberately not support auditctl "-w" (and the
> exported audit_add_watch() function) since that is currently hardcoded
> to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by
> audit_rule_fieldpair_data()] anyways that is more likely to be
> deprecated.  It also appears to make sense not to support autrace (at
> least initially).

I'm going to be honest with you and simply say that I've run out of my
email/review time in front of the computer on this holiday weekend
(blame the lockdown/bpf/lsm discussion <g>) and I need to go for
today, but this is something I'll take a look it this coming week.
Hopefully the comments above give us something to think/talk about in
the meantime.

Regardless, thanks for your help on the userspace side of the
filtering, that should make testing a lot easier moving forward.

--
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-05-30 15:26     ` Paul Moore
@ 2021-05-31 13:44       ` Richard Guy Briggs
  2021-06-02  1:40         ` Paul Moore
  2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
  2021-05-31 13:44       ` [PATCH 2/2] audit: block PERM fields being used with io_uring filtering Richard Guy Briggs
  2 siblings, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-31 13:44 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021-05-30 11:26, Paul Moore wrote:
> On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2021-05-21 17:50, Paul Moore wrote:
> > > WARNING - This is a work in progress and should not be merged
> > > anywhere important.  It is almost surely not complete, and while it
> > > probably compiles it likely hasn't been booted and will do terrible
> > > things.  You have been warned.
> > >
> > > This patch adds basic audit io_uring filtering, using as much of the
> > > existing audit filtering infrastructure as possible.  In order to do
> > > this we reuse the audit filter rule's syscall mask for the io_uring
> > > operation and we create a new filter for io_uring operations as
> > > AUDIT_FILTER_URING_EXIT/audit_filter_list[7].
> > >
> > > <TODO - provide some additional guidance for the userspace tools>
> > >
> > > Signed-off-by: Paul Moore <[email protected]>
> > > ---
> > >  include/uapi/linux/audit.h |    3 +-
> > >  kernel/auditfilter.c       |    4 ++-
> > >  kernel/auditsc.c           |   65 ++++++++++++++++++++++++++++++++++----------
> > >  3 files changed, 55 insertions(+), 17 deletions(-)
> 
> ...
> 
> > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > index d8aa2c690bf9..4f6ab34020fb 100644
> > > --- a/kernel/auditsc.c
> > > +++ b/kernel/auditsc.c
> > > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> > >       return rule->mask[word] & bit;
> > >  }
> > >
> > > +/**
> > > + * audit_filter_uring - apply filters to an io_uring operation
> > > + * @tsk: associated task
> > > + * @ctx: audit context
> > > + */
> > > +static void audit_filter_uring(struct task_struct *tsk,
> > > +                            struct audit_context *ctx)
> > > +{
> > > +     struct audit_entry *e;
> > > +     enum audit_state state;
> > > +
> > > +     if (auditd_test_task(tsk))
> > > +             return;
> >
> > Is this necessary?  auditd and auditctl don't (intentionally) use any
> > io_uring functionality.  Is it possible it might inadvertantly use some
> > by virtue of libc or other library calls now or in the future?
> 
> I think the better question is what harm does it do?  Yes, I'm not
> aware of an auditd implementation that currently makes use of
> io_uring, but it is also not inconceivable some future implementation
> might want to make use of it and given the disjoint nature of kernel
> and userspace development I don't want the kernel to block such
> developments.  However, if you can think of a reason why having this
> check here is bad I'm listening (note: we are already in the slow path
> here so having the additional check isn't an issue as far as I'm
> concerned).
> 
> As a reminder, auditd_test_task() only returns true/1 if the task is
> registered with the audit subsystem as an auditd connection, an
> auditctl process should not cause this function to return true.

My main concern was overhead, since the whole goal of io_uring is speed.

The chances that audit does use this functionality in the future suggest
to me that it is best to leave this check in.

> > > +     rcu_read_lock();
> > > +     list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
> > > +                             list) {
> > > +             if (audit_in_mask(&e->rule, ctx->uring_op) &&
> >
> > While this seems like the most obvious approach given the parallels
> > between syscalls and io_uring operations, as coded here it won't work
> > due to the different mappings of syscall numbers and io_uring
> > operations unless we re-use the auditctl -S field with raw io_uring
> > operation numbers in the place of syscall numbers.  This should have
> > been obvious to me when I first looked at this patch.  It became obvious
> > when I started looking at the userspace auditctl.c.
> 
> FWIW, my intention was to treat io_uring opcodes exactly like we treat
> syscall numbers.  Yes, this would potentially be an issue if we wanted
> to combine syscalls and io_uring opcodes into one filter, but why
> would we ever want to do that?  Combining the two into one filter not
> only makes the filter lists longer than needed (we will always know if
> we are filtering on a syscall or io_uring op) and complicates the
> filter rule processing.
> 
> Or is there a problem with this that I'm missing?

No, I think you have a good understanding of it.  I'm asking hard
questions to avoid missing something important.  If we can reuse the
syscall infrastructure for this then that is extremely helpful (if not
lazy, which isn't necessarily a bad thing).  It does mean that the
io_uring op dictionary will need to live in userspace audit the way it
is currently implemented, or provide a flag to indicate it is a syscall
number to be translated in the kernel either at the time of rule
addition or translated on the fly on rule check in the kernel adding
overhead to a critical path.

> > The easy first step would be to use something like this:
> >         auditctl -a uring,always -S 18,28 -F key=uring_open
> > to monitor file open commands only.  The same is not yet possible for
> > the perm field, but there are so few io_uring ops at this point compared
> > with syscalls that it might be manageable.  The arch is irrelevant since
> > io_uring operation numbers are identical across all hardware as far as I
> > can tell.  Most of the rest of the fields should make sense if they do
> > for a syscall rule.
> 
> I've never been a fan of audit's "perm" filtering; I've always felt
> there were better ways to handle that so I'm not overly upset that we
> are skipping that functionality with this initial support.  If it
> becomes a problem in the future we can always add that support at a
> later date.

Ok, I don't see a pressing need to add it initially, but should add a
check to block that field from being used to avoid the confusion of
unpredictable behaviour should someone try to add a perm filter to a
io_uring filter.  That should be done protectively in the kernel and
proactively in userspace.

> I currently fear that just getting io_uring and audit to coexist is
> going to be a large enough problem in the immediate future.

Agreed.

> > Here's a sample of userspace code to support this
> > patch:
> >         https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96
> >         https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0
> 
> Great, thank you.  I haven't grabbed a copy yet for testing, but I will.

I've added a perm filter block as an additional patch in userspace and
updated the tree so that first commit is no longer the top of tree but
the branch name is current.

I'll add a kernel perm filter check.

I just noticed some list checking that is missing in tree and watch in
your patch.

Suggested fixup patches to follow...

> > If we abuse the syscall infrastructure at first, we'd need a transition
> > plan to coordinate user and kernel switchover to seperate mechanisms for
> > the two to work together if the need should arise to have both syscall
> > and uring filters in the same rule.
> 
> See my comments above, I don't currently see why we would ever want
> syscall and io_uring filtering to happen in the same rule.  Please
> speak up if you can think of a reason why this would either be needed,
> or desirable for some reason.

I think they can be seperate rules for now.  Either a syscall rule
catching all io_uring ops can be added, or an io_uring rule can be added
to catch specific ops.  The scenario I was thinking of was catching
syscalls of specific io_uring ops.

> > It might be wise to deliberately not support auditctl "-w" (and the
> > exported audit_add_watch() function) since that is currently hardcoded
> > to the AUDIT_FILTER_EXIT list and is the old watch form [replaced by
> > audit_rule_fieldpair_data()] anyways that is more likely to be
> > deprecated.  It also appears to make sense not to support autrace (at
> > least initially).
> 
> I'm going to be honest with you and simply say that I've run out of my
> email/review time in front of the computer on this holiday weekend
> (blame the lockdown/bpf/lsm discussion <g>) and I need to go for
> today, but this is something I'll take a look it this coming week.
> Hopefully the comments above give us something to think/talk about in
> the meantime.

I wasn't expecting you to work the weekend.  :-)

> Regardless, thanks for your help on the userspace side of the
> filtering, that should make testing a lot easier moving forward.

Standard RFC disclaimers apply.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-05-30 15:26     ` Paul Moore
  2021-05-31 13:44       ` Richard Guy Briggs
@ 2021-05-31 13:44       ` Richard Guy Briggs
  2021-05-31 16:08         ` kernel test robot
                           ` (2 more replies)
  2021-05-31 13:44       ` [PATCH 2/2] audit: block PERM fields being used with io_uring filtering Richard Guy Briggs
  2 siblings, 3 replies; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-31 13:44 UTC (permalink / raw)
  To: Paul Moore, linux-audit, io-uring
  Cc: linux-fsdevel, linux-security-module, selinux,
	Kumar Kartikeya Dwivedi, Jens Axboe, Alexander Viro,
	Richard Guy Briggs

The commit ("audit: add filtering for io_uring records") added support for
filtering io_uring operations.

Add checks to the audit io_uring filtering code for directory and path watches,
and to keep the list counts consistent.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
 kernel/audit_tree.c  | 3 ++-
 kernel/audit_watch.c | 3 ++-
 kernel/auditfilter.c | 7 +++++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
index 6c91902f4f45..2be285c2f069 100644
--- a/kernel/audit_tree.c
+++ b/kernel/audit_tree.c
@@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op)
 {
 
 	if (pathname[0] != '/' ||
-	    rule->listnr != AUDIT_FILTER_EXIT ||
+	    (rule->listnr != AUDIT_FILTER_EXIT &&
+	     rule->listnr != AUDIT_FILTER_URING_EXIT) ||
 	    op != Audit_equal ||
 	    rule->inode_f || rule->watch || rule->tree)
 		return -EINVAL;
diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 2acf7ca49154..698b62b4a2ec 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op)
 		return -EOPNOTSUPP;
 
 	if (path[0] != '/' || path[len-1] == '/' ||
-	    krule->listnr != AUDIT_FILTER_EXIT ||
+	    (krule->listnr != AUDIT_FILTER_EXIT &&
+	     krule->listnr != AUDIT_FILTER_URING_EXIT) ||
 	    op != Audit_equal ||
 	    krule->inode_f || krule->watch || krule->tree)
 		return -EINVAL;
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index c21119c00504..bcdedfd1088c 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len)
 static inline int audit_to_inode(struct audit_krule *krule,
 				 struct audit_field *f)
 {
-	if (krule->listnr != AUDIT_FILTER_EXIT ||
+	if ((krule->listnr != AUDIT_FILTER_EXIT &&
+	     krule->listnr != AUDIT_FILTER_URING_EXIT) ||
 	    krule->inode_f || krule->watch || krule->tree ||
 	    (f->op != Audit_equal && f->op != Audit_not_equal))
 		return -EINVAL;
@@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data *
 		pr_err("AUDIT_FILTER_ENTRY is deprecated\n");
 		goto exit_err;
 	case AUDIT_FILTER_EXIT:
+	case AUDIT_FILTER_URING_EXIT:
 	case AUDIT_FILTER_TASK:
 #endif
 	case AUDIT_FILTER_USER:
@@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry)
 	}
 
 	entry->rule.prio = ~0ULL;
-	if (entry->rule.listnr == AUDIT_FILTER_EXIT) {
+	if (entry->rule.listnr == AUDIT_FILTER_EXIT ||
+	    entry->rule.listnr == AUDIT_FILTER_URING_EXIT) {
 		if (entry->rule.flags & AUDIT_FILTER_PREPEND)
 			entry->rule.prio = ++prio_high;
 		else
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [PATCH 2/2] audit: block PERM fields being used with io_uring filtering
  2021-05-30 15:26     ` Paul Moore
  2021-05-31 13:44       ` Richard Guy Briggs
  2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
@ 2021-05-31 13:44       ` Richard Guy Briggs
  2 siblings, 0 replies; 71+ messages in thread
From: Richard Guy Briggs @ 2021-05-31 13:44 UTC (permalink / raw)
  To: Paul Moore, linux-audit, io-uring
  Cc: linux-fsdevel, linux-security-module, selinux,
	Kumar Kartikeya Dwivedi, Jens Axboe, Alexander Viro,
	Richard Guy Briggs

The commit ("audit: add filtering for io_uring records") added support for
filtering io_uring operations.  The PERM field is invalid for io_uring
filtering, so block it.

Signed-off-by: Richard Guy Briggs <[email protected]>
---
 kernel/auditfilter.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index bcdedfd1088c..d75acb014ccd 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -336,6 +336,10 @@ static int audit_field_valid(struct audit_entry *entry, struct audit_field *f)
 		if (entry->rule.listnr != AUDIT_FILTER_FS)
 			return -EINVAL;
 		break;
+	case AUDIT_PERM:
+		if (entry->rule.listnr == AUDIT_FILTER_URING_EXIT)
+			return -EINVAL;
+		break;
 	}
 
 	switch (entry->rule.listnr) {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
@ 2021-05-31 16:08         ` kernel test robot
  2021-05-31 17:38         ` kernel test robot
  2021-06-07 23:15         ` Paul Moore
  2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2021-05-31 16:08 UTC (permalink / raw)
  To: Richard Guy Briggs, Paul Moore, linux-audit, io-uring
  Cc: kbuild-all, linux-fsdevel, linux-security-module, selinux,
	Kumar Kartikeya Dwivedi, Jens Axboe, Alexander Viro,
	Richard Guy Briggs

[-- Attachment #1: Type: text/plain, Size: 4288 bytes --]

Hi Richard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pcmoore-audit/next]
[also build test ERROR on v5.13-rc4 next-20210528]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git next
config: parisc-randconfig-r015-20210531 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/07a3e22a2f984838bc98b43b58e8ef08e9353483
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941
        git checkout 07a3e22a2f984838bc98b43b58e8ef08e9353483
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

   kernel/auditfilter.c: In function 'audit_to_inode':
>> kernel/auditfilter.c:155:24: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'?
     155 |       krule->listnr != AUDIT_FILTER_URING_EXIT) ||
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
         |                        AUDIT_FILTER_EXIT
   kernel/auditfilter.c:155:24: note: each undeclared identifier is reported only once for each function it appears in
   kernel/auditfilter.c: In function 'audit_to_entry_common':
   kernel/auditfilter.c:252:7: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'?
     252 |  case AUDIT_FILTER_URING_EXIT:
         |       ^~~~~~~~~~~~~~~~~~~~~~~
         |       AUDIT_FILTER_EXIT
   kernel/auditfilter.c: In function 'audit_add_rule':
   kernel/auditfilter.c:986:28: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'?
     986 |      entry->rule.listnr == AUDIT_FILTER_URING_EXIT) {
         |                            ^~~~~~~~~~~~~~~~~~~~~~~
         |                            AUDIT_FILTER_EXIT
--
   kernel/audit_watch.c: In function 'audit_to_watch':
>> kernel/audit_watch.c:187:24: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'?
     187 |       krule->listnr != AUDIT_FILTER_URING_EXIT) ||
         |                        ^~~~~~~~~~~~~~~~~~~~~~~
         |                        AUDIT_FILTER_EXIT
   kernel/audit_watch.c:187:24: note: each undeclared identifier is reported only once for each function it appears in
--
   kernel/audit_tree.c: In function 'audit_make_tree':
>> kernel/audit_tree.c:731:23: error: 'AUDIT_FILTER_URING_EXIT' undeclared (first use in this function); did you mean 'AUDIT_FILTER_EXIT'?
     731 |       rule->listnr != AUDIT_FILTER_URING_EXIT) ||
         |                       ^~~~~~~~~~~~~~~~~~~~~~~
         |                       AUDIT_FILTER_EXIT
   kernel/audit_tree.c:731:23: note: each undeclared identifier is reported only once for each function it appears in


vim +155 kernel/auditfilter.c

   149	
   150	/* Translate an inode field to kernel representation. */
   151	static inline int audit_to_inode(struct audit_krule *krule,
   152					 struct audit_field *f)
   153	{
   154		if ((krule->listnr != AUDIT_FILTER_EXIT &&
 > 155		     krule->listnr != AUDIT_FILTER_URING_EXIT) ||
   156		    krule->inode_f || krule->watch || krule->tree ||
   157		    (f->op != Audit_equal && f->op != Audit_not_equal))
   158			return -EINVAL;
   159	
   160		krule->inode_f = f;
   161		return 0;
   162	}
   163	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31564 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
  2021-05-31 16:08         ` kernel test robot
@ 2021-05-31 17:38         ` kernel test robot
  2021-06-07 23:15         ` Paul Moore
  2 siblings, 0 replies; 71+ messages in thread
From: kernel test robot @ 2021-05-31 17:38 UTC (permalink / raw)
  To: Richard Guy Briggs, Paul Moore, linux-audit, io-uring
  Cc: kbuild-all, clang-built-linux, linux-fsdevel,
	linux-security-module, selinux, Kumar Kartikeya Dwivedi,
	Jens Axboe, Alexander Viro, Richard Guy Briggs

[-- Attachment #1: Type: text/plain, Size: 3329 bytes --]

Hi Richard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pcmoore-audit/next]
[also build test ERROR on v5.13-rc4 next-20210528]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git next
config: arm64-randconfig-r021-20210531 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project bc6799f2f79f0ae87e9f1ebf9d25ba799fbd25a9)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/0day-ci/linux/commit/07a3e22a2f984838bc98b43b58e8ef08e9353483
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Richard-Guy-Briggs/audit-add-filtering-for-io_uring-records-addendum/20210531-214941
        git checkout 07a3e22a2f984838bc98b43b58e8ef08e9353483
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

>> kernel/auditfilter.c:155:24: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT'
                krule->listnr != AUDIT_FILTER_URING_EXIT) ||
                                 ^
   kernel/auditfilter.c:252:7: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT'
           case AUDIT_FILTER_URING_EXIT:
                ^
   kernel/auditfilter.c:986:28: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT'
               entry->rule.listnr == AUDIT_FILTER_URING_EXIT) {
                                     ^
   3 errors generated.
--
>> kernel/audit_watch.c:187:24: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT'
                krule->listnr != AUDIT_FILTER_URING_EXIT) ||
                                 ^
   1 error generated.
--
>> kernel/audit_tree.c:731:23: error: use of undeclared identifier 'AUDIT_FILTER_URING_EXIT'
                rule->listnr != AUDIT_FILTER_URING_EXIT) ||
                                ^
   1 error generated.


vim +/AUDIT_FILTER_URING_EXIT +155 kernel/auditfilter.c

   149	
   150	/* Translate an inode field to kernel representation. */
   151	static inline int audit_to_inode(struct audit_krule *krule,
   152					 struct audit_field *f)
   153	{
   154		if ((krule->listnr != AUDIT_FILTER_EXIT &&
 > 155		     krule->listnr != AUDIT_FILTER_URING_EXIT) ||
   156		    krule->inode_f || krule->watch || krule->tree ||
   157		    (f->op != Audit_equal && f->op != Audit_not_equal))
   158			return -EINVAL;
   159	
   160		krule->inode_f = f;
   161		return 0;
   162	}
   163	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 41736 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-05-31 13:44       ` Richard Guy Briggs
@ 2021-06-02  1:40         ` Paul Moore
  2021-06-02 15:37           ` Richard Guy Briggs
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-06-02  1:40 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <[email protected]> wrote:
> On 2021-05-30 11:26, Paul Moore wrote:
> > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2021-05-21 17:50, Paul Moore wrote:

...

> > > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > > index d8aa2c690bf9..4f6ab34020fb 100644
> > > > --- a/kernel/auditsc.c
> > > > +++ b/kernel/auditsc.c
> > > > @@ -799,6 +799,35 @@ static int audit_in_mask(const struct audit_krule *rule, unsigned long val)
> > > >       return rule->mask[word] & bit;
> > > >  }
> > > >
> > > > +/**
> > > > + * audit_filter_uring - apply filters to an io_uring operation
> > > > + * @tsk: associated task
> > > > + * @ctx: audit context
> > > > + */
> > > > +static void audit_filter_uring(struct task_struct *tsk,
> > > > +                            struct audit_context *ctx)
> > > > +{
> > > > +     struct audit_entry *e;
> > > > +     enum audit_state state;
> > > > +
> > > > +     if (auditd_test_task(tsk))
> > > > +             return;
> > >
> > > Is this necessary?  auditd and auditctl don't (intentionally) use any
> > > io_uring functionality.  Is it possible it might inadvertantly use some
> > > by virtue of libc or other library calls now or in the future?
> >
> > I think the better question is what harm does it do?  Yes, I'm not
> > aware of an auditd implementation that currently makes use of
> > io_uring, but it is also not inconceivable some future implementation
> > might want to make use of it and given the disjoint nature of kernel
> > and userspace development I don't want the kernel to block such
> > developments.  However, if you can think of a reason why having this
> > check here is bad I'm listening (note: we are already in the slow path
> > here so having the additional check isn't an issue as far as I'm
> > concerned).
> >
> > As a reminder, auditd_test_task() only returns true/1 if the task is
> > registered with the audit subsystem as an auditd connection, an
> > auditctl process should not cause this function to return true.
>
> My main concern was overhead, since the whole goal of io_uring is speed.

At the point where this test takes place we are already in the audit
slow path as far as io_uring is concerned.  I understand your concern,
but the advantage of being able to effectively use io_uring in the
future makes this worth keeping in my opinion.

> The chances that audit does use this functionality in the future suggest
> to me that it is best to leave this check in.

Sounds like we are in agreement.  We'll keep it for now.

> > > > +     rcu_read_lock();
> > > > +     list_for_each_entry_rcu(e, &audit_filter_list[AUDIT_FILTER_URING_EXIT],
> > > > +                             list) {
> > > > +             if (audit_in_mask(&e->rule, ctx->uring_op) &&
> > >
> > > While this seems like the most obvious approach given the parallels
> > > between syscalls and io_uring operations, as coded here it won't work
> > > due to the different mappings of syscall numbers and io_uring
> > > operations unless we re-use the auditctl -S field with raw io_uring
> > > operation numbers in the place of syscall numbers.  This should have
> > > been obvious to me when I first looked at this patch.  It became obvious
> > > when I started looking at the userspace auditctl.c.
> >
> > FWIW, my intention was to treat io_uring opcodes exactly like we treat
> > syscall numbers.  Yes, this would potentially be an issue if we wanted
> > to combine syscalls and io_uring opcodes into one filter, but why
> > would we ever want to do that?  Combining the two into one filter not
> > only makes the filter lists longer than needed (we will always know if
> > we are filtering on a syscall or io_uring op) and complicates the
> > filter rule processing.
> >
> > Or is there a problem with this that I'm missing?
>
> No, I think you have a good understanding of it.  I'm asking hard
> questions to avoid missing something important.  If we can reuse the
> syscall infrastructure for this then that is extremely helpful (if not
> lazy, which isn't necessarily a bad thing).  It does mean that the
> io_uring op dictionary will need to live in userspace audit the way it
> is currently implemented ....

Which I currently believe is the right thing to do.

> > > The easy first step would be to use something like this:
> > >         auditctl -a uring,always -S 18,28 -F key=uring_open
> > > to monitor file open commands only.  The same is not yet possible for
> > > the perm field, but there are so few io_uring ops at this point compared
> > > with syscalls that it might be manageable.  The arch is irrelevant since
> > > io_uring operation numbers are identical across all hardware as far as I
> > > can tell.  Most of the rest of the fields should make sense if they do
> > > for a syscall rule.
> >
> > I've never been a fan of audit's "perm" filtering; I've always felt
> > there were better ways to handle that so I'm not overly upset that we
> > are skipping that functionality with this initial support.  If it
> > becomes a problem in the future we can always add that support at a
> > later date.
>
> Ok, I don't see a pressing need to add it initially, but should add a
> check to block that field from being used to avoid the confusion of
> unpredictable behaviour should someone try to add a perm filter to a
> io_uring filter.  That should be done protectively in the kernel and
> proactively in userspace.

Sure, that's reasonable.

> > > Here's a sample of userspace code to support this
> > > patch:
> > >         https://github.com/rgbriggs/audit-userspace/commit/a77baa1651b7ad841a220eb962d4cc92bc07dc96
> > >         https://github.com/linux-audit/audit-userspace/compare/master...rgbriggs:ghau-iouring-filtering.v1.0
> >
> > Great, thank you.  I haven't grabbed a copy yet for testing, but I will.
>
> I've added a perm filter block as an additional patch in userspace and
> updated the tree so that first commit is no longer the top of tree but
> the branch name is current.
>
> I'll add a kernel perm filter check.
>
> I just noticed some list checking that is missing in tree and watch in
> your patch.
>
> Suggested fixup patches to follow...

I see them, thank you, comments will follow over there.  Although to
be honest I'm mostly focusing on the testing right now while we wait
to hear back from Jens on what he is willing to accept regarding audit
calls in io_issue_sqe().  If we can't do the _entry()/_exit() calls
then this work is pretty much dead and we just have to deal with it in
Kconfig.  I might make one last, clean patchset and put it in a branch
for the distros that want to carry the patchset, but it isn't clear to
me that it is something I would want to maintain long term.  Long
running out of tree patches are generally A Bad Idea.

> > > If we abuse the syscall infrastructure at first, we'd need a transition
> > > plan to coordinate user and kernel switchover to seperate mechanisms for
> > > the two to work together if the need should arise to have both syscall
> > > and uring filters in the same rule.
> >
> > See my comments above, I don't currently see why we would ever want
> > syscall and io_uring filtering to happen in the same rule.  Please
> > speak up if you can think of a reason why this would either be needed,
> > or desirable for some reason.
>
> I think they can be seperate rules for now.  Either a syscall rule
> catching all io_uring ops can be added, or an io_uring rule can be added
> to catch specific ops.  The scenario I was thinking of was catching
> syscalls of specific io_uring ops.

Perhaps I'm misunderstand you, but that scenario really shouldn't
exist.  The io_uring ops function independently of syscalls; you can
*submit* io_uring ops via io_uring_enter(), but they are not
guaranteed to be dispatched synchronously (obviously), and given the
cred shenanigans that can happen with io_uring there is no guarantee
the filters would even be applicable.

It isn't an issue of "can" the filters be separate, they *have* to be separate.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-28 16:02                             ` Paul Moore
@ 2021-06-02  8:26                               ` Pavel Begunkov
  2021-06-02 15:46                                 ` Richard Guy Briggs
  2021-06-02 19:46                                 ` Paul Moore
  2021-06-03 15:54                               ` Jens Axboe
  1 sibling, 2 replies; 71+ messages in thread
From: Pavel Begunkov @ 2021-06-02  8:26 UTC (permalink / raw)
  To: Paul Moore, Jens Axboe
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/28/21 5:02 PM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched?  My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>>   case IORING_OP_OPENAT:
>>     audit_uring_entry(req->opcode);
>>     ret = io_openat(req, issue_flags);
>>     audit_uring_exit(!ret, ret);
>>     break;
> 
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level.  I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.

May work for me, but have to ask how many, and what is the
criteria? I'd think anything opening a file or manipulating fs:

IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
IORING_OP_FILES_UPDATE
+ coming mkdirat and others.

IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?

What about?
IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
IORING_OP_FALLOCATE, IORING_OP_STATX,
IORING_OP_FADVISE, IORING_OP_MADVISE,
IORING_OP_EPOLL_CTL


Another question, io_uring may exercise asynchronous paths,
i.e. io_issue_sqe() returns before requests completes.
Shouldn't be the case for open/etc at the moment, but was that
considered?

I don't see it happening, but would prefer to keep it open
async reimplementation in a distant future. Does audit sleep?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-06-02  1:40         ` Paul Moore
@ 2021-06-02 15:37           ` Richard Guy Briggs
  2021-06-02 17:20             ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-06-02 15:37 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021-06-01 21:40, Paul Moore wrote:
> On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <[email protected]> wrote:
> > On 2021-05-30 11:26, Paul Moore wrote:
> > > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <[email protected]> wrote:
> > > > On 2021-05-21 17:50, Paul Moore wrote:
> > > > If we abuse the syscall infrastructure at first, we'd need a transition
> > > > plan to coordinate user and kernel switchover to seperate mechanisms for
> > > > the two to work together if the need should arise to have both syscall
> > > > and uring filters in the same rule.
> > >
> > > See my comments above, I don't currently see why we would ever want
> > > syscall and io_uring filtering to happen in the same rule.  Please
> > > speak up if you can think of a reason why this would either be needed,
> > > or desirable for some reason.
> >
> > I think they can be seperate rules for now.  Either a syscall rule
> > catching all io_uring ops can be added, or an io_uring rule can be added
> > to catch specific ops.  The scenario I was thinking of was catching
> > syscalls of specific io_uring ops.
> 
> Perhaps I'm misunderstand you, but that scenario really shouldn't
> exist.  The io_uring ops function independently of syscalls; you can
> *submit* io_uring ops via io_uring_enter(), but they are not
> guaranteed to be dispatched synchronously (obviously), and given the
> cred shenanigans that can happen with io_uring there is no guarantee
> the filters would even be applicable.

That wasn't my understanding.  There are a number of io_uring calls
starting with at least open that are currently synchronous (but may
become async in future) that we may want to single out which would be a
specific io_uring syscall with a specific io_uring opcode.  I guess
that particular situation would be caught by the io_uring opcode
triggering an event that includes SYSCALL and URINGOP records.

> It isn't an issue of "can" the filters be separate, they *have* to be separate.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-02  8:26                               ` Pavel Begunkov
@ 2021-06-02 15:46                                 ` Richard Guy Briggs
  2021-06-03 10:39                                   ` Pavel Begunkov
  2021-06-02 19:46                                 ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-06-02 15:46 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Paul Moore, Jens Axboe, selinux, linux-security-module,
	linux-audit, Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring,
	Alexander Viro

On 2021-06-02 09:26, Pavel Begunkov wrote:
> On 5/28/21 5:02 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> > 
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
> 
> May work for me, but have to ask how many, and what is the
> criteria? I'd think anything opening a file or manipulating fs:
> 
> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
> IORING_OP_FILES_UPDATE
> + coming mkdirat and others.
> 
> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
> 
> What about?
> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
> IORING_OP_FALLOCATE, IORING_OP_STATX,
> IORING_OP_FADVISE, IORING_OP_MADVISE,
> IORING_OP_EPOLL_CTL
> 
> 
> Another question, io_uring may exercise asynchronous paths,
> i.e. io_issue_sqe() returns before requests completes.
> Shouldn't be the case for open/etc at the moment, but was that
> considered?

This would be why audit needs to monitor a thread until it wraps up, to
wait for the result code.  My understanding is that both sync and async
parts of an op would be monitored.

> I don't see it happening, but would prefer to keep it open
> async reimplementation in a distant future. Does audit sleep?

Some parts do, some parts don't depending on what they are interacting
with in the kernel.  It can be made to not sleep if needed.

> Pavel Begunkov

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 4/9] audit: add filtering for io_uring records
  2021-06-02 15:37           ` Richard Guy Briggs
@ 2021-06-02 17:20             ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-06-02 17:20 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Wed, Jun 2, 2021 at 11:38 AM Richard Guy Briggs <[email protected]> wrote:
> On 2021-06-01 21:40, Paul Moore wrote:
> > On Mon, May 31, 2021 at 9:44 AM Richard Guy Briggs <[email protected]> wrote:
> > > On 2021-05-30 11:26, Paul Moore wrote:
> > > > On Fri, May 28, 2021 at 6:36 PM Richard Guy Briggs <[email protected]> wrote:
> > > > > On 2021-05-21 17:50, Paul Moore wrote:
> > > > > If we abuse the syscall infrastructure at first, we'd need a transition
> > > > > plan to coordinate user and kernel switchover to seperate mechanisms for
> > > > > the two to work together if the need should arise to have both syscall
> > > > > and uring filters in the same rule.
> > > >
> > > > See my comments above, I don't currently see why we would ever want
> > > > syscall and io_uring filtering to happen in the same rule.  Please
> > > > speak up if you can think of a reason why this would either be needed,
> > > > or desirable for some reason.
> > >
> > > I think they can be seperate rules for now.  Either a syscall rule
> > > catching all io_uring ops can be added, or an io_uring rule can be added
> > > to catch specific ops.  The scenario I was thinking of was catching
> > > syscalls of specific io_uring ops.
> >
> > Perhaps I'm misunderstand you, but that scenario really shouldn't
> > exist.  The io_uring ops function independently of syscalls; you can
> > *submit* io_uring ops via io_uring_enter(), but they are not
> > guaranteed to be dispatched synchronously (obviously), and given the
> > cred shenanigans that can happen with io_uring there is no guarantee
> > the filters would even be applicable.
>
> That wasn't my understanding.  There are a number of io_uring calls
> starting with at least open that are currently synchronous (but may
> become async in future) that we may want to single out which would be a
> specific io_uring syscall with a specific io_uring opcode.  I guess
> that particular situation would be caught by the io_uring opcode
> triggering an event that includes SYSCALL and URINGOP records.

The only io_uring syscalls are io_uring_setup(2), io_uring_enter(2),
etc., the stuff that is dispatched in io_issue_sqe() are the io_uring
ops/opcodes/whatever.  They *look* like syscalls but they are not and
we have to treat them differently.

> > It isn't an issue of "can" the filters be separate, they *have* to be separate.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit, io_uring, io-wq: add some basic audit support to io_uring
  2021-05-21 21:49 ` [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring Paul Moore
  2021-05-22  0:22   ` Pavel Begunkov
@ 2021-06-02 17:29   ` Richard Guy Briggs
  2021-06-02 20:46     ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-06-02 17:29 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021-05-21 17:49, Paul Moore wrote:
> WARNING - This is a work in progress and should not be merged
> anywhere important.  It is almost surely not complete, and while it
> probably compiles it likely hasn't been booted and will do terrible
> things.  You have been warned.
> 
> This patch adds basic auditing to io_uring operations, regardless of
> their context.  This is accomplished by allocating audit_context
> structures for the io-wq worker and io_uring SQPOLL kernel threads
> as well as explicitly auditing the io_uring operations in
> io_issue_sqe().  The io_uring operations are audited using a new
> AUDIT_URINGOP record, an example is shown below:
> 
>   % <TODO - insert AUDIT_URINGOP record example>
> 
> Thanks to Richard Guy Briggs for review and feedback.
> 
> Signed-off-by: Paul Moore <[email protected]>
> ---
>  fs/io-wq.c                 |    4 +
>  fs/io_uring.c              |   11 +++
>  include/linux/audit.h      |   17 ++++
>  include/uapi/linux/audit.h |    1 
>  kernel/audit.h             |    2 +
>  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 208 insertions(+)
> 
> diff --git a/fs/io-wq.c b/fs/io-wq.c
> index 5361a9b4b47b..8af09a3336e0 100644
> --- a/fs/io-wq.c
> +++ b/fs/io-wq.c
> @@ -16,6 +16,7 @@
>  #include <linux/rculist_nulls.h>
>  #include <linux/cpu.h>
>  #include <linux/tracehook.h>
> +#include <linux/audit.h>
>  
>  #include "io-wq.h"
>  
> @@ -535,6 +536,8 @@ static int io_wqe_worker(void *data)
>  	snprintf(buf, sizeof(buf), "iou-wrk-%d", wq->task->pid);
>  	set_task_comm(current, buf);
>  
> +	audit_alloc_kernel(current);
> +
>  	while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
>  		long ret;
>  
> @@ -573,6 +576,7 @@ static int io_wqe_worker(void *data)
>  			raw_spin_unlock_irq(&wqe->lock);
>  	}
>  
> +	audit_free(current);
>  	io_worker_exit(worker);
>  	return 0;
>  }
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e481ac8a757a..e9941d1ad8fd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
>  #include <linux/task_work.h>
>  #include <linux/pagemap.h>
>  #include <linux/io_uring.h>
> +#include <linux/audit.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/io_uring.h>
> @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  	if (req->work.creds && req->work.creds != current_cred())
>  		creds = override_creds(req->work.creds);
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_entry(req->opcode);
> +
>  	switch (req->opcode) {
>  	case IORING_OP_NOP:
>  		ret = io_nop(req, issue_flags);
> @@ -6211,6 +6215,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
>  		break;
>  	}
>  
> +	if (req->opcode < IORING_OP_LAST)
> +		audit_uring_exit(!ret, ret);
> +
>  	if (creds)
>  		revert_creds(creds);
>  
> @@ -6827,6 +6834,8 @@ static int io_sq_thread(void *data)
>  		set_cpus_allowed_ptr(current, cpu_online_mask);
>  	current->flags |= PF_NO_SETAFFINITY;
>  
> +	audit_alloc_kernel(current);
> +
>  	mutex_lock(&sqd->lock);
>  	/* a user may had exited before the thread started */
>  	io_run_task_work_head(&sqd->park_task_work);
> @@ -6916,6 +6925,8 @@ static int io_sq_thread(void *data)
>  	io_run_task_work_head(&sqd->park_task_work);
>  	mutex_unlock(&sqd->lock);
>  
> +	audit_free(current);
> +
>  	complete(&sqd->exited);
>  	do_exit(0);
>  }
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 82b7c1116a85..6a0c013bc7de 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -286,7 +286,10 @@ static inline int audit_signal_info(int sig, struct task_struct *t)
>  /* These are defined in auditsc.c */
>  				/* Public API */
>  extern int  audit_alloc(struct task_struct *task);
> +extern int  audit_alloc_kernel(struct task_struct *task);
>  extern void __audit_free(struct task_struct *task);
> +extern void __audit_uring_entry(u8 op);
> +extern void __audit_uring_exit(int success, long code);
>  extern void __audit_syscall_entry(int major, unsigned long a0, unsigned long a1,
>  				  unsigned long a2, unsigned long a3);
>  extern void __audit_syscall_exit(int ret_success, long ret_value);
> @@ -323,6 +326,16 @@ static inline void audit_free(struct task_struct *task)
>  	if (unlikely(task->audit_context))
>  		__audit_free(task);
>  }
> +static inline void audit_uring_entry(u8 op)
> +{
> +	if (unlikely(audit_context()))
> +		__audit_uring_entry(op);
> +}
> +static inline void audit_uring_exit(int success, long code)
> +{
> +	if (unlikely(audit_context()))
> +		__audit_uring_exit(success, code);
> +}
>  static inline void audit_syscall_entry(int major, unsigned long a0,
>  				       unsigned long a1, unsigned long a2,
>  				       unsigned long a3)
> @@ -554,6 +567,10 @@ static inline int audit_alloc(struct task_struct *task)
>  {
>  	return 0;
>  }
> +static inline int audit_alloc_kernel(struct task_struct *task)
> +{
> +	return 0;
> +}
>  static inline void audit_free(struct task_struct *task)
>  { }
>  static inline void audit_syscall_entry(int major, unsigned long a0,
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index cd2d8279a5e4..b26e0c435e8b 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -118,6 +118,7 @@
>  #define AUDIT_TIME_ADJNTPVAL	1333	/* NTP value adjustment */
>  #define AUDIT_BPF		1334	/* BPF subsystem */
>  #define AUDIT_EVENT_LISTENER	1335	/* Task joined multicast read socket */
> +#define AUDIT_URINGOP		1336	/* io_uring operation */
>  
>  #define AUDIT_AVC		1400	/* SE Linux avc denial or grant */
>  #define AUDIT_SELINUX_ERR	1401	/* Internal SE Linux Errors */
> diff --git a/kernel/audit.h b/kernel/audit.h
> index fba180de5912..50de827497ca 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -100,10 +100,12 @@ struct audit_context {
>  	enum {
>  		AUDIT_CTX_UNUSED,	/* audit_context is currently unused */
>  		AUDIT_CTX_SYSCALL,	/* in use by syscall */
> +		AUDIT_CTX_URING,	/* in use by io_uring */
>  	} context;
>  	enum audit_state    state, current_state;
>  	unsigned int	    serial;     /* serial number for record */
>  	int		    major;      /* syscall number */
> +	int		    uring_op;   /* uring operation */
>  	struct timespec64   ctime;      /* time of syscall entry */
>  	unsigned long	    argv[4];    /* syscall arguments */
>  	long		    return_code;/* syscall return code */
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index cc89e9f9a753..729849d41631 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -953,6 +953,7 @@ static void audit_reset_context(struct audit_context *ctx)
>  	ctx->current_state = ctx->state;
>  	ctx->serial = 0;
>  	ctx->major = 0;
> +	ctx->uring_op = 0;
>  	ctx->ctime = (struct timespec64){ .tv_sec = 0, .tv_nsec = 0 };
>  	memset(ctx->argv, 0, sizeof(ctx->argv));
>  	ctx->return_code = 0;
> @@ -1038,6 +1039,31 @@ int audit_alloc(struct task_struct *tsk)
>  	return 0;
>  }
>  
> +/**
> + * audit_alloc_kernel - allocate an audit_context for a kernel task
> + * @tsk: the kernel task
> + *
> + * Similar to the audit_alloc() function, but intended for kernel private
> + * threads.  Returns zero on success, negative values on failure.
> + */
> +int audit_alloc_kernel(struct task_struct *tsk)
> +{
> +	/*
> +	 * At the moment we are just going to call into audit_alloc() to
> +	 * simplify the code, but there two things to keep in mind with this
> +	 * approach:
> +	 *
> +	 * 1. Filtering internal kernel tasks is a bit laughable in almost all
> +	 * cases, but there is at least one case where there is a benefit:
> +	 * the '-a task,never' case allows the admin to effectively disable
> +	 * task auditing at runtime.
> +	 *
> +	 * 2. The {set,clear}_task_syscall_work() ops likely have zero effect
> +	 * on these internal kernel tasks, but they probably don't hurt either.
> +	 */
> +	return audit_alloc(tsk);
> +}
> +
>  static inline void audit_free_context(struct audit_context *context)
>  {
>  	/* resetting is extra work, but it is likely just noise */
> @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
>  	audit_log_end(ab);
>  }
>  
> +/**
> + * audit_log_uring - generate a AUDIT_URINGOP record
> + * @ctx: the audit context
> + */
> +static void audit_log_uring(struct audit_context *ctx)
> +{
> +	struct audit_buffer *ab;
> +	const struct cred *cred;
> +
> +	/*
> +	 * TODO: What do we log here?  I'm tossing in a few things to start the
> +	 *       conversation, but additional thought needs to go into this.
> +	 */
> +
> +	ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> +	if (!ab)
> +		return;
> +	cred = current_cred();

This may need to be req->work.creds.  I haven't been following if the
io_uring thread inherited the user task's creds (and below, comm and
exe).

> +	audit_log_format(ab, "uring_op=%d", ctx->uring_op);

arch is stored below in __audit_uring_entry() and never used in the
AUDIT_CTX_URING case.  That assignment can either be dropped or printed
before uring_op similar to the SYSCALL record.  There aren't really any
arg[0-3] to print.

io_uring_register and io_uring_setup() args are better covered by other
records.  io_uring_enter() has 6 args and the last two aren't covered by
SYSCALL anyways.

> +	if (ctx->return_valid != AUDITSC_INVALID)
> +		audit_log_format(ab, " success=%s exit=%ld",
> +				 (ctx->return_valid == AUDITSC_SUCCESS ?
> +				  "yes" : "no"),
> +				 ctx->return_code);
> +	audit_log_format(ab,
> +			 " items=%d"
> +			 " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> +			 " euid=%u suid=%u fsuid=%u"
> +			 " egid=%u sgid=%u fsgid=%u",
> +			 ctx->name_count,
> +			 task_ppid_nr(current),
> +			 task_tgid_nr(current),
> +			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
> +			 from_kuid(&init_user_ns, cred->uid),
> +			 from_kgid(&init_user_ns, cred->gid),
> +			 from_kuid(&init_user_ns, cred->euid),
> +			 from_kuid(&init_user_ns, cred->suid),
> +			 from_kuid(&init_user_ns, cred->fsuid),
> +			 from_kgid(&init_user_ns, cred->egid),
> +			 from_kgid(&init_user_ns, cred->sgid),
> +			 from_kgid(&init_user_ns, cred->fsgid));

The audit session ID is still important, relevant and qualifies auid.
In keeping with the SYSCALL record format, I think we want to keep
ses=audit_get_sessionid(current) in here.

I'm pretty sure we also want to keep comm= and exe= too, but may have to
reach into req->task to get it.  There are two values for comm possible,
one from the original task and second "iou-sqp-<pid>" set at the top of
io_sq_thread().

I'm reluctant to leave them out now and then have to re-add them in yet
another field order later.

> +	audit_log_task_context(ab);
> +	audit_log_key(ab, ctx->filterkey);
> +	audit_log_end(ab);
> +}
> +
>  static void audit_log_exit(void)
>  {
>  	int i, call_panic = 0;
> @@ -1571,6 +1643,9 @@ static void audit_log_exit(void)
>  		audit_log_key(ab, context->filterkey);
>  		audit_log_end(ab);
>  		break;
> +	case AUDIT_CTX_URING:
> +		audit_log_uring(context);
> +		break;
>  	default:
>  		BUG();
>  		break;
> @@ -1740,6 +1815,104 @@ static void audit_return_fixup(struct audit_context *ctx,
>  	ctx->return_valid = (success ? AUDITSC_SUCCESS : AUDITSC_FAILURE);
>  }
>  
> +/**
> + * __audit_uring_entry - prepare the kernel task's audit context for io_uring
> + * @op: the io_uring opcode
> + *
> + * This is similar to audit_syscall_entry() but is intended for use by io_uring
> + * operations.
> + */
> +void __audit_uring_entry(u8 op)
> +{
> +	struct audit_context *ctx = audit_context();
> +
> +	if (!audit_enabled || !ctx || ctx->state == AUDIT_DISABLED)
> +		return;
> +
> +	/*
> +	 * NOTE: It's possible that we can be called from the process' context
> +	 *       before it returns to userspace, and before audit_syscall_exit()
> +	 *       is called.  In this case there is not much to do, just record
> +	 *       the io_uring details and return.
> +	 */
> +	ctx->uring_op = op;
> +	if (ctx->context == AUDIT_CTX_SYSCALL)
> +		return;
> +
> +	ctx->dummy = !audit_n_rules;
> +	if (!ctx->dummy && ctx->state == AUDIT_BUILD_CONTEXT)
> +		ctx->prio = 0;
> +
> +	ctx->arch = syscall_get_arch(current);
> +	ctx->context = AUDIT_CTX_URING;
> +	ctx->current_state = ctx->state;
> +	ktime_get_coarse_real_ts64(&ctx->ctime);
> +}
> +
> +/**
> + * __audit_uring_exit - wrap up the kernel task's audit context after io_uring
> + * @success: true/false value to indicate if the operation succeeded or not
> + * @code: operation return code
> + *
> + * This is similar to audit_syscall_exit() but is intended for use by io_uring
> + * operations.
> + */
> +void __audit_uring_exit(int success, long code)
> +{
> +	struct audit_context *ctx = audit_context();
> +
> +	/*
> +	 * TODO: At some point we will likely want to filter on io_uring ops
> +	 *       and other things similar to what we do for syscalls, but that
> +	 *       is something for another day; just record what we can here.
> +	 */
> +
> +	if (!ctx || ctx->dummy)
> +		goto out;
> +	if (ctx->context == AUDIT_CTX_SYSCALL) {
> +		/*
> +		 * NOTE: See the note in __audit_uring_entry() about the case
> +		 *       where we may be called from process context before we
> +		 *       return to userspace via audit_syscall_exit().  In this
> +		 *       case we simply emit a URINGOP record and bail, the
> +		 *       normal syscall exit handling will take care of
> +		 *       everything else.
> +		 *       It is also worth mentioning that when we are called,
> +		 *       the current process creds may differ from the creds
> +		 *       used during the normal syscall processing; keep that
> +		 *       in mind if/when we move the record generation code.
> +		 */
> +
> +		/*
> +		 * We need to filter on the syscall info here to decide if we
> +		 * should emit a URINGOP record.  I know it seems odd but this
> +		 * solves the problem where users have a filter to block *all*
> +		 * syscall records in the "exit" filter; we want to preserve
> +		 * the behavior here.
> +		 */
> +		audit_filter_syscall(current, ctx);
> +		audit_filter_inodes(current, ctx);
> +		if (ctx->current_state != AUDIT_RECORD_CONTEXT)
> +			return;
> +
> +		audit_log_uring(ctx);
> +		return;
> +	}
> +
> +	/* this may generate CONFIG_CHANGE records */
> +	if (!list_empty(&ctx->killed_trees))
> +		audit_kill_trees(ctx);
> +
> +	audit_filter_inodes(current, ctx);
> +	if (ctx->current_state != AUDIT_RECORD_CONTEXT)
> +		goto out;
> +	audit_return_fixup(ctx, success, code);
> +	audit_log_exit();
> +
> +out:
> +	audit_reset_context(ctx);
> +}
> +
>  /**
>   * __audit_syscall_entry - fill in an audit record at syscall entry
>   * @major: major syscall type (function)
> 
> --
> Linux-audit mailing list
> [email protected]
> https://listman.redhat.com/mailman/listinfo/linux-audit

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-02  8:26                               ` Pavel Begunkov
  2021-06-02 15:46                                 ` Richard Guy Briggs
@ 2021-06-02 19:46                                 ` Paul Moore
  2021-06-03 10:51                                   ` Pavel Begunkov
  1 sibling, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-06-02 19:46 UTC (permalink / raw)
  To: Pavel Begunkov
  Cc: Jens Axboe, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <[email protected]> wrote:
> On 5/28/21 5:02 PM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> >
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
>
> May work for me, but have to ask how many, and what is the
> criteria? I'd think anything opening a file or manipulating fs:
>
> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
> IORING_OP_FILES_UPDATE
> + coming mkdirat and others.
>
> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>
> What about?
> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
> IORING_OP_FALLOCATE, IORING_OP_STATX,
> IORING_OP_FADVISE, IORING_OP_MADVISE,
> IORING_OP_EPOLL_CTL

Looking quickly at v5.13-rc4 the following seems like candidates for
auditing, there may be a small number of subtractions/additions to
this list as people take a closer look, but it should serve as a
starting point:

IORING_OP_SENDMSG
IORING_OP_RECVMSG
IORING_OP_ACCEPT
IORING_OP_CONNECT
IORING_OP_FALLOCATE
IORING_OP_OPENAT
IORING_OP_CLOSE
IORING_OP_MADVISE
IORING_OP_OPENAT2
IORING_OP_SHUTDOWN
IORING_OP_RENAMEAT
IORING_OP_UNLINKAT

... can you live with that list?

> Another question, io_uring may exercise asynchronous paths,
> i.e. io_issue_sqe() returns before requests completes.
> Shouldn't be the case for open/etc at the moment, but was that
> considered?

Yes, I noticed that when testing the code (and it makes sense when you
look at how io_uring handles things).  Depending on the state of the
system when the io_uring request is submitted I've seen both sync and
async io_uring operations with the associated different calling
contexts.  In the case where io_issue_sqe() needs to defer the
operation to a different context you will see an audit record
indicating that the operation failed and then another audit record
when it completes; it's actually pretty interesting to be able to see
how the system and io_uring are working.

We could always mask out these delayed attempts, but at this early
stage they are helpful, and they may be useful for admins.

> I don't see it happening, but would prefer to keep it open
> async reimplementation in a distant future. Does audit sleep?

The only place in the audit_uring_entry()/audit_uring_exit() code path
that could sleep at present is the call to audit_log_uring() which is
made when the rules dictate that an audit record be generated.  The
offending code is an allocation in audit_log_uring() which is
currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
was a copy-n-paste from the similar syscall function where GFP_KERNEL
is appropriate due to the calling context at the end of the syscall.
I'll change that as soon as I'm done with this email.

Of course if you are calling io_uring_enter(2), or something similar,
then audit may sleep as part of the normal syscall processing (as
mentioned above), but that is due to the fact that io_uring_enter(2)
is a syscall and not because of anything in io_issue_sqe().

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit, io_uring, io-wq: add some basic audit support to io_uring
  2021-06-02 17:29   ` [RFC PATCH 2/9] audit, io_uring, io-wq: " Richard Guy Briggs
@ 2021-06-02 20:46     ` Paul Moore
  2021-08-25  1:21       ` Richard Guy Briggs
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-06-02 20:46 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <[email protected]> wrote:
> On 2021-05-21 17:49, Paul Moore wrote:
> > WARNING - This is a work in progress and should not be merged
> > anywhere important.  It is almost surely not complete, and while it
> > probably compiles it likely hasn't been booted and will do terrible
> > things.  You have been warned.
> >
> > This patch adds basic auditing to io_uring operations, regardless of
> > their context.  This is accomplished by allocating audit_context
> > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > as well as explicitly auditing the io_uring operations in
> > io_issue_sqe().  The io_uring operations are audited using a new
> > AUDIT_URINGOP record, an example is shown below:
> >
> >   % <TODO - insert AUDIT_URINGOP record example>
> >
> > Thanks to Richard Guy Briggs for review and feedback.
> >
> > Signed-off-by: Paul Moore <[email protected]>
> > ---
> >  fs/io-wq.c                 |    4 +
> >  fs/io_uring.c              |   11 +++
> >  include/linux/audit.h      |   17 ++++
> >  include/uapi/linux/audit.h |    1
> >  kernel/audit.h             |    2 +
> >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> >  6 files changed, 208 insertions(+)

...

> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index e481ac8a757a..e9941d1ad8fd 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -78,6 +78,7 @@
> >  #include <linux/task_work.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/io_uring.h>
> > +#include <linux/audit.h>
> >
> >  #define CREATE_TRACE_POINTS
> >  #include <trace/events/io_uring.h>
> > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> >       if (req->work.creds && req->work.creds != current_cred())
> >               creds = override_creds(req->work.creds);
> >
> > +     if (req->opcode < IORING_OP_LAST)
> > +             audit_uring_entry(req->opcode);

Note well the override_creds() call right above the audit code that is
being added, it will be important later in this email.

> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index cc89e9f9a753..729849d41631 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
> >       audit_log_end(ab);
> >  }
> >
> > +/**
> > + * audit_log_uring - generate a AUDIT_URINGOP record
> > + * @ctx: the audit context
> > + */
> > +static void audit_log_uring(struct audit_context *ctx)
> > +{
> > +     struct audit_buffer *ab;
> > +     const struct cred *cred;
> > +
> > +     /*
> > +      * TODO: What do we log here?  I'm tossing in a few things to start the
> > +      *       conversation, but additional thought needs to go into this.
> > +      */
> > +
> > +     ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> > +     if (!ab)
> > +             return;
> > +     cred = current_cred();
>
> This may need to be req->work.creds.  I haven't been following if the
> io_uring thread inherited the user task's creds (and below, comm and
> exe).

Nope, we're good.  See the existing code in io_issue_sqe() :)

> > +     audit_log_format(ab, "uring_op=%d", ctx->uring_op);
>
> arch is stored below in __audit_uring_entry() and never used in the
> AUDIT_CTX_URING case.  That assignment can either be dropped or printed
> before uring_op similar to the SYSCALL record.

Good point, I'll drop the code that records the arch from _entry(); it
is really only useful to give the appropriate context if needed for
other things in the audit stream, and that isn't the case like it is
with syscalls.

> There aren't really any arg[0-3] to print.

Which is why I didn't print them.

> io_uring_register and io_uring_setup() args are better covered by other
> records.  io_uring_enter() has 6 args and the last two aren't covered by
> SYSCALL anyways.

 ???

I think you are confusing the io_uring ops with syscalls; they are
very different things from an audit perspective and the io_uring
auditing is not intended to replace syscall records.  The
io_uring_setup() and io_uring_enter() syscalls will be auditing just
as any other syscalls would be using the existing syscall audit code.

> > +     if (ctx->return_valid != AUDITSC_INVALID)
> > +             audit_log_format(ab, " success=%s exit=%ld",
> > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > +                               "yes" : "no"),
> > +                              ctx->return_code);
> > +     audit_log_format(ab,
> > +                      " items=%d"
> > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > +                      " euid=%u suid=%u fsuid=%u"
> > +                      " egid=%u sgid=%u fsgid=%u",
> > +                      ctx->name_count,
> > +                      task_ppid_nr(current),
> > +                      task_tgid_nr(current),
> > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > +                      from_kuid(&init_user_ns, cred->uid),
> > +                      from_kgid(&init_user_ns, cred->gid),
> > +                      from_kuid(&init_user_ns, cred->euid),
> > +                      from_kuid(&init_user_ns, cred->suid),
> > +                      from_kuid(&init_user_ns, cred->fsuid),
> > +                      from_kgid(&init_user_ns, cred->egid),
> > +                      from_kgid(&init_user_ns, cred->sgid),
> > +                      from_kgid(&init_user_ns, cred->fsgid));
>
> The audit session ID is still important, relevant and qualifies auid.
> In keeping with the SYSCALL record format, I think we want to keep
> ses=audit_get_sessionid(current) in here.

This might be another case of syscall/io_uring confusion.  An io_uring
op doesn't necessarily have an audit session ID or an audit UID in the
conventional sense; for example think about SQPOLL works, shared
rings, etc.

> I'm pretty sure we also want to keep comm= and exe= too, but may have to
> reach into req->task to get it.  There are two values for comm possible,
> one from the original task and second "iou-sqp-<pid>" set at the top of
> io_sq_thread().

I think this is more syscall/io_uring confusion.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-02 15:46                                 ` Richard Guy Briggs
@ 2021-06-03 10:39                                   ` Pavel Begunkov
  0 siblings, 0 replies; 71+ messages in thread
From: Pavel Begunkov @ 2021-06-03 10:39 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Paul Moore, Jens Axboe, selinux, linux-security-module,
	linux-audit, Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring,
	Alexander Viro

On 6/2/21 4:46 PM, Richard Guy Briggs wrote:
> On 2021-06-02 09:26, Pavel Begunkov wrote:
>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
>>>> ... If we moved the _entry
>>>> and _exit calls into the individual operation case blocks (quick
>>>> openat example below) so that only certain operations were able to be
>>>> audited would that be acceptable assuming the high frequency ops were
>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>> regardless it isn't worth the time - thoughts?
>>>>
>>>>   case IORING_OP_OPENAT:
>>>>     audit_uring_entry(req->opcode);
>>>>     ret = io_openat(req, issue_flags);
>>>>     audit_uring_exit(!ret, ret);
>>>>     break;
>>>
>>> I wanted to pose this question again in case it was lost in the
>>> thread, I suspect this may be the last option before we have to "fix"
>>> things at the Kconfig level.  I definitely don't want to have to go
>>> that route, and I suspect most everyone on this thread feels the same,
>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>> to both groups.
>>
>> May work for me, but have to ask how many, and what is the
>> criteria? I'd think anything opening a file or manipulating fs:
>>
>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>> IORING_OP_FILES_UPDATE
>> + coming mkdirat and others.
>>
>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>
>> What about?
>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>> IORING_OP_EPOLL_CTL
>>
>>
>> Another question, io_uring may exercise asynchronous paths,
>> i.e. io_issue_sqe() returns before requests completes.
>> Shouldn't be the case for open/etc at the moment, but was that
>> considered?
> 
> This would be why audit needs to monitor a thread until it wraps up, to
> wait for the result code.  My understanding is that both sync and async
> parts of an op would be monitored.

There may be a misunderstanding

audit_start(req)
ret = io_issue_sqe(req);
audit_end(ret);

io_issue_sqe() may return 0 but leave the request inflight,
which will be completed asynchronously e.g. by IRQ, not going
through io_issue_sqe() or any io_read()/etc helpers again, and
after last audit_end() had already happened.
That's the case with read/write/timeout, but is not true for
open/etc.

>> I don't see it happening, but would prefer to keep it open
>> async reimplementation in a distant future. Does audit sleep?
> 
> Some parts do, some parts don't depending on what they are interacting
> with in the kernel.  It can be made to not sleep if needed.

Ok, good

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-02 19:46                                 ` Paul Moore
@ 2021-06-03 10:51                                   ` Pavel Begunkov
  2021-06-03 15:54                                     ` Casey Schaufler
  0 siblings, 1 reply; 71+ messages in thread
From: Pavel Begunkov @ 2021-06-03 10:51 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jens Axboe, linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 6/2/21 8:46 PM, Paul Moore wrote:
> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <[email protected]> wrote:
>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
>>>> ... If we moved the _entry
>>>> and _exit calls into the individual operation case blocks (quick
>>>> openat example below) so that only certain operations were able to be
>>>> audited would that be acceptable assuming the high frequency ops were
>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>> regardless it isn't worth the time - thoughts?
>>>>
>>>>   case IORING_OP_OPENAT:
>>>>     audit_uring_entry(req->opcode);
>>>>     ret = io_openat(req, issue_flags);
>>>>     audit_uring_exit(!ret, ret);
>>>>     break;
>>>
>>> I wanted to pose this question again in case it was lost in the
>>> thread, I suspect this may be the last option before we have to "fix"
>>> things at the Kconfig level.  I definitely don't want to have to go
>>> that route, and I suspect most everyone on this thread feels the same,
>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>> to both groups.
>>
>> May work for me, but have to ask how many, and what is the
>> criteria? I'd think anything opening a file or manipulating fs:
>>
>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>> IORING_OP_FILES_UPDATE
>> + coming mkdirat and others.
>>
>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>
>> What about?
>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>> IORING_OP_EPOLL_CTL
> 
> Looking quickly at v5.13-rc4 the following seems like candidates for
> auditing, there may be a small number of subtractions/additions to
> this list as people take a closer look, but it should serve as a
> starting point:
> 
> IORING_OP_SENDMSG
> IORING_OP_RECVMSG
> IORING_OP_ACCEPT
> IORING_OP_CONNECT
> IORING_OP_FALLOCATE
> IORING_OP_OPENAT
> IORING_OP_CLOSE
> IORING_OP_MADVISE
> IORING_OP_OPENAT2
> IORING_OP_SHUTDOWN
> IORING_OP_RENAMEAT
> IORING_OP_UNLINKAT
> 
> ... can you live with that list?

it will bloat binary somewhat, but considering it's all in one
place -- io_issue_sqe(), it's workable.

Not nice to have send/recv msg in the list, but I admit they
may do some crazy things. What can be traced for them? Because
at the moment of issue_sqe() not everything is read from the
userspace.

see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); },

will copy header only in io_sendmsg() in most cases, and
then stash it for re-issuing if needed.


>> Another question, io_uring may exercise asynchronous paths,
>> i.e. io_issue_sqe() returns before requests completes.
>> Shouldn't be the case for open/etc at the moment, but was that
>> considered?
> 
> Yes, I noticed that when testing the code (and it makes sense when you
> look at how io_uring handles things).  Depending on the state of the
> system when the io_uring request is submitted I've seen both sync and
> async io_uring operations with the associated different calling
> contexts.  In the case where io_issue_sqe() needs to defer the
> operation to a different context you will see an audit record
> indicating that the operation failed and then another audit record
> when it completes; it's actually pretty interesting to be able to see
> how the system and io_uring are working.

Copying a reply to another message to keep clear out
of misunderstanding.

"io_issue_sqe() may return 0 but leave the request inflight,
which will be completed asynchronously e.g. by IRQ, not going
through io_issue_sqe() or any io_read()/etc helpers again, and
after last audit_end() had already happened.
That's the case with read/write/timeout, but is not true for
open/etc."

And there is interest in async send/recv[msg] as well (via
IRQ as described, callbacks, etc.).
 
> We could always mask out these delayed attempts, but at this early
> stage they are helpful, and they may be useful for admins.
> 
>> I don't see it happening, but would prefer to keep it open
>> async reimplementation in a distant future. Does audit sleep?
> 
> The only place in the audit_uring_entry()/audit_uring_exit() code path
> that could sleep at present is the call to audit_log_uring() which is
> made when the rules dictate that an audit record be generated.  The
> offending code is an allocation in audit_log_uring() which is
> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
> was a copy-n-paste from the similar syscall function where GFP_KERNEL
> is appropriate due to the calling context at the end of the syscall.
> I'll change that as soon as I'm done with this email.

Ok, depends where it steers, but there may be a requirement to
not sleep for some hooks because of not having a sleepable context.

> 
> Of course if you are calling io_uring_enter(2), or something similar,
> then audit may sleep as part of the normal syscall processing (as
> mentioned above), but that is due to the fact that io_uring_enter(2)
> is a syscall and not because of anything in io_issue_sqe().
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-05-28 16:02                             ` Paul Moore
  2021-06-02  8:26                               ` Pavel Begunkov
@ 2021-06-03 15:54                               ` Jens Axboe
  2021-06-04  5:04                                 ` Paul Moore
  1 sibling, 1 reply; 71+ messages in thread
From: Jens Axboe @ 2021-06-03 15:54 UTC (permalink / raw)
  To: Paul Moore
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On 5/28/21 10:02 AM, Paul Moore wrote:
> On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
>> ... If we moved the _entry
>> and _exit calls into the individual operation case blocks (quick
>> openat example below) so that only certain operations were able to be
>> audited would that be acceptable assuming the high frequency ops were
>> untouched?  My initial gut feeling was that this would involve >50% of
>> the ops, but Steve Grubb seems to think it would be less; it may be
>> time to look at that a bit more seriously, but if it gets a NACK
>> regardless it isn't worth the time - thoughts?
>>
>>   case IORING_OP_OPENAT:
>>     audit_uring_entry(req->opcode);
>>     ret = io_openat(req, issue_flags);
>>     audit_uring_exit(!ret, ret);
>>     break;
> 
> I wanted to pose this question again in case it was lost in the
> thread, I suspect this may be the last option before we have to "fix"
> things at the Kconfig level.  I definitely don't want to have to go
> that route, and I suspect most everyone on this thread feels the same,
> so I'm hopeful we can find a solution that is begrudgingly acceptable
> to both groups.

Sorry for the lack of response here, but to sum up my order of
preference:

1) It's probably better to just make the audit an opt-out in io_op_defs
   for each opcode, and avoid needing boiler plate code for each op
   handler. The opt-out would ensure that new opcodes get it by default
   it someone doesn't know what it is, and the io_op_defs addition would
   mean that it's in generic code rather then in the handlers. Yes it's
   a bit slower, but it's saner imho.

2) With the above, I'm fine with adding this to io_uring. I don't think
   going the route of mutual exclusion in kconfig helps anyone, it'd
   be counter productive to both sides.

Hope that works and helps move this forward. I'll be mostly out of touch
the next week and a half, but wanted to ensure that I sent out my
(brief) thoughts before going away.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-03 10:51                                   ` Pavel Begunkov
@ 2021-06-03 15:54                                     ` Casey Schaufler
  0 siblings, 0 replies; 71+ messages in thread
From: Casey Schaufler @ 2021-06-03 15:54 UTC (permalink / raw)
  To: Pavel Begunkov, Paul Moore
  Cc: Jens Axboe, selinux, linux-security-module, linux-audit,
	Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring, Alexander Viro,
	Casey Schaufler

On 6/3/2021 3:51 AM, Pavel Begunkov wrote:
> On 6/2/21 8:46 PM, Paul Moore wrote:
>> On Wed, Jun 2, 2021 at 4:27 AM Pavel Begunkov <[email protected]> wrote:
>>> On 5/28/21 5:02 PM, Paul Moore wrote:
>>>> On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
>>>>> ... If we moved the _entry
>>>>> and _exit calls into the individual operation case blocks (quick
>>>>> openat example below) so that only certain operations were able to be
>>>>> audited would that be acceptable assuming the high frequency ops were
>>>>> untouched?  My initial gut feeling was that this would involve >50% of
>>>>> the ops, but Steve Grubb seems to think it would be less; it may be
>>>>> time to look at that a bit more seriously, but if it gets a NACK
>>>>> regardless it isn't worth the time - thoughts?
>>>>>
>>>>>   case IORING_OP_OPENAT:
>>>>>     audit_uring_entry(req->opcode);
>>>>>     ret = io_openat(req, issue_flags);
>>>>>     audit_uring_exit(!ret, ret);
>>>>>     break;
>>>> I wanted to pose this question again in case it was lost in the
>>>> thread, I suspect this may be the last option before we have to "fix"
>>>> things at the Kconfig level.  I definitely don't want to have to go
>>>> that route, and I suspect most everyone on this thread feels the same,
>>>> so I'm hopeful we can find a solution that is begrudgingly acceptable
>>>> to both groups.
>>> May work for me, but have to ask how many, and what is the
>>> criteria? I'd think anything opening a file or manipulating fs:
>>>
>>> IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_OPENAT[2],
>>> IORING_OP_RENAMEAT, IORING_OP_UNLINKAT, IORING_OP_SHUTDOWN,
>>> IORING_OP_FILES_UPDATE
>>> + coming mkdirat and others.
>>>
>>> IORING_OP_CLOSE? IORING_OP_SEND IORING_OP_RECV?
>>>
>>> What about?
>>> IORING_OP_FSYNC, IORING_OP_SYNC_FILE_RANGE,
>>> IORING_OP_FALLOCATE, IORING_OP_STATX,
>>> IORING_OP_FADVISE, IORING_OP_MADVISE,
>>> IORING_OP_EPOLL_CTL
>> Looking quickly at v5.13-rc4 the following seems like candidates for
>> auditing, there may be a small number of subtractions/additions to
>> this list as people take a closer look, but it should serve as a
>> starting point:
>>
>> IORING_OP_SENDMSG
>> IORING_OP_RECVMSG
>> IORING_OP_ACCEPT
>> IORING_OP_CONNECT
>> IORING_OP_FALLOCATE
>> IORING_OP_OPENAT
>> IORING_OP_CLOSE
>> IORING_OP_MADVISE
>> IORING_OP_OPENAT2
>> IORING_OP_SHUTDOWN
>> IORING_OP_RENAMEAT
>> IORING_OP_UNLINKAT
>>
>> ... can you live with that list?
> it will bloat binary somewhat, but considering it's all in one
> place -- io_issue_sqe(), it's workable.
>
> Not nice to have send/recv msg in the list, but I admit they
> may do some crazy things. What can be traced for them?

Both SELinux and Smack do access checks on packet operations.
As access may be denied by these checks, audit needs to be available.
This is true for UDS, IP and at least one other protocol family.

>  Because
> at the moment of issue_sqe() not everything is read from the
> userspace.
>
> see: io_sendmsg() { ...; io_sendmsg_copy_hdr(); },
>
> will copy header only in io_sendmsg() in most cases, and
> then stash it for re-issuing if needed.
>
>
>>> Another question, io_uring may exercise asynchronous paths,
>>> i.e. io_issue_sqe() returns before requests completes.
>>> Shouldn't be the case for open/etc at the moment, but was that
>>> considered?
>> Yes, I noticed that when testing the code (and it makes sense when you
>> look at how io_uring handles things).  Depending on the state of the
>> system when the io_uring request is submitted I've seen both sync and
>> async io_uring operations with the associated different calling
>> contexts.  In the case where io_issue_sqe() needs to defer the
>> operation to a different context you will see an audit record
>> indicating that the operation failed and then another audit record
>> when it completes; it's actually pretty interesting to be able to see
>> how the system and io_uring are working.
> Copying a reply to another message to keep clear out
> of misunderstanding.
>
> "io_issue_sqe() may return 0 but leave the request inflight,
> which will be completed asynchronously e.g. by IRQ, not going
> through io_issue_sqe() or any io_read()/etc helpers again, and
> after last audit_end() had already happened.
> That's the case with read/write/timeout, but is not true for
> open/etc."
>
> And there is interest in async send/recv[msg] as well (via
> IRQ as described, callbacks, etc.).
>  
>> We could always mask out these delayed attempts, but at this early
>> stage they are helpful, and they may be useful for admins.
>>
>>> I don't see it happening, but would prefer to keep it open
>>> async reimplementation in a distant future. Does audit sleep?
>> The only place in the audit_uring_entry()/audit_uring_exit() code path
>> that could sleep at present is the call to audit_log_uring() which is
>> made when the rules dictate that an audit record be generated.  The
>> offending code is an allocation in audit_log_uring() which is
>> currently GFP_KERNEL but really should be GFP_ATOMIC, or similar.  It
>> was a copy-n-paste from the similar syscall function where GFP_KERNEL
>> is appropriate due to the calling context at the end of the syscall.
>> I'll change that as soon as I'm done with this email.
> Ok, depends where it steers, but there may be a requirement to
> not sleep for some hooks because of not having a sleepable context.
>
>> Of course if you are calling io_uring_enter(2), or something similar,
>> then audit may sleep as part of the normal syscall processing (as
>> mentioned above), but that is due to the fact that io_uring_enter(2)
>> is a syscall and not because of anything in io_issue_sqe().
>>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring
  2021-06-03 15:54                               ` Jens Axboe
@ 2021-06-04  5:04                                 ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-06-04  5:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Pavel Begunkov, linux-security-module, selinux, linux-audit,
	io-uring, linux-fsdevel, Kumar Kartikeya Dwivedi, Alexander Viro

On Thu, Jun 3, 2021 at 11:54 AM Jens Axboe <[email protected]> wrote:
> On 5/28/21 10:02 AM, Paul Moore wrote:
> > On Wed, May 26, 2021 at 4:19 PM Paul Moore <[email protected]> wrote:
> >> ... If we moved the _entry
> >> and _exit calls into the individual operation case blocks (quick
> >> openat example below) so that only certain operations were able to be
> >> audited would that be acceptable assuming the high frequency ops were
> >> untouched?  My initial gut feeling was that this would involve >50% of
> >> the ops, but Steve Grubb seems to think it would be less; it may be
> >> time to look at that a bit more seriously, but if it gets a NACK
> >> regardless it isn't worth the time - thoughts?
> >>
> >>   case IORING_OP_OPENAT:
> >>     audit_uring_entry(req->opcode);
> >>     ret = io_openat(req, issue_flags);
> >>     audit_uring_exit(!ret, ret);
> >>     break;
> >
> > I wanted to pose this question again in case it was lost in the
> > thread, I suspect this may be the last option before we have to "fix"
> > things at the Kconfig level.  I definitely don't want to have to go
> > that route, and I suspect most everyone on this thread feels the same,
> > so I'm hopeful we can find a solution that is begrudgingly acceptable
> > to both groups.
>
> Sorry for the lack of response here, but to sum up my order of
> preference:
>
> 1) It's probably better to just make the audit an opt-out in io_op_defs
>    for each opcode, and avoid needing boiler plate code for each op
>    handler. The opt-out would ensure that new opcodes get it by default
>    it someone doesn't know what it is, and the io_op_defs addition would
>    mean that it's in generic code rather then in the handlers. Yes it's
>    a bit slower, but it's saner imho.
>
> 2) With the above, I'm fine with adding this to io_uring. I don't think
>    going the route of mutual exclusion in kconfig helps anyone, it'd
>    be counter productive to both sides.
>
> Hope that works and helps move this forward. I'll be mostly out of touch
> the next week and a half, but wanted to ensure that I sent out my
> (brief) thoughts before going away.

Thanks Jens.  I'll revise the patchset based on this (basically doing
an opt-out version of what you did on May 26th) and do a v2 post with
the other accumulated fixes/changes.  If there is anything else that
needs discussion/review I'm sure Pavel can help us out, he's been
helpful thus far.

--
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
  2021-05-31 16:08         ` kernel test robot
  2021-05-31 17:38         ` kernel test robot
@ 2021-06-07 23:15         ` Paul Moore
  2021-06-08 12:55           ` Richard Guy Briggs
  2 siblings, 1 reply; 71+ messages in thread
From: Paul Moore @ 2021-06-07 23:15 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-audit, io-uring, linux-fsdevel, linux-security-module,
	selinux, Kumar Kartikeya Dwivedi, Jens Axboe, Alexander Viro

On Mon, May 31, 2021 at 9:45 AM Richard Guy Briggs <[email protected]> wrote:
>
> The commit ("audit: add filtering for io_uring records") added support for
> filtering io_uring operations.
>
> Add checks to the audit io_uring filtering code for directory and path watches,
> and to keep the list counts consistent.
>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
>  kernel/audit_tree.c  | 3 ++-
>  kernel/audit_watch.c | 3 ++-
>  kernel/auditfilter.c | 7 +++++--
>  3 files changed, 9 insertions(+), 4 deletions(-)

Thanks for pointing these omissions out in the original patch.  When a
patch has obvious problems generally people just provide feedback and
the patch author incorporates the fixes; this helps ensure we don't
merge known broken patches, helping preserve `git bisect`.

Do you mind if I incorporate these suggestions, and the one in patch
2/2, into the filtering patch in the original RFC patchset?  I'll add
a 'thank you' comment in the commit description as I did to the other
patch where you provided feedback.  I feel that is the proper way to
handle this.

> diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
> index 6c91902f4f45..2be285c2f069 100644
> --- a/kernel/audit_tree.c
> +++ b/kernel/audit_tree.c
> @@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op)
>  {
>
>         if (pathname[0] != '/' ||
> -           rule->listnr != AUDIT_FILTER_EXIT ||
> +           (rule->listnr != AUDIT_FILTER_EXIT &&
> +            rule->listnr != AUDIT_FILTER_URING_EXIT) ||
>             op != Audit_equal ||
>             rule->inode_f || rule->watch || rule->tree)
>                 return -EINVAL;
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index 2acf7ca49154..698b62b4a2ec 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op)
>                 return -EOPNOTSUPP;
>
>         if (path[0] != '/' || path[len-1] == '/' ||
> -           krule->listnr != AUDIT_FILTER_EXIT ||
> +           (krule->listnr != AUDIT_FILTER_EXIT &&
> +            krule->listnr != AUDIT_FILTER_URING_EXIT) ||
>             op != Audit_equal ||
>             krule->inode_f || krule->watch || krule->tree)
>                 return -EINVAL;
> diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> index c21119c00504..bcdedfd1088c 100644
> --- a/kernel/auditfilter.c
> +++ b/kernel/auditfilter.c
> @@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len)
>  static inline int audit_to_inode(struct audit_krule *krule,
>                                  struct audit_field *f)
>  {
> -       if (krule->listnr != AUDIT_FILTER_EXIT ||
> +       if ((krule->listnr != AUDIT_FILTER_EXIT &&
> +            krule->listnr != AUDIT_FILTER_URING_EXIT) ||
>             krule->inode_f || krule->watch || krule->tree ||
>             (f->op != Audit_equal && f->op != Audit_not_equal))
>                 return -EINVAL;
> @@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data *
>                 pr_err("AUDIT_FILTER_ENTRY is deprecated\n");
>                 goto exit_err;
>         case AUDIT_FILTER_EXIT:
> +       case AUDIT_FILTER_URING_EXIT:
>         case AUDIT_FILTER_TASK:
>  #endif
>         case AUDIT_FILTER_USER:
> @@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry)
>         }
>
>         entry->rule.prio = ~0ULL;
> -       if (entry->rule.listnr == AUDIT_FILTER_EXIT) {
> +       if (entry->rule.listnr == AUDIT_FILTER_EXIT ||
> +           entry->rule.listnr == AUDIT_FILTER_URING_EXIT) {
>                 if (entry->rule.flags & AUDIT_FILTER_PREPEND)
>                         entry->rule.prio = ++prio_high;
>                 else
> --
> 2.27.0

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-06-07 23:15         ` Paul Moore
@ 2021-06-08 12:55           ` Richard Guy Briggs
  2021-06-09  2:45             ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-06-08 12:55 UTC (permalink / raw)
  To: Paul Moore
  Cc: Jens Axboe, selinux, linux-security-module, linux-audit,
	Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring, Alexander Viro

On 2021-06-07 19:15, Paul Moore wrote:
> On Mon, May 31, 2021 at 9:45 AM Richard Guy Briggs <[email protected]> wrote:
> > The commit ("audit: add filtering for io_uring records") added support for
> > filtering io_uring operations.
> >
> > Add checks to the audit io_uring filtering code for directory and path watches,
> > and to keep the list counts consistent.
> >
> > Signed-off-by: Richard Guy Briggs <[email protected]>
> > ---
> >  kernel/audit_tree.c  | 3 ++-
> >  kernel/audit_watch.c | 3 ++-
> >  kernel/auditfilter.c | 7 +++++--
> >  3 files changed, 9 insertions(+), 4 deletions(-)
> 
> Thanks for pointing these omissions out in the original patch.  When a
> patch has obvious problems generally people just provide feedback and
> the patch author incorporates the fixes; this helps ensure we don't
> merge known broken patches, helping preserve `git bisect`.
> 
> Do you mind if I incorporate these suggestions, and the one in patch
> 2/2, into the filtering patch in the original RFC patchset?  I'll add
> a 'thank you' comment in the commit description as I did to the other
> patch where you provided feedback.  I feel that is the proper way to
> handle this.

I should have been more explicit.  The intent was to have the fixes
incorporated directly into your patches since they aren't committed in
any public tree yet, exactly for bisect reasons.  I didn't add a
"fixes:" tag because it had no public commit hash, but could/should have
instead made a note about it or even used the "fixup:" subject tag.
Posting using the thread as the "in-reply-to:" for this patchset was the
simplest and clearest way to demonstrate the intent and check they were
correct and I was too lazy to add a cover letter explaining that.  There
is also a Co-developed-by: tag that could be used if you feel that is
appropriate, now that you mention it, but that appears to imply a much
stronger equal contribution.

> > diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c
> > index 6c91902f4f45..2be285c2f069 100644
> > --- a/kernel/audit_tree.c
> > +++ b/kernel/audit_tree.c
> > @@ -727,7 +727,8 @@ int audit_make_tree(struct audit_krule *rule, char *pathname, u32 op)
> >  {
> >
> >         if (pathname[0] != '/' ||
> > -           rule->listnr != AUDIT_FILTER_EXIT ||
> > +           (rule->listnr != AUDIT_FILTER_EXIT &&
> > +            rule->listnr != AUDIT_FILTER_URING_EXIT) ||
> >             op != Audit_equal ||
> >             rule->inode_f || rule->watch || rule->tree)
> >                 return -EINVAL;
> > diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> > index 2acf7ca49154..698b62b4a2ec 100644
> > --- a/kernel/audit_watch.c
> > +++ b/kernel/audit_watch.c
> > @@ -183,7 +183,8 @@ int audit_to_watch(struct audit_krule *krule, char *path, int len, u32 op)
> >                 return -EOPNOTSUPP;
> >
> >         if (path[0] != '/' || path[len-1] == '/' ||
> > -           krule->listnr != AUDIT_FILTER_EXIT ||
> > +           (krule->listnr != AUDIT_FILTER_EXIT &&
> > +            krule->listnr != AUDIT_FILTER_URING_EXIT) ||
> >             op != Audit_equal ||
> >             krule->inode_f || krule->watch || krule->tree)
> >                 return -EINVAL;
> > diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
> > index c21119c00504..bcdedfd1088c 100644
> > --- a/kernel/auditfilter.c
> > +++ b/kernel/auditfilter.c
> > @@ -153,7 +153,8 @@ char *audit_unpack_string(void **bufp, size_t *remain, size_t len)
> >  static inline int audit_to_inode(struct audit_krule *krule,
> >                                  struct audit_field *f)
> >  {
> > -       if (krule->listnr != AUDIT_FILTER_EXIT ||
> > +       if ((krule->listnr != AUDIT_FILTER_EXIT &&
> > +            krule->listnr != AUDIT_FILTER_URING_EXIT) ||
> >             krule->inode_f || krule->watch || krule->tree ||
> >             (f->op != Audit_equal && f->op != Audit_not_equal))
> >                 return -EINVAL;
> > @@ -250,6 +251,7 @@ static inline struct audit_entry *audit_to_entry_common(struct audit_rule_data *
> >                 pr_err("AUDIT_FILTER_ENTRY is deprecated\n");
> >                 goto exit_err;
> >         case AUDIT_FILTER_EXIT:
> > +       case AUDIT_FILTER_URING_EXIT:
> >         case AUDIT_FILTER_TASK:
> >  #endif
> >         case AUDIT_FILTER_USER:
> > @@ -982,7 +984,8 @@ static inline int audit_add_rule(struct audit_entry *entry)
> >         }
> >
> >         entry->rule.prio = ~0ULL;
> > -       if (entry->rule.listnr == AUDIT_FILTER_EXIT) {
> > +       if (entry->rule.listnr == AUDIT_FILTER_EXIT ||
> > +           entry->rule.listnr == AUDIT_FILTER_URING_EXIT) {
> >                 if (entry->rule.flags & AUDIT_FILTER_PREPEND)
> >                         entry->rule.prio = ++prio_high;
> >                 else
> 
> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH 1/2] audit: add filtering for io_uring records, addendum
  2021-06-08 12:55           ` Richard Guy Briggs
@ 2021-06-09  2:45             ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-06-09  2:45 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Jens Axboe, selinux, linux-security-module, linux-audit,
	Kumar Kartikeya Dwivedi, linux-fsdevel, io-uring, Alexander Viro

On Tue, Jun 8, 2021 at 8:55 AM Richard Guy Briggs <[email protected]> wrote:
> I should have been more explicit.  The intent was to have the fixes
> incorporated directly into your patches since they aren't committed in
> any public tree yet, exactly for bisect reasons ...

No worries, thanks for the clarification Richard.  I just wanted to
make sure since the contribution format was a bit unusual given the
context :)

Regardless, thanks again for the feedback, I'll get this incorporated.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit, io_uring, io-wq: add some basic audit support to io_uring
  2021-06-02 20:46     ` Paul Moore
@ 2021-08-25  1:21       ` Richard Guy Briggs
  2021-08-25 19:41         ` Paul Moore
  0 siblings, 1 reply; 71+ messages in thread
From: Richard Guy Briggs @ 2021-08-25  1:21 UTC (permalink / raw)
  To: Paul Moore
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On 2021-06-02 13:46, Paul Moore wrote:
> On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <[email protected]> wrote:
> > On 2021-05-21 17:49, Paul Moore wrote:
> > > WARNING - This is a work in progress and should not be merged
> > > anywhere important.  It is almost surely not complete, and while it
> > > probably compiles it likely hasn't been booted and will do terrible
> > > things.  You have been warned.
> > >
> > > This patch adds basic auditing to io_uring operations, regardless of
> > > their context.  This is accomplished by allocating audit_context
> > > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > > as well as explicitly auditing the io_uring operations in
> > > io_issue_sqe().  The io_uring operations are audited using a new
> > > AUDIT_URINGOP record, an example is shown below:
> > >
> > >   % <TODO - insert AUDIT_URINGOP record example>
> > >
> > > Thanks to Richard Guy Briggs for review and feedback.
> > >
> > > Signed-off-by: Paul Moore <[email protected]>
> > > ---
> > >  fs/io-wq.c                 |    4 +
> > >  fs/io_uring.c              |   11 +++
> > >  include/linux/audit.h      |   17 ++++
> > >  include/uapi/linux/audit.h |    1
> > >  kernel/audit.h             |    2 +
> > >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> > >  6 files changed, 208 insertions(+)
> 
> ...
> 
> > > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > > index e481ac8a757a..e9941d1ad8fd 100644
> > > --- a/fs/io_uring.c
> > > +++ b/fs/io_uring.c
> > > @@ -78,6 +78,7 @@
> > >  #include <linux/task_work.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/io_uring.h>
> > > +#include <linux/audit.h>
> > >
> > >  #define CREATE_TRACE_POINTS
> > >  #include <trace/events/io_uring.h>
> > > @@ -6105,6 +6106,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> > >       if (req->work.creds && req->work.creds != current_cred())
> > >               creds = override_creds(req->work.creds);
> > >
> > > +     if (req->opcode < IORING_OP_LAST)
> > > +             audit_uring_entry(req->opcode);
> 
> Note well the override_creds() call right above the audit code that is
> being added, it will be important later in this email.
> 
> > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > > index cc89e9f9a753..729849d41631 100644
> > > --- a/kernel/auditsc.c
> > > +++ b/kernel/auditsc.c
> > > @@ -1536,6 +1562,52 @@ static void audit_log_proctitle(void)
> > >       audit_log_end(ab);
> > >  }
> > >
> > > +/**
> > > + * audit_log_uring - generate a AUDIT_URINGOP record
> > > + * @ctx: the audit context
> > > + */
> > > +static void audit_log_uring(struct audit_context *ctx)
> > > +{
> > > +     struct audit_buffer *ab;
> > > +     const struct cred *cred;
> > > +
> > > +     /*
> > > +      * TODO: What do we log here?  I'm tossing in a few things to start the
> > > +      *       conversation, but additional thought needs to go into this.
> > > +      */
> > > +
> > > +     ab = audit_log_start(ctx, GFP_KERNEL, AUDIT_URINGOP);
> > > +     if (!ab)
> > > +             return;
> > > +     cred = current_cred();
> >
> > This may need to be req->work.creds.  I haven't been following if the
> > io_uring thread inherited the user task's creds (and below, comm and
> > exe).
> 
> Nope, we're good.  See the existing code in io_issue_sqe() :)
> 
> > > +     audit_log_format(ab, "uring_op=%d", ctx->uring_op);
> >
> > arch is stored below in __audit_uring_entry() and never used in the
> > AUDIT_CTX_URING case.  That assignment can either be dropped or printed
> > before uring_op similar to the SYSCALL record.
> 
> Good point, I'll drop the code that records the arch from _entry(); it
> is really only useful to give the appropriate context if needed for
> other things in the audit stream, and that isn't the case like it is
> with syscalls.
> 
> > There aren't really any arg[0-3] to print.
> 
> Which is why I didn't print them.
> 
> > io_uring_register and io_uring_setup() args are better covered by other
> > records.  io_uring_enter() has 6 args and the last two aren't covered by
> > SYSCALL anyways.
> 
>  ???
> 
> I think you are confusing the io_uring ops with syscalls; they are
> very different things from an audit perspective and the io_uring
> auditing is not intended to replace syscall records.  The
> io_uring_setup() and io_uring_enter() syscalls will be auditing just
> as any other syscalls would be using the existing syscall audit code.
> 
> > > +     if (ctx->return_valid != AUDITSC_INVALID)
> > > +             audit_log_format(ab, " success=%s exit=%ld",
> > > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > > +                               "yes" : "no"),
> > > +                              ctx->return_code);
> > > +     audit_log_format(ab,
> > > +                      " items=%d"
> > > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > > +                      " euid=%u suid=%u fsuid=%u"
> > > +                      " egid=%u sgid=%u fsgid=%u",
> > > +                      ctx->name_count,
> > > +                      task_ppid_nr(current),
> > > +                      task_tgid_nr(current),
> > > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > > +                      from_kuid(&init_user_ns, cred->uid),
> > > +                      from_kgid(&init_user_ns, cred->gid),
> > > +                      from_kuid(&init_user_ns, cred->euid),
> > > +                      from_kuid(&init_user_ns, cred->suid),
> > > +                      from_kuid(&init_user_ns, cred->fsuid),
> > > +                      from_kgid(&init_user_ns, cred->egid),
> > > +                      from_kgid(&init_user_ns, cred->sgid),
> > > +                      from_kgid(&init_user_ns, cred->fsgid));
> >
> > The audit session ID is still important, relevant and qualifies auid.
> > In keeping with the SYSCALL record format, I think we want to keep
> > ses=audit_get_sessionid(current) in here.
> 
> This might be another case of syscall/io_uring confusion.  An io_uring
> op doesn't necessarily have an audit session ID or an audit UID in the
> conventional sense; for example think about SQPOLL works, shared
> rings, etc.

Right, but those syscalls are what instigate io_uring operations, so
whatever process starts that operation, or gets handed that handle
should be tracked with auid and sessionid (the two work together to
track) unless we can easily track io_uring ops to connect them to a
previous setup syscall.  If we see a need to keep the auid, then the
sessionid goes with it.

> > I'm pretty sure we also want to keep comm= and exe= too, but may have to
> > reach into req->task to get it.  There are two values for comm possible,
> > one from the original task and second "iou-sqp-<pid>" set at the top of
> > io_sq_thread().
> 
> I think this is more syscall/io_uring confusion.

I wouldn't call them confusion but rather parallels, attributing a
particular subject to an action.

> paul moore

- RGB

--
Richard Guy Briggs <[email protected]>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 2/9] audit, io_uring, io-wq: add some basic audit support to io_uring
  2021-08-25  1:21       ` Richard Guy Briggs
@ 2021-08-25 19:41         ` Paul Moore
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Moore @ 2021-08-25 19:41 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: linux-security-module, selinux, linux-audit, io-uring,
	linux-fsdevel, Kumar Kartikeya Dwivedi, Jens Axboe,
	Alexander Viro

On Tue, Aug 24, 2021 at 9:21 PM Richard Guy Briggs <[email protected]> wrote:
>
> On 2021-06-02 13:46, Paul Moore wrote:
> > On Wed, Jun 2, 2021 at 1:29 PM Richard Guy Briggs <[email protected]> wrote:
> > > On 2021-05-21 17:49, Paul Moore wrote:
> > > > WARNING - This is a work in progress and should not be merged
> > > > anywhere important.  It is almost surely not complete, and while it
> > > > probably compiles it likely hasn't been booted and will do terrible
> > > > things.  You have been warned.
> > > >
> > > > This patch adds basic auditing to io_uring operations, regardless of
> > > > their context.  This is accomplished by allocating audit_context
> > > > structures for the io-wq worker and io_uring SQPOLL kernel threads
> > > > as well as explicitly auditing the io_uring operations in
> > > > io_issue_sqe().  The io_uring operations are audited using a new
> > > > AUDIT_URINGOP record, an example is shown below:
> > > >
> > > >   % <TODO - insert AUDIT_URINGOP record example>
> > > >
> > > > Thanks to Richard Guy Briggs for review and feedback.
> > > >
> > > > Signed-off-by: Paul Moore <[email protected]>
> > > > ---
> > > >  fs/io-wq.c                 |    4 +
> > > >  fs/io_uring.c              |   11 +++
> > > >  include/linux/audit.h      |   17 ++++
> > > >  include/uapi/linux/audit.h |    1
> > > >  kernel/audit.h             |    2 +
> > > >  kernel/auditsc.c           |  173 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  6 files changed, 208 insertions(+)

...

> > > > +     if (ctx->return_valid != AUDITSC_INVALID)
> > > > +             audit_log_format(ab, " success=%s exit=%ld",
> > > > +                              (ctx->return_valid == AUDITSC_SUCCESS ?
> > > > +                               "yes" : "no"),
> > > > +                              ctx->return_code);
> > > > +     audit_log_format(ab,
> > > > +                      " items=%d"
> > > > +                      " ppid=%d pid=%d auid=%u uid=%u gid=%u"
> > > > +                      " euid=%u suid=%u fsuid=%u"
> > > > +                      " egid=%u sgid=%u fsgid=%u",
> > > > +                      ctx->name_count,
> > > > +                      task_ppid_nr(current),
> > > > +                      task_tgid_nr(current),
> > > > +                      from_kuid(&init_user_ns, audit_get_loginuid(current)),
> > > > +                      from_kuid(&init_user_ns, cred->uid),
> > > > +                      from_kgid(&init_user_ns, cred->gid),
> > > > +                      from_kuid(&init_user_ns, cred->euid),
> > > > +                      from_kuid(&init_user_ns, cred->suid),
> > > > +                      from_kuid(&init_user_ns, cred->fsuid),
> > > > +                      from_kgid(&init_user_ns, cred->egid),
> > > > +                      from_kgid(&init_user_ns, cred->sgid),
> > > > +                      from_kgid(&init_user_ns, cred->fsgid));
> > >
> > > The audit session ID is still important, relevant and qualifies auid.
> > > In keeping with the SYSCALL record format, I think we want to keep
> > > ses=audit_get_sessionid(current) in here.
> >
> > This might be another case of syscall/io_uring confusion.  An io_uring
> > op doesn't necessarily have an audit session ID or an audit UID in the
> > conventional sense; for example think about SQPOLL works, shared
> > rings, etc.
>
> Right, but those syscalls are what instigate io_uring operations, so
> whatever process starts that operation, or gets handed that handle
> should be tracked with auid and sessionid (the two work together to
> track) unless we can easily track io_uring ops to connect them to a
> previous setup syscall.  If we see a need to keep the auid, then the
> sessionid goes with it.

As a reminder, once the io_uring is created appropriately one can
issue io_uring operations without making a syscall.  Further, sharing
a io_uring across process boundaries means that both the audit session
ID and audit login UID used to create the io_uring might not be the
same as the subject which issues operations to the io_uring.

Any io_uring operations that happen synchronously as the result of a
syscall should be associated with the SYSCALL record so the session ID
and login UID will be part of the event.  Asynchronous operations will
not have that information because we don't have a way to get it.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2021-08-25 19:41 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-21 21:49 [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Paul Moore
2021-05-21 21:49 ` [RFC PATCH 1/9] audit: prepare audit_context for use in calling contexts beyond syscalls Paul Moore
2021-05-21 21:49 ` [RFC PATCH 2/9] audit,io_uring,io-wq: add some basic audit support to io_uring Paul Moore
2021-05-22  0:22   ` Pavel Begunkov
2021-05-22  2:36     ` Paul Moore
2021-05-23 20:26       ` Pavel Begunkov
2021-05-24 19:59         ` Paul Moore
2021-05-25  8:27           ` Pavel Begunkov
2021-05-25 14:53             ` Paul Moore
2021-05-26  1:11           ` Jens Axboe
2021-05-26  2:04             ` Paul Moore
2021-05-26 10:19               ` Pavel Begunkov
2021-05-26 14:38                 ` Paul Moore
2021-05-26 15:11                   ` Steve Grubb
2021-05-26 15:17                   ` Stefan Metzmacher
2021-05-26 15:49                     ` Richard Guy Briggs
2021-05-26 17:22                       ` Jens Axboe
2021-05-27 17:27                         ` Richard Guy Briggs
2021-05-26 15:49                     ` Victor Stewart
2021-05-26 16:38                       ` Casey Schaufler
2021-05-26 17:15               ` Jens Axboe
2021-05-26 17:31                 ` Jens Axboe
2021-05-26 17:54                   ` Jens Axboe
2021-05-26 18:01                     ` Jens Axboe
2021-05-26 18:44                       ` Paul Moore
2021-05-26 18:57                         ` Pavel Begunkov
2021-05-26 19:10                           ` Paul Moore
2021-05-26 19:44                         ` Jens Axboe
2021-05-26 20:19                           ` Paul Moore
2021-05-28 16:02                             ` Paul Moore
2021-06-02  8:26                               ` Pavel Begunkov
2021-06-02 15:46                                 ` Richard Guy Briggs
2021-06-03 10:39                                   ` Pavel Begunkov
2021-06-02 19:46                                 ` Paul Moore
2021-06-03 10:51                                   ` Pavel Begunkov
2021-06-03 15:54                                     ` Casey Schaufler
2021-06-03 15:54                               ` Jens Axboe
2021-06-04  5:04                                 ` Paul Moore
2021-05-26 18:38                     ` Paul Moore
2021-06-02 17:29   ` [RFC PATCH 2/9] audit, io_uring, io-wq: " Richard Guy Briggs
2021-06-02 20:46     ` Paul Moore
2021-08-25  1:21       ` Richard Guy Briggs
2021-08-25 19:41         ` Paul Moore
2021-05-21 21:50 ` [RFC PATCH 3/9] audit: dev/test patch to force io_uring auditing Paul Moore
2021-05-21 21:50 ` [RFC PATCH 4/9] audit: add filtering for io_uring records Paul Moore
2021-05-28 22:35   ` Richard Guy Briggs
2021-05-30 15:26     ` Paul Moore
2021-05-31 13:44       ` Richard Guy Briggs
2021-06-02  1:40         ` Paul Moore
2021-06-02 15:37           ` Richard Guy Briggs
2021-06-02 17:20             ` Paul Moore
2021-05-31 13:44       ` [PATCH 1/2] audit: add filtering for io_uring records, addendum Richard Guy Briggs
2021-05-31 16:08         ` kernel test robot
2021-05-31 17:38         ` kernel test robot
2021-06-07 23:15         ` Paul Moore
2021-06-08 12:55           ` Richard Guy Briggs
2021-06-09  2:45             ` Paul Moore
2021-05-31 13:44       ` [PATCH 2/2] audit: block PERM fields being used with io_uring filtering Richard Guy Briggs
2021-05-21 21:50 ` [RFC PATCH 5/9] fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() Paul Moore
2021-05-21 21:50 ` [RFC PATCH 6/9] io_uring: convert io_uring to the secure anon inode interface Paul Moore
2021-05-21 21:50 ` [RFC PATCH 7/9] lsm,io_uring: add LSM hooks to io_uring Paul Moore
2021-05-26 14:48   ` Stefan Metzmacher
2021-05-26 20:45     ` Paul Moore
2021-05-21 21:50 ` [RFC PATCH 8/9] selinux: add support for the io_uring access controls Paul Moore
2021-05-21 21:50 ` [RFC PATCH 9/9] Smack: Brutalist io_uring support with debug Paul Moore
2021-05-22  0:53 ` [RFC PATCH 0/9] Add LSM access controls and auditing to io_uring Tetsuo Handa
2021-05-22  2:06   ` Paul Moore
2021-05-26 15:00 ` Jeff Moyer
2021-05-26 18:49   ` Paul Moore
2021-05-26 19:07     ` Jeff Moyer
2021-05-26 19:10       ` Paul Moore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox