* [RFC] struct filename, io_uring and audit troubles @ 2024-09-22 0:49 Al Viro 2024-09-22 4:10 ` Al Viro 2024-09-23 1:50 ` Al Viro 0 siblings, 2 replies; 23+ messages in thread From: Al Viro @ 2024-09-22 0:49 UTC (permalink / raw) To: linux-fsdevel; +Cc: audit, io-uring Looks like things like async unlink might fuck the audit very badly. io_uring does getname() in originating thread and uses the result at the time of operation, which can happen in a different thread. Moreover, by that time the original syscall might have very well returned. The trouble is, getname() establishes linkage between the struct filename and struct audit_name; filename->aname and audit_name->name respectively. struct filename can get moved from one thread to another; struct audit_name is very much tied to audit_context, which is per-thread - first few (5, currently) audit_name instances are embedded into audit_context. The rest gets allocated dynamically, but all of them are placed on audit_context::names_list. At audit_free_names() they are all wiped out - references back to filename instances are dropped, dynamically allocated ones are freed, and while embedded ones survive, they are zeroed out on reuse by audit_alloc_name(). audit_free_names() is called on each audit_reset_context(), which is done by __audit_syscall_exit() and (in states other than AUDIT_SYSCALL_CONTEXT) __audit_uring_exit(). Linkage from filename to audit_name is used by __audit_inode(). It definitely expects the reference back to filename to be stable. And in situation when io_uring has offloaded a directory operation to helper thread, that is not guaranteed. Another fun bit is that both audit_inode() and audit_inode_child() may bump the refcount on struct filename. Which can get really fishy if they get called by helper thread while the originator is exiting the syscall - putname() from audit_free_names() in originator vs. refcount increment in helper is Not Nice(tm), what with the refcount not being atomic. Potential solutions: * don't bother with audit_name creation and linkage in getname(); do that when we start using the sucker. Doing that from __set_nameidata() will catch the majority of the stuff that ever gets audit_inode* called for it (the only exceptions are mq_open(2) and mq_unlink(2)). Unfortunately, each audit_name instance gets spewed into logs, so we would need to bring the rest of that shite in, including the things like symlink bodies (note that for io_uring-originating symlink we'd need that done in do_symlinkat()), etc. Unpleasant, that. * make refcount atomic, add a pointer to audit_context or even task_struct in audit_name, have the "use name->aname if the type is acceptable" logics in audit_inode dependent upon the name->aname->owner matching what we want. With some locking to make the check itself safe. * make refcount atomic, get rid of ->aname and have audit_inode() scan the names_list for entries with matching ->name and type - and that before the existing scan with ->name->name comparisons. * something else? Suggestions _not_ involving creative uses of TARDIS would be welcome. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-22 0:49 [RFC] struct filename, io_uring and audit troubles Al Viro @ 2024-09-22 4:10 ` Al Viro 2024-09-22 15:09 ` Al Viro 2024-09-23 1:50 ` Al Viro 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-22 4:10 UTC (permalink / raw) To: linux-fsdevel; +Cc: audit, io-uring On Sun, Sep 22, 2024 at 01:49:01AM +0100, Al Viro wrote: > * don't bother with audit_name creation and linkage in getname(); do that > when we start using the sucker. Doing that from __set_nameidata() will > catch the majority of the stuff that ever gets audit_inode* called for it > (the only exceptions are mq_open(2) and mq_unlink(2)). Unfortunately, > each audit_name instance gets spewed into logs, so we would need to > bring the rest of that shite in, including the things like symlink > bodies (note that for io_uring-originating symlink we'd need that done > in do_symlinkat()), etc. Unpleasant, that. BTW, how much is exact order and number of PATH items in audit logs cast in stone? For example, char s[2][20] = {"/tmp/blah/x", "/tmp/blah/x"}; rename(s[0], s[1]); rename(s[0], s[0]); produces 2 items (both for /tmp/blah) on the first call, and only 1 on the second. Does anything care about preserving that distinction? And what in audit_inode{,_child}() behaviour can be changed? I mean, does anything care about the loop directions when we pick the candidate audit_name for conversion, etc.? It's been a long time since I've touched audit, and I have done my best to purge memories of the specifications ;-/ ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-22 4:10 ` Al Viro @ 2024-09-22 15:09 ` Al Viro 0 siblings, 0 replies; 23+ messages in thread From: Al Viro @ 2024-09-22 15:09 UTC (permalink / raw) To: linux-fsdevel; +Cc: audit, io-uring On Sun, Sep 22, 2024 at 05:10:06AM +0100, Al Viro wrote: > On Sun, Sep 22, 2024 at 01:49:01AM +0100, Al Viro wrote: > > > * don't bother with audit_name creation and linkage in getname(); do that > > when we start using the sucker. Doing that from __set_nameidata() will > > catch the majority of the stuff that ever gets audit_inode* called for it > > (the only exceptions are mq_open(2) and mq_unlink(2)). Unfortunately, > > each audit_name instance gets spewed into logs, so we would need to > > bring the rest of that shite in, including the things like symlink > > bodies (note that for io_uring-originating symlink we'd need that done > > in do_symlinkat()), etc. Unpleasant, that. > > BTW, how much is exact order and number of PATH items in audit logs cast > in stone? > > For example, > char s[2][20] = {"/tmp/blah/x", "/tmp/blah/x"}; > rename(s[0], s[1]); > rename(s[0], s[0]); > > produces 2 items (both for /tmp/blah) on the first call, and only 1 on > the second. Does anything care about preserving that distinction? > > And what in audit_inode{,_child}() behaviour can be changed? I mean, does > anything care about the loop directions when we pick the candidate > audit_name for conversion, etc.? > > It's been a long time since I've touched audit, and I have done my best > to purge memories of the specifications ;-/ Speaking of which, is there anything in specs that would require -F obj_uid=0 to give a match on symlink("foo", "bar") done by non-root in non-root-owned directory, but not on symlink("foo", "foo") in the same conditions? Put it another way, could we possibly make the predicates that refer to inode state *not* evaluate to true on audit_name instances that had never been associated with any inodes? Currently, symlink body has such an audit_name instance, except when that instance had been picked by audit_inode_parent() for conversion (or had been shared from the very beginning in case of symlink(2) that had identical pointers passed in both arguments). Al, carefully abstaining from any comments on the sanity of said specifications - it's a government document, after all, so those should be obvious... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-22 0:49 [RFC] struct filename, io_uring and audit troubles Al Viro 2024-09-22 4:10 ` Al Viro @ 2024-09-23 1:50 ` Al Viro 2024-09-23 6:30 ` Jens Axboe 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-23 1:50 UTC (permalink / raw) To: linux-fsdevel; +Cc: audit, io-uring On Sun, Sep 22, 2024 at 01:49:01AM +0100, Al Viro wrote: > Another fun bit is that both audit_inode() and audit_inode_child() > may bump the refcount on struct filename. Which can get really fishy > if they get called by helper thread while the originator is exiting the > syscall - putname() from audit_free_names() in originator vs. refcount > increment in helper is Not Nice(tm), what with the refcount not being > atomic. *blink* OK, I really wonder which version had I been reading at the time; refcount is, indeed, atomic these days. Other problems (->aname pointing to other thread's struct audit_names and outliving reuse of those, as well as insane behaviour of audit predicates on symlink(2)) are, unfortunately, quite real - on the current mainline. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 1:50 ` Al Viro @ 2024-09-23 6:30 ` Jens Axboe 2024-09-23 12:54 ` Paul Moore 2024-09-23 15:07 ` Al Viro 0 siblings, 2 replies; 23+ messages in thread From: Jens Axboe @ 2024-09-23 6:30 UTC (permalink / raw) To: Al Viro, linux-fsdevel; +Cc: audit, io-uring On 9/22/24 7:50 PM, Al Viro wrote: > On Sun, Sep 22, 2024 at 01:49:01AM +0100, Al Viro wrote: > >> Another fun bit is that both audit_inode() and audit_inode_child() >> may bump the refcount on struct filename. Which can get really fishy >> if they get called by helper thread while the originator is exiting the >> syscall - putname() from audit_free_names() in originator vs. refcount >> increment in helper is Not Nice(tm), what with the refcount not being >> atomic. > > *blink* > > OK, I really wonder which version had I been reading at the time; refcount > is, indeed, atomic these days. > > Other problems (->aname pointing to other thread's struct audit_names > and outliving reuse of those, as well as insane behaviour of audit predicates > on symlink(2)) are, unfortunately, quite real - on the current mainline. Traveling but took a quick look. As far as I can tell, for the "reuse someone elses aname", we could do either: 1) Just don't reuse the entry. Then we can drop the struct filename->aname completely as well. Yes that might incur an extra alloc for the odd case of audit_enabled and being deep enough that the preallocated names have been used, but doesn't anyone really care? It'll be noise in the overhead anyway. Side note - that would unalign struct filename again. Would be nice to drop audit_names from a core fs struct... 2) Add a ref to struct audit_names, RCU kfree it when it drops to zero. This would mean dropping struct audit_context->preallocated_names, as otherwise we'd run into trouble there if a context gets blown away while someone else has a ref to that audit_names struct. We could do this without a ref as well, as long as we can store an audit_context pointer in struct audit_names and be able to validate it under RCU. If ctx doesn't match, don't use it. And probably other ways too, those were just the two immediate ones I thought it. Seems like option 1 is simpler and just fine? Quick hack: diff --git a/fs/namei.c b/fs/namei.c index 891b169e38c9..11263f779b96 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -206,7 +206,6 @@ getname_flags(const char __user *filename, int flags) atomic_set(&result->refcnt, 1); result->uptr = filename; - result->aname = NULL; audit_getname(result); return result; } @@ -254,7 +253,6 @@ getname_kernel(const char * filename) } memcpy((char *)result->name, filename, len); result->uptr = NULL; - result->aname = NULL; atomic_set(&result->refcnt, 1); audit_getname(result); diff --git a/include/linux/fs.h b/include/linux/fs.h index 0df3e5f0dd2b..859244c877b4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2685,10 +2685,8 @@ struct filename { const char *name; /* pointer to actual string */ const __user char *uptr; /* original userland pointer */ atomic_t refcnt; - struct audit_names *aname; const char iname[]; }; -static_assert(offsetof(struct filename, iname) % sizeof(long) == 0); static inline struct mnt_idmap *file_mnt_idmap(const struct file *file) { diff --git a/kernel/auditsc.c b/kernel/auditsc.c index cd57053b4a69..09caf8408225 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -2240,7 +2240,6 @@ void __audit_getname(struct filename *name) n->name = name; n->name_len = AUDIT_NAME_FULL; - name->aname = n; atomic_inc(&name->refcnt); } @@ -2325,22 +2324,6 @@ void __audit_inode(struct filename *name, const struct dentry *dentry, if (!name) goto out_alloc; - /* - * If we have a pointer to an audit_names entry already, then we can - * just use it directly if the type is correct. - */ - n = name->aname; - if (n) { - if (parent) { - if (n->type == AUDIT_TYPE_PARENT || - n->type == AUDIT_TYPE_UNKNOWN) - goto out; - } else { - if (n->type != AUDIT_TYPE_PARENT) - goto out; - } - } - list_for_each_entry_reverse(n, &context->names_list, list) { if (n->ino) { /* valid inode number, use that for the comparison */ -- Jens Axboe ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 6:30 ` Jens Axboe @ 2024-09-23 12:54 ` Paul Moore 2024-09-23 14:48 ` Al Viro 2024-09-23 15:07 ` Al Viro 1 sibling, 1 reply; 23+ messages in thread From: Paul Moore @ 2024-09-23 12:54 UTC (permalink / raw) To: Jens Axboe, Al Viro; +Cc: linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 2:31 AM Jens Axboe <[email protected]> wrote: > On 9/22/24 7:50 PM, Al Viro wrote: > > On Sun, Sep 22, 2024 at 01:49:01AM +0100, Al Viro wrote: > > > >> Another fun bit is that both audit_inode() and audit_inode_child() > >> may bump the refcount on struct filename. Which can get really fishy > >> if they get called by helper thread while the originator is exiting the > >> syscall - putname() from audit_free_names() in originator vs. refcount > >> increment in helper is Not Nice(tm), what with the refcount not being > >> atomic. > > > > *blink* > > > > OK, I really wonder which version had I been reading at the time; refcount > > is, indeed, atomic these days. > > > > Other problems (->aname pointing to other thread's struct audit_names > > and outliving reuse of those, as well as insane behaviour of audit predicates > > on symlink(2)) are, unfortunately, quite real - on the current mainline. > > Traveling but took a quick look. As far as I can tell, for the "reuse > someone elses aname", we could do either: > > 1) Just don't reuse the entry. Then we can drop the struct > filename->aname completely as well. Yes that might incur an extra > alloc for the odd case of audit_enabled and being deep enough that > the preallocated names have been used, but doesn't anyone really > care? It'll be noise in the overhead anyway. Side note - that would > unalign struct filename again. Would be nice to drop audit_names from > a core fs struct... > > 2) Add a ref to struct audit_names, RCU kfree it when it drops to zero. > This would mean dropping struct audit_context->preallocated_names, as > otherwise we'd run into trouble there if a context gets blown away > while someone else has a ref to that audit_names struct. We could do > this without a ref as well, as long as we can store an audit_context > pointer in struct audit_names and be able to validate it under RCU. > If ctx doesn't match, don't use it. > > And probably other ways too, those were just the two immediate ones I > thought it. Seems like option 1 is simpler and just fine? Quick hack: [Sorry for the delay, between flying back home, and just not wanting to think about the kernel for a day, I took the weekend "off".] Jens and I have talked about similar issues in the past, and I think the only real solution to ensure the correctness of the audit records and provide some consistency between the io_uring approach and traditional syscalls, is to introduce a mechanism where we create/clone an audit_context in the io_uring prep stage to capture things like PATH records, stash that audit_context in the io_kiocb struct, and then restore it later when io_uring does/finishes the operation. I'm reasonably confident that we don't need to do it for all of the io_uring ops, just the !audit_skip case. I'm always open to ideas, but everything else I can think of is either far too op-specific to be maintainable long term, a performance nightmare, or just plain wrong with respect to the audit records. I keep hoping to have some time to code it up properly, but so far this year has been an exercise in "I'll just put this fire over here with the other fire". Believe it or not, this is at the top of my TODO list, perhaps this week I can dedicate some time to this. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 12:54 ` Paul Moore @ 2024-09-23 14:48 ` Al Viro 2024-09-23 16:14 ` Paul Moore 0 siblings, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-23 14:48 UTC (permalink / raw) To: Paul Moore; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 08:54:03AM -0400, Paul Moore wrote: > [Sorry for the delay, between flying back home, and just not wanting > to think about the kernel for a day, I took the weekend "off".] > > Jens and I have talked about similar issues in the past, and I think > the only real solution to ensure the correctness of the audit records > and provide some consistency between the io_uring approach and > traditional syscalls, is to introduce a mechanism where we > create/clone an audit_context in the io_uring prep stage to capture > things like PATH records, stash that audit_context in the io_kiocb > struct, and then restore it later when io_uring does/finishes the > operation. I'm reasonably confident that we don't need to do it for > all of the io_uring ops, just the !audit_skip case. > > I'm always open to ideas, but everything else I can think of is either > far too op-specific to be maintainable long term, a performance > nightmare, or just plain wrong with respect to the audit records. > > I keep hoping to have some time to code it up properly, but so far > this year has been an exercise in "I'll just put this fire over here > with the other fire". Believe it or not, this is at the top of my > TODO list, perhaps this week I can dedicate some time to this. What are the requirements regarding the order of audit_names in the ->names_list? I really don't like the idea of having struct filename tied to audit_context - io_uring is not the only context where it might make sense to treat struct filename as first-class citizens. And having everything that passed through getname()/getname_kernel() shoved into ->names_list leads to very odd behaviour, especially with audit_names conversions in audit_inode()/audit_inode_child(). Look at the handling of AUDIT_DEV{MAJOR,MINOR} or AUDIT_OBJ_{UID,GID} or AUDIT_COMPARE_..._TO_OBJ; should they really apply to audit_names resulting from copying the symlink body into the kernel? And if they should be applied to audit_names instance that had never been associated with any inode, should that depend upon the string in those being equal to another argument of the same syscall? I'm going through the kernel/auditsc.c right now, but it's more of a "document what it does" - I don't have the specs and I certainly don't remember such details. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 14:48 ` Al Viro @ 2024-09-23 16:14 ` Paul Moore 2024-09-23 18:17 ` Al Viro 2024-09-23 20:36 ` Al Viro 0 siblings, 2 replies; 23+ messages in thread From: Paul Moore @ 2024-09-23 16:14 UTC (permalink / raw) To: Al Viro; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 10:48 AM Al Viro <[email protected]> wrote: > On Mon, Sep 23, 2024 at 08:54:03AM -0400, Paul Moore wrote: > > [Sorry for the delay, between flying back home, and just not wanting > > to think about the kernel for a day, I took the weekend "off".] > > > > Jens and I have talked about similar issues in the past, and I think > > the only real solution to ensure the correctness of the audit records > > and provide some consistency between the io_uring approach and > > traditional syscalls, is to introduce a mechanism where we > > create/clone an audit_context in the io_uring prep stage to capture > > things like PATH records, stash that audit_context in the io_kiocb > > struct, and then restore it later when io_uring does/finishes the > > operation. I'm reasonably confident that we don't need to do it for > > all of the io_uring ops, just the !audit_skip case. > > > > I'm always open to ideas, but everything else I can think of is either > > far too op-specific to be maintainable long term, a performance > > nightmare, or just plain wrong with respect to the audit records. > > > > I keep hoping to have some time to code it up properly, but so far > > this year has been an exercise in "I'll just put this fire over here > > with the other fire". Believe it or not, this is at the top of my > > TODO list, perhaps this week I can dedicate some time to this. > > What are the requirements regarding the order of audit_names in > the ->names_list? Uncertain. As things currently stand there isn't really an explicit ordering between the PATH records and the syscall, there is the implicit order in which the PATH records appear in the event, but I don't know that I would read too much into that. From my point of view, stuff like that is largely driven by enterprise distros chasing 3rd party security certifications so they can sell products/services to a certain class of users. These are the same enterprise distros that haven't really bothered to contribute a lot to the upstream Linux kernel's audit subsystem lately so I'm not going to worry too much about them at this point. > I really don't like the idea of having struct filename > tied to audit_context - io_uring is not the only context where it might > make sense to treat struct filename as first-class citizens. From an audit perspective this is larger than just path walks, if I recall previous conversations correctly, there was also an issue with sockaddrs (slightly different problem, same solution). I'm not opposed to seeing better support for struct filename, I think in a lot of cases that would make things easier for audit (especially where I would like to take audit ... eventually). Assuming your ideas for struct filename don't significantly break audit you can consider me supportive so long as we still have a way to take a struct filename reference inside the audit_context; we need to keep that ref until syscall/io_uring-op exit time as we can't be certain if we need to log the PATH until we know the success/fail status of the operation (among other things). > And having everything that passed through getname()/getname_kernel() > shoved into ->names_list leads to very odd behaviour, especially with > audit_names conversions in audit_inode()/audit_inode_child(). > > Look at the handling of AUDIT_DEV{MAJOR,MINOR} or AUDIT_OBJ_{UID,GID} > or AUDIT_COMPARE_..._TO_OBJ; should they really apply to audit_names > resulting from copying the symlink body into the kernel? And if they > should be applied to audit_names instance that had never been associated > with any inode, should that depend upon the string in those being > equal to another argument of the same syscall? > > I'm going through the kernel/auditsc.c right now, but it's more of > a "document what it does" - I don't have the specs and I certainly > don't remember such details. My approach to audit is "do what makes sense for a normal person", if somebody needs silly behavior to satisfy some security cert then they can get involved in upstream development and send me patches that don't suck. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 16:14 ` Paul Moore @ 2024-09-23 18:17 ` Al Viro 2024-09-23 23:49 ` Paul Moore 2024-09-23 20:36 ` Al Viro 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-23 18:17 UTC (permalink / raw) To: Paul Moore; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 12:14:29PM -0400, Paul Moore wrote: > > And having everything that passed through getname()/getname_kernel() > > shoved into ->names_list leads to very odd behaviour, especially with > > audit_names conversions in audit_inode()/audit_inode_child(). > > > > Look at the handling of AUDIT_DEV{MAJOR,MINOR} or AUDIT_OBJ_{UID,GID} > > or AUDIT_COMPARE_..._TO_OBJ; should they really apply to audit_names > > resulting from copying the symlink body into the kernel? And if they > > should be applied to audit_names instance that had never been associated > > with any inode, should that depend upon the string in those being > > equal to another argument of the same syscall? > > > > I'm going through the kernel/auditsc.c right now, but it's more of > > a "document what it does" - I don't have the specs and I certainly > > don't remember such details. > > My approach to audit is "do what makes sense for a normal person", if > somebody needs silly behavior to satisfy some security cert then they > can get involved in upstream development and send me patches that > don't suck. root@kvm1:/tmp# auditctl -l -a always,exit -S all -F dir=/tmp/blah -F perm=rwxa -F obj_uid=0 root@kvm1:/tmp# su - al al@kvm1:~$ cd /tmp/blah al@kvm1:/tmp/blah$ ln -s a a al@kvm1:/tmp/blah$ ln -s c b al@kvm1:/tmp/blah$ ls -l total 0 lrwxrwxrwx 1 al al 1 Sep 23 13:44 a -> a lrwxrwxrwx 1 al al 1 Sep 23 13:44 b -> c al@kvm1:/tmp/blah$ ls -ld drwxr-xr-x 2 al al 4096 Sep 23 13:44 . resulting in type=USER_START msg=audit(1727113434.684:497): pid=3433 uid=0 auid=0 ses=1 subj=unconfined msg='op=PAM:session_open grantors=pam_key init,pam_env,pam_env,pam_mail,pam_limits,pam_permit,pam_unix,pam_ecryptfs acct="al" exe="/usr/bin/su" hostname=? addr=? terminal=/de v/pts/0 res=success'^]UID="root" AUID="root" type=SYSCALL msg=audit(1727113466.464:498): arch=c000003e syscall=266 success=yes exit=0 a0=7ffc441ad843 a1=ffffff9c a2=7ffc441ad845 a3=7f10e4816a90 items=3 ppid=3434 pid=3449 auid=0 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts0 ses=1 comm="ln" exe="/usr/bin/ln" subj=unconfined key=(null)^]ARCH=x86_64 SYSCALL=symlinkat AUID="root" UID="al" GID="al" EUID="al" SUID="al" FSUID="al" EGID="al" SGID="al" FSGID="al" type=CWD msg=audit(1727113466.464:498): cwd="/tmp/blah" type=PATH msg=audit(1727113466.464:498): item=0 name="/tmp/blah" inode=135460 dev=08:14 mode=041777 ouid=1000 ogid=1000 rdev=00:00 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0^]OUID="al" OGID="al" type=PATH msg=audit(1727113466.464:498): item=1 name="c" nametype=UNKNOWN cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0 type=PATH msg=audit(1727113466.464:498): item=2 name="b" inode=135497 dev=08:14 mode=0120777 ouid=1000 ogid=1000 rdev=00:00 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0^]OUID="al" OGID="al" type=PROCTITLE msg=audit(1727113466.464:498): proctitle=6C6E002D7300630062 type=USER_END msg=audit(1727113560.852:499): pid=3433 uid=0 auid=0 ses=1 subj=unconfined msg='op=PAM:session_close grantors=pam_keyinit,pam_env,pam_env,pam_mail,pam_limits,pam_permit,pam_unix,pam_ecryptfs acct="al" exe="/usr/bin/su" hostname=? addr=? terminal=/dev/pts/0 res=success'^]UID="root" AUID="root" Note that * none of the filesystem objects involved have UID 0 * of two calls of symlinkat(), only the second one triggers the rule ... and removing the -F obj_uid=0 from the rule catches both (as expected), with the first one getting *two* items instead of 3 - there's PARENT (identical to the second call), there's CREATE (for "a" instead of "b", obviously) and there's no UNKNOWN with the symlink body. Does the behaviour above make sense for a normal person, by your definition? What happens is that we start with two UNKNOWN (for oldname and newname arguments of symlinkat(); incidentally, for the order is undefined - it's up to compiler which argument of do_symlinkat(getname(oldname), newdfd, getname(newname)) gets evaluated first). When we get through the call of __filename_parentat(), we get the newname's audit_names instance hit by audit_inode(...., AUDIT_INODE_PARENT). That converts the one for newname from UNKNOWN to PARENT. Then, in may_create(), we hit it with audit_inode_child() for AUDIT_TYPE_CHILD_CREATING. Which finds the parent (newname's audit_names, converted to PARENT by that point) and goes looking for child. That's where the execution paths diverge - for symlink("a", "a") the oldname's audit_names instance looks like a valid candidate and gets converted to CHILD_CREATING. For symlink("c", "b") such candidate is there, so we get a new item allocated - CHILD_CREATING for "b". Name reference gets cloned from PARENT. Now, ->symlink() is done, it succeeds and fsnotify_create(dir, dentry) gets called. We get an identical audit_inode_child() call, except that now dentry is positive. It finds PARENT instance, finds CHILD_CREATING one, and slaps the inode metadata into it, setting ->uid to non-zero. As the result, in one case you are left with PARENT, UNKNOWN, CHILD_CREATING and in another - PARENT, CHILD_CREATING. Modulo reordering first two items if compiler decides to evaluate the first argument of do_symlinkat() call before the third one... In any case, AUDIT_OBJ_UID(0) gives a match on UNKNOWN - if it's there. Since both PARENT and CHILD_CREATING refer to non-root-owned inodes, AUDIT_OBJ_UID(0) does not match either... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 18:17 ` Al Viro @ 2024-09-23 23:49 ` Paul Moore 0 siblings, 0 replies; 23+ messages in thread From: Paul Moore @ 2024-09-23 23:49 UTC (permalink / raw) To: Al Viro; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 2:17 PM Al Viro <[email protected]> wrote: > On Mon, Sep 23, 2024 at 12:14:29PM -0400, Paul Moore wrote: > > > > And having everything that passed through getname()/getname_kernel() > > > shoved into ->names_list leads to very odd behaviour, especially with > > > audit_names conversions in audit_inode()/audit_inode_child(). > > > > > > Look at the handling of AUDIT_DEV{MAJOR,MINOR} or AUDIT_OBJ_{UID,GID} > > > or AUDIT_COMPARE_..._TO_OBJ; should they really apply to audit_names > > > resulting from copying the symlink body into the kernel? And if they > > > should be applied to audit_names instance that had never been associated > > > with any inode, should that depend upon the string in those being > > > equal to another argument of the same syscall? > > > > > > I'm going through the kernel/auditsc.c right now, but it's more of > > > a "document what it does" - I don't have the specs and I certainly > > > don't remember such details. > > > > My approach to audit is "do what makes sense for a normal person", if > > somebody needs silly behavior to satisfy some security cert then they > > can get involved in upstream development and send me patches that > > don't suck. > > root@kvm1:/tmp# auditctl -l > -a always,exit -S all -F dir=/tmp/blah -F perm=rwxa -F obj_uid=0 > root@kvm1:/tmp# su - al > al@kvm1:~$ cd /tmp/blah > al@kvm1:/tmp/blah$ ln -s a a > al@kvm1:/tmp/blah$ ln -s c b > al@kvm1:/tmp/blah$ ls -l > total 0 > lrwxrwxrwx 1 al al 1 Sep 23 13:44 a -> a > lrwxrwxrwx 1 al al 1 Sep 23 13:44 b -> c > al@kvm1:/tmp/blah$ ls -ld > drwxr-xr-x 2 al al 4096 Sep 23 13:44 . ... > Note that > * none of the filesystem objects involved have UID 0 > * of two calls of symlinkat(), only the second one triggers the rule > ... and removing the -F obj_uid=0 from the rule catches both (as expected), > with the first one getting *two* items instead of 3 - there's PARENT (identical > to the second call), there's CREATE (for "a" instead of "b", obviously) and > there's no UNKNOWN with the symlink body. > > Does the behaviour above make sense for a normal person, by your definition? I never said that there aren't plenty of bugs in the current implementation. Really. Audit is a train wreck for so many reasons, most of my time spent with audit is spent making sure it doesn't panic, or some other subsystem hasn't wrecked it even further. If you're looking for someone to justify audit's behavior here, you'll need to keep looking. Personally I've got plans to burn the whole thing down if I ever get enough time to work on it, but in the meantime I want to try and make it as usable as possible. Patches are always welcome. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 16:14 ` Paul Moore 2024-09-23 18:17 ` Al Viro @ 2024-09-23 20:36 ` Al Viro 2024-09-24 0:11 ` Paul Moore 2024-09-24 21:40 ` Al Viro 1 sibling, 2 replies; 23+ messages in thread From: Al Viro @ 2024-09-23 20:36 UTC (permalink / raw) To: Paul Moore; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 12:14:29PM -0400, Paul Moore wrote: [ordering and number of PATH items for syscall] > >From my point of view, stuff like that is largely driven by enterprise > distros chasing 3rd party security certifications so they can sell > products/services to a certain class of users. These are the same > enterprise distros that haven't really bothered to contribute a lot to > the upstream Linux kernel's audit subsystem lately so I'm not going to > worry too much about them at this point. Umm... IIRC, sgrubb had been involved in the spec-related horrors, but that was a long time ago... > where I would like to take audit ... eventually). Assuming your ideas > for struct filename don't significantly break audit you can consider > me supportive so long as we still have a way to take a struct filename > reference inside the audit_context; we need to keep that ref until > syscall/io_uring-op exit time as we can't be certain if we need to log > the PATH until we know the success/fail status of the operation (among > other things). OK... As for what I would like to do: * go through the VFS side of things and make sure we have a consistent set of helpers that would take struct filename * - *not* the ad-hoc mix we have right now, when that's basically driven by io_uring borging them in one by one - or duplicates them without bothering to share helpers. E.g. mkdirat(2) does getname() and passes it to do_mkdirat(), which consumes the sucker; so does mknodat(2), but do_mknodat() is static. OTOH, path_setxattr() does setxattr_copy(), then retry_estale loop with user_path_at() + mnt_want_write() + do_setxattr() + mnt_drop_write() + path_put() as a body, while on io_uring side we have retry_estale loop with filename_lookup() + (io_uring helper that does mnt_want_write() + do_setxattr() + mnt_drop_write()) + path_put(). Sure, that user_path_at() call is getname() + filename_lookup() + putname(), so they are equivalent, but keeping that shite in sync is going to be trouble. * get rid of the "repeated getname() on the same address is going to give you the same object" - that can't be relied upon without audit, for one thing and for another... having a syscall that takes two pathnames that gives different audit log (if not predicate evaluation) in cases when those are identical pointers vs. strings with identical contenst is, IMO, somewhat undesirable. That kills filename->uaddr. * looking at the users of that stuff, I would probably prefer to separate getname*() from insertion into audit context. It's not that tricky - __set_nameidata() catches *everything* that uses nd->name (i.e. all that audit_inode() calls in fs/namei.c use). What remains is do_symlinkat() for symlink body fs_index() on the argument (if we want to bother - it's a part of weird Missed'em'V sysfs(2) syscall; I sincerely doubt that there's anybody who'd use it) fsconfig(2) FSCONFIG_SET_PATH handling. mq_open(2) and mq_unlink(2) - those bypass the normal pathwalk logics, so __set_nameidata() won't catch them. _maybe_ alpha osf_mount(2) devname argument; or we could get rid of that stupidity and have it use copy_mount_string() like mount(2) does, instead of messing with getname(). That's all it takes. With that done, we can kill ->aname; just look in the ->names_list for the first entry with given ->name - as in, given struct filename * value, no need to look inside. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 20:36 ` Al Viro @ 2024-09-24 0:11 ` Paul Moore 2024-09-24 7:01 ` Al Viro 2024-09-25 20:44 ` Al Viro 2024-09-24 21:40 ` Al Viro 1 sibling, 2 replies; 23+ messages in thread From: Paul Moore @ 2024-09-24 0:11 UTC (permalink / raw) To: Al Viro; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 4:37 PM Al Viro <[email protected]> wrote: > On Mon, Sep 23, 2024 at 12:14:29PM -0400, Paul Moore wrote: > > [ordering and number of PATH items for syscall] > > > >From my point of view, stuff like that is largely driven by enterprise > > distros chasing 3rd party security certifications so they can sell > > products/services to a certain class of users. These are the same > > enterprise distros that haven't really bothered to contribute a lot to > > the upstream Linux kernel's audit subsystem lately so I'm not going to > > worry too much about them at this point. > > Umm... IIRC, sgrubb had been involved in the spec-related horrors, but > that was a long time ago... Yep, he was. Last I spoke to Steve a year or so ago, audit was no longer part of his job description; Steve still maintains his userspace audit tools, but that is a nights/weekends job as far as I understand. The last time I was involved in any audit/CC spec related work was well over a decade ago now, and all of those CC protection profiles have long since expired and been replaced. > > where I would like to take audit ... eventually). Assuming your ideas > > for struct filename don't significantly break audit you can consider > > me supportive so long as we still have a way to take a struct filename > > reference inside the audit_context; we need to keep that ref until > > syscall/io_uring-op exit time as we can't be certain if we need to log > > the PATH until we know the success/fail status of the operation (among > > other things). > > OK... As for what I would like to do: > > * go through the VFS side of things and make sure we have a consistent > set of helpers that would take struct filename * - *not* the ad-hoc mix we > have right now, when that's basically driven by io_uring borging them in > one by one - or duplicates them without bothering to share helpers. > E.g. mkdirat(2) does getname() and passes it to do_mkdirat(), which > consumes the sucker; so does mknodat(2), but do_mknodat() is static. OTOH, > path_setxattr() does setxattr_copy(), then retry_estale loop with > user_path_at() + mnt_want_write() + do_setxattr() + mnt_drop_write() + path_put() > as a body, while on io_uring side we have retry_estale loop with filename_lookup() + > (io_uring helper that does mnt_want_write() + do_setxattr() + mnt_drop_write()) + > path_put(). > Sure, that user_path_at() call is getname() + filename_lookup() + putname(), > so they are equivalent, but keeping that shite in sync is going to be trouble. I obviously trust you to do the right thing with the VFS bits, and having a well defined struct filename interface sounds like a good thing from an audit perspective. I don't believe it completely solves the audit/io_uring issue, but it should make things easier and hopefully will result in less chance of breakage in the future. > * get rid of the "repeated getname() on the same address is going to > give you the same object" - that can't be relied upon without audit, for one > thing and for another... having a syscall that takes two pathnames that gives > different audit log (if not predicate evaluation) in cases when those are > identical pointers vs. strings with identical contenst is, IMO, somewhat > undesirable. That kills filename->uaddr. /uaddr/uptr/ if I'm following you correctly, but yeah, that all seems good. > * looking at the users of that stuff, I would probably prefer to > separate getname*() from insertion into audit context. It's not that > tricky - __set_nameidata() catches *everything* that uses nd->name (i.e. > all that audit_inode() calls in fs/namei.c use). That should be a pretty significant simplification, that sounds good to me. > ... What remains is > do_symlinkat() for symlink body > fs_index() on the argument (if we want to bother - it's a part > of weird Missed'em'V sysfs(2) syscall; I sincerely doubt that there's > anybody who'd use it) We probably should bother, folks that really care about audit don't like blind spots. Perhaps make it a separate patch if it isn't too ugly to split it out. > fsconfig(2) FSCONFIG_SET_PATH handling. > mq_open(2) and mq_unlink(2) - those bypass the normal pathwalk > logics, so __set_nameidata() won't catch them. > _maybe_ alpha osf_mount(2) devname argument; or we could get rid > of that stupidity and have it use copy_mount_string() like mount(2) does, > instead of messing with getname(). > That's all it takes. With that done, we can kill ->aname; > just look in the ->names_list for the first entry with given ->name - > as in, given struct filename * value, no need to look inside. Seems reasonable to me. I can't imagine these special cases being any worse than what we have now in fs/namei.c, and if nothing else having a single catch point for the bulk of the VFS lookups makes it worth it as far as I'm concerned. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-24 0:11 ` Paul Moore @ 2024-09-24 7:01 ` Al Viro 2024-09-24 23:17 ` Paul Moore 2024-09-25 20:44 ` Al Viro 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-24 7:01 UTC (permalink / raw) To: Paul Moore; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 08:11:51PM -0400, Paul Moore wrote: > > Umm... IIRC, sgrubb had been involved in the spec-related horrors, but > > that was a long time ago... > > Yep, he was. Last I spoke to Steve a year or so ago, audit was no > longer part of his job description; Steve still maintains his > userspace audit tools, but that is a nights/weekends job as far as I > understand. > > The last time I was involved in any audit/CC spec related work was > well over a decade ago now, and all of those CC protection profiles > have long since expired and been replaced. Interesting... I guess eparis would be the next victim^Wpossible source of information. > > * get rid of the "repeated getname() on the same address is going to > > give you the same object" - that can't be relied upon without audit, for one > > thing and for another... having a syscall that takes two pathnames that gives > > different audit log (if not predicate evaluation) in cases when those are > > identical pointers vs. strings with identical contenst is, IMO, somewhat > > undesirable. That kills filename->uaddr. > > /uaddr/uptr/ if I'm following you correctly, but yeah, that all seems good. *nod* > > * looking at the users of that stuff, I would probably prefer to > > separate getname*() from insertion into audit context. It's not that > > tricky - __set_nameidata() catches *everything* that uses nd->name (i.e. > > all that audit_inode() calls in fs/namei.c use). > > That should be a pretty significant simplification, that sounds good to me. > > > ... What remains is > > do_symlinkat() for symlink body > > fs_index() on the argument (if we want to bother - it's a part > > of weird Missed'em'V sysfs(2) syscall; I sincerely doubt that there's > > anybody who'd use it) > > We probably should bother, folks that really care about audit don't > like blind spots. Perhaps make it a separate patch if it isn't too > ugly to split it out. Heh... I suggest you to look at the manpage of that thing. sysfs(1, "ext2") => echo $((`sed -ne "/\text2$/=" </proc/filesystems` - 1)) sysfs(2, 10) => sed -ne "11s/.*\t//p" </proc/filesystems sysfs(3) => wc -l </proc/filesystems Yes, really - find position of filesystem type in the list of registered filesystems by name (0-based numeration), find the name of filesystem type by position and find the number of registered filesystem types. Missed'em'V had no synthetic filesystems... And the string is, of course, not a pathname of any sort, so I'd argue that spewing PATH record into audit log is a bug. Not that the number you get from sysfs(1, something) had been usable for anything other than passing it to sysfs(2, number) and getting the same string back - you can't pass that number to mount(2) or anything of that kind. I suspect that the only reason this syscall exists is some binary emulation - introduced in 1.1.11, not supported by glibc at least since 2014 (and almost certainly way before that). > > That's all it takes. With that done, we can kill ->aname; > > just look in the ->names_list for the first entry with given ->name - > > as in, given struct filename * value, no need to look inside. > > Seems reasonable to me. I can't imagine these special cases being any > worse than what we have now in fs/namei.c, and if nothing else having > a single catch point for the bulk of the VFS lookups makes it worth it > as far as I'm concerned. Huh? Right now we allocate audit_names at getname_flags()/getname_kernel() time; grep for audit_getname() - that's as centralized as it gets. What I want to do is somewhat _de_centralize it; that way they would not go anywhere other than audit_context of the thread actually doing the work. There is a lot of calls of audit_inode(), but I'm not planning to touch any of those. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-24 7:01 ` Al Viro @ 2024-09-24 23:17 ` Paul Moore 0 siblings, 0 replies; 23+ messages in thread From: Paul Moore @ 2024-09-24 23:17 UTC (permalink / raw) To: Al Viro; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Tue, Sep 24, 2024 at 3:01 AM Al Viro <[email protected]> wrote: > On Mon, Sep 23, 2024 at 08:11:51PM -0400, Paul Moore wrote: > > > Umm... IIRC, sgrubb had been involved in the spec-related horrors, but > > > that was a long time ago... > > > > Yep, he was. Last I spoke to Steve a year or so ago, audit was no > > longer part of his job description; Steve still maintains his > > userspace audit tools, but that is a nights/weekends job as far as I > > understand. > > > > The last time I was involved in any audit/CC spec related work was > > well over a decade ago now, and all of those CC protection profiles > > have long since expired and been replaced. > > Interesting... I guess eparis would be the next victim^Wpossible source > of information. Eric was the one who dumped the audit subsystem in my lap ~10 years ago so he could run off and play with containers. Fortunately for Eric I was much more trusting back then and didn't read all the fine print before agreeing to look after audit. > > > * looking at the users of that stuff, I would probably prefer to > > > separate getname*() from insertion into audit context. It's not that > > > tricky - __set_nameidata() catches *everything* that uses nd->name (i.e. > > > all that audit_inode() calls in fs/namei.c use). > > > > That should be a pretty significant simplification, that sounds good to me. > > > > > ... What remains is > > > do_symlinkat() for symlink body > > > fs_index() on the argument (if we want to bother - it's a part > > > of weird Missed'em'V sysfs(2) syscall; I sincerely doubt that there's > > > anybody who'd use it) > > > > We probably should bother, folks that really care about audit don't > > like blind spots. Perhaps make it a separate patch if it isn't too > > ugly to split it out. > > Heh... I suggest you to look at the manpage of that thing. Ooof. That's something isn't it? Yeah, that's ugly, and since it doesn't really return any user or sensitive system information I think it's safe to skip - my mistake. > > > That's all it takes. With that done, we can kill ->aname; > > > just look in the ->names_list for the first entry with given ->name - > > > as in, given struct filename * value, no need to look inside. > > > > Seems reasonable to me. I can't imagine these special cases being any > > worse than what we have now in fs/namei.c, and if nothing else having > > a single catch point for the bulk of the VFS lookups makes it worth it > > as far as I'm concerned. > > Huh? Right now we allocate audit_names at getname_flags()/getname_kernel() > time; grep for audit_getname() - that's as centralized as it gets. > What I want to do is somewhat _de_centralize it; that way they would not > go anywhere other than audit_context of the thread actually doing the > work. > > There is a lot of calls of audit_inode(), but I'm not planning to touch > any of those. Sorry. I'm a bit foggy from traveling, and trying to play catch-up back at home. I'm sure it will make more sense once I see the patches. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-24 0:11 ` Paul Moore 2024-09-24 7:01 ` Al Viro @ 2024-09-25 20:44 ` Al Viro 2024-09-25 20:58 ` Paul Moore 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-25 20:44 UTC (permalink / raw) To: Paul Moore; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 08:11:51PM -0400, Paul Moore wrote: > > * get rid of the "repeated getname() on the same address is going to > > give you the same object" - that can't be relied upon without audit, for one > > thing and for another... having a syscall that takes two pathnames that gives > > different audit log (if not predicate evaluation) in cases when those are > > identical pointers vs. strings with identical contenst is, IMO, somewhat > > undesirable. That kills filename->uaddr. > > /uaddr/uptr/ if I'm following you correctly, but yeah, that all seems good. BTW, what should we do when e.g. mkdir(2) manages to get to the parent, calls audit_inode() to memorize that one and then gets -ESTALE from nfs_mkdir()? We repeat the pathwalk, this time with LOOKUP_REVAL (i.e. make sure to ask the server about each NFS directory we are visiting, even if it had been seen recently) and arrive to a different directory, which is not stale and where subdirectory creation succeeds. The thing is, we call audit_inode(...., AUDIT_INODE_PARENT) twice. With the same name, but with different inodes. Should we log both, or should the latter call cannibalize the audit_names instance from the earlier? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-25 20:44 ` Al Viro @ 2024-09-25 20:58 ` Paul Moore 0 siblings, 0 replies; 23+ messages in thread From: Paul Moore @ 2024-09-25 20:58 UTC (permalink / raw) To: Al Viro; +Cc: Jens Axboe, linux-fsdevel, audit, io-uring On Wed, Sep 25, 2024 at 4:44 PM Al Viro <[email protected]> wrote: > On Mon, Sep 23, 2024 at 08:11:51PM -0400, Paul Moore wrote: > > > > * get rid of the "repeated getname() on the same address is going to > > > give you the same object" - that can't be relied upon without audit, for one > > > thing and for another... having a syscall that takes two pathnames that gives > > > different audit log (if not predicate evaluation) in cases when those are > > > identical pointers vs. strings with identical contenst is, IMO, somewhat > > > undesirable. That kills filename->uaddr. > > > > /uaddr/uptr/ if I'm following you correctly, but yeah, that all seems good. > > BTW, what should we do when e.g. mkdir(2) manages to get to the parent, calls > audit_inode() to memorize that one and then gets -ESTALE from nfs_mkdir()? > We repeat the pathwalk, this time with LOOKUP_REVAL (i.e. make sure to ask > the server about each NFS directory we are visiting, even if it had been seen > recently) and arrive to a different directory, which is not stale and where > subdirectory creation succeeds. Ah, that's fun. I'm guessing we could run into similar issues with other network filesystems, or is this specific to NFS? > The thing is, we call audit_inode(...., AUDIT_INODE_PARENT) twice. With the > same name, but with different inodes. Should we log both, or should the > latter call cannibalize the audit_names instance from the earlier? I think the proper behavior is to have the second call cannibalize the state from the first. The intent of logging is to capture the state when/where the new directory is created, since we never created a directory off the -ESTALE path I don't see why we would need to log it. -- paul-moore.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 20:36 ` Al Viro 2024-09-24 0:11 ` Paul Moore @ 2024-09-24 21:40 ` Al Viro 2024-09-25 6:01 ` Jens Axboe 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-24 21:40 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-fsdevel, io-uring On Mon, Sep 23, 2024 at 09:36:59PM +0100, Al Viro wrote: > * go through the VFS side of things and make sure we have a consistent > set of helpers that would take struct filename * - *not* the ad-hoc mix we > have right now, when that's basically driven by io_uring borging them in > one by one - or duplicates them without bothering to share helpers. > E.g. mkdirat(2) does getname() and passes it to do_mkdirat(), which > consumes the sucker; so does mknodat(2), but do_mknodat() is static. OTOH, > path_setxattr() does setxattr_copy(), then retry_estale loop with > user_path_at() + mnt_want_write() + do_setxattr() + mnt_drop_write() + path_put() > as a body, while on io_uring side we have retry_estale loop with filename_lookup() + > (io_uring helper that does mnt_want_write() + do_setxattr() + mnt_drop_write()) + > path_put(). > Sure, that user_path_at() call is getname() + filename_lookup() + putname(), > so they are equivalent, but keeping that shite in sync is going to be trouble. BTW, re mess around xattr: static int __io_getxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { ... ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); ix->ctx.size = READ_ONCE(sqe->len); ... ret = strncpy_from_user(ix->ctx.kname->name, name, sizeof(ix->ctx.kname->name)); } int io_fgetxattr(struct io_kiocb *req, unsigned int issue_flags) { ... ret = do_getxattr(file_mnt_idmap(req->file), req->file->f_path.dentry, &ix->ctx); ... } ssize_t do_getxattr(struct mnt_idmap *idmap, struct dentry *d, struct xattr_ctx *ctx) { ... if (error > 0) { if (ctx->size && copy_to_user(ctx->value, ctx->kvalue, error)) ... } and we have struct xattr_ctx { /* Value of attribute */ union { const void __user *cvalue; void __user *value; }; ... } Undefined behaviour aside, there's something odd going on here: why do we bother with copy-in in ->prep() when we do copy-out in ->issue() anyway? ->issue() does run with initiator's ->mm in use, right? IOW, what's the io_uring policy on what gets copied in ->prep() vs. in ->issue()? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-24 21:40 ` Al Viro @ 2024-09-25 6:01 ` Jens Axboe 2024-09-25 17:39 ` Al Viro 2024-09-26 3:56 ` Al Viro 0 siblings, 2 replies; 23+ messages in thread From: Jens Axboe @ 2024-09-25 6:01 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, io-uring On 9/24/24 3:40 PM, Al Viro wrote: > On Mon, Sep 23, 2024 at 09:36:59PM +0100, Al Viro wrote: > >> * go through the VFS side of things and make sure we have a consistent >> set of helpers that would take struct filename * - *not* the ad-hoc mix we >> have right now, when that's basically driven by io_uring borging them in >> one by one - or duplicates them without bothering to share helpers. >> E.g. mkdirat(2) does getname() and passes it to do_mkdirat(), which >> consumes the sucker; so does mknodat(2), but do_mknodat() is static. OTOH, >> path_setxattr() does setxattr_copy(), then retry_estale loop with >> user_path_at() + mnt_want_write() + do_setxattr() + mnt_drop_write() + path_put() >> as a body, while on io_uring side we have retry_estale loop with filename_lookup() + >> (io_uring helper that does mnt_want_write() + do_setxattr() + mnt_drop_write()) + >> path_put(). >> Sure, that user_path_at() call is getname() + filename_lookup() + putname(), >> so they are equivalent, but keeping that shite in sync is going to be trouble. > > BTW, re mess around xattr: > static int __io_getxattr_prep(struct io_kiocb *req, > const struct io_uring_sqe *sqe) > { > ... > ix->ctx.cvalue = u64_to_user_ptr(READ_ONCE(sqe->addr2)); > ix->ctx.size = READ_ONCE(sqe->len); > ... > ret = strncpy_from_user(ix->ctx.kname->name, name, > sizeof(ix->ctx.kname->name)); > > } > > int io_fgetxattr(struct io_kiocb *req, unsigned int issue_flags) > { > ... > ret = do_getxattr(file_mnt_idmap(req->file), > req->file->f_path.dentry, > &ix->ctx); > ... > } > > ssize_t > do_getxattr(struct mnt_idmap *idmap, struct dentry *d, > struct xattr_ctx *ctx) > { > ... > if (error > 0) { > if (ctx->size && copy_to_user(ctx->value, ctx->kvalue, error)) > ... > } > > and we have > struct xattr_ctx { > /* Value of attribute */ > union { > const void __user *cvalue; > void __user *value; > }; > ... > } > > Undefined behaviour aside, there's something odd going on here: > why do we bother with copy-in in ->prep() when we do copy-out in > ->issue() anyway? ->issue() does run with initiator's ->mm in use, > right? > > IOW, what's the io_uring policy on what gets copied in ->prep() vs. > in ->issue()? The normal policy is that anything that is read-only should remain stable after ->prep() has been called, so that ->issue() can use it. That means the application can keep it on-stack as long as it's valid until io_uring_submit() returns. For structs/buffers that are copied to after IO, those the application obviously need to keep around until they see a completion for that request. So yes, for the xattr cases where the struct is copied to at completion time, those do not need to be stable after ->prep(), could be handled purely on the ->issue() side. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-25 6:01 ` Jens Axboe @ 2024-09-25 17:39 ` Al Viro 2024-09-25 17:58 ` Jens Axboe 2024-09-26 3:56 ` Al Viro 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-25 17:39 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-fsdevel, io-uring On Wed, Sep 25, 2024 at 12:01:01AM -0600, Jens Axboe wrote: > The normal policy is that anything that is read-only should remain > stable after ->prep() has been called, so that ->issue() can use it. > That means the application can keep it on-stack as long as it's valid > until io_uring_submit() returns. For structs/buffers that are copied to > after IO, those the application obviously need to keep around until they > see a completion for that request. So yes, for the xattr cases where the > struct is copied to at completion time, those do not need to be stable > after ->prep(), could be handled purely on the ->issue() side. Hmm... Nothing in xattr is copied in both directions, actually. AFAICS, the only copy-in you leave to ->issue() is the data for write and sendmsg and ->msg_control for sendmsg. Wait, there's that ioctl-like mess you've got, so anything that feels like doing (seems to include at least setsockopt)... Oh, well... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-25 17:39 ` Al Viro @ 2024-09-25 17:58 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-09-25 17:58 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, io-uring On 9/25/24 11:39 AM, Al Viro wrote: > On Wed, Sep 25, 2024 at 12:01:01AM -0600, Jens Axboe wrote: > >> The normal policy is that anything that is read-only should remain >> stable after ->prep() has been called, so that ->issue() can use it. >> That means the application can keep it on-stack as long as it's valid >> until io_uring_submit() returns. For structs/buffers that are copied to >> after IO, those the application obviously need to keep around until they >> see a completion for that request. So yes, for the xattr cases where the >> struct is copied to at completion time, those do not need to be stable >> after ->prep(), could be handled purely on the ->issue() side. > > Hmm... Nothing in xattr is copied in both directions, actually. Even if it's just copied out at the end, it does not need to remain stable after ->prep(). But most things will still do prep appropriate things at prep time, just to keep it consistent across different op types. Regardless, other things read from the sqe need to be stable after prep anyway, so it's just the best place to do copy-in of structs too, even if they are written at completion time and could defer reading it in until issue. The ->issue() side is mostly just about that, issuing a request and posting a completion - or deferring a retry if it's not ready for issue just yet. > AFAICS, the only copy-in you leave to ->issue() is the data for write > and sendmsg and ->msg_control for sendmsg. Wait, there's that ioctl-like > mess you've got, so anything that feels like doing (seems to include > at least setsockopt)... Oh, well... Right, things that obviously write to buffers will do that there, as that's the context where the completion happens anyway. The setsockopt parts should just copy out at issue time. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-25 6:01 ` Jens Axboe 2024-09-25 17:39 ` Al Viro @ 2024-09-26 3:56 ` Al Viro 1 sibling, 0 replies; 23+ messages in thread From: Al Viro @ 2024-09-26 3:56 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-fsdevel, io-uring On Wed, Sep 25, 2024 at 12:01:01AM -0600, Jens Axboe wrote: > The normal policy is that anything that is read-only should remain > stable after ->prep() has been called, so that ->issue() can use it. > That means the application can keep it on-stack as long as it's valid > until io_uring_submit() returns. For structs/buffers that are copied to > after IO, those the application obviously need to keep around until they > see a completion for that request. So yes, for the xattr cases where the > struct is copied to at completion time, those do not need to be stable > after ->prep(), could be handled purely on the ->issue() side. Looks like io_fsetxattr() was missing audit_file()... Anyway, in a local branch I've added two helpers - int file_setxattr(struct file *file, struct xattr_ctx *ctx); int filename_setxattr(int dfd, struct filename *filename, struct xattr_ctx *ctx, unsigned int lookup_flags); and converted both fs/xattr.c and io_uring/xattr.c to those. Completely untested delta follows; it's _not_ in the final form, it misses getxattr side, etc. BTW, I think fs/internal.h is a very wrong place for that, as well as for do_mkdirat() et.al. include/linux/marshalled_syscalls.h, perhaps? diff --git a/fs/internal.h b/fs/internal.h index 8cf42b327e5e..e39f80201ff8 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -285,8 +285,9 @@ ssize_t do_getxattr(struct mnt_idmap *idmap, struct xattr_ctx *ctx); int setxattr_copy(const char __user *name, struct xattr_ctx *ctx); -int do_setxattr(struct mnt_idmap *idmap, struct dentry *dentry, - struct xattr_ctx *ctx); +int file_setxattr(struct file *file, struct xattr_ctx *ctx); +int filename_setxattr(int dfd, struct filename *filename, + struct xattr_ctx *ctx, unsigned int lookup_flags); int may_write_xattr(struct mnt_idmap *idmap, struct inode *inode); #ifdef CONFIG_FS_POSIX_ACL diff --git a/fs/xattr.c b/fs/xattr.c index 0fc813cb005c..fc6409181c46 100644 --- a/fs/xattr.c +++ b/fs/xattr.c @@ -619,7 +619,7 @@ int setxattr_copy(const char __user *name, struct xattr_ctx *ctx) return error; } -int do_setxattr(struct mnt_idmap *idmap, struct dentry *dentry, +static int do_setxattr(struct mnt_idmap *idmap, struct dentry *dentry, struct xattr_ctx *ctx) { if (is_posix_acl_xattr(ctx->kname->name)) @@ -630,32 +630,31 @@ int do_setxattr(struct mnt_idmap *idmap, struct dentry *dentry, ctx->kvalue, ctx->size, ctx->flags); } -static int path_setxattr(const char __user *pathname, - const char __user *name, const void __user *value, - size_t size, int flags, unsigned int lookup_flags) +int file_setxattr(struct file *f, struct xattr_ctx *ctx) +{ + int error = mnt_want_write_file(f); + + if (!error) { + audit_file(f); + error = do_setxattr(file_mnt_idmap(f), f->f_path.dentry, ctx); + mnt_drop_write_file(f); + } + return error; +} + +int filename_setxattr(int dfd, struct filename *filename, + struct xattr_ctx *ctx, unsigned int lookup_flags) { - struct xattr_name kname; - struct xattr_ctx ctx = { - .cvalue = value, - .kvalue = NULL, - .size = size, - .kname = &kname, - .flags = flags, - }; struct path path; int error; - error = setxattr_copy(name, &ctx); - if (error) - return error; - retry: - error = user_path_at(AT_FDCWD, pathname, lookup_flags, &path); + error = filename_lookup(dfd, filename, lookup_flags, &path, NULL); if (error) goto out; error = mnt_want_write(path.mnt); if (!error) { - error = do_setxattr(mnt_idmap(path.mnt), path.dentry, &ctx); + error = do_setxattr(mnt_idmap(path.mnt), path.dentry, ctx); mnt_drop_write(path.mnt); } path_put(&path); @@ -665,6 +664,30 @@ static int path_setxattr(const char __user *pathname, } out: + putname(filename); + return error; +} + +static int path_setxattr(const char __user *pathname, + const char __user *name, const void __user *value, + size_t size, int flags, unsigned int lookup_flags) +{ + struct xattr_name kname; + struct xattr_ctx ctx = { + .cvalue = value, + .kvalue = NULL, + .size = size, + .kname = &kname, + .flags = flags, + }; + int error; + + error = setxattr_copy(name, &ctx); + if (error) + return error; + + error = filename_setxattr(AT_FDCWD, getname(pathname), + &ctx, lookup_flags); kvfree(ctx.kvalue); return error; } @@ -700,17 +723,11 @@ SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name, if (fd_empty(f)) return -EBADF; - audit_file(fd_file(f)); error = setxattr_copy(name, &ctx); if (error) return error; - error = mnt_want_write_file(fd_file(f)); - if (!error) { - error = do_setxattr(file_mnt_idmap(fd_file(f)), - fd_file(f)->f_path.dentry, &ctx); - mnt_drop_write_file(fd_file(f)); - } + error = file_setxattr(fd_file(f), &ctx); kvfree(ctx.kvalue); return error; } diff --git a/io_uring/xattr.c b/io_uring/xattr.c index 13e8d7d2cdc2..702d5981fd63 100644 --- a/io_uring/xattr.c +++ b/io_uring/xattr.c @@ -203,28 +203,14 @@ int io_fsetxattr_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) return __io_setxattr_prep(req, sqe); } -static int __io_setxattr(struct io_kiocb *req, unsigned int issue_flags, - const struct path *path) -{ - struct io_xattr *ix = io_kiocb_to_cmd(req, struct io_xattr); - int ret; - - ret = mnt_want_write(path->mnt); - if (!ret) { - ret = do_setxattr(mnt_idmap(path->mnt), path->dentry, &ix->ctx); - mnt_drop_write(path->mnt); - } - - return ret; -} - int io_fsetxattr(struct io_kiocb *req, unsigned int issue_flags) { + struct io_xattr *ix = io_kiocb_to_cmd(req, struct io_xattr); int ret; WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK); - ret = __io_setxattr(req, issue_flags, &req->file->f_path); + ret = file_setxattr(req->file, &ix->ctx); io_xattr_finish(req, ret); return IOU_OK; } @@ -232,22 +218,11 @@ int io_fsetxattr(struct io_kiocb *req, unsigned int issue_flags) int io_setxattr(struct io_kiocb *req, unsigned int issue_flags) { struct io_xattr *ix = io_kiocb_to_cmd(req, struct io_xattr); - unsigned int lookup_flags = LOOKUP_FOLLOW; - struct path path; int ret; WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK); -retry: - ret = filename_lookup(AT_FDCWD, ix->filename, lookup_flags, &path, NULL); - if (!ret) { - ret = __io_setxattr(req, issue_flags, &path); - path_put(&path); - if (retry_estale(ret, lookup_flags)) { - lookup_flags |= LOOKUP_REVAL; - goto retry; - } - } + ret = filename_setxattr(AT_FDCWD, ix->filename, &ix->ctx, LOOKUP_FOLLOW); io_xattr_finish(req, ret); return IOU_OK; ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 6:30 ` Jens Axboe 2024-09-23 12:54 ` Paul Moore @ 2024-09-23 15:07 ` Al Viro 2024-09-24 11:15 ` Jens Axboe 1 sibling, 1 reply; 23+ messages in thread From: Al Viro @ 2024-09-23 15:07 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-fsdevel, audit, io-uring On Mon, Sep 23, 2024 at 12:30:48AM -0600, Jens Axboe wrote: > 1) Just don't reuse the entry. Then we can drop the struct > filename->aname completely as well. Yes that might incur an extra > alloc for the odd case of audit_enabled and being deep enough that > the preallocated names have been used, but doesn't anyone really > care? It'll be noise in the overhead anyway. Side note - that would > unalign struct filename again. Would be nice to drop audit_names from > a core fs struct... You'll get different output in logs, though. Whether that breaks userland setups/invalidates certifications/etc.... fuck knows. If anything, a loop through the list, searching for matching entry would be safer in that respect. Order of the items... might or might not be an issue - see above. > 2) Add a ref to struct audit_names, RCU kfree it when it drops to zero. > This would mean dropping struct audit_context->preallocated_names, as Costly, that. > otherwise we'd run into trouble there if a context gets blown away > while someone else has a ref to that audit_names struct. We could do > this without a ref as well, as long as we can store an audit_context > pointer in struct audit_names and be able to validate it under RCU. > If ctx doesn't match, don't use it. That's one of the variants I mentioned upthread... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RFC] struct filename, io_uring and audit troubles 2024-09-23 15:07 ` Al Viro @ 2024-09-24 11:15 ` Jens Axboe 0 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-09-24 11:15 UTC (permalink / raw) To: Al Viro; +Cc: linux-fsdevel, audit, io-uring On 9/23/24 9:07 AM, Al Viro wrote: > On Mon, Sep 23, 2024 at 12:30:48AM -0600, Jens Axboe wrote: > >> 1) Just don't reuse the entry. Then we can drop the struct >> filename->aname completely as well. Yes that might incur an extra >> alloc for the odd case of audit_enabled and being deep enough that >> the preallocated names have been used, but doesn't anyone really >> care? It'll be noise in the overhead anyway. Side note - that would >> unalign struct filename again. Would be nice to drop audit_names from >> a core fs struct... > > You'll get different output in logs, though. Whether that breaks userland > setups/invalidates certifications/etc.... fuck knows. No idea about that... But I'd say without strong evidence that this breaks userland for something as odd as audit, well... And honestly really a layering problem that struct filename has an audit link in there. > If anything, a loop through the list, searching for matching entry would > be safer in that respect. Order of the items... might or might not be > an issue - see above. > >> 2) Add a ref to struct audit_names, RCU kfree it when it drops to zero. >> This would mean dropping struct audit_context->preallocated_names, as > > Costly, that. For sure. And you could keep preallocated_names if you rcu free the context too. But I strongly believe that approach #1 is, by far, the cheaper alternative. If we can tolerate the ordering potentially changing. >> otherwise we'd run into trouble there if a context gets blown away >> while someone else has a ref to that audit_names struct. We could do >> this without a ref as well, as long as we can store an audit_context >> pointer in struct audit_names and be able to validate it under RCU. >> If ctx doesn't match, don't use it. > > That's one of the variants I mentioned upthread... Sorry, still away on travels and conferences, so haven't been keeping up on replies. -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-09-26 3:56 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-09-22 0:49 [RFC] struct filename, io_uring and audit troubles Al Viro 2024-09-22 4:10 ` Al Viro 2024-09-22 15:09 ` Al Viro 2024-09-23 1:50 ` Al Viro 2024-09-23 6:30 ` Jens Axboe 2024-09-23 12:54 ` Paul Moore 2024-09-23 14:48 ` Al Viro 2024-09-23 16:14 ` Paul Moore 2024-09-23 18:17 ` Al Viro 2024-09-23 23:49 ` Paul Moore 2024-09-23 20:36 ` Al Viro 2024-09-24 0:11 ` Paul Moore 2024-09-24 7:01 ` Al Viro 2024-09-24 23:17 ` Paul Moore 2024-09-25 20:44 ` Al Viro 2024-09-25 20:58 ` Paul Moore 2024-09-24 21:40 ` Al Viro 2024-09-25 6:01 ` Jens Axboe 2024-09-25 17:39 ` Al Viro 2024-09-25 17:58 ` Jens Axboe 2024-09-26 3:56 ` Al Viro 2024-09-23 15:07 ` Al Viro 2024-09-24 11:15 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox