public inbox for [email protected]
 help / color / mirror / Atom feed
From: Jann Horn <[email protected]>
To: Al Viro <[email protected]>, Jens Axboe <[email protected]>,
	Stefan Roesch <[email protected]>
Cc: Linus Torvalds <[email protected]>,
	io-uring <[email protected]>,
	linux-fsdevel <[email protected]>,
	Pavel Begunkov <[email protected]>
Subject: Re: [PATCH v7 0/3] io_uring: add getdents64 support
Date: Mon, 3 Jan 2022 08:03:51 +0100	[thread overview]
Message-ID: <CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>

On Sat, Jan 1, 2022 at 8:59 PM Al Viro <[email protected]> wrote:
> On Fri, Dec 31, 2021 at 11:15:55PM +0000, Al Viro wrote:
> > On Tue, Dec 21, 2021 at 09:15:24AM -0800, Linus Torvalds wrote:
> > > On Tue, Dec 21, 2021 at 8:40 AM Stefan Roesch <[email protected]> wrote:
> > > >
> > > > This series adds support for getdents64 in liburing. The intent is to
> > > > provide a more complete I/O interface for io_uring.
> > >
> > > Ack, this series looks much more natural to me now.
> > >
> > > I think some of the callers of "iterate_dir()" could probably be
> > > cleaned up with the added argument, but for this series I prefer that
> > > mindless approach of just doing "&(arg1)->f_pos" as the third argument
> > > that is clearly a no-op.
> > >
> > > So the end result is perhaps not as beautiful as it could be, but I
> > > think the patch series DTRT.
> >
> > It really does not.  Think what happens if you pass e.g. an odd position
> > to that on e.g. ext2/3/4.  Or just use it on tmpfs, for that matter.
>
> [A bit of a braindump follows; my apologies for reminding of some
> unpleasant parts of history]
>
> The real problem here is the userland ABI along the lines of pread for
> directories.  There's a good reason why we (as well as everybody else,
> AFAIK) do not have pgetdents(2).
>
> Handling of IO positions for directories had been causing trouble
> ever since the directory layouts had grown support for long names.
> Originally a directory had been an array of fixed-sized entries; back
> then ls(1) simply used read(2).  No special logics for handling offsets,
> other than "each entry is 16 bytes, so you want to read a multiple of
> 16 starting from offset that is a multiple of 16".
>
> As soon as FFS had introduced support for names longer than 14 characters,
> the things got more complicated - there's no predicting if given position
> is an entry boundary.  Worse, what used to have been an entry boundary
> might very well come to point to the middle of a name - all it takes is
> unlink()+creat() done since the time the position used to be OK.
>
> Details are filesystem-dependent; e.g. for original FFS all multiples of
> 512 are valid offsets, and each entry has its length stored in bytes 4
> and 5, so one needs to read the relevant 512 bytes of contents and walk
> the list of entries until they get to (or past) the position that needs
> to be validated.  For ext2 it's a bit more unpleasant, since the chunk
> size is not 512 bytes - it's a block size, i.e. might easily by 4Kb.
> For more complex layouts it gets even nastier.
>
> Having to do that kind of validation on each directory entry access would
> be too costly.  That's somewhat mitigated since the old readdir(2) is no
> longer used, and we return multiple entries per syscall (getdents(2)).
> With the exclusion between directory reads and modifications, that allows
> to limit validation to the first entry returned on that syscall.
>
> It's still too costly, though.  The next part of mitigation is to use
> the fact that valid position will remain valid until somebody modifies a
> directory, so we don't need to validate if directory hadn't been changed
> since the last time getdents(2) had gotten to this position.  Of course,
> explicit change of position since that last getdents(2) means that we
> can't skip validation.

And for this validation caching to work properly, AFAIU you need to
hold the file->f_pos_lock (or have exclusive access to the "struct
file"), which happens in the normal getdents() path via fdget_pos().
This guarantees that the readdir handler has exclusive access to the
file's ->f_version, which has to stay in sync with the position.

By the way, this makes me wonder: If I open() an ext4 directory and
then concurrently modify the directory, call readdir() on it, and
issue IORING_OP_READV SQEs with ->off == -1, can that cause ext4 to
think that filesystem corruption has occurred?

io_uring has some dodgy code that seems to be reading and writing
file->f_pos without holding the file->f_pos_lock. And even if the file
doesn't have an f_op->read or f_op->read_iter handler, I think you
might be able to read ->f_pos of an ext4 directory and write the value
back later, unless I'm missing a check somewhere?

io_prep_rw() grabs file->f_pos; then later, io_read() calls
io_iter_do_read() (which will fail with -EINVAL), and then the error
path goes through kiocb_done(), which writes the position back to
req->file->f_pos. So I think the following race might work:

First, open an FD referencing a directory on an ext4 filesystem. Read
part of the directory's entries from the FD with getdents(). Then
modify the directory such that the FD's ->f_version is now stale. Then
do the race:

task A: start a IORING_OP_READV op on the FD and let it grab the stale
offset from file->f_pos
task B: run getdents() on the FD. after this, ->f_version is
up-to-date and ->f_pos points to the start of a dentry
task A: continue handling the IORING_OP_READV: io_iter_do_read()
fails, kiocb_done() overwrites the valid ->f_pos with the stale value
while leaving ->f_version up-to-date

Afterwards, the file might have an ->f_pos pointing in the middle of a
dentry while ->f_version indicates that it is valid (meaning it's
supposed to point to the start of a dentry). If you then do another
getdents() call on the FD, I think you might get garbage back and/or
make the kernel think that the filesystem is corrupted (depending on
what's stored at that offset)?

  reply	other threads:[~2022-01-03  7:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-21 16:40 [PATCH v7 0/3] io_uring: add getdents64 support Stefan Roesch
2021-12-21 16:40 ` [PATCH v7 1/3] fs: add offset parameter to iterate_dir function Stefan Roesch
2021-12-21 16:40 ` [PATCH v7 2/3] fs: split off vfs_getdents function of getdents64 syscall Stefan Roesch
2021-12-21 16:40 ` [PATCH v7 3/3] io_uring: add support for getdents64 Stefan Roesch
2021-12-21 17:15 ` [PATCH v7 0/3] io_uring: add getdents64 support Linus Torvalds
2021-12-31 23:15   ` Al Viro
2022-01-01 19:59     ` Al Viro
2022-01-03  7:03       ` Jann Horn [this message]
2022-01-03 15:00         ` Jens Axboe
2022-01-03 18:55         ` Linus Torvalds
2022-01-03 21:12         ` Al Viro
2021-12-21 19:17 ` Jens Axboe
2021-12-31 23:14 ` Al Viro
2023-04-16 22:06 ` Dominique Martinet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG48ez1O9VxSuWuLXBjke23YxUA8EhMP+6RCHo5PNQBf3B0pDQ@mail.gmail.com \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox