public inbox for [email protected]
 help / color / mirror / Atom feed
From: Miklos Szeredi <[email protected]>
To: Bernd Schubert <[email protected]>
Cc: Bernd Schubert <[email protected]>,
	Amir Goldstein <[email protected]>,
	[email protected],
	 Andrew Morton <[email protected]>,
	[email protected],  Ingo Molnar <[email protected]>,
	Peter Zijlstra <[email protected]>,
	 Andrei Vagin <[email protected]>,
	[email protected],
	 Kent Overstreet <[email protected]>
Subject: Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring
Date: Wed, 12 Jun 2024 09:39:56 +0200	[thread overview]
Message-ID: <CAJfpegsq06UZSPCDB=0Q3OPoH+c3is4A_d2oFven3Ebou8XPOw@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>

On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <[email protected]> wrote:

> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?

The ops can be queued together and submitted together.  Two separate
(but possibly linked) ops should result in exactly the same number of
syscalls as a single combined op.

> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

The original idea was to use the cloned /dev/fuse fd to sort requests
into separate queues.  That was only half finished: the input queue is
currently shared by all clones, but once a request is read by the
server from a particular clone it is put into a separate processing
queue.   Adding separate input queues to each clone should also be
possible.

I'm not saying this is definitely the direction we should be taking,
but it's something to consider.

> > The advantage of separate ops is more flexibility and less complexity
> > (do only one thing in an op)
>
> Did you look at patch 12/19? It just does
> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

That function name indicates that this is too complex: it is doing two
independent things (ending one request and fetching the next).

Fine if it's a valid optimization, but I'm saying that it likely isn't.

> > The major difference between your idea of a fuse_uring and the
> > io_uring seems to be that you place not only the request on the shared
> > buffer, but the data as well.   I don't think this is a good idea,
> > since it will often incur one more memory copy.  Otherwise the idea
> > itself seems sound.
>
> Coud you explain what you mean with "one more memory copy"?

If the filesystem is providing the result of a READ request as a
pointer to a buffer (which can be the case with fuse_reply_data()),
then that buffer will need to be copied to the shared buffer, and from
the shared buffer to the read destination.

That first copy is unnecessary if the kernel receives the pointer to
the userspace buffer and copies the data directly to the destination.

> > So I think either better integration with io_uring is needed with
> > support for "reverse submission" or a new interface.
>
> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
> ublk_drv  also works exactly that way. I had pointed it out before,
> initially I had considered to write a reverse io-uring myself and then
> exactly at that time ublk came up.

I'm just looking for answers why this architecture is the best.  Maybe
it is, but I find it too complex and can't explain why it's going to
perform better than a much simpler single ring model.

> The interface of that 'reverse io' to io-uring is really simple.
>
> 1) Userspace sends a IORING_OP_URING_CMD SQE
> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
> request
> 3) When fuse client has data to complete the request, it calls
> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>
> Personally I don't see anything twisted here, one just needs to
> understand that IORING_OP_URING_CMD was written for that reverse order.

That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
And that's just the queuing.

> (There came up a light twisting when io-uring introduced issue_flags -
> that is part of discussion of patch 19/19 with Jens in the series. Jens
> suggested to work on io-uring improvements once the main series is
> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
> Jens for help once the other parts are merged. Right now that easy to
> work around by always submitting with an io-uring task).
>
> Also, that simplicity is the reason why I'm hesitating a bit to work on
> Kents new ring, as io-uring already has all what we need and with a
> rather simple interface.

I'm in favor of using io_uring, if possible.

I'm also in favor of a single shared buffer (ring) if possible.  Using
cloned fd + plain READV / WRITEV ops is one possibility.

But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
could be inlined in the majority of cases into that 80 byte free space
in the sqe.  Also might consider an extended cqe mode, where short
fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
byte payload).

> To be honest, I wonder how you worked around scheduler issues on waking
> up the application thread. Did you core bind application threads as well
> (I mean besides fuse server threads)? We now have this (unexported)
> wake_on_current_cpu. Last year that still wasn't working perfectly well
> and  Hillf Danton has suggested the 'seesaw' approach. And with that the
> scheduler was working very well. You could get the same with application
> core binding, but with 512 CPUs that is certainly not done manually
> anymore. Did you use a script to bind application threads or did you
> core bind from within the application?

Probably, I don't remember anymore.

Thanks,
Miklos

  parent reply	other threads:[~2024-06-12  7:40 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-29 18:00 [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Bernd Schubert
2024-05-29 18:00 ` [PATCH RFC v2 19/19] fuse: {uring} Optimize async sends Bernd Schubert
2024-05-31 16:24   ` Jens Axboe
2024-05-31 17:36     ` Bernd Schubert
2024-05-31 19:10       ` Jens Axboe
2024-06-01 16:37         ` Bernd Schubert
2024-05-30  7:07 ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Amir Goldstein
2024-05-30 12:09   ` Bernd Schubert
2024-05-30 15:36 ` Kent Overstreet
2024-05-30 16:02   ` Bernd Schubert
2024-05-30 16:10     ` Kent Overstreet
2024-05-30 16:17       ` Bernd Schubert
2024-05-30 17:30         ` Kent Overstreet
2024-05-30 19:09         ` Josef Bacik
2024-05-30 20:05           ` Kent Overstreet
2024-05-31  3:53         ` [PATCH] fs: sys_ringbuffer() (WIP) Kent Overstreet
2024-05-31 13:11           ` kernel test robot
2024-05-31 15:49           ` kernel test robot
2024-05-30 16:21     ` [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Jens Axboe
2024-05-30 16:32       ` Bernd Schubert
2024-05-30 17:26         ` Jens Axboe
2024-05-30 17:16       ` Kent Overstreet
2024-05-30 17:28         ` Jens Axboe
2024-05-30 17:58           ` Kent Overstreet
2024-05-30 18:48             ` Jens Axboe
2024-05-30 19:35               ` Kent Overstreet
2024-05-31  0:11                 ` Jens Axboe
2024-06-04 23:45       ` Ming Lei
2024-05-30 20:47 ` Josef Bacik
2024-06-11  8:20 ` Miklos Szeredi
2024-06-11 10:26   ` Bernd Schubert
2024-06-11 15:35     ` Miklos Szeredi
2024-06-11 17:37       ` Bernd Schubert
2024-06-11 23:35         ` Kent Overstreet
2024-06-12 13:53           ` Bernd Schubert
2024-06-12 14:19             ` Kent Overstreet
2024-06-12 15:40               ` Bernd Schubert
2024-06-12 15:55                 ` Kent Overstreet
2024-06-12 16:15                   ` Bernd Schubert
2024-06-12 16:24                     ` Kent Overstreet
2024-06-12 16:44                       ` Bernd Schubert
2024-06-12  7:39         ` Miklos Szeredi [this message]
2024-06-12 13:32           ` Bernd Schubert
2024-06-12 13:46             ` Bernd Schubert
2024-06-12 14:07             ` Miklos Szeredi
2024-06-12 14:56               ` Bernd Schubert
2024-08-02 23:03                 ` Bernd Schubert
2024-08-29 22:32                 ` Bernd Schubert
2024-08-30 13:12                   ` Jens Axboe
2024-08-30 13:28                     ` Bernd Schubert
2024-08-30 13:33                       ` Jens Axboe
2024-08-30 14:55                         ` Pavel Begunkov
2024-08-30 15:10                           ` Bernd Schubert
2024-08-30 20:08                           ` Jens Axboe
2024-08-31  0:02                             ` Bernd Schubert
2024-08-31  0:49                               ` Bernd Schubert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJfpegsq06UZSPCDB=0Q3OPoH+c3is4A_d2oFven3Ebou8XPOw@mail.gmail.com' \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox