Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size

public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed

From: Gabriel Krisman Bertazi <krisman@suse.de>
To: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Subject: Re: [PATCH] io_uring/sqpoll: Increase task_work submission batch size
Date: Thu, 03 Apr 2025 21:18:15 -0400	[thread overview]
Message-ID: <87friod8rs.fsf@mailhost.krisman.be> (raw)
In-Reply-To: <94da2142-d7c1-46bb-bc35-05d0d1c28182@kernel.dk> (Jens Axboe's message of "Thu, 3 Apr 2025 14:26:40 -0600")

Jens Axboe <axboe@kernel.dk> writes:

> On 4/3/25 1:56 PM, Gabriel Krisman Bertazi wrote:
>> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
>>  #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
>> -#define IORING_TW_CAP_ENTRIES_VALUE	8
>> +#define IORING_TW_CAP_ENTRIES_VALUE	1024
>
> That's a huge bump! This should not be a submission side thing, it's
> purely running the task work. For this test case, I'm assuming you don't
> see any io-wq activity, and hence everything is done purely inline from
> the SQPOLL thread?
> This confuses me a bit, as this should not be driving
> the queue depth at all, as submissions would be done by
> __io_sq_thread().

Indeed, the submission happens fully inside __io_sq_thread, and I can
confirm that from the profile.  What is interesting is that, once I lift
the cap, we end up spending more time inside io_submit_sqes, which means
it is able to drive more requests.

Let me share the profile in case it rings a bell:

This is perf-record on a slow kernel:

   - 49.30% io_sq_thread
      - 41.86% io_submit_sqes
         - 20.57% io_issue_sqe
            - 19.89% io_read
               - __io_read
                  - 18.19% blkdev_read_iter
                     - 17.84% blkdev_direct_IO.part.21
                        + 7.25% submit_bio_noacct_nocheck
                        + 6.49% bio_iov_iter_get_pages
                        + 1.80% bio_alloc_bioset
                          1.27% bio_set_pages_dirty
                  + 0.78% security_file_permission
         - 10.88% blk_finish_plug
            - __blk_flush_plug
               - 10.80% blk_mq_flush_plug_list.part.88
                  + 10.69% null_queue_rqs
           0.83% io_prep_rw
      - 4.11% io_sq_tw
         - 3.62% io_handle_tw_list
            - 2.76% ctx_flush_and_put.isra.72
                 2.67% __io_submit_flush_completions
              0.58% io_req_rw_complete
      + 1.15% io_sq_update_worktime.isra.9
        1.05% mutex_unlock
      + 1.05% getrusage

After my patch:

   - 50.07% io_sq_thread
      - 47.22% io_submit_sqes
         - 38.04% io_issue_sqe
            - 37.19% io_read
               - 37.10% __io_read
                  - 34.79% blkdev_read_iter
                     - 34.34% blkdev_direct_IO.part.21
                        + 21.01% submit_bio_noacct_nocheck
                        + 8.30% bio_iov_iter_get_pages
                        + 2.21% bio_alloc_bioset
                        + 1.52% bio_set_pages_dirty
                  + 1.19% security_file_permission
         - 3.29% blk_finish_plug
            - __blk_flush_plug
               - 3.27% blk_mq_flush_plug_list.part.88
                  - 3.25% null_queue_rqs
                     + null_queue_rq
           1.16% io_prep_rw
      - 2.25% io_sq_tw
         - tctx_task_work_run
            - 2.00% io_handle_tw_list
               - 1.08% ctx_flush_and_put.isra.72
                    1.07% __io_submit_flush_completions
                 0.68% io_req_rw_complete

> And that part only caps when there is more than a
> single ctx in there, which your case would not have. IOW, it should
> submit everything that's there and hence this change should not change
> the submission/queueing side of things. It only really deals with
> running the task_work that will post the completion.
>
> Maybe we should just not submit more until we've depleted the tw list?
>
> In any case, we can _probably_ make this 32 or something without
> worrying too much about it, though I would like to fully understand why
> it's slower. Maybe it's the getrusage() that we do for every loop? You
> could try and disable that just to see if it makes a difference?

While the overhead of the usage accounting is very visible in the
profile, my first test when I got this bug was to drop that code, and it
had very little impact on throughput (around 1%).  The main difference
really seems to be around the number of ios we queue per iteration. In
fact, looking at iostat, I can see a very noticeable difference in
aqu-sz between both kernels.

A lower limit should work, yes, but I'm also quite curious how the tw
affects the submission. But also, what is the reason to cap it in the
first place?  io_handle_tw_list does a cond_reesched() on each
iteration, so it wont hog to the cpu and, if we drop the cap, we'll have
the behavior of not submitting more until the tw list is empty, as you
suggested.

-- 
Gabriel Krisman Bertazi

next prev parent reply	other threads:[~2025-04-04  1:18 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-03 19:56 [PATCH] io_uring/sqpoll: Increase task_work submission batch size Gabriel Krisman Bertazi
2025-04-03 20:26 ` Jens Axboe
2025-04-04  1:18   ` Gabriel Krisman Bertazi [this message]
2025-04-07 15:47     ` Gabriel Krisman Bertazi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87friod8rs.fsf@mailhost.krisman.be \
    --to=krisman@suse.de \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox