public inbox for [email protected]
 help / color / mirror / Atom feed
From: Xiaoguang Wang <[email protected]>
To: Pavel Begunkov <[email protected]>, [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: [PATCH] io_uring: create percpu io sq thread when IORING_SETUP_SQ_AFF is flagged
Date: Thu, 28 May 2020 15:56:46 +0800	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

hi Pavel,

> On 22/05/2020 11:33, Xiaoguang Wang wrote:
>> hi Pavel,
>>
>> First thanks for reviewing!
> 
> sure, you're welcome!
> 
>>> On 20/05/2020 14:56, Xiaoguang Wang wrote:
>>>>
>>>> To fix this issue, when io_uring instance uses IORING_SETUP_SQ_AFF to specify a
>>>> cpu,  we create a percpu io sq_thread to handle multiple io_uring instances' io
>>>> requests serially. With this patch, in same environment, we get a huge
>>>> improvement:
>>>
>>> Consider that a user can issue a long sequence of requests, say 2^15 of them,
>>> and all of them will happen in a single call to io_submit_sqes(). Just preparing
>>> them would take much time, apart from that it can do some effective work for
>>> each of them, e.g. copying a page. And now there is another io_uring, caring
>>> much about latencies and nevertheless waiting for _very_ long for its turn.
>> Indeed I had thought this case before and don't clearly know how to optimize it
>> yet.
>> But I think if user has above usage scenarios, they probably shouldn't make
>> io_uring
>> instances share same cpu core. I'd like to explain more about our usage scenarios.
> 
> IIRC, you create _globally_ available per-cpu threads, so 2 totally independent
> users may greatly affect each other, e.g. 2 docker containers in a hosted
> server, and that would be nasty. But if it's limited to a "single user", it'd be
> fine by me, see below for more details.
> 
> BTW, out of curiosity, what's the performance\latency impact of disabling SQPOLL
> at all for your app? Say, comparing with non contended case.
> 
>> In a physical machine, say there are 96 cores, and it runs multiple cgroups, every
>> cgroup run same application and will monopoly 16 cpu cores. This application will
>> create 16 io threads and every io thread will create an io_uring instance and every
>> thread will be bound to a different cpu core, these io threads will receive io
>> requests.
>> If we enable SQPOLL for these io threads, we allocate one or two cpu cores for
>> these
>> io_uring instances at most, so they must share allocated cpu core. It's totally
>> disaster
>> that some io_uring instances' busy loop in their sq_thread_idle period will
>> impact other
>> io_uring instances which have io requests to handle.
>>
>>>
>>> Another problem is that with it a user can't even guess when its SQ would be
>>> emptied, and would need to constantly poll.
>> In this patch, in every iteration, we only handle io requests already queued,
>> will not constantly poll.
> 
> I was talking about a user polling _from userspace_ its full SQ, to understand
> when it can submit more. That's if it doesn't want to wait for CQEs yet for some
> reason (e.g. useful for net apps).
> 
>>
>>>
>>> In essence, the problem is in bypassing thread scheduling, and re-implementing
>>> poor man's version of it (round robin by io_uring).
>> Yes :) Currently I use round robin strategy to handle multiple io_uring instance
>> in every iteration.
>>
>>> The idea and the reasoning are compelling, but I think we need to do something
>>> about unrelated io_uring instances obstructing each other. At least not making
>>> it mandatory behaviour.
>>>
>>> E.g. it's totally fine by me, if a sqpoll kthread is shared between specified
>>> bunch of io_urings -- the user would be responsible for binding them and not
>>> screwing up latencies/whatever. Most probably there won't be much (priviledged)
>>> users using SQPOLL, and they all be a part of a single app, e.g. with
>>> multiple/per-thread io_urings.
>> Did you read my patch? In this patch, I have implemented this idea :)
> 
> Took a glance, may have overlooked things. I meant to do as in your patch, but
> not sharing threads between ALL io_uring in the system, but rather between a
> specified set of them. In other words, making yours @percpu_threads not global,
> but rather binding to a set of io_urings.
> 
> e.g. create 2 independent per-cpu sets of threads. 1st one for {a1,a2,a3}, 2nd
> for {b1,b2,b3}.
> 
> a1 = create_uring()
> a2 = create_uring(shared_sq_threads=a1)
> a3 = create_uring(shared_sq_threads=a1)
> 
> b1 = create_uring()
> b2 = create_uring(shared_sq_threads=b1)
> b3 = create_uring(shared_sq_threads=b1)
Think your suggestions further more, seems that you suggest the implementation method
io-wq adopts, which requires a valid wq_fd, and that means we can only share io-wq between
io_uring instances belong to same process, so do io_sq_thread. This method is hard to
implement io_sq_thread share between processes.

As I said in previous mail, I guess io_uring instances which have SQPOLL enabed and do not
specify valid sq_thread_cpu, are not likely be used in real business environment, so I'd
like to preserve current design in my patch, which is simple and only affects cpu-pinned case.

What do you think? Thanks.

Regards,
Xiaoguang Wang

> 
> And then:
> - it somehow solves the problem. As long as it doesn't effect much other users,
> it's ok to let userspace screw itself by submitting 2^16 requests.
> 
> - there is still a problem with a simple round robin. E.g. >100 io_urings per
> such set. Even though, a user may decide for itself, it worth to think about. I
> don't want another scheduling framework here. E.g. first round-robin, then
> weighted one, etc.
> 
> - it's actually not a set of threads (i.e. per-cpu) the API should operate on,
> but just binding io_urings to a single SQPOLL thread.
> 
> - there is no need to restrict it to cpu-pinned case.
> 
>>>
>>> Another way would be to switch between io_urings faster, e.g. after processing
>>> not all requests but 1 or some N of them. But that's very thin ice, and I
>>> already see other bag of issues.
>> Sounds good, could you please lift detailed issues? Thanks.
> 
> Sounds terrible, TBH. Especially with arbitrary length links.
> 
> 

      parent reply	other threads:[~2020-05-28  7:56 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20 11:56 [PATCH] io_uring: create percpu io sq thread when IORING_SETUP_SQ_AFF is flagged Xiaoguang Wang
2020-05-20 12:11 ` Xiaoguang Wang
2020-05-22 11:17   ` Yu Jian Wu
2020-05-25  3:16     ` Xiaoguang Wang
2020-05-20 22:09 ` Pavel Begunkov
2020-05-22  8:33   ` Xiaoguang Wang
2020-05-24 11:46     ` Pavel Begunkov
2020-05-26 14:42       ` Xiaoguang Wang
2020-06-03 14:47         ` Pavel Begunkov
2020-06-03 18:48           ` Pavel Begunkov
2020-05-28  7:56       ` Xiaoguang Wang [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f83f98d-65da-4ca6-db23-7993c16047de@linux.alibaba.com \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox