"Cannot allocate memory" on ring creation (not RLIMIT

public inbox for [email protected]
 help / color / mirror / Atom feed

* "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
@ 2020-12-17  8:19 Dmitry Kadashev
  2020-12-17  8:26 ` Norman Maurer
  2020-12-18 15:26 ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-17  8:19 UTC (permalink / raw)
  To: io-uring, Jens Axboe

Hi,

We've ran into something that looks like a memory accounting problem in the
kernel / io_uring code. We use multiple rings per process, and generally it
works fine. Until it does not - new ring creation just fails with ENOMEM. And at
that point it fails consistently until the box is rebooted.

More details: we use multiple rings per process, typically they are initialized
on the process start (not necessarily, but that is not important here, let's
just assume all are initialized on the process start). On a freshly booted box
everything works fine. But after a while - and some process restarts -
io_uring_queue_init() starts to fail with ENOMEM. Sometimes we see it fail, but
then subsequent ones succeed (in the same process), but over time it gets worse,
and eventually no ring can be initialized. And once that happens the only way to
fix the problem is to restart the box.  Most of the mentioned restarts are
graceful: a new process is started and then the old one is killed, possibly with
the KILL signal if it does not shut down in time.  Things work fine for some
time, but eventually we start getting those errors.

Originally we've used 5.6.6 kernel, but given the fact quite a few accounting
issues were fixed in io_uring in 5.8, we've tried 5.9.5 as well, but the issue
is not gone.

Just in case, everything else seems to be working fine, it just falls back to
the thread pool instead of io_uring, and then everything continues to work just
fine.

I was not able to spot anything suspicious in the /proc/meminfo. We have
RLIMIT_MEMLOCK set to infinity. And on a box that currently experiences the
problem /proc/meminfo shows just 24MB as locked.

Any pointers to how can we debug this?

Thanks,
Dmitry

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17  8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev
@ 2020-12-17  8:26 ` Norman Maurer
  2020-12-17  8:36   ` Dmitry Kadashev
  2020-12-18 15:26 ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Norman Maurer @ 2020-12-17  8:26 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: io-uring, Jens Axboe

I wonder if this is also related to one of the bug-reports we received:

https://github.com/netty/netty-incubator-transport-io_uring/issues/14


> On 17. Dec 2020, at 09:19, Dmitry Kadashev <[email protected]> wrote:
> 
> Hi,
> 
> We've ran into something that looks like a memory accounting problem in the
> kernel / io_uring code. We use multiple rings per process, and generally it
> works fine. Until it does not - new ring creation just fails with ENOMEM. And at
> that point it fails consistently until the box is rebooted.
> 
> More details: we use multiple rings per process, typically they are initialized
> on the process start (not necessarily, but that is not important here, let's
> just assume all are initialized on the process start). On a freshly booted box
> everything works fine. But after a while - and some process restarts -
> io_uring_queue_init() starts to fail with ENOMEM. Sometimes we see it fail, but
> then subsequent ones succeed (in the same process), but over time it gets worse,
> and eventually no ring can be initialized. And once that happens the only way to
> fix the problem is to restart the box.  Most of the mentioned restarts are
> graceful: a new process is started and then the old one is killed, possibly with
> the KILL signal if it does not shut down in time.  Things work fine for some
> time, but eventually we start getting those errors.
> 
> Originally we've used 5.6.6 kernel, but given the fact quite a few accounting
> issues were fixed in io_uring in 5.8, we've tried 5.9.5 as well, but the issue
> is not gone.
> 
> Just in case, everything else seems to be working fine, it just falls back to
> the thread pool instead of io_uring, and then everything continues to work just
> fine.
> 
> I was not able to spot anything suspicious in the /proc/meminfo. We have
> RLIMIT_MEMLOCK set to infinity. And on a box that currently experiences the
> problem /proc/meminfo shows just 24MB as locked.
> 
> Any pointers to how can we debug this?
> 
> Thanks,
> Dmitry


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17  8:26 ` Norman Maurer
@ 2020-12-17  8:36   ` Dmitry Kadashev
  2020-12-17  8:40     ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-17  8:36 UTC (permalink / raw)
  To: Norman Maurer; +Cc: io-uring, Jens Axboe

On Thu, Dec 17, 2020 at 3:27 PM Norman Maurer
<[email protected]> wrote:
>
> I wonder if this is also related to one of the bug-reports we received:
>
> https://github.com/netty/netty-incubator-transport-io_uring/issues/14

That is curious. This ticket mentions Shmem though, and in our case it does
not look suspicious at all. E.g. on a box that has the problem at the moment:
Shmem:  41856 kB. The box has 256GB of RAM.

But I'd (given my lack of knowledge) expect the issues to be related anyway.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17  8:36   ` Dmitry Kadashev
@ 2020-12-17  8:40     ` Dmitry Kadashev
  2020-12-17 10:38       ` Josef
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-17  8:40 UTC (permalink / raw)
  To: Norman Maurer; +Cc: io-uring, Jens Axboe

On Thu, Dec 17, 2020 at 3:36 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 3:27 PM Norman Maurer
> <[email protected]> wrote:
> >
> > I wonder if this is also related to one of the bug-reports we received:
> >
> > https://github.com/netty/netty-incubator-transport-io_uring/issues/14
>
> That is curious. This ticket mentions Shmem though, and in our case it does
> not look suspicious at all. E.g. on a box that has the problem at the moment:
> Shmem:  41856 kB. The box has 256GB of RAM.
>
> But I'd (given my lack of knowledge) expect the issues to be related anyway.

One common thing here is the ticket OP mentions kill -9, and we do use that as
well at least in some circumstances.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17  8:40     ` Dmitry Kadashev
@ 2020-12-17 10:38       ` Josef
  2020-12-17 11:10         ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-17 10:38 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Norman Maurer, io-uring, Jens Axboe

> > That is curious. This ticket mentions Shmem though, and in our case it does
 > not look suspicious at all. E.g. on a box that has the problem at the moment:
 > Shmem:  41856 kB. The box has 256GB of RAM.
 >
 > But I'd (given my lack of knowledge) expect the issues to be related anyway.

what about mapped? mapped is pretty high 1GB on my machine, I'm still
reproduce that in C...however the user process is killed but not the
io_wq_worker kernel processes, that's also the reason why the server
socket still listening(even if the user process is killed), the bug
only occurs(in netty) with a high number of operations and using
eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS)

(tested on kernel 5.9 and 5.10)

-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17 10:38       ` Josef
@ 2020-12-17 11:10         ` Dmitry Kadashev
  2020-12-17 13:43           ` Victor Stewart
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-17 11:10 UTC (permalink / raw)
  To: Josef; +Cc: Norman Maurer, io-uring, Jens Axboe

On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote:
>
> > > That is curious. This ticket mentions Shmem though, and in our case it does
>  > not look suspicious at all. E.g. on a box that has the problem at the moment:
>  > Shmem:  41856 kB. The box has 256GB of RAM.
>  >
>  > But I'd (given my lack of knowledge) expect the issues to be related anyway.
>
> what about mapped? mapped is pretty high 1GB on my machine, I'm still
> reproduce that in C...however the user process is killed but not the
> io_wq_worker kernel processes, that's also the reason why the server
> socket still listening(even if the user process is killed), the bug
> only occurs(in netty) with a high number of operations and using
> eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS)
>
> (tested on kernel 5.9 and 5.10)

Stats from another box with this problem (still 256G of RAM):

Mlocked:           17096 kB
Mapped:           171480 kB
Shmem:             41880 kB

Does not look suspicious at a glance. Number of io_wq* processes is 23-31.

Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of
these four the old instance was killed with SIGKILL. On the last process start
18 rings failed to initialize, but after that 6 more were initialized
successfully. It was before the old instance was killed. Maybe it's related to
the load and number of io-wq processes, e.g. some of them exited and a few more
rings were initialized successfully.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17 11:10         ` Dmitry Kadashev
@ 2020-12-17 13:43           ` Victor Stewart
  2020-12-18  9:20             ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Victor Stewart @ 2020-12-17 13:43 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Josef, Norman Maurer, io-uring, Jens Axboe

On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote:
> >
> > > > That is curious. This ticket mentions Shmem though, and in our case it does
> >  > not look suspicious at all. E.g. on a box that has the problem at the moment:
> >  > Shmem:  41856 kB. The box has 256GB of RAM.
> >  >
> >  > But I'd (given my lack of knowledge) expect the issues to be related anyway.
> >
> > what about mapped? mapped is pretty high 1GB on my machine, I'm still
> > reproduce that in C...however the user process is killed but not the
> > io_wq_worker kernel processes, that's also the reason why the server
> > socket still listening(even if the user process is killed), the bug
> > only occurs(in netty) with a high number of operations and using
> > eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS)
> >
> > (tested on kernel 5.9 and 5.10)
>
> Stats from another box with this problem (still 256G of RAM):
>
> Mlocked:           17096 kB
> Mapped:           171480 kB
> Shmem:             41880 kB
>
> Does not look suspicious at a glance. Number of io_wq* processes is 23-31.
>
> Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of
> these four the old instance was killed with SIGKILL. On the last process start
> 18 rings failed to initialize, but after that 6 more were initialized
> successfully. It was before the old instance was killed. Maybe it's related to
> the load and number of io-wq processes, e.g. some of them exited and a few more
> rings were initialized successfully.

have you tried using IORING_SETUP_ATTACH_WQ?

https://lkml.org/lkml/2020/1/27/763

>
> --
> Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17 13:43           ` Victor Stewart
@ 2020-12-18  9:20             ` Dmitry Kadashev
  2020-12-18 17:22               ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-18  9:20 UTC (permalink / raw)
  To: Victor Stewart; +Cc: Josef, Norman Maurer, io-uring, Jens Axboe

On Thu, Dec 17, 2020 at 8:43 PM Victor Stewart <[email protected]> wrote:
>
> On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote:
> >
> > On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote:
> > >
> > > > > That is curious. This ticket mentions Shmem though, and in our case it does
> > >  > not look suspicious at all. E.g. on a box that has the problem at the moment:
> > >  > Shmem:  41856 kB. The box has 256GB of RAM.
> > >  >
> > >  > But I'd (given my lack of knowledge) expect the issues to be related anyway.
> > >
> > > what about mapped? mapped is pretty high 1GB on my machine, I'm still
> > > reproduce that in C...however the user process is killed but not the
> > > io_wq_worker kernel processes, that's also the reason why the server
> > > socket still listening(even if the user process is killed), the bug
> > > only occurs(in netty) with a high number of operations and using
> > > eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS)
> > >
> > > (tested on kernel 5.9 and 5.10)
> >
> > Stats from another box with this problem (still 256G of RAM):
> >
> > Mlocked:           17096 kB
> > Mapped:           171480 kB
> > Shmem:             41880 kB
> >
> > Does not look suspicious at a glance. Number of io_wq* processes is 23-31.
> >
> > Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of
> > these four the old instance was killed with SIGKILL. On the last process start
> > 18 rings failed to initialize, but after that 6 more were initialized
> > successfully. It was before the old instance was killed. Maybe it's related to
> > the load and number of io-wq processes, e.g. some of them exited and a few more
> > rings were initialized successfully.
>
> have you tried using IORING_SETUP_ATTACH_WQ?
>
> https://lkml.org/lkml/2020/1/27/763

No, I have not, but while using that might help to slow down progression of the
issue, it won't fix it - at least if I understand correctly. The problem is not
that those rings can't be created at all - there is no problem with that on a
freshly booted box, but rather that after some (potentially abrupt) owning
process terminations under load kernel gets into a state where - eventually - no
new rings can be created at all. Not a single one. In the above example the
issue just haven't progressed far enough yet.

In other words, there seems to be a leak / accounting problem in the io_uring
code that is triggered by abrupt process termination under load (just no
io_uring_queue_exit?) - this is not a usage problem.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-18  9:20             ` Dmitry Kadashev
@ 2020-12-18 17:22               ` Jens Axboe
  0 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2020-12-18 17:22 UTC (permalink / raw)
  To: Dmitry Kadashev, Victor Stewart; +Cc: Josef, Norman Maurer, io-uring

On 12/18/20 2:20 AM, Dmitry Kadashev wrote:
> On Thu, Dec 17, 2020 at 8:43 PM Victor Stewart <[email protected]> wrote:
>>
>> On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote:
>>>
>>> On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote:
>>>>
>>>>>> That is curious. This ticket mentions Shmem though, and in our case it does
>>>>  > not look suspicious at all. E.g. on a box that has the problem at the moment:
>>>>  > Shmem:  41856 kB. The box has 256GB of RAM.
>>>>  >
>>>>  > But I'd (given my lack of knowledge) expect the issues to be related anyway.
>>>>
>>>> what about mapped? mapped is pretty high 1GB on my machine, I'm still
>>>> reproduce that in C...however the user process is killed but not the
>>>> io_wq_worker kernel processes, that's also the reason why the server
>>>> socket still listening(even if the user process is killed), the bug
>>>> only occurs(in netty) with a high number of operations and using
>>>> eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS)
>>>>
>>>> (tested on kernel 5.9 and 5.10)
>>>
>>> Stats from another box with this problem (still 256G of RAM):
>>>
>>> Mlocked:           17096 kB
>>> Mapped:           171480 kB
>>> Shmem:             41880 kB
>>>
>>> Does not look suspicious at a glance. Number of io_wq* processes is 23-31.
>>>
>>> Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of
>>> these four the old instance was killed with SIGKILL. On the last process start
>>> 18 rings failed to initialize, but after that 6 more were initialized
>>> successfully. It was before the old instance was killed. Maybe it's related to
>>> the load and number of io-wq processes, e.g. some of them exited and a few more
>>> rings were initialized successfully.
>>
>> have you tried using IORING_SETUP_ATTACH_WQ?
>>
>> https://lkml.org/lkml/2020/1/27/763
> 
> No, I have not, but while using that might help to slow down progression of the
> issue, it won't fix it - at least if I understand correctly. The problem is not
> that those rings can't be created at all - there is no problem with that on a
> freshly booted box, but rather that after some (potentially abrupt) owning
> process terminations under load kernel gets into a state where - eventually - no
> new rings can be created at all. Not a single one. In the above example the
> issue just haven't progressed far enough yet.
> 
> In other words, there seems to be a leak / accounting problem in the io_uring
> code that is triggered by abrupt process termination under load (just no
> io_uring_queue_exit?) - this is not a usage problem.

Right, I don't think that's related at all. Might be a good idea in general
depending on your use case, but it won't really have any bearing on the
particular issue at hand.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-17  8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev
  2020-12-17  8:26 ` Norman Maurer
@ 2020-12-18 15:26 ` Jens Axboe
  2020-12-18 17:21   ` Josef
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-18 15:26 UTC (permalink / raw)
  To: Dmitry Kadashev, io-uring

On 12/17/20 1:19 AM, Dmitry Kadashev wrote:
> Hi,
> 
> We've ran into something that looks like a memory accounting problem
> in the kernel / io_uring code. We use multiple rings per process, and
> generally it works fine. Until it does not - new ring creation just
> fails with ENOMEM. And at that point it fails consistently until the
> box is rebooted.
> 
> More details: we use multiple rings per process, typically they are
> initialized on the process start (not necessarily, but that is not
> important here, let's just assume all are initialized on the process
> start). On a freshly booted box everything works fine. But after a
> while - and some process restarts - io_uring_queue_init() starts to
> fail with ENOMEM. Sometimes we see it fail, but then subsequent ones
> succeed (in the same process), but over time it gets worse, and
> eventually no ring can be initialized. And once that happens the only
> way to fix the problem is to restart the box.  Most of the mentioned
> restarts are graceful: a new process is started and then the old one
> is killed, possibly with the KILL signal if it does not shut down in
> time.  Things work fine for some time, but eventually we start getting
> those errors.
> 
> Originally we've used 5.6.6 kernel, but given the fact quite a few
> accounting issues were fixed in io_uring in 5.8, we've tried 5.9.5 as
> well, but the issue is not gone.
> 
> Just in case, everything else seems to be working fine, it just falls
> back to the thread pool instead of io_uring, and then everything
> continues to work just fine.
> 
> I was not able to spot anything suspicious in the /proc/meminfo. We
> have RLIMIT_MEMLOCK set to infinity. And on a box that currently
> experiences the problem /proc/meminfo shows just 24MB as locked.
> 
> Any pointers to how can we debug this?

I've read through this thread, but haven't had time to really debug it
yet. I did try a few test cases, and wasn't able to trigger anything.
The signal part is interesting, as it would cause parallel teardowns
potentially. And I did post a patch for that yesterday, where I did spot
a race in the user mm accounting. I don't think this is related to this
one, but would still be useful if you could test with this applied:

https://lore.kernel.org/io-uring/[email protected]/T/#u

just in case...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-18 15:26 ` Jens Axboe
@ 2020-12-18 17:21   ` Josef
  2020-12-18 17:23     ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-18 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

> I've read through this thread, but haven't had time to really debug it
> yet. I did try a few test cases, and wasn't able to trigger anything.
> The signal part is interesting, as it would cause parallel teardowns
> potentially. And I did post a patch for that yesterday, where I did spot
> a race in the user mm accounting. I don't think this is related to this
> one, but would still be useful if you could test with this applied:
>
> https://lore.kernel.org/io-uring/[email protected]/T/#u

as you expected it didn't work, unfortunately I couldn't reproduce
that in C..I'll try to debug in netty/kernel


--
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-18 17:21   ` Josef
@ 2020-12-18 17:23     ` Jens Axboe
  2020-12-19  2:49       ` Josef
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-18 17:23 UTC (permalink / raw)
  To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

On 12/18/20 10:21 AM, Josef wrote:
>> I've read through this thread, but haven't had time to really debug it
>> yet. I did try a few test cases, and wasn't able to trigger anything.
>> The signal part is interesting, as it would cause parallel teardowns
>> potentially. And I did post a patch for that yesterday, where I did spot
>> a race in the user mm accounting. I don't think this is related to this
>> one, but would still be useful if you could test with this applied:
>>
>> https://lore.kernel.org/io-uring/[email protected]/T/#u
> 
> as you expected it didn't work, unfortunately I couldn't reproduce
> that in C..I'll try to debug in netty/kernel

I'm happy to run _any_ reproducer, so please do let us know if you
manage to find something that I can run with netty. As long as it
includes instructions for exactly how to run it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-18 17:23     ` Jens Axboe
@ 2020-12-19  2:49       ` Josef
  2020-12-19 16:13         ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-19  2:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

> I'm happy to run _any_ reproducer, so please do let us know if you
> manage to find something that I can run with netty. As long as it
> includes instructions for exactly how to run it :-)

cool :)  I just created a repo for that:
https://github.com/1Jo1/netty-io_uring-kernel-debugging.git

- install jdk 1.8
- to run netty: ./mvnw compile exec:java
-Dexec.mainClass="uring.netty.example.EchoUringServer"
- to run the echo test: cargo run --release -- --address
"127.0.0.1:2022" --number 200 --duration 20 --length 300
(https://github.com/haraldh/rust_echo_bench.git)
- process kill -9

async flag is enabled and these operation are used: OP_READ,
OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT

(btw you can change the port in EchoUringServer.java)

-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19  2:49       ` Josef
@ 2020-12-19 16:13         ` Jens Axboe
  2020-12-19 16:29           ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 16:13 UTC (permalink / raw)
  To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

On 12/18/20 7:49 PM, Josef wrote:
>> I'm happy to run _any_ reproducer, so please do let us know if you
>> manage to find something that I can run with netty. As long as it
>> includes instructions for exactly how to run it :-)
> 
> cool :)  I just created a repo for that:
> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
> 
> - install jdk 1.8
> - to run netty: ./mvnw compile exec:java
> -Dexec.mainClass="uring.netty.example.EchoUringServer"
> - to run the echo test: cargo run --release -- --address
> "127.0.0.1:2022" --number 200 --duration 20 --length 300
> (https://github.com/haraldh/rust_echo_bench.git)
> - process kill -9
> 
> async flag is enabled and these operation are used: OP_READ,
> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
> 
> (btw you can change the port in EchoUringServer.java)

This is great! Not sure this is the same issue, but what I see here is
that we have leftover workers when the test is killed. This means the
rings aren't gone, and the memory isn't freed (and unaccounted), which
would ultimately lead to problems of course, similar to just an
accounting bug or race.

The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
down...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 16:13         ` Jens Axboe
@ 2020-12-19 16:29           ` Jens Axboe
  2020-12-19 17:11             ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 16:29 UTC (permalink / raw)
  To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

On 12/19/20 9:13 AM, Jens Axboe wrote:
> On 12/18/20 7:49 PM, Josef wrote:
>>> I'm happy to run _any_ reproducer, so please do let us know if you
>>> manage to find something that I can run with netty. As long as it
>>> includes instructions for exactly how to run it :-)
>>
>> cool :)  I just created a repo for that:
>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
>>
>> - install jdk 1.8
>> - to run netty: ./mvnw compile exec:java
>> -Dexec.mainClass="uring.netty.example.EchoUringServer"
>> - to run the echo test: cargo run --release -- --address
>> "127.0.0.1:2022" --number 200 --duration 20 --length 300
>> (https://github.com/haraldh/rust_echo_bench.git)
>> - process kill -9
>>
>> async flag is enabled and these operation are used: OP_READ,
>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
>>
>> (btw you can change the port in EchoUringServer.java)
> 
> This is great! Not sure this is the same issue, but what I see here is
> that we have leftover workers when the test is killed. This means the
> rings aren't gone, and the memory isn't freed (and unaccounted), which
> would ultimately lead to problems of course, similar to just an
> accounting bug or race.
> 
> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
> down...

Further narrowed down, it seems to be related to IOSQE_ASYNC on the
read requests. I'm guessing there are cases where we end up not
canceling them on ring close, hence the ring stays active, etc.

If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then
the test terminates fine on the kill -9.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 16:29           ` Jens Axboe
@ 2020-12-19 17:11             ` Jens Axboe
  2020-12-19 17:34               ` Norman Maurer
  2020-12-21 10:31               ` Dmitry Kadashev
  0 siblings, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 17:11 UTC (permalink / raw)
  To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer

On 12/19/20 9:29 AM, Jens Axboe wrote:
> On 12/19/20 9:13 AM, Jens Axboe wrote:
>> On 12/18/20 7:49 PM, Josef wrote:
>>>> I'm happy to run _any_ reproducer, so please do let us know if you
>>>> manage to find something that I can run with netty. As long as it
>>>> includes instructions for exactly how to run it :-)
>>>
>>> cool :)  I just created a repo for that:
>>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
>>>
>>> - install jdk 1.8
>>> - to run netty: ./mvnw compile exec:java
>>> -Dexec.mainClass="uring.netty.example.EchoUringServer"
>>> - to run the echo test: cargo run --release -- --address
>>> "127.0.0.1:2022" --number 200 --duration 20 --length 300
>>> (https://github.com/haraldh/rust_echo_bench.git)
>>> - process kill -9
>>>
>>> async flag is enabled and these operation are used: OP_READ,
>>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
>>>
>>> (btw you can change the port in EchoUringServer.java)
>>
>> This is great! Not sure this is the same issue, but what I see here is
>> that we have leftover workers when the test is killed. This means the
>> rings aren't gone, and the memory isn't freed (and unaccounted), which
>> would ultimately lead to problems of course, similar to just an
>> accounting bug or race.
>>
>> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
>> down...
> 
> Further narrowed down, it seems to be related to IOSQE_ASYNC on the
> read requests. I'm guessing there are cases where we end up not
> canceling them on ring close, hence the ring stays active, etc.
> 
> If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then
> the test terminates fine on the kill -9.

And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
file descriptor. You probably don't want/mean to do that as it's
pollable, I guess it's done because you just set it on all reads for the
test?

In any case, it should of course work. This is the leftover trace when
we should be exiting, but an io-wq worker is still trying to get data
from the eventfd:

$ sudo cat /proc/2148/stack
[<0>] eventfd_read+0x160/0x260
[<0>] io_iter_do_read+0x1b/0x40
[<0>] io_read+0xa5/0x320
[<0>] io_issue_sqe+0x23c/0xe80
[<0>] io_wq_submit_work+0x6e/0x1a0
[<0>] io_worker_handle_work+0x13d/0x4e0
[<0>] io_wqe_worker+0x2aa/0x360
[<0>] kthread+0x130/0x160
[<0>] ret_from_fork+0x1f/0x30

which will never finish at this point, it should have been canceled.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 17:11             ` Jens Axboe
@ 2020-12-19 17:34               ` Norman Maurer
  2020-12-19 17:38                 ` Jens Axboe
  2020-12-21 10:31               ` Dmitry Kadashev
  1 sibling, 1 reply; 52+ messages in thread
From: Norman Maurer @ 2020-12-19 17:34 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Josef, Dmitry Kadashev, io-uring

Thanks a lot ... we can just workaround this than in netty .

Bye
Norman 


> Am 19.12.2020 um 18:11 schrieb Jens Axboe <[email protected]>:
> 
> On 12/19/20 9:29 AM, Jens Axboe wrote:
>>> On 12/19/20 9:13 AM, Jens Axboe wrote:
>>> On 12/18/20 7:49 PM, Josef wrote:
>>>>> I'm happy to run _any_ reproducer, so please do let us know if you
>>>>> manage to find something that I can run with netty. As long as it
>>>>> includes instructions for exactly how to run it :-)
>>>> 
>>>> cool :)  I just created a repo for that:
>>>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
>>>> 
>>>> - install jdk 1.8
>>>> - to run netty: ./mvnw compile exec:java
>>>> -Dexec.mainClass="uring.netty.example.EchoUringServer"
>>>> - to run the echo test: cargo run --release -- --address
>>>> "127.0.0.1:2022" --number 200 --duration 20 --length 300
>>>> (https://github.com/haraldh/rust_echo_bench.git)
>>>> - process kill -9
>>>> 
>>>> async flag is enabled and these operation are used: OP_READ,
>>>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
>>>> 
>>>> (btw you can change the port in EchoUringServer.java)
>>> 
>>> This is great! Not sure this is the same issue, but what I see here is
>>> that we have leftover workers when the test is killed. This means the
>>> rings aren't gone, and the memory isn't freed (and unaccounted), which
>>> would ultimately lead to problems of course, similar to just an
>>> accounting bug or race.
>>> 
>>> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
>>> down...
>> 
>> Further narrowed down, it seems to be related to IOSQE_ASYNC on the
>> read requests. I'm guessing there are cases where we end up not
>> canceling them on ring close, hence the ring stays active, etc.
>> 
>> If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then
>> the test terminates fine on the kill -9.
> 
> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> file descriptor. You probably don't want/mean to do that as it's
> pollable, I guess it's done because you just set it on all reads for the
> test?
> 
> In any case, it should of course work. This is the leftover trace when
> we should be exiting, but an io-wq worker is still trying to get data
> from the eventfd:
> 
> $ sudo cat /proc/2148/stack
> [<0>] eventfd_read+0x160/0x260
> [<0>] io_iter_do_read+0x1b/0x40
> [<0>] io_read+0xa5/0x320
> [<0>] io_issue_sqe+0x23c/0xe80
> [<0>] io_wq_submit_work+0x6e/0x1a0
> [<0>] io_worker_handle_work+0x13d/0x4e0
> [<0>] io_wqe_worker+0x2aa/0x360
> [<0>] kthread+0x130/0x160
> [<0>] ret_from_fork+0x1f/0x30
> 
> which will never finish at this point, it should have been canceled.
> 
> -- 
> Jens Axboe
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 17:34               ` Norman Maurer
@ 2020-12-19 17:38                 ` Jens Axboe
  2020-12-19 20:51                   ` Josef
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 17:38 UTC (permalink / raw)
  To: Norman Maurer; +Cc: Josef, Dmitry Kadashev, io-uring

On 12/19/20 10:34 AM, Norman Maurer wrote:
>> Am 19.12.2020 um 18:11 schrieb Jens Axboe <[email protected]>:
>>
>> On 12/19/20 9:29 AM, Jens Axboe wrote:
>>>> On 12/19/20 9:13 AM, Jens Axboe wrote:
>>>> On 12/18/20 7:49 PM, Josef wrote:
>>>>>> I'm happy to run _any_ reproducer, so please do let us know if you
>>>>>> manage to find something that I can run with netty. As long as it
>>>>>> includes instructions for exactly how to run it :-)
>>>>>
>>>>> cool :)  I just created a repo for that:
>>>>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
>>>>>
>>>>> - install jdk 1.8
>>>>> - to run netty: ./mvnw compile exec:java
>>>>> -Dexec.mainClass="uring.netty.example.EchoUringServer"
>>>>> - to run the echo test: cargo run --release -- --address
>>>>> "127.0.0.1:2022" --number 200 --duration 20 --length 300
>>>>> (https://github.com/haraldh/rust_echo_bench.git)
>>>>> - process kill -9
>>>>>
>>>>> async flag is enabled and these operation are used: OP_READ,
>>>>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
>>>>>
>>>>> (btw you can change the port in EchoUringServer.java)
>>>>
>>>> This is great! Not sure this is the same issue, but what I see here is
>>>> that we have leftover workers when the test is killed. This means the
>>>> rings aren't gone, and the memory isn't freed (and unaccounted), which
>>>> would ultimately lead to problems of course, similar to just an
>>>> accounting bug or race.
>>>>
>>>> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
>>>> down...
>>>
>>> Further narrowed down, it seems to be related to IOSQE_ASYNC on the
>>> read requests. I'm guessing there are cases where we end up not
>>> canceling them on ring close, hence the ring stays active, etc.
>>>
>>> If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then
>>> the test terminates fine on the kill -9.
>>
>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>> file descriptor. You probably don't want/mean to do that as it's
>> pollable, I guess it's done because you just set it on all reads for the
>> test?
>>
>> In any case, it should of course work. This is the leftover trace when
>> we should be exiting, but an io-wq worker is still trying to get data
>> from the eventfd:
>>
>> $ sudo cat /proc/2148/stack
>> [<0>] eventfd_read+0x160/0x260
>> [<0>] io_iter_do_read+0x1b/0x40
>> [<0>] io_read+0xa5/0x320
>> [<0>] io_issue_sqe+0x23c/0xe80
>> [<0>] io_wq_submit_work+0x6e/0x1a0
>> [<0>] io_worker_handle_work+0x13d/0x4e0
>> [<0>] io_wqe_worker+0x2aa/0x360
>> [<0>] kthread+0x130/0x160
>> [<0>] ret_from_fork+0x1f/0x30
>>
>> which will never finish at this point, it should have been canceled.
>
> Thanks a lot ... we can just workaround this than in netty .

That probably should be done in any case, since I don't think
IOSQE_ASYNC is useful on the eventfd read for you. But I'm trying to
narrow down _why_ it fails, it could be a general issue in how
cancelations are processed for sudden exit. Which would explain why it
only shows up for the kill -9 case.

Anyway, digging into it :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 17:38                 ` Jens Axboe
@ 2020-12-19 20:51                   ` Josef
  2020-12-19 21:54                     ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-19 20:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> file descriptor. You probably don't want/mean to do that as it's
> pollable, I guess it's done because you just set it on all reads for the
> test?

yes exactly, eventfd fd is blocking, so it actually makes no sense to
use IOSQE_ASYNC
I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
in my tests, thanks a lot :)

> In any case, it should of course work. This is the leftover trace when
> we should be exiting, but an io-wq worker is still trying to get data
> from the eventfd:

interesting, btw what kind of tool do you use for kernel debugging?

-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 20:51                   ` Josef
@ 2020-12-19 21:54                     ` Jens Axboe
  2020-12-19 23:13                       ` Jens Axboe
  0 siblings, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 21:54 UTC (permalink / raw)
  To: Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 12/19/20 1:51 PM, Josef wrote:
>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>> file descriptor. You probably don't want/mean to do that as it's
>> pollable, I guess it's done because you just set it on all reads for the
>> test?
> 
> yes exactly, eventfd fd is blocking, so it actually makes no sense to
> use IOSQE_ASYNC

Right, and it's pollable too.

> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
> in my tests, thanks a lot :)
> 
>> In any case, it should of course work. This is the leftover trace when
>> we should be exiting, but an io-wq worker is still trying to get data
>> from the eventfd:
> 
> interesting, btw what kind of tool do you use for kernel debugging?

Just poking at it and thinking about it, no hidden magic I'm afraid...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 21:54                     ` Jens Axboe
@ 2020-12-19 23:13                       ` Jens Axboe
  2020-12-19 23:42                         ` Josef
  2020-12-19 23:42                         ` Pavel Begunkov
  0 siblings, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2020-12-19 23:13 UTC (permalink / raw)
  To: Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 12/19/20 2:54 PM, Jens Axboe wrote:
> On 12/19/20 1:51 PM, Josef wrote:
>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>> file descriptor. You probably don't want/mean to do that as it's
>>> pollable, I guess it's done because you just set it on all reads for the
>>> test?
>>
>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>> use IOSQE_ASYNC
> 
> Right, and it's pollable too.
> 
>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>> in my tests, thanks a lot :)
>>
>>> In any case, it should of course work. This is the leftover trace when
>>> we should be exiting, but an io-wq worker is still trying to get data
>>> from the eventfd:
>>
>> interesting, btw what kind of tool do you use for kernel debugging?
> 
> Just poking at it and thinking about it, no hidden magic I'm afraid...

Josef, can you try with this added? Looks bigger than it is, most of it
is just moving one function below another.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..96f6445ab827 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8735,10 +8735,43 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx,
 	}
 }
 
+static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx,
+					    struct task_struct *task)
+{
+	while (1) {
+		struct io_task_cancel cancel = { .task = task, .files = NULL, };
+		enum io_wq_cancel cret;
+		bool ret = false;
+
+		cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, &cancel, true);
+		if (cret != IO_WQ_CANCEL_NOTFOUND)
+			ret = true;
+
+		/* SQPOLL thread does its own polling */
+		if (!(ctx->flags & IORING_SETUP_SQPOLL)) {
+			while (!list_empty_careful(&ctx->iopoll_list)) {
+				io_iopoll_try_reap_events(ctx);
+				ret = true;
+			}
+		}
+
+		ret |= io_poll_remove_all(ctx, task, NULL);
+		ret |= io_kill_timeouts(ctx, task, NULL);
+		if (!ret)
+			break;
+		io_run_task_work();
+		cond_resched();
+	}
+}
+
 static void io_uring_cancel_files(struct io_ring_ctx *ctx,
 				  struct task_struct *task,
 				  struct files_struct *files)
 {
+	/* files == NULL, task is exiting. Cancel all that match task */
+	if (!files)
+		__io_uring_cancel_task_requests(ctx, task);
+
 	while (!list_empty_careful(&ctx->inflight_list)) {
 		struct io_task_cancel cancel = { .task = task, .files = files };
 		struct io_kiocb *req;
@@ -8772,35 +8805,6 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx,
 	}
 }
 
-static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx,
-					    struct task_struct *task)
-{
-	while (1) {
-		struct io_task_cancel cancel = { .task = task, .files = NULL, };
-		enum io_wq_cancel cret;
-		bool ret = false;
-
-		cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, &cancel, true);
-		if (cret != IO_WQ_CANCEL_NOTFOUND)
-			ret = true;
-
-		/* SQPOLL thread does its own polling */
-		if (!(ctx->flags & IORING_SETUP_SQPOLL)) {
-			while (!list_empty_careful(&ctx->iopoll_list)) {
-				io_iopoll_try_reap_events(ctx);
-				ret = true;
-			}
-		}
-
-		ret |= io_poll_remove_all(ctx, task, NULL);
-		ret |= io_kill_timeouts(ctx, task, NULL);
-		if (!ret)
-			break;
-		io_run_task_work();
-		cond_resched();
-	}
-}
-
 /*
  * We need to iteratively cancel requests, in case a request has dependent
  * hard links. These persist even for failure of cancelations, hence keep

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 23:13                       ` Jens Axboe
@ 2020-12-19 23:42                         ` Josef
  2020-12-19 23:42                         ` Pavel Begunkov
  1 sibling, 0 replies; 52+ messages in thread
From: Josef @ 2020-12-19 23:42 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

> Josef, can you try with this added? Looks bigger than it is, most of it
> is just moving one function below another.

yeah sure, sorry stupid question which branch is the patch based on?
(last commit?)

-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 23:13                       ` Jens Axboe
  2020-12-19 23:42                         ` Josef
@ 2020-12-19 23:42                         ` Pavel Begunkov
  2020-12-20  0:25                           ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-19 23:42 UTC (permalink / raw)
  To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 19/12/2020 23:13, Jens Axboe wrote:
> On 12/19/20 2:54 PM, Jens Axboe wrote:
>> On 12/19/20 1:51 PM, Josef wrote:
>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>>> file descriptor. You probably don't want/mean to do that as it's
>>>> pollable, I guess it's done because you just set it on all reads for the
>>>> test?
>>>
>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>>> use IOSQE_ASYNC
>>
>> Right, and it's pollable too.
>>
>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>>> in my tests, thanks a lot :)
>>>
>>>> In any case, it should of course work. This is the leftover trace when
>>>> we should be exiting, but an io-wq worker is still trying to get data
>>>> from the eventfd:
>>>
>>> interesting, btw what kind of tool do you use for kernel debugging?
>>
>> Just poking at it and thinking about it, no hidden magic I'm afraid...
> 
> Josef, can you try with this added? Looks bigger than it is, most of it
> is just moving one function below another.

Hmm, which kernel revision are you poking? Seems it doesn't match
io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
NULL files.

if (!files)
	__io_uring_cancel_task_requests(ctx, task);
else
	io_uring_cancel_files(ctx, task, files);

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index f3690dfdd564..96f6445ab827 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8735,10 +8735,43 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx,
[...]
>  static void io_uring_cancel_files(struct io_ring_ctx *ctx,
>  				  struct task_struct *task,
>  				  struct files_struct *files)
>  {
> +	/* files == NULL, task is exiting. Cancel all that match task */
> +	if (!files)
> +		__io_uring_cancel_task_requests(ctx, task);
> +

For 5.11 I believe it should look like

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..38fb351cc1dd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8822,9 +8822,8 @@ static void io_uring_cancel_task_requests(struct io_ring_ctx *ctx,
 	io_cqring_overflow_flush(ctx, true, task, files);
 	io_ring_submit_unlock(ctx, (ctx->flags & IORING_SETUP_IOPOLL));
 
-	if (!files)
-		__io_uring_cancel_task_requests(ctx, task);
-	else
+	__io_uring_cancel_task_requests(ctx, task);
+	if (files)
 		io_uring_cancel_files(ctx, task, files);
 
 	if ((ctx->flags & IORING_SETUP_SQPOLL) && ctx->sq_data) {


-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 23:42                         ` Pavel Begunkov
@ 2020-12-20  0:25                           ` Jens Axboe
  2020-12-20  0:55                             ` Pavel Begunkov
  2020-12-20  1:57                             ` Pavel Begunkov
  0 siblings, 2 replies; 52+ messages in thread
From: Jens Axboe @ 2020-12-20  0:25 UTC (permalink / raw)
  To: Pavel Begunkov, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 12/19/20 4:42 PM, Pavel Begunkov wrote:
> On 19/12/2020 23:13, Jens Axboe wrote:
>> On 12/19/20 2:54 PM, Jens Axboe wrote:
>>> On 12/19/20 1:51 PM, Josef wrote:
>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>>>> file descriptor. You probably don't want/mean to do that as it's
>>>>> pollable, I guess it's done because you just set it on all reads for the
>>>>> test?
>>>>
>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>>>> use IOSQE_ASYNC
>>>
>>> Right, and it's pollable too.
>>>
>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>>>> in my tests, thanks a lot :)
>>>>
>>>>> In any case, it should of course work. This is the leftover trace when
>>>>> we should be exiting, but an io-wq worker is still trying to get data
>>>>> from the eventfd:
>>>>
>>>> interesting, btw what kind of tool do you use for kernel debugging?
>>>
>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
>>
>> Josef, can you try with this added? Looks bigger than it is, most of it
>> is just moving one function below another.
> 
> Hmm, which kernel revision are you poking? Seems it doesn't match
> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
> NULL files.
> 
> if (!files)
> 	__io_uring_cancel_task_requests(ctx, task);
> else
> 	io_uring_cancel_files(ctx, task, files);

Yeah, I think I messed up. If files == NULL, then the task is going away.
So we should cancel all requests that match 'task', not just ones that
match task && files.

Not sure I have much more time to look into this before next week, but
something like that.

The problem case is the async worker being queued, long before the task
is killed and the contexts go away. But from exit_files(), we're only
concerned with canceling if we have inflight. Doesn't look right to me.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20  0:25                           ` Jens Axboe
@ 2020-12-20  0:55                             ` Pavel Begunkov
  2020-12-21 10:35                               ` Dmitry Kadashev
  2020-12-20  1:57                             ` Pavel Begunkov
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20  0:55 UTC (permalink / raw)
  To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 00:25, Jens Axboe wrote:
> On 12/19/20 4:42 PM, Pavel Begunkov wrote:
>> On 19/12/2020 23:13, Jens Axboe wrote:
>>> On 12/19/20 2:54 PM, Jens Axboe wrote:
>>>> On 12/19/20 1:51 PM, Josef wrote:
>>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>>>>> file descriptor. You probably don't want/mean to do that as it's
>>>>>> pollable, I guess it's done because you just set it on all reads for the
>>>>>> test?
>>>>>
>>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>>>>> use IOSQE_ASYNC
>>>>
>>>> Right, and it's pollable too.
>>>>
>>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>>>>> in my tests, thanks a lot :)
>>>>>
>>>>>> In any case, it should of course work. This is the leftover trace when
>>>>>> we should be exiting, but an io-wq worker is still trying to get data
>>>>>> from the eventfd:
>>>>>
>>>>> interesting, btw what kind of tool do you use for kernel debugging?
>>>>
>>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
>>>
>>> Josef, can you try with this added? Looks bigger than it is, most of it
>>> is just moving one function below another.
>>
>> Hmm, which kernel revision are you poking? Seems it doesn't match
>> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
>> NULL files.
>>
>> if (!files)
>> 	__io_uring_cancel_task_requests(ctx, task);
>> else
>> 	io_uring_cancel_files(ctx, task, files);
> 
> Yeah, I think I messed up. If files == NULL, then the task is going away.
> So we should cancel all requests that match 'task', not just ones that
> match task && files.
> 
> Not sure I have much more time to look into this before next week, but
> something like that.
> 
> The problem case is the async worker being queued, long before the task
> is killed and the contexts go away. But from exit_files(), we're only
> concerned with canceling if we have inflight. Doesn't look right to me.

In theory all that should be killed in io_ring_ctx_wait_and_kill(),
of course that's if the ring itself is closed.

Guys, do you share rings between processes? Explicitly like sending 
io_uring fd over a socket, or implicitly e.g. sharing fd tables
(threads), or cloning with copying fd tables (and so taking a ref
to a ring).
In other words, if you kill all your io_uring applications, does it
go back to normal?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20  0:55                             ` Pavel Begunkov
@ 2020-12-21 10:35                               ` Dmitry Kadashev
  2020-12-21 10:49                                 ` Dmitry Kadashev
  2020-12-21 11:00                                 ` Dmitry Kadashev
  0 siblings, 2 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-21 10:35 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote:
>
> On 20/12/2020 00:25, Jens Axboe wrote:
> > On 12/19/20 4:42 PM, Pavel Begunkov wrote:
> >> On 19/12/2020 23:13, Jens Axboe wrote:
> >>> On 12/19/20 2:54 PM, Jens Axboe wrote:
> >>>> On 12/19/20 1:51 PM, Josef wrote:
> >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> >>>>>> file descriptor. You probably don't want/mean to do that as it's
> >>>>>> pollable, I guess it's done because you just set it on all reads for the
> >>>>>> test?
> >>>>>
> >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
> >>>>> use IOSQE_ASYNC
> >>>>
> >>>> Right, and it's pollable too.
> >>>>
> >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
> >>>>> in my tests, thanks a lot :)
> >>>>>
> >>>>>> In any case, it should of course work. This is the leftover trace when
> >>>>>> we should be exiting, but an io-wq worker is still trying to get data
> >>>>>> from the eventfd:
> >>>>>
> >>>>> interesting, btw what kind of tool do you use for kernel debugging?
> >>>>
> >>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
> >>>
> >>> Josef, can you try with this added? Looks bigger than it is, most of it
> >>> is just moving one function below another.
> >>
> >> Hmm, which kernel revision are you poking? Seems it doesn't match
> >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
> >> NULL files.
> >>
> >> if (!files)
> >>      __io_uring_cancel_task_requests(ctx, task);
> >> else
> >>      io_uring_cancel_files(ctx, task, files);
> >
> > Yeah, I think I messed up. If files == NULL, then the task is going away.
> > So we should cancel all requests that match 'task', not just ones that
> > match task && files.
> >
> > Not sure I have much more time to look into this before next week, but
> > something like that.
> >
> > The problem case is the async worker being queued, long before the task
> > is killed and the contexts go away. But from exit_files(), we're only
> > concerned with canceling if we have inflight. Doesn't look right to me.
>
> In theory all that should be killed in io_ring_ctx_wait_and_kill(),
> of course that's if the ring itself is closed.
>
> Guys, do you share rings between processes? Explicitly like sending
> io_uring fd over a socket, or implicitly e.g. sharing fd tables
> (threads), or cloning with copying fd tables (and so taking a ref
> to a ring).

We do not share rings between processes. Our rings are accessible from different
threads (under locks), but nothing fancy.

> In other words, if you kill all your io_uring applications, does it
> go back to normal?

I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
affected box and double check just in case.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-21 10:35                               ` Dmitry Kadashev
@ 2020-12-21 10:49                                 ` Dmitry Kadashev
  2020-12-21 11:00                                 ` Dmitry Kadashev
  1 sibling, 0 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-21 10:49 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote:
> >
> > On 20/12/2020 00:25, Jens Axboe wrote:
> > > On 12/19/20 4:42 PM, Pavel Begunkov wrote:
> > >> On 19/12/2020 23:13, Jens Axboe wrote:
> > >>> On 12/19/20 2:54 PM, Jens Axboe wrote:
> > >>>> On 12/19/20 1:51 PM, Josef wrote:
> > >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> > >>>>>> file descriptor. You probably don't want/mean to do that as it's
> > >>>>>> pollable, I guess it's done because you just set it on all reads for the
> > >>>>>> test?
> > >>>>>
> > >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
> > >>>>> use IOSQE_ASYNC
> > >>>>
> > >>>> Right, and it's pollable too.
> > >>>>
> > >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
> > >>>>> in my tests, thanks a lot :)
> > >>>>>
> > >>>>>> In any case, it should of course work. This is the leftover trace when
> > >>>>>> we should be exiting, but an io-wq worker is still trying to get data
> > >>>>>> from the eventfd:
> > >>>>>
> > >>>>> interesting, btw what kind of tool do you use for kernel debugging?
> > >>>>
> > >>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
> > >>>
> > >>> Josef, can you try with this added? Looks bigger than it is, most of it
> > >>> is just moving one function below another.
> > >>
> > >> Hmm, which kernel revision are you poking? Seems it doesn't match
> > >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
> > >> NULL files.
> > >>
> > >> if (!files)
> > >>      __io_uring_cancel_task_requests(ctx, task);
> > >> else
> > >>      io_uring_cancel_files(ctx, task, files);
> > >
> > > Yeah, I think I messed up. If files == NULL, then the task is going away.
> > > So we should cancel all requests that match 'task', not just ones that
> > > match task && files.
> > >
> > > Not sure I have much more time to look into this before next week, but
> > > something like that.
> > >
> > > The problem case is the async worker being queued, long before the task
> > > is killed and the contexts go away. But from exit_files(), we're only
> > > concerned with canceling if we have inflight. Doesn't look right to me.
> >
> > In theory all that should be killed in io_ring_ctx_wait_and_kill(),
> > of course that's if the ring itself is closed.
> >
> > Guys, do you share rings between processes? Explicitly like sending
> > io_uring fd over a socket, or implicitly e.g. sharing fd tables
> > (threads), or cloning with copying fd tables (and so taking a ref
> > to a ring).
>
> We do not share rings between processes. Our rings are accessible from different
> threads (under locks), but nothing fancy.

Actually, I'm wrong about the locks part, forgot how it works. In our case it
works like this: a parent thread creates a ring, and passes it to a worker
thread, which does all of the work with it, no locks are involved. On
(clean) termination the parent notifies the worker, waits for it to exit and
then calls io_uring_queue_exit. Not sure if that counts as sharing rings between
the threads or not.

As I've mentioned in some other email, I'll try (again) to make a reproducer.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-21 10:35                               ` Dmitry Kadashev
  2020-12-21 10:49                                 ` Dmitry Kadashev
@ 2020-12-21 11:00                                 ` Dmitry Kadashev
  2020-12-21 15:36                                   ` Pavel Begunkov
  2020-12-22  3:35                                   ` Pavel Begunkov
  1 sibling, 2 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-21 11:00 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote:
> >
> > On 20/12/2020 00:25, Jens Axboe wrote:
> > > On 12/19/20 4:42 PM, Pavel Begunkov wrote:
> > >> On 19/12/2020 23:13, Jens Axboe wrote:
> > >>> On 12/19/20 2:54 PM, Jens Axboe wrote:
> > >>>> On 12/19/20 1:51 PM, Josef wrote:
> > >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> > >>>>>> file descriptor. You probably don't want/mean to do that as it's
> > >>>>>> pollable, I guess it's done because you just set it on all reads for the
> > >>>>>> test?
> > >>>>>
> > >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
> > >>>>> use IOSQE_ASYNC
> > >>>>
> > >>>> Right, and it's pollable too.
> > >>>>
> > >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
> > >>>>> in my tests, thanks a lot :)
> > >>>>>
> > >>>>>> In any case, it should of course work. This is the leftover trace when
> > >>>>>> we should be exiting, but an io-wq worker is still trying to get data
> > >>>>>> from the eventfd:
> > >>>>>
> > >>>>> interesting, btw what kind of tool do you use for kernel debugging?
> > >>>>
> > >>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
> > >>>
> > >>> Josef, can you try with this added? Looks bigger than it is, most of it
> > >>> is just moving one function below another.
> > >>
> > >> Hmm, which kernel revision are you poking? Seems it doesn't match
> > >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
> > >> NULL files.
> > >>
> > >> if (!files)
> > >>      __io_uring_cancel_task_requests(ctx, task);
> > >> else
> > >>      io_uring_cancel_files(ctx, task, files);
> > >
> > > Yeah, I think I messed up. If files == NULL, then the task is going away.
> > > So we should cancel all requests that match 'task', not just ones that
> > > match task && files.
> > >
> > > Not sure I have much more time to look into this before next week, but
> > > something like that.
> > >
> > > The problem case is the async worker being queued, long before the task
> > > is killed and the contexts go away. But from exit_files(), we're only
> > > concerned with canceling if we have inflight. Doesn't look right to me.
> >
> > In theory all that should be killed in io_ring_ctx_wait_and_kill(),
> > of course that's if the ring itself is closed.
> >
> > Guys, do you share rings between processes? Explicitly like sending
> > io_uring fd over a socket, or implicitly e.g. sharing fd tables
> > (threads), or cloning with copying fd tables (and so taking a ref
> > to a ring).
>
> We do not share rings between processes. Our rings are accessible from different
> threads (under locks), but nothing fancy.
>
> > In other words, if you kill all your io_uring applications, does it
> > go back to normal?
>
> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
> affected box and double check just in case.

So, I've just tried stopping everything that uses io-uring. No io_wq* processes
remained:

$ ps ax | grep wq
    9 ?        I<     0:00 [mm_percpu_wq]
  243 ?        I<     0:00 [tpm_dev_wq]
  246 ?        I<     0:00 [devfreq_wq]
27922 pts/4    S+     0:00 grep --colour=auto wq
$

But not a single ring (with size 1024) can be created afterwards anyway.

Apparently the problem netty hit and this one are different?

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-21 11:00                                 ` Dmitry Kadashev
@ 2020-12-21 15:36                                   ` Pavel Begunkov
  2020-12-22  3:35                                   ` Pavel Begunkov
  1 sibling, 0 replies; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-21 15:36 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On 21/12/2020 11:00, Dmitry Kadashev wrote:
> On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote:
>>
>> On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote:
>>>
>>> On 20/12/2020 00:25, Jens Axboe wrote:
>>>> On 12/19/20 4:42 PM, Pavel Begunkov wrote:
>>>>> On 19/12/2020 23:13, Jens Axboe wrote:
>>>>>> On 12/19/20 2:54 PM, Jens Axboe wrote:
>>>>>>> On 12/19/20 1:51 PM, Josef wrote:
>>>>>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>>>>>>>> file descriptor. You probably don't want/mean to do that as it's
>>>>>>>>> pollable, I guess it's done because you just set it on all reads for the
>>>>>>>>> test?
>>>>>>>>
>>>>>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>>>>>>>> use IOSQE_ASYNC
>>>>>>>
>>>>>>> Right, and it's pollable too.
>>>>>>>
>>>>>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>>>>>>>> in my tests, thanks a lot :)
>>>>>>>>
>>>>>>>>> In any case, it should of course work. This is the leftover trace when
>>>>>>>>> we should be exiting, but an io-wq worker is still trying to get data
>>>>>>>>> from the eventfd:
>>>>>>>>
>>>>>>>> interesting, btw what kind of tool do you use for kernel debugging?
>>>>>>>
>>>>>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
>>>>>>
>>>>>> Josef, can you try with this added? Looks bigger than it is, most of it
>>>>>> is just moving one function below another.
>>>>>
>>>>> Hmm, which kernel revision are you poking? Seems it doesn't match
>>>>> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
>>>>> NULL files.
>>>>>
>>>>> if (!files)
>>>>>      __io_uring_cancel_task_requests(ctx, task);
>>>>> else
>>>>>      io_uring_cancel_files(ctx, task, files);
>>>>
>>>> Yeah, I think I messed up. If files == NULL, then the task is going away.
>>>> So we should cancel all requests that match 'task', not just ones that
>>>> match task && files.
>>>>
>>>> Not sure I have much more time to look into this before next week, but
>>>> something like that.
>>>>
>>>> The problem case is the async worker being queued, long before the task
>>>> is killed and the contexts go away. But from exit_files(), we're only
>>>> concerned with canceling if we have inflight. Doesn't look right to me.
>>>
>>> In theory all that should be killed in io_ring_ctx_wait_and_kill(),
>>> of course that's if the ring itself is closed.
>>>
>>> Guys, do you share rings between processes? Explicitly like sending
>>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
>>> (threads), or cloning with copying fd tables (and so taking a ref
>>> to a ring).
>>
>> We do not share rings between processes. Our rings are accessible from different
>> threads (under locks), but nothing fancy.
>>
>>> In other words, if you kill all your io_uring applications, does it
>>> go back to normal?
>>
>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
>> affected box and double check just in case.
> 
> So, I've just tried stopping everything that uses io-uring. No io_wq* processes
> remained:
> 
> $ ps ax | grep wq
>     9 ?        I<     0:00 [mm_percpu_wq]
>   243 ?        I<     0:00 [tpm_dev_wq]
>   246 ?        I<     0:00 [devfreq_wq]
> 27922 pts/4    S+     0:00 grep --colour=auto wq
> $
> 
> But not a single ring (with size 1024) can be created afterwards anyway.
> 
> Apparently the problem netty hit and this one are different?

Yep, looks like it

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-21 11:00                                 ` Dmitry Kadashev
  2020-12-21 15:36                                   ` Pavel Begunkov
@ 2020-12-22  3:35                                   ` Pavel Begunkov
  2020-12-22  4:07                                     ` Pavel Begunkov
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-22  3:35 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On 21/12/2020 11:00, Dmitry Kadashev wrote:
[snip]
>> We do not share rings between processes. Our rings are accessible from different
>> threads (under locks), but nothing fancy.
>>
>>> In other words, if you kill all your io_uring applications, does it
>>> go back to normal?
>>
>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
>> affected box and double check just in case.

I can't spot any misaccounting, but I wonder if it can be that your memory is
getting fragmented enough to be unable make an allocation of 16 __contiguous__
pages, i.e. sizeof(sqe) * 1024

That's how it's allocated internally:

static void *io_mem_alloc(size_t size)
{
	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
				__GFP_NORETRY;

	return (void *) __get_free_pages(gfp_flags, get_order(size));
}

What about smaller rings? Can you check io_uring of what SQ size it can allocate?
That can be a different program, e.g. modify a bit liburing/test/nop.
Also, can you allocate it if you switch a user (preferably to non-root) after it
happens?

> 
> So, I've just tried stopping everything that uses io-uring. No io_wq* processes
> remained:
> 
> $ ps ax | grep wq
>     9 ?        I<     0:00 [mm_percpu_wq]
>   243 ?        I<     0:00 [tpm_dev_wq]
>   246 ?        I<     0:00 [devfreq_wq]
> 27922 pts/4    S+     0:00 grep --colour=auto wq
> $
> 
> But not a single ring (with size 1024) can be created afterwards anyway.
> 
> Apparently the problem netty hit and this one are different?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22  3:35                                   ` Pavel Begunkov
@ 2020-12-22  4:07                                     ` Pavel Begunkov
  2020-12-22 11:04                                       ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-22  4:07 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On 22/12/2020 03:35, Pavel Begunkov wrote:
> On 21/12/2020 11:00, Dmitry Kadashev wrote:
> [snip]
>>> We do not share rings between processes. Our rings are accessible from different
>>> threads (under locks), but nothing fancy.
>>>
>>>> In other words, if you kill all your io_uring applications, does it
>>>> go back to normal?
>>>
>>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
>>> affected box and double check just in case.
> 
> I can't spot any misaccounting, but I wonder if it can be that your memory is
> getting fragmented enough to be unable make an allocation of 16 __contiguous__
> pages, i.e. sizeof(sqe) * 1024
> 
> That's how it's allocated internally:
> 
> static void *io_mem_alloc(size_t size)
> {
> 	gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> 				__GFP_NORETRY;
> 
> 	return (void *) __get_free_pages(gfp_flags, get_order(size));
> }
> 
> What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> That can be a different program, e.g. modify a bit liburing/test/nop.

Even better to allocate N smaller rings, where N = 1024 / SQ_size

static int try_size(int sq_size)
{
	int ret = 0, i, n = 1024 / sq_size;
	static struct io_uring rings[128];

	for (i = 0; i < n; ++i) {
		if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) {
			ret = -1;
			break;
		}
	}
	for (i -= 1; i >= 0; i--)
		io_uring_queue_exit(&rings[i]);
	return ret;
}

int main()
{
	int size;

	for (size = 1024; size >= 2; size /= 2) {
		if (!try_size(size)) {
			printf("max size %i\n", size);
			return 0;
		}
	}

	printf("can't allocate %i\n", size);
	return 0;
}


> Also, can you allocate it if you switch a user (preferably to non-root) after it
> happens?
> 
>>
>> So, I've just tried stopping everything that uses io-uring. No io_wq* processes
>> remained:
>>
>> $ ps ax | grep wq
>>     9 ?        I<     0:00 [mm_percpu_wq]
>>   243 ?        I<     0:00 [tpm_dev_wq]
>>   246 ?        I<     0:00 [devfreq_wq]
>> 27922 pts/4    S+     0:00 grep --colour=auto wq
>> $
>>
>> But not a single ring (with size 1024) can be created afterwards anyway.
>>
>> Apparently the problem netty hit and this one are different?
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22  4:07                                     ` Pavel Begunkov
@ 2020-12-22 11:04                                       ` Dmitry Kadashev
  2020-12-22 11:06                                         ` Dmitry Kadashev
  2020-12-22 16:33                                         ` Pavel Begunkov
  0 siblings, 2 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-22 11:04 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
>
> On 22/12/2020 03:35, Pavel Begunkov wrote:
> > On 21/12/2020 11:00, Dmitry Kadashev wrote:
> > [snip]
> >>> We do not share rings between processes. Our rings are accessible from different
> >>> threads (under locks), but nothing fancy.
> >>>
> >>>> In other words, if you kill all your io_uring applications, does it
> >>>> go back to normal?
> >>>
> >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
> >>> affected box and double check just in case.
> >
> > I can't spot any misaccounting, but I wonder if it can be that your memory is
> > getting fragmented enough to be unable make an allocation of 16 __contiguous__
> > pages, i.e. sizeof(sqe) * 1024
> >
> > That's how it's allocated internally:
> >
> > static void *io_mem_alloc(size_t size)
> > {
> >       gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> >                               __GFP_NORETRY;
> >
> >       return (void *) __get_free_pages(gfp_flags, get_order(size));
> > }
> >
> > What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> > That can be a different program, e.g. modify a bit liburing/test/nop.
>
> Even better to allocate N smaller rings, where N = 1024 / SQ_size
>
> static int try_size(int sq_size)
> {
>         int ret = 0, i, n = 1024 / sq_size;
>         static struct io_uring rings[128];
>
>         for (i = 0; i < n; ++i) {
>                 if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) {
>                         ret = -1;
>                         break;
>                 }
>         }
>         for (i -= 1; i >= 0; i--)
>                 io_uring_queue_exit(&rings[i]);
>         return ret;
> }
>
> int main()
> {
>         int size;
>
>         for (size = 1024; size >= 2; size /= 2) {
>                 if (!try_size(size)) {
>                         printf("max size %i\n", size);
>                         return 0;
>                 }
>         }
>
>         printf("can't allocate %i\n", size);
>         return 0;
> }

Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
try this there. Also I was not able to come up with an isolated reproducer for
this yet.

The good news is I've found a relatively easy way to provoke this on a test VM
using our software. Our app runs with "admin" user perms (plus some
capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
an user called 'ioutest' to run the check for ring sizes using a different user.

I've modified the test program slightly, to show the number of rings
successfully
created on each iteration and the actual error message (to debug a problem I was
having with it, but I've kept this after that). Here is the output:

# sudo -u admin bash -c 'ulimit -a' | grep locked
max locked memory       (kbytes, -l) 1024

# sudo -u ioutest bash -c 'ulimit -a' | grep locked
max locked memory       (kbytes, -l) 1024

# sudo -u admin ./iou-test1
Failed after 0 rings with 1024 size: Cannot allocate memory
Failed after 0 rings with 512 size: Cannot allocate memory
Failed after 0 rings with 256 size: Cannot allocate memory
Failed after 0 rings with 128 size: Cannot allocate memory
Failed after 0 rings with 64 size: Cannot allocate memory
Failed after 0 rings with 32 size: Cannot allocate memory
Failed after 0 rings with 16 size: Cannot allocate memory
Failed after 0 rings with 8 size: Cannot allocate memory
Failed after 0 rings with 4 size: Cannot allocate memory
Failed after 0 rings with 2 size: Cannot allocate memory
can't allocate 1

# sudo -u ioutest ./iou-test1
max size 1024

# ps ax | grep wq
    8 ?        I<     0:00 [mm_percpu_wq]
  121 ?        I<     0:00 [tpm_dev_wq]
  124 ?        I<     0:00 [devfreq_wq]
20593 pts/1    S+     0:00 grep --color=auto wq

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22 11:04                                       ` Dmitry Kadashev
@ 2020-12-22 11:06                                         ` Dmitry Kadashev
  2020-12-22 13:13                                           ` Dmitry Kadashev
  2020-12-22 16:33                                         ` Pavel Begunkov
  1 sibling, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-22 11:06 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Tue, Dec 22, 2020 at 6:04 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
> >
> > On 22/12/2020 03:35, Pavel Begunkov wrote:
> > > On 21/12/2020 11:00, Dmitry Kadashev wrote:
> > > [snip]
> > >>> We do not share rings between processes. Our rings are accessible from different
> > >>> threads (under locks), but nothing fancy.
> > >>>
> > >>>> In other words, if you kill all your io_uring applications, does it
> > >>>> go back to normal?
> > >>>
> > >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
> > >>> affected box and double check just in case.
> > >
> > > I can't spot any misaccounting, but I wonder if it can be that your memory is
> > > getting fragmented enough to be unable make an allocation of 16 __contiguous__
> > > pages, i.e. sizeof(sqe) * 1024
> > >
> > > That's how it's allocated internally:
> > >
> > > static void *io_mem_alloc(size_t size)
> > > {
> > >       gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> > >                               __GFP_NORETRY;
> > >
> > >       return (void *) __get_free_pages(gfp_flags, get_order(size));
> > > }
> > >
> > > What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> > > That can be a different program, e.g. modify a bit liburing/test/nop.
> >
> > Even better to allocate N smaller rings, where N = 1024 / SQ_size
> >
> > static int try_size(int sq_size)
> > {
> >         int ret = 0, i, n = 1024 / sq_size;
> >         static struct io_uring rings[128];
> >
> >         for (i = 0; i < n; ++i) {
> >                 if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) {
> >                         ret = -1;
> >                         break;
> >                 }
> >         }
> >         for (i -= 1; i >= 0; i--)
> >                 io_uring_queue_exit(&rings[i]);
> >         return ret;
> > }
> >
> > int main()
> > {
> >         int size;
> >
> >         for (size = 1024; size >= 2; size /= 2) {
> >                 if (!try_size(size)) {
> >                         printf("max size %i\n", size);
> >                         return 0;
> >                 }
> >         }
> >
> >         printf("can't allocate %i\n", size);
> >         return 0;
> > }
>
> Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> try this there. Also I was not able to come up with an isolated reproducer for
> this yet.
>
> The good news is I've found a relatively easy way to provoke this on a test VM
> using our software. Our app runs with "admin" user perms (plus some
> capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> an user called 'ioutest' to run the check for ring sizes using a different user.
>
> I've modified the test program slightly, to show the number of rings
> successfully
> created on each iteration and the actual error message (to debug a problem I was
> having with it, but I've kept this after that). Here is the output:
>
> # sudo -u admin bash -c 'ulimit -a' | grep locked
> max locked memory       (kbytes, -l) 1024
>
> # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> max locked memory       (kbytes, -l) 1024
>
> # sudo -u admin ./iou-test1
> Failed after 0 rings with 1024 size: Cannot allocate memory
> Failed after 0 rings with 512 size: Cannot allocate memory
> Failed after 0 rings with 256 size: Cannot allocate memory
> Failed after 0 rings with 128 size: Cannot allocate memory
> Failed after 0 rings with 64 size: Cannot allocate memory
> Failed after 0 rings with 32 size: Cannot allocate memory
> Failed after 0 rings with 16 size: Cannot allocate memory
> Failed after 0 rings with 8 size: Cannot allocate memory
> Failed after 0 rings with 4 size: Cannot allocate memory
> Failed after 0 rings with 2 size: Cannot allocate memory
> can't allocate 1
>
> # sudo -u ioutest ./iou-test1
> max size 1024
>
> # ps ax | grep wq
>     8 ?        I<     0:00 [mm_percpu_wq]
>   121 ?        I<     0:00 [tpm_dev_wq]
>   124 ?        I<     0:00 [devfreq_wq]
> 20593 pts/1    S+     0:00 grep --color=auto wq

This was on kernel 5.6.7, I'm going to try this on 5.10.1 now.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22 11:06                                         ` Dmitry Kadashev
@ 2020-12-22 13:13                                           ` Dmitry Kadashev
  0 siblings, 0 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-22 13:13 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Tue, Dec 22, 2020 at 6:06 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 6:04 PM Dmitry Kadashev <[email protected]> wrote:
> >
> > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
> > >
> > > On 22/12/2020 03:35, Pavel Begunkov wrote:
> > > > On 21/12/2020 11:00, Dmitry Kadashev wrote:
> > > > [snip]
> > > >>> We do not share rings between processes. Our rings are accessible from different
> > > >>> threads (under locks), but nothing fancy.
> > > >>>
> > > >>>> In other words, if you kill all your io_uring applications, does it
> > > >>>> go back to normal?
> > > >>>
> > > >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an
> > > >>> affected box and double check just in case.
> > > >
> > > > I can't spot any misaccounting, but I wonder if it can be that your memory is
> > > > getting fragmented enough to be unable make an allocation of 16 __contiguous__
> > > > pages, i.e. sizeof(sqe) * 1024
> > > >
> > > > That's how it's allocated internally:
> > > >
> > > > static void *io_mem_alloc(size_t size)
> > > > {
> > > >       gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP |
> > > >                               __GFP_NORETRY;
> > > >
> > > >       return (void *) __get_free_pages(gfp_flags, get_order(size));
> > > > }
> > > >
> > > > What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> > > > That can be a different program, e.g. modify a bit liburing/test/nop.
> > >
> > > Even better to allocate N smaller rings, where N = 1024 / SQ_size
> > >
> > > static int try_size(int sq_size)
> > > {
> > >         int ret = 0, i, n = 1024 / sq_size;
> > >         static struct io_uring rings[128];
> > >
> > >         for (i = 0; i < n; ++i) {
> > >                 if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) {
> > >                         ret = -1;
> > >                         break;
> > >                 }
> > >         }
> > >         for (i -= 1; i >= 0; i--)
> > >                 io_uring_queue_exit(&rings[i]);
> > >         return ret;
> > > }
> > >
> > > int main()
> > > {
> > >         int size;
> > >
> > >         for (size = 1024; size >= 2; size /= 2) {
> > >                 if (!try_size(size)) {
> > >                         printf("max size %i\n", size);
> > >                         return 0;
> > >                 }
> > >         }
> > >
> > >         printf("can't allocate %i\n", size);
> > >         return 0;
> > > }
> >
> > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> > try this there. Also I was not able to come up with an isolated reproducer for
> > this yet.
> >
> > The good news is I've found a relatively easy way to provoke this on a test VM
> > using our software. Our app runs with "admin" user perms (plus some
> > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> > an user called 'ioutest' to run the check for ring sizes using a different user.
> >
> > I've modified the test program slightly, to show the number of rings
> > successfully
> > created on each iteration and the actual error message (to debug a problem I was
> > having with it, but I've kept this after that). Here is the output:
> >
> > # sudo -u admin bash -c 'ulimit -a' | grep locked
> > max locked memory       (kbytes, -l) 1024
> >
> > # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> > max locked memory       (kbytes, -l) 1024
> >
> > # sudo -u admin ./iou-test1
> > Failed after 0 rings with 1024 size: Cannot allocate memory
> > Failed after 0 rings with 512 size: Cannot allocate memory
> > Failed after 0 rings with 256 size: Cannot allocate memory
> > Failed after 0 rings with 128 size: Cannot allocate memory
> > Failed after 0 rings with 64 size: Cannot allocate memory
> > Failed after 0 rings with 32 size: Cannot allocate memory
> > Failed after 0 rings with 16 size: Cannot allocate memory
> > Failed after 0 rings with 8 size: Cannot allocate memory
> > Failed after 0 rings with 4 size: Cannot allocate memory
> > Failed after 0 rings with 2 size: Cannot allocate memory
> > can't allocate 1
> >
> > # sudo -u ioutest ./iou-test1
> > max size 1024
> >
> > # ps ax | grep wq
> >     8 ?        I<     0:00 [mm_percpu_wq]
> >   121 ?        I<     0:00 [tpm_dev_wq]
> >   124 ?        I<     0:00 [devfreq_wq]
> > 20593 pts/1    S+     0:00 grep --color=auto wq
>
> This was on kernel 5.6.7, I'm going to try this on 5.10.1 now.

Curious. It seems to be much harder to reproduce on 5.9 and 5.10. I'm 100% sure
it still happens on 5.9 though, since it did happen on production quite a few
times. But the way I've used to reproduce it on 5.6 worked two times there, and
quite quickly. And with 5.9 and 5.10 the same approach does not seem to be
working. I'll give it some more time and also will keep trying to come up with
a synthetic reproducer.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22 11:04                                       ` Dmitry Kadashev
  2020-12-22 11:06                                         ` Dmitry Kadashev
@ 2020-12-22 16:33                                         ` Pavel Begunkov
  2020-12-23  8:39                                           ` Dmitry Kadashev
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-22 16:33 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On 22/12/2020 11:04, Dmitry Kadashev wrote:
> On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
[...]
>>> What about smaller rings? Can you check io_uring of what SQ size it can allocate?
>>> That can be a different program, e.g. modify a bit liburing/test/nop.
> Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> try this there. Also I was not able to come up with an isolated reproducer for
> this yet.
> 
> The good news is I've found a relatively easy way to provoke this on a test VM
> using our software. Our app runs with "admin" user perms (plus some
> capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> an user called 'ioutest' to run the check for ring sizes using a different user.
> 
> I've modified the test program slightly, to show the number of rings
> successfully
> created on each iteration and the actual error message (to debug a problem I was
> having with it, but I've kept this after that). Here is the output:
> 
> # sudo -u admin bash -c 'ulimit -a' | grep locked
> max locked memory       (kbytes, -l) 1024
> 
> # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> max locked memory       (kbytes, -l) 1024
> 
> # sudo -u admin ./iou-test1
> Failed after 0 rings with 1024 size: Cannot allocate memory
> Failed after 0 rings with 512 size: Cannot allocate memory
> Failed after 0 rings with 256 size: Cannot allocate memory
> Failed after 0 rings with 128 size: Cannot allocate memory
> Failed after 0 rings with 64 size: Cannot allocate memory
> Failed after 0 rings with 32 size: Cannot allocate memory
> Failed after 0 rings with 16 size: Cannot allocate memory
> Failed after 0 rings with 8 size: Cannot allocate memory
> Failed after 0 rings with 4 size: Cannot allocate memory
> Failed after 0 rings with 2 size: Cannot allocate memory
> can't allocate 1
> 
> # sudo -u ioutest ./iou-test1
> max size 1024

Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK
capability we don't even account locked memory.

btw, do you use registered buffers?

> 
> # ps ax | grep wq
>     8 ?        I<     0:00 [mm_percpu_wq]
>   121 ?        I<     0:00 [tpm_dev_wq]
>   124 ?        I<     0:00 [devfreq_wq]
> 20593 pts/1    S+     0:00 grep --color=auto wq
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-22 16:33                                         ` Pavel Begunkov
@ 2020-12-23  8:39                                           ` Dmitry Kadashev
  2020-12-23  9:38                                             ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-23  8:39 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote:
>
> On 22/12/2020 11:04, Dmitry Kadashev wrote:
> > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
> [...]
> >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> >>> That can be a different program, e.g. modify a bit liburing/test/nop.
> > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> > try this there. Also I was not able to come up with an isolated reproducer for
> > this yet.
> >
> > The good news is I've found a relatively easy way to provoke this on a test VM
> > using our software. Our app runs with "admin" user perms (plus some
> > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> > an user called 'ioutest' to run the check for ring sizes using a different user.
> >
> > I've modified the test program slightly, to show the number of rings
> > successfully
> > created on each iteration and the actual error message (to debug a problem I was
> > having with it, but I've kept this after that). Here is the output:
> >
> > # sudo -u admin bash -c 'ulimit -a' | grep locked
> > max locked memory       (kbytes, -l) 1024
> >
> > # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> > max locked memory       (kbytes, -l) 1024
> >
> > # sudo -u admin ./iou-test1
> > Failed after 0 rings with 1024 size: Cannot allocate memory
> > Failed after 0 rings with 512 size: Cannot allocate memory
> > Failed after 0 rings with 256 size: Cannot allocate memory
> > Failed after 0 rings with 128 size: Cannot allocate memory
> > Failed after 0 rings with 64 size: Cannot allocate memory
> > Failed after 0 rings with 32 size: Cannot allocate memory
> > Failed after 0 rings with 16 size: Cannot allocate memory
> > Failed after 0 rings with 8 size: Cannot allocate memory
> > Failed after 0 rings with 4 size: Cannot allocate memory
> > Failed after 0 rings with 2 size: Cannot allocate memory
> > can't allocate 1
> >
> > # sudo -u ioutest ./iou-test1
> > max size 1024
>
> Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK
> capability we don't even account locked memory.

We do have some capabilities, but not CAP_IPC_LOCK. Ours are:

CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL,
CAP_DAC_READ_SEARCH.

The latter was necessary for integration with some third-party thing that we do
not really use anymore, so we can try building without it, but it'd require some
time, mostly because I'm not sure how quickly I'd be able to provoke the issue.

> btw, do you use registered buffers?

No, we do not use neither registered buffers nor registered files (nor anything
else).

Also, I just tried the test program on a real box (this time one instance of our
program is still running - can repeat the check with it dead, but I expect the
results to be pretty much the same, at least after a few more restarts). This
box runs 5.9.5.

# sudo -u admin bash -c 'ulimit -l'
1024

# sudo -u admin ./iou-test1
Failed after 0 rings with 1024 size: Cannot allocate memory
Failed after 0 rings with 512 size: Cannot allocate memory
Failed after 0 rings with 256 size: Cannot allocate memory
Failed after 0 rings with 128 size: Cannot allocate memory
Failed after 0 rings with 64 size: Cannot allocate memory
Failed after 0 rings with 32 size: Cannot allocate memory
Failed after 0 rings with 16 size: Cannot allocate memory
Failed after 0 rings with 8 size: Cannot allocate memory
Failed after 0 rings with 4 size: Cannot allocate memory
Failed after 0 rings with 2 size: Cannot allocate memory
can't allocate 1

# sudo -u dmitry bash -c 'ulimit -l'
1024

# sudo -u dmitry ./iou-test1
max size 1024

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-23  8:39                                           ` Dmitry Kadashev
@ 2020-12-23  9:38                                             ` Dmitry Kadashev
  2020-12-23 11:48                                               ` Dmitry Kadashev
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-23  9:38 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Wed, Dec 23, 2020 at 3:39 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote:
> >
> > On 22/12/2020 11:04, Dmitry Kadashev wrote:
> > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
> > [...]
> > >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> > >>> That can be a different program, e.g. modify a bit liburing/test/nop.
> > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> > > try this there. Also I was not able to come up with an isolated reproducer for
> > > this yet.
> > >
> > > The good news is I've found a relatively easy way to provoke this on a test VM
> > > using our software. Our app runs with "admin" user perms (plus some
> > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> > > an user called 'ioutest' to run the check for ring sizes using a different user.
> > >
> > > I've modified the test program slightly, to show the number of rings
> > > successfully
> > > created on each iteration and the actual error message (to debug a problem I was
> > > having with it, but I've kept this after that). Here is the output:
> > >
> > > # sudo -u admin bash -c 'ulimit -a' | grep locked
> > > max locked memory       (kbytes, -l) 1024
> > >
> > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> > > max locked memory       (kbytes, -l) 1024
> > >
> > > # sudo -u admin ./iou-test1
> > > Failed after 0 rings with 1024 size: Cannot allocate memory
> > > Failed after 0 rings with 512 size: Cannot allocate memory
> > > Failed after 0 rings with 256 size: Cannot allocate memory
> > > Failed after 0 rings with 128 size: Cannot allocate memory
> > > Failed after 0 rings with 64 size: Cannot allocate memory
> > > Failed after 0 rings with 32 size: Cannot allocate memory
> > > Failed after 0 rings with 16 size: Cannot allocate memory
> > > Failed after 0 rings with 8 size: Cannot allocate memory
> > > Failed after 0 rings with 4 size: Cannot allocate memory
> > > Failed after 0 rings with 2 size: Cannot allocate memory
> > > can't allocate 1
> > >
> > > # sudo -u ioutest ./iou-test1
> > > max size 1024
> >
> > Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK
> > capability we don't even account locked memory.
>
> We do have some capabilities, but not CAP_IPC_LOCK. Ours are:
>
> CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL,
> CAP_DAC_READ_SEARCH.
>
> The latter was necessary for integration with some third-party thing that we do
> not really use anymore, so we can try building without it, but it'd require some
> time, mostly because I'm not sure how quickly I'd be able to provoke the issue.
>
> > btw, do you use registered buffers?
>
> No, we do not use neither registered buffers nor registered files (nor anything
> else).
>
> Also, I just tried the test program on a real box (this time one instance of our
> program is still running - can repeat the check with it dead, but I expect the
> results to be pretty much the same, at least after a few more restarts). This
> box runs 5.9.5.
>
> # sudo -u admin bash -c 'ulimit -l'
> 1024
>
> # sudo -u admin ./iou-test1
> Failed after 0 rings with 1024 size: Cannot allocate memory
> Failed after 0 rings with 512 size: Cannot allocate memory
> Failed after 0 rings with 256 size: Cannot allocate memory
> Failed after 0 rings with 128 size: Cannot allocate memory
> Failed after 0 rings with 64 size: Cannot allocate memory
> Failed after 0 rings with 32 size: Cannot allocate memory
> Failed after 0 rings with 16 size: Cannot allocate memory
> Failed after 0 rings with 8 size: Cannot allocate memory
> Failed after 0 rings with 4 size: Cannot allocate memory
> Failed after 0 rings with 2 size: Cannot allocate memory
> can't allocate 1
>
> # sudo -u dmitry bash -c 'ulimit -l'
> 1024
>
> # sudo -u dmitry ./iou-test1
> max size 1024

Please ignore the results from the real box above (5.9.5). The memlock limit
interfered with this, since our app was running in the background and it had a
few rings running (most failed to be created, but not all). I'll try to make it
fully stuck and repeat the test with the app dead.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-23  9:38                                             ` Dmitry Kadashev
@ 2020-12-23 11:48                                               ` Dmitry Kadashev
  2020-12-23 12:27                                                 ` Pavel Begunkov
  0 siblings, 1 reply; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-23 11:48 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On Wed, Dec 23, 2020 at 4:38 PM Dmitry Kadashev <[email protected]> wrote:
>
> On Wed, Dec 23, 2020 at 3:39 PM Dmitry Kadashev <[email protected]> wrote:
> >
> > On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote:
> > >
> > > On 22/12/2020 11:04, Dmitry Kadashev wrote:
> > > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote:
> > > [...]
> > > >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate?
> > > >>> That can be a different program, e.g. modify a bit liburing/test/nop.
> > > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't
> > > > try this there. Also I was not able to come up with an isolated reproducer for
> > > > this yet.
> > > >
> > > > The good news is I've found a relatively easy way to provoke this on a test VM
> > > > using our software. Our app runs with "admin" user perms (plus some
> > > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created
> > > > an user called 'ioutest' to run the check for ring sizes using a different user.
> > > >
> > > > I've modified the test program slightly, to show the number of rings
> > > > successfully
> > > > created on each iteration and the actual error message (to debug a problem I was
> > > > having with it, but I've kept this after that). Here is the output:
> > > >
> > > > # sudo -u admin bash -c 'ulimit -a' | grep locked
> > > > max locked memory       (kbytes, -l) 1024
> > > >
> > > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked
> > > > max locked memory       (kbytes, -l) 1024
> > > >
> > > > # sudo -u admin ./iou-test1
> > > > Failed after 0 rings with 1024 size: Cannot allocate memory
> > > > Failed after 0 rings with 512 size: Cannot allocate memory
> > > > Failed after 0 rings with 256 size: Cannot allocate memory
> > > > Failed after 0 rings with 128 size: Cannot allocate memory
> > > > Failed after 0 rings with 64 size: Cannot allocate memory
> > > > Failed after 0 rings with 32 size: Cannot allocate memory
> > > > Failed after 0 rings with 16 size: Cannot allocate memory
> > > > Failed after 0 rings with 8 size: Cannot allocate memory
> > > > Failed after 0 rings with 4 size: Cannot allocate memory
> > > > Failed after 0 rings with 2 size: Cannot allocate memory
> > > > can't allocate 1
> > > >
> > > > # sudo -u ioutest ./iou-test1
> > > > max size 1024
> > >
> > > Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK
> > > capability we don't even account locked memory.
> >
> > We do have some capabilities, but not CAP_IPC_LOCK. Ours are:
> >
> > CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL,
> > CAP_DAC_READ_SEARCH.
> >
> > The latter was necessary for integration with some third-party thing that we do
> > not really use anymore, so we can try building without it, but it'd require some
> > time, mostly because I'm not sure how quickly I'd be able to provoke the issue.
> >
> > > btw, do you use registered buffers?
> >
> > No, we do not use neither registered buffers nor registered files (nor anything
> > else).
> >
> > Also, I just tried the test program on a real box (this time one instance of our
> > program is still running - can repeat the check with it dead, but I expect the
> > results to be pretty much the same, at least after a few more restarts). This
> > box runs 5.9.5.
> >
> > # sudo -u admin bash -c 'ulimit -l'
> > 1024
> >
> > # sudo -u admin ./iou-test1
> > Failed after 0 rings with 1024 size: Cannot allocate memory
> > Failed after 0 rings with 512 size: Cannot allocate memory
> > Failed after 0 rings with 256 size: Cannot allocate memory
> > Failed after 0 rings with 128 size: Cannot allocate memory
> > Failed after 0 rings with 64 size: Cannot allocate memory
> > Failed after 0 rings with 32 size: Cannot allocate memory
> > Failed after 0 rings with 16 size: Cannot allocate memory
> > Failed after 0 rings with 8 size: Cannot allocate memory
> > Failed after 0 rings with 4 size: Cannot allocate memory
> > Failed after 0 rings with 2 size: Cannot allocate memory
> > can't allocate 1
> >
> > # sudo -u dmitry bash -c 'ulimit -l'
> > 1024
> >
> > # sudo -u dmitry ./iou-test1
> > max size 1024
>
> Please ignore the results from the real box above (5.9.5). The memlock limit
> interfered with this, since our app was running in the background and it had a
> few rings running (most failed to be created, but not all). I'll try to make it
> fully stuck and repeat the test with the app dead.

I've experimented with the 5.9 live boxes that were showing signs of the problem
a bit more, and I'm not entirely sure they get stuck until reboot anymore.

I'm pretty sure it is the case with 5.6, but probably a bug was fixed since
then - the fact that 5.8 in particular had quite a few fixes that seemed
relevant is the reason we've tried 5.9 in the first place.

And on 5.9 we might be seeing fragmentation issues indeed. I shouldn't have been
mixing my kernel versions :) Also, I did not realize a ring of size=1024
requires 16 contiguous pages. We will experiment and observe a bit more, and
meanwhile let's consider the case closed. If the issue surfaces again I'll
update this thread.

Thanks a *lot* Pavel for helping to debug this issue.

And sorry for the false alarm / noise everyone.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-23 11:48                                               ` Dmitry Kadashev
@ 2020-12-23 12:27                                                 ` Pavel Begunkov
  0 siblings, 0 replies; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-23 12:27 UTC (permalink / raw)
  To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring

On 23/12/2020 11:48, Dmitry Kadashev wrote:
> On Wed, Dec 23, 2020 at 4:38 PM Dmitry Kadashev <[email protected]> wrote:
>> Please ignore the results from the real box above (5.9.5). The memlock limit
>> interfered with this, since our app was running in the background and it had a
>> few rings running (most failed to be created, but not all). I'll try to make it
>> fully stuck and repeat the test with the app dead.
> 
> I've experimented with the 5.9 live boxes that were showing signs of the problem
> a bit more, and I'm not entirely sure they get stuck until reboot anymore.
> 
> I'm pretty sure it is the case with 5.6, but probably a bug was fixed since
> then - the fact that 5.8 in particular had quite a few fixes that seemed
> relevant is the reason we've tried 5.9 in the first place.
> 
> And on 5.9 we might be seeing fragmentation issues indeed. I shouldn't have been
> mixing my kernel versions :) Also, I did not realize a ring of size=1024
> requires 16 contiguous pages. We will experiment and observe a bit more, and
> meanwhile let's consider the case closed. If the issue surfaces again I'll
> update this thread.

If fragmentation is to blame, it's still a problem. Let us know if you find out
anything. And thanks for keeping debugging

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20  0:25                           ` Jens Axboe
  2020-12-20  0:55                             ` Pavel Begunkov
@ 2020-12-20  1:57                             ` Pavel Begunkov
  2020-12-20  7:13                               ` Josef
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20  1:57 UTC (permalink / raw)
  To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 00:25, Jens Axboe wrote:
> On 12/19/20 4:42 PM, Pavel Begunkov wrote:
>> On 19/12/2020 23:13, Jens Axboe wrote:
>>> On 12/19/20 2:54 PM, Jens Axboe wrote:
>>>> On 12/19/20 1:51 PM, Josef wrote:
>>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
>>>>>> file descriptor. You probably don't want/mean to do that as it's
>>>>>> pollable, I guess it's done because you just set it on all reads for the
>>>>>> test?
>>>>>
>>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to
>>>>> use IOSQE_ASYNC
>>>>
>>>> Right, and it's pollable too.
>>>>
>>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work
>>>>> in my tests, thanks a lot :)
>>>>>
>>>>>> In any case, it should of course work. This is the leftover trace when
>>>>>> we should be exiting, but an io-wq worker is still trying to get data
>>>>>> from the eventfd:
>>>>>
>>>>> interesting, btw what kind of tool do you use for kernel debugging?
>>>>
>>>> Just poking at it and thinking about it, no hidden magic I'm afraid...
>>>
>>> Josef, can you try with this added? Looks bigger than it is, most of it
>>> is just moving one function below another.
>>
>> Hmm, which kernel revision are you poking? Seems it doesn't match
>> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with
>> NULL files.
>>
>> if (!files)
>> 	__io_uring_cancel_task_requests(ctx, task);
>> else
>> 	io_uring_cancel_files(ctx, task, files);
> 
> Yeah, I think I messed up. If files == NULL, then the task is going away.
> So we should cancel all requests that match 'task', not just ones that
> match task && files.
> 
> Not sure I have much more time to look into this before next week, but
> something like that.
> 
> The problem case is the async worker being queued, long before the task
> is killed and the contexts go away. But from exit_files(), we're only
> concerned with canceling if we have inflight. Doesn't look right to me.

Josef, can you test the patch below instead? Following Jens' idea it
cancels more aggressively when a task is killed or exits. It's based
on [1] but would probably apply fine to for-next.

[1] git://git.kernel.dk/linux-block
branch io_uring-5.11, commit dd20166236953c8cd14f4c668bf972af32f0c6be


diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..3a98e6dd71c0 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8919,8 +8919,6 @@ void __io_uring_files_cancel(struct files_struct *files)
 		struct io_ring_ctx *ctx = file->private_data;
 
 		io_uring_cancel_task_requests(ctx, files);
-		if (files)
-			io_uring_del_task_file(file);
 	}
 
 	atomic_dec(&tctx->in_idle);
@@ -8960,6 +8958,8 @@ static s64 tctx_inflight(struct io_uring_task *tctx)
 void __io_uring_task_cancel(void)
 {
 	struct io_uring_task *tctx = current->io_uring;
+	struct file *file;
+	unsigned long index;
 	DEFINE_WAIT(wait);
 	s64 inflight;
 
@@ -8986,6 +8986,9 @@ void __io_uring_task_cancel(void)
 
 	finish_wait(&tctx->wait, &wait);
 	atomic_dec(&tctx->in_idle);
+
+	xa_for_each(&tctx->xa, index, file)
+		io_uring_del_task_file(file);
 }
 
 static int io_uring_flush(struct file *file, void *data)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b2d845704d..54925c74aa88 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -48,7 +48,7 @@ static inline void io_uring_task_cancel(void)
 static inline void io_uring_files_cancel(struct files_struct *files)
 {
 	if (current->io_uring && !xa_empty(&current->io_uring->xa))
-		__io_uring_files_cancel(files);
+		__io_uring_task_cancel();
 }
 static inline void io_uring_free(struct task_struct *tsk)
 {

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20  1:57                             ` Pavel Begunkov
@ 2020-12-20  7:13                               ` Josef
  2020-12-20 13:00                                 ` Pavel Begunkov
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-20  7:13 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

> Guys, do you share rings between processes? Explicitly like sending
> io_uring fd over a socket, or implicitly e.g. sharing fd tables
> (threads), or cloning with copying fd tables (and so taking a ref
> to a ring).

no in netty we don't share ring between processes

> In other words, if you kill all your io_uring applications, does it
> go back to normal?

no at all, the io-wq worker thread is still running, I literally have
to restart the vm to go back to normal(as far as I know is not
possible to kill kernel threads right?)

> Josef, can you test the patch below instead? Following Jens' idea it
> cancels more aggressively when a task is killed or exits. It's based
> on [1] but would probably apply fine to for-next.

it works, I run several tests with eventfd read op async flag enabled,
thanks a lot :) you are awesome guys :)

-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20  7:13                               ` Josef
@ 2020-12-20 13:00                                 ` Pavel Begunkov
  2020-12-20 14:19                                   ` Pavel Begunkov
  2020-12-20 16:14                                   ` Jens Axboe
  0 siblings, 2 replies; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20 13:00 UTC (permalink / raw)
  To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 07:13, Josef wrote:
>> Guys, do you share rings between processes? Explicitly like sending
>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
>> (threads), or cloning with copying fd tables (and so taking a ref
>> to a ring).
> 
> no in netty we don't share ring between processes
> 
>> In other words, if you kill all your io_uring applications, does it
>> go back to normal?
> 
> no at all, the io-wq worker thread is still running, I literally have
> to restart the vm to go back to normal(as far as I know is not
> possible to kill kernel threads right?)
> 
>> Josef, can you test the patch below instead? Following Jens' idea it
>> cancels more aggressively when a task is killed or exits. It's based
>> on [1] but would probably apply fine to for-next.
> 
> it works, I run several tests with eventfd read op async flag enabled,
> thanks a lot :) you are awesome guys :)

Thanks for testing and confirming! Either we forgot something in
io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
or we have a dependency that prevents release from happening.

BTW, apparently that patch causes hangs for unrelated but known
reasons, so better to not use it, we'll merge something more stable.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 13:00                                 ` Pavel Begunkov
@ 2020-12-20 14:19                                   ` Pavel Begunkov
  2020-12-20 15:56                                     ` Josef
  2020-12-20 16:14                                   ` Jens Axboe
  1 sibling, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20 14:19 UTC (permalink / raw)
  To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 13:00, Pavel Begunkov wrote:
> On 20/12/2020 07:13, Josef wrote:
>>> Guys, do you share rings between processes? Explicitly like sending
>>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
>>> (threads), or cloning with copying fd tables (and so taking a ref
>>> to a ring).
>>
>> no in netty we don't share ring between processes
>>
>>> In other words, if you kill all your io_uring applications, does it
>>> go back to normal?
>>
>> no at all, the io-wq worker thread is still running, I literally have
>> to restart the vm to go back to normal(as far as I know is not
>> possible to kill kernel threads right?)
>>
>>> Josef, can you test the patch below instead? Following Jens' idea it
>>> cancels more aggressively when a task is killed or exits. It's based
>>> on [1] but would probably apply fine to for-next.
>>
>> it works, I run several tests with eventfd read op async flag enabled,
>> thanks a lot :) you are awesome guys :)
> 
> Thanks for testing and confirming! Either we forgot something in
> io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
> or we have a dependency that prevents release from happening.
> 
> BTW, apparently that patch causes hangs for unrelated but known
> reasons, so better to not use it, we'll merge something more stable.

I'd really appreciate if you can try one more. I want to know why
the final cleanup doesn't cope  with it.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 941fe9b64fd9..d38fc819648e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8614,6 +8614,10 @@ static int io_remove_personalities(int id, void *p, void *data)
 	return 0;
 }
 
+static void io_cancel_defer_files(struct io_ring_ctx *ctx,
+				  struct task_struct *task,
+				  struct files_struct *files);
+
 static void io_ring_exit_work(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
@@ -8627,6 +8631,8 @@ static void io_ring_exit_work(struct work_struct *work)
 	 */
 	do {
 		io_iopoll_try_reap_events(ctx);
+		io_poll_remove_all(ctx, NULL, NULL);
+		io_kill_timeouts(ctx, NULL, NULL);
 	} while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20));
 	io_ring_ctx_free(ctx);
 }
@@ -8641,6 +8647,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_cqring_overflow_flush(ctx, true, NULL, NULL);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_cancel_defer_files(ctx, NULL, NULL);
 	io_kill_timeouts(ctx, NULL, NULL);
 	io_poll_remove_all(ctx, NULL, NULL);
 
-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 14:19                                   ` Pavel Begunkov
@ 2020-12-20 15:56                                     ` Josef
  2020-12-20 15:58                                       ` Pavel Begunkov
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-20 15:56 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

> I'd really appreciate if you can try one more. I want to know why
> the final cleanup doesn't cope  with it.

yeah sure, which kernel version? it seems to be that this patch
doesn't match io_uring-5.11 and io_uring-5.10

On Sun, 20 Dec 2020 at 15:22, Pavel Begunkov <[email protected]> wrote:
>
> On 20/12/2020 13:00, Pavel Begunkov wrote:
> > On 20/12/2020 07:13, Josef wrote:
> >>> Guys, do you share rings between processes? Explicitly like sending
> >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
> >>> (threads), or cloning with copying fd tables (and so taking a ref
> >>> to a ring).
> >>
> >> no in netty we don't share ring between processes
> >>
> >>> In other words, if you kill all your io_uring applications, does it
> >>> go back to normal?
> >>
> >> no at all, the io-wq worker thread is still running, I literally have
> >> to restart the vm to go back to normal(as far as I know is not
> >> possible to kill kernel threads right?)
> >>
> >>> Josef, can you test the patch below instead? Following Jens' idea it
> >>> cancels more aggressively when a task is killed or exits. It's based
> >>> on [1] but would probably apply fine to for-next.
> >>
> >> it works, I run several tests with eventfd read op async flag enabled,
> >> thanks a lot :) you are awesome guys :)
> >
> > Thanks for testing and confirming! Either we forgot something in
> > io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
> > or we have a dependency that prevents release from happening.
> >
> > BTW, apparently that patch causes hangs for unrelated but known
> > reasons, so better to not use it, we'll merge something more stable.
>
> I'd really appreciate if you can try one more. I want to know why
> the final cleanup doesn't cope  with it.
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 941fe9b64fd9..d38fc819648e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8614,6 +8614,10 @@ static int io_remove_personalities(int id, void *p, void *data)
>         return 0;
>  }
>
> +static void io_cancel_defer_files(struct io_ring_ctx *ctx,
> +                                 struct task_struct *task,
> +                                 struct files_struct *files);
> +
>  static void io_ring_exit_work(struct work_struct *work)
>  {
>         struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
> @@ -8627,6 +8631,8 @@ static void io_ring_exit_work(struct work_struct *work)
>          */
>         do {
>                 io_iopoll_try_reap_events(ctx);
> +               io_poll_remove_all(ctx, NULL, NULL);
> +               io_kill_timeouts(ctx, NULL, NULL);
>         } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20));
>         io_ring_ctx_free(ctx);
>  }
> @@ -8641,6 +8647,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
>                 io_cqring_overflow_flush(ctx, true, NULL, NULL);
>         mutex_unlock(&ctx->uring_lock);
>
> +       io_cancel_defer_files(ctx, NULL, NULL);
>         io_kill_timeouts(ctx, NULL, NULL);
>         io_poll_remove_all(ctx, NULL, NULL);
>
> --
> Pavel Begunkov



-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 15:56                                     ` Josef
@ 2020-12-20 15:58                                       ` Pavel Begunkov
  0 siblings, 0 replies; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20 15:58 UTC (permalink / raw)
  To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 15:56, Josef wrote:
>> I'd really appreciate if you can try one more. I want to know why
>> the final cleanup doesn't cope  with it.
> 
> yeah sure, which kernel version? it seems to be that this patch
> doesn't match io_uring-5.11 and io_uring-5.10

It's io_uring-5.11 but I had some patches on top.
I regenerated it below for up to date Jens' io_uring-5.11

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..4e1fb4054516 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8620,6 +8620,10 @@ static int io_remove_personalities(int id, void *p, void *data)
 	return 0;
 }
 
+static void io_cancel_defer_files(struct io_ring_ctx *ctx,
+				  struct task_struct *task,
+				  struct files_struct *files);
+
 static void io_ring_exit_work(struct work_struct *work)
 {
 	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
@@ -8633,6 +8637,8 @@ static void io_ring_exit_work(struct work_struct *work)
 	 */
 	do {
 		io_iopoll_try_reap_events(ctx);
+		io_poll_remove_all(ctx, NULL, NULL);
+		io_kill_timeouts(ctx, NULL, NULL);
 	} while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20));
 	io_ring_ctx_free(ctx);
 }
@@ -8647,6 +8653,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_cqring_overflow_flush(ctx, true, NULL, NULL);
 	mutex_unlock(&ctx->uring_lock);
 
+	io_cancel_defer_files(ctx, NULL, NULL);
 	io_kill_timeouts(ctx, NULL, NULL);
 	io_poll_remove_all(ctx, NULL, NULL);
 

-- 
Pavel Begunkov

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 13:00                                 ` Pavel Begunkov
  2020-12-20 14:19                                   ` Pavel Begunkov
@ 2020-12-20 16:14                                   ` Jens Axboe
  2020-12-20 16:59                                     ` Josef
  1 sibling, 1 reply; 52+ messages in thread
From: Jens Axboe @ 2020-12-20 16:14 UTC (permalink / raw)
  To: Pavel Begunkov, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring

On 12/20/20 6:00 AM, Pavel Begunkov wrote:
> On 20/12/2020 07:13, Josef wrote:
>>> Guys, do you share rings between processes? Explicitly like sending
>>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
>>> (threads), or cloning with copying fd tables (and so taking a ref
>>> to a ring).
>>
>> no in netty we don't share ring between processes
>>
>>> In other words, if you kill all your io_uring applications, does it
>>> go back to normal?
>>
>> no at all, the io-wq worker thread is still running, I literally have
>> to restart the vm to go back to normal(as far as I know is not
>> possible to kill kernel threads right?)
>>
>>> Josef, can you test the patch below instead? Following Jens' idea it
>>> cancels more aggressively when a task is killed or exits. It's based
>>> on [1] but would probably apply fine to for-next.
>>
>> it works, I run several tests with eventfd read op async flag enabled,
>> thanks a lot :) you are awesome guys :)
> 
> Thanks for testing and confirming! Either we forgot something in
> io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
> or we have a dependency that prevents release from happening.

Just a guess - Josef, is the eventfd for the ring fd itself?

BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go.
We should just use targeted cancelation - that's cleaner, and the
cancel all will impact ATTACH_WQ as well. Separate thing to fix, though.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 16:14                                   ` Jens Axboe
@ 2020-12-20 16:59                                     ` Josef
  2020-12-20 18:23                                       ` Josef
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-20 16:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Pavel Begunkov, Norman Maurer, Dmitry Kadashev, io-uring

> Just a guess - Josef, is the eventfd for the ring fd itself?

yes via eventfd_write we want to wake up/unblock
io_uring_enter(IORING_ENTER_GETEVENTS), and the read eventfd event is
submitted every time
each ring fd in netty has one eventfd

On Sun, 20 Dec 2020 at 17:14, Jens Axboe <[email protected]> wrote:
>
> On 12/20/20 6:00 AM, Pavel Begunkov wrote:
> > On 20/12/2020 07:13, Josef wrote:
> >>> Guys, do you share rings between processes? Explicitly like sending
> >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
> >>> (threads), or cloning with copying fd tables (and so taking a ref
> >>> to a ring).
> >>
> >> no in netty we don't share ring between processes
> >>
> >>> In other words, if you kill all your io_uring applications, does it
> >>> go back to normal?
> >>
> >> no at all, the io-wq worker thread is still running, I literally have
> >> to restart the vm to go back to normal(as far as I know is not
> >> possible to kill kernel threads right?)
> >>
> >>> Josef, can you test the patch below instead? Following Jens' idea it
> >>> cancels more aggressively when a task is killed or exits. It's based
> >>> on [1] but would probably apply fine to for-next.
> >>
> >> it works, I run several tests with eventfd read op async flag enabled,
> >> thanks a lot :) you are awesome guys :)
> >
> > Thanks for testing and confirming! Either we forgot something in
> > io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
> > or we have a dependency that prevents release from happening.
>
> Just a guess - Josef, is the eventfd for the ring fd itself?
>
> BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go.
> We should just use targeted cancelation - that's cleaner, and the
> cancel all will impact ATTACH_WQ as well. Separate thing to fix, though.
>
> --
> Jens Axboe
>


-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 16:59                                     ` Josef
@ 2020-12-20 18:23                                       ` Josef
  2020-12-20 18:41                                         ` Pavel Begunkov
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-20 18:23 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

> It's io_uring-5.11 but I had some patches on top.
> I regenerated it below for up to date Jens' io_uring-5.11

Pavel I just tested your patch, it works :)

On Sun, 20 Dec 2020 at 17:59, Josef <[email protected]> wrote:
>
> > Just a guess - Josef, is the eventfd for the ring fd itself?
>
> yes via eventfd_write we want to wake up/unblock
> io_uring_enter(IORING_ENTER_GETEVENTS), and the read eventfd event is
> submitted every time
> each ring fd in netty has one eventfd
>
> On Sun, 20 Dec 2020 at 17:14, Jens Axboe <[email protected]> wrote:
> >
> > On 12/20/20 6:00 AM, Pavel Begunkov wrote:
> > > On 20/12/2020 07:13, Josef wrote:
> > >>> Guys, do you share rings between processes? Explicitly like sending
> > >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables
> > >>> (threads), or cloning with copying fd tables (and so taking a ref
> > >>> to a ring).
> > >>
> > >> no in netty we don't share ring between processes
> > >>
> > >>> In other words, if you kill all your io_uring applications, does it
> > >>> go back to normal?
> > >>
> > >> no at all, the io-wq worker thread is still running, I literally have
> > >> to restart the vm to go back to normal(as far as I know is not
> > >> possible to kill kernel threads right?)
> > >>
> > >>> Josef, can you test the patch below instead? Following Jens' idea it
> > >>> cancels more aggressively when a task is killed or exits. It's based
> > >>> on [1] but would probably apply fine to for-next.
> > >>
> > >> it works, I run several tests with eventfd read op async flag enabled,
> > >> thanks a lot :) you are awesome guys :)
> > >
> > > Thanks for testing and confirming! Either we forgot something in
> > > io_ring_ctx_wait_and_kill() and it just can't cancel some requests,
> > > or we have a dependency that prevents release from happening.
> >
> > Just a guess - Josef, is the eventfd for the ring fd itself?
> >
> > BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go.
> > We should just use targeted cancelation - that's cleaner, and the
> > cancel all will impact ATTACH_WQ as well. Separate thing to fix, though.
> >
> > --
> > Jens Axboe
> >
>
>
> --
> Josef



-- 
Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 18:23                                       ` Josef
@ 2020-12-20 18:41                                         ` Pavel Begunkov
  2020-12-21  8:22                                           ` Josef
  0 siblings, 1 reply; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-20 18:41 UTC (permalink / raw)
  To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

On 20/12/2020 18:23, Josef wrote:
>> It's io_uring-5.11 but I had some patches on top.
>> I regenerated it below for up to date Jens' io_uring-5.11
> 
> Pavel I just tested your patch, it works :)

Interesting, thanks a lot! Not sure how exactly it's related to
eventfd, but maybe just because it was dragged through internal
polling asynchronously or somewhat like that, and
io_ring_ctx_wait_and_kill() haven't found it at first.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-20 18:41                                         ` Pavel Begunkov
@ 2020-12-21  8:22                                           ` Josef
  2020-12-21 15:30                                             ` Pavel Begunkov
  0 siblings, 1 reply; 52+ messages in thread
From: Josef @ 2020-12-21  8:22 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

Pavel I'm sorry...my kernel build process was wrong...the same kernel
patch(the first one) was used...I run different load tests on all 3
patches several times

your first patch works great and unfortunately second and third patch
doesn't work

Here the patch summary:

first patch works:

[1] git://git.kernel.dk/linux-block
branch io_uring-5.11, commit dd20166236953c8cd14f4c668bf972af32f0c6be

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..3a98e6dd71c0 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8919,8 +8919,6 @@ void __io_uring_files_cancel(struct files_struct *files)
                struct io_ring_ctx *ctx = file->private_data;

                io_uring_cancel_task_requests(ctx, files);
-               if (files)
-                       io_uring_del_task_file(file);
        }

        atomic_dec(&tctx->in_idle);
@@ -8960,6 +8958,8 @@ static s64 tctx_inflight(struct io_uring_task *tctx)
 void __io_uring_task_cancel(void)
 {
        struct io_uring_task *tctx = current->io_uring;
+       struct file *file;
+       unsigned long index;
        DEFINE_WAIT(wait);
        s64 inflight;

@@ -8986,6 +8986,9 @@ void __io_uring_task_cancel(void)

        finish_wait(&tctx->wait, &wait);
        atomic_dec(&tctx->in_idle);
+
+       xa_for_each(&tctx->xa, index, file)
+               io_uring_del_task_file(file);
 }

 static int io_uring_flush(struct file *file, void *data)
diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h
index 35b2d845704d..54925c74aa88 100644
--- a/include/linux/io_uring.h
+++ b/include/linux/io_uring.h
@@ -48,7 +48,7 @@ static inline void io_uring_task_cancel(void)
 static inline void io_uring_files_cancel(struct files_struct *files)
 {
        if (current->io_uring && !xa_empty(&current->io_uring->xa))
-               __io_uring_files_cancel(files);
+               __io_uring_task_cancel();
 }
 static inline void io_uring_free(struct task_struct *tsk)
 {

second patch:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f3690dfdd564..4e1fb4054516 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -8620,6 +8620,10 @@ static int io_remove_personalities(int id, void
*p, void *data)
        return 0;
 }

+static void io_cancel_defer_files(struct io_ring_ctx *ctx,
+                                 struct task_struct *task,
+                                 struct files_struct *files);
+
 static void io_ring_exit_work(struct work_struct *work)
 {
        struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
@@ -8633,6 +8637,8 @@ static void io_ring_exit_work(struct work_struct *work)
         */
        do {
                io_iopoll_try_reap_events(ctx);
+               io_poll_remove_all(ctx, NULL, NULL);
+               io_kill_timeouts(ctx, NULL, NULL);
        } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20));
        io_ring_ctx_free(ctx);
 }
@@ -8647,6 +8653,7 @@ static void io_ring_ctx_wait_and_kill(struct
io_ring_ctx *ctx)

                io_cqring_overflow_flush(ctx, true, NULL, NULL);
        mutex_unlock(&ctx->uring_lock);

+       io_cancel_defer_files(ctx, NULL, NULL);
        io_kill_timeouts(ctx, NULL, NULL);
        io_poll_remove_all(ctx, NULL, NULL);

third patch you already sent which is similar to the second one:
https://lore.kernel.org/io-uring/[email protected]/T/#t

-- 
Josef

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-21  8:22                                           ` Josef
@ 2020-12-21 15:30                                             ` Pavel Begunkov
  0 siblings, 0 replies; 52+ messages in thread
From: Pavel Begunkov @ 2020-12-21 15:30 UTC (permalink / raw)
  To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring

On 21/12/2020 08:22, Josef wrote:
> Pavel I'm sorry...my kernel build process was wrong...the same kernel
> patch(the first one) was used...I run different load tests on all 3
> patches several times

No worries, thanks for letting know. At least clears up contradiction
of this patch with that it's eventfd related.

> your first patch works great and unfortunately second and third patch
> doesn't work
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK)
  2020-12-19 17:11             ` Jens Axboe
  2020-12-19 17:34               ` Norman Maurer
@ 2020-12-21 10:31               ` Dmitry Kadashev
  1 sibling, 0 replies; 52+ messages in thread
From: Dmitry Kadashev @ 2020-12-21 10:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Josef, io-uring, Norman Maurer

On Sun, Dec 20, 2020 at 12:11 AM Jens Axboe <[email protected]> wrote:
>
> On 12/19/20 9:29 AM, Jens Axboe wrote:
> > On 12/19/20 9:13 AM, Jens Axboe wrote:
> >> On 12/18/20 7:49 PM, Josef wrote:
> >>>> I'm happy to run _any_ reproducer, so please do let us know if you
> >>>> manage to find something that I can run with netty. As long as it
> >>>> includes instructions for exactly how to run it :-)
> >>>
> >>> cool :)  I just created a repo for that:
> >>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git
> >>>
> >>> - install jdk 1.8
> >>> - to run netty: ./mvnw compile exec:java
> >>> -Dexec.mainClass="uring.netty.example.EchoUringServer"
> >>> - to run the echo test: cargo run --release -- --address
> >>> "127.0.0.1:2022" --number 200 --duration 20 --length 300
> >>> (https://github.com/haraldh/rust_echo_bench.git)
> >>> - process kill -9
> >>>
> >>> async flag is enabled and these operation are used: OP_READ,
> >>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT
> >>>
> >>> (btw you can change the port in EchoUringServer.java)
> >>
> >> This is great! Not sure this is the same issue, but what I see here is
> >> that we have leftover workers when the test is killed. This means the
> >> rings aren't gone, and the memory isn't freed (and unaccounted), which
> >> would ultimately lead to problems of course, similar to just an
> >> accounting bug or race.
> >>
> >> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it
> >> down...
> >
> > Further narrowed down, it seems to be related to IOSQE_ASYNC on the
> > read requests. I'm guessing there are cases where we end up not
> > canceling them on ring close, hence the ring stays active, etc.
> >
> > If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then
> > the test terminates fine on the kill -9.
>
> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd
> file descriptor.

In our case - unlike netty - we use io_uring only for disk IO, no eventfd. And
we do not use IOSQE_ASYNC (we've tried, but this coincided with some kernel
crashes, so we've disabled it for now - not 100% sure if it's related or not
yet).

I'll try (again) to build a simpler reproducer for our issue, which is probably
different from the netty one.

-- 
Dmitry Kadashev

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2020-12-23 12:31 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-17  8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev
2020-12-17  8:26 ` Norman Maurer
2020-12-17  8:36   ` Dmitry Kadashev
2020-12-17  8:40     ` Dmitry Kadashev
2020-12-17 10:38       ` Josef
2020-12-17 11:10         ` Dmitry Kadashev
2020-12-17 13:43           ` Victor Stewart
2020-12-18  9:20             ` Dmitry Kadashev
2020-12-18 17:22               ` Jens Axboe
2020-12-18 15:26 ` Jens Axboe
2020-12-18 17:21   ` Josef
2020-12-18 17:23     ` Jens Axboe
2020-12-19  2:49       ` Josef
2020-12-19 16:13         ` Jens Axboe
2020-12-19 16:29           ` Jens Axboe
2020-12-19 17:11             ` Jens Axboe
2020-12-19 17:34               ` Norman Maurer
2020-12-19 17:38                 ` Jens Axboe
2020-12-19 20:51                   ` Josef
2020-12-19 21:54                     ` Jens Axboe
2020-12-19 23:13                       ` Jens Axboe
2020-12-19 23:42                         ` Josef
2020-12-19 23:42                         ` Pavel Begunkov
2020-12-20  0:25                           ` Jens Axboe
2020-12-20  0:55                             ` Pavel Begunkov
2020-12-21 10:35                               ` Dmitry Kadashev
2020-12-21 10:49                                 ` Dmitry Kadashev
2020-12-21 11:00                                 ` Dmitry Kadashev
2020-12-21 15:36                                   ` Pavel Begunkov
2020-12-22  3:35                                   ` Pavel Begunkov
2020-12-22  4:07                                     ` Pavel Begunkov
2020-12-22 11:04                                       ` Dmitry Kadashev
2020-12-22 11:06                                         ` Dmitry Kadashev
2020-12-22 13:13                                           ` Dmitry Kadashev
2020-12-22 16:33                                         ` Pavel Begunkov
2020-12-23  8:39                                           ` Dmitry Kadashev
2020-12-23  9:38                                             ` Dmitry Kadashev
2020-12-23 11:48                                               ` Dmitry Kadashev
2020-12-23 12:27                                                 ` Pavel Begunkov
2020-12-20  1:57                             ` Pavel Begunkov
2020-12-20  7:13                               ` Josef
2020-12-20 13:00                                 ` Pavel Begunkov
2020-12-20 14:19                                   ` Pavel Begunkov
2020-12-20 15:56                                     ` Josef
2020-12-20 15:58                                       ` Pavel Begunkov
2020-12-20 16:14                                   ` Jens Axboe
2020-12-20 16:59                                     ` Josef
2020-12-20 18:23                                       ` Josef
2020-12-20 18:41                                         ` Pavel Begunkov
2020-12-21  8:22                                           ` Josef
2020-12-21 15:30                                             ` Pavel Begunkov
2020-12-21 10:31               ` Dmitry Kadashev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox