* "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) @ 2020-12-17 8:19 Dmitry Kadashev 2020-12-17 8:26 ` Norman Maurer 2020-12-18 15:26 ` Jens Axboe 0 siblings, 2 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-17 8:19 UTC (permalink / raw) To: io-uring, Jens Axboe Hi, We've ran into something that looks like a memory accounting problem in the kernel / io_uring code. We use multiple rings per process, and generally it works fine. Until it does not - new ring creation just fails with ENOMEM. And at that point it fails consistently until the box is rebooted. More details: we use multiple rings per process, typically they are initialized on the process start (not necessarily, but that is not important here, let's just assume all are initialized on the process start). On a freshly booted box everything works fine. But after a while - and some process restarts - io_uring_queue_init() starts to fail with ENOMEM. Sometimes we see it fail, but then subsequent ones succeed (in the same process), but over time it gets worse, and eventually no ring can be initialized. And once that happens the only way to fix the problem is to restart the box. Most of the mentioned restarts are graceful: a new process is started and then the old one is killed, possibly with the KILL signal if it does not shut down in time. Things work fine for some time, but eventually we start getting those errors. Originally we've used 5.6.6 kernel, but given the fact quite a few accounting issues were fixed in io_uring in 5.8, we've tried 5.9.5 as well, but the issue is not gone. Just in case, everything else seems to be working fine, it just falls back to the thread pool instead of io_uring, and then everything continues to work just fine. I was not able to spot anything suspicious in the /proc/meminfo. We have RLIMIT_MEMLOCK set to infinity. And on a box that currently experiences the problem /proc/meminfo shows just 24MB as locked. Any pointers to how can we debug this? Thanks, Dmitry ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev @ 2020-12-17 8:26 ` Norman Maurer 2020-12-17 8:36 ` Dmitry Kadashev 2020-12-18 15:26 ` Jens Axboe 1 sibling, 1 reply; 52+ messages in thread From: Norman Maurer @ 2020-12-17 8:26 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: io-uring, Jens Axboe I wonder if this is also related to one of the bug-reports we received: https://github.com/netty/netty-incubator-transport-io_uring/issues/14 > On 17. Dec 2020, at 09:19, Dmitry Kadashev <[email protected]> wrote: > > Hi, > > We've ran into something that looks like a memory accounting problem in the > kernel / io_uring code. We use multiple rings per process, and generally it > works fine. Until it does not - new ring creation just fails with ENOMEM. And at > that point it fails consistently until the box is rebooted. > > More details: we use multiple rings per process, typically they are initialized > on the process start (not necessarily, but that is not important here, let's > just assume all are initialized on the process start). On a freshly booted box > everything works fine. But after a while - and some process restarts - > io_uring_queue_init() starts to fail with ENOMEM. Sometimes we see it fail, but > then subsequent ones succeed (in the same process), but over time it gets worse, > and eventually no ring can be initialized. And once that happens the only way to > fix the problem is to restart the box. Most of the mentioned restarts are > graceful: a new process is started and then the old one is killed, possibly with > the KILL signal if it does not shut down in time. Things work fine for some > time, but eventually we start getting those errors. > > Originally we've used 5.6.6 kernel, but given the fact quite a few accounting > issues were fixed in io_uring in 5.8, we've tried 5.9.5 as well, but the issue > is not gone. > > Just in case, everything else seems to be working fine, it just falls back to > the thread pool instead of io_uring, and then everything continues to work just > fine. > > I was not able to spot anything suspicious in the /proc/meminfo. We have > RLIMIT_MEMLOCK set to infinity. And on a box that currently experiences the > problem /proc/meminfo shows just 24MB as locked. > > Any pointers to how can we debug this? > > Thanks, > Dmitry ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 8:26 ` Norman Maurer @ 2020-12-17 8:36 ` Dmitry Kadashev 2020-12-17 8:40 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-17 8:36 UTC (permalink / raw) To: Norman Maurer; +Cc: io-uring, Jens Axboe On Thu, Dec 17, 2020 at 3:27 PM Norman Maurer <[email protected]> wrote: > > I wonder if this is also related to one of the bug-reports we received: > > https://github.com/netty/netty-incubator-transport-io_uring/issues/14 That is curious. This ticket mentions Shmem though, and in our case it does not look suspicious at all. E.g. on a box that has the problem at the moment: Shmem: 41856 kB. The box has 256GB of RAM. But I'd (given my lack of knowledge) expect the issues to be related anyway. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 8:36 ` Dmitry Kadashev @ 2020-12-17 8:40 ` Dmitry Kadashev 2020-12-17 10:38 ` Josef 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-17 8:40 UTC (permalink / raw) To: Norman Maurer; +Cc: io-uring, Jens Axboe On Thu, Dec 17, 2020 at 3:36 PM Dmitry Kadashev <[email protected]> wrote: > > On Thu, Dec 17, 2020 at 3:27 PM Norman Maurer > <[email protected]> wrote: > > > > I wonder if this is also related to one of the bug-reports we received: > > > > https://github.com/netty/netty-incubator-transport-io_uring/issues/14 > > That is curious. This ticket mentions Shmem though, and in our case it does > not look suspicious at all. E.g. on a box that has the problem at the moment: > Shmem: 41856 kB. The box has 256GB of RAM. > > But I'd (given my lack of knowledge) expect the issues to be related anyway. One common thing here is the ticket OP mentions kill -9, and we do use that as well at least in some circumstances. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 8:40 ` Dmitry Kadashev @ 2020-12-17 10:38 ` Josef 2020-12-17 11:10 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-17 10:38 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Norman Maurer, io-uring, Jens Axboe > > That is curious. This ticket mentions Shmem though, and in our case it does > not look suspicious at all. E.g. on a box that has the problem at the moment: > Shmem: 41856 kB. The box has 256GB of RAM. > > But I'd (given my lack of knowledge) expect the issues to be related anyway. what about mapped? mapped is pretty high 1GB on my machine, I'm still reproduce that in C...however the user process is killed but not the io_wq_worker kernel processes, that's also the reason why the server socket still listening(even if the user process is killed), the bug only occurs(in netty) with a high number of operations and using eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS) (tested on kernel 5.9 and 5.10) -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 10:38 ` Josef @ 2020-12-17 11:10 ` Dmitry Kadashev 2020-12-17 13:43 ` Victor Stewart 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-17 11:10 UTC (permalink / raw) To: Josef; +Cc: Norman Maurer, io-uring, Jens Axboe On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote: > > > > That is curious. This ticket mentions Shmem though, and in our case it does > > not look suspicious at all. E.g. on a box that has the problem at the moment: > > Shmem: 41856 kB. The box has 256GB of RAM. > > > > But I'd (given my lack of knowledge) expect the issues to be related anyway. > > what about mapped? mapped is pretty high 1GB on my machine, I'm still > reproduce that in C...however the user process is killed but not the > io_wq_worker kernel processes, that's also the reason why the server > socket still listening(even if the user process is killed), the bug > only occurs(in netty) with a high number of operations and using > eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS) > > (tested on kernel 5.9 and 5.10) Stats from another box with this problem (still 256G of RAM): Mlocked: 17096 kB Mapped: 171480 kB Shmem: 41880 kB Does not look suspicious at a glance. Number of io_wq* processes is 23-31. Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of these four the old instance was killed with SIGKILL. On the last process start 18 rings failed to initialize, but after that 6 more were initialized successfully. It was before the old instance was killed. Maybe it's related to the load and number of io-wq processes, e.g. some of them exited and a few more rings were initialized successfully. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 11:10 ` Dmitry Kadashev @ 2020-12-17 13:43 ` Victor Stewart 2020-12-18 9:20 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Victor Stewart @ 2020-12-17 13:43 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Josef, Norman Maurer, io-uring, Jens Axboe On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote: > > On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote: > > > > > > That is curious. This ticket mentions Shmem though, and in our case it does > > > not look suspicious at all. E.g. on a box that has the problem at the moment: > > > Shmem: 41856 kB. The box has 256GB of RAM. > > > > > > But I'd (given my lack of knowledge) expect the issues to be related anyway. > > > > what about mapped? mapped is pretty high 1GB on my machine, I'm still > > reproduce that in C...however the user process is killed but not the > > io_wq_worker kernel processes, that's also the reason why the server > > socket still listening(even if the user process is killed), the bug > > only occurs(in netty) with a high number of operations and using > > eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS) > > > > (tested on kernel 5.9 and 5.10) > > Stats from another box with this problem (still 256G of RAM): > > Mlocked: 17096 kB > Mapped: 171480 kB > Shmem: 41880 kB > > Does not look suspicious at a glance. Number of io_wq* processes is 23-31. > > Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of > these four the old instance was killed with SIGKILL. On the last process start > 18 rings failed to initialize, but after that 6 more were initialized > successfully. It was before the old instance was killed. Maybe it's related to > the load and number of io-wq processes, e.g. some of them exited and a few more > rings were initialized successfully. have you tried using IORING_SETUP_ATTACH_WQ? https://lkml.org/lkml/2020/1/27/763 > > -- > Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 13:43 ` Victor Stewart @ 2020-12-18 9:20 ` Dmitry Kadashev 2020-12-18 17:22 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-18 9:20 UTC (permalink / raw) To: Victor Stewart; +Cc: Josef, Norman Maurer, io-uring, Jens Axboe On Thu, Dec 17, 2020 at 8:43 PM Victor Stewart <[email protected]> wrote: > > On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote: > > > > On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote: > > > > > > > > That is curious. This ticket mentions Shmem though, and in our case it does > > > > not look suspicious at all. E.g. on a box that has the problem at the moment: > > > > Shmem: 41856 kB. The box has 256GB of RAM. > > > > > > > > But I'd (given my lack of knowledge) expect the issues to be related anyway. > > > > > > what about mapped? mapped is pretty high 1GB on my machine, I'm still > > > reproduce that in C...however the user process is killed but not the > > > io_wq_worker kernel processes, that's also the reason why the server > > > socket still listening(even if the user process is killed), the bug > > > only occurs(in netty) with a high number of operations and using > > > eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS) > > > > > > (tested on kernel 5.9 and 5.10) > > > > Stats from another box with this problem (still 256G of RAM): > > > > Mlocked: 17096 kB > > Mapped: 171480 kB > > Shmem: 41880 kB > > > > Does not look suspicious at a glance. Number of io_wq* processes is 23-31. > > > > Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of > > these four the old instance was killed with SIGKILL. On the last process start > > 18 rings failed to initialize, but after that 6 more were initialized > > successfully. It was before the old instance was killed. Maybe it's related to > > the load and number of io-wq processes, e.g. some of them exited and a few more > > rings were initialized successfully. > > have you tried using IORING_SETUP_ATTACH_WQ? > > https://lkml.org/lkml/2020/1/27/763 No, I have not, but while using that might help to slow down progression of the issue, it won't fix it - at least if I understand correctly. The problem is not that those rings can't be created at all - there is no problem with that on a freshly booted box, but rather that after some (potentially abrupt) owning process terminations under load kernel gets into a state where - eventually - no new rings can be created at all. Not a single one. In the above example the issue just haven't progressed far enough yet. In other words, there seems to be a leak / accounting problem in the io_uring code that is triggered by abrupt process termination under load (just no io_uring_queue_exit?) - this is not a usage problem. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-18 9:20 ` Dmitry Kadashev @ 2020-12-18 17:22 ` Jens Axboe 0 siblings, 0 replies; 52+ messages in thread From: Jens Axboe @ 2020-12-18 17:22 UTC (permalink / raw) To: Dmitry Kadashev, Victor Stewart; +Cc: Josef, Norman Maurer, io-uring On 12/18/20 2:20 AM, Dmitry Kadashev wrote: > On Thu, Dec 17, 2020 at 8:43 PM Victor Stewart <[email protected]> wrote: >> >> On Thu, Dec 17, 2020 at 11:12 AM Dmitry Kadashev <[email protected]> wrote: >>> >>> On Thu, Dec 17, 2020 at 5:38 PM Josef <[email protected]> wrote: >>>> >>>>>> That is curious. This ticket mentions Shmem though, and in our case it does >>>> > not look suspicious at all. E.g. on a box that has the problem at the moment: >>>> > Shmem: 41856 kB. The box has 256GB of RAM. >>>> > >>>> > But I'd (given my lack of knowledge) expect the issues to be related anyway. >>>> >>>> what about mapped? mapped is pretty high 1GB on my machine, I'm still >>>> reproduce that in C...however the user process is killed but not the >>>> io_wq_worker kernel processes, that's also the reason why the server >>>> socket still listening(even if the user process is killed), the bug >>>> only occurs(in netty) with a high number of operations and using >>>> eventfd_write to unblock io_uring_enter(IORING_ENTER_GETEVENTS) >>>> >>>> (tested on kernel 5.9 and 5.10) >>> >>> Stats from another box with this problem (still 256G of RAM): >>> >>> Mlocked: 17096 kB >>> Mapped: 171480 kB >>> Shmem: 41880 kB >>> >>> Does not look suspicious at a glance. Number of io_wq* processes is 23-31. >>> >>> Uptime is 27 days, 24 rings per process, process was restarted 4 times, 3 out of >>> these four the old instance was killed with SIGKILL. On the last process start >>> 18 rings failed to initialize, but after that 6 more were initialized >>> successfully. It was before the old instance was killed. Maybe it's related to >>> the load and number of io-wq processes, e.g. some of them exited and a few more >>> rings were initialized successfully. >> >> have you tried using IORING_SETUP_ATTACH_WQ? >> >> https://lkml.org/lkml/2020/1/27/763 > > No, I have not, but while using that might help to slow down progression of the > issue, it won't fix it - at least if I understand correctly. The problem is not > that those rings can't be created at all - there is no problem with that on a > freshly booted box, but rather that after some (potentially abrupt) owning > process terminations under load kernel gets into a state where - eventually - no > new rings can be created at all. Not a single one. In the above example the > issue just haven't progressed far enough yet. > > In other words, there seems to be a leak / accounting problem in the io_uring > code that is triggered by abrupt process termination under load (just no > io_uring_queue_exit?) - this is not a usage problem. Right, I don't think that's related at all. Might be a good idea in general depending on your use case, but it won't really have any bearing on the particular issue at hand. -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-17 8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev 2020-12-17 8:26 ` Norman Maurer @ 2020-12-18 15:26 ` Jens Axboe 2020-12-18 17:21 ` Josef 1 sibling, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-18 15:26 UTC (permalink / raw) To: Dmitry Kadashev, io-uring On 12/17/20 1:19 AM, Dmitry Kadashev wrote: > Hi, > > We've ran into something that looks like a memory accounting problem > in the kernel / io_uring code. We use multiple rings per process, and > generally it works fine. Until it does not - new ring creation just > fails with ENOMEM. And at that point it fails consistently until the > box is rebooted. > > More details: we use multiple rings per process, typically they are > initialized on the process start (not necessarily, but that is not > important here, let's just assume all are initialized on the process > start). On a freshly booted box everything works fine. But after a > while - and some process restarts - io_uring_queue_init() starts to > fail with ENOMEM. Sometimes we see it fail, but then subsequent ones > succeed (in the same process), but over time it gets worse, and > eventually no ring can be initialized. And once that happens the only > way to fix the problem is to restart the box. Most of the mentioned > restarts are graceful: a new process is started and then the old one > is killed, possibly with the KILL signal if it does not shut down in > time. Things work fine for some time, but eventually we start getting > those errors. > > Originally we've used 5.6.6 kernel, but given the fact quite a few > accounting issues were fixed in io_uring in 5.8, we've tried 5.9.5 as > well, but the issue is not gone. > > Just in case, everything else seems to be working fine, it just falls > back to the thread pool instead of io_uring, and then everything > continues to work just fine. > > I was not able to spot anything suspicious in the /proc/meminfo. We > have RLIMIT_MEMLOCK set to infinity. And on a box that currently > experiences the problem /proc/meminfo shows just 24MB as locked. > > Any pointers to how can we debug this? I've read through this thread, but haven't had time to really debug it yet. I did try a few test cases, and wasn't able to trigger anything. The signal part is interesting, as it would cause parallel teardowns potentially. And I did post a patch for that yesterday, where I did spot a race in the user mm accounting. I don't think this is related to this one, but would still be useful if you could test with this applied: https://lore.kernel.org/io-uring/[email protected]/T/#u just in case... -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-18 15:26 ` Jens Axboe @ 2020-12-18 17:21 ` Josef 2020-12-18 17:23 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-18 17:21 UTC (permalink / raw) To: Jens Axboe; +Cc: Dmitry Kadashev, io-uring, Norman Maurer > I've read through this thread, but haven't had time to really debug it > yet. I did try a few test cases, and wasn't able to trigger anything. > The signal part is interesting, as it would cause parallel teardowns > potentially. And I did post a patch for that yesterday, where I did spot > a race in the user mm accounting. I don't think this is related to this > one, but would still be useful if you could test with this applied: > > https://lore.kernel.org/io-uring/[email protected]/T/#u as you expected it didn't work, unfortunately I couldn't reproduce that in C..I'll try to debug in netty/kernel -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-18 17:21 ` Josef @ 2020-12-18 17:23 ` Jens Axboe 2020-12-19 2:49 ` Josef 0 siblings, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-18 17:23 UTC (permalink / raw) To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer On 12/18/20 10:21 AM, Josef wrote: >> I've read through this thread, but haven't had time to really debug it >> yet. I did try a few test cases, and wasn't able to trigger anything. >> The signal part is interesting, as it would cause parallel teardowns >> potentially. And I did post a patch for that yesterday, where I did spot >> a race in the user mm accounting. I don't think this is related to this >> one, but would still be useful if you could test with this applied: >> >> https://lore.kernel.org/io-uring/[email protected]/T/#u > > as you expected it didn't work, unfortunately I couldn't reproduce > that in C..I'll try to debug in netty/kernel I'm happy to run _any_ reproducer, so please do let us know if you manage to find something that I can run with netty. As long as it includes instructions for exactly how to run it :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-18 17:23 ` Jens Axboe @ 2020-12-19 2:49 ` Josef 2020-12-19 16:13 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-19 2:49 UTC (permalink / raw) To: Jens Axboe; +Cc: Dmitry Kadashev, io-uring, Norman Maurer > I'm happy to run _any_ reproducer, so please do let us know if you > manage to find something that I can run with netty. As long as it > includes instructions for exactly how to run it :-) cool :) I just created a repo for that: https://github.com/1Jo1/netty-io_uring-kernel-debugging.git - install jdk 1.8 - to run netty: ./mvnw compile exec:java -Dexec.mainClass="uring.netty.example.EchoUringServer" - to run the echo test: cargo run --release -- --address "127.0.0.1:2022" --number 200 --duration 20 --length 300 (https://github.com/haraldh/rust_echo_bench.git) - process kill -9 async flag is enabled and these operation are used: OP_READ, OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT (btw you can change the port in EchoUringServer.java) -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 2:49 ` Josef @ 2020-12-19 16:13 ` Jens Axboe 2020-12-19 16:29 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-19 16:13 UTC (permalink / raw) To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer On 12/18/20 7:49 PM, Josef wrote: >> I'm happy to run _any_ reproducer, so please do let us know if you >> manage to find something that I can run with netty. As long as it >> includes instructions for exactly how to run it :-) > > cool :) I just created a repo for that: > https://github.com/1Jo1/netty-io_uring-kernel-debugging.git > > - install jdk 1.8 > - to run netty: ./mvnw compile exec:java > -Dexec.mainClass="uring.netty.example.EchoUringServer" > - to run the echo test: cargo run --release -- --address > "127.0.0.1:2022" --number 200 --duration 20 --length 300 > (https://github.com/haraldh/rust_echo_bench.git) > - process kill -9 > > async flag is enabled and these operation are used: OP_READ, > OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT > > (btw you can change the port in EchoUringServer.java) This is great! Not sure this is the same issue, but what I see here is that we have leftover workers when the test is killed. This means the rings aren't gone, and the memory isn't freed (and unaccounted), which would ultimately lead to problems of course, similar to just an accounting bug or race. The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it down... -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 16:13 ` Jens Axboe @ 2020-12-19 16:29 ` Jens Axboe 2020-12-19 17:11 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-19 16:29 UTC (permalink / raw) To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer On 12/19/20 9:13 AM, Jens Axboe wrote: > On 12/18/20 7:49 PM, Josef wrote: >>> I'm happy to run _any_ reproducer, so please do let us know if you >>> manage to find something that I can run with netty. As long as it >>> includes instructions for exactly how to run it :-) >> >> cool :) I just created a repo for that: >> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git >> >> - install jdk 1.8 >> - to run netty: ./mvnw compile exec:java >> -Dexec.mainClass="uring.netty.example.EchoUringServer" >> - to run the echo test: cargo run --release -- --address >> "127.0.0.1:2022" --number 200 --duration 20 --length 300 >> (https://github.com/haraldh/rust_echo_bench.git) >> - process kill -9 >> >> async flag is enabled and these operation are used: OP_READ, >> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT >> >> (btw you can change the port in EchoUringServer.java) > > This is great! Not sure this is the same issue, but what I see here is > that we have leftover workers when the test is killed. This means the > rings aren't gone, and the memory isn't freed (and unaccounted), which > would ultimately lead to problems of course, similar to just an > accounting bug or race. > > The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it > down... Further narrowed down, it seems to be related to IOSQE_ASYNC on the read requests. I'm guessing there are cases where we end up not canceling them on ring close, hence the ring stays active, etc. If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then the test terminates fine on the kill -9. -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 16:29 ` Jens Axboe @ 2020-12-19 17:11 ` Jens Axboe 2020-12-19 17:34 ` Norman Maurer 2020-12-21 10:31 ` Dmitry Kadashev 0 siblings, 2 replies; 52+ messages in thread From: Jens Axboe @ 2020-12-19 17:11 UTC (permalink / raw) To: Josef; +Cc: Dmitry Kadashev, io-uring, Norman Maurer On 12/19/20 9:29 AM, Jens Axboe wrote: > On 12/19/20 9:13 AM, Jens Axboe wrote: >> On 12/18/20 7:49 PM, Josef wrote: >>>> I'm happy to run _any_ reproducer, so please do let us know if you >>>> manage to find something that I can run with netty. As long as it >>>> includes instructions for exactly how to run it :-) >>> >>> cool :) I just created a repo for that: >>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git >>> >>> - install jdk 1.8 >>> - to run netty: ./mvnw compile exec:java >>> -Dexec.mainClass="uring.netty.example.EchoUringServer" >>> - to run the echo test: cargo run --release -- --address >>> "127.0.0.1:2022" --number 200 --duration 20 --length 300 >>> (https://github.com/haraldh/rust_echo_bench.git) >>> - process kill -9 >>> >>> async flag is enabled and these operation are used: OP_READ, >>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT >>> >>> (btw you can change the port in EchoUringServer.java) >> >> This is great! Not sure this is the same issue, but what I see here is >> that we have leftover workers when the test is killed. This means the >> rings aren't gone, and the memory isn't freed (and unaccounted), which >> would ultimately lead to problems of course, similar to just an >> accounting bug or race. >> >> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it >> down... > > Further narrowed down, it seems to be related to IOSQE_ASYNC on the > read requests. I'm guessing there are cases where we end up not > canceling them on ring close, hence the ring stays active, etc. > > If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then > the test terminates fine on the kill -9. And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd file descriptor. You probably don't want/mean to do that as it's pollable, I guess it's done because you just set it on all reads for the test? In any case, it should of course work. This is the leftover trace when we should be exiting, but an io-wq worker is still trying to get data from the eventfd: $ sudo cat /proc/2148/stack [<0>] eventfd_read+0x160/0x260 [<0>] io_iter_do_read+0x1b/0x40 [<0>] io_read+0xa5/0x320 [<0>] io_issue_sqe+0x23c/0xe80 [<0>] io_wq_submit_work+0x6e/0x1a0 [<0>] io_worker_handle_work+0x13d/0x4e0 [<0>] io_wqe_worker+0x2aa/0x360 [<0>] kthread+0x130/0x160 [<0>] ret_from_fork+0x1f/0x30 which will never finish at this point, it should have been canceled. -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 17:11 ` Jens Axboe @ 2020-12-19 17:34 ` Norman Maurer 2020-12-19 17:38 ` Jens Axboe 2020-12-21 10:31 ` Dmitry Kadashev 1 sibling, 1 reply; 52+ messages in thread From: Norman Maurer @ 2020-12-19 17:34 UTC (permalink / raw) To: Jens Axboe; +Cc: Josef, Dmitry Kadashev, io-uring Thanks a lot ... we can just workaround this than in netty . Bye Norman > Am 19.12.2020 um 18:11 schrieb Jens Axboe <[email protected]>: > > On 12/19/20 9:29 AM, Jens Axboe wrote: >>> On 12/19/20 9:13 AM, Jens Axboe wrote: >>> On 12/18/20 7:49 PM, Josef wrote: >>>>> I'm happy to run _any_ reproducer, so please do let us know if you >>>>> manage to find something that I can run with netty. As long as it >>>>> includes instructions for exactly how to run it :-) >>>> >>>> cool :) I just created a repo for that: >>>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git >>>> >>>> - install jdk 1.8 >>>> - to run netty: ./mvnw compile exec:java >>>> -Dexec.mainClass="uring.netty.example.EchoUringServer" >>>> - to run the echo test: cargo run --release -- --address >>>> "127.0.0.1:2022" --number 200 --duration 20 --length 300 >>>> (https://github.com/haraldh/rust_echo_bench.git) >>>> - process kill -9 >>>> >>>> async flag is enabled and these operation are used: OP_READ, >>>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT >>>> >>>> (btw you can change the port in EchoUringServer.java) >>> >>> This is great! Not sure this is the same issue, but what I see here is >>> that we have leftover workers when the test is killed. This means the >>> rings aren't gone, and the memory isn't freed (and unaccounted), which >>> would ultimately lead to problems of course, similar to just an >>> accounting bug or race. >>> >>> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it >>> down... >> >> Further narrowed down, it seems to be related to IOSQE_ASYNC on the >> read requests. I'm guessing there are cases where we end up not >> canceling them on ring close, hence the ring stays active, etc. >> >> If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then >> the test terminates fine on the kill -9. > > And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > file descriptor. You probably don't want/mean to do that as it's > pollable, I guess it's done because you just set it on all reads for the > test? > > In any case, it should of course work. This is the leftover trace when > we should be exiting, but an io-wq worker is still trying to get data > from the eventfd: > > $ sudo cat /proc/2148/stack > [<0>] eventfd_read+0x160/0x260 > [<0>] io_iter_do_read+0x1b/0x40 > [<0>] io_read+0xa5/0x320 > [<0>] io_issue_sqe+0x23c/0xe80 > [<0>] io_wq_submit_work+0x6e/0x1a0 > [<0>] io_worker_handle_work+0x13d/0x4e0 > [<0>] io_wqe_worker+0x2aa/0x360 > [<0>] kthread+0x130/0x160 > [<0>] ret_from_fork+0x1f/0x30 > > which will never finish at this point, it should have been canceled. > > -- > Jens Axboe > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 17:34 ` Norman Maurer @ 2020-12-19 17:38 ` Jens Axboe 2020-12-19 20:51 ` Josef 0 siblings, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-19 17:38 UTC (permalink / raw) To: Norman Maurer; +Cc: Josef, Dmitry Kadashev, io-uring On 12/19/20 10:34 AM, Norman Maurer wrote: >> Am 19.12.2020 um 18:11 schrieb Jens Axboe <[email protected]>: >> >> On 12/19/20 9:29 AM, Jens Axboe wrote: >>>> On 12/19/20 9:13 AM, Jens Axboe wrote: >>>> On 12/18/20 7:49 PM, Josef wrote: >>>>>> I'm happy to run _any_ reproducer, so please do let us know if you >>>>>> manage to find something that I can run with netty. As long as it >>>>>> includes instructions for exactly how to run it :-) >>>>> >>>>> cool :) I just created a repo for that: >>>>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git >>>>> >>>>> - install jdk 1.8 >>>>> - to run netty: ./mvnw compile exec:java >>>>> -Dexec.mainClass="uring.netty.example.EchoUringServer" >>>>> - to run the echo test: cargo run --release -- --address >>>>> "127.0.0.1:2022" --number 200 --duration 20 --length 300 >>>>> (https://github.com/haraldh/rust_echo_bench.git) >>>>> - process kill -9 >>>>> >>>>> async flag is enabled and these operation are used: OP_READ, >>>>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT >>>>> >>>>> (btw you can change the port in EchoUringServer.java) >>>> >>>> This is great! Not sure this is the same issue, but what I see here is >>>> that we have leftover workers when the test is killed. This means the >>>> rings aren't gone, and the memory isn't freed (and unaccounted), which >>>> would ultimately lead to problems of course, similar to just an >>>> accounting bug or race. >>>> >>>> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it >>>> down... >>> >>> Further narrowed down, it seems to be related to IOSQE_ASYNC on the >>> read requests. I'm guessing there are cases where we end up not >>> canceling them on ring close, hence the ring stays active, etc. >>> >>> If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then >>> the test terminates fine on the kill -9. >> >> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >> file descriptor. You probably don't want/mean to do that as it's >> pollable, I guess it's done because you just set it on all reads for the >> test? >> >> In any case, it should of course work. This is the leftover trace when >> we should be exiting, but an io-wq worker is still trying to get data >> from the eventfd: >> >> $ sudo cat /proc/2148/stack >> [<0>] eventfd_read+0x160/0x260 >> [<0>] io_iter_do_read+0x1b/0x40 >> [<0>] io_read+0xa5/0x320 >> [<0>] io_issue_sqe+0x23c/0xe80 >> [<0>] io_wq_submit_work+0x6e/0x1a0 >> [<0>] io_worker_handle_work+0x13d/0x4e0 >> [<0>] io_wqe_worker+0x2aa/0x360 >> [<0>] kthread+0x130/0x160 >> [<0>] ret_from_fork+0x1f/0x30 >> >> which will never finish at this point, it should have been canceled. > > Thanks a lot ... we can just workaround this than in netty . That probably should be done in any case, since I don't think IOSQE_ASYNC is useful on the eventfd read for you. But I'm trying to narrow down _why_ it fails, it could be a general issue in how cancelations are processed for sudden exit. Which would explain why it only shows up for the kill -9 case. Anyway, digging into it :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 17:38 ` Jens Axboe @ 2020-12-19 20:51 ` Josef 2020-12-19 21:54 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-19 20:51 UTC (permalink / raw) To: Jens Axboe; +Cc: Norman Maurer, Dmitry Kadashev, io-uring > And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > file descriptor. You probably don't want/mean to do that as it's > pollable, I guess it's done because you just set it on all reads for the > test? yes exactly, eventfd fd is blocking, so it actually makes no sense to use IOSQE_ASYNC I just tested eventfd without the IOSQE_ASYNC flag, it seems to work in my tests, thanks a lot :) > In any case, it should of course work. This is the leftover trace when > we should be exiting, but an io-wq worker is still trying to get data > from the eventfd: interesting, btw what kind of tool do you use for kernel debugging? -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 20:51 ` Josef @ 2020-12-19 21:54 ` Jens Axboe 2020-12-19 23:13 ` Jens Axboe 0 siblings, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-19 21:54 UTC (permalink / raw) To: Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 12/19/20 1:51 PM, Josef wrote: >> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >> file descriptor. You probably don't want/mean to do that as it's >> pollable, I guess it's done because you just set it on all reads for the >> test? > > yes exactly, eventfd fd is blocking, so it actually makes no sense to > use IOSQE_ASYNC Right, and it's pollable too. > I just tested eventfd without the IOSQE_ASYNC flag, it seems to work > in my tests, thanks a lot :) > >> In any case, it should of course work. This is the leftover trace when >> we should be exiting, but an io-wq worker is still trying to get data >> from the eventfd: > > interesting, btw what kind of tool do you use for kernel debugging? Just poking at it and thinking about it, no hidden magic I'm afraid... -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 21:54 ` Jens Axboe @ 2020-12-19 23:13 ` Jens Axboe 2020-12-19 23:42 ` Josef 2020-12-19 23:42 ` Pavel Begunkov 0 siblings, 2 replies; 52+ messages in thread From: Jens Axboe @ 2020-12-19 23:13 UTC (permalink / raw) To: Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 12/19/20 2:54 PM, Jens Axboe wrote: > On 12/19/20 1:51 PM, Josef wrote: >>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>> file descriptor. You probably don't want/mean to do that as it's >>> pollable, I guess it's done because you just set it on all reads for the >>> test? >> >> yes exactly, eventfd fd is blocking, so it actually makes no sense to >> use IOSQE_ASYNC > > Right, and it's pollable too. > >> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >> in my tests, thanks a lot :) >> >>> In any case, it should of course work. This is the leftover trace when >>> we should be exiting, but an io-wq worker is still trying to get data >>> from the eventfd: >> >> interesting, btw what kind of tool do you use for kernel debugging? > > Just poking at it and thinking about it, no hidden magic I'm afraid... Josef, can you try with this added? Looks bigger than it is, most of it is just moving one function below another. diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..96f6445ab827 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8735,10 +8735,43 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx, } } +static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx, + struct task_struct *task) +{ + while (1) { + struct io_task_cancel cancel = { .task = task, .files = NULL, }; + enum io_wq_cancel cret; + bool ret = false; + + cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, &cancel, true); + if (cret != IO_WQ_CANCEL_NOTFOUND) + ret = true; + + /* SQPOLL thread does its own polling */ + if (!(ctx->flags & IORING_SETUP_SQPOLL)) { + while (!list_empty_careful(&ctx->iopoll_list)) { + io_iopoll_try_reap_events(ctx); + ret = true; + } + } + + ret |= io_poll_remove_all(ctx, task, NULL); + ret |= io_kill_timeouts(ctx, task, NULL); + if (!ret) + break; + io_run_task_work(); + cond_resched(); + } +} + static void io_uring_cancel_files(struct io_ring_ctx *ctx, struct task_struct *task, struct files_struct *files) { + /* files == NULL, task is exiting. Cancel all that match task */ + if (!files) + __io_uring_cancel_task_requests(ctx, task); + while (!list_empty_careful(&ctx->inflight_list)) { struct io_task_cancel cancel = { .task = task, .files = files }; struct io_kiocb *req; @@ -8772,35 +8805,6 @@ static void io_uring_cancel_files(struct io_ring_ctx *ctx, } } -static void __io_uring_cancel_task_requests(struct io_ring_ctx *ctx, - struct task_struct *task) -{ - while (1) { - struct io_task_cancel cancel = { .task = task, .files = NULL, }; - enum io_wq_cancel cret; - bool ret = false; - - cret = io_wq_cancel_cb(ctx->io_wq, io_cancel_task_cb, &cancel, true); - if (cret != IO_WQ_CANCEL_NOTFOUND) - ret = true; - - /* SQPOLL thread does its own polling */ - if (!(ctx->flags & IORING_SETUP_SQPOLL)) { - while (!list_empty_careful(&ctx->iopoll_list)) { - io_iopoll_try_reap_events(ctx); - ret = true; - } - } - - ret |= io_poll_remove_all(ctx, task, NULL); - ret |= io_kill_timeouts(ctx, task, NULL); - if (!ret) - break; - io_run_task_work(); - cond_resched(); - } -} - /* * We need to iteratively cancel requests, in case a request has dependent * hard links. These persist even for failure of cancelations, hence keep -- Jens Axboe ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 23:13 ` Jens Axboe @ 2020-12-19 23:42 ` Josef 2020-12-19 23:42 ` Pavel Begunkov 1 sibling, 0 replies; 52+ messages in thread From: Josef @ 2020-12-19 23:42 UTC (permalink / raw) To: Jens Axboe; +Cc: Norman Maurer, Dmitry Kadashev, io-uring > Josef, can you try with this added? Looks bigger than it is, most of it > is just moving one function below another. yeah sure, sorry stupid question which branch is the patch based on? (last commit?) -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 23:13 ` Jens Axboe 2020-12-19 23:42 ` Josef @ 2020-12-19 23:42 ` Pavel Begunkov 2020-12-20 0:25 ` Jens Axboe 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-19 23:42 UTC (permalink / raw) To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 19/12/2020 23:13, Jens Axboe wrote: > On 12/19/20 2:54 PM, Jens Axboe wrote: >> On 12/19/20 1:51 PM, Josef wrote: >>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>>> file descriptor. You probably don't want/mean to do that as it's >>>> pollable, I guess it's done because you just set it on all reads for the >>>> test? >>> >>> yes exactly, eventfd fd is blocking, so it actually makes no sense to >>> use IOSQE_ASYNC >> >> Right, and it's pollable too. >> >>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >>> in my tests, thanks a lot :) >>> >>>> In any case, it should of course work. This is the leftover trace when >>>> we should be exiting, but an io-wq worker is still trying to get data >>>> from the eventfd: >>> >>> interesting, btw what kind of tool do you use for kernel debugging? >> >> Just poking at it and thinking about it, no hidden magic I'm afraid... > > Josef, can you try with this added? Looks bigger than it is, most of it > is just moving one function below another. Hmm, which kernel revision are you poking? Seems it doesn't match io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with NULL files. if (!files) __io_uring_cancel_task_requests(ctx, task); else io_uring_cancel_files(ctx, task, files); > diff --git a/fs/io_uring.c b/fs/io_uring.c > index f3690dfdd564..96f6445ab827 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -8735,10 +8735,43 @@ static void io_cancel_defer_files(struct io_ring_ctx *ctx, [...] > static void io_uring_cancel_files(struct io_ring_ctx *ctx, > struct task_struct *task, > struct files_struct *files) > { > + /* files == NULL, task is exiting. Cancel all that match task */ > + if (!files) > + __io_uring_cancel_task_requests(ctx, task); > + For 5.11 I believe it should look like diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..38fb351cc1dd 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8822,9 +8822,8 @@ static void io_uring_cancel_task_requests(struct io_ring_ctx *ctx, io_cqring_overflow_flush(ctx, true, task, files); io_ring_submit_unlock(ctx, (ctx->flags & IORING_SETUP_IOPOLL)); - if (!files) - __io_uring_cancel_task_requests(ctx, task); - else + __io_uring_cancel_task_requests(ctx, task); + if (files) io_uring_cancel_files(ctx, task, files); if ((ctx->flags & IORING_SETUP_SQPOLL) && ctx->sq_data) { -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 23:42 ` Pavel Begunkov @ 2020-12-20 0:25 ` Jens Axboe 2020-12-20 0:55 ` Pavel Begunkov 2020-12-20 1:57 ` Pavel Begunkov 0 siblings, 2 replies; 52+ messages in thread From: Jens Axboe @ 2020-12-20 0:25 UTC (permalink / raw) To: Pavel Begunkov, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 12/19/20 4:42 PM, Pavel Begunkov wrote: > On 19/12/2020 23:13, Jens Axboe wrote: >> On 12/19/20 2:54 PM, Jens Axboe wrote: >>> On 12/19/20 1:51 PM, Josef wrote: >>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>>>> file descriptor. You probably don't want/mean to do that as it's >>>>> pollable, I guess it's done because you just set it on all reads for the >>>>> test? >>>> >>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to >>>> use IOSQE_ASYNC >>> >>> Right, and it's pollable too. >>> >>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >>>> in my tests, thanks a lot :) >>>> >>>>> In any case, it should of course work. This is the leftover trace when >>>>> we should be exiting, but an io-wq worker is still trying to get data >>>>> from the eventfd: >>>> >>>> interesting, btw what kind of tool do you use for kernel debugging? >>> >>> Just poking at it and thinking about it, no hidden magic I'm afraid... >> >> Josef, can you try with this added? Looks bigger than it is, most of it >> is just moving one function below another. > > Hmm, which kernel revision are you poking? Seems it doesn't match > io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with > NULL files. > > if (!files) > __io_uring_cancel_task_requests(ctx, task); > else > io_uring_cancel_files(ctx, task, files); Yeah, I think I messed up. If files == NULL, then the task is going away. So we should cancel all requests that match 'task', not just ones that match task && files. Not sure I have much more time to look into this before next week, but something like that. The problem case is the async worker being queued, long before the task is killed and the contexts go away. But from exit_files(), we're only concerned with canceling if we have inflight. Doesn't look right to me. -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 0:25 ` Jens Axboe @ 2020-12-20 0:55 ` Pavel Begunkov 2020-12-21 10:35 ` Dmitry Kadashev 2020-12-20 1:57 ` Pavel Begunkov 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 0:55 UTC (permalink / raw) To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 00:25, Jens Axboe wrote: > On 12/19/20 4:42 PM, Pavel Begunkov wrote: >> On 19/12/2020 23:13, Jens Axboe wrote: >>> On 12/19/20 2:54 PM, Jens Axboe wrote: >>>> On 12/19/20 1:51 PM, Josef wrote: >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>>>>> file descriptor. You probably don't want/mean to do that as it's >>>>>> pollable, I guess it's done because you just set it on all reads for the >>>>>> test? >>>>> >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to >>>>> use IOSQE_ASYNC >>>> >>>> Right, and it's pollable too. >>>> >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >>>>> in my tests, thanks a lot :) >>>>> >>>>>> In any case, it should of course work. This is the leftover trace when >>>>>> we should be exiting, but an io-wq worker is still trying to get data >>>>>> from the eventfd: >>>>> >>>>> interesting, btw what kind of tool do you use for kernel debugging? >>>> >>>> Just poking at it and thinking about it, no hidden magic I'm afraid... >>> >>> Josef, can you try with this added? Looks bigger than it is, most of it >>> is just moving one function below another. >> >> Hmm, which kernel revision are you poking? Seems it doesn't match >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with >> NULL files. >> >> if (!files) >> __io_uring_cancel_task_requests(ctx, task); >> else >> io_uring_cancel_files(ctx, task, files); > > Yeah, I think I messed up. If files == NULL, then the task is going away. > So we should cancel all requests that match 'task', not just ones that > match task && files. > > Not sure I have much more time to look into this before next week, but > something like that. > > The problem case is the async worker being queued, long before the task > is killed and the contexts go away. But from exit_files(), we're only > concerned with canceling if we have inflight. Doesn't look right to me. In theory all that should be killed in io_ring_ctx_wait_and_kill(), of course that's if the ring itself is closed. Guys, do you share rings between processes? Explicitly like sending io_uring fd over a socket, or implicitly e.g. sharing fd tables (threads), or cloning with copying fd tables (and so taking a ref to a ring). In other words, if you kill all your io_uring applications, does it go back to normal? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 0:55 ` Pavel Begunkov @ 2020-12-21 10:35 ` Dmitry Kadashev 2020-12-21 10:49 ` Dmitry Kadashev 2020-12-21 11:00 ` Dmitry Kadashev 0 siblings, 2 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-21 10:35 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote: > > On 20/12/2020 00:25, Jens Axboe wrote: > > On 12/19/20 4:42 PM, Pavel Begunkov wrote: > >> On 19/12/2020 23:13, Jens Axboe wrote: > >>> On 12/19/20 2:54 PM, Jens Axboe wrote: > >>>> On 12/19/20 1:51 PM, Josef wrote: > >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > >>>>>> file descriptor. You probably don't want/mean to do that as it's > >>>>>> pollable, I guess it's done because you just set it on all reads for the > >>>>>> test? > >>>>> > >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to > >>>>> use IOSQE_ASYNC > >>>> > >>>> Right, and it's pollable too. > >>>> > >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work > >>>>> in my tests, thanks a lot :) > >>>>> > >>>>>> In any case, it should of course work. This is the leftover trace when > >>>>>> we should be exiting, but an io-wq worker is still trying to get data > >>>>>> from the eventfd: > >>>>> > >>>>> interesting, btw what kind of tool do you use for kernel debugging? > >>>> > >>>> Just poking at it and thinking about it, no hidden magic I'm afraid... > >>> > >>> Josef, can you try with this added? Looks bigger than it is, most of it > >>> is just moving one function below another. > >> > >> Hmm, which kernel revision are you poking? Seems it doesn't match > >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with > >> NULL files. > >> > >> if (!files) > >> __io_uring_cancel_task_requests(ctx, task); > >> else > >> io_uring_cancel_files(ctx, task, files); > > > > Yeah, I think I messed up. If files == NULL, then the task is going away. > > So we should cancel all requests that match 'task', not just ones that > > match task && files. > > > > Not sure I have much more time to look into this before next week, but > > something like that. > > > > The problem case is the async worker being queued, long before the task > > is killed and the contexts go away. But from exit_files(), we're only > > concerned with canceling if we have inflight. Doesn't look right to me. > > In theory all that should be killed in io_ring_ctx_wait_and_kill(), > of course that's if the ring itself is closed. > > Guys, do you share rings between processes? Explicitly like sending > io_uring fd over a socket, or implicitly e.g. sharing fd tables > (threads), or cloning with copying fd tables (and so taking a ref > to a ring). We do not share rings between processes. Our rings are accessible from different threads (under locks), but nothing fancy. > In other words, if you kill all your io_uring applications, does it > go back to normal? I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an affected box and double check just in case. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-21 10:35 ` Dmitry Kadashev @ 2020-12-21 10:49 ` Dmitry Kadashev 2020-12-21 11:00 ` Dmitry Kadashev 1 sibling, 0 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-21 10:49 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote: > > On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote: > > > > On 20/12/2020 00:25, Jens Axboe wrote: > > > On 12/19/20 4:42 PM, Pavel Begunkov wrote: > > >> On 19/12/2020 23:13, Jens Axboe wrote: > > >>> On 12/19/20 2:54 PM, Jens Axboe wrote: > > >>>> On 12/19/20 1:51 PM, Josef wrote: > > >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > > >>>>>> file descriptor. You probably don't want/mean to do that as it's > > >>>>>> pollable, I guess it's done because you just set it on all reads for the > > >>>>>> test? > > >>>>> > > >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to > > >>>>> use IOSQE_ASYNC > > >>>> > > >>>> Right, and it's pollable too. > > >>>> > > >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work > > >>>>> in my tests, thanks a lot :) > > >>>>> > > >>>>>> In any case, it should of course work. This is the leftover trace when > > >>>>>> we should be exiting, but an io-wq worker is still trying to get data > > >>>>>> from the eventfd: > > >>>>> > > >>>>> interesting, btw what kind of tool do you use for kernel debugging? > > >>>> > > >>>> Just poking at it and thinking about it, no hidden magic I'm afraid... > > >>> > > >>> Josef, can you try with this added? Looks bigger than it is, most of it > > >>> is just moving one function below another. > > >> > > >> Hmm, which kernel revision are you poking? Seems it doesn't match > > >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with > > >> NULL files. > > >> > > >> if (!files) > > >> __io_uring_cancel_task_requests(ctx, task); > > >> else > > >> io_uring_cancel_files(ctx, task, files); > > > > > > Yeah, I think I messed up. If files == NULL, then the task is going away. > > > So we should cancel all requests that match 'task', not just ones that > > > match task && files. > > > > > > Not sure I have much more time to look into this before next week, but > > > something like that. > > > > > > The problem case is the async worker being queued, long before the task > > > is killed and the contexts go away. But from exit_files(), we're only > > > concerned with canceling if we have inflight. Doesn't look right to me. > > > > In theory all that should be killed in io_ring_ctx_wait_and_kill(), > > of course that's if the ring itself is closed. > > > > Guys, do you share rings between processes? Explicitly like sending > > io_uring fd over a socket, or implicitly e.g. sharing fd tables > > (threads), or cloning with copying fd tables (and so taking a ref > > to a ring). > > We do not share rings between processes. Our rings are accessible from different > threads (under locks), but nothing fancy. Actually, I'm wrong about the locks part, forgot how it works. In our case it works like this: a parent thread creates a ring, and passes it to a worker thread, which does all of the work with it, no locks are involved. On (clean) termination the parent notifies the worker, waits for it to exit and then calls io_uring_queue_exit. Not sure if that counts as sharing rings between the threads or not. As I've mentioned in some other email, I'll try (again) to make a reproducer. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-21 10:35 ` Dmitry Kadashev 2020-12-21 10:49 ` Dmitry Kadashev @ 2020-12-21 11:00 ` Dmitry Kadashev 2020-12-21 15:36 ` Pavel Begunkov 2020-12-22 3:35 ` Pavel Begunkov 1 sibling, 2 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-21 11:00 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote: > > On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote: > > > > On 20/12/2020 00:25, Jens Axboe wrote: > > > On 12/19/20 4:42 PM, Pavel Begunkov wrote: > > >> On 19/12/2020 23:13, Jens Axboe wrote: > > >>> On 12/19/20 2:54 PM, Jens Axboe wrote: > > >>>> On 12/19/20 1:51 PM, Josef wrote: > > >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > > >>>>>> file descriptor. You probably don't want/mean to do that as it's > > >>>>>> pollable, I guess it's done because you just set it on all reads for the > > >>>>>> test? > > >>>>> > > >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to > > >>>>> use IOSQE_ASYNC > > >>>> > > >>>> Right, and it's pollable too. > > >>>> > > >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work > > >>>>> in my tests, thanks a lot :) > > >>>>> > > >>>>>> In any case, it should of course work. This is the leftover trace when > > >>>>>> we should be exiting, but an io-wq worker is still trying to get data > > >>>>>> from the eventfd: > > >>>>> > > >>>>> interesting, btw what kind of tool do you use for kernel debugging? > > >>>> > > >>>> Just poking at it and thinking about it, no hidden magic I'm afraid... > > >>> > > >>> Josef, can you try with this added? Looks bigger than it is, most of it > > >>> is just moving one function below another. > > >> > > >> Hmm, which kernel revision are you poking? Seems it doesn't match > > >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with > > >> NULL files. > > >> > > >> if (!files) > > >> __io_uring_cancel_task_requests(ctx, task); > > >> else > > >> io_uring_cancel_files(ctx, task, files); > > > > > > Yeah, I think I messed up. If files == NULL, then the task is going away. > > > So we should cancel all requests that match 'task', not just ones that > > > match task && files. > > > > > > Not sure I have much more time to look into this before next week, but > > > something like that. > > > > > > The problem case is the async worker being queued, long before the task > > > is killed and the contexts go away. But from exit_files(), we're only > > > concerned with canceling if we have inflight. Doesn't look right to me. > > > > In theory all that should be killed in io_ring_ctx_wait_and_kill(), > > of course that's if the ring itself is closed. > > > > Guys, do you share rings between processes? Explicitly like sending > > io_uring fd over a socket, or implicitly e.g. sharing fd tables > > (threads), or cloning with copying fd tables (and so taking a ref > > to a ring). > > We do not share rings between processes. Our rings are accessible from different > threads (under locks), but nothing fancy. > > > In other words, if you kill all your io_uring applications, does it > > go back to normal? > > I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an > affected box and double check just in case. So, I've just tried stopping everything that uses io-uring. No io_wq* processes remained: $ ps ax | grep wq 9 ? I< 0:00 [mm_percpu_wq] 243 ? I< 0:00 [tpm_dev_wq] 246 ? I< 0:00 [devfreq_wq] 27922 pts/4 S+ 0:00 grep --colour=auto wq $ But not a single ring (with size 1024) can be created afterwards anyway. Apparently the problem netty hit and this one are different? -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-21 11:00 ` Dmitry Kadashev @ 2020-12-21 15:36 ` Pavel Begunkov 2020-12-22 3:35 ` Pavel Begunkov 1 sibling, 0 replies; 52+ messages in thread From: Pavel Begunkov @ 2020-12-21 15:36 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On 21/12/2020 11:00, Dmitry Kadashev wrote: > On Mon, Dec 21, 2020 at 5:35 PM Dmitry Kadashev <[email protected]> wrote: >> >> On Sun, Dec 20, 2020 at 7:59 AM Pavel Begunkov <[email protected]> wrote: >>> >>> On 20/12/2020 00:25, Jens Axboe wrote: >>>> On 12/19/20 4:42 PM, Pavel Begunkov wrote: >>>>> On 19/12/2020 23:13, Jens Axboe wrote: >>>>>> On 12/19/20 2:54 PM, Jens Axboe wrote: >>>>>>> On 12/19/20 1:51 PM, Josef wrote: >>>>>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>>>>>>>> file descriptor. You probably don't want/mean to do that as it's >>>>>>>>> pollable, I guess it's done because you just set it on all reads for the >>>>>>>>> test? >>>>>>>> >>>>>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to >>>>>>>> use IOSQE_ASYNC >>>>>>> >>>>>>> Right, and it's pollable too. >>>>>>> >>>>>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >>>>>>>> in my tests, thanks a lot :) >>>>>>>> >>>>>>>>> In any case, it should of course work. This is the leftover trace when >>>>>>>>> we should be exiting, but an io-wq worker is still trying to get data >>>>>>>>> from the eventfd: >>>>>>>> >>>>>>>> interesting, btw what kind of tool do you use for kernel debugging? >>>>>>> >>>>>>> Just poking at it and thinking about it, no hidden magic I'm afraid... >>>>>> >>>>>> Josef, can you try with this added? Looks bigger than it is, most of it >>>>>> is just moving one function below another. >>>>> >>>>> Hmm, which kernel revision are you poking? Seems it doesn't match >>>>> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with >>>>> NULL files. >>>>> >>>>> if (!files) >>>>> __io_uring_cancel_task_requests(ctx, task); >>>>> else >>>>> io_uring_cancel_files(ctx, task, files); >>>> >>>> Yeah, I think I messed up. If files == NULL, then the task is going away. >>>> So we should cancel all requests that match 'task', not just ones that >>>> match task && files. >>>> >>>> Not sure I have much more time to look into this before next week, but >>>> something like that. >>>> >>>> The problem case is the async worker being queued, long before the task >>>> is killed and the contexts go away. But from exit_files(), we're only >>>> concerned with canceling if we have inflight. Doesn't look right to me. >>> >>> In theory all that should be killed in io_ring_ctx_wait_and_kill(), >>> of course that's if the ring itself is closed. >>> >>> Guys, do you share rings between processes? Explicitly like sending >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables >>> (threads), or cloning with copying fd tables (and so taking a ref >>> to a ring). >> >> We do not share rings between processes. Our rings are accessible from different >> threads (under locks), but nothing fancy. >> >>> In other words, if you kill all your io_uring applications, does it >>> go back to normal? >> >> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an >> affected box and double check just in case. > > So, I've just tried stopping everything that uses io-uring. No io_wq* processes > remained: > > $ ps ax | grep wq > 9 ? I< 0:00 [mm_percpu_wq] > 243 ? I< 0:00 [tpm_dev_wq] > 246 ? I< 0:00 [devfreq_wq] > 27922 pts/4 S+ 0:00 grep --colour=auto wq > $ > > But not a single ring (with size 1024) can be created afterwards anyway. > > Apparently the problem netty hit and this one are different? Yep, looks like it -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-21 11:00 ` Dmitry Kadashev 2020-12-21 15:36 ` Pavel Begunkov @ 2020-12-22 3:35 ` Pavel Begunkov 2020-12-22 4:07 ` Pavel Begunkov 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-22 3:35 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On 21/12/2020 11:00, Dmitry Kadashev wrote: [snip] >> We do not share rings between processes. Our rings are accessible from different >> threads (under locks), but nothing fancy. >> >>> In other words, if you kill all your io_uring applications, does it >>> go back to normal? >> >> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an >> affected box and double check just in case. I can't spot any misaccounting, but I wonder if it can be that your memory is getting fragmented enough to be unable make an allocation of 16 __contiguous__ pages, i.e. sizeof(sqe) * 1024 That's how it's allocated internally: static void *io_mem_alloc(size_t size) { gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | __GFP_NORETRY; return (void *) __get_free_pages(gfp_flags, get_order(size)); } What about smaller rings? Can you check io_uring of what SQ size it can allocate? That can be a different program, e.g. modify a bit liburing/test/nop. Also, can you allocate it if you switch a user (preferably to non-root) after it happens? > > So, I've just tried stopping everything that uses io-uring. No io_wq* processes > remained: > > $ ps ax | grep wq > 9 ? I< 0:00 [mm_percpu_wq] > 243 ? I< 0:00 [tpm_dev_wq] > 246 ? I< 0:00 [devfreq_wq] > 27922 pts/4 S+ 0:00 grep --colour=auto wq > $ > > But not a single ring (with size 1024) can be created afterwards anyway. > > Apparently the problem netty hit and this one are different? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 3:35 ` Pavel Begunkov @ 2020-12-22 4:07 ` Pavel Begunkov 2020-12-22 11:04 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-22 4:07 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On 22/12/2020 03:35, Pavel Begunkov wrote: > On 21/12/2020 11:00, Dmitry Kadashev wrote: > [snip] >>> We do not share rings between processes. Our rings are accessible from different >>> threads (under locks), but nothing fancy. >>> >>>> In other words, if you kill all your io_uring applications, does it >>>> go back to normal? >>> >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an >>> affected box and double check just in case. > > I can't spot any misaccounting, but I wonder if it can be that your memory is > getting fragmented enough to be unable make an allocation of 16 __contiguous__ > pages, i.e. sizeof(sqe) * 1024 > > That's how it's allocated internally: > > static void *io_mem_alloc(size_t size) > { > gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | > __GFP_NORETRY; > > return (void *) __get_free_pages(gfp_flags, get_order(size)); > } > > What about smaller rings? Can you check io_uring of what SQ size it can allocate? > That can be a different program, e.g. modify a bit liburing/test/nop. Even better to allocate N smaller rings, where N = 1024 / SQ_size static int try_size(int sq_size) { int ret = 0, i, n = 1024 / sq_size; static struct io_uring rings[128]; for (i = 0; i < n; ++i) { if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) { ret = -1; break; } } for (i -= 1; i >= 0; i--) io_uring_queue_exit(&rings[i]); return ret; } int main() { int size; for (size = 1024; size >= 2; size /= 2) { if (!try_size(size)) { printf("max size %i\n", size); return 0; } } printf("can't allocate %i\n", size); return 0; } > Also, can you allocate it if you switch a user (preferably to non-root) after it > happens? > >> >> So, I've just tried stopping everything that uses io-uring. No io_wq* processes >> remained: >> >> $ ps ax | grep wq >> 9 ? I< 0:00 [mm_percpu_wq] >> 243 ? I< 0:00 [tpm_dev_wq] >> 246 ? I< 0:00 [devfreq_wq] >> 27922 pts/4 S+ 0:00 grep --colour=auto wq >> $ >> >> But not a single ring (with size 1024) can be created afterwards anyway. >> >> Apparently the problem netty hit and this one are different? > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 4:07 ` Pavel Begunkov @ 2020-12-22 11:04 ` Dmitry Kadashev 2020-12-22 11:06 ` Dmitry Kadashev 2020-12-22 16:33 ` Pavel Begunkov 0 siblings, 2 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-22 11:04 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > > On 22/12/2020 03:35, Pavel Begunkov wrote: > > On 21/12/2020 11:00, Dmitry Kadashev wrote: > > [snip] > >>> We do not share rings between processes. Our rings are accessible from different > >>> threads (under locks), but nothing fancy. > >>> > >>>> In other words, if you kill all your io_uring applications, does it > >>>> go back to normal? > >>> > >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an > >>> affected box and double check just in case. > > > > I can't spot any misaccounting, but I wonder if it can be that your memory is > > getting fragmented enough to be unable make an allocation of 16 __contiguous__ > > pages, i.e. sizeof(sqe) * 1024 > > > > That's how it's allocated internally: > > > > static void *io_mem_alloc(size_t size) > > { > > gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | > > __GFP_NORETRY; > > > > return (void *) __get_free_pages(gfp_flags, get_order(size)); > > } > > > > What about smaller rings? Can you check io_uring of what SQ size it can allocate? > > That can be a different program, e.g. modify a bit liburing/test/nop. > > Even better to allocate N smaller rings, where N = 1024 / SQ_size > > static int try_size(int sq_size) > { > int ret = 0, i, n = 1024 / sq_size; > static struct io_uring rings[128]; > > for (i = 0; i < n; ++i) { > if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) { > ret = -1; > break; > } > } > for (i -= 1; i >= 0; i--) > io_uring_queue_exit(&rings[i]); > return ret; > } > > int main() > { > int size; > > for (size = 1024; size >= 2; size /= 2) { > if (!try_size(size)) { > printf("max size %i\n", size); > return 0; > } > } > > printf("can't allocate %i\n", size); > return 0; > } Unfortunately I've rebooted the box I've used for tests yesterday, so I can't try this there. Also I was not able to come up with an isolated reproducer for this yet. The good news is I've found a relatively easy way to provoke this on a test VM using our software. Our app runs with "admin" user perms (plus some capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created an user called 'ioutest' to run the check for ring sizes using a different user. I've modified the test program slightly, to show the number of rings successfully created on each iteration and the actual error message (to debug a problem I was having with it, but I've kept this after that). Here is the output: # sudo -u admin bash -c 'ulimit -a' | grep locked max locked memory (kbytes, -l) 1024 # sudo -u ioutest bash -c 'ulimit -a' | grep locked max locked memory (kbytes, -l) 1024 # sudo -u admin ./iou-test1 Failed after 0 rings with 1024 size: Cannot allocate memory Failed after 0 rings with 512 size: Cannot allocate memory Failed after 0 rings with 256 size: Cannot allocate memory Failed after 0 rings with 128 size: Cannot allocate memory Failed after 0 rings with 64 size: Cannot allocate memory Failed after 0 rings with 32 size: Cannot allocate memory Failed after 0 rings with 16 size: Cannot allocate memory Failed after 0 rings with 8 size: Cannot allocate memory Failed after 0 rings with 4 size: Cannot allocate memory Failed after 0 rings with 2 size: Cannot allocate memory can't allocate 1 # sudo -u ioutest ./iou-test1 max size 1024 # ps ax | grep wq 8 ? I< 0:00 [mm_percpu_wq] 121 ? I< 0:00 [tpm_dev_wq] 124 ? I< 0:00 [devfreq_wq] 20593 pts/1 S+ 0:00 grep --color=auto wq -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 11:04 ` Dmitry Kadashev @ 2020-12-22 11:06 ` Dmitry Kadashev 2020-12-22 13:13 ` Dmitry Kadashev 2020-12-22 16:33 ` Pavel Begunkov 1 sibling, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-22 11:06 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Tue, Dec 22, 2020 at 6:04 PM Dmitry Kadashev <[email protected]> wrote: > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > > > > On 22/12/2020 03:35, Pavel Begunkov wrote: > > > On 21/12/2020 11:00, Dmitry Kadashev wrote: > > > [snip] > > >>> We do not share rings between processes. Our rings are accessible from different > > >>> threads (under locks), but nothing fancy. > > >>> > > >>>> In other words, if you kill all your io_uring applications, does it > > >>>> go back to normal? > > >>> > > >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an > > >>> affected box and double check just in case. > > > > > > I can't spot any misaccounting, but I wonder if it can be that your memory is > > > getting fragmented enough to be unable make an allocation of 16 __contiguous__ > > > pages, i.e. sizeof(sqe) * 1024 > > > > > > That's how it's allocated internally: > > > > > > static void *io_mem_alloc(size_t size) > > > { > > > gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | > > > __GFP_NORETRY; > > > > > > return (void *) __get_free_pages(gfp_flags, get_order(size)); > > > } > > > > > > What about smaller rings? Can you check io_uring of what SQ size it can allocate? > > > That can be a different program, e.g. modify a bit liburing/test/nop. > > > > Even better to allocate N smaller rings, where N = 1024 / SQ_size > > > > static int try_size(int sq_size) > > { > > int ret = 0, i, n = 1024 / sq_size; > > static struct io_uring rings[128]; > > > > for (i = 0; i < n; ++i) { > > if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) { > > ret = -1; > > break; > > } > > } > > for (i -= 1; i >= 0; i--) > > io_uring_queue_exit(&rings[i]); > > return ret; > > } > > > > int main() > > { > > int size; > > > > for (size = 1024; size >= 2; size /= 2) { > > if (!try_size(size)) { > > printf("max size %i\n", size); > > return 0; > > } > > } > > > > printf("can't allocate %i\n", size); > > return 0; > > } > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > try this there. Also I was not able to come up with an isolated reproducer for > this yet. > > The good news is I've found a relatively easy way to provoke this on a test VM > using our software. Our app runs with "admin" user perms (plus some > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > an user called 'ioutest' to run the check for ring sizes using a different user. > > I've modified the test program slightly, to show the number of rings > successfully > created on each iteration and the actual error message (to debug a problem I was > having with it, but I've kept this after that). Here is the output: > > # sudo -u admin bash -c 'ulimit -a' | grep locked > max locked memory (kbytes, -l) 1024 > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > max locked memory (kbytes, -l) 1024 > > # sudo -u admin ./iou-test1 > Failed after 0 rings with 1024 size: Cannot allocate memory > Failed after 0 rings with 512 size: Cannot allocate memory > Failed after 0 rings with 256 size: Cannot allocate memory > Failed after 0 rings with 128 size: Cannot allocate memory > Failed after 0 rings with 64 size: Cannot allocate memory > Failed after 0 rings with 32 size: Cannot allocate memory > Failed after 0 rings with 16 size: Cannot allocate memory > Failed after 0 rings with 8 size: Cannot allocate memory > Failed after 0 rings with 4 size: Cannot allocate memory > Failed after 0 rings with 2 size: Cannot allocate memory > can't allocate 1 > > # sudo -u ioutest ./iou-test1 > max size 1024 > > # ps ax | grep wq > 8 ? I< 0:00 [mm_percpu_wq] > 121 ? I< 0:00 [tpm_dev_wq] > 124 ? I< 0:00 [devfreq_wq] > 20593 pts/1 S+ 0:00 grep --color=auto wq This was on kernel 5.6.7, I'm going to try this on 5.10.1 now. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 11:06 ` Dmitry Kadashev @ 2020-12-22 13:13 ` Dmitry Kadashev 0 siblings, 0 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-22 13:13 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Tue, Dec 22, 2020 at 6:06 PM Dmitry Kadashev <[email protected]> wrote: > > On Tue, Dec 22, 2020 at 6:04 PM Dmitry Kadashev <[email protected]> wrote: > > > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > > > > > > On 22/12/2020 03:35, Pavel Begunkov wrote: > > > > On 21/12/2020 11:00, Dmitry Kadashev wrote: > > > > [snip] > > > >>> We do not share rings between processes. Our rings are accessible from different > > > >>> threads (under locks), but nothing fancy. > > > >>> > > > >>>> In other words, if you kill all your io_uring applications, does it > > > >>>> go back to normal? > > > >>> > > > >>> I'm pretty sure it does not, the only fix is to reboot the box. But I'll find an > > > >>> affected box and double check just in case. > > > > > > > > I can't spot any misaccounting, but I wonder if it can be that your memory is > > > > getting fragmented enough to be unable make an allocation of 16 __contiguous__ > > > > pages, i.e. sizeof(sqe) * 1024 > > > > > > > > That's how it's allocated internally: > > > > > > > > static void *io_mem_alloc(size_t size) > > > > { > > > > gfp_t gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP | > > > > __GFP_NORETRY; > > > > > > > > return (void *) __get_free_pages(gfp_flags, get_order(size)); > > > > } > > > > > > > > What about smaller rings? Can you check io_uring of what SQ size it can allocate? > > > > That can be a different program, e.g. modify a bit liburing/test/nop. > > > > > > Even better to allocate N smaller rings, where N = 1024 / SQ_size > > > > > > static int try_size(int sq_size) > > > { > > > int ret = 0, i, n = 1024 / sq_size; > > > static struct io_uring rings[128]; > > > > > > for (i = 0; i < n; ++i) { > > > if (io_uring_queue_init(sq_size, &rings[i], 0) < 0) { > > > ret = -1; > > > break; > > > } > > > } > > > for (i -= 1; i >= 0; i--) > > > io_uring_queue_exit(&rings[i]); > > > return ret; > > > } > > > > > > int main() > > > { > > > int size; > > > > > > for (size = 1024; size >= 2; size /= 2) { > > > if (!try_size(size)) { > > > printf("max size %i\n", size); > > > return 0; > > > } > > > } > > > > > > printf("can't allocate %i\n", size); > > > return 0; > > > } > > > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > > try this there. Also I was not able to come up with an isolated reproducer for > > this yet. > > > > The good news is I've found a relatively easy way to provoke this on a test VM > > using our software. Our app runs with "admin" user perms (plus some > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > > an user called 'ioutest' to run the check for ring sizes using a different user. > > > > I've modified the test program slightly, to show the number of rings > > successfully > > created on each iteration and the actual error message (to debug a problem I was > > having with it, but I've kept this after that). Here is the output: > > > > # sudo -u admin bash -c 'ulimit -a' | grep locked > > max locked memory (kbytes, -l) 1024 > > > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > > max locked memory (kbytes, -l) 1024 > > > > # sudo -u admin ./iou-test1 > > Failed after 0 rings with 1024 size: Cannot allocate memory > > Failed after 0 rings with 512 size: Cannot allocate memory > > Failed after 0 rings with 256 size: Cannot allocate memory > > Failed after 0 rings with 128 size: Cannot allocate memory > > Failed after 0 rings with 64 size: Cannot allocate memory > > Failed after 0 rings with 32 size: Cannot allocate memory > > Failed after 0 rings with 16 size: Cannot allocate memory > > Failed after 0 rings with 8 size: Cannot allocate memory > > Failed after 0 rings with 4 size: Cannot allocate memory > > Failed after 0 rings with 2 size: Cannot allocate memory > > can't allocate 1 > > > > # sudo -u ioutest ./iou-test1 > > max size 1024 > > > > # ps ax | grep wq > > 8 ? I< 0:00 [mm_percpu_wq] > > 121 ? I< 0:00 [tpm_dev_wq] > > 124 ? I< 0:00 [devfreq_wq] > > 20593 pts/1 S+ 0:00 grep --color=auto wq > > This was on kernel 5.6.7, I'm going to try this on 5.10.1 now. Curious. It seems to be much harder to reproduce on 5.9 and 5.10. I'm 100% sure it still happens on 5.9 though, since it did happen on production quite a few times. But the way I've used to reproduce it on 5.6 worked two times there, and quite quickly. And with 5.9 and 5.10 the same approach does not seem to be working. I'll give it some more time and also will keep trying to come up with a synthetic reproducer. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 11:04 ` Dmitry Kadashev 2020-12-22 11:06 ` Dmitry Kadashev @ 2020-12-22 16:33 ` Pavel Begunkov 2020-12-23 8:39 ` Dmitry Kadashev 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-22 16:33 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On 22/12/2020 11:04, Dmitry Kadashev wrote: > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: [...] >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate? >>> That can be a different program, e.g. modify a bit liburing/test/nop. > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > try this there. Also I was not able to come up with an isolated reproducer for > this yet. > > The good news is I've found a relatively easy way to provoke this on a test VM > using our software. Our app runs with "admin" user perms (plus some > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > an user called 'ioutest' to run the check for ring sizes using a different user. > > I've modified the test program slightly, to show the number of rings > successfully > created on each iteration and the actual error message (to debug a problem I was > having with it, but I've kept this after that). Here is the output: > > # sudo -u admin bash -c 'ulimit -a' | grep locked > max locked memory (kbytes, -l) 1024 > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > max locked memory (kbytes, -l) 1024 > > # sudo -u admin ./iou-test1 > Failed after 0 rings with 1024 size: Cannot allocate memory > Failed after 0 rings with 512 size: Cannot allocate memory > Failed after 0 rings with 256 size: Cannot allocate memory > Failed after 0 rings with 128 size: Cannot allocate memory > Failed after 0 rings with 64 size: Cannot allocate memory > Failed after 0 rings with 32 size: Cannot allocate memory > Failed after 0 rings with 16 size: Cannot allocate memory > Failed after 0 rings with 8 size: Cannot allocate memory > Failed after 0 rings with 4 size: Cannot allocate memory > Failed after 0 rings with 2 size: Cannot allocate memory > can't allocate 1 > > # sudo -u ioutest ./iou-test1 > max size 1024 Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK capability we don't even account locked memory. btw, do you use registered buffers? > > # ps ax | grep wq > 8 ? I< 0:00 [mm_percpu_wq] > 121 ? I< 0:00 [tpm_dev_wq] > 124 ? I< 0:00 [devfreq_wq] > 20593 pts/1 S+ 0:00 grep --color=auto wq > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-22 16:33 ` Pavel Begunkov @ 2020-12-23 8:39 ` Dmitry Kadashev 2020-12-23 9:38 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-23 8:39 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote: > > On 22/12/2020 11:04, Dmitry Kadashev wrote: > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > [...] > >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate? > >>> That can be a different program, e.g. modify a bit liburing/test/nop. > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > > try this there. Also I was not able to come up with an isolated reproducer for > > this yet. > > > > The good news is I've found a relatively easy way to provoke this on a test VM > > using our software. Our app runs with "admin" user perms (plus some > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > > an user called 'ioutest' to run the check for ring sizes using a different user. > > > > I've modified the test program slightly, to show the number of rings > > successfully > > created on each iteration and the actual error message (to debug a problem I was > > having with it, but I've kept this after that). Here is the output: > > > > # sudo -u admin bash -c 'ulimit -a' | grep locked > > max locked memory (kbytes, -l) 1024 > > > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > > max locked memory (kbytes, -l) 1024 > > > > # sudo -u admin ./iou-test1 > > Failed after 0 rings with 1024 size: Cannot allocate memory > > Failed after 0 rings with 512 size: Cannot allocate memory > > Failed after 0 rings with 256 size: Cannot allocate memory > > Failed after 0 rings with 128 size: Cannot allocate memory > > Failed after 0 rings with 64 size: Cannot allocate memory > > Failed after 0 rings with 32 size: Cannot allocate memory > > Failed after 0 rings with 16 size: Cannot allocate memory > > Failed after 0 rings with 8 size: Cannot allocate memory > > Failed after 0 rings with 4 size: Cannot allocate memory > > Failed after 0 rings with 2 size: Cannot allocate memory > > can't allocate 1 > > > > # sudo -u ioutest ./iou-test1 > > max size 1024 > > Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK > capability we don't even account locked memory. We do have some capabilities, but not CAP_IPC_LOCK. Ours are: CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL, CAP_DAC_READ_SEARCH. The latter was necessary for integration with some third-party thing that we do not really use anymore, so we can try building without it, but it'd require some time, mostly because I'm not sure how quickly I'd be able to provoke the issue. > btw, do you use registered buffers? No, we do not use neither registered buffers nor registered files (nor anything else). Also, I just tried the test program on a real box (this time one instance of our program is still running - can repeat the check with it dead, but I expect the results to be pretty much the same, at least after a few more restarts). This box runs 5.9.5. # sudo -u admin bash -c 'ulimit -l' 1024 # sudo -u admin ./iou-test1 Failed after 0 rings with 1024 size: Cannot allocate memory Failed after 0 rings with 512 size: Cannot allocate memory Failed after 0 rings with 256 size: Cannot allocate memory Failed after 0 rings with 128 size: Cannot allocate memory Failed after 0 rings with 64 size: Cannot allocate memory Failed after 0 rings with 32 size: Cannot allocate memory Failed after 0 rings with 16 size: Cannot allocate memory Failed after 0 rings with 8 size: Cannot allocate memory Failed after 0 rings with 4 size: Cannot allocate memory Failed after 0 rings with 2 size: Cannot allocate memory can't allocate 1 # sudo -u dmitry bash -c 'ulimit -l' 1024 # sudo -u dmitry ./iou-test1 max size 1024 -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-23 8:39 ` Dmitry Kadashev @ 2020-12-23 9:38 ` Dmitry Kadashev 2020-12-23 11:48 ` Dmitry Kadashev 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-23 9:38 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Wed, Dec 23, 2020 at 3:39 PM Dmitry Kadashev <[email protected]> wrote: > > On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote: > > > > On 22/12/2020 11:04, Dmitry Kadashev wrote: > > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > > [...] > > >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate? > > >>> That can be a different program, e.g. modify a bit liburing/test/nop. > > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > > > try this there. Also I was not able to come up with an isolated reproducer for > > > this yet. > > > > > > The good news is I've found a relatively easy way to provoke this on a test VM > > > using our software. Our app runs with "admin" user perms (plus some > > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > > > an user called 'ioutest' to run the check for ring sizes using a different user. > > > > > > I've modified the test program slightly, to show the number of rings > > > successfully > > > created on each iteration and the actual error message (to debug a problem I was > > > having with it, but I've kept this after that). Here is the output: > > > > > > # sudo -u admin bash -c 'ulimit -a' | grep locked > > > max locked memory (kbytes, -l) 1024 > > > > > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > > > max locked memory (kbytes, -l) 1024 > > > > > > # sudo -u admin ./iou-test1 > > > Failed after 0 rings with 1024 size: Cannot allocate memory > > > Failed after 0 rings with 512 size: Cannot allocate memory > > > Failed after 0 rings with 256 size: Cannot allocate memory > > > Failed after 0 rings with 128 size: Cannot allocate memory > > > Failed after 0 rings with 64 size: Cannot allocate memory > > > Failed after 0 rings with 32 size: Cannot allocate memory > > > Failed after 0 rings with 16 size: Cannot allocate memory > > > Failed after 0 rings with 8 size: Cannot allocate memory > > > Failed after 0 rings with 4 size: Cannot allocate memory > > > Failed after 0 rings with 2 size: Cannot allocate memory > > > can't allocate 1 > > > > > > # sudo -u ioutest ./iou-test1 > > > max size 1024 > > > > Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK > > capability we don't even account locked memory. > > We do have some capabilities, but not CAP_IPC_LOCK. Ours are: > > CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL, > CAP_DAC_READ_SEARCH. > > The latter was necessary for integration with some third-party thing that we do > not really use anymore, so we can try building without it, but it'd require some > time, mostly because I'm not sure how quickly I'd be able to provoke the issue. > > > btw, do you use registered buffers? > > No, we do not use neither registered buffers nor registered files (nor anything > else). > > Also, I just tried the test program on a real box (this time one instance of our > program is still running - can repeat the check with it dead, but I expect the > results to be pretty much the same, at least after a few more restarts). This > box runs 5.9.5. > > # sudo -u admin bash -c 'ulimit -l' > 1024 > > # sudo -u admin ./iou-test1 > Failed after 0 rings with 1024 size: Cannot allocate memory > Failed after 0 rings with 512 size: Cannot allocate memory > Failed after 0 rings with 256 size: Cannot allocate memory > Failed after 0 rings with 128 size: Cannot allocate memory > Failed after 0 rings with 64 size: Cannot allocate memory > Failed after 0 rings with 32 size: Cannot allocate memory > Failed after 0 rings with 16 size: Cannot allocate memory > Failed after 0 rings with 8 size: Cannot allocate memory > Failed after 0 rings with 4 size: Cannot allocate memory > Failed after 0 rings with 2 size: Cannot allocate memory > can't allocate 1 > > # sudo -u dmitry bash -c 'ulimit -l' > 1024 > > # sudo -u dmitry ./iou-test1 > max size 1024 Please ignore the results from the real box above (5.9.5). The memlock limit interfered with this, since our app was running in the background and it had a few rings running (most failed to be created, but not all). I'll try to make it fully stuck and repeat the test with the app dead. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-23 9:38 ` Dmitry Kadashev @ 2020-12-23 11:48 ` Dmitry Kadashev 2020-12-23 12:27 ` Pavel Begunkov 0 siblings, 1 reply; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-23 11:48 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On Wed, Dec 23, 2020 at 4:38 PM Dmitry Kadashev <[email protected]> wrote: > > On Wed, Dec 23, 2020 at 3:39 PM Dmitry Kadashev <[email protected]> wrote: > > > > On Tue, Dec 22, 2020 at 11:37 PM Pavel Begunkov <[email protected]> wrote: > > > > > > On 22/12/2020 11:04, Dmitry Kadashev wrote: > > > > On Tue, Dec 22, 2020 at 11:11 AM Pavel Begunkov <[email protected]> wrote: > > > [...] > > > >>> What about smaller rings? Can you check io_uring of what SQ size it can allocate? > > > >>> That can be a different program, e.g. modify a bit liburing/test/nop. > > > > Unfortunately I've rebooted the box I've used for tests yesterday, so I can't > > > > try this there. Also I was not able to come up with an isolated reproducer for > > > > this yet. > > > > > > > > The good news is I've found a relatively easy way to provoke this on a test VM > > > > using our software. Our app runs with "admin" user perms (plus some > > > > capabilities), it bumps RLIMIT_MEMLOCK to infinity on start. I've also created > > > > an user called 'ioutest' to run the check for ring sizes using a different user. > > > > > > > > I've modified the test program slightly, to show the number of rings > > > > successfully > > > > created on each iteration and the actual error message (to debug a problem I was > > > > having with it, but I've kept this after that). Here is the output: > > > > > > > > # sudo -u admin bash -c 'ulimit -a' | grep locked > > > > max locked memory (kbytes, -l) 1024 > > > > > > > > # sudo -u ioutest bash -c 'ulimit -a' | grep locked > > > > max locked memory (kbytes, -l) 1024 > > > > > > > > # sudo -u admin ./iou-test1 > > > > Failed after 0 rings with 1024 size: Cannot allocate memory > > > > Failed after 0 rings with 512 size: Cannot allocate memory > > > > Failed after 0 rings with 256 size: Cannot allocate memory > > > > Failed after 0 rings with 128 size: Cannot allocate memory > > > > Failed after 0 rings with 64 size: Cannot allocate memory > > > > Failed after 0 rings with 32 size: Cannot allocate memory > > > > Failed after 0 rings with 16 size: Cannot allocate memory > > > > Failed after 0 rings with 8 size: Cannot allocate memory > > > > Failed after 0 rings with 4 size: Cannot allocate memory > > > > Failed after 0 rings with 2 size: Cannot allocate memory > > > > can't allocate 1 > > > > > > > > # sudo -u ioutest ./iou-test1 > > > > max size 1024 > > > > > > Then we screw that specific user. Interestingly, if it has CAP_IPC_LOCK > > > capability we don't even account locked memory. > > > > We do have some capabilities, but not CAP_IPC_LOCK. Ours are: > > > > CAP_NET_ADMIN, CAP_NET_BIND_SERVICE, CAP_SYS_RESOURCE, CAP_KILL, > > CAP_DAC_READ_SEARCH. > > > > The latter was necessary for integration with some third-party thing that we do > > not really use anymore, so we can try building without it, but it'd require some > > time, mostly because I'm not sure how quickly I'd be able to provoke the issue. > > > > > btw, do you use registered buffers? > > > > No, we do not use neither registered buffers nor registered files (nor anything > > else). > > > > Also, I just tried the test program on a real box (this time one instance of our > > program is still running - can repeat the check with it dead, but I expect the > > results to be pretty much the same, at least after a few more restarts). This > > box runs 5.9.5. > > > > # sudo -u admin bash -c 'ulimit -l' > > 1024 > > > > # sudo -u admin ./iou-test1 > > Failed after 0 rings with 1024 size: Cannot allocate memory > > Failed after 0 rings with 512 size: Cannot allocate memory > > Failed after 0 rings with 256 size: Cannot allocate memory > > Failed after 0 rings with 128 size: Cannot allocate memory > > Failed after 0 rings with 64 size: Cannot allocate memory > > Failed after 0 rings with 32 size: Cannot allocate memory > > Failed after 0 rings with 16 size: Cannot allocate memory > > Failed after 0 rings with 8 size: Cannot allocate memory > > Failed after 0 rings with 4 size: Cannot allocate memory > > Failed after 0 rings with 2 size: Cannot allocate memory > > can't allocate 1 > > > > # sudo -u dmitry bash -c 'ulimit -l' > > 1024 > > > > # sudo -u dmitry ./iou-test1 > > max size 1024 > > Please ignore the results from the real box above (5.9.5). The memlock limit > interfered with this, since our app was running in the background and it had a > few rings running (most failed to be created, but not all). I'll try to make it > fully stuck and repeat the test with the app dead. I've experimented with the 5.9 live boxes that were showing signs of the problem a bit more, and I'm not entirely sure they get stuck until reboot anymore. I'm pretty sure it is the case with 5.6, but probably a bug was fixed since then - the fact that 5.8 in particular had quite a few fixes that seemed relevant is the reason we've tried 5.9 in the first place. And on 5.9 we might be seeing fragmentation issues indeed. I shouldn't have been mixing my kernel versions :) Also, I did not realize a ring of size=1024 requires 16 contiguous pages. We will experiment and observe a bit more, and meanwhile let's consider the case closed. If the issue surfaces again I'll update this thread. Thanks a *lot* Pavel for helping to debug this issue. And sorry for the false alarm / noise everyone. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-23 11:48 ` Dmitry Kadashev @ 2020-12-23 12:27 ` Pavel Begunkov 0 siblings, 0 replies; 52+ messages in thread From: Pavel Begunkov @ 2020-12-23 12:27 UTC (permalink / raw) To: Dmitry Kadashev; +Cc: Jens Axboe, Josef, Norman Maurer, io-uring On 23/12/2020 11:48, Dmitry Kadashev wrote: > On Wed, Dec 23, 2020 at 4:38 PM Dmitry Kadashev <[email protected]> wrote: >> Please ignore the results from the real box above (5.9.5). The memlock limit >> interfered with this, since our app was running in the background and it had a >> few rings running (most failed to be created, but not all). I'll try to make it >> fully stuck and repeat the test with the app dead. > > I've experimented with the 5.9 live boxes that were showing signs of the problem > a bit more, and I'm not entirely sure they get stuck until reboot anymore. > > I'm pretty sure it is the case with 5.6, but probably a bug was fixed since > then - the fact that 5.8 in particular had quite a few fixes that seemed > relevant is the reason we've tried 5.9 in the first place. > > And on 5.9 we might be seeing fragmentation issues indeed. I shouldn't have been > mixing my kernel versions :) Also, I did not realize a ring of size=1024 > requires 16 contiguous pages. We will experiment and observe a bit more, and > meanwhile let's consider the case closed. If the issue surfaces again I'll > update this thread. If fragmentation is to blame, it's still a problem. Let us know if you find out anything. And thanks for keeping debugging -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 0:25 ` Jens Axboe 2020-12-20 0:55 ` Pavel Begunkov @ 2020-12-20 1:57 ` Pavel Begunkov 2020-12-20 7:13 ` Josef 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 1:57 UTC (permalink / raw) To: Jens Axboe, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 00:25, Jens Axboe wrote: > On 12/19/20 4:42 PM, Pavel Begunkov wrote: >> On 19/12/2020 23:13, Jens Axboe wrote: >>> On 12/19/20 2:54 PM, Jens Axboe wrote: >>>> On 12/19/20 1:51 PM, Josef wrote: >>>>>> And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd >>>>>> file descriptor. You probably don't want/mean to do that as it's >>>>>> pollable, I guess it's done because you just set it on all reads for the >>>>>> test? >>>>> >>>>> yes exactly, eventfd fd is blocking, so it actually makes no sense to >>>>> use IOSQE_ASYNC >>>> >>>> Right, and it's pollable too. >>>> >>>>> I just tested eventfd without the IOSQE_ASYNC flag, it seems to work >>>>> in my tests, thanks a lot :) >>>>> >>>>>> In any case, it should of course work. This is the leftover trace when >>>>>> we should be exiting, but an io-wq worker is still trying to get data >>>>>> from the eventfd: >>>>> >>>>> interesting, btw what kind of tool do you use for kernel debugging? >>>> >>>> Just poking at it and thinking about it, no hidden magic I'm afraid... >>> >>> Josef, can you try with this added? Looks bigger than it is, most of it >>> is just moving one function below another. >> >> Hmm, which kernel revision are you poking? Seems it doesn't match >> io_uring-5.10, and for 5.11 io_uring_cancel_files() is never called with >> NULL files. >> >> if (!files) >> __io_uring_cancel_task_requests(ctx, task); >> else >> io_uring_cancel_files(ctx, task, files); > > Yeah, I think I messed up. If files == NULL, then the task is going away. > So we should cancel all requests that match 'task', not just ones that > match task && files. > > Not sure I have much more time to look into this before next week, but > something like that. > > The problem case is the async worker being queued, long before the task > is killed and the contexts go away. But from exit_files(), we're only > concerned with canceling if we have inflight. Doesn't look right to me. Josef, can you test the patch below instead? Following Jens' idea it cancels more aggressively when a task is killed or exits. It's based on [1] but would probably apply fine to for-next. [1] git://git.kernel.dk/linux-block branch io_uring-5.11, commit dd20166236953c8cd14f4c668bf972af32f0c6be diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..3a98e6dd71c0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8919,8 +8919,6 @@ void __io_uring_files_cancel(struct files_struct *files) struct io_ring_ctx *ctx = file->private_data; io_uring_cancel_task_requests(ctx, files); - if (files) - io_uring_del_task_file(file); } atomic_dec(&tctx->in_idle); @@ -8960,6 +8958,8 @@ static s64 tctx_inflight(struct io_uring_task *tctx) void __io_uring_task_cancel(void) { struct io_uring_task *tctx = current->io_uring; + struct file *file; + unsigned long index; DEFINE_WAIT(wait); s64 inflight; @@ -8986,6 +8986,9 @@ void __io_uring_task_cancel(void) finish_wait(&tctx->wait, &wait); atomic_dec(&tctx->in_idle); + + xa_for_each(&tctx->xa, index, file) + io_uring_del_task_file(file); } static int io_uring_flush(struct file *file, void *data) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 35b2d845704d..54925c74aa88 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -48,7 +48,7 @@ static inline void io_uring_task_cancel(void) static inline void io_uring_files_cancel(struct files_struct *files) { if (current->io_uring && !xa_empty(¤t->io_uring->xa)) - __io_uring_files_cancel(files); + __io_uring_task_cancel(); } static inline void io_uring_free(struct task_struct *tsk) { -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 1:57 ` Pavel Begunkov @ 2020-12-20 7:13 ` Josef 2020-12-20 13:00 ` Pavel Begunkov 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-20 7:13 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring > Guys, do you share rings between processes? Explicitly like sending > io_uring fd over a socket, or implicitly e.g. sharing fd tables > (threads), or cloning with copying fd tables (and so taking a ref > to a ring). no in netty we don't share ring between processes > In other words, if you kill all your io_uring applications, does it > go back to normal? no at all, the io-wq worker thread is still running, I literally have to restart the vm to go back to normal(as far as I know is not possible to kill kernel threads right?) > Josef, can you test the patch below instead? Following Jens' idea it > cancels more aggressively when a task is killed or exits. It's based > on [1] but would probably apply fine to for-next. it works, I run several tests with eventfd read op async flag enabled, thanks a lot :) you are awesome guys :) -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 7:13 ` Josef @ 2020-12-20 13:00 ` Pavel Begunkov 2020-12-20 14:19 ` Pavel Begunkov 2020-12-20 16:14 ` Jens Axboe 0 siblings, 2 replies; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 13:00 UTC (permalink / raw) To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 07:13, Josef wrote: >> Guys, do you share rings between processes? Explicitly like sending >> io_uring fd over a socket, or implicitly e.g. sharing fd tables >> (threads), or cloning with copying fd tables (and so taking a ref >> to a ring). > > no in netty we don't share ring between processes > >> In other words, if you kill all your io_uring applications, does it >> go back to normal? > > no at all, the io-wq worker thread is still running, I literally have > to restart the vm to go back to normal(as far as I know is not > possible to kill kernel threads right?) > >> Josef, can you test the patch below instead? Following Jens' idea it >> cancels more aggressively when a task is killed or exits. It's based >> on [1] but would probably apply fine to for-next. > > it works, I run several tests with eventfd read op async flag enabled, > thanks a lot :) you are awesome guys :) Thanks for testing and confirming! Either we forgot something in io_ring_ctx_wait_and_kill() and it just can't cancel some requests, or we have a dependency that prevents release from happening. BTW, apparently that patch causes hangs for unrelated but known reasons, so better to not use it, we'll merge something more stable. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 13:00 ` Pavel Begunkov @ 2020-12-20 14:19 ` Pavel Begunkov 2020-12-20 15:56 ` Josef 2020-12-20 16:14 ` Jens Axboe 1 sibling, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 14:19 UTC (permalink / raw) To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 13:00, Pavel Begunkov wrote: > On 20/12/2020 07:13, Josef wrote: >>> Guys, do you share rings between processes? Explicitly like sending >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables >>> (threads), or cloning with copying fd tables (and so taking a ref >>> to a ring). >> >> no in netty we don't share ring between processes >> >>> In other words, if you kill all your io_uring applications, does it >>> go back to normal? >> >> no at all, the io-wq worker thread is still running, I literally have >> to restart the vm to go back to normal(as far as I know is not >> possible to kill kernel threads right?) >> >>> Josef, can you test the patch below instead? Following Jens' idea it >>> cancels more aggressively when a task is killed or exits. It's based >>> on [1] but would probably apply fine to for-next. >> >> it works, I run several tests with eventfd read op async flag enabled, >> thanks a lot :) you are awesome guys :) > > Thanks for testing and confirming! Either we forgot something in > io_ring_ctx_wait_and_kill() and it just can't cancel some requests, > or we have a dependency that prevents release from happening. > > BTW, apparently that patch causes hangs for unrelated but known > reasons, so better to not use it, we'll merge something more stable. I'd really appreciate if you can try one more. I want to know why the final cleanup doesn't cope with it. diff --git a/fs/io_uring.c b/fs/io_uring.c index 941fe9b64fd9..d38fc819648e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8614,6 +8614,10 @@ static int io_remove_personalities(int id, void *p, void *data) return 0; } +static void io_cancel_defer_files(struct io_ring_ctx *ctx, + struct task_struct *task, + struct files_struct *files); + static void io_ring_exit_work(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, @@ -8627,6 +8631,8 @@ static void io_ring_exit_work(struct work_struct *work) */ do { io_iopoll_try_reap_events(ctx); + io_poll_remove_all(ctx, NULL, NULL); + io_kill_timeouts(ctx, NULL, NULL); } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); io_ring_ctx_free(ctx); } @@ -8641,6 +8647,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_cqring_overflow_flush(ctx, true, NULL, NULL); mutex_unlock(&ctx->uring_lock); + io_cancel_defer_files(ctx, NULL, NULL); io_kill_timeouts(ctx, NULL, NULL); io_poll_remove_all(ctx, NULL, NULL); -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 14:19 ` Pavel Begunkov @ 2020-12-20 15:56 ` Josef 2020-12-20 15:58 ` Pavel Begunkov 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-20 15:56 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring > I'd really appreciate if you can try one more. I want to know why > the final cleanup doesn't cope with it. yeah sure, which kernel version? it seems to be that this patch doesn't match io_uring-5.11 and io_uring-5.10 On Sun, 20 Dec 2020 at 15:22, Pavel Begunkov <[email protected]> wrote: > > On 20/12/2020 13:00, Pavel Begunkov wrote: > > On 20/12/2020 07:13, Josef wrote: > >>> Guys, do you share rings between processes? Explicitly like sending > >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables > >>> (threads), or cloning with copying fd tables (and so taking a ref > >>> to a ring). > >> > >> no in netty we don't share ring between processes > >> > >>> In other words, if you kill all your io_uring applications, does it > >>> go back to normal? > >> > >> no at all, the io-wq worker thread is still running, I literally have > >> to restart the vm to go back to normal(as far as I know is not > >> possible to kill kernel threads right?) > >> > >>> Josef, can you test the patch below instead? Following Jens' idea it > >>> cancels more aggressively when a task is killed or exits. It's based > >>> on [1] but would probably apply fine to for-next. > >> > >> it works, I run several tests with eventfd read op async flag enabled, > >> thanks a lot :) you are awesome guys :) > > > > Thanks for testing and confirming! Either we forgot something in > > io_ring_ctx_wait_and_kill() and it just can't cancel some requests, > > or we have a dependency that prevents release from happening. > > > > BTW, apparently that patch causes hangs for unrelated but known > > reasons, so better to not use it, we'll merge something more stable. > > I'd really appreciate if you can try one more. I want to know why > the final cleanup doesn't cope with it. > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 941fe9b64fd9..d38fc819648e 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -8614,6 +8614,10 @@ static int io_remove_personalities(int id, void *p, void *data) > return 0; > } > > +static void io_cancel_defer_files(struct io_ring_ctx *ctx, > + struct task_struct *task, > + struct files_struct *files); > + > static void io_ring_exit_work(struct work_struct *work) > { > struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, > @@ -8627,6 +8631,8 @@ static void io_ring_exit_work(struct work_struct *work) > */ > do { > io_iopoll_try_reap_events(ctx); > + io_poll_remove_all(ctx, NULL, NULL); > + io_kill_timeouts(ctx, NULL, NULL); > } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); > io_ring_ctx_free(ctx); > } > @@ -8641,6 +8647,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) > io_cqring_overflow_flush(ctx, true, NULL, NULL); > mutex_unlock(&ctx->uring_lock); > > + io_cancel_defer_files(ctx, NULL, NULL); > io_kill_timeouts(ctx, NULL, NULL); > io_poll_remove_all(ctx, NULL, NULL); > > -- > Pavel Begunkov -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 15:56 ` Josef @ 2020-12-20 15:58 ` Pavel Begunkov 0 siblings, 0 replies; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 15:58 UTC (permalink / raw) To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 15:56, Josef wrote: >> I'd really appreciate if you can try one more. I want to know why >> the final cleanup doesn't cope with it. > > yeah sure, which kernel version? it seems to be that this patch > doesn't match io_uring-5.11 and io_uring-5.10 It's io_uring-5.11 but I had some patches on top. I regenerated it below for up to date Jens' io_uring-5.11 diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..4e1fb4054516 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8620,6 +8620,10 @@ static int io_remove_personalities(int id, void *p, void *data) return 0; } +static void io_cancel_defer_files(struct io_ring_ctx *ctx, + struct task_struct *task, + struct files_struct *files); + static void io_ring_exit_work(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, @@ -8633,6 +8637,8 @@ static void io_ring_exit_work(struct work_struct *work) */ do { io_iopoll_try_reap_events(ctx); + io_poll_remove_all(ctx, NULL, NULL); + io_kill_timeouts(ctx, NULL, NULL); } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); io_ring_ctx_free(ctx); } @@ -8647,6 +8653,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_cqring_overflow_flush(ctx, true, NULL, NULL); mutex_unlock(&ctx->uring_lock); + io_cancel_defer_files(ctx, NULL, NULL); io_kill_timeouts(ctx, NULL, NULL); io_poll_remove_all(ctx, NULL, NULL); -- Pavel Begunkov ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 13:00 ` Pavel Begunkov 2020-12-20 14:19 ` Pavel Begunkov @ 2020-12-20 16:14 ` Jens Axboe 2020-12-20 16:59 ` Josef 1 sibling, 1 reply; 52+ messages in thread From: Jens Axboe @ 2020-12-20 16:14 UTC (permalink / raw) To: Pavel Begunkov, Josef; +Cc: Norman Maurer, Dmitry Kadashev, io-uring On 12/20/20 6:00 AM, Pavel Begunkov wrote: > On 20/12/2020 07:13, Josef wrote: >>> Guys, do you share rings between processes? Explicitly like sending >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables >>> (threads), or cloning with copying fd tables (and so taking a ref >>> to a ring). >> >> no in netty we don't share ring between processes >> >>> In other words, if you kill all your io_uring applications, does it >>> go back to normal? >> >> no at all, the io-wq worker thread is still running, I literally have >> to restart the vm to go back to normal(as far as I know is not >> possible to kill kernel threads right?) >> >>> Josef, can you test the patch below instead? Following Jens' idea it >>> cancels more aggressively when a task is killed or exits. It's based >>> on [1] but would probably apply fine to for-next. >> >> it works, I run several tests with eventfd read op async flag enabled, >> thanks a lot :) you are awesome guys :) > > Thanks for testing and confirming! Either we forgot something in > io_ring_ctx_wait_and_kill() and it just can't cancel some requests, > or we have a dependency that prevents release from happening. Just a guess - Josef, is the eventfd for the ring fd itself? BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go. We should just use targeted cancelation - that's cleaner, and the cancel all will impact ATTACH_WQ as well. Separate thing to fix, though. -- Jens Axboe ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 16:14 ` Jens Axboe @ 2020-12-20 16:59 ` Josef 2020-12-20 18:23 ` Josef 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-20 16:59 UTC (permalink / raw) To: Jens Axboe; +Cc: Pavel Begunkov, Norman Maurer, Dmitry Kadashev, io-uring > Just a guess - Josef, is the eventfd for the ring fd itself? yes via eventfd_write we want to wake up/unblock io_uring_enter(IORING_ENTER_GETEVENTS), and the read eventfd event is submitted every time each ring fd in netty has one eventfd On Sun, 20 Dec 2020 at 17:14, Jens Axboe <[email protected]> wrote: > > On 12/20/20 6:00 AM, Pavel Begunkov wrote: > > On 20/12/2020 07:13, Josef wrote: > >>> Guys, do you share rings between processes? Explicitly like sending > >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables > >>> (threads), or cloning with copying fd tables (and so taking a ref > >>> to a ring). > >> > >> no in netty we don't share ring between processes > >> > >>> In other words, if you kill all your io_uring applications, does it > >>> go back to normal? > >> > >> no at all, the io-wq worker thread is still running, I literally have > >> to restart the vm to go back to normal(as far as I know is not > >> possible to kill kernel threads right?) > >> > >>> Josef, can you test the patch below instead? Following Jens' idea it > >>> cancels more aggressively when a task is killed or exits. It's based > >>> on [1] but would probably apply fine to for-next. > >> > >> it works, I run several tests with eventfd read op async flag enabled, > >> thanks a lot :) you are awesome guys :) > > > > Thanks for testing and confirming! Either we forgot something in > > io_ring_ctx_wait_and_kill() and it just can't cancel some requests, > > or we have a dependency that prevents release from happening. > > Just a guess - Josef, is the eventfd for the ring fd itself? > > BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go. > We should just use targeted cancelation - that's cleaner, and the > cancel all will impact ATTACH_WQ as well. Separate thing to fix, though. > > -- > Jens Axboe > -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 16:59 ` Josef @ 2020-12-20 18:23 ` Josef 2020-12-20 18:41 ` Pavel Begunkov 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-20 18:23 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring > It's io_uring-5.11 but I had some patches on top. > I regenerated it below for up to date Jens' io_uring-5.11 Pavel I just tested your patch, it works :) On Sun, 20 Dec 2020 at 17:59, Josef <[email protected]> wrote: > > > Just a guess - Josef, is the eventfd for the ring fd itself? > > yes via eventfd_write we want to wake up/unblock > io_uring_enter(IORING_ENTER_GETEVENTS), and the read eventfd event is > submitted every time > each ring fd in netty has one eventfd > > On Sun, 20 Dec 2020 at 17:14, Jens Axboe <[email protected]> wrote: > > > > On 12/20/20 6:00 AM, Pavel Begunkov wrote: > > > On 20/12/2020 07:13, Josef wrote: > > >>> Guys, do you share rings between processes? Explicitly like sending > > >>> io_uring fd over a socket, or implicitly e.g. sharing fd tables > > >>> (threads), or cloning with copying fd tables (and so taking a ref > > >>> to a ring). > > >> > > >> no in netty we don't share ring between processes > > >> > > >>> In other words, if you kill all your io_uring applications, does it > > >>> go back to normal? > > >> > > >> no at all, the io-wq worker thread is still running, I literally have > > >> to restart the vm to go back to normal(as far as I know is not > > >> possible to kill kernel threads right?) > > >> > > >>> Josef, can you test the patch below instead? Following Jens' idea it > > >>> cancels more aggressively when a task is killed or exits. It's based > > >>> on [1] but would probably apply fine to for-next. > > >> > > >> it works, I run several tests with eventfd read op async flag enabled, > > >> thanks a lot :) you are awesome guys :) > > > > > > Thanks for testing and confirming! Either we forgot something in > > > io_ring_ctx_wait_and_kill() and it just can't cancel some requests, > > > or we have a dependency that prevents release from happening. > > > > Just a guess - Josef, is the eventfd for the ring fd itself? > > > > BTW, the io_wq_cancel_all() in io_ring_ctx_wait_and_kill() needs to go. > > We should just use targeted cancelation - that's cleaner, and the > > cancel all will impact ATTACH_WQ as well. Separate thing to fix, though. > > > > -- > > Jens Axboe > > > > > -- > Josef -- Josef ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 18:23 ` Josef @ 2020-12-20 18:41 ` Pavel Begunkov 2020-12-21 8:22 ` Josef 0 siblings, 1 reply; 52+ messages in thread From: Pavel Begunkov @ 2020-12-20 18:41 UTC (permalink / raw) To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring On 20/12/2020 18:23, Josef wrote: >> It's io_uring-5.11 but I had some patches on top. >> I regenerated it below for up to date Jens' io_uring-5.11 > > Pavel I just tested your patch, it works :) Interesting, thanks a lot! Not sure how exactly it's related to eventfd, but maybe just because it was dragged through internal polling asynchronously or somewhat like that, and io_ring_ctx_wait_and_kill() haven't found it at first. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-20 18:41 ` Pavel Begunkov @ 2020-12-21 8:22 ` Josef 2020-12-21 15:30 ` Pavel Begunkov 0 siblings, 1 reply; 52+ messages in thread From: Josef @ 2020-12-21 8:22 UTC (permalink / raw) To: Pavel Begunkov; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring Pavel I'm sorry...my kernel build process was wrong...the same kernel patch(the first one) was used...I run different load tests on all 3 patches several times your first patch works great and unfortunately second and third patch doesn't work Here the patch summary: first patch works: [1] git://git.kernel.dk/linux-block branch io_uring-5.11, commit dd20166236953c8cd14f4c668bf972af32f0c6be diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..3a98e6dd71c0 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8919,8 +8919,6 @@ void __io_uring_files_cancel(struct files_struct *files) struct io_ring_ctx *ctx = file->private_data; io_uring_cancel_task_requests(ctx, files); - if (files) - io_uring_del_task_file(file); } atomic_dec(&tctx->in_idle); @@ -8960,6 +8958,8 @@ static s64 tctx_inflight(struct io_uring_task *tctx) void __io_uring_task_cancel(void) { struct io_uring_task *tctx = current->io_uring; + struct file *file; + unsigned long index; DEFINE_WAIT(wait); s64 inflight; @@ -8986,6 +8986,9 @@ void __io_uring_task_cancel(void) finish_wait(&tctx->wait, &wait); atomic_dec(&tctx->in_idle); + + xa_for_each(&tctx->xa, index, file) + io_uring_del_task_file(file); } static int io_uring_flush(struct file *file, void *data) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 35b2d845704d..54925c74aa88 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -48,7 +48,7 @@ static inline void io_uring_task_cancel(void) static inline void io_uring_files_cancel(struct files_struct *files) { if (current->io_uring && !xa_empty(¤t->io_uring->xa)) - __io_uring_files_cancel(files); + __io_uring_task_cancel(); } static inline void io_uring_free(struct task_struct *tsk) { second patch: diff --git a/fs/io_uring.c b/fs/io_uring.c index f3690dfdd564..4e1fb4054516 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -8620,6 +8620,10 @@ static int io_remove_personalities(int id, void *p, void *data) return 0; } +static void io_cancel_defer_files(struct io_ring_ctx *ctx, + struct task_struct *task, + struct files_struct *files); + static void io_ring_exit_work(struct work_struct *work) { struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx, @@ -8633,6 +8637,8 @@ static void io_ring_exit_work(struct work_struct *work) */ do { io_iopoll_try_reap_events(ctx); + io_poll_remove_all(ctx, NULL, NULL); + io_kill_timeouts(ctx, NULL, NULL); } while (!wait_for_completion_timeout(&ctx->ref_comp, HZ/20)); io_ring_ctx_free(ctx); } @@ -8647,6 +8653,7 @@ static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) io_cqring_overflow_flush(ctx, true, NULL, NULL); mutex_unlock(&ctx->uring_lock); + io_cancel_defer_files(ctx, NULL, NULL); io_kill_timeouts(ctx, NULL, NULL); io_poll_remove_all(ctx, NULL, NULL); third patch you already sent which is similar to the second one: https://lore.kernel.org/io-uring/[email protected]/T/#t -- Josef ^ permalink raw reply related [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-21 8:22 ` Josef @ 2020-12-21 15:30 ` Pavel Begunkov 0 siblings, 0 replies; 52+ messages in thread From: Pavel Begunkov @ 2020-12-21 15:30 UTC (permalink / raw) To: Josef; +Cc: Jens Axboe, Norman Maurer, Dmitry Kadashev, io-uring On 21/12/2020 08:22, Josef wrote: > Pavel I'm sorry...my kernel build process was wrong...the same kernel > patch(the first one) was used...I run different load tests on all 3 > patches several times No worries, thanks for letting know. At least clears up contradiction of this patch with that it's eventfd related. > your first patch works great and unfortunately second and third patch > doesn't work -- Pavel Begunkov ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) 2020-12-19 17:11 ` Jens Axboe 2020-12-19 17:34 ` Norman Maurer @ 2020-12-21 10:31 ` Dmitry Kadashev 1 sibling, 0 replies; 52+ messages in thread From: Dmitry Kadashev @ 2020-12-21 10:31 UTC (permalink / raw) To: Jens Axboe; +Cc: Josef, io-uring, Norman Maurer On Sun, Dec 20, 2020 at 12:11 AM Jens Axboe <[email protected]> wrote: > > On 12/19/20 9:29 AM, Jens Axboe wrote: > > On 12/19/20 9:13 AM, Jens Axboe wrote: > >> On 12/18/20 7:49 PM, Josef wrote: > >>>> I'm happy to run _any_ reproducer, so please do let us know if you > >>>> manage to find something that I can run with netty. As long as it > >>>> includes instructions for exactly how to run it :-) > >>> > >>> cool :) I just created a repo for that: > >>> https://github.com/1Jo1/netty-io_uring-kernel-debugging.git > >>> > >>> - install jdk 1.8 > >>> - to run netty: ./mvnw compile exec:java > >>> -Dexec.mainClass="uring.netty.example.EchoUringServer" > >>> - to run the echo test: cargo run --release -- --address > >>> "127.0.0.1:2022" --number 200 --duration 20 --length 300 > >>> (https://github.com/haraldh/rust_echo_bench.git) > >>> - process kill -9 > >>> > >>> async flag is enabled and these operation are used: OP_READ, > >>> OP_WRITE, OP_POLL_ADD, OP_CLOSE, OP_ACCEPT > >>> > >>> (btw you can change the port in EchoUringServer.java) > >> > >> This is great! Not sure this is the same issue, but what I see here is > >> that we have leftover workers when the test is killed. This means the > >> rings aren't gone, and the memory isn't freed (and unaccounted), which > >> would ultimately lead to problems of course, similar to just an > >> accounting bug or race. > >> > >> The above _seems_ to be related to IOSQE_ASYNC. Trying to narrow it > >> down... > > > > Further narrowed down, it seems to be related to IOSQE_ASYNC on the > > read requests. I'm guessing there are cases where we end up not > > canceling them on ring close, hence the ring stays active, etc. > > > > If I just add a hack to clear IOSQE_ASYNC on IORING_OP_READ, then > > the test terminates fine on the kill -9. > > And even more so, it's IOSQE_ASYNC on the IORING_OP_READ on an eventfd > file descriptor. In our case - unlike netty - we use io_uring only for disk IO, no eventfd. And we do not use IOSQE_ASYNC (we've tried, but this coincided with some kernel crashes, so we've disabled it for now - not 100% sure if it's related or not yet). I'll try (again) to build a simpler reproducer for our issue, which is probably different from the netty one. -- Dmitry Kadashev ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2020-12-23 12:31 UTC | newest] Thread overview: 52+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-12-17 8:19 "Cannot allocate memory" on ring creation (not RLIMIT_MEMLOCK) Dmitry Kadashev 2020-12-17 8:26 ` Norman Maurer 2020-12-17 8:36 ` Dmitry Kadashev 2020-12-17 8:40 ` Dmitry Kadashev 2020-12-17 10:38 ` Josef 2020-12-17 11:10 ` Dmitry Kadashev 2020-12-17 13:43 ` Victor Stewart 2020-12-18 9:20 ` Dmitry Kadashev 2020-12-18 17:22 ` Jens Axboe 2020-12-18 15:26 ` Jens Axboe 2020-12-18 17:21 ` Josef 2020-12-18 17:23 ` Jens Axboe 2020-12-19 2:49 ` Josef 2020-12-19 16:13 ` Jens Axboe 2020-12-19 16:29 ` Jens Axboe 2020-12-19 17:11 ` Jens Axboe 2020-12-19 17:34 ` Norman Maurer 2020-12-19 17:38 ` Jens Axboe 2020-12-19 20:51 ` Josef 2020-12-19 21:54 ` Jens Axboe 2020-12-19 23:13 ` Jens Axboe 2020-12-19 23:42 ` Josef 2020-12-19 23:42 ` Pavel Begunkov 2020-12-20 0:25 ` Jens Axboe 2020-12-20 0:55 ` Pavel Begunkov 2020-12-21 10:35 ` Dmitry Kadashev 2020-12-21 10:49 ` Dmitry Kadashev 2020-12-21 11:00 ` Dmitry Kadashev 2020-12-21 15:36 ` Pavel Begunkov 2020-12-22 3:35 ` Pavel Begunkov 2020-12-22 4:07 ` Pavel Begunkov 2020-12-22 11:04 ` Dmitry Kadashev 2020-12-22 11:06 ` Dmitry Kadashev 2020-12-22 13:13 ` Dmitry Kadashev 2020-12-22 16:33 ` Pavel Begunkov 2020-12-23 8:39 ` Dmitry Kadashev 2020-12-23 9:38 ` Dmitry Kadashev 2020-12-23 11:48 ` Dmitry Kadashev 2020-12-23 12:27 ` Pavel Begunkov 2020-12-20 1:57 ` Pavel Begunkov 2020-12-20 7:13 ` Josef 2020-12-20 13:00 ` Pavel Begunkov 2020-12-20 14:19 ` Pavel Begunkov 2020-12-20 15:56 ` Josef 2020-12-20 15:58 ` Pavel Begunkov 2020-12-20 16:14 ` Jens Axboe 2020-12-20 16:59 ` Josef 2020-12-20 18:23 ` Josef 2020-12-20 18:41 ` Pavel Begunkov 2020-12-21 8:22 ` Josef 2020-12-21 15:30 ` Pavel Begunkov 2020-12-21 10:31 ` Dmitry Kadashev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox