* IOSQE_IO_LINK vs. short send of SOCK_STREAM @ 2023-01-11 15:26 Ming Lei 2023-01-11 15:49 ` Jens Axboe 2023-01-11 16:32 ` Stefan Metzmacher 0 siblings, 2 replies; 14+ messages in thread From: Ming Lei @ 2023-01-11 15:26 UTC (permalink / raw) To: io-uring, Pavel Begunkov, Jens Axboe Cc: ming.lei, Stefan Metzmacher, David Ahern Hello Guy, Per my understanding, a short send on SOCK_STREAM should terminate the remainder of the SQE chain built by IOSQE_IO_LINK. But from my observation, this point isn't true when using io_sendmsg or io_sendmsg_zc on TCP socket, and the other remainder of the chain still can be completed after one short send is found. MSG_WAITALL is off. For SOCK_STREAM, IOSQE_IO_LINK probably is the only way of io_uring for sending data correctly in batch. However, it depends on the assumption of chain termination by short send. Appreciate any comment. Thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-11 15:26 IOSQE_IO_LINK vs. short send of SOCK_STREAM Ming Lei @ 2023-01-11 15:49 ` Jens Axboe 2023-01-11 16:32 ` Stefan Metzmacher 1 sibling, 0 replies; 14+ messages in thread From: Jens Axboe @ 2023-01-11 15:49 UTC (permalink / raw) To: Ming Lei, io-uring, Pavel Begunkov; +Cc: Stefan Metzmacher, David Ahern On 1/11/23 8:26 AM, Ming Lei wrote: > Hello Guy, > > Per my understanding, a short send on SOCK_STREAM should terminate the > remainder of the SQE chain built by IOSQE_IO_LINK. > > But from my observation, this point isn't true when using io_sendmsg or > io_sendmsg_zc on TCP socket, and the other remainder of the chain still > can be completed after one short send is found. MSG_WAITALL is off. > > For SOCK_STREAM, IOSQE_IO_LINK probably is the only way of io_uring for > sending data correctly in batch. However, it depends on the assumption > of chain termination by short send. That is the intended behavior, maybe there are some cases where it's not being set and req_set_fail() not being called? Do you have a test case that I can try? If not, might be easier if you poke at io_uring/net.c:io_sendmsg(). If we send less than what was asked for and we don't retry, req_set_fail() should be called. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-11 15:26 IOSQE_IO_LINK vs. short send of SOCK_STREAM Ming Lei 2023-01-11 15:49 ` Jens Axboe @ 2023-01-11 16:32 ` Stefan Metzmacher 2023-01-11 16:36 ` Jens Axboe 2023-01-12 3:27 ` Ming Lei 1 sibling, 2 replies; 14+ messages in thread From: Stefan Metzmacher @ 2023-01-11 16:32 UTC (permalink / raw) To: Ming Lei, io-uring, Pavel Begunkov, Jens Axboe; +Cc: David Ahern Hi Ming, > Per my understanding, a short send on SOCK_STREAM should terminate the > remainder of the SQE chain built by IOSQE_IO_LINK. > > But from my observation, this point isn't true when using io_sendmsg or > io_sendmsg_zc on TCP socket, and the other remainder of the chain still > can be completed after one short send is found. MSG_WAITALL is off. This is due to legacy reasons, you need pass MSG_WAITALL explicitly in order to a retry or an error on a short write... It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. For recv and recvmsg MSG_WAITALL also fails the link for MSG_TRUNC and MSG_CTRUNC. metze ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-11 16:32 ` Stefan Metzmacher @ 2023-01-11 16:36 ` Jens Axboe 2023-01-12 3:27 ` Ming Lei 1 sibling, 0 replies; 14+ messages in thread From: Jens Axboe @ 2023-01-11 16:36 UTC (permalink / raw) To: Stefan Metzmacher, Ming Lei, io-uring, Pavel Begunkov; +Cc: David Ahern On 1/11/23 9:32 AM, Stefan Metzmacher wrote: > Hi Ming, > >> Per my understanding, a short send on SOCK_STREAM should terminate the >> remainder of the SQE chain built by IOSQE_IO_LINK. >> >> But from my observation, this point isn't true when using io_sendmsg or >> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >> can be completed after one short send is found. MSG_WAITALL is off. > > This is due to legacy reasons, you need pass MSG_WAITALL explicitly > in order to a retry or an error on a short write... > It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. Dylan and I were just discussing this OOB and was hoping you'd chime in, as I had some recollection that you were involved with this one. We should probably ensure this is adequately documented, as it isn't immediately obvious that you'd need WAITALL for links to work with receives. > For recv and recvmsg MSG_WAITALL also fails the link for MSG_TRUNC and MSG_CTRUNC. > > metze > -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-11 16:32 ` Stefan Metzmacher 2023-01-11 16:36 ` Jens Axboe @ 2023-01-12 3:27 ` Ming Lei 2023-01-12 3:40 ` Jens Axboe 1 sibling, 1 reply; 14+ messages in thread From: Ming Lei @ 2023-01-12 3:27 UTC (permalink / raw) To: Stefan Metzmacher Cc: io-uring, Pavel Begunkov, Jens Axboe, David Ahern, ming.lei Hi Stefan and Jens, Thanks for the help. BTW, the issue is observed when I write ublk-nbd: https://github.com/ming1/ubdsrv/commits/nbd and it isn't completed yet(multiple send sqe chains not serialized yet), the issue is triggered when writing big chunk data to ublk-nbd. On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > Hi Ming, > > > Per my understanding, a short send on SOCK_STREAM should terminate the > > remainder of the SQE chain built by IOSQE_IO_LINK. > > > > But from my observation, this point isn't true when using io_sendmsg or > > io_sendmsg_zc on TCP socket, and the other remainder of the chain still > > can be completed after one short send is found. MSG_WAITALL is off. > > This is due to legacy reasons, you need pass MSG_WAITALL explicitly > in order to a retry or an error on a short write... > It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. Turns out there is another application bug in which recv sqe may cut in the send sqe chain. After the issue is fixed, if MSG_WAITALL is set, short send can't be observed any more. But if MSG_WAITALL isn't set, short send can be observed and the send io chain still won't be terminated. So if MSG_WAITALL is set, will io_uring be responsible for retry in case of short send, and application needn't to take care of it? > > For recv and recvmsg MSG_WAITALL also fails the link for MSG_TRUNC and MSG_CTRUNC. OK, thanks for the sharing of recvmsg MSG_WAITALL. Thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-12 3:27 ` Ming Lei @ 2023-01-12 3:40 ` Jens Axboe 2023-01-12 7:35 ` Stefan Metzmacher 0 siblings, 1 reply; 14+ messages in thread From: Jens Axboe @ 2023-01-12 3:40 UTC (permalink / raw) To: Ming Lei, Stefan Metzmacher; +Cc: io-uring, Pavel Begunkov, David Ahern On 1/11/23 8:27?PM, Ming Lei wrote: > Hi Stefan and Jens, > > Thanks for the help. > > BTW, the issue is observed when I write ublk-nbd: > > https://github.com/ming1/ubdsrv/commits/nbd > > and it isn't completed yet(multiple send sqe chains not serialized > yet), the issue is triggered when writing big chunk data to ublk-nbd. Gotcha > On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: >> Hi Ming, >> >>> Per my understanding, a short send on SOCK_STREAM should terminate the >>> remainder of the SQE chain built by IOSQE_IO_LINK. >>> >>> But from my observation, this point isn't true when using io_sendmsg or >>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >>> can be completed after one short send is found. MSG_WAITALL is off. >> >> This is due to legacy reasons, you need pass MSG_WAITALL explicitly >> in order to a retry or an error on a short write... >> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > > Turns out there is another application bug in which recv sqe may cut in the > send sqe chain. > > After the issue is fixed, if MSG_WAITALL is set, short send can't be > observed any more. But if MSG_WAITALL isn't set, short send can be > observed and the send io chain still won't be terminated. Right, if MSG_WAITALL is set, then the whole thing will be written. If we get a short send, it's retried appropriately. Unless an error occurs, it should send the whole thing. > So if MSG_WAITALL is set, will io_uring be responsible for retry in case > of short send, and application needn't to take care of it? Correct. I did add a note about that in the liburing man pages after your email earlier: https://git.kernel.dk/cgit/liburing/commit/?id=8d056db7c0e58f45f7c474a6627f83270bb8f00e since that wasn't documented as far as I can tell. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-12 3:40 ` Jens Axboe @ 2023-01-12 7:35 ` Stefan Metzmacher 2023-01-13 10:12 ` Ming Lei 0 siblings, 1 reply; 14+ messages in thread From: Stefan Metzmacher @ 2023-01-12 7:35 UTC (permalink / raw) To: Jens Axboe, Ming Lei; +Cc: io-uring, Pavel Begunkov, David Ahern Am 12.01.23 um 04:40 schrieb Jens Axboe: > On 1/11/23 8:27?PM, Ming Lei wrote: >> Hi Stefan and Jens, >> >> Thanks for the help. >> >> BTW, the issue is observed when I write ublk-nbd: >> >> https://github.com/ming1/ubdsrv/commits/nbd >> >> and it isn't completed yet(multiple send sqe chains not serialized >> yet), the issue is triggered when writing big chunk data to ublk-nbd. > > Gotcha > >> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: >>> Hi Ming, >>> >>>> Per my understanding, a short send on SOCK_STREAM should terminate the >>>> remainder of the SQE chain built by IOSQE_IO_LINK. >>>> >>>> But from my observation, this point isn't true when using io_sendmsg or >>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >>>> can be completed after one short send is found. MSG_WAITALL is off. >>> >>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly >>> in order to a retry or an error on a short write... >>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. >> >> Turns out there is another application bug in which recv sqe may cut in the >> send sqe chain. >> >> After the issue is fixed, if MSG_WAITALL is set, short send can't be >> observed any more. But if MSG_WAITALL isn't set, short send can be >> observed and the send io chain still won't be terminated. > > Right, if MSG_WAITALL is set, then the whole thing will be written. If > we get a short send, it's retried appropriately. Unless an error occurs, > it should send the whole thing. > >> So if MSG_WAITALL is set, will io_uring be responsible for retry in case >> of short send, and application needn't to take care of it? With new kernels yes, but the application should be prepared to have retry logic in order to be compatible with older kernels. It was added for recv* in 5.18 and send* in 5.19. The MSG_WAITALL logic for failing links was added with 5.12. (It was backported to v5.10.28) As the 5.15 code was backported to v5.10.162, it's safe to assume it's available with IORING_FEAT_NATIVE_WORKERS. > Correct. I did add a note about that in the liburing man pages after > your email earlier: > > https://git.kernel.dk/cgit/liburing/commit/?id=8d056db7c0e58f45f7c474a6627f83270bb8f00e > > since that wasn't documented as far as I can tell. > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-12 7:35 ` Stefan Metzmacher @ 2023-01-13 10:12 ` Ming Lei 2023-01-13 17:51 ` Jens Axboe 0 siblings, 1 reply; 14+ messages in thread From: Ming Lei @ 2023-01-13 10:12 UTC (permalink / raw) To: Stefan Metzmacher Cc: Jens Axboe, io-uring, Pavel Begunkov, David Ahern, ming.lei Hello, On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: > Am 12.01.23 um 04:40 schrieb Jens Axboe: > > On 1/11/23 8:27?PM, Ming Lei wrote: > > > Hi Stefan and Jens, > > > > > > Thanks for the help. > > > > > > BTW, the issue is observed when I write ublk-nbd: > > > > > > https://github.com/ming1/ubdsrv/commits/nbd > > > > > > and it isn't completed yet(multiple send sqe chains not serialized > > > yet), the issue is triggered when writing big chunk data to ublk-nbd. > > > > Gotcha > > > > > On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > > > > Hi Ming, > > > > > > > > > Per my understanding, a short send on SOCK_STREAM should terminate the > > > > > remainder of the SQE chain built by IOSQE_IO_LINK. > > > > > > > > > > But from my observation, this point isn't true when using io_sendmsg or > > > > > io_sendmsg_zc on TCP socket, and the other remainder of the chain still > > > > > can be completed after one short send is found. MSG_WAITALL is off. > > > > > > > > This is due to legacy reasons, you need pass MSG_WAITALL explicitly > > > > in order to a retry or an error on a short write... > > > > It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > > > > > > Turns out there is another application bug in which recv sqe may cut in the > > > send sqe chain. > > > > > > After the issue is fixed, if MSG_WAITALL is set, short send can't be > > > observed any more. But if MSG_WAITALL isn't set, short send can be > > > observed and the send io chain still won't be terminated. > > > > Right, if MSG_WAITALL is set, then the whole thing will be written. If > > we get a short send, it's retried appropriately. Unless an error occurs, > > it should send the whole thing. > > > > > So if MSG_WAITALL is set, will io_uring be responsible for retry in case > > > of short send, and application needn't to take care of it? > > With new kernels yes, but the application should be prepared to have retry > logic in order to be compatible with older kernels. Now ublk-nbd can be played, mkfs/mount and fio starts to work. But short send still can be observed sometimes when sending nbd write request, which is done by sendmsg(), and the message includes two vectors, (the 1st is the nbd_request, another one is the data to be written to disk). Short send is reported by cqe in which cqe->res is always 28, which is size of 'struct nbd_request', also the length of the 1st io vec. And not see send cqe failure message. And MSG_WAITALL is set for all ublk-nbd io actually. Follows the steps: 1) install liburing 2.0+ 2) build ublk & reproduce the issue: - git clone https://github.com/ming1/ubdsrv.git -b nbd - cd ubdsrv - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir - ./build_with_liburing_src&& make -j4 3) run the nbd test - cd ubdsrv - make test T=nbd Sometimes the test hangs, and the following log can be observed in syslog: nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 ... Thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-13 10:12 ` Ming Lei @ 2023-01-13 17:51 ` Jens Axboe 2023-01-13 18:01 ` Jens Axboe 0 siblings, 1 reply; 14+ messages in thread From: Jens Axboe @ 2023-01-13 17:51 UTC (permalink / raw) To: Ming Lei, Stefan Metzmacher; +Cc: io-uring, Pavel Begunkov, David Ahern On 1/13/23 3:12 AM, Ming Lei wrote: > Hello, > > On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: >> Am 12.01.23 um 04:40 schrieb Jens Axboe: >>> On 1/11/23 8:27?PM, Ming Lei wrote: >>>> Hi Stefan and Jens, >>>> >>>> Thanks for the help. >>>> >>>> BTW, the issue is observed when I write ublk-nbd: >>>> >>>> https://github.com/ming1/ubdsrv/commits/nbd >>>> >>>> and it isn't completed yet(multiple send sqe chains not serialized >>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. >>> >>> Gotcha >>> >>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: >>>>> Hi Ming, >>>>> >>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the >>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. >>>>>> >>>>>> But from my observation, this point isn't true when using io_sendmsg or >>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >>>>>> can be completed after one short send is found. MSG_WAITALL is off. >>>>> >>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly >>>>> in order to a retry or an error on a short write... >>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. >>>> >>>> Turns out there is another application bug in which recv sqe may cut in the >>>> send sqe chain. >>>> >>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be >>>> observed any more. But if MSG_WAITALL isn't set, short send can be >>>> observed and the send io chain still won't be terminated. >>> >>> Right, if MSG_WAITALL is set, then the whole thing will be written. If >>> we get a short send, it's retried appropriately. Unless an error occurs, >>> it should send the whole thing. >>> >>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case >>>> of short send, and application needn't to take care of it? >> >> With new kernels yes, but the application should be prepared to have retry >> logic in order to be compatible with older kernels. > > Now ublk-nbd can be played, mkfs/mount and fio starts to work. > > But short send still can be observed sometimes when sending nbd write > request, which is done by sendmsg(), and the message includes two vectors, > (the 1st is the nbd_request, another one is the data to be written to disk). > > Short send is reported by cqe in which cqe->res is always 28, which is > size of 'struct nbd_request', also the length of the 1st io vec. And not > see send cqe failure message. > > And MSG_WAITALL is set for all ublk-nbd io actually. > > Follows the steps: > > 1) install liburing 2.0+ > > 2) build ublk & reproduce the issue: > > - git clone https://github.com/ming1/ubdsrv.git -b nbd > > - cd ubdsrv > > - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir > > - ./build_with_liburing_src&& make -j4 > > 3) run the nbd test > - cd ubdsrv > - make test T=nbd > > Sometimes the test hangs, and the following log can be observed > in syslog: > > nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 > ... > I can reproduce this, but it's a SEND that ends up being triggered, not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm assuming two vecs for that one. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-13 17:51 ` Jens Axboe @ 2023-01-13 18:01 ` Jens Axboe 2023-01-14 0:27 ` Ming Lei 0 siblings, 1 reply; 14+ messages in thread From: Jens Axboe @ 2023-01-13 18:01 UTC (permalink / raw) To: Ming Lei, Stefan Metzmacher; +Cc: io-uring, Pavel Begunkov, David Ahern On 1/13/23 10:51 AM, Jens Axboe wrote: > On 1/13/23 3:12 AM, Ming Lei wrote: >> Hello, >> >> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: >>> Am 12.01.23 um 04:40 schrieb Jens Axboe: >>>> On 1/11/23 8:27?PM, Ming Lei wrote: >>>>> Hi Stefan and Jens, >>>>> >>>>> Thanks for the help. >>>>> >>>>> BTW, the issue is observed when I write ublk-nbd: >>>>> >>>>> https://github.com/ming1/ubdsrv/commits/nbd >>>>> >>>>> and it isn't completed yet(multiple send sqe chains not serialized >>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. >>>> >>>> Gotcha >>>> >>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: >>>>>> Hi Ming, >>>>>> >>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the >>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. >>>>>>> >>>>>>> But from my observation, this point isn't true when using io_sendmsg or >>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >>>>>>> can be completed after one short send is found. MSG_WAITALL is off. >>>>>> >>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly >>>>>> in order to a retry or an error on a short write... >>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. >>>>> >>>>> Turns out there is another application bug in which recv sqe may cut in the >>>>> send sqe chain. >>>>> >>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be >>>>> observed any more. But if MSG_WAITALL isn't set, short send can be >>>>> observed and the send io chain still won't be terminated. >>>> >>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If >>>> we get a short send, it's retried appropriately. Unless an error occurs, >>>> it should send the whole thing. >>>> >>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case >>>>> of short send, and application needn't to take care of it? >>> >>> With new kernels yes, but the application should be prepared to have retry >>> logic in order to be compatible with older kernels. >> >> Now ublk-nbd can be played, mkfs/mount and fio starts to work. >> >> But short send still can be observed sometimes when sending nbd write >> request, which is done by sendmsg(), and the message includes two vectors, >> (the 1st is the nbd_request, another one is the data to be written to disk). >> >> Short send is reported by cqe in which cqe->res is always 28, which is >> size of 'struct nbd_request', also the length of the 1st io vec. And not >> see send cqe failure message. >> >> And MSG_WAITALL is set for all ublk-nbd io actually. >> >> Follows the steps: >> >> 1) install liburing 2.0+ >> >> 2) build ublk & reproduce the issue: >> >> - git clone https://github.com/ming1/ubdsrv.git -b nbd >> >> - cd ubdsrv >> >> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir >> >> - ./build_with_liburing_src&& make -j4 >> >> 3) run the nbd test >> - cd ubdsrv >> - make test T=nbd >> >> Sometimes the test hangs, and the following log can be observed >> in syslog: >> >> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 >> ... >> > > I can reproduce this, but it's a SEND that ends up being triggered, > not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm > assuming two vecs for that one. Added some debug and it looks like the request was indeed send up and is using IORING_OP_SEND and that the 28 is what was requested. But the completion side seems to think it's a SENDMSG and we should've received more? I think this needs a bit of debugging on the userspace side first. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-13 18:01 ` Jens Axboe @ 2023-01-14 0:27 ` Ming Lei 2023-01-14 1:39 ` Ming Lei 2023-01-14 2:12 ` Ming Lei 0 siblings, 2 replies; 14+ messages in thread From: Ming Lei @ 2023-01-14 0:27 UTC (permalink / raw) To: Jens Axboe Cc: Stefan Metzmacher, io-uring, Pavel Begunkov, David Ahern, ming.lei On Fri, Jan 13, 2023 at 11:01:51AM -0700, Jens Axboe wrote: > On 1/13/23 10:51 AM, Jens Axboe wrote: > > On 1/13/23 3:12 AM, Ming Lei wrote: > >> Hello, > >> > >> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: > >>> Am 12.01.23 um 04:40 schrieb Jens Axboe: > >>>> On 1/11/23 8:27?PM, Ming Lei wrote: > >>>>> Hi Stefan and Jens, > >>>>> > >>>>> Thanks for the help. > >>>>> > >>>>> BTW, the issue is observed when I write ublk-nbd: > >>>>> > >>>>> https://github.com/ming1/ubdsrv/commits/nbd > >>>>> > >>>>> and it isn't completed yet(multiple send sqe chains not serialized > >>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. > >>>> > >>>> Gotcha > >>>> > >>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > >>>>>> Hi Ming, > >>>>>> > >>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the > >>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. > >>>>>>> > >>>>>>> But from my observation, this point isn't true when using io_sendmsg or > >>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still > >>>>>>> can be completed after one short send is found. MSG_WAITALL is off. > >>>>>> > >>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly > >>>>>> in order to a retry or an error on a short write... > >>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > >>>>> > >>>>> Turns out there is another application bug in which recv sqe may cut in the > >>>>> send sqe chain. > >>>>> > >>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be > >>>>> observed any more. But if MSG_WAITALL isn't set, short send can be > >>>>> observed and the send io chain still won't be terminated. > >>>> > >>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If > >>>> we get a short send, it's retried appropriately. Unless an error occurs, > >>>> it should send the whole thing. > >>>> > >>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case > >>>>> of short send, and application needn't to take care of it? > >>> > >>> With new kernels yes, but the application should be prepared to have retry > >>> logic in order to be compatible with older kernels. > >> > >> Now ublk-nbd can be played, mkfs/mount and fio starts to work. > >> > >> But short send still can be observed sometimes when sending nbd write > >> request, which is done by sendmsg(), and the message includes two vectors, > >> (the 1st is the nbd_request, another one is the data to be written to disk). > >> > >> Short send is reported by cqe in which cqe->res is always 28, which is > >> size of 'struct nbd_request', also the length of the 1st io vec. And not > >> see send cqe failure message. > >> > >> And MSG_WAITALL is set for all ublk-nbd io actually. > >> > >> Follows the steps: > >> > >> 1) install liburing 2.0+ > >> > >> 2) build ublk & reproduce the issue: > >> > >> - git clone https://github.com/ming1/ubdsrv.git -b nbd > >> > >> - cd ubdsrv > >> > >> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir > >> > >> - ./build_with_liburing_src&& make -j4 > >> > >> 3) run the nbd test > >> - cd ubdsrv > >> - make test T=nbd > >> > >> Sometimes the test hangs, and the following log can be observed > >> in syslog: > >> > >> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 > >> ... > >> > > > > I can reproduce this, but it's a SEND that ends up being triggered, > > not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm > > assuming two vecs for that one. > > Added some debug and it looks like the request was indeed send up > and is using IORING_OP_SEND and that the 28 is what was requested. > But the completion side seems to think it's a SENDMSG and we should've > received more? > > I think this needs a bit of debugging on the userspace side first. Yeah, turns out it is indeed one userspace bug, IOSQE_IO_LINK is cleared wrong, and now the issue can't be triggered with the following fix: https://github.com/ming1/ubdsrv/commit/175ffd14ae2f8fa562134edfd4ac949f8050c108 Thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-14 0:27 ` Ming Lei @ 2023-01-14 1:39 ` Ming Lei 2023-01-14 2:12 ` Ming Lei 1 sibling, 0 replies; 14+ messages in thread From: Ming Lei @ 2023-01-14 1:39 UTC (permalink / raw) To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, Pavel Begunkov, David Ahern On Sat, Jan 14, 2023 at 08:27:37AM +0800, Ming Lei wrote: > On Fri, Jan 13, 2023 at 11:01:51AM -0700, Jens Axboe wrote: > > On 1/13/23 10:51 AM, Jens Axboe wrote: > > > On 1/13/23 3:12 AM, Ming Lei wrote: > > >> Hello, > > >> > > >> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: > > >>> Am 12.01.23 um 04:40 schrieb Jens Axboe: > > >>>> On 1/11/23 8:27?PM, Ming Lei wrote: > > >>>>> Hi Stefan and Jens, > > >>>>> > > >>>>> Thanks for the help. > > >>>>> > > >>>>> BTW, the issue is observed when I write ublk-nbd: > > >>>>> > > >>>>> https://github.com/ming1/ubdsrv/commits/nbd > > >>>>> > > >>>>> and it isn't completed yet(multiple send sqe chains not serialized > > >>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. > > >>>> > > >>>> Gotcha > > >>>> > > >>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > > >>>>>> Hi Ming, > > >>>>>> > > >>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the > > >>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. > > >>>>>>> > > >>>>>>> But from my observation, this point isn't true when using io_sendmsg or > > >>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still > > >>>>>>> can be completed after one short send is found. MSG_WAITALL is off. > > >>>>>> > > >>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly > > >>>>>> in order to a retry or an error on a short write... > > >>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > > >>>>> > > >>>>> Turns out there is another application bug in which recv sqe may cut in the > > >>>>> send sqe chain. > > >>>>> > > >>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be > > >>>>> observed any more. But if MSG_WAITALL isn't set, short send can be > > >>>>> observed and the send io chain still won't be terminated. > > >>>> > > >>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If > > >>>> we get a short send, it's retried appropriately. Unless an error occurs, > > >>>> it should send the whole thing. > > >>>> > > >>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case > > >>>>> of short send, and application needn't to take care of it? > > >>> > > >>> With new kernels yes, but the application should be prepared to have retry > > >>> logic in order to be compatible with older kernels. > > >> > > >> Now ublk-nbd can be played, mkfs/mount and fio starts to work. > > >> > > >> But short send still can be observed sometimes when sending nbd write > > >> request, which is done by sendmsg(), and the message includes two vectors, > > >> (the 1st is the nbd_request, another one is the data to be written to disk). > > >> > > >> Short send is reported by cqe in which cqe->res is always 28, which is > > >> size of 'struct nbd_request', also the length of the 1st io vec. And not > > >> see send cqe failure message. > > >> > > >> And MSG_WAITALL is set for all ublk-nbd io actually. > > >> > > >> Follows the steps: > > >> > > >> 1) install liburing 2.0+ > > >> > > >> 2) build ublk & reproduce the issue: > > >> > > >> - git clone https://github.com/ming1/ubdsrv.git -b nbd > > >> > > >> - cd ubdsrv > > >> > > >> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir > > >> > > >> - ./build_with_liburing_src&& make -j4 > > >> > > >> 3) run the nbd test > > >> - cd ubdsrv > > >> - make test T=nbd > > >> > > >> Sometimes the test hangs, and the following log can be observed > > >> in syslog: > > >> > > >> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 > > >> ... > > >> > > > > > > I can reproduce this, but it's a SEND that ends up being triggered, > > > not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm > > > assuming two vecs for that one. > > > > Added some debug and it looks like the request was indeed send up > > and is using IORING_OP_SEND and that the 28 is what was requested. > > But the completion side seems to think it's a SENDMSG and we should've > > received more? > > > > I think this needs a bit of debugging on the userspace side first. > > Yeah, turns out it is indeed one userspace bug, IOSQE_IO_LINK is cleared > wrong, and now the issue can't be triggered with the following fix: > > https://github.com/ming1/ubdsrv/commit/175ffd14ae2f8fa562134edfd4ac949f8050c108 Turns out the two are different issues, it is understandable that the above commit fixes io hang. But just checked syslog, the short send warning is still logged, will investigate further. thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-14 0:27 ` Ming Lei 2023-01-14 1:39 ` Ming Lei @ 2023-01-14 2:12 ` Ming Lei 2023-01-14 3:47 ` Jens Axboe 1 sibling, 1 reply; 14+ messages in thread From: Ming Lei @ 2023-01-14 2:12 UTC (permalink / raw) To: Jens Axboe; +Cc: Stefan Metzmacher, io-uring, Pavel Begunkov, David Ahern On Sat, Jan 14, 2023 at 08:27:37AM +0800, Ming Lei wrote: > On Fri, Jan 13, 2023 at 11:01:51AM -0700, Jens Axboe wrote: > > On 1/13/23 10:51 AM, Jens Axboe wrote: > > > On 1/13/23 3:12 AM, Ming Lei wrote: > > >> Hello, > > >> > > >> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: > > >>> Am 12.01.23 um 04:40 schrieb Jens Axboe: > > >>>> On 1/11/23 8:27?PM, Ming Lei wrote: > > >>>>> Hi Stefan and Jens, > > >>>>> > > >>>>> Thanks for the help. > > >>>>> > > >>>>> BTW, the issue is observed when I write ublk-nbd: > > >>>>> > > >>>>> https://github.com/ming1/ubdsrv/commits/nbd > > >>>>> > > >>>>> and it isn't completed yet(multiple send sqe chains not serialized > > >>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. > > >>>> > > >>>> Gotcha > > >>>> > > >>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > > >>>>>> Hi Ming, > > >>>>>> > > >>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the > > >>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. > > >>>>>>> > > >>>>>>> But from my observation, this point isn't true when using io_sendmsg or > > >>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still > > >>>>>>> can be completed after one short send is found. MSG_WAITALL is off. > > >>>>>> > > >>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly > > >>>>>> in order to a retry or an error on a short write... > > >>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > > >>>>> > > >>>>> Turns out there is another application bug in which recv sqe may cut in the > > >>>>> send sqe chain. > > >>>>> > > >>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be > > >>>>> observed any more. But if MSG_WAITALL isn't set, short send can be > > >>>>> observed and the send io chain still won't be terminated. > > >>>> > > >>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If > > >>>> we get a short send, it's retried appropriately. Unless an error occurs, > > >>>> it should send the whole thing. > > >>>> > > >>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case > > >>>>> of short send, and application needn't to take care of it? > > >>> > > >>> With new kernels yes, but the application should be prepared to have retry > > >>> logic in order to be compatible with older kernels. > > >> > > >> Now ublk-nbd can be played, mkfs/mount and fio starts to work. > > >> > > >> But short send still can be observed sometimes when sending nbd write > > >> request, which is done by sendmsg(), and the message includes two vectors, > > >> (the 1st is the nbd_request, another one is the data to be written to disk). > > >> > > >> Short send is reported by cqe in which cqe->res is always 28, which is > > >> size of 'struct nbd_request', also the length of the 1st io vec. And not > > >> see send cqe failure message. > > >> > > >> And MSG_WAITALL is set for all ublk-nbd io actually. > > >> > > >> Follows the steps: > > >> > > >> 1) install liburing 2.0+ > > >> > > >> 2) build ublk & reproduce the issue: > > >> > > >> - git clone https://github.com/ming1/ubdsrv.git -b nbd > > >> > > >> - cd ubdsrv > > >> > > >> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir > > >> > > >> - ./build_with_liburing_src&& make -j4 > > >> > > >> 3) run the nbd test > > >> - cd ubdsrv > > >> - make test T=nbd > > >> > > >> Sometimes the test hangs, and the following log can be observed > > >> in syslog: > > >> > > >> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 > > >> ... > > >> > > > > > > I can reproduce this, but it's a SEND that ends up being triggered, > > > not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm > > > assuming two vecs for that one. > > > > Added some debug and it looks like the request was indeed send up > > and is using IORING_OP_SEND and that the 28 is what was requested. > > But the completion side seems to think it's a SENDMSG and we should've > > received more? > > > > I think this needs a bit of debugging on the userspace side first. > > Yeah, turns out it is indeed one userspace bug, IOSQE_IO_LINK is cleared > wrong, and now the issue can't be triggered with the following fix: > > https://github.com/ming1/ubdsrv/commit/175ffd14ae2f8fa562134edfd4ac949f8050c108 Figured out, it is still one userspace issue. For nbd request sent to server, the cqe could come after the ublk io request is completed which is triggered by nbd reply from server, then if new ublk io req is submitted to same slot, the new data length and op code could be read in nbd_send_req_done(), and the warning is triggered. Thanks, Ming ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM 2023-01-14 2:12 ` Ming Lei @ 2023-01-14 3:47 ` Jens Axboe 0 siblings, 0 replies; 14+ messages in thread From: Jens Axboe @ 2023-01-14 3:47 UTC (permalink / raw) To: Ming Lei; +Cc: Stefan Metzmacher, io-uring, Pavel Begunkov, David Ahern On 1/13/23 7:12?PM, Ming Lei wrote: > On Sat, Jan 14, 2023 at 08:27:37AM +0800, Ming Lei wrote: >> On Fri, Jan 13, 2023 at 11:01:51AM -0700, Jens Axboe wrote: >>> On 1/13/23 10:51?AM, Jens Axboe wrote: >>>> On 1/13/23 3:12?AM, Ming Lei wrote: >>>>> Hello, >>>>> >>>>> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: >>>>>> Am 12.01.23 um 04:40 schrieb Jens Axboe: >>>>>>> On 1/11/23 8:27?PM, Ming Lei wrote: >>>>>>>> Hi Stefan and Jens, >>>>>>>> >>>>>>>> Thanks for the help. >>>>>>>> >>>>>>>> BTW, the issue is observed when I write ublk-nbd: >>>>>>>> >>>>>>>> https://github.com/ming1/ubdsrv/commits/nbd >>>>>>>> >>>>>>>> and it isn't completed yet(multiple send sqe chains not serialized >>>>>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. >>>>>>> >>>>>>> Gotcha >>>>>>> >>>>>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: >>>>>>>>> Hi Ming, >>>>>>>>> >>>>>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the >>>>>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. >>>>>>>>>> >>>>>>>>>> But from my observation, this point isn't true when using io_sendmsg or >>>>>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still >>>>>>>>>> can be completed after one short send is found. MSG_WAITALL is off. >>>>>>>>> >>>>>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly >>>>>>>>> in order to a retry or an error on a short write... >>>>>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. >>>>>>>> >>>>>>>> Turns out there is another application bug in which recv sqe may cut in the >>>>>>>> send sqe chain. >>>>>>>> >>>>>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be >>>>>>>> observed any more. But if MSG_WAITALL isn't set, short send can be >>>>>>>> observed and the send io chain still won't be terminated. >>>>>>> >>>>>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If >>>>>>> we get a short send, it's retried appropriately. Unless an error occurs, >>>>>>> it should send the whole thing. >>>>>>> >>>>>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case >>>>>>>> of short send, and application needn't to take care of it? >>>>>> >>>>>> With new kernels yes, but the application should be prepared to have retry >>>>>> logic in order to be compatible with older kernels. >>>>> >>>>> Now ublk-nbd can be played, mkfs/mount and fio starts to work. >>>>> >>>>> But short send still can be observed sometimes when sending nbd write >>>>> request, which is done by sendmsg(), and the message includes two vectors, >>>>> (the 1st is the nbd_request, another one is the data to be written to disk). >>>>> >>>>> Short send is reported by cqe in which cqe->res is always 28, which is >>>>> size of 'struct nbd_request', also the length of the 1st io vec. And not >>>>> see send cqe failure message. >>>>> >>>>> And MSG_WAITALL is set for all ublk-nbd io actually. >>>>> >>>>> Follows the steps: >>>>> >>>>> 1) install liburing 2.0+ >>>>> >>>>> 2) build ublk & reproduce the issue: >>>>> >>>>> - git clone https://github.com/ming1/ubdsrv.git -b nbd >>>>> >>>>> - cd ubdsrv >>>>> >>>>> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir >>>>> >>>>> - ./build_with_liburing_src&& make -j4 >>>>> >>>>> 3) run the nbd test >>>>> - cd ubdsrv >>>>> - make test T=nbd >>>>> >>>>> Sometimes the test hangs, and the following log can be observed >>>>> in syslog: >>>>> >>>>> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 >>>>> ... >>>>> >>>> >>>> I can reproduce this, but it's a SEND that ends up being triggered, >>>> not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm >>>> assuming two vecs for that one. >>> >>> Added some debug and it looks like the request was indeed send up >>> and is using IORING_OP_SEND and that the 28 is what was requested. >>> But the completion side seems to think it's a SENDMSG and we should've >>> received more? >>> >>> I think this needs a bit of debugging on the userspace side first. >> >> Yeah, turns out it is indeed one userspace bug, IOSQE_IO_LINK is cleared >> wrong, and now the issue can't be triggered with the following fix: >> >> https://github.com/ming1/ubdsrv/commit/175ffd14ae2f8fa562134edfd4ac949f8050c108 > > Figured out, it is still one userspace issue. > > For nbd request sent to server, the cqe could come after the > ublk io request is completed which is triggered by nbd reply > from server, then if new ublk io req is submitted to same slot, the > new data length and op code could be read in nbd_send_req_done(), > and the warning is triggered. Figured it was some kind of data reuse issue, as it is consistent with that. Glad you got it figured out. -- Jens Axboe ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2023-01-14 3:47 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-01-11 15:26 IOSQE_IO_LINK vs. short send of SOCK_STREAM Ming Lei 2023-01-11 15:49 ` Jens Axboe 2023-01-11 16:32 ` Stefan Metzmacher 2023-01-11 16:36 ` Jens Axboe 2023-01-12 3:27 ` Ming Lei 2023-01-12 3:40 ` Jens Axboe 2023-01-12 7:35 ` Stefan Metzmacher 2023-01-13 10:12 ` Ming Lei 2023-01-13 17:51 ` Jens Axboe 2023-01-13 18:01 ` Jens Axboe 2023-01-14 0:27 ` Ming Lei 2023-01-14 1:39 ` Ming Lei 2023-01-14 2:12 ` Ming Lei 2023-01-14 3:47 ` Jens Axboe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox