From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3F12C3DA78 for ; Sat, 14 Jan 2023 00:28:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230204AbjANA2k (ORCPT ); Fri, 13 Jan 2023 19:28:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229510AbjANA2k (ORCPT ); Fri, 13 Jan 2023 19:28:40 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA12A8142E for ; Fri, 13 Jan 2023 16:27:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673656071; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dk71agkkwTimJvA3tfXvfjuY0ukop0ahMqwGztZlrc4=; b=BQrZtvFwYi1EQvqiAo6XC7ZJ7VJ8ZfoW0zsQpKccs9q7W/bo8CZiTKCsMPGBgRcrjRHXVQ br9rQmqQaT/7c43mz2cbY+kKIy1LaE4arszsYdqwoin31uk/TbR6uJYOOGQ1OW/AFFSJXR Sr/rCOcTXEPhSc8ufhYr1QDVTjViwdY= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-153-NJAzEDBFPoOcUMkqR2c-4Q-1; Fri, 13 Jan 2023 19:27:47 -0500 X-MC-Unique: NJAzEDBFPoOcUMkqR2c-4Q-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 6AA7229AA38F; Sat, 14 Jan 2023 00:27:47 +0000 (UTC) Received: from T590 (ovpn-8-19.pek2.redhat.com [10.72.8.19]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 1AF5F2026D68; Sat, 14 Jan 2023 00:27:42 +0000 (UTC) Date: Sat, 14 Jan 2023 08:27:37 +0800 From: Ming Lei To: Jens Axboe Cc: Stefan Metzmacher , io-uring@vger.kernel.org, Pavel Begunkov , David Ahern , ming.lei@redhat.com Subject: Re: IOSQE_IO_LINK vs. short send of SOCK_STREAM Message-ID: References: <24a5eb97-92be-2441-13a2-9ebf098caf55@kernel.dk> <9eca9d42-e8ab-3e2b-888a-cd41722cce7a@samba.org> <6e237718-e09b-03ca-bd23-de94cdefa7fc@kernel.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <6e237718-e09b-03ca-bd23-de94cdefa7fc@kernel.dk> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On Fri, Jan 13, 2023 at 11:01:51AM -0700, Jens Axboe wrote: > On 1/13/23 10:51 AM, Jens Axboe wrote: > > On 1/13/23 3:12 AM, Ming Lei wrote: > >> Hello, > >> > >> On Thu, Jan 12, 2023 at 08:35:36AM +0100, Stefan Metzmacher wrote: > >>> Am 12.01.23 um 04:40 schrieb Jens Axboe: > >>>> On 1/11/23 8:27?PM, Ming Lei wrote: > >>>>> Hi Stefan and Jens, > >>>>> > >>>>> Thanks for the help. > >>>>> > >>>>> BTW, the issue is observed when I write ublk-nbd: > >>>>> > >>>>> https://github.com/ming1/ubdsrv/commits/nbd > >>>>> > >>>>> and it isn't completed yet(multiple send sqe chains not serialized > >>>>> yet), the issue is triggered when writing big chunk data to ublk-nbd. > >>>> > >>>> Gotcha > >>>> > >>>>> On Wed, Jan 11, 2023 at 05:32:00PM +0100, Stefan Metzmacher wrote: > >>>>>> Hi Ming, > >>>>>> > >>>>>>> Per my understanding, a short send on SOCK_STREAM should terminate the > >>>>>>> remainder of the SQE chain built by IOSQE_IO_LINK. > >>>>>>> > >>>>>>> But from my observation, this point isn't true when using io_sendmsg or > >>>>>>> io_sendmsg_zc on TCP socket, and the other remainder of the chain still > >>>>>>> can be completed after one short send is found. MSG_WAITALL is off. > >>>>>> > >>>>>> This is due to legacy reasons, you need pass MSG_WAITALL explicitly > >>>>>> in order to a retry or an error on a short write... > >>>>>> It should work for send, sendmsg, sendmsg_zc, recv and recvmsg. > >>>>> > >>>>> Turns out there is another application bug in which recv sqe may cut in the > >>>>> send sqe chain. > >>>>> > >>>>> After the issue is fixed, if MSG_WAITALL is set, short send can't be > >>>>> observed any more. But if MSG_WAITALL isn't set, short send can be > >>>>> observed and the send io chain still won't be terminated. > >>>> > >>>> Right, if MSG_WAITALL is set, then the whole thing will be written. If > >>>> we get a short send, it's retried appropriately. Unless an error occurs, > >>>> it should send the whole thing. > >>>> > >>>>> So if MSG_WAITALL is set, will io_uring be responsible for retry in case > >>>>> of short send, and application needn't to take care of it? > >>> > >>> With new kernels yes, but the application should be prepared to have retry > >>> logic in order to be compatible with older kernels. > >> > >> Now ublk-nbd can be played, mkfs/mount and fio starts to work. > >> > >> But short send still can be observed sometimes when sending nbd write > >> request, which is done by sendmsg(), and the message includes two vectors, > >> (the 1st is the nbd_request, another one is the data to be written to disk). > >> > >> Short send is reported by cqe in which cqe->res is always 28, which is > >> size of 'struct nbd_request', also the length of the 1st io vec. And not > >> see send cqe failure message. > >> > >> And MSG_WAITALL is set for all ublk-nbd io actually. > >> > >> Follows the steps: > >> > >> 1) install liburing 2.0+ > >> > >> 2) build ublk & reproduce the issue: > >> > >> - git clone https://github.com/ming1/ubdsrv.git -b nbd > >> > >> - cd ubdsrv > >> > >> - vim build_with_liburing_src && set LIBURING_DIR to your liburing dir > >> > >> - ./build_with_liburing_src&& make -j4 > >> > >> 3) run the nbd test > >> - cd ubdsrv > >> - make test T=nbd > >> > >> Sometimes the test hangs, and the following log can be observed > >> in syslog: > >> > >> nbd_send_req_done: short send/receive tag 2 op 1 8000000000800002, len 524316 written 28 cqe flags 0 > >> ... > >> > > > > I can reproduce this, but it's a SEND that ends up being triggered, > > not a SENDMSG. Should the payload carrying op not be a SENDMSG? I'm > > assuming two vecs for that one. > > Added some debug and it looks like the request was indeed send up > and is using IORING_OP_SEND and that the 28 is what was requested. > But the completion side seems to think it's a SENDMSG and we should've > received more? > > I think this needs a bit of debugging on the userspace side first. Yeah, turns out it is indeed one userspace bug, IOSQE_IO_LINK is cleared wrong, and now the issue can't be triggered with the following fix: https://github.com/ming1/ubdsrv/commit/175ffd14ae2f8fa562134edfd4ac949f8050c108 Thanks, Ming