From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A96915D5D9 for ; Thu, 12 Sep 2024 10:49:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726138191; cv=none; b=egfNkBwHnWUtPF2xxB+6Y0LCFQRyTLzrgAx2gZurE8fYcwVpVglVgISFUae69r1txdOIwGKsl599x9+bg48pXpsN0YmM6PF8WaF+lgOth5AdL9WXdVkxbwjgYhgpGIo2hsxf8/8SBRQQrhgQUp+VNyLbFLDAwek9i32zgsCqqLs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726138191; c=relaxed/simple; bh=tQvBKw04/nkyBK5qBAFq3uvl86EzrS7291cF20ua9RQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=pNpoENt4REZ92aT8MpmEF8CrwuNPJuRZ6SfCZQOtagU4tnNLsH17e8vObOpVdRUDTYbdFDIT0zcUzIZ2+iouelpSezAr7s+lBQKfKQXbPSlEN/Zw3gkkWyKLpas5NMxa8tTIfU5wfWz5k7y0EZhAZiCav5+uxK6iFlLTrHhr7ts= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=K5ecM8KT; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="K5ecM8KT" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1726138189; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=WfDnloVL27js4DeKz4WRuLdIcS0vYQGz+8EzQxXGU6c=; b=K5ecM8KTB3115/iN7ksKzm665SwsGSpTGIutgtqnZoGr3NZBW+MxicCYLLo+CXLr95JIN+ sZoiWkkzwhAKD25AMY+F13woCYRBQd0iUg0uRYeu1cUt8lEgRvY/ilP/oK25W61FFEHNnJ Qt7dDUtBTc7pJtofKT2rIAJ/p2+5iOI= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-608-bg3FrMVcNJ2VKPPrtu3IPQ-1; Thu, 12 Sep 2024 06:49:44 -0400 X-MC-Unique: bg3FrMVcNJ2VKPPrtu3IPQ-1 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id B54051955DB2; Thu, 12 Sep 2024 10:49:42 +0000 (UTC) Received: from localhost (unknown [10.72.116.81]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5BB5C19560A3; Thu, 12 Sep 2024 10:49:40 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org, Pavel Begunkov Cc: linux-block@vger.kernel.org, Ming Lei Subject: [PATCH V6 0/8] io_uring: support sqe group and provide group kbuf Date: Thu, 12 Sep 2024 18:49:20 +0800 Message-ID: <20240912104933.1875409-1-ming.lei@redhat.com> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 Hello, The 1st 3 patches are cleanup, and prepare for adding sqe group. The 4th patch supports generic sqe group which is like link chain, but allows each sqe in group to be issued in parallel and the group shares same IO_LINK & IO_DRAIN boundary, so N:M dependency can be supported with sqe group & io link together. sqe group changes nothing on IOSQE_IO_LINK. The 5th patch supports one variant of sqe group: allow members to depend on group leader, so that kernel resource lifetime can be aligned with group leader or group, then any kernel resource can be shared in this sqe group, and can be used in generic device zero copy. The 6th & 7th patches supports providing sqe group buffer via the sqe group variant. The 8th patch supports ublk zero copy based on io_uring providing sqe group buffer. Tests: 1) pass liburing test - make runtests 2) write/pass sqe group test case and sqe provide buffer case: https://github.com/axboe/liburing/compare/master...ming1:liburing:sqe_group_v3 https://github.com/ming1/liburing/tree/sqe_group_v3 - covers related sqe flags combination and linking groups, both nop and one multi-destination file copy. - cover failure handling test: fail leader IO or member IO in both single group and linked groups, which is done in each sqe flags combination test - covers IORING_PROVIDE_GROUP_KBUF by adding ublk-loop-zc 3) ublksrv zero copy: ublksrv userspace implements zero copy by sqe group & provide group kbuf: git clone https://github.com/ublk-org/ublksrv.git -b group-provide-buf_v3 make test T=loop/009:nbd/061 #ublk zc tests When running 64KB/512KB block size test on ublk-loop('ublk add -t loop --buffered_io -f $backing'), it is observed that perf is doubled. V6: - follow Pavel's suggestion to disallow IOSQE_CQE_SKIP_SUCCESS & LINK_TIMEOUT - kill __io_complete_group_member() (Pavel) - simplify link failure handling (Pavel) - move members' queuing out of completion lock (Pavel) - cleanup group io complete handler - add more comment - add ublk zc into liburing test for covering IOSQE_SQE_GROUP & IORING_PROVIDE_GROUP_KBUF V5: - follow Pavel's suggestion to minimize change on io_uring fast code path: sqe group code is called in by single 'if (unlikely())' from both issue & completion code path - simplify & re-write group request completion avoid to touch io-wq code by completing group leader via tw directly, just like ->task_complete re-write group member & leader completion handling, one simplification is always to free leader via the last member simplify queueing group members, not support issuing leader and members in parallel - fail the whole group if IO_*LINK & IO_DRAIN is set on group members, and test code to cover this change - misc cleanup V4: - address most comments from Pavel - fix request double free - don't use io_req_commit_cqe() in io_req_complete_defer() - make members' REQ_F_INFLIGHT discoverable - use common assembling check in submission code path - drop patch 3 and don't move REQ_F_CQE_SKIP out of io_free_req() - don't set .accept_group_kbuf for net send zc, in which members need to be queued after buffer notification is got, and can be enabled in future - add .grp_leader field via union, and share storage with .grp_link - move .grp_refs into one hole of io_kiocb, so that one extra cacheline isn't needed for io_kiocb - cleanup & document improvement V3: - add IORING_FEAT_SQE_GROUP - simplify group completion, and minimize change on io_req_complete_defer() - simplify & cleanup io_queue_group_members() - fix many failure handling issues - cover failure handling code in added liburing tests - remove RFC V2: - add generic sqe group, suggested by Kevin Wolf - add REQ_F_SQE_GROUP_DEP which is based on IOSQE_SQE_GROUP, for sharing kernel resource in group wide, suggested by Kevin Wolf - remove sqe ext flag, and use the last bit for IOSQE_SQE_GROUP(Pavel), in future we still can extend sqe flags with one uring context flag - initialize group requests via submit state pattern, suggested by Pavel - all kinds of cleanup & bug fixes Ming Lei (8): io_uring: add io_link_req() helper io_uring: add io_submit_fail_link() helper io_uring: add helper of io_req_commit_cqe() io_uring: support SQE group io_uring: support sqe group with members depending on leader io_uring: support providing sqe group buffer io_uring/uring_cmd: support provide group kernel buffer ublk: support provide io buffer drivers/block/ublk_drv.c | 160 +++++++++++++- include/linux/io_uring/cmd.h | 7 + include/linux/io_uring_types.h | 54 +++++ include/uapi/linux/io_uring.h | 11 +- include/uapi/linux/ublk_cmd.h | 7 +- io_uring/io_uring.c | 370 ++++++++++++++++++++++++++++++--- io_uring/io_uring.h | 16 ++ io_uring/kbuf.c | 60 ++++++ io_uring/kbuf.h | 13 ++ io_uring/net.c | 23 +- io_uring/opdef.c | 4 + io_uring/opdef.h | 2 + io_uring/rw.c | 20 +- io_uring/timeout.c | 6 + io_uring/uring_cmd.c | 28 +++ 15 files changed, 735 insertions(+), 46 deletions(-) -- 2.42.0