On 19/12/25 12:08AM, Ming Lei wrote: >On Thu, Dec 18, 2025 at 08:46:47PM +0530, Nitesh Shetty wrote: >> On 18/12/25 05:45PM, Ming Lei wrote: >> > On Thu, Dec 18, 2025 at 01:37:37AM -0800, Christoph Hellwig wrote: >> > > On Thu, Dec 18, 2025 at 05:31:42PM +0800, Ming Lei wrote: >> > > > ->bi_vcnt doesn't make sense for cloned bio, which is perfectly fine >> > > > passed to bio_may_need_split(). >> > > > >> > > > So fix bio_may_need_split() by not taking ->bi_vcnt directly, instead >> > > > checking with help from bio size and bvec->len. >> > > > >> > > > Meantime retrieving the 1st bvec via __bvec_iter_bvec(). >> > > >> > > That totally misses the point. The ->bi_vcnt is a fast and lose >> > > check to see if we need the fairly expensive iterators to do the >> > > real check. >> > >> > It is just __bvec_iter_bvec(), whatever it should be in cache sooner or >> > later. >> > >> > >> Functionality wise overall patch looks fine to me, but as Christoph >> stated there is slight performance(IOPS) penalty. >> Here is my benchmarking numbers[1], I suspect Jens setup might show >> more regression. >> >> Regards, >> Nitesh >> >> >> [1] >> =============================== >> a. two optane nvme device setup: >> ---------- >> base case: >> ---------- >> sudo taskset -c 0,1 /home/nitesh/src/private/fio/t/io_uring -b512 \ >> -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=0, tid=206586, file=/dev/nvme0n1, nfiles=1, node=-1 >> submitter=1, tid=206587, file=/dev/nvme1n1, nfiles=1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=6.45M, BW=3.15GiB/s, IOS/call=32/31 >> IOPS=6.47M, BW=3.16GiB/s, IOS/call=32/32 >> IOPS=6.47M, BW=3.16GiB/s, IOS/call=32/32 >> Exiting on timeout >> Maximum IOPS=6.47M >> >> ---------------- >> with this patch: >> ---------------- >> sudo taskset -c 0,1 /home/nitesh/src/private/fio/t/io_uring -b512 \ >> -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1 >> submitter=0, tid=6352, file=/dev/nvme0n1, nfiles=1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> submitter=1, tid=6353, file=/dev/nvme1n1, nfiles=1, node=-1 >> IOPS=6.30M, BW=3.08GiB/s, IOS/call=32/31 >> IOPS=6.35M, BW=3.10GiB/s, IOS/call=32/31 >> IOPS=6.37M, BW=3.11GiB/s, IOS/call=32/32 >> Exiting on timeout >> Maximum IOPS=6.37M >> >> ============================= >> b. two null-blk device setup: >> ------------------ >> null device setup: >> ------------------ >> sudo modprobe null_blk queue_mode=2 gb=10 bs=512 nr_devices=2 irqmode=2 \ >> completion_nsec=1000000 hw_queue_depth=256 memory_backed=0 discard=0 \ >> use_per_node_hctx=1 >> >> ---------- >> base case: >> ---------- >> sudo taskset -c 0,1 /home/nitesh/src/private/fio/t/io_uring -b512 \ >> -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nullb0 /dev/nullb1 >> submitter=0, tid=6743, file=/dev/nullb0, nfiles=1, node=-1 >> submitter=1, tid=6744, file=/dev/nullb1, nfiles=1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=7.89M, BW=3.85GiB/s, IOS/call=32/31 >> IOPS=7.96M, BW=3.89GiB/s, IOS/call=32/32 >> IOPS=7.99M, BW=3.90GiB/s, IOS/call=32/32 >> Exiting on timeout >> Maximum IOPS=7.99M >> >> ------------------- >> with this patchset: >> ------------------- >> sudo taskset -c 0,1 /home/nitesh/src/private/fio/t/io_uring -b512 \ >> -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nullb0 /dev/nullb1 >> submitter=0, tid=35633, file=/dev/nullb0, nfiles=1, node=-1 >> submitter=1, tid=35634, file=/dev/nullb1, nfiles=1, node=-1 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 >> Engine=io_uring, sq_ring=128, cq_ring=128 >> IOPS=7.79M, BW=3.80GiB/s, IOS/call=32/31 >> IOPS=7.86M, BW=3.84GiB/s, IOS/call=32/32 >> IOPS=7.89M, BW=3.85GiB/s, IOS/call=32/32 >> Exiting on timeout >> Maximum IOPS=7.89M > >Thanks for the perf test! > >This patch only adds bio->bi_iter memory footprint, which is supposed >to hit from L1, maybe because `bi_io_vec` is in the 2nd cacheline, can >you see any difference with the following change? > > >diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h >index 5dc061d318a4..1c4570b37436 100644 >--- a/include/linux/blk_types.h >+++ b/include/linux/blk_types.h >@@ -240,6 +240,7 @@ struct bio { > /* for plugged zoned writes only: */ > unsigned int __bi_nr_segments; > }; >+ struct bio_vec *bi_io_vec; /* the actual vec list */ > bio_end_io_t *bi_end_io; > void *bi_private; > #ifdef CONFIG_BLK_CGROUP >@@ -275,8 +276,6 @@ struct bio { > > atomic_t __bi_cnt; /* pin count */ > >- struct bio_vec *bi_io_vec; /* the actual vec list */ >- > struct bio_set *bi_pool; > }; > With above patch perf numbers match the base case. Thanks, Nitesh Shetty