[PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization

public inbox for [email protected]
 help / color / mirror / Atom feed

* [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization
@ 2021-04-28 13:32 Hao Xu
  2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu
  2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
  0 siblings, 2 replies; 40+ messages in thread
From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

patch 1/2 provides a new option to set up sq_thread_idle in nanosecond
granularity.
patch 2/2 is to cut down IO latency when sqthread is waking up.

This is a RFC, especially 2/2. There may be more works to do, like add
a REGISTER OP to allow applications to adjust sq_thread_idle, since it
may experience both high IO pressure and low IO pressure. And in low IO
pressure, patch 1/2 saves cpu resource while keeping a reasonable
latency. liburing tweak is ready as well, but currently I'd like to just
post this for comments.

Hao Xu (2):
  io_uring: add support for ns granularity of io_sq_thread_idle
  io_uring: submit sqes in the original context when waking up sqthread

 fs/io_uring.c                 | 88 ++++++++++++++++++++++++++++++++-----------
 include/uapi/linux/io_uring.h |  4 +-
 2 files changed, 70 insertions(+), 22 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu
@ 2021-04-28 13:32 ` Hao Xu
  2021-04-28 14:07   ` Pavel Begunkov
  2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
  1 sibling, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

currently unit of io_sq_thread_idle is millisecond, the smallest value
is 1ms, which means for IOPS > 1000, sqthread will very likely  take
100% cpu usage. This is not necessary in some cases, like users may
don't care about latency much in low IO pressure
(like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
an option of nanosecond granularity of io_sq_thread_idle. Some test
results by fio below:

uring average latency:(us)
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
2k	        10.93	10.68	10.72	10.7	10.79	10.52	10.59	10.54	10.47	10.39	8.4
4k	        10.55	10.48	10.51	10.42	10.35	8.34
6k	        10.82	10.5	10.39	8.4
8k	        10.44	10.45	10.34	8.39
10k	        10.45	10.39	8.33

uring cpu usage of sqthread:
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
2k	        4%	14%	24%	34.70%	44.70%	55%	65.10%	75.40%	85.40%	95.70%	100%
4k	        7.70%	28.20%	48.50%	69%	90%	100%
6k	        11.30%	42%	73%	100%
8k	        15.30%	56.30%	97%	100%
10k	        19%	70%	100%

aio average latency:(us)
iops	latency	99th lat  cpu
2k	13.34	14.272    3%
4k	13.195	14.016	  7%
6k	13.29	14.656	  9.70%
8k	13.2	14.656	  12.70%
10k	13.2	15	  17%

fio config is:
./run_fio.sh
fio \
--ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
--direct=1 --rw=randread --time_based=1 --runtime=300 \
--group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
--randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
--io_sq_thread_idle=${2}

in 2k IOPS, if latency of 10.93us is acceptable for an application,
then they get 100% - 4% = 96% reduction of cpu usage, while the latency
is smaller than aio(10.93us vs 13.34us).

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c                 | 59 ++++++++++++++++++++++++++++++++-----------
 include/uapi/linux/io_uring.h |  3 ++-
 2 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 63ff70587d4f..1871fad48412 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -279,7 +279,8 @@ struct io_sq_data {
 	struct task_struct	*thread;
 	struct wait_queue_head	wait;
 
-	unsigned		sq_thread_idle;
+	u64			sq_thread_idle;
+	bool			idle_mode;
 	int			sq_cpu;
 	pid_t			task_pid;
 	pid_t			task_tgid;
@@ -362,7 +363,7 @@ struct io_ring_ctx {
 		unsigned		cached_sq_head;
 		unsigned		sq_entries;
 		unsigned		sq_mask;
-		unsigned		sq_thread_idle;
+		u64			sq_thread_idle;
 		unsigned		cached_sq_dropped;
 		unsigned		cached_cq_overflow;
 		unsigned long		sq_check_overflow;
@@ -6797,18 +6798,42 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries)
 static void io_sqd_update_thread_idle(struct io_sq_data *sqd)
 {
 	struct io_ring_ctx *ctx;
-	unsigned sq_thread_idle = 0;
+	u64 sq_thread_idle = 0;
 
-	list_for_each_entry(ctx, &sqd->ctx_list, sqd_list)
-		sq_thread_idle = max(sq_thread_idle, ctx->sq_thread_idle);
+	sqd->idle_mode = false;
+	list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
+		u64 tmp_idle = ctx->sq_thread_idle;
+
+		if (!(ctx->flags & IORING_SETUP_IDLE_NS))
+			tmp_idle = jiffies64_to_nsecs(tmp_idle);
+		else if (!sqd->idle_mode)
+			sqd->idle_mode = true;
+
+		if (sq_thread_idle < tmp_idle)
+			sq_thread_idle = tmp_idle;
+	}
+
+	if (!sqd->idle_mode)
+		sq_thread_idle = nsecs_to_jiffies64(sq_thread_idle);
 	sqd->sq_thread_idle = sq_thread_idle;
 }
 
+static inline u64 io_current_time(bool idle_mode)
+{
+	return idle_mode ? ktime_get_ns() : get_jiffies_64();
+}
+
+static inline bool io_time_after(bool idle_mode, u64 timeout)
+{
+	return idle_mode ? ktime_get_ns() > timeout :
+			time_after64(get_jiffies_64(), timeout);
+}
+
 static int io_sq_thread(void *data)
 {
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
-	unsigned long timeout = 0;
+	u64 timeout = 0;
 	char buf[TASK_COMM_LEN];
 	DEFINE_WAIT(wait);
 
@@ -6842,7 +6867,7 @@ static int io_sq_thread(void *data)
 				break;
 			io_run_task_work();
 			io_run_task_work_head(&sqd->park_task_work);
-			timeout = jiffies + sqd->sq_thread_idle;
+			timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle;
 			continue;
 		}
 		sqt_spin = false;
@@ -6859,11 +6884,11 @@ static int io_sq_thread(void *data)
 				sqt_spin = true;
 		}
 
-		if (sqt_spin || !time_after(jiffies, timeout)) {
+		if (sqt_spin || !io_time_after(sqd->idle_mode, timeout)) {
 			io_run_task_work();
 			cond_resched();
 			if (sqt_spin)
-				timeout = jiffies + sqd->sq_thread_idle;
+				timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle;
 			continue;
 		}
 
@@ -6896,7 +6921,7 @@ static int io_sq_thread(void *data)
 
 		finish_wait(&sqd->wait, &wait);
 		io_run_task_work_head(&sqd->park_task_work);
-		timeout = jiffies + sqd->sq_thread_idle;
+		timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle;
 	}
 
 	io_uring_cancel_sqpoll(sqd);
@@ -7940,7 +7965,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
 		struct task_struct *tsk;
 		struct io_sq_data *sqd;
-		bool attached;
+		bool attached, idle_ns;
 
 		sqd = io_get_sq_data(p, &attached);
 		if (IS_ERR(sqd)) {
@@ -7950,9 +7975,13 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 
 		ctx->sq_creds = get_current_cred();
 		ctx->sq_data = sqd;
-		ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle);
+
+		idle_ns = ctx->flags & IORING_SETUP_IDLE_NS;
+		if (!idle_ns)
+			p->sq_thread_idle = nsecs_to_jiffies64(p->sq_thread_idle * NSEC_PER_MSEC);
+		ctx->sq_thread_idle = p->sq_thread_idle;
 		if (!ctx->sq_thread_idle)
-			ctx->sq_thread_idle = HZ;
+			ctx->sq_thread_idle = idle_ns ? NSEC_PER_SEC : HZ;
 
 		io_sq_thread_park(sqd);
 		list_add(&ctx->sqd_list, &sqd->ctx_list);
@@ -7990,7 +8019,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx,
 		wake_up_new_task(tsk);
 		if (ret)
 			goto err;
-	} else if (p->flags & IORING_SETUP_SQ_AFF) {
+	} else if (p->flags & (IORING_SETUP_SQ_AFF | IORING_SETUP_IDLE_NS)) {
 		/* Can't have SQ_AFF without SQPOLL */
 		ret = -EINVAL;
 		goto err;
@@ -9680,7 +9709,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
 			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
 			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
-			IORING_SETUP_R_DISABLED))
+			IORING_SETUP_R_DISABLED | IORING_SETUP_IDLE_NS))
 		return -EINVAL;
 
 	return  io_uring_create(entries, &p, params);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e1ae46683301..311532ff6ce3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,7 @@ enum {
 #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
 #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
 #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
+#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
 
 enum {
 	IORING_OP_NOP,
@@ -259,7 +260,7 @@ struct io_uring_params {
 	__u32 cq_entries;
 	__u32 flags;
 	__u32 sq_thread_cpu;
-	__u32 sq_thread_idle;
+	__u64 sq_thread_idle;
 	__u32 features;
 	__u32 wq_fd;
 	__u32 resv[3];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu
@ 2021-04-28 14:07   ` Pavel Begunkov
  2021-04-28 14:16     ` Jens Axboe
  2021-04-29  3:28     ` Hao Xu
  0 siblings, 2 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:07 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/28/21 2:32 PM, Hao Xu wrote:
> currently unit of io_sq_thread_idle is millisecond, the smallest value
> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
> 100% cpu usage. This is not necessary in some cases, like users may
> don't care about latency much in low IO pressure
> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
> an option of nanosecond granularity of io_sq_thread_idle. Some test
> results by fio below:

If numbers justify it, I don't see why not do it in ns, but I'd suggest
to get rid of all the mess and simply convert to jiffies during ring
creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.

Or is there a reason for having it high precision, i.e. ktime()?

> uring average latency:(us)
> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
> 2k	        10.93	10.68	10.72	10.7	10.79	10.52	10.59	10.54	10.47	10.39	8.4
> 4k	        10.55	10.48	10.51	10.42	10.35	8.34
> 6k	        10.82	10.5	10.39	8.4
> 8k	        10.44	10.45	10.34	8.39
> 10k	        10.45	10.39	8.33
> 
> uring cpu usage of sqthread:
> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
> 2k	        4%	14%	24%	34.70%	44.70%	55%	65.10%	75.40%	85.40%	95.70%	100%
> 4k	        7.70%	28.20%	48.50%	69%	90%	100%
> 6k	        11.30%	42%	73%	100%
> 8k	        15.30%	56.30%	97%	100%
> 10k	        19%	70%	100%
> 
> aio average latency:(us)
> iops	latency	99th lat  cpu
> 2k	13.34	14.272    3%
> 4k	13.195	14.016	  7%
> 6k	13.29	14.656	  9.70%
> 8k	13.2	14.656	  12.70%
> 10k	13.2	15	  17%
> 
> fio config is:
> ./run_fio.sh
> fio \
> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
> --direct=1 --rw=randread --time_based=1 --runtime=300 \
> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
> --io_sq_thread_idle=${2}
> 
> in 2k IOPS, if latency of 10.93us is acceptable for an application,
> then they get 100% - 4% = 96% reduction of cpu usage, while the latency
> is smaller than aio(10.93us vs 13.34us).
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
[snip]
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index e1ae46683301..311532ff6ce3 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -98,6 +98,7 @@ enum {
>  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>  
>  enum {
>  	IORING_OP_NOP,
> @@ -259,7 +260,7 @@ struct io_uring_params {
>  	__u32 cq_entries;
>  	__u32 flags;
>  	__u32 sq_thread_cpu;
> -	__u32 sq_thread_idle;
> +	__u64 sq_thread_idle;

breaks userspace API

>  	__u32 features;
>  	__u32 wq_fd;
>  	__u32 resv[3];
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 14:07   ` Pavel Begunkov
@ 2021-04-28 14:16     ` Jens Axboe
  2021-04-28 14:53       ` Pavel Begunkov
  2021-04-29  3:41       ` Hao Xu
  2021-04-29  3:28     ` Hao Xu
  1 sibling, 2 replies; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 14:16 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index e1ae46683301..311532ff6ce3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -98,6 +98,7 @@ enum {
>>  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>>  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>>  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
>> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>>  
>>  enum {
>>  	IORING_OP_NOP,
>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>  	__u32 cq_entries;
>>  	__u32 flags;
>>  	__u32 sq_thread_cpu;
>> -	__u32 sq_thread_idle;
>> +	__u64 sq_thread_idle;
> 
> breaks userspace API

And I don't think we need to. If you're using IDLE_NS, then the value
should by definition be small enough that it'd fit in 32-bits. If you
need higher timeouts, don't set it and it's in usec instead.

So I'd just leave this one alone.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 14:16     ` Jens Axboe
@ 2021-04-28 14:53       ` Pavel Begunkov
  2021-04-28 14:54         ` Jens Axboe
  2021-04-29  3:41       ` Hao Xu
  1 sibling, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:53 UTC (permalink / raw)
  To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 3:16 PM, Jens Axboe wrote:
> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index e1ae46683301..311532ff6ce3 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -98,6 +98,7 @@ enum {
>>>  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>>>  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>>>  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
>>> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>>>  
>>>  enum {
>>>  	IORING_OP_NOP,
>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>  	__u32 cq_entries;
>>>  	__u32 flags;
>>>  	__u32 sq_thread_cpu;
>>> -	__u32 sq_thread_idle;
>>> +	__u64 sq_thread_idle;
>>
>> breaks userspace API
> 
> And I don't think we need to. If you're using IDLE_NS, then the value
> should by definition be small enough that it'd fit in 32-bits. If you
> need higher timeouts, don't set it and it's in usec instead.
> 
> So I'd just leave this one alone.

Sounds good

u64 time_ns = p->sq_thread_idle;
if (!IDLE_NS)
    time_ns *= NSEC_PER_MSEC;
idel = ns_to_jiffies(time_ns);

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 14:53       ` Pavel Begunkov
@ 2021-04-28 14:54         ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 14:54 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 8:53 AM, Pavel Begunkov wrote:
> On 4/28/21 3:16 PM, Jens Axboe wrote:
>> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index e1ae46683301..311532ff6ce3 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -98,6 +98,7 @@ enum {
>>>>  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>>>>  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>>>>  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
>>>> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>>>>  
>>>>  enum {
>>>>  	IORING_OP_NOP,
>>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>>  	__u32 cq_entries;
>>>>  	__u32 flags;
>>>>  	__u32 sq_thread_cpu;
>>>> -	__u32 sq_thread_idle;
>>>> +	__u64 sq_thread_idle;
>>>
>>> breaks userspace API
>>
>> And I don't think we need to. If you're using IDLE_NS, then the value
>> should by definition be small enough that it'd fit in 32-bits. If you
>> need higher timeouts, don't set it and it's in usec instead.
>>
>> So I'd just leave this one alone.
> 
> Sounds good
> 
> u64 time_ns = p->sq_thread_idle;
> if (!IDLE_NS)
>     time_ns *= NSEC_PER_MSEC;
> idel = ns_to_jiffies(time_ns);

Precisely! With the overlap being there, there's no need to make it bigger.
And having nsec granularity if your idle time is in the msecs doesn't make
a lot of sense.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 14:16     ` Jens Axboe
  2021-04-28 14:53       ` Pavel Begunkov
@ 2021-04-29  3:41       ` Hao Xu
  2021-04-29  9:11         ` Pavel Begunkov
  1 sibling, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-04-29  3:41 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, Joseph Qi

在 2021/4/28 下午10:16, Jens Axboe 写道:
> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index e1ae46683301..311532ff6ce3 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -98,6 +98,7 @@ enum {
>>>   #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>>>   #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>>>   #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
>>> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>>>   
>>>   enum {
>>>   	IORING_OP_NOP,
>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>   	__u32 cq_entries;
>>>   	__u32 flags;
>>>   	__u32 sq_thread_cpu;
>>> -	__u32 sq_thread_idle;
>>> +	__u64 sq_thread_idle;
>>
>> breaks userspace API
> 
> And I don't think we need to. If you're using IDLE_NS, then the value
> should by definition be small enough that it'd fit in 32-bits. If you
I make it u64 since I thought users may want a full flexibility to set
idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by
millisecond granularity). But I'm not sure if this deserve changing the
userspace API.
> need higher timeouts, don't set it and it's in usec instead.
> 
> So I'd just leave this one alone.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-29  3:41       ` Hao Xu
@ 2021-04-29  9:11         ` Pavel Begunkov
  2021-05-05 14:07           ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-29  9:11 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/29/21 4:41 AM, Hao Xu wrote:
> 在 2021/4/28 下午10:16, Jens Axboe 写道:
>> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index e1ae46683301..311532ff6ce3 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -98,6 +98,7 @@ enum {
>>>>   #define IORING_SETUP_CLAMP    (1U << 4)    /* clamp SQ/CQ ring sizes */
>>>>   #define IORING_SETUP_ATTACH_WQ    (1U << 5)    /* attach to existing wq */
>>>>   #define IORING_SETUP_R_DISABLED    (1U << 6)    /* start with ring disabled */
>>>> +#define IORING_SETUP_IDLE_NS    (1U << 7)    /* unit of thread_idle is nano second */
>>>>     enum {
>>>>       IORING_OP_NOP,
>>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>>       __u32 cq_entries;
>>>>       __u32 flags;
>>>>       __u32 sq_thread_cpu;
>>>> -    __u32 sq_thread_idle;
>>>> +    __u64 sq_thread_idle;
>>>
>>> breaks userspace API
>>
>> And I don't think we need to. If you're using IDLE_NS, then the value
>> should by definition be small enough that it'd fit in 32-bits. If you
> I make it u64 since I thought users may want a full flexibility to set
> idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by

It's a really weird user requiring such a precision. u32 allows up to
~1s, and if more is needed users can switch to ms mode, so in the worst
case the precision is 1/1000 of the desired value, more than enough.

> millisecond granularity). But I'm not sure if this deserve changing the
> userspace API.
 
That's not about deserve or not, we can't break ABI. Can be worked around,
e.g. by taking resv fields, but don't see a reason

>> need higher timeouts, don't set it and it's in usec instead.
>>
>> So I'd just leave this one alone.
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-29  9:11         ` Pavel Begunkov
@ 2021-05-05 14:07           ` Hao Xu
  2021-05-05 17:40             ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-05-05 14:07 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/29 下午5:11, Pavel Begunkov 写道:
> On 4/29/21 4:41 AM, Hao Xu wrote:
>> 在 2021/4/28 下午10:16, Jens Axboe 写道:
>>> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index e1ae46683301..311532ff6ce3 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -98,6 +98,7 @@ enum {
>>>>>    #define IORING_SETUP_CLAMP    (1U << 4)    /* clamp SQ/CQ ring sizes */
>>>>>    #define IORING_SETUP_ATTACH_WQ    (1U << 5)    /* attach to existing wq */
>>>>>    #define IORING_SETUP_R_DISABLED    (1U << 6)    /* start with ring disabled */
>>>>> +#define IORING_SETUP_IDLE_NS    (1U << 7)    /* unit of thread_idle is nano second */
>>>>>      enum {
>>>>>        IORING_OP_NOP,
>>>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>>>        __u32 cq_entries;
>>>>>        __u32 flags;
>>>>>        __u32 sq_thread_cpu;
>>>>> -    __u32 sq_thread_idle;
>>>>> +    __u64 sq_thread_idle;
>>>>
>>>> breaks userspace API
>>>
>>> And I don't think we need to. If you're using IDLE_NS, then the value
>>> should by definition be small enough that it'd fit in 32-bits. If you
>> I make it u64 since I thought users may want a full flexibility to set
>> idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by
> 
> It's a really weird user requiring such a precision. u32 allows up to
> ~1s, and if more is needed users can switch to ms mode, so in the worst
> case the precision is 1/1000 of the desired value, more than enough.
> 
>> millisecond granularity). But I'm not sure if this deserve changing the
>> userspace API.
>   
> That's not about deserve or not, we can't break ABI. Can be worked around,
Is it for compatibility reason?
> e.g. by taking resv fields, but don't see a reason
> 
>>> need higher timeouts, don't set it and it's in usec instead.
>>>
>>> So I'd just leave this one alone.
>>>
>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-05-05 14:07           ` Hao Xu
@ 2021-05-05 17:40             ` Pavel Begunkov
  0 siblings, 0 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-05-05 17:40 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 5/5/21 3:07 PM, Hao Xu wrote:
> 在 2021/4/29 下午5:11, Pavel Begunkov 写道:
>> On 4/29/21 4:41 AM, Hao Xu wrote:
>>> 在 2021/4/28 下午10:16, Jens Axboe 写道:
>>>> On 4/28/21 8:07 AM, Pavel Begunkov wrote:
>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>>> index e1ae46683301..311532ff6ce3 100644
>>>>>> --- a/include/uapi/linux/io_uring.h
>>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>>> @@ -98,6 +98,7 @@ enum {
>>>>>>    #define IORING_SETUP_CLAMP    (1U << 4)    /* clamp SQ/CQ ring sizes */
>>>>>>    #define IORING_SETUP_ATTACH_WQ    (1U << 5)    /* attach to existing wq */
>>>>>>    #define IORING_SETUP_R_DISABLED    (1U << 6)    /* start with ring disabled */
>>>>>> +#define IORING_SETUP_IDLE_NS    (1U << 7)    /* unit of thread_idle is nano second */
>>>>>>      enum {
>>>>>>        IORING_OP_NOP,
>>>>>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>>>>>        __u32 cq_entries;
>>>>>>        __u32 flags;
>>>>>>        __u32 sq_thread_cpu;
>>>>>> -    __u32 sq_thread_idle;
>>>>>> +    __u64 sq_thread_idle;
>>>>>
>>>>> breaks userspace API
>>>>
>>>> And I don't think we need to. If you're using IDLE_NS, then the value
>>>> should by definition be small enough that it'd fit in 32-bits. If you
>>> I make it u64 since I thought users may want a full flexibility to set
>>> idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by
>>
>> It's a really weird user requiring such a precision. u32 allows up to
>> ~1s, and if more is needed users can switch to ms mode, so in the worst
>> case the precision is 1/1000 of the desired value, more than enough.
>>
>>> millisecond granularity). But I'm not sure if this deserve changing the
>>> userspace API.
>>   That's not about deserve or not, we can't break ABI. Can be worked around,
> Is it for compatibility reason?

Right, all binaries should work without recompilation, including
those not using liburing.


>> e.g. by taking resv fields, but don't see a reason
>>
>>>> need higher timeouts, don't set it and it's in usec instead.
>>>>
>>>> So I'd just leave this one alone.
>>>>
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-28 14:07   ` Pavel Begunkov
  2021-04-28 14:16     ` Jens Axboe
@ 2021-04-29  3:28     ` Hao Xu
  2021-04-29 22:15       ` Pavel Begunkov
  1 sibling, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-04-29  3:28 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/28 下午10:07, Pavel Begunkov 写道:
> On 4/28/21 2:32 PM, Hao Xu wrote:
>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>> 100% cpu usage. This is not necessary in some cases, like users may
>> don't care about latency much in low IO pressure
>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>> results by fio below:
> 
> If numbers justify it, I don't see why not do it in ns, but I'd suggest
> to get rid of all the mess and simply convert to jiffies during ring
> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
1) here I keep millisecond mode for compatibility
2) I saw jiffies is calculated by HZ, and HZ could be large enough
(like HZ = 1000) to make nsecs_to_jiffies64() = 0:

   u64 nsecs_to_jiffies64(u64 n)
   {
   #if (NSEC_PER_SEC % HZ) == 0
           /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 
etc. */
           return div_u64(n, NSEC_PER_SEC / HZ);
   #elif (HZ % 512) == 0
           /* overflow after 292 years if HZ = 1024 */
           return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
   #else
           /*
           ¦* Generic case - optimized for cases where HZ is a multiple 
of 3.
           ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 
etc.
           ¦*/
           return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
   #endif
   }

say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).

> 
> Or is there a reason for having it high precision, i.e. ktime()?
> 
>> uring average latency:(us)
>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>> 2k	        10.93	10.68	10.72	10.7	10.79	10.52	10.59	10.54	10.47	10.39	8.4
>> 4k	        10.55	10.48	10.51	10.42	10.35	8.34
>> 6k	        10.82	10.5	10.39	8.4
>> 8k	        10.44	10.45	10.34	8.39
>> 10k	        10.45	10.39	8.33
>>
>> uring cpu usage of sqthread:
>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>> 2k	        4%	14%	24%	34.70%	44.70%	55%	65.10%	75.40%	85.40%	95.70%	100%
>> 4k	        7.70%	28.20%	48.50%	69%	90%	100%
>> 6k	        11.30%	42%	73%	100%
>> 8k	        15.30%	56.30%	97%	100%
>> 10k	        19%	70%	100%
>>
>> aio average latency:(us)
>> iops	latency	99th lat  cpu
>> 2k	13.34	14.272    3%
>> 4k	13.195	14.016	  7%
>> 6k	13.29	14.656	  9.70%
>> 8k	13.2	14.656	  12.70%
>> 10k	13.2	15	  17%
>>
>> fio config is:
>> ./run_fio.sh
>> fio \
>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>> --io_sq_thread_idle=${2}
>>
>> in 2k IOPS, if latency of 10.93us is acceptable for an application,
>> then they get 100% - 4% = 96% reduction of cpu usage, while the latency
>> is smaller than aio(10.93us vs 13.34us).
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
> [snip]
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index e1ae46683301..311532ff6ce3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -98,6 +98,7 @@ enum {
>>   #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
>>   #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
>>   #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
>> +#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
>>   
>>   enum {
>>   	IORING_OP_NOP,
>> @@ -259,7 +260,7 @@ struct io_uring_params {
>>   	__u32 cq_entries;
>>   	__u32 flags;
>>   	__u32 sq_thread_cpu;
>> -	__u32 sq_thread_idle;
>> +	__u64 sq_thread_idle;
> 
> breaks userspace API
> 
>>   	__u32 features;
>>   	__u32 wq_fd;
>>   	__u32 resv[3];
>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-29  3:28     ` Hao Xu
@ 2021-04-29 22:15       ` Pavel Begunkov
  2021-09-26 10:00         ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-29 22:15 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/29/21 4:28 AM, Hao Xu wrote:
> 在 2021/4/28 下午10:07, Pavel Begunkov 写道:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>>> 100% cpu usage. This is not necessary in some cases, like users may
>>> don't care about latency much in low IO pressure
>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>>> results by fio below:
>>
>> If numbers justify it, I don't see why not do it in ns, but I'd suggest
>> to get rid of all the mess and simply convert to jiffies during ring
>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
> 1) here I keep millisecond mode for compatibility
> 2) I saw jiffies is calculated by HZ, and HZ could be large enough
> (like HZ = 1000) to make nsecs_to_jiffies64() = 0:
> 
>  u64 nsecs_to_jiffies64(u64 n)
>  {
>  #if (NSEC_PER_SEC % HZ) == 0
>          /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
>          return div_u64(n, NSEC_PER_SEC / HZ);
>  #elif (HZ % 512) == 0
>          /* overflow after 292 years if HZ = 1024 */
>          return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
>  #else
>          /*
>          ¦* Generic case - optimized for cases where HZ is a multiple of 3.
>          ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
>          ¦*/
>          return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
>  #endif
>  }
> 
> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).

Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
But I'd much prefer to not duplicate all that. So, jiffies won't do,
ktime() may be ok but a bit heavier that we'd like it to be...

Jens, any chance you remember something in the middle? Like same source
as ktime() but without the heavy correction it does.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-04-29 22:15       ` Pavel Begunkov
@ 2021-09-26 10:00         ` Hao Xu
  2021-09-28 10:51           ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-09-26 10:00 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/30 上午6:15, Pavel Begunkov 写道:
> On 4/29/21 4:28 AM, Hao Xu wrote:
>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道:
>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>>>> 100% cpu usage. This is not necessary in some cases, like users may
>>>> don't care about latency much in low IO pressure
>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>>>> results by fio below:
>>>
>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest
>>> to get rid of all the mess and simply convert to jiffies during ring
>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
>> 1) here I keep millisecond mode for compatibility
>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough
>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0:
>>
>>   u64 nsecs_to_jiffies64(u64 n)
>>   {
>>   #if (NSEC_PER_SEC % HZ) == 0
>>           /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
>>           return div_u64(n, NSEC_PER_SEC / HZ);
>>   #elif (HZ % 512) == 0
>>           /* overflow after 292 years if HZ = 1024 */
>>           return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
>>   #else
>>           /*
>>           ¦* Generic case - optimized for cases where HZ is a multiple of 3.
>>           ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
>>           ¦*/
>>           return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
>>   #endif
>>   }
>>
>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).
> 
> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
> But I'd much prefer to not duplicate all that. So, jiffies won't do,
> ktime() may be ok but a bit heavier that we'd like it to be...
> 
> Jens, any chance you remember something in the middle? Like same source
> as ktime() but without the heavy correction it does.
I'm gonna pick this one up again, currently this patch
with ktime_get_ns() works good on our productions. This
patch makes the latency a bit higher than before, but
still lower than aio.
I haven't gotten a faster alternate for ktime_get_ns(),
any hints?

Regards,
Hao
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-26 10:00         ` Hao Xu
@ 2021-09-28 10:51           ` Pavel Begunkov
  2021-09-29  7:52             ` Hao Xu
  2021-09-29  9:24             ` Hao Xu
  0 siblings, 2 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-09-28 10:51 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/26/21 11:00 AM, Hao Xu wrote:
> 在 2021/4/30 上午6:15, Pavel Begunkov 写道:
>> On 4/29/21 4:28 AM, Hao Xu wrote:
>>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道:
>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>>>>> 100% cpu usage. This is not necessary in some cases, like users may
>>>>> don't care about latency much in low IO pressure
>>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>>>>> results by fio below:
>>>>
>>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest
>>>> to get rid of all the mess and simply convert to jiffies during ring
>>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
>>> 1) here I keep millisecond mode for compatibility
>>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough
>>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0:
>>>
>>>   u64 nsecs_to_jiffies64(u64 n)
>>>   {
>>>   #if (NSEC_PER_SEC % HZ) == 0
>>>           /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
>>>           return div_u64(n, NSEC_PER_SEC / HZ);
>>>   #elif (HZ % 512) == 0
>>>           /* overflow after 292 years if HZ = 1024 */
>>>           return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
>>>   #else
>>>           /*
>>>           ¦* Generic case - optimized for cases where HZ is a multiple of 3.
>>>           ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
>>>           ¦*/
>>>           return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
>>>   #endif
>>>   }
>>>
>>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
>>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).
>>
>> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
>> But I'd much prefer to not duplicate all that. So, jiffies won't do,
>> ktime() may be ok but a bit heavier that we'd like it to be...
>>
>> Jens, any chance you remember something in the middle? Like same source
>> as ktime() but without the heavy correction it does.
> I'm gonna pick this one up again, currently this patch
> with ktime_get_ns() works good on our productions. This
> patch makes the latency a bit higher than before, but
> still lower than aio.
> I haven't gotten a faster alternate for ktime_get_ns(),
> any hints?

Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
In particular coarse variants may be of interest.
https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access


Off topic: it sounds that you're a long user of SQPOLL. Interesting to
ask how do you find it in general. i.e. does it help much with
latency? Performance? Anything else?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-28 10:51           ` Pavel Begunkov
@ 2021-09-29  7:52             ` Hao Xu
  2021-09-29  9:24             ` Hao Xu
  1 sibling, 0 replies; 40+ messages in thread
From: Hao Xu @ 2021-09-29  7:52 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午6:51, Pavel Begunkov 写道:
> On 9/26/21 11:00 AM, Hao Xu wrote:
>> 在 2021/4/30 上午6:15, Pavel Begunkov 写道:
>>> On 4/29/21 4:28 AM, Hao Xu wrote:
>>>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道:
>>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>>>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>>>>>> 100% cpu usage. This is not necessary in some cases, like users may
>>>>>> don't care about latency much in low IO pressure
>>>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>>>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>>>>>> results by fio below:
>>>>>
>>>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest
>>>>> to get rid of all the mess and simply convert to jiffies during ring
>>>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
>>>> 1) here I keep millisecond mode for compatibility
>>>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough
>>>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0:
>>>>
>>>>    u64 nsecs_to_jiffies64(u64 n)
>>>>    {
>>>>    #if (NSEC_PER_SEC % HZ) == 0
>>>>            /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
>>>>            return div_u64(n, NSEC_PER_SEC / HZ);
>>>>    #elif (HZ % 512) == 0
>>>>            /* overflow after 292 years if HZ = 1024 */
>>>>            return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
>>>>    #else
>>>>            /*
>>>>            ¦* Generic case - optimized for cases where HZ is a multiple of 3.
>>>>            ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
>>>>            ¦*/
>>>>            return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
>>>>    #endif
>>>>    }
>>>>
>>>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
>>>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).
>>>
>>> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
>>> But I'd much prefer to not duplicate all that. So, jiffies won't do,
>>> ktime() may be ok but a bit heavier that we'd like it to be...
>>>
>>> Jens, any chance you remember something in the middle? Like same source
>>> as ktime() but without the heavy correction it does.
>> I'm gonna pick this one up again, currently this patch
>> with ktime_get_ns() works good on our productions. This
>> patch makes the latency a bit higher than before, but
>> still lower than aio.
>> I haven't gotten a faster alternate for ktime_get_ns(),
>> any hints?
> 
> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
> In particular coarse variants may be of interest.
> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
> 
> 
> Off topic: it sounds that you're a long user of SQPOLL. Interesting to
> ask how do you find it in general. i.e. does it help much with
> latency? Performance? Anything else?
It helps with the latency and iops(can not surely recall the number now..)
It is useful when many user threads offload IO to just one sqthread.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-28 10:51           ` Pavel Begunkov
  2021-09-29  7:52             ` Hao Xu
@ 2021-09-29  9:24             ` Hao Xu
  2021-09-29 11:37               ` Pavel Begunkov
  1 sibling, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-09-29  9:24 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/28 下午6:51, Pavel Begunkov 写道:
> On 9/26/21 11:00 AM, Hao Xu wrote:
>> 在 2021/4/30 上午6:15, Pavel Begunkov 写道:
>>> On 4/29/21 4:28 AM, Hao Xu wrote:
>>>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道:
>>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value
>>>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely  take
>>>>>> 100% cpu usage. This is not necessary in some cases, like users may
>>>>>> don't care about latency much in low IO pressure
>>>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
>>>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test
>>>>>> results by fio below:
>>>>>
>>>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest
>>>>> to get rid of all the mess and simply convert to jiffies during ring
>>>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
>>>> 1) here I keep millisecond mode for compatibility
>>>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough
>>>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0:
>>>>
>>>>    u64 nsecs_to_jiffies64(u64 n)
>>>>    {
>>>>    #if (NSEC_PER_SEC % HZ) == 0
>>>>            /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
>>>>            return div_u64(n, NSEC_PER_SEC / HZ);
>>>>    #elif (HZ % 512) == 0
>>>>            /* overflow after 292 years if HZ = 1024 */
>>>>            return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
>>>>    #else
>>>>            /*
>>>>            ¦* Generic case - optimized for cases where HZ is a multiple of 3.
>>>>            ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
>>>>            ¦*/
>>>>            return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
>>>>    #endif
>>>>    }
>>>>
>>>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
>>>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).
>>>
>>> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
>>> But I'd much prefer to not duplicate all that. So, jiffies won't do,
>>> ktime() may be ok but a bit heavier that we'd like it to be...
>>>
>>> Jens, any chance you remember something in the middle? Like same source
>>> as ktime() but without the heavy correction it does.
>> I'm gonna pick this one up again, currently this patch
>> with ktime_get_ns() works good on our productions. This
>> patch makes the latency a bit higher than before, but
>> still lower than aio.
>> I haven't gotten a faster alternate for ktime_get_ns(),
>> any hints?
> 
> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
> In particular coarse variants may be of interest.
> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
> 
The coarse functions seems to be like jiffies, because they use the last
timer tick(from the explanation in that doc, it seems the timer tick is
in the same frequency as jiffies update). So I believe it is just
another format of jiffies which is low accurate.
> 
> Off topic: it sounds that you're a long user of SQPOLL. Interesting to
> ask how do you find it in general. i.e. does it help much with
> latency? Performance? Anything else?
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-29  9:24             ` Hao Xu
@ 2021-09-29 11:37               ` Pavel Begunkov
  2021-09-29 12:13                 ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-09-29 11:37 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/29/21 10:24 AM, Hao Xu wrote:
> 在 2021/9/28 下午6:51, Pavel Begunkov 写道:
>> On 9/26/21 11:00 AM, Hao Xu wrote:
[...]
>>> I'm gonna pick this one up again, currently this patch
>>> with ktime_get_ns() works good on our productions. This
>>> patch makes the latency a bit higher than before, but
>>> still lower than aio.
>>> I haven't gotten a faster alternate for ktime_get_ns(),
>>> any hints?
>>
>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
>> In particular coarse variants may be of interest.
>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
>>
> The coarse functions seems to be like jiffies, because they use the last
> timer tick(from the explanation in that doc, it seems the timer tick is
> in the same frequency as jiffies update). So I believe it is just
> another format of jiffies which is low accurate.

I haven't looked into the details, but it seems that unlike jiffies for
the coarse mode 10ms (or whatever) is the worst case, but it _may_ be
much better on average and feasible for your case, but can't predict
if that's really the case in a real system and what will be the
relative error comparing to normal ktime_ns().

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-29 11:37               ` Pavel Begunkov
@ 2021-09-29 12:13                 ` Hao Xu
  2021-09-30  8:51                   ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-09-29 12:13 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/29 下午7:37, Pavel Begunkov 写道:
> On 9/29/21 10:24 AM, Hao Xu wrote:
>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道:
>>> On 9/26/21 11:00 AM, Hao Xu wrote:
> [...]
>>>> I'm gonna pick this one up again, currently this patch
>>>> with ktime_get_ns() works good on our productions. This
>>>> patch makes the latency a bit higher than before, but
>>>> still lower than aio.
>>>> I haven't gotten a faster alternate for ktime_get_ns(),
>>>> any hints?
>>>
>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
>>> In particular coarse variants may be of interest.
>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
>>>
>> The coarse functions seems to be like jiffies, because they use the last
>> timer tick(from the explanation in that doc, it seems the timer tick is
>> in the same frequency as jiffies update). So I believe it is just
>> another format of jiffies which is low accurate.
> 
> I haven't looked into the details, but it seems that unlike jiffies for
> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be
Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no?
(say HZ = 100, then jiffies updated by 1 every 10ms)
> much better on average and feasible for your case, but can't predict
> if that's really the case in a real system and what will be the
> relative error comparing to normal ktime_ns().
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-29 12:13                 ` Hao Xu
@ 2021-09-30  8:51                   ` Pavel Begunkov
  2021-09-30 12:04                     ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-09-30  8:51 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/29/21 1:13 PM, Hao Xu wrote:
> 在 2021/9/29 下午7:37, Pavel Begunkov 写道:
>> On 9/29/21 10:24 AM, Hao Xu wrote:
>>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道:
>>>> On 9/26/21 11:00 AM, Hao Xu wrote:
>> [...]
>>>>> I'm gonna pick this one up again, currently this patch
>>>>> with ktime_get_ns() works good on our productions. This
>>>>> patch makes the latency a bit higher than before, but
>>>>> still lower than aio.
>>>>> I haven't gotten a faster alternate for ktime_get_ns(),
>>>>> any hints?
>>>>
>>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
>>>> In particular coarse variants may be of interest.
>>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
>>>>
>>> The coarse functions seems to be like jiffies, because they use the last
>>> timer tick(from the explanation in that doc, it seems the timer tick is
>>> in the same frequency as jiffies update). So I believe it is just
>>> another format of jiffies which is low accurate.
>>
>> I haven't looked into the details, but it seems that unlike jiffies for
>> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be
> Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no?
> (say HZ = 100, then jiffies updated by 1 every 10ms)

I'm speculating, but it sounds it's updated on every call to ktime_ns()
in the system, so if someone else calls ktime_ns() every 1us, than the
resolution will be 1us, where with jiffies the update interval is strictly
10ms when HZ=100. May be not true, need to see the code.


>> much better on average and feasible for your case, but can't predict
>> if that's really the case in a real system and what will be the
>> relative error comparing to normal ktime_ns().
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-30  8:51                   ` Pavel Begunkov
@ 2021-09-30 12:04                     ` Pavel Begunkov
  2021-10-05 15:00                       ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-09-30 12:04 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 9/30/21 9:51 AM, Pavel Begunkov wrote:
> On 9/29/21 1:13 PM, Hao Xu wrote:
>> 在 2021/9/29 下午7:37, Pavel Begunkov 写道:
>>> On 9/29/21 10:24 AM, Hao Xu wrote:
>>>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道:
>>>>> On 9/26/21 11:00 AM, Hao Xu wrote:
>>> [...]
>>>>>> I'm gonna pick this one up again, currently this patch
>>>>>> with ktime_get_ns() works good on our productions. This
>>>>>> patch makes the latency a bit higher than before, but
>>>>>> still lower than aio.
>>>>>> I haven't gotten a faster alternate for ktime_get_ns(),
>>>>>> any hints?
>>>>>
>>>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
>>>>> In particular coarse variants may be of interest.
>>>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
>>>>>
>>>> The coarse functions seems to be like jiffies, because they use the last
>>>> timer tick(from the explanation in that doc, it seems the timer tick is
>>>> in the same frequency as jiffies update). So I believe it is just
>>>> another format of jiffies which is low accurate.
>>>
>>> I haven't looked into the details, but it seems that unlike jiffies for
>>> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be
>> Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no?
>> (say HZ = 100, then jiffies updated by 1 every 10ms)
> 
> I'm speculating, but it sounds it's updated on every call to ktime_ns()
> in the system, so if someone else calls ktime_ns() every 1us, than the
> resolution will be 1us, where with jiffies the update interval is strictly
> 10ms when HZ=100. May be not true, need to see the code.

Taking a second quick look, doesn't seem to be the case indeed. And it's
limited to your feature anyway, so the overhead of ktime_get() shouldn't
matter much.

>>> much better on average and feasible for your case, but can't predict
>>> if that's really the case in a real system and what will be the
>>> relative error comparing to normal ktime_ns().

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle
  2021-09-30 12:04                     ` Pavel Begunkov
@ 2021-10-05 15:00                       ` Hao Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Xu @ 2021-10-05 15:00 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/9/30 下午8:04, Pavel Begunkov 写道:
> On 9/30/21 9:51 AM, Pavel Begunkov wrote:
>> On 9/29/21 1:13 PM, Hao Xu wrote:
>>> 在 2021/9/29 下午7:37, Pavel Begunkov 写道:
>>>> On 9/29/21 10:24 AM, Hao Xu wrote:
>>>>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道:
>>>>>> On 9/26/21 11:00 AM, Hao Xu wrote:
>>>> [...]
>>>>>>> I'm gonna pick this one up again, currently this patch
>>>>>>> with ktime_get_ns() works good on our productions. This
>>>>>>> patch makes the latency a bit higher than before, but
>>>>>>> still lower than aio.
>>>>>>> I haven't gotten a faster alternate for ktime_get_ns(),
>>>>>>> any hints?
>>>>>>
>>>>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
>>>>>> In particular coarse variants may be of interest.
>>>>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access
>>>>>>
>>>>> The coarse functions seems to be like jiffies, because they use the last
>>>>> timer tick(from the explanation in that doc, it seems the timer tick is
>>>>> in the same frequency as jiffies update). So I believe it is just
>>>>> another format of jiffies which is low accurate.
>>>>
>>>> I haven't looked into the details, but it seems that unlike jiffies for
>>>> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be
>>> Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no?
>>> (say HZ = 100, then jiffies updated by 1 every 10ms)
>>
>> I'm speculating, but it sounds it's updated on every call to ktime_ns()
>> in the system, so if someone else calls ktime_ns() every 1us, than the
>> resolution will be 1us, where with jiffies the update interval is strictly
>> 10ms when HZ=100. May be not true, need to see the code.
> 
> Taking a second quick look, doesn't seem to be the case indeed. And it's
> limited to your feature anyway, so the overhead of ktime_get() shouldn't
> matter much.
Thanks, I'll continue on this when return from the holiday.
> 
>>>> much better on average and feasible for your case, but can't predict
>>>> if that's really the case in a real system and what will be the
>>>> relative error comparing to normal ktime_ns().
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu
  2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu
@ 2021-04-28 13:32 ` Hao Xu
  2021-04-28 14:12   ` Jens Axboe
                     ` (2 more replies)
  1 sibling, 3 replies; 40+ messages in thread
From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

sqes are submitted by sqthread when it is leveraged, which means there
is IO latency when waking up sqthread. To wipe it out, submit limited
number of sqes in the original task context.
Tests result below:

99th latency:
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
with this patch:
2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
without this patch:
2k      	15	14	15	15	15	14	15	14	14	13	11.84

fio config:
./run_fio.sh
fio \
--ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
--direct=1 --rw=randread --time_based=1 --runtime=300 \
--group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
--randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
--io_sq_thread_idle=${2}

Signed-off-by: Hao Xu <[email protected]>
---
 fs/io_uring.c                 | 29 +++++++++++++++++++++++------
 include/uapi/linux/io_uring.h |  1 +
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1871fad48412..f0a01232671e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
 {
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_kiocb *link = io_prep_linked_timeout(req);
-	struct io_uring_task *tctx = req->task->io_uring;
+	struct io_uring_task *tctx = NULL;
+
+	if (ctx->sq_data && ctx->sq_data->thread)
+		tctx = ctx->sq_data->thread->io_uring;
+	else
+		tctx = req->task->io_uring;
 
 	BUG_ON(!tctx);
 	BUG_ON(!tctx->io_wq);
@@ -9063,9 +9068,10 @@ static void io_uring_try_cancel(struct files_struct *files)
 	xa_for_each(&tctx->xa, index, node) {
 		struct io_ring_ctx *ctx = node->ctx;
 
-		/* sqpoll task will cancel all its requests */
-		if (!ctx->sq_data)
-			io_uring_try_cancel_requests(ctx, current, files);
+		/*
+		 * for sqpoll ctx, there may be requests in task_works etc.
+		 */
+		io_uring_try_cancel_requests(ctx, current, files);
 	}
 }
 
@@ -9271,7 +9277,8 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
 	io_run_task_work();
 
 	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
-			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG)))
+			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
+			       IORING_ENTER_SQ_DEPUTY)))
 		return -EINVAL;
 
 	f = fdget(fd);
@@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
 		if (unlikely(ctx->sq_data->thread == NULL)) {
 			goto out;
 		}
-		if (flags & IORING_ENTER_SQ_WAKEUP)
+		if (flags & IORING_ENTER_SQ_WAKEUP) {
 			wake_up(&ctx->sq_data->wait);
+			if ((flags & IORING_ENTER_SQ_DEPUTY) &&
+					!(ctx->flags & IORING_SETUP_IOPOLL)) {
+				ret = io_uring_add_task_file(ctx);
+				if (unlikely(ret))
+					goto out;
+				mutex_lock(&ctx->uring_lock);
+				io_submit_sqes(ctx, min(to_submit, 8U));
+				mutex_unlock(&ctx->uring_lock);
+			}
+		}
 		if (flags & IORING_ENTER_SQ_WAIT) {
 			ret = io_sqpoll_wait_sq(ctx);
 			if (ret)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 311532ff6ce3..b1130fec2b7d 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -251,6 +251,7 @@ struct io_cqring_offsets {
 #define IORING_ENTER_SQ_WAKEUP	(1U << 1)
 #define IORING_ENTER_SQ_WAIT	(1U << 2)
 #define IORING_ENTER_EXT_ARG	(1U << 3)
+#define IORING_ENTER_SQ_DEPUTY	(1U << 4)
 
 /*
  * Passed in for io_uring_setup(2). Copied back with updated info on success
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
@ 2021-04-28 14:12   ` Jens Axboe
  2021-04-29  4:12     ` Hao Xu
  2021-04-28 14:34   ` Pavel Begunkov
  2021-04-29 22:02   ` Pavel Begunkov
  2 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 14:12 UTC (permalink / raw)
  To: Hao Xu; +Cc: io-uring, Pavel Begunkov, Joseph Qi

On 4/28/21 7:32 AM, Hao Xu wrote:
> sqes are submitted by sqthread when it is leveraged, which means there
> is IO latency when waking up sqthread. To wipe it out, submit limited
> number of sqes in the original task context.
> Tests result below:
> 
> 99th latency:
> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
> with this patch:
> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
> without this patch:
> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
> 
> fio config:
> ./run_fio.sh
> fio \
> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
> --direct=1 --rw=randread --time_based=1 --runtime=300 \
> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
> --io_sq_thread_idle=${2}

Interesting concept! One question:

> @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
>  		if (unlikely(ctx->sq_data->thread == NULL)) {
>  			goto out;
>  		}
> -		if (flags & IORING_ENTER_SQ_WAKEUP)
> +		if (flags & IORING_ENTER_SQ_WAKEUP) {
>  			wake_up(&ctx->sq_data->wait);
> +			if ((flags & IORING_ENTER_SQ_DEPUTY) &&
> +					!(ctx->flags & IORING_SETUP_IOPOLL)) {
> +				ret = io_uring_add_task_file(ctx);
> +				if (unlikely(ret))
> +					goto out;
> +				mutex_lock(&ctx->uring_lock);
> +				io_submit_sqes(ctx, min(to_submit, 8U));
> +				mutex_unlock(&ctx->uring_lock);
> +			}
> +		}

Do we want to wake the sqpoll thread _post_ submitting these ios? The
idea being that if we're submitting now after a while (since the thread
is sleeping), then we're most likely going to be submitting more than
just this single batch. And the wakeup would do the same if done after
the submit, it'd just not interfere with this submit. You could imagine
a scenario where we do the wake and the sqpoll thread beats us to the
submit, and now we're just stuck waiting for the uring_lock and end up
doing nothing.

Maybe you guys already tested this? Also curious if you did, what kind
of requests are being submitted? That can have quite a bit of effect on
how quickly the submit is done.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:12   ` Jens Axboe
@ 2021-04-29  4:12     ` Hao Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Xu @ 2021-04-29  4:12 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi

在 2021/4/28 下午10:12, Jens Axboe 写道:
> On 4/28/21 7:32 AM, Hao Xu wrote:
>> sqes are submitted by sqthread when it is leveraged, which means there
>> is IO latency when waking up sqthread. To wipe it out, submit limited
>> number of sqes in the original task context.
>> Tests result below:
>>
>> 99th latency:
>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>> with this patch:
>> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
>> without this patch:
>> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
>>
>> fio config:
>> ./run_fio.sh
>> fio \
>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>> --io_sq_thread_idle=${2}
> 
> Interesting concept! One question:
> 
>> @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
>>   		if (unlikely(ctx->sq_data->thread == NULL)) {
>>   			goto out;
>>   		}
>> -		if (flags & IORING_ENTER_SQ_WAKEUP)
>> +		if (flags & IORING_ENTER_SQ_WAKEUP) {
>>   			wake_up(&ctx->sq_data->wait);
>> +			if ((flags & IORING_ENTER_SQ_DEPUTY) &&
>> +					!(ctx->flags & IORING_SETUP_IOPOLL)) {
>> +				ret = io_uring_add_task_file(ctx);
>> +				if (unlikely(ret))
>> +					goto out;
>> +				mutex_lock(&ctx->uring_lock);
>> +				io_submit_sqes(ctx, min(to_submit, 8U));
>> +				mutex_unlock(&ctx->uring_lock);
>> +			}
>> +		}
> 
> Do we want to wake the sqpoll thread _post_ submitting these ios? The
> idea being that if we're submitting now after a while (since the thread
> is sleeping), then we're most likely going to be submitting more than
> just this single batch. And the wakeup would do the same if done after
yes, prediction that it will likely submit more than just that single
batch is the idea behind this patch.
> the submit, it'd just not interfere with this submit. You could imagine
> a scenario where we do the wake and the sqpoll thread beats us to the
> submit, and now we're just stuck waiting for the uring_lock and end up
> doing nothing.
true, I think trylock may be good to address this.
> 
> Maybe you guys already tested this? Also curious if you did, what kind
> of requests are being submitted? That can have quite a bit of effect on
> how quickly the submit is done.
we currently just have put it on fio benchmark and liburing tests, this 
patch definitely needs more thinking.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
  2021-04-28 14:12   ` Jens Axboe
@ 2021-04-28 14:34   ` Pavel Begunkov
  2021-04-28 14:37     ` Pavel Begunkov
                       ` (2 more replies)
  2021-04-29 22:02   ` Pavel Begunkov
  2 siblings, 3 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:34 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/28/21 2:32 PM, Hao Xu wrote:
> sqes are submitted by sqthread when it is leveraged, which means there
> is IO latency when waking up sqthread. To wipe it out, submit limited
> number of sqes in the original task context.
> Tests result below:

Frankly, it can be a nest of corner cases if not now then in the future,
leading to a high maintenance burden. Hence, if we consider the change,
I'd rather want to limit the userspace exposure, so it can be removed
if needed.

A noticeable change of behaviour here, as Hao recently asked, is that
the ring can be passed to a task from a completely another thread group,
and so the feature would execute from that context, not from the
original/sqpoll one.

Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
ignored if the previous point is addressed.

> 
> 99th latency:
> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
> with this patch:
> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
> without this patch:
> 2k      	15	14	15	15	15	14	15	14	14	13	11.84

Not sure the second nine describes it well enough, please can you
add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.

Btw, how happened that only some of the numbers have fractional part?
Can't believe they all but 3 were close enough to integer values.

> fio config:
> ./run_fio.sh
> fio \
> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
> --direct=1 --rw=randread --time_based=1 --runtime=300 \
> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
> --io_sq_thread_idle=${2}
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>  include/uapi/linux/io_uring.h |  1 +
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 1871fad48412..f0a01232671e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>  {
>  	struct io_ring_ctx *ctx = req->ctx;
>  	struct io_kiocb *link = io_prep_linked_timeout(req);
> -	struct io_uring_task *tctx = req->task->io_uring;
> +	struct io_uring_task *tctx = NULL;
> +
> +	if (ctx->sq_data && ctx->sq_data->thread)
> +		tctx = ctx->sq_data->thread->io_uring;

without park it's racy, sq_data->thread may become NULL and removed,
as well as its ->io_uring.

> +	else
> +		tctx = req->task->io_uring;
>  
>  	BUG_ON(!tctx);
>  	BUG_ON(!tctx->io_wq);

[snip]

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:34   ` Pavel Begunkov
@ 2021-04-28 14:37     ` Pavel Begunkov
  2021-04-29  4:37       ` Hao Xu
  2021-04-28 14:39     ` Jens Axboe
  2021-04-29  8:44     ` Hao Xu
  2 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:37 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/28/21 3:34 PM, Pavel Begunkov wrote:
> On 4/28/21 2:32 PM, Hao Xu wrote:
>> sqes are submitted by sqthread when it is leveraged, which means there
>> is IO latency when waking up sqthread. To wipe it out, submit limited
>> number of sqes in the original task context.
>> Tests result below:
> 
> Frankly, it can be a nest of corner cases if not now then in the future,
> leading to a high maintenance burden. Hence, if we consider the change,
> I'd rather want to limit the userspace exposure, so it can be removed
> if needed.
> 
> A noticeable change of behaviour here, as Hao recently asked, is that
> the ring can be passed to a task from a completely another thread group,
> and so the feature would execute from that context, not from the
> original/sqpoll one.

So maybe something like:
if (same_thread_group()) {
	/* submit */
}

> 
> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
> ignored if the previous point is addressed.

I'd question whether it'd be better with the flag or without doing
this feature by default.

> 
>>
>> 99th latency:
>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>> with this patch:
>> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
>> without this patch:
>> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
> 
> Not sure the second nine describes it well enough, please can you
> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
> 
> Btw, how happened that only some of the numbers have fractional part?
> Can't believe they all but 3 were close enough to integer values.
> 
>> fio config:
>> ./run_fio.sh
>> fio \
>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>> --io_sq_thread_idle=${2}
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>  fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>  include/uapi/linux/io_uring.h |  1 +
>>  2 files changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 1871fad48412..f0a01232671e 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>  {
>>  	struct io_ring_ctx *ctx = req->ctx;
>>  	struct io_kiocb *link = io_prep_linked_timeout(req);
>> -	struct io_uring_task *tctx = req->task->io_uring;
>> +	struct io_uring_task *tctx = NULL;
>> +
>> +	if (ctx->sq_data && ctx->sq_data->thread)
>> +		tctx = ctx->sq_data->thread->io_uring;
> 
> without park it's racy, sq_data->thread may become NULL and removed,
> as well as its ->io_uring.
> 
>> +	else
>> +		tctx = req->task->io_uring;
>>  
>>  	BUG_ON(!tctx);
>>  	BUG_ON(!tctx->io_wq);
> 
> [snip]
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:37     ` Pavel Begunkov
@ 2021-04-29  4:37       ` Hao Xu
  2021-04-29  9:28         ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-04-29  4:37 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/28 下午10:37, Pavel Begunkov 写道:
> On 4/28/21 3:34 PM, Pavel Begunkov wrote:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> sqes are submitted by sqthread when it is leveraged, which means there
>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>> number of sqes in the original task context.
>>> Tests result below:
>>
>> Frankly, it can be a nest of corner cases if not now then in the future,
>> leading to a high maintenance burden. Hence, if we consider the change,
>> I'd rather want to limit the userspace exposure, so it can be removed
>> if needed.
>>
>> A noticeable change of behaviour here, as Hao recently asked, is that
>> the ring can be passed to a task from a completely another thread group,
>> and so the feature would execute from that context, not from the
>> original/sqpoll one.
> 
> So maybe something like:
> if (same_thread_group()) {
> 	/* submit */
> }I thought this case(cross independent processes) for some time, Pavel,
could you give more hints about how this may trigger errors?
> 
>>
>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>> ignored if the previous point is addressed.
> 
> I'd question whether it'd be better with the flag or without doing
> this feature by default.
Just like what Jens said, the flag here is to allow users to do their
decision, there may be cases like a application wants to offload as much
as possible IO related work to sqpoll, so that it can be dedicated to
computation work etc.
> 
>>
>>>
>>> 99th latency:
>>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>>> with this patch:
>>> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
>>> without this patch:
>>> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
>>
>> Not sure the second nine describes it well enough, please can you
>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
Sure, I will.
>>
>> Btw, how happened that only some of the numbers have fractional part?
>> Can't believe they all but 3 were close enough to integer values.
This confused me a little bit too, but it is indeed what fio outputs.
>>
>>> fio config:
>>> ./run_fio.sh
>>> fio \
>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>> --io_sq_thread_idle=${2}
>>>
>>> Signed-off-by: Hao Xu <[email protected]>
>>> ---
>>>   fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>   include/uapi/linux/io_uring.h |  1 +
>>>   2 files changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 1871fad48412..f0a01232671e 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>   {
>>>   	struct io_ring_ctx *ctx = req->ctx;
>>>   	struct io_kiocb *link = io_prep_linked_timeout(req);
>>> -	struct io_uring_task *tctx = req->task->io_uring;
>>> +	struct io_uring_task *tctx = NULL;
>>> +
>>> +	if (ctx->sq_data && ctx->sq_data->thread)
>>> +		tctx = ctx->sq_data->thread->io_uring;
>>
>> without park it's racy, sq_data->thread may become NULL and removed,
>> as well as its ->io_uring.
>>
>>> +	else
>>> +		tctx = req->task->io_uring;
>>>   
>>>   	BUG_ON(!tctx);
>>>   	BUG_ON(!tctx->io_wq);
>>
>> [snip]
>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-29  4:37       ` Hao Xu
@ 2021-04-29  9:28         ` Pavel Begunkov
  2021-05-05 11:20           ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-29  9:28 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/29/21 5:37 AM, Hao Xu wrote:
> 在 2021/4/28 下午10:37, Pavel Begunkov 写道:
>> On 4/28/21 3:34 PM, Pavel Begunkov wrote:
>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>> number of sqes in the original task context.
>>>> Tests result below:
>>>
>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>> leading to a high maintenance burden. Hence, if we consider the change,
>>> I'd rather want to limit the userspace exposure, so it can be removed
>>> if needed.
>>>
>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>> the ring can be passed to a task from a completely another thread group,
>>> and so the feature would execute from that context, not from the
>>> original/sqpoll one.
>>
>> So maybe something like:
>> if (same_thread_group()) {
>>     /* submit */
>> }I thought this case(cross independent processes) for some time, Pavel,
> could you give more hints about how this may trigger errors?

Currently? We need to audit cancellation, but don't think it's a problem.

But as said, it's about the future. Your patch adds a new quirky
userspace behaviour (submitting from alien context as described), and
once commited it can't be removed and should be maintained. 

I can easily imagine it either over-complicating cancellations (and
we had enough of troubles with it), or just preventing more important
optimisations, or anything else often happening with new features.


>>
>>>
>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>> ignored if the previous point is addressed.
>>
>> I'd question whether it'd be better with the flag or without doing
>> this feature by default.
> Just like what Jens said, the flag here is to allow users to do their
> decision, there may be cases like a application wants to offload as much
> as possible IO related work to sqpoll, so that it can be dedicated to
> computation work etc.
>>
>>>
>>>>
>>>> 99th latency:
>>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>>> with this patch:
>>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>>> without this patch:
>>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>>
>>> Not sure the second nine describes it well enough, please can you
>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
> Sure, I will.

Forgot but it's important, should compared with non-sqpoll as well
because the feature is taking the middle ground between them.

>>>
>>> Btw, how happened that only some of the numbers have fractional part?
>>> Can't believe they all but 3 were close enough to integer values.
> This confused me a little bit too, but it is indeed what fio outputs.

That's just always when I see such, something tells me that data has
been manipulated. Even if it's fio, it's really weird and suspicious, 
and worth to look what's wrong with it.

>>>
>>>> fio config:
>>>> ./run_fio.sh
>>>> fio \
>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>>> --io_sq_thread_idle=${2}
>>>>
>>>> Signed-off-by: Hao Xu <[email protected]>
>>>> ---
>>>>   fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>>   include/uapi/linux/io_uring.h |  1 +
>>>>   2 files changed, 24 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 1871fad48412..f0a01232671e 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>>   {
>>>>       struct io_ring_ctx *ctx = req->ctx;
>>>>       struct io_kiocb *link = io_prep_linked_timeout(req);
>>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>>> +    struct io_uring_task *tctx = NULL;
>>>> +
>>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>>> +        tctx = ctx->sq_data->thread->io_uring;
>>>
>>> without park it's racy, sq_data->thread may become NULL and removed,
>>> as well as its ->io_uring.
>>>
>>>> +    else
>>>> +        tctx = req->task->io_uring;
>>>>         BUG_ON(!tctx);
>>>>       BUG_ON(!tctx->io_wq);
>>>
>>> [snip]
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-29  9:28         ` Pavel Begunkov
@ 2021-05-05 11:20           ` Hao Xu
  0 siblings, 0 replies; 40+ messages in thread
From: Hao Xu @ 2021-05-05 11:20 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/29 下午5:28, Pavel Begunkov 写道:
> On 4/29/21 5:37 AM, Hao Xu wrote:
>> 在 2021/4/28 下午10:37, Pavel Begunkov 写道:
>>> On 4/28/21 3:34 PM, Pavel Begunkov wrote:
>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>>> number of sqes in the original task context.
>>>>> Tests result below:
>>>>
>>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>>> leading to a high maintenance burden. Hence, if we consider the change,
>>>> I'd rather want to limit the userspace exposure, so it can be removed
>>>> if needed.
>>>>
>>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>>> the ring can be passed to a task from a completely another thread group,
>>>> and so the feature would execute from that context, not from the
>>>> original/sqpoll one.
>>>
>>> So maybe something like:
>>> if (same_thread_group()) {
>>>      /* submit */
>>> }I thought this case(cross independent processes) for some time, Pavel,
>> could you give more hints about how this may trigger errors?
> 
> Currently? We need to audit cancellation, but don't think it's a problem.
> 
> But as said, it's about the future. Your patch adds a new quirky
> userspace behaviour (submitting from alien context as described), and
> once commited it can't be removed and should be maintained.
> 
> I can easily imagine it either over-complicating cancellations (and
> we had enough of troubles with it), or just preventing more important
> optimisations, or anything else often happening with new features.
> 
> 
>>>
>>>>
>>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>>> ignored if the previous point is addressed.
>>>
>>> I'd question whether it'd be better with the flag or without doing
>>> this feature by default.
>> Just like what Jens said, the flag here is to allow users to do their
>> decision, there may be cases like a application wants to offload as much
>> as possible IO related work to sqpoll, so that it can be dedicated to
>> computation work etc.
>>>
>>>>
>>>>>
>>>>> 99th latency:
>>>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>>>> with this patch:
>>>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>>>> without this patch:
>>>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>>>
>>>> Not sure the second nine describes it well enough, please can you
>>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
>> Sure, I will.
> 
> Forgot but it's important, should compared with non-sqpoll as well
> because the feature is taking the middle ground between them.
Hi Pavel,
I found that the previous tests are not correct since when I did the
test I enabled IOPOLL and in io_uring_enter(), I didn't add a if:
if (IOPOLL disabled)
    submit at most 8 sqes in original context

this optimization cannot run under SQPOLL+IOPOLL.

So I did another tests with my new code, under 2k IOPS, against an
Intel optane device. The fio command is:
echo "
[global]
ioengine=io_uring
sqthread_poll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=160
group_reporting=1
filename=/dev/nvme1n1
sqthread_poll_cpu=30
randrepeat=0

[job0]
cpus_allowed=35
iodepth=128
rate_iops=${1}
io_sq_thread_idle=${2}" | ./fio/fio -

the result is:
(
1. big numbers in 'th' are in ns, others in us,
2. for no sqpoll mode, sq_idle doesn't affect the metrics, so they're
    just repetitive tests.
)

2k+idle_submit_min8																						
metrics\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
min_lat	9	9	9	9	9	9	9	9	9	9	8
max_lat	2233	2802	3000	3518	3078	2052	2583	2914	4176	2255	3275
avg_lat	11.26	11.25	11.24	11.29	11.29	11.21	11.24	11.21	11.23	11.27	9.53
50th	7648	7648	7648	7648	7648	7648	7648	7584	7584	7648	9152
90th	7840	7840	7840	7840	7840	7840	7840	7776	7840	7904	9408
99th	10944	11200	11712	11328	12096	12352	11584	12096	11712	11072	12992
99.99th	73216	38144	56576	38656	38144	70144	40192	38144	40704	38144	39680
fio(usr)	3.66%	3.53%	3.80%	3.60%	3.41%	3.93%	3.55%	3.57%	3.58%	3.75%	3.82%
fio(sys)	0.77%	0.70%	0.64%	0.65%	0.82%	0.71%	0.69%	0.69%	0.68%	0.72%	0.07%
fio(usr+sys)	4.43%	4.23%	4.44%	4.25%	4.23%	4.64%	4.24%	4.26%	4.26% 
4.47%	3.89%
											
											
											
2k+idle_without_submit_min8											
metrics\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
min_lat	11	8	8	8	8	8	8	8	8	8	8
max_lat	12643	11703	11617	3050	8662	3402	5179	4307	2897	4204	3017
avg_lat	12.75	12.77	12.5	12.43	12.5	12.39	12.39	12.33	12.16	12.1	9.44
50th	11	11	11	11	11	11	11	11	11	11	9024
90th	11	11	11	11	11	11	11	11	11	11	9280
99th	18	17	15	16	16	16	16	16	15	15	13120
99.99th	930	1090	635	635	693	652	717	635	510	330	39680
fio(usr)	4.59%	3.29%	4.06%	4.10%	4.18%	4.30%	4.73%	4.12%	4.11%	3.83%	3.80%
fio(sys)	0.39%	0.45%	0.46%	0.42%	0.34%	0.39%	0.37%	0.37%	0.33%	0.39%	0.07%
fio(usr+sys)	4.98%	3.74%	4.52%	4.52%	4.52%	4.69%	5.10%	4.49%	4.44% 
4.22%	3.87%
											
											
											
2k+without_sqpoll											
metrics\idle	10us	60us	110us	160us	210us	260us	310us 360us	410us	460us	510us
min_lat	8	8	8	7	8	8	8	8	8	8	8
max_lat	623	602	273	388	271	290	891	495	519	396	245
avg_lat	13.12	13.14	13.15	13.1	13.16	13.06	13.06	13.15	13.12	13.14	13.1
50th	10560	10560	10560	10560	10560	10432	10560	10560	10560	10560	10560
90th	10816	10816	10944	10816	10816	10816	10816	10944	10816	10944	10816
99th	14144	14656	14272	13888	14400	14016	14400	14272	14400	14528	14272
99.99th	42752	50432	43246	41216	56576	42240	43264	42240	47360	58624	42240
fio(usr)	2.31%	2.33%	2.81%	2.30%	2.34%	2.34%	2.12%	2.26%	2.09%	2.36%	2.34%
fio(sys)	0.82%	0.79%	0.93%	0.81%	0.79%	0.77%	0.79%	0.86%	1.02%	0.76%	0.79%
fio(usr+sys)	3.13%	3.12%	3.74%	3.11%	3.13%	3.11%	2.91%	3.12%	3.11% 
3.12%	3.13%

Cpu usage of sqthread hasn't been recorded, but I think the trending
would be like the result in the previous email.
 From the data, the 'submit in original context' change reduce the avg
latency and 'th' latency, meanwhile costs too much cpu usage of the
original context. may be 'submit min(to_submit, 8)' is too much in 2k
IOPS. I'll try other combination of IOPS and idle time.

Thanks,
Hao
> 
>>>>
>>>> Btw, how happened that only some of the numbers have fractional part?
>>>> Can't believe they all but 3 were close enough to integer values.
>> This confused me a little bit too, but it is indeed what fio outputs.
> 
> That's just always when I see such, something tells me that data has
> been manipulated. Even if it's fio, it's really weird and suspicious,
> and worth to look what's wrong with it.
Could you tell me what it is.
> 
>>>>
>>>>> fio config:
>>>>> ./run_fio.sh
>>>>> fio \
>>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>>>> --io_sq_thread_idle=${2}
>>>>>
>>>>> Signed-off-by: Hao Xu <[email protected]>
>>>>> ---
>>>>>    fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>>>    include/uapi/linux/io_uring.h |  1 +
>>>>>    2 files changed, 24 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 1871fad48412..f0a01232671e 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>>>    {
>>>>>        struct io_ring_ctx *ctx = req->ctx;
>>>>>        struct io_kiocb *link = io_prep_linked_timeout(req);
>>>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>>>> +    struct io_uring_task *tctx = NULL;
>>>>> +
>>>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>>>> +        tctx = ctx->sq_data->thread->io_uring;
>>>>
>>>> without park it's racy, sq_data->thread may become NULL and removed,
>>>> as well as its ->io_uring.
>>>>
>>>>> +    else
>>>>> +        tctx = req->task->io_uring;
>>>>>          BUG_ON(!tctx);
>>>>>        BUG_ON(!tctx->io_wq);
>>>>
>>>> [snip]
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:34   ` Pavel Begunkov
  2021-04-28 14:37     ` Pavel Begunkov
@ 2021-04-28 14:39     ` Jens Axboe
  2021-04-28 14:50       ` Pavel Begunkov
  2021-04-29  4:43       ` Hao Xu
  2021-04-29  8:44     ` Hao Xu
  2 siblings, 2 replies; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 14:39 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 8:34 AM, Pavel Begunkov wrote:
> On 4/28/21 2:32 PM, Hao Xu wrote:
>> sqes are submitted by sqthread when it is leveraged, which means there
>> is IO latency when waking up sqthread. To wipe it out, submit limited
>> number of sqes in the original task context.
>> Tests result below:
> 
> Frankly, it can be a nest of corner cases if not now then in the future,
> leading to a high maintenance burden. Hence, if we consider the change,
> I'd rather want to limit the userspace exposure, so it can be removed
> if needed.
> 
> A noticeable change of behaviour here, as Hao recently asked, is that
> the ring can be passed to a task from a completely another thread group,
> and so the feature would execute from that context, not from the
> original/sqpoll one.
> 
> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
> ignored if the previous point is addressed.

I mostly agree on that. The problem I see is that for most use cases,
the "submit from task itself if we need to enter the kernel" is
perfectly fine, and would probably be preferable. But there are also
uses cases that absolutely do not want to spend any extra cycles doing
submit, they are isolating the submission to sqpoll exclusively and that
is part of the win there. Based on that, I don't think it can be an
automatic kind of feature.

I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
would likely be better, or maybe even more verbose as
IORING_ENTER_SQ_SUBMIT_ON_IDLE.

On top of that, I don't think an extra submit flag is a huge deal, I
don't imagine we'll end up with a ton of them. In fact, two have been
added related to sqpoll since the inception, out of the 3 total added
flags.

This is all independent of implementation detail and needed fixes to the
patch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:39     ` Jens Axboe
@ 2021-04-28 14:50       ` Pavel Begunkov
  2021-04-28 14:53         ` Jens Axboe
  2021-04-29  4:43       ` Hao Xu
  1 sibling, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:50 UTC (permalink / raw)
  To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 3:39 PM, Jens Axboe wrote:
> On 4/28/21 8:34 AM, Pavel Begunkov wrote:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> sqes are submitted by sqthread when it is leveraged, which means there
>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>> number of sqes in the original task context.
>>> Tests result below:
>>
>> Frankly, it can be a nest of corner cases if not now then in the future,
>> leading to a high maintenance burden. Hence, if we consider the change,
>> I'd rather want to limit the userspace exposure, so it can be removed
>> if needed.
>>
>> A noticeable change of behaviour here, as Hao recently asked, is that
>> the ring can be passed to a task from a completely another thread group,
>> and so the feature would execute from that context, not from the
>> original/sqpoll one.
>>
>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>> ignored if the previous point is addressed.
> 
> I mostly agree on that. The problem I see is that for most use cases,
> the "submit from task itself if we need to enter the kernel" is
> perfectly fine, and would probably be preferable. But there are also
> uses cases that absolutely do not want to spend any extra cycles doing
> submit, they are isolating the submission to sqpoll exclusively and that
> is part of the win there. Based on that, I don't think it can be an
> automatic kind of feature.

Reasonable. 
 
> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
> would likely be better, or maybe even more verbose as
> IORING_ENTER_SQ_SUBMIT_ON_IDLE.
> 
> On top of that, I don't think an extra submit flag is a huge deal, I
> don't imagine we'll end up with a ton of them. In fact, two have been
> added related to sqpoll since the inception, out of the 3 total added
> flags.

I don't care about the flag itself, nor about performance as it's
nicely under the SQPOLL check, but I rather want to leave a way to
ignore the feature if we would (ever) need to disable it, either
with flag or without it.

> This is all independent of implementation detail and needed fixes to the
> patch.
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:50       ` Pavel Begunkov
@ 2021-04-28 14:53         ` Jens Axboe
  2021-04-28 14:56           ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 14:53 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 8:50 AM, Pavel Begunkov wrote:
> On 4/28/21 3:39 PM, Jens Axboe wrote:
>> On 4/28/21 8:34 AM, Pavel Begunkov wrote:
>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>> number of sqes in the original task context.
>>>> Tests result below:
>>>
>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>> leading to a high maintenance burden. Hence, if we consider the change,
>>> I'd rather want to limit the userspace exposure, so it can be removed
>>> if needed.
>>>
>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>> the ring can be passed to a task from a completely another thread group,
>>> and so the feature would execute from that context, not from the
>>> original/sqpoll one.
>>>
>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>> ignored if the previous point is addressed.
>>
>> I mostly agree on that. The problem I see is that for most use cases,
>> the "submit from task itself if we need to enter the kernel" is
>> perfectly fine, and would probably be preferable. But there are also
>> uses cases that absolutely do not want to spend any extra cycles doing
>> submit, they are isolating the submission to sqpoll exclusively and that
>> is part of the win there. Based on that, I don't think it can be an
>> automatic kind of feature.
> 
> Reasonable. 
>  
>> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
>> would likely be better, or maybe even more verbose as
>> IORING_ENTER_SQ_SUBMIT_ON_IDLE.
>>
>> On top of that, I don't think an extra submit flag is a huge deal, I
>> don't imagine we'll end up with a ton of them. In fact, two have been
>> added related to sqpoll since the inception, out of the 3 total added
>> flags.
> 
> I don't care about the flag itself, nor about performance as it's
> nicely under the SQPOLL check, but I rather want to leave a way to
> ignore the feature if we would (ever) need to disable it, either
> with flag or without it.

I think we just return -EINVAL for that case, just like we'd do now if
you attempted to use the flag as we don't grok it. As it should be
functionally equivalent if we do the submit inline or not, we could also
argue that we simply ignore the flag if it isn't feasible to submit
inline.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:53         ` Jens Axboe
@ 2021-04-28 14:56           ` Pavel Begunkov
  2021-04-28 15:09             ` Jens Axboe
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-28 14:56 UTC (permalink / raw)
  To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 3:53 PM, Jens Axboe wrote:
> On 4/28/21 8:50 AM, Pavel Begunkov wrote:
>> On 4/28/21 3:39 PM, Jens Axboe wrote:
>>> On 4/28/21 8:34 AM, Pavel Begunkov wrote:
>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>>> number of sqes in the original task context.
>>>>> Tests result below:
>>>>
>>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>>> leading to a high maintenance burden. Hence, if we consider the change,
>>>> I'd rather want to limit the userspace exposure, so it can be removed
>>>> if needed.
>>>>
>>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>>> the ring can be passed to a task from a completely another thread group,
>>>> and so the feature would execute from that context, not from the
>>>> original/sqpoll one.
>>>>
>>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>>> ignored if the previous point is addressed.
>>>
>>> I mostly agree on that. The problem I see is that for most use cases,
>>> the "submit from task itself if we need to enter the kernel" is
>>> perfectly fine, and would probably be preferable. But there are also
>>> uses cases that absolutely do not want to spend any extra cycles doing
>>> submit, they are isolating the submission to sqpoll exclusively and that
>>> is part of the win there. Based on that, I don't think it can be an
>>> automatic kind of feature.
>>
>> Reasonable. 
>>  
>>> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
>>> would likely be better, or maybe even more verbose as
>>> IORING_ENTER_SQ_SUBMIT_ON_IDLE.
>>>
>>> On top of that, I don't think an extra submit flag is a huge deal, I
>>> don't imagine we'll end up with a ton of them. In fact, two have been
>>> added related to sqpoll since the inception, out of the 3 total added
>>> flags.
>>
>> I don't care about the flag itself, nor about performance as it's
>> nicely under the SQPOLL check, but I rather want to leave a way to
>> ignore the feature if we would (ever) need to disable it, either
>> with flag or without it.
> 
> I think we just return -EINVAL for that case, just like we'd do now if
> you attempted to use the flag as we don't grok it. As it should be
> functionally equivalent if we do the submit inline or not, we could also
> argue that we simply ignore the flag if it isn't feasible to submit
> inline.

Yeah, no-brainer if we limit context to the original thread group, as
I described in the first reply.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:56           ` Pavel Begunkov
@ 2021-04-28 15:09             ` Jens Axboe
  0 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2021-04-28 15:09 UTC (permalink / raw)
  To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi

On 4/28/21 8:56 AM, Pavel Begunkov wrote:
> On 4/28/21 3:53 PM, Jens Axboe wrote:
>> On 4/28/21 8:50 AM, Pavel Begunkov wrote:
>>> On 4/28/21 3:39 PM, Jens Axboe wrote:
>>>> On 4/28/21 8:34 AM, Pavel Begunkov wrote:
>>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>>>> number of sqes in the original task context.
>>>>>> Tests result below:
>>>>>
>>>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>>>> leading to a high maintenance burden. Hence, if we consider the change,
>>>>> I'd rather want to limit the userspace exposure, so it can be removed
>>>>> if needed.
>>>>>
>>>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>>>> the ring can be passed to a task from a completely another thread group,
>>>>> and so the feature would execute from that context, not from the
>>>>> original/sqpoll one.
>>>>>
>>>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>>>> ignored if the previous point is addressed.
>>>>
>>>> I mostly agree on that. The problem I see is that for most use cases,
>>>> the "submit from task itself if we need to enter the kernel" is
>>>> perfectly fine, and would probably be preferable. But there are also
>>>> uses cases that absolutely do not want to spend any extra cycles doing
>>>> submit, they are isolating the submission to sqpoll exclusively and that
>>>> is part of the win there. Based on that, I don't think it can be an
>>>> automatic kind of feature.
>>>
>>> Reasonable. 
>>>  
>>>> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
>>>> would likely be better, or maybe even more verbose as
>>>> IORING_ENTER_SQ_SUBMIT_ON_IDLE.
>>>>
>>>> On top of that, I don't think an extra submit flag is a huge deal, I
>>>> don't imagine we'll end up with a ton of them. In fact, two have been
>>>> added related to sqpoll since the inception, out of the 3 total added
>>>> flags.
>>>
>>> I don't care about the flag itself, nor about performance as it's
>>> nicely under the SQPOLL check, but I rather want to leave a way to
>>> ignore the feature if we would (ever) need to disable it, either
>>> with flag or without it.
>>
>> I think we just return -EINVAL for that case, just like we'd do now if
>> you attempted to use the flag as we don't grok it. As it should be
>> functionally equivalent if we do the submit inline or not, we could also
>> argue that we simply ignore the flag if it isn't feasible to submit
>> inline.
> 
> Yeah, no-brainer if we limit context to the original thread group, as
> I described in the first reply.

Yep, that's a requirement for any kind of sanity there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:39     ` Jens Axboe
  2021-04-28 14:50       ` Pavel Begunkov
@ 2021-04-29  4:43       ` Hao Xu
  1 sibling, 0 replies; 40+ messages in thread
From: Hao Xu @ 2021-04-29  4:43 UTC (permalink / raw)
  To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, Joseph Qi

在 2021/4/28 下午10:39, Jens Axboe 写道:
> On 4/28/21 8:34 AM, Pavel Begunkov wrote:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> sqes are submitted by sqthread when it is leveraged, which means there
>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>> number of sqes in the original task context.
>>> Tests result below:
>>
>> Frankly, it can be a nest of corner cases if not now then in the future,
>> leading to a high maintenance burden. Hence, if we consider the change,
>> I'd rather want to limit the userspace exposure, so it can be removed
>> if needed.
>>
>> A noticeable change of behaviour here, as Hao recently asked, is that
>> the ring can be passed to a task from a completely another thread group,
>> and so the feature would execute from that context, not from the
>> original/sqpoll one.
>>
>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>> ignored if the previous point is addressed.
> 
> I mostly agree on that. The problem I see is that for most use cases,
> the "submit from task itself if we need to enter the kernel" is
> perfectly fine, and would probably be preferable. But there are also
> uses cases that absolutely do not want to spend any extra cycles doing
> submit, they are isolating the submission to sqpoll exclusively and that
> is part of the win there. Based on that, I don't think it can be an
> automatic kind of feature.
> 
> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE
> would likely be better, or maybe even more verbose as
> IORING_ENTER_SQ_SUBMIT_ON_IDLE.
Agree 😂
> 
> On top of that, I don't think an extra submit flag is a huge deal, I
> don't imagine we'll end up with a ton of them. In fact, two have been
> added related to sqpoll since the inception, out of the 3 total added
> flags.
> 
> This is all independent of implementation detail and needed fixes to the
> patch.
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 14:34   ` Pavel Begunkov
  2021-04-28 14:37     ` Pavel Begunkov
  2021-04-28 14:39     ` Jens Axboe
@ 2021-04-29  8:44     ` Hao Xu
  2021-04-29 22:10       ` Pavel Begunkov
  2 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-04-29  8:44 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/28 下午10:34, Pavel Begunkov 写道:
> On 4/28/21 2:32 PM, Hao Xu wrote:
>> sqes are submitted by sqthread when it is leveraged, which means there
>> is IO latency when waking up sqthread. To wipe it out, submit limited
>> number of sqes in the original task context.
>> Tests result below:
> 
> Frankly, it can be a nest of corner cases if not now then in the future,
> leading to a high maintenance burden. Hence, if we consider the change,
> I'd rather want to limit the userspace exposure, so it can be removed
> if needed.
> 
> A noticeable change of behaviour here, as Hao recently asked, is that
> the ring can be passed to a task from a completely another thread group,
> and so the feature would execute from that context, not from the
> original/sqpoll one.
> 
> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
> ignored if the previous point is addressed.
> 
>>
>> 99th latency:
>> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
>> with this patch:
>> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
>> without this patch:
>> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
> 
> Not sure the second nine describes it well enough, please can you
> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
> 
> Btw, how happened that only some of the numbers have fractional part?
> Can't believe they all but 3 were close enough to integer values.
> 
>> fio config:
>> ./run_fio.sh
>> fio \
>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>> --io_sq_thread_idle=${2}
>>
>> Signed-off-by: Hao Xu <[email protected]>
>> ---
>>   fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>   include/uapi/linux/io_uring.h |  1 +
>>   2 files changed, 24 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 1871fad48412..f0a01232671e 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>   {
>>   	struct io_ring_ctx *ctx = req->ctx;
>>   	struct io_kiocb *link = io_prep_linked_timeout(req);
>> -	struct io_uring_task *tctx = req->task->io_uring;
>> +	struct io_uring_task *tctx = NULL;
>> +
>> +	if (ctx->sq_data && ctx->sq_data->thread)
>> +		tctx = ctx->sq_data->thread->io_uring;
> 
> without park it's racy, sq_data->thread may become NULL and removed,
> as well as its ->io_uring.
I now think that it's ok to queue async work to req->task->io_uring. I
look through all the OPs, seems only have to take care of async_cancel:

io_async_cancel(req) {
     cancel from req->task->io_uring;
     cancel from ctx->tctx_list
}

Given req->task is 'original context', the req to be cancelled may in
ctx->sq_data->thread->io_uring's iowq. So search the req from
sqthread->io_uring is needed here. This avoids overload in main code
path.
Did I miss something else?


> 
>> +	else
>> +		tctx = req->task->io_uring;
>>   
>>   	BUG_ON(!tctx);
>>   	BUG_ON(!tctx->io_wq);
> 
> [snip]
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-29  8:44     ` Hao Xu
@ 2021-04-29 22:10       ` Pavel Begunkov
  2021-05-05 13:10         ` Hao Xu
  0 siblings, 1 reply; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-29 22:10 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/29/21 9:44 AM, Hao Xu wrote:
> 在 2021/4/28 下午10:34, Pavel Begunkov 写道:
>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>> sqes are submitted by sqthread when it is leveraged, which means there
>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>> number of sqes in the original task context.
>>> Tests result below:
>>
>> Frankly, it can be a nest of corner cases if not now then in the future,
>> leading to a high maintenance burden. Hence, if we consider the change,
>> I'd rather want to limit the userspace exposure, so it can be removed
>> if needed.
>>
>> A noticeable change of behaviour here, as Hao recently asked, is that
>> the ring can be passed to a task from a completely another thread group,
>> and so the feature would execute from that context, not from the
>> original/sqpoll one.
>>
>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>> ignored if the previous point is addressed.
>>
>>>
>>> 99th latency:
>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>> with this patch:
>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>> without this patch:
>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>
>> Not sure the second nine describes it well enough, please can you
>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
>>
>> Btw, how happened that only some of the numbers have fractional part?
>> Can't believe they all but 3 were close enough to integer values.
>>
>>> fio config:
>>> ./run_fio.sh
>>> fio \
>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>> --io_sq_thread_idle=${2}
>>>
>>> Signed-off-by: Hao Xu <[email protected]>
>>> ---
>>>   fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>   include/uapi/linux/io_uring.h |  1 +
>>>   2 files changed, 24 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 1871fad48412..f0a01232671e 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>   {
>>>       struct io_ring_ctx *ctx = req->ctx;
>>>       struct io_kiocb *link = io_prep_linked_timeout(req);
>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>> +    struct io_uring_task *tctx = NULL;
>>> +
>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>> +        tctx = ctx->sq_data->thread->io_uring;
>>
>> without park it's racy, sq_data->thread may become NULL and removed,
>> as well as its ->io_uring.
> I now think that it's ok to queue async work to req->task->io_uring. I
> look through all the OPs, seems only have to take care of async_cancel:
> 
> io_async_cancel(req) {
>    cancel from req->task->io_uring;
>    cancel from ctx->tctx_list
> }
> 
> Given req->task is 'original context', the req to be cancelled may in
> ctx->sq_data->thread->io_uring's iowq. So search the req from
> sqthread->io_uring is needed here. This avoids overload in main code
> path.
> Did I miss something else?

It must be req->task->io_uring, otherwise cancellations will
be broken. And using it should be fine in theory because io-wq/etc.
should be set up by io_uring_add_task_file()


One more problem to the pile is io_req_task_work_add() and notify
mode it choses. Look for IORING_SETUP_SQPOLL in the function.

Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like
may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL.

I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL
and think whether it's flawed.

> 
> 
>>
>>> +    else
>>> +        tctx = req->task->io_uring;
>>>         BUG_ON(!tctx);
>>>       BUG_ON(!tctx->io_wq);
>>
>> [snip]
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-29 22:10       ` Pavel Begunkov
@ 2021-05-05 13:10         ` Hao Xu
  2021-05-05 17:44           ` Pavel Begunkov
  0 siblings, 1 reply; 40+ messages in thread
From: Hao Xu @ 2021-05-05 13:10 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi

在 2021/4/30 上午6:10, Pavel Begunkov 写道:
> On 4/29/21 9:44 AM, Hao Xu wrote:
>> 在 2021/4/28 下午10:34, Pavel Begunkov 写道:
>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>> number of sqes in the original task context.
>>>> Tests result below:
>>>
>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>> leading to a high maintenance burden. Hence, if we consider the change,
>>> I'd rather want to limit the userspace exposure, so it can be removed
>>> if needed.
>>>
>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>> the ring can be passed to a task from a completely another thread group,
>>> and so the feature would execute from that context, not from the
>>> original/sqpoll one.
>>>
>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>> ignored if the previous point is addressed.
>>>
>>>>
>>>> 99th latency:
>>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>>> with this patch:
>>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>>> without this patch:
>>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>>
>>> Not sure the second nine describes it well enough, please can you
>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
>>>
>>> Btw, how happened that only some of the numbers have fractional part?
>>> Can't believe they all but 3 were close enough to integer values.
>>>
>>>> fio config:
>>>> ./run_fio.sh
>>>> fio \
>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>>> --io_sq_thread_idle=${2}
>>>>
>>>> Signed-off-by: Hao Xu <[email protected]>
>>>> ---
>>>>    fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>>    include/uapi/linux/io_uring.h |  1 +
>>>>    2 files changed, 24 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 1871fad48412..f0a01232671e 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>>    {
>>>>        struct io_ring_ctx *ctx = req->ctx;
>>>>        struct io_kiocb *link = io_prep_linked_timeout(req);
>>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>>> +    struct io_uring_task *tctx = NULL;
>>>> +
>>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>>> +        tctx = ctx->sq_data->thread->io_uring;
>>>
>>> without park it's racy, sq_data->thread may become NULL and removed,
>>> as well as its ->io_uring.
>> I now think that it's ok to queue async work to req->task->io_uring. I
>> look through all the OPs, seems only have to take care of async_cancel:
>>
>> io_async_cancel(req) {
>>     cancel from req->task->io_uring;
>>     cancel from ctx->tctx_list
>> }
>>
>> Given req->task is 'original context', the req to be cancelled may in
>> ctx->sq_data->thread->io_uring's iowq. So search the req from
>> sqthread->io_uring is needed here. This avoids overload in main code
>> path.
>> Did I miss something else?
> 
> It must be req->task->io_uring, otherwise cancellations will
> be broken. And using it should be fine in theory because io-wq/etc.
> should be set up by io_uring_add_task_file()
> 
> 
> One more problem to the pile is io_req_task_work_add() and notify
> mode it choses. Look for IORING_SETUP_SQPOLL in the function.
How about:
notify = TWA_SIGNAL
if ( (is sq mode) and (sqd->thread == NULL or == req->task))
     notify = TWA_NONE;

> 
> Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like
> may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL.
> 
I've excluded IOPOLL. This change will only affect SQPOLL mode.
> I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL
> and think whether it's flawed.
I'm working on this. (no obvious problem through eyes, will put the code
change on more tests)
> 
>>
>>
>>>
>>>> +    else
>>>> +        tctx = req->task->io_uring;
>>>>          BUG_ON(!tctx);
>>>>        BUG_ON(!tctx->io_wq);
>>>
>>> [snip]
>>>
>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-05-05 13:10         ` Hao Xu
@ 2021-05-05 17:44           ` Pavel Begunkov
  0 siblings, 0 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-05-05 17:44 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 5/5/21 2:10 PM, Hao Xu wrote:
> 在 2021/4/30 上午6:10, Pavel Begunkov 写道:
>> On 4/29/21 9:44 AM, Hao Xu wrote:
>>> 在 2021/4/28 下午10:34, Pavel Begunkov 写道:
>>>> On 4/28/21 2:32 PM, Hao Xu wrote:
>>>>> sqes are submitted by sqthread when it is leveraged, which means there
>>>>> is IO latency when waking up sqthread. To wipe it out, submit limited
>>>>> number of sqes in the original task context.
>>>>> Tests result below:
>>>>
>>>> Frankly, it can be a nest of corner cases if not now then in the future,
>>>> leading to a high maintenance burden. Hence, if we consider the change,
>>>> I'd rather want to limit the userspace exposure, so it can be removed
>>>> if needed.
>>>>
>>>> A noticeable change of behaviour here, as Hao recently asked, is that
>>>> the ring can be passed to a task from a completely another thread group,
>>>> and so the feature would execute from that context, not from the
>>>> original/sqpoll one.
>>>>
>>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
>>>> ignored if the previous point is addressed.
>>>>
>>>>>
>>>>> 99th latency:
>>>>> iops\idle    10us    60us    110us    160us    210us    260us    310us    360us    410us    460us    510us
>>>>> with this patch:
>>>>> 2k          13    13    12    13    13    12    12    11    11    10.304    11.84
>>>>> without this patch:
>>>>> 2k          15    14    15    15    15    14    15    14    14    13    11.84
>>>>
>>>> Not sure the second nine describes it well enough, please can you
>>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
>>>>
>>>> Btw, how happened that only some of the numbers have fractional part?
>>>> Can't believe they all but 3 were close enough to integer values.
>>>>
>>>>> fio config:
>>>>> ./run_fio.sh
>>>>> fio \
>>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
>>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \
>>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
>>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
>>>>> --io_sq_thread_idle=${2}
>>>>>
>>>>> Signed-off-by: Hao Xu <[email protected]>
>>>>> ---
>>>>>    fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>>>>>    include/uapi/linux/io_uring.h |  1 +
>>>>>    2 files changed, 24 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>>> index 1871fad48412..f0a01232671e 100644
>>>>> --- a/fs/io_uring.c
>>>>> +++ b/fs/io_uring.c
>>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>>>>>    {
>>>>>        struct io_ring_ctx *ctx = req->ctx;
>>>>>        struct io_kiocb *link = io_prep_linked_timeout(req);
>>>>> -    struct io_uring_task *tctx = req->task->io_uring;
>>>>> +    struct io_uring_task *tctx = NULL;
>>>>> +
>>>>> +    if (ctx->sq_data && ctx->sq_data->thread)
>>>>> +        tctx = ctx->sq_data->thread->io_uring;
>>>>
>>>> without park it's racy, sq_data->thread may become NULL and removed,
>>>> as well as its ->io_uring.
>>> I now think that it's ok to queue async work to req->task->io_uring. I
>>> look through all the OPs, seems only have to take care of async_cancel:
>>>
>>> io_async_cancel(req) {
>>>     cancel from req->task->io_uring;
>>>     cancel from ctx->tctx_list
>>> }
>>>
>>> Given req->task is 'original context', the req to be cancelled may in
>>> ctx->sq_data->thread->io_uring's iowq. So search the req from
>>> sqthread->io_uring is needed here. This avoids overload in main code
>>> path.
>>> Did I miss something else?
>>
>> It must be req->task->io_uring, otherwise cancellations will
>> be broken. And using it should be fine in theory because io-wq/etc.
>> should be set up by io_uring_add_task_file()
>>
>>
>> One more problem to the pile is io_req_task_work_add() and notify
>> mode it choses. Look for IORING_SETUP_SQPOLL in the function.
> How about:
> notify = TWA_SIGNAL
> if ( (is sq mode) and (sqd->thread == NULL or == req->task))
>    notify = TWA_NONE;

notify = (sqd && tsk == sqd->thread) ? TWA_NONE : TWA_SIGNAL;

Like that? Should work


>> Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like
>> may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL.
>>
> I've excluded IOPOLL. This change will only affect SQPOLL mode.
>> I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL
>> and think whether it's flawed.
> I'm working on this. (no obvious problem through eyes, will put the code
> change on more tests)
>>
>>>
>>>
>>>>
>>>>> +    else
>>>>> +        tctx = req->task->io_uring;
>>>>>          BUG_ON(!tctx);
>>>>>        BUG_ON(!tctx->io_wq);
>>>>
>>>> [snip]
>>>>
>>>
>>
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread
  2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
  2021-04-28 14:12   ` Jens Axboe
  2021-04-28 14:34   ` Pavel Begunkov
@ 2021-04-29 22:02   ` Pavel Begunkov
  2 siblings, 0 replies; 40+ messages in thread
From: Pavel Begunkov @ 2021-04-29 22:02 UTC (permalink / raw)
  To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi

On 4/28/21 2:32 PM, Hao Xu wrote:
> sqes are submitted by sqthread when it is leveraged, which means there
> is IO latency when waking up sqthread. To wipe it out, submit limited
> number of sqes in the original task context.
> Tests result below:

Synchronising needs some patching in case of IOPOLL, as
io_iopoll_req_issued() won't wake up sqpoll task. Added in this patch
wake_up() doesn't help with that because it's placed before submission

> 
> 99th latency:
> iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
> with this patch:
> 2k      	13	13	12	13	13	12	12	11	11	10.304	11.84
> without this patch:
> 2k      	15	14	15	15	15	14	15	14	14	13	11.84
> 
> fio config:
> ./run_fio.sh
> fio \
> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
> --direct=1 --rw=randread --time_based=1 --runtime=300 \
> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
> --io_sq_thread_idle=${2}
> 
> Signed-off-by: Hao Xu <[email protected]>
> ---
>  fs/io_uring.c                 | 29 +++++++++++++++++++++++------
>  include/uapi/linux/io_uring.h |  1 +
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 1871fad48412..f0a01232671e 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
>  {
>  	struct io_ring_ctx *ctx = req->ctx;
>  	struct io_kiocb *link = io_prep_linked_timeout(req);
> -	struct io_uring_task *tctx = req->task->io_uring;
> +	struct io_uring_task *tctx = NULL;
> +
> +	if (ctx->sq_data && ctx->sq_data->thread)
> +		tctx = ctx->sq_data->thread->io_uring;
> +	else
> +		tctx = req->task->io_uring;
>  
>  	BUG_ON(!tctx);
>  	BUG_ON(!tctx->io_wq);
> @@ -9063,9 +9068,10 @@ static void io_uring_try_cancel(struct files_struct *files)
>  	xa_for_each(&tctx->xa, index, node) {
>  		struct io_ring_ctx *ctx = node->ctx;
>  
> -		/* sqpoll task will cancel all its requests */
> -		if (!ctx->sq_data)
> -			io_uring_try_cancel_requests(ctx, current, files);
> +		/*
> +		 * for sqpoll ctx, there may be requests in task_works etc.
> +		 */
> +		io_uring_try_cancel_requests(ctx, current, files);
>  	}
>  }
>  
> @@ -9271,7 +9277,8 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
>  	io_run_task_work();
>  
>  	if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
> -			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG)))
> +			       IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG |
> +			       IORING_ENTER_SQ_DEPUTY)))
>  		return -EINVAL;
>  
>  	f = fdget(fd);
> @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
>  		if (unlikely(ctx->sq_data->thread == NULL)) {
>  			goto out;
>  		}
> -		if (flags & IORING_ENTER_SQ_WAKEUP)
> +		if (flags & IORING_ENTER_SQ_WAKEUP) {
>  			wake_up(&ctx->sq_data->wait);
> +			if ((flags & IORING_ENTER_SQ_DEPUTY) &&
> +					!(ctx->flags & IORING_SETUP_IOPOLL)) {
> +				ret = io_uring_add_task_file(ctx);
> +				if (unlikely(ret))
> +					goto out;
> +				mutex_lock(&ctx->uring_lock);
> +				io_submit_sqes(ctx, min(to_submit, 8U));
> +				mutex_unlock(&ctx->uring_lock);
> +			}
> +		}
>  		if (flags & IORING_ENTER_SQ_WAIT) {
>  			ret = io_sqpoll_wait_sq(ctx);
>  			if (ret)
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 311532ff6ce3..b1130fec2b7d 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -251,6 +251,7 @@ struct io_cqring_offsets {
>  #define IORING_ENTER_SQ_WAKEUP	(1U << 1)
>  #define IORING_ENTER_SQ_WAIT	(1U << 2)
>  #define IORING_ENTER_EXT_ARG	(1U << 3)
> +#define IORING_ENTER_SQ_DEPUTY	(1U << 4)
>  
>  /*
>   * Passed in for io_uring_setup(2). Copied back with updated info on success
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-10-05 15:00 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu
2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu
2021-04-28 14:07   ` Pavel Begunkov
2021-04-28 14:16     ` Jens Axboe
2021-04-28 14:53       ` Pavel Begunkov
2021-04-28 14:54         ` Jens Axboe
2021-04-29  3:41       ` Hao Xu
2021-04-29  9:11         ` Pavel Begunkov
2021-05-05 14:07           ` Hao Xu
2021-05-05 17:40             ` Pavel Begunkov
2021-04-29  3:28     ` Hao Xu
2021-04-29 22:15       ` Pavel Begunkov
2021-09-26 10:00         ` Hao Xu
2021-09-28 10:51           ` Pavel Begunkov
2021-09-29  7:52             ` Hao Xu
2021-09-29  9:24             ` Hao Xu
2021-09-29 11:37               ` Pavel Begunkov
2021-09-29 12:13                 ` Hao Xu
2021-09-30  8:51                   ` Pavel Begunkov
2021-09-30 12:04                     ` Pavel Begunkov
2021-10-05 15:00                       ` Hao Xu
2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu
2021-04-28 14:12   ` Jens Axboe
2021-04-29  4:12     ` Hao Xu
2021-04-28 14:34   ` Pavel Begunkov
2021-04-28 14:37     ` Pavel Begunkov
2021-04-29  4:37       ` Hao Xu
2021-04-29  9:28         ` Pavel Begunkov
2021-05-05 11:20           ` Hao Xu
2021-04-28 14:39     ` Jens Axboe
2021-04-28 14:50       ` Pavel Begunkov
2021-04-28 14:53         ` Jens Axboe
2021-04-28 14:56           ` Pavel Begunkov
2021-04-28 15:09             ` Jens Axboe
2021-04-29  4:43       ` Hao Xu
2021-04-29  8:44     ` Hao Xu
2021-04-29 22:10       ` Pavel Begunkov
2021-05-05 13:10         ` Hao Xu
2021-05-05 17:44           ` Pavel Begunkov
2021-04-29 22:02   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox