* [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization @ 2021-04-28 13:32 Hao Xu 2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu 0 siblings, 2 replies; 40+ messages in thread From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi patch 1/2 provides a new option to set up sq_thread_idle in nanosecond granularity. patch 2/2 is to cut down IO latency when sqthread is waking up. This is a RFC, especially 2/2. There may be more works to do, like add a REGISTER OP to allow applications to adjust sq_thread_idle, since it may experience both high IO pressure and low IO pressure. And in low IO pressure, patch 1/2 saves cpu resource while keeping a reasonable latency. liburing tweak is ready as well, but currently I'd like to just post this for comments. Hao Xu (2): io_uring: add support for ns granularity of io_sq_thread_idle io_uring: submit sqes in the original context when waking up sqthread fs/io_uring.c | 88 ++++++++++++++++++++++++++++++++----------- include/uapi/linux/io_uring.h | 4 +- 2 files changed, 70 insertions(+), 22 deletions(-) -- 1.8.3.1 ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu @ 2021-04-28 13:32 ` Hao Xu 2021-04-28 14:07 ` Pavel Begunkov 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu 1 sibling, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi currently unit of io_sq_thread_idle is millisecond, the smallest value is 1ms, which means for IOPS > 1000, sqthread will very likely take 100% cpu usage. This is not necessary in some cases, like users may don't care about latency much in low IO pressure (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer an option of nanosecond granularity of io_sq_thread_idle. Some test results by fio below: uring average latency:(us) iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us 2k 10.93 10.68 10.72 10.7 10.79 10.52 10.59 10.54 10.47 10.39 8.4 4k 10.55 10.48 10.51 10.42 10.35 8.34 6k 10.82 10.5 10.39 8.4 8k 10.44 10.45 10.34 8.39 10k 10.45 10.39 8.33 uring cpu usage of sqthread: iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us 2k 4% 14% 24% 34.70% 44.70% 55% 65.10% 75.40% 85.40% 95.70% 100% 4k 7.70% 28.20% 48.50% 69% 90% 100% 6k 11.30% 42% 73% 100% 8k 15.30% 56.30% 97% 100% 10k 19% 70% 100% aio average latency:(us) iops latency 99th lat cpu 2k 13.34 14.272 3% 4k 13.195 14.016 7% 6k 13.29 14.656 9.70% 8k 13.2 14.656 12.70% 10k 13.2 15 17% fio config is: ./run_fio.sh fio \ --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ --direct=1 --rw=randread --time_based=1 --runtime=300 \ --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ --io_sq_thread_idle=${2} in 2k IOPS, if latency of 10.93us is acceptable for an application, then they get 100% - 4% = 96% reduction of cpu usage, while the latency is smaller than aio(10.93us vs 13.34us). Signed-off-by: Hao Xu <[email protected]> --- fs/io_uring.c | 59 ++++++++++++++++++++++++++++++++----------- include/uapi/linux/io_uring.h | 3 ++- 2 files changed, 46 insertions(+), 16 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 63ff70587d4f..1871fad48412 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -279,7 +279,8 @@ struct io_sq_data { struct task_struct *thread; struct wait_queue_head wait; - unsigned sq_thread_idle; + u64 sq_thread_idle; + bool idle_mode; int sq_cpu; pid_t task_pid; pid_t task_tgid; @@ -362,7 +363,7 @@ struct io_ring_ctx { unsigned cached_sq_head; unsigned sq_entries; unsigned sq_mask; - unsigned sq_thread_idle; + u64 sq_thread_idle; unsigned cached_sq_dropped; unsigned cached_cq_overflow; unsigned long sq_check_overflow; @@ -6797,18 +6798,42 @@ static int __io_sq_thread(struct io_ring_ctx *ctx, bool cap_entries) static void io_sqd_update_thread_idle(struct io_sq_data *sqd) { struct io_ring_ctx *ctx; - unsigned sq_thread_idle = 0; + u64 sq_thread_idle = 0; - list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) - sq_thread_idle = max(sq_thread_idle, ctx->sq_thread_idle); + sqd->idle_mode = false; + list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) { + u64 tmp_idle = ctx->sq_thread_idle; + + if (!(ctx->flags & IORING_SETUP_IDLE_NS)) + tmp_idle = jiffies64_to_nsecs(tmp_idle); + else if (!sqd->idle_mode) + sqd->idle_mode = true; + + if (sq_thread_idle < tmp_idle) + sq_thread_idle = tmp_idle; + } + + if (!sqd->idle_mode) + sq_thread_idle = nsecs_to_jiffies64(sq_thread_idle); sqd->sq_thread_idle = sq_thread_idle; } +static inline u64 io_current_time(bool idle_mode) +{ + return idle_mode ? ktime_get_ns() : get_jiffies_64(); +} + +static inline bool io_time_after(bool idle_mode, u64 timeout) +{ + return idle_mode ? ktime_get_ns() > timeout : + time_after64(get_jiffies_64(), timeout); +} + static int io_sq_thread(void *data) { struct io_sq_data *sqd = data; struct io_ring_ctx *ctx; - unsigned long timeout = 0; + u64 timeout = 0; char buf[TASK_COMM_LEN]; DEFINE_WAIT(wait); @@ -6842,7 +6867,7 @@ static int io_sq_thread(void *data) break; io_run_task_work(); io_run_task_work_head(&sqd->park_task_work); - timeout = jiffies + sqd->sq_thread_idle; + timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle; continue; } sqt_spin = false; @@ -6859,11 +6884,11 @@ static int io_sq_thread(void *data) sqt_spin = true; } - if (sqt_spin || !time_after(jiffies, timeout)) { + if (sqt_spin || !io_time_after(sqd->idle_mode, timeout)) { io_run_task_work(); cond_resched(); if (sqt_spin) - timeout = jiffies + sqd->sq_thread_idle; + timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle; continue; } @@ -6896,7 +6921,7 @@ static int io_sq_thread(void *data) finish_wait(&sqd->wait, &wait); io_run_task_work_head(&sqd->park_task_work); - timeout = jiffies + sqd->sq_thread_idle; + timeout = io_current_time(sqd->idle_mode) + sqd->sq_thread_idle; } io_uring_cancel_sqpoll(sqd); @@ -7940,7 +7965,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, if (ctx->flags & IORING_SETUP_SQPOLL) { struct task_struct *tsk; struct io_sq_data *sqd; - bool attached; + bool attached, idle_ns; sqd = io_get_sq_data(p, &attached); if (IS_ERR(sqd)) { @@ -7950,9 +7975,13 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, ctx->sq_creds = get_current_cred(); ctx->sq_data = sqd; - ctx->sq_thread_idle = msecs_to_jiffies(p->sq_thread_idle); + + idle_ns = ctx->flags & IORING_SETUP_IDLE_NS; + if (!idle_ns) + p->sq_thread_idle = nsecs_to_jiffies64(p->sq_thread_idle * NSEC_PER_MSEC); + ctx->sq_thread_idle = p->sq_thread_idle; if (!ctx->sq_thread_idle) - ctx->sq_thread_idle = HZ; + ctx->sq_thread_idle = idle_ns ? NSEC_PER_SEC : HZ; io_sq_thread_park(sqd); list_add(&ctx->sqd_list, &sqd->ctx_list); @@ -7990,7 +8019,7 @@ static int io_sq_offload_create(struct io_ring_ctx *ctx, wake_up_new_task(tsk); if (ret) goto err; - } else if (p->flags & IORING_SETUP_SQ_AFF) { + } else if (p->flags & (IORING_SETUP_SQ_AFF | IORING_SETUP_IDLE_NS)) { /* Can't have SQ_AFF without SQPOLL */ ret = -EINVAL; goto err; @@ -9680,7 +9709,7 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params) if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ | - IORING_SETUP_R_DISABLED)) + IORING_SETUP_R_DISABLED | IORING_SETUP_IDLE_NS)) return -EINVAL; return io_uring_create(entries, &p, params); diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e1ae46683301..311532ff6ce3 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -98,6 +98,7 @@ enum { #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ enum { IORING_OP_NOP, @@ -259,7 +260,7 @@ struct io_uring_params { __u32 cq_entries; __u32 flags; __u32 sq_thread_cpu; - __u32 sq_thread_idle; + __u64 sq_thread_idle; __u32 features; __u32 wq_fd; __u32 resv[3]; -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu @ 2021-04-28 14:07 ` Pavel Begunkov 2021-04-28 14:16 ` Jens Axboe 2021-04-29 3:28 ` Hao Xu 0 siblings, 2 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:07 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/28/21 2:32 PM, Hao Xu wrote: > currently unit of io_sq_thread_idle is millisecond, the smallest value > is 1ms, which means for IOPS > 1000, sqthread will very likely take > 100% cpu usage. This is not necessary in some cases, like users may > don't care about latency much in low IO pressure > (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer > an option of nanosecond granularity of io_sq_thread_idle. Some test > results by fio below: If numbers justify it, I don't see why not do it in ns, but I'd suggest to get rid of all the mess and simply convert to jiffies during ring creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. Or is there a reason for having it high precision, i.e. ktime()? > uring average latency:(us) > iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us > 2k 10.93 10.68 10.72 10.7 10.79 10.52 10.59 10.54 10.47 10.39 8.4 > 4k 10.55 10.48 10.51 10.42 10.35 8.34 > 6k 10.82 10.5 10.39 8.4 > 8k 10.44 10.45 10.34 8.39 > 10k 10.45 10.39 8.33 > > uring cpu usage of sqthread: > iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us > 2k 4% 14% 24% 34.70% 44.70% 55% 65.10% 75.40% 85.40% 95.70% 100% > 4k 7.70% 28.20% 48.50% 69% 90% 100% > 6k 11.30% 42% 73% 100% > 8k 15.30% 56.30% 97% 100% > 10k 19% 70% 100% > > aio average latency:(us) > iops latency 99th lat cpu > 2k 13.34 14.272 3% > 4k 13.195 14.016 7% > 6k 13.29 14.656 9.70% > 8k 13.2 14.656 12.70% > 10k 13.2 15 17% > > fio config is: > ./run_fio.sh > fio \ > --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ > --direct=1 --rw=randread --time_based=1 --runtime=300 \ > --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ > --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ > --io_sq_thread_idle=${2} > > in 2k IOPS, if latency of 10.93us is acceptable for an application, > then they get 100% - 4% = 96% reduction of cpu usage, while the latency > is smaller than aio(10.93us vs 13.34us). > > Signed-off-by: Hao Xu <[email protected]> > --- [snip] > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index e1ae46683301..311532ff6ce3 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -98,6 +98,7 @@ enum { > #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ > #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ > #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ > +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ > > enum { > IORING_OP_NOP, > @@ -259,7 +260,7 @@ struct io_uring_params { > __u32 cq_entries; > __u32 flags; > __u32 sq_thread_cpu; > - __u32 sq_thread_idle; > + __u64 sq_thread_idle; breaks userspace API > __u32 features; > __u32 wq_fd; > __u32 resv[3]; > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 14:07 ` Pavel Begunkov @ 2021-04-28 14:16 ` Jens Axboe 2021-04-28 14:53 ` Pavel Begunkov 2021-04-29 3:41 ` Hao Xu 2021-04-29 3:28 ` Hao Xu 1 sibling, 2 replies; 40+ messages in thread From: Jens Axboe @ 2021-04-28 14:16 UTC (permalink / raw) To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 8:07 AM, Pavel Begunkov wrote: >> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >> index e1ae46683301..311532ff6ce3 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -98,6 +98,7 @@ enum { >> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >> >> enum { >> IORING_OP_NOP, >> @@ -259,7 +260,7 @@ struct io_uring_params { >> __u32 cq_entries; >> __u32 flags; >> __u32 sq_thread_cpu; >> - __u32 sq_thread_idle; >> + __u64 sq_thread_idle; > > breaks userspace API And I don't think we need to. If you're using IDLE_NS, then the value should by definition be small enough that it'd fit in 32-bits. If you need higher timeouts, don't set it and it's in usec instead. So I'd just leave this one alone. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 14:16 ` Jens Axboe @ 2021-04-28 14:53 ` Pavel Begunkov 2021-04-28 14:54 ` Jens Axboe 2021-04-29 3:41 ` Hao Xu 1 sibling, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:53 UTC (permalink / raw) To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 3:16 PM, Jens Axboe wrote: > On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>> index e1ae46683301..311532ff6ce3 100644 >>> --- a/include/uapi/linux/io_uring.h >>> +++ b/include/uapi/linux/io_uring.h >>> @@ -98,6 +98,7 @@ enum { >>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>> >>> enum { >>> IORING_OP_NOP, >>> @@ -259,7 +260,7 @@ struct io_uring_params { >>> __u32 cq_entries; >>> __u32 flags; >>> __u32 sq_thread_cpu; >>> - __u32 sq_thread_idle; >>> + __u64 sq_thread_idle; >> >> breaks userspace API > > And I don't think we need to. If you're using IDLE_NS, then the value > should by definition be small enough that it'd fit in 32-bits. If you > need higher timeouts, don't set it and it's in usec instead. > > So I'd just leave this one alone. Sounds good u64 time_ns = p->sq_thread_idle; if (!IDLE_NS) time_ns *= NSEC_PER_MSEC; idel = ns_to_jiffies(time_ns); -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 14:53 ` Pavel Begunkov @ 2021-04-28 14:54 ` Jens Axboe 0 siblings, 0 replies; 40+ messages in thread From: Jens Axboe @ 2021-04-28 14:54 UTC (permalink / raw) To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 8:53 AM, Pavel Begunkov wrote: > On 4/28/21 3:16 PM, Jens Axboe wrote: >> On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>>> index e1ae46683301..311532ff6ce3 100644 >>>> --- a/include/uapi/linux/io_uring.h >>>> +++ b/include/uapi/linux/io_uring.h >>>> @@ -98,6 +98,7 @@ enum { >>>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>>> >>>> enum { >>>> IORING_OP_NOP, >>>> @@ -259,7 +260,7 @@ struct io_uring_params { >>>> __u32 cq_entries; >>>> __u32 flags; >>>> __u32 sq_thread_cpu; >>>> - __u32 sq_thread_idle; >>>> + __u64 sq_thread_idle; >>> >>> breaks userspace API >> >> And I don't think we need to. If you're using IDLE_NS, then the value >> should by definition be small enough that it'd fit in 32-bits. If you >> need higher timeouts, don't set it and it's in usec instead. >> >> So I'd just leave this one alone. > > Sounds good > > u64 time_ns = p->sq_thread_idle; > if (!IDLE_NS) > time_ns *= NSEC_PER_MSEC; > idel = ns_to_jiffies(time_ns); Precisely! With the overlap being there, there's no need to make it bigger. And having nsec granularity if your idle time is in the msecs doesn't make a lot of sense. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 14:16 ` Jens Axboe 2021-04-28 14:53 ` Pavel Begunkov @ 2021-04-29 3:41 ` Hao Xu 2021-04-29 9:11 ` Pavel Begunkov 1 sibling, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-04-29 3:41 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, Joseph Qi 在 2021/4/28 下午10:16, Jens Axboe 写道: > On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>> index e1ae46683301..311532ff6ce3 100644 >>> --- a/include/uapi/linux/io_uring.h >>> +++ b/include/uapi/linux/io_uring.h >>> @@ -98,6 +98,7 @@ enum { >>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>> >>> enum { >>> IORING_OP_NOP, >>> @@ -259,7 +260,7 @@ struct io_uring_params { >>> __u32 cq_entries; >>> __u32 flags; >>> __u32 sq_thread_cpu; >>> - __u32 sq_thread_idle; >>> + __u64 sq_thread_idle; >> >> breaks userspace API > > And I don't think we need to. If you're using IDLE_NS, then the value > should by definition be small enough that it'd fit in 32-bits. If you I make it u64 since I thought users may want a full flexibility to set idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by millisecond granularity). But I'm not sure if this deserve changing the userspace API. > need higher timeouts, don't set it and it's in usec instead. > > So I'd just leave this one alone. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-29 3:41 ` Hao Xu @ 2021-04-29 9:11 ` Pavel Begunkov 2021-05-05 14:07 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-29 9:11 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/29/21 4:41 AM, Hao Xu wrote: > 在 2021/4/28 下午10:16, Jens Axboe 写道: >> On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>>> index e1ae46683301..311532ff6ce3 100644 >>>> --- a/include/uapi/linux/io_uring.h >>>> +++ b/include/uapi/linux/io_uring.h >>>> @@ -98,6 +98,7 @@ enum { >>>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>>> enum { >>>> IORING_OP_NOP, >>>> @@ -259,7 +260,7 @@ struct io_uring_params { >>>> __u32 cq_entries; >>>> __u32 flags; >>>> __u32 sq_thread_cpu; >>>> - __u32 sq_thread_idle; >>>> + __u64 sq_thread_idle; >>> >>> breaks userspace API >> >> And I don't think we need to. If you're using IDLE_NS, then the value >> should by definition be small enough that it'd fit in 32-bits. If you > I make it u64 since I thought users may want a full flexibility to set > idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by It's a really weird user requiring such a precision. u32 allows up to ~1s, and if more is needed users can switch to ms mode, so in the worst case the precision is 1/1000 of the desired value, more than enough. > millisecond granularity). But I'm not sure if this deserve changing the > userspace API. That's not about deserve or not, we can't break ABI. Can be worked around, e.g. by taking resv fields, but don't see a reason >> need higher timeouts, don't set it and it's in usec instead. >> >> So I'd just leave this one alone. >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-29 9:11 ` Pavel Begunkov @ 2021-05-05 14:07 ` Hao Xu 2021-05-05 17:40 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-05-05 14:07 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/29 下午5:11, Pavel Begunkov 写道: > On 4/29/21 4:41 AM, Hao Xu wrote: >> 在 2021/4/28 下午10:16, Jens Axboe 写道: >>> On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>>>> index e1ae46683301..311532ff6ce3 100644 >>>>> --- a/include/uapi/linux/io_uring.h >>>>> +++ b/include/uapi/linux/io_uring.h >>>>> @@ -98,6 +98,7 @@ enum { >>>>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>>>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>>>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>>>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>>>> enum { >>>>> IORING_OP_NOP, >>>>> @@ -259,7 +260,7 @@ struct io_uring_params { >>>>> __u32 cq_entries; >>>>> __u32 flags; >>>>> __u32 sq_thread_cpu; >>>>> - __u32 sq_thread_idle; >>>>> + __u64 sq_thread_idle; >>>> >>>> breaks userspace API >>> >>> And I don't think we need to. If you're using IDLE_NS, then the value >>> should by definition be small enough that it'd fit in 32-bits. If you >> I make it u64 since I thought users may want a full flexibility to set >> idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by > > It's a really weird user requiring such a precision. u32 allows up to > ~1s, and if more is needed users can switch to ms mode, so in the worst > case the precision is 1/1000 of the desired value, more than enough. > >> millisecond granularity). But I'm not sure if this deserve changing the >> userspace API. > > That's not about deserve or not, we can't break ABI. Can be worked around, Is it for compatibility reason? > e.g. by taking resv fields, but don't see a reason > >>> need higher timeouts, don't set it and it's in usec instead. >>> >>> So I'd just leave this one alone. >>> >> > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-05-05 14:07 ` Hao Xu @ 2021-05-05 17:40 ` Pavel Begunkov 0 siblings, 0 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-05-05 17:40 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 5/5/21 3:07 PM, Hao Xu wrote: > 在 2021/4/29 下午5:11, Pavel Begunkov 写道: >> On 4/29/21 4:41 AM, Hao Xu wrote: >>> 在 2021/4/28 下午10:16, Jens Axboe 写道: >>>> On 4/28/21 8:07 AM, Pavel Begunkov wrote: >>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>>>>> index e1ae46683301..311532ff6ce3 100644 >>>>>> --- a/include/uapi/linux/io_uring.h >>>>>> +++ b/include/uapi/linux/io_uring.h >>>>>> @@ -98,6 +98,7 @@ enum { >>>>>> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >>>>>> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >>>>>> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >>>>>> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >>>>>> enum { >>>>>> IORING_OP_NOP, >>>>>> @@ -259,7 +260,7 @@ struct io_uring_params { >>>>>> __u32 cq_entries; >>>>>> __u32 flags; >>>>>> __u32 sq_thread_cpu; >>>>>> - __u32 sq_thread_idle; >>>>>> + __u64 sq_thread_idle; >>>>> >>>>> breaks userspace API >>>> >>>> And I don't think we need to. If you're using IDLE_NS, then the value >>>> should by definition be small enough that it'd fit in 32-bits. If you >>> I make it u64 since I thought users may want a full flexibility to set >>> idle in nanosecond granularity(eg. (1e6 + 10) ns cannot be set by >> >> It's a really weird user requiring such a precision. u32 allows up to >> ~1s, and if more is needed users can switch to ms mode, so in the worst >> case the precision is 1/1000 of the desired value, more than enough. >> >>> millisecond granularity). But I'm not sure if this deserve changing the >>> userspace API. >> That's not about deserve or not, we can't break ABI. Can be worked around, > Is it for compatibility reason? Right, all binaries should work without recompilation, including those not using liburing. >> e.g. by taking resv fields, but don't see a reason >> >>>> need higher timeouts, don't set it and it's in usec instead. >>>> >>>> So I'd just leave this one alone. >>>> >>> >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-28 14:07 ` Pavel Begunkov 2021-04-28 14:16 ` Jens Axboe @ 2021-04-29 3:28 ` Hao Xu 2021-04-29 22:15 ` Pavel Begunkov 1 sibling, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-04-29 3:28 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/28 下午10:07, Pavel Begunkov 写道: > On 4/28/21 2:32 PM, Hao Xu wrote: >> currently unit of io_sq_thread_idle is millisecond, the smallest value >> is 1ms, which means for IOPS > 1000, sqthread will very likely take >> 100% cpu usage. This is not necessary in some cases, like users may >> don't care about latency much in low IO pressure >> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >> an option of nanosecond granularity of io_sq_thread_idle. Some test >> results by fio below: > > If numbers justify it, I don't see why not do it in ns, but I'd suggest > to get rid of all the mess and simply convert to jiffies during ring > creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. 1) here I keep millisecond mode for compatibility 2) I saw jiffies is calculated by HZ, and HZ could be large enough (like HZ = 1000) to make nsecs_to_jiffies64() = 0: u64 nsecs_to_jiffies64(u64 n) { #if (NSEC_PER_SEC % HZ) == 0 /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ return div_u64(n, NSEC_PER_SEC / HZ); #elif (HZ % 512) == 0 /* overflow after 292 years if HZ = 1024 */ return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); #else /* ¦* Generic case - optimized for cases where HZ is a multiple of 3. ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. ¦*/ return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); #endif } say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). > > Or is there a reason for having it high precision, i.e. ktime()? > >> uring average latency:(us) >> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >> 2k 10.93 10.68 10.72 10.7 10.79 10.52 10.59 10.54 10.47 10.39 8.4 >> 4k 10.55 10.48 10.51 10.42 10.35 8.34 >> 6k 10.82 10.5 10.39 8.4 >> 8k 10.44 10.45 10.34 8.39 >> 10k 10.45 10.39 8.33 >> >> uring cpu usage of sqthread: >> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >> 2k 4% 14% 24% 34.70% 44.70% 55% 65.10% 75.40% 85.40% 95.70% 100% >> 4k 7.70% 28.20% 48.50% 69% 90% 100% >> 6k 11.30% 42% 73% 100% >> 8k 15.30% 56.30% 97% 100% >> 10k 19% 70% 100% >> >> aio average latency:(us) >> iops latency 99th lat cpu >> 2k 13.34 14.272 3% >> 4k 13.195 14.016 7% >> 6k 13.29 14.656 9.70% >> 8k 13.2 14.656 12.70% >> 10k 13.2 15 17% >> >> fio config is: >> ./run_fio.sh >> fio \ >> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >> --io_sq_thread_idle=${2} >> >> in 2k IOPS, if latency of 10.93us is acceptable for an application, >> then they get 100% - 4% = 96% reduction of cpu usage, while the latency >> is smaller than aio(10.93us vs 13.34us). >> >> Signed-off-by: Hao Xu <[email protected]> >> --- > [snip] >> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >> index e1ae46683301..311532ff6ce3 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -98,6 +98,7 @@ enum { >> #define IORING_SETUP_CLAMP (1U << 4) /* clamp SQ/CQ ring sizes */ >> #define IORING_SETUP_ATTACH_WQ (1U << 5) /* attach to existing wq */ >> #define IORING_SETUP_R_DISABLED (1U << 6) /* start with ring disabled */ >> +#define IORING_SETUP_IDLE_NS (1U << 7) /* unit of thread_idle is nano second */ >> >> enum { >> IORING_OP_NOP, >> @@ -259,7 +260,7 @@ struct io_uring_params { >> __u32 cq_entries; >> __u32 flags; >> __u32 sq_thread_cpu; >> - __u32 sq_thread_idle; >> + __u64 sq_thread_idle; > > breaks userspace API > >> __u32 features; >> __u32 wq_fd; >> __u32 resv[3]; >> > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-29 3:28 ` Hao Xu @ 2021-04-29 22:15 ` Pavel Begunkov 2021-09-26 10:00 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-29 22:15 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/29/21 4:28 AM, Hao Xu wrote: > 在 2021/4/28 下午10:07, Pavel Begunkov 写道: >> On 4/28/21 2:32 PM, Hao Xu wrote: >>> currently unit of io_sq_thread_idle is millisecond, the smallest value >>> is 1ms, which means for IOPS > 1000, sqthread will very likely take >>> 100% cpu usage. This is not necessary in some cases, like users may >>> don't care about latency much in low IO pressure >>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >>> an option of nanosecond granularity of io_sq_thread_idle. Some test >>> results by fio below: >> >> If numbers justify it, I don't see why not do it in ns, but I'd suggest >> to get rid of all the mess and simply convert to jiffies during ring >> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. > 1) here I keep millisecond mode for compatibility > 2) I saw jiffies is calculated by HZ, and HZ could be large enough > (like HZ = 1000) to make nsecs_to_jiffies64() = 0: > > u64 nsecs_to_jiffies64(u64 n) > { > #if (NSEC_PER_SEC % HZ) == 0 > /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ > return div_u64(n, NSEC_PER_SEC / HZ); > #elif (HZ % 512) == 0 > /* overflow after 292 years if HZ = 1024 */ > return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); > #else > /* > ¦* Generic case - optimized for cases where HZ is a multiple of 3. > ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. > ¦*/ > return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); > #endif > } > > say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 > iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). Agree, apparently jiffies precision fractions of a second, e.g. 0.001s But I'd much prefer to not duplicate all that. So, jiffies won't do, ktime() may be ok but a bit heavier that we'd like it to be... Jens, any chance you remember something in the middle? Like same source as ktime() but without the heavy correction it does. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-04-29 22:15 ` Pavel Begunkov @ 2021-09-26 10:00 ` Hao Xu 2021-09-28 10:51 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-09-26 10:00 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/30 上午6:15, Pavel Begunkov 写道: > On 4/29/21 4:28 AM, Hao Xu wrote: >> 在 2021/4/28 下午10:07, Pavel Begunkov 写道: >>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>> currently unit of io_sq_thread_idle is millisecond, the smallest value >>>> is 1ms, which means for IOPS > 1000, sqthread will very likely take >>>> 100% cpu usage. This is not necessary in some cases, like users may >>>> don't care about latency much in low IO pressure >>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >>>> an option of nanosecond granularity of io_sq_thread_idle. Some test >>>> results by fio below: >>> >>> If numbers justify it, I don't see why not do it in ns, but I'd suggest >>> to get rid of all the mess and simply convert to jiffies during ring >>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. >> 1) here I keep millisecond mode for compatibility >> 2) I saw jiffies is calculated by HZ, and HZ could be large enough >> (like HZ = 1000) to make nsecs_to_jiffies64() = 0: >> >> u64 nsecs_to_jiffies64(u64 n) >> { >> #if (NSEC_PER_SEC % HZ) == 0 >> /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ >> return div_u64(n, NSEC_PER_SEC / HZ); >> #elif (HZ % 512) == 0 >> /* overflow after 292 years if HZ = 1024 */ >> return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); >> #else >> /* >> ¦* Generic case - optimized for cases where HZ is a multiple of 3. >> ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. >> ¦*/ >> return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); >> #endif >> } >> >> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 >> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). > > Agree, apparently jiffies precision fractions of a second, e.g. 0.001s > But I'd much prefer to not duplicate all that. So, jiffies won't do, > ktime() may be ok but a bit heavier that we'd like it to be... > > Jens, any chance you remember something in the middle? Like same source > as ktime() but without the heavy correction it does. I'm gonna pick this one up again, currently this patch with ktime_get_ns() works good on our productions. This patch makes the latency a bit higher than before, but still lower than aio. I haven't gotten a faster alternate for ktime_get_ns(), any hints? Regards, Hao > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-26 10:00 ` Hao Xu @ 2021-09-28 10:51 ` Pavel Begunkov 2021-09-29 7:52 ` Hao Xu 2021-09-29 9:24 ` Hao Xu 0 siblings, 2 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-09-28 10:51 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 9/26/21 11:00 AM, Hao Xu wrote: > 在 2021/4/30 上午6:15, Pavel Begunkov 写道: >> On 4/29/21 4:28 AM, Hao Xu wrote: >>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道: >>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value >>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely take >>>>> 100% cpu usage. This is not necessary in some cases, like users may >>>>> don't care about latency much in low IO pressure >>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test >>>>> results by fio below: >>>> >>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest >>>> to get rid of all the mess and simply convert to jiffies during ring >>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. >>> 1) here I keep millisecond mode for compatibility >>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough >>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0: >>> >>> u64 nsecs_to_jiffies64(u64 n) >>> { >>> #if (NSEC_PER_SEC % HZ) == 0 >>> /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ >>> return div_u64(n, NSEC_PER_SEC / HZ); >>> #elif (HZ % 512) == 0 >>> /* overflow after 292 years if HZ = 1024 */ >>> return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); >>> #else >>> /* >>> ¦* Generic case - optimized for cases where HZ is a multiple of 3. >>> ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. >>> ¦*/ >>> return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); >>> #endif >>> } >>> >>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 >>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). >> >> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s >> But I'd much prefer to not duplicate all that. So, jiffies won't do, >> ktime() may be ok but a bit heavier that we'd like it to be... >> >> Jens, any chance you remember something in the middle? Like same source >> as ktime() but without the heavy correction it does. > I'm gonna pick this one up again, currently this patch > with ktime_get_ns() works good on our productions. This > patch makes the latency a bit higher than before, but > still lower than aio. > I haven't gotten a faster alternate for ktime_get_ns(), > any hints? Good, I'd suggest to look through Documentation/core-api/timekeeping.rst In particular coarse variants may be of interest. https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access Off topic: it sounds that you're a long user of SQPOLL. Interesting to ask how do you find it in general. i.e. does it help much with latency? Performance? Anything else? -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-28 10:51 ` Pavel Begunkov @ 2021-09-29 7:52 ` Hao Xu 2021-09-29 9:24 ` Hao Xu 1 sibling, 0 replies; 40+ messages in thread From: Hao Xu @ 2021-09-29 7:52 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/9/28 下午6:51, Pavel Begunkov 写道: > On 9/26/21 11:00 AM, Hao Xu wrote: >> 在 2021/4/30 上午6:15, Pavel Begunkov 写道: >>> On 4/29/21 4:28 AM, Hao Xu wrote: >>>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道: >>>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value >>>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely take >>>>>> 100% cpu usage. This is not necessary in some cases, like users may >>>>>> don't care about latency much in low IO pressure >>>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >>>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test >>>>>> results by fio below: >>>>> >>>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest >>>>> to get rid of all the mess and simply convert to jiffies during ring >>>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. >>>> 1) here I keep millisecond mode for compatibility >>>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough >>>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0: >>>> >>>> u64 nsecs_to_jiffies64(u64 n) >>>> { >>>> #if (NSEC_PER_SEC % HZ) == 0 >>>> /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ >>>> return div_u64(n, NSEC_PER_SEC / HZ); >>>> #elif (HZ % 512) == 0 >>>> /* overflow after 292 years if HZ = 1024 */ >>>> return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); >>>> #else >>>> /* >>>> ¦* Generic case - optimized for cases where HZ is a multiple of 3. >>>> ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. >>>> ¦*/ >>>> return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); >>>> #endif >>>> } >>>> >>>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 >>>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). >>> >>> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s >>> But I'd much prefer to not duplicate all that. So, jiffies won't do, >>> ktime() may be ok but a bit heavier that we'd like it to be... >>> >>> Jens, any chance you remember something in the middle? Like same source >>> as ktime() but without the heavy correction it does. >> I'm gonna pick this one up again, currently this patch >> with ktime_get_ns() works good on our productions. This >> patch makes the latency a bit higher than before, but >> still lower than aio. >> I haven't gotten a faster alternate for ktime_get_ns(), >> any hints? > > Good, I'd suggest to look through Documentation/core-api/timekeeping.rst > In particular coarse variants may be of interest. > https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access > > > Off topic: it sounds that you're a long user of SQPOLL. Interesting to > ask how do you find it in general. i.e. does it help much with > latency? Performance? Anything else? It helps with the latency and iops(can not surely recall the number now..) It is useful when many user threads offload IO to just one sqthread. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-28 10:51 ` Pavel Begunkov 2021-09-29 7:52 ` Hao Xu @ 2021-09-29 9:24 ` Hao Xu 2021-09-29 11:37 ` Pavel Begunkov 1 sibling, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-09-29 9:24 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/9/28 下午6:51, Pavel Begunkov 写道: > On 9/26/21 11:00 AM, Hao Xu wrote: >> 在 2021/4/30 上午6:15, Pavel Begunkov 写道: >>> On 4/29/21 4:28 AM, Hao Xu wrote: >>>> 在 2021/4/28 下午10:07, Pavel Begunkov 写道: >>>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>>> currently unit of io_sq_thread_idle is millisecond, the smallest value >>>>>> is 1ms, which means for IOPS > 1000, sqthread will very likely take >>>>>> 100% cpu usage. This is not necessary in some cases, like users may >>>>>> don't care about latency much in low IO pressure >>>>>> (like 1000 < IOPS < 20000), but cpu resource does matter. So we offer >>>>>> an option of nanosecond granularity of io_sq_thread_idle. Some test >>>>>> results by fio below: >>>>> >>>>> If numbers justify it, I don't see why not do it in ns, but I'd suggest >>>>> to get rid of all the mess and simply convert to jiffies during ring >>>>> creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged. >>>> 1) here I keep millisecond mode for compatibility >>>> 2) I saw jiffies is calculated by HZ, and HZ could be large enough >>>> (like HZ = 1000) to make nsecs_to_jiffies64() = 0: >>>> >>>> u64 nsecs_to_jiffies64(u64 n) >>>> { >>>> #if (NSEC_PER_SEC % HZ) == 0 >>>> /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */ >>>> return div_u64(n, NSEC_PER_SEC / HZ); >>>> #elif (HZ % 512) == 0 >>>> /* overflow after 292 years if HZ = 1024 */ >>>> return div_u64(n * HZ / 512, NSEC_PER_SEC / 512); >>>> #else >>>> /* >>>> ¦* Generic case - optimized for cases where HZ is a multiple of 3. >>>> ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc. >>>> ¦*/ >>>> return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ); >>>> #endif >>>> } >>>> >>>> say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0 >>>> iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ). >>> >>> Agree, apparently jiffies precision fractions of a second, e.g. 0.001s >>> But I'd much prefer to not duplicate all that. So, jiffies won't do, >>> ktime() may be ok but a bit heavier that we'd like it to be... >>> >>> Jens, any chance you remember something in the middle? Like same source >>> as ktime() but without the heavy correction it does. >> I'm gonna pick this one up again, currently this patch >> with ktime_get_ns() works good on our productions. This >> patch makes the latency a bit higher than before, but >> still lower than aio. >> I haven't gotten a faster alternate for ktime_get_ns(), >> any hints? > > Good, I'd suggest to look through Documentation/core-api/timekeeping.rst > In particular coarse variants may be of interest. > https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access > The coarse functions seems to be like jiffies, because they use the last timer tick(from the explanation in that doc, it seems the timer tick is in the same frequency as jiffies update). So I believe it is just another format of jiffies which is low accurate. > > Off topic: it sounds that you're a long user of SQPOLL. Interesting to > ask how do you find it in general. i.e. does it help much with > latency? Performance? Anything else? > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-29 9:24 ` Hao Xu @ 2021-09-29 11:37 ` Pavel Begunkov 2021-09-29 12:13 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-09-29 11:37 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 9/29/21 10:24 AM, Hao Xu wrote: > 在 2021/9/28 下午6:51, Pavel Begunkov 写道: >> On 9/26/21 11:00 AM, Hao Xu wrote: [...] >>> I'm gonna pick this one up again, currently this patch >>> with ktime_get_ns() works good on our productions. This >>> patch makes the latency a bit higher than before, but >>> still lower than aio. >>> I haven't gotten a faster alternate for ktime_get_ns(), >>> any hints? >> >> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst >> In particular coarse variants may be of interest. >> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access >> > The coarse functions seems to be like jiffies, because they use the last > timer tick(from the explanation in that doc, it seems the timer tick is > in the same frequency as jiffies update). So I believe it is just > another format of jiffies which is low accurate. I haven't looked into the details, but it seems that unlike jiffies for the coarse mode 10ms (or whatever) is the worst case, but it _may_ be much better on average and feasible for your case, but can't predict if that's really the case in a real system and what will be the relative error comparing to normal ktime_ns(). -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-29 11:37 ` Pavel Begunkov @ 2021-09-29 12:13 ` Hao Xu 2021-09-30 8:51 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-09-29 12:13 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/9/29 下午7:37, Pavel Begunkov 写道: > On 9/29/21 10:24 AM, Hao Xu wrote: >> 在 2021/9/28 下午6:51, Pavel Begunkov 写道: >>> On 9/26/21 11:00 AM, Hao Xu wrote: > [...] >>>> I'm gonna pick this one up again, currently this patch >>>> with ktime_get_ns() works good on our productions. This >>>> patch makes the latency a bit higher than before, but >>>> still lower than aio. >>>> I haven't gotten a faster alternate for ktime_get_ns(), >>>> any hints? >>> >>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst >>> In particular coarse variants may be of interest. >>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access >>> >> The coarse functions seems to be like jiffies, because they use the last >> timer tick(from the explanation in that doc, it seems the timer tick is >> in the same frequency as jiffies update). So I believe it is just >> another format of jiffies which is low accurate. > > I haven't looked into the details, but it seems that unlike jiffies for > the coarse mode 10ms (or whatever) is the worst case, but it _may_ be Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no? (say HZ = 100, then jiffies updated by 1 every 10ms) > much better on average and feasible for your case, but can't predict > if that's really the case in a real system and what will be the > relative error comparing to normal ktime_ns(). > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-29 12:13 ` Hao Xu @ 2021-09-30 8:51 ` Pavel Begunkov 2021-09-30 12:04 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-09-30 8:51 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 9/29/21 1:13 PM, Hao Xu wrote: > 在 2021/9/29 下午7:37, Pavel Begunkov 写道: >> On 9/29/21 10:24 AM, Hao Xu wrote: >>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道: >>>> On 9/26/21 11:00 AM, Hao Xu wrote: >> [...] >>>>> I'm gonna pick this one up again, currently this patch >>>>> with ktime_get_ns() works good on our productions. This >>>>> patch makes the latency a bit higher than before, but >>>>> still lower than aio. >>>>> I haven't gotten a faster alternate for ktime_get_ns(), >>>>> any hints? >>>> >>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst >>>> In particular coarse variants may be of interest. >>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access >>>> >>> The coarse functions seems to be like jiffies, because they use the last >>> timer tick(from the explanation in that doc, it seems the timer tick is >>> in the same frequency as jiffies update). So I believe it is just >>> another format of jiffies which is low accurate. >> >> I haven't looked into the details, but it seems that unlike jiffies for >> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be > Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no? > (say HZ = 100, then jiffies updated by 1 every 10ms) I'm speculating, but it sounds it's updated on every call to ktime_ns() in the system, so if someone else calls ktime_ns() every 1us, than the resolution will be 1us, where with jiffies the update interval is strictly 10ms when HZ=100. May be not true, need to see the code. >> much better on average and feasible for your case, but can't predict >> if that's really the case in a real system and what will be the >> relative error comparing to normal ktime_ns(). >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-30 8:51 ` Pavel Begunkov @ 2021-09-30 12:04 ` Pavel Begunkov 2021-10-05 15:00 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-09-30 12:04 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 9/30/21 9:51 AM, Pavel Begunkov wrote: > On 9/29/21 1:13 PM, Hao Xu wrote: >> 在 2021/9/29 下午7:37, Pavel Begunkov 写道: >>> On 9/29/21 10:24 AM, Hao Xu wrote: >>>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道: >>>>> On 9/26/21 11:00 AM, Hao Xu wrote: >>> [...] >>>>>> I'm gonna pick this one up again, currently this patch >>>>>> with ktime_get_ns() works good on our productions. This >>>>>> patch makes the latency a bit higher than before, but >>>>>> still lower than aio. >>>>>> I haven't gotten a faster alternate for ktime_get_ns(), >>>>>> any hints? >>>>> >>>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst >>>>> In particular coarse variants may be of interest. >>>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access >>>>> >>>> The coarse functions seems to be like jiffies, because they use the last >>>> timer tick(from the explanation in that doc, it seems the timer tick is >>>> in the same frequency as jiffies update). So I believe it is just >>>> another format of jiffies which is low accurate. >>> >>> I haven't looked into the details, but it seems that unlike jiffies for >>> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be >> Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no? >> (say HZ = 100, then jiffies updated by 1 every 10ms) > > I'm speculating, but it sounds it's updated on every call to ktime_ns() > in the system, so if someone else calls ktime_ns() every 1us, than the > resolution will be 1us, where with jiffies the update interval is strictly > 10ms when HZ=100. May be not true, need to see the code. Taking a second quick look, doesn't seem to be the case indeed. And it's limited to your feature anyway, so the overhead of ktime_get() shouldn't matter much. >>> much better on average and feasible for your case, but can't predict >>> if that's really the case in a real system and what will be the >>> relative error comparing to normal ktime_ns(). -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle 2021-09-30 12:04 ` Pavel Begunkov @ 2021-10-05 15:00 ` Hao Xu 0 siblings, 0 replies; 40+ messages in thread From: Hao Xu @ 2021-10-05 15:00 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/9/30 下午8:04, Pavel Begunkov 写道: > On 9/30/21 9:51 AM, Pavel Begunkov wrote: >> On 9/29/21 1:13 PM, Hao Xu wrote: >>> 在 2021/9/29 下午7:37, Pavel Begunkov 写道: >>>> On 9/29/21 10:24 AM, Hao Xu wrote: >>>>> 在 2021/9/28 下午6:51, Pavel Begunkov 写道: >>>>>> On 9/26/21 11:00 AM, Hao Xu wrote: >>>> [...] >>>>>>> I'm gonna pick this one up again, currently this patch >>>>>>> with ktime_get_ns() works good on our productions. This >>>>>>> patch makes the latency a bit higher than before, but >>>>>>> still lower than aio. >>>>>>> I haven't gotten a faster alternate for ktime_get_ns(), >>>>>>> any hints? >>>>>> >>>>>> Good, I'd suggest to look through Documentation/core-api/timekeeping.rst >>>>>> In particular coarse variants may be of interest. >>>>>> https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access >>>>>> >>>>> The coarse functions seems to be like jiffies, because they use the last >>>>> timer tick(from the explanation in that doc, it seems the timer tick is >>>>> in the same frequency as jiffies update). So I believe it is just >>>>> another format of jiffies which is low accurate. >>>> >>>> I haven't looked into the details, but it seems that unlike jiffies for >>>> the coarse mode 10ms (or whatever) is the worst case, but it _may_ be >>> Maybe I'm wrong, but for jiffies, 10ms uis also the worst case, no? >>> (say HZ = 100, then jiffies updated by 1 every 10ms) >> >> I'm speculating, but it sounds it's updated on every call to ktime_ns() >> in the system, so if someone else calls ktime_ns() every 1us, than the >> resolution will be 1us, where with jiffies the update interval is strictly >> 10ms when HZ=100. May be not true, need to see the code. > > Taking a second quick look, doesn't seem to be the case indeed. And it's > limited to your feature anyway, so the overhead of ktime_get() shouldn't > matter much. Thanks, I'll continue on this when return from the holiday. > >>>> much better on average and feasible for your case, but can't predict >>>> if that's really the case in a real system and what will be the >>>> relative error comparing to normal ktime_ns(). > ^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu 2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu @ 2021-04-28 13:32 ` Hao Xu 2021-04-28 14:12 ` Jens Axboe ` (2 more replies) 1 sibling, 3 replies; 40+ messages in thread From: Hao Xu @ 2021-04-28 13:32 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi sqes are submitted by sqthread when it is leveraged, which means there is IO latency when waking up sqthread. To wipe it out, submit limited number of sqes in the original task context. Tests result below: 99th latency: iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us with this patch: 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 without this patch: 2k 15 14 15 15 15 14 15 14 14 13 11.84 fio config: ./run_fio.sh fio \ --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ --direct=1 --rw=randread --time_based=1 --runtime=300 \ --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ --io_sq_thread_idle=${2} Signed-off-by: Hao Xu <[email protected]> --- fs/io_uring.c | 29 +++++++++++++++++++++++------ include/uapi/linux/io_uring.h | 1 + 2 files changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 1871fad48412..f0a01232671e 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) { struct io_ring_ctx *ctx = req->ctx; struct io_kiocb *link = io_prep_linked_timeout(req); - struct io_uring_task *tctx = req->task->io_uring; + struct io_uring_task *tctx = NULL; + + if (ctx->sq_data && ctx->sq_data->thread) + tctx = ctx->sq_data->thread->io_uring; + else + tctx = req->task->io_uring; BUG_ON(!tctx); BUG_ON(!tctx->io_wq); @@ -9063,9 +9068,10 @@ static void io_uring_try_cancel(struct files_struct *files) xa_for_each(&tctx->xa, index, node) { struct io_ring_ctx *ctx = node->ctx; - /* sqpoll task will cancel all its requests */ - if (!ctx->sq_data) - io_uring_try_cancel_requests(ctx, current, files); + /* + * for sqpoll ctx, there may be requests in task_works etc. + */ + io_uring_try_cancel_requests(ctx, current, files); } } @@ -9271,7 +9277,8 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz io_run_task_work(); if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP | - IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG))) + IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG | + IORING_ENTER_SQ_DEPUTY))) return -EINVAL; f = fdget(fd); @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz if (unlikely(ctx->sq_data->thread == NULL)) { goto out; } - if (flags & IORING_ENTER_SQ_WAKEUP) + if (flags & IORING_ENTER_SQ_WAKEUP) { wake_up(&ctx->sq_data->wait); + if ((flags & IORING_ENTER_SQ_DEPUTY) && + !(ctx->flags & IORING_SETUP_IOPOLL)) { + ret = io_uring_add_task_file(ctx); + if (unlikely(ret)) + goto out; + mutex_lock(&ctx->uring_lock); + io_submit_sqes(ctx, min(to_submit, 8U)); + mutex_unlock(&ctx->uring_lock); + } + } if (flags & IORING_ENTER_SQ_WAIT) { ret = io_sqpoll_wait_sq(ctx); if (ret) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 311532ff6ce3..b1130fec2b7d 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -251,6 +251,7 @@ struct io_cqring_offsets { #define IORING_ENTER_SQ_WAKEUP (1U << 1) #define IORING_ENTER_SQ_WAIT (1U << 2) #define IORING_ENTER_EXT_ARG (1U << 3) +#define IORING_ENTER_SQ_DEPUTY (1U << 4) /* * Passed in for io_uring_setup(2). Copied back with updated info on success -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu @ 2021-04-28 14:12 ` Jens Axboe 2021-04-29 4:12 ` Hao Xu 2021-04-28 14:34 ` Pavel Begunkov 2021-04-29 22:02 ` Pavel Begunkov 2 siblings, 1 reply; 40+ messages in thread From: Jens Axboe @ 2021-04-28 14:12 UTC (permalink / raw) To: Hao Xu; +Cc: io-uring, Pavel Begunkov, Joseph Qi On 4/28/21 7:32 AM, Hao Xu wrote: > sqes are submitted by sqthread when it is leveraged, which means there > is IO latency when waking up sqthread. To wipe it out, submit limited > number of sqes in the original task context. > Tests result below: > > 99th latency: > iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us > with this patch: > 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 > without this patch: > 2k 15 14 15 15 15 14 15 14 14 13 11.84 > > fio config: > ./run_fio.sh > fio \ > --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ > --direct=1 --rw=randread --time_based=1 --runtime=300 \ > --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ > --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ > --io_sq_thread_idle=${2} Interesting concept! One question: > @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz > if (unlikely(ctx->sq_data->thread == NULL)) { > goto out; > } > - if (flags & IORING_ENTER_SQ_WAKEUP) > + if (flags & IORING_ENTER_SQ_WAKEUP) { > wake_up(&ctx->sq_data->wait); > + if ((flags & IORING_ENTER_SQ_DEPUTY) && > + !(ctx->flags & IORING_SETUP_IOPOLL)) { > + ret = io_uring_add_task_file(ctx); > + if (unlikely(ret)) > + goto out; > + mutex_lock(&ctx->uring_lock); > + io_submit_sqes(ctx, min(to_submit, 8U)); > + mutex_unlock(&ctx->uring_lock); > + } > + } Do we want to wake the sqpoll thread _post_ submitting these ios? The idea being that if we're submitting now after a while (since the thread is sleeping), then we're most likely going to be submitting more than just this single batch. And the wakeup would do the same if done after the submit, it'd just not interfere with this submit. You could imagine a scenario where we do the wake and the sqpoll thread beats us to the submit, and now we're just stuck waiting for the uring_lock and end up doing nothing. Maybe you guys already tested this? Also curious if you did, what kind of requests are being submitted? That can have quite a bit of effect on how quickly the submit is done. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:12 ` Jens Axboe @ 2021-04-29 4:12 ` Hao Xu 0 siblings, 0 replies; 40+ messages in thread From: Hao Xu @ 2021-04-29 4:12 UTC (permalink / raw) To: Jens Axboe; +Cc: io-uring, Pavel Begunkov, Joseph Qi 在 2021/4/28 下午10:12, Jens Axboe 写道: > On 4/28/21 7:32 AM, Hao Xu wrote: >> sqes are submitted by sqthread when it is leveraged, which means there >> is IO latency when waking up sqthread. To wipe it out, submit limited >> number of sqes in the original task context. >> Tests result below: >> >> 99th latency: >> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >> with this patch: >> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >> without this patch: >> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >> >> fio config: >> ./run_fio.sh >> fio \ >> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >> --io_sq_thread_idle=${2} > > Interesting concept! One question: > >> @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz >> if (unlikely(ctx->sq_data->thread == NULL)) { >> goto out; >> } >> - if (flags & IORING_ENTER_SQ_WAKEUP) >> + if (flags & IORING_ENTER_SQ_WAKEUP) { >> wake_up(&ctx->sq_data->wait); >> + if ((flags & IORING_ENTER_SQ_DEPUTY) && >> + !(ctx->flags & IORING_SETUP_IOPOLL)) { >> + ret = io_uring_add_task_file(ctx); >> + if (unlikely(ret)) >> + goto out; >> + mutex_lock(&ctx->uring_lock); >> + io_submit_sqes(ctx, min(to_submit, 8U)); >> + mutex_unlock(&ctx->uring_lock); >> + } >> + } > > Do we want to wake the sqpoll thread _post_ submitting these ios? The > idea being that if we're submitting now after a while (since the thread > is sleeping), then we're most likely going to be submitting more than > just this single batch. And the wakeup would do the same if done after yes, prediction that it will likely submit more than just that single batch is the idea behind this patch. > the submit, it'd just not interfere with this submit. You could imagine > a scenario where we do the wake and the sqpoll thread beats us to the > submit, and now we're just stuck waiting for the uring_lock and end up > doing nothing. true, I think trylock may be good to address this. > > Maybe you guys already tested this? Also curious if you did, what kind > of requests are being submitted? That can have quite a bit of effect on > how quickly the submit is done. we currently just have put it on fio benchmark and liburing tests, this patch definitely needs more thinking. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu 2021-04-28 14:12 ` Jens Axboe @ 2021-04-28 14:34 ` Pavel Begunkov 2021-04-28 14:37 ` Pavel Begunkov ` (2 more replies) 2021-04-29 22:02 ` Pavel Begunkov 2 siblings, 3 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:34 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/28/21 2:32 PM, Hao Xu wrote: > sqes are submitted by sqthread when it is leveraged, which means there > is IO latency when waking up sqthread. To wipe it out, submit limited > number of sqes in the original task context. > Tests result below: Frankly, it can be a nest of corner cases if not now then in the future, leading to a high maintenance burden. Hence, if we consider the change, I'd rather want to limit the userspace exposure, so it can be removed if needed. A noticeable change of behaviour here, as Hao recently asked, is that the ring can be passed to a task from a completely another thread group, and so the feature would execute from that context, not from the original/sqpoll one. Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be ignored if the previous point is addressed. > > 99th latency: > iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us > with this patch: > 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 > without this patch: > 2k 15 14 15 15 15 14 15 14 14 13 11.84 Not sure the second nine describes it well enough, please can you add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. Btw, how happened that only some of the numbers have fractional part? Can't believe they all but 3 were close enough to integer values. > fio config: > ./run_fio.sh > fio \ > --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ > --direct=1 --rw=randread --time_based=1 --runtime=300 \ > --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ > --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ > --io_sq_thread_idle=${2} > > Signed-off-by: Hao Xu <[email protected]> > --- > fs/io_uring.c | 29 +++++++++++++++++++++++------ > include/uapi/linux/io_uring.h | 1 + > 2 files changed, 24 insertions(+), 6 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 1871fad48412..f0a01232671e 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) > { > struct io_ring_ctx *ctx = req->ctx; > struct io_kiocb *link = io_prep_linked_timeout(req); > - struct io_uring_task *tctx = req->task->io_uring; > + struct io_uring_task *tctx = NULL; > + > + if (ctx->sq_data && ctx->sq_data->thread) > + tctx = ctx->sq_data->thread->io_uring; without park it's racy, sq_data->thread may become NULL and removed, as well as its ->io_uring. > + else > + tctx = req->task->io_uring; > > BUG_ON(!tctx); > BUG_ON(!tctx->io_wq); [snip] -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:34 ` Pavel Begunkov @ 2021-04-28 14:37 ` Pavel Begunkov 2021-04-29 4:37 ` Hao Xu 2021-04-28 14:39 ` Jens Axboe 2021-04-29 8:44 ` Hao Xu 2 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:37 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/28/21 3:34 PM, Pavel Begunkov wrote: > On 4/28/21 2:32 PM, Hao Xu wrote: >> sqes are submitted by sqthread when it is leveraged, which means there >> is IO latency when waking up sqthread. To wipe it out, submit limited >> number of sqes in the original task context. >> Tests result below: > > Frankly, it can be a nest of corner cases if not now then in the future, > leading to a high maintenance burden. Hence, if we consider the change, > I'd rather want to limit the userspace exposure, so it can be removed > if needed. > > A noticeable change of behaviour here, as Hao recently asked, is that > the ring can be passed to a task from a completely another thread group, > and so the feature would execute from that context, not from the > original/sqpoll one. So maybe something like: if (same_thread_group()) { /* submit */ } > > Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be > ignored if the previous point is addressed. I'd question whether it'd be better with the flag or without doing this feature by default. > >> >> 99th latency: >> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >> with this patch: >> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >> without this patch: >> 2k 15 14 15 15 15 14 15 14 14 13 11.84 > > Not sure the second nine describes it well enough, please can you > add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. > > Btw, how happened that only some of the numbers have fractional part? > Can't believe they all but 3 were close enough to integer values. > >> fio config: >> ./run_fio.sh >> fio \ >> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >> --io_sq_thread_idle=${2} >> >> Signed-off-by: Hao Xu <[email protected]> >> --- >> fs/io_uring.c | 29 +++++++++++++++++++++++------ >> include/uapi/linux/io_uring.h | 1 + >> 2 files changed, 24 insertions(+), 6 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 1871fad48412..f0a01232671e 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >> { >> struct io_ring_ctx *ctx = req->ctx; >> struct io_kiocb *link = io_prep_linked_timeout(req); >> - struct io_uring_task *tctx = req->task->io_uring; >> + struct io_uring_task *tctx = NULL; >> + >> + if (ctx->sq_data && ctx->sq_data->thread) >> + tctx = ctx->sq_data->thread->io_uring; > > without park it's racy, sq_data->thread may become NULL and removed, > as well as its ->io_uring. > >> + else >> + tctx = req->task->io_uring; >> >> BUG_ON(!tctx); >> BUG_ON(!tctx->io_wq); > > [snip] > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:37 ` Pavel Begunkov @ 2021-04-29 4:37 ` Hao Xu 2021-04-29 9:28 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-04-29 4:37 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/28 下午10:37, Pavel Begunkov 写道: > On 4/28/21 3:34 PM, Pavel Begunkov wrote: >> On 4/28/21 2:32 PM, Hao Xu wrote: >>> sqes are submitted by sqthread when it is leveraged, which means there >>> is IO latency when waking up sqthread. To wipe it out, submit limited >>> number of sqes in the original task context. >>> Tests result below: >> >> Frankly, it can be a nest of corner cases if not now then in the future, >> leading to a high maintenance burden. Hence, if we consider the change, >> I'd rather want to limit the userspace exposure, so it can be removed >> if needed. >> >> A noticeable change of behaviour here, as Hao recently asked, is that >> the ring can be passed to a task from a completely another thread group, >> and so the feature would execute from that context, not from the >> original/sqpoll one. > > So maybe something like: > if (same_thread_group()) { > /* submit */ > }I thought this case(cross independent processes) for some time, Pavel, could you give more hints about how this may trigger errors? > >> >> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >> ignored if the previous point is addressed. > > I'd question whether it'd be better with the flag or without doing > this feature by default. Just like what Jens said, the flag here is to allow users to do their decision, there may be cases like a application wants to offload as much as possible IO related work to sqpoll, so that it can be dedicated to computation work etc. > >> >>> >>> 99th latency: >>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>> with this patch: >>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>> without this patch: >>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >> >> Not sure the second nine describes it well enough, please can you >> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. Sure, I will. >> >> Btw, how happened that only some of the numbers have fractional part? >> Can't believe they all but 3 were close enough to integer values. This confused me a little bit too, but it is indeed what fio outputs. >> >>> fio config: >>> ./run_fio.sh >>> fio \ >>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>> --io_sq_thread_idle=${2} >>> >>> Signed-off-by: Hao Xu <[email protected]> >>> --- >>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>> include/uapi/linux/io_uring.h | 1 + >>> 2 files changed, 24 insertions(+), 6 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index 1871fad48412..f0a01232671e 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>> { >>> struct io_ring_ctx *ctx = req->ctx; >>> struct io_kiocb *link = io_prep_linked_timeout(req); >>> - struct io_uring_task *tctx = req->task->io_uring; >>> + struct io_uring_task *tctx = NULL; >>> + >>> + if (ctx->sq_data && ctx->sq_data->thread) >>> + tctx = ctx->sq_data->thread->io_uring; >> >> without park it's racy, sq_data->thread may become NULL and removed, >> as well as its ->io_uring. >> >>> + else >>> + tctx = req->task->io_uring; >>> >>> BUG_ON(!tctx); >>> BUG_ON(!tctx->io_wq); >> >> [snip] >> > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-29 4:37 ` Hao Xu @ 2021-04-29 9:28 ` Pavel Begunkov 2021-05-05 11:20 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-29 9:28 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/29/21 5:37 AM, Hao Xu wrote: > 在 2021/4/28 下午10:37, Pavel Begunkov 写道: >> On 4/28/21 3:34 PM, Pavel Begunkov wrote: >>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>> sqes are submitted by sqthread when it is leveraged, which means there >>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>> number of sqes in the original task context. >>>> Tests result below: >>> >>> Frankly, it can be a nest of corner cases if not now then in the future, >>> leading to a high maintenance burden. Hence, if we consider the change, >>> I'd rather want to limit the userspace exposure, so it can be removed >>> if needed. >>> >>> A noticeable change of behaviour here, as Hao recently asked, is that >>> the ring can be passed to a task from a completely another thread group, >>> and so the feature would execute from that context, not from the >>> original/sqpoll one. >> >> So maybe something like: >> if (same_thread_group()) { >> /* submit */ >> }I thought this case(cross independent processes) for some time, Pavel, > could you give more hints about how this may trigger errors? Currently? We need to audit cancellation, but don't think it's a problem. But as said, it's about the future. Your patch adds a new quirky userspace behaviour (submitting from alien context as described), and once commited it can't be removed and should be maintained. I can easily imagine it either over-complicating cancellations (and we had enough of troubles with it), or just preventing more important optimisations, or anything else often happening with new features. >> >>> >>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>> ignored if the previous point is addressed. >> >> I'd question whether it'd be better with the flag or without doing >> this feature by default. > Just like what Jens said, the flag here is to allow users to do their > decision, there may be cases like a application wants to offload as much > as possible IO related work to sqpoll, so that it can be dedicated to > computation work etc. >> >>> >>>> >>>> 99th latency: >>>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>>> with this patch: >>>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>>> without this patch: >>>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >>> >>> Not sure the second nine describes it well enough, please can you >>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. > Sure, I will. Forgot but it's important, should compared with non-sqpoll as well because the feature is taking the middle ground between them. >>> >>> Btw, how happened that only some of the numbers have fractional part? >>> Can't believe they all but 3 were close enough to integer values. > This confused me a little bit too, but it is indeed what fio outputs. That's just always when I see such, something tells me that data has been manipulated. Even if it's fio, it's really weird and suspicious, and worth to look what's wrong with it. >>> >>>> fio config: >>>> ./run_fio.sh >>>> fio \ >>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>>> --io_sq_thread_idle=${2} >>>> >>>> Signed-off-by: Hao Xu <[email protected]> >>>> --- >>>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>>> include/uapi/linux/io_uring.h | 1 + >>>> 2 files changed, 24 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index 1871fad48412..f0a01232671e 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>>> { >>>> struct io_ring_ctx *ctx = req->ctx; >>>> struct io_kiocb *link = io_prep_linked_timeout(req); >>>> - struct io_uring_task *tctx = req->task->io_uring; >>>> + struct io_uring_task *tctx = NULL; >>>> + >>>> + if (ctx->sq_data && ctx->sq_data->thread) >>>> + tctx = ctx->sq_data->thread->io_uring; >>> >>> without park it's racy, sq_data->thread may become NULL and removed, >>> as well as its ->io_uring. >>> >>>> + else >>>> + tctx = req->task->io_uring; >>>> BUG_ON(!tctx); >>>> BUG_ON(!tctx->io_wq); >>> >>> [snip] >>> >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-29 9:28 ` Pavel Begunkov @ 2021-05-05 11:20 ` Hao Xu 0 siblings, 0 replies; 40+ messages in thread From: Hao Xu @ 2021-05-05 11:20 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/29 下午5:28, Pavel Begunkov 写道: > On 4/29/21 5:37 AM, Hao Xu wrote: >> 在 2021/4/28 下午10:37, Pavel Begunkov 写道: >>> On 4/28/21 3:34 PM, Pavel Begunkov wrote: >>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>> sqes are submitted by sqthread when it is leveraged, which means there >>>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>>> number of sqes in the original task context. >>>>> Tests result below: >>>> >>>> Frankly, it can be a nest of corner cases if not now then in the future, >>>> leading to a high maintenance burden. Hence, if we consider the change, >>>> I'd rather want to limit the userspace exposure, so it can be removed >>>> if needed. >>>> >>>> A noticeable change of behaviour here, as Hao recently asked, is that >>>> the ring can be passed to a task from a completely another thread group, >>>> and so the feature would execute from that context, not from the >>>> original/sqpoll one. >>> >>> So maybe something like: >>> if (same_thread_group()) { >>> /* submit */ >>> }I thought this case(cross independent processes) for some time, Pavel, >> could you give more hints about how this may trigger errors? > > Currently? We need to audit cancellation, but don't think it's a problem. > > But as said, it's about the future. Your patch adds a new quirky > userspace behaviour (submitting from alien context as described), and > once commited it can't be removed and should be maintained. > > I can easily imagine it either over-complicating cancellations (and > we had enough of troubles with it), or just preventing more important > optimisations, or anything else often happening with new features. > > >>> >>>> >>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>>> ignored if the previous point is addressed. >>> >>> I'd question whether it'd be better with the flag or without doing >>> this feature by default. >> Just like what Jens said, the flag here is to allow users to do their >> decision, there may be cases like a application wants to offload as much >> as possible IO related work to sqpoll, so that it can be dedicated to >> computation work etc. >>> >>>> >>>>> >>>>> 99th latency: >>>>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>>>> with this patch: >>>>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>>>> without this patch: >>>>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >>>> >>>> Not sure the second nine describes it well enough, please can you >>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. >> Sure, I will. > > Forgot but it's important, should compared with non-sqpoll as well > because the feature is taking the middle ground between them. Hi Pavel, I found that the previous tests are not correct since when I did the test I enabled IOPOLL and in io_uring_enter(), I didn't add a if: if (IOPOLL disabled) submit at most 8 sqes in original context this optimization cannot run under SQPOLL+IOPOLL. So I did another tests with my new code, under 2k IOPS, against an Intel optane device. The fio command is: echo " [global] ioengine=io_uring sqthread_poll=1 thread=1 bs=4k direct=1 rw=randread time_based=1 runtime=160 group_reporting=1 filename=/dev/nvme1n1 sqthread_poll_cpu=30 randrepeat=0 [job0] cpus_allowed=35 iodepth=128 rate_iops=${1} io_sq_thread_idle=${2}" | ./fio/fio - the result is: ( 1. big numbers in 'th' are in ns, others in us, 2. for no sqpoll mode, sq_idle doesn't affect the metrics, so they're just repetitive tests. ) 2k+idle_submit_min8 metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us min_lat 9 9 9 9 9 9 9 9 9 9 8 max_lat 2233 2802 3000 3518 3078 2052 2583 2914 4176 2255 3275 avg_lat 11.26 11.25 11.24 11.29 11.29 11.21 11.24 11.21 11.23 11.27 9.53 50th 7648 7648 7648 7648 7648 7648 7648 7584 7584 7648 9152 90th 7840 7840 7840 7840 7840 7840 7840 7776 7840 7904 9408 99th 10944 11200 11712 11328 12096 12352 11584 12096 11712 11072 12992 99.99th 73216 38144 56576 38656 38144 70144 40192 38144 40704 38144 39680 fio(usr) 3.66% 3.53% 3.80% 3.60% 3.41% 3.93% 3.55% 3.57% 3.58% 3.75% 3.82% fio(sys) 0.77% 0.70% 0.64% 0.65% 0.82% 0.71% 0.69% 0.69% 0.68% 0.72% 0.07% fio(usr+sys) 4.43% 4.23% 4.44% 4.25% 4.23% 4.64% 4.24% 4.26% 4.26% 4.47% 3.89% 2k+idle_without_submit_min8 metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us min_lat 11 8 8 8 8 8 8 8 8 8 8 max_lat 12643 11703 11617 3050 8662 3402 5179 4307 2897 4204 3017 avg_lat 12.75 12.77 12.5 12.43 12.5 12.39 12.39 12.33 12.16 12.1 9.44 50th 11 11 11 11 11 11 11 11 11 11 9024 90th 11 11 11 11 11 11 11 11 11 11 9280 99th 18 17 15 16 16 16 16 16 15 15 13120 99.99th 930 1090 635 635 693 652 717 635 510 330 39680 fio(usr) 4.59% 3.29% 4.06% 4.10% 4.18% 4.30% 4.73% 4.12% 4.11% 3.83% 3.80% fio(sys) 0.39% 0.45% 0.46% 0.42% 0.34% 0.39% 0.37% 0.37% 0.33% 0.39% 0.07% fio(usr+sys) 4.98% 3.74% 4.52% 4.52% 4.52% 4.69% 5.10% 4.49% 4.44% 4.22% 3.87% 2k+without_sqpoll metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us min_lat 8 8 8 7 8 8 8 8 8 8 8 max_lat 623 602 273 388 271 290 891 495 519 396 245 avg_lat 13.12 13.14 13.15 13.1 13.16 13.06 13.06 13.15 13.12 13.14 13.1 50th 10560 10560 10560 10560 10560 10432 10560 10560 10560 10560 10560 90th 10816 10816 10944 10816 10816 10816 10816 10944 10816 10944 10816 99th 14144 14656 14272 13888 14400 14016 14400 14272 14400 14528 14272 99.99th 42752 50432 43246 41216 56576 42240 43264 42240 47360 58624 42240 fio(usr) 2.31% 2.33% 2.81% 2.30% 2.34% 2.34% 2.12% 2.26% 2.09% 2.36% 2.34% fio(sys) 0.82% 0.79% 0.93% 0.81% 0.79% 0.77% 0.79% 0.86% 1.02% 0.76% 0.79% fio(usr+sys) 3.13% 3.12% 3.74% 3.11% 3.13% 3.11% 2.91% 3.12% 3.11% 3.12% 3.13% Cpu usage of sqthread hasn't been recorded, but I think the trending would be like the result in the previous email. From the data, the 'submit in original context' change reduce the avg latency and 'th' latency, meanwhile costs too much cpu usage of the original context. may be 'submit min(to_submit, 8)' is too much in 2k IOPS. I'll try other combination of IOPS and idle time. Thanks, Hao > >>>> >>>> Btw, how happened that only some of the numbers have fractional part? >>>> Can't believe they all but 3 were close enough to integer values. >> This confused me a little bit too, but it is indeed what fio outputs. > > That's just always when I see such, something tells me that data has > been manipulated. Even if it's fio, it's really weird and suspicious, > and worth to look what's wrong with it. Could you tell me what it is. > >>>> >>>>> fio config: >>>>> ./run_fio.sh >>>>> fio \ >>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>>>> --io_sq_thread_idle=${2} >>>>> >>>>> Signed-off-by: Hao Xu <[email protected]> >>>>> --- >>>>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>>>> include/uapi/linux/io_uring.h | 1 + >>>>> 2 files changed, 24 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>> index 1871fad48412..f0a01232671e 100644 >>>>> --- a/fs/io_uring.c >>>>> +++ b/fs/io_uring.c >>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>>>> { >>>>> struct io_ring_ctx *ctx = req->ctx; >>>>> struct io_kiocb *link = io_prep_linked_timeout(req); >>>>> - struct io_uring_task *tctx = req->task->io_uring; >>>>> + struct io_uring_task *tctx = NULL; >>>>> + >>>>> + if (ctx->sq_data && ctx->sq_data->thread) >>>>> + tctx = ctx->sq_data->thread->io_uring; >>>> >>>> without park it's racy, sq_data->thread may become NULL and removed, >>>> as well as its ->io_uring. >>>> >>>>> + else >>>>> + tctx = req->task->io_uring; >>>>> BUG_ON(!tctx); >>>>> BUG_ON(!tctx->io_wq); >>>> >>>> [snip] >>>> >>> >> > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:34 ` Pavel Begunkov 2021-04-28 14:37 ` Pavel Begunkov @ 2021-04-28 14:39 ` Jens Axboe 2021-04-28 14:50 ` Pavel Begunkov 2021-04-29 4:43 ` Hao Xu 2021-04-29 8:44 ` Hao Xu 2 siblings, 2 replies; 40+ messages in thread From: Jens Axboe @ 2021-04-28 14:39 UTC (permalink / raw) To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 8:34 AM, Pavel Begunkov wrote: > On 4/28/21 2:32 PM, Hao Xu wrote: >> sqes are submitted by sqthread when it is leveraged, which means there >> is IO latency when waking up sqthread. To wipe it out, submit limited >> number of sqes in the original task context. >> Tests result below: > > Frankly, it can be a nest of corner cases if not now then in the future, > leading to a high maintenance burden. Hence, if we consider the change, > I'd rather want to limit the userspace exposure, so it can be removed > if needed. > > A noticeable change of behaviour here, as Hao recently asked, is that > the ring can be passed to a task from a completely another thread group, > and so the feature would execute from that context, not from the > original/sqpoll one. > > Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be > ignored if the previous point is addressed. I mostly agree on that. The problem I see is that for most use cases, the "submit from task itself if we need to enter the kernel" is perfectly fine, and would probably be preferable. But there are also uses cases that absolutely do not want to spend any extra cycles doing submit, they are isolating the submission to sqpoll exclusively and that is part of the win there. Based on that, I don't think it can be an automatic kind of feature. I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE would likely be better, or maybe even more verbose as IORING_ENTER_SQ_SUBMIT_ON_IDLE. On top of that, I don't think an extra submit flag is a huge deal, I don't imagine we'll end up with a ton of them. In fact, two have been added related to sqpoll since the inception, out of the 3 total added flags. This is all independent of implementation detail and needed fixes to the patch. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:39 ` Jens Axboe @ 2021-04-28 14:50 ` Pavel Begunkov 2021-04-28 14:53 ` Jens Axboe 2021-04-29 4:43 ` Hao Xu 1 sibling, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:50 UTC (permalink / raw) To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 3:39 PM, Jens Axboe wrote: > On 4/28/21 8:34 AM, Pavel Begunkov wrote: >> On 4/28/21 2:32 PM, Hao Xu wrote: >>> sqes are submitted by sqthread when it is leveraged, which means there >>> is IO latency when waking up sqthread. To wipe it out, submit limited >>> number of sqes in the original task context. >>> Tests result below: >> >> Frankly, it can be a nest of corner cases if not now then in the future, >> leading to a high maintenance burden. Hence, if we consider the change, >> I'd rather want to limit the userspace exposure, so it can be removed >> if needed. >> >> A noticeable change of behaviour here, as Hao recently asked, is that >> the ring can be passed to a task from a completely another thread group, >> and so the feature would execute from that context, not from the >> original/sqpoll one. >> >> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >> ignored if the previous point is addressed. > > I mostly agree on that. The problem I see is that for most use cases, > the "submit from task itself if we need to enter the kernel" is > perfectly fine, and would probably be preferable. But there are also > uses cases that absolutely do not want to spend any extra cycles doing > submit, they are isolating the submission to sqpoll exclusively and that > is part of the win there. Based on that, I don't think it can be an > automatic kind of feature. Reasonable. > I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE > would likely be better, or maybe even more verbose as > IORING_ENTER_SQ_SUBMIT_ON_IDLE. > > On top of that, I don't think an extra submit flag is a huge deal, I > don't imagine we'll end up with a ton of them. In fact, two have been > added related to sqpoll since the inception, out of the 3 total added > flags. I don't care about the flag itself, nor about performance as it's nicely under the SQPOLL check, but I rather want to leave a way to ignore the feature if we would (ever) need to disable it, either with flag or without it. > This is all independent of implementation detail and needed fixes to the > patch. > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:50 ` Pavel Begunkov @ 2021-04-28 14:53 ` Jens Axboe 2021-04-28 14:56 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Jens Axboe @ 2021-04-28 14:53 UTC (permalink / raw) To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 8:50 AM, Pavel Begunkov wrote: > On 4/28/21 3:39 PM, Jens Axboe wrote: >> On 4/28/21 8:34 AM, Pavel Begunkov wrote: >>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>> sqes are submitted by sqthread when it is leveraged, which means there >>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>> number of sqes in the original task context. >>>> Tests result below: >>> >>> Frankly, it can be a nest of corner cases if not now then in the future, >>> leading to a high maintenance burden. Hence, if we consider the change, >>> I'd rather want to limit the userspace exposure, so it can be removed >>> if needed. >>> >>> A noticeable change of behaviour here, as Hao recently asked, is that >>> the ring can be passed to a task from a completely another thread group, >>> and so the feature would execute from that context, not from the >>> original/sqpoll one. >>> >>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>> ignored if the previous point is addressed. >> >> I mostly agree on that. The problem I see is that for most use cases, >> the "submit from task itself if we need to enter the kernel" is >> perfectly fine, and would probably be preferable. But there are also >> uses cases that absolutely do not want to spend any extra cycles doing >> submit, they are isolating the submission to sqpoll exclusively and that >> is part of the win there. Based on that, I don't think it can be an >> automatic kind of feature. > > Reasonable. > >> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE >> would likely be better, or maybe even more verbose as >> IORING_ENTER_SQ_SUBMIT_ON_IDLE. >> >> On top of that, I don't think an extra submit flag is a huge deal, I >> don't imagine we'll end up with a ton of them. In fact, two have been >> added related to sqpoll since the inception, out of the 3 total added >> flags. > > I don't care about the flag itself, nor about performance as it's > nicely under the SQPOLL check, but I rather want to leave a way to > ignore the feature if we would (ever) need to disable it, either > with flag or without it. I think we just return -EINVAL for that case, just like we'd do now if you attempted to use the flag as we don't grok it. As it should be functionally equivalent if we do the submit inline or not, we could also argue that we simply ignore the flag if it isn't feasible to submit inline. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:53 ` Jens Axboe @ 2021-04-28 14:56 ` Pavel Begunkov 2021-04-28 15:09 ` Jens Axboe 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-28 14:56 UTC (permalink / raw) To: Jens Axboe, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 3:53 PM, Jens Axboe wrote: > On 4/28/21 8:50 AM, Pavel Begunkov wrote: >> On 4/28/21 3:39 PM, Jens Axboe wrote: >>> On 4/28/21 8:34 AM, Pavel Begunkov wrote: >>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>> sqes are submitted by sqthread when it is leveraged, which means there >>>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>>> number of sqes in the original task context. >>>>> Tests result below: >>>> >>>> Frankly, it can be a nest of corner cases if not now then in the future, >>>> leading to a high maintenance burden. Hence, if we consider the change, >>>> I'd rather want to limit the userspace exposure, so it can be removed >>>> if needed. >>>> >>>> A noticeable change of behaviour here, as Hao recently asked, is that >>>> the ring can be passed to a task from a completely another thread group, >>>> and so the feature would execute from that context, not from the >>>> original/sqpoll one. >>>> >>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>>> ignored if the previous point is addressed. >>> >>> I mostly agree on that. The problem I see is that for most use cases, >>> the "submit from task itself if we need to enter the kernel" is >>> perfectly fine, and would probably be preferable. But there are also >>> uses cases that absolutely do not want to spend any extra cycles doing >>> submit, they are isolating the submission to sqpoll exclusively and that >>> is part of the win there. Based on that, I don't think it can be an >>> automatic kind of feature. >> >> Reasonable. >> >>> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE >>> would likely be better, or maybe even more verbose as >>> IORING_ENTER_SQ_SUBMIT_ON_IDLE. >>> >>> On top of that, I don't think an extra submit flag is a huge deal, I >>> don't imagine we'll end up with a ton of them. In fact, two have been >>> added related to sqpoll since the inception, out of the 3 total added >>> flags. >> >> I don't care about the flag itself, nor about performance as it's >> nicely under the SQPOLL check, but I rather want to leave a way to >> ignore the feature if we would (ever) need to disable it, either >> with flag or without it. > > I think we just return -EINVAL for that case, just like we'd do now if > you attempted to use the flag as we don't grok it. As it should be > functionally equivalent if we do the submit inline or not, we could also > argue that we simply ignore the flag if it isn't feasible to submit > inline. Yeah, no-brainer if we limit context to the original thread group, as I described in the first reply. -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:56 ` Pavel Begunkov @ 2021-04-28 15:09 ` Jens Axboe 0 siblings, 0 replies; 40+ messages in thread From: Jens Axboe @ 2021-04-28 15:09 UTC (permalink / raw) To: Pavel Begunkov, Hao Xu; +Cc: io-uring, Joseph Qi On 4/28/21 8:56 AM, Pavel Begunkov wrote: > On 4/28/21 3:53 PM, Jens Axboe wrote: >> On 4/28/21 8:50 AM, Pavel Begunkov wrote: >>> On 4/28/21 3:39 PM, Jens Axboe wrote: >>>> On 4/28/21 8:34 AM, Pavel Begunkov wrote: >>>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>>> sqes are submitted by sqthread when it is leveraged, which means there >>>>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>>>> number of sqes in the original task context. >>>>>> Tests result below: >>>>> >>>>> Frankly, it can be a nest of corner cases if not now then in the future, >>>>> leading to a high maintenance burden. Hence, if we consider the change, >>>>> I'd rather want to limit the userspace exposure, so it can be removed >>>>> if needed. >>>>> >>>>> A noticeable change of behaviour here, as Hao recently asked, is that >>>>> the ring can be passed to a task from a completely another thread group, >>>>> and so the feature would execute from that context, not from the >>>>> original/sqpoll one. >>>>> >>>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>>>> ignored if the previous point is addressed. >>>> >>>> I mostly agree on that. The problem I see is that for most use cases, >>>> the "submit from task itself if we need to enter the kernel" is >>>> perfectly fine, and would probably be preferable. But there are also >>>> uses cases that absolutely do not want to spend any extra cycles doing >>>> submit, they are isolating the submission to sqpoll exclusively and that >>>> is part of the win there. Based on that, I don't think it can be an >>>> automatic kind of feature. >>> >>> Reasonable. >>> >>>> I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE >>>> would likely be better, or maybe even more verbose as >>>> IORING_ENTER_SQ_SUBMIT_ON_IDLE. >>>> >>>> On top of that, I don't think an extra submit flag is a huge deal, I >>>> don't imagine we'll end up with a ton of them. In fact, two have been >>>> added related to sqpoll since the inception, out of the 3 total added >>>> flags. >>> >>> I don't care about the flag itself, nor about performance as it's >>> nicely under the SQPOLL check, but I rather want to leave a way to >>> ignore the feature if we would (ever) need to disable it, either >>> with flag or without it. >> >> I think we just return -EINVAL for that case, just like we'd do now if >> you attempted to use the flag as we don't grok it. As it should be >> functionally equivalent if we do the submit inline or not, we could also >> argue that we simply ignore the flag if it isn't feasible to submit >> inline. > > Yeah, no-brainer if we limit context to the original thread group, as > I described in the first reply. Yep, that's a requirement for any kind of sanity there. -- Jens Axboe ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:39 ` Jens Axboe 2021-04-28 14:50 ` Pavel Begunkov @ 2021-04-29 4:43 ` Hao Xu 1 sibling, 0 replies; 40+ messages in thread From: Hao Xu @ 2021-04-29 4:43 UTC (permalink / raw) To: Jens Axboe, Pavel Begunkov; +Cc: io-uring, Joseph Qi 在 2021/4/28 下午10:39, Jens Axboe 写道: > On 4/28/21 8:34 AM, Pavel Begunkov wrote: >> On 4/28/21 2:32 PM, Hao Xu wrote: >>> sqes are submitted by sqthread when it is leveraged, which means there >>> is IO latency when waking up sqthread. To wipe it out, submit limited >>> number of sqes in the original task context. >>> Tests result below: >> >> Frankly, it can be a nest of corner cases if not now then in the future, >> leading to a high maintenance burden. Hence, if we consider the change, >> I'd rather want to limit the userspace exposure, so it can be removed >> if needed. >> >> A noticeable change of behaviour here, as Hao recently asked, is that >> the ring can be passed to a task from a completely another thread group, >> and so the feature would execute from that context, not from the >> original/sqpoll one. >> >> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >> ignored if the previous point is addressed. > > I mostly agree on that. The problem I see is that for most use cases, > the "submit from task itself if we need to enter the kernel" is > perfectly fine, and would probably be preferable. But there are also > uses cases that absolutely do not want to spend any extra cycles doing > submit, they are isolating the submission to sqpoll exclusively and that > is part of the win there. Based on that, I don't think it can be an > automatic kind of feature. > > I do think the naming is kind of horrible. IORING_ENTER_SQ_SUBMIT_IDLE > would likely be better, or maybe even more verbose as > IORING_ENTER_SQ_SUBMIT_ON_IDLE. Agree 😂 > > On top of that, I don't think an extra submit flag is a huge deal, I > don't imagine we'll end up with a ton of them. In fact, two have been > added related to sqpoll since the inception, out of the 3 total added > flags. > > This is all independent of implementation detail and needed fixes to the > patch. > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 14:34 ` Pavel Begunkov 2021-04-28 14:37 ` Pavel Begunkov 2021-04-28 14:39 ` Jens Axboe @ 2021-04-29 8:44 ` Hao Xu 2021-04-29 22:10 ` Pavel Begunkov 2 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-04-29 8:44 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/28 下午10:34, Pavel Begunkov 写道: > On 4/28/21 2:32 PM, Hao Xu wrote: >> sqes are submitted by sqthread when it is leveraged, which means there >> is IO latency when waking up sqthread. To wipe it out, submit limited >> number of sqes in the original task context. >> Tests result below: > > Frankly, it can be a nest of corner cases if not now then in the future, > leading to a high maintenance burden. Hence, if we consider the change, > I'd rather want to limit the userspace exposure, so it can be removed > if needed. > > A noticeable change of behaviour here, as Hao recently asked, is that > the ring can be passed to a task from a completely another thread group, > and so the feature would execute from that context, not from the > original/sqpoll one. > > Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be > ignored if the previous point is addressed. > >> >> 99th latency: >> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >> with this patch: >> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >> without this patch: >> 2k 15 14 15 15 15 14 15 14 14 13 11.84 > > Not sure the second nine describes it well enough, please can you > add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. > > Btw, how happened that only some of the numbers have fractional part? > Can't believe they all but 3 were close enough to integer values. > >> fio config: >> ./run_fio.sh >> fio \ >> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >> --io_sq_thread_idle=${2} >> >> Signed-off-by: Hao Xu <[email protected]> >> --- >> fs/io_uring.c | 29 +++++++++++++++++++++++------ >> include/uapi/linux/io_uring.h | 1 + >> 2 files changed, 24 insertions(+), 6 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 1871fad48412..f0a01232671e 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >> { >> struct io_ring_ctx *ctx = req->ctx; >> struct io_kiocb *link = io_prep_linked_timeout(req); >> - struct io_uring_task *tctx = req->task->io_uring; >> + struct io_uring_task *tctx = NULL; >> + >> + if (ctx->sq_data && ctx->sq_data->thread) >> + tctx = ctx->sq_data->thread->io_uring; > > without park it's racy, sq_data->thread may become NULL and removed, > as well as its ->io_uring. I now think that it's ok to queue async work to req->task->io_uring. I look through all the OPs, seems only have to take care of async_cancel: io_async_cancel(req) { cancel from req->task->io_uring; cancel from ctx->tctx_list } Given req->task is 'original context', the req to be cancelled may in ctx->sq_data->thread->io_uring's iowq. So search the req from sqthread->io_uring is needed here. This avoids overload in main code path. Did I miss something else? > >> + else >> + tctx = req->task->io_uring; >> >> BUG_ON(!tctx); >> BUG_ON(!tctx->io_wq); > > [snip] > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-29 8:44 ` Hao Xu @ 2021-04-29 22:10 ` Pavel Begunkov 2021-05-05 13:10 ` Hao Xu 0 siblings, 1 reply; 40+ messages in thread From: Pavel Begunkov @ 2021-04-29 22:10 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/29/21 9:44 AM, Hao Xu wrote: > 在 2021/4/28 下午10:34, Pavel Begunkov 写道: >> On 4/28/21 2:32 PM, Hao Xu wrote: >>> sqes are submitted by sqthread when it is leveraged, which means there >>> is IO latency when waking up sqthread. To wipe it out, submit limited >>> number of sqes in the original task context. >>> Tests result below: >> >> Frankly, it can be a nest of corner cases if not now then in the future, >> leading to a high maintenance burden. Hence, if we consider the change, >> I'd rather want to limit the userspace exposure, so it can be removed >> if needed. >> >> A noticeable change of behaviour here, as Hao recently asked, is that >> the ring can be passed to a task from a completely another thread group, >> and so the feature would execute from that context, not from the >> original/sqpoll one. >> >> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >> ignored if the previous point is addressed. >> >>> >>> 99th latency: >>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>> with this patch: >>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>> without this patch: >>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >> >> Not sure the second nine describes it well enough, please can you >> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. >> >> Btw, how happened that only some of the numbers have fractional part? >> Can't believe they all but 3 were close enough to integer values. >> >>> fio config: >>> ./run_fio.sh >>> fio \ >>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>> --io_sq_thread_idle=${2} >>> >>> Signed-off-by: Hao Xu <[email protected]> >>> --- >>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>> include/uapi/linux/io_uring.h | 1 + >>> 2 files changed, 24 insertions(+), 6 deletions(-) >>> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>> index 1871fad48412..f0a01232671e 100644 >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>> { >>> struct io_ring_ctx *ctx = req->ctx; >>> struct io_kiocb *link = io_prep_linked_timeout(req); >>> - struct io_uring_task *tctx = req->task->io_uring; >>> + struct io_uring_task *tctx = NULL; >>> + >>> + if (ctx->sq_data && ctx->sq_data->thread) >>> + tctx = ctx->sq_data->thread->io_uring; >> >> without park it's racy, sq_data->thread may become NULL and removed, >> as well as its ->io_uring. > I now think that it's ok to queue async work to req->task->io_uring. I > look through all the OPs, seems only have to take care of async_cancel: > > io_async_cancel(req) { > cancel from req->task->io_uring; > cancel from ctx->tctx_list > } > > Given req->task is 'original context', the req to be cancelled may in > ctx->sq_data->thread->io_uring's iowq. So search the req from > sqthread->io_uring is needed here. This avoids overload in main code > path. > Did I miss something else? It must be req->task->io_uring, otherwise cancellations will be broken. And using it should be fine in theory because io-wq/etc. should be set up by io_uring_add_task_file() One more problem to the pile is io_req_task_work_add() and notify mode it choses. Look for IORING_SETUP_SQPOLL in the function. Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL. I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL and think whether it's flawed. > > >> >>> + else >>> + tctx = req->task->io_uring; >>> BUG_ON(!tctx); >>> BUG_ON(!tctx->io_wq); >> >> [snip] >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-29 22:10 ` Pavel Begunkov @ 2021-05-05 13:10 ` Hao Xu 2021-05-05 17:44 ` Pavel Begunkov 0 siblings, 1 reply; 40+ messages in thread From: Hao Xu @ 2021-05-05 13:10 UTC (permalink / raw) To: Pavel Begunkov, Jens Axboe; +Cc: io-uring, Joseph Qi 在 2021/4/30 上午6:10, Pavel Begunkov 写道: > On 4/29/21 9:44 AM, Hao Xu wrote: >> 在 2021/4/28 下午10:34, Pavel Begunkov 写道: >>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>> sqes are submitted by sqthread when it is leveraged, which means there >>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>> number of sqes in the original task context. >>>> Tests result below: >>> >>> Frankly, it can be a nest of corner cases if not now then in the future, >>> leading to a high maintenance burden. Hence, if we consider the change, >>> I'd rather want to limit the userspace exposure, so it can be removed >>> if needed. >>> >>> A noticeable change of behaviour here, as Hao recently asked, is that >>> the ring can be passed to a task from a completely another thread group, >>> and so the feature would execute from that context, not from the >>> original/sqpoll one. >>> >>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>> ignored if the previous point is addressed. >>> >>>> >>>> 99th latency: >>>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>>> with this patch: >>>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>>> without this patch: >>>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >>> >>> Not sure the second nine describes it well enough, please can you >>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. >>> >>> Btw, how happened that only some of the numbers have fractional part? >>> Can't believe they all but 3 were close enough to integer values. >>> >>>> fio config: >>>> ./run_fio.sh >>>> fio \ >>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>>> --io_sq_thread_idle=${2} >>>> >>>> Signed-off-by: Hao Xu <[email protected]> >>>> --- >>>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>>> include/uapi/linux/io_uring.h | 1 + >>>> 2 files changed, 24 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index 1871fad48412..f0a01232671e 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>>> { >>>> struct io_ring_ctx *ctx = req->ctx; >>>> struct io_kiocb *link = io_prep_linked_timeout(req); >>>> - struct io_uring_task *tctx = req->task->io_uring; >>>> + struct io_uring_task *tctx = NULL; >>>> + >>>> + if (ctx->sq_data && ctx->sq_data->thread) >>>> + tctx = ctx->sq_data->thread->io_uring; >>> >>> without park it's racy, sq_data->thread may become NULL and removed, >>> as well as its ->io_uring. >> I now think that it's ok to queue async work to req->task->io_uring. I >> look through all the OPs, seems only have to take care of async_cancel: >> >> io_async_cancel(req) { >> cancel from req->task->io_uring; >> cancel from ctx->tctx_list >> } >> >> Given req->task is 'original context', the req to be cancelled may in >> ctx->sq_data->thread->io_uring's iowq. So search the req from >> sqthread->io_uring is needed here. This avoids overload in main code >> path. >> Did I miss something else? > > It must be req->task->io_uring, otherwise cancellations will > be broken. And using it should be fine in theory because io-wq/etc. > should be set up by io_uring_add_task_file() > > > One more problem to the pile is io_req_task_work_add() and notify > mode it choses. Look for IORING_SETUP_SQPOLL in the function. How about: notify = TWA_SIGNAL if ( (is sq mode) and (sqd->thread == NULL or == req->task)) notify = TWA_NONE; > > Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like > may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL. > I've excluded IOPOLL. This change will only affect SQPOLL mode. > I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL > and think whether it's flawed. I'm working on this. (no obvious problem through eyes, will put the code change on more tests) > >> >> >>> >>>> + else >>>> + tctx = req->task->io_uring; >>>> BUG_ON(!tctx); >>>> BUG_ON(!tctx->io_wq); >>> >>> [snip] >>> >> > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-05-05 13:10 ` Hao Xu @ 2021-05-05 17:44 ` Pavel Begunkov 0 siblings, 0 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-05-05 17:44 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 5/5/21 2:10 PM, Hao Xu wrote: > 在 2021/4/30 上午6:10, Pavel Begunkov 写道: >> On 4/29/21 9:44 AM, Hao Xu wrote: >>> 在 2021/4/28 下午10:34, Pavel Begunkov 写道: >>>> On 4/28/21 2:32 PM, Hao Xu wrote: >>>>> sqes are submitted by sqthread when it is leveraged, which means there >>>>> is IO latency when waking up sqthread. To wipe it out, submit limited >>>>> number of sqes in the original task context. >>>>> Tests result below: >>>> >>>> Frankly, it can be a nest of corner cases if not now then in the future, >>>> leading to a high maintenance burden. Hence, if we consider the change, >>>> I'd rather want to limit the userspace exposure, so it can be removed >>>> if needed. >>>> >>>> A noticeable change of behaviour here, as Hao recently asked, is that >>>> the ring can be passed to a task from a completely another thread group, >>>> and so the feature would execute from that context, not from the >>>> original/sqpoll one. >>>> >>>> Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be >>>> ignored if the previous point is addressed. >>>> >>>>> >>>>> 99th latency: >>>>> iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us >>>>> with this patch: >>>>> 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 >>>>> without this patch: >>>>> 2k 15 14 15 15 15 14 15 14 14 13 11.84 >>>> >>>> Not sure the second nine describes it well enough, please can you >>>> add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put. >>>> >>>> Btw, how happened that only some of the numbers have fractional part? >>>> Can't believe they all but 3 were close enough to integer values. >>>> >>>>> fio config: >>>>> ./run_fio.sh >>>>> fio \ >>>>> --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ >>>>> --direct=1 --rw=randread --time_based=1 --runtime=300 \ >>>>> --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ >>>>> --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ >>>>> --io_sq_thread_idle=${2} >>>>> >>>>> Signed-off-by: Hao Xu <[email protected]> >>>>> --- >>>>> fs/io_uring.c | 29 +++++++++++++++++++++++------ >>>>> include/uapi/linux/io_uring.h | 1 + >>>>> 2 files changed, 24 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>> index 1871fad48412..f0a01232671e 100644 >>>>> --- a/fs/io_uring.c >>>>> +++ b/fs/io_uring.c >>>>> @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) >>>>> { >>>>> struct io_ring_ctx *ctx = req->ctx; >>>>> struct io_kiocb *link = io_prep_linked_timeout(req); >>>>> - struct io_uring_task *tctx = req->task->io_uring; >>>>> + struct io_uring_task *tctx = NULL; >>>>> + >>>>> + if (ctx->sq_data && ctx->sq_data->thread) >>>>> + tctx = ctx->sq_data->thread->io_uring; >>>> >>>> without park it's racy, sq_data->thread may become NULL and removed, >>>> as well as its ->io_uring. >>> I now think that it's ok to queue async work to req->task->io_uring. I >>> look through all the OPs, seems only have to take care of async_cancel: >>> >>> io_async_cancel(req) { >>> cancel from req->task->io_uring; >>> cancel from ctx->tctx_list >>> } >>> >>> Given req->task is 'original context', the req to be cancelled may in >>> ctx->sq_data->thread->io_uring's iowq. So search the req from >>> sqthread->io_uring is needed here. This avoids overload in main code >>> path. >>> Did I miss something else? >> >> It must be req->task->io_uring, otherwise cancellations will >> be broken. And using it should be fine in theory because io-wq/etc. >> should be set up by io_uring_add_task_file() >> >> >> One more problem to the pile is io_req_task_work_add() and notify >> mode it choses. Look for IORING_SETUP_SQPOLL in the function. > How about: > notify = TWA_SIGNAL > if ( (is sq mode) and (sqd->thread == NULL or == req->task)) > notify = TWA_NONE; notify = (sqd && tsk == sqd->thread) ? TWA_NONE : TWA_SIGNAL; Like that? Should work >> Also, IOPOLL+SQPOLL io_uring_try_cancel_requests() looks like >> may fail (didn't double check it). Look again for IORING_SETUP_SQPOLL. >> > I've excluded IOPOLL. This change will only affect SQPOLL mode. >> I'd rather recommend to go over all uses of IORING_SETUP_SQPOLL >> and think whether it's flawed. > I'm working on this. (no obvious problem through eyes, will put the code > change on more tests) >> >>> >>> >>>> >>>>> + else >>>>> + tctx = req->task->io_uring; >>>>> BUG_ON(!tctx); >>>>> BUG_ON(!tctx->io_wq); >>>> >>>> [snip] >>>> >>> >> > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu 2021-04-28 14:12 ` Jens Axboe 2021-04-28 14:34 ` Pavel Begunkov @ 2021-04-29 22:02 ` Pavel Begunkov 2 siblings, 0 replies; 40+ messages in thread From: Pavel Begunkov @ 2021-04-29 22:02 UTC (permalink / raw) To: Hao Xu, Jens Axboe; +Cc: io-uring, Joseph Qi On 4/28/21 2:32 PM, Hao Xu wrote: > sqes are submitted by sqthread when it is leveraged, which means there > is IO latency when waking up sqthread. To wipe it out, submit limited > number of sqes in the original task context. > Tests result below: Synchronising needs some patching in case of IOPOLL, as io_iopoll_req_issued() won't wake up sqpoll task. Added in this patch wake_up() doesn't help with that because it's placed before submission > > 99th latency: > iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us > with this patch: > 2k 13 13 12 13 13 12 12 11 11 10.304 11.84 > without this patch: > 2k 15 14 15 15 15 14 15 14 14 13 11.84 > > fio config: > ./run_fio.sh > fio \ > --ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \ > --direct=1 --rw=randread --time_based=1 --runtime=300 \ > --group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \ > --randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \ > --io_sq_thread_idle=${2} > > Signed-off-by: Hao Xu <[email protected]> > --- > fs/io_uring.c | 29 +++++++++++++++++++++++------ > include/uapi/linux/io_uring.h | 1 + > 2 files changed, 24 insertions(+), 6 deletions(-) > > diff --git a/fs/io_uring.c b/fs/io_uring.c > index 1871fad48412..f0a01232671e 100644 > --- a/fs/io_uring.c > +++ b/fs/io_uring.c > @@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req) > { > struct io_ring_ctx *ctx = req->ctx; > struct io_kiocb *link = io_prep_linked_timeout(req); > - struct io_uring_task *tctx = req->task->io_uring; > + struct io_uring_task *tctx = NULL; > + > + if (ctx->sq_data && ctx->sq_data->thread) > + tctx = ctx->sq_data->thread->io_uring; > + else > + tctx = req->task->io_uring; > > BUG_ON(!tctx); > BUG_ON(!tctx->io_wq); > @@ -9063,9 +9068,10 @@ static void io_uring_try_cancel(struct files_struct *files) > xa_for_each(&tctx->xa, index, node) { > struct io_ring_ctx *ctx = node->ctx; > > - /* sqpoll task will cancel all its requests */ > - if (!ctx->sq_data) > - io_uring_try_cancel_requests(ctx, current, files); > + /* > + * for sqpoll ctx, there may be requests in task_works etc. > + */ > + io_uring_try_cancel_requests(ctx, current, files); > } > } > > @@ -9271,7 +9277,8 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz > io_run_task_work(); > > if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP | > - IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG))) > + IORING_ENTER_SQ_WAIT | IORING_ENTER_EXT_ARG | > + IORING_ENTER_SQ_DEPUTY))) > return -EINVAL; > > f = fdget(fd); > @@ -9304,8 +9311,18 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz > if (unlikely(ctx->sq_data->thread == NULL)) { > goto out; > } > - if (flags & IORING_ENTER_SQ_WAKEUP) > + if (flags & IORING_ENTER_SQ_WAKEUP) { > wake_up(&ctx->sq_data->wait); > + if ((flags & IORING_ENTER_SQ_DEPUTY) && > + !(ctx->flags & IORING_SETUP_IOPOLL)) { > + ret = io_uring_add_task_file(ctx); > + if (unlikely(ret)) > + goto out; > + mutex_lock(&ctx->uring_lock); > + io_submit_sqes(ctx, min(to_submit, 8U)); > + mutex_unlock(&ctx->uring_lock); > + } > + } > if (flags & IORING_ENTER_SQ_WAIT) { > ret = io_sqpoll_wait_sq(ctx); > if (ret) > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 311532ff6ce3..b1130fec2b7d 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -251,6 +251,7 @@ struct io_cqring_offsets { > #define IORING_ENTER_SQ_WAKEUP (1U << 1) > #define IORING_ENTER_SQ_WAIT (1U << 2) > #define IORING_ENTER_EXT_ARG (1U << 3) > +#define IORING_ENTER_SQ_DEPUTY (1U << 4) > > /* > * Passed in for io_uring_setup(2). Copied back with updated info on success > -- Pavel Begunkov ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2021-10-05 15:00 UTC | newest] Thread overview: 40+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-04-28 13:32 [PATCH RFC 5.13 0/2] adaptive sqpoll and its wakeup optimization Hao Xu 2021-04-28 13:32 ` [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle Hao Xu 2021-04-28 14:07 ` Pavel Begunkov 2021-04-28 14:16 ` Jens Axboe 2021-04-28 14:53 ` Pavel Begunkov 2021-04-28 14:54 ` Jens Axboe 2021-04-29 3:41 ` Hao Xu 2021-04-29 9:11 ` Pavel Begunkov 2021-05-05 14:07 ` Hao Xu 2021-05-05 17:40 ` Pavel Begunkov 2021-04-29 3:28 ` Hao Xu 2021-04-29 22:15 ` Pavel Begunkov 2021-09-26 10:00 ` Hao Xu 2021-09-28 10:51 ` Pavel Begunkov 2021-09-29 7:52 ` Hao Xu 2021-09-29 9:24 ` Hao Xu 2021-09-29 11:37 ` Pavel Begunkov 2021-09-29 12:13 ` Hao Xu 2021-09-30 8:51 ` Pavel Begunkov 2021-09-30 12:04 ` Pavel Begunkov 2021-10-05 15:00 ` Hao Xu 2021-04-28 13:32 ` [PATCH RFC 5.13 2/2] io_uring: submit sqes in the original context when waking up sqthread Hao Xu 2021-04-28 14:12 ` Jens Axboe 2021-04-29 4:12 ` Hao Xu 2021-04-28 14:34 ` Pavel Begunkov 2021-04-28 14:37 ` Pavel Begunkov 2021-04-29 4:37 ` Hao Xu 2021-04-29 9:28 ` Pavel Begunkov 2021-05-05 11:20 ` Hao Xu 2021-04-28 14:39 ` Jens Axboe 2021-04-28 14:50 ` Pavel Begunkov 2021-04-28 14:53 ` Jens Axboe 2021-04-28 14:56 ` Pavel Begunkov 2021-04-28 15:09 ` Jens Axboe 2021-04-29 4:43 ` Hao Xu 2021-04-29 8:44 ` Hao Xu 2021-04-29 22:10 ` Pavel Begunkov 2021-05-05 13:10 ` Hao Xu 2021-05-05 17:44 ` Pavel Begunkov 2021-04-29 22:02 ` Pavel Begunkov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox