From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 370FCC433E0 for ; Tue, 7 Jul 2020 16:21:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0829420675 for ; Tue, 7 Jul 2020 16:21:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=kernel-dk.20150623.gappssmtp.com header.i=@kernel-dk.20150623.gappssmtp.com header.b="NPb157AW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727936AbgGGQVS (ORCPT ); Tue, 7 Jul 2020 12:21:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727886AbgGGQVS (ORCPT ); Tue, 7 Jul 2020 12:21:18 -0400 Received: from mail-il1-x141.google.com (mail-il1-x141.google.com [IPv6:2607:f8b0:4864:20::141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 622BEC061755 for ; Tue, 7 Jul 2020 09:21:18 -0700 (PDT) Received: by mail-il1-x141.google.com with SMTP id t27so31548551ill.9 for ; Tue, 07 Jul 2020 09:21:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:from:to:cc:references:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=ylXLDPwjFwmNUX021iadtYVQcvfBwbfAN12x96EMekU=; b=NPb157AWz4yddSc2cQeFATh1/6cybV/M0QCmEGk0NxwpuHaVc5hllxortFUJd74Vcn iRPSimRdxYRRCaake0SrQs6UQw2jGRPN30vnoLaZ5S5HiXhwpOe+UCyBvdn5qk2a7Bz0 04VzLnlHM713YBqYzXmg4aPVa3pUMfxUm8fsu2nqDA9qY3ujRtNbZd+3tK7djhfH9rYS kERRAQjAjQ4qOj4fxGX8ttzS6GBj/IpsjBElPYmFMYsCN3pBV8+kVJ1pmRjS4IQ8RigW zAF2MEEDgpWpkrO5avoTIP55zh40zWcXE5vihJOPXut3d9puKFimAcMy85IIJxej//N/ hWCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:references:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=ylXLDPwjFwmNUX021iadtYVQcvfBwbfAN12x96EMekU=; b=tqfHfezGwrYDCPCUMK2Enb66f2QWjNREOZK9rq/ClPbpMkbRx5f6dH6faz7NEzAroo hqLTOpByFZlyOafixxVQ3XHpTaUOG2lpXy3LhP3f29RBBMSbhYeZJP070KzVpFTAGJ1Y +21bx7a46vA7+WKeNRA2lqdjI6vvmCnqrqOHULUCJ7sSUnIuv2tUzs6yNe8U6StFN8wz YS9JfBJpowdSY2lGCmGaFEhxZxgYl7crZbTkB/HFP6DwCWP+MaATXs4Y8ddfR7hns4VF d7XT7U1N3g1+h6+kgCsMvHNj8Z2jJc8xeT+k+4Fqm+TXdG26YO6uVEQgeLipePAJzdbK wcNw== X-Gm-Message-State: AOAM5315W1NRxTpdgnHTRhBFb6RJfo74itAKYQZs4qMMEXqT4ziq9KVl m1qll1sx9+lEomAlfjF0PV2RQQi2yYMGPQ== X-Google-Smtp-Source: ABdhPJz+o+5s4fpFpyd4V3BzCzj0BfD1OfQjpW7MIYidUlHO+gl4nK4YdEgo0YWjzYwD8VfEMZi9Xw== X-Received: by 2002:a05:6e02:46:: with SMTP id i6mr35283991ilr.91.1594138876538; Tue, 07 Jul 2020 09:21:16 -0700 (PDT) Received: from [192.168.1.58] ([65.144.74.34]) by smtp.gmail.com with ESMTPSA id t14sm13642133ilk.17.2020.07.07.09.21.15 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 07 Jul 2020 09:21:15 -0700 (PDT) Subject: Re: [PATCH] io_uring: export cq overflow status to userspace From: Jens Axboe To: Xiaoguang Wang , io-uring@vger.kernel.org Cc: joseph.qi@linux.alibaba.com References: <20200707132420.2007-1-xiaoguang.wang@linux.alibaba.com> <0ebded37-3660-e3c0-aa51-d3d7e56d634c@kernel.dk> Message-ID: Date: Tue, 7 Jul 2020 10:21:14 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: <0ebded37-3660-e3c0-aa51-d3d7e56d634c@kernel.dk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: io-uring-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On 7/7/20 8:28 AM, Jens Axboe wrote: > On 7/7/20 7:24 AM, Xiaoguang Wang wrote: >> For those applications which are not willing to use io_uring_enter() >> to reap and handle cqes, they may completely rely on liburing's >> io_uring_peek_cqe(), but if cq ring has overflowed, currently because >> io_uring_peek_cqe() is not aware of this overflow, it won't enter >> kernel to flush cqes, below test program can reveal this bug: >> >> static void test_cq_overflow(struct io_uring *ring) >> { >> struct io_uring_cqe *cqe; >> struct io_uring_sqe *sqe; >> int issued = 0; >> int ret = 0; >> >> do { >> sqe = io_uring_get_sqe(ring); >> if (!sqe) { >> fprintf(stderr, "get sqe failed\n"); >> break;; >> } >> ret = io_uring_submit(ring); >> if (ret <= 0) { >> if (ret != -EBUSY) >> fprintf(stderr, "sqe submit failed: %d\n", ret); >> break; >> } >> issued++; >> } while (ret > 0); >> assert(ret == -EBUSY); >> >> printf("issued requests: %d\n", issued); >> >> while (issued) { >> ret = io_uring_peek_cqe(ring, &cqe); >> if (ret) { >> if (ret != -EAGAIN) { >> fprintf(stderr, "peek completion failed: %s\n", >> strerror(ret)); >> break; >> } >> printf("left requets: %d\n", issued); >> continue; >> } >> io_uring_cqe_seen(ring, cqe); >> issued--; >> printf("left requets: %d\n", issued); >> } >> } >> >> int main(int argc, char *argv[]) >> { >> int ret; >> struct io_uring ring; >> >> ret = io_uring_queue_init(16, &ring, 0); >> if (ret) { >> fprintf(stderr, "ring setup failed: %d\n", ret); >> return 1; >> } >> >> test_cq_overflow(&ring); >> return 0; >> } >> >> To fix this issue, export cq overflow status to userspace, then >> helper functions() in liburing, such as io_uring_peek_cqe, can be >> aware of this cq overflow and do flush accordingly. > > Is there any way we can accomplish the same without exporting > another set of flags? Would it be enough for the SQPOLl thread to set > IORING_SQ_NEED_WAKEUP if we're in overflow condition? That should > result in the app entering the kernel when it's flushed the user CQ > side, and then the sqthread could attempt to flush the pending > events as well. > > Something like this, totally untested... OK, took a closer look at this, it's a generic thing, not just SQPOLL related. My bad! Anyway, my suggestion would be to add IORING_SQ_CQ_OVERFLOW to the existing flags, and then make a liburing change almost identical to what you had. Hence kernel side: diff --git a/fs/io_uring.c b/fs/io_uring.c index d37d7ea5ebe5..af9fd5cefc51 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -1234,11 +1234,12 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) struct io_uring_cqe *cqe; struct io_kiocb *req; unsigned long flags; + bool ret = true; LIST_HEAD(list); if (!force) { if (list_empty_careful(&ctx->cq_overflow_list)) - return true; + goto done; if ((ctx->cached_cq_tail - READ_ONCE(rings->cq.head) == rings->cq_ring_entries)) return false; @@ -1284,7 +1285,11 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) io_put_req(req); } - return cqe != NULL; + ret = cqe != NULL; +done: + if (ret) + ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW; + return ret; } static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags) @@ -5933,10 +5938,13 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr, int i, submitted = 0; /* if we have a backlog and couldn't flush it all, return BUSY */ - if (test_bit(0, &ctx->sq_check_overflow)) { + if (unlikely(test_bit(0, &ctx->sq_check_overflow))) { if (!list_empty(&ctx->cq_overflow_list) && - !io_cqring_overflow_flush(ctx, false)) + !io_cqring_overflow_flush(ctx, false)) { + ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW; + smp_mb(); return -EBUSY; + } } /* make sure SQ entry isn't read before tail */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 92c22699a5a7..9c7e028beda5 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -197,6 +197,7 @@ struct io_sqring_offsets { * sq_ring->flags */ #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ +#define IORING_SQ_CQ_OVERFLOW (1U << 1) /* app needs to enter kernel */ struct io_cqring_offsets { __u32 head; and then this for the liburing side: diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h index 6a73522..e4314ed 100644 --- a/src/include/liburing/io_uring.h +++ b/src/include/liburing/io_uring.h @@ -202,6 +202,7 @@ struct io_sqring_offsets { * sq_ring->flags */ #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ +#define IORING_SQ_CQ_OVERFLOW (1U << 1) struct io_cqring_offsets { __u32 head; diff --git a/src/queue.c b/src/queue.c index 88e0294..1f00251 100644 --- a/src/queue.c +++ b/src/queue.c @@ -32,6 +32,11 @@ static inline bool sq_ring_needs_enter(struct io_uring *ring, return false; } +static inline bool cq_ring_needs_flush(struct io_uring *ring) +{ + return IO_URING_READ_ONCE(*ring->sq.kflags) & IORING_SQ_CQ_OVERFLOW; +} + static int __io_uring_peek_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr) { @@ -67,22 +72,26 @@ int __io_uring_get_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, int ret = 0, err; do { + bool cq_overflow_flush = false; unsigned flags = 0; err = __io_uring_peek_cqe(ring, &cqe); if (err) break; if (!cqe && !to_wait && !submit) { - err = -EAGAIN; - break; + if (!cq_ring_needs_flush(ring)) { + err = -EAGAIN; + break; + } + cq_overflow_flush = true; } if (wait_nr && cqe) wait_nr--; - if (wait_nr) + if (wait_nr || cq_overflow_flush) flags = IORING_ENTER_GETEVENTS; if (submit) sq_ring_needs_enter(ring, submit, &flags); - if (wait_nr || submit) + if (wait_nr || submit || cq_overflow_flush) ret = __sys_io_uring_enter(ring->ring_fd, submit, wait_nr, flags, sigmask); if (ret < 0) { If you agree with this approach, could you test this and resubmit the two patches? -- Jens Axboe