From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yVW0=AT=vger.kernel.org=io-uring-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,
	USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2F0BFC433E2
	for <io-uring@archiver.kernel.org>; Wed,  8 Jul 2020 05:29:44 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 1294320774
	for <io-uring@archiver.kernel.org>; Wed,  8 Jul 2020 05:29:44 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1729218AbgGHF3n (ORCPT <rfc822;io-uring@archiver.kernel.org>);
        Wed, 8 Jul 2020 01:29:43 -0400
Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:54635 "EHLO
        out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S1725784AbgGHF3m (ORCPT
        <rfc822;io-uring@vger.kernel.org>); Wed, 8 Jul 2020 01:29:42 -0400
X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R421e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01358;MF=xiaoguang.wang@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0U25-2eJ_1594186175;
Received: from 30.225.32.175(mailfrom:xiaoguang.wang@linux.alibaba.com fp:SMTPD_---0U25-2eJ_1594186175)
          by smtp.aliyun-inc.com(127.0.0.1);
          Wed, 08 Jul 2020 13:29:35 +0800
Subject: Re: [PATCH] io_uring: export cq overflow status to userspace
To:     Jens Axboe <axboe@kernel.dk>
Cc:     io-uring@vger.kernel.org, joseph.qi@linux.alibaba.com
References: <c4e40cba-4171-a253-8813-ca833c8ce575@linux.alibaba.com>
 <D59FC4AE-8D3B-44F4-A6AA-91722D97E202@kernel.dk>
From:   Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Message-ID: <83a3a0eb-8dea-20ad-ffc5-619226b47998@linux.alibaba.com>
Date:   Wed, 8 Jul 2020 13:29:35 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <D59FC4AE-8D3B-44F4-A6AA-91722D97E202@kernel.dk>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Sender: io-uring-owner@vger.kernel.org
Precedence: bulk
List-ID: <io-uring.vger.kernel.org>
X-Mailing-List: io-uring@vger.kernel.org

hi,

> 
>> On Jul 7, 2020, at 9:25 PM, Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> wrote:
>>
>> ﻿hi,
>>
>>>> On 7/7/20 8:28 AM, Jens Axboe wrote:
>>>> On 7/7/20 7:24 AM, Xiaoguang Wang wrote:
>>>>> For those applications which are not willing to use io_uring_enter()
>>>>> to reap and handle cqes, they may completely rely on liburing's
>>>>> io_uring_peek_cqe(), but if cq ring has overflowed, currently because
>>>>> io_uring_peek_cqe() is not aware of this overflow, it won't enter
>>>>> kernel to flush cqes, below test program can reveal this bug:
>>>>>
>>>>> static void test_cq_overflow(struct io_uring *ring)
>>>>> {
>>>>>          struct io_uring_cqe *cqe;
>>>>>          struct io_uring_sqe *sqe;
>>>>>          int issued = 0;
>>>>>          int ret = 0;
>>>>>
>>>>>          do {
>>>>>                  sqe = io_uring_get_sqe(ring);
>>>>>                  if (!sqe) {
>>>>>                          fprintf(stderr, "get sqe failed\n");
>>>>>                          break;;
>>>>>                  }
>>>>>                  ret = io_uring_submit(ring);
>>>>>                  if (ret <= 0) {
>>>>>                          if (ret != -EBUSY)
>>>>>                                  fprintf(stderr, "sqe submit failed: %d\n", ret);
>>>>>                          break;
>>>>>                  }
>>>>>                  issued++;
>>>>>          } while (ret > 0);
>>>>>          assert(ret == -EBUSY);
>>>>>
>>>>>          printf("issued requests: %d\n", issued);
>>>>>
>>>>>          while (issued) {
>>>>>                  ret = io_uring_peek_cqe(ring, &cqe);
>>>>>                  if (ret) {
>>>>>                          if (ret != -EAGAIN) {
>>>>>                                  fprintf(stderr, "peek completion failed: %s\n",
>>>>>                                          strerror(ret));
>>>>>                                  break;
>>>>>                          }
>>>>>                          printf("left requets: %d\n", issued);
>>>>>                          continue;
>>>>>                  }
>>>>>                  io_uring_cqe_seen(ring, cqe);
>>>>>                  issued--;
>>>>>                  printf("left requets: %d\n", issued);
>>>>>          }
>>>>> }
>>>>>
>>>>> int main(int argc, char *argv[])
>>>>> {
>>>>>          int ret;
>>>>>          struct io_uring ring;
>>>>>
>>>>>          ret = io_uring_queue_init(16, &ring, 0);
>>>>>          if (ret) {
>>>>>                  fprintf(stderr, "ring setup failed: %d\n", ret);
>>>>>                  return 1;
>>>>>          }
>>>>>
>>>>>          test_cq_overflow(&ring);
>>>>>          return 0;
>>>>> }
>>>>>
>>>>> To fix this issue, export cq overflow status to userspace, then
>>>>> helper functions() in liburing, such as io_uring_peek_cqe, can be
>>>>> aware of this cq overflow and do flush accordingly.
>>>>
>>>> Is there any way we can accomplish the same without exporting
>>>> another set of flags? Would it be enough for the SQPOLl thread to set
>>>> IORING_SQ_NEED_WAKEUP if we're in overflow condition? That should
>>>> result in the app entering the kernel when it's flushed the user CQ
>>>> side, and then the sqthread could attempt to flush the pending
>>>> events as well.
>>>>
>>>> Something like this, totally untested...
>>> OK, took a closer look at this, it's a generic thing, not just
>>> SQPOLL related. My bad!
>>> Anyway, my suggestion would be to add IORING_SQ_CQ_OVERFLOW to the
>>> existing flags, and then make a liburing change almost identical to
>>> what you had.
>>> Hence kernel side:
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index d37d7ea5ebe5..af9fd5cefc51 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1234,11 +1234,12 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
>>>       struct io_uring_cqe *cqe;
>>>       struct io_kiocb *req;
>>>       unsigned long flags;
>>> +    bool ret = true;
>>>       LIST_HEAD(list);
>>>         if (!force) {
>>>           if (list_empty_careful(&ctx->cq_overflow_list))
>>> -            return true;
>>> +            goto done;
>>>           if ((ctx->cached_cq_tail - READ_ONCE(rings->cq.head) ==
>>>               rings->cq_ring_entries))
>>>               return false;
>>> @@ -1284,7 +1285,11 @@ static bool io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
>>>           io_put_req(req);
>>>       }
>>>   -    return cqe != NULL;
>>> +    ret = cqe != NULL;
>>> +done:
>>> +    if (ret)
>>> +        ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW;
>>> +    return ret;
>>>   }
>>>     static void __io_cqring_fill_event(struct io_kiocb *req, long res, long cflags)
>>> @@ -5933,10 +5938,13 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
>>>       int i, submitted = 0;
>>>         /* if we have a backlog and couldn't flush it all, return BUSY */
>>> -    if (test_bit(0, &ctx->sq_check_overflow)) {
>>> +    if (unlikely(test_bit(0, &ctx->sq_check_overflow))) {
>>>           if (!list_empty(&ctx->cq_overflow_list) &&
>>> -            !io_cqring_overflow_flush(ctx, false))
>>> +            !io_cqring_overflow_flush(ctx, false)) {
>>> +            ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW;
>>> +            smp_mb();
>>>               return -EBUSY;
>>> +        }
>>>       }
>>>         /* make sure SQ entry isn't read before tail */
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 92c22699a5a7..9c7e028beda5 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -197,6 +197,7 @@ struct io_sqring_offsets {
>>>    * sq_ring->flags
>>>    */
>>>   #define IORING_SQ_NEED_WAKEUP    (1U << 0) /* needs io_uring_enter wakeup */
>>> +#define IORING_SQ_CQ_OVERFLOW    (1U << 1) /* app needs to enter kernel */
>>>     struct io_cqring_offsets {
>>>       __u32 head;
>> I think above codes are still not correct, you only set IORING_SQ_CQ_OVERFLOW in
>> io_submit_sqes, but if cq ring has been overflowed and applications don't do io
>> submit anymore, just calling io_uring_peek_cqe continuously, then applications
>> still won't be aware of the cq ring overflow.
> 
> With the associated liburing change in that same email, the peek will enter the kernel just to flush the overflow. So not sure I see your concern?
I modify above test program a bit:
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <assert.h>

#include "liburing.h"

static void test_cq_overflow(struct io_uring *ring)
{
         struct io_uring_cqe *cqe;
         struct io_uring_sqe *sqe;
         int issued = 0;
         int ret = 0;
         int i;

         for (i = 0; i < 33; i++) {
                 sqe = io_uring_get_sqe(ring);
                 if (!sqe) {
                         fprintf(stderr, "get sqe failed\n");
                         break;;
                 }
                 ret = io_uring_submit(ring);
                 if (ret <= 0) {
                         if (ret != -EBUSY)
                                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                         break;
                 }
                 issued++;
         }

         printf("issued requests: %d\n", issued);

         while (issued) {
                 ret = io_uring_peek_cqe(ring, &cqe);
                 if (ret) {
                         if (ret != -EAGAIN) {
                                 fprintf(stderr, "peek completion failed: %s\n",
                                         strerror(ret));
                                 break;
                         }
                         printf("left requets: %d %d\n", issued, IO_URING_READ_ONCE(*ring->sq.kflags));
                         continue;
                 }
                 io_uring_cqe_seen(ring, cqe);
                 issued--;
                 printf("left requets: %d\n", issued);
         }
}

int main(int argc, char *argv[])
{
         int ret;
         struct io_uring ring;

         ret = io_uring_queue_init(16, &ring, 0);
         if (ret) {
                 fprintf(stderr, "ring setup failed: %d\n", ret);
                 return 1;
         }

         test_cq_overflow(&ring);
         return 0;
}

Though with your patches applied, we still can not peek the last cqe.
This test program will only issue 33 sqes, so it won't get EBUSY error.

Regards,
Xiaoguang Wang


> 
>> We can put the IORING_SQ_CQ_OVERFLOW set in __io_cqring_fill_event() when setting
>> cq_check_overflow. In non-sqpoll, this will be safe, but in sqpoll mode, there maybe
>> concurrent modifications to sq_flags, which is a race condition and may need extral
>> lock to protect IORING_SQ_NEED_WAKEP and IORING_SQ_CQ_OVERFLOW.
> 
> That’s why I wanted to keep it in the shared submission path, as that’s properly synchronized.
>