From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93FACC433DB for ; Thu, 14 Jan 2021 21:18:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2E02A23A5E for ; Thu, 14 Jan 2021 21:18:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727685AbhANVSM (ORCPT ); Thu, 14 Jan 2021 16:18:12 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:52206 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727680AbhANVSM (ORCPT ); Thu, 14 Jan 2021 16:18:12 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 10EL9F7M109750; Thu, 14 Jan 2021 21:17:26 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : references : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=m+KCOLCvhAFM0WTasJLBsHDXP3YFF1lidDFrnj35bxE=; b=W0CxSN+VffjzdZsabZ5vFnKNxxIJIziHkniTIqQ2FM0NjyVX22sxX8HFEVg5YRP1mG9P 5dX57Q0GbZFIVHSfc5nNOxu4Nis3eo9yKtFBVp54p3HxsCutG9PAdffNXYGpg4xr0f9l w5KOWCyCeTSeUFrI/djWrZXiejwQ3vC3SQoxJrtcm9kjGfiZxT0zL9ySZTjUXamvZd70 0Qd08dzClSPcoCkJYwfWDSfZ0rRk9r6EcC6+LJRgF88bmijWbWMnM7n35HCWIvgM8uHE BRdplTLFwgOht5AZpgOAgJQm9BvNOzgOVlBMUpbhH9eFjkzyoQrfWYKBFBST+rdNUdr7 9A== Received: from userp3030.oracle.com (userp3030.oracle.com [156.151.31.80]) by userp2130.oracle.com with ESMTP id 360kvka7j2-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 14 Jan 2021 21:17:26 +0000 Received: from pps.filterd (userp3030.oracle.com [127.0.0.1]) by userp3030.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 10ELAtHI058784; Thu, 14 Jan 2021 21:17:26 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userp3030.oracle.com with ESMTP id 360kentybk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 14 Jan 2021 21:17:26 +0000 Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 10ELHNTN029573; Thu, 14 Jan 2021 21:17:23 GMT Received: from [10.154.113.143] (/10.154.113.143) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 14 Jan 2021 13:17:23 -0800 Subject: Re: [PATCH v5 13/13] io_uring: support buffer registration sharing From: Bijan Mottahedeh To: axboe@kernel.dk, asml.silence@gmail.com, io-uring@vger.kernel.org References: <1610487193-21374-1-git-send-email-bijan.mottahedeh@oracle.com> <1610487193-21374-14-git-send-email-bijan.mottahedeh@oracle.com> Message-ID: <5ff5ebc2-7b38-9899-06e9-8747e6562b70@oracle.com> Date: Thu, 14 Jan 2021 13:17:23 -0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Antivirus: Avast (VPS 210113-0, 01/12/2021), Outbound message X-Antivirus-Status: Clean X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9864 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxscore=0 phishscore=0 spamscore=0 malwarescore=0 suspectscore=0 mlxlogscore=999 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101140123 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9864 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 phishscore=0 lowpriorityscore=0 bulkscore=0 priorityscore=1501 malwarescore=0 clxscore=1015 impostorscore=0 spamscore=0 mlxscore=0 suspectscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101140123 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On 1/13/2021 6:01 PM, Bijan Mottahedeh wrote: > On 1/12/2021 1:33 PM, Bijan Mottahedeh wrote: >> Implement buffer sharing among multiple rings. >> >> A ring shares its (future) buffer registrations at setup time with >> IORING_SETUP_SHARE_BUF. A ring attaches to another ring's buffer >> registration at setup time with IORING_SETUP_ATTACH_BUF, after >> authenticating with the buffer registration owner's fd. Any updates to >> the owner's buffer registrations become immediately available to the >> attached rings. >> >> Signed-off-by: Bijan Mottahedeh >> >> Conflicts: >>     fs/io_uring.c >> --- >>   fs/io_uring.c                 | 85 >> +++++++++++++++++++++++++++++++++++++++++-- >>   include/uapi/linux/io_uring.h |  2 + >>   2 files changed, 83 insertions(+), 4 deletions(-) >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index 37639b9..856a570b 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -8439,6 +8439,13 @@ static void io_buffers_map_free(struct >> io_ring_ctx *ctx) >>       ctx->nr_user_bufs = 0; >>   } >> +static void io_detach_buf_data(struct io_ring_ctx *ctx) >> +{ >> +    percpu_ref_put(&ctx->buf_data->refs); >> +    ctx->buf_data = NULL; >> +    ctx->nr_user_bufs = 0; >> +} >> + >>   static int io_sqe_buffers_unregister(struct io_ring_ctx *ctx) >>   { >>       struct fixed_rsrc_data *data = ctx->buf_data; >> @@ -8447,6 +8454,11 @@ static int io_sqe_buffers_unregister(struct >> io_ring_ctx *ctx) >>       if (!data) >>           return -ENXIO; >> +    if (ctx->flags & IORING_SETUP_ATTACH_BUF) { >> +        io_detach_buf_data(ctx); >> +        return 0; >> +    } >> + >>       ret = io_rsrc_ref_quiesce(data, ctx, alloc_fixed_buf_ref_node); >>       if (ret) >>           return ret; >> @@ -8690,9 +8702,13 @@ static struct fixed_rsrc_data >> *io_buffers_map_alloc(struct io_ring_ctx *ctx, >>       if (!nr_args || nr_args > IORING_MAX_FIXED_BUFS) >>           return ERR_PTR(-EINVAL); >> -    buf_data = alloc_fixed_rsrc_data(ctx); >> -    if (IS_ERR(buf_data)) >> -        return buf_data; >> +    if (ctx->buf_data) { >> +        buf_data = ctx->buf_data; >> +    } else { >> +        buf_data = alloc_fixed_rsrc_data(ctx); >> +        if (IS_ERR(buf_data)) >> +            return buf_data; >> +    } >>       nr_tables = DIV_ROUND_UP(nr_args, IORING_MAX_BUFS_TABLE); >>       buf_data->table = kcalloc(nr_tables, sizeof(*buf_data->table), >> @@ -8757,9 +8773,17 @@ static int io_sqe_buffers_register(struct >> io_ring_ctx *ctx, void __user *arg, >>       if (ctx->nr_user_bufs) >>           return -EBUSY; >> +    if (ctx->flags & IORING_SETUP_ATTACH_BUF) { >> +        if (!ctx->buf_data) >> +            return -EFAULT; >> +        ctx->nr_user_bufs = ctx->buf_data->ctx->nr_user_bufs; >> +        return 0; >> +    } >> + >>       buf_data = io_buffers_map_alloc(ctx, nr_args); >>       if (IS_ERR(buf_data)) >>           return PTR_ERR(buf_data); >> +    ctx->buf_data = buf_data; >>       for (i = 0; i < nr_args; i++, ctx->nr_user_bufs++) { >>           struct io_mapped_ubuf *imu; >> @@ -8783,7 +8807,6 @@ static int io_sqe_buffers_register(struct >> io_ring_ctx *ctx, void __user *arg, >>               break; >>       } >> -    ctx->buf_data = buf_data; >>       if (ret) { >>           io_sqe_buffers_unregister(ctx); >>           return ret; >> @@ -9831,6 +9854,55 @@ static struct file *io_uring_get_file(struct >> io_ring_ctx *ctx) >>       return file; >>   } >> +static int io_attach_buf_data(struct io_ring_ctx *ctx, >> +                  struct io_uring_params *p) >> +{ >> +    struct io_ring_ctx *ctx_attach; >> +    struct fd f; >> + >> +    f = fdget(p->wq_fd); >> +    if (!f.file) >> +        return -EBADF; >> +    if (f.file->f_op != &io_uring_fops) { >> +        fdput(f); >> +        return -EINVAL; >> +    } >> + >> +    ctx_attach = f.file->private_data; >> +    if (!ctx_attach->buf_data) { >> +        fdput(f); >> +        return -EINVAL; >> +    } >> +    ctx->buf_data = ctx_attach->buf_data; >> + >> +    percpu_ref_get(&ctx->buf_data->refs); >> +    fdput(f); >> +    return 0; >> +} >> + >> +static int io_init_buf_data(struct io_ring_ctx *ctx, struct >> io_uring_params *p) >> +{ >> +    if ((p->flags & (IORING_SETUP_SHARE_BUF | >> IORING_SETUP_ATTACH_BUF)) == >> +        (IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF)) >> +        return -EINVAL; >> + >> +    if (p->flags & IORING_SETUP_SHARE_BUF) { >> +        struct fixed_rsrc_data *buf_data; >> + >> +        buf_data = alloc_fixed_rsrc_data(ctx); >> +        if (IS_ERR(buf_data)) >> +            return PTR_ERR(buf_data); >> + >> +        ctx->buf_data = buf_data; >> +        return 0; >> +    } >> + >> +    if (p->flags & IORING_SETUP_ATTACH_BUF) >> +        return io_attach_buf_data(ctx, p); >> + >> +    return 0; >> +} >> + >>   static int io_uring_create(unsigned entries, struct io_uring_params *p, >>                  struct io_uring_params __user *params) >>   { >> @@ -9948,6 +10020,10 @@ static int io_uring_create(unsigned entries, >> struct io_uring_params *p, >>       if (ret) >>           goto err; >> +    ret = io_init_buf_data(ctx, p); >> +    if (ret) >> +        goto err; >> + >>       ret = io_sq_offload_create(ctx, p); >>       if (ret) >>           goto err; >> @@ -10028,6 +10104,7 @@ static long io_uring_setup(u32 entries, struct >> io_uring_params __user *params) >>       if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL | >>               IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE | >>               IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ | >> +            IORING_SETUP_SHARE_BUF | IORING_SETUP_ATTACH_BUF | >>               IORING_SETUP_R_DISABLED)) >>           return -EINVAL; >> diff --git a/include/uapi/linux/io_uring.h >> b/include/uapi/linux/io_uring.h >> index b289ef8..3ad786a 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -98,6 +98,8 @@ enum { >>   #define IORING_SETUP_CLAMP    (1U << 4)    /* clamp SQ/CQ ring sizes */ >>   #define IORING_SETUP_ATTACH_WQ    (1U << 5)    /* attach to existing >> wq */ >>   #define IORING_SETUP_R_DISABLED    (1U << 6)    /* start with ring >> disabled */ >> +#define IORING_SETUP_SHARE_BUF    (1U << 7)    /* share buffer >> registration */ >> +#define IORING_SETUP_ATTACH_BUF    (1U << 8)    /* attach buffer >> registration */ >>   enum { >>       IORING_OP_NOP, >> > > I recreated the deadlock scenario you had raised: > > > Ok, now the original io_uring instance will wait until the attached > > once get rid of their references. That's a versatile ground to have > > in kernel deadlocks. > > > > task1: uring1 = create() > > task2: uring2 = create() > > task1: uring3 = create(share=uring2); > > task2: uring4 = create(share=uring1); > > > > task1: io_sqe_buffers_unregister(uring1) > > task2: io_sqe_buffers_unregister(uring2) > > > > If I skimmed through the code right, that should hang unkillably. > > with the following test: > > +static int test_deadlock(void) > +{ > +       int i, pid, ret; > +       struct io_uring rings[4]; > +       struct io_uring_params p = {}; > + > +       p.flags = IORING_SETUP_SHARE_BUF; > + > +       for (i = 0; i < 2; i++) { > +               ret = io_uring_queue_init_params(1, &rings[i], &p); > +               if (ret) { > +                       verror("queue_init share"); > +                       return ret; > +               } > +       } > + > +       p.flags = IORING_SETUP_ATTACH_BUF; > + > +       pid = fork(); > +       if (pid) { > +               p.wq_fd = rings[1].ring_fd; > +               ret = io_uring_queue_init_params(1, &rings[3], &p); ^^^ 2 > +       } else { > +               p.wq_fd = rings[0].ring_fd; > +               ret = io_uring_queue_init_params(1, &rings[4], &p); ^^^ 3 > +       } > + > +       if (ret) { > +               verror("queue_init attach"); > +               return ret; > +       } > + > + > +       vinfo(V1, "unregister\n"); > + > +       if (pid) { > +               close(rings[1].ring_fd); > +               ret = io_uring_unregister_buffers(&rings[0]); > +       } else { > +               close(rings[0].ring_fd); > +               ret = io_uring_unregister_buffers(&rings[1]); > +       } > + > +       vinfo(V1, "unregister done\n"); > + > +       if (ret) > +               verror("unregister"); > + > +       return ret; > +} > > > The two processe in the test hang but can be interrupted. > > I checked that > > ret = wait_for_completion_interruptible(&data->done); > > in io_rsrc_ref_quiesce() returns -ERESTARTSYS (-512) after hitting ^C > and that > > ret = io_run_task_work_sig(); > > returns -EINTR (-4) > > Finally in > > io_ring_ctx_free() > -> io_sqe_buffers_unregister() >    -> io_rsrc_ref_quiesce() > > ret = wait_for_completion_interruptible(&data->done); > > returns 0. > > So it looks like the unkillable hang is not there. > > However, when I take out the io_uring_unregister_buffers() calls from > the test, one of the processes gets a segfault with the following trace > and I'm not sure what the cause is. Bug in test itself. Wrong index passed to io_uring_queue_init_params() above.