From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23FB8C47083 for ; Wed, 2 Jun 2021 10:49:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0D906613B4 for ; Wed, 2 Jun 2021 10:49:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231952AbhFBKuo (ORCPT ); Wed, 2 Jun 2021 06:50:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230321AbhFBKtu (ORCPT ); Wed, 2 Jun 2021 06:49:50 -0400 Received: from mail-wm1-x331.google.com (mail-wm1-x331.google.com [IPv6:2a00:1450:4864:20::331]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2929CC061763 for ; Wed, 2 Jun 2021 03:48:01 -0700 (PDT) Received: by mail-wm1-x331.google.com with SMTP id h12-20020a05600c350cb029019fae7a26cdso1372541wmq.5 for ; Wed, 02 Jun 2021 03:48:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=to:cc:references:from:subject:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=48OcPMCgzRaV4Hvf2hUEiH5aAfW3sJxFOm4K9wlET9E=; b=KuNzUhiWdKjqf0oZolZRhBVSn/MivnmN1yif6JYWmBn2RCMRE0zJwp9GnHz7FewA6w ZcKfHaVQ65xqO4KAlX9/lcgPOtpXR2U9IRD3KssDII5BX6l+UO6V+Xaf+pnuQBgqlM/K oTdSSRoflDEnDZIykGSB+UyivFnIfh1gLpSJktDhK9wA812Db46aqno7JvLjA9CcFLzN Pngdg81xy38r1DIl1UrUcg/eG8gb0PdAy2tLsNfAB+9XdZOND1Mn42SchsRTBH/hU4kN I8gJeYpf43+gwCP46ONuFIwoTjFm4/JMfVKlUxJzF764TdrJ0Tp9dObMgrILw684i5OG IZrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:subject:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=48OcPMCgzRaV4Hvf2hUEiH5aAfW3sJxFOm4K9wlET9E=; b=LiNbr1paE2F3ieSGQat8rs9ZLEH1BSb0dn+XM0cIxpgfGk9YX1nyN8XlxfCg8/O1e2 os7i6aDNVHHUo1J+KQLHO+JwNNwgHo8fV8WdMDeiOjg+FBB2DO5+Aqolm7lQRfiHarcd rLUWolUXj70DiD8mWsd6/PxcliYMiNHbf7JvDPEWzHZacwccN2T7QQ+J2ibRl5ElIJX5 zyl6w12ErIc0ZYumQu6CsMRIo8xrh/sGyg7nUIZDhUS6Q3XPhPeScOeLeI5muKtcL//W ATAuaQURef8hr0NJzGkjhZnwYZ5AahjZwBnCBeSYkFy5/ITpyed6mlQ/GEwgrdWUpl7x JMZg== X-Gm-Message-State: AOAM533re8YoYCQ+w+SNfw3BcobLDrtemwkodd8XvxStbiTpchZ17N3/ Z4PQUPvw4JwLuXPsXCOtfxE= X-Google-Smtp-Source: ABdhPJysv3XIbEYHStWFwg1Xhy/PKQoTqzcPIC4oEJl9bE+XbAsRfCDefZW2ZvxID3r4sx1DNaO/lQ== X-Received: by 2002:a05:600c:4657:: with SMTP id n23mr4634790wmo.47.1622630879762; Wed, 02 Jun 2021 03:47:59 -0700 (PDT) Received: from ?IPv6:2620:10d:c096:310::2410? ([2620:10d:c092:600::2:f0d]) by smtp.gmail.com with ESMTPSA id v8sm7028408wrc.29.2021.06.02.03.47.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 02 Jun 2021 03:47:59 -0700 (PDT) To: Christian Dietrich , io-uring Cc: Horst Schirmeier , "Franz-B. Tuneke" References: <9b3a8815-9a47-7895-0f4d-820609c15e9b@gmail.com> <4a553a51-50ff-e986-acf0-da9e266d97cd@gmail.com> <46229c8c-7e9d-9232-1e97-d1716dfc3056@gmail.com> <0468c1d5-9d0a-f8c0-618c-4a40b4677099@gmail.com> From: Pavel Begunkov Subject: Re: [RFC] Programming model for io_uring + eBPF Message-ID: <49df117c-dcbd-9b91-e181-e5b2757ae6aa@gmail.com> Date: Wed, 2 Jun 2021 11:47:50 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On 5/27/21 12:12 PM, Christian Dietrich wrote: > Pavel Begunkov [21. May 2021]: > >>> The problem that I see is that eBPF in io_uring breaks this fine >>> synchronization as eBPF SQE submission and userspace SQE submission can >>> run in parallel. >> >> It definitely won't be a part of ABI, but they actually do serialise >> at the moment. > > They serialize because they are executed by the same worker thread, > right? No, because all submissions are currently under ctx->uring_lock mutex and also executed in the submitting task context (no io-wq). I hope to get rid of the second restriction, i.e. prior upstreaming, and maybe do something about the first one in far future if there will be a need. > Perhaps that is the solution to my synchronization problem. If/when > io_uring supports more than one eBPF executioners, we should make the > number of executors configurable at setup time. Thereby, the user can > implicitly manage serialization of eBPF execution. I wouldn't keep it a part of ABI, but may be good enough for experiments. So, it'll be UB and may break unless we figure something out. ... actually it sounds like a sane idea to do lock grabbing lazy, i.e. only if it actually submits requests. That may make sense if you control several requests by BPF, e.g. keeping QD + batching. >>> But going back to my original wish: I wanted to ensure that I can >>> serialize eBPF-SQEs such that I'm sure that they do not run in parallel. >>> My idea was to use synchronization groups as a generalization of >>> SQE linking in order to make it also useful for others (not only for eBPF). >> >> So, let's dissect it a bit more, why do you need serialising as >> such? What use case you have in mind, and let's see if it's indeed >> can't be implemented efficiently with what we have. > > What I want to do is to manipulate (read-calculate-update) user memory > from eBPF without the need to synchronize between eBPF invocations. > > As eBPF invocations have a run-to-completion semantic, it feels bad to > use lock-based synchronization. Besides waiting for user-memory to be > swapped in, they will be usually short and plug together results and > newly emitted CQEs. swapping can't be done under spinlocks, that's a reason why userspace read/write are available only to sleepable BPF. btw, I was thinking about adding atomic ops with userspace memory. I can code an easy solution, but may have too much of additional overhead. Same situation with normal load/stores being slow mentioned below. > >> To recap: BPFs don't share SQ with userspace at all, and may have >> separate CQs to reap events from. You may post an event and it's >> wait synchronised, so may act as a message-based synchronisation, >> see test3 in the recently posted v2 for example. I'll also be >> adding futex support (bpf + separate requests), it might >> play handy for some users. > > I'm sure that it is possible to use those mechanisms for synchronizing, > but I assume that explicit synchronization (locks, passing tokens > around) is more expensive than sequenzializing requests (implicit > synchronization) before starting to execute them. As you know it depends on work/synchronisation ratio, but I'm also concerned about scalability. Consider same argument applied to sequenzializing normal user threads (e.g. pthreads). Not quite of the same extent but shows the idea. > But probably, we need some benchmarks to see what performs better. Would be great to have some in any case. Userspace read/write may be slow. It can be solved (if a problem) but would require work to be done on the BPF kernel side >>> My reasoning being not doing this serialization in userspace is that I >>> want to use the SQPOLL mode and execute long chains of >>> IO/computation-SQEs without leaving the kernelspace. >> >> btw, "in userspace" is now more vague as it can be done by BPF >> as well. For some use cases I'd expect BPF acting as a reactor, >> e.g. on completion of previous CQEs and submitting new requests >> in response, and so keeping it entirely in kernel space until >> it have anything to tell to the userspace, e.g. by posting >> into the main CQ. > > Yes, exactly that is my motivation. But I also think that it is a useful > pattern to have many eBPF callbacks pending (e.g. one for each > connection). Good case. One more moment is how much generality such cases need. E.g. there are some data structures in BPF, don't remember names but like perf buffers or perf rings. > > With one pending invocation per connection, synchronization with a fixed > number of additional CQEs might be problematic: For example, for a > per-connection barrier synchronization with the CQ-reap approach, one > needs one CQ for each connection. > >>> The problem that I had when thinking about the implementation is that >>> IO_LINK semantic works in the wrong direction: Link the next SQE, >>> whenever it comes to this SQE. If it would be the other way around >>> ("Link this SQE to the previous one") it would be much easier as the >>> cost would only arise if we actually request linking. But compatibility.. >> >> Stack vs queue style linking? If I understand what you mean right, that's >> because this is how SQ is parsed and so that's the most efficient way. > > No, I did not want to argue about the ordering within the link chain, > but with the semantic of link flag. I though that it might have been > beneficial to let the flag indicate that the SQE should be linked to the > previous one. However, after thinking this through in more detail I now > believe that it does not make any difference for the submission path. > > >>> Ok, but what happens if the last SQE in an io_submit_sqes() call >>> requests linking? Is it envisioned that the first SQE that comes with >>> the next io_submit_sqes() is linked to that one? >> >> No, it doesn't leave submission boundary (e.g. a single syscall). >> In theory may be left there _not_ submitted, but don't see much >> profit in it. >> >>> If this is not supported, what happens if I use the SQPOLL mode where >>> the poller thread can partition my submitted SQEs at an arbitrary >>> point into multiple io_submit_sqes() calls? >> >> It's not arbitrary, submission is atomic in nature, first you fill >> SQEs in memory but they are not visible to SQPOLL in a meanwhile, >> and then you "commit" them by overwriting SQ tail pointer. >> >> Not a great exception for that is shared SQPOLL task, but it >> just waits someone to take care of the case. >> >> if (cap_entries && to_submit > 8) >> to_submit = 8; >> >>> If this is supported, link.head has to point to the last submitted SQE after >>> the first io_submit_sqes()-call. Isn't then appending SQEs in the >>> second io_submit_sqes()-call racy with the completion side. (With the >>> same problems that I tried to solve? >> >> Exactly why it's not supported > > Thank you for this detailed explanation. I now understand the design > decision with the SQE linking much better and why delayed linking of > SQEs introduces linking between completion and submission side that is > undesirable. You're welcome, hope we'll get API into shape in no time -- Pavel Begunkov