From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C69FC433EF for ; Thu, 18 Nov 2021 22:02:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7186461361 for ; Thu, 18 Nov 2021 22:02:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231176AbhKRWFa (ORCPT ); Thu, 18 Nov 2021 17:05:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35544 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229472AbhKRWFa (ORCPT ); Thu, 18 Nov 2021 17:05:30 -0500 Received: from mail-pg1-x52c.google.com (mail-pg1-x52c.google.com [IPv6:2607:f8b0:4864:20::52c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9D7CAC061574; Thu, 18 Nov 2021 14:02:29 -0800 (PST) Received: by mail-pg1-x52c.google.com with SMTP id g28so6710492pgg.3; Thu, 18 Nov 2021 14:02:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=8fh25Uj6YoBvqabkRqtjeDWkpKjWk2JP2GTPGCVb4fw=; b=ddSjjAwKtkJ5FN0JMAPArzeYHb8DoBJo54IAiteYJVVOz1gPw92lhRM3IwDvvfub4c 8eNobr+J17FSrQo1JAnACftnyQzw1tfOUDzeMY152J+fKowx13UCZc1QFTzj9dGQsSQI Hfoo38oWz+tx2j58HLzGBIRLvPqNAvhXxXFW//qttykdMr5sKz23xozcTe+69e4qwL1P Aqfxa6AF+pkIfrYhUfSEmmTgR+WO+YPYzFuoWcg0OlIQ6vBbpXuRR1dF+n/DJc2dYVPJ DhYglKkB2PPssdpPR3HgGG4R0ZsTN7ADEe+O8v22autDHeVUJOVfhDmwMhAbAfiS0S3g 9JBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8fh25Uj6YoBvqabkRqtjeDWkpKjWk2JP2GTPGCVb4fw=; b=mnG7RGiqUkaXrdrjfVP/FHyKvbGXotXqivQLQuNfxDkquh3UoIdBMaz93BY+tC2CAY 7H2OrxX1k3UphfrVUYBPR7l+q/tdQDTZ/P7n1VoyrMSRoqUl3ojRPU/MzoOmKpnfDlEO o5H664fDSqqXyJ03+6NKd7e3ISfZ4mmErh7XJfUta7SMhywlKZAkpVNlGOfczQjIF0aA uU75kp4fUSjrSMmKctXtNTKZ1ZDxxpofpHLLdBEwoP98PIppKn9k/Yql2SaQwycwcMq6 5SYOZ3iDfjlN+PGvgheG/B6GEGDu5xI4mjHodjgGgGCCKcLkHZsfTyqH2ZLaiQtW3X5E trlg== X-Gm-Message-State: AOAM531ggxln1SOFZ1OHI/Mf1wp2+Rmq4FPNd9cEf+Gtk+QHiWi6LKP0 lhrggqZg274CqjYoVDvcVtd5iljCus0= X-Google-Smtp-Source: ABdhPJwSxHjZbxBxsDpQ3Yv8kgFa69flZgG3R8Q2pPP2B89P9cbEqO8gbVUygG5a0cKU1S3+FRKO9w== X-Received: by 2002:a63:6c44:: with SMTP id h65mr13775544pgc.423.1637272949121; Thu, 18 Nov 2021 14:02:29 -0800 (PST) Received: from ast-mbp.dhcp.thefacebook.com ([2620:10d:c090:500::1:6926]) by smtp.gmail.com with ESMTPSA id h6sm587662pfi.174.2021.11.18.14.02.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Nov 2021 14:02:28 -0800 (PST) Date: Thu, 18 Nov 2021 14:02:26 -0800 From: Alexei Starovoitov To: Kumar Kartikeya Dwivedi Cc: bpf@vger.kernel.org, Jens Axboe , Pavel Begunkov , io-uring@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Pavel Emelyanov , Alexander Mihalicyn , Andrei Vagin , criu@openvz.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers Message-ID: <20211118220226.ritjbjeh5s4yw7hl@ast-mbp.dhcp.thefacebook.com> References: <20211116054237.100814-1-memxor@gmail.com> <20211116054237.100814-2-memxor@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20211116054237.100814-2-memxor@gmail.com> Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote: > This change adds eBPF iterator for buffers registered in io_uring ctx. > It gives access to the ctx, the index of the registered buffer, and a > pointer to the io_uring_ubuf itself. This allows the iterator to save > info related to buffers added to an io_uring instance, that isn't easy > to export using the fdinfo interface (like exact struct page composing > the registered buffer). > > The primary usecase this is enabling is checkpoint/restore support. > > Note that we need to use mutex_trylock when the file is read from, in > seq_start functions, as the order of lock taken is opposite of what it > would be when io_uring operation reads the same file. We take > seq_file->lock, then ctx->uring_lock, while io_uring would first take > ctx->uring_lock and then seq_file->lock for the same ctx. > > This can lead to a deadlock scenario described below: > > CPU 0 CPU 1 > > vfs_read > mutex_lock(&seq_file->lock) io_read > mutex_lock(&ctx->uring_lock) > mutex_lock(&ctx->uring_lock) # switched to mutex_trylock > mutex_lock(&seq_file->lock) > > The trylock also protects the case where io_uring tries to read from > iterator attached to itself (same ctx), where the order of locks would > be: > io_uring_enter > mutex_lock(&ctx->uring_lock) <-----------. > io_read \ > seq_read \ > mutex_lock(&seq_file->lock) / > mutex_lock(&ctx->uring_lock) # deadlock-` > > In both these cases (recursive read and contended uring_lock), -EDEADLK > is returned to userspace. > > In the future, this iterator will be extended to directly support > iteration of bvec Flexible Array Member, so that when there is no > corresponding VMA that maps to the registered buffer (e.g. if VMA is > destroyed after pinning pages), we are able to reconstruct the > registration on restore by dumping the page contents and then replaying > them into a temporary mapping used for registration later. All this is > out of scope for the current series however, but builds upon this > iterator. >From BPF infra perspective these new iterators fit very well and I don't see any issues maintaining this interface while kernel keeps changing, but this commit log and shallowness of the selftests makes me question feasibility of this approach in particular with io_uring. Is it even possible to scan all internal bits of io_uring and reconstruct it later? The bpf iter is only the read part. Don't you need the write part for CRIU ? Even for reads only... io_uring has complex inner state. Like bpf itself which cannot be realistically CRIU-ed. I don't think we can merge this in pieces. We need to wait until there is full working CRIU framework that uses these new iterators.