From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING, NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 775B9C433ED for ; Mon, 26 Apr 2021 11:49:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 332EB61263 for ; Mon, 26 Apr 2021 11:49:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231911AbhDZLuB (ORCPT ); Mon, 26 Apr 2021 07:50:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39202 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231864AbhDZLuA (ORCPT ); Mon, 26 Apr 2021 07:50:00 -0400 Received: from mail-wm1-x335.google.com (mail-wm1-x335.google.com [IPv6:2a00:1450:4864:20::335]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31271C061574 for ; Mon, 26 Apr 2021 04:49:19 -0700 (PDT) Received: by mail-wm1-x335.google.com with SMTP id n3-20020a05600c4f83b02901425630b2c2so632103wmq.1 for ; Mon, 26 Apr 2021 04:49:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=fUN83aO6Ui8oau0kwNgxqbXnlgM3IobDYebaVWaXM0Q=; b=LQY2Cknnec/8lxZNYUOzcHdyrILCOXMBanDcE05igBf+TtJHoJC5iuJaZsSgCwqRYq nYL7FwZfrgxlidzKE30cmwSrQRmqtz1AY06OkPg4eNDeiYVRU6v94e+LzPyo2yOI+DST iiJeLL9C0mLArXnHojtThmLVQ9LdkrWTAccsICukCmzCf9sSZW0sDyMC1WCaqhWIPE7I CO4ycvRUUdxeeTRpbM3/MgUTbeUmms6ofYkJ5StymexIzh4TsHSl2VwRXbspcIN5QStK h4lB+n/I0fHPcDyMQVLZX6DneljpPVRJCqZ5Bif+8VGwIYhmUVu7+VsAX2ELEWAE39aq TCJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=fUN83aO6Ui8oau0kwNgxqbXnlgM3IobDYebaVWaXM0Q=; b=NFwHUUK7ITuBve2lYgbjc/YQiPx6qjWmofLDLJaB/3x3ikHzZk3/BisuR6BiYpjJ8i irEOnKy11Geh2qJ8bSTdYSSgDTslnnfBSYB+wKIYslBxQWi4QGd6eQWtmDzJK2Q/f13t vs80dJY4X9yvSng/qHj4ekebSvSOQKxso7dIZon9BnSRhsyM+vqAu4oFr+hwe/Xdv4Jh XSANVQ3vpL5uhYsZr5M3k8tWok6Vt7UJCkY+gZ/pE3dTFgfAjb5VzIK2c7E7dzKLDSJu QtrqlW8gHc+nXOQGhES0pyokIwDrD94ebM490AGZlzdO+BKKh+yAj/yM+7pMOTGNexov iygA== X-Gm-Message-State: AOAM533bOJZX1QxduawEl9r7EzFHDdRf3UQIQg6wCyJIGxP0o1AZPuMn I3uicIRJ8TWchq2LTo7JBgw= X-Google-Smtp-Source: ABdhPJzvWPTNsGiWHF2ETwu5AB5nQ/jDEJsHDFqGRWm2imVU6GE5JEeSPTccb9mxm/8XgvKucIjjnA== X-Received: by 2002:a7b:ce1a:: with SMTP id m26mr20788292wmc.137.1619437757748; Mon, 26 Apr 2021 04:49:17 -0700 (PDT) Received: from [192.168.8.197] ([148.252.129.131]) by smtp.gmail.com with ESMTPSA id c15sm20630837wrr.3.2021.04.26.04.49.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 26 Apr 2021 04:49:17 -0700 (PDT) Subject: Re: io_uring networking performance degradation To: Michael Stoler Cc: io-uring@vger.kernel.org, Jens Axboe References: <6d9d2610-e7c0-ad29-fd30-85da8222ab83@gmail.com> From: Pavel Begunkov Message-ID: <285e496e-a921-ce16-1fb9-e914362439b9@gmail.com> Date: Mon, 26 Apr 2021 12:49:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org On 4/25/21 10:52 AM, Michael Stoler wrote: > Because of unstable working of perf over AWS VM I recheck test on > physical machine: Ubuntu 20.04, 5.8.0-50-generic kernel, CPU AMD EPYC > 7272 12-Core Processor 3200MHz, BogoMIPS 5789.39, NIC melanox 5, > Speed: 25000Mb/s Full Duplex. > Over physical machine performance degradation is much less pronounced > but still exists: > io_uring-echo-server Speed: 143081 request/sec, 143081 response/sec > epoll-echo-server Speed: 150692 request/sec, 150692 response/sec > epoll-echo-server is 5% faster Have to note that I haven't check the userspace programs, so not sure it's a fair comparison (may be or may be not). So, with it being said: 1) The last report had lot of idle time, so it may be a question of latency but not throughput for it. 2) Did you do proper pinning to a CPU/core? taskset or cset? Also, did it saturate the CPU/core you used in the most recent post? 3) Looking at __skb_datagram_iter taking 1%, seems there are other tasks taking a relatively good share of CPU/NIC resources. What is this datagram? UDP on the same NIC? Is something else using your NIC/interface? 4) don't see even close anything related to io_uring in the recent run, and it was only a small fraction in previous ones. So it's definitely not the overhead on submit/complete. If there is a io_uring problem, it could be the difference in polling / iowq punting comparing with epoll. It may be interesting to look into. And related thing I'm curious about is to compare FAST_POLL requests with io_uring multi-shot polling + send/recv. > > "perf top" with io_uring-echo-server: > PerfTop: 16481 irqs/sec kernel:98.5% exact: 99.8% lost: 0/0 drop: > 0/0 [4000Hz cycles], (all, 24 CPUs) > ------------------------------------------------------------------------------- > 8.66% [kernel] [k] __x86_indirect_thunk_rax > 8.49% [kernel] [k] copy_user_generic_string > 5.57% [kernel] [k] memset > 2.81% [kernel] [k] tcp_rate_skb_sent > 2.32% [kernel] [k] __alloc_skb > 2.16% [kernel] [k] __check_object_size > 1.44% [unknown] [k] 0xffffffffc100c296 > 1.28% [kernel] [k] tcp_write_xmit > 1.22% [kernel] [k] iommu_dma_map_page > 1.16% [kernel] [k] kmem_cache_free > 1.14% [kernel] [k] __softirqentry_text_start > 1.06% [unknown] [k] 0xffffffffc1008a7e > 1.03% [kernel] [k] __skb_datagram_iter > 0.97% [kernel] [k] __dev_queue_xmit > 0.86% [kernel] [k] ipv4_mtu > 0.85% [kernel] [k] tcp_schedule_loss_probe > 0.80% [kernel] [k] tcp_release_cb > 0.78% [unknown] [k] 0xffffffffc100c290 > 0.77% [unknown] [k] 0xffffffffc100c295 > 0.76% perf [.] __symbols__insert > > "perf top" with epoll-echo-server: > PerfTop: 24255 irqs/sec kernel:98.3% exact: 99.6% lost: 0/0 drop: > 0/0 [4000Hz cycles], (all, 24 CPUs) > ------------------------------------------------------------------------------- > 8.77% [kernel] [k] __x86_indirect_thunk_rax > 7.50% [kernel] [k] copy_user_generic_string > 4.10% [kernel] [k] memset > 2.70% [kernel] [k] tcp_rate_skb_sent > 2.18% [kernel] [k] __check_object_size > 2.09% [kernel] [k] __alloc_skb > 1.61% [unknown] [k] 0xffffffffc100c296 > 1.47% [kernel] [k] __virt_addr_valid > 1.40% [kernel] [k] iommu_dma_map_page > 1.37% [unknown] [k] 0xffffffffc1008a7e > 1.22% [kernel] [k] tcp_poll > 1.16% [kernel] [k] __softirqentry_text_start > 1.15% [kernel] [k] tcp_stream_memory_free > 1.07% [kernel] [k] tcp_write_xmit > 1.06% [kernel] [k] kmem_cache_free > 1.03% [kernel] [k] tcp_release_cb > 0.96% [kernel] [k] syscall_return_via_sysret > 0.90% [kernel] [k] __lock_text_start > 0.82% [kernel] [k] __copy_skb_header > 0.81% [kernel] [k] amd_iommu_map > > Regards > Michael Stoler > > On Tue, Apr 20, 2021 at 1:44 PM Michael Stoler wrote: >> >> Hi, perf data and tops for linux-5.8 are here: >> http://rdxdownloads.rdxdyn.com/michael_stoler_perf_data.tgz >> >> Regards >> Michael Stoler >> >> On Mon, Apr 19, 2021 at 5:27 PM Michael Stoler wrote: >>> >>> 1) linux-5.12-rc8 shows generally same picture: >>> >>> average load, 70-85% CPU core usage, 128 bytes packets >>> echo_bench --address '172.22.150.170:7777' --number 10 --duration >>> 60 --length 128` >>> epoll-echo-server: Speed: 71513 request/sec, 71513 response/sec >>> io_uring_echo_server: Speed: 64091 request/sec, 64091 response/sec >>> epoll-echo-server is 11% faster >>> >>> high load, 95-100% CPU core usage, 128 bytes packets >>> echo_bench --address '172.22.150.170:7777' --number 20 --duration >>> 60 --length 128` >>> epoll-echo-server: Speed: 130186 request/sec, 130186 response/sec >>> io_uring_echo_server: Speed: 109793 request/sec, 109793 response/sec >>> epoll-echo-server is 18% faster >>> >>> average load, 70-85% CPU core usage, 2048 bytes packets >>> echo_bench --address '172.22.150.170:7777' --number 10 --duration >>> 60 --length 2048` >>> epoll-echo-server: Speed: 63082 request/sec, 63082 response/sec >>> io_uring_echo_server: Speed: 59449 request/sec, 59449 response/sec >>> epoll-echo-server is 6% faster >>> >>> high load, 95-100% CPU core usage, 2048 bytes packets >>> echo_bench --address '172.22.150.170:7777' --number 20 --duration >>> 60 --length 2048` >>> epoll-echo-server: Speed: 110402 request/sec, 110402 response/sec >>> io_uring_echo_server: Speed: 88718 request/sec, 88718 response/sec >>> epoll-echo-server is 24% faster >>> >>> >>> 2-3) The "perf top" doesn't work stable with Ubuntu over AWS. All the >>> time it shows errors: "Uhhuh. NMI received for unknown reason", "Do >>> you have a strange power saving mode enabled?", "Dazed and confused, >>> but trying to continue". >>> >>> Regards >>> Michael Stoler >>> >>> On Mon, Apr 19, 2021 at 1:20 PM Pavel Begunkov wrote: >>>> >>>> On 4/19/21 10:13 AM, Michael Stoler wrote: >>>>> We are trying to reproduce reported on page >>>>> https://github.com/frevib/io_uring-echo-server/blob/master/benchmarks/benchmarks.md >>>>> results with a more realistic environment: >>>>> 1. Internode networking in AWS cluster with i3.16xlarge nodes type(25 >>>>> Gigabit network connection between client and server) >>>>> 2. 128 and 2048 packet sizes, to simulate typical payloads >>>>> 3. 10 clients to get 75-95% CPU utilization by server to simulate >>>>> server's normal load >>>>> 4. 20 clients to get 100% CPU utilization by server to simulate >>>>> server's hard load >>>>> >>>>> Software: >>>>> 1. OS: Ubuntu 20.04.2 LTS HWE with 5.8.0-45-generic kernel with latest liburing >>>>> 2. io_uring-echo-server: https://github.com/frevib/io_uring-echo-server >>>>> 3. epoll-echo-server: https://github.com/frevib/epoll-echo-server >>>>> 4. benchmark: https://github.com/haraldh/rust_echo_bench >>>>> 5. all commands runs with "hwloc-bind os=eth1" >>>>> >>>>> The results are confusing, epoll_echo_server shows stable advantage >>>>> over io_uring-echo-server, despite reported advantage of >>>>> io_uring-echo-server: >>>>> >>>>> 128 bytes packet size, 10 clients, 75-95% CPU core utilization by server: >>>>> echo_bench --address '172.22.117.67:7777' -c 10 -t 60 -l 128 >>>>> epoll_echo_server: Speed: 80999 request/sec, 80999 response/sec >>>>> io_uring_echo_server: Speed: 74488 request/sec, 74488 response/sec >>>>> epoll_echo_server is 8% faster >>>>> >>>>> 128 bytes packet size, 20 clients, 100% CPU core utilization by server: >>>>> echo_bench --address '172.22.117.67:7777' -c 20 -t 60 -l 128 >>>>> epoll_echo_server: Speed: 129063 request/sec, 129063 response/sec >>>>> io_uring_echo_server: Speed: 102681 request/sec, 102681 response/sec >>>>> epoll_echo_server is 25% faster >>>>> >>>>> 2048 bytes packet size, 10 clients, 75-95% CPU core utilization by server: >>>>> echo_bench --address '172.22.117.67:7777' -c 10 -t 60 -l 2048 >>>>> epoll_echo_server: Speed: 74421 request/sec, 74421 response/sec >>>>> io_uring_echo_server: Speed: 66510 request/sec, 66510 response/sec >>>>> epoll_echo_server is 11% faster >>>>> >>>>> 2048 bytes packet size, 20 clients, 100% CPU core utilization by server: >>>>> echo_bench --address '172.22.117.67:7777' -c 20 -t 60 -l 2048 >>>>> epoll_echo_server: Speed: 108704 request/sec, 108704 response/sec >>>>> io_uring_echo_server: Speed: 85536 request/sec, 85536 response/sec >>>>> epoll_echo_server is 27% faster >>>>> >>>>> Why io_uring shows consistent performance degradation? What is going wrong? >>>> >>>> 5.8 is pretty old, and I'm not sure all the performance problems were >>>> addressed there. Apart from missing common optimisations as you may >>>> have seen in the thread, it looks to me it doesn't have sighd(?) lock >>>> hammering fix. Jens, knows better it has been backported or not. >>>> >>>> So, things you can do: >>>> 1) try out 5.12 >>>> 2) attach `perf top` output or some other profiling for your 5.8 >>>> 3) to have a more complete picture do 2) with 5.12 >>>> -- Pavel Begunkov