FYI, fsnotify contention with aio and io

public inbox for [email protected]
 help / color / mirror / Atom feed

* FYI, fsnotify contention with aio and io_uring.
@ 2023-08-04 17:47 Pierre Labat
  2023-08-07 20:11 ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Pierre Labat @ 2023-08-04 17:47 UTC (permalink / raw)
  To: '[email protected]'

Hi,

This is FYI, may be you already knows about that, but in case you don't....

I was pushing the limit of the number of nvme read IOPS,  the  FIO + the Linux OS can handle. For that, I have something special under the Linux nvme driver. As a consequence I am not limited by whatever the NVME SSD max IOPS or IO latency would be.

As I cranked the number of system cores and FIO jobs doing direct 4k random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows (less than linear) and around 15 FIO jobs on 15 core threads, the overall IOPS, in fact, goes down as I add more FIO jobs. For example on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs, the overall IOPS starts to go down.

This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1).

Did some profiling to know why. On a 24 cores/48 threads with FIO 48 jobs, I got for the io_uring case:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1858618550304
#
# Overhead  Command          Shared Object                 Symbol                                     
# ........  ...............  ............................  ...........................................
#
    39.46%  fio              [kernel.vmlinux]              [k] lockref_get_not_zero
            |
            ---lockref_get_not_zero
               dget_parent
               __fsnotify_parent
               io_read
               io_issue_sqe
               io_submit_sqes
               __do_sys_io_uring_enter
               do_syscall_64
               entry_SYSCALL_64
               syscall
.
.
.
    36.03%  fio              [kernel.vmlinux]              [k] lockref_put_return
            |
            ---lockref_put_return
               dput
               __fsnotify_parent
               io_read
               io_issue_sqe
               io_submit_sqes
               __do_sys_io_uring_enter
               do_syscall_64
               entry_SYSCALL_64
               syscall
.
.

As you can see 76% of the cpu on the box is sucked up by lockref_get_not_zero() and lockref_put_return().
Looking at the code, there is contention when IO_uring call fsnotify_access().
The filesystem code fsnotify_access() ends up calling dget_parent() and later dput() to take a reference on the parent directory (that would be /dev/ in our case), and later release the reference.
This is done (get+put) for each IO. 

The dget increments a unique/same counter (for the /dev/ directory)  and dput decrements this same counter.

As a consequence we have 24 cores/48 threads fighting to get the same counter in their cache to modify it. At a rate of M of iops. That is disastrous.

To work around that problem, and continue my scalability testing, I acked io_uring and aio to set the flag FMODE_NONOTIFY in the struct file->f_mode of the file on which IOs are done.
Doing that forces fsnotify to do nothing. The iops immediately went up more than 4X and the fsnotify trashing disappeared. 

May be it would be a good idea to add an option to FIO to disable fsnotify on the file[s] on which IOs are issued?
Or to take a reference on the file parent directory only once when fio starts?

Regards,

Pierre

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-04 17:47 FYI, fsnotify contention with aio and io_uring Pierre Labat
@ 2023-08-07 20:11 ` Jeff Moyer
  2023-08-08 21:41   ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2023-08-07 20:11 UTC (permalink / raw)
  To: Pierre Labat; +Cc: '[email protected]'

Hi, Pierre,

Pierre Labat <[email protected]> writes:

> Hi,
>
> This is FYI, may be you already knows about that, but in case you don't....
>
> I was pushing the limit of the number of nvme read IOPS, the FIO + the
> Linux OS can handle. For that, I have something special under the
> Linux nvme driver. As a consequence I am not limited by whatever the
> NVME SSD max IOPS or IO latency would be.
>
> As I cranked the number of system cores and FIO jobs doing direct 4k
> random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows
> (less than linear) and around 15 FIO jobs on 15 core threads, the
> overall IOPS, in fact, goes down as I add more FIO jobs. For example
> on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs,
> the overall IOPS starts to go down.
>
> This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1).

[snip]

> As you can see 76% of the cpu on the box is sucked up by
> lockref_get_not_zero() and lockref_put_return().  Looking at the code,
> there is contention when IO_uring call fsnotify_access().

Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might be a
good idea to find out what set it and why.

> The filesystem code fsnotify_access() ends up calling dget_parent()
> and later dput() to take a reference on the parent directory (that
> would be /dev/ in our case), and later release the reference.  This is
> done (get+put) for each IO.
>
> The dget increments a unique/same counter (for the /dev/ directory)
> and dput decrements this same counter.
>
> As a consequence we have 24 cores/48 threads fighting to get the same
> counter in their cache to modify it. At a rate of M of iops. That is
> disastrous.
>
> To work around that problem, and continue my scalability testing, I
> acked io_uring and aio to set the flag FMODE_NONOTIFY in the struct
> file->f_mode of the file on which IOs are done.  Doing that forces
> fsnotify to do nothing. The iops immediately went up more than 4X and
> the fsnotify trashing disappeared.
>
> May be it would be a good idea to add an option to FIO to disable
> fsnotify on the file[s] on which IOs are issued?

Maybe I'm wrong, but that sounds like an abuse of the FMODE_NONOTIFY
flag.

> Or to take a reference on the file parent directory only once when fio
> starts?

Let's decide on whether or not the application is following best
practices, first, starting with answering the questions I asked above.

Cheers,
Jeff


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-07 20:11 ` Jeff Moyer
@ 2023-08-08 21:41   ` Jens Axboe
  2023-08-09 16:33     ` [EXT] " Pierre Labat
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2023-08-08 21:41 UTC (permalink / raw)
  To: Jeff Moyer, Pierre Labat; +Cc: '[email protected]'

On 8/7/23 2:11?PM, Jeff Moyer wrote:
> Hi, Pierre,
> 
> Pierre Labat <[email protected]> writes:
> 
>> Hi,
>>
>> This is FYI, may be you already knows about that, but in case you don't....
>>
>> I was pushing the limit of the number of nvme read IOPS, the FIO + the
>> Linux OS can handle. For that, I have something special under the
>> Linux nvme driver. As a consequence I am not limited by whatever the
>> NVME SSD max IOPS or IO latency would be.
>>
>> As I cranked the number of system cores and FIO jobs doing direct 4k
>> random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows
>> (less than linear) and around 15 FIO jobs on 15 core threads, the
>> overall IOPS, in fact, goes down as I add more FIO jobs. For example
>> on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs,
>> the overall IOPS starts to go down.
>>
>> This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1).
> 
> [snip]
> 
>> As you can see 76% of the cpu on the box is sucked up by
>> lockref_get_not_zero() and lockref_put_return().  Looking at the code,
>> there is contention when IO_uring call fsnotify_access().
> 
> Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might be a
> good idea to find out what set it and why.

This would be my guess too, some distros do seem to do that. The
notification bits scale horribly, nobody should use it for anything high
performance...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-08 21:41   ` Jens Axboe
@ 2023-08-09 16:33     ` Pierre Labat
  2023-08-09 17:14       ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Pierre Labat @ 2023-08-09 16:33 UTC (permalink / raw)
  To: Jens Axboe, Jeff Moyer; +Cc: '[email protected]'

Micron Confidential

Hi Jeff and Jens,

About "FAN_MODIFY fsnotify watch set on /dev".

Was using Fedora34 distro (with 6.3.9 kernel), and fio. Without any particular/specific setting.
I tried to see what could watch /dev but failed at that.
I used the inotify-info tool, but that display watchers using the inotify interface. And nothing was watching /dev via inotify.
Need to figure out how to do the same but for the fanotify interface.
I'll look at it again and let you know.

Regards,

Pierre



Micron Confidential
> -----Original Message-----
> From: Jens Axboe <[email protected]>
> Sent: Tuesday, August 8, 2023 2:41 PM
> To: Jeff Moyer <[email protected]>; Pierre Labat <[email protected]>
> Cc: '[email protected]' <[email protected]>
> Subject: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
>
>
> On 8/7/23 2:11?PM, Jeff Moyer wrote:
> > Hi, Pierre,
> >
> > Pierre Labat <[email protected]> writes:
> >
> >> Hi,
> >>
> >> This is FYI, may be you already knows about that, but in case you
> don't....
> >>
> >> I was pushing the limit of the number of nvme read IOPS, the FIO +
> >> the Linux OS can handle. For that, I have something special under the
> >> Linux nvme driver. As a consequence I am not limited by whatever the
> >> NVME SSD max IOPS or IO latency would be.
> >>
> >> As I cranked the number of system cores and FIO jobs doing direct 4k
> >> random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows
> >> (less than linear) and around 15 FIO jobs on 15 core threads, the
> >> overall IOPS, in fact, goes down as I add more FIO jobs. For example
> >> on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs,
> >> the overall IOPS starts to go down.
> >>
> >> This happens the same for io_uring and aio. Was using kernel version
> 6.3.9. Using one namespace (/dev/nvme0n1).
> >
> > [snip]
> >
> >> As you can see 76% of the cpu on the box is sucked up by
> >> lockref_get_not_zero() and lockref_put_return().  Looking at the
> >> code, there is contention when IO_uring call fsnotify_access().
> >
> > Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might be
> > a good idea to find out what set it and why.
>
> This would be my guess too, some distros do seem to do that. The
> notification bits scale horribly, nobody should use it for anything high
> performance...
>
> --
> Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-09 16:33     ` [EXT] " Pierre Labat
@ 2023-08-09 17:14       ` Jeff Moyer
  2023-08-14 16:30         ` Pierre Labat
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Moyer @ 2023-08-09 17:14 UTC (permalink / raw)
  To: Pierre Labat; +Cc: Jens Axboe, '[email protected]'

Pierre Labat <[email protected]> writes:

> Micron Confidential
>
> Hi Jeff and Jens,
>
> About "FAN_MODIFY fsnotify watch set on /dev".
>
> Was using Fedora34 distro (with 6.3.9 kernel), and fio. Without any particular/specific setting.
> I tried to see what could watch /dev but failed at that.
> I used the inotify-info tool, but that display watchers using the
> inotify interface. And nothing was watching /dev via inotify.
> Need to figure out how to do the same but for the fanotify interface.
> I'll look at it again and let you know.

You wouldn't happen to be running pipewire, would you?

https://gitlab.freedesktop.org/pipewire/pipewire/-/commit/88f0dbd6fcd0a412fc4bece22afdc3ba0151e4cf

-Jeff

>
> Regards,
>
> Pierre
>
>
>
> Micron Confidential
>> -----Original Message-----
>> From: Jens Axboe <[email protected]>
>> Sent: Tuesday, August 8, 2023 2:41 PM
>> To: Jeff Moyer <[email protected]>; Pierre Labat <[email protected]>
>> Cc: '[email protected]' <[email protected]>
>> Subject: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
>> recognize the sender and were expecting this message.
>>
>>
>> On 8/7/23 2:11?PM, Jeff Moyer wrote:
>> > Hi, Pierre,
>> >
>> > Pierre Labat <[email protected]> writes:
>> >
>> >> Hi,
>> >>
>> >> This is FYI, may be you already knows about that, but in case you
>> don't....
>> >>
>> >> I was pushing the limit of the number of nvme read IOPS, the FIO +
>> >> the Linux OS can handle. For that, I have something special under the
>> >> Linux nvme driver. As a consequence I am not limited by whatever the
>> >> NVME SSD max IOPS or IO latency would be.
>> >>
>> >> As I cranked the number of system cores and FIO jobs doing direct 4k
>> >> random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows
>> >> (less than linear) and around 15 FIO jobs on 15 core threads, the
>> >> overall IOPS, in fact, goes down as I add more FIO jobs. For example
>> >> on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs,
>> >> the overall IOPS starts to go down.
>> >>
>> >> This happens the same for io_uring and aio. Was using kernel version
>> 6.3.9. Using one namespace (/dev/nvme0n1).
>> >
>> > [snip]
>> >
>> >> As you can see 76% of the cpu on the box is sucked up by
>> >> lockref_get_not_zero() and lockref_put_return().  Looking at the
>> >> code, there is contention when IO_uring call fsnotify_access().
>> >
>> > Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might be
>> > a good idea to find out what set it and why.
>>
>> This would be my guess too, some distros do seem to do that. The
>> notification bits scale horribly, nobody should use it for anything high
>> performance...
>>
>> --
>> Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-09 17:14       ` Jeff Moyer
@ 2023-08-14 16:30         ` Pierre Labat
  2023-08-29 21:54           ` Pierre Labat
  0 siblings, 1 reply; 8+ messages in thread
From: Pierre Labat @ 2023-08-14 16:30 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: Jens Axboe, '[email protected]'

Hi Jeff,

Indeed, by default, in my configuration, pipewire is running.
When I can re-test, I'll disabled it and see if that remove the problem.
Thanks for the hint!

Pierre

> -----Original Message-----
> From: Jeff Moyer <[email protected]>
> Sent: Wednesday, August 9, 2023 10:15 AM
> To: Pierre Labat <[email protected]>
> Cc: Jens Axboe <[email protected]>; '[email protected]' <io-
> [email protected]>
> Subject: Re: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
> 
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you
> recognize the sender and were expecting this message.
> 
> 
> Pierre Labat <[email protected]> writes:
> 
> > Micron Confidential
> >
> > Hi Jeff and Jens,
> >
> > About "FAN_MODIFY fsnotify watch set on /dev".
> >
> > Was using Fedora34 distro (with 6.3.9 kernel), and fio. Without any
> particular/specific setting.
> > I tried to see what could watch /dev but failed at that.
> > I used the inotify-info tool, but that display watchers using the
> > inotify interface. And nothing was watching /dev via inotify.
> > Need to figure out how to do the same but for the fanotify interface.
> > I'll look at it again and let you know.
> 
> You wouldn't happen to be running pipewire, would you?
> 
> https://urldefense.com/v3/__https://gitlab.freedesktop.org/pipewire/pipewir
> e/-
> /commit/88f0dbd6fcd0a412fc4bece22afdc3ba0151e4cf__;!!KZTdOCjhgt4hgw!6E063jj
> -_XK1NceWzms7DaYacILy4cKmeNVA3xalNwkd0zrYTX-IouUnvJ8bZs-RG3YSdk5XpFoo$
> 
> -Jeff
> 
> >
> > Regards,
> >
> > Pierre
> >
> >
> >
> > Micron Confidential
> >> -----Original Message-----
> >> From: Jens Axboe <[email protected]>
> >> Sent: Tuesday, August 8, 2023 2:41 PM
> >> To: Jeff Moyer <[email protected]>; Pierre Labat <[email protected]>
> >> Cc: '[email protected]' <[email protected]>
> >> Subject: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
> >>
> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
> >> unless you recognize the sender and were expecting this message.
> >>
> >>
> >> On 8/7/23 2:11?PM, Jeff Moyer wrote:
> >> > Hi, Pierre,
> >> >
> >> > Pierre Labat <[email protected]> writes:
> >> >
> >> >> Hi,
> >> >>
> >> >> This is FYI, may be you already knows about that, but in case you
> >> don't....
> >> >>
> >> >> I was pushing the limit of the number of nvme read IOPS, the FIO +
> >> >> the Linux OS can handle. For that, I have something special under
> >> >> the Linux nvme driver. As a consequence I am not limited by
> >> >> whatever the NVME SSD max IOPS or IO latency would be.
> >> >>
> >> >> As I cranked the number of system cores and FIO jobs doing direct
> >> >> 4k random read on /dev/nvme0n1, I hit a wall. The IOPS scaling
> >> >> slows (less than linear) and around 15 FIO jobs on 15 core
> >> >> threads, the overall IOPS, in fact, goes down as I add more FIO
> >> >> jobs. For example on a system with 24 cores/48 threads, when I
> >> >> goes beyond 15 FIO jobs, the overall IOPS starts to go down.
> >> >>
> >> >> This happens the same for io_uring and aio. Was using kernel
> >> >> version
> >> 6.3.9. Using one namespace (/dev/nvme0n1).
> >> >
> >> > [snip]
> >> >
> >> >> As you can see 76% of the cpu on the box is sucked up by
> >> >> lockref_get_not_zero() and lockref_put_return().  Looking at the
> >> >> code, there is contention when IO_uring call fsnotify_access().
> >> >
> >> > Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might
> >> > be a good idea to find out what set it and why.
> >>
> >> This would be my guess too, some distros do seem to do that. The
> >> notification bits scale horribly, nobody should use it for anything
> >> high performance...
> >>
> >> --
> >> Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-14 16:30         ` Pierre Labat
@ 2023-08-29 21:54           ` Pierre Labat
  2023-09-14 19:11             ` Jeff Moyer
  0 siblings, 1 reply; 8+ messages in thread
From: Pierre Labat @ 2023-08-29 21:54 UTC (permalink / raw)
  To: 'Jeff Moyer'
  Cc: 'Jens Axboe', '[email protected]'

Hi,

Had some time to re-do some testing.

1) Pipewire (its wireplumber deamon) set a watch on the children of the directory /dev via inotify.
I removed that (disabled pipewire), but still had the fsnotify overhead when using aio/io_ring at high IOPS across several threads on several cores.

2) I then noticed that udev set a watch (via inotify) on the files in /dev.
This is due to a rule in /usr/lib/udev/rules.d/60-block.rules
# watch metadata changes, caused by tools closing the device node which was opened for writing
ACTION!="remove", SUBSYSTEM=="block", \
  KERNEL=="loop*|mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*|sd*|vd*|xvd*|bcache*|cciss*|dasd*|ubd*|ubi*|scm*|pmem*|nbd*|zd*", \
  OPTIONS+="watch"
I removed "nvme*" from this rule (I am testing on /dev/nvme0n1), then finally the fsnotify overhead disappeared.

3) I think there is nothing wrong with Pipewire and udev, they simply want to watch what is going on in /dev.
I don't think they are interested in (and it is not the goal/charter of fsnotify) quantifying millions of read/write accesses/sec to a file they watch. There are other tools for that, that are optimized for that task.

I think to avoid the overhead, the fsnotify subsystem should be refined to factor high frequency read/write file access.
Or piece of code (like aio/io_uring) doing high frequency fsnotify should do the factoring themselves.
Or the user should be given a way to turn off fsnotify calls for read/write on specific file.


Now, the only way to work around the cpu overhead without hacking, is to disable services watching /dev.
That means people can't use these services anymore. Doesn't seem right.

Regards,

Pierre


> -----Original Message-----
> From: Pierre Labat
> Sent: Monday, August 14, 2023 9:31 AM
> To: Jeff Moyer <[email protected]>
> Cc: Jens Axboe <[email protected]>; '[email protected]' <io-
> [email protected]>
> Subject: RE: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
> 
> Hi Jeff,
> 
> Indeed, by default, in my configuration, pipewire is running.
> When I can re-test, I'll disabled it and see if that remove the problem.
> Thanks for the hint!
> 
> Pierre
> 
> > -----Original Message-----
> > From: Jeff Moyer <[email protected]>
> > Sent: Wednesday, August 9, 2023 10:15 AM
> > To: Pierre Labat <[email protected]>
> > Cc: Jens Axboe <[email protected]>; '[email protected]' <io-
> > [email protected]>
> > Subject: Re: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
> >
> > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> > you recognize the sender and were expecting this message.
> >
> >
> > Pierre Labat <[email protected]> writes:
> >
> > > Micron Confidential
> > >
> > > Hi Jeff and Jens,
> > >
> > > About "FAN_MODIFY fsnotify watch set on /dev".
> > >
> > > Was using Fedora34 distro (with 6.3.9 kernel), and fio. Without any
> > particular/specific setting.
> > > I tried to see what could watch /dev but failed at that.
> > > I used the inotify-info tool, but that display watchers using the
> > > inotify interface. And nothing was watching /dev via inotify.
> > > Need to figure out how to do the same but for the fanotify interface.
> > > I'll look at it again and let you know.
> >
> > You wouldn't happen to be running pipewire, would you?
> >
> > https://urldefense.com/v3/__https://gitlab.freedesktop.org/pipewire/pi
> > pewir
> > e/-
> > /commit/88f0dbd6fcd0a412fc4bece22afdc3ba0151e4cf__;!!KZTdOCjhgt4hgw!6E
> > 063jj
> > -_XK1NceWzms7DaYacILy4cKmeNVA3xalNwkd0zrYTX-IouUnvJ8bZs-RG3YSdk5XpFoo$
> >
> > -Jeff
> >
> > >
> > > Regards,
> > >
> > > Pierre
> > >
> > >
> > >
> > > Micron Confidential
> > >> -----Original Message-----
> > >> From: Jens Axboe <[email protected]>
> > >> Sent: Tuesday, August 8, 2023 2:41 PM
> > >> To: Jeff Moyer <[email protected]>; Pierre Labat
> > >> <[email protected]>
> > >> Cc: '[email protected]' <[email protected]>
> > >> Subject: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
> > >>
> > >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
> > >> unless you recognize the sender and were expecting this message.
> > >>
> > >>
> > >> On 8/7/23 2:11?PM, Jeff Moyer wrote:
> > >> > Hi, Pierre,
> > >> >
> > >> > Pierre Labat <[email protected]> writes:
> > >> >
> > >> >> Hi,
> > >> >>
> > >> >> This is FYI, may be you already knows about that, but in case
> > >> >> you
> > >> don't....
> > >> >>
> > >> >> I was pushing the limit of the number of nvme read IOPS, the FIO
> > >> >> + the Linux OS can handle. For that, I have something special
> > >> >> under the Linux nvme driver. As a consequence I am not limited
> > >> >> by whatever the NVME SSD max IOPS or IO latency would be.
> > >> >>
> > >> >> As I cranked the number of system cores and FIO jobs doing
> > >> >> direct 4k random read on /dev/nvme0n1, I hit a wall. The IOPS
> > >> >> scaling slows (less than linear) and around 15 FIO jobs on 15
> > >> >> core threads, the overall IOPS, in fact, goes down as I add more
> > >> >> FIO jobs. For example on a system with 24 cores/48 threads, when
> > >> >> I goes beyond 15 FIO jobs, the overall IOPS starts to go down.
> > >> >>
> > >> >> This happens the same for io_uring and aio. Was using kernel
> > >> >> version
> > >> 6.3.9. Using one namespace (/dev/nvme0n1).
> > >> >
> > >> > [snip]
> > >> >
> > >> >> As you can see 76% of the cpu on the box is sucked up by
> > >> >> lockref_get_not_zero() and lockref_put_return().  Looking at the
> > >> >> code, there is contention when IO_uring call fsnotify_access().
> > >> >
> > >> > Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it
> > >> > might be a good idea to find out what set it and why.
> > >>
> > >> This would be my guess too, some distros do seem to do that. The
> > >> notification bits scale horribly, nobody should use it for anything
> > >> high performance...
> > >>
> > >> --
> > >> Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
  2023-08-29 21:54           ` Pierre Labat
@ 2023-09-14 19:11             ` Jeff Moyer
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Moyer @ 2023-09-14 19:11 UTC (permalink / raw)
  To: Pierre Labat; +Cc: 'Jens Axboe', '[email protected]'

[-- Attachment #1: Type: text/plain, Size: 7386 bytes --]

Pierre Labat <[email protected]> writes:

> Hi,
>
> Had some time to re-do some testing.
>
> 1) Pipewire (its wireplumber deamon) set a watch on the children of the directory /dev via inotify.
> I removed that (disabled pipewire), but still had the fsnotify
> overhead when using aio/io_ring at high IOPS across several threads on
> several cores.
>
> 2) I then noticed that udev set a watch (via inotify) on the files in /dev.
> This is due to a rule in /usr/lib/udev/rules.d/60-block.rules
> # watch metadata changes, caused by tools closing the device node which was opened for writing
> ACTION!="remove", SUBSYSTEM=="block", \
>   KERNEL=="loop*|mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*|sd*|vd*|xvd*|bcache*|cciss*|dasd*|ubd*|ubi*|scm*|pmem*|nbd*|zd*",
> \
>   OPTIONS+="watch"
> I removed "nvme*" from this rule (I am testing on /dev/nvme0n1), then finally the fsnotify overhead disappeared.

Interesting.  I don't see that behavior.  I setup a null block device
with the following parameters:

modprobe null_blk submit_queues=96 queue_mode=2 gb=350 bs=4096 completion_nsec=0 hw_queue_depth=1024

And I ran the following fio job:

---
[global]
ioengine=io_uring
iodepth=8
direct=1
rw=read
filename=/dev/nullb0
cpus_allowed=0-95
cpus_allowed_policy=split
size=1g
offset_increment=10g

[32thread]
numjobs=32
---

If there are no watches on /dev or /dev/nullb0, then I see 70-79GiB/s
throughput from this job.  If I add a watch on /dev/nullb0, there
appears to be a small performance hit, but it is within the run-to-run
variation.  If I instead add a watch to /dev, the throughput drops to
~10GiB/s.  So, I think this matches your initial report (the perf top
output closely matched yours).

Can you run the attached script to verify that nothing is watching /dev
when you have udev configured to watch the nvme device, and report back?
Run it with the path as the argument, so "inotify-watchers.sh /dev".
If there is no watch on /dev, and you still see a performance problem,
then we'll need to start investigating that.  A good starting point
would be details of how you are testing, along with new perf top output.

-Jeff

> 3) I think there is nothing wrong with Pipewire and udev, they simply want to watch what is going on in /dev.
> I don't think they are interested in (and it is not the goal/charter
> of fsnotify) quantifying millions of read/write accesses/sec to a file
> they watch. There are other tools for that, that are optimized for
> that task.
>
> I think to avoid the overhead, the fsnotify subsystem should be
> refined to factor high frequency read/write file access.
> Or piece of code (like aio/io_uring) doing high frequency fsnotify should do the factoring themselves.
> Or the user should be given a way to turn off fsnotify calls for read/write on specific file.
> Now, the only way to work around the cpu overhead without hacking, is
> to disable services watching /dev.  That means people can't use these
> services anymore. Doesn't seem right.
>
> Regards,
>
> Pierre
>
>
>> -----Original Message-----
>> From: Pierre Labat
>> Sent: Monday, August 14, 2023 9:31 AM
>> To: Jeff Moyer <[email protected]>
>> Cc: Jens Axboe <[email protected]>; '[email protected]' <io-
>> [email protected]>
>> Subject: RE: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
>> 
>> Hi Jeff,
>> 
>> Indeed, by default, in my configuration, pipewire is running.
>> When I can re-test, I'll disabled it and see if that remove the problem.
>> Thanks for the hint!
>> 
>> Pierre
>> 
>> > -----Original Message-----
>> > From: Jeff Moyer <[email protected]>
>> > Sent: Wednesday, August 9, 2023 10:15 AM
>> > To: Pierre Labat <[email protected]>
>> > Cc: Jens Axboe <[email protected]>; '[email protected]' <io-
>> > [email protected]>
>> > Subject: Re: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
>> >
>> > CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> > you recognize the sender and were expecting this message.
>> >
>> >
>> > Pierre Labat <[email protected]> writes:
>> >
>> > > Micron Confidential
>> > >
>> > > Hi Jeff and Jens,
>> > >
>> > > About "FAN_MODIFY fsnotify watch set on /dev".
>> > >
>> > > Was using Fedora34 distro (with 6.3.9 kernel), and fio. Without any
>> > particular/specific setting.
>> > > I tried to see what could watch /dev but failed at that.
>> > > I used the inotify-info tool, but that display watchers using the
>> > > inotify interface. And nothing was watching /dev via inotify.
>> > > Need to figure out how to do the same but for the fanotify interface.
>> > > I'll look at it again and let you know.
>> >
>> > You wouldn't happen to be running pipewire, would you?
>> >
>> > https://urldefense.com/v3/__https://gitlab.freedesktop.org/pipewire/pi
>> > pewir
>> > e/-
>> > /commit/88f0dbd6fcd0a412fc4bece22afdc3ba0151e4cf__;!!KZTdOCjhgt4hgw!6E
>> > 063jj
>> > -_XK1NceWzms7DaYacILy4cKmeNVA3xalNwkd0zrYTX-IouUnvJ8bZs-RG3YSdk5XpFoo$
>> >
>> > -Jeff
>> >
>> > >
>> > > Regards,
>> > >
>> > > Pierre
>> > >
>> > >
>> > >
>> > > Micron Confidential
>> > >> -----Original Message-----
>> > >> From: Jens Axboe <[email protected]>
>> > >> Sent: Tuesday, August 8, 2023 2:41 PM
>> > >> To: Jeff Moyer <[email protected]>; Pierre Labat
>> > >> <[email protected]>
>> > >> Cc: '[email protected]' <[email protected]>
>> > >> Subject: [EXT] Re: FYI, fsnotify contention with aio and io_uring.
>> > >>
>> > >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
>> > >> unless you recognize the sender and were expecting this message.
>> > >>
>> > >>
>> > >> On 8/7/23 2:11?PM, Jeff Moyer wrote:
>> > >> > Hi, Pierre,
>> > >> >
>> > >> > Pierre Labat <[email protected]> writes:
>> > >> >
>> > >> >> Hi,
>> > >> >>
>> > >> >> This is FYI, may be you already knows about that, but in case
>> > >> >> you
>> > >> don't....
>> > >> >>
>> > >> >> I was pushing the limit of the number of nvme read IOPS, the FIO
>> > >> >> + the Linux OS can handle. For that, I have something special
>> > >> >> under the Linux nvme driver. As a consequence I am not limited
>> > >> >> by whatever the NVME SSD max IOPS or IO latency would be.
>> > >> >>
>> > >> >> As I cranked the number of system cores and FIO jobs doing
>> > >> >> direct 4k random read on /dev/nvme0n1, I hit a wall. The IOPS
>> > >> >> scaling slows (less than linear) and around 15 FIO jobs on 15
>> > >> >> core threads, the overall IOPS, in fact, goes down as I add more
>> > >> >> FIO jobs. For example on a system with 24 cores/48 threads, when
>> > >> >> I goes beyond 15 FIO jobs, the overall IOPS starts to go down.
>> > >> >>
>> > >> >> This happens the same for io_uring and aio. Was using kernel
>> > >> >> version
>> > >> 6.3.9. Using one namespace (/dev/nvme0n1).
>> > >> >
>> > >> > [snip]
>> > >> >
>> > >> >> As you can see 76% of the cpu on the box is sucked up by
>> > >> >> lockref_get_not_zero() and lockref_put_return().  Looking at the
>> > >> >> code, there is contention when IO_uring call fsnotify_access().
>> > >> >
>> > >> > Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it
>> > >> > might be a good idea to find out what set it and why.
>> > >>
>> > >> This would be my guess too, some distros do seem to do that. The
>> > >> notification bits scale horribly, nobody should use it for anything
>> > >> high performance...
>> > >>
>> > >> --
>> > >> Jens Axboe


[-- Attachment #2: inotify-watchers.sh --]
[-- Type: application/x-sh, Size: 795 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-14 19:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-04 17:47 FYI, fsnotify contention with aio and io_uring Pierre Labat
2023-08-07 20:11 ` Jeff Moyer
2023-08-08 21:41   ` Jens Axboe
2023-08-09 16:33     ` [EXT] " Pierre Labat
2023-08-09 17:14       ` Jeff Moyer
2023-08-14 16:30         ` Pierre Labat
2023-08-29 21:54           ` Pierre Labat
2023-09-14 19:11             ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox