Possible io_uring related race leads to btrfs data csum mismatch

public inbox for [email protected]
 help / color / mirror / Atom feed

* Possible io_uring related race leads to btrfs data csum mismatch
@ 2023-08-16  6:52 Qu Wenruo
  2023-08-16 14:33 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-16  6:52 UTC (permalink / raw)
  To: [email protected], Linux FS Devel, io-uring

Hi,

Recently I'm digging into a very rare failure during btrfs/06[234567],
where btrfs scrub detects unrepairable data corruption.

After days of digging, I have a much smaller reproducer:

```
fail()
{
         echo "!!! FAILED !!!"
         exit 1
}

workload()
{
         mkfs.btrfs -f -m single -d single --csum sha256 $dev1
         mount $dev1 $mnt
	# There are around 10 more combinations with different
         # seed and -p/-n parameters, but this is the smallest one
	# I found so far.
	$fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
	umount $mnt
	btrfs check --check-data-csum $dev1 || fail
}
runtime=1024
for (( i = 0; i < $runtime; i++ )); do
         echo "=== $i / $runtime ==="
         workload
done
```

At least here, with a VM with 6 cores (host has 8C/16T), fast enough
storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
1/100 to hit the error.

Checking the fsstress verbose log against the failed file, it turns out
to be an io_uring write.

And with uring_write disabled in fsstress, I have no longer reproduced
the csum mismatch, even with much larger -n and -p parameters.

However I didn't see any io_uring related callback inside btrfs code,
any advice on the io_uring part would be appreciated.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16  6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
@ 2023-08-16 14:33 ` Jens Axboe
  2023-08-16 14:49   ` Jens Axboe
  2023-08-16 21:46   ` Qu Wenruo
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2023-08-16 14:33 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 12:52 AM, Qu Wenruo wrote:
> Hi,
> 
> Recently I'm digging into a very rare failure during btrfs/06[234567],
> where btrfs scrub detects unrepairable data corruption.
> 
> After days of digging, I have a much smaller reproducer:
> 
> ```
> fail()
> {
>         echo "!!! FAILED !!!"
>         exit 1
> }
> 
> workload()
> {
>         mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>         mount $dev1 $mnt
>     # There are around 10 more combinations with different
>         # seed and -p/-n parameters, but this is the smallest one
>     # I found so far.
>     $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>     umount $mnt
>     btrfs check --check-data-csum $dev1 || fail
> }
> runtime=1024
> for (( i = 0; i < $runtime; i++ )); do
>         echo "=== $i / $runtime ==="
>         workload
> done
> ```

Tried to reproduce this, both on a vm and on a real host, and no luck so
far. I've got a few followup questions as your report is missing some
important info:

1) What kernel are you running?
2) What's the .config you are using?

> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
> 1/100 to hit the error.

What does "unsafe cche mode" mean? Is that write back caching enabled?
Write back caching with volatile write cache? For your device, can you
do:

$ grep . /sys/block/$dev/queue/*

> Checking the fsstress verbose log against the failed file, it turns out
> to be an io_uring write.

Any more details on what the write looks like?

> And with uring_write disabled in fsstress, I have no longer reproduced
> the csum mismatch, even with much larger -n and -p parameters.

Is it more likely to reproduce with larger -n/-p in general?

> However I didn't see any io_uring related callback inside btrfs code,
> any advice on the io_uring part would be appreciated.

io_uring doesn't do anything special here, it uses the normal page cache
read/write parts for buffered IO. But you may get extra parallellism
with io_uring here. For example, with the buffered write that this most
likely is, libaio would be exactly the same as a pwrite(2) on the file.
If this would've blocked, io_uring would offload this to a helper
thread. Depending on the workload, you could have multiple of those in
progress at the same time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 14:33 ` Jens Axboe
@ 2023-08-16 14:49   ` Jens Axboe
  2023-08-16 21:46   ` Qu Wenruo
  1 sibling, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2023-08-16 14:49 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 8:33 AM, Jens Axboe wrote:
>> However I didn't see any io_uring related callback inside btrfs code,
>> any advice on the io_uring part would be appreciated.
> 
> io_uring doesn't do anything special here, it uses the normal page cache
> read/write parts for buffered IO. But you may get extra parallellism
> with io_uring here. For example, with the buffered write that this most
> likely is, libaio would be exactly the same as a pwrite(2) on the file.
> If this would've blocked, io_uring would offload this to a helper
> thread. Depending on the workload, you could have multiple of those in
> progress at the same time.

I poked a bit at fsstress, and it's a bit odd imho. For example, any aio
read/write seems to hardcode O_DIRECT. The io_uring side will be
buffered. Not sure why there are those differences and why buffered/dio
isn't a variable. But this does mean that these are certainly buffered
writes with io_uring.

Are any of the writes overlapping? You could have a situation where
writeA and writeB overlap, and writeA will get punted to io-wq for
execution and writeB will complete inline. In other words, writeA is
issued, writeB is issued. writeA goes to io-wq, writeB now completes
inline, and now writeA is done and completed. It may be exposing issues
in btrfs. You can try the below patch, which should serialize all the
writes to a given file. If this makes a difference for you, then I'd
strongly suspect that the issue is deeper than the delivery mechanism of
the write.

diff --git a/ltp/fsstress.c b/ltp/fsstress.c
index 6641a525fe5d..034cbba27c6e 100644
--- a/ltp/fsstress.c
+++ b/ltp/fsstress.c
@@ -2317,6 +2317,7 @@ do_uring_rw(opnum_t opno, long r, int flags)
 		off %= maxfsize;
 		memset(buf, nameseq & 0xff, len);
 		io_uring_prep_writev(sqe, fd, &iovec, 1, off);
+		sqe->flags |= IOSQE_ASYNC;
 	} else {
 		off = (off64_t)(lr % stb.st_size);
 		io_uring_prep_readv(sqe, fd, &iovec, 1, off);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 14:33 ` Jens Axboe
  2023-08-16 14:49   ` Jens Axboe
@ 2023-08-16 21:46   ` Qu Wenruo
  2023-08-16 22:28     ` Jens Axboe
  2023-08-16 22:36     ` Jens Axboe
  1 sibling, 2 replies; 22+ messages in thread
From: Qu Wenruo @ 2023-08-16 21:46 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/16 22:33, Jens Axboe wrote:
> On 8/16/23 12:52 AM, Qu Wenruo wrote:
>> Hi,
>>
>> Recently I'm digging into a very rare failure during btrfs/06[234567],
>> where btrfs scrub detects unrepairable data corruption.
>>
>> After days of digging, I have a much smaller reproducer:
>>
>> ```
>> fail()
>> {
>>          echo "!!! FAILED !!!"
>>          exit 1
>> }
>>
>> workload()
>> {
>>          mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>>          mount $dev1 $mnt
>>      # There are around 10 more combinations with different
>>          # seed and -p/-n parameters, but this is the smallest one
>>      # I found so far.
>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>>      umount $mnt
>>      btrfs check --check-data-csum $dev1 || fail
>> }
>> runtime=1024
>> for (( i = 0; i < $runtime; i++ )); do
>>          echo "=== $i / $runtime ==="
>>          workload
>> done
>> ```
>
> Tried to reproduce this, both on a vm and on a real host, and no luck so
> far. I've got a few followup questions as your report is missing some
> important info:

You may want to try much higher -p/-n numbers.

For verification purpose, I normally go with -p 10 -n 10000, which has a
much higher chance to hit, but definitely too noisy for debug.

I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
workload, it failed at 21/1024.

>
> 1) What kernel are you running?

David's misc-next branch, aka, lastest upstream tags plus some btrfs
patches for the next merge window.

Although I have some internal reports showing this problem quite some
time ago.

> 2) What's the .config you are using?

Pretty common config, no heavy debug options (KASAN etc).

>
>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>> 1/100 to hit the error.
>
> What does "unsafe cche mode" mean?

Libvirt cache option "unsafe"

Which is mostly ignoring flush/fua commands and fully rely on host fs
(in my case it's file backed) cache.

> Is that write back caching enabled?
> Write back caching with volatile write cache? For your device, can you
> do:
>
> $ grep . /sys/block/$dev/queue/*
>
>> Checking the fsstress verbose log against the failed file, it turns out
>> to be an io_uring write.
>
> Any more details on what the write looks like?

For the involved file, it shows the following operations for the minimal
reproducible seed/-p/-n combination:

```
0/24: link d0/f2 d0/f3 0
0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
```

>
>> And with uring_write disabled in fsstress, I have no longer reproduced
>> the csum mismatch, even with much larger -n and -p parameters.
>
> Is it more likely to reproduce with larger -n/-p in general?

Yes, but I use that specific combination as the minimal reproducer for
debug purposes.

>
>> However I didn't see any io_uring related callback inside btrfs code,
>> any advice on the io_uring part would be appreciated.
>
> io_uring doesn't do anything special here, it uses the normal page cache
> read/write parts for buffered IO. But you may get extra parallellism
> with io_uring here. For example, with the buffered write that this most
> likely is, libaio would be exactly the same as a pwrite(2) on the file.
> If this would've blocked, io_uring would offload this to a helper
> thread. Depending on the workload, you could have multiple of those in
> progress at the same time.

My biggest concern is, would io_uring modify the page when it's still
under writeback?
In that case, it's going to cause csum mismatch as btrfs relies on the
page under writeback to be unchanged.

Thanks,
Qu

>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 21:46   ` Qu Wenruo
@ 2023-08-16 22:28     ` Jens Axboe
  2023-08-17  1:05       ` Qu Wenruo
  2023-08-16 22:36     ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-16 22:28 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 3:46 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/16 22:33, Jens Axboe wrote:
>> On 8/16/23 12:52 AM, Qu Wenruo wrote:
>>> Hi,
>>>
>>> Recently I'm digging into a very rare failure during btrfs/06[234567],
>>> where btrfs scrub detects unrepairable data corruption.
>>>
>>> After days of digging, I have a much smaller reproducer:
>>>
>>> ```
>>> fail()
>>> {
>>>          echo "!!! FAILED !!!"
>>>          exit 1
>>> }
>>>
>>> workload()
>>> {
>>>          mkfs.btrfs -f -m single -d single --csum sha256 $dev1
>>>          mount $dev1 $mnt
>>>      # There are around 10 more combinations with different
>>>          # seed and -p/-n parameters, but this is the smallest one
>>>      # I found so far.
>>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt
>>>      umount $mnt
>>>      btrfs check --check-data-csum $dev1 || fail
>>> }
>>> runtime=1024
>>> for (( i = 0; i < $runtime; i++ )); do
>>>          echo "=== $i / $runtime ==="
>>>          workload
>>> done
>>> ```
>>
>> Tried to reproduce this, both on a vm and on a real host, and no luck so
>> far. I've got a few followup questions as your report is missing some
>> important info:
> 
> You may want to try much higher -p/-n numbers.
> 
> For verification purpose, I normally go with -p 10 -n 10000, which has a
> much higher chance to hit, but definitely too noisy for debug.
> 
> I just tried a run with "$fsstress -p 10 -n 10000 -w -d $mnt" as the
> workload, it failed at 21/1024.

OK I'll try that.

>> 1) What kernel are you running?
> 
> David's misc-next branch, aka, lastest upstream tags plus some btrfs
> patches for the next merge window.
> 
> Although I have some internal reports showing this problem quite some
> time ago.

That's what I was getting at, if it was new or not.

>> 2) What's the .config you are using?
> 
> Pretty common config, no heavy debug options (KASAN etc).

Please just send the .config, I'd rather not have to guess. Things like
preempt etc may make a difference in reproducing this.

>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>>> 1/100 to hit the error.
>>
>> What does "unsafe cche mode" mean?
> 
> Libvirt cache option "unsafe"
> 
> Which is mostly ignoring flush/fua commands and fully rely on host fs
> (in my case it's file backed) cache.

Gotcha

>> Is that write back caching enabled?
>> Write back caching with volatile write cache? For your device, can you
>> do:
>>
>> $ grep . /sys/block/$dev/queue/*
>>
>>> Checking the fsstress verbose log against the failed file, it turns out
>>> to be an io_uring write.
>>
>> Any more details on what the write looks like?
> 
> For the involved file, it shows the following operations for the minimal
> reproducible seed/-p/-n combination:
> 
> ```
> 0/24: link d0/f2 d0/f3 0
> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
> 1457078] return 25, fallback to stat()
> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
> 3512660 81075 0
> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
> 308134 1763236 856 3593735] [5603798,59420] 0
> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
> 5663218]t 1361821 480392 0
> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
> ```

And just to be sure, this is not mixing dio and buffered, right?

>>> However I didn't see any io_uring related callback inside btrfs code,
>>> any advice on the io_uring part would be appreciated.
>>
>> io_uring doesn't do anything special here, it uses the normal page cache
>> read/write parts for buffered IO. But you may get extra parallellism
>> with io_uring here. For example, with the buffered write that this most
>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>> If this would've blocked, io_uring would offload this to a helper
>> thread. Depending on the workload, you could have multiple of those in
>> progress at the same time.
> 
> My biggest concern is, would io_uring modify the page when it's still
> under writeback?

No, of course not. Like I mentioned, io_uring doesn't do anything that
the normal read/write path isn't already doing - it's using the same
->read_iter() and ->write_iter() that everything else is, there's no
page cache code in io_uring.

> In that case, it's going to cause csum mismatch as btrfs relies on the
> page under writeback to be unchanged.

Sure, I'm aware of the stable page requirements.

See my followup email as well on a patch to test as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 21:46   ` Qu Wenruo
  2023-08-16 22:28     ` Jens Axboe
@ 2023-08-16 22:36     ` Jens Axboe
  2023-08-17  0:40       ` Qu Wenruo
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-16 22:36 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 3:46 PM, Qu Wenruo wrote:
>> Is that write back caching enabled?
>> Write back caching with volatile write cache? For your device, can you
>> do:
>>
>> $ grep . /sys/block/$dev/queue/*

You didn't answer this one either.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 22:36     ` Jens Axboe
@ 2023-08-17  0:40       ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2023-08-17  0:40 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring

On 2023/8/17 06:36, Jens Axboe wrote:
> On 8/16/23 3:46 PM, Qu Wenruo wrote:
>>> Is that write back caching enabled?
>>> Write back caching with volatile write cache? For your device, can you
>>> do:
>>>
>>> $ grep . /sys/block/$dev/queue/*
>
> You didn't answer this one either.
>
Sorry, here is the output:

$ grep . /sys/block/vdb/queue/*
/sys/block/vdb/queue/add_random:0
/sys/block/vdb/queue/chunk_sectors:0
/sys/block/vdb/queue/dax:0
/sys/block/vdb/queue/discard_granularity:512
/sys/block/vdb/queue/discard_max_bytes:2147483136
/sys/block/vdb/queue/discard_max_hw_bytes:2147483136
/sys/block/vdb/queue/discard_zeroes_data:0
/sys/block/vdb/queue/dma_alignment:511
/sys/block/vdb/queue/fua:0
/sys/block/vdb/queue/hw_sector_size:512
/sys/block/vdb/queue/io_poll:0
/sys/block/vdb/queue/io_poll_delay:-1
/sys/block/vdb/queue/iostats:1
/sys/block/vdb/queue/logical_block_size:512
/sys/block/vdb/queue/max_discard_segments:1
/sys/block/vdb/queue/max_hw_sectors_kb:2147483647
/sys/block/vdb/queue/max_integrity_segments:0
/sys/block/vdb/queue/max_sectors_kb:1280
/sys/block/vdb/queue/max_segment_size:4294967295
/sys/block/vdb/queue/max_segments:254
/sys/block/vdb/queue/minimum_io_size:512
/sys/block/vdb/queue/nomerges:0
/sys/block/vdb/queue/nr_requests:256
/sys/block/vdb/queue/nr_zones:0
/sys/block/vdb/queue/optimal_io_size:0
/sys/block/vdb/queue/physical_block_size:512
/sys/block/vdb/queue/read_ahead_kb:128
/sys/block/vdb/queue/rotational:1
/sys/block/vdb/queue/rq_affinity:1
/sys/block/vdb/queue/scheduler:[none] mq-deadline kyber bfq
/sys/block/vdb/queue/stable_writes:0
/sys/block/vdb/queue/throttle_sample_time:100
/sys/block/vdb/queue/virt_boundary_mask:0
/sys/block/vdb/queue/wbt_lat_usec:75000
/sys/block/vdb/queue/write_cache:write back
/sys/block/vdb/queue/write_same_max_bytes:0
/sys/block/vdb/queue/write_zeroes_max_bytes:2147483136
/sys/block/vdb/queue/zone_append_max_bytes:0
/sys/block/vdb/queue/zone_write_granularity:0
/sys/block/vdb/queue/zoned:none

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-16 22:28     ` Jens Axboe
@ 2023-08-17  1:05       ` Qu Wenruo
  2023-08-17  1:12         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-17  1:05 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring

[-- Attachment #1: Type: text/plain, Size: 4068 bytes --]



On 2023/8/17 06:28, Jens Axboe wrote:
[...]
>
>>> 2) What's the .config you are using?
>>
>> Pretty common config, no heavy debug options (KASAN etc).
>
> Please just send the .config, I'd rather not have to guess. Things like
> preempt etc may make a difference in reproducing this.

Sure, please see the attached config.gz

>
>>>> At least here, with a VM with 6 cores (host has 8C/16T), fast enough
>>>> storage (PCIE4.0 NVME, with unsafe cache mode), it has the chance around
>>>> 1/100 to hit the error.
>>>
>>> What does "unsafe cche mode" mean?
>>
>> Libvirt cache option "unsafe"
>>
>> Which is mostly ignoring flush/fua commands and fully rely on host fs
>> (in my case it's file backed) cache.
>
> Gotcha
>
>>> Is that write back caching enabled?
>>> Write back caching with volatile write cache? For your device, can you
>>> do:
>>>
>>> $ grep . /sys/block/$dev/queue/*
>>>
>>>> Checking the fsstress verbose log against the failed file, it turns out
>>>> to be an io_uring write.
>>>
>>> Any more details on what the write looks like?
>>
>> For the involved file, it shows the following operations for the minimal
>> reproducible seed/-p/-n combination:
>>
>> ```
>> 0/24: link d0/f2 d0/f3 0
>> 0/29: fallocate(INSERT_RANGE) d0/f3 [276 2 0 0 176 481971]t 884736 585728 95
>> 0/30: uring_write d0/f3[276 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>> 0/31: writev d0/f3[276 2 0 0 296 1457078] [709121,8,964] 0
>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite d0/f3[276 2 308134 1763236 320 1457078] [589824,16384] 0
>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[276 2 308134 1763236 496
>> 1457078] return 25, fallback to stat()
>> 0/38: dwrite d0/f3[276 2 308134 1763236 496 1457078] [2084864,36864] 0
>> 0/40: fallocate(ZERO_RANGE) d0/f3 [276 2 308134 1763236 688 2809139]t
>> 3512660 81075 0
>> 0/43: splice d0/f5[289 1 0 0 1872 2678784] [552619,59420] -> d0/f3[276 2
>> 308134 1763236 856 3593735] [5603798,59420] 0
>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [276 1 308134 1763236 976
>> 5663218]t 1361821 480392 0
>> 0/49: clonerange d0/f3[276 1 308134 1763236 856 5663218] [2461696,53248]
>> -> d0/f5[289 1 0 0 1872 2678784] [942080,53248]
>> ```
>
> And just to be sure, this is not mixing dio and buffered, right?

I'd say it's mixing, there are dwrite() and writev() for the same file,
but at least not overlapping using this particular seed, nor they are
concurrent (all inside the same process sequentially).

But considering if only uring_write is disabled, then no more reproduce,
thus there must be some untested btrfs path triggered by uring_write.

>
>>>> However I didn't see any io_uring related callback inside btrfs code,
>>>> any advice on the io_uring part would be appreciated.
>>>
>>> io_uring doesn't do anything special here, it uses the normal page cache
>>> read/write parts for buffered IO. But you may get extra parallellism
>>> with io_uring here. For example, with the buffered write that this most
>>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>>> If this would've blocked, io_uring would offload this to a helper
>>> thread. Depending on the workload, you could have multiple of those in
>>> progress at the same time.
>>
>> My biggest concern is, would io_uring modify the page when it's still
>> under writeback?
>
> No, of course not. Like I mentioned, io_uring doesn't do anything that
> the normal read/write path isn't already doing - it's using the same
> ->read_iter() and ->write_iter() that everything else is, there's no
> page cache code in io_uring.
>
>> In that case, it's going to cause csum mismatch as btrfs relies on the
>> page under writeback to be unchanged.
>
> Sure, I'm aware of the stable page requirements.
>
> See my followup email as well on a patch to test as well.
>

Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed
at 23rd run.

Thanks,
Qu

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35693 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:05       ` Qu Wenruo
@ 2023-08-17  1:12         ` Jens Axboe
  2023-08-17  1:19           ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-17  1:12 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 7:05 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/17 06:28, Jens Axboe wrote:
> [...]
>>
>>>> 2) What's the .config you are using?
>>>
>>> Pretty common config, no heavy debug options (KASAN etc).
>>
>> Please just send the .config, I'd rather not have to guess. Things like
>> preempt etc may make a difference in reproducing this.
> 
> Sure, please see the attached config.gz

Thanks

>> And just to be sure, this is not mixing dio and buffered, right?
> 
> I'd say it's mixing, there are dwrite() and writev() for the same file,
> but at least not overlapping using this particular seed, nor they are
> concurrent (all inside the same process sequentially).
> 
> But considering if only uring_write is disabled, then no more reproduce,
> thus there must be some untested btrfs path triggered by uring_write.

That would be one conclusion, another would be that timing is just
different and that triggers and issue. Or it could of course be a bug in
io_uring, perhaps a short write that gets retried or something like
that. I've run the tests for hours here and don't hit anything, I've
pulled in the for-next branch for btrfs and see if that'll make a
difference. I'll check your .config too.

Might not be a bad idea to have the writes contain known data, and when
you hit the failure to verify the csum, dump the data where the csum
says it's wrong and figure out at what offset, what content, etc it is?
If that can get correlated to the log of what happened, that might shed
some light on this.

>>>>> However I didn't see any io_uring related callback inside btrfs code,
>>>>> any advice on the io_uring part would be appreciated.
>>>>
>>>> io_uring doesn't do anything special here, it uses the normal page cache
>>>> read/write parts for buffered IO. But you may get extra parallellism
>>>> with io_uring here. For example, with the buffered write that this most
>>>> likely is, libaio would be exactly the same as a pwrite(2) on the file.
>>>> If this would've blocked, io_uring would offload this to a helper
>>>> thread. Depending on the workload, you could have multiple of those in
>>>> progress at the same time.
>>>
>>> My biggest concern is, would io_uring modify the page when it's still
>>> under writeback?
>>
>> No, of course not. Like I mentioned, io_uring doesn't do anything that
>> the normal read/write path isn't already doing - it's using the same
>> ->read_iter() and ->write_iter() that everything else is, there's no
>> page cache code in io_uring.
>>
>>> In that case, it's going to cause csum mismatch as btrfs relies on the
>>> page under writeback to be unchanged.
>>
>> Sure, I'm aware of the stable page requirements.
>>
>> See my followup email as well on a patch to test as well.
>>
> 
> Applied and tested, using "-p 10 -n 1000" as fsstress workload, failed
> at 23rd run.

OK, that rules out the multiple-writers theory.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:12         ` Jens Axboe
@ 2023-08-17  1:19           ` Qu Wenruo
  2023-08-17  1:23             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-17  1:19 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/17 09:12, Jens Axboe wrote:
> On 8/16/23 7:05 PM, Qu Wenruo wrote:
>>
>>
>> On 2023/8/17 06:28, Jens Axboe wrote:
>> [...]
>>>
>>>>> 2) What's the .config you are using?
>>>>
>>>> Pretty common config, no heavy debug options (KASAN etc).
>>>
>>> Please just send the .config, I'd rather not have to guess. Things like
>>> preempt etc may make a difference in reproducing this.
>>
>> Sure, please see the attached config.gz
>
> Thanks
>
>>> And just to be sure, this is not mixing dio and buffered, right?
>>
>> I'd say it's mixing, there are dwrite() and writev() for the same file,
>> but at least not overlapping using this particular seed, nor they are
>> concurrent (all inside the same process sequentially).
>>
>> But considering if only uring_write is disabled, then no more reproduce,
>> thus there must be some untested btrfs path triggered by uring_write.
>
> That would be one conclusion, another would be that timing is just
> different and that triggers and issue. Or it could of course be a bug in
> io_uring, perhaps a short write that gets retried or something like
> that. I've run the tests for hours here and don't hit anything, I've
> pulled in the for-next branch for btrfs and see if that'll make a
> difference. I'll check your .config too.

Just to mention, the problem itself was pretty hard to hit before if
using any debug kernel configs.

Not sure why but later I switched both my CPUs (from a desktop i7-13700K
but with limited 160W power, to a laptop 7940HS), dropping all heavy
debug kernel configs, then it's 100% reproducible here.

So I guess a faster CPU is also one factor?

>
> Might not be a bad idea to have the writes contain known data, and when
> you hit the failure to verify the csum, dump the data where the csum
> says it's wrong and figure out at what offset, what content, etc it is?
> If that can get correlated to the log of what happened, that might shed
> some light on this.
>
Thanks for the advice, would definitely try this method, would keep you
updated when I found something valuable.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:19           ` Qu Wenruo
@ 2023-08-17  1:23             ` Jens Axboe
  2023-08-17  1:31               ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-17  1:23 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 7:19 PM, Qu Wenruo wrote:
> On 2023/8/17 09:12, Jens Axboe wrote:
>> On 8/16/23 7:05 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2023/8/17 06:28, Jens Axboe wrote:
>>> [...]
>>>>
>>>>>> 2) What's the .config you are using?
>>>>>
>>>>> Pretty common config, no heavy debug options (KASAN etc).
>>>>
>>>> Please just send the .config, I'd rather not have to guess. Things like
>>>> preempt etc may make a difference in reproducing this.
>>>
>>> Sure, please see the attached config.gz
>>
>> Thanks
>>
>>>> And just to be sure, this is not mixing dio and buffered, right?
>>>
>>> I'd say it's mixing, there are dwrite() and writev() for the same file,
>>> but at least not overlapping using this particular seed, nor they are
>>> concurrent (all inside the same process sequentially).
>>>
>>> But considering if only uring_write is disabled, then no more reproduce,
>>> thus there must be some untested btrfs path triggered by uring_write.
>>
>> That would be one conclusion, another would be that timing is just
>> different and that triggers and issue. Or it could of course be a bug in
>> io_uring, perhaps a short write that gets retried or something like
>> that. I've run the tests for hours here and don't hit anything, I've
>> pulled in the for-next branch for btrfs and see if that'll make a
>> difference. I'll check your .config too.
> 
> Just to mention, the problem itself was pretty hard to hit before if
> using any debug kernel configs.

The kernels I'm testing with don't have any debug options enabled,
outside of the basic cheap stuff. I do notice you have all btrfs debug
stuff enabled, I'll try and do that too.

> Not sure why but later I switched both my CPUs (from a desktop i7-13700K
> but with limited 160W power, to a laptop 7940HS), dropping all heavy
> debug kernel configs, then it's 100% reproducible here.
> 
> So I guess a faster CPU is also one factor?

I've run this on kvm on an apple m1 max, no luck there. Ran it on a
7950X, no luck there. Fiddling config options on the 7950 and booting up
the 7763 two socket box. Both that and the 7950 are using gen4 optane,
should be plenty beefy. But if it's timing related, well...

>> Might not be a bad idea to have the writes contain known data, and when
>> you hit the failure to verify the csum, dump the data where the csum
>> says it's wrong and figure out at what offset, what content, etc it is?
>> If that can get correlated to the log of what happened, that might shed
>> some light on this.
>>
> Thanks for the advice, would definitely try this method, would keep you
> updated when I found something valuable.

If I can't reproduce this, then this seems like the best way forward
indeed.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:23             ` Jens Axboe
@ 2023-08-17  1:31               ` Qu Wenruo
  2023-08-17  1:32                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-17  1:31 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/17 09:23, Jens Axboe wrote:
> On 8/16/23 7:19 PM, Qu Wenruo wrote:
>> On 2023/8/17 09:12, Jens Axboe wrote:
>>> On 8/16/23 7:05 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2023/8/17 06:28, Jens Axboe wrote:
>>>> [...]
>>>>>
>>>>>>> 2) What's the .config you are using?
>>>>>>
>>>>>> Pretty common config, no heavy debug options (KASAN etc).
>>>>>
>>>>> Please just send the .config, I'd rather not have to guess. Things like
>>>>> preempt etc may make a difference in reproducing this.
>>>>
>>>> Sure, please see the attached config.gz
>>>
>>> Thanks
>>>
>>>>> And just to be sure, this is not mixing dio and buffered, right?
>>>>
>>>> I'd say it's mixing, there are dwrite() and writev() for the same file,
>>>> but at least not overlapping using this particular seed, nor they are
>>>> concurrent (all inside the same process sequentially).
>>>>
>>>> But considering if only uring_write is disabled, then no more reproduce,
>>>> thus there must be some untested btrfs path triggered by uring_write.
>>>
>>> That would be one conclusion, another would be that timing is just
>>> different and that triggers and issue. Or it could of course be a bug in
>>> io_uring, perhaps a short write that gets retried or something like
>>> that. I've run the tests for hours here and don't hit anything, I've
>>> pulled in the for-next branch for btrfs and see if that'll make a
>>> difference. I'll check your .config too.
>>
>> Just to mention, the problem itself was pretty hard to hit before if
>> using any debug kernel configs.
>
> The kernels I'm testing with don't have any debug options enabled,
> outside of the basic cheap stuff. I do notice you have all btrfs debug
> stuff enabled, I'll try and do that too.
>
>> Not sure why but later I switched both my CPUs (from a desktop i7-13700K
>> but with limited 160W power, to a laptop 7940HS), dropping all heavy
>> debug kernel configs, then it's 100% reproducible here.
>>
>> So I guess a faster CPU is also one factor?
>
> I've run this on kvm on an apple m1 max, no luck there. Ran it on a
> 7950X, no luck there. Fiddling config options on the 7950 and booting up
> the 7763 two socket box. Both that and the 7950 are using gen4 optane,
> should be plenty beefy. But if it's timing related, well...

Just to mention, the following progs are involved:

- btrfs-progs v6.3.3
   In theory anything newer than 5.15 should be fine, it's some default
   settings change.

- fsstress from xfstests project
   Thus it's not the one directly from LTP

Hopes this could help you to reproduce the bug.

Thanks,
Qu

>
>>> Might not be a bad idea to have the writes contain known data, and when
>>> you hit the failure to verify the csum, dump the data where the csum
>>> says it's wrong and figure out at what offset, what content, etc it is?
>>> If that can get correlated to the log of what happened, that might shed
>>> some light on this.
>>>
>> Thanks for the advice, would definitely try this method, would keep you
>> updated when I found something valuable.
>
> If I can't reproduce this, then this seems like the best way forward
> indeed.
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:31               ` Qu Wenruo
@ 2023-08-17  1:32                 ` Jens Axboe
  2023-08-19 23:59                   ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-17  1:32 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/16/23 7:31 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/17 09:23, Jens Axboe wrote:
>> On 8/16/23 7:19 PM, Qu Wenruo wrote:
>>> On 2023/8/17 09:12, Jens Axboe wrote:
>>>> On 8/16/23 7:05 PM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2023/8/17 06:28, Jens Axboe wrote:
>>>>> [...]
>>>>>>
>>>>>>>> 2) What's the .config you are using?
>>>>>>>
>>>>>>> Pretty common config, no heavy debug options (KASAN etc).
>>>>>>
>>>>>> Please just send the .config, I'd rather not have to guess. Things like
>>>>>> preempt etc may make a difference in reproducing this.
>>>>>
>>>>> Sure, please see the attached config.gz
>>>>
>>>> Thanks
>>>>
>>>>>> And just to be sure, this is not mixing dio and buffered, right?
>>>>>
>>>>> I'd say it's mixing, there are dwrite() and writev() for the same file,
>>>>> but at least not overlapping using this particular seed, nor they are
>>>>> concurrent (all inside the same process sequentially).
>>>>>
>>>>> But considering if only uring_write is disabled, then no more reproduce,
>>>>> thus there must be some untested btrfs path triggered by uring_write.
>>>>
>>>> That would be one conclusion, another would be that timing is just
>>>> different and that triggers and issue. Or it could of course be a bug in
>>>> io_uring, perhaps a short write that gets retried or something like
>>>> that. I've run the tests for hours here and don't hit anything, I've
>>>> pulled in the for-next branch for btrfs and see if that'll make a
>>>> difference. I'll check your .config too.
>>>
>>> Just to mention, the problem itself was pretty hard to hit before if
>>> using any debug kernel configs.
>>
>> The kernels I'm testing with don't have any debug options enabled,
>> outside of the basic cheap stuff. I do notice you have all btrfs debug
>> stuff enabled, I'll try and do that too.
>>
>>> Not sure why but later I switched both my CPUs (from a desktop i7-13700K
>>> but with limited 160W power, to a laptop 7940HS), dropping all heavy
>>> debug kernel configs, then it's 100% reproducible here.
>>>
>>> So I guess a faster CPU is also one factor?
>>
>> I've run this on kvm on an apple m1 max, no luck there. Ran it on a
>> 7950X, no luck there. Fiddling config options on the 7950 and booting up
>> the 7763 two socket box. Both that and the 7950 are using gen4 optane,
>> should be plenty beefy. But if it's timing related, well...
> 
> Just to mention, the following progs are involved:
> 
> - btrfs-progs v6.3.3
>   In theory anything newer than 5.15 should be fine, it's some default
>   settings change.

axboe@r7525 ~> apt show btrfs-progs
Package: btrfs-progs
Version: 6.3.2-1

is what I have.

> - fsstress from xfstests project
>   Thus it's not the one directly from LTP

That's what I'm using too.

> Hopes this could help you to reproduce the bug.

So far, not really :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-17  1:32                 ` Jens Axboe
@ 2023-08-19 23:59                   ` Qu Wenruo
  2023-08-20  0:22                     ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-19 23:59 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring

Hi Jens

I tried more on my side to debug the situation, and found a very weird
write behavior:

	Some unexpected direct IO happened, without corresponding
	fsstress workload.

The workload is:

	$fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress

Which I can reliably reproduce the problem locally, around 1/50 possibility.
In my particular case, it results data corruption at root 5 inode 283
offset 8192.

Then I added some trace points for the following functions:

- btrfs_do_write_iter()
   Two trace points, one before btrfs_direct_write(), and one
   before btrfs_buffered_write(), outputting the aligned and unaligned
   write range, root/inode number, type of the write (buffered or
   direct).

- btrfs_finish_one_ordered()
   This is where btrfs inserts its ordered extent into the subvolume
   tree.
   This happens when a range of pages finishes its writeback.

Then here comes the fsstress log for inode 283 (no btrfs root number):

0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[307200,0]
0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[1058819,0]
0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 585728 95
0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[293 1 0 0 1872 2678784] [942080,53248]

Note one thing, there is no direct/buffered write into inode 283 offset
8192.

But from the trace events for root 5 inode 283:

  btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
len=90112(88327)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
len=61440(56456)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
len=12288(7712)

  btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
len=73728(73728) <<<<<

  btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
len=16384(16384)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
  btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
  btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
len=36864(36864)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
len=61440(60139)
  btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
len=61440(59420)
  btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
  btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
  btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
  btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
  btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
  btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440

Note that phantom direct IO call, which is in the corrupted range.

If paired with fsstress, that phantom write happens between the two
operations:

0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0

I'll keep digging, but the phantom writes which is not properly loggeg
from fsstress is already a concern to me.

Or maybe I'm missing some fixes in fsstress?

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-19 23:59                   ` Qu Wenruo
@ 2023-08-20  0:22                     ` Qu Wenruo
  2023-08-20 13:26                       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-20  0:22 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/20 07:59, Qu Wenruo wrote:
> Hi Jens
>
> I tried more on my side to debug the situation, and found a very weird
> write behavior:
>
>      Some unexpected direct IO happened, without corresponding
>      fsstress workload.
>
> The workload is:
>
>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>
> Which I can reliably reproduce the problem locally, around 1/50
> possibility.
> In my particular case, it results data corruption at root 5 inode 283
> offset 8192.
>
> Then I added some trace points for the following functions:
>
> - btrfs_do_write_iter()
>    Two trace points, one before btrfs_direct_write(), and one
>    before btrfs_buffered_write(), outputting the aligned and unaligned
>    write range, root/inode number, type of the write (buffered or
>    direct).
>
> - btrfs_finish_one_ordered()
>    This is where btrfs inserts its ordered extent into the subvolume
>    tree.
>    This happens when a range of pages finishes its writeback.
>
> Then here comes the fsstress log for inode 283 (no btrfs root number):
>
> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
> [307200,0]
> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
> [1058819,0]
> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
> 585728 95
> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
> 1457078] return 25, fallback to stat()
> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
> 3512660 81075 0
> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
> 308134 1763236 856 3593735] [5603798,59420] 0
> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
> 5663218]t 1361821 480392 0
> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>
> Note one thing, there is no direct/buffered write into inode 283 offset
> 8192.
>
> But from the trace events for root 5 inode 283:
>
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
> len=90112(88327)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
> len=61440(56456)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
> len=12288(7712)
>
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
> len=73728(73728) <<<<<
>
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
> len=16384(16384)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>   btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
> len=36864(36864)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
> len=61440(60139)
>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
> len=61440(59420)
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>   btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>
> Note that phantom direct IO call, which is in the corrupted range.
>
> If paired with fsstress, that phantom write happens between the two
> operations:
>
> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0

Just to be more accurate, there is a 0/33 operation, which is:

0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/33: awrite - io_getevents failed -4

The failed one doesn't have inode number thus it didn't get caught by grep.

Return value -4 means -INTR, not sure who sent the interruption.
But if this interruption happens before the IO finished, we can call
free() on the buffer, and if we're unlucky enough, the freed memory can
be re-allocated for some other usage, thus modifying the pages before
the writeback finished.

I think this is the direct cause of the data corruption, page
modification before direct IO finished.

But unfortunately I still didn't get why the interruption can happen,
nor how can we handle such interruption?
(I guess just retry?)

Thanks,
Qu
>
> I'll keep digging, but the phantom writes which is not properly loggeg
> from fsstress is already a concern to me.
>
> Or maybe I'm missing some fixes in fsstress?
>
> Thanks,
> Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-20  0:22                     ` Qu Wenruo
@ 2023-08-20 13:26                       ` Jens Axboe
  2023-08-20 14:11                         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-20 13:26 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/19/23 6:22 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/20 07:59, Qu Wenruo wrote:
>> Hi Jens
>>
>> I tried more on my side to debug the situation, and found a very weird
>> write behavior:
>>
>>      Some unexpected direct IO happened, without corresponding
>>      fsstress workload.
>>
>> The workload is:
>>
>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>
>> Which I can reliably reproduce the problem locally, around 1/50
>> possibility.
>> In my particular case, it results data corruption at root 5 inode 283
>> offset 8192.
>>
>> Then I added some trace points for the following functions:
>>
>> - btrfs_do_write_iter()
>>    Two trace points, one before btrfs_direct_write(), and one
>>    before btrfs_buffered_write(), outputting the aligned and unaligned
>>    write range, root/inode number, type of the write (buffered or
>>    direct).
>>
>> - btrfs_finish_one_ordered()
>>    This is where btrfs inserts its ordered extent into the subvolume
>>    tree.
>>    This happens when a range of pages finishes its writeback.
>>
>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>
>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>> [307200,0]
>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>> [1058819,0]
>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>> 585728 95
>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>> 1457078] return 25, fallback to stat()
>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>> 3512660 81075 0
>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>> 308134 1763236 856 3593735] [5603798,59420] 0
>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>> 5663218]t 1361821 480392 0
>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>
>> Note one thing, there is no direct/buffered write into inode 283 offset
>> 8192.
>>
>> But from the trace events for root 5 inode 283:
>>
>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>> len=90112(88327)
>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>> len=61440(56456)
>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>> len=12288(7712)
>>
>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>> len=73728(73728) <<<<<
>>
>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>> len=16384(16384)
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>> len=36864(36864)
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>> len=61440(60139)
>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>> len=61440(59420)
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>
>> Note that phantom direct IO call, which is in the corrupted range.
>>
>> If paired with fsstress, that phantom write happens between the two
>> operations:
>>
>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
> 
> Just to be more accurate, there is a 0/33 operation, which is:
> 
> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
> 1457078] return 25, fallback to stat()
> 0/33: awrite - io_getevents failed -4
> 
> The failed one doesn't have inode number thus it didn't get caught by grep.
> 
> Return value -4 means -INTR, not sure who sent the interruption.
> But if this interruption happens before the IO finished, we can call
> free() on the buffer, and if we're unlucky enough, the freed memory can
> be re-allocated for some other usage, thus modifying the pages before
> the writeback finished.
> 
> I think this is the direct cause of the data corruption, page
> modification before direct IO finished.
> 
> But unfortunately I still didn't get why the interruption can happen,
> nor how can we handle such interruption?
> (I guess just retry?)

It's because you are mixing aio/io_uring, and the default settings for
io_uring is to use signal based notifications for queueing task_work.
This then causes a spurious -EINTR, which stops your io_getevents()
wait. Looks like this is a bug in fsstress, it should just retry the
wait if this happens. You can also configure the ring to not use signal
based notifications, but that bug needs fixing regardless.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-20 13:26                       ` Jens Axboe
@ 2023-08-20 14:11                         ` Jens Axboe
  2023-08-20 18:18                           ` Matthew Wilcox
  2023-08-21  0:38                           ` Qu Wenruo
  0 siblings, 2 replies; 22+ messages in thread
From: Jens Axboe @ 2023-08-20 14:11 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/20/23 7:26 AM, Jens Axboe wrote:
> On 8/19/23 6:22 PM, Qu Wenruo wrote:
>>
>>
>> On 2023/8/20 07:59, Qu Wenruo wrote:
>>> Hi Jens
>>>
>>> I tried more on my side to debug the situation, and found a very weird
>>> write behavior:
>>>
>>>      Some unexpected direct IO happened, without corresponding
>>>      fsstress workload.
>>>
>>> The workload is:
>>>
>>>      $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>
>>> Which I can reliably reproduce the problem locally, around 1/50
>>> possibility.
>>> In my particular case, it results data corruption at root 5 inode 283
>>> offset 8192.
>>>
>>> Then I added some trace points for the following functions:
>>>
>>> - btrfs_do_write_iter()
>>>    Two trace points, one before btrfs_direct_write(), and one
>>>    before btrfs_buffered_write(), outputting the aligned and unaligned
>>>    write range, root/inode number, type of the write (buffered or
>>>    direct).
>>>
>>> - btrfs_finish_one_ordered()
>>>    This is where btrfs inserts its ordered extent into the subvolume
>>>    tree.
>>>    This happens when a range of pages finishes its writeback.
>>>
>>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>>
>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>> [307200,0]
>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>> [1058819,0]
>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>>> 585728 95
>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>>> 1457078] return 25, fallback to stat()
>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>>> 1457078] return 25, fallback to stat()
>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>>> 1457078] return 25, fallback to stat()
>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>>> 3512660 81075 0
>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>>> 308134 1763236 856 3593735] [5603798,59420] 0
>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>>> 5663218]t 1361821 480392 0
>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>>
>>> Note one thing, there is no direct/buffered write into inode 283 offset
>>> 8192.
>>>
>>> But from the trace events for root 5 inode 283:
>>>
>>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>>> len=90112(88327)
>>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>>> len=61440(56456)
>>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>> len=12288(7712)
>>>
>>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>> len=73728(73728) <<<<<
>>>
>>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>> len=16384(16384)
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>>   btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>>> len=36864(36864)
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>>> len=61440(60139)
>>>   btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>>> len=61440(59420)
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>>   btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>>
>>> Note that phantom direct IO call, which is in the corrupted range.
>>>
>>> If paired with fsstress, that phantom write happens between the two
>>> operations:
>>>
>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>
>> Just to be more accurate, there is a 0/33 operation, which is:
>>
>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
>> 1457078] return 25, fallback to stat()
>> 0/33: awrite - io_getevents failed -4
>>
>> The failed one doesn't have inode number thus it didn't get caught by grep.
>>
>> Return value -4 means -INTR, not sure who sent the interruption.
>> But if this interruption happens before the IO finished, we can call
>> free() on the buffer, and if we're unlucky enough, the freed memory can
>> be re-allocated for some other usage, thus modifying the pages before
>> the writeback finished.
>>
>> I think this is the direct cause of the data corruption, page
>> modification before direct IO finished.
>>
>> But unfortunately I still didn't get why the interruption can happen,
>> nor how can we handle such interruption?
>> (I guess just retry?)
> 
> It's because you are mixing aio/io_uring, and the default settings for
> io_uring is to use signal based notifications for queueing task_work.
> This then causes a spurious -EINTR, which stops your io_getevents()
> wait. Looks like this is a bug in fsstress, it should just retry the
> wait if this happens. You can also configure the ring to not use signal
> based notifications, but that bug needs fixing regardless.

Something like this will probably fix it.


diff --git a/ltp/fsstress.c b/ltp/fsstress.c
index 6641a525fe5d..05fbfd3f8cf8 100644
--- a/ltp/fsstress.c
+++ b/ltp/fsstress.c
@@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
 			 (long long) s->st_blocks, (long long) s->st_size);
 }
 
+static int io_get_single_event(struct io_event *event)
+{
+	int ret;
+
+	do {
+		/*
+		 * We can get -EINTR if competing with io_uring using signal
+		 * based notifications. For that case, just retry the wait.
+		 */
+		ret = io_getevents(io_ctx, 1, 1, event, NULL);
+		if (ret != -EINTR)
+			break;
+	} while (1);
+
+	return ret;
+}
+
 void
 afsync_f(opnum_t opno, long r)
 {
@@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
 		close(fd);
 		return;
 	}
-	if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
+	if ((e = io_get_single_event(&event)) != 1) {
 		if (v)
 			printf("%d/%lld: afsync - io_getevents failed %d\n",
 			       procid, opno, e);
@@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
 	if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
 		if (v)
 			printf("%d/%lld: %s - io_submit failed %d\n",
-			       procid, opno, iswrite ? "awrite" : "aread", e);
+		 	       procid, opno, iswrite ? "awrite" : "aread", e);
 		goto aio_out;
 	}
-	if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
+	if ((e = io_get_single_event(&event)) != 1) {
 		if (v)
 			printf("%d/%lld: %s - io_getevents failed %d\n",
 			       procid, opno, iswrite ? "awrite" : "aread", e);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-20 14:11                         ` Jens Axboe
@ 2023-08-20 18:18                           ` Matthew Wilcox
  2023-08-20 18:40                             ` Jens Axboe
  2023-08-21  0:38                           ` Qu Wenruo
  1 sibling, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2023-08-20 18:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On Sun, Aug 20, 2023 at 08:11:04AM -0600, Jens Axboe wrote:
> +static int io_get_single_event(struct io_event *event)
> +{
> +	int ret;
> +
> +	do {
> +		/*
> +		 * We can get -EINTR if competing with io_uring using signal
> +		 * based notifications. For that case, just retry the wait.
> +		 */
> +		ret = io_getevents(io_ctx, 1, 1, event, NULL);
> +		if (ret != -EINTR)
> +			break;
> +	} while (1);
> +
> +	return ret;
> +}

Is there a reason to prefer this style over:

	do {
		ret = io_getevents(io_ctx, 1, 1, event, NULL);
	} while (ret == -1 && errno == EINTR);

(we need to check errno, here, right?  Or is io_getevents() special
somehow?)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-20 18:18                           ` Matthew Wilcox
@ 2023-08-20 18:40                             ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2023-08-20 18:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/20/23 12:18 PM, Matthew Wilcox wrote:
> On Sun, Aug 20, 2023 at 08:11:04AM -0600, Jens Axboe wrote:
>> +static int io_get_single_event(struct io_event *event)
>> +{
>> +	int ret;
>> +
>> +	do {
>> +		/*
>> +		 * We can get -EINTR if competing with io_uring using signal
>> +		 * based notifications. For that case, just retry the wait.
>> +		 */
>> +		ret = io_getevents(io_ctx, 1, 1, event, NULL);
>> +		if (ret != -EINTR)
>> +			break;
>> +	} while (1);
>> +
>> +	return ret;
>> +}
> 
> Is there a reason to prefer this style over:
> 
> 	do {
> 		ret = io_getevents(io_ctx, 1, 1, event, NULL);
> 	} while (ret == -1 && errno == EINTR);
> 
> (we need to check errno, here, right?  Or is io_getevents() special
> somehow?)

Honestly, don't really care about the style, mostly cared about getting
a bug fixed. io_getevents() returns number of events claimed, or -errno.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-20 14:11                         ` Jens Axboe
  2023-08-20 18:18                           ` Matthew Wilcox
@ 2023-08-21  0:38                           ` Qu Wenruo
  2023-08-21 14:57                             ` Jens Axboe
  1 sibling, 1 reply; 22+ messages in thread
From: Qu Wenruo @ 2023-08-21  0:38 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/20 22:11, Jens Axboe wrote:
> On 8/20/23 7:26 AM, Jens Axboe wrote:
>> On 8/19/23 6:22 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2023/8/20 07:59, Qu Wenruo wrote:
>>>> Hi Jens
>>>>
>>>> I tried more on my side to debug the situation, and found a very weird
>>>> write behavior:
>>>>
>>>>       Some unexpected direct IO happened, without corresponding
>>>>       fsstress workload.
>>>>
>>>> The workload is:
>>>>
>>>>       $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>>
>>>> Which I can reliably reproduce the problem locally, around 1/50
>>>> possibility.
>>>> In my particular case, it results data corruption at root 5 inode 283
>>>> offset 8192.
>>>>
>>>> Then I added some trace points for the following functions:
>>>>
>>>> - btrfs_do_write_iter()
>>>>     Two trace points, one before btrfs_direct_write(), and one
>>>>     before btrfs_buffered_write(), outputting the aligned and unaligned
>>>>     write range, root/inode number, type of the write (buffered or
>>>>     direct).
>>>>
>>>> - btrfs_finish_one_ordered()
>>>>     This is where btrfs inserts its ordered extent into the subvolume
>>>>     tree.
>>>>     This happens when a range of pages finishes its writeback.
>>>>
>>>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>>>
>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>> [307200,0]
>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>> [1058819,0]
>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>>>> 585728 95
>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>>>> 1457078] return 25, fallback to stat()
>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>>>> 1457078] return 25, fallback to stat()
>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>>>> 1457078] return 25, fallback to stat()
>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>>>> 3512660 81075 0
>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>>>> 308134 1763236 856 3593735] [5603798,59420] 0
>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>>>> 5663218]t 1361821 480392 0
>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>>>
>>>> Note one thing, there is no direct/buffered write into inode 283 offset
>>>> 8192.
>>>>
>>>> But from the trace events for root 5 inode 283:
>>>>
>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>>>> len=90112(88327)
>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>>>> len=61440(56456)
>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>>> len=12288(7712)
>>>>
>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>>> len=73728(73728) <<<<<
>>>>
>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>>> len=16384(16384)
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>>>> len=36864(36864)
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>>>> len=61440(60139)
>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>>>> len=61440(59420)
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>>>
>>>> Note that phantom direct IO call, which is in the corrupted range.
>>>>
>>>> If paired with fsstress, that phantom write happens between the two
>>>> operations:
>>>>
>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>
>>> Just to be more accurate, there is a 0/33 operation, which is:
>>>
>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
>>> 1457078] return 25, fallback to stat()
>>> 0/33: awrite - io_getevents failed -4
>>>
>>> The failed one doesn't have inode number thus it didn't get caught by grep.
>>>
>>> Return value -4 means -INTR, not sure who sent the interruption.
>>> But if this interruption happens before the IO finished, we can call
>>> free() on the buffer, and if we're unlucky enough, the freed memory can
>>> be re-allocated for some other usage, thus modifying the pages before
>>> the writeback finished.
>>>
>>> I think this is the direct cause of the data corruption, page
>>> modification before direct IO finished.
>>>
>>> But unfortunately I still didn't get why the interruption can happen,
>>> nor how can we handle such interruption?
>>> (I guess just retry?)
>>
>> It's because you are mixing aio/io_uring, and the default settings for
>> io_uring is to use signal based notifications for queueing task_work.
>> This then causes a spurious -EINTR, which stops your io_getevents()
>> wait. Looks like this is a bug in fsstress, it should just retry the
>> wait if this happens. You can also configure the ring to not use signal
>> based notifications, but that bug needs fixing regardless.
>
> Something like this will probably fix it.
>
>
> diff --git a/ltp/fsstress.c b/ltp/fsstress.c
> index 6641a525fe5d..05fbfd3f8cf8 100644
> --- a/ltp/fsstress.c
> +++ b/ltp/fsstress.c
> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
>   			 (long long) s->st_blocks, (long long) s->st_size);
>   }
>
> +static int io_get_single_event(struct io_event *event)
> +{
> +	int ret;
> +
> +	do {
> +		/*
> +		 * We can get -EINTR if competing with io_uring using signal
> +		 * based notifications. For that case, just retry the wait.
> +		 */
> +		ret = io_getevents(io_ctx, 1, 1, event, NULL);
> +		if (ret != -EINTR)
> +			break;
> +	} while (1);
> +
> +	return ret;
> +}
> +
>   void
>   afsync_f(opnum_t opno, long r)
>   {
> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
>   		close(fd);
>   		return;
>   	}
> -	if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
> +	if ((e = io_get_single_event(&event)) != 1) {
>   		if (v)
>   			printf("%d/%lld: afsync - io_getevents failed %d\n",
>   			       procid, opno, e);
> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
>   	if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
>   		if (v)
>   			printf("%d/%lld: %s - io_submit failed %d\n",
> -			       procid, opno, iswrite ? "awrite" : "aread", e);
> +		 	       procid, opno, iswrite ? "awrite" : "aread", e);
>   		goto aio_out;
>   	}
> -	if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
> +	if ((e = io_get_single_event(&event)) != 1) {
>   		if (v)
>   			printf("%d/%lld: %s - io_getevents failed %d\n",
>   			       procid, opno, iswrite ? "awrite" : "aread", e);
>
Exactly what I sent for fsstress:
https://lore.kernel.org/linux-btrfs/[email protected]/T/#u

Thanks,
Qu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-21  0:38                           ` Qu Wenruo
@ 2023-08-21 14:57                             ` Jens Axboe
  2023-08-21 21:42                               ` Qu Wenruo
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2023-08-21 14:57 UTC (permalink / raw)
  To: Qu Wenruo, [email protected], Linux FS Devel, io-uring

On 8/20/23 6:38 PM, Qu Wenruo wrote:
> 
> 
> On 2023/8/20 22:11, Jens Axboe wrote:
>> On 8/20/23 7:26 AM, Jens Axboe wrote:
>>> On 8/19/23 6:22 PM, Qu Wenruo wrote:
>>>>
>>>>
>>>> On 2023/8/20 07:59, Qu Wenruo wrote:
>>>>> Hi Jens
>>>>>
>>>>> I tried more on my side to debug the situation, and found a very weird
>>>>> write behavior:
>>>>>
>>>>>       Some unexpected direct IO happened, without corresponding
>>>>>       fsstress workload.
>>>>>
>>>>> The workload is:
>>>>>
>>>>>       $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>>>
>>>>> Which I can reliably reproduce the problem locally, around 1/50
>>>>> possibility.
>>>>> In my particular case, it results data corruption at root 5 inode 283
>>>>> offset 8192.
>>>>>
>>>>> Then I added some trace points for the following functions:
>>>>>
>>>>> - btrfs_do_write_iter()
>>>>>     Two trace points, one before btrfs_direct_write(), and one
>>>>>     before btrfs_buffered_write(), outputting the aligned and unaligned
>>>>>     write range, root/inode number, type of the write (buffered or
>>>>>     direct).
>>>>>
>>>>> - btrfs_finish_one_ordered()
>>>>>     This is where btrfs inserts its ordered extent into the subvolume
>>>>>     tree.
>>>>>     This happens when a range of pages finishes its writeback.
>>>>>
>>>>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>>>>
>>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>> [307200,0]
>>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>> [1058819,0]
>>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>>>>> 585728 95
>>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>>>>> 1457078] return 25, fallback to stat()
>>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>>>>> 1457078] return 25, fallback to stat()
>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>>>>> 1457078] return 25, fallback to stat()
>>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>>>>> 3512660 81075 0
>>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>>>>> 308134 1763236 856 3593735] [5603798,59420] 0
>>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>>>>> 5663218]t 1361821 480392 0
>>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>>>>
>>>>> Note one thing, there is no direct/buffered write into inode 283 offset
>>>>> 8192.
>>>>>
>>>>> But from the trace events for root 5 inode 283:
>>>>>
>>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>>>>> len=90112(88327)
>>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>>>>> len=61440(56456)
>>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>>>> len=12288(7712)
>>>>>
>>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>>>> len=73728(73728) <<<<<
>>>>>
>>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>>>> len=16384(16384)
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>>>>    btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>>>>> len=36864(36864)
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>>>>> len=61440(60139)
>>>>>    btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>>>>> len=61440(59420)
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>>>>    btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>>>>
>>>>> Note that phantom direct IO call, which is in the corrupted range.
>>>>>
>>>>> If paired with fsstress, that phantom write happens between the two
>>>>> operations:
>>>>>
>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>
>>>> Just to be more accurate, there is a 0/33 operation, which is:
>>>>
>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
>>>> 1457078] return 25, fallback to stat()
>>>> 0/33: awrite - io_getevents failed -4
>>>>
>>>> The failed one doesn't have inode number thus it didn't get caught by grep.
>>>>
>>>> Return value -4 means -INTR, not sure who sent the interruption.
>>>> But if this interruption happens before the IO finished, we can call
>>>> free() on the buffer, and if we're unlucky enough, the freed memory can
>>>> be re-allocated for some other usage, thus modifying the pages before
>>>> the writeback finished.
>>>>
>>>> I think this is the direct cause of the data corruption, page
>>>> modification before direct IO finished.
>>>>
>>>> But unfortunately I still didn't get why the interruption can happen,
>>>> nor how can we handle such interruption?
>>>> (I guess just retry?)
>>>
>>> It's because you are mixing aio/io_uring, and the default settings for
>>> io_uring is to use signal based notifications for queueing task_work.
>>> This then causes a spurious -EINTR, which stops your io_getevents()
>>> wait. Looks like this is a bug in fsstress, it should just retry the
>>> wait if this happens. You can also configure the ring to not use signal
>>> based notifications, but that bug needs fixing regardless.
>>
>> Something like this will probably fix it.
>>
>>
>> diff --git a/ltp/fsstress.c b/ltp/fsstress.c
>> index 6641a525fe5d..05fbfd3f8cf8 100644
>> --- a/ltp/fsstress.c
>> +++ b/ltp/fsstress.c
>> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
>>                (long long) s->st_blocks, (long long) s->st_size);
>>   }
>>
>> +static int io_get_single_event(struct io_event *event)
>> +{
>> +    int ret;
>> +
>> +    do {
>> +        /*
>> +         * We can get -EINTR if competing with io_uring using signal
>> +         * based notifications. For that case, just retry the wait.
>> +         */
>> +        ret = io_getevents(io_ctx, 1, 1, event, NULL);
>> +        if (ret != -EINTR)
>> +            break;
>> +    } while (1);
>> +
>> +    return ret;
>> +}
>> +
>>   void
>>   afsync_f(opnum_t opno, long r)
>>   {
>> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
>>           close(fd);
>>           return;
>>       }
>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>> +    if ((e = io_get_single_event(&event)) != 1) {
>>           if (v)
>>               printf("%d/%lld: afsync - io_getevents failed %d\n",
>>                      procid, opno, e);
>> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
>>       if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
>>           if (v)
>>               printf("%d/%lld: %s - io_submit failed %d\n",
>> -                   procid, opno, iswrite ? "awrite" : "aread", e);
>> +                    procid, opno, iswrite ? "awrite" : "aread", e);
>>           goto aio_out;
>>       }
>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>> +    if ((e = io_get_single_event(&event)) != 1) {
>>           if (v)
>>               printf("%d/%lld: %s - io_getevents failed %d\n",
>>                      procid, opno, iswrite ? "awrite" : "aread", e);
>>
> Exactly what I sent for fsstress:
> https://lore.kernel.org/linux-btrfs/[email protected]/T/#u

It's not really, as you only did the one case of io_getevents(). What
happens if the other one gets EINTR and aborts, now we do a rw operation
and the first event returned is the one from the fsync?

You should not just fix up the one that you happened to hit, fix up both
of them.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Possible io_uring related race leads to btrfs data csum mismatch
  2023-08-21 14:57                             ` Jens Axboe
@ 2023-08-21 21:42                               ` Qu Wenruo
  0 siblings, 0 replies; 22+ messages in thread
From: Qu Wenruo @ 2023-08-21 21:42 UTC (permalink / raw)
  To: Jens Axboe, [email protected], Linux FS Devel, io-uring



On 2023/8/21 22:57, Jens Axboe wrote:
> On 8/20/23 6:38 PM, Qu Wenruo wrote:
>>
>>
>> On 2023/8/20 22:11, Jens Axboe wrote:
>>> On 8/20/23 7:26 AM, Jens Axboe wrote:
>>>> On 8/19/23 6:22 PM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2023/8/20 07:59, Qu Wenruo wrote:
>>>>>> Hi Jens
>>>>>>
>>>>>> I tried more on my side to debug the situation, and found a very weird
>>>>>> write behavior:
>>>>>>
>>>>>>        Some unexpected direct IO happened, without corresponding
>>>>>>        fsstress workload.
>>>>>>
>>>>>> The workload is:
>>>>>>
>>>>>>        $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress
>>>>>>
>>>>>> Which I can reliably reproduce the problem locally, around 1/50
>>>>>> possibility.
>>>>>> In my particular case, it results data corruption at root 5 inode 283
>>>>>> offset 8192.
>>>>>>
>>>>>> Then I added some trace points for the following functions:
>>>>>>
>>>>>> - btrfs_do_write_iter()
>>>>>>      Two trace points, one before btrfs_direct_write(), and one
>>>>>>      before btrfs_buffered_write(), outputting the aligned and unaligned
>>>>>>      write range, root/inode number, type of the write (buffered or
>>>>>>      direct).
>>>>>>
>>>>>> - btrfs_finish_one_ordered()
>>>>>>      This is where btrfs inserts its ordered extent into the subvolume
>>>>>>      tree.
>>>>>>      This happens when a range of pages finishes its writeback.
>>>>>>
>>>>>> Then here comes the fsstress log for inode 283 (no btrfs root number):
>>>>>>
>>>>>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>>> [307200,0]
>>>>>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
>>>>>> [1058819,0]
>>>>>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
>>>>>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
>>>>>> 585728 95
>>>>>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
>>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
>>>>>> 1457078] return 25, fallback to stat()
>>>>>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
>>>>>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
>>>>>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
>>>>>> 3512660 81075 0
>>>>>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
>>>>>> 308134 1763236 856 3593735] [5603798,59420] 0
>>>>>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
>>>>>> 5663218]t 1361821 480392 0
>>>>>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
>>>>>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248]
>>>>>>
>>>>>> Note one thing, there is no direct/buffered write into inode 283 offset
>>>>>> 8192.
>>>>>>
>>>>>> But from the trace events for root 5 inode 283:
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
>>>>>> len=90112(88327)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
>>>>>> len=61440(56456)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
>>>>>> len=12288(7712)
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
>>>>>> len=73728(73728) <<<<<
>>>>>>
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
>>>>>> len=16384(16384)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
>>>>>>     btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
>>>>>> len=36864(36864)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
>>>>>> len=61440(60139)
>>>>>>     btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
>>>>>> len=61440(59420)
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
>>>>>>     btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440
>>>>>>
>>>>>> Note that phantom direct IO call, which is in the corrupted range.
>>>>>>
>>>>>> If paired with fsstress, that phantom write happens between the two
>>>>>> operations:
>>>>>>
>>>>>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
>>>>>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
>>>>>
>>>>> Just to be more accurate, there is a 0/33 operation, which is:
>>>>>
>>>>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
>>>>> 1457078] return 25, fallback to stat()
>>>>> 0/33: awrite - io_getevents failed -4
>>>>>
>>>>> The failed one doesn't have inode number thus it didn't get caught by grep.
>>>>>
>>>>> Return value -4 means -INTR, not sure who sent the interruption.
>>>>> But if this interruption happens before the IO finished, we can call
>>>>> free() on the buffer, and if we're unlucky enough, the freed memory can
>>>>> be re-allocated for some other usage, thus modifying the pages before
>>>>> the writeback finished.
>>>>>
>>>>> I think this is the direct cause of the data corruption, page
>>>>> modification before direct IO finished.
>>>>>
>>>>> But unfortunately I still didn't get why the interruption can happen,
>>>>> nor how can we handle such interruption?
>>>>> (I guess just retry?)
>>>>
>>>> It's because you are mixing aio/io_uring, and the default settings for
>>>> io_uring is to use signal based notifications for queueing task_work.
>>>> This then causes a spurious -EINTR, which stops your io_getevents()
>>>> wait. Looks like this is a bug in fsstress, it should just retry the
>>>> wait if this happens. You can also configure the ring to not use signal
>>>> based notifications, but that bug needs fixing regardless.
>>>
>>> Something like this will probably fix it.
>>>
>>>
>>> diff --git a/ltp/fsstress.c b/ltp/fsstress.c
>>> index 6641a525fe5d..05fbfd3f8cf8 100644
>>> --- a/ltp/fsstress.c
>>> +++ b/ltp/fsstress.c
>>> @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
>>>                 (long long) s->st_blocks, (long long) s->st_size);
>>>    }
>>>
>>> +static int io_get_single_event(struct io_event *event)
>>> +{
>>> +    int ret;
>>> +
>>> +    do {
>>> +        /*
>>> +         * We can get -EINTR if competing with io_uring using signal
>>> +         * based notifications. For that case, just retry the wait.
>>> +         */
>>> +        ret = io_getevents(io_ctx, 1, 1, event, NULL);
>>> +        if (ret != -EINTR)
>>> +            break;
>>> +    } while (1);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    void
>>>    afsync_f(opnum_t opno, long r)
>>>    {
>>> @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
>>>            close(fd);
>>>            return;
>>>        }
>>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>>> +    if ((e = io_get_single_event(&event)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: afsync - io_getevents failed %d\n",
>>>                       procid, opno, e);
>>> @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
>>>        if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: %s - io_submit failed %d\n",
>>> -                   procid, opno, iswrite ? "awrite" : "aread", e);
>>> +                    procid, opno, iswrite ? "awrite" : "aread", e);
>>>            goto aio_out;
>>>        }
>>> -    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
>>> +    if ((e = io_get_single_event(&event)) != 1) {
>>>            if (v)
>>>                printf("%d/%lld: %s - io_getevents failed %d\n",
>>>                       procid, opno, iswrite ? "awrite" : "aread", e);
>>>
>> Exactly what I sent for fsstress:
>> https://lore.kernel.org/linux-btrfs/[email protected]/T/#u
>
> It's not really, as you only did the one case of io_getevents(). What
> happens if the other one gets EINTR and aborts, now we do a rw operation
> and the first event returned is the one from the fsync?

Oh my bad, forgot there is another one.

Would fix it in the next update.

Thanks,
Qu
>
> You should not just fix up the one that you happened to hit, fix up both
> of them.
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-08-21 21:42 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-16  6:52 Possible io_uring related race leads to btrfs data csum mismatch Qu Wenruo
2023-08-16 14:33 ` Jens Axboe
2023-08-16 14:49   ` Jens Axboe
2023-08-16 21:46   ` Qu Wenruo
2023-08-16 22:28     ` Jens Axboe
2023-08-17  1:05       ` Qu Wenruo
2023-08-17  1:12         ` Jens Axboe
2023-08-17  1:19           ` Qu Wenruo
2023-08-17  1:23             ` Jens Axboe
2023-08-17  1:31               ` Qu Wenruo
2023-08-17  1:32                 ` Jens Axboe
2023-08-19 23:59                   ` Qu Wenruo
2023-08-20  0:22                     ` Qu Wenruo
2023-08-20 13:26                       ` Jens Axboe
2023-08-20 14:11                         ` Jens Axboe
2023-08-20 18:18                           ` Matthew Wilcox
2023-08-20 18:40                             ` Jens Axboe
2023-08-21  0:38                           ` Qu Wenruo
2023-08-21 14:57                             ` Jens Axboe
2023-08-21 21:42                               ` Qu Wenruo
2023-08-16 22:36     ` Jens Axboe
2023-08-17  0:40       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox