From: Damien Le Moal <[email protected]>
To: "[email protected]" <[email protected]>
Cc: Kanchan Joshi <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>,
"[email protected]" <[email protected]>
Subject: Re: [PATCH v2 0/2] zone-append support in io-uring and aio
Date: Fri, 26 Jun 2020 06:56:17 +0000 [thread overview]
Message-ID: <CY4PR04MB375154780F0B8073AB83DA9CE7930@CY4PR04MB3751.namprd04.prod.outlook.com> (raw)
In-Reply-To: [email protected]
On 2020/06/26 15:37, [email protected] wrote:
> On 26.06.2020 03:11, Damien Le Moal wrote:
>> On 2020/06/26 2:18, Kanchan Joshi wrote:
>>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]
>>>
>>> This patchset enables zone-append using io-uring/linux-aio, on block IO path.
>>> Purpose is to provide zone-append consumption ability to applications which are
>>> using zoned-block-device directly.
>>>
>>> The application may specify RWF_ZONE_APPEND flag with write when it wants to
>>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
>>> aio, and pwritev2. An error is reported if zone-append is requested using
>>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any
>>> other sync write API for reasons described later.
>>>
>>> Zone-append completion result --->
>>> With zone-append, where write took place can only be known after completion.
>>> So apart from usual return value of write, additional mean is needed to obtain
>>> the actual written location.
>>>
>>> In aio, this is returned to application using res2 field of io_event -
>>>
>>> struct io_event {
>>> __u64 data; /* the data field from the iocb */
>>> __u64 obj; /* what iocb this event came from */
>>> __s64 res; /* result code for this event */
>>> __s64 res2; /* secondary result */
>>> };
>>>
>>> In io-uring, cqe->flags is repurposed for zone-append result.
>>>
>>> struct io_uring_cqe {
>>> __u64 user_data; /* sqe->data submission passed back */
>>> __s32 res; /* result code for this event */
>>> __u32 flags;
>>> };
>>>
>>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset
>>> in sector/512b units. This can cover zone-size represented by chunk_sectors.
>>> Applications will have the trouble to combine this with zone start to know
>>> disk-relative offset. But if more bits are obtained by pulling from res field
>>> that too would compel application to interpret res field differently, and it
>>> seems more painstaking than the former option.
>>> To keep uniformity, even with aio, zone-relative offset is returned.
>>
>> I am really not a fan of this, to say the least. The input is byte offset, the
>> output is 512B relative sector count... Arg... We really cannot do better than
>> that ?
>>
>> At the very least, byte relative offset ? The main reason is that this is
>> _somewhat_ acceptable for raw block device accesses since the "sector"
>> abstraction has a clear meaning, but once we add iomap/zonefs async zone append
>> support, we really will want to have byte unit as the interface is regular
>> files, not block device file. We could argue that 512B sector unit is still
>> around even for files (e.g. block counts in file stat). Bu the different unit
>> for input and output of one operation is really ugly. This is not nice for the user.
>>
>
> You can refer to the discussion with Jens, Pavel and Alex on the uring
> interface. With the bits we have and considering the maximun zone size
> supported, there is no space for a byte relative offset. We can take
> some bits from cqe->res, but we were afraid this is not very
> future-proof. Do you have a better idea?
If you can take 8 bits, that gives you 40 bits, enough to support byte relative
offsets for any zone size defined as a number of 512B sectors using an unsigned
int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
already too tired and my math is failing me...
zone size is defined by chunk_sectors, which is used for raid and software raids
too. This has been an unsigned int forever. I do not see the need for changing
this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
does not really make any sense. Same for zone size...
>
>
>>>
>>> Append using io_uring fixed-buffer --->
>>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer
>>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer
>>> does not support such iov_iter.
>>>
>>> Block IO vs File IO --->
>>> For now, the user zone-append interface is supported only for zoned-block-device.
>>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS)
>>> will not need this anyway, because zone peculiarities are abstracted within FS.
>>> At this point, ZoneFS also likes to use append implicitly rather than explicitly.
>>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check
>>> allowing-only-block-device should be changed.
>>
>> Sure, but I think the interface is still a problem. I am not super happy about
>> the 512B sector unit. Zonefs will be the only file system that will be impacted
>> since other normal POSIX file system will not have zone append interface for
>> users. So this is a limited problem. Still, even for raw block device files
>> accesses, POSIX system calls use Byte unit everywhere. Let's try to use that.
>>
>> For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed
>> already, we can still 8 bits from the cqe res. All you need is to add a small
>> helper function in userspace iouring.h to simplify the work of the application
>> to get that result.
>
> Ok. See above. We can do this.
>
> Jens: Do you see this as a problem in the future?
>
> [...]
>
> Javier
>
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2020-06-26 6:56 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20200625171829epcas5p268486a0780571edb4999fc7b3caab602@epcas5p2.samsung.com>
2020-06-25 17:15 ` [PATCH v2 0/2] zone-append support in io-uring and aio Kanchan Joshi
[not found] ` <CGME20200625171834epcas5p226a24dfcb84cfa83fe29a2bd17795d85@epcas5p2.samsung.com>
2020-06-25 17:15 ` [PATCH v2 1/2] fs,block: Introduce RWF_ZONE_APPEND and handling in direct IO path Kanchan Joshi
2020-06-26 2:50 ` Damien Le Moal
2020-06-29 18:32 ` Kanchan Joshi
2020-06-30 0:37 ` Damien Le Moal
2020-06-30 7:40 ` Kanchan Joshi
2020-06-30 7:52 ` Damien Le Moal
2020-06-30 7:56 ` Damien Le Moal
2020-06-30 8:16 ` Kanchan Joshi
2020-06-26 8:58 ` Christoph Hellwig
2020-06-26 21:15 ` Kanchan Joshi
2020-06-27 6:51 ` Christoph Hellwig
[not found] ` <CGME20200625171838epcas5p449183e12770187142d8d55a9bf422a8d@epcas5p4.samsung.com>
2020-06-25 17:15 ` [PATCH v2 2/2] io_uring: add support for zone-append Kanchan Joshi
2020-06-25 19:40 ` Pavel Begunkov
2020-06-26 3:11 ` [PATCH v2 0/2] zone-append support in io-uring and aio Damien Le Moal
2020-06-26 6:37 ` javier.gonz
2020-06-26 6:56 ` Damien Le Moal [this message]
2020-06-26 7:03 ` [email protected]
2020-06-26 22:15 ` Kanchan Joshi
2020-06-30 12:46 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CY4PR04MB375154780F0B8073AB83DA9CE7930@CY4PR04MB3751.namprd04.prod.outlook.com \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox