public inbox for [email protected]
 help / color / mirror / Atom feed
From: Hannes Reinecke <[email protected]>
To: John Garry <[email protected]>,
	[email protected], [email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected]
Cc: [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected],
	[email protected], [email protected], [email protected],
	Himanshu Madhani <[email protected]>
Subject: Re: [PATCH v7 4/9] block: Add core atomic write support
Date: Mon, 3 Jun 2024 11:26:46 +0200	[thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

On 6/2/24 16:09, John Garry wrote:
> Add atomic write support, as follows:
> - add helper functions to get request_queue atomic write limits
> - report request_queue atomic write support limits to sysfs and update Doc
> - support to safely merge atomic writes
> - deal with splitting atomic writes
> - misc helper functions
> - add a per-request atomic write flag
> 
> New request_queue limits are added, as follows:
> - atomic_write_hw_max is set by the block driver and is the maximum length
>    of an atomic write which the device may support. It is not
>    necessarily a power-of-2.
> - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
>    max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
>    and atomic_write_max_sectors would be the limit on a merged atomic write
>    request size. This value is not capped at max_sectors, as the value in
>    max_sectors can be controlled from userspace, and it would only cause
>    trouble if userspace could limit atomic_write_unit_max_bytes and the
>    other atomic write limits.
> - atomic_write_hw_unit_{min,max} are set by the block driver and are the
>    min/max length of an atomic write unit which the device may support. They
>    both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
>    the same value as atomic_write_hw_max.
> - atomic_write_unit_{min,max} are derived from
>    atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
>    Both min and max values must be a power-of-2.
> - atomic_write_hw_boundary is set by the block driver. If non-zero, it
>    indicates an LBA space boundary at which an atomic write straddles no
>    longer is atomically executed by the disk. The value must be a
>    power-of-2. Note that it would be acceptable to enforce a rule that
>    atomic_write_hw_boundary_sectors is a multiple of
>    atomic_write_hw_unit_max, but the resultant code would be more
>    complicated.
> 
> All atomic writes limits are by default set 0 to indicate no atomic write
> support. Even though it is assumed by Linux that a logical block can always
> be atomically written, we ignore this as it is not of particular interest.
> Stacked devices are just not supported either for now.
> 
> An atomic write must always be submitted to the block driver as part of a
> single request. As such, only a single BIO must be submitted to the block
> layer for an atomic write. When a single atomic write BIO is submitted, it
> cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
> by the maximum guaranteed BIO size which will not be required to be split.
> This max size is calculated by request_queue max segments and the number
> of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
> issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
> segment containing PAGE_SIZE of data, apart from the first+last, which each
> can fit logical block size of data. The first+last will be LBS
> length/aligned as we rely on direct IO alignment rules also.
> 
> New sysfs files are added to report the following atomic write limits:
> - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
> 				bytes
> - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
> 				bytes
> - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
> 				bytes
> - atomic_write_max_bytes      - same as atomic_write_max_sectors in bytes
> 
> Atomic writes may only be merged with other atomic writes and only under
> the following conditions:
> - total resultant request length <= atomic_write_max_bytes
> - the merged write does not straddle a boundary
> 
> Helper function bdev_can_atomic_write() is added to indicate whether
> atomic writes may be issued to a bdev. If a bdev is a partition, the
> partition start must be aligned with both atomic_write_unit_min_sectors
> and atomic_write_hw_boundary_sectors.
> 
> FSes will rely on the block layer to validate that an atomic write BIO
> submitted will be of valid size, so add blk_validate_atomic_write_op_size()
> for this purpose. Userspace expects an atomic write which is of invalid
> size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
> BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
> invalid size BIO.
> 
> Flag REQ_ATOMIC is used for indicating an atomic write.
> 
> Co-developed-by: Himanshu Madhani <[email protected]>
> Signed-off-by: Himanshu Madhani <[email protected]>
> Signed-off-by: John Garry <[email protected]>
> ---
>   Documentation/ABI/stable/sysfs-block | 53 ++++++++++++++++
>   block/blk-core.c                     | 19 ++++++
>   block/blk-merge.c                    | 95 +++++++++++++++++++++++++++-
>   block/blk-settings.c                 | 52 +++++++++++++++
>   block/blk-sysfs.c                    | 33 ++++++++++
>   block/blk.h                          |  3 +
>   include/linux/blk_types.h            |  8 ++-
>   include/linux/blkdev.h               | 54 ++++++++++++++++
>   8 files changed, 315 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 831f19a32e08..cea8856f798d 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,59 @@ Description:
>   		device is offset from the internal allocation unit's
>   		natural alignment.
>   
> +What:		/sys/block/<disk>/atomic_write_max_bytes
> +Date:		February 2024
> +Contact:	Himanshu Madhani <[email protected]>
> +Description:
> +		[RO] This parameter specifies the maximum atomic write
> +		size reported by the device. This parameter is relevant
> +		for merging of writes, where a merged atomic write
> +		operation must not exceed this number of bytes.
> +		This parameter may be greater than the value in
> +		atomic_write_unit_max_bytes as
> +		atomic_write_unit_max_bytes will be rounded down to a
> +		power-of-two and atomic_write_unit_max_bytes may also be
> +		limited by some other queue limits, such as max_segments.
> +		This parameter - along with atomic_write_unit_min_bytes
> +		and atomic_write_unit_max_bytes - will not be larger than
> +		max_hw_sectors_kb, but may be larger than max_sectors_kb.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_min_bytes
> +Date:		February 2024
> +Contact:	Himanshu Madhani <[email protected]>
> +Description:
> +		[RO] This parameter specifies the smallest block which can
> +		be written atomically with an atomic write operation. All
> +		atomic write operations must begin at a
> +		atomic_write_unit_min boundary and must be multiples of
> +		atomic_write_unit_min. This value must be a power-of-two.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_max_bytes
> +Date:		February 2024
> +Contact:	Himanshu Madhani <[email protected]>
> +Description:
> +		[RO] This parameter defines the largest block which can be
> +		written atomically with an atomic write operation. This
> +		value must be a multiple of atomic_write_unit_min and must
> +		be a power-of-two. This value will not be larger than
> +		atomic_write_max_bytes.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_boundary_bytes
> +Date:		February 2024
> +Contact:	Himanshu Madhani <[email protected]>
> +Description:
> +		[RO] A device may need to internally split an atomic write I/O
> +		which straddles a given logical block address boundary. This
> +		parameter specifies the size in bytes of the atomic boundary if
> +		one is reported by the device. This value must be a
> +		power-of-two and at least the size as in
> +		atomic_write_unit_max_bytes.
> +		Any attempt to merge atomic write I/Os must not result in a
> +		merged I/O which crosses this boundary (if any).
> +
>   
>   What:		/sys/block/<disk>/diskseq
>   Date:		February 2021
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 82c3ae22d76d..d9f58fe71758 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -174,6 +174,8 @@ static const struct {
>   	/* Command duration limit device-side timeout */
>   	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
>   
> +	[BLK_STS_INVAL]		= { -EINVAL,	"invalid" },
> +
>   	/* everything else not covered above: */
>   	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
>   };
> @@ -739,6 +741,18 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>   		__submit_bio_noacct(bio);
>   }
>   
> +static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> +						 struct bio *bio)
> +{
> +	if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q))
> +		return BLK_STS_INVAL;
> +
> +	if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q))
> +		return BLK_STS_INVAL;
> +
> +	return BLK_STS_OK;
> +}
> +
>   /**
>    * submit_bio_noacct - re-submit a bio to the block device layer for I/O
>    * @bio:  The bio describing the location in memory and on the device.
> @@ -797,6 +811,11 @@ void submit_bio_noacct(struct bio *bio)
>   	switch (bio_op(bio)) {
>   	case REQ_OP_READ:
>   	case REQ_OP_WRITE:
> +		if (bio->bi_opf & REQ_ATOMIC) {
> +			status = blk_validate_atomic_write_op_size(q, bio);
> +			if (status != BLK_STS_OK)
> +				goto end_io;
> +		}
>   		break;
>   	case REQ_OP_FLUSH:
>   		/*
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8957e08e020c..ad07759ca147 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -18,6 +18,46 @@
>   #include "blk-rq-qos.h"
>   #include "blk-throttle.h"
>   
> +/*
> + * rq_straddles_atomic_write_boundary - check for boundary violation
> + * @rq: request to check
> + * @front: data size to be appended to front
> + * @back: data size to be appended to back
> + *
> + * Determine whether merging a request or bio into another request will result
> + * in a merged request which straddles an atomic write boundary.
> + *
> + * The value @front_adjust is the data which would be appended to the front of
> + * @rq, while the value @back_adjust is the data which would be appended to the
> + * back of @rq. Callers will typically only have either @front_adjust or
> + * @back_adjust as non-zero.
> + *
> + */
> +static bool rq_straddles_atomic_write_boundary(struct request *rq,
> +					unsigned int front_adjust,
> +					unsigned int back_adjust)
> +{
> +	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
> +	u64 mask, start_rq_pos, end_rq_pos;
> +
> +	if (!boundary)
> +		return false;
> +
> +	start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
> +	end_rq_pos = start_rq_pos + blk_rq_bytes(rq) - 1;
> +
> +	start_rq_pos -= front_adjust;
> +	end_rq_pos += back_adjust;
> +
> +	mask = ~(boundary - 1);
> +
> +	/* Top bits are different, so crossed a boundary */
> +	if ((start_rq_pos & mask) != (end_rq_pos & mask))
> +		return true;
> +
> +	return false;
> +}

But isn't that precisely what 'chunk_sectors' is doing?
IE ensuring that requests never cross that boundary?

Q1: Shouldn't we rather use/modify/adapt chunk_sectors for this thing?
Q2: If we don't, shouldn't we align the atomic write boundary to the 
chunk_sectors setting to ensure both match up?

Cheers,

Hannes


  reply	other threads:[~2024-06-03  9:26 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-02 14:09 [PATCH v7 0/9] block atomic writes John Garry
2024-06-02 14:09 ` [PATCH v7 1/9] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
2024-06-02 14:09 ` [PATCH v7 2/9] fs: Initial atomic write support John Garry
2024-06-05  8:30   ` Christoph Hellwig
2024-06-05 10:48     ` John Garry
2024-06-06  5:41       ` Christoph Hellwig
2024-06-06  6:38         ` John Garry
2024-06-02 14:09 ` [PATCH v7 3/9] fs: Add initial atomic write support info to statx John Garry
2024-06-02 14:09 ` [PATCH v7 4/9] block: Add core atomic write support John Garry
2024-06-03  9:26   ` Hannes Reinecke [this message]
2024-06-03 11:38     ` John Garry
2024-06-03 12:31       ` Hannes Reinecke
2024-06-03 13:29         ` John Garry
2024-06-05  8:32           ` Christoph Hellwig
2024-06-05 11:21             ` John Garry
2024-06-06  5:44               ` Christoph Hellwig
2024-06-05  8:31         ` Christoph Hellwig
2024-06-02 14:09 ` [PATCH v7 5/9] block: Add atomic write support for statx John Garry
2024-06-02 14:09 ` [PATCH v7 6/9] block: Add fops atomic write support John Garry
2024-06-02 14:09 ` [PATCH v7 7/9] scsi: sd: Atomic " John Garry
2024-06-02 14:09 ` [PATCH v7 8/9] scsi: scsi_debug: " John Garry
2024-06-02 14:09 ` [PATCH v7 9/9] nvme: " John Garry
2024-06-07  6:16 ` [PATCH v7 0/9] block atomic writes John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    [email protected] \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox