* [PATCH v2] io_uring: add support for probing opcodes
@ 2020-01-17 2:58 Jens Axboe
2020-01-17 7:42 ` Mark Papadakis
2020-01-17 17:15 ` Jeff Moyer
0 siblings, 2 replies; 6+ messages in thread
From: Jens Axboe @ 2020-01-17 2:58 UTC (permalink / raw)
To: io-uring; +Cc: Stefan Metzmacher
The application currently has no way of knowing if a given opcode is
supported or not without having to try and issue one and see if we get
-EINVAL or not. And even this approach is fraught with peril, as maybe
we're getting -EINVAL due to some fields being missing, or maybe it's
just not that easy to issue that particular command without doing some
other leg work in terms of setup first.
This adds IORING_REGISTER_PROBE, which fills in a structure with info
on what it supported or not. This will work even with sparse opcode
fields, which may happen in the future or even today if someone
backports specific features to older kernels.
Signed-off-by: Jens Axboe <[email protected]>
---
Changes since v1:
- Use io_op_def->not_supported flag. Makes it easier for backports
since they only need to change one location to ensure everything
is sane, and io_op_def always needs changing anyway.
- Cap 'nr_args' at OP_LAST
diff --git a/fs/io_uring.c b/fs/io_uring.c
index ee14a0fcd59f..b20587bda5d4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -561,6 +561,8 @@ struct io_op_def {
unsigned hash_reg_file : 1;
/* unbound wq insertion if file is a non-regular file */
unsigned unbound_nonreg_file : 1;
+ /* opcode is not supported by this kernel */
+ unsigned not_supported : 1;
};
static const struct io_op_def io_op_defs[] = {
@@ -6554,6 +6556,45 @@ SYSCALL_DEFINE2(io_uring_setup, u32, entries,
return io_uring_setup(entries, params);
}
+static int io_probe(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
+{
+ struct io_uring_probe *p;
+ size_t size;
+ int i, ret;
+
+ size = struct_size(p, ops, nr_args);
+ if (size == SIZE_MAX)
+ return -EOVERFLOW;
+ p = kzalloc(size, GFP_KERNEL);
+ if (!p)
+ return -ENOMEM;
+
+ ret = -EFAULT;
+ if (copy_from_user(p, arg, size))
+ goto out;
+ ret = -EINVAL;
+ if (memchr_inv(p, 0, size))
+ goto out;
+
+ p->last_op = IORING_OP_LAST - 1;
+ if (nr_args > IORING_OP_LAST)
+ nr_args = IORING_OP_LAST;
+
+ for (i = 0; i < nr_args; i++) {
+ p->ops[i].op = i;
+ if (!io_op_defs[i].not_supported)
+ p->ops[i].flags = IO_URING_OP_SUPPORTED;
+ }
+ p->ops_len = i;
+
+ ret = 0;
+ if (copy_to_user(arg, p, size))
+ ret = -EFAULT;
+out:
+ kfree(p);
+ return ret;
+}
+
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
__releases(ctx->uring_lock)
@@ -6570,7 +6611,8 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
return -ENXIO;
if (opcode != IORING_UNREGISTER_FILES &&
- opcode != IORING_REGISTER_FILES_UPDATE) {
+ opcode != IORING_REGISTER_FILES_UPDATE &&
+ opcode != IORING_REGISTER_PROBE) {
percpu_ref_kill(&ctx->refs);
/*
@@ -6632,6 +6674,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_eventfd_unregister(ctx);
break;
+ case IORING_REGISTER_PROBE:
+ ret = -EINVAL;
+ if (!arg || nr_args > 256)
+ break;
+ ret = io_probe(ctx, arg, nr_args);
+ break;
default:
ret = -EINVAL;
break;
@@ -6639,7 +6687,8 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
if (opcode != IORING_UNREGISTER_FILES &&
- opcode != IORING_REGISTER_FILES_UPDATE) {
+ opcode != IORING_REGISTER_FILES_UPDATE &&
+ opcode != IORING_REGISTER_PROBE) {
/* bring the ctx back to life */
percpu_ref_reinit(&ctx->refs);
out:
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index fea7da182851..955fd477e530 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -194,6 +194,7 @@ struct io_uring_params {
#define IORING_UNREGISTER_EVENTFD 5
#define IORING_REGISTER_FILES_UPDATE 6
#define IORING_REGISTER_EVENTFD_ASYNC 7
+#define IORING_REGISTER_PROBE 8
struct io_uring_files_update {
__u32 offset;
@@ -201,4 +202,21 @@ struct io_uring_files_update {
__aligned_u64 /* __s32 * */ fds;
};
+#define IO_URING_OP_SUPPORTED (1U << 0)
+
+struct io_uring_probe_op {
+ __u8 op;
+ __u8 resv;
+ __u16 flags; /* IO_URING_OP_* flags */
+ __u32 resv2;
+};
+
+struct io_uring_probe {
+ __u8 last_op; /* last opcode supported */
+ __u8 ops_len; /* length of ops[] array below */
+ __u16 resv;
+ __u32 resv2[3];
+ struct io_uring_probe_op ops[0];
+};
+
#endif
--
Jens Axboe
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2] io_uring: add support for probing opcodes
2020-01-17 2:58 [PATCH v2] io_uring: add support for probing opcodes Jens Axboe
@ 2020-01-17 7:42 ` Mark Papadakis
2020-01-17 15:15 ` Jens Axboe
2020-01-17 17:20 ` Jeff Moyer
2020-01-17 17:15 ` Jeff Moyer
1 sibling, 2 replies; 6+ messages in thread
From: Mark Papadakis @ 2020-01-17 7:42 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, Stefan Metzmacher
I 've been thinking about this earlier.
I think the most realistic solution would be to have kind of website/page(libiouring.org?), which lists all SQE OPs, the kernel release that first implemented support for it, and (if necessary) notes about compatibility.
- There will be, hopefully, a lot more such OPS implemented in the future
- By having this list readily available, one can determine the lowest Linux Kernel release required(target) for a specific set of OPs they need for their program. If I want support for readv, writev, accept, and connect - say - then I should be able to quickly figure out that e.g 5.5 is the minimum LK release I must require
- Subtle issues may be discovered, or other such important specifics may be to be called out -- e.g readv works for < 5.5 for disk I/O but (e.g) "broken" for 5.4.3. This should be included in that table
Testing against specific SQE OPs support alone won't be enough, and it will likely also get convoluted fast.
liburing could provide a simple utility function that returns the (major, minor) LK release for convenience.
@markpapadakis
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] io_uring: add support for probing opcodes
2020-01-17 7:42 ` Mark Papadakis
@ 2020-01-17 15:15 ` Jens Axboe
2020-01-17 17:20 ` Jeff Moyer
1 sibling, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2020-01-17 15:15 UTC (permalink / raw)
To: Mark Papadakis; +Cc: io-uring, Stefan Metzmacher
On 1/17/20 12:42 AM, Mark Papadakis wrote:
> I 've been thinking about this earlier. I think the most realistic
> solution would be to have kind of website/page(libiouring.org?), which
> lists all SQE OPs, the kernel release that first implemented support
> for it, and (if necessary) notes about compatibility.
>
> - There will be, hopefully, a lot more such OPS implemented in the
> future - By having this list readily available, one can determine the
> lowest Linux Kernel release required(target) for a specific set of OPs
> they need for their program. If I want support for readv, writev,
> accept, and connect - say - then I should be able to quickly figure
> out that e.g 5.5 is the minimum LK release I must require - Subtle
> issues may be discovered, or other such important specifics may be to
> be called out -- e.g readv works for < 5.5 for disk I/O but (e.g)
> "broken" for 5.4.3. This should be included in that table
The problem with this approach is that io_uring is (mostly) a one man
show so far, and maintaining a web page is not part of my core skill
set, nor is it something I want to take on. It'd be awesome if someone
else would step up on that front. Might be easier to simply keep the
liburing man pages up-to-date and maintain the information in there.
I think a lot of the issues you mention above are early teething issues,
hopefully won't be much of a concern going forward as things solidy and
stabilize on all fronts. So we're left with mostly "is this supported in
the kernel I'm on" kind of questions, which would hopefully be request
specific.
One thing that has been brought up is that we could add an opcode
version. There's an u8 reserved field next to the opcode, that could be
turned into
__u8 version;
in the future, which would allow us to differentiate types of supported
for an individual opcode. Your readv example would work with that, for
instance.
> Testing against specific SQE OPs support alone won't be enough, and it
> will likely also get convoluted fast. liburing could provide a simple
> utility function that returns the (major, minor) LK release for
> convenience.
I'm not a huge fan of versioning, exactly because it requires some other
source of information to then cross check a version number with
features. Then the application needs to maintain it to. It also totally
breaks down for backports, where you may only selectively backport
features to an older kernel. You can't represent that with a major.minor
that is sytem wide. The table can.
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] io_uring: add support for probing opcodes
2020-01-17 7:42 ` Mark Papadakis
2020-01-17 15:15 ` Jens Axboe
@ 2020-01-17 17:20 ` Jeff Moyer
1 sibling, 0 replies; 6+ messages in thread
From: Jeff Moyer @ 2020-01-17 17:20 UTC (permalink / raw)
To: Mark Papadakis; +Cc: Jens Axboe, io-uring, Stefan Metzmacher
Mark Papadakis <[email protected]> writes:
> I 've been thinking about this earlier.
> I think the most realistic solution would be to have kind of
> website/page(libiouring.org?), which lists all SQE OPs, the kernel
> release that first implemented support for it, and (if necessary)
> notes about compatibility.
>
> - There will be, hopefully, a lot more such OPS implemented in the future
> - By having this list readily available, one can determine the lowest
> Linux Kernel release required(target) for a specific set of OPs they
> need for their program. If I want support for readv, writev, accept,
> and connect - say - then I should be able to quickly figure out that
> e.g 5.5 is the minimum LK release I must require
This falls apart when you start looking at distro kernels. RHEL and
SuSe routinely backport features and fixes, and there may be subsets of
functionality available. Feature testing really is the best way.
> - Subtle issues may be discovered, or other such important specifics
> may be to be called out -- e.g readv works for < 5.5 for disk I/O but
> (e.g) "broken" for 5.4.3. This should be included in that table
That's true. I had wondered whether you should be able to specify an
fd, to see if an operation was supported for that particular thing
(file, socket, whatever). However, I'm not sure how easy that would be
to implement.
One other thing that might be useful is to ask about a specific op.
The way this is implemented, you have to get an entire table, and then
look up the op in that table. I don't think it's a big deal, though.
> Testing against specific SQE OPs support alone won't be enough, and it
> will likely also get convoluted fast. liburing could provide a simple
> utility function that returns the (major, minor) LK release for
> convenience.
Again, that model doesn't work for all kernels.
Cheers,
Jeff
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] io_uring: add support for probing opcodes
2020-01-17 2:58 [PATCH v2] io_uring: add support for probing opcodes Jens Axboe
2020-01-17 7:42 ` Mark Papadakis
@ 2020-01-17 17:15 ` Jeff Moyer
2020-01-17 17:36 ` Jens Axboe
1 sibling, 1 reply; 6+ messages in thread
From: Jeff Moyer @ 2020-01-17 17:15 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring, Stefan Metzmacher
Jens Axboe <[email protected]> writes:
> The application currently has no way of knowing if a given opcode is
> supported or not without having to try and issue one and see if we get
> -EINVAL or not. And even this approach is fraught with peril, as maybe
> we're getting -EINVAL due to some fields being missing, or maybe it's
> just not that easy to issue that particular command without doing some
> other leg work in terms of setup first.
>
> This adds IORING_REGISTER_PROBE, which fills in a structure with info
> on what it supported or not. This will work even with sparse opcode
> fields, which may happen in the future or even today if someone
> backports specific features to older kernels.
This looks pretty good to me. You can call it with 0 args to get the
total number of ops, then allocate an array with that number and
re-issue the syscall. I also like that you've allowed for backporting
subsets of functionality.
I have one question below:
> @@ -6632,6 +6674,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> break;
> ret = io_eventfd_unregister(ctx);
> break;
> + case IORING_REGISTER_PROBE:
> + ret = -EINVAL;
> + if (!arg || nr_args > 256)
> + break;
> + ret = io_probe(ctx, arg, nr_args);
> + break;
Why 256? If it's just arbitrary, please add a comment.
Otherwise looks good!
Reviewed-by: Jeff Moyer <[email protected]>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2] io_uring: add support for probing opcodes
2020-01-17 17:15 ` Jeff Moyer
@ 2020-01-17 17:36 ` Jens Axboe
0 siblings, 0 replies; 6+ messages in thread
From: Jens Axboe @ 2020-01-17 17:36 UTC (permalink / raw)
To: Jeff Moyer; +Cc: io-uring, Stefan Metzmacher
On 1/17/20 10:15 AM, Jeff Moyer wrote:
> Jens Axboe <[email protected]> writes:
>
>> The application currently has no way of knowing if a given opcode is
>> supported or not without having to try and issue one and see if we get
>> -EINVAL or not. And even this approach is fraught with peril, as maybe
>> we're getting -EINVAL due to some fields being missing, or maybe it's
>> just not that easy to issue that particular command without doing some
>> other leg work in terms of setup first.
>>
>> This adds IORING_REGISTER_PROBE, which fills in a structure with info
>> on what it supported or not. This will work even with sparse opcode
>> fields, which may happen in the future or even today if someone
>> backports specific features to older kernels.
>
> This looks pretty good to me. You can call it with 0 args to get the
> total number of ops, then allocate an array with that number and
> re-issue the syscall. I also like that you've allowed for backporting
> subsets of functionality.
Right, this is similar to how most hardware commands work when you don't
know what the max size would be. Since this is pretty small, I would
expect applications to just use 256 as the value and get all of them.
But if they want to probe and use that method, that'll work just fine.
> I have one question below:
>
>> @@ -6632,6 +6674,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>> break;
>> ret = io_eventfd_unregister(ctx);
>> break;
>> + case IORING_REGISTER_PROBE:
>> + ret = -EINVAL;
>> + if (!arg || nr_args > 256)
>> + break;
>> + ret = io_probe(ctx, arg, nr_args);
>> + break;
>
> Why 256? If it's just arbitrary, please add a comment.
We can't have more than 256 opcodes, as it's a byte for the opcode.
> Otherwise looks good!
>
> Reviewed-by: Jeff Moyer <[email protected]>
Thanks!
--
Jens Axboe
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-01-17 17:36 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-01-17 2:58 [PATCH v2] io_uring: add support for probing opcodes Jens Axboe
2020-01-17 7:42 ` Mark Papadakis
2020-01-17 15:15 ` Jens Axboe
2020-01-17 17:20 ` Jeff Moyer
2020-01-17 17:15 ` Jeff Moyer
2020-01-17 17:36 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox