public inbox for [email protected]
 help / color / mirror / Atom feed
* [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
@ 2020-11-07 13:20 Pavel Begunkov
  2020-11-07 13:46 ` Stefan Metzmacher
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Pavel Begunkov @ 2020-11-07 13:20 UTC (permalink / raw)
  To: Jens Axboe, io-uring

We don't even allow not plain data msg_control, which is disallowed in
__sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
there are cases that can be, e.g. IOSQE_ASYNC.

Signed-off-by: Pavel Begunkov <[email protected]>
---
 fs/io_uring.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 2e435b336927..8d721a652d61 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -849,8 +849,7 @@ static const struct io_op_def io_op_defs[] = {
 		.pollout		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_RECVMSG] = {
 		.needs_file		= 1,
@@ -859,8 +858,7 @@ static const struct io_op_def io_op_defs[] = {
 		.buffer_select		= 1,
 		.needs_async_data	= 1,
 		.async_size		= sizeof(struct io_async_msghdr),
-		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
-						IO_WQ_WORK_FS,
+		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
 	},
 	[IORING_OP_TIMEOUT] = {
 		.needs_async_data	= 1,
-- 
2.24.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 13:20 [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg Pavel Begunkov
@ 2020-11-07 13:46 ` Stefan Metzmacher
  2020-11-07 16:02   ` Pavel Begunkov
  2020-11-15 13:07 ` Pavel Begunkov
  2020-11-16 16:31 ` Jens Axboe
  2 siblings, 1 reply; 10+ messages in thread
From: Stefan Metzmacher @ 2020-11-07 13:46 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 559 bytes --]

Hi Pavel,

> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().

Can't we better remove these checks and allow msg_control?
For me it's a limitation that I would like to be removed.

If there's a cost using IO_WQ_WORK_FS, would it be possible to use IO_WQ_WORK_FS only it msg_control is actually used?

  if (msg->msg_control || msg->msg_controllen)
      static const struct io_op_def sendmsg_control_op_def = {
         ...
      };

      something = &sendmsg_control_op_def;
  }

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 13:46 ` Stefan Metzmacher
@ 2020-11-07 16:02   ` Pavel Begunkov
  2020-11-07 16:07     ` Pavel Begunkov
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-11-07 16:02 UTC (permalink / raw)
  To: Stefan Metzmacher, Jens Axboe, io-uring

On 07/11/2020 13:46, Stefan Metzmacher wrote:
> Hi Pavel,
> 
>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
> 
> Can't we better remove these checks and allow msg_control?
> For me it's a limitation that I would like to be removed.

We can grab fs only in specific situations as you mentioned, by e.g.
adding a switch(opcode) in io_prep_async_work(), but that's the easy
part. All msg_control should be dealt one by one as they do different
things. And it's not the fact that they ever require fs.

> 
> If there's a cost using IO_WQ_WORK_FS, would it be possible to use IO_WQ_WORK_FS only it msg_control is actually use> 
>   if (msg->msg_control || msg->msg_controllen) 
>       static const struct io_op_def sendmsg_control_op_def = {
>          ...
>       };
> 
>       something = &sendmsg_control_op_def;
>   }

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 16:02   ` Pavel Begunkov
@ 2020-11-07 16:07     ` Pavel Begunkov
  2020-11-18 16:27       ` Stefan Metzmacher
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-11-07 16:07 UTC (permalink / raw)
  To: Stefan Metzmacher, Jens Axboe, io-uring

On 07/11/2020 16:02, Pavel Begunkov wrote:
> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>> Hi Pavel,
>>
>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>
>> Can't we better remove these checks and allow msg_control?
>> For me it's a limitation that I would like to be removed.
> 
> We can grab fs only in specific situations as you mentioned, by e.g.
> adding a switch(opcode) in io_prep_async_work(), but that's the easy
> part. All msg_control should be dealt one by one as they do different
> things. And it's not the fact that they ever require fs.

BTW, Jens mentioned that there is a queued patch that allows plain
data msg_control. Are those not enough?

> 
>>
>> If there's a cost using IO_WQ_WORK_FS, would it be possible to use IO_WQ_WORK_FS only it msg_control is actually use> 
>>   if (msg->msg_control || msg->msg_controllen) 
>>       static const struct io_op_def sendmsg_control_op_def = {
>>          ...
>>       };
>>
>>       something = &sendmsg_control_op_def;
>>   }
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 13:20 [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg Pavel Begunkov
  2020-11-07 13:46 ` Stefan Metzmacher
@ 2020-11-15 13:07 ` Pavel Begunkov
  2020-11-16 16:31 ` Jens Axboe
  2 siblings, 0 replies; 10+ messages in thread
From: Pavel Begunkov @ 2020-11-15 13:07 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 07/11/2020 13:20, Pavel Begunkov wrote:
> We don't even allow not plain data msg_control, which is disallowed in
> __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
> IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
> there are cases that can be, e.g. IOSQE_ASYNC.

This one is still good to go. If anyone needs fs, etc. for msg_control,
IMHO it should be done in a different way not penalising others. 
i.e. grabbing it in io_prep_async_work().

> 
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>  fs/io_uring.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 2e435b336927..8d721a652d61 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -849,8 +849,7 @@ static const struct io_op_def io_op_defs[] = {
>  		.pollout		= 1,
>  		.needs_async_data	= 1,
>  		.async_size		= sizeof(struct io_async_msghdr),
> -		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
> -						IO_WQ_WORK_FS,
> +		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
>  	},
>  	[IORING_OP_RECVMSG] = {
>  		.needs_file		= 1,
> @@ -859,8 +858,7 @@ static const struct io_op_def io_op_defs[] = {
>  		.buffer_select		= 1,
>  		.needs_async_data	= 1,
>  		.async_size		= sizeof(struct io_async_msghdr),
> -		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG |
> -						IO_WQ_WORK_FS,
> +		.work_flags		= IO_WQ_WORK_MM | IO_WQ_WORK_BLKCG,
>  	},
>  	[IORING_OP_TIMEOUT] = {
>  		.needs_async_data	= 1,
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 13:20 [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg Pavel Begunkov
  2020-11-07 13:46 ` Stefan Metzmacher
  2020-11-15 13:07 ` Pavel Begunkov
@ 2020-11-16 16:31 ` Jens Axboe
  2 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2020-11-16 16:31 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 11/7/20 6:20 AM, Pavel Begunkov wrote:
> We don't even allow not plain data msg_control, which is disallowed in
> __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and
> IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but
> there are cases that can be, e.g. IOSQE_ASYNC.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-07 16:07     ` Pavel Begunkov
@ 2020-11-18 16:27       ` Stefan Metzmacher
  2020-11-18 16:57         ` Stefan Metzmacher
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Metzmacher @ 2020-11-18 16:27 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 1281 bytes --]

Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
> On 07/11/2020 16:02, Pavel Begunkov wrote:
>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>> Hi Pavel,
>>>
>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>
>>> Can't we better remove these checks and allow msg_control?
>>> For me it's a limitation that I would like to be removed.
>>
>> We can grab fs only in specific situations as you mentioned, by e.g.
>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>> part. All msg_control should be dealt one by one as they do different
>> things. And it's not the fact that they ever require fs.
> 
> BTW, Jens mentioned that there is a queued patch that allows plain
> data msg_control. Are those not enough?

You mean the PROTO_CMSG_DATA_ONLY check?

It's not perfect, but better than nothing for a start.

But as far as I can see this is only in the recvmsg path, I'd need it
for sendmsg. Can this be fixed? It would also be great to have a way to
detect support for this from userspace.

It would also be great to somehow fill in .msg_iocb and handle
-EIOCBQUEUED from a socket and let a iocb->ki_complete() function to be called
in order to handle completions.

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-18 16:27       ` Stefan Metzmacher
@ 2020-11-18 16:57         ` Stefan Metzmacher
  2020-11-18 19:50           ` Pavel Begunkov
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Metzmacher @ 2020-11-18 16:57 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 1917 bytes --]

Am 18.11.20 um 17:27 schrieb Stefan Metzmacher:
> Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
>> On 07/11/2020 16:02, Pavel Begunkov wrote:
>>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>>> Hi Pavel,
>>>>
>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>>
>>>> Can't we better remove these checks and allow msg_control?
>>>> For me it's a limitation that I would like to be removed.
>>>
>>> We can grab fs only in specific situations as you mentioned, by e.g.
>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>>> part. All msg_control should be dealt one by one as they do different
>>> things. And it's not the fact that they ever require fs.
>>
>> BTW, Jens mentioned that there is a queued patch that allows plain
>> data msg_control. Are those not enough?
> 
> You mean the PROTO_CMSG_DATA_ONLY check?
> 
> It's not perfect, but better than nothing for a start.

What actually have in mind for my smbdirect socket driver [1]:

- I have a pipe that got filled by IORING_OP_SPLICE
- The data in the pipe need to be "spliced" into a remote RDMA buffers,
  but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2]
  array needs to be passed too.
- I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control.
  msg_control would get the RDMA buffer descriptor array and the pipe fd.

The reverse operation (splicing data from remote RDMA buffers into a pipe)
would be implemented with IORING_OP_RECVMSG with MSG_OOB and msg_control.

I guess my smbdirect socket driver would not qualify to be marked as PROTO_CMSG_DATA_ONLY, correct?

[1] https://git.samba.org/?p=metze/linux/smbdirect.git;a=blob;f=smbdirect_socket.c;h=a738854462b198e#l2076
[2] https://docs.microsoft.com/ru-ru/openspecs/windows_protocols/ms-smbd/bee890cb-48f0-42a3-ba62-f1a3a19b0edc

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-18 16:57         ` Stefan Metzmacher
@ 2020-11-18 19:50           ` Pavel Begunkov
  2020-11-19  9:17             ` Stefan Metzmacher
  0 siblings, 1 reply; 10+ messages in thread
From: Pavel Begunkov @ 2020-11-18 19:50 UTC (permalink / raw)
  To: Stefan Metzmacher, Jens Axboe, io-uring

On 18/11/2020 16:57, Stefan Metzmacher wrote:
> Am 18.11.20 um 17:27 schrieb Stefan Metzmacher:
>> Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
>>> On 07/11/2020 16:02, Pavel Begunkov wrote:
>>>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>>>> Hi Pavel,
>>>>>
>>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>>>
>>>>> Can't we better remove these checks and allow msg_control?
>>>>> For me it's a limitation that I would like to be removed.
>>>>
>>>> We can grab fs only in specific situations as you mentioned, by e.g.
>>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>>>> part. All msg_control should be dealt one by one as they do different
>>>> things. And it's not the fact that they ever require fs.
>>>
>>> BTW, Jens mentioned that there is a queued patch that allows plain
>>> data msg_control. Are those not enough?
>>
>> You mean the PROTO_CMSG_DATA_ONLY check?
>>
>> It's not perfect, but better than nothing for a start.
> 
> What actually have in mind for my smbdirect socket driver [1]:
> 
> - I have a pipe that got filled by IORING_OP_SPLICE
> - The data in the pipe need to be "spliced" into a remote RDMA buffers,
>   but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2]
>   array needs to be passed too.
> - I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control.
>   msg_control would get the RDMA buffer descriptor array and the pipe fd.

If I get you right, you can't splice again because there is an RDMA header
that should go before payload data. Is that correct?

So you would need to do like in the pseudo-code below

payload = pipe.get_buffers();
iov[] = {&header, payload};
sendmsg(iov);

> The reverse operation (splicing data from remote RDMA buffers into a pipe)
> would be implemented with IORING_OP_RECVMSG with MSG_OOB and msg_control.
> 
> I guess my smbdirect socket driver would not qualify to be marked as PROTO_CMSG_DATA_ONLY, correct?
> 
> [1] https://git.samba.org/?p=metze/linux/smbdirect.git;a=blob;f=smbdirect_socket.c;h=a738854462b198e#l2076
> [2] https://docs.microsoft.com/ru-ru/openspecs/windows_protocols/ms-smbd/bee890cb-48f0-42a3-ba62-f1a3a19b0edc

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
  2020-11-18 19:50           ` Pavel Begunkov
@ 2020-11-19  9:17             ` Stefan Metzmacher
  0 siblings, 0 replies; 10+ messages in thread
From: Stefan Metzmacher @ 2020-11-19  9:17 UTC (permalink / raw)
  To: Pavel Begunkov, Jens Axboe, io-uring


[-- Attachment #1.1: Type: text/plain, Size: 4229 bytes --]

Am 18.11.20 um 20:50 schrieb Pavel Begunkov:
> On 18/11/2020 16:57, Stefan Metzmacher wrote:
>> Am 18.11.20 um 17:27 schrieb Stefan Metzmacher:
>>> Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
>>>> On 07/11/2020 16:02, Pavel Begunkov wrote:
>>>>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>>>>> Hi Pavel,
>>>>>>
>>>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>>>>
>>>>>> Can't we better remove these checks and allow msg_control?
>>>>>> For me it's a limitation that I would like to be removed.
>>>>>
>>>>> We can grab fs only in specific situations as you mentioned, by e.g.
>>>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>>>>> part. All msg_control should be dealt one by one as they do different
>>>>> things. And it's not the fact that they ever require fs.
>>>>
>>>> BTW, Jens mentioned that there is a queued patch that allows plain
>>>> data msg_control. Are those not enough?
>>>
>>> You mean the PROTO_CMSG_DATA_ONLY check?
>>>
>>> It's not perfect, but better than nothing for a start.
>>
>> What actually have in mind for my smbdirect socket driver [1]:
>>
>> - I have a pipe that got filled by IORING_OP_SPLICE
>> - The data in the pipe need to be "spliced" into a remote RDMA buffers,
>>   but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2]
>>   array needs to be passed too.
>> - I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control.
>>   msg_control would get the RDMA buffer descriptor array and the pipe fd.
> 
> If I get you right, you can't splice again because there is an RDMA header
> that should go before payload data. Is that correct?

No.

> So you would need to do like in the pseudo-code below
> 
> payload = pipe.get_buffers();
> iov[] = {&header, payload};
> sendmsg(iov);

This would be for the TCP case, there I use IORING_OP_SENDMSG with MSG_MORE
followed by a IORING_OP_SPLICE in order to put the SMB2 headers before the
payload buffer, while both result after each other in the byte stream.

With SMB-Direct (a transport for SMB over RDMA) there's basically a bi-directional
byte stream similar to TCP, but using RDMA_SEND pdus use via ib_post_send(IB_WR_SEND) on the
sender and ib_post_recv() on the receiver.

But there are also out of band commands to do direct data placement using
RDMA_READ and RDMA_WRITE using ib_post_send(IB_WR_RDMA_READ) and ib_post_send(IB_WR_RDMA_WRITE),
these verbs require a descriptor for the remote memory, 1. a steering tag (which is some kind of temporary cookie/identifier)
for a memory registration on the remote peer, 2. offset, 3. length.
These are completely independent of the byte stream, but they use the same RDMA connection.

This presentation contains illustrations on pages 19, 20 and 22:
https://www.snia.org/sites/default/files/files2/files2/SDC2011/presentations/tuesday/TomTalpey_GregKramer_SMB%202-2_Over_RDMA.pdf

Typically the client registers a memory region(s) and transfers the descriptor(s) within
the native SMB2 protocol using the "stream" of the SMB-Direct transport.
The server reads or writes from/to that clients memory. In order to do that the server
creates a temporary local memory registration, then it needs to pass the local memory descriptor,
but also the remote memory descriptor to the raw RDMA_READ/WRITE verbs and tell the hardware to
transfer the memory.

What I need is a way to trigger these out of band transfers, the simple approach
would be that userspace pass a buffer (iov) together with the remote memory descriptor
to the kernel. For now a use an ioctl() for that case.

But as io_uring doesn't support generic ioctls, my idea was to use sendmsg(MSG_OOM) instead
and pass the remote memory descriptors via msg_control and the buffer via msg_iov.

In order to avoid memory copies I'd like to use a pipe instead of buffer (iovs),
so my idea would be passing the pipe fd via an additional msg_control element
and use msg_iovlen=0, in order to simulate splice with additional meta data (that's only
needed at the local socket layer).

Do you understand this now, or is it still unclear?

metze


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-11-19  9:17 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-07 13:20 [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg Pavel Begunkov
2020-11-07 13:46 ` Stefan Metzmacher
2020-11-07 16:02   ` Pavel Begunkov
2020-11-07 16:07     ` Pavel Begunkov
2020-11-18 16:27       ` Stefan Metzmacher
2020-11-18 16:57         ` Stefan Metzmacher
2020-11-18 19:50           ` Pavel Begunkov
2020-11-19  9:17             ` Stefan Metzmacher
2020-11-15 13:07 ` Pavel Begunkov
2020-11-16 16:31 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox