From: Stefan Metzmacher <[email protected]>
To: Pavel Begunkov <[email protected]>,
Jens Axboe <[email protected]>,
[email protected]
Subject: Re: [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg
Date: Thu, 19 Nov 2020 10:17:06 +0100 [thread overview]
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
[-- Attachment #1.1: Type: text/plain, Size: 4229 bytes --]
Am 18.11.20 um 20:50 schrieb Pavel Begunkov:
> On 18/11/2020 16:57, Stefan Metzmacher wrote:
>> Am 18.11.20 um 17:27 schrieb Stefan Metzmacher:
>>> Am 07.11.20 um 17:07 schrieb Pavel Begunkov:
>>>> On 07/11/2020 16:02, Pavel Begunkov wrote:
>>>>> On 07/11/2020 13:46, Stefan Metzmacher wrote:
>>>>>> Hi Pavel,
>>>>>>
>>>>>>> We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock().
>>>>>>
>>>>>> Can't we better remove these checks and allow msg_control?
>>>>>> For me it's a limitation that I would like to be removed.
>>>>>
>>>>> We can grab fs only in specific situations as you mentioned, by e.g.
>>>>> adding a switch(opcode) in io_prep_async_work(), but that's the easy
>>>>> part. All msg_control should be dealt one by one as they do different
>>>>> things. And it's not the fact that they ever require fs.
>>>>
>>>> BTW, Jens mentioned that there is a queued patch that allows plain
>>>> data msg_control. Are those not enough?
>>>
>>> You mean the PROTO_CMSG_DATA_ONLY check?
>>>
>>> It's not perfect, but better than nothing for a start.
>>
>> What actually have in mind for my smbdirect socket driver [1]:
>>
>> - I have a pipe that got filled by IORING_OP_SPLICE
>> - The data in the pipe need to be "spliced" into a remote RDMA buffers,
>> but I can't use IORING_OP_SPLICE again, because the RDMA buffer descriptor [2]
>> array needs to be passed too.
>> - I'd like to use IORING_OP_SENDMSG with MSG_OOB and msg_control.
>> msg_control would get the RDMA buffer descriptor array and the pipe fd.
>
> If I get you right, you can't splice again because there is an RDMA header
> that should go before payload data. Is that correct?
No.
> So you would need to do like in the pseudo-code below
>
> payload = pipe.get_buffers();
> iov[] = {&header, payload};
> sendmsg(iov);
This would be for the TCP case, there I use IORING_OP_SENDMSG with MSG_MORE
followed by a IORING_OP_SPLICE in order to put the SMB2 headers before the
payload buffer, while both result after each other in the byte stream.
With SMB-Direct (a transport for SMB over RDMA) there's basically a bi-directional
byte stream similar to TCP, but using RDMA_SEND pdus use via ib_post_send(IB_WR_SEND) on the
sender and ib_post_recv() on the receiver.
But there are also out of band commands to do direct data placement using
RDMA_READ and RDMA_WRITE using ib_post_send(IB_WR_RDMA_READ) and ib_post_send(IB_WR_RDMA_WRITE),
these verbs require a descriptor for the remote memory, 1. a steering tag (which is some kind of temporary cookie/identifier)
for a memory registration on the remote peer, 2. offset, 3. length.
These are completely independent of the byte stream, but they use the same RDMA connection.
This presentation contains illustrations on pages 19, 20 and 22:
https://www.snia.org/sites/default/files/files2/files2/SDC2011/presentations/tuesday/TomTalpey_GregKramer_SMB%202-2_Over_RDMA.pdf
Typically the client registers a memory region(s) and transfers the descriptor(s) within
the native SMB2 protocol using the "stream" of the SMB-Direct transport.
The server reads or writes from/to that clients memory. In order to do that the server
creates a temporary local memory registration, then it needs to pass the local memory descriptor,
but also the remote memory descriptor to the raw RDMA_READ/WRITE verbs and tell the hardware to
transfer the memory.
What I need is a way to trigger these out of band transfers, the simple approach
would be that userspace pass a buffer (iov) together with the remote memory descriptor
to the kernel. For now a use an ioctl() for that case.
But as io_uring doesn't support generic ioctls, my idea was to use sendmsg(MSG_OOM) instead
and pass the remote memory descriptors via msg_control and the buffer via msg_iov.
In order to avoid memory copies I'd like to use a pipe instead of buffer (iovs),
so my idea would be passing the pipe fd via an additional msg_control element
and use msg_iovlen=0, in order to simulate splice with additional meta data (that's only
needed at the local socket layer).
Do you understand this now, or is it still unclear?
metze
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2020-11-19 9:17 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-11-07 13:20 [PATCH 5.11] io_uring: don't take fs for recvmsg/sendmsg Pavel Begunkov
2020-11-07 13:46 ` Stefan Metzmacher
2020-11-07 16:02 ` Pavel Begunkov
2020-11-07 16:07 ` Pavel Begunkov
2020-11-18 16:27 ` Stefan Metzmacher
2020-11-18 16:57 ` Stefan Metzmacher
2020-11-18 19:50 ` Pavel Begunkov
2020-11-19 9:17 ` Stefan Metzmacher [this message]
2020-11-15 13:07 ` Pavel Begunkov
2020-11-16 16:31 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox