From: Ming Lei <[email protected]>
To: Ziyang Zhang <[email protected]>
Cc: [email protected],
Harris James R <[email protected]>,
[email protected], [email protected],
Gabriel Krisman Bertazi <[email protected]>,
Xiaoguang Wang <[email protected]>,
Stefan Hajnoczi <[email protected]>,
Jens Axboe <[email protected]>,
[email protected]
Subject: Re: [PATCH V3 1/1] ublk: add io_uring based userspace block driver
Date: Thu, 30 Jun 2022 20:33:20 +0800 [thread overview]
Message-ID: <Yr2YEIoBPOLxq6NB@T590> (raw)
In-Reply-To: <[email protected]>
On Thu, Jun 30, 2022 at 07:35:11PM +0800, Ziyang Zhang wrote:
> On 2022/6/29 00:08, Ming Lei wrote:
>
> [...]
>
> > +#define UBLK_MAX_PIN_PAGES 32
> > +
> > +static inline void ublk_release_pages(struct ublk_queue *ubq, struct page **pages,
> > + int nr_pages)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pages; i++)
> > + put_page(pages[i]);
> > +}
> > +
> > +static inline int ublk_pin_user_pages(struct ublk_queue *ubq, u64 start_vm,
> > + unsigned int nr_pages, unsigned int gup_flags,
> > + struct page **pages)
> > +{
> > + return get_user_pages_fast(start_vm, nr_pages, gup_flags, pages);
> > +}
>
> > +
> > +static inline unsigned ublk_copy_bv(struct bio_vec *bv, void **bv_addr,
> > + void *pg_addr, unsigned int *pg_off,
> > + unsigned int *pg_len, bool to_bv)
> > +{
> > + unsigned len = min_t(unsigned, bv->bv_len, *pg_len);
> > +
> > + if (*bv_addr == NULL)
> > + *bv_addr = kmap_local_page(bv->bv_page);
> > +
> > + if (to_bv)
> > + memcpy(*bv_addr + bv->bv_offset, pg_addr + *pg_off, len);
> > + else
> > + memcpy(pg_addr + *pg_off, *bv_addr + bv->bv_offset, len);
> > +
> > + bv->bv_offset += len;
> > + bv->bv_len -= len;
> > + *pg_off += len;
> > + *pg_len -= len;
> > +
> > + if (!bv->bv_len) {
> > + kunmap_local(*bv_addr);
> > + *bv_addr = NULL;
> > + }
> > +
> > + return len;
> > +}
> > +
> > +/* copy rq pages to ublksrv vm address pointed by io->addr */
> > +static int ublk_copy_pages(struct ublk_queue *ubq, struct request *rq, bool to_rq,
> > + unsigned int max_bytes)
> > +{
> > + unsigned int gup_flags = to_rq ? 0 : FOLL_WRITE;
> > + struct ublk_io *io = &ubq->ios[rq->tag];
> > + struct page *pgs[UBLK_MAX_PIN_PAGES];
> > + struct req_iterator req_iter;
> > + struct bio_vec bv;
> > + const unsigned int rq_bytes = min(blk_rq_bytes(rq), max_bytes);
> > + unsigned long start = io->addr, left = rq_bytes;
> > + unsigned int idx = 0, pg_len = 0, pg_off = 0;
> > + int nr_pin = 0;
> > + void *pg_addr = NULL;
> > + struct page *curr = NULL;
> > +
> > + rq_for_each_segment(bv, rq, req_iter) {
> > + unsigned len, bv_off = bv.bv_offset, bv_len = bv.bv_len;
> > + void *bv_addr = NULL;
> > +
> > +refill:
> > + if (pg_len == 0) {
> > + unsigned int off = 0;
> > +
> > + if (pg_addr) {
> > + kunmap_local(pg_addr);
> > + if (!to_rq)
> > + set_page_dirty_lock(curr);
> > + pg_addr = NULL;
> > + }
> > +
> > + /* refill pages */
> > + if (idx >= nr_pin) {
> > + unsigned int max_pages;
> > +
> > + ublk_release_pages(ubq, pgs, nr_pin);
> > +
> > + off = start & (PAGE_SIZE - 1);
> > + max_pages = min_t(unsigned, (off + left +
> > + PAGE_SIZE - 1) >> PAGE_SHIFT,
> > + UBLK_MAX_PIN_PAGES);
> > + nr_pin = ublk_pin_user_pages(ubq, start,
> > + max_pages, gup_flags, pgs);
> > + if (nr_pin < 0)
> > + goto exit;
> > + idx = 0;
> > + }
> > + pg_off = off;
> > + pg_len = min(PAGE_SIZE - off, left);
> > + off = 0;
> > + curr = pgs[idx++];
> > + pg_addr = kmap_local_page(curr);
> > + }
> > +
> > + len = ublk_copy_bv(&bv, &bv_addr, pg_addr, &pg_off, &pg_len,
> > + to_rq);
> > + /* either one of the two has been consumed */
> > + WARN_ON_ONCE(bv.bv_len && pg_len);
> > + start += len;
> > + left -= len;
> > +
> > + /* overflow */
> > + WARN_ON_ONCE(left > rq_bytes);
> > + WARN_ON_ONCE(bv.bv_len > bv_len);
> > + if (bv.bv_len)
> > + goto refill;
> > +
> > + bv.bv_len = bv_len;
> > + bv.bv_offset = bv_off;
> > + }
> > + if (pg_addr) {
> > + kunmap_local(pg_addr);
> > + if (!to_rq)
> > + set_page_dirty_lock(curr);
> > + }
> > + ublk_release_pages(ubq, pgs, nr_pin);
> > +
> > +exit:
> > + return rq_bytes - left;
> > +}
> > +
>
> Hi Ming,
>
> I note that you pin the user buffer's pages, memcpy() and release them immediately.
>
> 1) I think maybe copy_page_from_iter() is another choice for copying user buffer to biovecs
> since copy_page_from_iter() do not pin pages(But it may raise page fault).
copy_page_from_iter/copy_page_to_iter needs the userspage page,
then copy between the userspace page and bvec_iter pages, what it does
is just kmap/copy/kunmap.
Not see it is useful here.
>
> 2) Or will you design some mechanism such as LRU to manage these pinned pages?
> For example pin those pages frequently required for a long time and release
> those pages not used for a long time.
> I remember you have talked about this LRU on pinned pages?
I'd explain it a bit.
When I worked on v1/v2, 'perf report' shows that get_user_pages_fast()
as one of top samples. Turns out it is a bug, which is fixed in
https://github.com/ming1/linux/commit/3c9fd476951759858cc548dee4cedc074194d0b0
After the issue is fixed, not see get_user_pages_fast() being hot spot
any more. I actually implemented one patch which pins all pages in
the ubd device whole lifetime, but not see obvious improvement, so I gave
up the idea.
In the test VM on my laptop, single job ubd/null randwrite can reach 700K iops.
>
> Which one do you think is better? copy_page_from_iter() or pin pages with LRU?
> Maybe it depends on the user's workload?
So far in the enablement stage, I think the current approach is just fine,
but we still can improve it in future.
Thanks,
Ming
next prev parent reply other threads:[~2022-06-30 12:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-28 16:08 [PATCH V3 0/1] ublk: add io_uring based userspace block driver Ming Lei
2022-06-28 16:08 ` [PATCH V3 1/1] " Ming Lei
2022-06-30 11:35 ` Ziyang Zhang
2022-06-30 12:33 ` Ming Lei [this message]
2022-07-01 2:47 ` Ziyang Zhang
2022-07-01 4:06 ` Ming Lei
2022-07-04 11:17 ` Sagi Grimberg
2022-07-04 12:34 ` Ming Lei
2022-07-04 14:00 ` Sagi Grimberg
2022-07-04 16:13 ` Gabriel Krisman Bertazi
2022-07-04 16:19 ` Sagi Grimberg
2022-07-05 0:43 ` Ming Lei
2022-07-04 22:10 ` Gabriel Krisman Bertazi
2022-07-05 4:16 ` Ziyang Zhang
2022-07-05 8:12 ` Ming Lei
2022-07-05 8:06 ` Ming Lei
2022-07-07 7:49 ` Ming Lei
2022-07-07 7:58 ` Ming Lei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yr2YEIoBPOLxq6NB@T590 \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
[email protected] \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox