summaryrefslogtreecommitdiffstats
path: root/drivers/vhost/net.c
Commit message (Collapse)AuthorAgeFilesLines
* vhost_net: fix possible infinite loopJason Wang2019-08-041-16/+13Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit e2412c07f8f3040593dfb88207865a3cd58680c0 upstream. When the rx buffer is too small for a packet, we will discard the vq descriptor and retry it for the next packet: while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk, &busyloop_intr))) { ... /* On overrun, truncate and discard */ if (unlikely(headcount > UIO_MAXIOV)) { iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1); err = sock->ops->recvmsg(sock, &msg, 1, MSG_DONTWAIT | MSG_TRUNC); pr_debug("Discarded rx packet: len %zd\n", sock_len); continue; } ... } This makes it possible to trigger a infinite while..continue loop through the co-opreation of two VMs like: 1) Malicious VM1 allocate 1 byte rx buffer and try to slow down the vhost process as much as possible e.g using indirect descriptors or other. 2) Malicious VM2 generate packets to VM1 as fast as possible Fixing this by checking against weight at the end of RX and TX loop. This also eliminate other similar cases when: - userspace is consuming the packets in the meanwhile - theoretical TOCTOU attack if guest moving avail index back and forth to hit the continue after vhost find guest just add new buffers This addresses CVE-2019-3900. Fixes: d8316f3991d20 ("vhost: fix total length when packets are too short") Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [jwang: backport to 4.19] Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vhost: introduce vhost_exceeds_weight()Jason Wang2019-08-041-16/+6Star
| | | | | | | | | | | | | | | | | | | | | | commit e82b9b0727ff6d665fff2d326162b460dded554d upstream. We used to have vhost_exceeds_weight() for vhost-net to: - prevent vhost kthread from hogging the cpu - balance the time spent between TX and RX This function could be useful for vsock and scsi as well. So move it to vhost.c. Device must specify a weight which counts the number of requests, or it can also specific a byte_weight which counts the number of bytes that has been processed. Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> [jwang: backport to 4.19, fix conflict in net.c] Signed-off-by: Jack Wang <jinpu.wang@cloud.ionos.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vhost_net: disable zerocopy by defaultJason Wang2019-07-261-1/+1
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit 098eadce3c622c07b328d0a43dda379b38cf7c5e ] Vhost_net was known to suffer from HOL[1] issues which is not easy to fix. Several downstream disable the feature by default. What's more, the datapath was split and datacopy path got the support of batching and XDP support recently which makes it faster than zerocopy part for small packets transmission. It looks to me that disable zerocopy by default is more appropriate. It cold be enabled by default again in the future if we fix the above issues. [1] https://patchwork.kernel.org/patch/3787671/ Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
* vhost: fix OOB in get_rx_bufs()Jason Wang2019-02-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b46a0bf78ad7b150ef5910da83859f7f5a514ffd ] After batched used ring updating was introduced in commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"). We tend to batch heads in vq->heads for more than one packet. But the quota passed to get_rx_bufs() was not correctly limited, which can result a OOB write in vq->heads. headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx, vhost_len, &in, vq_log, &log, likely(mergeable) ? UIO_MAXIOV : 1); UIO_MAXIOV was still used which is wrong since we could have batched used in vq->heads, this will cause OOB if the next buffer needs more than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've batched 64 (VHOST_NET_BATCH) heads: Acked-by: Stefan Hajnoczi <stefanha@redhat.com> ============================================================================= BUG kmalloc-8k (Tainted: G B ): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674 kmem_cache_alloc_trace+0xbb/0x140 alloc_pd+0x22/0x60 gen8_ppgtt_create+0x11d/0x5f0 i915_ppgtt_create+0x16/0x80 i915_gem_create_context+0x248/0x390 i915_gem_context_create_ioctl+0x4b/0xe0 drm_ioctl_kernel+0xa5/0xf0 drm_ioctl+0x2ed/0x3a0 do_vfs_ioctl+0x9f/0x620 ksys_ioctl+0x6b/0x80 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201 INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for vhost-net. This is done through set the limitation through vhost_dev_init(), then set_owner can allocate the number of iov in a per device manner. This fixes CVE-2018-16880. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vhost: log dirty page correctlyJason Wang2019-01-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 ] Vhost dirty page logging API is designed to sync through GPA. But we try to log GIOVA when device IOTLB is enabled. This is wrong and may lead to missing data after migration. To solve this issue, when logging with device IOTLB enabled, we will: 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to get HVA, for writable descriptor, get HVA through iovec. For used ring update, translate its GIOVA to HVA 2) traverse the GPA->HVA mapping to get the possible GPA and log through GPA. Pay attention this reverse mapping is not guaranteed to be unique, so we should log each possible GPA in this case. This fix the failure of scp to guest during migration. In -next, we will probably support passing GIOVA->GPA instead of GIOVA->HVA. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Jintack Lim <jintack@cs.columbia.edu> Cc: Jintack Lim <jintack@cs.columbia.edu> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* vhost: switch to use new message formatJason Wang2018-08-061-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use to have message like: struct vhost_msg { int type; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; Unfortunately, there will be a hole of 32bit in 64bit machine because of the alignment. This leads a different formats between 32bit API and 64bit API. What's more it will break 32bit program running on 64bit machine. So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { __u32 type; __u32 reserved; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; We will have a consistent ABI after switching to use this. To enable this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2). Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: batch update used ring for datacopy TXJason Wang2018-07-221-15/+25
| | | | | | | | | | | | | Like commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"), this patches implements batch used ring update for datacopy TX (zerocopy has already done some kind of batching). Testpmd transmission from guest to host (XDP_DROP on tap) shows 25.8% improvement (from ~3.1Mpps to ~3.9Mpps) on Broadwell i7-5600U CPU @ 2.60GHz machine. Netperf TCP tests does not show obvious differences. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCHJason Wang2018-07-221-4/+4
| | | | | | | A more generic name which could be used for TX as well. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()Jason Wang2018-07-221-4/+4
| | | | | | | Rename for reusing this for TX. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: split out datacopy logicJason Wang2018-07-221-20/+90
| | | | | | | | | Instead of mixing zerocopy and datacopy logics, this patch tries to split datacopy logic out. This results for a more compact code and ad-hoc optimization could be done on top more easily. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: introduce tx_can_batch()Jason Wang2018-07-221-2/+7
| | | | | | | | Introduce tx_can_batch() to determine whether TX could be batched. This will help to reduce the code duplication in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: introduce get_tx_bufs()Jason Wang2018-07-221-17/+32
| | | | | | | | | Factor out logic of getting tx buffer and iov iter initialization. This will be used for reducing codes duplication in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: introduce vhost_exceeds_weight()Jason Wang2018-07-221-5/+8
| | | | | Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: introduce helper to initialize tx iov iterJason Wang2018-07-221-9/+17
| | | | | | | Introduce init_iov_iter() in order to be reused by future patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: drop unnecessary parameterJason Wang2018-07-221-4/+2Star
| | | | | Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: Avoid rx vring kicks during busyloopToshiaki Makita2018-07-041-3/+7
| | | | | | | | | | We may run out of avail rx ring descriptor under heavy load but busypoll did not detect it so busypoll may have exited prematurely. Avoid this by checking rx ring full during busypoll. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: Avoid rx queue wake-ups during busypollToshiaki Makita2018-07-041-7/+19
| | | | | | | | | | | | | We may run handle_rx() while rx work is queued. For example a packet can push the rx work during the window before handle_rx calls vhost_net_disable_vq(). In that case busypoll immediately exits due to vhost_has_work() condition and enables vq again. This can lead to another unnecessary rx wake-ups, so poll rx work instead of enabling the vq. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: Avoid tx vring kicks during busyloopToshiaki Makita2018-07-041-13/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under heavy load vhost busypoll may run without suppressing notification. For example tx zerocopy callback can push tx work while handle_tx() is running, then busyloop exits due to vhost_has_work() condition and enables notification but immediately reenters handle_tx() because the pushed work was tx. In this case handle_tx() tries to disable notification again, but when using event_idx it by design cannot. Then busyloop will run without suppressing notification. Another example is the case where handle_tx() tries to enable notification but avail idx is advanced so disables it again. This case also leads to the same situation with event_idx. The problem is that once we enter this situation busyloop does not work under heavy load for considerable amount of time, because notification is likely to happen during busyloop and handle_tx() immediately enables notification after notification happens. Specifically busyloop detects notification by vhost_has_work() and then handle_tx() calls vhost_enable_notify(). Because the detected work was the tx work, it enters handle_tx(), and enters busyloop without suppression again. This is likely to be repeated, so with event_idx we are almost not able to suppress notification in this case. To fix this, poll the work instead of enabling notification when busypoll is interrupted by something. IMHO vhost_has_work() is kind of interruption rather than a signal to completely cancel the busypoll, so let's run busypoll after the necessary work is done. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: Rename local variables in vhost_net_rx_peek_head_lenToshiaki Makita2018-07-041-17/+17
| | | | | | | | So we can easily see which variable is for which, tx or rx. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: validate sock before trying to put its fdJason Wang2018-06-231-1/+2
| | | | | | | | | | | | Sock will be NULL if we pass -1 to vhost_net_set_backend(), but when we meet errors during ubuf allocation, the code does not check for NULL before calling sockfd_put(), this will lead NULL dereferencing. Fixing by checking sock pointer before. Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* treewide: kmalloc() -> kmalloc_array()Kees Cook2018-06-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kmalloc() function has a 2-factor argument form, kmalloc_array(). This patch replaces cases of: kmalloc(a * b, gfp) with: kmalloc_array(a * b, gfp) as well as handling cases of: kmalloc(a * b * c, gfp) with: kmalloc(array3_size(a, b, c), gfp) as it's slightly less ugly than: kmalloc_array(array_size(a, b), c, gfp) This does, however, attempt to ignore constant size factors like: kmalloc(4 * 1024, gfp) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The tools/ directory was manually excluded, since it has its own implementation of kmalloc(). The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kmalloc( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kmalloc( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kmalloc( - sizeof(u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(__u8) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kmalloc( - sizeof(u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(__u8) * COUNT + COUNT , ...) | kmalloc( - sizeof(char) * COUNT + COUNT , ...) | kmalloc( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kmalloc + kmalloc_array ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kmalloc( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kmalloc( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kmalloc( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kmalloc( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kmalloc( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kmalloc( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kmalloc(C1 * C2 * C3, ...) | kmalloc( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kmalloc( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kmalloc( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kmalloc(sizeof(THING) * C2, ...) | kmalloc(sizeof(TYPE) * C2, ...) | kmalloc(C1 * C2 * C3, ...) | kmalloc(C1 * C2, ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kmalloc + kmalloc_array ( - (E1) * E2 + E1, E2 , ...) | - kmalloc + kmalloc_array ( - (E1) * (E2) + E1, E2 , ...) | - kmalloc + kmalloc_array ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: Kees Cook <keescook@chromium.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-06-031-13/+24
|\ | | | | | | | | | | | | | | Filling in the padding slot in the bpf structure as a bug fix in 'ne' overlapped with actually using that padding area for something in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: flush batched heads before trying to busy pollingJason Wang2018-05-301-13/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"), we tend to batch updating used heads. But it doesn't flush batched heads before trying to do busy polling, this will cause vhost to wait for guest TX which waits for the used RX. Fixing by flush batched heads before busy loop. 1 byte TCP_RR performance recovers from 13107.83 to 50402.65. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | vhost_net: use packet weight for rx handler, tooPaolo Abeni2018-04-241-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit a2ac99905f1e ("vhost-net: set packet weight of tx polling to 2 * vq size"), we need a packet-based limit for handler_rx, too - elsewhere, under rx flood with small packets, tx can be delayed for a very long time, even without busypolling. The pkt limit applied to handle_rx must be the same applied by handle_tx, or we will get unfair scheduling between rx and tx. Tying such limit to the queue length makes it less effective for large queue length values and can introduce large process scheduler latencies, so a constant valued is used - likewise the existing bytes limit. The selected limit has been validated with PVP[1] performance test with different queue sizes: queue size 256 512 1024 baseline 366 354 362 weight 128 715 723 670 weight 256 740 745 733 weight 512 600 460 583 weight 1024 423 427 418 A packet weight of 256 gives peek performances in under all the tested scenarios. No measurable regression in unidirectional performance tests has been detected. [1] https://developers.redhat.com/blog/2017/06/05/measuring-and-comparing-open-vswitch-performance/ Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tun: convert to use generic xdp_frame and xdp_return_frame APIJesper Dangaard Brouer2018-04-171-3/+4
|/ | | | | | | | | | | | | | | The tuntap driver invented it's own driver specific way of queuing XDP packets, by storing the xdp_buff information in the top of the XDP frame data. Convert it over to use the more generic xdp_frame structure. The main problem with the in-driver method is that the xdp_rxq_info pointer cannot be trused/used when dequeueing the frame. V3: Remove check based on feedback from Jason Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost-net: set packet weight of tx polling to 2 * vq sizehaibinzhang(张海斌)2018-04-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy polling udp packets with small length(e.g. 1byte udp payload), because setting VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length. Ping-Latencies shown below were tested between two Virtual Machines using netperf (UDP_STREAM, len=1), and then another machine pinged the client: vq size=256 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 3.319 18.489 57.303 64 1.643 2.021 2.552 128 1.825 2.600 3.224 256 1.997 2.710 4.295 512 1.860 3.171 4.631 1024 2.002 4.173 9.056 2048 2.257 5.650 9.688 4096 2.093 8.508 15.943 vq size=512 Packet-Weight Ping-Latencies(millisecond) min avg max Origin 6.537 29.177 66.245 64 2.798 3.614 4.403 128 2.861 3.820 4.775 256 3.008 4.018 4.807 512 3.254 4.523 5.824 1024 3.079 5.335 7.747 2048 3.944 8.201 12.762 4096 4.158 11.057 19.985 Seems pretty consistent, a small dip at 2 VQ sizes. Ring size is a hint from device about a burst size it can tolerate. Based on benchmarks, set the weight to 2 * vq size. To evaluate this change, another tests were done using netperf(RR, TX) between two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was tweaked through qemu. Results shown below does not show obvious changes. vq size=256 TCP_RR vq size=512 TCP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -7%/ -2% 1/ 1/ 0%/ -2% 1/ 4/ +1%/ 0% 1/ 4/ +1%/ 0% 1/ 8/ +1%/ -2% 1/ 8/ 0%/ +1% 64/ 1/ -6%/ 0% 64/ 1/ +7%/ +3% 64/ 4/ 0%/ +2% 64/ 4/ -1%/ +1% 64/ 8/ 0%/ 0% 64/ 8/ -1%/ -2% 256/ 1/ -3%/ -4% 256/ 1/ -4%/ -2% 256/ 4/ +3%/ +4% 256/ 4/ +1%/ +2% 256/ 8/ +2%/ 0% 256/ 8/ +1%/ -1% vq size=256 UDP_RR vq size=512 UDP_RR size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 1/ 1/ -5%/ +1% 1/ 1/ -3%/ -2% 1/ 4/ +4%/ +1% 1/ 4/ -2%/ +2% 1/ 8/ -1%/ -1% 1/ 8/ -1%/ 0% 64/ 1/ -2%/ -3% 64/ 1/ +1%/ +1% 64/ 4/ -5%/ -1% 64/ 4/ +2%/ 0% 64/ 8/ 0%/ -1% 64/ 8/ -2%/ +1% 256/ 1/ +7%/ +1% 256/ 1/ -7%/ 0% 256/ 4/ +1%/ +1% 256/ 4/ -3%/ -4% 256/ 8/ +2%/ +2% 256/ 8/ +1%/ +1% vq size=256 TCP_STREAM vq size=512 TCP_STREAM size/sessions/+thu%/+normalize% size/sessions/+thu%/+normalize% 64/ 1/ 0%/ -3% 64/ 1/ 0%/ 0% 64/ 4/ +3%/ -1% 64/ 4/ -2%/ +4% 64/ 8/ +9%/ -4% 64/ 8/ -1%/ +2% 256/ 1/ +1%/ -4% 256/ 1/ +1%/ +1% 256/ 4/ -1%/ -1% 256/ 4/ -3%/ 0% 256/ 8/ +7%/ +5% 256/ 8/ -3%/ 0% 512/ 1/ +1%/ 0% 512/ 1/ -1%/ -1% 512/ 4/ +1%/ -1% 512/ 4/ 0%/ 0% 512/ 8/ +7%/ -5% 512/ 8/ +6%/ -1% 1024/ 1/ 0%/ -1% 1024/ 1/ 0%/ +1% 1024/ 4/ +3%/ 0% 1024/ 4/ +1%/ 0% 1024/ 8/ +8%/ +5% 1024/ 8/ -1%/ 0% 2048/ 1/ +2%/ +2% 2048/ 1/ -1%/ 0% 2048/ 4/ +1%/ 0% 2048/ 4/ 0%/ -1% 2048/ 8/ -2%/ 0% 2048/ 8/ 5%/ -1% 4096/ 1/ -2%/ 0% 4096/ 1/ -2%/ 0% 4096/ 4/ +2%/ 0% 4096/ 4/ 0%/ 0% 4096/ 8/ +9%/ -2% 4096/ 8/ -5%/ -1% Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Haibin Zhang <haibinzhang@tencent.com> Signed-off-by: Yunfang Tai <yunfangtai@tencent.com> Signed-off-by: Lidong Chen <lidongchen@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-04-021-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c, we had some overlapping changes: 1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE --> MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE 2) In 'net-next' params->log_rq_size is renamed to be params->log_rq_mtu_frames. 3) In 'net-next' params->hard_mtu is added. Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: add missing lock nesting notationJason Wang2018-03-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | We try to hold TX virtqueue mutex in vhost_net_rx_peek_head_len() after RX virtqueue mutex is held in handle_rx(). This requires an appropriate lock nesting notation to calm down deadlock detector. Fixes: 0308813724606 ("vhost_net: basic polling support") Reported-by: syzbot+7f073540b1384a614e09@syzkaller.appspotmail.com Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-03-231-3/+5
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fun set of conflict resolutions here... For the mac80211 stuff, these were fortunately just parallel adds. Trivially resolved. In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the function phy_disable_interrupts() earlier in the file, whilst in 'net-next' the phy_error() call from this function was removed. In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the 'rt_table_id' member of rtable collided with a bug fix in 'net' that added a new struct member "rt_mtu_locked" which needs to be copied over here. The mlxsw driver conflict consisted of net-next separating the span code and definitions into separate files, whilst a 'net' bug fix made some changes to that moved code. The mlx5 infiniband conflict resolution was quite non-trivial, the RDMA tree's merge commit was used as a guide here, and here are their notes: ==================== Due to bug fixes found by the syzkaller bot and taken into the for-rc branch after development for the 4.17 merge window had already started being taken into the for-next branch, there were fairly non-trivial merge issues that would need to be resolved between the for-rc branch and the for-next branch. This merge resolves those conflicts and provides a unified base upon which ongoing development for 4.17 can be based. Conflicts: drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524 (IB/mlx5: Fix cleanup order on unload) added to for-rc and commit b5ca15ad7e61 (IB/mlx5: Add proper representors support) add as part of the devel cycle both needed to modify the init/de-init functions used by mlx5. To support the new representors, the new functions added by the cleanup patch needed to be made non-static, and the init/de-init list added by the representors patch needed to be modified to match the init/de-init list changes made by the cleanup patch. Updates: drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function prototypes added by representors patch to reflect new function names as changed by cleanup patch drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init stage list to match new order from cleanup patch ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: examine pointer types during un-producingJason Wang2018-03-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | After commit fc72d1d54dd9 ("tuntap: XDP transmission"), we can actually queueing XDP pointers in the pointer ring, so we should examine the pointer type before freeing the pointer. Fixes: fc72d1d54dd9 ("tuntap: XDP transmission") Reported-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: keep private_data and rx_ring syncedJason Wang2018-03-091-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | We get pointer ring from the exported sock, this means we should keep rx_ring and vq->private synced during both vq stop and backend set, otherwise we may see stale rx_ring. Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array") Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: initialize rx_ring in vhost_net_open()Alexander Potapenko2018-03-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KMSAN reported a use of uninit memory in vhost_net_buf_unproduce() while trying to access n->vqs[VHOST_NET_VQ_TX].rx_ring: ================================================================== BUG: KMSAN: use of uninitialized memory in vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vho et.c:170 CPU: 0 PID: 3021 Comm: syz-fuzzer Not tainted 4.16.0-rc4+ #3853 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x1f0 mm/kmsan/kmsan.c:1093 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 vhost_net_buf_unproduce+0x7bb/0x9a0 drivers/vhost/net.c:170 vhost_net_stop_vq drivers/vhost/net.c:974 [inline] vhost_net_stop+0x146/0x380 drivers/vhost/net.c:982 vhost_net_release+0xb1/0x4f0 drivers/vhost/net.c:1015 __fput+0x49f/0xa00 fs/file_table.c:209 ____fput+0x37/0x40 fs/file_table.c:243 task_work_run+0x243/0x2c0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:191 [inline] exit_to_usermode_loop arch/x86/entry/common.c:166 [inline] prepare_exit_to_usermode+0x349/0x3b0 arch/x86/entry/common.c:196 syscall_return_slowpath+0xf3/0x6d0 arch/x86/entry/common.c:265 do_syscall_64+0x34d/0x450 arch/x86/entry/common.c:292 ... origin: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:303 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:213 kmsan_kmalloc_large+0x6f/0xd0 mm/kmsan/kmsan.c:392 kmalloc_large_node_hook mm/slub.c:1366 [inline] kmalloc_large_node mm/slub.c:3808 [inline] __kmalloc_node+0x100e/0x1290 mm/slub.c:3818 kmalloc_node include/linux/slab.h:554 [inline] kvmalloc_node+0x1a5/0x2e0 mm/util.c:419 kvmalloc include/linux/mm.h:541 [inline] vhost_net_open+0x64/0x5f0 drivers/vhost/net.c:921 misc_open+0x7b5/0x8b0 drivers/char/misc.c:154 chrdev_open+0xc28/0xd90 fs/char_dev.c:417 do_dentry_open+0xccb/0x1430 fs/open.c:752 vfs_open+0x272/0x2e0 fs/open.c:866 do_last fs/namei.c:3378 [inline] path_openat+0x49ad/0x6580 fs/namei.c:3519 do_filp_open+0x267/0x640 fs/namei.c:3553 do_sys_open+0x6ad/0x9c0 fs/open.c:1059 SYSC_openat+0xc7/0xe0 fs/open.c:1086 SyS_openat+0x63/0x90 fs/open.c:1080 do_syscall_64+0x2f1/0x450 arch/x86/entry/common.c:287 ================================================================== Fixes: c67df11f6e480 ("vhost_net: try batch dequing from skb array") Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: make getname() functions return length rather than use int* parameterDenys Vlasenko2018-02-121-4/+3Star
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Changes since v1: Added changes in these files: drivers/infiniband/hw/usnic/usnic_transport.c drivers/staging/lustre/lnet/lnet/lib-socket.c drivers/target/iscsi/iscsi_target_login.c drivers/vhost/net.c fs/dlm/lowcomms.c fs/ocfs2/cluster/tcp.c security/tomoyo/network.c Before: All these functions either return a negative error indicator, or store length of sockaddr into "int *socklen" parameter and return zero on success. "int *socklen" parameter is awkward. For example, if caller does not care, it still needs to provide on-stack storage for the value it does not need. None of the many FOO_getname() functions of various protocols ever used old value of *socklen. They always just overwrite it. This change drops this parameter, and makes all these functions, on success, return length of sockaddr. It's always >= 0 and can be differentiated from an error. Tests in callers are changed from "if (err)" to "if (err < 0)", where needed. rpc_sockname() lost "int buflen" parameter, since its only use was to be passed to kernel_getsockname() as &buflen and subsequently not used in any way. Userspace API is not changed. text data bss dec hex filename 30108430 2633624 873672 33615726 200ef6e vmlinux.before.o 30108109 2633612 873672 33615393 200ee21 vmlinux.o Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: linux-bluetooth@vger.kernel.org CC: linux-decnet-user@lists.sourceforge.net CC: linux-wireless@vger.kernel.org CC: linux-rdma@vger.kernel.org CC: linux-sctp@vger.kernel.org CC: linux-nfs@vger.kernel.org CC: linux-x25@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds2018-02-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds2018-02-081-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull virtio/vhost updates from Michael Tsirkin: "virtio, vhost: fixes, cleanups, features This includes the disk/cache memory stats for for the virtio balloon, as well as multiple fixes and cleanups" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost: don't hold onto file pointer for VHOST_SET_LOG_FD vhost: don't hold onto file pointer for VHOST_SET_VRING_ERR vhost: don't hold onto file pointer for VHOST_SET_VRING_CALL ringtest: ring.c malloc & memset to calloc virtio_vop: don't kfree device on register failure virtio_pci: don't kfree device on register failure virtio: split device_register into device_initialize and device_add vhost: remove unused lock check flag in vhost_dev_cleanup() vhost: Remove the unused variable. virtio_blk: print capacity at probe time virtio: make VIRTIO a menuconfig to ease disabling it all virtio/ringtest: virtio_ring: fix up need_event math virtio/ringtest: fix up need_event math virtio: virtio_mmio: make of_device_ids const. firmware: Use PTR_ERR_OR_ZERO() virtio-mmio: Use PTR_ERR_OR_ZERO() vhost/scsi: Improve a size determination in four functions virtio_balloon: include disk/file caches memory statistics
| * vhost: remove unused lock check flag in vhost_dev_cleanup()夷则(Caspar)2018-02-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | In commit ea5d404655ba ("vhost: fix release path lockdep checks"), Michael added a flag to check whether we should hold a lock in vhost_dev_cleanup(), however, in commit 47283bef7ed3 ("vhost: move memory pointer to VQs"), RCU operations have been replaced by mutex, we can remove the no-longer-used `locked' parameter now. Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2018-01-311-24/+44
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Significantly shrink the core networking routing structures. Result of http://vger.kernel.org/~davem/seoul2017_netdev_keynote.pdf 2) Add netdevsim driver for testing various offloads, from Jakub Kicinski. 3) Support cross-chip FDB operations in DSA, from Vivien Didelot. 4) Add a 2nd listener hash table for TCP, similar to what was done for UDP. From Martin KaFai Lau. 5) Add eBPF based queue selection to tun, from Jason Wang. 6) Lockless qdisc support, from John Fastabend. 7) SCTP stream interleave support, from Xin Long. 8) Smoother TCP receive autotuning, from Eric Dumazet. 9) Lots of erspan tunneling enhancements, from William Tu. 10) Add true function call support to BPF, from Alexei Starovoitov. 11) Add explicit support for GRO HW offloading, from Michael Chan. 12) Support extack generation in more netlink subsystems. From Alexander Aring, Quentin Monnet, and Jakub Kicinski. 13) Add 1000BaseX, flow control, and EEE support to mvneta driver. From Russell King. 14) Add flow table abstraction to netfilter, from Pablo Neira Ayuso. 15) Many improvements and simplifications to the NFP driver bpf JIT, from Jakub Kicinski. 16) Support for ipv6 non-equal cost multipath routing, from Ido Schimmel. 17) Add resource abstration to devlink, from Arkadi Sharshevsky. 18) Packet scheduler classifier shared filter block support, from Jiri Pirko. 19) Avoid locking in act_csum, from Davide Caratti. 20) devinet_ioctl() simplifications from Al viro. 21) More TCP bpf improvements from Lawrence Brakmo. 22) Add support for onlink ipv6 route flag, similar to ipv4, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1925 commits) tls: Add support for encryption using async offload accelerator ip6mr: fix stale iterator net/sched: kconfig: Remove blank help texts openvswitch: meter: Use 64-bit arithmetic instead of 32-bit tcp_nv: fix potential integer overflow in tcpnv_acked r8169: fix RTL8168EP take too long to complete driver initialization. qmi_wwan: Add support for Quectel EP06 rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK ipmr: Fix ptrdiff_t print formatting ibmvnic: Wait for device response when changing MAC qlcnic: fix deadlock bug tcp: release sk_frag.page in tcp_disconnect ipv4: Get the address of interface correctly. net_sched: gen_estimator: fix lockdep splat net: macb: Handle HRESP error net/mlx5e: IPoIB, Fix copy-paste bug in flow steering refactoring ipv6: addrconf: break critical section in addrconf_verify_rtnl() ipv6: change route cache aging logic i40e/i40evf: Update DESC_NEEDED value to reflect larger value bnxt_en: cleanup DIM work on device shutdown ...
| * | vhost_net: stop device during reset ownerJason Wang2018-01-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't stop device before reset owner, this means we could try to serve any virtqueue kick before reset dev->worker. This will result a warn since the work was pending at llist during owner resetting. Fix this by stopping device during owner reset. Reported-by: syzbot+eb17c6162478cc50632c@syzkaller.appspotmail.com Fixes: 3a4d5c94e9593 ("vhost_net: a kernel-level virtio server") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | vhost_net: batch used ring update in rxJason Wang2018-01-101-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch tries to batched used ring update during RX. This is pretty fit for the case when guest is much faster (e.g dpdk based backend). In this case, used ring is almost empty: - we may get serious cache line misses/contending on both used ring and used idx. - at most 1 packet could be dequeued at one time, batching in guest does not make much effect. Update used ring in a batch can help since guest won't access the used ring until used idx was advanced for several descriptors and since we advance used ring for every N packets, guest will only need to access used idx for every N packet since it can cache the used idx. To have a better interaction for both batch dequeuing and dpdk batching, VHOST_RX_BATCH was used as the maximum number of descriptors that could be batched. Test were done between two machines with 2.40GHz Intel(R) Xeon(R) CPU E5-2630 connected back to back through ixgbe. Traffic were generated on one remote ixgbe through MoonGen and measure the RX pps through testpmd in guest when do xdp_redirect_map from local ixgbe to tap. RX pps were increased from 3.05 Mpps to 4.00 Mpps (about 31% improvement). One possible concern for this is the implications for TCP (especially latency sensitive workload). Result[1] does not show obvious changes for most of the netperf test (RR, TX, and RX). And we do get some improvements for RX on some specific size. Guest RX: size/sessions/+thu%/+normalize% 64/ 1/ +2%/ +2% 64/ 2/ +2%/ -1% 64/ 4/ +1%/ +1% 64/ 8/ 0%/ 0% 256/ 1/ +6%/ -3% 256/ 2/ -3%/ +2% 256/ 4/ +11%/ +11% 256/ 8/ 0%/ 0% 512/ 1/ +4%/ 0% 512/ 2/ +2%/ +2% 512/ 4/ 0%/ -1% 512/ 8/ -8%/ -8% 1024/ 1/ -7%/ -17% 1024/ 2/ -8%/ -7% 1024/ 4/ +1%/ 0% 1024/ 8/ 0%/ 0% 2048/ 1/ +30%/ +14% 2048/ 2/ +46%/ +40% 2048/ 4/ 0%/ 0% 2048/ 8/ 0%/ 0% 4096/ 1/ +23%/ +22% 4096/ 2/ +26%/ +23% 4096/ 4/ 0%/ +1% 4096/ 8/ 0%/ 0% 16384/ 1/ -2%/ -3% 16384/ 2/ +1%/ -4% 16384/ 4/ -1%/ -3% 16384/ 8/ 0%/ -1% 65535/ 1/ +15%/ +7% 65535/ 2/ +4%/ +7% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ 0% TCP_RR: size/sessions/+thu%/+normalize% 1/ 1/ 0%/ +1% 1/ 25/ +2%/ +1% 1/ 50/ +4%/ +1% 64/ 1/ 0%/ -4% 64/ 25/ +2%/ +1% 64/ 50/ 0%/ -1% 256/ 1/ 0%/ 0% 256/ 25/ 0%/ 0% 256/ 50/ +4%/ +2% Guest TX: size/sessions/+thu%/+normalize% 64/ 1/ +4%/ -2% 64/ 2/ -6%/ -5% 64/ 4/ +3%/ +6% 64/ 8/ 0%/ +3% 256/ 1/ +15%/ +16% 256/ 2/ +11%/ +12% 256/ 4/ +1%/ 0% 256/ 8/ +5%/ +5% 512/ 1/ -1%/ -6% 512/ 2/ 0%/ -8% 512/ 4/ -2%/ +4% 512/ 8/ +6%/ +9% 1024/ 1/ +3%/ +1% 1024/ 2/ +3%/ +9% 1024/ 4/ 0%/ +7% 1024/ 8/ 0%/ +7% 2048/ 1/ +8%/ +2% 2048/ 2/ +3%/ -1% 2048/ 4/ -1%/ +11% 2048/ 8/ +3%/ +9% 4096/ 1/ +8%/ +8% 4096/ 2/ 0%/ -7% 4096/ 4/ +4%/ +4% 4096/ 8/ +2%/ +5% 16384/ 1/ -3%/ +1% 16384/ 2/ -1%/ -12% 16384/ 4/ -1%/ +5% 16384/ 8/ 0%/ +1% 65535/ 1/ 0%/ -3% 65535/ 2/ +5%/ +16% 65535/ 4/ +1%/ +2% 65535/ 8/ +1%/ -1% Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tuntap: XDP transmissionJason Wang2018-01-091-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements XDP transmission for TAP. Since we can't create new queues for TAP during XDP set, exist ptr_ring was reused for queuing XDP buffers. To differ xdp_buff from sk_buff, TUN_XDP_FLAG (0x1UL) was encoded into lowest bit of xpd_buff pointer during ptr_ring_produce, and was decoded during consuming. XDP metadata was stored in the headroom of the packet which should work in most of cases since driver usually reserve enough headroom. Very minor changes were done for vhost_net: it just need to peek the length depends on the type of pointer. Tests were done on two Intel E5-2630 2.40GHz machines connected back to back through two 82599ES. Traffic were generated/received through MoonGen/testpmd(rxonly). It reports ~20% improvements when xdp_redirect_map is doing redirection from ixgbe to TAP (from 2.50Mpps to 3.05Mpps) Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | tun/tap: use ptr_ring instead of skb_arrayJason Wang2018-01-091-19/+20
| |/ | | | | | | | | | | | | | | | | This patch switches to use ptr_ring instead of skb_array. This will be used to enqueue different types of pointers by encoding type into lower bits. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'misc.poll' of ↵Linus Torvalds2018-01-311-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ...
| * the rest of drivers/*: annotate ->poll() instancesAl Viro2017-11-281-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vhost: fix skb leak in handle_rx()Wei Xu2017-12-031-10/+10
|/ | | | | | | | | | | | | | | | | | | | | | | Matthew found a roughly 40% tcp throughput regression with commit c67df11f(vhost_net: try batch dequing from skb array) as discussed in the following thread: https://www.mail-archive.com/netdev@vger.kernel.org/msg187936.html Eventually we figured out that it was a skb leak in handle_rx() when sending packets to the VM. This usually happens when a guest can not drain out vq as fast as vhost fills in, afterwards it sets off the traffic jam and leaks skb(s) which occurs as no headcount to send on the vq from vhost side. This can be avoided by making sure we have got enough headcount before actually consuming a skb from the batched rx array while transmitting, which is simply done by moving checking the zero headcount a bit ahead. Signed-off-by: Wei Xu <wexu@redhat.com> Reported-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: conditionally enable tx pollingJason Wang2017-11-151-0/+2
| | | | | | | | | | | | | | | | | We always poll tx for socket, this is sub optimal since this will slightly increase the waitqueue traversing time and more important, vhost could not benefit from commit 9e641bdcfa4e ("net-tun: restructure tun_do_read for better sleep/wakeup efficiency") even if we've stopped rx polling during handle_rx(), tx poll were still left in the waitqueue. Pktgen from a remote host to VM over mlx4 on two 2.00GHz Xeon E5-2650 shows 11.7% improvements on rx PPS. (from 1.28Mpps to 1.44Mpps) Cc: Wei Xu <wexu@redhat.com> Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: do not stall on zerocopy depletionWillem de Bruijn2017-10-101-9/+3Star
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vhost-net has a hard limit on the number of zerocopy skbs in flight. When reached, transmission stalls. Stalls cause latency, as well as head-of-line blocking of other flows that do not use zerocopy. Instead of stalling, revert to copy-based transmission. Tested by sending two udp flows from guest to host, one with payload of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The large flow is redirected to a netem instance with 1MBps rate limit and deep 1000 entry queue. modprobe ifb ip link set dev ifb0 up tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit tc qdisc add dev tap0 ingress tc filter add dev tap0 parent ffff: protocol ip \ u32 match ip dport 8000 0xffff \ action mirred egress redirect dev ifb0 Before the delay, both flows process around 80K pps. With the delay, before this patch, both process around 400. After this patch, the large flow is still rate limited, while the small reverts to its original rate. See also discussion in the first link, below. Without rate limiting, {1, 10, 100}x TCP_STREAM tests continued to send at 100% zerocopy. The limit in vhost_exceeds_maxpend must be carefully chosen. With vq->num >> 1, the flows remain correlated. This value happens to correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller fractions and ensure correctness also for much smaller values of vq->num, by testing the min() of both explicitly. See also the discussion in the second link below. Changes v1 -> v2 - replaced min with typed min_t - avoid unnecessary whitespace change Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3ThTXa6BnZkygP3tgVpjbp93g@mail.gmail.com Link:http://lkml.kernel.org/r/20170819064129.27272-1-den@klaipeden.com Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* vhost_net: correctly check tx avail during rx busy pollingJason Wang2017-09-051-1/+6
| | | | | | | | | | | | | | | | | | We check tx avail through vhost_enable_notify() in the past which is wrong since it only checks whether or not guest has filled more available buffer since last avail idx synchronization which was just done by vhost_vq_avail_empty() before. What we really want is checking pending buffers in the avail ring. Fix this by calling vhost_vq_avail_empty() instead. This issue could be noticed by doing netperf TCP_RR benchmark as client from guest (but not host). With this fix, TCP_RR from guest to localhost restores from 1375.91 trans per sec to 55235.28 trans per sec on my laptop (Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz). Fixes: 030881372460 ("vhost_net: basic polling support") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert (struct ubuf_info)->refcnt to refcount_tEric Dumazet2017-09-021-1/+1
| | | | | | | | | | | | | | | refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. v2: added the change in drivers/vhost/net.c as spotted by Willem. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: enable MSG_ZEROCOPYWillem de Bruijn2017-08-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with skb_zerocopy_clone() wherever needed due to skb split, merge, resize or clone. Split skb_orphan_frags into two variants. The split, merge, .. paths support reference counted zerocopy buffers, so do not do a deep copy. Add skb_orphan_frags_rx for paths that may loop packets to receive sockets. That is not allowed, as it may cause unbounded latency. Deep copy all zerocopy copy buffers, ref-counted or not, in this path. The exact locations to modify were chosen by exhaustively searching through all code that might modify skb_frag references and/or the the SKBTX_DEV_ZEROCOPY tx_flags bit. The changes err on the safe side, in two ways. (1) legacy ubuf_info paths virtio and tap are not modified. They keep a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags still call skb_copy_ubufs and thus copy frags in this case. (2) not all copies deep in the stack are addressed yet. skb_shift, skb_split and skb_try_coalesce can be refined to avoid copying. These are not in the hot path and this patch is hairy enough as is, so that is left for future refinement. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful ↵Michal Hocko2017-07-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | semantic __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>