-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
router: poor performance over real NICs, despite decent performance over veth #4593
Comments
The ultimate performance would come from using some XDP-based approach. The DPDK framework seems less work than doing it from scratch (https://doc.dpdk.org/guides/index.html); however, using from Go code isn't trivial either; but apparently feasible (https://pkg.go.dev/github.com/yerden/go-dpdk, https://pkg.go.dev/github.com/millken/dpdk-go). This article makes a series of less involved suggestions that do not require XDP/DPDK: https://medium.com/@pavel.odintsov/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-ef782fe8959d may be that's a worthy first step. There is a significant drawback: the ring API is meant for traffic sniffing; packets which destination does match the local host are duplicated: one copy to the sniffer's ring and the other to the kernel's network stack. To prevent that we'd need to play games; possibly games that nullify the benefits, but may be not. One way is to filter the traffic with https://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.adv-qdisc.ingress.html (here's a tutorial: https://www.dasblinkenlichten.com/working-with-tc-on-linux-systems/) If my understanding is correct. The following sequence would suppress all traffic from eth0 to the kernel stack, leaving only the ring's copy:
... most likely the packet still gets copied before being dropped. Too bad. |
I did a little more research to clarify the (fast changing) state of affairs with the Linux kernel and high-performance networking. There seems to be, at the moment, three main avenues:
The transition from AF_PACKET v4 to AF_XDP is originally described by its author (Björn Töpel) here: https://lwn.net/Articles/745934/ or, in a more polished version, here: https://www.kernel.org/doc/html/v4.18/networking/af_xdp.html The AF_XDP API seems reasonably accessible. It has several levels of acceleration available depending on the driver's capabilities and will make the best of it. It does require loading an eBPF program, it seems. However that program is a canned well-known one that some libraries (including Go ones: https://pkg.go.dev/github.com/liheng562653799/xdp) can deal with for you (or may be that's been automated away by now - that was on the 2DO list). The DPDK framework will also use AF_XDP behind the scenes, unless it has a user-mode driver for the device that's being used. In which case it will switch to a very different approach and practically map the device in user space. Using the framework seems a lot more involved and so is only worth it if we want to access the user-mode driver performance. There are Go libraries that can expose the DPDK framework, though; as mentionned previously. I found that this doc provided some interesting insight into performance expectations: https://fast.dpdk.org/doc/perf/DPDK_23_03_Intel_NIC_performance_report.pdf The relative performance of the various approaches is the subject of contradictory reports, but roughly:
|
In order to preserve portability to non-linux platforms, we'll need to preserve the plain IP/UDP socket code. Likewise, it is possible that the AF_XDP code isn't as portable as the AF_PACKET with RINGS code, so we probably want to keep both if AF_XDP is much better. Refer to updated action plan in the description. |
…4651) WriteTo needs to be given the destination address explicitly. WriteBatch, on the other end, can either find it in each packet structure, or rely on the connection's destination. WriteTo is only used to send BFD packets. It turns out that BFD packets can also very easily be sent via the regular forwarders that use WriteBtach. The motivation to do that is to simplify the interface between the dataplane and the forwarders, in view of supporting multiple underlays with possibly very different interfaces. So the channels around the processor tasks seems like a good universal interface. In passing, also removed a duplicate field. Slightly off-topic but still in the spirit of noise abatement. As this is to facilitate the necessary refactoring... Contributes to #4593
In every case where the router modified packets it would serialize updated headers to a temporary buffer and then copy that to the packet buffer. To avoid this extra copy, replaced gopacket.serializeBuffer with a custom implementation that writes to a given buffer. In this case, the packet's raw buffer. There is one remaining copy for some SCMP messages because we have to move the existing packet to the payload. This too could be avoided but that's for another PR; it would require to leave headroom in received packets. The performance impact is very small since, on the critical path, it just avoids copying a scion hdr per packet, but it is a simplification. It also pays back the copy added by a previous simplification of the BFD code. As such... Contributes to #4593
The goal is to eventually remove all underlay-specific code from the main router code and place it in a plugin, such that there can then be multiple underlay plugins. This PR isn't attempting to fully accomplish that. It avoids moving some code, so that the diff is easier to read. The price to pay is a few extra spaghetti left between main code and plugin. The code that should eventually move to the underlay is: * runReceiver and runForwarder. * likely some of the BFD code. * opening of connections (currently done by connector.go) Other changes being planned: * Stop reusing the internal connection for sibling links (so we can take advantage of bound connections). * Add knowledge of multiple underlay into the configurator. * Make underlay addresses opaque to the router. * Demultiplex to links on ingest, so ifID and srcAddress (when they are defined by the link) are obtained in the most efficient way (directly from the link's fields for example). Contributes to: #4593
In DPDK, vendors regularly publish their performance reports: https://core.dpdk.org/perf-reports/ Well... some of the vendors (nvidia do that for almost every release, Broadcom and Intel - every now and then). For Intel 23.03 is latest performance report, but for some reason they haven't included their best available NICs there.
Absolute best for DPDK is 260Mpps per single NIC. Reported by nvidia for their flagship ConnectX-7 NIC.
There is one fundametal problem here - to make an educated choice about Framework/Library/Approach performance for such complicated matter like a NIC you need to do a bottleneck analysis. As you would discover one thing - DPDK is limited by the NIC performance mostly on the test systems. It scales pretty much linearly with CPU cores and CPU frequency (unless you cross sub-numa borders, etc, etc) and it is not uheard of to get more than 1 Gpps forwarding performance from a single machine that runs DPDK-based software (and that is for forwarding - so some work with packets, not just send/receive). AF_XDP is still relatively young and your bottleneck would likely be kernel. My advice, if you want to do a better guess - setup a small lab, you would need 2 relatively old Xeon machines (E5v2 is enough), couple of NICs and connect them with DACs directly, setup some load generator (I was using Cisco Trex in my experiments, it is opensource, a bit funky but works), and experiment with different drivers, you'll spend maybe 2-3 weeks writing simple AF_XDP and DPDK apps but you'll get numbers and you'll understand how hard it is to work with each other. And I understand concerns about compatibility and ease of use from Go code, but keep in mind, that industry developed DPDK and other projects (e.x. VPP) for a reason and if you will go "we'll use plain AF_XDP" route you will need to reimplement half of the concepts from scratch. To extent that it might be easier to rewrite your whole stack in C++ than to reimplement it in Go. P.S. keep in mind, that because of how Go's GC works, it would require you special handling around it. On some thousands pps it doesn't matter, but on 30+ Mpps it will. |
* Moved the sender and receiver tasks to the underlay. Metrics and a a few other things had to change a bit as a result. * Got sick of maintaining the fiction that a zero-valued Dataplane makes sense. so created constructors and remove the bug-prone and merit-less lazy initializations. While at it, unexported the dataPlane struct altogether so there cannot be any ctor bypassing in the future. * Moved the udpip provider to its own subdirectory. * Tests adjusted as needed. contributes to: #4593
The router code itself can demonstrably forward 800K small packets per second and 10Gb/s of traffic in larger (2K packets) when benchmarked over veth. However, the observed performance is less that 1/2 that when using real NICs, including 10GigE NICs.
Since the processing code isn't the bottleneck, it has to be either the effect of the real NICs activity on the overall system (e.g. the interrupt processing overhead), or the impact of the API used by the router (regular UDP socket) on real I/O versus virtual.
Creating this work item to track investigation and resolution.
Current action plan:
The text was updated successfully, but these errors were encountered: