OpenBSD Journal

n2k13 update: Hardware VLAN tagging/stripping and performance enhancements for vr(4)

Contributed by pitrh on from the tag the puffy dept.

Darren Tucker (dtucker@) writes in with a n2k13 hackathon report with details on his vr(4) driver work:

I intended to start the hackathon by finishing off a diff to add hardware VLAN tagging/stripping support for VT6105M chips in vr(4) then moving on to something else. Although I'm not a kernel or hardware hacker, I already had some mostly working code, the data sheet and a test device. How long could this take?
The VT6105M is one of the last revisions of the reasonably simple 10/100 VIA Rhine family of 10/100 Ethernet chips. It's used in, amongst other things, the PCEngines ALIX and Soekris net5501 devices. It's capable of doing 802.1Q VLAN tagging and untagging in hardware, however OpenBSD's driver did not support that, and neither did any of the other BSDs.

Background

The VIA Rhine chips have a bunch of configuration registers to set up the chip, plus some "descriptors" representing the Ethernet frames being sent and received. Each descriptor is 16 bytes and contains a bunch of flags describing the packet (two 32-bit words) and two pointers, one to the Ethernet frame data and one to the next descriptor (another two 32-bit words). In the OpenBSD driver, the descriptors are arranged into two rings of 128 descriptors, one for transmit and one for receive, and the driver and the chip fill and empty the rings. There's a bit in each descriptor indicating whether the chip or the driver currently owns a given descriptor.

Hardware VLAN tagging

Figuring out how to send VLAN tagged frames was relatively straightforward: the data sheet shows how to set the VLAN ID (and the related priority bits) by setting bits 16-28 in the first word of the TX descriptor ("TDES0"). Observing the emitted packets via tcpdump on another machine, I was puzzled to see that while they were indeed tagged, they were all in VLAN 0. This turned out to be due to this somewhat odd construction in the driver where it changed the ownership bit to turn it over to the hardware:

#define VR_TXSTAT_OWN   0x80000000
#define VR_TXOWN(x)     x->vr_ptr->vr_status

        VR_TXOWN(cur_tx) = htole32(VR_TXSTAT_OWN);
which expands to:
        cur_tx->vr_ptr->vr_status = htole32(0x80000000);
The VLAN ID is in the same word as the owner bit, so that effectively zeroed my carefully populated VLAN bits. Whoops. After changing that to a bit operation I could see correctly tagged VLAN frames. Chris Cappuccio cleaned that up while reworking the driver so it was already fixed by the time of the hackathon.

While figuring out from the data sheet how to transmit tagged packets was straightforward, figuring out to receive them was not. There's a bit in the RX descriptor that tells you whether or not a given frame was tagged, however there's nothing in the data sheet that describes where the VLAN ID is actually stored. Fortunately, the Linux driver already supported hardware VLAN tagging, and they had a nice comment describing where to find it:

 * If hardware VLAN tag extraction is enabled and the chip indicates a 802.1Q
 * packet, the extracted 802.1Q header (2 bytes TPID + 2 bytes TCI) is 4-byte
 * aligned following the CRC.

Why did they do this? My guess is that it's because they'd run out of spare bits in the RX descriptor. Why didn't they at least include this information in the data sheet? Beats me.

I'd done most of this before the hackathon, so after some cleanup I had working code, and a bit later after some feedback from various folks it was tidied up and committed. Job done.

A minor optimization

Well, not quite. While poking around in the guts of the driver, I noticed a small possible optimization in the transmit path: when the chip's queue is full, it'd try to add some more packets which would fail, but then poke the chip to tell it to start anyway, which was unneccessary since nothing had changed. Keeping a local counter of packets added to the queue allowed us to avoid a PCI bus write, which helped a little (about 0.5% lower CPU usage in my tests, which is admittedly within the margin of error). That went in too. It turns out FreeBSD already did something similar.

A major optimization

The OpenBSD driver requests an interrupt for each packet transmitted or received. Interrupts are expensive, so this per-packet overhead is significant.

FreeBSD has implemented interrupt reduction on the transmit path: instead of requesting an interrupt for every packet they request one every eight packets by only setting the "interrupt control" bit (TDES0 bit 23) on every eighth packet. Chris had previously tried this and saw no improvement but suggested that I have a try. I did, based on what FreeBSD did and like Chris saw no change on my ALIX.

Being stubborn, I spend the next couple of days poking around in the driver, building booting kernels, browsing the data sheet and running benchmarks. Around this time, I realised that my "baseline" numbers were from a kernel built without POOL_DEBUG turned on while the test kernels had it, which invalidated the comparison and caused me to re-run a number of tests.

Eventually, I noticed the following entry in the datasheet for TDES3, which is the pointer to the next descriptor in the ring:

Bit 0: TDCTL[0]. Interrupt Control.
0 = issue interrupt for this packet
1 = no interrupt generated

Wait, what? That seems a lot like the bit we're already using (TDES1 bit 23):

Bit 23: IC. Interrupt Control
0: No interrupt when Transmit OK
1: Interrupt when Transmit OK

Why are there two bits doing what seem to be the same thing (although in opposite directions) in one 128 bit entry? Beats me. And why is one of them in the low bits of a pointer? Beats me too (although since the descriptors are going to have to be aligned to at least a 4-byte boundary, I guess they can ignore the least significant bits in the address and get away with it).

I changed the code to set these "interrupt disable" bit on most of the packets while keeping the "interrupt request" bit on every eighth packet and all of a sudden things started to look better! Here's a summary of what systat showed while pushing about 85Mbit/s from userspace on my ALIX, before:


                                                                 Interrupts
  35.7%Int  37.2%Sys   0.8%Usr   0.0%Nic  26.4%Idle             10555 total
|    |    |    |    |    |    |    |    |    |    |             10323 vr0
||||||||||||||||||==================> 
                                                                 7215 IPKTS
                                                                14423 OPKTS
and after:

                                                                Interrupts
  29.2%Int  33.1%Sys   0.0%Usr   0.0%Nic  37.7%Idle             4599 total
|    |    |    |    |    |    |    |    |    |    |             4370 vr0
|||||||||||||||================
                                                                7204 IPKTS
                                                               14403 OPKTS

And similarly for routing 85Mbit/s of TCP through it, before:


                                                                Interrupts
  66.2%Int   0.0%Sys   0.8%Usr   0.0%Nic  33.1%Idle            18469 total
     |    |    |    |    |    |    |    |    |    |            10241 vr0
|||||||||||||||||||||||||||||||||                               8001 vr1
                       
                                                               11069 IPKTS
                                                               11062 OPKTS
and after:

                                                                Interrupts
  45.7%Int   0.0%Sys   0.0%Usr   0.0%Nic  54.3%Idle            12012 total
     |    |    |    |    |    |    |    |    |    |             7709 vr0
|||||||||||||||||||||||                                         4072 vr1
 
                                                               11011 IPKTS
                                                               10992 OPKTS

A useful improvement: 15% to 30% reduction in CPU usage for the same workload. Since the change only affects the transmit path, the number of interrupts for the packets received (both the data and the TCP ACKs) is still the same.

I was unable to get more than about 85Mbit/s out of a single interface on my ALIX, however it'd happily route that. I was able to get 72Mbit/s from userspace out two different interfaces for a total of 144 Mbit/s. Even so, freeing up the CPU for other things (such as running PF, since these devices are often used as firewalls) is still useful.

Receive-side interrupt mitigation

To mitigate the interrupts on the receive side the chip would normally have a "holdoff timer" that causes it to delay interrupting the CPU for some amount of time after a packet is received, in case more packets arrive shortly afterward. This does add some latency, but also reduces the interrupt overhead significantly. Unfortunately, as far as I can tell the VT6105M does not support this feature, and I spent the rest of the time at the hackathon fiddling around with the chip's programmable interval timer, trying unsuccessfully to provide some mitigation on the receive side.

Conclusion

And that's how I spent most of a week tweaking the driver for a 10-cent Ethernet chip.

Being a complete kernel n00b, being able to get help from the folks who know this stuff was extremely useful, and face-to-face has a lot less turnaround time than email. I'd like to thank brad, chris, dlg, jsing, mikeb and sthen for putting up with my questions (and blunders) with good humour.

I usually work on userspace software for fun or large scale systems for work so flipping individual bits on actual hardware was a change for me and quite interesting although, at times, frustrating. I'd like to thank the University of Otago and in particular Jim Cheetham for making it possible.

(Comments are closed)


Comments
  1. By Matthieu Herrb (mherrb) matthieu@openbsd.org on

    Hmm isn't that n2k13 (rather than n2k12) ?

    Comments
    1. By Janne Johansson (jj) on http://www.inet6.se

      > Hmm isn't that n2k13 (rather than n2k12) ?

      Yes, fixed. Thanks for pointing it out.

  2. By Darren Tucker (dtucker) dtucker@openbsd.org on

    Note that other chips may also benefit from the the code added for the VT6105M chips.

    The bit that the VT6105M needed is not in all chipset revisions' data sheet and I don't have devices to test it.

    If you'd like to try it, compare baseline performance vs adding VR_Q_INTDISABLE to the vr_devices vr_quirks in /usr/src/sys/dev/pci/if_vr.c for your particular device. If you do, please let me know what the result was.

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]