OpenBSD Journal

Developer Blog: dlg: making ami(4) better

Contributed by dlg on from the kernel-hackers dept.

I finished a major reworking of the ami(4) driver about two weeks ago, the major goal of which was to streamline the code paths for getting a command from the operating system onto the hardware an off again. I've already described how the paths that commands take on and off the hardware have been split up from one huge, generic, and hairy path into several lightweight and specific paths. However, just this morning marco and I figured out how to improve the interactivity of your system when running with a MegaRAID controller.

ami(4) is a SCSI driver, so its job is to translate requests from the SCSI midlayer (ie, the scsibus driver) in the OpenBSD kernel into commands that go onto the hardware. The scsi midlayer gets requests from sd, which gets requests from the block layer, which gets requests from the filesystem which gets requests from applications running in userspace. The problem here is that userland apps can generated a large number of requests that are eventually turned into a metric buttload of io commands that ami has to deal with. The MegaRAID in my box at home can deal with 126 commands at once, and using iogen (or even find) it isn't hard to generate enough io requests to do just that.

So for the fun of it just imagine that the midlayer is trying to push 126 requests onto ami all at once.

In the old ami code what would happen is that for every command that came into the driver we would busy wait till the hardware was ready to accept a new command, then push it onto the hardware, then return to the midlayer. The midlayer would then immediately issue a new command to ami, which would then busy wait again, and so on. Imagine doing that 126 times in a row.

Now realise that all of this is happening in the kernel, meaning that nothing else can run until it's done. The result is that you get these really noticable pauses when ami puts all these commands on the hardware, and the main culprit is the busy waits that ami does while it waits for the hardware to become ready for a new command. In the worst cases I've seen ami lock up the machine for 3 or 4 seconds when this happens.

Getting rid of these busy waits has been the main reason I've been reworking the ami driver. So here is what happens now when the midlayer tries to put the same 126 commands on the hardware.

The new code will take a request from the SCSI midlayer and put it on a worklist. If the hardware is too busy to take that command right now, ami will schedule a timeout to be run at the next tick of the clock interrupt, and try it again from there. After scheduling the timeout it returns to the midlayer which pushes another command into the driver. Each time the midlayer gives ami a command, we just add it to the list and try to put it on the hardware. Eventually, the midlayer will stop giving ami commands and it will return control to the rest of the system.

Now ami has a list of commands that it has to put on the hardware. At the next clock interrupt it tries to submit commands from the work list onto the hardware. If the hardware gets busy again and wont take more commands from the work list we schedule another timeout and try them again from there.

Notice that now we're not busy waiting for the hardware to become ready for the command? Instead we're trying to use the hardware every time the clock ticks. This means that other stuff can run in between the clock ticks which results in an improvement in the interactivity of the system.

Most of this work was finished about two weeks ago, but unfortunately I misunderstood how the timeout API worked. I was accidentally scheduling the retries to happen after 0 ticks of the clock which meant that the current clock interrupt would keep running the new scheduling attempts. This in turn led to the same behaviour I was trying to avoid because we were basically looping until all the commands had been put on the hardware. It wasn't till marco@ was reading the code that he pointed out what I'd done wrong. He committed the fix to it this morning.

So now ami(4) is streamlined and doesn't busy wait.

(Comments are closed)


Comments
  1. By Anonymous Coward (193.63.217.208) on

    Always good to see these blog posts giving an insight into development. I am curious as to what happens when the commands come in faster than the MegaRAID can handle them? Does the worklist have an upper bound to it's length so that we will end up busy waiting on the worklist rather than the hardware?

    Comments
    1. By David Gwynne (220.245.180.133) loki@animata.net on

      Previously the commands couldn't come from the midlayer faster than the hardware could handle them simply because we were always waiting for the hardware to catch up. Currently we just keep taking what the midlayer gives us and stick it on the worklist to be dealt with out of the timeouts. So yes, it is possible we can queue up commands way faster than the hardware itself can be expected to deal with them.

      When a scsi controller is attached in OpenBSD it advertises how many outstanding commands it can deal with for each device on the scsibus. That advertisment is the upper bound on the length of the worklist.

  2. By edgars (62.85.49.190) on

    so, it means, that ami compatible devices now are evilfast?

    Comments
    1. By Bastiaan Jacques (80.60.243.33) on

      No, it means the system is (much) more responsive whilst ami works.

  3. By Jason Wright (65.202.219.66) jason@thought.net on

    Hmm, if the pending queue is non-empty when a command is completed (presumably from an interrupt), why not attempt submission then? No timeouts needed. This is how network drivers work.

    Ie. command comes in, immediately queue it in software queue, and call a routine that tries to queue it on the hardware (leaving it on sw queue if it fails).

    When an interrupt comes along, do your normal stuff and finally reprocess the software queue as above: no timeouts needed. The timeouts only increase the potential latency.

    Comments
    1. By Jason L. Wright (65.202.219.66) jason@thought.net on

      Actually, a good example of this technique is ubsec(4):ubsec_feed()

    2. By Anonymous Coward (67.64.89.177) on

      The frequency of interrupts interrupts is too low for that model to work effectively.

      Comments
      1. By tedu (69.12.168.114) on

        how can a slot open up without generating an interrupt?

        Comments
        1. By David Gwynne (130.102.78.195) loki@animata.net on

          There's two things involved here: available slots on the hardware, and whether the hardware is ready to accept new commands to fill those slots. An interrupt is generated when a command finishes, therefore the slot opens up, but they don't guarantee that the hardware is ready to receive a new command so we can fill the slot again.

          Operations like large sequential reads and writes (eg, dd) generally only use one or two commands at a time. So what happens is the first time you put a command on the hardware, it works cos the hardware is idle and able to take the command. Later on the command completes, an interrupt is generated, and the handler takes the command off the hardware and returns the command to the midlayer. The midlayer then takes this opportunity to push a new command down for you to run. However, because the hardware is busy at the moment it doesn't get submitted, it simply gets added to the workqueue to be tried later. But until a new command is submitted by the midlayer or a new interrupt is generated the command on the worklist wont be retried and your sequential io basically stalls.

          That's a description of the worst case scenario. In practice this doesn't actually happen much because you're always doing io and things eventually move along. Under workloads with lots of multiple random readers and writers (eg, iogen) you wont notice at all because there are so many opportunities generated by the interrupts and midlayer for commands to be run out of.

          This is demonstrated by the fact that there is no speed difference between ami with the timeout and without when using iogen. However, without the timeout dd runs at about half the speed as it does when the timeout is used.

          Using the timeouts basically guarantees that any workload will have its commands retried without relying on other io to generate the opportunities for command submission.

          Comments
          1. By Jason L. Wright (24.254.95.239) jason@thought.net on

            I'm thus enlightened. Tell ya what... I'll stick to ethernet/serial/crytpo... Disk stuff is just black magic.

            Comments
            1. By David Gwynne (130.102.78.195) loki@animata.net on

              This busy thing is a "feature" of the MegaRAIDs I think, hence the alyte timeout fun. Otherwise the driver is fairly standard.

              You do serial stuff eh? The whole tty vs cua state machine makes my head hurt. I have a usb serial device that needs some love, but last time I played in ucom stuff I broke something.

  4. By Amir Mesry (66.23.227.241) on

    You guys have the latest Sata LSI cards?

    Comments
    1. By David Gwynne (130.102.78.195) loki@animata.net on

      No, but I'd love one.

  5. By Venture37 (217.22.88.124) venture37 # hotmail com on www.geeklan.co.uk

    Bah!
    & you spent all that time doing all that work, when you could have worked on some kind of .dll wrapper so the windows driver would work on openbsd. tut tut

    *ducks*
    ;)

    :D

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]