Duff's Device: loop unrolling for interpreted languages

In 1983, Tom Duff invented a really strange way to use the C language's switch and case statements for the in code "unrolling" optimization of large loops. As an example, he described a simple loop that copies an array to an output register:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		do
			*to = *from++;
		while(--count>0);
	}

On every iteration of the loop, in addition to the copy that is being performed, the count variable is decremented and a comparison is done against 0. Duff's Device allows you to achieve the same result, but with only an 8th of the overhead:

	send(to, from, count)
	register short *to, *from;
	register count;
	{
		register n=(count+7)/8;
		switch(count%8){
		case 0:	do{	*to = *from++;
		case 7:		*to = *from++;
		case 6:		*to = *from++;
		case 5:		*to = *from++;
		case 4:		*to = *from++;
		case 3:		*to = *from++;
		case 2:		*to = *from++;
		case 1:		*to = *from++;
			}while(--n>0);
		}
	}

What happens is that the loop is unrolled 8 times, so each iteration of the loop runs the internal code 8 times over without the comparison. The genius of Duff's Device is that it takes advantage of the way the C switch/case structure works. The first time through, if the iterations don't divide evenly by 8, the loop code is executed enough times to equal the remainder of iterations/8. It's a little bizarre, because the "do" statement occurs within the switch, but there are "case" statements within the "do". Nevertheless, it's valid C.

Before someone cries foul, remember that this is only really suitable for speeding up the performance of inner loops when no suitable, better algorithm is available. If you code C, most modern compilers are smart enough to automatically optimize your code and unroll loops for you.

For interpreted languages like PHP or Javascript, however, you sometimes need to do a little optimization on your own if you want to squeeze out some extra performance. Luckily, both languages have a c-style switch statement.

Andy King wrote about a slightly altered version of this loop unrolling algorithm which ditches the switch statement and breaks the normal Duff's Device into two separate loops, one for the remainder and one for the unrolling. In Javascript, it performs a simple addition loop in only a third of the time of a normal for loop (testVal++ is the normal loop's interior):

function duffFasterLoop8(iterations) {
  var testVal=0;	
  var n = iterations % 8;
  if (n>0) {
    do 
    {
      testVal++;
    }
    while (--n); // n must be greater than 0 here
  }
    
  n = parseInt(iterations / 8);
  do 
  {
    testVal++;
    testVal++;
    testVal++;
    testVal++;
    testVal++;
    testVal++;
    testVal++;
    testVal++;
  }
  while (--n);
}

It's not as syntactically clever as Duff's Device, but it's a good way to manually unroll an inner loop and get the best performance for your extra effort.

Duff's Device
Andy King's Optimizing JavaScript For Execution Speed


Recent Entries

Comments

Oldest comments listed first.

Posted by: anti on May 21, 2008 at 7:48 AM

I wonder why these stupid things keep coming up every 5 to 10 years.

Ever thought about cache behaviour ?
Or the fact that an "decrease, compare and jump" is basically free ?

Please take a look at the assembler that gets generated from "Duff's Device" and then tell me again that it's a good idea to use it.

Please you young and innocent out there:
Forget that you ever saw this, there is a good reason why it has been buried long time ago.

:)


Posted by: Anonymous on May 21, 2008 at 8:49 AM

function duffFasterLoop8(iterations) {
var testVal=0;

for(n = iterations; n > 8; n -= 8)
{
testVal++;
testVal++;
testVal++;
testVal++;
testVal++;
testVal++;
testVal++;
testVal++;
}

do
{
testVal++;
}
while (--n);

}


Posted by: Anonymous on September 5, 2008 at 3:02 AM

I'm one of those who re-discovered same optimization several years ago. And it works. Benchmarks show real speed improvements. And it probably worked even better in 1983, when C compilers were really bad.
What about code size and cache issues? As long as code completely fits into L1, who cares?


Posted by: Shantanu Goel on February 13, 2009 at 3:10 PM

Nice explanation. People don't understand that micro-optimizations are still useful. Not every compiler in the world is gcc-like. You may have to use different compilers for different projects, different devices and many compilers need hand-holding to produce better results, even gcc performs better if you provide more info to it. Here is another article about getting compilers to produce better code for loops:
http://www.safercode.com/blog/2009/01/27/tweak-your-code-for-speed-unroll-those-loops-part-1.html


Leave a comment


Subscribe to MAKE!Subscribe to MAKE Magazine!

Subscribe today, save 42% and get web access to MAKE free. MAKE Digital Edition is available only to subscribers.

$34.95 / 1 year
(4 Quarterly Issues)

Subscribe now


Void your warranty, violate a user agreement, fry a circuit, blow a fuse, poke an eye out. Make: The risk-takers, the doers, the makers of things... Welcome to Make: Online!


CRAFT Maker Shed Maker Faire MAKE television
MAKE: en EspaƱol MAKE: Japan


Check out all of the episodes of Make: television

Make: Science Room

Connect with MAKE

Be a MAKE fan on Facebook MAKE on Facebook
Visit our Facebook page and become a fan of MAKE!
MAKE on Twitter MAKE on Twitter
Follow our MAKE tweets!
MAKE Flickr Pool MAKE on Flickr
Join our MAKE Flickr Pool!
    make_tips on Twitter

    MAKE's RSS feed is here.
    Add MAKE to iGoogle - GoogleGoogle.
    How to add MAKE to your RSS reader - Real simple.
    Add MAKE on FriendFeed




    Maker SHED

    Advertise here with FM.

    Why advertise on MAKE?
    Read what folks are saying about us!

    Click here to advertise on MAKE!



    Subscribe to MAKE Magazine!

    Make: Online authors!

    Gareth BranwynGareth Branwyn
    Senior Editor


    Phillip TorronePhillip Torrone
    Senior Editor
    | AIM | Twitter


    Becky SternBecky Stern
    Associate Editor
    | AIM | Twitter


    Marc de VinckMarc de Vinck
    Contributing Writer
    | AIM | Twitter


    John ParkJohn Park
    Contributing Writer
    | Twitter


    Sean RaganSean Ragan
    Contributing Writer
    | Twitter


    Matt MetsMatt Mets
    Contributing Writer
    | AIM | Twitter


    Dale DoughertyDale Dougherty
    Editor & Publisher
    | Twitter


    Shawn ConnallyShawn Connally
    Managing Editor
    | Twitter


    Goli MohammadiGoli Mohammadi
    Associate Managing Editor

    Kip KayKip Kay
    Weekend Projects
    | AIM | Twitter


    Collin CunninghamCollin Cunningham
    Contributing Writer
    | AIM | Twitter

    Adam FlahertyAdam Flaherty
    Contributing Writer
    | AIM | Twitter



    More contributors: Mark Frauenfelder (Editor-in-Chief, MAKE magazine), Kipp Bradford (Technical Consultant/Writer), Chris Connors (Education), Diana Eng (Guest Author), Peter Horvath (Intern), Brian Jepson (O'Reilly Media), Robert Bruce Thompson (Science Room)

    Suggest a Site!

    Current Podcast

    itunesdl.gif Weekend Project: Making Char Cloth Learn how to make a cheap and effective fire starter made from an old t-shirt. To download The Char Cloth video click here and subscribe in iTunes. See Char Cloth in action with the Fire Piston from William Gurstelle.... More...

    Get the Make: Online sent via email
    Enter your email to receive Make: Online each day:



    MAKE Fascination video series brought to you by Dow

    Make: Education

    Important please read


    Subscribe to MAKE Magazine!

    Recent Posts from the Craft: Blog