walkaway.software.meta
Name
Subject
Comment
Link URI
File
Captcha captcha
Solution
Actions:
Secret:

brainfuck optimization strategies Mats Linander >>47f25e15e72789964f [Reply]
[delete]

In almost every computation a great variety of arrangements for the 
succession of the processes is possible, and various considerations 
must influence the selection amongst them for the purposes of a 
Calculating Engine. One essential object is to choose that 
arrangement which shall tend to reduce to a minimum the time 
necessary for completing the calculation.” 

  Ada Lovelace, 1843 


Lovelace’s words still ring true, almost two centuries later. One way 
to choose that arrangement is to use an optimizing compiler. In this 
post, we’ll have a look some important compiler optimization 
strategies for the brainfuck programming language and we’ll examine 
how they affect the run time of a handful of non-trivial brainfuck 
programs. The chart below illustrates the impact of the techniques 
covered. 



Prerequisites 

If you’re new to the brainfuck programming language, then head on over 
to the previous post where we cover 
both the basics of the language as well as some of the issues encountered 
when writing a brainfuck-to-java compiler in brainfuck. With that 
said, most of this post should be grokkable even without prior 
experience of the language. 

Programs worth optimizing 

In order to meaningfully evaluate the impact different optimization 
techniques can have, we need a set of brainfuck 
programs that are sufficiently non-trivial for optimization to make 
sense. Unsurprisingly, most brainfuck programs out there are pretty 
darn trivial, but there are some exceptions. In this post we’ll be 
looking at six such programs. 

The interested reader is encouraged to head over to this 
repository for a more 
detailed description of these programs. 

An intermediary representation 

Our first step is to define an intermediary representation (IR) on 
which an optimizer can operate. Working with an IR enables 
optimizations that are platform independent and brings more 
expressiveness than brainfuck itself. To further illustrate how the IR 
functions, we’ll also give C implementations for each operation. 

In the table below we find all 8 brainfuck instructions, their IR 
counterparts and an implementation of the IR in C. This basic mapping 
provides a very simple and somewhat naive way of compiling brainfuck. 



 BF 
 IR 
 C 


 + 
 Add 
 mem[p]++; 


 - 
 Sub 
 mem[p]--; 


 >
 Right 
 p++; 


 <
 Left 
 p--; 


 . 
 Out 
 putchar(mem[p]); 


 , 
 In 
 mem[p] = getchar(); 


 [ 
 Open 
 while(mem[p]) { 


 ] 
 Add 
 } 



The C code depends on the memory area and pointer having been set up 
properly, e.g. like this: 

char mem[65536] = {0}; 
int p = 0; 


At this point the IR is no more expressive than brainfuck itself but 
we will extend it later on. 

Making things faster 

Throughout the remainder of this post we’ll look at how different 
optimizations affect the execution time of our 6 sample programs. The 
brainfuck code is compiled to C code which in turn is compiled with 
gcc 4.8.2. We use optimization level 0 (-O0) to make sure we benefit 
as little as possible from gcc’s optimization engine. Run time is then 
measured as average real time over 10 runs on a Lenovo x240 running 
Ubuntu Trusty. 

Contraction 

Brainfuck code is often riddled with long sequences of +, 
-, < and >. In our naive 
mapping, every single one of these instructions will result in a line 
of C code. Consider for instance the following (somewhat contrived) 
brainfuck snippet: 

+++++[->>>++<<<]>>>. 


This program will output an ascii newline character (ascii 10, 
‘\n’). The first sequence of + stores the number 5 in the current 
cell. After that we have a loop which in each iteration subtracts 1 
from the current cell and adds the number 2 to another cell 3 steps to 
the right. The loop will run 5 times, so the cell at offset 3 will be 
set to 10 (2 times 5). Finally, this cell is printed as output. 

Using our naive mapping to C produces the following code: 

mem[p]++; 
mem[p]++; 
mem[p]++; 
mem[p]++; 
mem[p]++; 
while (mem[p]) { 
    mem[p]--; 
    p++; 
    p++; 
    p++; 
    mem[p]++; 
    mem[p]++; 
    p--; 
    p--; 
    p--; 
} 
p++; 
p++; 
p++; 
putchar(mem[p]); 


We can clearly do better. Let’s extend these four IR operations so 
that they accept a single argument x indicating that the 
operation should be applied x times: 



 IR 
 C 


 Add(x) 
 mem[p] += x; 


 Sub(x) 
 mem[p] -= x; 


 Right(x) 
 p += x; 


 Left(x) 
 p -= x; 


 Out 
 putchar(mem[p]); 


 In 
 mem[p] = getchar(); 


 Open 
 while(mem[p]) { 


 Add 
 } 


 Clear 
 mem[p] = 0; 



Applying this to our snippet produces a much more compact piece of C: 

mem[p] += 5; 
while (mem[p]) { 
    mem[p] -= 1; 
    p += 3; 
    mem[p] += 2; 
    p -= 3; 
} 
p += 3; 
putchar(mem[p]); 


Analyzing the code of our six sample programs is encouraging. For 
instance, about 75% of the instructions in factor.b, hanoi.b and 
mandelbrot.b are contractable, i.e. part of a sequence of at least 2 
immediately adjacent >, +, 
< or -. Lingering around 40% we have
dbfi.b and long.b, while awib-0.4.b is at 60%. Still, we shouldn’t 
stare ourselves blind at these figures; it could be that these 
sequences are only executed rarely. Let’s look at an actual 
measurement of the speedup. 



The impact is impressive overall and especially so for mandelbrot.b 
and factor.b. On the other end of the spectrum we have dbfi.b. All in 
all, contraction appears to be a straightforward and effective 
optimization that all brainfuck compilers should consider 
implementing. 

Clear loops 

A common idiom in brainfuck is the clear loop: [-]. This 
loop subtracts 1 from the current cell and keeps iterating until the 
cell reaches zero. Executed naively, the clear loop’s run time is 
potentially proportional to the maximum value that a cell can hold 
(commonly 255). 

We can do better. Let’s introduce a Clear operation to 
the IR. 



 IR 
 C 


 Add 
 mem[p]++; 


 Sub 
 mem[p]--; 


 Right 
 p++; 


 Left 
 p--; 


 Out 
 putchar(mem[p]); 


 In 
 mem[p] = getchar(); 


 Open 
 while(mem[p]) { 


 Add 
 } 


 Clear 
 mem[p] = 0; 



In addition to compiling all occurrences of [-] to 
Clear, we can also do the same for [+]. This 
works since (all sane implementations of) brainfuck’s memory model 
provides cells that wrap around to 0 when the max value is exceeded. 

Inspecting our sample programs reveals that roughly 8% of the 
instructions in hanoi.b and long.b are part of a clear loop, while the 
corresponding figure for the other programs is around 2%. 



As expected, the impact of the clear loop optimization is modest for 
all but long.b and hanoi.b, but the speedup for these two is 
impressive. Conclusion: the clear loop optimization is simple to 
implement and will in some cases have significant impact. Thumbs up. 

Copy loops 

Another common construct is the copy loop. Consider for instance this 
little snippet: [->+>+<<]. Just like the 
clear loop, the body subtracts 1 from the current cell and iterates 
until it reaches 0. But there’s a side effect. For each iteration the 
body will add 1 to the two cells above the current one, effectively 
clearing the current cell while adding its original value to the 
other two cells. 

Let’s introduce another IR operation called Copy(x) that 
adds a copy of the current cell to the cell at offset x. 



 IR 
 C 


 Add(x) 
 mem[p] += x; 


 Sub(x) 
 mem[p] -= x; 


 Right(x) 
 p += x; 


 Left(x) 
 p -= x; 


 Out 
 putchar(mem[p]); 


 In 
 mem[p] = getchar(); 


 Open 
 while(mem[p]) { 


 Add 
 } 


 Clear 
 mem[p] = 0; 


 Copy(x) 
 mem[p+x] += mem[p]; 



Applying this to our copy loop example ([->+>+<<]) 
results in three IR operations, Copy(1), Copy(2), Clear, 
which in turn compiles into the following C code: 

mem[p+1] += mem[p]; 
mem[p+2] += mem[p]; 
mem[p] = 0; 


This will execute in constant time while the naive implementation of 
the loop would iterate as many times as the value held in 
mem[p]. 



Improvement across the board, except for dbfi.b. 

Multiplication loops 

Continuing in the same vein, copy loops can be generalized into 
multiplication loops. A piece of brainfuck like 
[->+++>+++++++<<] behaves a bit like an 
ordinary copy loop, but introduces a multiplicative factor to the 
copies of the current cell. It could be compiled into this: 

mem[p+1] += mem[p] * 3; 
mem[p+2] += mem[p] * 7; 
mem[p] = 0 


Let’s replace the Copy operation with a more general 
Mul operation and have a look at what it does to our 
sample programs. 



 IR 
 C 


 Add(x) 
 mem[p] += x; 


 Sub(x) 
 mem[p] -= x; 


 Right(x) 
 p += x; 


 Left(x) 
 p -= x; 


 Out 
 putchar(mem[p]); 


 In 
 mem[p] = getchar(); 


 Open 
 while(mem[p]) { 


 Add 
 } 


 Clear 
 mem[p] = 0; 


 Mul(x, y) 
 mem[p+x] += mem[p] * y; 





While most programs benefit slightly from the shift from 
Copy to Mul, long.b improves significantly 
and hanoi.b not at all. The explanation is simple: long.b 
contains no copy loops but one deeply nested multiplication loop; 
hanoi.b has no multiplication loops but many copy loops. 

Scan loops 

We have so far been able to improve the performance of most of our 
sample programs, but dbfi.b remains elusive. Examining the source code 
reveals something interesting: out of dbfi.b’s 429 instructions, a 
whopping 8% can be found in loops like [<] and 
[>]. Contrast that with awib-0.4 being at 0.8% and the 
other programs having no occurrences of this construct at all. These 
pieces of code operate by repeatedly moving the pointer (left or 
right, respectively) until a zero cell is reached. We saw a nice 
example in the 
previous post of how they can be utilized. 

It should be mentioned that while scanning past runs of non-zero cells 
is a common and important use case for these loops, it is not the only 
one. For instance, the same construct can appear in snippets like 
+<[>-]>[>]<, the exact meaning of which is 
left as an exercise for the reader to figure out. With that said, the 
scanning case can be very time consuming when traversing large memory 
areas and is arguably what we should optimize for. 

Fortunately for us, the problem of efficiently searching a memory area 
for occurrences of a particular byte is mostly solved by the C 
standard library’s memchr() function. Scanning 
“backwards” is in turn handled by the GNU extension 
memrchr(). In a nutshell, these functions operate by 
loading full memory words (typically 32 or 64 bits) into a CPU 
register and checking the individual 8-bit components in 
parallel. This proves to be much more efficient than loading and 
inspecting bytes one at a time. 

Let’s extend the IR by adding two new operations: 
ScanLeft and ScanRight. 



 IR 
 C 


 Add(x) 
 mem[p] += x; 


 Sub(x) 
 mem[p] -= x; 


 Right(x) 
 p += x; 


 Left(x) 
 p -= x; 


 Out 
 putchar(mem[p]); 


 In 
 mem[p] = getchar(); 


 Open 
 while(mem[p]) { 


 Add 
 } 


 Clear 
 mem[p] = 0; 


 Mul(x, y) 
 mem[p+x] += mem[p] * y; 


 ScanLeft 
 p -= (long)((void *)(mem + p) - memrchr(mem, 0, p + 1)); 


 ScanRight 
 p += (long)(memchr(mem + p, 0, sizeof(mem)) - (void *)(mem + p)); 



An unfortunate consequence of using array indexes instead of direct 
pointer manipulation is that the C versions of the scan operations 
look pretty horrid. Well, such is life. 



More than a 3x speedup for dbfi.b while the other programs see little 
to no improvement. There is of course no guarantee that 
memchr() will perform as well on other libc 
implementations or on a different architecture, but there are 
efficient and portable memchr() and 
memrchr() implementations out there so these functions 
can easily be bundled by a compiler. 

Operation offsets 

Both the copy loop and multiplication loop optimizations share an 
interesting trait: they perform an arithmetic operation at an offset 
from the current cell. In brainfuck we often find long sequences of 
non-loop operations and these sequences typically contain a fair 
number of < and >. Why waste time 
moving the pointer around? What if we precalculate offsets for the 
non-loop instructions and only update the pointer at the end of these 
sequences? 



 IR 
 C 


 Add(x, off) 
 mem[p+off] += x; 


 Sub(x, off) 
 mem[p+off] -= x; 


 Right(x) 
 p += x; 


 Left(x) 
 p -= x; 


 Out(off) 
 putchar(mem[p+off]); 


 In(off) 
 mem[p+off] = getchar(); 


 Clear(off) 
 mem[p+off] = 0; 


 Mul(x, y, off) 
 mem[p+x+off] += mem[p+off] * y; 


 ScanLeft 
 p -= (long)((void *)(mem + p) - memrchr(mem, 0, p + 1)); 


 ScanRight 
 p += (long)(memchr(mem + p, 0, sizeof(mem)) - (void *)(mem + p)); 



Note that while Clear and Mul are included 
in this table, the test was run with the operation offset optimization 
in isolation, so these two operations were never emitted by the 
compiler. 



Tangible results, albeit somewhat meager. 

Applying all optimizations 

To wrap up, let’s have a look at what we can accomplish by applying all 
these optimizations to our sample programs. 



Significant improvement across the board, ranging from about 130x in 
the case of hanoi.b to the 2.4x speedup of awib-0.4. 

Summary 

Fairly simple techniques proved to significantly improve the 
performance of our sample programs. Brainfuck compiler writers should 
consider implementing at least some of them. Now, spending time and 
effort on such an esoteric language as brainfuck may seem like a 
pointless exercise, but we hope and believe that some of the concepts 
can carry over to more real settings. Also, mucking around with 
brainfuck is just plain fun. 

That was all.