Introduction: BrainF Polyglot Transpiler

BrainF is an esoteric, Turing-complete programming language consisting of only 8 single-character instructions. Programs run by moving a pointer back and forth along a 1-dimensional array, and manipulating the value of data stored at that space in memory. This project is a tiny BrainF transpiler, which reads in a source BrainF file, parses it, runs some optimizations, and then produces an equivalent program in one of five target languages (C, JavaScript, Ada, Fortran90, or Rust). The code is all written in C++, but it is hopefully intelligible to anyone with any programming experience. This guide will include a decent amount of code, though not the full codebase; instead it will be more of an overview of the overall design and functionality. You can however access all the code and documentation for this project at https://github.com/Xavier-Maruff/bft.

A Crash Course in BrainF

The eight instructions used in BrainF are:

  • " < " , Used to decrement the pointer
  • " > " , Used to increment the pointer
  • " + " , Used to increment the value currently pointed to
  • " - " , Used to decrement the value currently pointed to
  • " . " , Output the value currently pointed to to stdout
  • " , " , Read from stdin into the memory address currently pointed to
  • " [ " , Begin loop
  • " ] " , If value pointed to > 0, return to loop start

The following is a "hello world" program written in BrainF:

++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.

Supplies

Environment

This project uses CMake as a build system generator, with Make as the default build system. Additionally, the Program Options component of Boost is used for command-line option parsing, so CMake will need to be able to find Boost for the project to compile. All the build and test scripts are written in shell script, and as such will only work on Linux, Mac, or WSL in Windows. They can however be easily ported to Batch for native Windows compatibilty (but I would suggest just using WSL, it's a lot less painful).

Summarised dependencies

Build dependencies (required):

  • CMake
  • Make (default, other build systems can be used)
  • Boost
  • A C++ compiler

Testing dependencies (optional):

  • GNAT for Ada tests
  • GCC for C tests
  • GCC for Fortran90 tests
  • NodeJS for JavaScript tests
  • Rustc for Rust tests

Step 1: The Command Line Interface

To create the CLI, we will use the Boost Program Options library, because it is pretty simple and has a nice API. Our CLI is outlined in the above image, though we are also including --dump_ir and --verbose options, which will be helpful in debugging. We can create this interface by creating a boost::program_options::options_description object, and adding each option using the options_description::add_options() method. This can then be run on the argc and argv supplied to the program, and subsequently used to populate a hash map containing the value of each option.

//note: previous declaration that namespace po = boost::program_options;
po::options_description desc("\nBFT - a multi-target optimizing brainf*ck transpiler");
    desc.add_options()
        ("help,h", "show this help message")
        ("verbose,v", "increase verbosity")
        ("dump_ir,d", "dump the ASC IR to the output file (overrides target)")
        ("optimize,O", po::value(), "optimization level (0 - 3)")
        ("output,o", po::value(), "output asm file path")
        ("target,t", po::value(), "the target language (js, c, rust, c++)")
        ("source,s", po::value(), "source bf file path")
    ;

    po::positional_options_description pos_desc;
    pos_desc.add("source", -1);

    po::variables_map vm;
    try{
        po::store(po::command_line_parser(argc, argv).options(desc).positional(pos_desc).run(), vm);
        po::notify(vm);
    } catch(po::error& e){
        stdlog.err() << "Invalid command line arguments - " << e.what() << std::endl;
        throw CLI_INVALID_ARGS_ERR;
    }

We can then check that we have the correct arguments, and also store the CLI information in a struct that is more portable than the hefty boost::program_options::variables_map container:

typedef enum compile_target {
    target_js,
    target_c,
    target_rust,
    target_ada,
    target_fortran
} compile_target;

typedef struct cli_args {
    bool verbose;
    bool dump_ir;
    uint8_t optimization_level;
    compile_target target;
    std::string source_filename;
    std::string output_filename;

    cli_args();
    ~cli_args();
} cli_args;

Relevant fields of this struct can then be relayed to various components of our program, like the parser, the logger, and the code generator.

Step 2: The Parser

The parser has the job of reading a BrainF source file and converting it to something that the program can understand, an "IR" (intermediate representation). The parser also applies various optimizations to this IR, before piping it through a code generator backend. Due to the single character nature of BrainF instructions, the lexer (splits the source code into "tokens") and parser (constructs an "abstract syntax tree") are luckily pretty easy to write.

To lex, or tokenize, the source BrainF code, we just need to iterate over every character, and match it with the instruction associated with that character. The following code does just this, constructing a linked list of abstract syntax tree (or AST) nodes which represent our BrainF source code. (Note: you'll notice the AST referred to as "asc" in the code for this project. That's because what we construct is actually a linked list and not a tree, due to BrainF's simplicity, so it's known in the code as an "abstract syntax chain" or ASC)

std::map token_map = {
    {'>', inc_ptr},
    {'<', dec_ptr},
    {'+', inc_val},
    {'-', dec_val},
    {'[', loop_start},
    {']', loop_end},
    {'.', put_char},
    {',', get_char},
};

//tokenizes stream of characters
void parser::tokenize(std::istream* input_stream) {
    stdlog() << "Constructing ASC" << std::endl;

    size_t location = 0;
    root_node = std::make_unique(no_op, location);
    asc_node* next_node = root_node.get();

    size_t unmatched_loop = 0;
    char token;

    while((token = input_stream->get()) != -1){
        auto bf_instr_iter = token_map.find(token);
        if(bf_instr_iter != token_map.end()) {
            switch(bf_instr_iter->second){
                case loop_start:
                loop_stack.push_back(loop_stack.size());
                unmatched_loop++;
                break;

                case loop_end:
                unmatched_loop--;
                break;

                default:
                break;
            }
            next_node->next = std::make_unique(bf_instr_iter->second, ++location);
            next_node = next_node->next.get();
        }
    }

    if(unmatched_loop < 0) {
        stdlog.err() << ERR_PREFIX << "Unexpected loop ending" << std::endl;
        throw PARSE_ERR;
    }
    else if(unmatched_loop > 0){
        stdlog.err() << ERR_PREFIX << "Unmatched loop start" << std::endl;
        throw PARSE_ERR;
    }
}

We could theoretically just leave it there, and pipe the AST directly through one of our codegen backends, but direct BrainF code is not super efficient, so it's better to run a few optimizations on our AST first.

Step 3: Optimizing the AST

To optimize our BrainF AST, we will be doing all of the following:

  • Squishing repeating operations together
  • Cancelling opposing operations
  • Flattening zero assign loops (will be explained)

Contracting repeating operations

If we are given some BrainF code like this: ">>>>>>>>", the naive way to run it is to run the increment pointer operation 8 times. However, this is very inefficient - we could just run one operation which increments the pointer by 8, in one go. To do this, we will iterate over all the nodes in the AST, and if the current node is of the same type as the previous node, squish them both together into one node, with an "iterations" attribute equal to the sum of both nodes' individual "iterations" attributes. This "iterations" attribute can be interpreted by our codegen backends later on.

void parser::contract_repeating_nodes(){
    asc_node* target_node = root_node.get();
    if(!target_node) {
        stdlog.err() << "Null ASC" << std::endl;
        throw OPTIMIZE_ERR;
    }
    asc_node* next_node = target_node->next.get();
    bf_instr target_node_type = target_node->node_type;
    bool opposing_ops = false;
    while(next_node){
        if(next_node->node_type == target_node_type){
            target_node->iterations++;
            target_node->next = std::move(next_node->next);
        }
        else {
            target_node = target_node->next.get();
            if(!target_node) break;
            target_node_type = target_node->node_type;

            while(target_node->node_type == loop_start || target_node->node_type == loop_end){
                target_node = target_node->next.get();
                if(!target_node) break;
                target_node_type = target_node->node_type;
            }
            if(!target_node) break;

        }
        next_node = target_node->next.get();
    }
}

Cancelling opposing operations

Next up, we'll cancel opposing operations. If some BrainF source code has operations like this: "<<>>>+++--", it can be simplified down to ">+", because a lot of the operations cancel each other out. This can also be applied to the AST, by traversing it and finding adjacent nodes which oppose each other, then resolving these into one node.

void parser::cancel_opposing_ops(){
    asc_node* prior_node = root_node.get();
    asc_node* target_node = root_node.get();
    if(!target_node) {
        stdlog.err() << "Null ASC" << std::endl;
        throw OPTIMIZE_ERR;
    }
    asc_node* next_node = target_node->next.get();
    bf_instr target_node_type = target_node->node_type;
    bool opposing_ops = false;
    bool null_next = false;
    while(next_node){
        //check if opposing operations (+/-, >/<)
        opposing_ops =
           (target_node->node_type == dec_ptr && next_node->node_type == inc_ptr)
        || (target_node->node_type == inc_ptr && next_node->node_type == dec_ptr)
        || (target_node->node_type == dec_val && next_node->node_type == inc_val)
        || (target_node->node_type == inc_val && next_node->node_type == dec_val);
        if(opposing_ops){
            opposing_ops = false;
            //if the iterations fully cancel, ditch the target and next nodes
            if(target_node->iterations == next_node->iterations){
                if(!next_node->next) break;
                prior_node->next = std::move(next_node->next);
                target_node = prior_node;
                next_node = prior_node->next.get();
            }
            //subtract the iterations
            else if(target_node->iterations > next_node->iterations){
                target_node->iterations -= next_node->iterations;
            }
            //defer to the next node, subtract target iterations
            else {
                target_node->iterations = next_node->iterations-target_node->iterations;
                target_node->location = next_node->location;
                target_node->node_type = next_node->node_type;
            }
            if(!next_node->next) break;
            //ditch the next node
            target_node->next = std::move(next_node->next);
        }

        //move along the asc
        prior_node = target_node;
        target_node = target_node->next.get();
        if(!target_node->next) break;
        next_node = target_node->next.get();
    }
}

Flattening zero assign loops

In BrainF, if you want to set a cell's value to 0, you run this code snippet: "[-]". This translates to "While the current cell value is greater than 0, decrement the current cell value". As you can probably tell, this is super inefficient and results in a bunch more comparison, jump, and subtract operations than is required. The combat this, we can create a new type of AST node which we'll call "zero_assign", which will recognise this pattern and tell the given codegen backend to skip all the malarkey and just set the currently pointed to value to 0.

//reduces three asc nodes to the first node, with a target type
inline bool reduce_instr_triple(asc_node** prior_node, asc_node** current_node, asc_node** next_node, bf_instr reductant){
    (*prior_node)->node_type = reductant;
    //move next_node->next into prior_node->next
    (*prior_node)->next = std::move((*next_node)->next);
    //current node is next_node->next
    (*current_node) = (*prior_node)->next.get();
    if(!(*current_node)) return false;
    //next node is current_node->next
    (*next_node) = (*current_node)->next.get();
    if(!(*next_node)) return false;
    return true;
}

//flattens zero assignment loops, i.e. [-] or [+] -> *bf_ptr = 0
void parser::zero_loop(){
    asc_node* prior_node = root_node.get();
    if(!prior_node) {
        stdlog.err() << "Null ASC" << std::endl;
        throw OPTIMIZE_ERR;
    }
    asc_node* current_node = prior_node->next.get();
    if(!current_node) return;
    asc_node* next_node = current_node->next.get();
    bool end_of_chain = false;
    //jump size will be utilized when zero scan loop optimization is added
    size_t jump_size = 0;
    while(next_node){
        //if chain context matches three node [*] pattern
        if(prior_node->node_type == loop_start && next_node->node_type == loop_end){
            switch(current_node->node_type){
                //reduce chain context to one zero assign instruction
                case dec_val:
                end_of_chain = reduce_instr_triple(&prior_node, ¤t_node, &next_node, zero_assign);
                break;

                //reduce chain context to one zero assign instruction
                case inc_val:
                end_of_chain = reduce_instr_triple(&prior_node, ¤t_node, &next_node, zero_assign);
                break;
                
                //not a zero loop
                default:
                prior_node = current_node;
                current_node = next_node;
                next_node = current_node->next.get();
                break;
            }
            if(end_of_chain){
                end_of_chain = false;
                break;
            }
        }
        else {
            prior_node = current_node;
            current_node = next_node;
            next_node = current_node->next.get();
        }
    }
}

Improvement

As you can see from the section header image, these optimizations drastically decrease the running time of a compiled BrainF program, often by more than 70%!

Step 4: Generating the Code

The codegen backends accept the optimized AST as an input, and output valid code in a target language. In the project I have so far implemented a backend for C, Ada, Rust, Fortran90, and JavaScript. For the sake of time, I'll just go over the C backend, but they are all implemented in the same fashion, as subclasses of an abstract "codegen" class.

The codegen backends all have a "skeleton", a predefined header and footer which sets up all the necessary definitions and declarations. For the C backend, the header includes stdio, and creates the BrainF array and pointer:

#include <stdio.h>

int main(int argc, char* argv[]){
	int bf_array[BF_ARRAY_SIZE] = {0};
	int* bf_ptr = bf_array;
The AST is then traversed, and individual operations are added to the output C code:
void codegen_c::generate(asc_node* node_) {
    switch(node_->node_type){
        case inc_ptr:
        code_stream_fmt() << "bf_ptr += " << node_->iterations << ";\n";
        break;

	case dec_ptr:
	code_stream_fmt() << "bf_ptr -= " << node_->iterations << ";\n";
	break;

	case inc_val:
	code_stream_fmt() << "*bf_ptr += " << node_->iterations << ";\n";
	break;

	case dec_val:
	code_stream_fmt() << "*bf_ptr-= " << node_->iterations << ";\n";
	break;

	case loop_start:
	if(open_loop_stack.size() >= open_loop_stack.max_size()){
		stdlog.err() << "Deque stack passed max size (" << open_loop_stack.max_size() << ")" << std::endl;
		throw MEM_ERR;
	}
	open_loop_stack.push_back(loop_stack.back());
	loop_stack.pop_back();
	code_stream_fmt() << "while(*bf_ptr){\n";
	indent_level++;
	break;

	case loop_end:
	indent_level--;
	code_stream_fmt() << "}\n";
	open_loop_stack.pop_back();
	break;

	case put_char:
	for(size_t i = 0; i < node_->iterations; i++)
		code_stream_fmt() << "putchar(*bf_ptr);\n";
	break;

	case get_char:
	for(size_t i = 0; i < node_->iterations; i++)
		code_stream_fmt() << "*bf_ptr = getchar();\n";
	break;

	case zero_assign:
	code_stream_fmt() << "*bf_ptr = 0;\n";
	break;

	default:
	break;
    }
}
After this, the second part of the skeleton is added to the output C file:
	return 0;
}
And voila! We have a BrainF source file being parsed, optimized, and transpiled to C!

Step 5: Closing

Writing a BrainF transpiler is a fun, quick way to learn a little bit about how compilers work - the lexing, parsing, optimization, and codegen processes. I highly recommend you download the code for this project from https://github.com/Xavier-Maruff/bft, and have a mess around with it to see what you can come up with. There is a pretty cool Mandelbrot set generating BrainF program written by Erik Bosman which is included in the test suite, and I'd be interested to see what other optimizations can be made :)

Hour of Code Speed Challenge

Participated in the
Hour of Code Speed Challenge