Emacs Plugins in OCaml: Brain-what?! (part 2)

Let’s write the core of our plugin, a brainfuck interpreter. We’ll make it a library so it can be easily compiled into our plugin later, but this step actually doesn’t involve Emacs or Ecaml at all.

If you’re not super interested in the interpreter code and just want to get to the Emacs/Ecaml part, skim the interface briefly and then move on to the next post.

This post is part 2 of a series (prev/next). The full code is available on GitHub.

1 What the brainfuck?

Brainfuck is an esoteric programming language created in 1993 by Urban Müller, and notable for its extreme minimalism.

– Wikipedia

(The rest of this section is also based off of the brainfuck Wikipedia page).

The language has only eight commands, each denoted by a single character. All commands operate on a single, infinite array of zero-initialized bytes. There is also a single data pointer that points to some position in the array, plus an instruction pointer that keeps track of where we are in the program.

1.1 Commands

The commands are shown in the table below. Any characters in the source file other than the 8 commands are treated as comments and ignored.

(“Array element” refers to the element currently pointed to by the data pointer.)

character meaning
> increment the data pointer
< decrement the data pointer
+ increment the array element
- decrement the array element
. print out the array element (byte)
, read into the array element (byte)
[ skip to ] if byte is zero
] jump back to [ if byte is nonzero

Reading and printing are based on bytes. For example, if the data pointer currently points to an array element of (decimal) value 97, and the program executes command ., the byte 0x61 will be printed, and it will show up in your terminal as LATIN SMALL LETTER A.

1.2 Let’s C that again?

If you’re familiar with C, you may find the following equivalence between brainfuck and C helpful.

We’ll assume the following main():

int main() {
  uint8_t memory[QUITE_BIG] = {0};
  uint8_t *ptr = &memory[0];

  // brainfuck code goes here
}
brainfuck C
> ++ptr;
< --ptr;
+ ++*ptr;
- --*ptr;
. putchar(*ptr);
, *ptr = getchar();
[ while (*ptr) {
] }

2 An interpreter

For our simple implementation, we’ll assume that our array has a size of 106 bytes (1MB), and that the data pointer starts on the leftmost side of the array.

Let’s set up our workspace…

mkdir -p ~/ecaml-bf/lib/bf_lang
cd ~/ecaml-bf/lib/bf_lang
touch bf_lang.ml{,i}
vi jbuild

I’m using Jane Street’s Core_kernel standard library overlay since I’m familiar with its idioms, and it’ll already be installed since Ecaml depends on it. Here’s our jbuild:

(jbuild_version 1)

(library
 ((name      bf_lib)
  (synopsis  "An interpreter library for brainfuck.")
  (libraries (core_kernel))))

2.1 The interface

Here’s what our library’s interface will look like:

 1: open! Core_kernel
 2: 
 3: module Input : sig
 4:   (** Stateful input source. *)
 5:   type t
 6: 
 7:   val of_string : string -> t
 8: 
 9:   val of_chan : In_channel.t -> t
10: end
11: 
12: module Output : sig
13:   (** Stateful output sink. *)
14:   type t
15: 
16:   val of_buffer : Buffer.t -> t
17: 
18:   val of_chan : Out_channel.t -> t
19: end
20: 
21: module Memory : sig
22:   (** A big byte array. *)
23:   type t = Bigstring.t
24: 
25:   (** [create_fresh n] creates a zero-initialized memory of size [n] bytes. *)
26:   val create_fresh : int -> t
27: end
28: 
29: module Program : sig
30:   (** An immutable representation of a brainfuck program. *)
31:   type t
32: 
33:   (** "Compile" a program from its source code. *)
34:   val parse : Input.t -> t
35: 
36:   (** [run] mutates [memory], consumes [input], and writes to [output]. *)
37:   val run
38:     :  t
39:     -> memory : Memory.t
40:     -> input  : Input.t
41:     -> output : Output.t
42:     -> unit
43: 
44:   (** A simple interface for [run]. *)
45:   val run'
46:     :  program : string
47:     -> input   : string
48:     -> string                   (* output *)
49: end

Input.t and Output.t represent possible sources of input and output for our programs (and for the source of the program text). The of_* functions convert other types into things our program can read from. For example, the return value of Input.of_string s returns a value that keeps track of how far into the string we’ve read.

Memory.t represents the working memory of the programs, i.e., a very large byte array.

Finally, Program, the most exciting module of our interpreter, can be used in one of two ways:

  1. run, the lower-level interface, gives the user full control over the input and output source of the program. It also allows the user to parse the program just once and reuse the result multiple times.
  2. run', a simplified interface, simply accepts strings as input and returns a string as output.

2.2 Memory

Memory needs to be a large array of bytes. While this sounds like a job for type bytes, we’ll actually use Bigstring.t, which doesn’t have the same size limitations1 as bytes.

Bigstring.t is a type alias for (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t from the bigarray library. You can think of it as basically being a pointer to a big contiguous area of malloc()’d memory.

We’ll also add a new function for creating a zero-initialized memory of a given length. Bigstring.create by itself just allocates the memory, but doesn’t initialize it, so it could have arbitrary contents.

35: module Memory = struct
36:   type t = Bigstring.t
37: 
38:   let create_fresh n =
39:     let memory = Bigstring.create n in
40:     Bigarray.Array1.fill memory '\000';
41:     memory
42:   ;;
43: end

2.3 Input and output

Since we want our library to be fairly general-purpose, we’ll accept input and produce output to a generic interface, with adapters for common input/output sources/sinks such as in_channel or Buffer.t.

Input will be a closure that returns a char option (it returns None at the end of input). We’ll also define an iter function that repeatedly calls its argument with the next character from the input until the input is exhausted.

 3: module Input = struct
 4:   (* [t]'s return None at the end of input. *)
 5:   type t = unit -> char option
 6: 
 7:   let of_string str =
 8:     let i = ref 0 in
 9:     fun () ->
10:       if !i >= String.length str
11:       then None
12:       else (
13:         let char = str.[ !i ] in
14:         incr i;
15:         Some char)
16:   ;;
17: 
18:   let of_chan chan = fun () -> In_channel.input_char chan
19: 
20:   let rec iter t ~f =
21:     match t () with
22:     | None   -> ()
23:     | Some c -> f c; iter t ~f
24:   ;;
25: end

Output will be represented as a function that accepts a char. It’s a bit simpler since we don’t have to keep track of any state (Buffer and Out_channel do it for us).

27: module Output = struct
28:   type t = char -> unit
29: 
30:   let of_buffer buf = fun char -> Buffer.add_char buf char
31: 
32:   let of_chan chan = Pervasives.output_char chan
33: end

(of_buffer could have been written as simply Buffer.add_char, and similarly for of_chan, but I wrote them out all the way2 for clarity.)

2.4 Lexer

Writing the lexer is straightforward. We can simply scan each character and produce either a command token or None to ignore a non-command character.

46: module Command = struct
47:   type t =
48:     | Left
49:     | Right
50:     | Decrement
51:     | Increment
52:     | Input
53:     | Output
54:     | Loop_begin
55:     | Loop_end
56: 
57:   let of_char = function
58:     | '<' -> Some Left
59:     | '>' -> Some Right
60:     | '-' -> Some Decrement
61:     | '+' -> Some Increment
62:     | ',' -> Some Input
63:     | '.' -> Some Output
64:     | '[' -> Some Loop_begin
65:     | ']' -> Some Loop_end
66:     | _   -> None
67:   ;;
68: end

2.5 Program representation

A brainfuck program is simply a sequence of commands. Since the looping commands require the interpreter to jump to the matching bracket, we’ll record the positions of corresponding left and right brackets in a jump table. This is a hash table that maps the position of a bracket to the position of its partner, and vice versa.

70: type t =
71:   { commands   : Command.t array   (* just the sequence of commands *)
72:   ; jump_table : int Int.Table.t } (* maps matching brackets' positions to
73:                                       each others' *)

2.6 Parser

The only job of the parser is to determine the positions of matching pairs of brackets. It should croak if it finds unmatched bracket pairs in the program text. We’ll define two functions to do that:

75: let extra_left = invalid_argf "unmatched left bracket at command index %d"
76: let extra_right = invalid_argf "unmatched right bracket at command index %d"

invalid_argf is a printf-like function; in this case, extra_left and extra_right have types int -> unit -> 'a, i.e., they accept the current position and a unit value, raising an exception with the location of a syntax error. (“Command index” means the index of the character in program text representing the unmatched bracket in question, if we ignore all non-command characters.)

Here’s the whole parser:

 78: let parse source_code =
 79:   (* a stack *)
 80:   let open_brackets = ref [] in
 81:   let commands      = ref [] in
 82:   let current_pos   = ref 0  in
 83:   let jump_table = Int.Table.create () in
 84:   let add_command cmd =
 85:     commands := cmd :: !commands;
 86:     incr current_pos
 87:   in
 88:   let push_left_bracket () =
 89:     open_brackets := !current_pos :: !open_brackets;
 90:     add_command Command.Loop_begin
 91:   in
 92:   let pop_left_bracket () =
 93:     match !open_brackets with
 94:     | []     -> extra_right !current_pos ()
 95:     | hd::tl ->
 96:       Hashtbl.add_exn jump_table ~key:!current_pos ~data:hd;
 97:       Hashtbl.add_exn jump_table ~key:hd ~data:!current_pos;
 98:       open_brackets := tl;
 99:       add_command Command.Loop_end
100:   in
101:   Input.iter source_code ~f:(fun char ->
102:     match Command.of_char char with
103:     (* ignore non-command characters *)
104:     | None -> ()
105: 
106:     (* regular commands just get appended *)
107:     | Some (Left | Right | Decrement | Increment | Input | Output as cmd)
108:       -> add_command cmd
109: 
110:     (* record loop beginning positions *)
111:     | Some Loop_begin -> push_left_bracket ()
112: 
113:     (* record entries in jump table *)
114:     | Some Loop_end -> pop_left_bracket ());
115: 
116:   (* check for any remaining open left brackets *)
117:   (match !open_brackets with
118:    | [] -> ()
119:    | hd :: _ -> extra_left hd ());
120:   { commands = Array.of_list_rev !commands
121:   ; jump_table }
122: ;;

Most of the code is fairly straightforward; most of the code is dedicated to matching bracket pairs, and other commands are just passed straight through.

2.7 Runner

The runner steps through the program text, executing commands on the given memory, input, and output. It basically simulates the abstract state machine that brainfuck code is meant to run on.

124: let run t ~memory ~input ~output =
125:   (* get/set bytes in memory; bytes are unsigned, wrap around on overflow *)
126:   let get pos   = Bigstring.get_uint8 memory ~pos           in
127:   let set pos x = Bigstring.set_uint8 memory ~pos (x % 256) in
128: 
129:   (* we can actually always jump one past the matching bracket *)
130:   let jump pc = 1 + Hashtbl.find_exn t.jump_table pc in
131: 
132:   (* pc  :: program counter
133:      pos :: data pointer *)
134:   let rec loop ~pc ~pos =
135:     match t.commands.( pc ) with
136:     | Left  -> loop ~pc:(pc + 1) ~pos:(pos - 1)
137:     | Right -> loop ~pc:(pc + 1) ~pos:(pos + 1)
138: 
139:     (* parenthesized for clarity *)
140:     | Decrement -> set pos ((get pos) - 1); loop ~pc:(pc + 1) ~pos
141:     | Increment -> set pos ((get pos) + 1); loop ~pc:(pc + 1) ~pos
142: 
143:     | Input ->
144:       (match input () with
145:        | None   -> ()         (* do nothing if we run [,] at EOF *)
146:        | Some c -> set pos (Char.to_int c));
147:       loop ~pc:(pc + 1) ~pos
148: 
149:     | Output ->
150:       output (Char.of_int_exn (get pos));
151:       loop ~pc:(pc + 1) ~pos
152: 
153:     | Loop_begin ->
154:       if get pos = 0
155:       then (loop ~pc:(jump pc) ~pos)
156:       else (loop ~pc:(pc + 1)  ~pos)
157: 
158:     | Loop_end ->
159:       if get pos <> 0
160:       then (loop ~pc:(jump pc) ~pos)
161:       else (loop ~pc:(pc + 1)  ~pos)
162: 
163:     | exception _ -> ()       (* reached end of commands, so we terminate *)
164:   in
165:   loop ~pc:0 ~pos:0
166: ;;

A couple decisions we’ve made above about edge cases:

  1. We represent bytes as unsigned. This doesn’t make that much of a difference, since programs can’t observe that fact, given that…
  2. Increment and decrement operations that would underflow or overflow an unsigned byte wrap around instead, modulo 256.
  3. If we encounter the , (input) command when we have no remaining input, we simply ignore it (i.e., do nothing). Other common choices including setting the current cell to 0 or -1.

2.8 Wrapper

We also define a simplified interface for running programs. This’ll be less awkward to use for simple cases. create_memory allocates an array of 106 bytes and fills the buffer with 0x00’s. run' is simply a wrapper around run that creates a fresh memory and accepts and returns everything as string’s.

69: let create_memory () =
70:   let memory_size = 1_000_000 in
71:   let memory = Bigstring.create memory_size in
72:   Bigarray.Array1.fill memory '\000';
73:   memory
74: ;;
168: let run' ~program ~input =
169:   let t = parse (Input.of_string program) in
170:   let buffer = Buffer.create 16 in
171:   let memory = create_memory () in
172:   let input = Input.of_string input in
173:   let output = Output.of_buffer buffer in
174:   run t ~memory ~input ~output;
175:   Buffer.contents buffer
176: ;;

3 Trial run

Let’s try out our interpreter. jbuilder has an awesome utop command which starts a toplevel3 with access to the library we’ve just built (--dev enables stricter compilation flags).

# chdir to project root
jbuilder utop lib/bf_lib --dev
utop # Bf_lib.Program.run' ~input:"" ~program:"++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.";;
- : string = "Hello World!\n"

Awesome. We have a working brainfuck interpreter library! See Wikipedia for more example programs to try out.

4 To be continued

In the next post, we’ll see how to use Ecaml’s API to interact with various components of Emacs, such as buffers, files, key bindings, and more. Stay tuned!

Footnotes:

1

The maximum length of a string or bytes is given in OCaml by Sys.max_string_length. On 64-bit systems, it’s 144115188075855863, which is surely big enough, but on 32-bit systems it’s 16777211 (about 16 MB), and we might want to give a program a bigger memory than that in the future.

2

Or, if you prefer, eta-expanded.

3

OCaml’s term for a REPL (Read-Eval-Print Loop), i.e., interactive prompt. Similar to python or irb.

Share

Comments