Emacs Plugins in OCaml: Brain-what?! (part 2)
Published: Last updated:· Tags: ecaml emacs ocaml · Category: ecaml-getting-started
This post is part 2 of a series (prev/next). The full code is available on GitHub.
Let’s write the core of our plugin, a brainfuck interpreter. We’ll make it a library so it can be easily compiled into our plugin later, but this step actually doesn’t involve Emacs or Ecaml at all.
If you’re not super interested in the interpreter code and just want to get to the Emacs/Ecaml part, skim the interface briefly and then move on to the next post.
Table of Contents
1 What the brainfuck?
Brainfuck is an esoteric programming language created in 1993 by Urban Müller, and notable for its extreme minimalism.
– Wikipedia
(The rest of this section is also based off of the brainfuck Wikipedia page).
The language has only eight commands, each denoted by a single character. All commands operate on a single, infinite array of zero-initialized bytes. There is also a single data pointer that points to some position in the array, plus an instruction pointer that keeps track of where we are in the program.
1.1 Commands
The commands are shown in the table below. Any characters in the source file other than the 8 commands are treated as comments and ignored.
(“Array element” refers to the element currently pointed to by the data pointer.)
| character | meaning |
|---|---|
> |
increment the data pointer |
< |
decrement the data pointer |
+ |
increment the array element |
- |
decrement the array element |
. |
print out the array element (byte) |
, |
read into the array element (byte) |
[ |
skip to ] if byte is zero |
] |
jump back to [ if byte is nonzero |
Reading and printing are based on bytes. For example, if the data pointer
currently points to an array element of (decimal) value 97, and the program
executes command ., the byte 0x61 will be printed, and it will show up in
your terminal as LATIN SMALL LETTER A.
1.2 Let’s C that again?
If you’re familiar with C, you may find the following equivalence between brainfuck and C helpful.
We’ll assume the following main():
int main() { uint8_t memory[QUITE_BIG] = {0}; uint8_t *ptr = &memory[0]; // brainfuck code goes here }
| brainfuck | C |
|---|---|
> |
++ptr; |
< |
--ptr; |
+ |
++*ptr; |
- |
--*ptr; |
. |
putchar(*ptr); |
, |
*ptr = getchar(); |
[ |
while (*ptr) { |
] |
} |
2 An interpreter
For our simple implementation, we’ll assume that our array has a size of 106 bytes (1MB), and that the data pointer starts on the leftmost side of the array.
Let’s set up our workspace…
mkdir -p ~/ecaml-bf/lib/bf_lang cd ~/ecaml-bf/lib/bf_lang touch bf_lang.ml{,i} vi jbuild
I’m using Jane Street’s Core_kernel standard library overlay since I’m
familiar with its idioms, and it’ll already be installed since Ecaml depends
on it. Here’s our jbuild:
(jbuild_version 1) (library ((name bf_lib) (synopsis "An interpreter library for brainfuck.") (libraries (core_kernel))))
2.1 The interface
Here’s what our library’s interface will look like:
1: open! Core_kernel 2: 3: module Input : sig 4: (** Stateful input source. *) 5: type t 6: 7: val of_string : string -> t 8: 9: val of_chan : In_channel.t -> t 10: end 11: 12: module Output : sig 13: (** Stateful output sink. *) 14: type t 15: 16: val of_buffer : Buffer.t -> t 17: 18: val of_chan : Out_channel.t -> t 19: end 20: 21: module Memory : sig 22: (** A big byte array. *) 23: type t = Bigstring.t 24: 25: (** [create_fresh n] creates a zero-initialized memory of size [n] bytes. *) 26: val create_fresh : int -> t 27: end 28: 29: module Program : sig 30: (** An immutable representation of a brainfuck program. *) 31: type t 32: 33: (** "Compile" a program from its source code. *) 34: val parse : Input.t -> t 35: 36: (** [run] mutates [memory], consumes [input], and writes to [output]. *) 37: val run 38: : t 39: -> memory : Memory.t 40: -> input : Input.t 41: -> output : Output.t 42: -> unit 43: 44: (** A simple interface for [run]. *) 45: val run' 46: : program : string 47: -> input : string 48: -> string (* output *) 49: end
Input.t and Output.t represent possible sources of input and output for
our programs (and for the source of the program text). The of_* functions
convert other types into things our program can read from. For example, the
return value of Input.of_string s returns a value that keeps track of how
far into the string we’ve read.
Memory.t represents the working memory of the programs, i.e., a very large
byte array.
Finally, Program, the most exciting module of our interpreter, can be used
in one of two ways:
run, the lower-level interface, gives the user full control over the input and output source of the program. It also allows the user to parse the program just once and reuse the result multiple times.run', a simplified interface, simply accepts strings as input and returns a string as output.
2.2 Memory
Memory needs to be a large array of bytes. While this sounds like a job for
type bytes, we’ll actually use Bigstring.t, which doesn’t have the same
size limitations1 as bytes.
Bigstring.t is a type alias for (char, Bigarray.int8_unsigned_elt,
Bigarray.c_layout) Bigarray.Array1.t from the bigarray library. You can
think of it as basically being a pointer to a big contiguous area of
malloc()’d memory.
We’ll also add a new function for creating a zero-initialized memory of a
given length. Bigstring.create by itself just allocates the memory, but
doesn’t initialize it, so it could have arbitrary contents.
35: module Memory = struct 36: type t = Bigstring.t 37: 38: let create_fresh n = 39: let memory = Bigstring.create n in 40: Bigarray.Array1.fill memory '\000'; 41: memory 42: ;; 43: end
2.3 Input and output
Since we want our library to be fairly general-purpose, we’ll accept input
and produce output to a generic interface, with adapters for common
input/output sources/sinks such as in_channel or Buffer.t.
Input will be a closure that returns a char option (it returns None at
the end of input). We’ll also define an iter function that repeatedly calls
its argument with the next character from the input until the input is
exhausted.
3: module Input = struct 4: (* [t]'s return None at the end of input. *) 5: type t = unit -> char option 6: 7: let of_string str = 8: let i = ref 0 in 9: fun () -> 10: if !i >= String.length str 11: then None 12: else ( 13: let char = str.[ !i ] in 14: incr i; 15: Some char) 16: ;; 17: 18: let of_chan chan = fun () -> In_channel.input_char chan 19: 20: let rec iter t ~f = 21: match t () with 22: | None -> () 23: | Some c -> f c; iter t ~f 24: ;; 25: end
Output will be represented as a function that accepts a char. It’s a bit
simpler since we don’t have to keep track of any state (Buffer and
Out_channel do it for us).
27: module Output = struct 28: type t = char -> unit 29: 30: let of_buffer buf = fun char -> Buffer.add_char buf char 31: 32: let of_chan chan = Pervasives.output_char chan 33: end
(of_buffer could have been written as simply Buffer.add_char, and
similarly for of_chan, but I wrote them out all the way2 for clarity.)
2.4 Lexer
Writing the lexer is straightforward. We can simply scan each character and
produce either a command token or None to ignore a non-command character.
46: module Command = struct 47: type t = 48: | Left 49: | Right 50: | Decrement 51: | Increment 52: | Input 53: | Output 54: | Loop_begin 55: | Loop_end 56: 57: let of_char = function 58: | '<' -> Some Left 59: | '>' -> Some Right 60: | '-' -> Some Decrement 61: | '+' -> Some Increment 62: | ',' -> Some Input 63: | '.' -> Some Output 64: | '[' -> Some Loop_begin 65: | ']' -> Some Loop_end 66: | _ -> None 67: ;; 68: end
2.5 Program representation
A brainfuck program is simply a sequence of commands. Since the looping commands require the interpreter to jump to the matching bracket, we’ll record the positions of corresponding left and right brackets in a jump table. This is a hash table that maps the position of a bracket to the position of its partner, and vice versa.
70: type t = 71: { commands : Command.t array (* just the sequence of commands *) 72: ; jump_table : int Int.Table.t } (* maps matching brackets' positions to 73: each others' *)
2.6 Parser
The only job of the parser is to determine the positions of matching pairs of brackets. It should croak if it finds unmatched bracket pairs in the program text. We’ll define two functions to do that:
75: let extra_left = invalid_argf "unmatched left bracket at command index %d" 76: let extra_right = invalid_argf "unmatched right bracket at command index %d"
invalid_argf is a printf-like function; in this case, extra_left and
extra_right have types int -> unit -> 'a, i.e., they accept the current
position and a unit value, raising an exception with the location of a syntax
error. (“Command index” means the index of the character in program text
representing the unmatched bracket in question, if we ignore all non-command
characters.)
Here’s the whole parser:
78: let parse source_code = 79: (* a stack *) 80: let open_brackets = ref [] in 81: let commands = ref [] in 82: let current_pos = ref 0 in 83: let jump_table = Int.Table.create () in 84: let add_command cmd = 85: commands := cmd :: !commands; 86: incr current_pos 87: in 88: let push_left_bracket () = 89: open_brackets := !current_pos :: !open_brackets; 90: add_command Command.Loop_begin 91: in 92: let pop_left_bracket () = 93: match !open_brackets with 94: | [] -> extra_right !current_pos () 95: | hd::tl -> 96: Hashtbl.add_exn jump_table ~key:!current_pos ~data:hd; 97: Hashtbl.add_exn jump_table ~key:hd ~data:!current_pos; 98: open_brackets := tl; 99: add_command Command.Loop_end 100: in 101: Input.iter source_code ~f:(fun char -> 102: match Command.of_char char with 103: (* ignore non-command characters *) 104: | None -> () 105: 106: (* regular commands just get appended *) 107: | Some (Left | Right | Decrement | Increment | Input | Output as cmd) 108: -> add_command cmd 109: 110: (* record loop beginning positions *) 111: | Some Loop_begin -> push_left_bracket () 112: 113: (* record entries in jump table *) 114: | Some Loop_end -> pop_left_bracket ()); 115: 116: (* check for any remaining open left brackets *) 117: (match !open_brackets with 118: | [] -> () 119: | hd :: _ -> extra_left hd ()); 120: { commands = Array.of_list_rev !commands 121: ; jump_table } 122: ;;
Most of the code is fairly straightforward; most of the code is dedicated to matching bracket pairs, and other commands are just passed straight through.
2.7 Runner
The runner steps through the program text, executing commands on the given memory, input, and output. It basically simulates the abstract state machine that brainfuck code is meant to run on.
124: let run t ~memory ~input ~output = 125: (* get/set bytes in memory; bytes are unsigned, wrap around on overflow *) 126: let get pos = Bigstring.get_uint8 memory ~pos in 127: let set pos x = Bigstring.set_uint8 memory ~pos (x % 256) in 128: 129: (* we can actually always jump one past the matching bracket *) 130: let jump pc = 1 + Hashtbl.find_exn t.jump_table pc in 131: 132: (* pc :: program counter 133: pos :: data pointer *) 134: let rec loop ~pc ~pos = 135: match t.commands.( pc ) with 136: | Left -> loop ~pc:(pc + 1) ~pos:(pos - 1) 137: | Right -> loop ~pc:(pc + 1) ~pos:(pos + 1) 138: 139: (* parenthesized for clarity *) 140: | Decrement -> set pos ((get pos) - 1); loop ~pc:(pc + 1) ~pos 141: | Increment -> set pos ((get pos) + 1); loop ~pc:(pc + 1) ~pos 142: 143: | Input -> 144: (match input () with 145: | None -> () (* do nothing if we run [,] at EOF *) 146: | Some c -> set pos (Char.to_int c)); 147: loop ~pc:(pc + 1) ~pos 148: 149: | Output -> 150: output (Char.of_int_exn (get pos)); 151: loop ~pc:(pc + 1) ~pos 152: 153: | Loop_begin -> 154: if get pos = 0 155: then (loop ~pc:(jump pc) ~pos) 156: else (loop ~pc:(pc + 1) ~pos) 157: 158: | Loop_end -> 159: if get pos <> 0 160: then (loop ~pc:(jump pc) ~pos) 161: else (loop ~pc:(pc + 1) ~pos) 162: 163: | exception _ -> () (* reached end of commands, so we terminate *) 164: in 165: loop ~pc:0 ~pos:0 166: ;;
A couple decisions we’ve made above about edge cases:
- We represent bytes as unsigned. This doesn’t make that much of a difference, since programs can’t observe that fact, given that…
- Increment and decrement operations that would underflow or overflow an unsigned byte wrap around instead, modulo 256.
- If we encounter the
,(input) command when we have no remaining input, we simply ignore it (i.e., do nothing). Other common choices including setting the current cell to0or-1.
2.8 Wrapper
We also define a simplified interface for running programs. This’ll be less
awkward to use for simple cases. create_memory allocates an array of 106
bytes and fills the buffer with 0x00’s. run' is simply a wrapper around
run that creates a fresh memory and accepts and returns everything as
string’s.
69: let create_memory () = 70: let memory_size = 1_000_000 in 71: let memory = Bigstring.create memory_size in 72: Bigarray.Array1.fill memory '\000'; 73: memory 74: ;;
168: let run' ~program ~input = 169: let t = parse (Input.of_string program) in 170: let buffer = Buffer.create 16 in 171: let memory = create_memory () in 172: let input = Input.of_string input in 173: let output = Output.of_buffer buffer in 174: run t ~memory ~input ~output; 175: Buffer.contents buffer 176: ;;
3 Trial run
Let’s try out our interpreter. jbuilder has an awesome utop command which
starts a toplevel3 with access to the library we’ve just built (--dev
enables stricter compilation flags).
# chdir to project root jbuilder utop lib/bf_lib --dev
utop # Bf_lib.Program.run' ~input:"" ~program:"++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.";; - : string = "Hello World!\n"
Awesome. We have a working brainfuck interpreter library! See Wikipedia for more example programs to try out.
4 To be continued
In the next post, we’ll see how to use Ecaml’s API to interact with various components of Emacs, such as buffers, files, key bindings, and more. Stay tuned!
Footnotes:
The maximum length of a string or bytes is given in OCaml by
Sys.max_string_length. On 64-bit systems, it’s 144115188075855863, which is
surely big enough, but on 32-bit systems it’s 16777211 (about 16 MB), and we
might want to give a program a bigger memory than that in the future.
Or, if you prefer, eta-expanded.
OCaml’s term for a REPL (Read-Eval-Print Loop), i.e., interactive prompt.
Similar to python or irb.