A compiler for ChocoPy.
My main goal is to (nearly) fully implement ChocoPy as defined in the language reference. In addition, I'd like to experiment with some additional features, which are described below.
Build the compiler first:
cargo build --release
Run target/release/cocopy.exe
:
cocopy
A compiler for ChocoPy
USAGE:
cocopy.exe [OPTIONS] <SUBCOMMAND>
OPTIONS:
-h, --help Print help information
-v, --verbose <VERBOSE> [default: 1]
SUBCOMMANDS:
check Check a program for errors
compile Compile a program
help Print this message or the help of the given subcommand(s)
run Compile and run
To check, compile, or run a file, specify it as the first argument:
cocopy run ./examples/simple_add_program.py
The compilation step generates an out/out.exe
file on Windows, assuming you
have a Visual Studio 2022 installation with the C++ workload installed.
Otherwise, you may need tweak and run build_win64.ps1
instead, which will
manually link out.obj
into an executable.
On Linux, the compiled executable is written to out/out
. Note that you must
have nasm
and gcc
installed. As with Windows, you can also use the
build_linux64.sh
script to assemble and link the executable.
Currently, the compiler is capable of compiling basic ChocoPy programs to assembly. The following steps are taken during compilation.
Start here.
a:int = 0
b:int = 9
a = b + 10 * 3
b = a * 100
if b > 10:
print(b)
The lexer emits a token stream indicating the type of each token (keyword,
identifier, symbol, literal, etc..) and its value, if present. It also detects
indentation changes, emitting special indent
and dedent
tokens whenever
the indentation level changes.
"a" ":" "int" "=" "0" "\n"
"b" ":" "int" "=" "9" "\n"
"a" "=" "b" "+" "10" "*" "3" "\n"
"b" "=" "a" "*" "100" "\n"
"if" "b" ">" "10" ":" "\n"
<indent>
"print" "(" "b" ")" "\n"
<dedent>
The parser combines tokens into expressions and statements, which together form an abstract syntax tree. The type checker verifies that all operations evaluate to the correct types.
The syntax tree is represented like so:
program:
var_defs:
- name: a
type_spec: int
value: integer(0)
- name: b
type_spec: int
value: integer(9)
statements:
- assign:
target: identifier(a)
value:
binary:
lhs: identifier(b)
op: add
rhs:
binary:
lhs:
literal: integer(10)
op: multiply
rhs:
literal: integer(3)
- assign:
target: identifier(b)
value:
binary:
lhs: identifier(a)
op: multiply
rhs:
literal: integer(100)
- if:
condition:
binary:
lhs: identifier(b)
op: greater_than
rhs:
literal: integer(10)
body:
statements:
- evaluate:
expression:
function_call:
name: print
args:
- identifier(b)
The IL generator converts the syntax tree into three-address code.
Complex expressions are flattened, generating lots of intermediates (%
).
To prevent variable reuse, variables receive a new subscript (^
)
every time they're reassigned. This will make it easier for the optimiser to
reason about program flow.
function main
a^1 = 0
b^1 = 9
%1 = 10 * 3
%2 = b^1 + %1
a^2 = %2
%3 = a^2 * 100
b^2 = %3
if b^2 <= 10 goto if_end_1
%4 = call print (b^2)
if_end_1:
The optimiser finds statements that can be removed or simplified, and rewrites the intermediate code.
function main
%1 = 10 * 3
%2 = 9 + %1
%3 = %2 * 100
if %3 <= 10 goto if_end_1
call print (%3)
if_end_1:
Each instruction is converted to assembly. Operations may be broken up
further into a store
and apply
instruction when they cannot be converted
to a single assembly instruction. The register allocator assigns registers to
variables, keeping track of where registers are used, and deallocating them
as soon as their value is no longer needed.
main:
push rbp ; store base pointer
; %1 = 10 * 3
mov rax, 10
imul rax, 3
; %2 = 9 + %1
mov rbx, 9
add rbx, rax
; %3 = %2 * 100
mov rcx, rbx
imul rcx, 100
; if %3 <= 10 goto if_end_1
mov rax, 10
cmp rcx, rax
jle .if_end_1
; call print (%3)
push rcx
sub rsp, 8
call print
add rsp, 8
pop rcx
.if_end_1:
xor rax, rax ; return 0
pop rbp ; restore previous base pointer
ret ; return to caller
- Token lexing
- Keywords
- Symbols
- Structure
- Newline
- Tab-based indentation
- Space-based indentation
- Identifiers
- Integer literals
- String literals
- Comments
- Structural normalisation (emitting final newlines and dedents)
- Source code positions
- Literals
- Variables
- Type annotations
- Functions
- Function definition
- Function body
-
global
-
nonlocal
- Classes
- Class body
- Expressions
- Literal
- Identifier
- Unary
- Binary
- Ternary
- Function/method call
- Operator precedence and associativity
- Handle assignment targets differently from binary expressions
- Statements
- Expression evaluation
- Assignment
-
if
- Body
-
else
-
elif
-
while
-
for
-
pass
-
return
- Blocks
- Contextual error reporting
- Split assignment target and expression
- Create AST nodes
- Create assignment target parsing functions
- Literals
- Variables
- Initialisation
- Statements
- Expression evaluation
- Assignment
-
if
- Body
-
else
-
elif
-
while
-
for
-
pass
-
return
- Expressions
- Unary
- Negate
- Not
- Binary
- Arithmetic
- Integer comparison
- Boolean comparison
- Boolean combination
- String operations
-
is
- Ternary
- Function application
- Method call
- Object construction
- List display
- List operators
- Attribute access
- Multiple assignment
- Function application
- Unary
- Function definitions
- Class definitions
- Global environment
- Overload resolution
- Literals
- Statements
- Expression evaluation
- Assignment
-
if
- Special-case comparison statements
- Body
-
else
-
elif
-
while
-
for
-
pass
-
return
- Expressions
- Literal
- Identifier
- Binary
- Procedure call
- Object construction
- List display
- List operators
- Attribute access
- Multiple assignment
- Function definitions
- Class definitions
- Remove unused assignments
- Inline assigned variables when posssible
- Remove φ-functions
- When unifying two names, prefer variables over temporaries
- Proof-of-concept
- Generate assembly code
- Assemble object files with
nasm
- Linking with
gcc
orlink.exe
- Convert intermediate code to assembly
- Integer arithmetic
- Procedure calls
- Procedure return
- Boolean operations
- Goto
- Conditional goto
- Array lookup
- Pointer dereference
- Everything else
- Register allocation
- Naive register allocation
- Spill to stack when registers are exhausted
- Reuse registers after final reference
- Free registers when required by a different operation
- Save and restore registers between function calls
- Everything
- Multiple backends (native, LLVM)
- Using Hindley-Milner type inference to make all type annotations optional
- Good error reporting
- More features?
- My First Language Frontend with LLVM
- Lecture Notes on Static Single Assignment Form
- x64 software conventions (Windows)
- What Every Computer Scientist Should Know About Floating-Point Arithmetic
- What Every Programmer Should Know About Memory
- Understanding Windows x64 Assembly
- x86 instruction reference
- Register Allocation (via graph coloring)