The goal of this project was to gain concrete, hands-on experience with the MLIR ecosystem. It uses an MLIR builder, with lowering to LLVM IR, incorporating a custom dialect (silly) along with several existing MLIR dialects (scf, arith, memref, etc.).
There are two front ends, one using the original ANTLR4 grammar (requiring LLVM to be built without -fno-rtti), and another using an experimental Bison grammar (not yet feature complete, as measured by the test suite.)
My interest in MLIR was driven by work with proprietary AST walkers in both Clang and proprietary commercial compilers. In some cases, AST walking infrastructure can be used to extend the compiler itself by adding user-defined semantics tailored to a specific customer base, and is immensely powerful and valuable. Despite Clang's AST walk and rewrite API not allowing for user-defined language extension plugins, having a structured representation of source code that can be queried programmatically is amazing and very useful.
What does that have to do with MLIR? I had seen MLIR in action in a prototype project at a previous job, and that prototype suggested that it was a natural way to avoid hand-coding an AST. An MLIR representation of a program could:
- drive a semantic analysis pass,
- perform code generation,
- enable in-language transformation passes, and
- allow for cross-dialect (cross-language) transformations.
In short, I was intrigued enough to explore MLIR personally, even without a work-related justification, and this project was born.
Initially, I used MLIR to build a simple symbolic calculator that supported double-like variable declarations (DCL, DECLARE -- now removed), assignments, unary and binary arithmetic operations, and output display.
This used a dialect initially called toy, but that dialect was renamed to silly to avoid confusion with the MLIR tutorial toy dialect.
As noted earlier, the primary goal of the project was not calculation itself, but to gain concrete, hands-on experience with the MLIR ecosystem.
That initial implementation has evolved into a silly language and its compiler. There's no good reason to use this language, nor the compiler, but it was fun to build.
The language and compiler now support the following features:
Floating-point types (FLOAT32, FLOAT64):
FLOAT32 pi = 3.14;
FLOAT64 e{2.7182};
FLOAT64 a[3]; // uninitialized array
Integer types (INT8, INT16, INT32, INT64):
INT32 answer = 42;
INT64 bignum{12345};
INT16 values[3]{10, 20, 30}; // initialized array
Boolean type (BOOL):
BOOL flag = TRUE;
BOOL bits[2]{TRUE, FALSE};
PRINT — outputs to stdout:
PRINT "Hello, world!";
PRINT "x = ", x;
PRINT arr[0], ", ", arr[1];
PRINT 3.14 CONTINUE; // suppress trailing newline
ERROR — outputs to stderr:
ERROR "Warning: value out of range";
ERROR "Got: ", invalid_value;
GET — reads from stdin:
GET x; // read into scalar
GET arr[i]; // read into array element
Single-line comments:
// This is a comment
INT32 x = 42; // inline comment
IF/ELIF/ELSE — conditional execution:
IF (x < 0) {
PRINT "negative";
} ELIF (x EQ 0) {
PRINT "zero";
} ELSE {
PRINT "positive";
};
FOR — range-based loop with optional step:
// Basic loop from 1 to 10 (inclusive)
FOR (INT32 i : (1, 10)) {
PRINT i;
};
// With step size
FOR (INT32 j : (0, 100, 5)) {
PRINT j; // prints 0, 5, 10, ..., 100
};
The induction variable is privately scoped to the loop.
Function definition with typed parameters and optional return type:
// Function with return value
FUNCTION add(INT32 a, INT32 b) : INT32 {
RETURN a + b;
};
// Void function
FUNCTION greet(INT32 count) {
FOR (INT32 i : (1, count)) {
PRINT "Hello #", i;
};
RETURN;
};
// Recursion supported
FUNCTION factorial(INT32 n) : INT32 {
INT32 result = 1;
IF (n > 1) {
result = n * CALL factorial(n - 1);
};
RETURN result;
};
Function calls using the CALL keyword:
INT32 sum = CALL add(10, 20);
CALL greet(3);
// CALL can be used in expressions
INT32 x = 1 + CALL add(2, 3) * 4;
Generalized expressions with full operator precedence, parentheses, and unary chaining:
INT32 result = ((a + b) * c - d) / e;
FLOAT64 x = -CALL abs(y) + 3.14;
BOOL valid = (x > 0) AND (x < 100);
Supported operators (by category):
- Arithmetic:
+,-,*,/ - Comparison:
<,>,<=,>=,EQ,NE - Logical/Bitwise:
AND,OR,XOR,NOT - Unary:
+,-,NOT
All operators support proper precedence and associativity rules.
Assignment operator (=) for scalars and array elements:
x = 42;
arr[0] = x * 2;
y = a + b * c; // expression assignment
Numeric literals (integer and floating-point):
INT32 i = 42;
FLOAT64 pi = 3.14159;
FLOAT32 sci = 2.5E-3;
Boolean literals:
BOOL yes = TRUE;
BOOL no = FALSE;
String literals:
INT8 msg[6] = "Hello"; // STRING is alias for INT8[]
PRINT "Embedded string";
Array support including declaration, initialization, access, and element-wise operations:
INT32 arr[5]; // uninitialized
INT32 init[3]{1,2,3}; // initialized
arr[i] = 42; // assignment
PRINT arr[2]; // element access
EXIT arr[0]; // use in expressions
EXIT — explicit program exit with optional status code:
EXIT; // exit with status 0
EXIT 0; // explicit success
EXIT 1; // exit with error code
EXIT status; // exit with variable value
ABORT — emergency termination with error message:
IF (error_condition) {
ERROR "Fatal error detected";
ABORT; // prints location and aborts
};
DWARF instrumentation sufficient for:
- Line stepping in gdb/lldb
- Breakpoints
- Variable inspection
- Call stack unwinding
Variable modification is likely supported but untested.
There is lots of room to add further language elements to make the compiler and language more interesting. Some ideas for improvements (as well as bug fixes) can be found in the TODO
- Like scripted languages, there is an implicit
mainin this silly language. No user-definedmainfunction is allowed. - Functions can be defined anywhere (including in other sources), but must be declared (i.e.: prototype or
IMPORT) if called before the definition. - The
EXITstatement, if specified, currently must be at the end of the program.EXITwithout a numeric value is equivalent toEXIT 0, as is a program with no explicitEXIT. - The
RETURNstatement must be at the end of a function. It is currently mandatory. - See the TODO for a long list of nice-to-have features that I haven't gotten around to yet, and may never.
GETinto aBOOLvalue will abort if the input value is not 0 or 1. This is inconsistent with assignment to aBOOLvariable, which will truncate without raising a runtime error.- The storage requirement of
BOOLis currently one byte per element, even forBOOLarrays. ArrayBOOLvalues may use a packed bitmask representation in the future. - Negative
FORloop step sizes currently have implementation defined behaviour.FORloops have noBREAKnorCONTINUEsupport.
- The ANTLR4 grammar for the silly language.
- Tablegen definition for the silly MLIR dialect. This is the compiler's internal view of all grammar elements.
- The Compiler driver. This parses and handles command-line options, opens output files, and orchestrates all lower-level actions (parse tree walk + MLIR builder, lowering to LLVM IR, assembly printing, and linker invocation.)
- The ANTLR4 parse tree walker and MLIR builder.
- The LLVM IR lowering classes used to transform silly dialect operators to LLVM-IR.
- Sample programs in
Samples/. These serve both as samples, and as test cases. - A build script that runs both cmake and ninja, setting various options.
- A mlir-opt wrapper, that pre-loads the silly dialect shared object
- A set of lit tests (
tests/dialect/, ...) that are used to unit test the dialect verify functions, driver, syntax, ...
Once built, the compiler driver can be run with build/bin/silly with the following options:
-g— Show MLIR location info in dumps and lowered LLVM IR-O[0123]— Optimization level (standard)-c— Compile only, producing .o (don't link)-c --emit-mlir— Compile only (to text .mlir) (don't link, or create object)-c --emit-mlirbc— Compile only (to binary .mlirbc) (don't link, or create object)-c --emit-llvm— Compile only (to text LLVM-IR .ll) (don't link, or create object)-c --emit-llvmbc— Compile only (to binary LLVM-IR .bc) (don't link, or create object)-o— Name of the object (w/ -c) or executable.--init-fill nnn— Set fill character for stack variables (numeric value ≤ 255). Default is zero-initialized.--output-directory— Specify output directory for generated files (.mlir, .ll, .o, executable)--no-color-errors— If stderr is output to a TTY, error messages will be in color by default. This disables that color output.--imports mod1.silly-- experimentalIMPORTstatement support.
--emit-llvm— Emit LLVM-IR files (text format)--emit-llvmbc— Emit LLVM-IR files (byte code format)--emit-mlir— Emit MLIR files (text format)--emit-mlirbc— Emit MLIR files (binary format)--debug— Enable MLIR debug output (built-in LLVM option)-debug-only=silly-driver— Enable driver-specific debug output-debug-only=silly-lowering— Enable lowering-specific debug output--debug-mlir— Enable MLIR-specific debugging--verbose-link— Show the link command. This is implicit if the link fails.--keep-temp— Do not delete temporary .o files (and give a message showing the name.)- `--no-verbose-parse-error' -- The default parse error message is noisy and lists a number of grammar specific tokens, which requires expected test output updates every time the grammar is changed. This option inhibits that volatile output, which is still on by default.
--no-abort-path- If ABORT is called, omit any path components from the output message. Implemented for test code, so that the output doesn't depend on user specific paths (i.e.: fatal.silly).
silly --output-directory out f.silly -g --emit-llvm --emit-mlir --debug
silly f.silly -O2
silly f.sir -O2
silly -c f.silly -g ; silly -o foo f.o
silly mymain.silly mymodule.silly -g -o bar
silly -c mymain.silly -g ; silly mymain.o mymodule.silly -g -o pgmThe compiler can consume:
- silly sources (with .silly suffix),
- MLIR silly-dialect sources (with .mlir, .sir, or .mlirbc suffixes).
- LLVM-IR sources (with .ll or .bc suffixes).
Specifying a MLIR silly-dialect source means that the compiler will bypass the front end (parser/builder) and go straight to lowering.
Specifying a LLVM-IR source means that the compiler will go straight to the assembly printer.
Module imports (--imports ...) may only pass .silly or MLIR sources (IMPORT processing uses the mlir::func::FuncOp objects from that internal representation to generate prototypes in the caller module.)
Example:
cd Samples/types
testit -j loadstore.silly
silly-opt --pretty --source out/loadstore.mlirBy default, silly-opt output goes to stdout.
Run silly-opt --help for all available options.
sudo apt-get install libantlr4-runtime-dev antlr4 dwarfdumpThis assumes that the ANTLR4 runtime, after installation, is version 4.10.
On WSL2/Ubuntu, the installed runtime version may not match the generator version. Workaround:
wget https://www.antlr.org/download/antlr-4.10-complete.jarsudo dnf -y install antlr4-runtime antlr4 antlr4-cpp-runtime antlr4-cpp-runtime-devel \
cmake clang-tools-extra g++ ninja cscope clang++ ccache libdwarf-toolsOn both Ubuntu and Fedora, I needed a custom build of LLVM/MLIR, as I didn't find a package that included the MLIR TableGen files.
As it turned out, a custom LLVM/MLIR build was also required to specifically enable RTTI, as ANTLR4 uses dynamic_cast<>.
The -fno-rtti flag required by default to avoid typeinfo symbol link errors explicitly breaks the ANTLR4 header files.
This could be avoided by separating the ANTLR4 listener class from the MLIR builder, but the MLIR builder effectively
provides an AST, which I don't need to build separately if I construct it directly from the listener.
If you attempt to include an MLIR file with -frtti specified, you'll get link errors like:
undefined reference to `typeinfo for mlir::ConversionPattern'
Conversely, if you attempt to use -fno-rtti and include an ANTLR4 header file, you'll get errors like:
In file included from /usr/include/antlr4-runtime/antlr4-runtime.h:30:
In file included from /usr/include/antlr4-runtime/InterpreterRuleContext.h:8:
In file included from /usr/include/antlr4-runtime/ParserRuleContext.h:9:
/usr/include/antlr4-runtime/support/CPPUtils.h:58:11: error: use of typeid requires -frtti
58 | ss << typeid(o).name() << "@" << std::hex << reinterpret_cast<uintptr_t>(&o);
| ^
as well as specific errors whereever dynamic_cast<> is used. Example:
/home/pjoot/toycalculator/build/src/antlr4Grammar/SillyParser/SillyParser.cpp:1104:25: error: use of dynamic_cast requires -frtti
1104 | auto parserListener = dynamic_cast<SillyListener *>(listener);
| ^
See bin/buildllvm for how I built and deployed the LLVM+MLIR installation used for this project.
It doesn't appear that the llvm-project has any sort of generic lexer/parser. Clang/Flang/etc., look like they each roll their own. Having used ANTLR4 for previous prototyping (also generating C++ listeners), it made sense to use what I knew, but this does not interoperate well with the LLVM ecosystem.
- V5,V6,V7,V8,V9 -- require llvm-project version >= 21.1.0-rc3, and llvm-project version < 22.1
- V10+ -- requires llvm-project 22.1.0+
. ./bin/env
buildThe build script (which may or may not also run cscope, doxygen and ctags by default, depending on my mood), currently assumes that I'm the one building it. This build script is probably not sufficiently general for other people to use, and will surely break as I upgrade the systems I build on.
This project has been built in a few (Linux-only) configurations:
- Fedora 42/ARM (running on a UTM VM, hosted on a macbook)
- Fedora 42/X64 (on a dual-boot Windows 11/Linux laptop)
- WSL Ubuntu 24/X64 (same laptop)
- Armbian (Ubuntu), running on a Raspberry Pi.
and currently requires LLVM 22.1.0+. The V9 tag (aka: V0.9.0) was last built on LLVM 21.1.8.
As an experiment, I've implemented an incomplete Bison/Flex front end and grammar, factoring out enough of Antlr4ParseListener.cpp into Builder.cpp so that this front end can handle some most operations.
See Docs/TODO.md for some of what is left to make this a first class front end.
I was able to minimize the locations where I required -frtti for ANTLR4, but that still clashes with the LLVM (default) requirement for -fno-rtti in a few key locations (i.e.: the Listener class.) This inspired me to look at other grammar/parser combinations, and explore just closely I'd coupled this project with ANTLR4. It turns out that it was possible to logically separate out a lot of the MLIR/silly-dialect specific builder code, making the thinning out the ANTLR4 parser/builder, and making it a much lighter weight entity. This refactoring was desirable, even if I end up throwing out or abandoning the Bison front end.
To build with Bison instead of ANTLR4, configure with cmake -DUSE_BISON_GRAMMAR=1.
The Bison front end passes all the non-debug/syntax-error tests in the testsuite.
Testing is now all llvm-lit based. Examples:
$HOME/build-llvm/bin/llvm-lit -v tests/*/driver-* # Run the driver tests
$HOME/build-llvm/bin/llvm-lit -v -j3 tests/ # Run all the testsctest can also be used (from the build/ dir) to run these tests.
The following describes the operations and statements supported by the Silly language, as defined by the Silly.g4 ANTLR4 grammar.
It is intended as a language-level reference rather than a grammar walkthrough.
A Silly main program consists of zero or more statements and comments, optionally followed by an explicit EXIT statement.
Each statement is terminated by a semicolon (;).
Blocks use { ... }, and expressions use parentheses ( ... ).
A Silly program may define supplementary functions in a different source from the Silly main source file.
Such a source may only have comments, FUNCTION declarations, FUNCTION definitions, IMPORT statements, and must use the MODULE keyword.
A non-MODULE source may optionally use the MAIN keyword, but it is implied if MODULE is not specified (provided for symmetry).
Here is an example of a silly main program
MAIN;
PRINT "Hello Silly World!"
EXIT 0;
and an example of a silly module:
// mymodule.silly
MODULE;
IMPORT foo; // import all the prototypes from foo.silly
// prototypes:
FUNCTION name2(...) ...;
FUNCTION dcl2(...) ...;
// functions (can call each other if ordered appropriately, or prototyped.)
FUNCTION name1(...) ...
{
}
FUNCTION name2(...) ...
{
}
An IMPORT mymodule; from some other silly source, will insert prototypes for name1, and name2.
When building a multiple source silly program, only one source may omit a MODULE statement.
The expression grammar supports generalized expressions with proper precedence, associativity, and parentheses.
| Precedence | Operators | Associativity | Notes |
|---|---|---|---|
| 1 (highest) | NOT, unary +, unary - |
right | Unary operators chain right-to-left |
| 2 | \*, /, % |
left | Multiplicative |
| 3 | +, - |
left | Additive |
| 4 | <, >, <=, >= |
left | Relational (non-associative in practice) |
| 5 | EQ, NE |
left | Equality |
| 6 | AND |
left | Bitwise/logical AND |
| 7 | XOR |
left | Bitwise/logical XOR |
| 8 (lowest) | OR |
left | Bitwise/logical OR |
Parentheses override default precedence:
x = (a + b) * c;
// Arithmetic
result = 2 + 3 * 4; // = 14
result = (2 + 3) * 4; // = 20
// Comparisons
valid = x > 0 AND x < 100;
// Unary chaining
y = - - x; // double negation
flag = NOT NOT condition;
// Complex expression
z = (a + b) * (c - d) / e;
Variables are declared with an explicit type and optional initializer or optional assignment. Array variables may only using the optional initializer syntax.
TYPE name;
TYPE name = initializer;
TYPE name{initializer};
TYPE name{};
An un-initialized variable gets a default value from the --init-fill command line option.
That value defaults to zero.
Specification of an empty initializer expression results in zero initialization (not the --init-fill value.)
TYPE name[size];
TYPE name[size]{val1, val2, ...};
- Arrays must have a compile-time constant size
- Initializer lists can be shorter than the array (rest are explicitly zero-initialized, and do not use the
--init-fillvalue.) - Excess initializers are an error
| Type | Description | Width |
|---|---|---|
BOOL |
Boolean | 8 bits |
INT8 |
Signed integer | 8 bits |
INT16 |
Signed integer | 16 bits |
INT32 |
Signed integer | 32 bits |
INT64 |
Signed integer | 64 bits |
FLOAT32 |
Floating-point | 32 bits |
FLOAT64 |
Floating-point | 64 bits |
INT32 x;
FLOAT64 pi = 3.14159;
BOOL flags[10];
INT32 values[3]{1, 2, 3};
Assigns a value to a variable or array element.
variable = expression;
array[index] = expression;
Right-hand sides may be:
- Literals
- Variables or array elements
- Unary expressions
- Binary expressions
- Function calls
- Allowed combinations of the above, using parenthesis where desired.
42
-7
3.14
2.0E-3
TRUE
FALSE
(Boolean literals may also be represented numerically as 0 or 1.)
"hello world"
Unary operators apply to scalar values or array elements.
| Operator | Meaning |
|---|---|
+ |
Unary plus |
- |
Unary negation |
NOT |
Boolean negation |
x = -y;
flag = NOT flag;
Binary arithmetic operators work on numeric operands.
| Operator | Meaning |
|---|---|
+ |
Addition |
- |
Subtraction |
\* |
Multiplication |
/ |
Division |
% |
Modulus |
x = a + b;
y = x * 3;
Comparison operators produce boolean values.
| Operator | Meaning |
|---|---|
< |
Less than |
> |
Greater than |
<= |
Less than or equal |
>= |
Greater than or equal |
EQ |
Equal |
NE |
Not equal |
IF (x < 10) { PRINT x; };
These operators work across any combination of floating-point and integer types (including BOOL).
Boolean operators may be logical or bitwise depending on operand types.
| Operator | Meaning |
|---|---|
AND |
Boolean AND |
OR |
Boolean OR |
XOR |
Boolean XOR |
flag = a AND b;
Integer bitwise operators (OR, AND, XOR) are applicable only to integer types (including BOOL).
Conditional execution using boolean expressions.
IF (x < 0) {
PRINT "negative";
} ELIF (x EQ 0) {
ERROR "zero";
ABORT;
} ELSE {
PRINT "positive";
};
Range-based iteration with optional step size (must be positive).
Note: There is currently no checking that the step size is positive or non-zero. Use of a negative or zero step has undefined behavior.
FOR (INT32 i : (1, 10)) {
PRINT i;
};
// With expressions for range and step
INT32 a = 0;
INT32 b = -20;
INT32 c = 2;
INT32 z = 0;
FOR (INT32 i : (+a, -b, c + z)) {
PRINT i;
};
Semantically equivalent to:
{
int i = start;
while (i <= end) {
...
i += step;
}
}Constraints:
- The induction variable name must not be used by any variable in the function
- It cannot shadow any induction variable of the same name in an outer FOR loop
Defines a function with typed parameters and optional return type.
FUNCTION add(INT32 a, INT32 b) : INT32 {
RETURN a + b;
};
FUNCTION void(INT32 a, INT32 b) {
PRINT a + b;
RETURN;
};
Functions may be declared, and then defined later. Example:
FUNCTION bar ( INT16 w );
CALL bar( 3 );
FUNCTION bar ( INT16 w )
{
PRINT w;
RETURN;
};
Notes:
- Parameters and return types must be scalar types
- A
RETURNstatement is required and must be the last statement - Recursion is supported
- Nested functions are not currently supported
Functions are invoked using the CALL keyword.
x = CALL add(2, 3);
y = CALL sum(a[1], a[2]);
CALL expressions can be part of larger expressions:
x = 1 + v * CALL foo();
x = - CALL foo();
Outputs one or more expressions (variables, literals, array elements, expressions) with a trailing newline (unless CONTINUE is specified).
PRINT x, " is ", y;
PRINT 3.14 CONTINUE;
PRINT "Hello: ", v;
PRINT arr[3];
PRINT "hi", s, 40 + 2, ", ", -x, ", ", f[0], ", ", CALL foo();
The ERROR statement is equivalent to PRINT, but outputs to stderr instead of stdout.
ERROR "Unexpected value: " CONTINUE;
ERROR v;
Reads input into a scalar or array element.
GET x;
GET arr[2];
Explicitly terminates program execution, optionally returning a value.
EXIT;
EXIT 0;
EXIT 39 + 3;
EXIT status;
EXIT arr[0];
Prints a message like <file>:<line>:ERROR: aborting to stderr, then aborts.
ABORT;
Single-line comments begin with // and extend to the end of the line.
// This is a comment
The silly language now supports a primitive but functional multi-module system using the IMPORT statement.
- A module file (marked with the
MODULE; keyword at the top) can define functions that are imported and called from the main program or from other sources. - Any source without
MODULEis aMAIN. A program can have at most oneMAIN. TheMAINkeyword is implicit, but provided for symmetry. - Cross-file function calls were possible before via manual function prototypes, but this was an error-prone idea, as manual prototypes could be out of synch with the definitions. Now,
IMPORTautomates prototype generation, making the function implementations themselves the "source of truth".
// callee.silly
MODULE;
FUNCTION bar(INT16 w) {
PRINT w;
RETURN;
};
FUNCTION foo() : INT32 {
CALL bar(3);
RETURN 42;
};
// callmod.silly
IMPORT callee;
PRINT "In main";
CALL bar(3);
PRINT "Back in main";
INT32 rc = CALL foo();
PRINT "Back in main again, rc = ", rc;
silly --imports callee.silly callmod.silly -o program
# or with separate compilation steps:
silly -c --emit-mlirbc callee.silly
silly --imports callee.mlirbc callmod.silly -o program- A
MODULEsource has only functions, no variables and no "main". - Every
IMPORTmodule must be specified in a--imports <file>command line option. - The imported file can be .silly source, textual .mlir, or binary .mlirbc
IMPORTcreates function prototypes automatically from allFUNCTIONdefinitions in the imported module- No qualified names, namespaces, visibility control (public/private), or cycle detection yet
- No automatic search paths — the exact file must be named on the command line
- Imported modules are compiled in two phases: prototypes first (for name resolution), bodies later (after main code is processed)
- If multiple imported modules define functions with the same name, behavior is currently undefined (may result in last-one-wins resolution, link-time errors, or silent overrides). Qualified names or visibility modifiers are planned for future versions.
Ideas for future refinement include support for multiple imports, filesystem-based module discovery, and more robust dependency handling. See TODO, and module design notes for additional details.
See tests/driver/callmod.silly + callee.silly for a small working demo.
- Variable declaration (implicit-float and explicit types), scalars or arrays
- Scalar and array element assignment
- String variables and literals
- General expressions with standard precedence, associativity, and parentheses
- Boolean logic and comparisons
- Conditional execution (
IF/ELIF/ELSE) - Range-based
FORloops - Functions and calls (with recursion support)
- Input (
GET) and output (PRINT,ERROR) - Explicit program termination (
EXIT,ABORT) - Module support with
IMPORTandMODULE/MAINstatements.