21st November 2018

Using ANTLR to parse and calculate expressions (Part I)

In an upcoming series of blog posts, we are going to talk about how we have developed and integrated a simple Domain Specific Language using ANTLR 4 with some of our Visual Studio projects on .NET and C#. We will also show how we’ve used the generated code to evaluate expressions at runtime for various mathematical calculations.

In this first entry, we’ll discuss the initial steps, such as creating a grammar, visualizing the parsed expression tree and moving on to code generation as well as its inclusion in the .NET projects. We’ll take advantage of some of the features in the new C# project structure (csproj) that comes with Visual Studio 2017 to ensure that the latest version of our grammar is always parsed, the code generated and included in our project structure.

In later posts, we’ll see how we can stream updates to a client, whenever a component’s value is updated. Using EventStore and Reactive Extensions we can make it a push based model. But more on that later.

For the uninitiated ANTLR or “ANother Tool for Language Recognition” can be used (among other things) to build languages. More info on the ANTLR4 official website.

The Grammar for our new Domain Specific Language

The first step is understanding the problem and writing a simple grammar to solve it. We need a way to parse custom expressions or ‘formulas’ that are allowed, parsed and evaluated at runtime once we have all the necessary information. Some of the variables in our grammar can be constant values, some are fed into our application by an external source, and at ‘evaluation’ time, we substitute them and calculate a result.
An example of this could be the following:

FXRate(‘EURUSD’) * UomConvert(‘ST’,’MT’) * 100
In this formula, we need the EURUSD foreign exchange rate, the conversion factor between Short Tons and Metric Tons and finally, we multiply that by 100.

It is a simple example, but it represents the basic operations that we need to support.
Parsing can be done with Regex, however, the resulting pattern to accommodate the requirements would be complex and difficult to understand, not to mention that it could also be more error-prone. A better solution should be something testable and extensible, as well as easy to understand to a newcomer. This is where ANTLR can be beneficial. Having a grammar that defines all our supported operations makes the code more readable and more maintainable than a very complex regular expression.
We start with a grammar file that then gets fed into the ANTLR binary and the necessary classes are generated, to successfully work with the operations we want, in the form of C# classes.

The grammar file defines every single element of our language. Starting from what is considered the building blocks (a digit, number, alphanumerical characters etc.) to the functions we want to support and finally the full set of allowed operations.
For the purposes of this post, we will define a simple set of rules and some basic expressions that our language will allow to parse the above expression:

grammar MyGrammar;
/* * Parser Rules */
number: INT | FLOAT;
fromUomCode: NAME | IDENTIFIER;
toUomCode: NAME | IDENTIFIER;
fxRateFunc: ‘FXRate’ ‘(‘currencyPair’)’;
currencyPair: NAME;
uomConvertFunc: ‘UomConvert’ ‘(‘fromUomCode ‘,’ toUomCode ‘)’;
expr: expr op=(MUL | DIV) expr #mulDiv
| expr op=(ADD | SUB) expr #addSub
| number | ‘(‘expr’)’ #num
| fxRateFunc#fxRate
| uomConvertFunc #uomFactor
;
/* * Lexer Rules */
fragment DIGIT: [0 – 9];
fragment LETTER: [a – zA – Z];
INT: DIGIT +;
FLOAT: DIGIT + ‘.’
DIGIT +;
STRING_LITERAL: ‘”.* ? ‘”;
NAME: LETTER(LETTER | DIGIT) * ;
IDENTIFIER: [a-zA-Z0-9]+;
MUL: ‘*’;
DIV: ‘/’;
ADD: ‘+’;
SUB: ‘-‘;
WS: [trn]+ -> skip;

Without needing to go into too much detail, we can see the basic components of our language. Lexer Rules define what a Digit, Letter, and Integer are. Other structures like Float, Name, String literal etc. are composed by combining primitive types, and basic arithmetic operators are also specified. As we can see, Lexer rules defining digits, letters and even string literals are very similar to Regex (in fact, it is all regex underneath). Meanwhile, ANTLR is keeping us away from the more complex regular expressions that are happening under the hood.

The more interesting part of the grammar comes from the parser rules. These are defining the structure of the operations that we want to support in our parser. In our case, we want to support the following:

UomConvert(From, To). This expression is meant to receive two units of measure and will (given all the information it needs) convert the ‘From’ unit of measure to the ‘To’ unit of measure, by implementing the necessary code in C#. We’ll go into more details on the next post of this serie.
FXRate(‘currencyPair’): This will perform a currency conversion, given the FXRates we need.
IDENTIFIER: As it is possible to have a unit of measure like Bushels.56, the Identifier is used to define one of the possible parameters to the UomConvert function. Therefore, we have the fromUomCode defined as “NAME | IDENTIFIER”. Note that ‘currencyPair’ is just a name because so far, we have no currencies that contain numbers in them.
Number: This can be either an Integer or a floating-point number, hence the “INT | FLOAT” definition in our parser rule.

Finally, as we want to support not just the above two expression, but any valid mathematical operator with said formulas, we will create the ‘expr’ rule and recursively allow all the valid combinations by having:

expr: expr op=(MUL | DIV) expr #mulDiv
| expr op=(ADD | SUB) expr  #addSub
| number | '('expr')' #num
| fxRateFunc #fxRate
| uomConvertFunc               #uomFactor
;

When we implement all these functions later in our C# code, we will have to specify what to do for each case as we’re visiting the parse tree.

The file is saved with a .g4 extension and we’re ready to use it.

Note: the # operator used here is to supply alternative names for the functions we will use later. We will get a more in depth look on this once we get to using the generated code in C#, on an upcoming blog post.

A look at the tree

Once we have our grammar set up, there are a few ways to visualize what is happening behind the scenes. I found the most convenient way is to set up the ANTLR plugin for Visual Studio Code which can be installed from the Marketplace.

After creating a launch configuration for Visual Studio Code, we’ll have all we need to be able to parse, generate and visualize our grammar’s parse tree.

Here is a launch configuration we can use for VSCode:

To test the above grammar file, we’ll create a simple input text file with the following:

UomConvert(MT, ST)

The generated parse tree looks as follows:

Using ANTLR 4 to develop and integrate a simple Domain Specific Language. Fig. 1

Other more complex expressions can be parsed:

2 / UomConvert(MT, ST) * FXRate(EURUSD)

Using ANTLR 4 to develop and integrate a simple Domain Specific Language. Fig. 2

Generating C# we can use…
The latest version of a Visual Studio C# project file has been massively simplified by Microsoft. Not only is the file easier to understand, but editing and changing things is much quicker without needing to unload and reload the project. It all just works on the fly.
To have a consistent working set of generated C# classes, and to avoid any issues in any possible development environment as well as the CI/CD pipeline, we wanted to have the following steps as part of the build:
NOTE: Generated files go under the $ProjectDirExpressionsGenerated folder

Delete all the previously existing *.cs files in the Generated folder.
Delete the Generated folder
Call ANTLR4 binary using java -jar as a pre-build step and setting it to output all files to the directory
Include *.cs under Generated
Compile!
Profit?

To do the above steps, we used the following project file: (comments provided in each line about what it’s doing)

<Project Sdk="Microsoft.NET.Sdk" ToolsVersion="15.0">
  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net471</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <!-- Include Antlr4.Runtime -->
    <PackageReference Include="Antlr4.Runtime.Standard" Version="4.7.1.1" />
  </ItemGroup>

  <Target Name="PreBuild" BeforeTargets="PreBuildEvent">
    <ItemGroup>
      <!-- Use Compile Remove to delete generated files -->
      <Compile Remove="ExpressionsGenerated*.cs" />
    </ItemGroup>
    <!-- Remove Generated dir.-->
    <RemoveDir Directories="$(ProjectDir)ExpressionsGenerated" />
    <!-- Run ANTLR on the grammar -->
    <Exec Command="java -jar $(SolutionDir)toolsantlr-4.7.1-complete.jar $(ProjectDir)ExpressionsGrammarMyGrammar.g4 -o $(ProjectDir)ExpressionsGenerated -Dlanguage=CSharp -no-listener -visitor -package $(ProjectName).Expressions.Generated" />
    <ItemGroup>
      <!-- Include generated C# files -->
      <Compile Include="ExpressionsGenerated*.cs" />
    </ItemGroup>
  </Target>
</Project>

As a prerequisite of this step, we need the ANTLR4 binary. We’ve decided to use the Java version of ATLR4 even though there’s a C# port which also works. Under the Solution root directory, we’ve created a “tools” folder which contains the Antlr4 java binary. Java needs to be installed on the build servers as well for this step to work on CI/CD toolchains.

Let’s have a look at some of the command line arguments for the ANTLR4 step that we used.

$(ProjectDir)ExpressionsGrammarMyGrammar.g4: this is the path to our grammar file.
-o: specifies the output directory.
Dlanguage=CSharp: because that’s the beauty of ANTLR, it generates a C# visitor class structure (even generating interfaces and abstract classes that we can extend)
-no-listener: don’t generate the parse tree listener. We don’t really need it for what we want to do, and it is enabled by default.
-visitor: generates the tree visitor. This will allow us to implement our behavior later as the tree is visited in our code and we can perform the proper actions. We’ll need to get our information from external sources to substitute variables into actual values.
-package: specify a package/namespace for our code.

After the build succeeds we should have all the output files added to our solution automatically.

Summary on how we developed and integrated a simple DSL using ANTLR

This was a learning process. We realized that Regular Expressions could get tricky, if there was ever a need to add more variations to the required inputs, the expressions could get extremely hard to follow. We preferred to investigate and spend some time learning how ANTLR works and what benefits we could get from it.

Turns out, it’s not very difficult to get it up and running. So far, we’ve been using this solution (or one very similar to it) and it has proven that the code can be very extensible. It might occasionally need some tweaking here and there, but our grammar files have stayed the same since we first tried this approach. The grammar really is the main driver for all of this. Getting it right in the beginning can help avoid a lot of headaches and it will rarely ever need to be changed (unless requirements change, of course).
A copy of the code can be found on our GitHub repository with the details of this blog post.

The second part of this serie will focus more on the generated files and how we use them. The Visitor pattern is the main driver for the next part of the process.

Some useful links:

Keep reading

Antlr4 and expression parsing (Part 2)

Written by an Adaptive Consultant.

Follow @WeAreAdaptive

Get our newsletter