DMS Parsers
The DMS Software Reengineering Toolkit is designed to allow the "domain" (language) engineer to specify new langauges to process, quickly and accurately, so that she may spend most of her attention on the actual program analysis or transformation of interest. (To give a sense of this, SD was asked in 2014 to define an Erlang front end to DMS. We were lucky to have a well-defined grammar for Erlang's parser generator from Ericsson, so there was no guesswork as to what the grammar really was. Still, SD built a complete working parser, including lexer and prettyprinter, in 8 hours and integrated with an SD tool, CloneDR. We think that's impressive.)
At DMS Domain Specifications, we give some background on the necessary elements to build a DMS language processing domain. DMS Lexers shows how one can succinctly specify to DMS how to produce a stream of lexemes. On this page, we show what DMS grammars look like, and discuss why they are easy to write and debug even for complex languages.
We use Nicholas Wirth's Oberon language as an example, as it is a real, practical language yet simple enough so the entire parser and prettyprinter definitions can be easily be displayed and understood here.
DMS BNF Grammar Specifications
DMS defines grammars in a remarkably simple BNF format. All grammar rules are written as follows:
LHStoken = RHStoken1 ... RHStokenN ; SemanticPredicate
That's it. Nothing else. There are no grammar rule restrictions; you can write left recursive or right recursive grammar rules. Rules can have indefinite lookahead. They can be ambiguous. DMS's GLR parser takes care of all of these cases. You simply don't have the headaches of LL(k) (e.g., ANTLR, JavaCC, ...) or LALR(1) (Yacc, Bison, la de dah...); try defining C++14, as we have with DMS, with any of these.
This means the parser engineer can focus on writing simple rules, and getting the grammar right rather than fighting with her parser generator tool. (The SemanticPredicate can be used to eliminate some unwanted parses, if that makes sense, but is rarely used, usually when there is a severe context-sensitivity in the grammar, such as with COBOL level declarations.)
Other grammar systems use extended BNF (EBNF), with all kinds of additional alternation, sequencing and Kleene plus/star operators, with embedded actions for building tree nodes or checking semantics. (Ira Baxter, the architect of DMS, did that for 30 years. Then he got smart.)
It is easy to code the other EBNF ideas with this notation. Alternatives? Write multiple rules with the same LHS. Optionals? That's an alternative that has an empty RHS. Lists? Write two rules with same LHS, one of which starts with the same LHS. Weirdly, you simply don't miss the EBNFisms. It turns out they complicate other descriptive formalisms that DMS uses, so losing them actually ends as a win for more complex tasks!
A second major win for the domain/parser engineer is that she doesn't have to invent fancy or complicated rules to build Abstract Syntax Trees (ASTs). DMS will build an AST directly from the rules without any addtional information. Automatic AST construction is enormously helpful when you have a complex grammar (SD's IBM COBOL grammar has 6000+ rules!), or one that is changing rapidly while it is under development or design. Change the grammar: the generated tree changes automatically. The AST produced is isomorphic to the grammar, so the grammar is the documentation for the AST. Nothing else needed.
The constructed tree can be entirely concrete; the domain engineer can specify that all nodes are retained. Or, he may at parser generation time, specify that non-value carrying leaves are dropped; unary production nodes are eliminated; and that list-forming nodes be compressed into compact sequence nodes in the tree.
DMS's parsers even captures comments and attaches them to tree nodes. Comments are normally attached to the nearest concrete node; if such nodes are eliminated, the comments for that are lifted to a parent (retained) node and marked accordingly. Comments thus remain in the tree as if the leaf nodes still existed.
So you want a parser? Write a set of context-free grammar rules. Done.
A Grammar for Wirth's Oberon
Here is the complete grammar for the Oberon language, covering the original design by Wirth and the 2007 dialect. Explanation of some aspects of the grammar can be found below.
-- Oberon.atg: DMS Grammar Definition for Oberon -- Copyright (C) 2014-2015 Semantic Designs, Inc.; All Rights Reserved -- ------------------------------------------------------------------------------------------ ---(1)------------ Module ------------------------------------------------------------------------------------------ module = 'MODULE' ident ';' DeclarationSequence 'END' ident '.' ; module = 'MODULE' ident ';' DeclarationSequence ProgramBody ident '.' ; module = 'MODULE' ident ';' 'IMPORT' ImportList ';' DeclarationSequence 'END' ident '.' ; module = 'MODULE' ident ';' 'IMPORT' ImportList ';' DeclarationSequence ProgramBody ident '.' ; ProgramBody = 'BEGIN' StatementSequence 'END' ; ImportList = import ; ImportList = ImportList ',' import ; import = ident ; import = ident ':=' ident ; #IF Oberon -- Original version of the language as it was introduced in 1986. DeclarationSequence = ConstTypeVarDeclSequence ProcedureDeclSequence; ConstTypeVarDeclSequence = ; ConstTypeVarDeclSequence = ConstTypeVarDeclSequence 'CONST' ConstantDeclarationList ; ConstTypeVarDeclSequence = ConstTypeVarDeclSequence 'TYPE' TypeDeclarationList ; ConstTypeVarDeclSequence = ConstTypeVarDeclSequence 'VAR' VariableDeclarationList ; ProcedureDeclSequence = ; ProcedureDeclSequence = ProcedureDeclSequence ForwardDeclaration ';' ; ProcedureDeclSequence = ProcedureDeclSequence ProcedureDeclaration ';' ; #ELSIF Oberon07 -- Dialect of the original language. Revision 1.10.2013/10.3.2014. DeclarationSequence = ; DeclarationSequence = ConstTypeVarDeclaration ; DeclarationSequence = DeclarationSequence ProcedureDeclaration ';' ; ConstTypeVarDeclaration = 'VAR' VariableDeclarationList ; ConstTypeVarDeclaration = 'TYPE' TypeDeclarationList ; ConstTypeVarDeclaration = 'TYPE' TypeDeclarationList 'VAR' VariableDeclarationList ; ConstTypeVarDeclaration = 'CONST' ConstantDeclarationList ; ConstTypeVarDeclaration = 'CONST' ConstantDeclarationList 'VAR' VariableDeclarationList ; ConstTypeVarDeclaration = 'CONST' ConstantDeclarationList 'TYPE' TypeDeclarationList ; ConstTypeVarDeclaration = 'CONST' ConstantDeclarationList 'TYPE' TypeDeclarationList 'VAR' VariableDeclarationList ; #ELSE #FAIL #ENDIF ------------------------------------------------------------------------------------------ ---(2)------------ Declarations ------------------------------------------------------------------------------------------ ConstantDeclarationList = ; ConstantDeclarationList = ConstantDeclarationList IdentDef '=' ConstantExpression ';' ; #IF Oberon TypeDeclarationList = ; TypeDeclarationList = TypeDeclarationList IdentDef '=' Type ';' ; #ELSIF Oberon07 TypeDeclarationList = ; TypeDeclarationList = TypeDeclarationList IdentDef '=' StructType ';' ; #ELSE #FAIL #ENDIF Type = QualIdent ; Type = StructType ; StructType = ArrayType ; StructType = RecordType ; StructType = PointerType ; StructType = ProcedureType ; ArrayType = 'ARRAY' LengthList 'OF' Type ; LengthList = ConstantExpression ; LengthList = LengthList ',' ConstantExpression ; RecordType = 'RECORD' 'END' ; RecordType = 'RECORD' FieldListSequence 'END' ; RecordType = 'RECORD' '(' BaseType ')' 'END' ; RecordType = 'RECORD' '(' BaseType ')' FieldListSequence 'END' ; BaseType = QualIdent ; FieldListSequence = FieldList ; FieldListSequence = FieldListSequence ';' FieldList ; FieldList = FieldNameList ':' Type ; FieldNameList = IdentDef ; FieldNameList = FieldNameList ',' IdentDef ; PointerType = 'POINTER' 'TO' Type ; ProcedureType = 'PROCEDURE' ; ProcedureType = 'PROCEDURE' FormalParameters ; VariableDeclarationList = ; VariableDeclarationList = VariableDeclarationList VariableDefList ':' Type ';' ; VariableDefList = IdentDef ; VariableDefList = VariableDefList ',' IdentDef ; QualIdent = ident ; QualIdent = ident '.' ident ; IdentDef = ident ; IdentDef = ident '*' ; ------------------------------------------------------------------------------------------ ---(3)------------ Procedures ------------------------------------------------------------------------------------------ ProcedureDeclaration = ProcedureHeading ';' ProcedureBody ident ; ProcedureHeading = 'PROCEDURE' IdentDef ; ProcedureHeading = 'PROCEDURE' IdentDef FormalParameters ; #IF Oberon ProcedureHeading = 'PROCEDURE' '*' IdentDef ; ProcedureHeading = 'PROCEDURE' '*' IdentDef FormalParameters ; #ELSIF Oberon07 #ELSE #FAIL #ENDIF FormalParameters = '(' ')' ; FormalParameters = '(' ')' ':' QualIdent ; FormalParameters = '(' FormalParameterList ')' ; FormalParameters = '(' FormalParameterList ')' ':' QualIdent ; FormalParameterList = FormalParameterSection ; FormalParameterList = FormalParameterList ';' FormalParameterSection ; FormalParameterSection = FormalParameterNameList ':' FormalParameterType ; FormalParameterSection = 'VAR' FormalParameterNameList ':' FormalParameterType ; FormalParameterNameList = ident ; FormalParameterNameList = FormalParameterNameList ',' ident ; FormalParameterType = QualIdent ; FormalParameterType = ArrayOfSequence QualIdent ; #IF Oberon FormalParameterType = ProcedureType ; FormalParameterType = ArrayOfSequence ProcedureType ; #ELSIF Oberon07 #ELSE #FAIL #ENDIF ArrayOfSequence = 'ARRAY' 'OF' ; ArrayOfSequence = ArrayOfSequence 'ARRAY' 'OF' ; ProcedureBody = DeclarationSequence 'END' ; ProcedureBody = DeclarationSequence 'BEGIN' StatementSequence 'END' ; #IF Oberon ForwardDeclaration = 'PROCEDURE' '^' ident ; ForwardDeclaration = 'PROCEDURE' '^' ident FormalParameters ; ForwardDeclaration = 'PROCEDURE' '^' ident '*' ; ForwardDeclaration = 'PROCEDURE' '^' ident '*' FormalParameters ; #ELSIF Oberon07 ProcedureBody = DeclarationSequence 'RETURN' Expression 'END' ; ProcedureBody = DeclarationSequence 'BEGIN' StatementSequence 'RETURN' Expression 'END' ; #ELSE #FAIL #ENDIF ------------------------------------------------------------------------------------------ ---(4)------------ Statements ------------------------------------------------------------------------------------------ StatementSequence = Statement ; StatementSequence = StatementSequence ';' Statement ; Statement = ; Statement = Assignment ; Statement = ProcedureCall ; Statement = IfStatement ; Statement = CaseStatement ; Statement = WhileStatement ; Statement = RepeatStatement ; #IF Oberon Statement = LoopStatement ; Statement = WithStatement ; Statement = 'EXIT' ; Statement = 'RETURN' ; Statement = 'RETURN' Expression ; #ELSIF Oberon07 Statement = ForStatement ; #ELSE #FAIL #ENDIF Assignment = designator ':=' Expression ; ProcedureCall = designator ; ProcedureCall = designator ActualParameters ; IfStatement = 'IF' Expression 'THEN' StatementSequence 'END' ; IfStatement = 'IF' Expression 'THEN' StatementSequence 'ELSE' StatementSequence 'END' ; IfStatement = 'IF' Expression 'THEN' StatementSequence ElseIfSequence 'END' ; IfStatement = 'IF' Expression 'THEN' StatementSequence ElseIfSequence 'ELSE' StatementSequence 'END' ; ElseIfSequence = 'ELSIF' Expression 'THEN' StatementSequence ; ElseIfSequence = ElseIfSequence 'ELSIF' Expression 'THEN' StatementSequence ; CaseStatement = 'CASE' Expression 'OF' CaseSequence 'END' ; #IF Oberon CaseStatement = 'CASE' Expression 'OF' CaseSequence 'ELSE' StatementSequence 'END' ; #ELSIF Oberon07 #ELSE #FAIL #ENDIF CaseSequence = CaseCase ; CaseSequence = CaseSequence '|' CaseCase ; CaseCase = ; CaseCase = CaseLabelList ':' StatementSequence ; CaseLabelList = CaseLabels ; CaseLabelList = CaseLabels ',' CaseLabels ; CaseLabels = label ; CaseLabels = label '..' label ; #IF Oberon label = ConstantExpression ; #ELSIF Oberon07 label = integer ; label = hexint ; label = string ; label = ident ; #ELSE #FAIL #ENDIF WhileStatement = 'WHILE' Expression 'DO' StatementSequence 'END' ; #IF Oberon #ELSIF Oberon07 WhileStatement = 'WHILE' Expression 'DO' StatementSequence ElseIfWhileSequence 'END' ; ElseIfWhileSequence = 'ELSIF' Expression 'DO' StatementSequence ; ElseIfWhileSequence = ElseIfWhileSequence 'ELSIF' Expression 'DO' StatementSequence ; #ELSE #FAIL #ENDIF RepeatStatement = 'REPEAT' StatementSequence 'UNTIL' Expression ; #IF Oberon LoopStatement = 'LOOP' StatementSequence 'END' ; WithStatement = 'WITH' QualIdent ':' QualIdent 'DO' StatementSequence 'END' ; #ELSIF Oberon07 ForStatement = 'FOR' ident ':=' Expression 'TO' Expression 'DO' StatementSequence 'END' ; ForStatement = 'FOR' ident ':=' Expression 'TO' Expression 'BY' ConstantExpression 'DO' StatementSequence 'END' ; #ELSE #FAIL #ENDIF ------------------------------------------------------------------------------------------ ---(5)------------ Expressions ------------------------------------------------------------------------------------------ Expression = SimpleExpression ; Expression = Expression RelationOperator SimpleExpression ; ConstantExpression = Expression ; RelationOperator = '=' ; RelationOperator = '#' ; RelationOperator = '<' ; RelationOperator = '<=' ; RelationOperator = '>' ; RelationOperator = '>=' ; RelationOperator = 'IN' ; RelationOperator = 'IS' ; SimpleExpression = term ; SimpleExpression = '+' term ; SimpleExpression = '-' term ; SimpleExpression = SimpleExpression AddOperator term ; AddOperator = '+' ; AddOperator = '-' ; AddOperator = 'OR' ; term = factor ; term = term MulOperator factor ; MulOperator = '*' ; MulOperator = '/' ; MulOperator = 'MOD' ; MulOperator = 'DIV' ; MulOperator = '&' ; factor = number ; factor = string ; factor = 'NIL' ; #IF Oberon factor = charconst ; #ELSIF Oberon07 factor = 'TRUE' ; factor = 'FALSE' ; #ELSE #FAIL #ENDIF factor = Set ; factor = designator ; factor = designator ActualParameters ; factor = '(' Expression ')' ; factor = '~' factor ; number = integer ; number = hexint ; number = real ; Set = '{' '}' ; Set = '{' SetElementList '}' ; SetElementList = SetElement ; SetElementList = SetElementList ',' SetElement ; SetElement = Expression ; SetElement = Expression '..' Expression ; designator = QualIdent ; designator = designator '.' ident ; designator = designator '[' IndexList ']' ; designator = designator '(' QualIdent ')' ; designator = designator '^' ; IndexList = Expression ; IndexList = IndexList ',' Expression ; ActualParameters = '(' ')' ; ActualParameters = '(' ExpressionList ')' ; ExpressionList = Expression ; ExpressionList = ExpressionList ',' Expression ;
Description of the Oberon Grammar
The grammar rules themselves should be pretty explanatory; they follow the simple grammar rule formalism defined by DMS. The names of all terminals are defined by the Oberon lexical description.. Non-value carrying terminals such as operators =, self-designating literal values TRUE, and keywords LOOP are specified as quoted literals, e.g., '=', 'TRUE' and 'LOOP'. Value carrying terminals are given arbitrary (but self-documenting) names such as integer and string.
Rules with alternatives (e.g., Statement simply have one rule per alternative. Rules with options are specified as pairs of rules with the same LHS, with one of the rules right hand side simply being empty. List are defined implicitly with pairs of rules with the same LHS, with one rule mentioning the LHS as its starting token, followed by a non-value carrying terminal; see VariableDefList
The #IF, #ELSIF, #ELSE, #ENDIF are dialect conditionals, that control whether some aspects of a grammar are present in a specific dialects. Oberon07 added a ForStatement; it is included inside an Oberon07 dialect condition.
Oberon parsing example
Given a lexer and grammar definition, DMS can be asked to parse source files defined by that notation. It will do so, and complain about (and repair if practical) any syntax errors. DMS will also construct a tree data structure in memory accessible by the rest of DMS and/or PARLANSE.
S-expression style Abstract syntax tree for Oberon "Buffer" program
For the same small Oberon program used as a lexing example, we show a bit of the generated parse tree (and the full parse tree) as printed by the DMS parsing debugging tools. This parse tree was constructed entirely by DMS using just the grammar specification.
The parse tree is shown in Lisp S-expression style, e.g., a nested set of parentheses
with each parentheses representing a single node: (node child1 child2 ... childn)node.
In detail, each tree node is printed as:
(token_name@domain=token_number#ID^parentcount#parentID:childcount ... [value]file source position
If the node does not carry value, then the [value] portion is not included. Comments ("precomments" or "postcomments")
are included in the node contents if present.
(module@Oberon~Oberon07=2#1defc40^0 Line 4 Column 1 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon ('MODULE'@Oberon~Oberon07=169#1de89a0^1#1defc40:1[Keyword:0] Line 4 Column 1 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon precomment 0:1 `(* *)')'MODULE' (ident@Oberon~Oberon07=170#1de8a20^1#1defc40:2[`Buffer'] Line 4 Column 8 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)ident (';'@Oberon~Oberon07=171#1de8a60^1#1defc40:3[Keyword:0] Line 4 Column 14 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)';' (DeclarationSequence@Oberon~Oberon07=12#1def000^1#1defc40:4 Line 6 Column 3 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon (DeclarationSequence@Oberon~Oberon07=12#1ded3a0^1#1def000:1 Line 6 Column 3 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon (DeclarationSequence@Oberon~Oberon07=11#1de9760^1#1ded3a0:1 Line 6 Column 3 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon |(ConstTypeVarDeclaration@Oberon~Oberon07=17#1de97c0^1#1de9760:1 Line 6 Column 3 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | ('CONST'@Oberon~Oberon07=180#1de8aa0^1#1de97c0:1[Keyword:0] Line 6 Column 3 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)'CONST' | (ConstantDeclarationList@Oberon~Oberon07=21#1de8e40^1#1de97c0:2 Line 6 Column 9 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | (ConstantDeclarationList@Oberon~Oberon07=20#1de8b00^1#1de8e40:1 Line 6 Column 9 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)ConstantDeclarationList | (IdentDef@Oberon~Oberon07=52#1de8b80^1#1de8e40:2 Line 6 Column 9 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | (ident@Oberon~Oberon07=170#1de8ae0^1#1de8b80:1[`N'] Line 6 Column 9 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)ident | )IdentDef#1de8b80 | ('='@Oberon~Oberon07=181#1de8b60^1#1de8e40:3[Keyword:0] Line 6 Column 11 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)'=' | (ConstantExpression@Oberon~Oberon07=116#1de8dc0^1#1de8e40:4 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | (Expression@Oberon~Oberon07=114#1de8d80^1#1de8dc0:1 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | |(SimpleExpression@Oberon~Oberon07=125#1de8d20^1#1de8d80:1 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | | (term@Oberon~Oberon07=132#1de8ce0^1#1de8d20:1 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | | (factor@Oberon~Oberon07=139#1de8ca0^1#1de8ce0:1 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | | (number@Oberon~Oberon07=149#1de8c60^1#1de8ca0:1 Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon | | |(integer@Oberon~Oberon07=200#1de8c20^1#1de8c60:1[+100] Line 6 Column 13 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)integer | | )number#1de8c60 | | )factor#1de8ca0 | | )term#1de8ce0 | |)SimpleExpression#1de8d20 | )Expression#1de8d80 | )ConstantExpression#1de8dc0 | (';'@Oberon~Oberon07=171#1de8c40^1#1de8e40:5[Keyword:0] Line 6 Column 16 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)';' ... |)Assignment#1defb40 )Statement#1defb60 )StatementSequence#1defb80 ('END'@Oberon~Oberon07=172#1defaa0^1#1defc00:3[Keyword:0] Line 35 Column 1 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)'END' )ProgramBody#1defc00 (ident@Oberon~Oberon07=170#1defbc0^1#1defc40:6[`Buffer'] Line 35 Column 5 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)ident ('.'@Oberon~Oberon07=173#1defc20^1#1defc40:7[Keyword:0] Line 35 Column 11 File C:/DMS/Domains/Oberon/Examples/Buffer.Oberon)'.' )module#1defc40
The full AST is shown below.
This graph is drawn by DOT, from a debug output of the DMS parser. Click the image to see a large version.
Next: DMS PrettyPrinters.