DMS Patterns and Rewrite Rules
The DMS Software Reengineering Toolkit allows a "domain" (language) engineer describe/specify languages of interest quickly and accurately, so that she may spend most of her attention on a specific program analysis or transformation of interest.
At DMS Domain Specifications, we give some background on the necessary elements to build a DMS language processing domain. We have already described how to easily specify a grammar to DMS, which automatically generates a parser and syntax tree builder, and tree-to-text prettyprinter.
On this page, we show how DMS can work with parameterized code fragments, written, for readability, using the (concrete) syntax of such a specified grammar In particular, we show
- patterns (for matching existing, or creating (instantiating) new code)
- and rewrite rules (for transforming code)
DMS tools are implemented as metaprograms that invoke various aspects of DMS, such as pattern matching or rewriting. It is important to remember that these activities all occur on compiler data structures and not on raw text; for patterns and rewriting, this is largely (Abstract) Syntax Trees. This ensures that analyses and transforms are easier to build, run faster at scale, and are much more reliable than some ad hoc text-string manipulation.
We use Nicholas Wirth's Oberon language for examples, as it is a real, practical language yet simple enough so we can provide interesting pattern and rewrite examples.
Pattern Concepts
A DMS pattern allows the domain engineer to specify a code fragment to be used for either pattern-matching against DMS ASTs (acting like if you see this...), or to be used to instantiate ASTs that match the fragment (acting like instantiate this), or both. Sophisticated tools built using DMS may use many patterns, each representing a code structure of interest. (Patterns are also used to build rules; see below).
Such patterns are written using the surface syntax (e.g, the lexical and grammar rules) of a specifically chosen domain (notation) of interest (e.g., Java, SQL, Verilog). Because DMS is multi-lingual, in order for such a pattern to be interpreted, the domain must be explicitly specified (usually as a language~dialect):
default pattern domain PHP~PHP5;
A single DMS tool may have patterns (and rewrite rules) written for several languages that the tool has been configured to handle; the domain for each pattern (or rule) must be specified before the pattern (or rule) is defined.
A DMS pattern has the following form (free format):
pattern pattern_name(pattern_variable1:syntax_category1, ... pattern_variableN:syntax_categoryN) : pattern_syntax_category = " pattern_body " if additional_condition_on_pattern_match;
The elements of the pattern are:
- keyword pattern
- the pattern_name, referenceable by other patterns, rules, or by the rest of a DMS application
- optional pattern_variables, specifying pattern variable names and their corresponding syntax_category. Syntax categories are defined by the left-hand-side tokens of the DMS BNF grammar rules for the currently selected pattern domain.
- the pattern_syntax_category that the entire pattern (metaquoted text) satisifies
- metaquoted (double-quoted) pattern text (pattern_body). The pattern_body is a valid fragment of text from the currently specified domain, that is an instance of the specified pattern syntax category. The pattern body may include references to pattern variables metaescaped with a prefix \, and other metaescaped pattern expressions.
- an optional if additional_condition_on_pattern_match which provides additional conditions based on the values of the matched parameters.
The pattern_body is essentially what will be matched or instantiated, depending on the pattern usage, and normally contains a code fragment, written in the syntax from the target language of interest ("surface syntax"). The metaquotes around the pattern_body are needed to delimit the pattern code fragment in the target language from the RSL pattern-language definition syntax, because the pattern (rule) language generally has very different syntax from the target domain.
The pattern_body is interpreted as specification for a tree fragment with pattern-variable leaves using the same parsing engine that reads the target source code. The DMS pattern matcher can find where this tree fragment matches a parse tree; this is not a string-match. The DMS instantiation engine can construct an instance of this tree given the AST subtree bindings for the pattern variables, thereby providing a way to build an arbitrary tree fragments from (possibly multiple) patterns.
A meta-reference to another pattern may be written in the pattern_body. Such a reference looks like:
\pattern_name\(pattern_arg1\,...pattern_argN\)
This is in effect a subpattern function call with parameters consisting of additional pattern_args. (Note the metaescaped \(, \, and \) syntax, to distinguish from similar syntax in the target domain). One may write domain-text or other (recursively!) additional meta-references in the pattern_args. Such meta-references act as if the other named pattern were incorpated into the referencing pattern at the point of reference. Such metareferences can pass metavariables to enable matching with consistent bindings. This allows one to write extremely complex patterns in chunks which are easy to understand, and for a set of complex patterns to share building-block subpatterns.
Pattern Conditionals
Conditions are boolean expressions that allow checking whether bound pattern variables have fixed syntactic properties, or represent something that can be transformed to a canonical value by a set of rewrites (see below), or some arbitrarily complex semantic property possibly derived by inspecting other parts of the tree matched by a pattern, or inspecting auxiliary data structures (e.g., symbol tables, control flow, data flow, special inference procedures). A key idea is that conditionals connect analyzers implemented in other parts of DMS to the patterns and rewrite rules.
One might write a series of increasingly sophisticated patterns to match useless while statements. This pattern:
pattern useless_while(b:body) : statement = " while false do \b"
is equivalent to:
pattern useless_while(c:condition, b:body) : statement = " while \c do \b" if c == "false";
However, this second pattern actually succeeds in finding a tree with any conditional in a while statement, but then checks that (pattern-match bound) condition is equal to a literal constant false. A generalization of the (above purely syntactic) check) which is more "semantic" would be
pattern useless_while(c:condition, b:body) : statement = " while \c do \b" if c\|/ AlgebraSimplifications == false ;
This pattern if the matched condition c can be rewritten (\|/) using a set of rewrite rules called AlgebraSimplifications (not shown here), to match false. Even more "semantic" would be:
pattern useless_while(c:condition,t:then_clause) : statement = " while \c then \t" if can_determine_is_false(c);
which uses a custom (PARLANSE-coded) predicate can_determine_is_false that consults other DMS-derivable results to decide the instantiated condition is always false at the point where it is found (the matched tree c carries its source position with it, so the context is implied). One might build control and data flow for the desired language, and use the resulting dataflow to determine values that arrive at the condition c, which may then always turn out to be "false" in a nontrivial way.
Some Patterns for Oberon
Here we provide some example patterns that could be useful in matching/generating Oberon code (grammar for reference).
default pattern domain Oberon~Oberon07; default condition domain Oberon~Oberon07; pattern float_declaration(x: ident): VariableDeclarationList = " \x: FLOAT ; "; pattern double(f: Factor): Term = " \f * 2 "; pattern optimizable_divide(f: Factor): Term = " \f / \f " if is_not_zero(f); pattern assignment_plus_1(lhs: ident, e: SimpleExpression): Statement = " \lhs := \e + 1 "; pattern if_variable_equals_conditional_statement(i: ident, t: Term, s:Statement) :Statement = " IF \i = \t THEN \s END "; pattern procedure(i: ident, p: FormalParameterList, b: StatementSequence) : ProcedureDeclaration = " PROCEDURE \i ( \p ) ; BEGIN \b END \i "; pattern special_algorithm(x: ident, y: ident): StatementSequence = " WHILE \y > 0 DO \y := \y-1; \x := \x + \double\(\y\); END \assignment_plus_1\(\y\,\x\) "; external condition is_not_zero(f: Factor) = 'Misc/Oberon_IsNotZero';
Pattern float_declaration specifies a declaration statement that declares some identifier as a float-type. double defines an expression multiplied by the constant two. optimizable_divide finds a quotient that might be simplified (to one), because the numerator is identical to the denominator, and verifies that the divisor is not a zero value by inspecting the tree "around" the bound pattern variable. The fact that the same pattern variable \f is mentioned twice in the pattern, causes the pattern matcher to check that all instances have the same exact tree shape. assignment_plus1 describes a statement that assigns a value, plus one, to a target variable. if_variable_equals_conditional_statement describes a statement executed under a condition that checks a variable for equality to some value. procedure describes a rather abstract form of a procedure; all the pieces need to be filled in. Patterns can be composed; special_algorithm shows a complex set of statements, that mix surface syntax with invocations of other patterns. Condition is_not_zero is a custom analzer that inspects its tree to verify that the value of that tree is "not zero" in the context from which the tree was taken.
A DMS tool may use a pattern to locate code of interest in any AST it has acquired. The tool metaprogram may call a DMS library function, Pattern:Match with a reference to the optimizable_divide pattern, and a reference to the AST for this code:
y := 1; x := 3; q := (y-1)/(y-1); p := (x-4)/(x-4);
The matcher will attempt to match the (y-1)/(y-1) term, but the additional semantic check is_not_zero (implemented by the domain engineer) will determine that y-1==0, violating the semantic condition on the pattern. The matcher will find the (x-4)/(x-4) term, and the analyzer will decide that x-4<0, so it is not zero, and the pattern will match, binding the metavariable t to the subtree for x-4. The DMS tool will obtain a reference to the divide node as the location of the match, and a reference to the - ("subtract") node as the binding of the pattern variable. (Since there are two matching subtrees, the first (leftmost) one is returned).
A DMS tool may use a pattern to synthesize code, too. If the tool
metaprogram calls the DMS library function
Pattern:Instantiate with the pattern special_algorithm
and trees for p and q for pattern variables x and y
respectively, the instantior will build an DMS AST that would prettyprint as:
WHILE q > 0
DO q := q-1;
p := p + q * 2
END
q := p + 1
That result tree in turn may be used as a parameter to another pattern, to assemble a larger tree.
In this way an entire program can be generated from "templates" that is syntactically correct.
If the metaprogram then instantiates pattern procedure with that tree for the parameter
b, with trees foobar and ( q: integer ) for parameters i and f
respectively, the instantiator will build an AST that would prettyprint as:
PROCEDURE foobar(q: integer);
BEGIN
WHILE q > 0
DO q := q-1;
p := p + q * 2
END
q := p + 1
END
Assembling trees in this manner allows the code generator to build them in any order that makes sense, not just left-to-right. Futhermore, building ASTs allow the DMS Rewriting machinery to be applied to assembled trees, enabling post-tree-composition optimization.
Rewrite Rule Concepts
DMS also provides the ability to encode source-to-source transformations ("rewrites") using the surface syntax of the source and target (if different than source) languages, of the form, if you see this, replace it by that. These rewrites are applied to the parse trees to provide revised trees; a prettyprinter ("anti-parser") provided by DMS can then regenerate source code for the target language.
DMS rewrite rules take the form (free format):
rule rule_name(pattern_variable1:syntax_category1, ... pattern_variableN:syntax_categoryN) : pattern_syntax_category -> replacement_syntax_category = " pattern_body " -> " replacement_body " if additional_condition_on_pattern_match;
The first part of the rule is essentially a pattern (if you see this), and have virtually identical syntax and semantics. Rewrite rules add a replacement part (... then replace it by that). Rewrite rule elements are:
- keyword rule
- the rule_name, referenceable by other patterns, rules, or by the rest of a DMS application
- syntax pattern_variables, specifying pattern variable names and a corresponding syntax_category.
- the pattern_syntax_category that the pattern_body satisifies
- The token -> seperating pattern_syntax_category from replacement_syntax_category. This token is read as "(category) will be replaced by", and occurs here in sympathy with a second -> token in the rule.
- the replacement_syntax_category that the replacement_body satisifies
- metaquoted pattern text (pattern_body), including references to pattern variables metaescaped with a prefix \, and other metaescaped pattern expressions
- A second -> token between pattern_body and replacement_body, read as "(content) will be replaced by".
- metaquoted pattern text (replacement_body), including references to pattern variables metaescaped with a prefix \, and other metaescaped pattern expressions.
- a optional additional_condition_on_pattern_match which provides an additional check based on the value of matched parameters.
When transforming from one langauge to the same language, the pattern_syntax_category and the replement_syntax_category are almost always identical; you can't replace a "declaration" by an "expression" and end up with a syntactically well-formed program. When transforming form one language to a different language, the syntax categories (and their semantics) across the two languages can vary wildly.
What a DMS rewrite rule does, when applied to a specific place on an AST, is pattern-match against that AST. If the match fails, the rewrite fails (and tells this to the tool metaprogram; it may be able to use that knowledge). If the match succeeds, the metavariable are bound to appropriate places in the subtrees of the AST exactly as plain pattern matching would do, and the the replacement_body is instantiated, as described under pattern matching above.
As with patterns, meta-pattern references found in the replacement_body cause the specified pattern to be instantiated with the arguments provided, at the point referenced. Again this allows construction of very complex rewrite rules, using other subpatterns as building blocks.
Individual rules can also be grouped into a ruleset, which can then be used as a group.
Applications of Rewrite rules
Since the rewrites transform one tree into another tree, one can apply an arbitrarily long/complex set of transformation rules (usually one or more rulesets!) repeatedly to the entire tree. This gives DMS rewrites the power of a Post (string [generalized to tree]) rewriting system, thus Turing capable. You can technically produce any arbitrary transformation of the original tree with sufficient transforms. Usually one uses other features of DMS to make writing the transforms a bit easier; for instance, a rewrite rule may consult the result of a particular attribute grammar computation in order to "easily" use information from "far away in the tree" (individual rewrites rules always have a fixed, maximum "radius").
DMS Rewrite rules can also apply to associative and commutative operators. This is extremely hard to do with pure source-to-source rewrites; as one issue, you would have to specify these laws, and then somehow prevent the rewrite engine from getting stuck repeatedly applying the commutative law. Instead, the DMS Rewrite engine has built-in associative and commutative application machinery that handle this. You write the basic algebraic law, you decorate the grammer rules which have these properties, and DMS applies your rules automatically. This is very handy when writing expression simplification procedures.
DMS provides a lot of additional support machinery, to help one construct control flow graphs and/or compute dataflow with efficient parallel solvers; DMS rewrites can consult these artifacts also. We note that one can also fall back when needed on procedural metaprogramming to implement transforms not easily expressed by rewrites, such as LR table generation. DMS allows both source-to-source and procedural rewrites to mix easily and arbitrarily. (Most program transformation systems offer some type of source-to-source rewriting capability based on pattern (tree) matching/replacement somewhat like DMS, but most of them have no fall back for procedural rewrites. We think this is a fundamental mistake. There are other source-to-source systems such as Clang, that offer only procedural rewrites, but not pattern matching/rewrites. We this this is also a mistake.)
If one limits oneself to rewrites that exactly tile the original AST one gets "syntax-directed translation".
Here's a set of possible applications of rewrite rules:
- Expression simplification (both numeric and boolean)
- Dead code removal
- Code optimization
- Aspect insertion
- Refactoring tools
- Code instrumentation
- Refining abstract concepts into concrete code
- Code generation from models
- Language translation/Migration
- Modernization: Replacement of legacy techology constructs or APIs by modern constructs
- Fixing field-width errors (Y2K or similar) or other pervasive broken coding idiom
- Semantically deep code obfuscation (by scrambling control and data flow)
- Reverse engineering concrete code to abstractions
Some Rewrite Rules for Oberon
The following rules can be used to simplify Oberon code (grammar for reference):default base domain Oberon~Oberon07; -------------------------------------------------------------------------------------- rule simplify_boolean_not_not(e: Factor): Factor -> Factor = " ~ ~ \e " -> " \e "; rule simplify_boolean_and_true(e: Factor): Term -> Term = " \e & TRUE " -> " \e "; rule simplify_boolean_and_false(e: Factor): Term -> Term = " \e & FALSE " -> " FALSE " if no_side_effects("\e"); rule simplify_boolean_and_self(e: Factor): Term -> Term = " \e & \e " -> " \e " if no_side_effects("\e"); rule simplify_boolean_or_true(e: Term): SimpleExpression -> SimpleExpression = " \e OR TRUE " -> " TRUE " if no_side_effects("\e"); rule simplify_boolean_or_false(e: Factor): SimpleExpression -> SimpleExpression = " \e OR FALSE " -> " \e "; rule simplify_boolean_or_self(e: Term): SimpleExpression -> SimpleExpression = " \e OR \e " -> " \e " if no_side_effects("\e"); -------------------------------------------------------------------------------------- rule reduce_strength_exponentation(e: Expression):Factor -> Term = " power(\e,2) " -> " (\e) * (\e) " if no_side_effects(e); rule reduce_strength_multiplication(e: Factor):Term -> SimpleExpression = " \e * 2 " -> " \e + \e " if no_side_effects("\e"); -------------------------------------------------------------------------------------- rule simplify_if_true_stmt(ss: StatementSequence) : StatementSequence -> StatementSequence = " IF TRUE THEN \ss END" -> " \ss "; rule simplify_if_false_stmt(ss: StatementSequence): StatementSequence -> StatementSequence = " IF FALSE THEN \ss END " -> " ; "; rule simplify_if_true_stmt_stmt(ss1: StatementSequence, ss2: StatementSequence) : StatementSequence -> StatementSequence = " IF TRUE THEN \ss1 ELSE \ss2 END " -> " \ss2 "; rule simplify_if_false_stmt_stmt(ss1: StatementSequence, ss2: StatementSequence) : IfStatement -> StatementSequence = " IF FALSE THEN \ss1 ELSE \ss2 END " -> " \ss2 "; rule simplify_if_cond_empty_stmt(c: Expression, ss: StatementSequence) : Statement -> Statement = " IF \c THEN ; ELSE \ss END " -> " IF ~(\c) THEN \ss END "; rule simplify_if_cond_stmt_empty(c: Expression, ss: StatementSequence) : Statement -> Statement = " IF \c THEN \ss ELSE ; END " -> " IF \c THEN \ss END "; -------------------------------------------------------------------------------------- rule remove_empty_statement(ss: StatementSequence): StatementSequence -> StatementSequence = " \ss ; " -> " \ss " ; -------------------------------------------------------------------------------------- rule zombie_removal_enablement(config_variable:ident): Factor->Factor = " \config_variable " -> " FALSE " if is_dead_feature(config_variable); -------------------------------------------------------------------------------------- external condition no_side_effects(e: Expression) = 'Misc/Oberon_NoSideEffects'; external condition is_dead_feature(cv: ident) = 'Misc/Oberon_IsDeadFeature'; public ruleset GeneralSimplifications = { simplify_boolean_not_not, simplify_boolean_and_true, simplify_boolean_and_false, simplify_boolean_and_self, simplify_boolean_or_true, simplify_boolean_or_false, simplify_boolean_or_self, reduce_strength_exponentation, reduce_strength_multiplication, simplify_if_true_stmt, simplify_if_false_stmt, simplify_if_true_stmt_stmt, simplify_if_false_stmt_stmt, simplify_if_cond_empty_stmt, simplify_if_cond_stmt_empty, remove_empty_statement, }; public ruleset RemoveDeadCode = GeneralSimplifications + zombie_removal_enablement;
Note that some general simplification rules are grouped into a single ruleset GeneralSimplifications. When applied, this ruleset simplifies boolean conditions where possible, and then removes or cleans up IF statements where boolean expression collapse to TRUE or FALSE. A metaprogram might use this ruleset by itself to clean up Oberon code.
An additional ruleset RemoveDeadCode is defined that includes (+) the simplification rules and the special zombie handling rule. This set of rules is designed to do dead code removal. The zombie_removal_enablement rule converts a configuration variable to FALSE if an "oracle" dead_feature (controlled by the software engineer) says that the configuration variable will henceforth always be false. This can trigger a wave of boolean simplifications and then conditional statement removals, from the general simplifications. The effect is to delete features no longer used by application (zombie code).
Note that the simplify_boolean_... rules only express rewrites over very simple expessions, and not even the commutative variant. DMS's associative/commutative rewrite engine handles the rest. One can write more rules to handle numeric expression simplifications, including constant folding.
Note the semantic conditional no_side_effect; this is additional DMS metaprogramming provided by the tool engineer, not shown here.
The seemingly peculiar rule remove_if_false rewrites a useless if statement to a trivial statement " ; " rather than to some kind of "empty statement" because Oberon's syntax does not allow the latter. The additional rule remove_empty_statement makes the trivial statement vanish when it is found in a set of other statements. There are obvious extensions of the IF statement simplifications to other Oberon control constructs.
More complex transformations are often implemented as a set of individual helper transformations such as these, combined by metaprogramming to guide the transformation process, and to handle context issues (such as scopes, dataflows, ...)
Given the following Oberon program (actual DMS input):
MODULE UserFunctionModule; PROCEDURE user_function( key: string; h: DB_connection); VAR s: STRING; BEGIN; IF feature1 THEN s := READ(h,k) IF desired_property1(s) THEN WRITE(h,k,revise1(s)) END END IF feature2 THEN s := READ(h,k) IF desired_property2(s) & feature1 THEN WRITE(h,k,revise2(s) ELSE WRITE(h,k,revise3(s) END END END user_function; END UserFunctionModule.
and applying the transforms above with dead_feature(feature1)==true (intending to remove zombie functionality feature1) results in the following Oberon program (actual DMS output):
MODULE UserFunctionModule; PROCEDURE user_function( key : string; h : DB_connection); VAR s : STRING; BEGIN; IF feature2 THEN s := READ(h, k) WRITE(h, k, revise3(s) END END user_function; END UserFunctionModule.
The useless (zombie) feature and its support has been eliminated, automatically. Back to DMS overview