DMS compared to other Analysis and Transformation Technologies

The DMS Software Reengineering Toolkit ("DMS") is used to automate program analyses and transformations. It consequently shares a number of properties with other compiler technologies, but goes beyond them often in a variety of ways. This section explores other technologies that are available in practice and compares DMS to them. It is important to remember that DMS and other technologies often serve different purposes, so a difference isn't necessarily bad; different tools meet different needs.

What we hope you will note is that DMS functionally outperforms or is competitive with an amazing variety of available language processing tools. The implication is you can have a mountain of incompatible tools or DMS, your choice. Having only one means minimizing your learning curve. In addition, DMS comes with the support that goes with a commercial product.

We compare DMS to the following technologies:

Lexer Generators (e.g., Flex)
Parser Generators (e.g., YACC, Bison, JavaCC, CUP, ANTLR/PCCTS)
Program Transformation Systems (e.g., TXL, ASF+SDF, Stratego, Fermat, CodeWorker)
C/C++ Language Front Ends (e.g., Edison Design Group, GCC, Columbus, GCCGXL, ... )
C++ Transformation Frameworks (e.g. CodeBoost, ROSE, Puma, OpenC++, Clang)
Compiler Frameworks (e.g. LCC, GCC, SUIF, SGI Open64, Clang, ...)
DSL (domain-specific language) compilers
Aspect Languages (e.g., AspectJ, AspectC, AspectC++)
Programming Languages (e.g., Lisp, Java, C#, C++, Fortran, ...)
Graph Rewrite Systems
Modelling Tools (UML, Models, ...)
Application Code Generators (UML, Models, ...)
Machine Code Generators (Burg, ...)
Migration Engines
Quality, Security, Source Code Scanning tools (e.g., ... )
Reverse Engineering Tools
Compilers (e.g. GNU, Ada, Prolog)
Rewriting Engines
Tree Walkers

The list is rather long. We believe we have characterized these other systems fairly. If you think a description is inaccurate, or could be improved, please send a note to webmaster@semdesigns.com

Lexer Generators

Lexer Generators are used to construct finite state automata for recognizing the words of a language. Relevant issues include:

Expressibility of the regular expressions
Lexical format and comment capture
Lexical mode management
Character sets handled
Conversion actions performed by lexer
Support for include files
Multiplicity
Error reporting and recovery
Performance
Target language

Expressibility refers to what can be said in the regular expression language; most lexer generators allow characters sets, Kleene star and plus, and alternation. DMS is unusual in providing complements ("not") of regular expressions. DMS also allows a token definition to extend what has been collected or lookahead an arbitrary distance for the presence of some trailing indicator.

Lexical format and comment capture refers to what happens to the "shape" of the processed text. Most lexers simply toss away this information. DMS distinguishes white-space to be ignored, from comment text that must be captured and retained with the lexemes. DMS also captures lexeme formatting information such as radix of numbers, type of quotes on strings, file/line/column information, etc. This information is useful for error reporting, and for regenerating the original text around a place that DMS has transformed.

Lexical mode management refers to how the lexer handles switching from lexing with one set of rules to another. Consider what happens if you have C++ with embedded assembly code: the C++ code is likely to be lexed according one set of rules, and the assembly code according to another. Lexical mode management indicates how control can be passed from one mode to another. Many lexers offer multiple modes, with simple transfers from one mode state to another. Complicated languages such as COBOL and HTML with embedded scripts often require many, many modes with complex interactions and embeddings.

Each lexer must operate with some input character set. Most lexers are restricted to 7 bit ASCII, possibly with simple extension to ISO-8859-1 because of the ubiquity of 8 bit bytes in European languages. Lexing the essential 65536 characters of Unicode is harder, because one cannot simply have tables indexed by an 8 bit code, but it is important in times of globalization to handle everybody's language with aplomb.

After recognizing a string, it is often useful to process it into the target system's native representation (e.g, convert "5.73E-27" into its equivalent IEEE float). Conversion support indicates what the lexer provides to help with this. In many languages, a keyword may be an identifier in certain contexts; DMS allows the lexer to ask the parser if this is possible, and thus correctly classifies the name as keyword or identifier.

For complex inputs, often one file will "include" another. This indicates support provided by the lexer for this.

Multiplicity indicates if the lexing engine can handle multiple lexer instances (of the same lexer, or of different language lexers) in a single program execution.

Performance has to do with the rate at which the lexer processes characters. For very big inputs, this matters. For small inputs, this is irrelevant. For most source code analysis and processing tasks, very fast lexing isn't necessary; most of the time will be spent in the analysis or the transformation of the code.

Each lexer generator targets certain languages.

Lexer Generator	Expressibility	Format Capture	Mode Management	Character Sets	Literal Conversions	Include files	Multiplicity	Error Recovery	Performance	Target Language
DMS	standard, not regexp, recognized lexeme extension, arbitrary scan-ahead, Unicode character groups	comments, radix, lead zeros, quote type, keyword spelling, file/line/col	goto, gosub, push, pop	7 bit ASCII, ISO8859-1 to -16, UTF-8, UTF-16, EBCDIC with 80 column breaks, Windows code pages 1250-1258, Japanese Shift-JIS, custom	character, integer, precise float, rationals, string, escaped string, keyword/identifier fill lexeme	nested	Yes	Auto	Medium MMAP file read	PARLANSE
Flex	standard	No	goto	ASCII, ISO8859-1	none	none	Yes	Auto	Fast	C

Parser Generators

Parser Generators accept grammars and produce parsers. Key issues are:

Notation Expressiveness
Grammars Available
Class of language parseable
Semantic Conditions
Semantic Actions
Substring/Pattern Parsing
Tree Building
Error reporting and recovery
Performance
Target language

Notation Expressiveness refers to the type of extended BNF allowed by the parsing engine. Most parser generators allow a left-hand side to be tied to right hand sides having at least token sequence alternation, often Kleene star/plus, sometimes operator precedence, possibly in-lined predicates. Many allow inlined actions to enable custom code for processing on phrase recognition (e.g., tree building). The DMS BNF grammar actually can accept all of these, but for most DMS work, DMS uses of a remarkably simple form of BNF which meshes extremely well with attribute grammar evaluators that DMS can build on top of this syntax. DMS can be used to transform a more complex EBNF into the standard DMS BNF.

Some uses of a parser generator are for small custom languages, driving an application. DMS's emphasizes arbitrary languages with focus on analysis and transformation. Often, one wishes to process application source code in standard programming languages such as C, COBOL, etc. What matters here is having a robust language definition, handling the variety of dialects that these languages really have, and supporting the additional infrastructure required to really "read" these languages, such as include files, macros, preprocessors, class file readers, etc. DMS is distinguished by having a very large list of languages and dialects that have been validated against large volumes of production code, notably C++ with full preprocessing capabilities; uniquely, the DMS C++ preprocessor has the ability to NOT expand preprocessor directives, enabling transformation of the source code as seen by the programmer.

Most parser generators support a limited class of languages. YACC supports LALR(1), CUP supports LL(1). The good news about such parser generators is that they have great formal characterizations of the class languages they can process. The bad news is, that most real languages do not fall in these categories. LL parser generators are often characterized as some easier to construct, but they suffer from the problem that they cannot handle left-recursive BNF rules. Worse, real languages have ambiguous syntax, which means several possible parses of some phrases are possible; since many of these parsing technologies choose the accidental "first available parse", they typically cannot choose the one necessary parse unless backtracking is built-in (which it is often not).

With such parser generators, the grammar author must twist his grammar, his lexer, his brain, and code odd custom semantic conditions and actions to get a real working grammer. One consequence of this is that "grammar" definition for individual parser generators are almost always difficult to transfer to another parser generator. DMS parses arbitrary context-free grammars, by vitue of using GLR technology. This means that DMS grammars are shaped by author convenience, not by the parsing technology, and so tend to considerably closer to what is found in the standards documents describing the languages. One example of this: DMS is able to parse C++ with essentially the ISO grammar rules; this is widely considered to be impossible with LALR or LL technology. It also means that DMS is extremely agile, in being able to accept new language definitions fast. Proof is in the wide variety of languages that DMS accepts.

Often, semantic conditions are used to eliminate potential phrase parses, to avoid ambiguity. Most conventional parser generators depend heavily on these, especially top-down tools such a JavaCC and ANTLR. DMS can simply accept all the ambiguous parses and remove the inappropriate ones later, after type information is available. In those cases where it is easy to acquire information that can eliminate incorrect parses (such as loop-end to loop head matcing by line number in FORTRAN), DMS can use semantic predicates to eliminate inappropriate parses during parsing.

Semantic actions are often attached to grammars to carry out the key work (trivial code generation, command execution, or tree building). Virtually all parser generators provide a way to write such semantic actions, which of course means you must write them. Since the goal of DMS parsing is only to produce trees, DMS doesn't have semantic actions, so you don't have to write them. Less code is time saved and bugs not chased.

Most parsers are useful only for parsing complete source documents. For analysis and reengineering tasks, it is extremely useful to parse "patterns", which are code fragments with placeholders. DMS automatically constructs a pattern parser from the main parser. Once you've coded the main parser, you are ready to process code fragments. None of the conventional parser generators do this.

Most parser generators don't construct trees, although they provide support for constructing trees by library calls from semantic actions. DMS simply automatically constructs trees. The trees are abstract in the sense that terminals and useless unary productions can be eliminated, but are otherwise isomorphic to the grammar, which is in turn a clean implementation of the language standard, so the DMS engineer "knows" what his trees look like by referencing the standards documents. This is very useful when you have languages like C++ with hundreds of grammar rules. In constructing the trees, DMS automatically attaches format information from lexemes to the value-carrying leaves of the tree, and automatically attaches comments collected by the lexer to the appropriate place in the tree. Finally, each tree node is stamped with the domain (language) that produced it (this allows multiple languages in a single image, and indeed, in a single tree), the tree node type, and complete information about the source file/line/column that produced it. All this information enables reconstruction of the source code, and/or reporting accurately where analysis finds a problem or where transforms are made, as appropriate.

When processing lots of source, one invariably encounters malformed source code. Most parser generators provide various ad hoc means for recovering from such errors; YACC inserts an "error" token and the grammar engineer must write productions to swallow it. DMS provides automatic error recovery; it attempts to find a token it can insert that will let the parse proceed, or failing that, deletes the obstacle token and moves on. So the grammar engineer need do nothing at all to get useful error recovery.

Performance is always an issue when one tries to read a large amount of source (a 2000 line C++ compilation unit may implicitly read 250,000 lines of include file support!). Many of the conventional parser generators produce very fast parsers (although the typical bottleneck is lexing tokens and eliminating white space!). The DMS parsers are only moderately fast, because of the GLR property, automated tree building, and automated format/comment capture. This is fine for reengineering tasks; you don't actually process a million lines of code with DMS until you are ready, and then it doesn't matter if it takes a few minutes.

Parser generators typically produce parsers in some favored target language. YACC and Bison produce C-style code. JavaCC produces only Java. ANTLR produces C++ and Java. DMS parsers are really designed for the language underlying DMS, PARLANSE; however, DMS transformations can easily be applied to DMS grammars to produce recursive descent parsers in any of the languages for which DMS has a front end.

Parser Generator EBNF Notation Grammars
Available Language Class Semantic Conditions Semantic Actions Substring/
Pattern Parsing Tree Building Error Recovery Performance Pretty Printer/
Fidelity Printer Target Language

DMS Simple BNF;
Token Precedence;
Auto-List Formation Production Quality
preprocessor/C/C++0x,
C#,Java 1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Verilog, VHDL, SystemC Full Context Free
via GLR PARLANSE;
Check Reduced-To Tree
or Left Context None;
(Implicit Tree Building) Yes/Yes Auto Build;
Auto Lists;
Unary elimination;
Terminal Elimination;
Ambiguity Handling with
Shared Subtrees Auto
Token Insert
or Delete Medium (includes tree building); pretty fast DMS lexer Yes/Yes PARLANSE

YACC Standard C;
but not C++
Many others scattered across web LALR(1)+ No C code No/No Hand-coded Skip to Statement End?;
Optional Error Token Fast, usually because of fast lexer No C

BTYACC
(built on YACC) Standard All available for YACC
Not C++ Superset of LALR(1);
Choose first of ambiguous alternatives Yes? C code No/No Hand-coded Skip to Statment End;
Optional Error Token Little slower than YACC No C

Bison Standard C (via GNU)
Older C++ (via old GCC)
Many others LALR(1)+;
GLR option No C code No/No Hand-coded Skip to Statment End;
Optional Error Token Fast (usually because of fast lexer) No C

ANTLR Standard Java
C++ (draft)
Many others but
not production LL(*) Java Java No/No Hand-coded Ad hoc Fast, mostly due to integrated lexer No Java

JavaCC Standard Java
Not C++
Many others
varying qualilty LL(k)? Java Java No/No Hand-coded Ad hoc Medium (JIT-compile code, even for lexer)) No Java

XT Scannerless GLR Standard Java
Not C++
Academic quality Context free C C (usually tree building) No/Yes via ASF+SDF Hand-coded Unknown Slow; parses using characters as tokens Yes, via ASF+SDF C

Elkhound Standard ANSI? C++;
C via C++ subset
Academic quality Context free C++ C++ (usually tree building) No/No Hand-coded Unknown Fast (usually because of fast lexer) No C++

Parser Generator	EBNF Notation	Grammars Available	Language Class	Semantic Conditions	Semantic Actions	Substring/ Pattern Parsing	Tree Building	Error Recovery	Performance	Pretty Printer/ Fidelity Printer	Target Language
DMS	Simple BNF; Token Precedence; Auto-List Formation	Production Quality preprocessor/C/C++0x, C#,Java 1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Verilog, VHDL, SystemC	Full Context Free via GLR	PARLANSE; Check Reduced-To Tree or Left Context	None; (Implicit Tree Building)	Yes/Yes	Auto Build; Auto Lists; Unary elimination; Terminal Elimination; Ambiguity Handling with Shared Subtrees	Auto Token Insert or Delete	Medium (includes tree building); pretty fast DMS lexer	Yes/Yes	PARLANSE
YACC	Standard	C; but not C++ Many others scattered across web	LALR(1)+	No	C code	No/No	Hand-coded	Skip to Statement End?; Optional Error Token	Fast, usually because of fast lexer	No	C
BTYACC (built on YACC)	Standard	All available for YACC Not C++	Superset of LALR(1); Choose first of ambiguous alternatives	Yes?	C code	No/No	Hand-coded	Skip to Statment End; Optional Error Token	Little slower than YACC	No	C
Bison	Standard	C (via GNU) Older C++ (via old GCC) Many others	LALR(1)+; GLR option	No	C code	No/No	Hand-coded	Skip to Statment End; Optional Error Token	Fast (usually because of fast lexer)	No	C
ANTLR	Standard	Java C++ (draft) Many others but not production	LL(*)	Java	Java	No/No	Hand-coded	Ad hoc	Fast, mostly due to integrated lexer	No	Java
JavaCC	Standard	Java Not C++ Many others varying qualilty	LL(k)?	Java	Java	No/No	Hand-coded	Ad hoc	Medium (JIT-compile code, even for lexer))	No	Java
XT Scannerless GLR	Standard	Java Not C++ Academic quality	Context free	C	C (usually tree building)	No/Yes via ASF+SDF	Hand-coded	Unknown	Slow; parses using characters as tokens	Yes, via ASF+SDF	C
Elkhound	Standard	ANSI? C++; C via C++ subset Academic quality	Context free	C++	C++ (usually tree building)	No/No	Hand-coded	Unknown	Fast (usually because of fast lexer)	No	C++

Program Transformation Systems

Transformation systems are intended to process source code to produce improved code in the same or different languages. Often they support some kind of some analysis (to find and validate places that need improvement), and so they are also often used as simply analysis tools. Issues include:

What kind of language, graphical or textual, does the tool process?
What language can the transformation system process?
What is the implementation language for the transformation system, and why was it chosen?
How are transforms encoded?
How are analyses encoded?
How are transforms and analyses sequenced to achieve the desired result? ("Metaprogramming")
How scalable is the tool (along various axes of scalability)
Has the tool been used for production purposes?

Program transformations come in two forms: procedural (arbitrary functions applied to compiler data structures) and source-to-source (maps between concrete surface syntax forms of code). Procedural transformations have enjoyed enormous success for some 50 years in classical compilers [AhoUllman] for translating source code to binary and optimizing code generators. Traditionally transformations have been considered interesting only if they were correctness preserving. Compilers also traditionally construct and cache crucial auxiliary information such as symbol tables and control and data flow analysis results to support complex context-dependent transformations that preserve correctness.

More recently, procedural transforms have been proposed to support mechanical refactoring [12] as a technique for restructuring programs. Manual versions of this have been popularized [7] as a methodology with suggestions for tool support. Tools for refactoring Smalltalk [18] and Java have started to appear. The better Java tools [23,24] do some sophisticated refactorings such as ExtractMethod; others in the market require some manual validation of the steps. Some experimental work has been done in refactoring C++ [21], and some commercial C++ refactoring tools are available [XRefactory]. All of these above tools can only apply predefined sets of transforms. OpenC++ [ ] uses a custom recursive descent C++ parser front end that builds ASTs with customizable procedural transforms to enable some simple transforms on C++ syntax. The results of these tools must be manually (re)validated, because most of them do not construct the auxiliary information needed to ensure that the transforms are truly correctness preserving

Many tools that analyze source code based on parsing to compiler data structures and apply procedurally encoded analysis have been constructed. More recently, sophisticated reengineering tools such as the ATX Reengineering environment [11], based on parsing to compiler data structures and applying predefined procedural transforms have started to appear.

Source-to-source program transformations were originally conceived as a method of program generation in the 1970s [2] and lead to the notion of domain analysis to support transformational domain-specific program generation in the seminal Draco system [12], which accepted language ("domain") definitions and source-to-source transformations for each language, using recursive-descent parsing. The technology has continued development, moving into scientific computing [10] and is commercially used for sophisticated financial instruments [Kant Synapse 2005].

The idea that program transformations could be used for software maintenance and evolution by changing a specification and re-synthesizing was suggested in the early 80s [Balzer85 need paper ref]. Porting software and carrying out changes using transformations were suggested and demonstrated in the late 80s [1]. Simple source-to-source "evolution" transforms used to change specification code were suggested by Feather [9]. Evolution transforms lead to the key idea of using both correctness preserving transforms to change nonfunctional properties of software, and to use noncorrectness preserving transforms to change functional properties of a software system. Theory about how to modify programs incrementally using "maintenance delta" transformations and previously captured design information was suggested in 1990 [3][4]. This work defined an array of transformation support technology, which inspired the DMS Software Reengineering Toolkit.

Source to source transformation as a serious tool for software evolution is largely unrealized in practice, although there have been some very interesting applications. Following the Draco lead, Refine [6,17] was a ground-breaking software transformation engine originally intended for program generation from high level specifications, but found modest success when used for some blackbox commercial automated migration projects on legacy code [Newcomb85]. Refine tended to be limited by its LALR parser generator, which does not handle most legacy languages very well.

TXL [Cordy early, CordyLCT04?] was used to find and repair Y2K flaws on reportedly 4 billion lines of legacy code, much of it COBOL, and is still available [ref]. TXL uses a backtracking parser, which makes it somewhat more robust for parsing legacy languages, but does not handle ambiguities well. The TXL transformation philosophy gets around this partly by partitioning massive transformations into separate stages each operating on an abstracted, and therefore less complex grammars specific to that stage.

The ASF+SDF system [??] enables the specification of arbitrary languages, using GLR parsing technology, which can handle arbitrary context free grammars. ASF+SDF parsers working at the character level, allowing convenient composition of arbitrary sets of grammar rules. Transformations are specified not in terms of concrete surface syntax, but rather in terms of abstract syntax. Stratego [Visser] augments ASF+SDF with a domain-specific language for specifying complex traversals of the abstract syntax tree and sequencing of transforms. Both ASF+SDF and Stratego have been used as porting and mass source code change tools in research projects, but apparently not commercially.

Two key stumbling blocks for transforming C++ is problem of parsing it, and the even more difficult engineering task of constructing an accurate symbol table, both because of the complexity of the language semantics and because of the differences in the C++ dialects in use. CodeBoost [ ] combines the Edison Design Group (EDG) commercial front end C++ parser with Stratego to research the application of library-call optimizing transforms to be applied to C++ code; we believe transforms are specified in terms of abstract syntax. Boost does not appear to have been used to produce production code. Rose [ Quinlan ] combines the EDG front end with recursive calls on that front end to enable writing C++ source patterns used by C++ procedural transforms, to help optimize scientific codes written in C++. Besides being a production C++ parser, the EDG front end provides a symbol table, crucial to implementing sophisticated context-sensitive transforms. AspectC++, built on top of Puma, combines a hand-constructed C++ parser and symbol table adequate for research tasks, to enable customized procedural transforms to implement AspectJ-like transforms on C++ code. AspectC++ is not used for general C++ transformation.

The only available production quality C++ parsers known to us are the GNU hand-coded parser, the EDG parser and the one for DMS; all three are successful in our opinion not only because of the ability to parse, but especially because of the invested effort to engineer production quality symbol tables. While GNU is very good as a compiler, it is not designed as a general program transformation system. EDG's front end is just that; it builds an AST and a symbol table, but provides nothing beyond that point to support transformation. ASF+SDF's parsing machinery, like DMS's is GLR and can arguably be used to parse C++ directly, but that does not produce a symbol table. Of all of these systems, only DMS can accept true surface-syntax transforms and apply them. Rose comes close, but can only work with C++, whereas DMS's source-to-source engine is actually independent of the actual grammar, but can still use the grammar syntax. The DMS parsers, being reengineering parsers, capture macros, preprocessor conditionals, include files, comments and lexical formatting information in a regenerable way, which none of the other tools accomplish.

One atypical aspect of the DMS Toolkit is the explicit encouragement to mix both source to source and procedural transformation. While this might sound awkward, combining the succinctness of source-to-source transforms with procedural control and the more direct capability of procedural transforms to manipulate complex structures is remarkably powerful.

We believe the Boeing Migration tool (REF) would not have been practical without the robust DMS C++ front end, the reengineering parsing machinery, and the combined power of the mixed rewrites.

Transformation System Status Grammars
Available Transformation Style Analyzers Scale Performance Implementation Language

DMS Production;
Commercial;
Active enhancement Production
preprocessor/C/C++0x,
C#, Java1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Matlab M, Simulink,
Verilog, VHDL, SystemC;
Extensible to others Source-to-Source
and/or Procedural;
Procedural Metaprograms Parallel
Attribute Grammars;
Control Flow;
Data Flow;
Local Points-to;
Global Points-to;
Call Graph;
Symbolic Range;
Custom PARLANSE Code Multilingual;
Tens of
thousands
of files;
Parallel implementation Excellent;
Uses SMP PARLANSE (parallel)

TXL Very stable;
Academic;
Used commercially in past Java
but not C++;
Others as partial "island" grammars;
Extensible to others Source-to-Source;
Functional style Implement using transforms Big single files;
Multilingual via
union grammars Good Turing

ASF+SDF Stable;
Academic Java
but not C++
Many university quality;
Extensible to others Abstract Syntax to Abstract Syntax Implement via transforms;
Escape to C Unknown Excellent for rewrites?
Modest for parsing due to Scannerless GLR C

Stratego Stable;
Academic;
Active enhancement;
Built on AST+SDF Java
but not C++
Many university quality;
Extensible to others Abstract Syntax to Abstract Syntax;
Dynamic Rules;
Metaprogramming via explicit Strategies Implement via transforms;
Escape to C Unknown Excellent for rewriting;
Modest for parsing due to Scannerless GLR C

CodeBoost Stable;
Academic;
Built on OpenC++ and Stratego C++ via OpenC++ Stratego style Implement via transforms;
Escape to C? Unknown Unknown; Stratego transforms likely excellent C

OpenC++ Stable;
Open Source Most of ANSI C++
No preprocessor;
Limited template handling Procedural analysis/transformation
in C++ via Visitor pattern Ad hoc coded in C++ Single source file plus definitions from includes? Unknown C++

ROSE Stable?;
Research;
Active enhancement C/C++ via EDG front end;
Fortran90?;
Not extensible Source-patterns for matching and instantiating;
Procedural transforms in C++ Procedural in C++ Large numbers of files Unknown C++

Fermat Stable;
Research;
Applied commercially by SMLtd C, COBOL
with University qualitity;
Not easily extended because of requirement to map to WSL Encodes programs in WSL language;
WSL source-patterns for matching and instantiating;
Procedural transforms in MetaWSL Procedural in MetaWSL Tens of thousands of files (assembler) Several thousand lines per minute MetaWSL, PERL

Tom Stable;
Active enhancement No conventional languages;
Handles small DSLs well Tom patterns/rewrite rules
embedded in host language Java Tom rewrite rules;
Escape to Java? Unknown Fast OCAML?

CodeWorker Stable;
One man show;
Active enhancement Java?
but not C++
Several partial grammars;
Arguably extensible to others Procedural in "CodeWorker" language Procedural in "CodeWorker" Unknown Unknown Java?

XML/XSLT Production;
Active enhancement JavaML;
GCCXML (incomplete);
Extensible to others by using some parser generator with ad hoc XML export XSLT functional transformation Weak analysis via functional transformation Poor because of XML document sizes Poor C/C++/Java

Rascal/MP Academic;
Experimental;
Active enhancement Several DSLs;
Java?;
Extensible to others by using Rascal GLL parser generator Mixed source-to-Source with arbitrary Rascal procedural code Custom coded in Rascal relational calculus Poor (see performance) Interpreted; reported "very slow" by implementers Java?

Transformation System	Status	Grammars Available	Transformation Style	Analyzers	Scale	Performance	Implementation Language
DMS	Production; Commercial; Active enhancement	Production preprocessor/C/C++0x, C#, Java1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Matlab M, Simulink, Verilog, VHDL, SystemC; Extensible to others	Source-to-Source and/or Procedural; Procedural Metaprograms	Parallel Attribute Grammars; Control Flow; Data Flow; Local Points-to; Global Points-to; Call Graph; Symbolic Range; Custom PARLANSE Code	Multilingual; Tens of thousands of files; Parallel implementation	Excellent; Uses SMP	PARLANSE (parallel)
TXL	Very stable; Academic; Used commercially in past	Java but not C++; Others as partial "island" grammars; Extensible to others	Source-to-Source; Functional style	Implement using transforms	Big single files; Multilingual via union grammars	Good	Turing
ASF+SDF	Stable; Academic	Java but not C++ Many university quality; Extensible to others	Abstract Syntax to Abstract Syntax	Implement via transforms; Escape to C	Unknown	Excellent for rewrites? Modest for parsing due to Scannerless GLR	C
Stratego	Stable; Academic; Active enhancement; Built on AST+SDF	Java but not C++ Many university quality; Extensible to others	Abstract Syntax to Abstract Syntax; Dynamic Rules; Metaprogramming via explicit Strategies	Implement via transforms; Escape to C	Unknown	Excellent for rewriting; Modest for parsing due to Scannerless GLR	C
CodeBoost	Stable; Academic; Built on OpenC++ and Stratego	C++ via OpenC++	Stratego style	Implement via transforms; Escape to C?	Unknown	Unknown; Stratego transforms likely excellent	C
OpenC++	Stable; Open Source	Most of ANSI C++ No preprocessor; Limited template handling	Procedural analysis/transformation in C++ via Visitor pattern	Ad hoc coded in C++	Single source file plus definitions from includes?	Unknown	C++
ROSE	Stable?; Research; Active enhancement	C/C++ via EDG front end; Fortran90?; Not extensible	Source-patterns for matching and instantiating; Procedural transforms in C++	Procedural in C++	Large numbers of files	Unknown	C++
Fermat	Stable; Research; Applied commercially by SMLtd	C, COBOL with University qualitity; Not easily extended because of requirement to map to WSL	Encodes programs in WSL language; WSL source-patterns for matching and instantiating; Procedural transforms in MetaWSL	Procedural in MetaWSL	Tens of thousands of files (assembler)	Several thousand lines per minute	MetaWSL, PERL
Tom	Stable; Active enhancement	No conventional languages; Handles small DSLs well	Tom patterns/rewrite rules embedded in host language Java	Tom rewrite rules; Escape to Java?	Unknown	Fast	OCAML?
CodeWorker	Stable; One man show; Active enhancement	Java? but not C++ Several partial grammars; Arguably extensible to others	Procedural in "CodeWorker" language	Procedural in "CodeWorker"	Unknown	Unknown	Java?
XML/XSLT	Production; Active enhancement	JavaML; GCCXML (incomplete); Extensible to others by using some parser generator with ad hoc XML export	XSLT functional transformation	Weak analysis via functional transformation	Poor because of XML document sizes	Poor	C/C++/Java
Rascal/MP	Academic; Experimental; Active enhancement	Several DSLs; Java?; Extensible to others by using Rascal GLL parser generator	Mixed source-to-Source with arbitrary Rascal procedural code	Custom coded in Rascal relational calculus	Poor (see performance)	Interpreted; reported "very slow" by implementers	Java?

C++ Front Ends

C++ parsers ("front ends") are intended to read C++ source code and build compiler data structures representing that code, with the expectation that a custom tool will be build to process the compiler data structures, for analysis or C++ transformation.

Issues for C++ front ends are:

What dialects of C/C++ are handled?
Is there an integrated preprocessor?
Does the front end capture comments and lexical format information?
What compiler data structures are built? (AST, Symbol tables, ...?)
How can those structures be processed?
What kind of support for analysis is available?
Is the tool designed to process just one file, or entire systems of files?
Can the front end operate with other languages (SQL, ...?)

C++ Front Ends Status C/C++ Dialects Available Post-Parsing Capability Scale Performance Implementation
Language

DMS Production;
Commercial;
Active enhancement;
Integrated Preprocessor
with "no expand" option Production;
C & C++;
C++0x;
ANSI,GCC,
MSVC6,
MS Studio 2005,
SystemC;
Many other languages Procedural tree processing;
Name/Type resolution with or without preprocessing;
Source-to-Source Transformation;
Attribute Evaluation;
Access to DMS control, data, points-to analyzers;
Regeneration of source with comments and preprocessor directives Tens of
thousands
of source files;
Multilingual processing Parallel implementation PARLANSE (parallel)

EDG Production;
Commercial;
Active enhancement;
Integrated Preprocessor Production;
Most C/C++ dialects Procedural tree processing; Name/Type Resolution after preprocessing Single source file, includes processed but lost? Fast, hand tuned C

GCC Production;
Open Source;
Active Enhancement;
Integrated Preprocessor Production;
GCC, ANSI Arbitrary procedural analysis and transformation;
but tangled up in GCC implementation One source file plus includes Fast, hand tuned GCC C

GXLGCC
(built on GCC) Stable?;
Academic;
Integrated Preprocessor Production;
GCC, ANSI XML export of declarations;
No information about function bodies Declarations from single source file and its includes Poor; XML documents are huge C

Elsa
(built on Elkhound) Stable?;
No preprocessor;
Academic C as C++ subset;
ANSI? C++;
Site claims some limits on template support Name/Type resolution after preprocessing;
Procedural tree processing in C++ Single file plus includes Unknown C++

Columbus Stable?;
No preprocessor;
Academic/Commercial? ANSI C?/ANSI C++;
Limited template support? Procedural tree processing Single file plus includes Unknown C++?

OpenC++ Stable;
No preprocessor;
Open Source; Most of ANSI C++;
Limited template handling Procedural analysis/transformation
in C++ via Visitor pattern;
Reneration of preprocessed source Single source file plus definitions from includes? Unknown C++

ANTLR ANSI;
No proprocessor;
Stale C++ definition;
Not used anywhere in practice C++ (draft);
Some name/type resolution while parsing Procedural tree processing Single preprocessed file? Moderate? Java

CLang/LLVM Stable;
Open Source;
Active enhancement;
Integrated preprocessor ANSI C++;
GCC dialect? Name/type resolution while parsing;
Procedural tree processing;
Regeneration of source (w/ comments? preprocessor?)
Native code generation via LLVM;
Varying flow analysies via LLVM Single preprocessed file? Fast,hand tuned C++

C++ Front Ends	Status	C/C++ Dialects Available	Post-Parsing Capability	Scale	Performance	Implementation Language
DMS	Production; Commercial; Active enhancement; Integrated Preprocessor with "no expand" option	Production; C & C++; C++0x; ANSI,GCC, MSVC6, MS Studio 2005, SystemC; Many other languages	Procedural tree processing; Name/Type resolution with or without preprocessing; Source-to-Source Transformation; Attribute Evaluation; Access to DMS control, data, points-to analyzers; Regeneration of source with comments and preprocessor directives	Tens of thousands of source files; Multilingual processing	Parallel implementation	PARLANSE (parallel)
EDG	Production; Commercial; Active enhancement; Integrated Preprocessor	Production; Most C/C++ dialects	Procedural tree processing; Name/Type Resolution after preprocessing	Single source file, includes processed but lost?	Fast, hand tuned	C
GCC	Production; Open Source; Active Enhancement; Integrated Preprocessor	Production; GCC, ANSI	Arbitrary procedural analysis and transformation; but tangled up in GCC implementation	One source file plus includes	Fast, hand tuned	GCC C
GXLGCC (built on GCC)	Stable?; Academic; Integrated Preprocessor	Production; GCC, ANSI	XML export of declarations; No information about function bodies	Declarations from single source file and its includes	Poor; XML documents are huge	C
Elsa (built on Elkhound)	Stable?; No preprocessor; Academic	C as C++ subset; ANSI? C++; Site claims some limits on template support	Name/Type resolution after preprocessing; Procedural tree processing in C++	Single file plus includes	Unknown	C++
Columbus	Stable?; No preprocessor; Academic/Commercial?	ANSI C?/ANSI C++; Limited template support?	Procedural tree processing	Single file plus includes	Unknown	C++?
OpenC++	Stable; No preprocessor; Open Source;	Most of ANSI C++; Limited template handling	Procedural analysis/transformation in C++ via Visitor pattern; Reneration of preprocessed source	Single source file plus definitions from includes?	Unknown	C++
ANTLR	ANSI; No proprocessor; Stale C++ definition; Not used anywhere in practice	C++ (draft); Some name/type resolution while parsing	Procedural tree processing	Single preprocessed file?	Moderate?	Java
CLang/LLVM	Stable; Open Source; Active enhancement; Integrated preprocessor	ANSI C++; GCC dialect?	Name/type resolution while parsing; Procedural tree processing; Regeneration of source (w/ comments? preprocessor?) Native code generation via LLVM; Varying flow analysies via LLVM	Single preprocessed file?	Fast,hand tuned	C++

Compiler Infrastructures

Compiler infrastructures are intended to make it easier to construct compilers or compiler-like tools such as source code instrumenters.

Issues for compiler infrastructures are:

What languages can the infrastructure process?
What compiler data structures are built? (AST, Symbol tables, ...?)
Does the front end capture comments and lexical format information?
How can those structures be processed?
What kind of support for analysis is available?
How hard is it to customize the infrastructure for the new task?
Is the tool designed to process just one file, or entire systems of files?
Can the front end operate with other languages (SQL, ...?)

Compiler
Infrastructures Status Languages
Available Post-Parsing
Customization
Capability Scale Performance Implementation
Language

DMS Production;
Commercial;
Active Enhancement Production
preprocessor/C/C++0x,
C#, Java1.5,
Fortran 77/95,
Pascal, Ada,
COBOL, Natural, JCL,
JavaScript, PHP4/5,
VB.net,VBScript,VB6,
Python 2/3,
HTML, XML,
SQL 2011,PLSQL,
Matlab M, Simulink,
Verilog, VHDL, SystemC;
Extensible to others
(e.g. DSP assembly);
Easily extended to include DSLs for graphs, code generation, etc. Procedural tree processing;
Name/Type resolution with or without preprocessing;
Source-to-Source Transformation;
Attribute Evaluation;
Regeneration of source with comments and preprocessor directives;
Control, dataflow, and global points-to analyzers;
Used to build a wide variety of source-analysis or instrumentation tools, including code generators and vectorizers. Tens of
thousands
of source files;
Multilingual processing Parallel implementation PARLANSE (parallel)

GCC Production;
Open Source;
Active Enhancement GCC; ANSI Arbitrary procedural analysis and transformation;
Code-generation focused flow analysis;
Tangled in "wants to be a compiler for every possible machine" One source file plus includes Fast, hand tuned GCC C

LCC Stable;
Academic; ANSI C Arbitrary procedureal analysis and transformation;
Tangled in "wants to be a compiler" One source file plus includes Fast, hand-tuned C

SUIF Stable;
Academic;
No preprocessor C and FORTRAN 90? Procedural tree processing;
Name/Type Resolution after preprocessing;Data flow analyses;
Reneration of preprocessed source? Multiple files? Unknown C++?

Open64 Stable;
Open Source ANSI C, C++?;
FORTRAN 90?; Arbitrary procedural analysis and transformation;
Data flow analysis;
Reneration of preprocessed source?
Tangled in "wants to be an SGI compiler" Single source file plus definitions from includes? Likely fast, used as production compiler? C++

CLang/LLVM Stable;
Open Source;
Active enhancement;
Integrated preprocessor ANSI C++;
GCC dialect? Name/type resolution while parsing;
Procedural tree processing;
Regeneration of source (w/ comments? preprocessor?)
Native x86 code generation via LLVM;
Varying flow analyses via LLVM;
Custom analyzers in C++ Single preprocessed file? Fast,hand tuned C++

Compiler Infrastructures	Status	Languages Available	Post-Parsing Customization Capability	Scale	Performance	Implementation Language
DMS	Production; Commercial; Active Enhancement	Production preprocessor/C/C++0x, C#, Java1.5, Fortran 77/95, Pascal, Ada, COBOL, Natural, JCL, JavaScript, PHP4/5, VB.net,VBScript,VB6, Python 2/3, HTML, XML, SQL 2011,PLSQL, Matlab M, Simulink, Verilog, VHDL, SystemC; Extensible to others (e.g. DSP assembly); Easily extended to include DSLs for graphs, code generation, etc.	Procedural tree processing; Name/Type resolution with or without preprocessing; Source-to-Source Transformation; Attribute Evaluation; Regeneration of source with comments and preprocessor directives; Control, dataflow, and global points-to analyzers; Used to build a wide variety of source-analysis or instrumentation tools, including code generators and vectorizers.	Tens of thousands of source files; Multilingual processing	Parallel implementation	PARLANSE (parallel)
GCC	Production; Open Source; Active Enhancement	GCC; ANSI	Arbitrary procedural analysis and transformation; Code-generation focused flow analysis; Tangled in "wants to be a compiler for every possible machine"	One source file plus includes	Fast, hand tuned	GCC C
LCC	Stable; Academic;	ANSI C	Arbitrary procedureal analysis and transformation; Tangled in "wants to be a compiler"	One source file plus includes	Fast, hand-tuned	C
SUIF	Stable; Academic; No preprocessor	C and FORTRAN 90?	Procedural tree processing; Name/Type Resolution after preprocessing;Data flow analyses; Reneration of preprocessed source?	Multiple files?	Unknown	C++?
Open64	Stable; Open Source	ANSI C, C++?; FORTRAN 90?;	Arbitrary procedural analysis and transformation; Data flow analysis; Reneration of preprocessed source? Tangled in "wants to be an SGI compiler"	Single source file plus definitions from includes?	Likely fast, used as production compiler?	C++
CLang/LLVM	Stable; Open Source; Active enhancement; Integrated preprocessor	ANSI C++; GCC dialect?	Name/type resolution while parsing; Procedural tree processing; Regeneration of source (w/ comments? preprocessor?) Native x86 code generation via LLVM; Varying flow analyses via LLVM; Custom analyzers in C++	Single preprocessed file?	Fast,hand tuned	C++

Note: This page under construction.

ToDo: References. Links. CodeGenerators.