Life After Parsing™:
Got My Grammar... uh, now what?
It is astonishing how often people think that the key to building a tool to process a computer (or domain-specific) language is to get a parser.
Its true... in the same sense that playing poker is all about putting the ante into the pot. If you believe this, the other poker players are going to eat you alive. Yes, you have to ante. No, that's not where the game is played.
Only the most trivial computer languages are easy to analyze and transform. The compiler community has been working on this problem for 50 years, and one ignores what they have learned at one's peril. They tell us following technologies are needed to realistically process computer language(s):
- Lexing and Parsing, ... of course. But there are some really nasty problems here, including preprocessors.
- Representation Capture: A way to represent and capture program structure
- Symbol Tables: A mapping between identifiers used in the program, and their meaning (and scope).
- Inference Methods: A way to understand the meaning and consequences of the written code:
- Simple Fact Collection: Inferences that are easy to extract from the program structure.
- Control Flow Analysis: Data about the order in which activities occur in a program.
- Data Flow Analysis: Data about how information flows from one part of the program to another.
- Symbolic Reasoning: Method to compute complex consequences of semantics and various information flows.
- Pattern Matching and Transformation Capability: Means to find program points that look like places of interest, as well as means to convert one program representation instance into others, to provide optimizations or translation to another representation
- Pretty-Printing: If the representation is modified, one needs to be able to regenerate valid source code from it.
- Multi-module and multi-lingual Integration: Inclusion of other programs in the same or another notation to allow one to process systems of code.
Parsing and Program Analysis using only Regular Expressions is hopeless
Many people start by trying to use regular expressions to parse source code. Regular expressions are useful for scanning many kinds of strings with simple structure, but cannot handle the nested structures typical of a program. This Stack Overflow answer hints at how badly regexes fail for parsing. They are simply not an answer for parsing.
Parsing is hard ... but just a distraction
Tools like YACC, JavaCC, ANTLR, and variants are all about parsing and a little bit about AST building. It is a big effort to choose the "right" (lexer and) parser generator, find/create a grammar, wrestle into a form acceptable by a parser generator (lookaheads? ambiguities? context sensitivities?), deciding on how to build a tree, ... Then there's the often surprisingly big struggle to make the parser match what the compilers really accept as opposed to what everbody knows to be true without any further thought (Hey, C source code doesn't have any complications, its a simple language, right? Can't be that hard...oops, what's a trigraph?). Putting together a working parser is enough work so that people tend to focus on just getting that right, paying zero attention to what is needed after that. And that makes them lose sight of the full problem.
But success in parsing and tree building is such a small part of the problem that a victory really means very little. Climbing 11,000 feet to the foothills of Everest requires only basic technology: some camping gear and good hiking boots. Any determined clod can get this equipment and climb the foothills. To climb to the peak it is clear that one needs a completely different set of very sophisticated technology (oxygen tanks are pretty different than hiking boots) and a much different climbing process to finish the climb to Everest's peak. Those that attempt the remaining climb with just the basic equipment simply die. Those that have just a parser never really get a useful tool.
This insight led us to build the DMS Software Reengineering Toolkit, to provide as much as possible of that advanced technology. Semantic Designs has spent over 17 years building and enhancing that technology, and proving it on real languages such as COBOL, Java and C++, so that would-be tool builders don't have to reinvent the infrastructure, and can get on with building the tool they really wanted. (We don't expect the enhancement process to ever really stop or even slow down much). Along the way, we found that even the parsing process could be significantly enhanced.
Climbing Mt. Everest for Tool Builders
Let us consider these additional mechanisms further. First, what drives the need for this? Our programming languages, like our computers, used to be smaller and simpler: FORTRAN66, COBOL, RPG, BASIC, K&R C, ran on small machines, and were all written in ASCII text in a small number of files. Lots of complications have occured since then:
- Big, complex languges: C99, Enterprise COBOL, Ada2005, Java7, C#4, C++11, Fortran2005, HTML5 each in a variety of dialects... even C isn't small (check out GCC and Microsoft's variants), and Perl is an amazing mess. Language committees can't resist adding new goodies; there tends to be a "me too" race (Java and C# seem to co-evolve furiously). And companies providing compilers have a vested interest in adding features to create customer lock-ins.
- Values in a wide variety of forms and precision: IEEE Floating point, decimal numbers, infinite precision rationals, ...
- Preprocessors, with include files, macros, and the ability to conditionally include/not include program fragments
- Locale-specific character sets, Unicode in several flavors, EBCDIC
- Embeddings of language fragments into other languages (JavaScript and PHP in HTML, SQL in COBOL, ...
- Generics, reflection, and metaprogramming are used more frequently to consruct software systems
Lets examine the necessary tool-supporting mechanism in a bit more detail. What are they, and why are they needed? (This discussion parallels the opening):
- Lexing and Parsing (still necessary!)
- Lexical Capture: It is classic to perform
lexical
analysis of the source code, e.g., to break up the stream of
input characters into tokens that represent the conceptual
elements of a program: identifiers, numbers, keywords, comments,
strings, even whitespace. That idea still generally holds. It is
also classic to build a hand-coded lexer or use Flex to accomplish
this, and process the ASCII characters in files that made up all the
languages we knew and loved, and in fact still dominate Western
programming. But for highly evolved legacy languages, and modern
languages, you need to process possibly many character representations
(Katakana in comments in C, anyone?) including Unicode, convert
numbers to floating point without losing precision, capture numbers
with possibly arbitrarily large precision, handle keywords that might
be identifiers (PL/1 is famous: every keyword can be an identifier!),
etc.
DMS provides libraries and tools to handle all of these issues. The DMS lexer handles full Unicode, has support for opening nested streams (to handle include files), etc, and can easily build multi-mode lexers to handle the different lexing needs in the possibly different parts of the language. DMS's lexers also capture comments, which are extremely useful in program analysis and fundamental for program transformation; users will reject perfectly transformed code that has lost their comments. Similarly, DMS will capture lexical details, such as radix of numbers and string quote style; programmers reject modified code in which these are lost or changed, too. We don't know of other production lexer generators that have any or even most of these properties out of the box. (If your foundations aren't right, how are you going to construct a large building?) - Parsing: It is surprising that the choice of parser is often
made without regard to the language being parsed, often some parser
generator whose major property is that is convenient to download (e.g, Bison, ANTLR, ...).
If your language requires a lot of lookahead, an LL(1) parser is going
to be a very painful choice; if your language has ambiguities,
a parser generator that can only produce one parse will be trouble
(consider parsing C++). But when parsing real languages,
it is hard to know what issues the parser faces until you are done
building the parser, because you don't know what the grammar will be
until it can basically parse everything. And the manuals lie about what is legal. What this
suggests is you should simply use a very strong parsing technology to
avoid as much of this pain as you can.
DMS uses a GLR parser, which means it handles arbitrary lookahead and ambiguous grammars with aplomb. Arbitrary semantic reduction conditions allow DMS parsers to eliminate many ambiguous choices while parsing, or they can be retained an eliminated later using semantic information. You write a context-free grammar; DMS's GLR parser can parse it. So in general parsing is hard; with DMS, parsing is a lot easier.
DMS parsers can be configured to collect preprocessor definitions and conditionals without expanding them. This makes it possible to reason about the code as seen by the programmer, with all the preprocessor directives in place. No other tool known to us can do this.
- Lexical Capture: It is classic to perform
lexical
analysis of the source code, e.g., to break up the stream of
input characters into tokens that represent the conceptual
elements of a program: identifiers, numbers, keywords, comments,
strings, even whitespace. That idea still generally holds. It is
also classic to build a hand-coded lexer or use Flex to accomplish
this, and process the ASCII characters in files that made up all the
languages we knew and loved, and in fact still dominate Western
programming. But for highly evolved legacy languages, and modern
languages, you need to process possibly many character representations
(Katakana in comments in C, anyone?) including Unicode, convert
numbers to floating point without losing precision, capture numbers
with possibly arbitrarily large precision, handle keywords that might
be identifiers (PL/1 is famous: every keyword can be an identifier!),
etc.
- Representation Capture: Real tools require more than just "parsing"; the structure of the program must be captured to enable further processing. Usually an Abstract Syntax Tree (AST) is needed. Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well. This essentially (more than) doubles the work, and it makes maintaining the grammar hard because any grammar changes require the AST building code as well. In contrast, DMS automatically builds a syntax tree, either a concrete tree ("CST", mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed). Changes to the grammar automatically cause the C/AST to change correspondingly. And the tree is always correct with respect to the grammar. It includes precise source line information (file/line/column) as well as comments passed to it by the lexer, attached to appropriate tree nodes, and it can include preprocessor directives.
- Symbol Tables: A mapping between identifiers used in the
program, and their meaning (and scope). It is only the grammar which
is context-free in most programming languages; the meaning of
identifiers is usually context ("scope") and semantics
dependent. Generally
to analyze, reason about, or transform programs one must know each
identifier instance represents, where and how that identifier is
defined.
This is traditionally done with a
symbol table,
that maps code regions to sets of identifiers and their definitions.
For real languages a symbol table is often relatively complex because
different code regions implicitly
designate different sets of identifiers (e.g., local scopes, lexical
scopes, namespaces, parameter names, object slots, etc.). So the
symbol table must model the relationship of the various "scopes" to
enable symbols to be correctly found. Such a symbol table
has to be constructed from the program representation; the grammar
gives no clue about this.
DMS provides a symbol table structure with local mappings ("symbol spaces") of identifiers to definitions to model scopes. These provide multiple inheritance support, and are have proven strong enough to handle C++ and the many other languages that DMS can presently process. DMS has a library of routines for symbol-definition insert and fast hashed symbol lookup with overload resolution that can automatically navigate the symbol "spaces" and find the proper definition. DMS provides means to help build symbol tables (see "attribute grammars", below). No parser generator system known to us provides symbol table structure support or help in building symbol tables; the language engineer typically has to write all of this machinery herself. - Inference Methods: A way to understand the meaning and consequences of the written code, e.g., program analysis. Every program analysis or modification task starts with understanding the code, which requires collections of various kinds of facts about the code structures and how code elements interact. We call such determination of facts "inference". There are inferences that are easy to extract from the program structure, e.g., the AST. One can programmatically traverse an AST with recursive function calls, and almost any parser generator will provide programmatic access to the nodes in an AST, but this is inconvenient for a large or complex grammar. DMS of course provides such programmatic access, but more importantly provides a variety of other inference mechanisms. (Parser generators never provide any capability for this; the language engineer has to build it all by herself):
- Simple Fact Collection: One way to collect information is attribute grammars, which enable easy computation of arbitrary properties of syntax instances as annotations on the grammar rules themselves (a domain-specific language). This makes it very easy to collect complexity metrics. A more interesting example is collection of symbol table data. Since scopes for identifier names tend to follow the program structure, and this structure is defined by the grammar, there are almost always grammar rules which define lexical scopes. So attribute grammars are a nearly ideal way to collect the symbol table information. Coupled with the symbol-insertion capability of the symbol table library, attribute grammars make it relatively easy to build symbol tables from parse trees.
- Control Flow Analysis: To reason about most (especially procedural) languages, one needs to know the order and under what conditions actions occur. This requires control flow analysis. DMS implements this using a language-engineer specified attribute grammar analyzer that understands the control flow semantics of the language syntax, and a library that constructs a control flow graph graph from information collected by that analyzer. The resulting control flow information encodes a "flowchart" of the code in a function or method, including blocks of actions, conditional branches and exception branches, operations that occur in unspecified ("parallel"/fork-join) order, and provides information such as dominators and control dependences. Additional support allows one to structure this control flow graph into regions, even if the original program is full of unstructured blocks (e.g., using uncontrolled "goto"). DMS also provides tools to build call graphs (even in the face of indirect pointers, as found in C, and implicitly in OO programs as "objects", which are really just pointers), using data flow anlaysis and points-to information, see below).
- Data Flow Analysis: How information flows through the program is fundamental to understanding. What is needed is knowledge about how data flows from one point in the code to another. DMS provides a variety of support for computing data flow information, using built-in iterative solvers operating over the (possibly parallel) control flow graph, operating on arbitrary domains (including bit vectors), using identifier access and update information as determined by another language-engineer specified attribute grammar. Using these solvers, DMS can determine reaching and use-def chains, and consequently points-to analysis. DMS has been used to compute global points-to analysis for very large code bases. Using the explicitly captured data flow information, it is relatively easy to implement foward and backward program slice tracing.
- Symbolic Reasoning: More sophisticated analyzers are useful to reason about code, including method to compute complex consequences of semantics and various information flows. At present, DMS provides abstract interpretation to compute symbolic range information for values of variables in terms of other program variables. We have just recently added Binary Decision Diagram support (and an extension to handle "finite domains") to allow fast reasoning about complex boolean conditions over discrete choices (e.g., enums), especially of interest in reasoning about preprocessor conditionals.
- Pattern Matching and Transformation Capability: It is useful
to be able to match arbitrary program fragments,
or to generate arbitrary program fragments.
DMS provides a variety of support for this:
- Surface-syntax pattern matching: DMS does this with surface-sytax patterns, in which one writes code fragments in the target notation with placeholders. DMS can match such patterns against ASTs it has parsed, binding placeholders to corresponding subtrees. All one has to do to get this capability is define a parser to DMS. The patterns can also rely entirely or partly on custom-code tree-traversals.
- Associative/Commutative matching: Patterns can match on associative and commutative operators that are declared in the grammar. (No other available program transformation system known to us does this.)
- Pattern-directed code generation: The same patterns be used to generate ASTs that instantiate the patterns with pre-supplied subtrees, allowing one to build arbitrarily complicated trees fragments easily, and compose them into even larger trees.
- Source to source transformations: One can also write optimizations or transforms on code, using rewrite rules consisting of pattern pairs, one pattern for matching and one for replacement when a match is found.
- Cross language translation: Since rewrite rules can use surface syntax of one language in the match-pattern, and surface syntax of another in the replace-pattern, one can straightforwardly write language-to-language transformations.
- Data flow pattern matching: Nearing release, we are adding the ability to match patterns over dataflows, enabling a semantic matching capability. This is extremely useful in recognizing abstract code idioms that are scattered about in real implementation code. It can recognize variants produced by alternative design choices whose presence is only implicit in the code, as well as identifying the specific design choice configuration that enables a match.
- Pretty-Printing: If code transformations are applied to improve the code, one needs to be able to regenerate valid source code from the modified representations. DMS provides this ability in the form of prettyprinting rules, which specify how the AST should be regenerated in terms of its original layout ("fidelity printing") or in terms of nested, indented text boxes ("pretty printing"). These capabilities can be used to show smaller subtrees, also. The prettyprinter regenerates accurate numerical (especially floating point or infinite precision) values, lexical formatting information, (string quotes, radix, character escape codes, etc.), as well as comments, preprocessor macro declarations, includes, ifdefs and macro calls.
- Multi-module and multi-lingual Integration: Real software systems are not coded entirely in one language (Java? What, no SQL or HTML?). As a practical matter to process large systems of code, a good tool must be able to parse all the sublanguages and tie information together across the elements. DMS can host multiple language front ends/symbol tables/flow analyzers/prettyprinters simultaneously.
Bottom line: A parser simply isn't enough to do serious work. DMS is designed to be enough. The payoff is the ability to build real, useful, and robust tools with modest resources and time, as opposed to the common tool failure of "succeeding in just building a parser".
Bonus: Pretested Language FrontEnds
DMS has many predefined language front ends which have been developed and honed over the last decade. These front ends include grammars, prettyprinters, and to varying degrees, symbol tables, flow analysis, etc. For real languages such as COBOL, Java, C# and C++, you are much better off getting one (or more) of SD's tested language front ends than trying to define the grammar let alone the full front end yourself, or attempting to bend another that hasn't been used for serious tasks.
Double Bonus: Faster answers through Parallel foundations
Programs are getting bigger (millions of lines of code), and people (programmers and their managers) want answers fast. One way to get faster answers is to use parallelism. Cycles are an engineer's best friend.
All of the DMS machinery is implemented in a parallel programming language, PARLANSE which is designed to handle multicore irregular ("task") parallel computation. The DMS implementation provides for thread-safe access and update to all the structures to enable multiple analyses and transformations to occur. This allows DMS to scalably process very large systems of code, using the multiple CPUs available in virtually every desktop workstation that programmers have. We have not seen any other parser generator, program analysis, or program transformation tool that provides this capability.
It should be obvious that defining a full set of tools for processing languages is difficult. Defining such a set with parallel foundations is something a language engineer would not consider. But it is available built into DMS.