Therefore all progress depends on the unreasonable man.
-- George Bernard Shaw

Large software systems (think millions of lines of code, multiple languages) have surprising troubles. Semantic Designs believes that automated analysis and transformation tools such as DMS can tackle problems not previously solved. Such problems require some significant engineering but have big potential payoff in terms of cost, development time, reliability, and the ability to achieve new capabilities, for the client owning the large code base.

We suggest some ideas here. If you find these ideas interesting, or have a vision of your own, we invite you to contact us to discuss what might be possible.

Automated Extraction of Regression Tests

Writing software is hard. Writing tests to validate the software is hard, and consumes about 40% of the overall effort in a well-run project. If one could cut the cost of writing tests, it could have a major impact on the cost of building and maintaining software. Unfortunately, it is impossible for a tool to know what the functionality of a system should be in the absence of anything but the source code. So one cannot automate the generation of tests against the intended functionality.

But a running system represents working functionality (modulo its known bugs). If one had tests that verified the running system operated as it should, those tests could be used to verify that changes to the running system, which occur continuously, do not damage the part of the system not changed. A solution that extracted tests from running software would be an enormous benefit to organizations with legacy software.

Semantic Designs believes that it is possible to extract such tests from existing code. The essential idea is to instrument the running application to capture test case data based on data from its daily operation, and use that to generate unit tests on program elements. For those tests to be effective, the running context of the program must be re-established during testing. A puppetization process would install controls in the application to enable it to operate in parts exactly as the original, and in parts to force it down paths to the point where particular unit tests would be applied. SD has the technology to instrument applications in many languages (as examples, consider our test coverage and profiling tools) and capture data. It has the technology to puppetize code. What remains is to put the pieces together into a working system.

Unifying Forked Source Code Bases into a Product Line

Many organizations find themselves with a very large application that has been forked into multiple versions, and are now doing updates on the multiple versions at a correspondingly high prices. An ideal solution would combine the multiple versions into a single gold code base with configuration parameters, that could be used to generate the multiple versions. Then maintenance and updates happen on the golden code base, which is delivered to multiple sites according their corresponding configurations.

To do this, one must discover what the versions have in common, and where they differ. The common part can be extracted, and the differences added to the common section conditionally controlled by configuration parameters.

Semantic Designs has tools for discovering common code and differences (e.g, our Clone detection and Smart Differencing) tools, across many languages (see our supported languages list). We have the ability to transform code to insert configuration conditionals of many kinds (preprocessor conditional, procedural or macro abstraction, objects with inheritance, generics, whole-file-replacement). The result is a product line, which can be used to generate the instance variants as needed. Development on a shared code base makes common updates easily shared, and changes in configured code clearly specialized to the variant. You can read a bit more about this in a Dagstuhl Research Report, in the section onf Refactoring to Product Lines.

Automated Code Update from Data Schema Changes

Every application is driven by a model of the world, realized as in instance of a data schema that can hold the necessary details. The schema may be implicit (e.g., as in hierarchical databases or flat files) or explicit (as in relational data models or XML schemas). No matter how the scheme is defined, the program contains code that implicitly knows what the schema is. The obvious value is the program knows how to manipulate data in that schema. The problem is the organizations' needs, and the world, both evolve, requiring the data schema to change, and the program to change in response.

Some of that change is in terms of new functionality that harnesses the new types and relations of data in the new schema. But much of the change is just to accomodate changes in the schema. As an example, almost every new data field requires something to create, read, update, and delete new data field instances ("CRUD"). Knowledge of how the program uses the data schema, and changes being made to a data schema, could be used to automate the mundane part of updating the program, allowing software engineers to work on the interesting functionality.

Semantic Designs tools can process data schema descriptions (SQL, XML Schemas, ...) and source code. Changes made to a data schema can be detected by SD's Smart Differencers. Such changes could be used to automate much of the mundane part of code base changes.

Integration of Two Applications by Data Model Unification

Application integration allows a company to provide more sophisticated responses often with less effort, even to the point of driving corporate mergers. But too often, integration fails because the data models of two applications are not aligned, and because one cannot easily make changes in one application caused by integration changes induced by aligning data from the other model. And thus synergies of integration are not achieved, or are long delayed. Being able to unify two data models, and push changes into two applications, is key to application intergration. One needs to be able to align data model elements.

We suggested above how changes in one model could be partly automated using tools. What is different here is aligning two data models first. The changes required to align the models can be used to drive model changes into each application. Semantic Designs thinks that semantic description technology (e.g., OWL, descriptive logics, specification algebras) can be used to provide precise semantics to data elements and their relations, and new relations computed from old. Thus an algebraic means for unifying the schemas is suggested, which might both guide the unification process, and provide additional semantics for the programs.

Semantic Designs tools can process "semantic description" languages, and thus be used to enable modifications to schemas, and check that the resulting schemas are aligned (to some degree; semantic reasoning in general is Turing hard and most schemas are imcomplete in the semantics of the modelled facts). But any help here is enormous, because the cost of making the changes incorrectly is very high.

Basel II Compliance: What's the source of that datum?

Banks and other large financial institutions are becoming increasingly regulated in terms of delivered results and processes required. One set of standards to be met by such institutions are the Basel II agreements. Any Basel II solution considered to be "best practice," should be transparent and auditable. It should provide complete traceability of computed numbers down to the source data with the appropriate audit trail. How is one to achieve this, in face of large scale information systems in enormous organizations?

As financial processes become increasingly automated, this information flows through computer programs owned by the financial institution. One way to solve the tracking problem is then to literally trace the data going into reports, into the databases that produce it, and from there into other financial processes, repeating until one arrives at source data acquired from some outside agent. (Even then one might wish to dive further, but that is subject to the outside agent cooperating on massive scale).

Semantic Designs builds data flow analyzers that compute this data for individual programs; we've handled individual systems of 26 million lines of code. One can imagine scaling this up to trace data across processes and databases owned by the institution, to provide a documented trace of information sources. One would need tools to enable financial engineers to explore this trace. But questions about sources of information would then be answerable.

An odd side effect of this process is probably cleaning of data. Consider the notion of "profit". It ought to be that the profits of a company are the sum of the profits of its divisions. However, if such profits are measured in different ways (annual, cost-adjusted, ...) adding them may in fact produce nonsense. A full dataflow analysis would find where such profits are added. Adding type checking would verify that the composition was valid. One might not get a valid composition in all parts of an organization, but the organization should at least know where data is combined inappropriately.

Design Recovery

Most large applications exist only as source code (sometimes not even that). Any actual design knowledge may be hidden in some engineer's brain or more usually completely lost. A consequence is that continuous changes to code, demanded of working systems, always requires rediscovery of the concepts and code organization of the software. Thus programmers spend 50% of their time just staring at code, trying to understand what it does. They are hampered by only having the low level source code, perhaps some hints in the comments and rarely, software entities that are well-named with respect to purpose. Tools that can rediscover common concepts for the application, and where those concepts are implemented, could shorten the understanding time and therefore delivery considerably, and could raise the quality of changes that are made.

Code concepts are realized by data structures and idioms for manipulating those data structures in ways that achieve the application purpose. Once the data structures are defined, the idioms to achieve purpose tend to similar because they must process that data as defined. Semantic Designs has the technology to find data values flowing through code (e.g., data structure instances) and match idioms that manipulate such data structures. One can "tile" the code base with recognized concepts, and make those tiles visible to new programmers that have not seen the code before, enabling them to understand and decide what to do more efficiently. (We are presently doing a version of this for Dow Chemical).

Design Traceability from Specifications to Code

The holy grail of program development is not to recover design information that has been lost. Rather, it is to not lose that design information, as it is generated, thus avoiding the expensive and error-prone process of trying to rediscover it. One needs to record the the abstract concepts, the program purpose, the implementation choices and the final code to do this "right".

Semantic Designs' flagship product, DMS, was designed from this perspective. SD has a vision of how such design information might be captured and incrementally updated as changes are made. This would be especially valuable for capturing the structure of complex, expensive artifacts such a chip designs, software with safety requirements, or simply large applications. You can read a technical paper on formalizing and mechanizing this.


Bring us your poor, your tired, your huddled fantasies of massive software engineering using automated tools, and let us set it free.


For more information:    Follow us at Twitter: @SemanticDesigns