[ieee 2011 faible tension faible consommation (ftfc) - marrakech, morocco (2011.05.30-2011.06.1)]...

Embedding functional simulators in compilers fordebugging and profiling

Thibault Delavallée∗, Philippe Manet∗, Hans Vandierendonck† and Jean-Didier Legat∗∗Electrical Engineering, ICTEAM Institute, Université catholique de Louvain, Place du Levant 3, Louvain-la-Neuve

thibault.delavallee, philippe.manet, jean-didier.legat @uclouvain.be†Dept. of Electronics and Information Systems, Ghent University, SintPietersnieuwstraat 41, Ghent

[email protected]

Abstract—In embedded systems, achieving good performancesfor signal processing applications is crucial for power manage-ment. Good compilation is required to have maximal use of theavailable processing capabilities. Compiling for communication-exposed architectures such as ADRES, TRIPS and Wavescalaris however a complex task. Dataflow graphs are mapped onexecution unit grids in order to increase the instruction-level par-allelism while minimizing communication. Complex algorithmsand the large number of code optimizations make debugginghard for the developer. Moreover, iterative approaches are usedto optimize the compiled code quality.

This paper proposes to embed functional simulators in com-pilers in order to enable debugging and profiling-driven iterativecompilation. Debugging of optimization passes is achieved bymeans of functional simulators, running the original code andthe transformed code. Intermediate and output values resultscomparison allows to verify the correctness of the optimizationpass. Using embedded simulators also allows to extract code andexecution characteristics convenient for iterative compilation. Wepresent the mechanisms required to control those simulators.A case study based on the TRIPS processor demonstrates theusefulness of our approach.

Index Terms—compiler, debug, embedded systems, profiling,simulation

I. INTRODUCTION

Over the years, compilers have grown to large and complexsoftware systems. Many individual code optimizations havebeen implemented and are often individually callable by theuser. This places a large burden on the developper, whohas to verify that all optimizations work correctly and thatthey actually bring benefit. Indeed, having good compilerperformance is crucial to fully use the processing capabilitiesof the target processor, and therefore to save energy. Debug-ging compilers is known to be notoriously hard. Furthermore,selecting the optimal sequence of optimizations and theirassociated parameters is far from trivial [1]. To address thisproblem, iterative compilation techniques have been proposedthat iteratively compile a program with different optimizationsequences and measure the performance of each sequence.

To come to a solution to these problems, we propose toembed functional simulators in compilers for two reasons:(i) to facilitate debugging of compiler optimization passes and(ii) to facilitate fast turn-around times on iterative compilation.

Optimization passes can be debugged by running the orig-inal code and the transformed code through a functional sim-ulator. The optimization pass is correct if both code versions

produce equivalent results, otherwise failure is indicated tothe developer. This approach allows to run test cases veryquickly during the compilation process. While it does notreplace running extensive test suites, it gives the advantagethat a compiler bug can be pinpointed very quickly. It allowsthe immediate identification of the failing compiler pass aswell as the code region that triggers the bug.

Embedding simulators in compilers can speed-up iterativecompilation as it allows to fine-tune the optimization ofindividual code regions. Focusing the optimization process toindividual code regions has several benefits such as selectinga different optimization sequence for each code region anditerating longer over hot code regions. Furthermore, by exe-cuting the code regions in a functional simulator, it is possibleto measure code properties that allow to steer the compilationsequence in the right direction. A potential limitation is thatperformance may depend also on the interaction between coderegions, which is not measured by this model. Hereto, fullprogram iterative compilation would be necessary as well.

The main motivation for this work is the complexity ofcompiling for communication-exposed architectures such asADRES [2], Trips [3] and WaveScalar [4]. They aim to in-crease instruction-level parallelism (ILP) by executing instruc-tion blocks on grids of ALUs where communication betweenthe ALUs is exposed to the compiler [5]. Those architecturesaddress embedded systems applications, in which a gooduse of available computing capabilities is crucial for powermanagement. Compiling for such architectures consists ofmapping data flow graphs on the grid of ALUs in such a waythat concurrency (ILP) is maximized while communicationoverheads are minimized. We have found the transformationof sequential code to ISAs for communication-exposed archi-tectures very error-prone. Furthermore, compilation for thesearchitectures uses iterative approaches such as place-and-routealgorithms [6] and simulated annealing [5].

This paper presents the techniques that are necessary toenable the debugging and iterative compilation techniquesdescribed above. The remainder of this paper is structuredas follows. Section II presents applications in the fields ofdebugging and iterative compilation. Section III presents ourmethod that is evaluated in Section IV with a case study.Related work is discussed in Section V. Section VI concludesthe paper.

2011 Faible Tension Faible Consommation (FTFC)

978-1-61284-647-7/11/$26.00 ©2011 IEEE 55

II. APPLICATIONS

A. Compile flow

Compile flows for communication-exposed architecturesinvolve several passes. Control flow graph creation is thebasis of code region formation. During this pass, the originalcode is modified with techniques for block formation, such asloop unrolling, instruction promotion and instruction moving.Once code regions have been shaped and using their executiongraph, the compiler has to schedule instructions onto availableresources, that are execution units, available communicationresources, memory system, ... Static scheduling of instructionsis done in an out-of-order way, in order to maximize theexposed parallelism under communication and resource avail-ability constraints. The debugging and iterative compilationtechniques using embedded functional simulators relieve a partof the compilation complexity by helping developers.

B. Debugging

Debugging allows to verify optimization passes indepen-dently. Moreover, having feedback on the error together withits context can help the developer. One application is to verifythat control flows are still correct after major changes in thecode organization due to block formation. Another applicationis the debug of function control flow, by verifying the pro-gram behavior around procedure calls. When instructions aremanipulated by promotion or moving or even by scheduling,verifying code region outputs ensures the correctness of themodifications, notably dependence violations due to wronginstruction placement.

C. Iterative Compilation

Iterative compilation needs precise information to decideto issue a new iteration, depending on some criteria. Usingour proposed technique, profiling can be extracted from thecompile flow. The profiling extraction can be done in dur-ing compilation instead of working on the final binary. Forexample, in the Trips architecture, severe constraints are puton the code region formation, such as number of registeraccesses, physical location of resources, ... Extracting profilingabout code regions right after the formation is essential.Moreover, architectural exploration is facilitated by providingfast profiling results.

III. METHOD

Linking compilation and functional simulators brings ad-vantages to the developer in term of debugging and profiling.Although the basic technique is similar, there are specificitiesto each of them.

A. Debugging

When targeting debug, the goal is to test the equivalenceof two code fragments that are supposed to have the samefunctionality. The main focus is on testing the correctnessof compiler passes that make significant changes, e.g. whentransforming code to a timing-dependent representation. This

Figure 1. General setup of compiler and simulators.

occurs, for instance, when scheduling instructions on ExplicitDataflow Graph Execution (EDGE) processors [3].

Debugging is performed on small code regions to increasethe precision by which errors can be diagnosed. When com-piling for block-structured architectures such as TRIPS, wedebug each execution block separately, but it is also possibleto run all the code affected by an optimization pass, e.g. afunction body. In order to debug the correct optimization of acode region, the compiler sends the original code in the sourceinternal representation and the transformed code in the targetinternal representation to the online debugger (Figure 1). Theonline debugger then notifies the corresponding simulators toexecute the code regions. Afterwards, the resulting programstates are compared for equivalence.

The first step in preparing the execution of a code region isto setup the reference simulator. The reference simulator usessimulation of the whole program, starting with user-suppliedinput sets and program flags. Calls to library functions areemulated by the simulator while calls to functions defined inother compilation modules are skipped. While it is possibleto simulate the program from the start for each separate coderegion, it is much more efficient to first compile code regionsand then verify their correctness. This avoids reinitializing thereference simulator again and again.

Once the reference simulator has arrived at the correctprogram point, the target simulator is initialized to the corre-sponding state by probing the state of the reference simulator.Hereto, the compiler maintains a map that determines therelationship between variables in the original internal represen-tation and the target representation. The target simulator loadsevery input variable in the map by querying its value from thereference simulator and storing it in the appropriate registeror memory. Note that the map is not necessarily a one-to-onemapping, both for variables and code labels, because a givenvariable or code label can have a different number of copiesin the different internal representations. Furthermore, the mapconsiders also variables that exist in the communication nodesin order to verify intermediate values.

Next, both simulators can execute their code fragments.When finished, the target simulator again uses the map toquery the correct value of all output variables from the refer-ence simulator and compares their values. Also, the contents ofall memory state that is modified by either of the simulatorsis compared to check equivalence. When all state is equal,the target simulator notifies the online debugger that the test

56

succeeded.While performing debug, it is often interesting to known

not only the cause of the failure, but the context in whichit appeared. An execution trace can be kept in order to haveaccess to this context and the error pop up. This trace containsinformation about input values and history of control flow.

B. Profiling

When targeting profiling, one functional simulator runsover the internal representation of the compiler. Profilinguses synchronization and communication means similar tothose used in debug. The simulation grain vary depending oninformation to extract. For example, flow information does notneed to execute every instructions. Only the branch path, ieinstructions on which the branch depends, have to be executedin order to compute the target address. Iterative compilationtakes its decision based on results given by the profiling, inorder to measure the performance of different optimizationssequences.

C. Partial simulations

Partial simulations occur frequently when debugging par-tially compiled code or performing optimizations on a subsetof code regions. When facing partial simulations, simulatorinternal states must be initialized. Checkpointing mechanismsare used. One way to perform checkpointing on the twofunctional simulators while debugging is to have checkpointsfor one of them. Its initialized state is then used to initializethe other simulator state. This allows to use checkpointing ononly one functional simulator.

IV. CASE STUDY

This section evaluates the proposed technique in a casestudy. It consists in the debugging of instruction schedulingon the TRIPS processor.

A. Target Processor

Figure 2 shows an abstract representation of the TRIPSprocessor. It has an ALU grid of 4x4 execution units (labeledE). Banks of registers (R) are at the top of the processorwhile cache memories (D) are on the left. For convenience,instruction caches are not represented on the figure. Eachexecution unit has two operand buffers and instruction buffersholding instructions that will execute on that particular unit.It also has a small router and output ports, allowing datacommunication.

B. Debugging Case Study

A code fragment in sequential internal representation (IR)(Figure 3(a)) is transformed to an execution block EDGE IR(Figure 3(b)), which consider register file access explicitly asoperations. The code manipulates two live variables x and y.The map (Figure 3(c)) shows the relationship between programvariables in each representation. Values x and y match TRIPSregisters g0 and g1. Intermediate values ti correspond tointermediate results of the basic block inside the processor thatare not placed into registers. The internal state of the TRIPS

Figure 2. Abstract representation of the TRIPS processor

(a) Sequential IR (b) EDGE IR (c) Register map

Figure 3. Mapping a sequential internal representation to EDGE IR.

functional simulator holds the content of the register banks andof the memory, as well as values held in operand buffers. Thelocation of the data in the communication network is therebytaken into account. The functional simulators internal statemapping is necessary to perform state comparison betweendifferent IR.

Debugging is performed as explained in section III. Thereference simulator executes the code fragment in sequentialrepresentation, while the target simulator executes the EDGEcode block. Figure 4(a) shows the code fragment in EDGE IRplaced on the TRIPS processor as well as routing resourcesused for data communication.

This case study states that the scheduling algorithm waswrong as depicted on Figure 4(b). This leads to wrong memoryaccesses that are notified to the online debugger at the end ofthe code block execution. The online debugger is then notifiedthat the comparison failed, with the precise location of thecode block that failed, leading to a precise debugging whencompiling larger programs. A trace of the code block executionfor both simulators can be sent to the debugger. It can thenlocate the error by getting back in the dataflow graph of thecode block to precisely pinpoint the error.

(a) Correct scheduling (b) Wrong scheduling

Figure 4. Scheduling of code fragment onto the TRIPS processor

57

V. RELATED WORK

Many research has been performed in order to automati-cally generate test cases [7], [8], [9]. Black-box approachesrandomly generate input data sets that test different pathsthrough a program [9]. White-box approaches perform muchthe same, although they utilize additional program-specificinformation [8]. It is possible to generate tests that specif-ically test null-pointer dereferences using theorem provingtechniques [7]. This paper differs strongly from the cited workas it applies specifically to the issue of compiler bugs. Here,the problem is not so much that of exercising many controlflow paths or testing memory accesses, rather we wish toautomatically verify the compiler algorithmic correctness.

Comparison checking is an other technique for compilerdebugging [10]. In this approach, debugging is performeddynamically by checking that semantics of modified code doesnot change. For a given input, flows, memory modificationsand output values must be the same. Translation validationproposes to statically validate original and transformed codeof compiler passes with use of a validator [11]. The validatoris a tool using some rules such as symbolic execution toverify that the transformed code behaves like the originalcode. A validator has for example been proposed for theORC compiler [12]. Symbolic execution has been applied asvalidator rule to verify basic blocks execution integrity, as wellas validity of code moved through control flows [13]. In ourapproach, the use of functional simulators allows to easilyvalidate code modifications and moves like in comparisonchecking. But our approach also allows gives interestingopportunities for iterative compilation.

Optiscope use binary instrumentation to collect character-istics of binary code. The purpose is to be able to evaluatethe effects of new optimizations and help choosing the bestoptimization passes [14] . In our approach, it is possibleto have profiling results without having to go until binaryexecution, allowing to speed-up the compilation.

VI. CONCLUSION

This paper argues for embedding simulation infrastructurein compilers with two potential uses: debugging compileroptimization passes and profiling compiled code fragments initerative compilation settings. We decided to develop such aninfrastructure in the context of compiling for communication-exposed architectures targeting embedded systems such asADRES, TRIPS and WaveScalar, as the nature of these archi-tectures makes it quite complex to add simple fault-checkinglogic to the compiler. Furthermore, published scheduling al-gorithms for these architectures are iterative in nature andbenefit strongly from the profile-driven iterative compilationinfrastructure. The motivation of this work is to obtain thebest possible compilation performances in order to fully usethe computational capabilities of those architectures, which isimportant for power efficiency.

We discuss the mechanics necessary for steering the sim-ulation: maintaining a map between different internal rep-resentations of the compiler, setting up inputs to a code

fragment, simulating the internal representation, validating thesimulator’s state, etc. A case study demonstrates the usefulnessof the proposed approach.

In future work, we will develop the profile-driven iterativecompilation infrastructure and validate the benefits of selectingdifferent optimization sequences for different code regions.

ACKNOWLEDGMENTS

Thibault Delavallée and Philippe Manet are funded bythe Walloon region of Belgium. Hans Vandierendonck is aPostdoctoral Fellow of the Research Foundation - Flanders.

REFERENCES

[1] P. M. W. Knijnenburg, T. Kisuki, and M. F. P. O’Boyle, “Iterativecompilation,” in Embedded Processor Design Challenges: Systems,Architectures, Modeling, and Simulation - SAMOS, (London, UK, UK),pp. 171–187, Springer-Verlag, 2002.

[2] B. Bougard, B. De Sutter, D. Verkest, L. Van der Perre, and R. Lauw-ereins, “A coarse-grained array accelerator for software-defined radiobaseband processing,” IEEE Micro, vol. 28, no. 4, pp. 41–50, 2008.

[3] D. Burger, S. W. Keckler, K. S. McKinley, M. Dahlin, L. K. John,C. Lin, C. R. Moore, J. Burrill, R. G. McDonald, W. Yoder, and theTRIPS Team, “Scaling to the end of silicon with edge architectures,”IEEE Computer, vol. 37, no. 7, pp. 44–55, 2004.

[4] S. Swanson, A. Schwerin, M. Mercaldi, . Petersen, A. Putnam,K. Michelson, M. Oskin, and S. J. Eggers, “The WaveScalar architec-ture,” ACM Trans. Comput. Syst., vol. 25, no. 2, pp. 1–54, 2007.

[5] K. E. Coons, X. Chen, D. Burger, K. S. McKinley, and S. K. Kush-waha, “A spatial path scheduling algorithm for EDGE architectures,” inASPLOS-XII: Proceedings of the 12th international conference on Ar-chitectural support for programming languages and operating systems,pp. 129–140, 2006.

[6] B. De Sutter, P. Coene, T. Vander Aa, and B. Mei, “Placement-and-routing-based register allocation for coarse-grained reconfigurablearrays,” in Proc. of the 2003 ACM SIGPLAN Conference on Languages,Compilers and Tools for Embedded Systems (LCTES’10), pp. 151–160,Mar. 2010.

[7] C. Cadar, D. Dunbar, and D. Engler, “KLEE: unassisted and automaticgeneration of high-coverage tests for complex systems programs,” inProceedings of the 8th USENIX conference on Operating Systems Designand Implementation, OSDI’08, (Berkeley, CA, USA), pp. 209–224,USENIX Association, 2008.

[8] V. Ganesh, T. Leek, and M. Rinard, “Taint-based directed whiteboxfuzzing,” in Proceedings of the 31st International Conference on Soft-ware Engineering, ICSE ’09, (Washington, DC, USA), pp. 474–484,IEEE Computer Society, 2009.

[9] P. Godefroid, N. Klarlund, and K. Sen, “DART: directed automatedrandom testing,” in Proceedings of the 2005 ACM SIGPLAN conferenceon Programming Language Design and Implementation, PLDI ’05,(New York, NY, USA), pp. 213–223, ACM, 2005.

[10] C. S. Jaramillo, R. Gupta, and M. L. Soffa, “Verifying optimizersthrough comparison checking,” Electronic Notes in Theoretical Com-puter Science (COCV’02, Compiler Optimization Meets Compiler Veri-fication), vol. 65, Apr. 2002.

[11] G. C. Necula, “Translation validation for an optimizing compiler,” inProceedings of the ACM SIGPLAN 2000 conference on Programminglanguage design and implementation, PLDI ’00, (New York, NY, USA),p. 83–94, ACM, 2000.

[12] C. Barrett, Y. Fang, B. Goldberg, Y. Hu, A. Pnueli, and L. Zuck, “TVOC: A translation validator for optimizing compilers,” 2005.

[13] J. Tristan and X. Leroy, “Formal verification of translation validators: acase study on instruction scheduling optimizations,” in Proceedings ofthe 35th annual ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages, POPL ’08, (New York, NY, USA), pp. 17–27,ACM, 2008.

[14] T. Moseley, D. Grunwald, and R. Peri, “OptiScope: performanceaccountability for optimizing compilers,” in Proceedings of the 7thannual IEEE/ACM International Symposium on Code Generation andOptimization, CGO ’09, (Washington, DC, USA), pp. 254–264, IEEEComputer Society, 2009.

58

[ieee 2011 faible tension faible consommation (ftfc) - marrakech, morocco (2011.05.30-2011.06.1)]...

Documents