Source code profilers for Win32

David Adams - 1997-11-19 16:06 (updated 2024-05-23 08:48)

A review of the following source code profilers for Win32 software development: Intel VTune 2.5, Microsoft Visual C++ 5.0 profiler, Rational Visual Quantify 4.0, TracePoint HiProf 2.0, TracePoint Visual Coverage 1.0, Watcom C++ 11.0 profiler, Win32 SDK profiling tools.

A version of this article appeared in Dr. Dobb's Journal of February 1998.

Introduction
Product overview
Summary of features and test results
Addresses
Case studies
Sidebar: Profiling methods

Introduction

As Jon Bentley already pointed out years ago, programmers cannot ignore efficiency. However, knowing that the 80/20 rule also applies to software performance is a small comfort if you are confronted with tons of source code, half of which isn't even your own, and are wondering how to find those 20% that are causing most of the trouble. Not by intuition, let me tell you that from experience. You need to measure. Here's the good news: I have reviewed the latest profilers, and some are really good. In fact, the good ones are so good that no serious programmer should be without them.

Performance improvements come in varieties. The greatest gains are normally obtained through changes in the algorithm - the obvious example being the replacement of bubble sort by quicksort or heap sort. This is where O(n²) versus O(n logn) space and time bounds are mentioned. The next step is the actual source code implementation, and your choices here include things like automatic versus heap-based memory, working set and virtual memory behavior, the number of times results are recalculated versus being stored and remembered, the structure of loops and branches, and so on. In this phase, the (normally suppressed) constants in front of the O(n·logn) bounds are largely determined. As the final performance-tuning step, you'll find processor-specific optimizations that take things like the caching behavior, internal parallelism, branch predictions, etc. into account. Improvements in this area may be reflected in the organization of the source code, or they may use specialized instructions (e.g. MMX intrinsics) or use specialized library primitives.

You need different sorts of information for the different stages in the optimization process. At the algorithm level, high-level overviews of caller/callee relationships and intensity, e.g. in the form of annotated call graphs, are invaluable. At the source code level, you want function-by-function or line-by-line timings and counts. Finally, at the processor level, you need instruction breakdowns annotated with the relevant processor behavior. The middle area is what traditional profilers used to provide; tools for the other two areas are fairly new.

Back to the top

Product overview

For the review, I selected profiling tools that target C and C++ development for Win32 platforms. Some of the tools support other languages or platforms as well, but I'll mention those variations in a sidebar. Here is the roundup, alphabetically:

Intel VTune 2.5
Microsoft Visual C++ 5.0 (profiling tools only)
Rational Visual Quantify 4.0
TracePoint HiProf 2.0
Watcom C++ 11.0 (profiling tools only)
Win32 SDK (profiling tools only)

As an extra, I also tried out TracePoint Visual Coverage 1.0. Although not a profiler in the strictest sense, it uses similar techniques to show which parts and paths of a program are executed, and which are not. This makes it an invaluable tool during testing.

I have used all tools for several weeks as part of my own development process, and have tried to select a number of test scenarios that I considered representative for a fairly broad range of applications (see the sidebar Case studies). All test programs are written in C or C++ and compiled by the Microsoft C++ 4.2 or 5.0 compilers, or by the Watcom C++ 11.0 compiler if I wanted to test its profiling tools. The tests were run on a Windows NT 4.0 Workstation system with a 120 MHz Pentium processor and 64 Mbytes of RAM.

Back to the top

Intel VTune 2.5

Intel makes microprocessors and the VTune profiler shows that. The basic profiler uses sampling to obtain measurements of the program under test, but it also includes a variety of static and dynamic (assembly) code analysis tools that help you make the most of the Intel processors at, shall I say, a painful level of detail. If you are interested in pairing issues, instruction penalties, and processor cache misses, you are in for a treat here. As of version 2.5, VTune also supports profiling Java programs, but I have not tested that capability.

The operation of the program is quite simple: using either the Project Wizard or a property dialog, you specify which program to profile, what options to use for the profiler and the program, and off you go. VTune executes the program, collects samples, and when it's done it displays the Modules Report, the first of a number of bar charts showing the activity in your program and the rest of the system. From there, you can drill down into areas of interest for further graphics, or obtain annotated source code listings.

As an alternative to sampling, the Code Analyzer performs a static code analysis of your program and gives you information about the expected performance and low-level behavior of the various Intel processors. The same information can be obtained through Dynamic Assembly Analysis, which analyses small sections of your program in great detail by actually running the entire program and simulating the performance of the area of interest instruction-by-instruction (as opposed to using the sampling method applied elsewhere). All methods have in common that they can show your source code (where available) interspersed with assembly code, and annotated with remarks about processor performance issues for the Intel processors.

To help you translate this information back to the source code level, the C and FORTRAN Code Coaches are designed to point out source-level improvements. They require preprocessed source code to do so, and accept a Makefile, a command line, or a ready-made preprocessed source file as its input. Selecting an area in this file (normally a function or a possibly nested loop) and invoking the Code Coach results in specific advise for the optimization opportunities that the Code Coach recognizes, most of which having to do with loops and branches.

In addition to the actual profiler, the VTune package also contains the Performance Toolset with C/C++ and FORTRAN compilers that can be plugged into the Microsoft Developer Studio environment, several numerics and signal processing libraries, and a wealth of reference information about the Intel processors. The annoying thing is, however, that the access program is a Win16 application that doesn't recognize long filenames and as a result couldn't start Acrobat Reader located in "C:\Program Files\Acrobat3".

So how useful was VTune during my development? Given that I don't develop high-performance numerical codes or design code generators for a compiler, the level of detail offered by VTune was well beyond my needs. As a C and C++ programmer, I don't have very fine control over the eventual instructions that are fed to the CPU, and after the first amazement over all these processor intricacies, I had very little practical use for them. Things might have been different if the Code Coach would have worked properly, but try as I might, I never got beyond VTune's message that it had encountered an error while parsing my (preprocessed) source code files.

The fact that VTune uses sampling as its data collection method means that execution is excellent, but that the granularity (which is modifiable) is not always sufficient to capture all required information. In the Constrained Optimization test case for example, VTune completely missed the index operator that caused the initial performance hit - presumably because the operator itself didn't take long to execute per call.

Then there are a few other matters that hamper effective use of the profiler. First and foremost, it is inconvenient that the profiler samples for a predefined amount of time. It does not stop sampling when your program terminates; you'll have to interrupt the sampling session manually in that case. Conversely, if your program runs longer, you'll have to adjust the length of the sampling session and try again. Second, there was no really good way to restrict profiling to specific areas - in fact, the samples would include all system activities, whether relevant or not. Intel advertises this as a feature, but I'm inclined to see it otherwise. Third, the bars in the various graphics were usually awfully thin, say a pixel or so. Since they are the main means of navigating through the profile data, I had a very difficult time (even with my 20" monitor) pointing with the mouse cursor and selecting the correct one each time. The fact that each drill down action brings up a new top-level window with yet more bars doesn't help either: I tended to lose track of all the windows and bars rather quickly. Finally, creating new views is rather slow (probably because all sampling data is stored in an Microsoft Access database), and then there is some background activity which causes my hard disk to rattle every few seconds; this got quite unnerving and prompted me to keep each session as short as possible.

To sum up: Intel's VTune excels where it comes to processor-level optimizations, but I found it less useful for general-purpose work. Its ease of use is below that of several of the others, the sampling makes for a fairly coarse granularity, and the detail information provided is too detailed for common usage. However, static and dynamic analyses give a lot of insight in processor behavior, and the electronic documentation on low-level optimizations and the Intel processors is very valuable even if you don't need to wring out every cycle of performance.

Back to the top

Microsoft Visual C++ 5.0 Profiler

Inside the Microsoft Visual C++ 5.0 (and earlier) box, you'll find a profiler. It isn't much advertised, but it's there nevertheless. This profiler can be used to obtain line or function-level timings and counts, and it can also be used as a simple coverage tool at the line or function level. The profiler tools consist of three console applications. The first modifies the executable or DLL under test by thunking function calls and inserting breakpoints to divert the flow of control to the recording part of the profiler. The second is the actual recorder, and the third creates output lists in a variety of formats. Normally, a batch file controls the operation of the profiler tools and the Visual/Developer Studio contains a command that runs this batch file for you with the correct executable name filled in. In that case, the output listing appears in one of the tabs of the Output Window. You should be aware that this only works for GUI applications; to profile console applications, you have to revert to the command line and manually start the correct batch file.

The profiling tools are fairly versatile. They allow you to fine-tune both the instrumentation process (function level or line level, timings, counts, or coverage), and the recording process (determining start and stop points of the profile, including or excluding specific functions or modules), and let you specify how the output listing should appear (sorted according to some criterion, or in tab-delimited format for use by Microsoft Excel and other tools). If you want, you can also merge data from different runs to obtain an averaged effect. However, the whole process is purely command line and batch file-based (with some help from TOOLS.INI), and is definitely not comparable to the GUI-based competition. Moreover, if you want graphical output, you'll have to use Excel or another tool. Without them, you are looking at (sorted) lists of function and line timings.

In actual use, these tools aren't too bad. In fact, they are the sort of profiling tools that have been around for many years on most platforms. If you are prepared to spend some time learning the command line options, understanding the profiling process, and working your way through the output listings, they will get you most of the way. The standard cases are all handled by a set of straightforward batch files. The things I missed most, except from ease of use, were the ability to establish caller/callee relationships and an easy means to annotate your source code with the profiling information. In general, the edit/-compile/-run/-analyze cycle tends to be longer with this profiler than with the best GUI-based ones. In addition, the profiling overhead was noticeably larger than with other instrumentation-based profilers (as a group, they are much slower than sampling profilers are anyway).

In summary, the profiling tools that come with the Microsoft C++ compiler are definitely useful, but not as easy to use or as complete in their analysis options as the best of the flock. On the other hand, they are free once you have the compiler.

Back to the top

Rational Visual Quantify 4.0

Building on the same instrumentation techniques that they use in Purify/NT, Rational has introduced a profiler for C/C++, Java, and Visual Basic 5 (I only examined C/C++ programs). The product comes with its own environment, from where you load the program to be profiled. Visual Quantify instruments the program and all the DLLs that it uses (saving the instrumented versions under a different name), then runs it to collect profiling data. When finished, Visual Quantify displays the call graph (with the critical path highlighted), function list, and session summary windows for the run. From here, you can work your way to further information in the form of function details (showing callers and descendants of each function) and annotated source code.

There are many ways to customize Visual Quantify's mode of operation. Before the instrumentation takes place, you can choose between instruction counting (at the line or the function level) or function timing as the measurement method (see the sidebar Profiling methods). Regardless of what you choose for your own modules, certain Windows system modules are always timed rather than instruction counted. For the run itself, you can specify the options to the program under test (although output redirection is not supported). Finally, when the results are displayed, the Filter Manager lets you hide or delete module or function data from the views, thus making the remaining data more prominent. For even further control, a small API is defined that lets your program take control over its own profiling - starting and stopping data collection, clearing the buffers, etc. Obviously, this requires modifications to your source code and recompilation, which is not necessary if you stick to the GUI environment of Visual Quantify proper.

Visual Quantify also offers the ability to work with multiple profiling runs in a given project. With a few simple commands, data from different runs can be merged to obtain an averaging effect for different test cases or diff'ed to see changes in performance. The latter facility is particularly helpful and uses the color coding in the usual call graph and function views to indicate where performance has improved (green) or deteriorated (red). If you run a program with multiple threads, the profile shows the combined timings of all threads. By selecting a thread in the Call Graph and focusing on its subtree, the profile reduces to just that thread.

In day to day development, Visual Quantify is a pleasure to work with. Its features are well thought out, its user interface is intuitive and helpful, and the various views help to analyze the profiling data in several ways, from the high-level overviews and call patterns to detailed breakdowns per function and source code line. With the information presented as it is, you are likely to emerge with a better understanding of your program's behavior than you thought possible. However, there were a few quirks during testing. For large profiles, Visual Quantify requires prodigious amounts of virtual memory; the README file advises to reserve 200 Mbytes! Furthermore, C++ filenames that had a 'bool' parameter were not unmangled. Surprisingly in view of all the information available, the annotated source code view does not show line counts; it only displays line timings. Finally, Visual Quantify is limited to Windows NT and Microsoft compilers.

Back to the top

TracePoint HiProf 2.0

The introduction of TracePoint's HiProf 1.0 profiler broke new ground for Win32 profiling tools and HiProf 2.0 has several improvements over its predecessor, including support for Visual Basic 5. The product has its own GUI-based workbench from where applications are loaded, instrumented, and run. The instrumentation process prepares your executable and its modules for the data collection run, but skips over any modules is considers as "system modules". Instead of instrumenting these modules, HiProf uses Call Site instrumentation - basically adding timing code to the callers of non-instrumented module functions. This cleverly sidesteps any problems that might arise from modifying system modules, but it also prevents data collection on functions only used inside those modules. After instrumentation, the program is run under HiProf's control and afterwards the results are displayed in a variety of formats: the Function View (a list of functions), the Hierarchical View (a pie chart breakdown of function callers and descendants), the Critical Edge list, and the Critical Chain. Further views include the Source View and a navigator view with tabs for profiles and modules. Wherever sensible, views are linked so that navigation in one view is tracked by the other views.

HiProf's operation is subject to a number of settings. The most important is the actual measurement method: instruction counting (with the ability to specify a processor model) or time stamps (see the sidebar Profiling methods), with the proviso that calls to functions in system modules are always timed. Runtime options determine the command line arguments to the program under test (command line redirection is supported) and the use of HiProf's console, which is a small control unit to pause and resume the profiling process, and to store snapshots or clear the profiling data. For finer control, HiProf offers Tracepoints: a sort of breakpoints in your program that cause HiProf to execute some action, typically starting or stopping data collection, or storing a snapshot. Tracepoints can only be set on entries or exits of functions, but they are ideal to concentrate on specific parts of your program without any changes to the source code.

A given project may contain many snapshots, either from different runs or from different stages within a run. However, to compare two snapshots, you'll either have to start a second instance of HiProf and arrange the views side by side, or use a command line utility to merge or diff the data from different snapshots. Neither approach is ideal and this is probably the weakest point in an otherwise excellent product. For multi-threaded programs each thread is shown separately, which is a good default, but to obtain the combined data for the program, you'll have to use the command line tool once again.

In actual use, HiProf 2.0 pairs with Visual Quantify in features and ease of use. With the introduction of new views in version 2.0, the data analysis views are on a par with Visual Quantify. They give excellent information from a lot of different perspectives, and the internal synchronization among the views means that you waste no time coordinating different sorts of information. In combination with its tracepoints, HiProf allows you to tailor both data collection and data presentation without having to touch your source code. In fact, HiProf made profiling and optimizing an application almost addictive - a far cry from the laborious process that profiling used to be.

Back to the top

TracePoint Visual Coverage 1.0

Standing a bit apart from the crowd, Visual Coverage is a tool to determine which parts and paths of your program have been reached during a particular run. After instrumenting your program and running it through one or more test cases similar to TracePoint's HiProf profiler, data is displayed on various coverage metrics. You can choose from function coverage (the percentage of functions reached by the test runs), line coverage (ditto, for source code lines), code coverage (for CPU instructions), edge coverage (for branches), and call-pair coverage (for call sites). Visual Coverage uses almost the same views as HiProf 1.0 does for displaying data, so you'll find a function list, a hierarchical view, a source view, and, unique to Visual Coverage, a function distribution view that displays a bar chart with coverage percentages. Where appropriate, views can be switched to display the different forms of coverage data. Finally, since coverage information is usually collected over several test runs, both the combined and the separate data can be viewed.

Coverage tools are intended as an aid during testing. In particular, they help to find out which parts of a program have been exercised and which not. To some extent, profilers can be used for the same purpose, but information like edge and call-pair coverage is difficult to obtain without special measurements. In addition, Visual Coverage sorts out all data for you and similar to the profilers, a clear presentation of the data is half the battle as far as analysis is concerned. Visual Coverage does an admirable job here; apart from the views mentioned, several higher level selection options filter the data before they get to the views. To this end, the navigator window contains several predefined categories that isolate dead or unused functions, functions organized per class, or functions organized per module. If desired, further categories can be defined based on coverage type and cut-off percentages for filtering.

In daily use, I found that using a coverage tool requires more discipline than a profiler does. While a profiler provides instant gratification when you see the performance improvements, a coverage tool sits there as an administrator and points out that you still haven't tested all your code. To be of any real use, therefore, you need to be systematic in your approach to test cases and you must be prepared to spend considerable time studying the coverage information and building new test cases. For console applications, this is not normally a problem; for GUI applications, however, this means that you have to instrument the application first, then use Rational's Visual Test or a similar test harness to run the instrumented version of the application through different test scenarios. The resulting coverage data can then be viewed with the Visual Coverage GUI.

Back to the top

Watcom C++ 11.0 Profiler

Similar to Microsoft, the Watcom C/C++ compiler comes with its own profiler. This one is based on sampling and is packaged as two separate programs: one to collect the samples, the other to present the results. Operation is simple: from within an open project in the Watcom IDE, you issue the Sample command (possibly after setting options to control the sampling process and the executable under test) and when it is finished, the Profile command will list the results as a bar chart showing images (=sample run snapshots) and modules. Drilling down leads you to individual functions and finally to source code annotated with bar graphs that indicate the percentage of time spent in that function. If you want, you can export the sample data to DIF or comma separated text files.

The whole process has very little frills. Unfortunately, not much information is obtained either. The sampling process only records hits per function, and all you get (be it in graphical format, or as exported data file), are the names and the hits per function. Although this does give a coarse picture of the program's behavior, I found it insufficient for any sensible optimizations. For example, the MkDep test case spends most of its time in the Windows function ReadFile(), and the Watcom profile never gave me any clue to suspect that function or its callers. Likewise, the index operator problem in the constrained optimization sample was missed completely. I assume that this very basic approach to profiling is caused by Watcom's desire to implement the profiling tools on all platforms they support (and any other profiling method would mean much more platform-specific adaptations), but I'm not so happy with the end result.

On the whole, you will probably find the Watcom profiler insufficient for any serious profiling needs. All is not lost, however: the Win32 SDK that is included with the Watcom compiler package contains some profiling tools that are much more useful, at least when it comes to profiling Win32 programs.

Back to the top

Win32 SDK Profilers

Often overlooked in the mass of tools and information that current C/C++ compilers provide, the Win32 SDK contains a lot of useful programs. For profiling purposes, I found no less than five tools that would give information about an application's performance, and that is not counting PVIEW or PERFMON. Some of the tools are very specialized, but at least one has features that make it a viable alternative to the commercial competition - if you are willing to spend some time learning it. Also, please note that several of these tools assume specialized compiler or linker options (or compatibility) which may not be available in all compilers.

APIMON: API Monitor

APIMON is a stand-alone GUI-based program that will run other applications and keeps track of the Win32 API functions they call (complete with parameter and return values), how much time they spent there, and a variety of other things such as heap checking and page faults. It is very simple to operate and gives yet another view on your application's behavior. However, there is no way to relate the information to any specific locations inside your program, and the tool does not provide for arguments to the program under test, which severely limited serious testing of command line applications.

CAP: Call Attributed Profiler

This is the tool that most closely resembles the commercial profilers. It requires you to compile your code with a special compiler option (/Gh for Microsoft C/C++) that inserts _penter() hooks at the start of each function. These hooks are resolved by linking with the CAP.LIB import library, and at runtime the associated CAP.DLL module will be loaded and used to record the time spent per function. With the aid of some further programs, most notably CAPVIEW, the captured profile is then displayed as an annotated call tree or as a list of function counts and timings (with time per function and time spent in descendants separated). Colors are used to mark the most critical functions. I was really surprised by the usefulness of this program. Although it is not so versatile as the command line profiler that comes with the Microsoft compiler, it does provide a fairly accurate picture of the most critical performance data, and it does so in a format that is very usable.

FIOSAP: File I/O and Synchronization Win32 API Profiler

This one is designed to help identify I/O and synchronization bottlenecks in multi-threaded programs, although you can also use it to monitor file I/O activity in single-threaded programs. It uses a small helper program to patch your executable and reroute all KERNEL32 calls to its own FERNEL32 module, which collects data and forwards the call to the appropriate KERNEL32 function. The data collection comprises file I/O functions and operations on synchronization primitives such as semaphores, events, and mutexes. The information can be used to assess the amount of time spent waiting on the various operations. However, the results are summed over all threads in an application, and it is up to you to find out what the actual causes of performance loss in this area are.

PROFILE: Win32 Sampling Profiler

(Do not confuse this one with the Microsoft Visual C++ profiler of the same name.) It operates essentially the same as the Watcom profiler and runs your program while taking periodic samples of the instruction pointer's location. The result is a text file which indicates the number of hits per function. While it operates a lot faster than CAP, I don't find the information thus gathered very useful.

WST: Working Set Tuner

Its purpose in life is to help you reduce the working set of your program. Similar to CAP, WST requires recompilation with the insertion of _penter() hook functions and resolves these in the WST.LIB import library. At runtime, the WST.DLL module takes frequent snapshots of which functions were called during the period of time preceding the snapshot. The result shows which functions are used close together in time, and the WSTUNE program applies this information to produce a packing list for the linker that places temporally near functions also physically near, thus reducing the working set of your application. The whole process is something that you want to do when your application is almost ready for shipment, because during development the constant addition and removal of functions invalidate any WST results rather quickly. Nevertheless, WST can give your applications the final touch when it comes to performance.

Back to the top

Summary of features and test results

Legend: ++=best (in the case of runtime overhead, this means least overhead), +=good, 0=reasonable, -=marginal, --=worst.

Feature	VTune 2.5	Microsoft Visual C++ 5.0	Visual Quantify 4.0	HiProf 2.0	Watcom C++ 11.0	Win32 SDK tools
Languages supported	C, C++, Fortran, Java	C, C++	C, C++, Java, Visual Basic 5	C, C++, Visual Basic 5	C, C++	C, C++
Profiling method(s)	Sampling, event-based sampling, code analysis	Timing, line counting	Timing, instruction counting	Timing, instruction counting	Sampling	Sampling, timing, API counting
Accuracy	- (sampling) ++ (analysis)	+	++	++	-	- to +
Multi-threaded information	-	-	++	+	-	- to +
Presentation and analysis views	0	-	++	++	0	- to +
View options	0	-	++	++	0	- to +
Merging & diff'ing	0	0	++	+	-	-
Export formats	Access, tab-delimited	Text, CSV	Tab-delimited	Excel*	DIF, tab-delimited	Text, CSV
Runtime overhead	++	--	-	-	++	-- to +
Ease of use	+	-	++	++	+	-- to 0

* Note: requires clipboard copy & paste.

Back to the top

Addresses

Intel VTune:	Intel Corporation.
Microsoft Visual C++:	Microsoft Corporation.
Visual Quantify:	Rational, Inc.
HiProf & Visual Coverage:	TracePoint, Inc.
Watcom C++:	SyBase, Inc.

Back to the top

Case studies

The test cases were selected to represent a representative sampling of profiling tasks. All applications are written in C or C++ and run on the Win32 platforms.

MkDep

MkDep is a console application that reads C and C++ source files and generates a list of #include dependencies suitable for use in a Makefile. On the whole, the program is I/O-bound, but the question was, where are the bottlenecks? My hunch before I got to profiling was that the file I/O would be the problem, and I had spent quite some time optimizing this area with special buffering and things. The benchmark I used throughout testing was the dependency list for MFC 4.2. In the original version of the program, it took about 5:00 (mins:secs) on my system to process all those files. In the version as it currently stands, this has been reduced to 0:25, and I expect some further improvements (in the order of 25%).

The improvements came from reducing I/O traffic, but not in the way I expected. It turned out that the original program was spending 75% of its time in the _access() function I used to look for header files along the INCLUDE paths. I never did suspect that function until I profiled. Still, all that function did under Win32 was calling GetFileAttributes(), and there is little room for improvement there. Fortunately, the profiles showed that it wasn't just the time spent per call, but also the number of times it was called. To cut a long story short: I implemented a caching scheme (in effect storing the entire dependency tree of a MkDep run internally), simplified file I/O, and trimmed down the program in some other areas as well. As it stands, the program processes each file in the entire run exactly once, using one ReadFile() call per file to do so. This is of course optimal (unless you start using heuristics to avoid processing some files, but that might lose information), and it accounts for 75% of the current runtime. The remainder is taken by _access() calls that fail while searching the INCLUDE path, and the final I/O for the actual dependency lists. There is still room for improvement but it can be at most 25%, which is good to know because it sets realistic expectations and helps to gauge how much effort should go into further optimizations.

Constrained optimization

Constrained optimization programs tend to do lot of internal processing and are therefore mostly CPU-bound. This test program is another console application that attempts to solve an optimization problem using clever searching techniques. In the process, it uses large graph-like data structures that display poor locality of reference. To really improve performance in this sort of NP-hard applications (and we are talking about performance improvements in the order of 1010 or more compared to just blind searching) you need to improve the search methods, but I used the profilers to see if I could make the operational side of things just mildly faster (say, a factor 2 or 3, which is peanuts to researchers in this field, but still worth some effort).

Before the profiling sessions, I didn't really have a good idea about how time was spent in the program. I suspected that memory allocations might play a role (there are a lot of dynamic data structures), but otherwise I was more or less clueless. Surprise: the profiling sessions revealed that most of the time was spent inside a lowly index operator overloading in one of the array classes used in the program, and in the dynamic_cast<> operator used in several critical places. The index operator itself wasn't particularly complicated, but it did bounds checking on each call and was called very often - well over 43 million times to process just 100,000 nodes in the search tree. The dynamic_cast<> operator was used for a downcast somewhere in the program to obtain application-specific information from a generic tree node. It too was called often (1.4 million times).

After removing these bottlenecks, the program was about 1.24 times faster than the original one, and things became more complicated because the cycle eaters were more evenly distributed. In the end I achieved a speedup of almost a factor 2; further improvements would have required specialized memory allocators or changes to the way information was stored. Still, without the profiling information, it wouldn't have occurred to me to look at those particular functions - even though I was the one who designed and implemented both the data structures and the searching algorithm. It has been said before, but it's worth repeating: a programmer's intuition is hardly any help to locate hot spots in a program.

Simulation model calculations

This case is part of a Win32 GUI program that uses MFC as its application framework. The application itself is a business simulation that processes decisions taken by various "companies" staffed with management trainees. The program displays the results in different textual and graphical formats. The challenge was to find the bottlenecks in the actual model calculations, since they seemed to be slower than necessary. The problem here is the interactive nature of the application: how do you make sure that the model calculation timings aren't drowned in all the message processing and surrounding activities of the program?

As it turned out, the calculations themselves were no problem, but the code also contained calls to logging functions which created an audit trail of the simulation model's decisions. These logging functions were the real time eaters, and I could only partially remedy that problem. However, only a few profilers allowed me to isolate the calculations from the rest of the program and make the analysis obvious. HiProf's tracepoints came in wonderful here; for Visual Quantify I inserted API calls to start and stop the profiling at the correct locations. All other profilers left me wading through long lists of irrelevant information, although with extra effort I could have configured the Microsoft Visual C++ profiler to exclude nearly everything, except for the functions I was interested in. Still, this would have meant a fair bit of work, which would have to be repeated if I had turned to other areas of the program.

Multi-threaded record processing

The final test is a very simple multi-threaded program that I borrowed from the HiProf examples to see how the profilers dealt with multi-threaded programs. I did not attempt to optimize anything here; I was just interested in the information that would be obtained from this program.

I was slightly disappointed. With the exception of HiProf and Visual Quantify, all profilers just lump together their timing or sampling information and never show how the time was distributed across threads. Visual Quantify has the most convenient way of separating and combining per-thread information through its Call Graph. HiProf shows all threads separately, but due to its somewhat involved merging procedure, makes it not very easy to view the combined information.

Back to the top

Sidebar: Profiling methods

The profilers in this review employ one or more of the following methods to obtain timing information from the program under test.

Sampling	The simplest method is sampling. In this method, the program is interrupted frequently (say, every fraction of a millisecond) and a note is made of the location of its instruction pointer. With the aid of a map file or debug information, this location is translated back to a function name or source code position. Normally, all samples within a single function are lumped together. The advantages of this method are its simplicity and the fact that it has comparatively little impact on the execution speed of the program under test; the primary drawback is the coarseness of the information. Even at the best of times the samples are only an approximation of where the time is spent in a program, and for a variety of reasons (granularity, resonance) the collected samples can be downright misleading.
Event-based sampling	Supported only by Pentium Pro processors and later, event-based sampling uses internal counters in these processors to collect information about performance-related events such as cache misses. VTune can use this sampling mode as an alternative to standard sampling, and will collect both the event data and the regular instruction pointer samples by interrupting the processor at frequent intervals.
Timing	A better method is to actually use a timer to measure the time spent in a function. This requires that the profiler be notified when a function is entered and when it is left (accomplished through instrumentation or the insertion of hook functions), and an accurate timer. The latter may be provided by the operating system or by the processor; for example, the Pentium and later CPUs from Intel contain a high-resolution counter that is used by several profilers for this purpose. Information collected this way is basically a record of wall clock time, which means that it also takes into account processor behavior such as cache misses, and background activity in the rest of the system. Depending on the purpose, this could be an advantage or a disadvantage, but in any case it is not so repeatable as instruction counting is.
Line counting	Since the overhead of timing is usually to large to make it practical at the line level, line counting is often used to obtain an indication of the program's performance at the source code line level. It requires some form of breakpoints at the line level and introduces a large amount of overhead. In addition, unless coupled with instruction counting, it may give little information about the actual time spent in some section of code.
Instruction counting	The final method is instruction counting. In essence, the profiler counts how often a particular instruction (or more typically, a particular basic block in the program) is executed, then multiplies that with the number of clock cycles required for each instruction. Obviously, this requires detailed information about the underlying processor and is often dependent on the exact version of the processor as well. Furthermore, the number of clock cycles per instruction may vary greatly with the dynamic behavior of the program and its environment (considering things such as cache and pairing behavior, branch prediction, etc.), so the profiler must make some assumptions here. Usually, they assume optimal execution. Instruction counting therefore does not include any ambient effects, which makes it more reproducible and in some sense "purer", but it tends to give a somewhat optimistic view of the program's speed. Even so, it is used as the most accurate method in the best profilers.
Code analysis	Intel's VTune provides code analysis, a different method that uses very detailed information about processor behavior to indicate performance problems at the instruction level. This is not profiling is a strict sense, but it does give performance information. In this method, knowledge about the processor's behavior is applied to the instruction sequences found in the program and results in reports of penalties incurred, pairing issues, and expected cache behavior. A companion Code Coach will sometimes advise on rearrangements of the source code that reduce the performance penalties.

David Adams's profile and contact details >>