Destroy or destruct?

A discussion of the use of virtual destructors in C++, versus an alternative approach for polymorphic object destruction. Includes empirical data.

Introduction

Some time ago, Michael van Rooyen [1] kicked up some dust when he described an alternative to the use of virtual destructors to accomplish the effect of virtual destruction. He argued that his alternative would impose less of a penalty than an (inline) virtual destructor, while still achieving the desired polymorphic effect. Subsequently, Irfan Pyarali [2] and Tim Peierls [3] discussed some related issues. Taken together, there seem to be two points emerging from the discussion.

  1. Is it worthwhile to circumvent virtual destructors (by introducing another mechanism for polymorphic object destruction)?
  2. Can we find a way to reliably detect or mark dynamic objects, even in the presence of multi-threading and complicated initialization schemes such as presented by Tim Peierls?

My short initial answers to these two questions above are "no" and "not without the help of the compiler", respectively. However, I will do the following:

  1. Gather empirical data as to the runtime costs involved in Van Rooyen’s problem and his solution.
  2. Discuss other ways of determining the "heapness" of objects.

Issue A will be dealt with in this article, while issue B is taken up in its companion "On the heap?".

Back to the top

Virtual destructors versus destroy()

Van Rooyen effectively proposes to replace virtual destructors by a supposedly equivalent construction consisting of a non-virtual destructor and a virtual destroy() member function (see [1] for details). The result would also allow polymorphic destruction, albeit through a call to destroy() rather than to operator delete. The benefits of this alternative would be reduced code size and/or increased speed of destruction. These benefits would arise from a combination of two factors: the destructor, being non-virtual, could be made inline (increasing runtime speed) without unduly increasing object code size. Van Rooyen argues that this same combination could not be obtained by an inline virtual destructor, because in that case, there would either be an extra indirect function call, or a duplication of object code, or both.

To dispense with at least part of the assumed overhead of the extra indirect function call right away: this could only be a problem if the compiler cannot unambiguously determine the static type of the object involved. As a result, only destruction through object reference variables (i.e., pointers or references) could incur this overhead; static and automatic objects will not suffer from it. The latter is true regardless of the virtualness of the object’s destructor (or any other member function, for that matter).

Back to the top

Benchmark program

To check the runtime costs of either approach, I have devised a small benchmark program that measures the runtime of both approaches under various conditions. The source code for the benchmark program is available electronically, so you may check my assumptions and my results, and possibly extend them to other platforms.

Benchmark classes

The benchmark program implements the Shared and Foo classes of Van Rooyen’s article and compares them to a more conventional approach using virtual destructors. Since the whole issue becomes largely moot if not used in combination with inheritance, each approach uses a base class and some derived classes. Throughout, I have tried to keep both alternatives identical and faithful to Van Rooyen’s intent and conventional usage, respectively, with the obvious exception of the destruction method.

Van Rooyen’s article describes the Shared and Foo classes, but for ease of reference I will repeat them here.

// Shared is the base class for destroy() classes
class Shared {
public:
  // MODIFIED: placement new operator for the class
  static void *operator new(size_t, void *);

  Shared();
  ~Shared();

  virtual void destroy();
  int isdynamic() const { return dynamic; }

protected:
  // MODIFIED: protected operator delete -- use destroy()
  static void operator delete(void *);

private:
  static void *allocated;
  unsigned short dynamic: 1;
  unsigned short referenced: 15;
};

void *Shared::allocated;

INLINE Shared::Shared()
{
  dynamic = (unsigned short)(allocated == this);
  referenced = 1;
}

INLINE Shared::~Shared() {}

void Shared::destroy()
{
  if (isdynamic())
    delete this;
}

void *Shared::operator new(size_t, void *ptr)
{ return allocated = ptr; }

void Shared::operator delete(void *) {} // No-op

I have left out the hold() and release() member functions as they are not used in the test, and have replaced the operators new and delete by placement and no-op versions, respectively. This was done to exclude the time spent in memory allocation, because this could very well dominate the overall timing without contributing to the comparison proper. Instead, new and delete are used solely to invoke the constructors and destructors of the objects without incurring any further overhead, and a preallocated buffer is used to hold the dynamically created objects. Finally, I made operator delete protected instead of private, so that derived classes can reuse it in their destroy() implementations and don’t have to define their own operator delete.

The Foo class has been implemented as follows.

// Foo is a derived class with polymorphic destruction
class Foo: public Shared {
public:
  Foo();
  ~Foo();

  virtual void destroy();
};

INLINE Foo::Foo() { /* some code -- see text */ }
INLINE Foo::~Foo() { /* some code -- see text */ }

void Foo::destroy()
{
  if (isdynamic())
    delete this;
}

Foo (and any other derived class) requires its own destroy() implementation in Van Rooyen’s approach. The implementation code is the same in all cases, but the destructor call inside destroy() resolves to a different (non-virtual, remember!) destructor in each class. You might also want to make the operator delete private for each derived class, but this is not strictly necessary: the one inherited from class Shared is already private.

In the actual benchmark program, there are three Foo classes, differing only in the content of their constructor and destructor code. The simplest version does nothing visible (but vtable setup will occur behind the scenes, of course), while the other two initialize and destruct a single integer variable, and a dynamically allocated character array, respectively. These variations serve to assess the overhead of the constructor and destructor calls relative to the work that goes on inside their bodies.

The counterparts to Shared and Foo are the Any classes (sorry; couldn’t come up with a better name). Their implementation goes like this:

// Global placement new operator
void *operator new(size_t, void *ptr) { return ptr; }

// Any: regular base class with virtual destructor
class Any {
public:
  Any();
  virtual ~Any();

  // operator delete is no-op for the class and derivations
  static void operator delete(void *);
};

INLINE Any::Any() {}
INLINE Any::~Any() {}

void Any::operator delete(void *) {}

// AnyX: derived class
class AnyX: public Any {
public:
  AnyX();
  virtual ~AnyX();
};

INLINE AnyX::AnyX() { /* some code -- see text */ }
INLINE AnyX::~AnyX() { /* some code -- see text */ }

The Any classes are similar to Shared and Foo, except for the presence of a virtual destructor (and absence of destroy()) and some details regarding operator new and delete. The code that goes inside the constructor and destructor bodies is varied in the same way as that of class Foo. We therefore have three AnyX classes as well (actually, Any1-3).

In all, the following classes were tested subject to the variations discussed below.

  • Foo1: a Foo class with empty constructor and destructor bodies.
  • Foo2: a Foo class with a single integer assignment in both the constructor and the destructor.
  • Foo3: a Foo class with a character array allocation and deletion, respectively, in the constructor and the destructor.
  • Likewise, Any1 to Any3 to match the Foo classes.

Each class was tested both with dynamically allocated instances (using pointers) and with automatic instances (local variables). Therefore, the tests are designated as follows:

1a. Foo1 or Any1 with dynamic objects.
1b. Foo1 or Any1 with automatic objects.
2a/b. Foo2 or Any2 with dynamic/automatic objects.
3a/b Foo3 or Any3 with dynamic/automatic objects.

So, each run consisted of 12 subtests (6 for Foo, 6 for Any). Each subtest, in turn, was executed a large number of times (typically 2,000,000) inside a for loop. The actual code looks like this for the dynamic cases:

// Code used for Foo classes:
for (int32 i = 0; i < itercnt; ++i)
{
  Shared *sptr = new (memblock) Foo1;
  sptr->destroy();
}

// Code used for Any classes:
for (int32 i = 0; i < itercnt; ++i)
{
  Any *aptr = new (memblock) Any1;
  delete aptr;
}

The variable "memblock" refers to a preallocated buffer used for placement new. In the cases of automatic variables, a variable of the appropriate class is declared inside the body of the for loop.

Benchmark variations

To cater for different circumstances that might affect the outcome of the benchmark in different ways, I have tested the classes presented above with different combinations of optimizations, inline member functions, and presence or absence of exception handling. To top it off, I have compiled the tests with three different C++ compilers for Windows NT. Here goes:

  • Optimization: either none, or for speed.
  • Inline member functions: either none, or those marked INLINE in previous listings. Note that some C++ compilers will inline any suitable function when asked to optimize for speed, regardless of the presence or absence of the "inline" attribute.
  • Exception handling: either with or without support for exception handling. In the former case, destructor cleanup was also enabled if necessary.

All other compilation aspects where kept constant across the tests, as far as possible. For example, structure alignment (relevant to some of the Foo and Any variations) was set to 8-byte boundaries throughout, code generation was for Intel i386 (rather than optimized for Pentium scheduling or so), and so on.

The compilers I used were, in alphabetical order:

  • Borland C++ 5.0B
  • Microsoft Visual C++ 4.1
  • Symantec C++ 7.21

Benchmark execution

These resulting eight variations of the benchmark program were compiled by all compilers, then each of the 24 executables was run 10 times on a PC running Windows NT 3.51 Workstation. The PC is equipped with a 120 MHz Intel Pentium processor and has 32 Mbytes of RAM. To minimize the disturbance by other tasks running on the same computer, I disabled network access, screen savers, and any other background tasks that could safely be killed.

The actual timing measurements were done through the GetProcessTimes() function from Windows NT (this function is part of the Win32 API, but is not implemented on Windows 95 or Win32s). Theoretically, this function yields only the times spent by the threads of the process (as opposed to wall clock time), but I found no significant differences between these times and the times returned by the ANSI C function clock(). For further verification, a subset of the tests was run on a PC with the same processor type, 40 Mbytes of RAM, and running Windows 95. I found no significant differences here either.

After collecting the results, they were stored in a Microsoft Excel workbook and averaged over the ten runs per test. To make comparisons between all class and benchmark variations more meaningful, I then normalized the outcomes, where for each class and benchmark variation, the Foo version time became 1 and the corresponding Any version value (Any time)/(Foo time). Hence, a value lower than 1 indicates a speed advantage for the Any version, while a value higher than 1 means a Foo advantage.

The results were quite interesting, actually. Among other things, they showed (sometimes unexpectedly) considerable differences among the compilers in the line up, even to the extent that the direction of the outcome differed. However, I will restrict the discussion to a summary of the normalized Foo/Any comparison; the full results, including the actual timings in ms, are accessible electronically.

Back to the top

Results

The normalized results are summarized in the tables below; the normalization is relative to Foo=1. Lower numbers indicate better performance

Borland C++ 5.0B Dynamic Automatic Average
  1a 2a 3a 1b 2b 3b  
Standard Outline No exceptions 0.71 0.71 0.89 1.06 1.31 0.94 0.94
Standard Outline Exceptions 0.96 0.97 0.98 1.04 1.20 0.94 1.01
Standard Inline No exceptions 0.71 0.69 1.21 0.96 2.10 0.93 1.10
Standard Inline Exceptions 1.01 1.09 1.00 0.90 1.38 0.93 1.05
Optimized Outline No exceptions 0.71 0.71 0.89 1.06 1.31 0.94 0.94
Optimized Outline Exceptions 0.97 0.98 0.99 1.06 1.22 0.94 1.03
Optimized Inline No exceptions 0.71 0.70 1.08 1.01 2.31 0.92 1.12
Optimized Inline Exceptions 1.01 1.07 0.98 0.89 1.44 0.93 1.05
Average 0.85 0.86 1.00 1.00 1.53 0.93 1.03
Microsoft Visual C++ 4.1 Dynamic  Automatic Average
  1a 2a 3a 1b 2b 3b  
Standard Outline No exceptions 0.73 0.73 0.79 0.83 0.85 0.74 0.78
Standard Outline Exceptions 0.76 0.76 0.72 0.88 0.89 0.77 0.80
Standard Inline No exceptions 0.74 0.70 0.86 0.85 0.85 0.98 0.83
Standard Inline Exceptions 0.76 0.72 1.31 0.87 0.89 1.48 1.01
Optimized Outline No exceptions 0.68 0.61 0.97 0.48 0.50 0.97 0.70
Optimized Outline Exceptions 0.78 0.80 1.38 0.73 0.76 1.16 0.94
Optimized Inline No exceptions 0.58 0.63 0.71 0.00 0.00 0.97 0.48
Optimized Inline Exceptions 1.01 0.97 1.02 0.64 0.69 1.01 0.89
Average 0.76 0.74 0.97 0.66 0.68 1.01 0.80
Symantec C++ 7.21 Dynamic Automatic Average
  1a 2a 3a 1b 2b 3b  
Standard Outline No exceptions 0.65 0.66 0.85 0.50 0.53 0.82 0.67
Standard Outline Exceptions 0.71 0.72 0.88 0.59 0.60 0.89 0.73
Standard Inline No exceptions 0.73 0.73 0.96 0.44 0.47 1.02 0.73
Standard Inline Exceptions 0.75 0.75 0.93 0.51 0.54 0.92 0.73
Optimized Outline No exceptions 0.78 0.78 0.95 1.00 0.73 0.97 0.87
Optimized Outline Exceptions 0.86 0.85 0.92 0.80 0.77 0.92 0.85
Optimized Inline No exceptions 0.64 0.65 0.94 0.46 0.59 0.93 0.70
Optimized Inline Exceptions 1.23 1.22 1.09 0.80 0.98 0.97 1.05
Average 0.79 0.80 0.94 0.64 0.65 0.93 0.79

How to read these tables: shown are the runtimes of the Any tests, normalized relative to the runtimes of the same Foo tests. Therefore, numbers < 1 indicate a performance advantage of Any over Foo, while numbers > 1 indicate the reverse. As far as averaging is concerned, I have assigned equal weights to each of the test case vs. compiler settings combinations. In practice, you might want to put more emphasis on versions that are intended for release (probably the ones with optimizations and inline code enabled), but for the present discussion I’ll just cover the whole range equally.

Back to the top

Discussion

The picture that emerges from the tables above is somewhat mixed. A lot depends on the combination of compiler, compilation settings, and the actual test being run. There are also a few anomalies (e.g. the Borland C++ timings for test 2b with settings inline, no exceptions) where an unexpectedly large difference occurs. However, a few observations can be made (I’ll leave the rest for you to ponder on):

  • Only the Borland compiler generates code that is, on the average, favorable to the Foo case. It should be noted, though, that this is in large part due to the behavior of the code on test 2b, which is strikingly different from the other tests (and from the other compilers on the same test, for that matter).
  • With the exception of the Borland compiler, it appears that the Any case allows for better optimizations when used with automatic (i.e. statically typed) variables. This is quite reasonable, as in this case the destructor call does not require the virtual indirection, and the constructor of the Any base class is simpler than the one of the Shared base class. The Microsoft compiler even managed to optimize the entire for loop of tests 1b and 2b away in one of the Any cases.
  • Provisions for exception handling tend to even out the differences between Foo and Any. This is understandable, since the compiler will insert extra code in this case (including calls to helper functions), which add approximately equal overhead to both the Foo and the Any variants.
  • The results of tests 3a and 3b are more equalized because of the relatively large amount of time spent in memory allocation and deallocation in those tests.

It should also be noted that the data shown above are not the whole picture: the Foo approach might conceivably need additional code to support the "isdynamic" marking of the Foo objects. The data above include the technique described by Van Rooyen, but as pointed out by Pyarali and Peierls, and also in my companion article "On the heap?", a more complex scheme might be needed. In the presence of multi-threading, this could go as far as introducing locks to guard against concurrent access. Needless to say, this would introduce even more overhead in the Foo case, which is absent from the Any case.

Back to the top

Conclusion

It is probably safe to say that on the whole, and across a wide variety of circumstances, the (standard) Any variant is quicker than the supposedly improved Foo variant, and by quite a margin in some cases. Reasons why this should be so include:

  • A misunderstanding by Van Rooyen about the use of the virtual call mechanism for statically typed variables (it is not used in that case).
  • An execution path for dynamically typed variables that is essentially the same between the two approaches, or even longer in the Foo case. In particular, both involve a virtual function call (to Foo::destroy() or to the Any destructor, respectively), followed by some code that either is inlined by definition (the body of the Any destructor) or amenable to inlining by the compiler (the call plus body of the Foo destructor).

In addition:

  • In the presence of multi-threading, additional overhead would be introduced to safeguard the Foo scheme against concurrent access during construction. This is not required in the Any case.

I therefore conclude that Van Rooyen’s Foo approach introduces complexities in implementation and opportunities for mistakes and misuse, without any gain in performance. Quite to the contrary, in many cases a performance loss is caused by the Foo approach.

Back to the top

References

[1] Michael van Rooyen: Alternative C++, C++ Report April 1996, pp. 39-42.

[2] Irfan Pyarali: Comments on "Alternative C++", C++ Report May 1996, Letters section.

[3] Tim Peierls: Comments on "Alternative C++", C++ Report June 1996, Letters section.