How much to grow buffer in a StringBuilder-like C module?

Tags: , ,



In C, I’m working on a “class” that manages a byte buffer, allowing arbitrary data to be appended to the end. I’m now looking into automatic resizing as the underlying array fills up using calls to realloc. This should make sense to anyone who’s ever used Java or C# StringBuilder. I understand how to go about the resizing. But does anyone have any suggestions, with rationale provided, on how much to grow the buffer with each resize?

Obviously, there’s a trade off to be made between wasted space and excessive realloc calls (which could lead to excessive copying). I’ve seen some tutorials/articles that suggest doubling. That seems wasteful if the user manages to supply a good initial guess. Is it worth trying to round to some power of two or a multiple of the alignment size on a platform?

Does any one know what Java or C# does under the hood?

Answer

In C# the strategy used to grow the internal buffer used by a StringBuilder has changed over time.

There are three basic strategies for solving this problem, and they have different performance characteristics.

The first basic strategy is:

  • Make an array of characters
  • When you run out of room, create a new array with k more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

This strategy has a number of problems, the most obvious of which is that it is O(n2) in time if the string being built is extremely large. Let’s say that k is a thousand characters and the final string is a million characters. You end up reallocating the string at 1000, 2000, 3000, 4000, … and therefore copying 1000 + 2000 + 3000 + 4000 + … + 999000 characters, which sums to on the order of 500 billion characters copied!

This strategy has the nice property that the amount of “wasted” memory is bounded by k.

In practice this strategy is seldom used because of that n-squared problem.

The second basic strategy is

  • Make an array
  • When you run out of room, create a new array with k% more characters, for some constant k.
  • Copy the old array to the new array, and orphan the old array.

k% is usually 100%; if it is then this is called the “double when full” strategy.

This strategy has the nice property that its amortized cost is O(n). Suppose again the final string is a million characters and you start with a thousand. You make copies at 1000, 2000, 4000, 8000, … and end up copying 1000 + 2000 + 4000 + 8000 … + 512000 characters, which sums to about a million characters copied; much better.

The strategy has the property that the amortized cost is linear no matter what percentage you choose.

This strategy has a number of downside that sometimes a copy operation is extremely expensive, and you can be wasting up to k% of the final string length in unused memory.

The third strategy is to make a linked list of arrays, each array of size k. When you overflow an existing array, a new one is allocated and appended to the end of the list.

This strategy has the nice property that no operation is particularly expensive, the total wasted memory is bounded by k, and you don’t need to be able to locate large blocks in the heap on a regular basis. It has the downside that finally turning the thing into a string can be expensive as the arrays in the linked list might have poor locality.

The string builder in the .NET framework used to use a double-when-full strategy; it now uses a linked-list-of-blocks strategy.



Source: stackoverflow