Vector reallocation strategies

August 11th, 2011

I was interested in understanding a bit more of how C++ compilers reallocate vectors once their capacity (using the default allocator) is reached so I decided to build a small experiment. All I wanted to know was the new vector capacity, and it turns out it’s (obviously) based on the previously full vector’s size.

The small experiment is simple:

#include<iostream>
#include<vector>
using namespace std;

int main() {
    vector<int> v;
    auto prev = v.begin();
    for(int i=0; i<1000000; i++) {
        v.push_back(i);
        if(prev != v.begin()) {
            cout << "Reallocation occurred. i = " << i << endl;
            prev = v.begin();
        }
    }
    return 0;
}

In g++ 4.6.1 on Arch Linux 64-bit I get:

Reallocation occurred. i = 0
Reallocation occurred. i = 1
Reallocation occurred. i = 2
Reallocation occurred. i = 4
Reallocation occurred. i = 8
Reallocation occurred. i = 16
Reallocation occurred. i = 32
Reallocation occurred. i = 64
Reallocation occurred. i = 128
Reallocation occurred. i = 256
Reallocation occurred. i = 512
Reallocation occurred. i = 1024
Reallocation occurred. i = 2048
Reallocation occurred. i = 4096
Reallocation occurred. i = 8192
Reallocation occurred. i = 16384
Reallocation occurred. i = 32768
Reallocation occurred. i = 65536
Reallocation occurred. i = 131072
Reallocation occurred. i = 262144
Reallocation occurred. i = 524288

We can clearly notice the familiar 2^x pattern that gcc uses. If a vector of size n is full and performs a push_back(), the allocator simply allocates 2n. That’s pretty simple and requires very few allocations, with the potential of wasting a lot of space if the user uses only a small portion of the new allocated space.

On the other hand, VisualStudio 2010 Ultimate Build 10.0.30319.1 RTMRel on Windows 7 (Release mode) does something considerably different:

Reallocation occurred. i = 0
Reallocation occurred. i = 1
Reallocation occurred. i = 2
Reallocation occurred. i = 3
Reallocation occurred. i = 4
Reallocation occurred. i = 6
Reallocation occurred. i = 9
Reallocation occurred. i = 13
Reallocation occurred. i = 19
Reallocation occurred. i = 28
Reallocation occurred. i = 42
Reallocation occurred. i = 63
Reallocation occurred. i = 94
Reallocation occurred. i = 141
Reallocation occurred. i = 211
Reallocation occurred. i = 316
Reallocation occurred. i = 474
Reallocation occurred. i = 711
Reallocation occurred. i = 1066
Reallocation occurred. i = 1599
Reallocation occurred. i = 2398
Reallocation occurred. i = 3597
Reallocation occurred. i = 5395
Reallocation occurred. i = 8092
Reallocation occurred. i = 12138
Reallocation occurred. i = 18207
Reallocation occurred. i = 27310
Reallocation occurred. i = 40965
Reallocation occurred. i = 61447
Reallocation occurred. i = 92170
Reallocation occurred. i = 138255
Reallocation occurred. i = 207382
Reallocation occurred. i = 311073
Reallocation occurred. i = 466609
Reallocation occurred. i = 699913

So it does a lot more allocations, but it can potentially save same space. It also allows some rare allocations in situations where the process does not have space for a 2n allocation but it does have space for a smaller allocation. The problem is that reallocations are expensive, and this rare occasion is probably not worth the considerably greater number of reallocations.

But if you pay attention to the behavior of the numbers above, it’s not so easy to derive a simple formula such as 2^n. After some thought and observation I think the function behaves as (sorry no TeX plugin):

a(n+1) = { a(n)+1          if floor(a/2)=0
         { a(n)+floor(a/2) otherwise

In C++ that would be:

void next(int &a) {
    int half = a/2;
    if(half)
        a = a+half;
    else
        a++;
}

But really, even though the behavior of these two sequences seem very different, they both are still exponential after the initial values (when half turns out to be 0). The difference is that gcc grows by a factor of n while VC10 grows by a factor of n/2.

Of course we can provide our own custom allocator and guarantee the performance that we expect instead of relying on the defaults provided by compilers (which could change at any time), but understanding the essentials (e.g. both gcc and VC10 do not allocate chunks of constant size or in a linear fashion) of these reallocations helps us build more reliable and predictable systems.

shared_ptr’s constructor vs make_shared()

July 20th, 2011

First of all, please watch Lavavej’s Advanced STL 1.

This post shows a small comparison between the default constructor of shared_ptr and the utility function make_shared(). If you haven’t watched the video mentioned above, basically the difference is that the constructor of shared_ptr must perform two independent allocations while make_shared() can potentially make only one.

I’ve written a small program to test these differences and see if they actually made any difference. I wanted to see how long the construction of the smart pointers would take and confirm the number of allocations. For the timing I’m using UNIX’s timespec in the time.h header. I wish it had some built-in function to calculate the elapsed time between two timespecs, but I had to write my own. For the number of allocations I’ve used valgrind which was good enough. Here’s the code:

#include<iostream>
#include<memory>
#include<sys/time.h>
using namespace std;

const int AMOUNT = 100000;
const int BILLION = 1000000000; // 10^9
const int BIG_NUM = 999999;

/* Returns a timespec representing the elapsed time between
 * start and end.
 * Assumption: end is bigger (later) than start. */
timespec elapsed(timespec &start, timespec &end) {
    timespec result;
    if(end.tv_nsec < start.tv_nsec) { // borrow from tv_sec
        result.tv_sec = (end.tv_sec - 1) - start.tv_sec;
        result.tv_nsec = (end.tv_nsec + BILLION) - start.tv_nsec;
    } else {
        result.tv_sec = end.tv_sec - start.tv_sec;
        result.tv_nsec = end.tv_nsec - start.tv_nsec;
    }
    return result;
}

int main() {
    timespec start, end, delta1, delta2;

    clock_gettime(CLOCK_REALTIME, &start);
    for(int i=0; i<AMOUNT; i++)
        shared_ptr<int> sp(new int(BIG_NUM)); // AMOUNT*2 allocations
    clock_gettime(CLOCK_REALTIME, &end);
    delta1 = elapsed(start, end);

    clock_gettime(CLOCK_REALTIME, &start);
    for(int i=0; i<AMOUNT; i++)
        auto sp2 = make_shared<int>(BIG_NUM); // AMOUNT allocations
    clock_gettime(CLOCK_REALTIME, &end);
    delta2 = elapsed(start, end);

    cout << delta1.tv_sec << '.' << delta1.tv_nsec << endl;
    cout << delta2.tv_sec << '.' << delta2.tv_nsec << endl;

    return 0;
}

Compile:

g++ shared_ptr.cpp -o shared_ptr -lrt -std=C++0x

The surprising thing is that when compiled without optimization (i.e. without the -O flag, as demonstrated above), GCC produced an executable in which make_shared() was SLOWER than the plain shared_ptr constructor, even though the latter makes twice more allocations. I was getting something like 2.7s against 1.6s. The timings were consistent, so make_shared() was around 60% slower!

I was actually puzzled by these measurements, and I didn’t understand what was going on until I turned on the -O flag. Both times went down to 0.4s and 0.7s, at which point I realized that make_shared() was indeed considerably faster than the normal shared_ptr constructor.

For both tests, valgrind reported that there were no memory leaks (i.e. the implementation of the shared_ptr constructor and make_shared() are correct) and that there were 300.000 allocations (200.000 from the shared_ptr constructor and 100.00 from make_shared()), which confirms the expected values. I’m still puzzled at why make_shared() is STILL slower when no optimization is performed, even though it performs less system calls (the allocations). Maybe the overhead of calling a function (make_shared()) or caching it plays a role here, I’m not sure.

Threading in C++0x

July 18th, 2011

I think it’s awesome that C++0x supports threading in the language itself instead of relying in OS specific libraries and/or extensions. I’ve played with it yesterday and I’ve written a simple program to calculate an approximation to the exponential function using 3 separate threads. The math behind is just:

exp(x) = x^0/0! + x^1/1! + x^2/2! + x^3/3! + ...

Obviously if I wanted this program to finish I had to give a limit to the amount of terms in the above summation, so I put that into a constant in the program set to a big number like 1000. The bigger the better, but also slower. Alternatively we could keep the program running forever until an interrupt occurs, and just keep updating the value as we go, but I didn’t do it this way.
Anyway, we should be dealing with threading and not with the best possible approximation for e^x. The first thread (t1) calculates the x^i numerator, while the second thread (t2) calculates the i! numerator and finally the third thread (t3) just divides them to obtain a decimal number that is accumulated into a variable. We could have maintained the fractions, but it’s just simpler to code and deal with decimal objects. GMP handles these big numbers.

#include<iostream>
#include<thread>
#include<mutex>
#include<gmp.h>
using namespace std;

int main() {
    mutex m1, m2;
    const unsigned int iterations = 1000;
    mpf_t res1, res2, result, x;
    mpf_inits(res1, res2, result, x, nullptr);
    mpf_set_ui(res1, 1);
    mpf_set_ui(res2, 1);
    mpf_set_ui(result, 0);

    double x_cin;
    cout << "Enter x: ";
    cin >> x_cin;
    mpf_set_d(x, x_cin);

    // t1 calculates x^i
    thread t1([&m1, &res1, iterations](mpf_t x) {
        for(unsigned int i = 0; i < iterations; ) {
            m1.lock();
            if(mpf_cmp_ui(res1, 0) < 0) { // there's work to do
                // res1 = (-res1)*x
                mpf_ui_sub(res1, 0, res1);
                mpf_mul(res1, res1, x);
                i++;
            }
            m1.unlock();
        }
    }, x);

    // t2 calculates i!
    thread t2([&m2, &res2, iterations]() {
        for(unsigned int i = 1; i <= iterations; ) {
            m2.lock();
            if(mpf_cmp_ui(res2, 0) < 0) { // there's work to do
                // res2 = (-res2)*i
                mpf_ui_sub(res2, 0, res2);
                mpf_mul_ui(res2, res2, i);
                i++;
            }
            m2.unlock();
        }
    });

    // t3 accumulates res1/res2
    thread t3([&m1, &m2, &res1, &res2, &result, iterations]() {
        mpf_t tmp;
        mpf_init(tmp);
        for(unsigned int i = 0; i  0) {
                m2.lock();
                if(mpf_cmp_ui(res2, 0) > 0) {
                    mpf_div(tmp, res1, res2);
                    mpf_add(result, result, tmp);
                    mpf_ui_sub(res1, 0, res1); // tell t1 to work
                    mpf_ui_sub(res2, 0, res2); // tell t2 to work
                    i++;
                }
                m2.unlock();
            }
            m1.unlock();
        }
    });

    t1.join();
    t2.join();
    t3.join();
    cout << "e^x ~= " << result << endl;

    return 0;
}

You can compile with a command like this (assuming your program is called e_thread.cpp):

g++ e_thread.cpp -o e_thread -std=c++0x -pthread -lgmp -lgmpxx

Just don’t expect a very good performance from this program because the program is mostly busy “waiting” for a mutex to be unlocked, instead of actually calculating what’s necessary. We could have used O(iterations) memory to store each value from t1 and t2 separately, reducing the time considerably, but the above solution is O(1) memory. Anyway, it was a good exercise for me to understand one way of natively using threads in C++.