EmbeddedRelated.com
Forums

Benchmark: STL's list vs. hand-coded one

Started by Jyrki Saarinen November 20, 2007
Hello everyone, in the compiler thread someone (was it Dijkstra?) 
claimed that M$ STL implementation of the doubly linked list was 50x 
slower than a hand-coded implementation. Now, I get the metrics that 
they are equally fast.

OS: Ubuntu 7.10
CPU: Athlon XP 2600+
compiler: gcc 4.1.3
opts: -O3 -march=athlon-xp

Results:

Custom impl. took 4.000000 seconds of CPU time.
STL impl. took 4.000000 seconds of CPU time.

(I couldn't find a method of more accurate timing in glibc)

Here's the source:

#include <list>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

#define N 4000000

struct Node
{
  unsigned int data;
  struct Node *prev, *next;
};

struct List
{
  struct Node *head, *tail;
};

/**
 * Allocates a new doubly linked list.
 */
static List *list_New()
{
  List *list = (struct List *)malloc(sizeof(struct List));
  if (list == NULL)
    perror("malloc() failed");
  list->head = list->tail = NULL;
  return list;
}

/**
 * Frees a node.
 */
static void node_Delete(struct Node *node)
{
  free(node);
}

/**
 * Frees resources allocated to a list
 */
static void list_Delete(struct List *list)
{
  assert(list != NULL);
  struct Node *node = list->head;
  while (node != NULL)
  {
    struct Node *current = node;
    node_Delete(current);
    node = node->next;
  }
  free(list);
}

static struct Node *list_GetHead(const struct List *list)
{
  assert(list != NULL);
  return list->head;
}

static struct Node *node_GetNext(const struct Node *node)
{
  assert(node != NULL);
  return node->next;
}

/**
 * Allocates a new node
 */
static struct Node *node_New()
{
  struct Node *node = (struct Node *)malloc(sizeof(struct Node));
  if (node == NULL)
    perror("malloc() failed.");
  node->prev = node->next = NULL;
  return node;
}

/**
 * "Accessor"
 */
static void node_SetData(struct Node *node, int data)
{
  assert(node != NULL);
  node->data = data;
}

/**
 * "Accessor".
 */
static int node_GetData(const struct Node *node)
{
  assert(node != NULL);
  return node->data;
}

/**
 * Adds a node to the end of the list.
 */
static void list_Add(struct List *list, struct Node *node)
{
  assert(list != NULL);
  assert(node != NULL);
  assert(node->prev == NULL && node->next == NULL);
  node->prev = list->tail;
  if (list->head == NULL)
    list->head = node;
  if (list->tail != NULL)
    list->tail->next = node;
  list->tail = node;
}

int main()
{
  time_t t1, t2;
  t1 = time(NULL);
  struct List *list = list_New();
  int sum1 = 0;
  for (unsigned int i = 0; i < N; i++)
  {
    struct Node *node = node_New(); 
    node_SetData(node, i);
    list_Add(list, node);
  }
  struct Node *node = list_GetHead(list);
  for (int i = 0; i < N; i++)
  {
    sum1 += node_GetData(node);    
    node = node_GetNext(node);
  }
  list_Delete(list);
  t2 = time(NULL);
  double dt = t2 - t1;
  printf("Custom impl. took %f seconds of CPU time.\n", dt);
  
  t1 = time(NULL);
  std::list<unsigned int> *stlList = new std::list<unsigned int>;
  int sum2 = 0;
  for (unsigned int i = 0; i < N; i++)
  {
    stlList->push_back(i);
  }
  std::list<unsigned int>::const_iterator i1 = stlList->begin();
  std::list<unsigned int>::const_iterator i2 = stlList->end();

  while (i1 != i2)
  {
    sum2 += *i1;
    i1++;
  }
  delete stlList;
  t2 = time(NULL);
  dt = t2 - t1;
  printf("STL impl. took %f seconds of CPU time.\n", dt);

  // do prevent optimizer to remove all code as dead
  return sum1 + sum2;
}

-- 
Jyrki Saarinen
http://koti.welho.com/jsaari88/

Jyrki Saarinen wrote:
> Hello everyone, in the compiler thread someone (was it Dijkstra?) > claimed that M$ STL implementation of the doubly linked list was 50x > slower than a hand-coded implementation. Now, I get the metrics that > they are equally fast. > > OS: Ubuntu 7.10 > CPU: Athlon XP 2600+ > compiler: gcc 4.1.3 > opts: -O3 -march=athlon-xp > > Results: > > Custom impl. took 4.000000 seconds of CPU time. > STL impl. took 4.000000 seconds of CPU time. > > (I couldn't find a method of more accurate timing in glibc) > <snip>
On most platforms the clock() function in <time.h> gives more accurate timing information than the time() function.
Patrick de Zeester <invalid@invalid.invalid> wrote:

>> (I couldn't find a method of more accurate timing in glibc) >> <snip> > > On most platforms the clock() function in <time.h> gives more accurate > timing information than the time() function.
On Linux, it returns 0 for some reason, perhaps it is not implemented. -- Jyrki Saarinen http://koti.welho.com/jsaari88/
Jyrki Saarinen wrote:
> Patrick de Zeester <invalid@invalid.invalid> wrote: > >>> (I couldn't find a method of more accurate timing in glibc) >>> <snip> >> On most platforms the clock() function in <time.h> gives more accurate >> timing information than the time() function. > > On Linux, it returns 0 for some reason, perhaps it is not implemented.
On the version of Linux you are running that is, on the version of Linux I use the clock() function works just fine.
Patrick de Zeester <invalid@invalid.invalid> wrote:

>> On Linux, it returns 0 for some reason, perhaps it is not implemented. > > On the version of Linux you are running that is, on the version of Linux > I use the clock() function works just fine.
What distribution, kernel and glibc versions are you using? -- Jyrki Saarinen http://koti.welho.com/jsaari88/
Jyrki Saarinen wrote:
> Patrick de Zeester <invalid@invalid.invalid> wrote: > >>> On Linux, it returns 0 for some reason, perhaps it is not implemented. >> On the version of Linux you are running that is, on the version of Linux >> I use the clock() function works just fine. > > What distribution, kernel and glibc versions are you using?
I checked it on a very old Mandrake distro: Linux version 2.6.8.1-12mdk (quintela@n5.mandrakesoft.com) (gcc version 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)) #1 Fri Oct 1 12:53:41 CEST 2004 glibc-2.3.3-23.1.101mdk
Patrick de Zeester <invalid@invalid.invalid> wrote:

> I checked it on a very old Mandrake distro: > Linux version 2.6.8.1-12mdk (quintela@n5.mandrakesoft.com) (gcc version > 3.4.1 (Mandrakelinux (Alpha 3.4.1-3mdk)) #1 Fri Oct 1 12:53:41 CEST 2004 > glibc-2.3.3-23.1.101mdk
Ok, gettimeofday() did the trick. Now the times are for 4000000 nodes: Custom impl. took 0.946796 seconds of CPU time. STL impl. took 0.972560 seconds of CPU time. -- Jyrki Saarinen http://koti.welho.com/jsaari88/
"Jyrki Saarinen" <jyrki@NOSPAMwelho.com.invalid> wrote in message news:fhvaou$4iq$1@nyytiset.pp.htv.fi...
> Hello everyone, in the compiler thread someone (was it Dijkstra?) > claimed that M$ STL implementation of the doubly linked list was 50x > slower than a hand-coded implementation. Now, I get the metrics that > they are equally fast.
I didn't claim that, nor did anyone else for that matter. I was talking about STL strings and in particular string concatenation. Anyway, there are some obvious flaws here. For starters few would use a linked list if they had to store such a large number of integers. Applications that do use large linked lists typically do something very different. Your implementation uses 24 bytes for just one node (16 bytes if the heap is 4-byte aligned) compared to just over 4 for a specialized implementation. But worst of all, do you have any idea where over 99% of the time is spent? Let me give you a hint - it is NOT the linked list code... Wilco
Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote:

>> Hello everyone, in the compiler thread someone (was it Dijkstra?) >> claimed that M$ STL implementation of the doubly linked list was 50x >> slower than a hand-coded implementation. Now, I get the metrics that >> they are equally fast. > > I didn't claim that, nor did anyone else for that matter. I was talking about > STL strings and in particular string concatenation.
If this is the case, of course this micro-benchmark is irrelevant. I have to read your post again (if I can find it still..), and perhaps do another benchmark. Easier would be if you could describe the problem with STL string concatenation again?
> Anyway, there are some obvious flaws here. For starters few would > use a linked list if they had to store such a large number of integers.
Well - int was just chosen, it could be anything else. I'll do the same test with a struct that contains a larger amount of data.
> different. Your implementation uses 24 bytes for just one node (16 bytes > if the heap is 4-byte aligned) compared to just over 4 for a specialized > implementation.
What kind of an implementation is this with 4 bytes for a node?
> But worst of all, do you have any idea where over 99% > of the time is spent? Let me give you a hint - it is NOT the linked list code...
I didn't profile, but very likely it is the memory allocation. -- Jyrki Saarinen http://koti.welho.com/jsaari88/
"Jyrki Saarinen" <jyrki@NOSPAMwelho.com.invalid> wrote in message news:fi0v23$qfg$1@nyytiset.pp.htv.fi...
> Wilco Dijkstra <Wilco_dot_Dijkstra@ntlworld.com> wrote: > >>> Hello everyone, in the compiler thread someone (was it Dijkstra?) >>> claimed that M$ STL implementation of the doubly linked list was 50x >>> slower than a hand-coded implementation. Now, I get the metrics that >>> they are equally fast. >> >> I didn't claim that, nor did anyone else for that matter. I was talking about >> STL strings and in particular string concatenation. > > If this is the case, of course this micro-benchmark is irrelevant. I have > to read your post again (if I can find it still..), and perhaps do another > benchmark. Easier would be if you could describe the problem with STL > string concatenation again?
The problem is as follows: you have 4 small character strings (char *), and create a temporary string by concatenating them with spaces between each string. Implementing this in the obvious way gives a factor of 50 times in VC++ between strcat/strcpy vs STL.
>> Anyway, there are some obvious flaws here. For starters few would >> use a linked list if they had to store such a large number of integers. > > Well - int was just chosen, it could be anything else. I'll do the same > test with a struct that contains a larger amount of data.
It wouldn't change the result much, you still spend most of the time in malloc/free. What matters is not the size of the data but the usage pattern - since the benchmark doesn't do anything but create/destroy a long list, it only measures the (high) overhead of memory allocation.
>> different. Your implementation uses 24 bytes for just one node (16 bytes >> if the heap is 4-byte aligned) compared to just over 4 for a specialized >> implementation. > > What kind of an implementation is this with 4 bytes for a node?
A linked list of arrays.
>> But worst of all, do you have any idea where over 99% >> of the time is spent? Let me give you a hint - it is NOT the linked list code... > > I didn't profile, but very likely it is the memory allocation.
Correct. This is why big applications use custom allocators that are much faster than malloc/free - think 2 orders of magnitude. Wilco