strdup/free vs atomic reference counting
As I was working on a little C project of mine, I realized that often times I would have functions returning strings, and couldn't simply return the pointer directly (especially in mutli-threaded app) as the caller needs to be safe using the string.
Typically, what is done is simply return a copy of the string, and the caller must free it once it's done. But this would be happening a lot on my application, and I wondered if it wouldn't be better to save those strdup/free calls, using reference counting.
So I implemented that, and then recently found myself reading about this again, where I'd find both that avoiding those malloc/memcpy calls by using reference counting is good, or that atomic operations (which I used to implement reference counting) could be slow and once shouldn't be afraid of using malloc/free.
And I started wondering...
Since I couldn't really find a definite answer, I went ahead and made a little test app, to see the difference between those two methods. For those interested, here are the results.
Test application
First, the code used for those little tests, compiled simply with:
gcc pkg-config --cflags --libs glib-2.0
m.c
- #include <glib.h>
- #define REFCOUNT 1
- #define ATOMIC 1
- #define NB_THREADS 5
- #if REFCOUNT
- typedef struct
- {
- const gchar *str;
- guint ref;
- #if ATOMIC
- #else
- GMutex mutex;
- #endif
- } ss_t;
- static ss_t *
- ref (ss_t *ss)
- {
- #if ATOMIC
- g_atomic_int_inc (&ss->ref);
- #else
- g_mutex_lock (&ss->mutex);
- ++ss->ref;
- g_mutex_unlock (&ss->mutex);
- #endif
- return ss;
- }
- static void
- unref (ss_t *ss)
- {
- #if ATOMIC
- if (g_atomic_int_dec_and_test (&ss->ref))
- g_free (ss);
- #else
- g_mutex_lock (&ss->mutex);
- if (--ss->ref == 0)
- {
- g_mutex_clear (&ss->mutex);
- g_free (ss);
- }
- else
- g_mutex_unlock (&ss->mutex);
- #endif
- }
- static ss_t *str;
- static ss_t *
- get_string (void)
- {
- return ref (str);
- }
- #else
- static const gchar *str = "foobar";
- static gchar *
- get_string (void)
- {
- return g_strdup (str);
- }
- #endif
- static void
- worker (void)
- {
- int i;
- #if REFCOUNT
- ss_t *s;
- #else
- gchar *s;
- #endif
- for (i = 0; i < 100000; ++i)
- {
- s = get_string ();
- #if REFCOUNT
- unref (s);
- #else
- g_free (s);
- #endif
- }
- }
- int
- main (void)
- {
- int i;
- GThread *thread[NB_THREADS];
- #if REFCOUNT
- str = g_new (ss_t, 1);
- str->str = "foobar";
- str->ref = 1;
- #if ATOMIC
- #else
- g_mutex_init (&str->mutex);
- #endif
- #endif
- for (i = 0; i < NB_THREADS; ++i)
- thread[i] = g_thread_new ("worker", (GThreadFunc) worker, NULL);
- for (i = 0; i < NB_THREADS; ++i)
- g_thread_join (thread[i]);
- #if REFCOUNT
- unref (str);
- #endif
- return 0;
- }
Results
First I ran those with 8 threads :
Reference counting, using mutex
Binary was 8,652 bytes; Average time: 0.442
==17320== HEAP SUMMARY:
==17320== in use at exit: 5,476 bytes in 12 blocks
==17320== total heap usage: 38 allocs, 26 frees, 8,092 bytes allocated
==17320==
==17320== LEAK SUMMARY:
==17320== definitely lost: 0 bytes in 0 blocks
==17320== indirectly lost: 0 bytes in 0 blocks
==17320== possibly lost: 2,016 bytes in 2 blocks
==17320== still reachable: 3,460 bytes in 10 blocks
==17320== suppressed: 0 bytes in 0 blocks
./a.out 1.32s user 0.47s system 345% cpu 0.520 total
./a.out 1.29s user 0.37s system 335% cpu 0.495 total
./a.out 1.35s user 0.40s system 334% cpu 0.524 total
./a.out 1.11s user 0.38s system 334% cpu 0.444 total
./a.out 1.37s user 0.53s system 357% cpu 0.531 total
Reference counting, atomic operations
Binary was 8,053 bytes; Average time: 0.199
==17441== HEAP SUMMARY:
==17441== in use at exit: 5,476 bytes in 12 blocks
==17441== total heap usage: 37 allocs, 25 frees, 8,044 bytes allocated
==17441==
==17441== LEAK SUMMARY:
==17441== definitely lost: 0 bytes in 0 blocks
==17441== indirectly lost: 0 bytes in 0 blocks
==17441== possibly lost: 2,016 bytes in 2 blocks
==17441== still reachable: 3,460 bytes in 10 blocks
==17441== suppressed: 0 bytes in 0 blocks
./a.out 0.76s user 0.00s system 357% cpu 0.214 total
./a.out 0.65s user 0.00s system 356% cpu 0.183 total
./a.out 0.83s user 0.00s system 372% cpu 0.225 total
./a.out 0.72s user 0.00s system 365% cpu 0.197 total
./a.out 0.61s user 0.00s system 349% cpu 0.175 total
Duplicate strings
Binary was 7,761 bytes; Average time: 0.039
==17601== HEAP SUMMARY:
==17601== in use at exit: 5,476 bytes in 12 blocks
==17601== total heap usage: 800,036 allocs, 800,024 frees, 5,608,028 bytes allocated
==17601==
==17601== LEAK SUMMARY:
==17601== definitely lost: 0 bytes in 0 blocks
==17601== indirectly lost: 0 bytes in 0 blocks
==17601== possibly lost: 2,016 bytes in 2 blocks
==17601== still reachable: 3,460 bytes in 10 blocks
==17601== suppressed: 0 bytes in 0 blocks
./a.out 0.10s user 0.00s system 292% cpu 0.035 total
./a.out 0.11s user 0.00s system 252% cpu 0.042 total
./a.out 0.10s user 0.01s system 269% cpu 0.040 total
./a.out 0.10s user 0.01s system 273% cpu 0.039 total
./a.out 0.10s user 0.00s system 273% cpu 0.038 total
After looking at these results, I worried that implementing reference counting might have done more bad than good in the end, slowing things here by a factor of 5. Of course, this is a pretty extreme case (8 threads all doing 100,000 operations). And let's not forget the difference in memory usage, either.
Down to 5 threads
Atomic reference counting
Average time: 0.117
==19689== HEAP SUMMARY:
==19689== in use at exit: 5,476 bytes in 12 blocks
==19689== total heap usage: 28 allocs, 16 frees, 7,087 bytes allocated
==19689==
==19689== LEAK SUMMARY:
==19689== definitely lost: 0 bytes in 0 blocks
==19689== indirectly lost: 0 bytes in 0 blocks
==19689== possibly lost: 2,016 bytes in 2 blocks
==19689== still reachable: 3,460 bytes in 10 blocks
==19689== suppressed: 0 bytes in 0 blocks
./a.out 0.43s user 0.00s system 355% cpu 0.122 total
./a.out 0.40s user 0.00s system 346% cpu 0.117 total
./a.out 0.38s user 0.00s system 350% cpu 0.110 total
./a.out 0.37s user 0.00s system 338% cpu 0.108 total
./a.out 0.45s user 0.00s system 350% cpu 0.127 total
Duplicating strings
Average time: 0.028
==19584== HEAP SUMMARY:
==19584== in use at exit: 5,476 bytes in 12 blocks
==19584== total heap usage: 500,027 allocs, 500,015 frees, 3,507,071 bytes allocated
==19584==
==19584== LEAK SUMMARY:
==19584== definitely lost: 0 bytes in 0 blocks
==19584== indirectly lost: 0 bytes in 0 blocks
==19584== possibly lost: 2,016 bytes in 2 blocks
==19584== still reachable: 3,460 bytes in 10 blocks
==19584== suppressed: 0 bytes in 0 blocks
./a.out 0.07s user 0.00s system 253% cpu 0.026 total
./a.out 0.06s user 0.01s system 239% cpu 0.028 total
./a.out 0.06s user 0.00s system 250% cpu 0.027 total
./a.out 0.07s user 0.00s system 222% cpu 0.030 total
./a.out 0.07s user 0.00s system 214% cpu 0.031 total
Atomic operations are still slower, but only by a factor or 4.
Down to 2 threads
Things change quite a bit with only two threads, where - apart for memory usage obviously - results were pretty much the same :
Reference counting (atomic)
Average time: 0.018
==19473== HEAP SUMMARY:
==19473== in use at exit: 5,476 bytes in 12 blocks
==19473== total heap usage: 19 allocs, 7 frees, 6,130 bytes allocated
==19473==
==19473== LEAK SUMMARY:
==19473== definitely lost: 0 bytes in 0 blocks
==19473== indirectly lost: 0 bytes in 0 blocks
==19473== possibly lost: 2,016 bytes in 2 blocks
==19473== still reachable: 3,460 bytes in 10 blocks
==19473== suppressed: 0 bytes in 0 blocks
./a.out 0.02s user 0.00s system 149% cpu 0.016 total
./a.out 0.02s user 0.00s system 137% cpu 0.017 total
./a.out 0.03s user 0.00s system 170% cpu 0.018 total
./a.out 0.03s user 0.00s system 168% cpu 0.018 total
./a.out 0.04s user 0.00s system 162% cpu 0.022 total
Duplicating strings
Average time: 0.017
==19381== HEAP SUMMARY:
==19381== in use at exit: 5,476 bytes in 12 blocks
==19381== total heap usage: 200,018 allocs, 200,006 frees, 1,406,114 bytes allocated
==19381==
==19381== LEAK SUMMARY:
==19381== definitely lost: 0 bytes in 0 blocks
==19381== indirectly lost: 0 bytes in 0 blocks
==19381== possibly lost: 2,016 bytes in 2 blocks
==19381== still reachable: 3,460 bytes in 10 blocks
==19381== suppressed: 0 bytes in 0 blocks
./a.out 0.03s user 0.00s system 171% cpu 0.016 total
./a.out 0.03s user 0.00s system 160% cpu 0.017 total
./a.out 0.03s user 0.00s system 164% cpu 0.016 total
./a.out 0.03s user 0.00s system 156% cpu 0.019 total
./a.out 0.03s user 0.00s system 145% cpu 0.018 total
Reference counting might not be a bad thing after all. Yes, atomic operations might be "slow"/slower, but it remains pretty fast, is definitely better as far as memory usage goes, and will probably only slow things down in extreme cases, if at all.
As a last try, I ran those tests again but having each of the 8 threads only do 1,000 operations, instead of the 100,000 they did so far. In this case, again except for memory usage of course, the results are pretty much exactly the same, averaging at 0.003 both when duplicating strings, or using atomic reference counting.
Atomic reference counting
==20666== HEAP SUMMARY:
==20666== in use at exit: 5,476 bytes in 12 blocks
==20666== total heap usage: 37 allocs, 25 frees, 8,044 bytes allocated
==20666==
==20666== LEAK SUMMARY:
==20666== definitely lost: 0 bytes in 0 blocks
==20666== indirectly lost: 0 bytes in 0 blocks
==20666== possibly lost: 2,016 bytes in 2 blocks
==20666== still reachable: 3,460 bytes in 10 blocks
==20666== suppressed: 0 bytes in 0 blocks
./a.out 0.00s user 0.00s system 79% cpu 0.004 total
./a.out 0.00s user 0.00s system 0% cpu 0.002 total
./a.out 0.00s user 0.00s system 0% cpu 0.002 total
./a.out 0.00s user 0.00s system 0% cpu 0.002 total
./a.out 0.00s user 0.00s system 69% cpu 0.005 total
Duplicating strings
==20634== HEAP SUMMARY:
==20634== in use at exit: 5,476 bytes in 12 blocks
==20634== total heap usage: 8,036 allocs, 8,024 frees, 64,028 bytes allocated
==20634==
==20634== LEAK SUMMARY:
==20634== definitely lost: 0 bytes in 0 blocks
==20634== indirectly lost: 0 bytes in 0 blocks
==20634== possibly lost: 2,016 bytes in 2 blocks
==20634== still reachable: 3,460 bytes in 10 blocks
==20634== suppressed: 0 bytes in 0 blocks
./a.out 0.00s user 0.00s system 0% cpu 0.003 total
./a.out 0.00s user 0.00s system 0% cpu 0.003 total
./a.out 0.00s user 0.00s system 0% cpu 0.002 total
./a.out 0.00s user 0.00s system 0% cpu 0.003 total
./a.out 0.00s user 0.00s system 0% cpu 0.003 total
So in the end, I'm gonna stick with atomic reference counting in my app, it should not, or barely, have any impact on speed, but will definitely help as far as memory usage goes.