Jeesdev

A generic hashtable in C

Kuvis

2.3.2019 23:58:15

When ever I've implemented hashtables in C in the past, the way I've made them "generic" has been to create macros that declare new types and sets of functions for the tables wanted. For example, taken from the MUTA source code:

#define DYNAMIC_HASH_TABLE_DEFINITION(table_name, value_type, key_type, \
   hash_type, hash_func, bucket_sz) \
   ... \

...

DYNAMIC_HASH_TABLE_DEFINITION(str_int_table, int, char *, uint32,
   fnv_hash32_from_str, 4);

The above macro would declare a new type, str_int_table_t, and functions such as str_int_table_init(), str_int_table_insert() and str_int_table_erase().

With such macros, it's a bother to have to call the macro before using any new type of a hashtable. However, for example the stb library implements easier-to-use, generic dynamic arrays with macro trickery, so why not try the same with hashtables?

Turns out working with a more complex structure like a hashtable takes a little more work than dynamic arrays. We cannot automatically create versions of a single function for different data types in C like we could with C++ templates, so the main problem is passing compile-time data such as struct member types and sizes to functions. And we do want to use functions because we will need complex statements such as loops - such statements do not (very cleanly at least) fit inside plain macros.

The above paragraph might not have made immediate sense, so I'll attempt to clarify my point in practice. Below is the new hashtable declaration macro.

#define hashtable(key_type, value_type) \
   struct { \
       struct { \
           struct { \
               key_type    key; \
               value_type value; \
               size_t      hash; \
               uint8_t     reserved; \
           } items[HASHTABLE_BUCKET_SIZE]; \
       } *buckets; \
       size_t num_buckets; \
       size_t num_values; \
   }

Below is how it's used.

/* Declare a hashtable called 'table' whose key type is char * and value type
* is int. */
hashtable(char *, int) table;

As may be observed, the type of the resulting struct is anonymous, meaning it cannot even be passed to a struct as a void * and then cast back to its original type. So how do we do what we need to do? Below is the hashtable_insert_ext() macro and the signature of the _hashtable_insert() function the macro calls.

#define hashtable_insert_ext(table_, pair_, compare_keys_func_, \
   copy_key_func_, ret_err_) \
   (table_.buckets = _hashtable_insert((ret_err_), \
       (uint8_t*)table_.buckets, &table_.num_buckets, &table_.num_values,        \
       sizeof(*table_.buckets), sizeof(table_.buckets[0].items[0]), \
       _hashtable_ptr_offset(&table_.buckets[0].items[0], \
           &table_.buckets[0]), \
       _hashtable_ptr_offset(&table_.buckets[0].items[0].key, \
           &table_.buckets[0].items[0]), \
       _hashtable_ptr_offset(&table_.buckets[0].items[0].value, \
           &table_.buckets[0].items[0]), \
       _hashtable_ptr_offset(&table_.buckets[0].items[0].hash, \
           &table_.buckets[0].items[0]), \
       _hashtable_ptr_offset(&table_.buckets[0].items[0].reserved, \
           &table_.buckets[0].items[0]), \
       &pair_.key, sizeof(pair_.key), pair_.hash, &pair_.value, \
       sizeof(pair_.value), compare_keys_func_, copy_key_func_))

void *_hashtable_insert(int *ret_err, uint8_t *buckets, size_t *num_buckets,
   size_t *num_values, size_t bucket_size, size_t item_size,
   size_t items_offset, size_t key_off1, size_t value_off1, size_t    hash_off1,
   size_t reserved_off1, void *key, size_t key_size, size_t hash, void    *value,
   size_t value_size,
   int (*compare_keys)(const void *a, const void *b, size_t size),
   int (*copy_key)(void *dst, const void *src, size_t size));

Yep, instead of passing in a struct to _hashtable_insert(), we pass pointers to each of the struct's relevant members. Sizes of some data types are also passed, and finally, struct offsets using the macro _hashtable_ptr_offset(). That allows us to modify everything at the byte level, which we do inside _hashtable_insert(). As a side note, the function callbacks passed are there to allow for dynamic types such as copyable strings.

One may wonder if this is an efficient way to do things. I would say it probably isn't, not in terms of runtime efficiency anyway. But in terms of programmer efficiency, I think it might be the right choice until I come across a better solution. That's because with the this implementation, doing things such as the following is super easy compared to what I used to have to do.

hashtable(uint32_t, int) table;
hashtable_init(table, 8, 0);

struct {
   uint32_t    key;
   int         value;
   size_t      hash;
} insert_data = {123, 5, hash_u32(123)};
hashtable_insert(table, insert_data);

struct {
   uint32_t    key;
   size_t      hash;
} find_data = {123, hash_u32(123)};
int *value = hashtable_find(table, find_data);
if (value) {
   ...
}

Okay, maybe it isn't the most intuitive API ever, but I like it far more than having to write type/function declaration macros like I used to. And for my hashtable use-cases, I'm willing to pay for the potential runtime-overhead.

Also, I didn't actually have a hashtable that stored the actual keys in MUTA before. While not storing keys is fine for assets and similar items to save on memory, for runtime-generated ID's a table like this will be required.

I've been working on this over the weekend and it likely still has some bugs in it, but in case you're curious, the Git repository can be found here. I've also uploaded the current header and source files here and here, respectively. I'll keep on working on it and start using it in MUTA very soon.

Edit 3.3.2019
After a night of thinking I decided to the change the API so that it would not require special structs created by the user. The reason I used them originally was an attempt to force certain type-safety, namely preventing accidental pointer key casting to void * (since that can happen implicitly in C). An example of such a mistake would be when you try to pass in the address of a char *, but accidentally pass in the pointer's value rather than it's address.

char *key = ...;
hashtable_insert(
   table,
   key, // Should be &key but no error is generated due to implicit cast to void * inside the macro
   &value,
   NULL );

I concluded that similar "type-safety" could be achieved by having the user pass references to variables directly instead of their addresses. The macros will instead fetch the addresses and we should get a compilation error when certain mistakes happen. For example, if the user passes in a key as '&key', the macro would format it to '&&key', which would result in an error. The API change should reduce the amount of lines required to achieve the same result, and I think could also prove out to be more clear. Below is an example of the new API in use.

hashtable(uint64_t, uint32_t) table;
...
uint64_t    key     = 12345;
uint32_t    value   = 54321;
size_t      hash    = hashtable_hash(&key, sizeof(key));
hashtable_insert(table, key, hash, value, NULL);

I've also uploaded the newer source files here and here.

We fulfill power fantasies

A generic hashtable in C