Best practices for multi-character character constants

Multi-character character constants such as

	'abcd' 

have long been permissible in C, and are now standardized, although their use is discouraged, for very good reasons. The are very useful in certain circumstances however, and are used.

For brevity, here “multi-character integer character constants” will be called multi-chars.

The best standard to date is

INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:TC3
Final version of the C99 standard with corrigenda TC1, TC2, and TC3 included

In subsection 6.4.4.4 (Character constants), the standard states that a multi-char always resolves to an int, but that the exact value is “implementation-dependent”. That is, different compilers may resolve the same multi-char to different integers. This is a portability problem, and it is one of the reasons multi-chars are discouraged.

However, when programming in some rather common data structure environments, multi-chars can help to make readable code.

This page details the issues involved, some options, and is intended as a “best practices” for coding with multi-chars.

Common applications

Many data structures contain 4-byte fields meant to be interpreted as four ASCII letters. Examples go by names such as “resource ID”, “tags” etc. When coding for these structures, it is very nice to use the character implementation in code as constants, as in a switch:

	switch( tag ) {
	case 'TAG1': ...;
	case 'TAG2': ...;
	}

Here, a multi-char is very clear and convenient and potentially more efficient than some of the alternatives.

There are also some applications that use 2- or 8-byte ID’s, but it is uncommon for the lengths to be mixed.

Options

C string literals

First one might think of initializing integers using C string literals. This approach can only be used at run-time however.

The problem is not how to pack bytes into an integer—the problem is how to get the bytes out of a string. Any attempt to extract a character from a string (or an item of an array) is a run-time operation, therefore a non-constant operation, and therefore can’t be used in a case statement of a switch, and can’t be used to initialize a const variable

In C++, objects initialized by strings might be more natural, and it would also be possible to produce a hash implementation for efficient searches. Similar measures could be taken in C as well, but would be messy.

preprocessor macro

A preprocessor macro

#define LE_CHR(a,b,c,d) ( ((a)<<24) | ((b)<<16) | ((c)<<8) | (d) )

is portable in the sense that it puts the right-most character of the multi-char into the least significant position in the resulting integer, in little-endian fashion.

On the other hand, the character sequence ‘TAG1’ is much harder to read and search for in code using the macro:

	LE_CHR( 'T', 'A', 'G', '1' )

In this code, the data structure issue up-stages the data itself.

Problems

endian-ness

Typically, on little-endian architectures, the rightmost character of a multi-char will become the least significant byte of the resulting integer. On big-endian architectures, it is the other way around.

In principle the compiler could also switch the order.

Endian-ness is easy to detect and handle though.

incomplete multi-chars and padding

Compilers differ in how they handle incomplete multi-chars the case where a multi-char doesn’t wholly specify an int, such as

	'abc'

Some compilers pad on the left, some on the right, regardless of endian-ness! Some compilers may not pad at all! It wouldn't contradict the standard, but would result in code that might behave differently from one run to the next!

Unfortunately, I know of no robust programmatic way to detect that a compiler is padding with zero or not, or whether a given multi-char properly fills an int. Therefore, on compilers that don’t pad, a typo 'abc' can produce errors that are erratic and hard to diagnose.

It is the author’s opinion that compilers that don’t pad multi-chars with zero are broken.

There are other issues having to do with readability of escape-sequences and wide characters in multi-chars. Some are described in the C99 standard section, under EXAMPLES. These issues rarely arise in the application being discussed here, however, except that: it is very easy for a typo to result in an incomplete multi-char, or worse, an incomplete multi-char that looks like a complete one.

Best practices

First, don’t use multi-chars without due consideration. They are not portable.

When writing toward a little-endian structure for an architecture expected also to be little-endian, robust code can be written by taking a few precautions. A check that the architecture is indeed little-endian, should suffice.

But even in more complex scenarios, simply wrapping all multi-chars in macros should provide enough flexibility to do necessary checking and possible swapping of bytes.

Incomplete multi-chars are a tricky problem. If they can be avoided altogether, then robust code can be written. Otherwise, there are measures that can be taken, but the current standards simply don't say anything helpful on the subject, and absolute certainty is difficult to achieve.

tests for endian-ness

There are many ways to check endian-ness at run time:

	int IS_LITTLE_ENDIAN() {
		static const int NL_AT_END = 0x000A;
		return ((char*)(void*)&NL_AT_END)[0] == '\n';
	}
	...
	#import <assert.h>
	...
	assert( IS_LITTLE_ENDIAN() );

Given multi-chars are already being used in the code, endian-ness is easy to check for at compile time:

	#if( 'q\0\0\0' & 'q' )
		#error( "architecture is big-endian" )
	#endif

Since the standard is silent as to how characters in a multi-char are packed into an integer, it may be best to check that they are packed as expected:

	#if( 'abcd' != LE_CHR( 'a', 'b', 'c', 'd' ) )
		#error( "unexpected multi-character packing" )
	#endif

padding test

Similarly there are tests of which side incomplete multi-chars are padded:

	#if( '\0abc' != 'abc' )
		#error( "compiler not padding multi-chars on the left" )
	#endif

Unfortunately, this is not a robust test if the compiler does not pad at all. It could conceivably pass and fail erratically on such a compiler.

reversing endian-ness

When writing code for a little-endian data structure that is meant to run on arbitrary architectures, it is useful to reverse the bytes of an int. This macro performs a compile-time reversal of bytes in a 4-byte int:

#define REV_BYTES(q) ( ( (q) & 0x000000FF ) << 24 | ( (q) & 0x0000FF00 ) <<  8 \
                     | ( (q) & 0x00FF0000 ) >>  8 | ( (q) & 0xFF000000 ) >> 24 )

compiler notes

gcc

The GNU C compiler gcc implements multi-chars in its preprocessor stage, gpp.

The documents for The C Preprocessor (under Implementation-defined behavior) says that it:

Wishes

Given a world divided by endian-ness, and given that the multi-char exists in the language, it would have been best if facilities were provided to specify or detect the compiler’s behavior.

Summary