Best practices for multi-character character constants

Multi-character character constants such as

	'abcd'

have long been permissible in C, and are now standardized, although their use is discouraged, for very good reasons. The are very useful in certain circumstances however, and are used.

For brevity, here “multi-character integer character constants” will be called multi-chars.

The best standard to date is

INTERNATIONAL STANDARD ©ISO/IEC ISO/IEC 9899:TC3
Final version of the C99 standard with corrigenda TC1, TC2, and TC3 included

In subsection 6.4.4.4 (Character constants), the standard states that a multi-char always resolves to an int, but that the exact value is “implementation-dependent”. That is, different compilers may resolve the same multi-char to different integers. This is a portability problem, and it is one of the reasons multi-chars are discouraged.

However, when programming in some rather common data structure environments, multi-chars can help to make readable code.

This page details the issues involved, some options, and is intended as a “best practices” for coding with multi-chars.

Common applications

Many data structures contain 4-byte fields meant to be interpreted as four ASCII letters. Examples go by names such as “resource ID”, “tags” etc. When coding for these structures, it is very nice to use the character implementation in code as constants, as in a switch:

	switch( tag ) {
	case 'TAG1': ...;
	case 'TAG2': ...;
	}

Here, a multi-char is very clear and convenient and potentially more efficient than some of the alternatives.

There are also some applications that use 2- or 8-byte ID’s, but it is uncommon for the lengths to be mixed.

Options

C string literals

First one might think of initializing integers using C string literals. This approach can only be used at run-time however.

The problem is not how to pack bytes into an integer—the problem is how to get the bytes out of a string. Any attempt to extract a character from a string (or an item of an array) is a run-time operation, therefore a non-constant operation, and therefore can’t be used in a case statement of a switch, and can’t be used to initialize a const variable

In C++, objects initialized by strings might be more natural, and it would also be possible to produce a hash implementation for efficient searches. Similar measures could be taken in C as well, but would be messy.

preprocessor macro

A preprocessor macro

#define LE_CHR(a,b,c,d) ( ((a)<<24) | ((b)<<16) | ((c)<<8) | (d) )

is portable in the sense that it puts the right-most character of the multi-char into the least significant position in the resulting integer, in little-endian fashion.

On the other hand, the character sequence ‘TAG1’ is much harder to read and search for in code using the macro:

	LE_CHR( 'T', 'A', 'G', '1' )

In this code, the data structure issue up-stages the data itself.

Problems

endian-ness

Typically, on little-endian architectures, the rightmost character of a multi-char will become the least significant byte of the resulting integer. On big-endian architectures, it is the other way around.

In principle the compiler could also switch the order.

Endian-ness is easy to detect and handle though.

incomplete multi-chars and padding

Compilers differ in how they handle incomplete multi-chars the case where a multi-char doesn’t wholly specify an int, such as

	'abc'

Some compilers pad on the left, some on the right, regardless of endian-ness! Some compilers may not pad at all! It wouldn't contradict the standard, but would result in code that might behave differently from one run to the next!

Unfortunately, I know of no robust programmatic way to detect that a compiler is padding with zero or not, or whether a given multi-char properly fills an int. Therefore, on compilers that don’t pad, a typo 'abc' can produce errors that are erratic and hard to diagnose.

It is the author’s opinion that compilers that don’t pad multi-chars with zero are broken.

There are other issues having to do with readability of escape-sequences and wide characters in multi-chars. Some are described in the C99 standard section, under EXAMPLES. These issues rarely arise in the application being discussed here, however, except that: it is very easy for a typo to result in an incomplete multi-char, or worse, an incomplete multi-char that looks like a complete one.

Best practices

First, don’t use multi-chars without due consideration. They are not portable.

When writing toward a little-endian structure for an architecture expected also to be little-endian, robust code can be written by taking a few precautions. A check that the architecture is indeed little-endian, should suffice.

But even in more complex scenarios, simply wrapping all multi-chars in macros should provide enough flexibility to do necessary checking and possible swapping of bytes.

Incomplete multi-chars are a tricky problem. If they can be avoided altogether, then robust code can be written. Otherwise, there are measures that can be taken, but the current standards simply don't say anything helpful on the subject, and absolute certainty is difficult to achieve.

tests for endian-ness

There are many ways to check endian-ness at run time:

	int IS_LITTLE_ENDIAN() {
		static const int NL_AT_END = 0x000A;
		return ((char*)(void*)&NL_AT_END)[0] == '\n';
	}
	...
	#import <assert.h>
	...
	assert( IS_LITTLE_ENDIAN() );

Given multi-chars are already being used in the code, endian-ness is easy to check for at compile time:

	#if( 'q\0\0\0' & 'q' )
		#error( "architecture is big-endian" )
	#endif

Since the standard is silent as to how characters in a multi-char are packed into an integer, it may be best to check that they are packed as expected:

	#if( 'abcd' != LE_CHR( 'a', 'b', 'c', 'd' ) )
		#error( "unexpected multi-character packing" )
	#endif

padding test

Similarly there are tests of which side incomplete multi-chars are padded:

	#if( '\0abc' != 'abc' )
		#error( "compiler not padding multi-chars on the left" )
	#endif

Unfortunately, this is not a robust test if the compiler does not pad at all. It could conceivably pass and fail erratically on such a compiler.

reversing endian-ness

When writing code for a little-endian data structure that is meant to run on arbitrary architectures, it is useful to reverse the bytes of an int. This macro performs a compile-time reversal of bytes in a 4-byte int:

#define REV_BYTES(q) ( ( (q) & 0x000000FF ) << 24 | ( (q) & 0x0000FF00 ) <<  8 \
                     | ( (q) & 0x00FF0000 ) >>  8 | ( (q) & 0xFF000000 ) >> 24 )

compiler notes

gcc

The GNU C compiler gcc implements multi-chars in its preprocessor stage, gpp.

The documents for The C Preprocessor (under Implementation-defined behavior) says that it:

performs the little-endian interpretation of multi-chars
pads incomplete multi-chars on the left with zero
has warning switches (see also the gcc man page) -Wmultichar -Wno-multichar
warns of overfull multi-chars:
warning: character constant too long for its type

Wishes

Given a world divided by endian-ness, and given that the multi-char exists in the language, it would have been best if facilities were provided to specify or detect the compiler’s behavior.

A standard macro to indicate endian-ness would be very helpful in writing portable code. The Gnulib header endian.h does provide BYTE_ORDER, LITTLE_ENDIAN, BIG_ENDIAN.
It would be nice if compilers had a switch that would set the endian-ness of their interpretation of multi-chars.

Some compilers have warning for multi-chars. It would be good to have a separate warning for multi-chars that don’t wholly fill an int.
It should be an error for a multi-char to specify too many bytes for it to be represented as an int.
It should have been standardized that multi-chars that don’t wholly fill an int should be zero-padded.

In what ever way they do it, from one instance to the next, a given multi-char should resolve to the same integer.
One might expect that (assuming zero padding), the characters in a multi-char that doesn’t fill an int would be padded on the left, so that
```
	'abc' == '\0abc' 

	'ab' & 'b' == 'b' 

	'abcd' & 'cd' == 'cd'
```
The standard says nothing about this, and unfortunately some compilers do things such as putting the first byte of the multi-char as the most significant byte of the result. This is a very unfortunate inconsistency. (In applications I’ve seen, padding on the left is always the intent of these constructs.) Note this padding question is independent of endian-ness, or generally how the integer value is produced.

Summary

Multi-chars are problematic and non-portable, and best avoided.
In those situations where multi-chars are really helpful for code readability, there are three main issues. One is a matter of programming discipline. The other two are matters for standardization or compiler warnings.

1) endian-ness. This can be handled programmatically.

2) incomplete multi-chars. There is no robust programmatic solution for this. They could be avoided by compiler warnings, or made more useful by further standardization.

3) over-full multi-chars. This should be an error, although the standard doesn’t say so.