Endianness |
Endianness is an arbitrary convention of byte order, required when integers are represented with multiple bytes. In such situations, there are different ways those bytes can be arranged in Memory (computers) or in transmission over some medium. The two main types of endianness are termed big-endian and little-endian. Arrangements other than these are referred to as middle-endian. When specifically talking about bytes, endianness is also referred to as byte order or byte sex.
More generally whenever a sequence of small units are used to form a larger ordinal value (e.g., the parts of a date or the digits in a decimal number), there must be a rule as to which order those smaller units are placed in. This could be considered similar to the situation in different written languages, where some are written left-to-right, while others (such as Arabic language and Hebrew language) are written right-to-left. However there are many subtle but important differences between the concepts of endianness and text direction most importantly when line wrapping becomes involved.
There seems to be no significant advantage in using one way over the other and both have remained common. Generally the byte (Octet) is considered an atomic unit from the point of view of storage and all but the lowest levels of network protocols. Therefore sequences based around single bytes (e.g., text in ASCII or one of the ISO-8859-n encodings) are not generally affected by endian issues. While variable-width text encodings using the byte as their base unit could be considered to have an inbuilt endianness this is (at least in all commonly used ones) fixed by the encoding s design. However, strings encoded with unicode UTF-16 or UTF-32 are affected by endianness, because each set of two or four bytes represents a single character.
= Endianness in computers =
==Detailed description==
: Note : all numerical values in this section that appear in this typeface are in hexadecimal notation. When some computers store a 32-bit integer value in memory, for example 4A3B2C1D at address 100, they store the bytes within the address range 100 through 103 in the following order:
Big-endian
That is, the most significant byte (also known as the MSB , which is 4A in our example) is stored at the memory location with the lowest address, the next byte in significance, 3B, is stored at the next memory location and so on.
Architectures that follow this rule are called big-endian ( .
Other computers store the value 4A3B2C1D in the following order:
Little-endian
That is, .
In other words, contrary to what you might first think, endianness does not denote what the value ends with when stored in memory, but rather which end it begins with.
Note that the stated mnemonics are not the origin of the terms, see Endianness#Discussion.2C_background.2C_etymology.
Some architectures can be configured either way; these include Advanced RISC Machines, PowerPC (but not the PowerPC 970), DEC Alpha, MIPS architecture, PA-RISC and IA64. The word bytesexual or bi-endian, said of hardware, denotes willingness to compute or pass data in either big-endian or little-endian format (depending, presumably, on a mode bit somewhere). Many of these architectures can be switched via software to default to a specific endian format (usually done when the computer starts up); however, on some architectures the default endianness is selected by some hardware on the motherboard and sometimes even cannot be changed by software (e.g., the DEC Alpha, which runs only in big-endian mode on the Cray T3E).
Still other (generally older) architectures, called middle-endian, may have a more complicated ordering such that the bytes within a 16-bit unit are ordered differently from the 16-bit units within a 32-bit Word (computer science), for instance, 4A3B2C1D is stored as:
Middle-endian
Middle-endian architectures include the PDP-11 family of processors. The format for double-precision floating-point numbers on the VAX is also middle-endian. In general, these complex orderings are more confusing to work with than consistent big- or little-endianness.
Endianness also applies in the numbering of the bits within a byte or word. In a consistently big-endian architecture the bits in the word are numbered from the left, bit zero being the most significant bit and bit 7 being the least significant bit in a byte. The favored bit endianness depends somewhat on where the computer users expect the binary point to be located in a number. It seems most intuitive to number the bits in the little-endian order if the byte is taken to represent an integer. In this case the bit number corresponds to the exponent of the numeric weight of the bit. However, if the byte is taken to represent a binary fraction, with the binary point to the left of the most significant bit, then the big-endian numbering convention is more convenient.
To summarize, here are the default endian-formats of some common computer architectures:
C function to check if a system is big or little endian (assumes int is larger than char and will not determine if a system is middle endian):
#define LITTLE_ENDIAN 0 #define BIG_ENDIAN 1
int machineEndianness() { int i = 1; char *p = (char *) &s; if (p[0] == 1) // Lowest address contains the least significant byte return LITTLE_ENDIAN; else return BIG_ENDIAN; }
==Portability issues==
Endianness has grave implications in software portability. For example, in interpreting data stored in binary format and using an appropriate bitmask, the endianness is important because different endianness will lead to different results from the mask.
Writing binary data from software to a common format leads to a concern of the proper endianness. For example saving data in the Windows bitmap bitmap format requires little endian integers - if the data are stored using big-endian integers then the data will be corrupted since they do not match the format.
The OpenStep operating system has software that swaps the bytes of integers and other C (programming language) datatypes in order to preserve the correct endianness, since software running on OPENSTEP for PA-RISC is intended to be portable to OPENSTEP running on Mach kernel/i386.
In Unicode a Byte Order Mark (BOM) of between 2 and 4 bytes is sometimes used at the beginning of a string to denote its endianness.
= Endianness in communications =
In general, the NUXI problem is the problem of transferring data between computers with differing byte order. For example, the string UNIX , packed two bytes per 16-bit word integer, might look like NUXI on a machine with a different byte sex . The problem is caused by the difference in endianness. The problem was first discovered when porting an early version of Unix from PDP-11 (a middle-endian architecture) to an IBM Series 1 minicomputer (a big-endian architecture); when upon startup, the computer output replaced the string UNIX with NUXI .
The Internet Protocol defines a standard big-endian network byte order, where binary values are in general encoded into packets, and sent out over the network, most significant byte first. This occurs regardless of the native endianness of the host CPU.
serial communications devices also have bit-endianness: the bits in a byte can be sent little-endian (least significant bit first) or big-endian (most significant bit first). This decision is made in the very bottom of the data link layer of the OSI model.
= Endianness in date formats =
Endianness is simply illustrated by the different manners in which countries format calendar date. For example, in the United States and a few other countries, dates are commonly formatted as Month; Day; Year (e.g. May 24th, 2006 or 5/24/2006 ). This is a middle-endian order.
In most of the world s countries, including all of Europe except Sweden, Latvia and Hungary, dates are formatted as Day; Month; Year (e.g. 24th May, 2006 or 24/5/2006 or 24/5-2006 ). This is little-endian.
China, Japan and the ISO 8601 International standard ordering for dates displays them in the order of Year; Month; Day (e.g. 2006 May 24th , or, more properly, 2006-05-24 ). This is big-endian.
The ISO 8601 ordering scheme lends itself to straightforward sort algorithm of dates in lexicographical order, or dictionary sort order. This means that the sorting algorithm does not need to treat the numeric parts of the date string any differently from a string of non-numeric characters, and the dates will be sorted into chronological order. Note, however, that for this to work, there must always be four digits for the year, two for the month, and two for the day, so for example single-digit days must be padded with a zero yielding 01 , 02 , ... , 09 .
= Discussion, background, etymology =
Big-endian numbers are easier to read when Debugging a program. Some think they are less intuitive because the most significant byte is at the smaller address. Some think they are less confusing because the significance order is the same as the order of normal textual character strings in the computer, just as in non-computer text (see below). A person s preference usually is based on which convention was studied first and on which one the person s mental models were built.
== Origin of the term ==
The choice of big-endian vs. little-endian has been the subject of a lot of flame wars. Emphasizing the futility of this argument, the very terms big-endian and little-endian were taken from the Big-Endians and Little-Endians of Jonathan Swift s satiric novel Gulliver s Travels , where in Lilliput and Blefuscu Gulliver finds two groups of people in conflict over which end of an egg to crack.
See the [http://www.rdrop.com/~cary/html/endian_faq.html Endian FAQ ], including the significant essay [http://www.rdrop.com/~cary/html/endian_faq.html#danny_cohen On Holy Wars and a Plea for Peace ] by Danny Cohen (1980).
The written system of arabic numerals is used world-wide and is such that the most significant digits are always written to the left of the less significant ones. In languages that write text left to right, this system is therefore big-endian. In languages that write right to left, this numeral system is also big-endian, because the number itself is a separate domain from the right-to-left language and must be read in its own order. To illustrate this point, if a number appears in text, whether the text is written left to right or right to left, a number too long to display on one line is broken so that the most significant digits are displayed on the first line.
The spoken numeral system in are also mainly big-endian, with an exception for the multiples-of-ten, e.g. 376 is pronounced as Dreihundertsechsundsiebzig and driehonderd zes en zeventig respectively, i.e. three hundred six-and-seventy .
Little-endian ordering has been used in compiling reverse dictionaries, where the entries begin, for example, with a, aa, baa, ... and end, for example, with ... buzz, abuzz, fuzz. An actual example is the pronouncing dictionary for Standard Cantonese (ISBN 9629485095) which begins with a, ba, da, dza, and ends with , tyt, tsyt, m, .
There seems to be some confusion about how the word endianness should be spelled. The two major variants are endianness and endianess . There are even some documents containing both variants. While neither of the two forms appears in current (non-computing) dictionaries, it appears that the former follows the pattern of similar words such as barren and barrenness . Thus, endianness is generally more accepted and is used in this article.
= External links =
*[http://www.rdrop.com/~cary/html/endian_faq.html David Cary s Endian FAQ] – Including the paper On Holy Wars and a Plea for Peace by Danny Cohen, 1 April 1980 *[http://www.cs.princeton.edu/~kazad/resources/cs/endian.htm Kalid s Endian issues page at Princeton] with examples in the programming language C|
|