On ASCII's Design
Today I sat in the library and studied the USA Standard Code for Information Interchange USAS X3.4-1968 (USASCII).
The first part of the document, namely the actual standard, is pretty compact and easily read within half an hour.
What drew my attention was the Appendix, which also covers most of the physical pages.
There the authors explain some of their criteria, motivation and design choices, leading to the arrangement of the ASCII table as we know it today.
8 columns, 16 rows
“In discussing the set structure it is convenient to divide the set into 8 columns and 16 rows as indicated in this standard.” Appendix A4.1
Why is that convenient? We can now visually create subsets.
Have a look at column 3. It represents a 4-bit subset with prefix 0b011 which includes all digits (besides some other characters).
b7 0 0 0 0 1 1 1 1
bit b6 0 0 1 1 0 0 1 1
b5 0 1 0 1 0 1 0 1
bbbb -----------------------------------
4321 |r/c 0 1 2 3 4 5 6 7
| -------------------------------
0b0000 | 0: NUL DLE 0 @ P ` p
0b0001 | 1: SOH DC1 ! 1 A Q a q
0b0010 | 2: STX DC2 " 2 B R b r
0b0011 | 3: ETX DC3 # 3 C S c s
0b0100 | 4: EOT DC4 $ 4 D T d t
0b0101 | 5: ENQ NAK % 5 E U e u
0b0110 | 6: ACK SYN & 6 F V f v
0b0111 | 7: BEL ETB ' 7 G W g w
0b1000 | 8: BS CAN ( 8 H X h x
0b1001 | 9: HT EM ) 9 I Y i y
0b1010 | A: LF SUB * : J Z j z
0b1011 | B: VT ESC + ; K [ k {
0b1100 | C: FF FS , < L \ l |
0b1101 | D: CR GS - = M ] m }
0b1110 | E: SO RS . > N ^ n ~
0b1111 | F: SI US / ? O _ o DEL
This 4-bit number subset encodes the character of the digit (0 to 9) as the binary representation of the digit itself.
So the character '5' would be 0b0101, which is 5 in the decimal system.
>>> # binary representation of the ascii string '5'
>>> bin(int.from_bytes('5'.encode('ascii')))
'0b110101'
>>> # integer representation of its 4 least sig. bits
>>> int.from_bytes('5'.encode('ascii')) & 0b1111
5
The existence of such 4-bit encodable number subset is a direct design criteria from the authors. Likewise, there was a criteria for a 5-bit encodable (single case) alphabet, which you can find in column 4&5 (upper case) or 6&7 (lower case). Note that you can switch between an upper and lower case character by changing only one bit.
There is also the control characters subset (column 0&1) and the specials subset (column 2). In general, the first two columns are control characters. The other six are so called ‘graphics’ - visual characters.
Ordering
A strong design motivation is the logical ordering of the characters within a subset.
Simple binary comparison should provide this “natural” ordering: “a” < “b”, because 0x61 < 0x62.
While this only applies within the subset, the authors tried to distribute the characters
across the whole ascii table such that some characters deliberately collate ahead of others.
Most prominent is the empty space character, which in the table is the very first graphics. But also
other special characters often used as prefix or suffix are placed early in the table such that
comparison of strings yields “natural” results; The standard gives an example: “JOHNS” should collate ahead of “JOHNSON.”, also “JOHNS,A” should collate before “JOHNSON”.
Furthermore, all special symbols should collate ahead of numbers and alphabets. However, this design goal was not fully achieved (e.g. :, ;, ~, …).
It is a bit unfortunate that the upper case alphabet collates before the lower case. I don’t know the reason behind this design decision.
miscellaneous facts
DELis not a control character.SO(Shift Out),SI(Shift In) andSUB(Substitue) can be used to code characters outside ASCII.- Bit is the contraction of
binary digit. - Ever wondered about the placement of special characters in the keyboard layout? Compare column 1 and 2, e.g.
1<->!,5, <->%. Just one bit, set by the shift key!