answersLogoWhite

0


Best Answer

Text encoded in unicode requires more storage space than when it is encoded using another system that supports less languages.

User Avatar

Wiki User

11y ago
This answer is:
User Avatar
More answers
User Avatar

Wiki User

9y ago

ASCII (American Standard Code for Information Interchange) ensures that every system maps the same 7-bit ASCII code to the same symbol (known as a glyph) regardless of which code page is being used. That is, code 65 (0x41) always maps to the glyph representing the upper case letter 'A' in all code pages on all systems, the only practical difference being the typeface.

7-bit ASCII encoding allows a maximum of 128 symbols in the decimal range 0 to 127 (0x00 to 0x7F). However, the first 32 characters (0x00 to 0x1F) are control codes (non-printing characters) and the last is the backspace (DEL) character, so there are really just 95 printable symbols. The Latin alphabet consumes 52 symbols (both upper and lower case) and the digits 0 through 9 take another ten, which leaves just 33 for everything else. The symbols that were finally chosen were those that can be found in virtually every programming language today, including the arithmetic operators (plus, minus, multiply, divide and modus), logic operators (not, and, or, xor and complement), punctuation (period, comma, semi-colon and colon), brace-pairs (parenthesis, braces and brackets) and quotes (single and double). This left precious few encodings to cater for the myriad symbols used in written language let alone cater for American-English. There is no £ symbol, degree symbol or copyright symbol, never mind accented characters.

7-bit ASCII text is normally transmitted using an 8-bit byte, so the high-order bit (bit 7) is never used in ASCII encoded text. When bit 7 is set, the encoding maps to a character within the extended character set, which provides an additional 128 symbols over and above the ASCII character set (256 in total). In order to interpret an extended character correctly, you have to use the same code page that was used to encode the symbol in the first place. However, even with 256 symbols at your disposal, it is still woefully inadequate to cater for just a single language like American-English let alone every possible language. Chinese script, for instance, has over 10,000 symbols alone!

In order to cater for all symbols in all languages worldwide we use UNICODE which was developed in conjunction with the Universal Character Set (UCS). The now defunct UCS-2 used two 8-bit bytes to encode each character but this only caters for a maximum of 65,536 unique symbols. UNICODE's UTF-32 uses 4 bytes per character and can handle up to 4,294,967,296 unique symbols, more than enough to cater for every symbol or glyph in every language and every script worldwide. Currently, UTF-32 has a range of 0x000000 through 0x10FFFF which would normally cater for up to 1,114,111 unique symbols. However, the encodings are divided up into separate code pages so not all encodings have a symbol associated with them. There are in fact just 110,000 valid encodings, but that's still more than enough to cater for every script and still leave room for future expansion.

The UNICODE standard is maintained by the Unicode Consortium and the current standard is Unicode 7.0.

Of course, the problem with UTF-32 is that every symbol consumes 32 bits. So converting an ASCII file to UTF-32 would result in a file 4 times larger than the original. To cater for this, two variable-width encodings were introduced, UTF-16 and UTF-8. These are also known as the multi-byte character sets (MBCS). UTF-8 is by far the most common encoding scheme in use today. As well as catering for 8-bit environments (which are by far the most common) UNICODE also caters for 7-bit environments through UTF-7.

Converting from ASCII to UTF-8 requires no conversion whatsoever; they are exactly the same so UTF-8 has no additional overhead compared to ASCII in terms of memory consumption. Thus UTF-8 is suitable for encoding programming language source code at no additional cost and allows the use of foreign language literals within the code at minimal cost. UTF-8 is also used to encode mark-up files such as HTML and XML.

UTF-8 works by introducing "code points" that may be 2, 3 or 4 bytes in length. For instance, values in the range 0x80 through 0x7FF are two bytes long, 0x800 through 0xFFFF are three bytes long and 0x10000 through 0x10FFFF are four bytes long.

This answer is:
User Avatar

Add your answer:

Earn +20 pts
Q: What is the disadvantages of unicode?
Write your answer...
Submit
Still have questions?
magnify glass
imp