Which charset for chinese




















Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character? A: It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.

The phonetic data in the Unihan Database should not be used for this purpose. A blank in the phonetic data means that nobody's supplied a reading, not that a reading doesn't exist. Because updating the Unihan Database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete.

In particular, there are obscure characters where it is known that there is a reading, but since the character does not occur in standard dictionaries, we are unable to supply it e.

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean. The only proper mechanism is, as for determining whether "chat" is spelled correctly in English or French, is to use a higher-level protocol. This is a complicated question. For answers, see How are Chinese characters input?

Q: Why is Unicode missing some characters from the Big Five character set? A: The "Big Five" character set is an industrial standard commonly used for traditional Chinese. There are, however, several versions of the Big Five in common use, generally representing extensions of the formal standard. The initial, un-extended Big Five was the standard version of the character set at the time that the Unicode Standard, Version 1. This is reflected in the data files supplied by the Unicode Consortium.

Some vendors provide vendor-specific tables showing mapping data for their custom Big Five extensions and Unicode. The Unicode Consortium does not, however, provide data on every known dialect of the Big Five, so it is possible that a particular dialect of the Big Five is not included in the tables provided by Unicode. I hear that certain characters from the GB encoding are not mapped to any code points in Unicode, and need to be mapped to characters in the Private Use Area instead.

Is this true? And if so, is the issue being dealt with in the near future? That used to be true, as of Unicode 4. However, to avoid having to map characters to the PUA for support of GB, the missing characters were added as of Unicode 4.

You can find the characters in question in Annex C p. All now have regular Unicode characters. Q: Isn't it true that some Japanese can't write their own names in Unicode? A: There are some situations where an individual prefers their name be written with a specific glyph, as in the West we have John and Jon, Mark and Marc, Cathy and Kathy. In most cases, variation sequences in the UTS 37 Unicode Ideographic Variation Database can be used to provide the required representation in plain text.

In other cases, the variant forms have been encoded in Unicode as distinct characters. The IRG also may consider where the encoding of new variant characters is justified. It should be noted that this is not a problem of Han unification per se, as it is often represented. Unicode is a superset of the major Japanese character encoding standards. A: The Unihan database covers only the ideographs in the Unicode Standard.

EACC also includes characters such as Japanese kana and Korean hangul that are outside the scope of the Unihan database. It was established in January , then revised in February It enumerates 11, characters, which extends the 4, characters of the JIS X standard.

It consists of 10, Kanji ideographic characters and 1, non-Kanji non-ideographic characters. These characters are arranged in two planes of a row-bycell matrix. The language and character set names will appear under Character Set or Encoding in the View menu your browser even though the fonts have not been downloaded. See an example page with the Traditional Chinese Big5 character set. This should work you have downloaded the character set and selected it in preferences.

Simplified characters are now used in China and Singapore. Traditional characters are used in Taiwan, Hong Kong, and most overseas communities. ISO is the universal multi-octet character set defined by ISO; we feel that in the future it may become the preferred technology for Chinese documents and electronic mail when it is widely available. Specification 1. Designations define the Chinese character sets used in the text. A designation overrides any previous designation for subsequent bytes in the text.

Shift functions specify how to interpret the subsequent bytes. Example: the hex sequence 1b 24 29 41 0e 3d 3b 3b 3b 1b 24 29 47 47 28 5f 50 0f represents the Chinese word for "Interchange" jiao huan twice; Zhu, et al Informational [Page 3] RFC Chinese Character Encoding March the first time in simplified form using GB the 3d 3b 3b 3b sequence above , and the second time in traditional form using CNS the 47 28 5f 50 sequence above.

These GB sets shall only be used once these final characters are assigned. This name is intended to be used as the "charset" parameter in MIME messages.

By the "common part" we mean the part that is not specific to any Big5 vendor, consisting of more frequently used characters in Big5 range 0xAxC67E, less frequently used characters in Big5 range 0xCxF9D5, and other symbols in Big5 range 0xAxA3E0, as defined in Institute for Information Industry's III technical report C see also [Big5]. The appendix of this document presents a conversion table for converting Big5 into CNS, including specific extensions of some popular vendors.

Public domain software binary or C source code for conversion between Big5 and CNS is available on many Internet sites. Otherwise, an 8-bit message that passes through a 7-bit mailer is likely to have the 8th bit truncated, resulting in an unreadable message. Although "just send 8-bit data" has been common practice in the past, it is incorrect according to the Internet standards and causes interoperability problems.

If the character is from GB , the MSB bit-8 of each byte is set to 1, and therefore becomes a 8-bit character. This constructs a character set named "GB Internal Code". This method is also adopted in the. There are also character sets that can only be used with other GB sets. Note: There are some supplementary character sets in GB, i.

Normally, they won't be used independently without using GB or GB, so they are not necessarily to be registered. Sign up using Facebook. Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Does ES6 make JavaScript frameworks obsolete?

Podcast Do polyglots have an edge when it comes to mastering programming Featured on Meta. Now live: A fully responsive profile. Visit chat. Linked 4. Related



0コメント

  • 1000 / 1000