bộ sưu tập mã nguồn cơ sở: Tài liệu The New C Standard- P4 ppt

5.2.1.1 Trigraph sequences
232
C90
This deﬁnition is new in C99.
229
in this International Standard the term does not include other characters that are letters in other alphabets.
Commentary
All implementations are required to support the basic source character set to which this terminology applies.
Annex D lists those universal character names that can appear in identiﬁers. However, they are not referred
to as letters (although they may well be regarded as such in their native language).
The term letter assumes that the orthography (writing system) of a language has an alphabet. Some
792 orthography
orthographies, for instance Japanese, don’t have an alphabet as such (let alone the concept of upper- and
lowercase letters). Even when the orthography of a language does include characters that are considered
to be matching upper and lowercase letters by speakers of that language (e.g., æ and Æ, å and Å), the C
Standard does not deﬁne these characters to be letters.
C
++
The deﬁnition used in the C
++
Standard, 17.3.2.1.3 (the footnote applies to C90 only), implies this is also
true in C
++
.
Coding Guidelines
The term letter has a common usage meaning in a number of different languages. Developers do not often
use this term in its C Standard sense. Perhaps the safest approach for coding guideline documents to take is
to avoid use of this term completely.
230
The universal character name construct provides a way to name other characters.
Commentary
In theory all characters on planet Earth and beyond. In practice, those deﬁned in ISO 10646.
28 ISO 10646
C90
Support for universal character names is new in C99.
Other Languages
Other language standards are slowly moving to support ISO 10646. Java supports a similar concept.
Common Implementations
Support for these characters is relatively new. It will take time before similarities between implementations
become apparent.
231
Forward references: universal character names (6.4.3), character constants (6.4.4.4), preprocessing direc-
tives (6.10), string literals (6.4.5), comments (6.4.9), string (7.1.1).
5.2.1.1 Trigraph sequences
232
trigraph se-
quences
replaced by
All occurrences in a source ﬁle Before any other processing takes place, each occurrence of one of the
following sequences of three characters (called trigraph sequences
12)
) are replaced with the corresponding
single character.
Commentary
Trigraphs were an invention of the C committee. They are a method of supporting the input (into source ﬁles,
not executing programs) and the printing of some C source characters in countries whose alphabets, and
keyboards, do not include them in their national character set. Digraphs, discussed elsewhere, are another
916 digraphs
sequence of characters that are replaced by a corresponding single character.
The \? escape sequence was introduced to allow sequences of ?s to occur within string literals.
895 string literal
syntax
The wording was changed by the response to DR #309.
June 24, 2009 v 1.2
5.2.1.1 Trigraph sequences
234
Other Languages
Until recently many computer languages did not attempt to be as worldly as C, requiring what might be called
an Ascii keyboard. Pascal speciﬁes what it calls lexical alternatives for some lexical tokens. The character
sequences making up these lexical alternatives are only recognized in a context where they can form a single,
complete token.
Common Implementations
On the Apple MacIntosh host, the notation
’????’
is used to denote the unknown ﬁle type. Translators in
this environment often disable trigraphs by default to prevent unintended replacements from occurring.
233
trigraph se-
quences
mappings
??= # ??) ] ??! |
??( [ ??’ ^ ??< }
??/ \ ??< { ??- ~
Commentary
The above sequences were chosen to minimize the likelihood of breaking any existing, conforming, C source
code.
Other Languages
Many languages use a small subset, or none, of these problematic source characters, reducing the potential
severity of the problem. The Pascal standard speciﬁes
(.
and
.)
as alternative lexical representations of
[
and ] respectively.
Common Implementations
Recognizing trigraph sequences entails a check against every character read in by the translator. Performance
proﬁling of translators has shown that a large percentage of time is spent in the lexer. A study by Waite
[1469]
found 41% of total translation time was spent in a handcrafted lexer (with little code optimization performed
by the translator). An automatically produced lexer, the lex tool was used, consumed 3 to 5 as much time.
One vendor, Borland, who used to take pride, and was known, for the speed at which their translators
operated, did not include trigraph processing in the main translator program. A stand-alone utility was
provided to perform trigraph processing. Those few programs that used trigraphs needed to be processed by
this utility, generating a temporary ﬁle that was processed by the main translator program. While using this
pre-preprocessor was a large overhead for programs that used trigraphs, performance was not degraded for
source code that did not contain them.
Usage
There are insufﬁcient trigraphs in the visible form of the
.c
ﬁles to enable any meaningful analysis of the
usage of different trigraphs to be made.
234
No other trigraph sequences exist.trigraph se-
quences
no other
Commentary
The set of characters for which trigraphs were created to provide an alternative spelling are known, and
unlikely to be extended.
Coding Guidelines
Although no other trigraph sequences exist, sequences of two adjacent questions marks in string literals
may lead to confusion. Developers may be unsure about whether they represent a trigraph or not. Using the
escape sequence \? on at least one of these questions marks can help clarify the intent.
Example
1 char
*
unknown_trigraph = "??++";
2 char
*
cannot_be_trigraph = "?\? ";
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
238
Usage
The visible form of the
.c
ﬁles contained 593 (
.h
10) instances of two question marks (i.e.,
??
) in string
literals that were not followed by a character that would have created a trigraph sequence.
235
Each ? that does not begin one of the trigraphs listed above is not changed.
Commentary
Two ?s followed by any other character than those listed above is not a trigraph.
Common Implementations
No implementation is known to deﬁne any other sequence of ?s to be replaced by other characters.
Coding Guidelines
No other trigraph sequences are deﬁned by the standard, have been notiﬁed for future addition to the standard,
or used in known implementations. Placing restrictions on other uses of other sequences of
?
s provides no
beneﬁt.
236
EXAMPLE 1
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
becomes
#define arraycheck(a,b) a[b] || b[a]
Commentary
This example was added by the response to DR #310 and is intended to show a common trigraph usage.
237
EXAMPLE 2 The following source line
printf("Eh???/n");
becomes (after replacement of the trigraph sequence ??/)
printf("Eh?\n");
Commentary
This illustrates the sometimes surprising consequences of trigraph processing.
5.2.1.2 Multibyte characters
238
The source character set may contain multibyte characters, used to represent members of the extended
multibyte
character
source contain
character set.
Commentary
The mapping from physical source ﬁle multibyte characters to the source character set occurs in translation
60 multibyte
character
phase 1. Whether multibyte characters are mapped to UCNs, single characters (if possible), or remain as
116 transla-
tion phase
1
multibyte characters depends on the model used by the implementation.
115 UCN
models of
C
++
The representations used for multibyte characters, in source code, invariably involve at least one character
that is not in the basic source character set:
2.1p1
Any source ﬁle character not in the basic source character set (2.2) is replaced by the universal-character-name
that designates that character.
The C
++
Standard does not discuss the issue of a translator having to process multibyte characters during
translation. However, implementations may choose to replace such characters with a corresponding universal-
character-name.
June 24, 2009 v 1.2
5.2.1.2 Multibyte characters
241
Other Languages
Most programming languages do not contain the concept of multibyte characters.
Common Implementations
Support for multibyte characters in identiﬁers, using a shift state encoding, is sometimes seen as an ex-
tension. Support for multibyte characters in this context using UCNs is new in C99. The most common
universal
charac-
ter name
syntax
815
implementations have been created to support the various Japanese character sets.
Coding Guidelines
The standard does not deﬁne how multibyte characters are to be represented. Any program that contains
them is dependent on a particular implementation to do the right thing. Converting programs that existed
before support for universal character names became available may not be economically viable.
Some coding guideline documents recommend against the use of characters that are not speciﬁed in the C
Standard. Simply prohibiting multibyte characters because they rely on implementation-deﬁned behavior
ignores the cost/beneﬁt issues applicable to the developers who need to read the source. These are complex
issues for which your author has insufﬁcient experience with which to frame any applicable guideline
recommendations.
239
The execution character set may also contain multibyte characters, which need not have the same encoding
as for the source character set.
Commentary
Multibyte characters could be read from a ﬁle during program execution, or even created by assigning byte
values to contiguous array elements. These multibyte sequences could then be interpreted by various library
functions as representing certain (wide) characters.
The execution character set need not be ﬁxed at translation time. A program’s locale can be changed
at execution time (by a call to the
setlocale
function). Such a change of locale can alter how multibyte
characters are interpreted by a library function.
C
++
There is no explicit statement about such behavior being permitted in the C
++
Standard. The C header
<wchar.h>
(speciﬁed in Amendment 1 to C90) is included by reference and so the support it deﬁnes for
multibyte characters needs to be provided by C
++
implementations.
Other Languages
Most languages do not include library functions for handling multibyte characters.
Coding Guidelines
Use of multibyte characters during program execution is an applications issue that is outside the scope of
these coding guidelines.
240
For both character sets, the following shall hold:
Commentary
This is a set of requirements that applies to an implementation. It is the minimum set of guaranteed
requirements that a program can rely on.
Coding Guidelines
The set of requirements listed in the following C-sentences is fairly general. Dealing with implementations
that do not meet the requirements listed in these sentences is outside the scope of these coding guidelines.
241
— The basic character set shall be present and each character shall be encoded as a single byte.
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
243
Commentary
This is a requirement on the implementation. It prevents an implementation from being purely multibyte-
based. The members of the basic character set are guaranteed to always be available and ﬁt in a byte.
222 basic char-
acter set
ﬁt in a byte
Common Implementations
An implementation that includes support for an extended character set might choose to deﬁne
CHAR_BIT
to
216 extended
character set
307 CHAR_BIT
macro
be 16 (most of the commonly used characters in ISO 10646 are representable in 16 bits, each in UTF-16; at
28 ISO 10646
28 UTF-16
least those likely to be encountered outside of academic research and the traditional Chinese written on Hong
Kong). Alternatively, an implementation may use an encoding where the members of the basic character set
are representable in a byte, but some members of the extended character set require more than one byte for
their encoding. One such representation is UTF-8.
28 UTF-8
242
— The presence, meaning, and representation of any additional members is locale-speciﬁc.
Commentary
On program startup the execution locale is the
"C"
locale. During execution it can be set under program
control. The standard is silent on what the translation time locale might be.
Common Implementations
The full Ascii character set is used by a large number of implementations.
Coding Guidelines
It often comes as a surprise to developers to learn what characters the C Standard does not require to be
provided by an implementation. Source code readability could be affected if any of these additional members
appear within comments and cannot be meaningfully displayed. Balancing the beneﬁts of using additional
members against the likelihood of not being able to display them is a management issue.
The use of any additional members during the execution of a program will be driven by the user require-
ments of the application. This issue is outside the scope of these coding guidelines.
243
— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte
multibyte
character
state-dependent
encoding
shift state
characters begins in an initial shift state and enters other locale-speciﬁc shift states when speciﬁc multibyte
characters are encountered in the sequence.
Commentary
State-dependent encodings are essentially ﬁnite state machines. When a state encoding, or any multibyte
encoding, is being used the number of characters in a string literal is not the same as the number of bytes
encountered before the null character. There is no requirement that the sequence of shift states and characters
representing an extended character be unique.
215 extended
characters
There are situations where the visual appearance of two or more characters is considered to be a single
combining
characters
character. For instance, (using ISO 10646 as the example encoding), the two characters LATIN SMALL
LETTER O (U+006F) followed by COMBINING CIRCUMFLEX ACCENT (U+0302) represent the grapheme
cluster (the ISO 10646 term
[334]
for what might be considered a user character)
ô
not the two characters
o ^
. Some languages use grapheme clusters that require more than one combining character, for instance
ô
¯
. Unicode (not ISO 10646) deﬁnes a canonical accent ordering to handle sequences of these combining
characters. The so-called combining characters are deﬁned to combine with the character that comes
immediately before them in the character stream. For backwards compatibility with other character encodings,
and ease of conversion, the ISO 10646 Standard provides explicit codes for some accent characters; for
instance, LATIN SMALL LETTER O WITH CIRCUMFLEX (U+00F4) also denotes ô.
A character that is capable of standing alone, the
o
above, is known as a base character. A character that
modiﬁes a base character, the
ô
above, is known as a combining character (the visible form of some combining
characters are called diacritic characters). Most character encodings do not contain any combining characters,
and those that do contain them rarely specify whether they should occur before or after the modiﬁed base
June 24, 2009 v 1.2
5.2.1.2 Multibyte characters
243
character. Claims that a particular standard require the combining character to occur before the base character
it modiﬁes may be based on a misunderstanding. For instance, ISO/IEC 6937 speciﬁes a single-byte
encoding for base characters and a double-byte encoding for some visual combinations of (diacritic + base)
Latin letter. These double-byte encodings are precomposed in the sense that they represent a single character;
there is no single-byte encoding for the diacritic character, and the representation of the second byte happens
to be the same as that of the single-byte representation of the corresponding base character (e.g., 0xC14F
represents LATIN CAPITAL LETTER O WITH GRAVE and 0xC16F represents LATIN SMALL LETTER O
WITH GRAVE).
C90
The C90 Standard speciﬁed implementation-deﬁned shift states rather than locale-speciﬁc shift states.
C
++
The deﬁnition of multibyte character, 1.3.8, says nothing about encoding issues (other than that more than
one byte may be used). The deﬁnition of multibyte strings, 17.3.2.1.3.2, requires the multibyte characters to
begin and end in the initial shift state.
Common Implementations
Most methods for state-dependent encoding are based on ISO/IEC 2022:1994 (identical to the standard
ISO 2022
ECMA-35 “Character Code Structure and Extension Techniques”, freely available from their Web site,
http://www.ecma.ch
). This uses a different structure than that speciﬁed in ISO/IEC 10646–1. The
encoding method deﬁned by ISO 2022 supports both 7-bit and 8-bit codes. It divides these codes up into
control characters (known as C0 and C1) and graphics characters (known as G0, G1, G2, and G3). In the
initial shift state the C0 and G0 characters are in effect.
Table 243.1:
Commonly seen ISO 2022 Control Characters. The alternative values for SS2 and SS3 are only available for 8-bit
codes.
Name Acronym Code Value Meaning
Escape ESC 0x1b Escape
Shift-In SI 0x0f Shift to the G0 set
Shift-Out SO 0x0e Shift to the G1 set
Locking-Shift 2 LS2 ESC 0x6e Shift to the G2 set
Locking-Shift 3 LS3 ESC 0x6f Shift to the G3 set
Single-Shift 2 SS2 ESC 0x4e, or 0x8e Next character only is in G2
Single-Shift 3 SS3 ESC 0x4f, or 0x8f Next character only is in G3
Some of the control codes and their values are listed in Table 243.1. The codes SI, SO, LS2, and LS3 are
known as locking shifts. They cause a change of state that lasts until the next control code is encountered. A
stream that uses locking shifts is said to use stateful encoding.
ISO 2022 speciﬁes an encoding method: it does not specify what the values within the range used for
graphic characters represent. This role is ﬁlled by other standards, such as ISO 8859. A C implementation
ISO 8859 24
that supports a state-dependent encoding chooses which character sets are available in each state that it
supports (the C Standard only deﬁnes the character set for the initial shift state).
Table 243.2: An implementation where G1 is ISO 8859–1, and G2 is ISO 8891–7 (Greek).
Encoded values 0x62 0x63 0x64 0x0e 0xe6 0x1b 0x6e 0xe1 0xe2 0xe3 0x0f
Control character SO LS2 SI
Graphic character a b c æ α β γ
Having to rely on implicit knowledge of what character set is intended to be used for G1, G2, and so on, is
not always satisfactory. A method of specifying the character sets in the sequence of bytes is needed. The
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
244
ESC control code provides this functionality by using two or more following bytes to specify the character
set (ISO maintains a registry of coded character sets). It is possible to change between character sets without
any intervening characters. Table 243.3 lists some of the commonly used Japanese character sets.
C source code written by Japanese developers probably has the highest usage of shift sequences. There are
several JIS (Japanese Industrial Standard) documents specifying representations for such sequences. Shift
JIS (developed by Microsoft) belies its name and does not involve shift sequences that use a state-dependent
encoding.
Table 243.3: ESC codes for some of the character sets used in Japanese.
Character Set Byte Encoding Visible Ascii Representation
JIS C 6226–1978 1B 24 40 <ESC> $ @
JIS X 0208–1983 1B 24 42 <ESC> $ B
JIS X 0208–1990 1B 26 40 1B 24 42 <ESC> & @ <ESC> $ B
JIS X 0212–1990 1B 24 28 44 <ESC> $ ( D
JIS-Roman 1B 28 4A <ESC> ( J
Ascii 1B 28 42 <ESC> ( B
Half width Katakana 1B 28 49 <ESC> ( I
Table 243.4: A JIS encoding of the character sequence かな漢字(“kana and kanji”).
Encoded values 0x1b 0x24 0x42 0x242b 0x244a 0x3441 0x3b7a 0x1b 0x28 0x4a
Control character <ESC> $ B <ESC> ( J
Graphic character かな漢字
Ascii characters $+ $J 4A ;z
Coding Guidelines
Developers do not need to remember the numerical values for extended characters. The editor, or program
development environment, used to create the source code invariably looks after the details (generating any
escape sequences and the appropriate byte values for the extended character selected by the developer). How
these tools decide to encode multibyte character sequences is outside the scope of these coding guidelines.
It is usually possible to express an extended character in a minimal number of bytes using a particular
state-dependent encoding. The extent to which developers might create ﬁxed-length data structures on the
assumption that multibyte characters will not contain any redundant shift sequences is outside the scope of
2017 footnote
152
this book. The value of the
MB_LEN_MAX
macro places an upper limit on the number of possible redundant
313
MB_LEN_MAX
shift sequences.
Example
1 #include <stdio.h>
2
3 char
*
p1 = "^[$B$3$l$OF|K\8lI=8=^[(J"; /
*
^[$BF|K\8lJ8;zNs^[(J
*
/
4 char
*
p2 = "^[$B$3$l$OF|1Q^[(Jmixed^[$BJ8;zNs^[(J"; /
*
Ascii + ^[$BF|K\8l^[(J
*
/
5 char
*
p3 = "^[$B$3$l$OH>3Q^[(J^N6@6E^O^[$B$H^[(JASCII^[$B:.9g^[(J";
6
7 int main(void)
8 {
9 printf("%s^[$B$H^[(J%s^[$B$H^[(J%s\n", p1, p2, p3);
10 }
244
While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the
shift state.
June 24, 2009 v 1.2
5.2.1.2 Multibyte characters
247
Commentary
The implementation of a stateful encoding has to pick a special character, which is not in the basic character
set, to indicate the start of a shift sequence. When not in the initial shift state, it is very unlikely that single
bytes will be interpreted the same way as when in the initial shift state.
C
++
The C
++
Standard does not explicitly specify this requirement.
Common Implementations
The ESC character, 0x1b, is commonly used to indicate the start of a shift sequence.
245
12) The trigraph sequences enable the input of characters that are not deﬁned in the Invariant Code Set as
footnote
12
described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set.
Commentary
When trigraphs are used, it is possible to write C source code that contains only those characters that are in
the Invariant Code Set of ISO/IEC 646.
C90
The C90 Standard explicitly referred to the 1983 version of ISO/IEC 646 standard.
246
The interpretation for subsequent bytes in the sequence is a function of the current shift state.
Commentary
This wording is really a suggestion for the design of multibyte shift states (it is effectively describing the
processing performed by ﬁnite state machines, which is what a shift state encoding is). Being able to interpret
a byte independent of the current shift state would indicate that the sequence of bytes that resulted in the
current state were redundant.
The speciﬁcation of the macro
MB_LEN_MAX
requires that the maximum number of bytes needed to handle
MB_LEN_MAX
313
a supported multibyte character be provided. It may, or may not, be possible to represent some redundant
shift sequence within the available bytes. The standard does not explicitly require or prohibit support for
redundant shift sequences.
C
++
A set of virtual functions for handling state-dependent encodings, during program execution, is discussed in
Clause 22, Localization library. But, this requirement is not speciﬁed.
Common Implementations
Implementations usually use a simple ﬁnite state machine, often automatically generated, to handle the
mapping of shift states into their execution character value. The extent to which sequences of redundant shift
sequences is supported will depend on the implementation.
Coding Guidelines
The sequence of bytes in a shift sequence are usually generated via some automated process. For this reason
a guideline recommending against the use of redundant shift sequences is unlikely to be enforceable, and
none is given.
247
— A byte with all bits zero shall be interpreted as a null character independent of shift state.byte
all bits zero
Commentary
This is a requirement on the implementation. This requirement makes it possible to search for the end of
a string without needing any knowledge of the encoding that has been used. For instance, string-handling
functions can copy multibyte characters without interpreting their contents.
v 1.2 June 24, 2009
5.2.1.2 Multibyte characters
250
C
++
2.2p3
. . . , plus a null character (respectively, null wide character), whose representation has all zero bits.
While the C
++
Standard does not rule out the possibility of all bits zero having another interpretation in other
contexts, other requirements (17.3.2.1.3.1p1 and 17.3.2.1.3.2p1) restrict these other contexts, as do existing
character set encodings.
248
— A byte with all bits zero shall not occur in the second or subsequent bytes of a Such a byte shall not occur
multibyte
character
end in initial
shift state
as part of any other multibyte character.
Commentary
This is a requirement on the implementation. The effect of this requirement is that partial multibyte characters
cannot be created (otherwise the behavior is undeﬁned). A null character can only exist outside of the
sequence of bytes making up a multibyte character. For source ﬁles this requirement follows from the
requirement to end in the initial shift state. During program execution this requirement means that library
250 token
shift state
functions processing multibyte characters do not need to concern themselves with handling partial multibyte
characters at the end of a string.
The wording was changed by the response to DR #278 (it is a requirement on the implementation that
forbids a two-byte character from having a ﬁrst, or any, byte that is zero).
C
++
This requirement can be deduced from the deﬁnition of null terminated byte strings, 17.3.2.1.3.1p1, and null
terminated multibyte strings, 17.3.2.1.3.2p1.
249
For source ﬁles, the following shall hold:
Commentary
These C-sentences specify requirements on a program. A program that violates them exhibits undeﬁned
behavior.
Use of multibyte characters can involve locale-speciﬁc and implementation-deﬁned behaviors. A source
44 locale-
speciﬁc
behavior
42
implementation-
deﬁned
behavior
ﬁle does not affect the conformance status of any program built using it, provided its use of multibyte
characters either involves locale-speciﬁc behavior or the implementation-deﬁned behavior does not affect
program output (e.g., they appear in comments).
Coding Guidelines
The creation of multibyte characters within source ﬁles is usually handled by an editor. The developer
involvement in the process being the selection of the appropriate character. In such an environment the
developer has no control over the byte sequences used. A guideline recommending against such usage is
likely to be impractical to implement and none is given.
250
— An identiﬁer, comment, string literal, character constant, or header name shall begin and end in the initial
token
shift state
shift state.
Commentary
These are the only tokens that can meaningfully contain a multibyte character. A token containing a multibyte
character should not affect the processing of subsequent tokens. Without this requirement a token that did
not end in the initial shift state would be likely to affect the processing of subsequent tokens.
C90
Support for multibyte characters in identiﬁers is new in C99.
June 24, 2009 v 1.2
5.2.2 Character display semantics
252
C
++
In C
++
all characters are mapped to the source character set in translation phase 1. Any shift state encoding
transla-
tion phase
1
116
will not exist after translation phase 1, so the C requirement is not applicable to C
++
source ﬁles.
Coding Guidelines
The fact that many multibyte sequences are created automatically, by an editor, can make it very difﬁcult for
a developer to meet this requirement. A developer is unlikely to intentionally end a preprocessing token,
created using a multibyte sequence, in other than the initial state. A coding guideline is unlikely to be of
beneﬁt.
251
— An identiﬁer, comment, string literal, character constant, or header name shall consist of a sequence of
valid multibyte characters.
Commentary
What is a valid multibyte character? This decision can only be made by a translator, should it chose to accept
multibyte characters.
In C90 it was relatively easy to lexically process a source ﬁle containing multibyte characters. The
context in which these characters occurred often meant that a lexer simply had to look for the character that
terminated the kind of token being processed (unless that character occurred as part of a multibyte character).
Identiﬁer tokens do not have a single termination character. This means that it is not possible to generalise
support for multibyte characters in identiﬁers across all translators. It is possible that source containing a
multibyte character identiﬁer supported by one translator will cause another translator to issue a diagnostic.
C90
Support for multibyte characters in identiﬁers is new in C99.
C
++
In C
++
all characters are mapped to the source character set in translation phase 1. Any shift state encoding
transla-
tion phase
1
116
will not exist after translation phase 1, so the C requirement is not applicable to C
++
source ﬁles.
Coding Guidelines
In some cases source ﬁles can contain multibyte characters and be translated by translators that have no
knowledge of the structure of these multibyte characters. The developer is relying on the translator ignoring
them in comments containing their native language, or simply copying the character sequence in a string
literal into the program image. In other cases, for instance identiﬁers, knowledge of the encoding used for
the multibyte character set is likely to be needed by a translator.
Ensuring that a translator capable of handling any multibyte characters occurring in the source is used, is a
conﬁguration-management issue that is outside the scope of these coding guidelines.
5.2.2 Character display semantics
Commentary
There is no guarantee that a character display will exist on any hosted implementation. If such a device is
character display
semantics
supported by an implementation, this clause speciﬁes its attributes.
C
++
Clause 18 mentions “display as a wstring” in Notes:. But, there is no other mention of display semantics
anywhere in the standard.
Common Implementations
Most Unix-based environments contain a database of terminal capabilities, the so-called termcap database.
[1332]
termcap
database
This database provides information to the host on a large number of terminal capabilities and characteristics.
Knowing the display device currently being used (this usually relies on the user setting an environment
variable) enables the database to be queried for device attribute information. This information can then be
used by an application to handle its output to display devices. There is a similar database of information on
printer characteristics.
v 1.2 June 24, 2009

bộ sưu tập mã nguồn cơ sở

Thứ Sáu, 28 tháng 2, 2014

Tài liệu The New C Standard- P4 ppt

Xem chi tiết: Tài liệu The New C Standard- P4 ppt

Không có nhận xét nào:

Đăng nhận xét