System and method for processing graphic language characters5802482Abstract A system and method is described for processing foreign language characters. An input processor parses input language data into combinations dictated by a set of combining rules. The combining rules are application and language dependent. The data structures generated by the input processor comprises a header and one or more characters or character strings. The header further comprises a layout field that identifies the relative position of the one or more characters or character strings. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Position Permissible Categories
______________________________________
Top Tone Mark, Diacritic
Above Vowel, Tone Mark
Base Consonant
Below Vowel, Diacritic
______________________________________
Character subunit input data received from file system 102 or keyboard 104 can be represented by single byte indexes into the character library. The role of input processor 106 is to combine these single byte indexes into data structures that represent valid Thai characters with respect to combining rules 108. For the Thai language of FIG. 2, exemplary combining rules can be represented by the rules listed in Table 2.
TABLE 2
______________________________________
SEQ Position 1 (320)
Position 2 (330)
Position 3 (340)
______________________________________
1 BAS Cons. BLW/ABV Vowel BLW/ABV/TOP Tone
Mark
2 BAS Cons. BLW/ABV Vowel BLW/TOP Diacritic
3 BAS Cons. BLW/TOP Diacritic
N/A
4 BAS Cons. BLW/ABV/TOP N/A
Tone Mark
5 BAS Cons. N/A N/A
______________________________________
Each of the rules in Table 2 define a permissible input sequence. For example, consider character subunit sequence 2. If an input data stream from file system 102 or keyboard 104 includes the sequence of a base consonant (labeled BAS Cons.), an above vowel (labeled ABV Vowel), and a top tone mark (labeled TOP Tone Mark), input processor 106 will conclude that the sequence is valid. Examples of incorrect sequences include (1) consonant-vowel-vowel, (2) consonant-tone mark-vowel, and (3) consonant-diacritic-tone mark. Once a valid input sequence is identified, a data structure can be formed for that particular Thai character. FIG. 3 illustrates an example of a Thai character data structure 300. Data structure 300 comprises header 310 and character subunits 320, 330 and 340. Header 310 further comprises fields 302, 304, and 306. Field 302 is a 1-bit field that identifies whether the data structure represents a combined character or a combined string. Combined strings are described in greater detail below. For combined characters, field 302 is set to 0 for a combined string; for a combined string, field 302 is set to 1. Field 304 is a 3-bit field that identifies a relative display position of character subunits 320, 330, and 340. The bits in field 306 are unused in this example. The number of bits in field 304 can vary based upon the number of character subunits in the data structure and the number of possible display positions for those character subunits. For this particular example, the combining rules dictate that character subunit 320 is occupied by a consonant. That is, a valid input sequence has a consonant as the first subunit. Since consonants are restricted to the base position (see Table 1), character subunit 320 is always assigned to the base position. Character subunit 330, on the other hand, can be occupied by (1) a vowel in the below or above position, (2) a diacritic in the below or top position, (3) a tone mark in the below, above, or top position, or (4) nothing at all (see Table 2). In sum, character subunit 330 can be placed in either the below, above, or top positions. Finally, if character subunit 320 contains a vowel, character subunit 340 can be occupied by (1) a tone mark in the below, above or top position or (2) a diacritic in the below or top position. In the same manner as character subunit 330, character subunit 340 can be placed in either the below, above or top position. To unambiguously identify all possible display positions, field 304 contains at least 3bits. Table 3 illustrates character subunit positions according to an embodiment of the present invention.
TABLE 3
______________________________________
Field 304 Character Positions
______________________________________
000 BAS = 1
001 BAS = 1, BLW = 2
010 BAS = 1, ABV = 2
011 BAS = 1, TOP = 2
100 BAS = 1, BLW = 2, ABV = 3
101 BAS = 1, BLW = 2, TOP = 3
110 BAS = 1, ABV = 2, BLW = 3
111 BAS = 1, ABV = 2, TOP = 3
______________________________________
In the exemplary data structure of FIG. 3, field 304 is specified as 111. As listed in Table 3, the value of 111 places character subunit 320 at the base position, character subunit 330 at the above position, and character subunit 340 at the top position. This relative positioning is illustrated in FIG. 2 as positions 202, 204, and 206, respectively. Having identified the contents of a data structure, the process of generating a data structure is now described with reference to the flow chart of FIG. 6. In step 602, input processor 106 receives a character from either file system 102 or keyboard 104. In one embodiment, the character is in the form of a single (or multiple) byte index. Next, in step 604, input processor 106 determines whether the received character is part of a valid sequence. In the Thai example described above, the combining rules specify that the first character in a sequence is the base consonant (see Table 2). Thus, if the first character in a sequence is not a consonant, input processor 106 knows that an error has occurred. In step 606, input processor 106 can signal an error or prompt a user to reenter a character at keyboard 104. If the received character is part of a valid sequence, it is placed in a register (not shown) within input processor 106. This process is represented by step 608. In step 610, the next character is received from file system 102 or keyboard 104. After the next character is received, a determination is made in step 612 whether a combined character sequence has ended. In the Thai example described above, the end of the sequence can be identified by a receipt of a second consonant. This would indicate that a new combined character sequence has started. In other embodiments, the end of a character sequence can be identified by an arbitrary control character. If input processor 106 determines, in step 612, that a combined character sequence has not ended, the process returns to step 604 where the character is validated with respect to combining rules 108. For example, if the next character is a vowel that follows a previous consonant-vowel sequence, the character is invalidated. Generally, if the character is validated in step 604, the process then proceeds in a similar manner through steps 608, 610, and 612. If input processor 106 determines, in step 612, that a combined character sequence has ended, the process continues to step 614 where input processor 106 generates a header for the data structure. This header (e.g., header 310) includes information that defines the relative position of the characters subunits within the combined character. Finally, in step 616, input processor 106 stores the generated data structure in memory 110. As illustrated by the flow chart of FIG. 7, input processor 106 can also edit data structures that have been previously stored in memory 110. This editing process begins in step 702 where input processor 106 loads one or more characters of a data structure into one or more registers. In step 704, input processor 106 deletes one or more of the characters stored in the registers based upon the control of a user. In a preferred embodiment, this deletion process occurs sequentially. In other words, the characters are deleted in a reverse order of the input process. For example, in FIG. 2, the characters 202, 204 and 206 would be input in that order (see sequence 2 of Table 2). If character 204 is sought to be changed, character 206 and character 204 are deleted in that order. After the deletion process is completed, the user, in step 706, provides one or more characters that are sought to be inserted. These new characters are checked with respect to the combining rules in the same manner as illustrated in FIG. 6. If the new character sequence is validated, a new header is generated in step 708. Finally, the edited data structure is stored in memory 110. As the Thai language example illustrates, a data structure can represent a foreign language character that can be divided into individual character subunits. As one can readily appreciate, the present invention is not confined to a specific number or type of character subunit. Moreover, the present invention is not confined to the subunit display positions (i.e., top, above, base, below) of FIG. 2. Various other positions can be defined in the context of the demands of a specific foreign language. As the next example illustrates, the present invention can be extended beyond the display of single foreign language characters. In particular, the present invention can be applied to displays of foreign language characters and their associated pronunciations. For example, consider the Japanese language which includes Kanji, Hiragana, and Katakana characters. Unlike Hiragana and Katakana characters, Kanji characters are not phonetically based. Accordingly, the Kanji character set numbers in the thousands. One problem that exists in the processing of Kanji characters is the existence of multiple pronunciations for a single Kanji character. For example, consider Kanji character 402 in FIG. 4. Kanji character 402 has two pronunciations. The first pronunciation, "ya-ma", is represented by Hiragana characters 412 and 414. The second pronunciation, "san", is represented by Hiragana characters 422 and 424. In many publishing applications, the pronunciation of the Kanji character is used for indexing or sorting. To support this function, the Hiragana or Katakana characters are stored and or displayed with the Kanji character. In the present invention, the Kanji character and its Hiragana or Katakana pronunciation are stored as a combined string. An exemplary data structure 500 representing the combined string is illustrated in FIG. 5. Data structure 500 comprises header 510 and strings 520 and 530. String 520 includes the multi-byte representation for Kanji character 402. String 530 includes the multi-byte representation for Hiragana characters 412 and 414. Header 510 further comprises information fields 511-516. Information field 511 identifies data structure 500 as either a combined character or a combined string. As noted above, field 511 is set to a 1 for combined strings. Information field 512 identifies the relative position of strings 520 and 530. If a simple combining rule is chosen such that a Kanji character is received first, followed by a string of Hiragana or Katakana characters, 2 bits can be used to identify the relative display positions of the character strings. These relative display positions can be in a left-right or top-bottom orientation. This simple listing of positions is illustrated in Table 4. In representing the "ya-ma" pronunciation of Kanji character 402, information field 511 is set to 00. That is, Kanji character 402 is at the base and Hiragana characters 412, 414 are at the top.
TABLE 4
______________________________________
Field 511 String Positions
______________________________________
00 BAS = 1, TOP = 2
01 BAS = 1, BTM = 2
10 BAS = 1, RGT = 2
11 BAS = 1, LFT = 2
______________________________________
Information fields 513-515 identify the number of bytes of each of the strings. Specifically, field 513 identifies the number of bytes in string 1, field 514 identifies the number of bytes in string 2, and field 515 identifies the number of bytes in string 3 (unused in this example). For exemplary data structure 500, field 513 is set to 0010 (i.e., field 513 contains a single double-byte index) and field 514 is set to 0100 (i.e., field 513 contains two double-byte indexes). This assumes that each of the Kanji and Hiragana characters requires a two byte representation. Finally, in this example, information field 516 is unused. As the Japanese language example illustrates, a data structure can be used to represent one or more characters having a specified relation (e.g., pronunciation). As one can readily appreciate, the present invention is not confined to a specific number or type of character strings. Moreover, the present invention is not confined to the character string display positions (e.g., top and bottom) of FIG. 4. Various other positions can be defined in the context of the demands of a specific foreign language application. Generally, the creation and editing of combined string data structures follows the processes illustrated in FIGS. 6 and 7. Referring back to FIG. 1, the generation of data structures representing combined characters or strings is stored in memory 110. These data structures can be retrieved from memory 110 by output processor 112 or draw processor 114. Both output processor 112 and draw processor 114 render the characters or strings in the order defined by the header. Draw processor 114 outputs the rendered characters or strings to display 118 while output processor 112 outputs the rendered characters or strings to file 116. The process of generating a single bit-mapped representation based upon the character units or strings in the data structures would be apparent to one of ordinary skill in the relevant art and are not described in greater detail. In one embodiment, the invention is directed to a computer system operating as discussed herein. An exemplary computer system 802 is shown in FIG. 8. The computer system 802 includes one or more processors, such as processor 804. The processor 804 is connected to a communication bus 806. The computer system 802 also includes a main memory 808, preferably random access memory (RAM), and a secondary memory 810. The secondary memory 810 includes, for example, a hard disk drive 812 and/or a removable storage drive 814, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, etc. The removable storage drive 814 reads from and/or writes to a removable storage unit 818 in a well known manner. Removable storage unit 818, also called a program storage device or a computer program product, represents a floppy disk, magnetic tape, compact disk, etc. As will be appreciated, the removable storage unit 818 includes a computer usable storage medium having stored therein computer software and/or data. Computer programs (also called computer control logic) are stored in main memory and/or the secondary memory 810. Such computer programs, when executed, enable the computer system 802 to perform the features of graphical language character processing as discussed herein. In particular, the computer programs, when executed, enable the processor 804 to perform the features of the present invention. Accordingly, such computer programs represent controllers of the computer system 802. In another embodiment, the invention is directed to a computer program product comprising a computer readable medium having control logic (computer software) stored therein. The control logic, when executed by the processor 804, causes the processor 804 to perform the functions of the invention as described herein. In another embodiment, the invention is implemented primarily in hardware using, for example, a hardware state machine. Implementation of the hardware state machine so as to perform the functions described herein will be apparent to persons skilled in the relevant art(s). While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the relevant art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
