Method, system and computer program product for sorting text strings6389386Abstract A multi-field text string contains display characters in a first field and sort characters in a second field. For ideographic languages such as Japanese, the display characters may be Kanji symbols for the text string while the sort characters are phonetic syllabary representations of the Kanji symbols. A plurality of such multi-field text strings may then be sorted by the contents of the second field rather than the contents of the first. Despite both the multiple pronunciations or meanings associated with the same Kanji symbols in Japanese and the unsorted ordering of Kanji symbols within the Unicode character set for Japanese, a culturally correct sort order is achieved for the multi-field text strings. Additionally, the contents of the second field may be altered to artificially promote a specific item within the sort order, while displaying the sorted text strings utilizing the contents of the first field. The mechanism for promoting particular text strings within the sort order does not interfere with user viewing of the displayed text strings. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE I
Field Type Data
baseString Java String The user's text
sortString Java String Language/locale dependent
altString Java String Language/locale dependent
sourceLocale Java String ISO-3166 code, example "US"
sourceLanguage Java String ISO-639 code, example "en"
sourceVariant Java String Variant code
targetLocale Java String ISO-3166 code, example "JP"
targetLanguage Java String ISO-639 code, example "ja"
targetVariant Java String Variant code
A Java constructor for a new, empty IString class object 202 where the contents are independent of language or locale may be: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p>Allocate a new IString containing no characters in the default * locale. </p> * *************************************************** public IString( ) { this.baseString=new String( ); this.sortString=new String( ); this.altString=new String( ); init( ); } To allow objects of the IString class 202 datatype to be stored in an Object Database (ODB), however, and to permit manipulation of IString data by Common Object Request Broker Architecture (CORBA) applications, an Interface Definition Language (IDL) class should be defined:
struct IString{
string baseString; //base test String
string sortString; //related text String for collation
string altString; //related alternate text String (pronunciation)
string sourceLocale; //source locale as an ISO-3166 code
string sourceLanguage; //source language as an ISO-639 code
string sourceVariant; //source variant code
string targetLocale; //target locale as an ISO-3166 code
string targetLanguage; //target language as an ISO-639 code
string targetVariant; //target variant code
}
The contents of baseString 204, sortString 206, and altString 208 are preferably but not necessarily Unicode text entered by data entry methods 210 within IString class 202. Data entry methods 210, and thus the contents of baseString 204, sortString 206, and altString 208, may depend at least in part on language and locale parameters defined by sourceLocale field 212, sourceLanguage field 214, targetLocale field 216, and targetLanguage 218. Because data entry methods 210 are dependent on the locale and/or langauge employed by the underlying host system, creation of a new IString object 202 preferably results in the locale and language properties of the host system in which the IString object 202 is created being placed in sourceLocale field 212 and sourceLanguage field 214. A constructor for allocating a new, empty IString for a specified locale and language determined from the host system in which the IString class object 202 is being created may be: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p>Allocate a new IString containing no characters in the * specified locale. </p> * *************************************************** public IString(Locale loc) { this.baseString=new String( ); this.sortString=new String( ); this.altString=new String( ); this.sourceLocale=ioc.getLocale( ); this.sourceLanguage=loc.getLanguage( ); init( ); } Input of data into an IString class 202 object is preferably locale- or language-dependent. The source-Language and targetLanguage properties 214 and 218 control how data is input into an IString class object 202 by data input methods 210. The sourceLanguage property 214 may be set to the language property of the host system on which the IString class object is created. The targetLanguage property 218 may also be set to that language, or may alternatively be set to a common, "universal", language such as English. Data input methods 210 compare sourceLanguage and targetLanguage properties 214 and 218 to determine what is entered into baseString 204, sortString 206, and altString 208 in an IString class object 202. Character strings are entered into the baseString 204, sortString 206, and altString 208 fields by data input methods 220 for IString class 202, which may selectively utilize data from either the user's direct entry or specification, from transliteration engine 220, or from the Input Method Editor (IME) 224. Where the targetLanguage property 218 is set to English as a default, data entry methods 210 determine the contents of baseString 204, sortString 206, and altString 208 fields based upon the character set employed by the language in which data is entered by the user (sourceLanguage property 214). For languages which employ the latin character set, the user input is placed by data entry methods 220 into all three fields (baseString 204, sortString 206, and altString 208) of the IString class 202 by data entry methods 210. A suitable constructor may be: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p>Allocate a new IString which contains the same sequence of * characters as the string argument in the specified locale. </p> * *************************************************** public IString(String str, Locale loc) { this.baseString new String(str); this.sortString=new String(str); this.altString=new String(str); this.sourceLocale=loc.getLocale( ); this.sourceLanguage=loc.getLanguage( ); init( ); } For most locales and languages, the entered string will be input into all three fields of the IString object 202. If targetLanguage property 218 were not set to English, data entry methods 224 would input the user-entered text into all three fields whenever the languages identified in source-Language and targetLanguage properties 214 and 218 employ a common character set (e.g., both employ latin characters, as in the case of Spanish and Afrikaans). Table II illustrates how data is entered into IString class 202 fields where the host language and locale utilize the latin character set.
TABLE II
Field Type Data
baseString Java String Hetherington
sortString Java String Hetherington
altString Java String Hetherington
sourceLocale Java String US
sourceLanguage Java String en
targetLocale Java String US
targetLanguage Java String en
If desired, the fields may be individually edited and the object artificially promoted for sorting purposes by inserting a string having a lower sort value (e.g., "AAA_Hetherington") into sortString 206. For languages which do not employ the latin character set, but which utilize a character set which may be sound mapped to the latin character set, the user input is entered by data entry methods 210 into baseString 204 and sortString 206, but a transliterated, phonetic representation of the input is placed in altString 208. An internal method within the transliteration engine 220 is employed to sound-map the passed string to a phonetic, latin character representation for altString 208 to transliterate entered characters into other characters understandable to people who are not familiar with the character set of the original language. To generate the contents of altString 208, transliteration engine 220 selects an appropriate Java resource file 222 containing a mapping table to create the alternate text to be placed in altString 208. The selection of the particular resource file which is employed based on the combination of source and target languages. Java resource files 222 are named for the combination of languages for which the mapping is being performed. In the example shown in FIG. 2, ru--en_class is for mapping Russian (Cyrillic characters) to English (Latin characters). The structure of resource file 222 is a table with associated entries for foreign language characters and corresponding latin characters. A suitable constructor for an IString object in which altString 208 is transliterated from the passed string may be: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p>Allocate a new IString. The baseString and sortString are the * passed string, the altString is transliterated into the target * language. </p> * *************************************************** public IString(String str) { this.baseString=new String(str); this.sortString=new String(str); if(isSameLanguage( ) this.altString=new String(str); else this.altString=transmogrify(str, this.sourceLanguage, this.targetLanguage); } The "transmogrify" method is the internal method within transliteration engine 220 which was described above. The character set into which the entered characters are transliterated is determined from the targetLanguage property 218, which in the exemplary embodiment is assumed to be set to English. Given an appropriate resource file 222, however, characters may be transliterated between any two languages for which characters in one language sound-map to one or more characters in the other. Table III illustrates how data is entered into IString class 202 by data entry methods 210 where the language utilizes a non-latin character set which maps to the latin character set, such as Russian Cyrillic.
TABLE III
Field Type Data
baseString Java String {character pullout}
sortString Java String {character pullout}
altString Java String David Kumhyr
sourceLocale Java String RU
sourceLanguage Java String ru
targetLocale Java String US
targetLanguage Java String en
In the example shown, the text entered by the user is inserted into both baseString 204 and sortString 206, but the text entered into altString 208 is selected by transliteration engine 220 utilizing a resource table of Russian Cyrillic to English character sound mappings. The phonetic representation of the baseString 204 is thus entered into altString 208 as a pronunciation key for users unfamiliar with the Cyrillic character set. For languages which do not employ the latin character set or a character set which may be sound-mapped to the latin character set, data entry methods 210 input data into the baseString 204, sortString 206, and altString 208 fields which is derived from the input method editor (IME) 224. IME 224 may be either a customized input method editor or the input method editor which is integrated into Asian versions of the Windows NT operating system available from Microsoft Corporation of Redmond Washington. If the Windows NT input method editor is employed, the appropriate data must be extracted from the Windows NT input method editor internal data storage. Table IV illustrates how data is entered into IString class 202 by data entry methods 210 for logosyllabic languages, such as Japanese, which employ neither the latin character set nor a character set which may be sound-mapped to the latin character set.
TABLE IV
Field Type Data
baseString Java String <Kanji>
sortString Java String {character pullout}
altString Java String hayashi
sourceLocale Java String JP
sourceLanguage Java String ja
targetLocale Java String US
targetLanguage Java String en
Logosyllabic languages do not have alphabets, but instead have very large character sets with symbols ("ideographs") corresponding to concepts and objects rather than simple sounds. For instance, the Joyo Kanji List (Kanji for Daily Use) adopted for the Japanese language in 1981 includes 1945 symbols. Normal computer keyboards cannot contain enough separate keys to have one for each symbol in the language, so input is accomplished phonetically utilizing keystroke combinations to select characters from one of two phonetic syllabaries, hiragana or katakana, and dictionary lookup for Kanji symbol creation. The process is implemented in the Windows NT input method editor identified above. For logosyllabic or ideograhic languages, therefore, the data entered into altString 208 is the latin characters typed by the user to compose the desired ideograph. The data entered into sortString 206 are the syllabary characters phonetically spelling the desired ideograph, providing an intermediate representation of the ideograph. The data entered into baseString 204 is the final ideograph selected by the user. As with transliteration of non-latin characters as described above, non-latin characters may be entered into altString 208 if the targetLanguage property is set to a language other than English and IME 224 supports composition of the ideographs by phonetic spelling in a language other than English. For instance, an IString object 202 might contain Japanese Kanji in baseString 204, hiragana in sortString 206, and Cyrillic characters in altString 208 if IME 224 permits composition of Japanese Kanji characters by phonetic spelling in Russian. A suitable constructor for receiving baseString 204, sortString 206 and altString 208 from IME 224 via data entry methods 210 for entry into an IString object 202 may be: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Allocate a new IString. The baseString, sortString and * altString are entered from the IME utilizing the default language and * locale. </p> * *************************************************** public IString(String base, String sort, String alt, Locale src, Locale tgt) { this.baseString=base; this.sortString=sort; this.altString=alt; this.sourceLocale=src.getLocale( ); this.sourceLanguage=src.getLanguage( ); this.targetLocale=tgt.getLocale( ); this.targetLanguage=tgt.getLanguage( ); init( ); } The contents of baseString 204, sortString 206 and altString 208 are entered into the respective fields from data derived from IME 224, while the contents of sourceLocale 212 and sourceLanguage 214 are entered from the default locale and language properties specified by the host system in which data is being entered into IString object 202. The contents of targetLocale 216 and targetLanguage 218 will typically be a locale/language code for a language utilizing the latin character set such as "en_US" (English--United States). Regardless of the language in which text is entered into an IString class object 202, the data automatically entered into each of the baseString 204, altString 206, and sortString 208 by data entry methods 210 may be overridden or altered using other methods. The fields of an IString object 202 may preferably be individually and independently edited, allowing artificial promotion within sortString field 206 as described above, replacement of an erroneously selected ideograph in baseString field 204, or correction of a phonetic spelling within altString field 208. While the above-described methods assumed that the source and target languages were taken from host system defaults, data may alternatively be entered into baseString 204, sortString 206 and altString 208 for specified source and target languages utilizing the constructor: /**************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p>Allocate a new IString. The baseString, sortString and * altString are entered from the IME for specified target and source * language and locale. </p> * *************************************************** public IString(String base, String sort, String alt, String srcLanguage, String srcLocale, String tgtLanguage, String tgtLocale) { this.baseString=base; this.sortString=sort; this.altString=alt; this.sourceLocale=srcLocale; this.sourceLanguage=srcLanguage; this.targetLocale=tgtLocale; this.targetLanguage=tgtLanguage; init( ); } In this constructor, the source and target language and locale which are employed to select the characters entered into baseString 204, sortString 206 and altString 208 may be specified. This latter constructor may be employed to create an IString object 202 in other than the host system default language, or in host systems where data for the IString object 202 is received from another system and a local instance is created. It should be noted that transliteration engine 220 and messaging methods 226 need not necessarily be implemented within an IString class 202 as depicted in FIG. 2, and that IME method 224 need not be implemented separately. Transliteration engine 220 and messaging methods 226 may instead be implemented within separate subclasses which are appropriately constructed and/or invoked by IString class 202 as necessary, while IME 224 may be implemented as a method within IString class 202. Transliteration engine 220 and IME 224 and are only required by data entry methods 210 to gather input data for IString class 202 objects under certain locale and language property settings. Otherwise, data may be programmatically input into baseString 204, sortString 206, and altString 208 by invoking the proper constructor. The methods which may be invoked by programs at runtime to programmatically get and set fields within IString 202 include: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Get the IString baseString. </p> * * @returns str String containing the base string * *************************************************** public String getBaseString( ) { return this.baseString; } This method returns the contents for baseString 204 for an IString object 202. Similar methods return the contents of sortString 206 and altString 208: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Get the IString sortString. </p> * * @returns str String containing the sort string * *************************************************** public String getSortString( ) { return this.sortString; } /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Get the IString altString. </p> * * @returns str String containing the alt string * **************************************************** public String getAltString( ) { return this.altString; } The methods also include setting baseString 204: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Set the IString baseString. <lp> * * @param str String containing the base string * *************************************************** public void setBaseString(String sBase) { this.baseString=sBase; } as well as sortString 206 and altString 208: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Set the IString sortString. </p> * * @param str String containing the sort string * *************************************************** public void setSortString(String sSrt) { this.sortString=sSrt; } /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Set the IString altString. </p> * * @param str String containing the alt string * *************************************************** public void setAltString(String sAlt) { this.altString=sAlt; } In addition to getting and setting baseString 204, sortString 206, and altString 208 for an IString object 202, programs may need to get or set the display locale or language of an IString object 202. Accordingly, other methods are provided to permit a program to get and/or set the locale or language properties of IString data: /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Get the locale of the IString data. </p> * * @returns loc Locale containing the locale of the data * *************************************************** public Locale getLocale( ) { Locale loc=new Locale(this.sourceLanguage, this.sourceLocale); return loc; } /*************************************************** * <P> </P> * <dt> <b> Description: </b> <dd> * <p> Set the locale of the IString data. </p> * * @param loc Locale of the data * *************************************************** * public void setLocale(Locale loc) { this.sourceLocale=loc.getLocale( ); this.sourceLanguage=loc.getLanguage( ); } /*************************************************** * * P> </P * * <dt> <b> Description: </b> <dd> * <p> Get the display language of the IString data. <lp> * * @returns Display language of the data * *************************************************** public String getDisplayLanguage( ) { Locale loc=new Locale(this.sourceLanguage, this.sourceLocale); return loc.getDisplayLanguage( ); } /*************************************************** * * <P> </P> * * <dt> <b> Description: </b> <dd> * <p> Get the display locale of the IString data. </p> * * @returns Display locale of the data *************************************************** public String getDisplayLocale( ) { if(this.sourceLanguage==null&&this.sourceLocale==null) return null; else { Locale loc=new Localelthis.sourceLanguage, this.sourceLocale); return loc.getDisplayLocale( ); } } While these methods are available, IString class 202 preferably exhibits a "black box" behavior such that the programmer/user need not know anything about the methods implemented for IString class 202. IString class 202 simply appears as a data type which encapsulates extra information about baseString 204 and also includes some methods for transforming characters from one character set to another. For special cases where the sortString field 206 or altString field 208 are to be exposed to the user in addition to or in lieu of baseString 204, either for editing or for display only, a separate set of controls may be provided. In the present invention, IString class 202 is employed to effectively transfer human language data across systems employing incongruous languages. The contents of baseString 204 provide a native representation of the text in the default language of the system originating the IString object 202. However, for each system participating in the exchange of data with other systems running in different human languages, the targetLocale property 216 and targetLanguage 218 property of an IString object 202 are preferably set to a common value (e.g., targetLocale="US", targetLanguage="en"). The contents of altString 208 will thus contain a common, cross-language representation of the text string. In systems where the default language of a system receiving an object differs from the language of the contents of baseString 204, IString class object 202 may automatically switch to presenting the contents of altString 208 as the text string to be displayed or processed. Referring to FIG. 3, a high level flowchart for a process of employing a multi-field text string class to sort text strings in accordance with a preferred embodiment of the present invention is illustrated. FIG. 3 is intended to be read in conjunction with FIG. 2. Normally text strings are sorted alphanumerically by the text contained within each respective string. With the three-field text class 202 of the present invention, objects may be artificially promoted by inserting extra, low-sort-value characters before the text in the sortString field 206 (e.g., "AAA_Frank Moss") without those additional characters appearing in the display when the default baseString field 204 is displayed. The three-field text class 202 of the present invention also provides another avenue for supporting alternative sort orders for different cultures. A group of IString objects 202 may be sorted by the Unicode value in the baseString field 204. However, since ideographs having multiple meanings and/or pronunciations may not be sorted in a culturally correct order without knowledge of the associated pronunciation, sorting IString objects 202 may be based on the Unicode characters within the sortString field 306. While the Unicode character stored in the baseString field 204 of an IString class object 202 may provide no information as to the correct pronunciation, the characters within the sortString field 206 will provide culturally correct sort order information for the IString class object 202. IString objects, therefore, may be sorted by employing the altString field 306 as the sort key rather than the baseString field 304. This allows, for example, Japanese to be sorted in a culturally correct order despite the Unicode ordering of the Kanji character set and despite the fact that a particular ideographic symbol may have several different pronunciations and/or meanings. Since the hiragana or katakana representation of the word is captured by IME 224 in the sortString field 206, IString objects 202 may be sorted by sortString 206, or first sorted by baseString 204 and, for subgroups of multiple objects having identical characters in the baseString field 204, by altString 206 within such subgroups. The former approach would be preferable for Japanese, since the Unicode ordering is culturally incorrect. The latter approach may be preferable in other circumstances. A high level flowchart for a process of sorting three-field text class objects in accordance with the present invention is illustrated in FIG. 3. The process begins at step 302, which depicts a sort of IString class objects being initiated. The process then passes to step 304, which illustrates a determination of whether a sort key (baseString 204, sortString 206, or altString 208) has been specified. If so, the process proceeds to step 306, which depicts sorting the subject IString objects utilizing the specified sort key. The process then passes to step 316, which illustrates the process becoming idle until another sort of IString objects is initiated. Referring again to step 304, if no sort key is specified, the process proceeds instead to step 308, which depicts checking the language and locale properties of the system in which the sorting is being performed. The process next passes to step 310, which illustrates a determination of whether alternate key sorting is employed for the language or locale specified. If so, the process proceeds to step 312, which depicts sorting the subject IString class objects by the default sort key for languages or locales which do not employ an alternate sort key, which is baseString 204 in the exemplary embodiment. The process then passes to step 316. If the language or locale specified by the language and locale properties employ an alternate sort key, the process proceeds from step 310 to step 314, which illustrates sorting the subject IString class objects utilizing the alternate sort key, which would typically be sortString 206. Alternatively, the sorting mechanism may sort first by a default sort key, such as baseString 204, and then perform a secondary sort within objects having the same contents within baseString 204 by the alternate sort key, such as sortString 206. The process then passes to step 316. It should be noted that employing sortString 206 for sorting purposes does not require the subject IString objects to be displayed utilizing sortString 206. The objects may be sorted utilizing the contents of one field, but represented in the display by the contents of a different field. When integrated with the language and locale properties, this permits IString objects containing strings in languages such as Japanese to be automatically sorted in a culturally correct order, despite the order of the Unicode characters. This also permits artificially promoted IString objects to be displayed without displaying the mechanism by which the sort order was changed. It is important to note that while the present invention has been described in the context of a fully functional data processing system and/or network, those skilled in the art will appreciate that the mechanism of the present invention is capable of being distributed in the form of a computer usable medium of instructions in a variety of forms, and that the present invention applies equally regardless of the particular type of signal bearing medium used to actually carry out the distribution. Examples of computer usable mediums include: nonvolatile, hard-coded type mediums such as read only memories (ROMs) or erasable, electrically programmable read only memories (EEPROMs), recordable type mediums such as floppy disks, hard disk drives and CD-ROMs, and transmission type mediums such as digital and analog communication links. While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
