Electronic digital system and method for reproducing languages using the Arabic-Farsi script3938099Abstract A system for mechanically reproducing language characters in a cursive form in accordance with the natural style calligraphy of the language. Written letters are characterized by "links" with preceding and following characters, and mathematical rules describe the cursive script in terms of the form each letter takes dependent upon the preceding and following characters. The system includes input means for inserting characters, one at a time, and for providing coded representations of the characters. The coded representations are fed to decoder means which has as an output a selected combination of concatenation properties applicable to the character. Analyzer means analyzes variables dependent on the concatentation properties of a successive string of characters which comprise a character under consideration, a preceding character and a following character. The analyzer means then provides a further coded representation of a particular concatenation property applicable to the character under consideration when the character under consideration is preceded by the preceding character and followed by the following character. The coded representation and the further coded representation are combined in a combining means to provide a composite coded representation containing information relative to a character and to its applicable concatenation properties. Means are provided for converting the composite code to a code suitable for driving output means. Claims I claim: Description BACKGROUND OF THE INVENTION
A.sub.ij = 1 .omega..sub.i,j.sub.' exists
= 0 .omega..sub.i,j.sub.' does not exist.
The availability matrix is implemented in a Read Only Memory, and plays an important role in the hardware design as will be described later with reference to a script processor design. It should be noted that Urdu is written from right to left. Consider the concatenation properties of an Urdu character .omega..sub.i. Let A, B and C be three Boolean variables which describe the following concatenation properties. i. A = o symbol concatenates on both sides. A = 1 symbol does not concatenate on at least one side. It is isolated or initial or terminal. ii. B = o links down to the left B = 1 links up to the left iii. C = o links down from the right 1 links up from the right The properties are summarized in Table I which follows. 8
Table 1
______________________________________
Link Table
A B C Min-term Comment
______________________________________
0 0 0 P.sub.0 Links down L
Links down R
Concatenates in both directions.
0 0 1 P.sub.1 Links down L
Links up R
Concatenates in both directions.
0 1 0 P.sub.2 Links up L
Links down R
Concatenates in both direction.
0 1 1 P.sub.3 Links up L
Links up R
Concatenates in both directions
1 0 0 P.sub.4 Links down R
Terminates on L.
1 0 1 P.sub.5 Links up R
Terminates on L.
1 1 0 P.sub.6 Links up or down at L.
Initial. No links on R.
1 1 1 P.sub.7 Does not links on L or R
Isolated symbol.
______________________________________
We assign to j in .omega..sub.ij the suffix of the corresponding Min-term The English characters A, B, D, J, for example will have the following associated graphic shapes and names in the Urdu writing system.
Table 2
__________________________________________________________________________
Shapes of symbols A, B, D & J
Letter P-term / .omega..sub.ij / graphic shape
__________________________________________________________________________
English
Urdu P.sub.0
P.sub.1
P.sub.2
P.sub.3
P.sub.4
P.sub.5
P.sub.6
P.sub.7
__________________________________________________________________________
A .omega..sub.A
-- -- -- -- .omega..sub.A5
.omega..sub.A6
.omega..sub.A7
B .omega..sub.B
-- .omega..sub.B1
-- .omega..sub.B3
-- .omega..sub.B5
.omega..sub.B6
.omega..sub.B7
D .omega..sub.D
-- -- -- -- -- .omega..sub.D5
.omega..sub.D6
.omega..sub.D7
J .omega..sub.J
-- -- .omega..sub.J2
-- .omega..sub.J4
-- .omega..sub.J6
.omega..sub.J7
__________________________________________________________________________
The domains for graphic shapes .omega..sub.Ci in Urdu for the English character C.sub.i are: .omega..sub.A = {.omega..sub.A5, .omega..sub.A6, .omega..sub. .omega..sub.B = {.omega..sub.B1, .omega..sub.B3, .omega..sub.B5, .omega..sub.B6, .omega..sub.B7 } .omega..sub.D = {.omega..sub.D5, .omega..sub.D6, .omega..sub.D7 } .omega..sub.J = {.omega..sub.J2, .omega..sub.J4, .omega..sub.J6, .omega..sub.J7} The first two rows of the availability matrix A.sub.ij would then be 0 0 0 0 0 1 1 1 A.sub.ij = .vertline.0 1 0 1 0 1 1 1 .vertline. As mentioned earlier, the set of the total alphabet V.sub.A is partitioned into four groups such that the characters having the same architectural characteristics in their Urdu form and similar concatenation properties constitute the same class of the partition. V.sub.A = {V.sub.S, V.sub.U, V.sub.D, V.sub.X } For the purpose of illustration, let V.sub.E = {V.sub.S ', V.sub.U ', V.sub.D' } where V.sub.S ' V.sub.S, V.sub.U ' V.sub.U and V.sub.D ' V.sub.D. V.sub.s' the characters in this partition V.sub.S '={.omega..sub.A, .omega..sub.R, .omega..sub.D, .omega..sub.O } have the property that they do not concatenate with the successor. V.sub.d' the right link (connecting with the precedecessor) of the characters points downwards. For example characters of the type .omega..sub.i0, .omega..sub.i2 and .omega..sub.i4 would be included in this partition. V.sub.u' the right link of the characters points upwards. Urdu graphics or the type .omega..sub.i1, .omega..sub.i3, and .omega..sub.i5 would be included in this partition. V.sub.x This partition which includes numerals etc... has been described earlier. It is assumed that the four partitions do not contain any common elements. In the current design V.sub.S ' ={.omega..sub.A, .omega..sub.R, .omega..sub.D, .omega..sub.o } V.sub.D ' ={.omega..sub.H, .omega..sub.J, .omega..sub.M } V.sub.U ' ={V.sub.E ' - V.sub.U ' - V.sub.s '} As stated earlier the choice of characters in a partition is based on the applicant's understanding of the script. It could vary depending on the language, the country and the user. The following description relates to the details of a transformational grammar, which accepts characters in their input sequence and performs a forward scan for the analysis. For the sake of completeness some basic definitions are reviewed. A grammar G = (V.sub.T, V.sub.N, P, .sigma.) is a 4-tuple that consists of V.sub.T a terminal vocabulary V.sub.N a non-terminal vocabulary P a set of production rules .sigma. a sentence symbol which is member of V.sub.N. If each production is of the form .phi. .xi. .psi. .fwdarw. .phi. .omega. .psi. where .phi. and .psi. are in (V.sub.T U V.sub.N)* and .omega. is in (V.sub.T U V.sub.N) - {.epsilon.}, where {.epsilon.} is the empty word, then the grammer G is called context sensitive. It should be noted that .phi. and .psi. may be null, and .omega. may not be empty. Specifically V.sub.N = V.sub.A U .theta., and V.sub.T = {.omega..sub.ij .vertline. i .epsilon. {1...., 35}, a.sub.ij .noteq.0} U {.music-sharp.} U {V.sub.X } } is the set of terminal Urdu character graphics augments by the delimiter .music-sharp., and the set V.sub.x. It is recalled that the symbols in V.sub.x are printed without modification. The grammar described below transforms words written in Urdu characters, i.e. strings over V.sub.O * , into words written in well-formed Urdu script graphics, i.e. strings over V.sub.T * . It is assumed that a sufficient number of production rules of the form .sigma..fwdarw..BECAUSE. .alpha. .music-sharp. exists, where .alpha. is a word writen with Urdu characters (.alpha. .epsilon. V.sub.o *). These rules generate the language, e.g. Arabic or Farsi, and are different for each language. They are of no concern to the theory of the invention. The rules which transform the word of a language to its written form are context sensitive, and are given below as:
R0: This is a large set of production rules of the form
.sigma..fwdarw.# S.sub.1, ... S.sub.n #, where S.sub.1, ...,
S.sub.n .epsilon. V.sub.0 and S.sub.1, ... S.sub.n
is the pseudo-English representation of an Urdu word.
R1: S.sub.i S.sub.j .fwdarw..omega..sub.i7 S.sub.j for S.sub.i, S.sub.j
.epsilon. V.sub.x U #
R2: S.sub.i C.sub.j .fwdarw..omega..sub.i7 C.sub.j for S.sub.i
.epsilon. {V.sub.x U #} and C.sub.j .epsilon. V.sub.0
R3: .omega..sub.kl C.sub.i C.sub.j .fwdarw..omega..sub.kl
.omega..sub.i7 C.sub.j for C.sub.i .epsilon. V.sub.S
and l .epsilon. {4, 5, 7}
R4: .omega..sub.kl C.sub.i C.sub.j .fwdarw..omega..sub.kl
.omega..sub.i6 C.sub.j for C.sub.j .epsilon. V.sub.D U V.sub.U
UV.sub.s
and l .epsilon. {4, 5, 7}
These rules formally express the tradition of writing the Urdu language. This is a new idea, and forms an important and integral part of the hardware design of the present invention. The theory and logical design of the machine which performs the syntactic transformation described previously are given below. It is well known that a context sensitive language is accepted by a linear bounded automaton. However, in this case, while the grammar is context sensitive, the requirement is to find a transducer that would both accept and transform. It appeared reasonable to find a finite state deterministic automaton. The production rules of the grammar of script generation may be re-stated as under: The string (actually written from right to left in Urdu) .omega..sub.kl C.sub.i C.sub.j and its concatenation characteristics are expressed in terms of four new Boolean variables E.sub.d, E.sub.g, R.sub.i, and R.sub.j. They are described below: E.sub.d The character C.sub.k that had been previously transformed to .omega..sub.kl is replaced by E.sub.d, such that
0, if l .epsilon. {4, 5, 7}, and
E.sub.d =
1 otherwise
E.sub.g It describes the contatenation characteristics of the two characters C.sub.i (undergoing analysis) and C.sub.j (last input), as follows:
0 if C.sub.i .epsilon. V.sub.S U V.sub.x or C.sub.j .epsilon.
V.sub.x, and
E.sub.g =
1 otherwise
R.sub.i and R.sub.j These Boolean variables, R.sub.i and R.sub.j, describe the right link properties of the characters C.sub.i and C.sub.j respectively.
0 right link down
R.sub.i, R.sub.j =
1 right link up
Next, the new output Boolean variables S.sub.0, S.sub.1, S.sub.2 are defined, which help in code translation from the input variables E.sub.g, E.sub.d, R.sub.i and R.sub.j. The following table may be easily constructed from the production rules described earlier.
Table 3.
______________________________________
Code translation Table
R.sub.j
R.sub.i
E.sub.g
E.sub.d
S.sub.0
S.sub.1
S.sub.2
Output Rule
______________________________________
-- -- 0 0 1 1 1 7 3,13
-- 0 0 1 1 0 0 4 11
-- 1 0 1 1 0 1 5 12
-- 0 0 1 1 0 0 4 6
-- 1 0 1 1 0 1 5 5
-- -- 1 0 1 1 0 6 4
0 0 1 1 0 0 0 0 9
0 1 1 1 0 0 1 1 10
1 0 1 1 0 1 0 2 8
1 1 1 1 0 1 1 3 7
______________________________________
By simplification the Boolean variables S.sub.0, S.sub.1, S.sub.2 may be obtained in terms of the variables E.sub.g, E.sub.d, R.sub.i, and R.sub.j as follows: S.sub.0 = E.sub.g + E.sub.d (1) S.sub.1 = E.sub.g .sup.. E.sub.d .sup.. R.sub.j + E.sub.d (2) and S.sub.2 = E.sub.g .sup.. E.sub.d + E.sub.d .sup.. R.sub.i (3) The above represents a code translation scheme .tau.: {0,1}.sup.m {0,1}.sup.n, m.gtoreq.n where m, n are the dimensions of the Boolean spaces (4 and 3 in this case) of the input and output respectively. Thus, the variables S.sub.0, S.sub.1, S.sub.2 give the representation of the form of the Urdu graphic .omega..sub.im corresponding to the character C.sub.i in the string C.sub.k C.sub.i C.sub.j, in terms of the concatenation and linking properties of the characters in the string. The operation will now be described. The analysis of the character string is performed in a uniform manner, no distinction being made between characters in different partitions of V.sub.A, i.e. V.sub.U, V.sub.D, V.sub.S and V.sub.X. The output follows the input with a one symbol delay. This mode of operation results in a simple design, by minimizing the problems of synchronization, timing and control. In a communication system where two Teletype like devices are linked to each other, the method proposed here eliminates the impression of erratic functioning on the user, who anticipates and receives a continuous message, not being aware of the delay. To the sender, inspite of the one symbol delay, this method with the feature of continuous output is equally attractive. For the purpose of illustration let us recall the process of analysing the string .omega..sub.kl C.sub.i C.sub.j. It is noted that the previous symbol C.sub.k had been analysed as the Urdu graphic .omega..sub.kl, C.sub.i is the symbol under analysis, and C.sub.j is the last symbol received. The overall design of the script processor shown in the drawing will now be described with reference to the processing of the string .omega..sub.kl C.sub.i C.sub.j. As mentioned earlier, the theory described forms the basis of the hardware design of the present invention. A preferred form of the hardware design is shown with regard to the drawings. Referring to FIG. 1 of the drawings, 1 is a keyboard having alphanumeric characters on the keys. The keyboard provides, at its output, an eight bit code representative of the character of a key which is depressed. Such keyboards are well known in the art, and, as is well known, the eight bit binary code is a standardized code for use in such keyboards. The keyboard could comprise, for example, the keyboard of a KSR.33 Teletype system. The output of the keyboard is fed, in parallel, to eight bit register 2. The eight bit register can comprise a series of eight flip-flops or any other similar means well known in the art. The output of the eight bit register 2 is fed, again in parallel form, to decoder 3. The decoder is of the well known type which receives a coded binary input and provides an output at only one of a plurality of outputs depending on the code at the input. A memory decoder, for example a Texas Instrument SN74154, which receives a 4 bit input and provides an output at any one of 16 outputs, can be used to fabricate the decoder 3. In one embodiment of the invention, 35 output lines are required. Thus, it would be necessary to use four SN74154's to make a decoder to be used in this embodiment. (It will, of course, be appreciated that such an arrangement will provide 256 outputs. Only 35 are used). The output of the decoder is fed to a Read Only Memory (ROM) 5. The ROM is a well known matrix and can consist of, for example, a plurality of diodes connected across the input and output as shown in the drawings. It is of course understood that only a small number of the total number of diodes are shown in the drawings. However, the ROM does not have to constitute this particular type of matrix and any other matrix which will serve the function can serve in its place. The input to the ROM consists of a plurality of leads corresponding in number at least to the plurality of leads at the ouput of the decoder. Each lead at the output of the decoder is connected to a separate lead at the input to the ROM. The output of the ROM is eight leads which provides an eight bit code in binary form. The ROM is the physical implementation of the availability matrix discussed above. As will be appreciated, the availability matrix will be different for different scripts or for different interpretations of the same script. However, in accordance with the inventive system, any one of these scripts or different interpretations of scripts can be implemented by the mere substitution of an ROM containing the appropriate availability matrix. The output of the ROM is fed to availability register 6 which again comprises an eight bit register. Status register 11, which will be more fully discussed below, receives inputs from both the availability register 6 and the decoder 3 as will be more fully discussed below. The status register, in turn, provides outputs to the analyzer module 7 which is described in more detail with regard to the description concerning FIG. 2 of the drawings. The output of the eight bit register 2 is fed, in a parallel path, to eight bit register 8. Outputs from the register 8 and from the analyzer module 7 are fed to an 11 bit register 10 which contains the 8 bit of a character from register 8, and a 3 bit code of a particular shape, i.e., one of the eight of Table 1, as received from the analyzer module 7. The 11 bit code is decoded by a decoder 13 to drive the printer 12. The decoder 13 can comprise a series of logic circuits, including AND gates, OR gates, shift registers etc., which will convert the 11 bit code to, for example, an eight bit code to drive the printer. The printer 12 is a standard printer which is driven by an eight bit binary signal and is well known in the art and could comprise for example, a printer of the Teletype system discussed above. Decoder 3 also provides an output to the input of control unit 9 whose output is fed both to the eight bit register 8 and the analyzer module 7. As will be seen, the ouput of the control unit 9 is fed to the clock terminals comprising the units 7 and 8 to advance these units without an analysis by the analyzer module. Synchronizer 4 provides a clock signal to the clocked units of the system in synchronism with the operation of the keyboard to thereby synchronize the entire system with the keyboard. The function of the analyzer module is to implement the Boolean equations 1, 2 and 3 disclosed above. Boolean equations are of course, most easily implemented with a series of logic elements. A form of the analyzer module is shown in FIG. 2 of the drawings. Referring to FIG. 2, output from the availability register 6 is fed to OR gate 21. The output of OR gate 21 is fed to flip-flop 23 and to AND gate 30. Equation (1) is implemented by OR gate 25 which receives its input from the NOT terminals of state register 11. Equation (2) is implemented with the combination of AND gate 27 and OR gate 29. AND gate 27 is fed from the terminals of state register 11 as well as from the output of flip-flop 23. The input to OR gate 29 comprises the output of AND gate 27 as well as one of the NOT terminals from state register 11. Equation (3) is implemented with the combination of AND gate 30, AND gate 31 and OR gate 33. The inputs to these gates and their interconnection is easily seen in the drawings. The operation of the entire logic circuitry comprising the analyzer module is self-evident and requires no further description here. Details of the state register 11 are shown in FIG. 3. As can be seen from the description of the variable E.sub.g, the Boolean equation for determining E.sub.g and E.sub.g is as shown in FIG. 3. The state register consists of the OR gate 41 which receives input V.sub.xj V.sub.sj from the decoder 3 as described with relation to FIG. 1. According to the terminology developed above, V.sub.x is a character in the partition including numerals etc. As can be seen in FIG. 1, when decoder 3 decodes such a character, it provides an output on a selected one of its output leads. As C.sub.j refers to the character following the character C.sub.i under consideration, V.sub.xj is the signal at the selected output of 3 when C.sub.j is in the partition V.sub.x. C.sub.j becomes C.sub.i when a further character (following C.sub.j) is keyed in. At the onset, V.sub.xj + V.sub.sj is stored in flip-flop 43. When the further character is keyed in, 43 is clocked and its output is V.sub.xi + V.sub.si. In a like manner V.sub.sj is a selected output on decoder 3 when the input is a character of the partition V.sub.s. The output of OR gate 41 is stored in flip-flop 43 to provide a time delay so that it is fed to the analzyer module when the next character is being considered. The V.sub.xj input is also fed, through inverter 42, to one terminal of AND gate 47. The other input to AND gate 47 is fed from the NOT terminal of flip-flop 43. The E.sub.d value is obtained from the combination of OR gate 49 and flip-flops 51 and 53. The OR gate is fed from the availability register 6, and flip-flops 51 and 53 merely provide the required time delay for anlysis. In operation, the system operates as follows: When a key on the keyboard 1 is depressed, the keyboard will provide an eight bit code word representative of that character. As will be appreciated, each of the characters will be represented by a different code word. The code word is stored in the register 2 until the next key is depressed. When the next key is depressed, it will energize the synchronizer to clock the register 2 so that the code representative of the first character will be passed on to both the decoder 3 and the register 8. The character is then decoded in the decoder and the next step in the process will depend on which of the four partitions the character falls into. Should the character in the decoder fall into the partition V.sub.s or V.sub.x, then the decoder 3 will provide an output to the control unit 9 which will then clock the register 8 to move the eight bit word down to the register 10 and thence to decoder 13 where it will be decoded to an eight bit printing code for printing that character. At the same time, the control unit 9 will provide a signal to the analyzer module 7 so that the analyzer module will not perform an analysis. When the character falls within the partitions V.sub.d or V.sub.u, then the decoder will provide an output on only one of its 35 output lines. As will be appreciated, each one of the output lines is associated with a different character. The signal on the decoder output line will be applied to its appropriate input of the ROM 5 and then passed to the 8 bit register 6 and, subsequently, to both the status register 11 and the analyzer module 7. As will be appreciated, a character inserted via the keyboard 1 will not be printed on the printer until the next character has been inserted via the keyboard 1. After the next character has been inserted, the analyzer module will perform an analysis of the character under consideration, the character preceding the character under consideration, and the character following the character under consideration, to solve the equations (1), (2) and (3) to thereby provide values for S.sub.0, S.sub.1 and S.sub.2. These values are provided to the register 10 so that the register will receive an eleven bit word which fully describes both the appropriate shape of a character and its linking characteristics taking into consideration the preceding and succeeding characters. The variables S.sub.0, S.sub.1 and S.sub.2 determine the concatenation properties of the character under consideration in accordance with Table 1. Thus, if S.sub.0, S.sub.1, S.sub.2 is 011, then the concatenation properties of the character will be that it links up to the left as links up from the right as per P.sub.3 of the table. For the purpose of testing the processor shown in the drawing, the Teletype output was modified to simulate Urdu writing with appropriate linkages. In this representation markers are printed around each character, i.e. before and after, to indicate its linkages if they exist. The method is shown below:
link up forward (right in English, left in Urdu).
link down forward (right in English, left in Urdu).
link up backward
link down backward
initial
Independent surrounded by blanks
Terminal down, up backward.
As an example, let us consider the word JOAB, which means "answer" in the Farsi language, and is printed on line 2 of Table 4. The analysis follows as under.
______________________________________
Rule
.sigma. #JOAB#
R O
______________________________________
The string .music-sharp.w.sub.J6 w.sub.O5 w.sub.A7 w.sub.B7 .music-sharp. is printed on the Teletype as J O A B. In addition to the above example, other words are printed by the processor in pseudo-Urdu showing their correct linkage and are shown in Table 4, which is the actual output produced by the system on a KSR.33 Teletype.
Table 4
______________________________________
PSUEDO-URDU OUTPUT PRODUCED BY THE PROCESSOR
______________________________________
G!'O R A
J!'O A B
B!'O L
B!'R B!'G''E
A G!'A
J!'A N
A B!'A
G!'A N
B!'B''A
K!'O F!'B''A
K!'E''A R E
A M!'E
K!'E''A R
A D R
D A R
R D A
F!'D A
F!'A D
J!'O C
A M!'D B!'D
______________________________________
|
Same subclass Same class Consider this |
||||||||||
