Language parsing device and method for same5649215Abstract A method for parsing of a language defined by context free grammar includes the steps of extracting groups corresponding to words of a sentence in an order of word arrangement of the sentence, each of the groups being formed by a terminal symbol, a start-position number, and an end-position number of a corresponding one of the words; creating a state list by using the groups, the context free grammar, and an LR table derived from the context free grammar, the state list corresponding to position numbers indicating positions of the words and including state numbers indicating state of the parsing; and carrying out the parsing while creating the state list. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Word Terminal Symbol
______________________________________
bunka N
ga P
kita N and V
kara P
tsutawatta V
______________________________________
Also, in order to express positional relations between the terminal symbols, position numbers are assigned as identifications to each space between the words as shown in TABLE 2.
TABLE 2
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
position 1 2 3 4 5 6
number
______________________________________
Using those positioning numbers, the input unit 1 of FIG. 1 stores information shown in FIG. 2. In FIG. 2, a start-position number and an end-position number are two position numbers assigned to a space preceding a word corresponding to a terminal symbol and to a space following that word, respectively. Terminal symbols are stored in an ascending order of numbers. As shown in FIG. 2 and TABLE 2, a start-position number for the first terminal symbol N is a positioning number which is assigned to the space in the left of (preceding) the corresponding word `bunka`, i.e., the start-position number is 1. Here, a symbol `$` is a terminal symbol indicating the end of the sentence, and is added to the end. The grammar J is stored in the grammar unit 3 of FIG. 1 as shown in FIG. 3. As shown in FIG. 3, the grammar J is stored such that the left-hand side and the right-hand side are stored separately along with the grammar numbers serving as a grammar identification. An LR table derived from the grammar J is stored in the operation table unit 4 of FIG. 1 as shown in FIG. 4. Contents of the LR table is the same as those of a conventional LR table. Namely, the LR table stores actions to be taken in response to a resulting state of the parsing. Here, states are identified by state numbers. There are three types of actions to be taken, i.e., a state transition, an application of grammar, and an acceptance. A state transition from a state indicated by a state number n to a state indicated by a state number m via a given terminal symbol is indicated by `sh m` stored in the ACTION part in a row of the state number n and in a column of the given terminal symbol number. A state transition via a given non-terminal symbol from a state indicated by a state number n to a state indicated by a state number m is indicated by `m` stored in the GOTO part in a row of the state number n and in a column of that given non-terminal symbol number. Applying a grammar rule indicated by a grammar number g to a next terminal symbol at a current state number n is indicated by `re g` stored in the ACTION part in a row of the state number n and in a column of the next terminal symbol number. In the case of accepting the sentence when a current state is n and a next terminal symbol is $, `acceptance` is stored in the ACTION part in a row of the state number n and in a column of $. Schematics of an operation of the parsing unit 6 in FIG. 1 will be described below. Terminal symbols stored in the input unit 1 are taken out one by one from the top, and are subject to operations shown in FIG. 5 to FIGS. 7A and 7B. Results of those operations are stored in the chart unit 2 in a manner shown in FIG. 8 and stored in the state list unit 5 in a manner shown in FIG. 9. The fact that the sentence is accepted in the end means that the sentence is in accordance with the grammar. In the chart unit 2 are stored all phrase structures which are evaluated in accordance with the grammar. FIG. 5 to FIGS. 7A and 7B show flow charts for explaining operations of the parsing unit 6. In the following, each flow chart will be explained step by step. FIG. 5 is a flow chart of a procedure PROC1. At a step S1, a start state number is stored at a position number 1 indicating the beginning of the sentence as shown in TABLE 3. The start state number which is created at the time of creating an LR table is set to 0 in this case.
TABLE 3
______________________________________
bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1
number
state 0
number
______________________________________
At a step S2, groups each formed by a terminal symbol and its start and end-position numbers are taken out from the input unit 1 in an ascending order of the start-position numbers. At a step S3, a procedure PROC2 is carried out. At a step S4, a check is made whether all the groups have been processed. If they have, this is the end of the procedure PROC1. If all the groups are not processed, the procedure goes back to the step S2. FIG. 6 is a flow chart of the procedure PROC2. The first group taken out from the input unit 1 at the step S2 of the procedure PROC1 is (N, 1, Thus, PROC2(N, 1, 2) is carried out first. At a step S5, a check is made whether the group (N, 1, 2) is stored in the chart unit 2 as a phase structure. If it is, this is the end of the procedure PROC2 at a step S6. If the group (N, 1, 2) is not stored in the chart unit 2 (this is surely the case in this example since this group is the first group), a procedure PROC3(N, 1, 2) is carried out at a step S7. Also, at the step S7, a check is made whether the procedure PROC3 is successfully finished. If it is not, the procedure PROC2 is finished in failure at a step S9. If the procedure PROC3 succeeds, the procedure PROC2 proceeds to a step S8. At the step S8, the group being processed is stored in the chart unit 2. At a step S10, the procedure is completed in success. FIGS. 7A and 7B are a flow chart of the procedure PROC3. Here, the first group (N, 1, 2) is processed first. At a step S11, a state list of a position number 1 is taken out. As shown in TABLE 3, the state number of the position number 1 is 0, which is thus taken out. Then, a state transition for the terminal symbol N and the state number 0 is looked up in the operation table of FIG. 4. In this case, the state transition is indicated by `sh 4` so that a transition destination is a state number 4. Thus, 4 is added to a list L, resulting in L=[4] At a step S12, a flag 0 is set in a variable Ret. At a step S13, a state transition from the state number 4 of the list L via a terminal symbol P having a start-position number 2 is looked up in the operation table of FIG. 4. In this case, the state transition is possible as indicated by `sh 7`, so that the state number 4 is stored in the state list unit 5 at a position number 2. Also, a flag 1 is set in a variable Ret. At a step S14, a grammar number applicable to the next terminal symbol P having a start-position number 2 when the state number of the list L is 4 is looked up in the operation table of FIG. 4. In this case, there is no applicable grammar. Thus, a RL list becomes a null list [ ]. At a step S15, no action is taken since the list RL is null. At a step S16, a check is made whether the variable Ret is 1. In this case, it is 1, so that the procedure PROC3(N, 1, 2) is successfully finished at a step S17. This means returning to the procedure PROC2. At this point, the state list unit 5 is as shown in TABLE 4.
TABLE 4
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1 2
number
state 0 4
number
______________________________________
With reference to FIG. 6 again, since the procedure PROC3(N, 1, 2) turns out to be successful at the step S7, the group (N, 1, 2) is stored in the chart unit 2 at the step S8. At the step S10, the procedure PROC2(N, 1, 2) is successfully finished. At this point, the state list unit 5 and the chart unit 2 are as shown in TABLE 5.
TABLE 5
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1 2
number
state 0 4
number
Chart Unit
(symbol, start, end)
(N, 1, 2)
______________________________________
With reference to FIG. 5 again, since it turns out at the step S4 that all the groups are not yet processed, the procedure PROC1 goes back to the step S2. At the step S2, the next group (P, 2, 3) is taken out. At the step S3, a procedure PROC2(P, 2, 3) is carried out. With reference to FIG. 6 again, a check is made at the step S5 whether the group (P, 2, 3) is already stored in the chart unit 2 as a phrase structure. Since it is not stored, a procedure PROC3(P, 2, 3) is carried out at the step S7. With reference to FIGS. 7A and 7B again, at the step S11, a state list of a position number 2 is taken out. As shown in TABLE 3, the state number of the position number 2 is 4, which is thus taken out. Then, a state transition for the terminal symbol P and the state number 4 is looked up in the operation table of FIG. 4. In this case, the state transition is indicated by `sh 7` so that a transition destination is a state number 7. Thus, 7 is added to a list L, resulting in L=[7] At a step S12, a flag 0 is set in a variable Ret. At a step S13, a state transition from the state number 7 of the list L via terminal symbols V and N having a start-position number 3 is looked up in the operation table of FIG. 4. In this case, there is no possible state transition for either V or N, so that no further action is taken. At a step S14, a grammar number applicable to the next terminal symbols V and N having a start-position number 3 when the state number of the list L is 7 is looked up in the operation table of FIG. 4. In this case, `re 3` is found as an applicable grammar for both V and N, thus resulting in the RL list being RL=[3]. At a step 15, a grammar rule (PP, [N, P]) which is indicated by the grammar number 3 of the RL list is taken out from the grammar unit 3. (Here, the left hand side in a grammar rule is indicated by Lh and the right hand side by Rh.) Then, a search is made in the phase structures stored in the chart unit 2 for Rh'=[N], which is a remaining part after removing the most right element P from Rh=[N, P]. The search is made from the position number 2 to the beginning of the sentence. As a result, [N] is found in the phase structure (N, 1, 2). It turns out that the found grammar symbol N has a start-position number 1. Thus, a procedure PROC2(PP, 1, 3) is carried out. With reference to FIG. 6 again, since it is checked at the step S5 that the group (PP, 1, 3) is not stored in the chart unit 2, a procedure PROC3(PP, 1, 3) is carried out at the step S7. With reference to FIGS. 7A and 7B again, at a step 11, a state list of a position number 1 is taken out. As shown in TABLE 5, the state number of the position number 1 is 0, which is thus taken out. Then, a state transition for the non-terminal symbol PP and the state number 0 is looked up in the operation table of FIG. 4. In this case, the state transition is indicated by `2` in the GOTO part, so that a transition destination is a state number 2. Thus, 2 is added to a list L, resulting in L=[2] At a step S12, a flag 0 is set in a variable Ret. At a step S13, a state transition from the state number 2 of the list L via a terminal symbol V and N having a start-position number 3 is looked up in the operation table of FIG. 4. In this case, the state transition is possible as indicated by `sh 3` and `sh 4` for V and N, respectively. Thus, the state number 2 is stored in the state list unit 5 at a position number 3. Also, a flag 1 is set in a variable Ret. At a step S14, a grammar number applicable to the next terminal symbols V and N having a start-position number 3 when the state number of the list L is 2 is looked up in the operation table of FIG. 4. In this case, there is no applicable grammar. Thus, a RL list becomes a null list [ ]. At a step 15, no action is taken since the list RL is null. At a step S16, a check is made whether the variable Ret is 1. In this case, it is 1, so that the procedure PROC3(PP, 1, 3) is successfully finished at a step S17. This means returning to the procedure PROC2(P, 2, 3). With reference back to FIG. 6, since the procedure PROC3(PP, 1, 3) is successfully finished, the group (PP, 1, 3) is stored in the chart unit 2 at the step S8. At the step S10, the procedure PROC2(PP, 1, 3) is completed in success. This means returning to the procedure PROC3(P, 2, 3). Referring to FIGS. 7A and 7B again, at the step S15, the procedure PROC2(PP, 1, 3) turns out to be successful so that the variable Ret is set to 1. At the step S16, it is checked that the variable Ret is 1. At the step S17, the procedure PROC3(P, 2, 3) is finished successfully. This means returning to the procedure PROC2(P, 2, 3). At this point, the state list unit 5 and the chart unit 2 are as shown in TABLE 6.
TABLE 6
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1 2 3
number
state 0 4 2
number
Chart Unit
(symbol, start, end)
(N, 1, 2)
(PP, 1, 3)
______________________________________
With reference to FIG. 6 again, since the procedure PROC3(P, 2, 3) turns out to be successful at the step S7, the group (P, 2, 3) is stored in the chart unit 2 at the step S8. At the step S10, the procedure PROC2(P, 2, 3) is successfully finished. At this point, the state list unit 5 and the chart unit 2 are as shown in TABLE 7.
TABLE 7
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1 2 3
number
state 0 4 2
number
Chart Unit
(symbol, start, end)
(N, 1, 2)
(PP, 1, 3)
(P, 2, 3)
______________________________________
For the remaining groups stored in the input unit 1, the same procedure is carried out. As a consequence, the state list unit 5 and the chart unit shown in TABLE 8 are obtained with the acceptance of the sentence.
TABLE 8
______________________________________
Bunka ga kita kara tsutawatta.
______________________________________
Input Unit
position 1 2 3 4 5 6 7
number
terminal N P N P V $
symbol V
State List Unit
position 1 2 3 4 5 6
number
state 0 4 2 6 2 1
number 4
1
Chart Unit
(symbol, start, end)
(N, 1, 2)
(PP, 1, 3)
(P, 2, 3)
(S, 3, 4)
(V, 3, 4)
(N, 3, 4)
(PP, 3, 5)
(P, 4, 5)
(S, 1, 4)
(S, 3, 4)
(PP, 1, 5)
(S, 1, 6)
(S, 3, 6)
(S, 5, 6)
______________________________________
As a final result, a total number of fourteen phrase structures are extracted, and stored in the chart unit 2 as shown in TABLE 8. FIG. 10 shows a block diagram of a second embodiment of a language parsing device according to the present invention. In the figure, the language parsing device includes a control device 10 such as a CPU (central processing unit), a RAM 11, and a ROM 12. The control device 10 carries out the syntactic parsing described above by using control programs stored in the ROM 12. The RAM 11 stores the same data as those of the first embodiment. Namely, the RAM 11 stores the groups each formed from a terminal symbol, a start position number, and an end position number. Also, the RAM 11 stores the context free grammar J, the state list and the constituents (phrase structures) as results of the parsing, and the LR table derived from the grammar J. Thus, the second embodiment can carry out the same syntactic parsing as that of the first embodiment. The above embodiments have been described by taking a Japanese sentence as an example. However, the present invention can be applied to other natural languages such as English and French and programming languages such as C. Although the above embodiments are concerned with a syntactic parsing, it is effective for such parsing as a morphological parsing or a semantic parsing. In short, the present invention can be applied to the various parsing of languages described by context free grammar. In the following, context free grammar used for a morphological parsing of the English language is shown as an example. S.fwdarw.WS PRD WS.fwdarw.WS DLM Word WS.fwdarw.Word Here, S represents a sentence, WS represents a word series, PRD represents a period (end-of-sentence), DLM represents a delimiter, and Word represents a word. The context free grammar shown above describes that an English sentence is a series of words separated by delimiters. In the above description, the present invention has been described by taking as an example its application to a syntactic parsing of a natural language. However, the present invention can be applied to the various parsing of languages defined by context free grammar. Although the above embodiments have been described with regard to an parsing made from the top of a sentence to the end, an parsing can be made from the end to the top of a sentence by modifying the LR table used in the above embodiments. As shown in the above description, the present invention carries out parsing by using the LR table, and, thus, is capable of a more efficient parsing than the Earley's algorithm or the chart parsing. Also, since one-dimensional lists of numbers are used, the mechanism of the present invention is simpler in comparison with use of a graph-structured stack, which is a multi-dimensional graph. Furthermore, an amount of memory volume required for the parsing is smaller than with the use of a graph-structured stack, because memory use in the present invention is only limited to one-dimensional lists of numbers. Further, the present invention is not limited to these embodiments, but various variations and modifications may be made without departing from the scope of the present invention.
|
Same subclass Same class Consider this |
||||||||||
