Knowledge-based molecular retrieval system and method using a hierarchy of molecular structures in the knowledge base5418944Abstract A molecular retrieval system and method answering to similarity queries for retrieving molecular structures, stored into a source database (22) and having a required similarity with an input structure, which can contain a set of property regions. A target database (23) stores the molecular structures described in hierarchical way and a knowledge base (24) stores well-known molecular fragments at different levels of description together with a set of physical and chemical properties associated to each fragment. A fragment recognizer (21) analyzes the fragments of the input structure to represent them in a hierarchical way. A query analyzer (25) analyzes the similarity queries and selects the appropriate level of molecular representation on the basis of the required similarity. Matching means (26, 27, 29), when called by the query analyzer (25), perform a matching of the representation of the input structure against the representations of the molecular structures stored into the target database (23) at the selected level of molecular representation. Claims We claim: Description The invention relates to the Computer Aided Molecular Design (CAMD) field and, more particularly, to those systems and methods that automatically allow the retrieval of molecules previously snored into a molecular database.
TABLE 1
______________________________________
*Hydrogen Bond
Yes (donor, acceptor)
No
*Nucleophilicity
Yes (+, -)
No
*Electrophilicity
Yes (+, -)
No
*Acidity
Yes
No
*Basicity
Yes
No
______________________________________
Structural matching module 26 and model matching module 27 are both used to test whether there are one or more molecules into target database 23 that satisfy the user query. Each of them is selected according to the query type entered by the user (see below). Graph matching module 29, finally, is a-graph matching subroutine used to match molecules described at any level of representation one onto the other. The operation of the subsystem 2b made of modules 24, 25, 26, 27, 28, 29 is here described. Whenever a query is entered into module 28, it is processed in order to reformulate it in a standard form (the query statement). Such a step is necessary insofar as the user is normally supposed to enter the query in a graphical mode (see, for instance, FIG. 4). Then, the query statement is presented to query analyzer 25 in order to activate the appropriate routines for satisfying the user query. Firstly, it analyzes the input query to understand the query type. Query analyzer 25, according to the input query, will select a number of subroutines implemented to deal with the given query. Module 25 has been designed to deal with four query types. The operation of the retrieval system is described in connection with each query type. 1. Query for molecules structurally similar to an input one. The similarity is established through the maximum number of atoms allowed to form the difference. The end user can also require in the query that the difference must be from 1 up to N connected regions anywhere located. For this query type, module 5 activates fragment recognizer module 21 to analyze the input molecule so as to represent it in the hierarchical way described above, in connection with FIG. 3. The RS-graph and the FG-graph are generated in order to be associated with the representation of the input molecule (the AT-graph). Then, query analyzer module 25 calls structural matching module 26 to match the hierarchical description of the input molecule against the ones of the molecules stored into target database 23. The analysis of the query carried out from query analyzer 25 reduces the complexity of the matching procedure. Let us suppose, for instance, that the user is interested in molecules different from the input one only because of one region. Then module 26, during the matching process, will discard all the molecules that have more than one residue different with respect to the input molecule whenever such residues do not make a connected region. In this way, a high-level matching accomplished at the residue level will be sufficient to discard a great number of molecules useless for answering to the query under consideration. Specifically, structural matching module 26 takes as input two molecules or compounds C1 and C2 both represented at the same hierarchical level previously described and produces as output an association, whenever it exists, among the nodes of C1 and C2. Table 4 details the functioning of the structural matching routine. FIG. 6 is a flow chart of the operation of the structural matching routine. In its operation, module 26 recursively calls graph matching module 29 in order to firstly match the RS-graphs, then the FG-graphs and finally the AT-graphs. Graph matching module 29 takes as input two graphs G1 and G2 in which each node is identified by a label, such as, for instance, H1, and a type such as, for instance, H and each arc is identified by a pair of node labels, such as, for instance, arc(H1, H2). In this formalism, H1 must be understood as an instance of H. The output is an association, whenever it exists, of the nodes of the same type in G1 and G2 and the set of different nodes (regions) in G1 and G2. Specifically the graph
TABLE 4
__________________________________________________________________________
The structural matching routine
__________________________________________________________________________
CALL.sub.-- GRAPH MATCHING (RS-graph of C1, RS-graph of C2).
RESULTS:
*a set of corresponding residues in C1 and C2
identified as of the same type;
*a set of corresponding different residues in C1
and C2, identified as of different types.
FOR.sub.-- EACH pair of different residues in C1 and C2:
a. reformulate both the residues in terms of FG-graphs
using the description of C1 and C2;
b. CALL GRAPH.sub.-- MATCHING (single difference in C1 and C2 in
terms of FG-graphs).
RESULTS:
*a set of corresponding functional groups in
these parts of C1 and C2 identified as of the
same type;
*a set of corresponding different functional
groups in these parts of C1 and C2, identified
as of different types.
c. FOR.sub.-- EACH pair of different functional groups C1 and C2:
1) reformulate both the functional groups in terms of
AT-graphs using the description of C1 and C2;
2) CALL GRAPH.sub.-- MATCHING (single difference in C1 and C2
in terms of AT-graphs).
RESULTS:
*a set of corresponding atom types in
these parts of C1 and C2;
*a set of corresponding different atom
types in these parts of C1 and C2.
__________________________________________________________________________
matching subroutine makes use of the notion of "arc" as the matching elementary unit where an arc of G1 can be matched onto an arc of G2 widen the corresponding terminal nodes are of the same type. Table 5 details the graph matching subroutine.
TABLE 5
__________________________________________________________________________
The graph matching subroutine
__________________________________________________________________________
1. DETERMINE the degree of each node. The degree of a node
X is the number of arcs containing X as a terminal node.
2. CONSIDER a node N1 in the graph G1 with the maximal degree.
3. FIND into the graph G2 a node N2 of the same type of N1 and
maximal degree. STORE in the matching.sub.-- list the pair
(N1, N2).
4. FOR.sub.-- EACH arc in G1 containing N1, FIND a corresponding arc
in G2 containing N2. IF the terminal nodes of the two arcs
in G1 and G2 are of the same type, THEN STORE the new
matched pair of nodes into the matching.sub.--list and delete the
arcs from G1 and G2.
5. FOR.sub.-- EACH new pair of matched nodes (N1, N2), recursively
FIND a matching for all the new arcs containing N1 and N2:
GOTO 4.
6. WHEN a connected region of arcs of G1 is maximally matched
onto G2 (no more matching arcs can be found in the step 4
and 5), THEN GOTO 2 (the graph G1 and G2 are now reduced by
the already matched arcs).
7. WHEN no more arcs in G1 and G2 are matchable, THEN FIND.sub.--ALL
the connected regions of unmatched arcs in G1 and G2 and
DETERMINE a matching between each unmatched region of G1
and each unmatched region of G2.
__________________________________________________________________________
FIG. 5 illustrates the output of a call to the graph matching subroutine. The fragments circled highlight the uncommon regions between the input molecule and a structurally similar one. An extension to the query type described above involves searching for pairs of molecules with minimal difference with respect to their structures and maximal difference with respect to any given property. Such a query is very useful when dealing with the Structure Activity Relationship problem in order to locate those fragments responsible for a major increase of the desired activity. The user is supposed to start his/her search from a subset of the molecules stored into target database 23 by using one of the query types here described. At this point, query analyzer module 25 calls structural matching module 26 to select all the pairs into such a subset that have the highest difference with respect to the required property and differ in only one residue (the activity value with respect to the given property is considered a property of the molecule as a whole as, for instance, the molecular weight). Whenever such molecules have been identified, structural matching module 26 will recursively call itself in the attempt at reformulating such a difference in terms of functional groups. Yet, the reformulation step will be accomplished only when satisfied the condition that the functional groups that actually make the difference also make a connected region. The same holds as concerns the functional group level. In this case, when the difference is expressed in terms of at most one functional group, module 26 again will recursively call itself in order to reformulate such a difference in terms of atom types on the condition above quoted. 2. Query for molecules containing a given property region, i.e. a substructure of the molecule that verifies a user defined combination of chemical and physical properties. Such a query, for instance, could ask for all the molecules that include a hydrophobic region with a volume greater than 20 cubic centimeters/mole and lesser than 40 cubic centimeters/mole. Firstly, query analyzer 25 looks into each of the dictionaries of the knowledge base 24 in order to select those fragments (residues functional groups atom types) that satisfy the user defined combination of chemical and physical properties. Following completion of this step, module 25 calls the structural matching module 26 to search in target database 23 for all the molecules including one of such fragments. Whenever volume is recognized as one of the properties used to define the region, query analyzer 25 will generate all the combinations of the fragments previously selected in order to identify those that fall into the range of the query. If the number of solutions is not too large, structural matching module 26 will search in target database 23 for all the molecules including one of the fragments or combinations of them previously selected. Otherwise, when the number of solutions is too large, query analyzer 25 will call model matching module 27 to search in target database 23 for all the molecules satisfying the user query (see below). 3. Query for molecules matching a user defined model represented in terms of fragments (namely: residues, functional groups, atom types) and/or property regions. Specifically, a model M is represented as a graph-whose nodes can be: * residues, * functional groups, * atom types, * property regions. FIG. 4 illustrates an example of user defined model. Each Ri is a user defined property region with a number of properties associated to it. The `*` symbol stands for whatever fragment. The user is also allowed to define a measure of distance between subunits of the molecule in terms of number of bonds. Such a measure can span over a user defined interval. The task of model matching module 27 is to find an association, whenever it exists, between the nodes of the model M and the nodes of the compound C selected from target database 23 by query analyzer 25. Specifically, each node of the model M is selected in a preferred order (firstly residues, secondly functional groups, then atom types, finally property regions) in order to be matched onto the corresponding node of the compound C. Model matching module 27 takes as input the model M and the compound C represented in the hierarchical way previously described and produces as output an association, whenever it exists, among the nodes of the same type in M and in C. In order to match the property region nodes of the model M, module 27 calls knowledge base module 24 to check whether the corresponding fragment of the compound C verifies the properties appended at the node of the model M. Table 6 gives details about the model matching routine. FIG. 7 is a flow chart of the operation of the model matching routine.
TABLE 6
__________________________________________________________________________
The model matching routine
__________________________________________________________________________
1. FIND a RS-, FG- or AT-node of the model that can be matched
onto a node of the same type in the compound C.
2. DETERMINE the near nodes of the model M and the compound C,
that is the set of nodes that are connected, by a given arc,
to the just matched nodes of the model and the compound. All
the RS-, FG- and AT-graphs of the compound C are used in
order to extract these near nodes.
3. FOR.sub.-- EACH near node N of the model:
*IF N is a residue node, THEN FIND the corresponding near
node in the RS near nodes of the compound C.
*IF N is a functional group node, THEN FIND the
corresponding near node in the FG near nodes of the
compound C.
*IF N is an atom type node, THEN FIND the corresponding
near node in the AT near nodes of the compound C.
*IF N is a property region node, THEN FIND in all the RS,
FG and AT near nodes of the compound C a node (or a set
of nodes) that verifies the properties characterizing the
property region N. In this case, the KB of chemical and
physical properties is consulted in order to analyze the
properties of the nodes of the compound C.
4. IF the model M is not completely matched, THEN GOTO step
__________________________________________________________________________
2.
For all the query types previously described, the answer to the query is dispatched by query analyzer 25 to user interface 28. The invention was described with no reference to a specific computer language; therefore, it will be understood that it can be implemented in any computer language.
|
Same subclass Same class Consider this |
||||||||||
