Priority scheduling

Multi processor sorting network for sorting while transmitting concurrently presented messages by message content to deliver a highest priority message

5276899

Abstract

A multiprocessor system intercouples the processors with an active logic network having a plurality of priority determining nodes. Messages applied concurrently to the network in groups are sorted, using the data content of the messages, to a single or common priority message which is distributed to all the processors with a predetermined total network delay time. Losing messages are again retried concurrently in groups at a later time. Message routing is determined by local acceptance or rejection of messages at the processors, based upon destination data in the messages. All messages occupy places in a coherent priority scheme and are transferred in contending groups with prioritization on the network. Using data, status, control and response messages, and different multiprocessor modes, the system is particularly suited for configuration in a relational data base machine having capability for maintaining an extended data base and handling complex queries.


Claims

What is claimed is:

1. A multi-processor system having a plurality of processors and improved facility for intercommunication of data packets between processors, comprising:

a sort network, intercoupling the processors such that each of the processors is coupled to all others of the processors, for sorting the data packets into priority order based on at least one criteria comprising command code, data content, and data length;

means at each of the processors for concurrently presenting data packets to the sort network; and

means at each of the processors responsive to received data packets from the sort network for accepting packets based on the data content thereof.

2. A method of communicating between processors to confirm in a minimum number of steps that all processors intended to receive a given task message have correctly received the message, the method comprising the steps of:

concurrently transmitting competing task messages having a given order of priority from the processors;

sorting the task messages, while transferring them, to derive a priority task message delivered concurrently to all processors, wherein the sorting is based on at least one criteria comprising command code, data content, and data length;

concurrently responding to the priority task message at all processors with response messages having a given order of priority, a "correct reception" response message having a lower order of priority than a "rejection" response message for an intended processor, and the response messages having priority over the task messages;

sorting the response messages, while transferring them, to derive a single priority response message delivered to all intended processors, wherein the sorting is based on at least one criteria comprising command code, data content, and data length; and

accepting the priority task message at the intended processors if the priority response message is a "correct reception" response message.

3. The method as set forth in claim 2 above, wherein the "correct reception" response message comprises an "acknowledgement" response message and the "rejection" response message comprises a "non-acknowledgment" response message, and wherein processors for which the priority task message is not intended transmit a "not applicable" response message which is of lower priority than the "acknowledgment" response message.

4. The method as set forth in claim 3 above, wherein the "non-acknowledgment" response message indicates an Overrun, Transaction Number error or Lock condition.

5. The method as set forth in claim 4 above, further including the steps of locally checking a transaction number state to ascertain if a processor is in a "non-participant" or "initial" state, and sending a "not-applicable" response message otherwise.

6. The method as set forth in claim 2 above, wherein task messages which are not accepted are repeatedly transmitted and responded to until they are correctly received by all intended processors.

7. The method as set forth in claim 2 above, wherein the task messages are primary data messages and the response messages are locally generated by all processors and interspersed in other messages.

8. The method as set forth in claim 2 above, wherein the task messages terminate in "end of message" codes and the responses messages are transmitted by all processors concurrently after the "end of message" code.

9. A method for assuring, in a minimum number of interchanges, that all relevant processors in a multi-processor system accept or commit to a task, the method comprising the steps of:

broadcasting a message intended to be accepted by the relevant processors;

concurrently responding to the broadcast message with response messages indicating by data content thereof whether the broadcast message has been accepted at the relevant processors; and

arbitrating the response messages based on at least one criteria comprising command code, data content, and data length, to broadcast a single priority response message indicative of acceptance by all of the relevant processors, unless at least one relevant processor has indicated non-acceptance.

10. The method as set forth in claim 9 above, wherein each of the processors transmits response messages whose data contents vary to indicate a state of readiness of the processor, and wherein the arbitration step gives highest priority to a response message indicating a lowest state of readiness.

11. The method as set forth in claim 10 above, wherein each of the processors that is not relevant to the task responds with a low priority response message that loses in the arbitration step.

12. The method as set forth in claim 11 above, further including the step of generating, at the processors, concurrent competing messages which contain the tasks, and arbitrating among the competing messages to provide a single priority message for broadcast.

13. In a multi-processor system, a method of confirming successful concurrent acceptance of messages by different relevant processors to a given process in arbitrarily controlled groupings, comprising the steps of:

broadcasting a message to all of the processors;

recognizing, at each of the processors, the message's relevance to the processor;

sending response messages concurrently from all processors, the response messages having data contents varying in accordance with a priority rule to denote that the broadcast message is not relevant, is relevant and reception is confirmed, or is relevant but reception is not confirmed; and

prioritizing the concurrent response messages based on at least one criteria comprising command code, data content, and data length, to establish whether all relevant processors have confirmed reception, such that one concurrent broadcast message to, and one concurrent response message from, each of the processors assures confirmation from all relevant processors.

14. A method of transmitting and receiving primary data messages in a multi-processor, the method comprising the steps of:

a) simultaneously broadcasting a primary data message to a plurality of processors;

b) determining at each individual processor if the broadcast primary data message is intended for the individual processor;

c) generating a response at each intended individual processor, the response indicating each individual processor's ability to process the primary data message;

d) transmitting the generated response for each intended individual processor through a priority determining network to prioritize all the generated responses based on at least one criteria comprising command code, data content, and data length;

e) granting highest priority within the priority determining network to any generated response indicating any intended individual processor's inability to process the primary data message; and

f) granting lowest priority within the priority determining network to any generated response acknowledging an intended individual processor's ability to process the primary data message, thereby causing the priority determining network to create an output signal indicating acknowledgment of the primary data message only when intended individual processor indicates an ability to process the primary data message.

15. The method of transmitting and receiving primary data messages as set forth in claim 14, further comprising the step of detecting, at each individual processor, an "end of message" indicator residing within the primary data message.

16. The method of transmitting and receiving primary data messages as set forth in claim 15, wherein each individual processor initially determines if the primary data message is intended for the individual processor by identifying a destination select word residing within the primary data message.

17. The method of transmitting and receiving primary data messages as set forth in claim 16, wherein each individual processor finally determines if the primary data message is intended for the individual processor by identifying a transaction number residing within the primary data message.

18. A method of processing primary data messages in a multi-processor system, the method comprising the steps of:

a) transmitting the primary data messages to a network coupled to a plurality of processors;

b) sorting the primary data messages, by the network, into priority order based on at least one criteria comprising command code, data content, and data length to obtain a highest priority primary data message;

c) receiving the highest priority primary data message at the processors, the processors each independently:

i) determining when the reception of the highest priority primary data message is completed;

ii) determining if the processor is a member of a group of processors intended for the highest priority primary data message;

iii) determining if the intended processor is in a state to process the highest priority primary data message;

iv) generating a responsive message indicating if the intended processor will process the highest priority primary data message;

v) transmitting the responsive message to the network;

d) determining within the network if every intended processor will process the highest priority primary data message; and

e) completing the processing of the highest priority primary data message only if the network indicates that every intended processor will process the highest priority primary data message.

19. The method of claim 18, wherein the transmission of the primary data message to the processors occurs simultaneously.

20. The method of claim 19 above, wherein each of the processors determining when the primary data message is completed by identifying an "end of message" bit, the "end of message" bit residing within the primary data message.

21. The method of claim 20 above, wherein the intended processors will generate a responsive message that does not acknowledge the primary data message if the intended processors are occupied so as to prevent processing the primary data message.

22. The method of claim 21 above, wherein each of the processors determines if the processor is a member of the group of processors intended for the primary messages by performing the steps of:

a) identifying a destination selection word residing within the primary data message;

b) comparing the destination selection word to a destination map residing within a processor interface;

c) generating a responsive message indicating that the primary message is not applicable to the processor if the destination map does not contain the destination selection word; and

d) continuing to receive the primary data message if the destination map contains the destination selection word.

23. The method of claim 22 above, wherein each of the processors determines if the processor is in a state to process the primary data message by performing the steps of:

a) identifying a transaction number residing in a directory within the processor interface; and

b) determining if the transaction number corresponds to a "ready to process" state.

24. A method of processing tasks in different processors of a computer system and merging message packets into a sorted order, the method comprising the steps of:

transmitting messages to the different processors for identifying individual portions of the tasks to be performed;

asynchronously carrying out the individual portions of the task in the different processors to generate processed message packets;

locally chaining together the processed message packets at each of the different processors into a sorted order, thereby creating a plurality of sorted message packets;

transmitting a "start merge" instruction to the different processors concurrently;

concurrently transmitting a highest priority one of the sorted message packets from each of the different processors while dynamically prioritizing a single message packet based on at least one criteria comprising command code, data content, and data length; and

repeating the concurrent transmission and prioritizing with losing and new message packets until all message packets pertaining to the task have been transmitted in sorted order.

25. The method as set forth in claim 24 above, including the steps of identifying the task for the different processors by a transaction identity, and starting and stopping the merging of messages with control communications.

26. The method as set forth in claim 25 above, further including the step of using command messages to suspend and resume merge operations so that a number of copending merge operations can be performed simultaneously in the computer system.

27. The method as set forth in claim 26 above, further including the steps of determining the readiness of the different processors, passing control of the merge operations to a least ready of the processors, and the least ready of the processors ascertaining whether the merge operations can begin when it has completed its portion of the task.

28. The method as set forth in claim 27 above, further including the steps of transmitting "end of file" data from each of the different processors which has completed transmission of its message packets, and transmitting "end of merge" data when all of the different processors have sent "end of file" data.

29. The method as set forth in claim 28 above, further including the step of storing processed message packets together with next message vectors to designate the sequence of the locally chained processed message packets.

30. The method as set forth in claim 29 above, wherein a transaction identity other than zero establishes a merge mode of operation and a transaction identity of zero establishes a non-merge mode.

31. The method as set forth in claim 30 above, wherein the step of transmitting messages to the different processors comprises broadcasting the messages to all of the different processors in a non-merge mode and recognizing an applicability of the messages at each of the different processors.

32. The method as set forth in claim 31 above, wherein the computer system further comprises a data base and further including the step of updating the data base in accordance with the processed messages packets.

33. A method of placing messages from a plurality of individual processors in a sorted order, the messages being generated during the completion of asynchronous processing tasks, the method comprising the steps of:

assembling, at the individual processors, sequences of the messages, the messages having sorting criteria therein;

synchronously transmitting high priority messages according to the sorting criteria from each of the individual processors;

dynamically prioritizing competing high priority messages based on at least one criteria comprising command code, data content, and data length to reject all but a highest priority message; and

synchronously repeating the transmission of remaining high priority messages until all the messages have been completely transmitted.

34. The method as set forth in claim 33 above, further comprising the steps of starting and stopping the method by first transmitting start and stop commands of higher priority in competition with other messages, dynamically prioritizing the start and stop commands, and delivering the start and stop commands to all processors concurrently.

35. The method as set forth in claim 34 above, further comprising the steps of assembling messages in sorted sub-orders at the processors, dynamically prioritizing all messages pertaining to one sub-order, and then repeating the steps with subsequent sub-orders.

36. The method as set forth in claim 35 above, further comprising the steps of delivering the priority messages to all processors concurrently, responding to the priority message from all processors concurrently, and dynamically prioritizing the responses before synchronously transmitting further messages.

37. A method of multiplexing a multi-processing computer system so that the processors may function as to a member of task requests, the method comprising the steps of:

assigning each of the processors to a sort/merge stream for each task request with which the processor is to be active;

asynchronously assembling, at each of the processors, sorted subsets of messages pertaining to the relevant task requests for that processor;

synchronously transmitting high priority messages from each subset pertaining to a given task request;

dynamically sort/merging the competing high priority messages to select a highest priority message wherein the high priority messages are sorted based on at least one criteria comprising command code data content and data length; and

assembling the sort/merge stream by repeated synchronous transmissions and selection of successive high priority messages.

38. The method as set forth in claim 37 above, further comprising the steps of changing the assembly of sort/merge streams by transmitting start and stop commands of a higher priority relating to different task requests, dynamically sort/merging the commands and broadcasting the commands to all of the processors simultaneously.

39. The method as set forth in claim 38 above, including the steps of identifying task requests with transaction numbers, using a specific transaction number as a non-merge command, and transmitting"start merge" commands, so that global merge operations and broadcast transmissions can be intermingled as to different task requests to enable global coordination of system resources.


Description

BACKGROUND OF THE INVENTION

Since the advent of the electronic computer in reliable form, workers in tile art have given much consideration to systems employing a number of computers functioning together in interrelated fashion to accomplish a given overall task. In some of these multiprocessor systems a large computer utilizes its superior speed and capacity to perform the complex portions of a program, but assigns smaller and slower satellite processors the less complicated and less urgent tasks in order to reduce the load and demands upon the large computer. The large computer is required to undertake the responsibilities of assigning subtasks, making sure that the smaller processors are kept occupied, ascertaining the availability and performance of the smaller processors, and providing a unified result.

Other multiprocessor systems utilize a different approach, employing multiple processors and a common bus system, with the processors leaving essential equality of function. In this type of system, separate control computers or control systems are often used to monitor the availability and capability of an individual processor for a given subtask, and to control the routing of tasks and information between processors. The processors may be arranged and operated so that they themselves monitor the status and availability of the other processors and determine the routing of messages and programs. The common and substantial drawback of these systems is that the software and operating time required for overhead and maintenance functions interfere with the performance of the principal objectives. Problems of routing and monitoring may increase quadratically in relation to the number of processors involved, so that ultimately a disproportionate amount of effort is spent in overhead functions.

The following are some patents that are illustrative of the state of the art:

    ______________________________________
    3,962,685         Belle Isle
    3,962,706         Dennis et al
    4,096,566         Borie et al
    4,096,567         Millard et al
    4,130,865         Heart et al
    4,136,386         Annunziata et al
    4,145,739         Dunning et al
    4,151,592         Suzuki et al
    ______________________________________


Since the days of the early "Binac" (two parallel processors) and comparable systems it has been recognized that a multiprocessor provides a redundant capability that can substantially improve the overall reliability of an operating system. Actual installations of multiprocessor systems have until recently been quite limited, largely due to the extensive software problems involved. Nonetheless, the advantages of multiprocessor operation for real time applications and other situations in which system down time cannot be tolerated have led to the development of systems which are successful in operation but which nevertheless involve significant commitments to overhead software and operating time. Illustrative of these are U.S. Pat. Nos. 31445,822; 3,566,363 and 3,593,300, all relating to a system in which multiple computers access a single shared main memory, and in which capabilities and requirements are compared in order to assign tasks optimally to individual processors.

Another example of the prior art is U.S. Pat. No. 4,099,233, in which a number of processors share a single bus and a control unit incorporating a buffer register is used in the transfer of data blocks between a transmitting miniprocessor and a receiving miniprocessor. This concept has been employed in a distributed mail sorting system in Europe.

U.S. Pat. No. 4,228,496 pertains to a commercially successful multiprocessor system in which buses between processors are coupled to bus controllers which monitor transmissions and determine the priority of data transfers between processors, each of which can be coupled in to control a certain part of a number of peripheral devices.

The "Ethernet" system (U.S. Pat. Nos. 4,063,220 and 4,099,024) being jointly promoted by Xerox Hewlett-Packard and Intel evidences another approach to the problem of intercommunicating between different processors and peripherals. All units are coupled to a common multiple access network and compete for priority. Collision detection is based upon time priority, which in turn means that global capabilities cannot readily be controlled, coordinated or given specificity.

Details of these complex systems can only be fully appreciated by close analysis of the patents and any related publications. However, review will show in each instance that the prioritizing of data transfer and the selection of processors requires extensive intercommunication and supervisory control if tasks are to be shared. Expansion of the systems to include additional processors does not present identical problems with these different systems, but in each instance substantially complicates system software, applications programming, hardware, or all three. Analysis will show that inherent constraints on multiprocessor system size and capability are imposed by the usage of one or two logically passive ohmic busses. While different techniques can be employed to facilitate intercommunication, such as the grouping of subsystems into global resources evidenced in recent U.S. Pat. No. 4,240,143, the amount of useful traffic must reach a limit and variable delays impose insuperable problems when large numbers of processors are used. Situations can arise in which one or more processors become locked out or deadlocked, and these circumstances in turn require added circuitry and software to resolve the problems. The impracticality of substantially extending the number of processors, say to 1024, thus becomes evident.

It is desirable for many applications to depart from the constraints of these existing approaches and to utilize modern technology to best advantage. The lowest cost technology available today is based upon mass produced microprocessors, and high capacity rotating disk memories, such as Winchester technology devices. It is desirable to be able to expand a multiprocessor system without disproportionate or even concomitant software complexity. It is desirable further to be able to handle computer problems that maybe characterized as having a distributed structure, in which an overall function can be dynamically subdivided into limited or iterative processing tasks. Virtually all data base machines fall into this category, which also includes such other typical examples as sorting, pattern recognition and correlation, digital filtering, large matrix computations, simulation of physical systems and the like. In all of these situations there is a requirement for widely dispersed, relatively straightforward individual processing tasks with a high instantaneous task load. This situation unduly burdens prior art multiprocessor systems because it tends to increase the time and software involved in overhead, and because practical difficulties arise in implementation of the systems. Using a shared passive bus, for example, propagation rates and data transfer times introduce an absolute barrier as to the rate at which transactions can be processed.

Data base machines thus provide a good example of the need for improved multiprocessor systems. Three basic approaches, namely the hierarchical, network, and relational, have been proposed for the implementation of large scale data base machines. The relational data base machine, which permits easier user access to given data in a complex system by using tables of relationships, has been recognized as having powerful potential. Typical publications, such ad an article entitled "Relational Data Base Machines", published by D. C. P. Smith and J. M. Smith, in the March 1979 issue of IEEE Computer magazine, p. 28, U.S. Pat. No. 4,221,003 and articles cited therein illustrate the state of the art.

Sorting machines also provide an example of the need for improved computing architecture. A review of sorting machine theory can be found in Searching and Sorting by D. E. Knuth, pp. 220-246, published (1973) by Addison-Wesley Publishing Co., Reading, Mass. A number of networks and algorithms are disclosed that must be studied in detail to appreciate their limitations, but it is generally true that they are typically complex schemes having only specific sorting purposes. Another example is provided by I,. A. Mollaar in an article entitled "A Design for a List Merging Network", in the IEEE Transactions on Computers, Vol. C-28 No. 6, Jun. 1979 at pp. 406-413. The network proposed utilizes external control of network merge elements and requires programming to perform specific functions.

Various workers in the art have considered and are considering specialized memory and system approaches that are intended to improve access to and maintenance of information in a relational data base. These approaches evidence the general recognition of the desirability of the relational data base machine. In their present forms, however, they violate the principle of utilizing the most advantageous cost per bit technology that is presently available, because they inherently require development of futuristic systems of ultimately unknown performance and economic viability. Furthermore, these proposals are so preliminary in nature that they cannot for some time confront the practical difficulties involved with a working data base machine, in which data must not Only be accessed, but must further be updated, corrected as necessary, sorted, merged, rolled back, recovered, and otherwise manipulated to meet the user's requirements. The incorporation of other features, such as a capability for expansion of the system, would tend to further delay practical usage of such system.

significant recent work on,.relational data base machines has been concerned with responding interactively to ever more complex queries. However, the ability to answer high level and sophisticated queries and the resultant ease of use and user productivity should not impose penalties oft the user in terms of throughput and response time. It is also evident that, where a large data base has been accumulated in an organization, the needs of different activities seeking information from the data base can vary widely, and thus to meet all the needs satisfactorily requires extensive knowledge of the system. Although some systems have been devices that perform all of the needed functions, they do so only for small data bases and at great expense.

It is highly desirable for many organizations to be able to utilize a given large main frame system, while obtaining the further cost and reliability advantages of a multiprocessor. If this can be done, all of the organization's existing software and hardware can continue to be used and the effort required to convert to a relational data base system will be minimized and continuity of day-today operations will be assured.

SUMMARY OF THE INVENTION

Systems and methods in accordance with the invention utilize a novel architecture and organization in which multiple processors are intercoupled by an active bidirectional network. The bidirectional network is arranged in a hierarchy of precedence determining nodes, each of which can concurrently resolve contentions for priority between competing pairs of messages. It also broadcasts to all processors, from an apex node at the highest tier in the hierarchy, that message packet having priority. Tasks to be performed by the individual processors are accepted and responsive message packets are returned, again via the bidirectional network.

The network serves in one direction as a high speed decision making tree whose active circuit nodes function in the time and space domains to make a prioritized sort. Priority between contending message packets is determined in accordance with predetermined rules and based upon the data content in the message packets themselves. Messages of lower priority that lose in contention within the network are again retried when the prior transmission is completed.

The priority scheme pertains as well to acknowledgment messages, status and control messages and special communications. Employing coherent priority relationships, and timing the application of messages to the network so that they are entered concurrently, the system eliminates the need for extensive prefatory and confirmatory exchanges. A message gaining priority on the network is delivered concurrently to all processors, and the messages that lose in contention may substantially immediately vie again for transmission.

The delay introduced by the network is balanced, in the sense that it is the same for all processors, and is dependent only on the number of node .levels in the hierarchical network. The delay therefore increases only by one increment for each doubling of the number of processors. In consequence of such factors, the minimization of support functions, and the fact that prioritizing is done without interruption of message flow, transfers on the network contain a very high proportion of data messages. Systems and methods in accordance with the invention can be advantageously configured to stand alone or to interface to the I/O subsystems of existing computer such as a large or small applications processing machine (referred to herein as the "host" or "main frame" computer). They also permit existing operating systems software and applications software on the "host" which do not use the invention to be used without modification.

The multiprocessor system utilizes highly cost effective microprocessors. For a data base system, some microprocessors may be characterized as interface processors and others of which may be characterized as access module processors. Both processor types are coupled to the base tier of the bidirectional network. The access module processors individually control different secondary storages, such as large capacity disk memories, each of which contains a portion of the relational data base arranged in scatter storage fashion. Each secondary storage has both primary and backup storage portions that are unique and nonredundant portions of the data base.

When a host computer generates a request, it communicates it via its I/O channel to an interface processor. The interface processor may determine that information stored by the access module processors must be retrieved or otherwise manipulated to satisfy the request.

In all applications of the-invention, requests for processing are communicated by a processor to other processors via packets on the active logic network. The network delivers such requests on a prioritized basis and is capable of directing the request to either the specific processor(s) or to the class of all processors which have the information or capabilities needed to process the packet. Those processor(s) then perform the indicated activity and may generate message packets in answer to the assigned task.

The processed message packets may be either routed back to the processor which originated the request or may be distributed to other processors for further processing. In either case the bidirectional active logic network means has the capability of accepting for contention packets responsive to the request from a plurality of responding processors and causing them to be merged in a sequential order by outputting them in successive packet contention time intervals. The interface processor completes processing of the host's request by returning response information via the I/O channel.

The high speed network continually receives synchronous message packet tries from the processors that are on line, determining priority at some initial or intermediate hierarchical level within the network and communicating the preferred message packet to its destinations but at a later time also and without significant delay transmitting the lower priority message packets to their appropriate destinations. Acknowledgment signals are locally generated and interspersed in the data messages, using the network merge and prioritize capability ill the same way. Control signals are generated to govern the pacing of both the external host system and other processors within the system.

An important aspect of the invention is that in the general distribution mode of operation there is automatic and unsupervised activity at a given processor in answer to a message packet that refers to that portion of the data base for which the processor is responsible. The message packets may include destination indications that comprise hash values used to define placement in the relational data base. Interface circuits at each processor then execute a table lookup or hashing function to determine whether the data is in range for that particular processor. Although the hash value may be computed by the sending or by the receiving processor, the use of hash mapping in conjunction with concurrent message broadcasting is of significant value in this relational data base machine.

The invention thus improves upon the state of the art of multiprocessor computer systems by providing means to:

--distribute work to a plurality of processors

--collect the results of work from a plurality of, processors

--ascertain the status of each resource (processor) in the network

--provide redundancy

--provide capability to readily expand the total processing capacity of the system

with a minimum of complexity and overhead in operating system and applications software.

The system architecture and operation permit expansion by the addition of further processors without requiring any modification or addition of supervisory software. The nature of the network is such that the individual transmission line connection sets the maximum distance between one processor and another. In contrast, for an ohmic bus the bus length determines the maximum distance in which all processors must be accommodated, and consequently severely limits maximum system size. Thus, systems in accordance with this invention are freed from a fundamental geometrical constraint on the size of multiprocessor systems that confronts prior art systems. Further, merge and sort functions may be carried out by using the inherent priority determination within the network in conjunction with individual microprocessor operation. The microprocessor can first determine message priority between the outgoing messages in the random access buffer memory, and the network can then select messages from different processors in proper order.

In a typical data base system, dual networks are employed. The system remains modular in character and can be expanded simply by adding access module processors with proper insertion of primary and backup stores in the data base. There is thus no single point of failure within the system and because of the use of scatter storage with both primary and backup copies, failure of even two units has a minimal effect on the reliability of the system and the recoverability of data.

BRIEF DESCRIPTION OF THE DRAWINGS

A better understanding of the invention may be had by reference to the following description, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a block diagram of a system in accordance with the invention including a novel bidirectional network;

FIG. 2 and FIGS. 2A to 2J comprising a set of sequential diagrams showing the transmission of data and control signals in a simplified example of the network of FIG. 1;

FIG. 3 is a graphical representation of the organization of a message packet used in the system of FIG. 1;

FIG. 4 is a block diagram showing further details of the novel bidirectional network of FIG. 1 as to the active logic nodes and clock circuits employed therein;

FIG. 5 is a state diagram depicting various conditions of operation in the active logic nodes;

FIG. 6 is a timing diagram useful in explaining end of message detection in the active logic nodes;

FIG. 7 is a diagram of timing waveforms useful in explaining the operation of the clock circuits of FIG. 4;

FIG. 8 is a block diagram of a processor module, including a high speed random access memory, that may be employed in the system of FIG. 1;

FIG. 9 is a diagram of address allocation in the main RAM of a microprocessor system as shown in FIG. 8;

FIG. 10 is a block diagram of the arrangement of data within one reference portion of the high speed random access memory of FIG. 8;

FIG. 11 is a chart showing the message priority protocol used in the system;

FIG. 12 is a graphical representation of transaction number word formats;

FIG. 13 comprises two sheets (FIGS. 13 and 13A) containing block diagrams of interface circuits employed with each processor module in the system of FIGS. 1 and 8 that are to be placed together with FIG. 13A on the right;

FIG. 14 is a timing diagram showing various clock and phase waveforms used in the interface circuits of FIG. 13;

FIG. 15 is a block diagram showing further details of organization of memory and a system for mapping in accordance with destination selection words;

FIG. 16 is a simplified flow diagram depicting status changes on reception of an input data message;

FIG. 17 and FIG. 17A are flow diagrams depicting status changes in receiving a message to be abutted with FIG. 17 on the top;

FIG. 18 is a matrix diagram showing the relationship between primary messages and generated responses and between primary messages and responsive actions;

FIG. 19 and FIG. 19A are flow diagrams depicting status changes in sending a message to be abutted with FIG. 19 on the top;

FIG. 20 is a block diagram of a stand alone system in accordance with the invention;

FIGS. 21A and 21B are diagrams of messages stored in the high speed random access memory; and

FIG. 22 is a simplified diagram of one way in which parts of a data base may be distributed among different processes in a data base system.

DETAILED DESCRIPTION OF THE INVENTION

Data Base Management System

The system depicted generally in FIG. 1 is illustrative of usage of the concepts of the invention in a data base-management application. Specifically, the system is configured for cooperation with one or more host-computer systems 10, 12, such as one of the IBM 370 family or DEC PDP-11 family, which advantageously, for purposes of this example, operate with existing and conventional operating system and applications software. In the IBM terminology the principal intercommunication networks between the host and the data base computer are referred to as, channels while in the DEC terminology the equivalent is referred to as a "UNIBUS" or "MASSBUS" or some other variant. Whether one of these systems or the main frame computer of another manufacturer is used, the channel or bus is an ohmic or logically passive pathway on which data base tasks and subtasks are presented.

The example of FIG. 1 shows a backend processor complex in association with host systems 10, 12. The system accepts tasks and subtasks from the host system, references the appropriate part or parts of an extensive data base storage, and returns appropriate processed or responsive messages in such fashion that no more than menial software management is required of the host systems, irrespective of the configuration of the backend processor complex. Consequently, the user's data base can be structured in a new multiprocessor system in which the data are organized in relational data base files of broadly expandable capacity; expansion can occur without any need to change the operating system or existing applications software resident in the user's host system. An example of a stand alone system is described below in conjunction with FIG. 20.

It will be recognized by those skilled in the art that relational data base management involves an overall function which is divisible into processing tasks that are separable, at least temporarily, because the data entries in storage are not interdependently linked by-address pointers. It will also be recognized that many other data processing situations exist in which dynamic subdivision and independent processing of limited or iterative tasks can be used. Thus while this example of the invention is described in terms of the demanding and widely encountered data base management processing problem, the novel methods and implementations disclosed herein are of broad applicability elsewhere as well.

A large data management system involves both potential advantages and the inherent difficulties when multiple processors are to be used. Vast numbers of entries, ranging into the hundreds of millions, must be held conveniently and rapidly accessible in storage. With the relational data base format, a wide range of data entry and information retrieval functions can be carried out concurrently.

In the great majority of data base systems, however, it is as important to maintain the integrity of the data base as it is to process transaction data rapidly. Integrity of data must be preserved across hardware failures, power outages, and other operational mishaps. Further, the data base system must be capable of restoring the data base to a known state to recover from user errors which can include bugs in applications software code. However, the data cannot be lost or entered erroneously, and all parts of the data base that relate to a specific entry must be changed responsively, whether the event involves new data, corrections for past errors or revision of a portion of a data base.

Integrity therefore implies that a degree of redundancy is required in the data base system, along with data roll back and recovery operations, error detection and correction, and detection of and compensation for changes in status of individual parts of the system. The system may have to be used in a number of different specific modes to accomplish these objectives.

It is further required of modern systems to be able to accept discretionary queries, that can be complex in form and to respond if necessary in an interactive fashion. Those who seek access to the system should not, despite the complexity of the query, be required to be experts in the system. Examples of discretionary queries that may be generated pertaining to a large production operation include the following:

A. A production manager might ask not merely for an item in inventory but for an inventory aging that identifies all parts inventories which are in excess of the monthly production rate for a part whose production rate is at least 10% less than in the comparable month for the prior year.

B. A marketing manager might ask, not merely if a particular account is 90 days overdue, but for all 90 day receivables as to customers from a particularly depressed geographic area who have exceeded 120 days in the past.

C. A personnel director might ask not merely for a listing of all employees having in excess of two weeks sick leave for a given year but for a listing of all employees with more than ten years longevity who were on sick leave for more than one week during frogging season in more than two of the prior five years.

In all of these examples, the user seeks to gain an insight into a real problem confronting him professionally by correlating, in unanticipated ways, information stored in the computer. The user's experience with his own problem areas and thus his intuition and imagination permits the non-computer-trained professional to make facile use of a data base system that is capable of handling complex queries.

Modern multiprocessor systems seek to satisfy these many and often conflicting requirements through the use of elaborate overhead and maintenance software systems, which inherently militate against easy expansion of the system. Expandability, however, is a highly desirable concept, because any growing business or operation inherently wishes to enlarge and retain its existing data base management system and not to be forced into the adoption of a new system and software.

The multiprocessor array--In FIG. 1, a typical system in accordance with the invention includes multiple microprocessors of two principal types, herein designated the interface processor (IFP) and the access module processor (AMP). Two IFPs 14, 16, are depicted, each coupled to the I/O system of a different host computer 10 or 12. A number of access module processors 18-23 inclusive are also incorporated in.,what may be termed a multiprocessor array. The term "array" is used in the conventional sense of referring to a set, collection or number of processor units disposed in a generally ordered linear or matrix fashion, and does not connote what has come to be referred to as an array processor. Although only eight microprocessors have been depicted as a simplified example of the system concept, many more IFPs and AMPs can and typically will be used.

The IFPs 14, 16 and AMPs 18-23 incorporate Intel 8086 16 bit microprocessors having an internal bus and a main memory with direct memory access for peripheral device controllers. Any of a wide variety of microprocessors and microprocessor system products of different manufacturers may be utilized. The "microprocessor" is merely a specific example of one type of computer or processor that may be used in the array, because the system concept can be used to advantage with minicomputers or large computer systems where the application demands such computing power. The 16 bit microprocessor is an advantageous example of a low cost unit having substantial data processing power and a standard replaceable configuration with a wide range of available hardware and software options.

The IFPs and AMPs utilize similar active logic, control logic and interface circuitry; microprocessors; memories; and internal busses, as described below in conjunction with FIGS. 1 and 8 respectively. These two processor types differ, however, in the nature of and control logic for their associated peripheral devices. Those skilled in the art will readily appreciate that other processor types with different peripheral controllers and functional assignments can be readily incorporated into this invention.

Each microprocessor has associated therewith a high speed random access memory 26 (described in conjunction with FIG. 8) which not only provides buffering of input and output messages but also coacts uniquely with other parts of the system to provide message management. Briefly, the high speed random access memories 26 function as a circular buffer for variable length input (called "receive") messages, provide sequential message output (called "send") storage, incorporate a table lookup portion for use in hash mapping and other modes, and store control information for orderly and sequenced handling of receive and send messages. The memories 26 are further used to fulfill unique roles in multiprocessor mode selection and in handling data, status, control and response message traffic. As described in detail hereafter, they are also arranged such that, based upon transaction identities in the messages, local and global status determinations and control functions are processed and communicated in highly efficient fashion. Control logic 28 (described below in conjunction with FIG. 13) at each IFP 14, 16 and AMP 18-23 is used in data transfer within the module and in the performance of overhead functions.

The IFPs 14, 16 each include an interface control 30 coupling the IFP to the channel or bus of the associated host computer 10 or 12. In the AMPs 18-23, however, the comparable unit is a disk controller 32 which may be of conventional configuration and which is employed to interface the AMPS 18-23 respectively with individually associated magnetic disk drives 38-43 respectively.

The magnetic disk drives 38-43 provide the secondary or mass storage for the data base,management system. In the present example, they comprise proven commercial products, such as Winchester technology, to provide high capacity and high reliability storage with extremely low cost per byte.

The relational data base is stored on these disk drives 38-43 in scatter storage fashion as shown in simplified form in FIG. 22. Each processor and associated disk drive is assigned a disjoint primary subset of the records comprising the complete data base, so that each of n storages has .sup.1 /nth of the data base. Further, each processor also is assigned disjoint backup data subsets making up .sup.1 /nth of the data base. As seen in FIG. 22, each primary file is duplicated by a backup file at a different processor,.giving two complete data bases distributed in different ways. This redundant arrangement of the primary and backup data subsets protects the integrity of the data base, because no large blocks of data or groups of relations can be substantially affected by a single failure.

Distribution of the data base is interrelated, as also shown in FIG. 22, with hashing of the various files, and incorporation of hash mapping data in the messages. The files at each processor are designated by simplified hash buckets shown as groups of binary series. The relations and tuples in a relational data base system thus can be located by the tables of relationships defined by the buckets. Hashing algorithms are used to derive the bucket assignments from keys in the relational data base system, so that expansion and modification of the data base system are readily feasible.

Selection of storage capacity is dependent upon data base management needs, transaction volume, and the processing power of the associated microprocessors. While a number of disk drives may be coupled to a single AMP, or a single disk file coupled to more than one AMP, such modifications will usually be limited to special applications. Extension of the data base typically is achieved by expanding the number of processors (and associated disk drives) in the multiprocessor array.

Active logic network --The objectives of providing orderly message packet flow and facilitating task performance are met by the use of a unique system architecture and message organization, centered upon a novel active logic network structure 50, This structure comprises, relative to the outputs of the microprocessors, a converging, ascending hierarchy of bidirectional active logic nodes 54. The nodes 54 comprise three port bidirectional circuits which may be described as forming a tree network, with couplings to the microprocessors 14, 16 and 18-23 being made at the base of the tree.

It will be recognized by those skilled in the art that nodes may be constructed where the number of logical sources is greater than 2, say 4 or 8, where the greater number of source inputs may also be resolved in the same time by the addition of more combinatorial logic.

For convenience of reference, all nodes (N) in the first tier are designated by the prefix I, in the second tier by the prefix II, and so forth. Individual nodes within a tier are designated by the subscripts .sub.1, 2 . . . , so that, for example, the fourth node in the first tier may be referred to as IN.sub.4. At the up-tree (or upstream) side, there is a single port, called the C port, which is coupled to one of the two down-tree ports, called the A and B ports, of a node in the next higher tier. The tiers converge to an uppermost or apex node 54a representing a convergence and recirculating means which directs upstream (up-tree) messages back in the downstream (down-tree) direction. Two tree networks 50a, 50b are utilized, the nodes and interconnections of the two networks being disposed in parallel to provide the redundancy desired for a large scale system. Inasmuch as the nodes 54 and the networks are identical, only one need be described.

To aid in visualization, it should be understood first that multiple message packets in tile form of serial signal trains are or can be launched concurrently into the active logic network 50 on the couplings for many of the microprocessors. The active logic nodes 54 each function in binary fashion in determining priority between two colliding packets, using the data contents of the message packets themselves. Further, all nodes 54 in a network are under the command of a clock source 56 arranged with the nodes 54 in such fashion as to synchronously advance the message packet trains toward the apex node 54a. In this manner each succeeding byte or other incremental segment of a serial train progresses to the next tier at the same time as the corresponding bytes of other messages also advance along other paths in the network 50.

A prioritized sort of competing signal trains takes place for message packets moving up-tree, ultimately to select a single message train to be redirected from the apex node 54a downstream. Because of the system organization, the decision as to ultimate priority need not occur at any particular point in the message packet, so that message transmission can be carried forward without requiring any more than the binary decisions between two colliding packets that are being made at the individual nodes 54. As a result the system provides message selection and data transfer in the space and time domains but does not delay-message transmissions for purposes of gaining control of the bus, identifying sending or receiving processors, or performing handshaking operations between processors.

Further, it is important to recognize that when several processors send identical packets at the same time, if successful, it will appear that all such sending processors were successful. This property is extremely useful in exercising efficient control of a large multiprocessor complex, because of the savings in time and overhead.

The nodes 54 also operate in bidirectional fashion to enable unimpeded downstream distribution of the message packets. At a given node 54 downstream messages received at the port C on the up-tree side are distributed to both ports A and B on the down-tree side and then transmitted on to both associated nodes at the next lower tier. Under the control of the common clock circuit 56, the message packet advances synchronously down-tree to be broadcast to all microprocessors simultaneously, enabling one or many of the processors to carry out the desired processing task or to accept a response.

The network 50 has a high data transfer rate in comparison to the data transfer rates of the microprocessors, typically being a multiple greater than two. In this particular example the network 50 has a byte clock interval of 120 nanoseconds and the data transfer rate is five times that of the microprocessor. Each node 54 is coupled, at each of its three ports, to the associated node port in the next tier, or to the microprocessor, by a set of data lines (here 10 in number) and by control lines (here 2 in number) and devoted to clock and collision signals respectively. The data and clock lines run in pairs, with separate lines for the uptree and downtree directions. The collision line propagates down tree only. The connections form a full duplex data path, with no delay needed to "turn around" the drive sense of any line.

Referring now to FIG. 3, the 10 data lines comprise an 8 bit byte, designated as bits 0-7 inclusive, occupying 8 of the 10 data lines. Another line, designated C, is a control line, carrying a control sequence that is used to characterize different parts of the message packet in particular ways. The 10th bit is used for odd parity in the present example. Practitioners skilled in the art will recognize that the system can readily be operated with more or fewer bits in the data path.

The byte sequences are arranged in successive fields, basically divided into command, key, destination selection, and data fields. As is discussed further below, a message may utilize only a single field, and concludes with a detectable End of Message code. An intervening idle field between messages is designated by an unbroken sequence of 1's on the C line, as well as on lines 0-7, and is transmitted whenever no message packet is available. The parity line is also employed in a unique fashion to communicate a change of status of an individual processor.

The idle state is an intermediate state and is not a part of the message packet, which typically begins with a 2 byte command word that includes a tag in the form of a transaction number (TN) for data messages or an originating processor ID (OPID) for response messages. The transaction number has many levels of significance in the system and serves as the basis for a number of functional communications and controls. The packet may thereafter contain any or all of a variable length key field and a fixed length destination selection word (DSW) as the first part of a variable length data field. The key field serves the purpose of providing sorting criteria, where messages are otherwise substantially identical. The DSW provides the basis for a number of special functions and also merits-particular attention, along with the TN.

The system operates with the interfaces in word synchronism, so that the first bytes of the command words are provided to the network 50 concurrently by all processors which have a packet to transmit. The data contents of the successive fields are used by the network in sorting on a binary basis at each node, with the lowest numerical value being given priority. Taking bit C as the largest quantity and bit 0 as the smallest in the successive data bits, the sorting priority order is:

1. first arrival at the network 50;

2. lowest command code (word);

3. lowest key field;

4. shortest key field;

5. lowest data field (including the destination selection word);

6. shortest data field.

For purposes of this general overview it should be noted primarily that when a priority decision has been made at a node 54, a collision indication (referred to as A.sub.col or B.sub.col) is returned along the path from which the losing transmission was received. This indication enables the transmitting microprocessor to recognize that the network 50 is busy with a higher priority transmission so that the transmission is terminated and must be retried again at a later time.

A simplified example is shown in the various representations of FIG. 2 of the manner in which the network 50 operates with the high speed random access memories in a tree using four different microprocessors, specifically an IFP 14 and three AMPs 18, 19 and 20. Ten subfigures 2A, 2B . . . 2J each correspond to one of ten successive time samples, from t=0 to t=9, to show the distribution of different simplified (four character) serial messages from each of the microprocessors within the network at each of these points in time, and the communications between ports and microprocessors at the different times. The diagram labeled simply FIG. 2 shows the state of the system prior to the beginning of signal transmission. In these separate views, the null or idle state requires a transmission designated .quadrature.. With the convention of lowest data content having priority, the message packet "EDDV" from AMP 19 in FIG. 2A should be the first to be transmitted through the system. These messages are retained, as described in greater detail below, in the high speed random access memories (sometimes H. S. RAM) 26 in the microprocessors. The H. S. RAMs 26 have input and output subdivisions that are depicted generally in FIG. 2, with the packets being arranged in FIFO vertical order in the output portion at t=0, thus being available for transmission, as indicated by the cursor arrow in H. S. RAM 26. At this point in time all transmissions in the network 50 indicate the null or idle state .quadrature..

At t=1, however, as designated in FIG. 2B, the first byte of each of the message packets is launched into the network 50 concurrently, with all-nodes 54 still returning the idle state indications and all transmissions above the first tier also being in the idle state. In the first clock interval, the initial bytes of the messages are set into tile lowest tier nodes, IN.sub.1 and IN.sub.2, so that at t=2 (FIG. 2C) contentions have been resolved and both upstream and downstream transmissions continue. Node IN.sub.1 has received an "E" on both input ports and is transmitting this upstream to the next tier, indicating the undecided state downstream to both sending processors. At the same tier, however, node IN.sub.2 has determined collision priority between the "E" from processor 19 and the "P" from microprocessor 20, in favor of the former, thus coupling port A to up-tree port C and providing the B.sub.col signal back to microprocessor 20. As the B.sub.col signal is returned toward the microprocessor 20, the IN.sub.2 node in effect locks the A input port to the C output port, so that the serial train from microprocessor 19 is transmitted on to the apex node IIN.sub.1.

At the IN.sub.1 node the first two characters are both "ED", and no decision can be made at this node at time t=2, as shown in FIG. 2C. At t=3 (FIG. 2D), moreover, the common initial character "E" from the three microprocessors 14, 18, and 19 reaches the IIN.sub.1 apex node and is redirected toward the downstream direction, as the second character "D", also common to all messages, is transmitted toward apex node IIN.sub.1. Node IN.sub.1 cannot yet make a decision at this time, but the third characters, "F", "E" and "D" from the successive microprocessors 14, 18, and 19 respectively are in the course of transmission to that node. Reception of the B.sub.col signal at the microprocessor 20 designates that its contention for priority has been lost, and it then and thereafter transmits only the idle indication .quadrature.. The cursor, arrows in the output buffers show that the microprocessor 20 has been returned to its initial state but that the other microprocessors continue to send successive characters. Thus at t=4 (FIG. 2E) the significant events are the decision for the port at node IN.sub.1 and the return transmission toward the first node tier of the initial character ("E") on all lines. The next collision is indicated at t=5 (FIG. 2F), with the B port of node IIN.sub.1 winning contention and A.sub.col being generated.

Broadcasting of the serial signal train in the downstream direction continues through successive clock times, and at time t=6 (FIG. 2G) the initial message character is set into the input portions of all H. S. RAMs 26. Concurrently it should be noted that the earlier priority determination at node IN.sub.1 is now overridden by the A.sub.col indication from the higher tier node IIN.sub.1 when the third character ("E") from the microprocessor 18 loses in contention with the third character ("D") from the microprocessor 19. As the cursor arrows show in FIG. 2H, microprocessors 14, 18 and 20 have returned to their initial states and winning microprocessor 19 previously completed its full transmission at time t=4. All input buffers are successively loaded with the priority message "EDDV" as seen in,FIGS. 2B, 2I and 2J. At t=8 (FIG. 2I), the message has run out of the first tier and the apex node IIN.sub.1 has been reset at t=7, because only idle signals are in contention, as the last downstream character is transferred to the microprocessors, At t=9 (FIG. 2J) the nodes IN.sub.1 and IN.sub.2 in the first tier are reset, and all the losing microprocessors 14, 18 and 20 now contend again for priority on the network by omitting the first message character when the network is again indicating idle. In practice, as described hereafter, acknowledgment signals are transmitted to the winning microprocessor(s), but this is not required for the most general case of the invention.

The message, once broadcast to all microprocessors in this manner, may be used by any or all as required. This depends upon the mode of operation and the functions being performed, which include many variations.

GLOBAL INTERCOMMUNICATION AND CONTROL

The foregoing example of the manner in which the network prioritizes a given message out of a group of contending messages pertains to the transfer of primary data messages. A complex multiprocessor system must however use many other types of communications and commands to have the efficiency and versatility now required. The principal functions to be supplied encompass, in addition to primary data transfer, what may broadly be termed multiprocessor modes, message acknowledgments, status indications and control signals. The following section provides a general overview from the global, or multiprocessor system, standpoint as to how different modes and messages coact with the prioritized sorting and communication network. Reference should be made to FIGS. 8 and 13 and the accompanying descriptions hereafter for a more detailed understanding.

In the general distribution or broadcasting mode, messages are simultaneously delivered to all processors without, specific delineation of one or more recipients. This mode is typically used for responses, status queries, commands, and control functions.

Where there is to be a delineation as to the recipient, the destination selection information within the message packet itself provides criteria for local acceptance or rejection of the packet. For example, interface logic in the receiving processor modules identifies whether the data is in range for their particular processor according to map information stored in the high speed RAM 26. A variety of selection criteria can be readily implemented by means of various settings of the map bits in the high speed RAM, including selection of a specific recipient processor, portion of the data base stored ("hashed"), logical process type ("class"), etc. The use of broadcasting with local access control is of particular benefit for a data base management system, inasmuch as minimal overhead software is needed to gain access to any part of the widely dispersed relational data base or to dispersed local copies of any of a number of globally known logical processes. The system is thus capable of specifically selecting a single destination or a class of resources as the destination for a message.

Also, high level data base inquiries often require cross-referencing between different portions of the data base, and consistent reference to a given task. The TN incorporated in the messages provides this global transaction identity and reference, among other features. Many tasks may be worked on concurrently by the asynchronous local processor modules, and each task or subtask has its appropriate TN. Using various combinations of the TN, DSW and commands, virtually infinite flexibility is achieved. An extended sort/merge operation can be undertaken on a large number of tasks that are asynchronously assigned and processed. TNs can be assigned and relinquished and merges can be started and stopped. Certain messages, such as continuations, can have priority over other transmissions. Using the TNs and local processor updating of status as to the TNs, one query can determine the status of the global resource as to a given TN. A distributed update can also be accomplished in one communication. The present system enables all these functions to be performed without extending the software or materially increasing the overhead load.

As a consequence of the invention, multiprocessor systems with much larger numbers of processors than feasible in the prior art can be operated with high effectiveness against problem tasks. Because of the present low cost of microprocessors, systems of high performance in a problem domain, and not just in "raw" power, can be realized at low cost.

A coherent priority protocol that encompasses all message types, and various subtypes, embraces all the different messages that are applied to the network. Although responses, status and control messages are of a different form than the primary data messages they also use the contention/merge operation of the network and thus are prioritized during transfer. Response messages in the present system are positive acknowledgment (ACK), negative acknowledgment (NAK), or all indication that the processor does not have the resources to process the message meaningfully ("not applicable processor" - NAP). The NAK response may be any of several different types, indicating a locked, error or overrun condition. Because an originating Processor or processors require such responses after termination of a message transmission, the responses have a higher priority level than primary data messages.

The present system also employs SACK (status acknowledgment) messages that denote the readiness state of a local processor with respect to a particular task or transaction. Such SACK responses are locally updated and held accessible to the network. They provide, in conjunction with the merge operation of the network, a single query global status report for a given task or transaction. Because the status responses are in accord with the priority protocol, the lowest data content response automatically gains priority and establishes the least ready status as the global system state for a transaction number in one uninterruptible operation. The SACK indications also are used in conjunction with certain primary messages to implement various protocols, such as system initialization and lockout operations.

The priority protocol definition for the various message types begins with the command code, using the initial 6 bits of the command word that starts each message and response, as shown in FIG. 11. An adequate range of distinctions as to message types, and subtypes, is available although more could be used. Here the SACK response differentiates seven different status levels (and provides a basis for prioritizing as well), as can be seen by reference to FIG. 11. For responses these first 6 bits are followed by the tag in the form of a 10 bit OPID (see FIG. 3). Both the TN and OPID can serve as further sorting criteria because of their differing data contents, within the tag.

After each primary message has been transmitted through the network, the interface section of ever processor generates a response message, even if it is only a NAP. The responses also contend on the network, and the single or common winning response message is broadcast to all processors. Losing message packets are retried later, synchronously, after a minimal delay so that the network is substantially constantly in use. Where a number of processors provide an ACK response, the responses are sorted by the OPID.

As a consequence of the invention, tasks can be started, stopped, controlled, and interrogated in synchronism in a very large number of physical processors with a minimum of overhead. This permits the raw power of a large number of processors to be effectively applied to problem state processing with a minimal diversion of that power to coordination and control. The overhead of coordination and control is a fundamental limitation on the efficacy of any distributed processing system.

Different types of control communications are employed where the purpose is globally (i.e. network) oriented. Thus, Stop Merge, Status Request, Start Merge, and certain task assignment and relinquishment messages have the same format as data messages and are also referred to herein as primary messages. These control messages also include the TN and have their places in the priority protocol, as is later discussed relative to FIGS. 10 and 11.

The term "global semaphore buffer system" has been adopted to connote the fact that the high speed random access memory 26 and control logic 28 shown in FIG. 1 also play a significant role in both multiprocessor mode selection and bidirectional communication of status and control indications. The global semaphore buffer system provides duality of access, in that both the high speed network structure 50 and the slower speed microprocessors can reference a message, response, control or status indication in the memory 26 without delay or requiring direct communication with each other. To this end the control logic 28 time multiplexes the memory 26 to network 50 and to the microprocessor in interleaved word cycles, creating in effect different ports having common access to the memory 26. The global resource or network 50 and microprocessors can use the transaction number as an address locater to a portion of the memory 26 devoted to transaction status. At the local level, the status of a subtask relating to a given transaction covering all useful states, is updated in the memory 26 under control of the microprocessor and locked by the control logic 28 at the buffer system. One of seven different readiness states is used, the entries conveniently being derived from a different dedicated portion of the memory 26. Upon a query from the network, the status of the processors is communicated (the "semaphore" is read) and prioritized in the network with the least complete readiness state taking priority. This arrangement provides an immediate hardware response from all processors to a query. Thus it can be known without delay or the use of software whether all of the distributed subtasks of a given task have been accomplished, In the instant system, moreover, any communicating processor module can assign an available transaction number for use with the messages and in each global semaphore buffer system.

A good example of this integrated use of transaction identity and status indication is presented by a complex merge operation, in which each of a number of processors is called upon to place in order all messages pertaining to a given criterion. In prior art systems each processor would have to receive and complete its tasks and then communicate the results to some "master" processor, which would do the final merge operation. That master processor thus constitutes a significant bottleneck to system throughput.

When the global readiness state establishes that all affected processors are ready, the messages of highest priority from the memory 26 at each processor are entered on the network concurrently and prioritized during merger as previously described. Successive retries with groups of messages generate a serial train of messages of descending priority, ending with the lowest for that transaction number. Specific command messages provide the system with the ability to suspend and resume the merge operation partway through so that the network 50 can be shared by a number of copending merge operations and thus make most effective utilization of the resources of the systems.

At any given time, therefore, all active processors connected to the network 50 can be working asynchronously on messages pertaining to different transaction numbers. When referenced to the same or "present" transaction number by a status query, all respond synchronously with one of the available status levels. For example, The START MERGE message tests the global semaphore represented by a particular transaction number, and if the global stale is ready (SEND READY or RECEIVE READY), the present transaction number (PTN) is set to the TN conveyed in the START MERGE message. (If the global state is not ready, the PTN reverts to a value of TN0).

A STOP MERGE message also resets the present transaction number to 0. TN0 is thus utilized as the "default" transaction number used for single processor to single processor (point-to-point) messages. In another sense, it identifies the "non-merge" mode of operation.

The global intercommunication system uses the message organization shown in FIGS. 3 and 11, and the high speed random access memory 26 organization shown in FIGS. 8 and 10. More detailed examinations are made below in conjunction with FIGS. 5, 7; 9 and 13.

In FIG. 3 and FIG. 11 it can be seen that command codes for the responses range from 00 to 0F (hexadecimal) and that those for primary messages range from 10 (hexadecimal) to some higher value. Thus, responses take priority over primary messages, lowest value first, in the sequence shown in FIG. 11.

One dedicated section of storage in the high speed RAM memory 26" (FIG. 8), (designated "transaction numbers") is used, for storage of the word formats (the seven readiness states, an Assign TN and an Unassigned TN state) of FIG. 12. Other dedicated portions of the memory 26" include a circular buffer for input (receive messages) and an output message space. Another separate section of the memory 26" is used as a message complete vector section, in which pointers can be placed to completed output messages so that output message space can be used efficiently.

It should be appreciated, therefore, that while the queuing and data buffering functions of the memory 26 and control logic 28 are of importance, the multiple coactions by which global transactions are dispersed and manipulated in relation to the individual processors are uniquely significant.

ACTIVE LOGIC NODES

The active logic nodes 54 of FIG. 1 are alike, in both of the redundant networks, except that the recirculation node 54a at the apex of each network has no upstream port, but merely a signal recirculation path that returns to the downstream direction. As shown in FIG. 4, each node 54 may be broadly divided into functional groupings, one of which pertains to message and collision signal transmissions and the other of which pertains generation and retransmission of the common clock signal. Clock signals are synchronized such that there is zero skew between them at the different nodes. These two functional groupings are not separate, inasmuch as the zero skew clock circuits form important parts of the signal transmission system. Both a word clock (two serial bytes) and a byte clock are utilized. Note should be taken of the fact that external control of the active logic nodes 54 is not required or utilized, whether to establish or reset the node's state or to set up different modes of operation. Furthermore, the identity between the nodes 54 enables them to be made in quantity using modern IC techniques, substantially reducing cost while improving reliability.

The A, B and C "ports" previously referred to each have ten input data lines and ten output data lines. Taking the A port as an example, the input lines are designated AI and the output AO. The single "collision" line is used at each port (e.g. A.sub.col for the A port), along with upstream and downstream clock lines. The data lines from the A and B ports are applied to a multiplexer 60 which switches the priority word of two competing words, or the common word (if both words are alike) to an up register 62 coupled to the upstream port (C), as the C0 data signals. Concurrently, the downstream data received at the C port from a higher tier node is shifted into and out of a down register 64, appearing as output at both of the A and B ports.

Although one upstream byte serial signal train may be blocked, no added upstream or downstream delay is introduced arid words are advanced in unbroken sequence through the up register 62 and down register 64 under control of tile word and byte clocks.

Competing bytes concurrently applied at the A and B ports are supplied to first and second parity detectors 66, 67 and also to a comparator 70 which determines priority on the basis of the eight data bits and one control bit, with lowest data content having priority. The "idle" or no message signal in this protocol is an unbroken sequence of 1's. Parity errors can occur due to typical causes, such as the presence of excessive noise or some other factor affecting signal transmission or circuit operation. In the present system, however, an important additional use is made of parity error indications. Each transition of a microprocessor to an inoperative state is marked by all output lines, including the parity line, going high (or 1 valued), thus establishing an odd parity error. This indication is transferred through the network once upon the presence of an error, as a marker which enables the system to identify a change in global resources and initiate procedures to determine the nature of the change.

The pair of parity detectors 66, 67 and the comparator 70 feed control circuits 72 that include priority message switching circuits 74, responsive to the comparator 70 for locking the multiplexer 60 in one state or the other if priority is determined, and for generating and propagating the downstream collision signals. Transitional parity error propagation circuits 76 are so called because they force the one-time all 1's parity error state along the network. Reset circuits 78 for returning the node to its initial state include an end of message (EOM) detector 80.

It will be appreciated that the functions described above and hereinafter may be accomplished at each active logic node by the use of a microprocessor chip, but they may even more readily be implemented in accordance with the state diagram of FIG. 5 and the logic equations set out below. In the state diagram, the state SO represents the idle state, and also the state in which competing messages are equal, so that no decision is made to favor one port against another. The S1 and S2 states are the states favoring the A port and B port respectively. Thus the A port is favored (S1 state is established) if the data content of BI is greater than AI and there is no parity error on AI or if there is a parity error on BI (these conditions being designated AIPE and BIPE respectively and represented by flip-flop states). The converse logic conditions as to AI and BI exist for the system to go into the S2 state. Any indication from a higher tier node that a collision has occurred at the tier is reflected back in a downstream signal as COL.sub.IN. Whether the system is in the S0, S1 or S2 states, it goes into the S3 state, transferring the collision signal downstream as A.sub.col and B.sub.col. In the S1 and S2 states, with the node having made a decision, the collision signal is sent downstream to the lower tier nodes in like fashion, with the priority message switching circuits 74 locked to the A port or B port as the case may be.

The reset circuits 78 include the EOM detector 80, used to reset the node from S3 to SO (FIG. 5). A first reset mode uses the end of message (EOM) field that concludes the data field in a primary message, as shown in FIG. 6. A group of flip flops and,gates are used to establish the logic:

URINC.multidot.URC.multidot.URCDLY

where URC represents the control bit in the up register, URINC represents the control bit value in the up register input and URCDLY represents the C value in an up register delay flip flop.

As seen in FIG. 6, control bit sequence pairs establish certain fields and transitions between them. For example, a transition from the all 1's used during idle to a 0, 1 bit sequence defines the start of a field. The same 0, 1 sequence is used to identify the start of the data field. Successive 1, 0 control bit strings denote the internal field or subfield, and the end of message (EOM) is identified by the 0, 0 control bit pair. The condition in which the string of 1, 0 pairs is followed by the 0, 0 pair is unique and readily identified. The URINC, URC and URCDLY signals are ANDed together, with each having a 1 byte clock delay from the other. The result is a,waveform that is high until the start of the message packet, and at which point it goes low and stays low through the data. It returns high 2 byte clocks following the EOM occurrence. This positive-going transition in the waveform URINC.multidot.URC.multidot.URCDLY is the EOM detection. It triggers, as shown by the legend in FIG. 5, a return from S1 or S2 to S0.

A higher node tier that is reset goes to COL.sub.IN indicating that the collision state has been removed. This logic state initiates a return from S3 back to the base state, S0. Note that the COL.sub.IN state will propagate down the tiers of the network as the end of the message "runs out". The nodes are thus self-resetting no matter how long or short tile message. Also note that no matter the state in which the network starts out, all nodes will be reset to the S0 state by the idle signals.

Collision signals are returned to the processor modules, which store the collision state information and revert to the transmission of the idle sequence as the winning processor continues to transmit. A processor may begin a new transmission as soon as it detects the transition from COL.sub.IN to COL.sub.IN. In addition, a processor may begin a new transmission after receiving idles for 2N byte times, where N is the number of tiers in the network, as this also indicates that the network is clear of any prior transmissions. This latter method of enabling new transmissions permits a processor entering a network for the first time to get into message synchronism with the network under conditions of light traffic, so that it need not wait for a poll from another processor in order to start interchange with other processors on the network.

Parity error states have been noted in the state diagram of FIG. 5 and are established pursuant to the following logic:

PESIG=AIPE.multidot.AIPEDLY+BIPE.multidot.BIPEDLY

If PESIG then (URIN 0:7, C, P=1 . . . 1, 1, 1) To implement this logic, the transitional parity error propagation circuits 76 comprise an AIPE, or A input parity error flip flop and a delay flip flop (AIREDLY). The latter is set 1 byte clock later in accordance with the AIPE setting. For A inputs, the PESIG value thus goes high for 1 byte clock when the AIPE flip flop is set by a parity error, so that the PESIG signal is propagated once, at the first indication of the parity error. The same condition arises when all of the data bits, control and parity bit are 1 values, which occurs at the previously noted, transition in the state of the global resources. All lines then go high, forcing all 1's and establishing an even total (odd parity) so that the AIPE and AIREDLY flip flops are set as previously described, to denote the parity error. This system operates in the same way when the message packet received on the B port contains a parity error or a forced parity indication of change of status.

Parity errors arising because of noise effects or other variables will typically not affect processor operation because of the redundant networks. For monitoring and maintenance purposes, indicator lights (not shown) are utilized to indicate the occurrence of parity error. The once-propagated parity error denoting change of status, however, initiates routines for assessing the significance of the change.

The clocking system used in the node 54, as shown in FIG. 4, provides a unique means for maintaining zero skew between the clocks at all the node elements, despite the number of tiers used in the network. The clock circuits 86 include first and second EXCLUSIVE OR gates 88, 89 respectively, the outputs of which, designated A and B, respectively, are subtractively combined (in the B-A sense) by a summing circuit 92, the output of which is passed through a low pass filter 94 to control the phase of the output from a phase locked loop or oscillator 96. The inputs to the first gate 88 are the output of the PLO 96 and a downstream clock passed from the next higher tier node element through an isolating driver 97. Tins line is designated as the word clock, and is derived from the next higher tier after a certain known delay, .lambda., the same signal being returned through another isolating driver 98 to the node at the next higher tier. The inputs to the second gate 89 comprise the word clock and a clock feedback from the next lower tier, which also receives a signal from the PLO 96.

The word clock line feeds the two inputs of a third gate 100, both directly and through a .lambda..sub.c delay line 101, to derive a byte clock signal at twice the frequency of the word clock and in timed relation to it.

The functioning of the clock circuits 86 may be better understood by reference to the timing diagram of FIG. 7. The clock out signal is the output of PLO 96. Inasmuch as a paramount objective of the system is to maintain a zero time skew between these outputs for all nodes in the network, it is clear that they must also have the same nominal frequency. The transmission line delay, .lambda., between nodes is kept substantially, constant, but can be long. Using the presently disclosed technique the length could be as long as 28 feet, with the network and node byte clock rates (nominally 120 ns.) used in a practical system. Those skilled in the art will recognize that lengths which are integer multiples of 28 feet can readily be obtained by adding tiers to the network which are not fully populated with the maximum possible number of processor modules. There will be a corresponding increase in the latency or transmit time through the network.

The word clock derived from the next higher tier, ad shown by the next lower waveform, is a similar waveform but delayed by .lambda.. The word clock constitutes the basic timing reference throughout all the nodes, and this is made possible because the leading edge of each clock out signal is controllable within the circuit and can be made to lead the word clock so that all nodes can be held in synchronism. Referring to the waveforms A and B, it can be seen that the first gate 88 generates a pulse A which terminates with the leading edge of tile word clock, while the second gate 89 generates a pulse B whose leading edge is coincident with the word clock. The trailing edge of the B pulse is defined by the initiation of the feedback pulse from the next lower tier mode, which is delayed by .lambda., so that the B pulse is of fixed duration. The clock circuits 86 function to keep the pulse A of the same duration as pulse B, because the summed signal, B-A, tends toward a null,.as the PLO 96 is advanced in phase so as to establish synchronism. In effect, the leading edge of the A signal, which may lead or lag the desired position as shown by dotted lines, is adjusted to precede the leading edge of the word clock by the interval .lambda.. When the leading edge of the clock out signal is in this desired nominal position at all the nodes, there is zero skew between the word clocks. For this reason the processors coupled to the network are freed from any constraints as to the total length of the path between one processor and another, because additive delays and differential propagation times are eliminated.

To produce the double frequency byte clock, the word clock signal is replicated at a delay .lambda..sub.c by the delay line 101, which also feeds the gate 100. Thus, as seen in the waveform labeled byte clock in FIG. 7, at either edge of the word clock, a byte clock pulse is produced having a duration .lambda..sub.c. This occurs twice each word clock interval and in synchronism with the word clock throughout all the nodes. It is implicit in the prior description that the delays introduced by the transmission lines between nodes are nearly identical in both directions between tiers so that in effect all word clocks and byte clocks within the system are held in stable phase relationship. The locally generated byte clocks therefore provide clocking at each node for the individual bytes in the 2 byte words of the messages.

The active logic nodes are of potential benefit wherever a competition between concurrent message packets is to be resolved on the basis of data content. Most known systems, as exemplified by U.S. Pat. No. 4,251,879 issued Feb. 17, 1981 on a "Speed Independent Arbiter Switch for Digital Communication Networks" are directed toward determining the first signal received in time, and utilize external processing or control circuits.

PROCESSOR MODULES

The individual processors in the overall system diagram of FIG. 1 are identified as examples of interface processors (IFPs) 14 and 16 and access module processors (AMPS) 18 to 23 respectively, and are broadly subdivided into principal elements. A more specific example of the organization of the processor modules shows the correspondence to the broad functional subdivisions of FIG. 1 but also reveals a substantial number of further subdivisions. As used herein, the term "processor module" refers to the entire assembly shown in FIG. 8, which with the optional features that are noted can serve either as an IFP or an AMP. The term "microprocessor system" refers to a system 103 that incorporates a microprocessor 105 such as a 16 bit microprocessor of the Intel 8086 type. The address and data busses of the microprocessor 105 are coupled within the microprocessor system 103 to conventional peripheral systems, such as the main RAM 107, and a peripheral controller 109. The peripheral controller 109 exemplifies what may be used when the processor module is an AMP and the peripheral unit is a disk drive 111. As shown in the dotted line rectangle, however, this controller or interface may alternatively be a channel interface if the processor module is to serve as an IFP. In this instance the channel interface would communicate with the channel or bus of a host system. Inasmuch as conventional controllers and interfaces may be used in the microprocessor system 103,they need not be further described.

It is noteworthy that it can be shown to be advantageous to use one disk drive per microprocessor, in terms of both cost and performance. This is true as to data base machines in general, even though there may be benefit at times in arranging one microprocessor so as to have access to a number of secondary storages. The diagram omits, for purposes of brevity, the incorporation of other subsystems that would typically be used, such as interrupt controllers that are supplied by semiconductor manufacturers for use in conjunction with their systems. Those skilled in the art will recognize the importance of a suitable scheme for distribution of electrical power to the processor modules to attainment of the full degree of redundancy and reliability the invention can provide.

The peripheral controller 109 and the channel interface depicted as an option in the microprocessor system 103 correspond to the IFP interface and disk controller in FIG. 1. The high speed RAM 26 of FIG. 1, however, actually comprises first and second H. S. RAMs 26', 26" respectively, each of which through time multiplexing is effectively a three-port device coupled to the microprocessor bus system at one of its ports (designated C). Each of the H. S. RAMs 26', 26" cooperates respectively with a first or second network interface 120, 120', providing communication with the first and second networks 50a and 50b (not shown in FIG. 8) respectively at an input (receive) port A and output (send) port B. With these redundant systems, only the second network interface 120' and the second H. S. RAM 26" need be described in detail. The network interfaces 120, 120' are further shown and described in conjunction with FIG. 13, but can be generally subdivided into four principal parts:

Input register array and control circuits 122 coupling the ten input lines from the second network 50b bus and address bus.

An output register array and control 124 coupling the output lines of the second network 50b to the interface data and address busses and to the B port of the second H. S. RAM 26".

A microprocessor bus interface and control 126 coupled to the interface address and data busses and to

the A and B ports of the H. S. RAM 26".

A clock generator 128 that receives the word clock from the network and generates synchronized, properly phased clocks for controlling the interface 120'.

The second network interface 120' and H. S. RAM 26" cooperate with the microprocessor system 103 in coordinating data transfers between the high speed network and the relatively, slower speed microprocessor and also provide queuing of messages between these different systems. The microprocessor bus interface and control 126 may be referred to as performing read/write (R/W) functions with the microprocessor system which (at least with the Intel 8086) has the capability of writing directly into and receiving data from the H. S. RAM 26".

Although the IFP and AMP systems are functionally alike, there can be a substantial disparity in the sizes of the incoming message storage and outgoing message storage in the H. S. RAM 26", as between the IFP and the AMP. In a relational data base system, the IFP has a large incoming message space in the H. S. RAM 26", in order to receive new messages from the high speed network so that the needs of the host computer may be serviced by constant usage of the network. In the AMP the reverse is true, because more storage space should be available to send processed message packets to the high speed network. The H. S. RAM 26" functions with the main RAM 107 in the microprocessor system 103, which has message buffer sections for each network.

The allocation of system address space in the main RAM 107 for the microprocessor system 103 is shown in FIG. 9, to which reference should briefly be made. It is conventional in having addresses devoted to system random access functions to provide an expansion space for use in the event the random access capacity is increased, an I/O address space and an address space reserved for ROM and PROM (including EPROM) functions. In addition, portions of the system address space are reserved for message packets from and to the first and second high speed RAMs 26', 26" respectively. This provides greater flexibility in the system operation, inasmuch as even though the microprocessor 105 can address the H. S. RAMs 26", the main RAMs 101 assure greater freedom from software and hardware interdependence.

It has been stated, referring again to FIG. 8, that the bidirectionally accessible H. S. RAMs 26" are organized in such fashion that they perform central functions in multiprocessor mode control, distributed updating and the management of message packet flow. For these and other purposes, the H. S. RAM 26" is divided into a number of different internal sectors. The relative disposition of the different sectors shown in FIG. 8 is used throughout the different processor modules in the system, and the specific addresses that designate the limits of the sectors refer to those used in an actual system, It will be appreciated that the sizes of these sectors of memory and their relative disposition are widely variable dependent on the specific system context. Sixteen bit memory words are employed in this example. The selection map and response directory are dedicated lookup tables of the type that may be written in once during initialization, while the transaction number section provides a dynamically revisable lookup table.

The selection map section of memory starts with location 0 but is based upon the use of four different maps used in interrelated fashion within the memory section. The destination selection word (DSW) that is contained within the message packet is used cooperatively with the dedicated selection maps in the H. S. RAM 26". The destination selection word, comprising 16 total bits, has a map address in 12 bit positions, and map selection data in the four other bits. Each of the first 1024 16 bit memory words of the H. S. RAM contains four map address values. The address value specified by the DSW provides, with a single memory access to the H. S. RAM, map bits for all four maps, while the map selection bits in the DSW determine which map is to be used.

FIG. 15 shows the conceptual organization of the map section as if each map had physically separate 4096-by-1-bit RAM. As a matter of implementation convenience all map data is stored in a single portion of the H. S. RAM, as shown by FIG. 8. The DSW Management Section 196 (FIG. 13) controls multiplexing of four bits from each of the four maps of FIG. 15 from one 16-bit word of H. S. RAM. Those skilled in the art will recognize the advantage of the scheme in that the maps can be initialized by the processor by the same means as used to access other parts of the H. S. RAM.

There are also three different classes of destination selection word that are used, and the selection map locations are correspondingly divided into a hash selection portion, a class selection portion and a destination processor identification (DPID) selection portion.. The DPID specifies whether the processor 105 is the specific one for which the message packet is intended, whereas the class selection portion specifies whether or not the processor is one of the number of processors in a particular process class that is to receive the message packet. The hash values are stored in accordance with the manner in which the data base is distributed throughout the relational data base system, following a predetermined algorithm for the particular relations and method of scatter storage that are employed. The hash value in this instance can designate the processor either as having primary or backup responsibility for the data. Thus the selection maps provide a technique for directly addressing the H. S. RAM 26" so as to determine processor destination. This function complements the broadcasting of prioritized messages to all network interfaces 120' and enables local accessing without interruption of microprocessor 105 status.

A separate section of H. S. RAM 26" serves as a pivotal means for checking and controlling globally distributed activities. Transaction numbers (TNs) are assigned to various ones of the processes that are sent on and received from the network 50b, as discussed above and shown in FIG. 3. TNs within messages are retained as global transaction identities as each microprocessor system 103 independently performs the subtasks accepted by it. The block within the H. S. RAM 26" that is dedicated to a number of available transaction number addresses contains status entries that are locally controlled and updated by the microprocessor system 103 as these subtasks are performed. The TN is used in a number of different ways, both locally and globally, in performing intercommunication functions. The transaction number is used to identify subtasks, to call forth data, to provide commands, to control message flow and to characterize the dynamics of a global process. Transaction numbers may be assigned, relinquished and changed in tile course of global communication. These aspects are explained more fully in the following description.

The most complex, but perhaps the most dramatic, aspect of the TN is the capability it affords, with the sort network, for distributed updating of the status of local processors as to a given control process. Each control process (i.e., task or multiprocessor activity) has its own TN.

Readiness state values are held in the transaction number section of H. S. RAM 26" and are locally modified under control of the microprocessor system 103. The microprocessor 103 can initialize the appropriate entry (e.g. SACK/Busy) in the response directory (address 050D (hex)) of FIG. 10, and enter the SACK/Busy status by transferring the exact image thus reproduced to the H. S. RAM 26". An entry at a TN address is accessible to the network 50b via the interface 120', at the A and B ports of the H. S. RAM 26". Queries are made using a Status Request message containing the status request command code (see FIG. 11) and TN. The interface 120' uses the content of the designated TN to reference the response directory which contains a properly formatted response message. A global status query as to a given TN, when received at the second network interface 120' elicits a direct response that is solely hardware controlled. No prefatory communication is needed, and the microprocessor system 103 is not interrupted or affected. However, the microprocessor 103 can assure against interruption when setting the status by transmitting a LOCK indication to the interface 120', which communicates the Lock word derived from 0501 (hex) until removed at a later time.

The word format of the readiness states is shown by the seven states from "busy" to "initial" in FIG. 12, which depict the useful variants employed in a practical system, More or fewer status variants may be used, but these seven states provide versatile and comprehensive control. It is the responsibility of the microprocessor system continually to update status levels for different TNs in the H. S. RAM 26' to reflect availability for, or progress toward the completion of a subtask. Such updates are made simply by writing into the TN address in the H. S. RAM. 26", using the formats shown in FIG. 12.

In FIG. 10 each status response is accompanied by an initiating status acknowledgment command code (SACK), from 05 to 0D (hexadecimal). The SACK responses sent to the network are essentially the command codes of FIG. 10, the numeric portions of the word formats of FIG. 12, and an originating processor ID (OPID), as seen in FIG. 11. The SACK responses thus define a consecutive priority subgrouping within the overall coherent priority scheme shown in FIG. 11. The OPID is significant in the priority scheme, because if a number of processors are working on a TN but are "Busy", the highest priority message that is broadcast is determined by the OPID. Transfers and system coordination can be based on this data.

The SACK message priority schedule, the simultaneous responses from a number of microprocessor systems 103, and the dynamic prioritizing in the network 50b, enable the status of global resource as to a given task to be determined in a vastly improved manner in comparison to prior art systems. The response given is unambiguous, cannot represent an invalid state and requires no software or local processor time. Thus Deadlock cannot arise due to repeated requests interfering with task performance, for example. Numerous multiprocessor options can be used at the different status levels. It is unique that the local processors can continue to operate independently and that a single query derives a global, prioritized, response.

Some specific discussion of the successive states depicted in FIG. 12 may be useful. The "busy" and "waiting" states connote successively more complete phases as to the assigned subtask, the latter identifying a condition in which a further communication or event is required. These states exemplify the "elevation" of the status of the TN until it reaches a level at which a message packet corresponding to the TN can be sent or received.

When a message packet is to be sent or received, however, a different feature of the TN comes into play, namely its capability for message control. When the microprocessor system 103 has a message for transmission, the status indication becomes "send ready". The microprocessor system 103 not only updates status but it also enters a "next message vector" value in the H. S. RAM 26", using the word format of FIG. 12. This entry defines the location at which the corresponding output message may be fetched from H. S. RAM 26". This vector is used internally in the network interface 120' in chaining together output messages pertaining to a specified TN.

A related function is performed during the "receive ready" state, in which the TN storage location retains an input message count from the microprocessor system 103 as to the number of messages that may be received pertaining to a given TN. This count can be decremented until reduced to zero as successive input messages are transferred. At zero, no more messages can be received and an overrun condition can be indicated. This enables the TN to be used in pacing transfer between the network 50b and the microprocessor system 103.

Locally, at each processor the TN is retained in the send and receive messages during processing, as a constant and uniform reference throughout the system. The TN0 or default state also provides a local command to identify the fact that a message is to be used in a non-merge mode.

From the global standpoint, moreover, the distinction between TN0 and TN>0 values establishes one of the command functions for which the TN is used. The merge/non-merge characterization thus inherent in each message packet, provides a valuable systems approach to prioritizing and sorting messages. Similarly, the "Assigned", "Unassigned", "Non-Participant" and "Initial" status are used to fulfill global intercommunication and control functions. The "Unassigned" state is one in which the processor has previously relinquished a TN, so that it must receive a new primary message reactivating the TN. If the processor indicates "Unassigned" when it should be "Assigned", this establishes that the TN was not properly entered and that corrective action should be taken. When a TN is "Assigned" where it should be "Unassigned", this may indicate a faulty transfer or a competition between two processors for a new TN. Neither "Assigned" nor "Unassigned" is treated as a readiness state, inasmuch as the processor has not yet undertaken work on the TN at these stages.

The "Initial" and "Non-Participant" states are also significant in global resource terms. A processor which comes on line and which therefore must be brought into the system is in the "Initial" state, which indicates that administrative steps are needed to bring the processor on line. Those processors which are "Non-Participants" in a given task do not need to do any processing locally, but must keep track of the TN so as not to inadvertently use it in an erroneous manner.

The dedicated directory or reference section of the H. S. RAM 26", referring again to FIG. 10, also includes other types of prioritized messages for use in generation of responses by hardware. NA (not assigned) entries are held available for future use. Three different types of NAK responses (Overrun; TN Error; Locked) are of lowest data content and at highest priority levels because they identify error conditions. The SACK responses are followed, in order of decreasing priority, by the ACK response and the NAP (not applicable processor) response. In the present implementation, two response command codes are not assigned (NA) and are available for future use. This directory can be initialized by software and is used by hardware to quickly and flexibly,generate any of the range of response message texts.

A separate portion of this directory is used to store TOP, GET, PUT, and BOTTOM addresses or pointers related to the functioning of circular buffers for input messages and completed output message pointers. These pointers function in conjunction with the dedicated sectors of H. S. RAM 26" devoted to input message management and output message management respectively. For incoming messages, a circular buffer scheme is used, with "TOP", stored in the directory section of the H. S. RAM 26", being a variable address defining the upper position for incoming messages. The PUT address, also stored in the directory section, defines where the circuits are to store the next message that is received. The GET address is provided and kept updated by software to enable the hardware to recognize the location at which the software is emptying the buffer.

The incoming message buffer is managed by setting PUT at: the bottom of the buffer and starting with the GET address equal to TOP. The rule of operation assured by software is that GET is not set equal to PUT, which would create an ambiguous condition. As messages are entered into the incoming message buffer in H. S. RAM 26", the message length-value in the message itself establishes the beginning point of the next message and the PUT address stored in the directory is then changed to indicate where the next succeeding message is to be received in the buffer. Incoming messages can thus be fetched by the microprocessor system 103 as its capabilities permit.

Data in the output message space within the H. S. RAM 26" is utilized in conjunction with the output message complete vectors held in a separate circular buffer and the next message vector in the H. S. RAM 26". Messages can be assembled and stored in arbitrary locations, and related messages can be chained together for transmission on the network. In the directory section of the H. S. RAM 26", TOP, BOTTOM, PUT and GET addresses are entered and updated as previously described to maintain active present references to locations within the output message complete buffer. The message complete vectors constitute addresses which reference messages in the output message space that have been successfully transmitted as indicated by a response received. As described below, the system enables the microprocessor system 103 to enter output messages readily, but to handle complex linkage vector sequences in orderly fashion so that output message space is efficiently used and message chains can be transmitted.

The protocol of FIG. 11, which was previously discussed as to responses, is continuous with respect to the primary messages as well. Response messages are given in sequence, the hexadecimal command codes being stated in ascending order. In the primary message grouping, the stop merge (also the base or non-merge control) message is of lowest data content and highest priority. This message constitutes a control communication that terminates merge mode within the network and at the processor modules.

A substantial number of-different types of primary data messages can be used in ascending priority, and can be categorized in priority order based upon application and system requirements. As mentioned, continuation messages,can have higher priority so as to maintain continuity with the preceding message packet to which they relate.

The last grouping of four primary messages in FIG. 11 comprise, in descending order of priority, the status request message, which is the only type of status message needed to obtain the status response, control messages calling for "relinquish TN" and "assign TN", and, of lower priority, a "start merge" control message.

This system permits versatile operation, as will be evident from the more detailed examples given hereafter. A processor module operates on a present transaction number (PTN hereafter), whether this is externally commanded from the network or generated internally in the course of successive operations. When merge operations are being carried out, the processor modules utilize the global reference or transaction identity defined by the TN. Starting, stopping and restarting of merge operations utilizes only simple message changes. When subtasks do not require messages to be merged, or message packets are generated that have no specific relationship to other messages, they are queued to an output against TN0 and are transmitted when the base or default condition defined by Present Transaction Number (being 0), holds true. TN0 condition enables messages to be queued for transmission whenever a merge mode is not utilized.

NETWORK INTERFACE SYSTEM

FIG. 13, to which reference is now made, depicts in further detail one example of interface circuits useful in systems in accordance with the invention. This section of the description includes a number of detailed features that are not necessary to an understanding of the invention but are embedded in a practical system example and are therefore included to place the examples more firmly in context. Specific gating arrangements and details not the subject matter of the invention and involving well known expedients for which many alternatives are available have been omitted or abbreviated. FIG. 13 is an elaboration of the second network interface 120' and the H. S. RAM 26" from FIG. 8. The interfaces 120 for both networks function in like fashion and thus description of one will suffice.

In FIG. 13, inputs from the particular active logic network 50 associated with the interface are applied at network message management circuits 140 via a multiplexer 142 and known parity check circuit 144. The multiplexer 142 is also coupled to the microprocessor system data bus, enabling access to the message management circuits 140 via the bus, This feature permits the microprocessor system to operate the interface in a step-by-step test mode, transferring data as if the interface were on line to the network. Inputs from the networks are applied to a receive network data register 146, both directly in a first section and through a receive byte buffer 148 which thereafter enters the byte into a different section of the receive network data register 146. Consequently, both bytes of each word received are entered into and held available at the receive network data register 146.

Output messages for transmission are entered into a send network data register 150, while a parity bit is added in a conventional parity generator 132. Messages are sent to the associated network from the network message management unit 140, or (where the test mode is to be used) to the microprocessor system data bus. For message management purposes within the interface, the send message format in the random access memory 68 comprises identifying data as well as message data. As seen in FIG. 21A, command, tag, key and DSW can all be incorporated along with primary data that is to be transmitted.

The organization shown in FIG. 13 is essentially the same as that shown in FIG. 8, which illustrates the interface data bus and address bus as separately coupled to input ports A and B at the H. S. RAM 26', while the address and data busses of the microprocessor system 103 are illustrated as coupled to a separate C port. In actuality, as seen in FIG. 13, this separate bidirectional access is achieved by time division multiplexing of input and output address functions within the interface and at the H. S. RAM 26". The microprocessor data and address busses are coupled to the interface busses via gates 145, 149 respectively so that the microprocessor can operate asynchronously on its own internal clock.

The timing system used is based upon clock pulses, phase control waveforms and phase subdivision waveforms generated by interface clock circuits 156 (FIG. 13) and having the timing relationships shown in FIG. 14, to which reference is also made. The interface clock circuits 156 receive the network word clock from the nearest node and a phase locked clock source 157 includes means for maintaining zero time skew as previously described in conjunction with FIG. 4. The nominal network word clock rate of 240 ns. in the network is subdivided in time in the interface clock circuits 156 because a frequency multiplier (not shown in detail) held in phase locked relationship provides a faster clock, defining reference period of 40 ns. duration (shown in FIG. 14 as PLCLK). The basic word period is defined by the opposite-going half cycles of a cyclic signal designated CLKSRA having a total 240 ns. duration. Two other signals of like frequency and duration are generated from PLCLK by frequency dividers 158 at times delayed by one and two cycles of PLCLK respectively from CLKSRA, and these are designated CLKSRB and CLKSRC respectively.

From these signals, control logic 159 develops timing waveforms, designated IO GATE, RECV GATE and SEND GATE, denoting successive equal thirds of the word period. These intervals are appropriately.,referred to as IO, receive and send phases. The phases defined by the gate signals are each further subdivided into,two equal half intervals by IO CLK, RECV CLK and SEND CLK signals which define the last half of each of the phases. Byte clocking functions are governed by BYTE CTRL and BYTE CLK signals.

The IO, RECV and SEND phases provide the basis for time division multiplexed operation of the random access memory 168 and the microprocessor system busses. The interface can receive or send no more than one word per word period from or to the high speed network, and receiving and sending are obviously never concurrent. Transfer rates to and from the microprocessor system are substantially lower, but even if equal the capacity of the interface circuits would not be taxed. The interface system arrangement hinges in large part on direct access to the random access memory 168, and substantially obviates the need for internal processing or software. Thus as the system cycles through the successive phases in each word period, words are successively advanced in non-conflicting progressions along their predetermined signal paths for the different functions taking place. For example, sending a message to the bus may be interleaved with reception of a message from the microprocessor, each being interchanged using a different portion of the memory 168.

Intercommunication between the data bus for the microprocessor system and the network interface is effected in IO (which may also be referred to as Read/Write) management circuits 160. A write gate 162, for words from the microprocessor system, and a system read register 164, for transferring words to the microprocessor system provide the coupling between the microprocessor's bus and the bus interface to the network interface.

A memory address register 165 and parity generator and check circuits 166 are also incorporated in the network interface subsystem. In this example the high speed storage comprises a 4K word.times.17 bit random access memory 168, the internal subdivision of which and the use of dedicated memory portions within which have previously, been described. The size of the random access memory can readily be reduced or expanded to meet the needs of a particular application.

Receive message buffer management circuits 170 are coupled to the data bus of the microprocessor and in turn are coupled to the address bus for the memory 168. The term "received messages" refers both to the incoming messages from the network for entry into the circular buffer at a location referred to as PUT and to the subsequent transfer of that message to the microprocessor system when a GET value specifies where the system is to sequence in extracting a received message for transfer to the microprocessor system. Address values for accessing the random access memory 168 are entered into a GET register 172, a TOP register 174, a PUT counter 175 and a BOTTOM register 176 respectively. The PUT counter 175 is updated by incrementing from an initial position defined by the BOTTOM register 176. The TOP register 174 provides an opposite limit reference. Both TOP and BOTTOM may be manipulated by software control to modify both the size of the receive buffer and the absolute location with H. S. RAM. When the contents of the PUT register equal the contents of the TOP register, the PUT register is reset to the contents of the BOTTOM register, thus effecting the circular nature of usage of the buffer. The GET, TOP and BOTTOM registers and the PUT counter are shared In managing both the incoming message and output message complete circular buffers.

Entries are made in the GET register 172 under software control, because the length of the then active message in the buffer determines the next address. Comparison circuits 178, 179 coupled to the outputs of the GET register 172, the PUT counter 175 and the TOP register 174 are used in detecting and indicating overrun conditions. An overrun exists when the GET and PUT settings are set equal or when GET is attempted to be set