Priority scheduling

Fault-tolerant multi-computer system

4356546

Abstract

A Fault-Tolerant Multi-Computer System for control applications is disclosed. The system has a plurality of Computers (10a-10n), each having an assigned set of tasks which it is capable of executing. No one Computer in the system acts as a master and no one Computer executes all of the tasks. Communication between the Computers is by individual communication links (16, 18, 20) over which each Computer sends information directly to all other Computers in the system. Each Computer comprises an Applications Computer (100) and an Operations Controller (200). The Operations Controller receives messages over the communication links and selects, from the assigned tasks, the tasks to be performed by the associated Applications Computer. Each Operations Controller includes a fault handler which checks the messages received from the other Computers. The fault handlers send and receive error messages, over the communication links, to assist in the identification of a faulty Computer. Subsequent messages from the Computers deemed to be faulty are ignored, and the tasks assigned to the faulty Computer are executed by alternate Computers in the system.


Claims

What is claimed is:

1. A fault-tolerant multi-computer system architecture, responsive to intercomputer messages and to inputs from external sources for executing a predetermined set of tasks to produce an output to at least one external device, comprising:

a plurality of computers for collectively executing the predetermined set of tasks in a coordinated manner to produce outputs to the at least one external device in response to the inputs from the external sources and the intercomputer messages, each of said computers having an assigned subset of the tasks which it is capable of selecting and executing in a predetermined order of priority, each task in said predetermined set of tasks being included in more than one of said assigned subset so that each task is capable of being selected and executed by more than one computer; and

a like plurality of communication links, one associated with each computer, each communication link transmitting only the intercomputer messages sent by the associated computer to all of the computers which require any message generated by the associated computer; and

wherein each of said plurality of computers comprises:

operations controller means for controlling the operation of its own computer in coordination with like operations controllers in the other computers, each operations controller including: receiver means for receiving intercomputer messages, fault handler means for checking said intercomputer messages to detect the faulty operation of any computer in the system, and to exclude from further processing the messages received from faulty computers, scheduler means responsive to the receipt of all the data variables for the execution of at least one of its assigned tasks for selecting from its assigned subset the tasks to be executed, task communicator means responsive to the task selected by the scheduler means for assembling the data variables required for the execution of the selected task and transmitter means responsive to said fault handler means, said scheduler means and said task communicator means for sending intercomputer messages to all of the computers in the system, said messages containing an identification of the faulty computers, identification of the tasks it has selected, and the values of the data variables resulting from the execution of the selected tasks required for the execution of a subsequent task; and

applications computer means for executing the tasks selected by said scheduler means using the data assembled by said task communicator means.

2. The fault-tolerant multi-computer system of claim 1 further including combiner/voter means for combining the corresponding outputs from said plurality of computers to produce a single combined output, and to produce a voted output indicative of the output of the majority of computers when more than two computers execute the tasks generating said output to the at least one external device.

3. The fault-tolerant multi-computer system of claim 1 wherein each of said applications computer means comprises:

program memory means for storing a set of instructions for the execution of each task in said assigned subset of tasks;

central processing means, responsive to the task selected by said scheduler means, for executing the set of instructions stored in said program memory means corresponding to the selected task using values of said data variables assembled by said task communicator means, said central processing means generating values for said data variables through the execution of each task;

data storage means for storing intermediate variable and fixed data values as required by said sets of instructions for each task in said assigned subset of tasks; and

input/output means for transferring to said central processing means the inputs from the external sources required for the execution of the selected tasks and for outputting to the at least one external device the data values resulting from the executed tasks, when instructed by the central processor means.

4. The fault-tolerant multi-computer system of claim 1 wherein said receiver means comprises a plurality of receivers, each receiver connected to one communication link and each receiver receiving only the messages sent by the computer associated with the communication link.

5. The fault-tolerant multi-computer system of claim 4 wherein said receiver means further includes a receiver connected to the communication link associated with its own computer and receiving only the messages sent by its own computer.

6. The fault-tolerant multi-computer system of claim 1 wherein said fault handler means includes:

checker means for checking each message received by said receiver means and for generating an error signal identifying each received message containing an error and the computer which sent the message; and

fault tolerator means responsive to the error signals generated by said checker means and messages received from other computers identifying which computers are considered to be faulty, for passing on for further processing only error free messages received from non-faulty computers, said fault tolerator means further including means responsive to said error signal for enabling said transmitter means to send messages identifying the computers deemed to be faulty.

7. The fault-tolerant multi-computer system of claim 1 wherein said fault tolerator means further includes means for cancelling the fault status of a computer previously identified as faulty when said computer sends error free messages for a predetermined period of time.

8. The fault-tolerant multi-computer system of claim 1 wherein said scheduler means comprises:

status table means for storing current status information for each task capable of being executed by the computer, said status table means storing said tasks in their order of execution priority;

means for recording in said status table means the tasks selected, and the values of the data variables contained in messages received from all the computers in the system, said means for recording further including means for generating a dispatch signal when all the data required for the execution of a task has been recorded as received;

system status monitor means responsive to said messages identifying which computers are considered to be faulty for recording in said status table means as unselected each task previously recorded as being selected by the computer identified as faulty, said system status monitor means further including means for generating a dispatch signal in response to the unselecting of a task in said status table;

scheduling status table means for storing the tasks selected in the order in which they are selected; and

task selector means responsive to said dispatch signals for selecting from the status table means the highest priority task ready for execution, said task selector means further including means for recording said selected task in said scheduling status table means, means for sending a dispatch task signal identifying the selected task to said task communicator means and means for enabling said transmitter means to send a message to all of the computers identifying the task selected.

9. The fault-tolerant multi-computer system of claim 6 wherein each of said messages has a predetermined format containing information identifying the message type and the computer than sent the message, said checker means includes a message format checker for checking the format of each received message to generate said error signal when an error in the message format is detected.

10. The fault-tolerant multi-computer system of claim 9 wherein each of said operations controller means further sends started/completed messages identify the tasks completed and the new task started by its applications computer means, said checker means further includes execution time checker means responsive to said completed/started message for checking the execution time of each task executed by each applications computer means in the system to generate said error signal when the execution time exceeds predetermined limits.

11. The fault-tolerant multi-computer system of claim 10 wherein said intercomputer messages is a task data value message containing the value of a data variable, said checker means further includes reasonable limits checker means for checking the value of the data variable contained in each task data value message against predetermined maximum and minimum values to generate said error signal when the value of the data variable in the received task data value message is outside said predetermined maximum and minimum values.

12. The fault-tolerant multi-computer of claim 10 wherein said intercomputer messages include a redundant data value message containing a value for data variable redundantly computed by more than two computers, said checker means further includes redundant value voter means responsive to the redundant data value messages received from said more than two computers for finding a voted data value when the value of the data variable contained in a predetermined number of redundant data value messages agree to generate said error signal identifying each computer which sent a redundant data value message containing a value which did not agree with the voted data value.

13. The fault-tolerant multi-computer system of claim 10 wherein said intercomputer messages include a task selected message containing an identification of the task selected by the computer sending the message, said checker means further includes message sequence checker means responsive to task selected and completed/started messages for checking the proper sequence in which the tasks are selected and executed by each computer to generate said error signal when the selection and execution of tasks does not follow a correct sequence.

14. The fault-tolerant multi-computer system of claim 13 wherein each computer is assigned a priority for the execution of tasks and wherein said scheduler further includes means for unselecting a previously selected task selected by a higher priority computer and selecting a new task in response to a message received from a computer having a higher priority identifying that the higher priority computer has selected the same task and means for sending a task unselected/selected message identifying the task unselected and the new task selected, said message sequence checker is responsive to said task unselected/selected message and said completed/started message.

15. The fault-tolerant multi-computer system of claim 1 wherein said fault handler means further includes synchronizer means responsive to intercomputer messages for synchronizing the operation of its own computer with all of the other computers in the system.

16. The fault-tolerant multi-computer system of claim 14 wherein said fault handler further includes synchronizer means responsive to intercomputer messages for synchronizing the operation of its own computer with all of the other computers in the system.

17. The fault-tolerant multi-computer system of claims 15 or 16 wherein said synchronizer means comprises:

timer means for repetitively generating sampling periods and a sampling number indicative of the current sampling period;

means for enabling said transmitter means to send sampling number messages at the end of each sampling period containing said sampling number indicative of said current sampling period;

voter means for finding a voted sampling number indicative of a sampling number contained in a predetermined number of sampling number messages received from like synchronizer means in the other computers;

means responsive to the finding of a voted sampling number for correcting said generated sampling number to agree with said voted sampling number; and

means responsive to time within each sampling period when said voted sampling number is found for correcting the duration of the sampling period currently being generated to cause the end of the current sampling period to coincide with the end of the sampling periods in said like synchronizers in the other computers.

18. The fault-tolerant multi-computer system of claim 17 wherein said synchronizer further includes means response to the sampling numbers contained in the sampling number messages received from the other computers for generating said error signal identifying each computer which sent a sampling number message containing a sampling number which disagrees with said voted sampling number.

19. The fault-tolerant multi-computer system of claim 8 wherein said applications computer generates a task done signal each time it completes the execution of a task, said scheduler means further includes task releaser means for sending a release task signal to the task communicator means in response to said task done signal, said release task signal identifying the selected task stored in said scheduling status table means.

20. The fault-tolerant multi-computer system of claim 17 wherein said fault tolerator means further includes means for cancelling the fault status of a computer previously identified as faulty when said computer sends error free messages for a predetermined period of time.

21. The fault-tolerant multi-computer system of claim 19 wherein each task communicator means includes means for enabling said transmitter means to send a task completed-started message identifying the task completed and the new task started by its applications computer each time the execution of a task is completed and a new task started, said scheduler means further includes completed task recorder means for recording in said status table means the completed status of the tasks identified in said task completed/started messages.

22. The fault-tolerant multi-computer system of claim 1 wherein said task communicator means comprises:

a data values table storing with respect to each data variable the values of that data variable;

store data value means for recording in said data values table the values of the data variables contained in the error-free messages received from non-faulty computers passed on by said fault handler means;

task input means for communicating from said store data values table to said applications computer means the values of the data variables required for the execution of the task selected by said scheduler means;

task results sender means for enabling said transmitter to send data value messages to all of the computers containing the values of the data variables resulting from the task executed by the applications computer means; and

task output means for communicating from said applications computer means to said task results sender means the values of the data variables resulting from the execution of each task.

23. The fault-tolerant multi-computer system of claim 22 wherein said task input means comprises:

a task input table for storing the values of the data variables required for the execution of tasks;

task dispatcher means responsive to the task selected by said scheduler means for recording in said task input table the data values stored in said data values table required for execution of the selected task; and

task releaser means responsive to said applications computer means completing the execution of a preceeding task for communicating from said task input table to said applications computer means the values of the data variables stored in said task input table.

24. The fault-tolerant multi-computer system of claim 16 wherein each task communicator means includes means for enabling said transmitter means to send a task completed/started message identifying the task completed and the new task started by its applications computer means, said scheduler means comprises:

status table means for storing current status information for each task capable of being executed by said applications computer, said status table means storing said tasks in their order of priority;

record data value means receiving said data value messages for recording in said status table that the data identified in the messages is currently available for execution, said record data value means further including means for generating a dispatch signal when all the data required for the execution of a task has been recorded as received;

completed task recorder means responsive to said task completed/started messages for recording in said status table means that the task identified as completed in the received completed/started message has been completed;

25. The fault-tolerant multi-computer system of claim 24 wherein said applications computer means generates a task done signal each time it completes the execution of a task, said scheduler means further includes task releaser means for sending a release task signal to the task communicator means in response to said task done signal, said release task message identifying the task stored in said scheduling status table means which has been selected for execution.

26. The fault-tolerant multi-computer system of claim 25 wherein said system has a normal mode of operation, and at least one degraded mode of operation, said system status monitor means further generates a mode signal indicative of the system's current mode of operation determined from the number of computers considered to be faulty and wherein said status table means comprises a plurality of task status tables, each task status table corresponding to one of said modes of operation, said mode signal identifying for said task selector means the task status table from which the tasks are to be selected.

27. The fault-tolerant multi-computer system of claims 22 or 23 wherein said task output means comprises a task output table storing the value of the data variables resulting from the task currently being executed; and

wherein said task results message sender means enables said transmitter to send said messages containing the values of the data variables stored in said task output table after the execution of the current task is completed.

28. The fault-tolerant multi-computer system of claim 25 wherein said task communicator means comprises:

a data values table storing the values of the data variables;

store data values means for recording in said data values table the value of the data variables contained in the messages received from all the computers;

task input means for communicating from data values table to said applications computer means the values of the data variables required for the execution of the task identified in the release task signal;

task results sender means responsive to said release task signal for enabling said transmitter means to send to all of the computers data value messages containing the values of the data variables resulting from the execution of the task completed by said applications computer means; and

task output means for communication from said applications computer to said task results sender means the values of the data variables resulting from the execution of each task.

29. The fault-tolerant multi-computer system of claim 28 wherein said tank input means comprises:

a task input table for storing the values of the data variables required for the execution of at least two tasks;

task dispatcher means responsive to said release task signal for recording in said task input table the values of the data variables stored in said data values table required for the execution of the identified task; and

task releaser means responsive to said release task signal for communicating to said applications computer means the values of the data variables stored in said task input table recorded in response to the preceeding release task signal.

30. The fault-tolerant multi-computer system of claim 29 wherein said task output means comprises a task output table storing the values of the data variables resulting from the task currently being executed by the applications computer means; and

wherein said task results message sender means enables said transmitter to send said data value messages containing the values stored in said task output table in response to said release task signal.

31. The fault-tolerant multi-computer system of claims 22 or 28 wherein said task communicator means further includes watch-dog timer means responsive to said applications computer completing the execution of each task for monitoring the execution time of each task executed by said applications computer to generate an error signal when the execution time for a task is not within predetermined limits.

32. The fault-tolerant multi-computer system of claims 22 or 28 wherein said messages containing values of the data variables further contain sequence numbers indicative of the sequence in which the values were generated to distinguish data values resulting from the execution of the same task at different times and using different values of the data variables, said data values table stores values of each data variable for a predetermined number of different sequence numbers, said task dispatcher means includes means for computing the sequence numbers of the data variables required for the execution of the selected task to communicate only the values of the data variables to the applications computer means having the computed sequence numbers, and wherein said messages containing values of the data variables sent by said task results message sender means further include a sequence number corresponding to the execution number of the executed task.

33. A method for controlling the operation of each computer in a fault tolerant multiple computer system wherein the system includes a communication network whereby each computer can send error, data value, task selection and task completed/started messages to every other computer in the system and each computer has an operations controller and applications computer and wherein each computer further has a set of tasks it is capable of selecting and executing, comprising the steps of:

checking for error with said operations controller all messages received by each computer from all the computers in system to generate an error signal identifying each computer which sent the message containing an error;

sending error messages to all of the other computers in response to said error signals identifying as a faulty computer each computer which sent a message containing an error;

recording in said operations controller as faulty each computer which sent a message containing an error in response to said error signals and each computer identified in error messages received from a predetermined number of other computers to generate a fault status table;

discarding all messages containing errors and messages received from computers recorded as faulty in said fault status table;

recording in a status table contained within said operations controller the status information contained in the task data value, task selection and task completed/started messages which were not discarded, said status table listing said tasks and their associated status information in their order of execution priority;

detecting when all the information required for the execution of any task is recorded in said status table to generate an dispatch signal;

recording as unselected in said status table, each task which was previously recorded as selected by a computer which has subsequently been recorded as faulty in said fault status table;

generating by said operations controller a dispatch signal to signify the tasks unselected are ready for execution;

selecting by said operations controller from said status table the highest priority task ready for execution and not selected by another computer in response to said dispatch signal;

recording in said operations controller the selected task as selected in said task status table;

sending to all of the computers a task selected message containing the identity of the task selected;

generating in said operations controller a release task signal containing the identify of the selected task in response to the computer signifying it has completed the execution of a preceeding task;

recording in a data values table contained within said operations controller the value of the data variable contained in each non discarded data value message received from all the computers;

communicating from said data values table to the applications computer the value of the data variables required for the execution of the selected task in response to said release task signal;

executing in the applications computer the selected task using the communicated values of the data variables to generate values for new data variables;

sending to all of the computers by said operations controller, data value messages containing the values of the new data variables received from the applications computer; and

sending to all of the computers by said operations controller a task completed/started message identifying the computer, the task completed, and the new task started by the identified computer after the execution of the task is completed.

34. The method of claim 33 wherein said messages sent between the computers include sampling number messages containing a sampling number identifying the current sampling period of that computer, said method further includes the steps of:

sequentially generating with a sampling period timer sampling periods having a predetermined time duration;

storing a current sampling number identifying the current sampling period;

comparing the sampling numbers contained in the sampling number messages received from all the computers to find a voted sampling number having a value which is the same as the value of the sampling numbers contained in a predetermined number of sampling number messages;

recording said voted sampling number as said current sampling number;

comparing the time remaining in the current sampling period, when said voted sampling number is found, with predetermined maximum and minimum values to determine if the sampling period timer is synchronized with like sampling period timers in the computer which sent the sampling number messages from which the voted sampling number was found;

correcting the remaining time in the sampling period timer when the current remaining time is outside said maximum and minimum values to synchronize the sampling period timer with like sampling period timers in the other computers;

recording in said status table as ready predetermined tasks to be executed once during each sampling period in response to the finding of said voted sampling number;

incrementing by one the stored current sampling number and restarting the sampling period timer at the end of each sampling period;

sending sampling number messages containing the current sampling number incremented by one; to all of the other computers at the end of each sampling period in which a voted sampling number is obtained;

generating a restart signal at the end of each sampling period in which no voted sampling number is obtained; and

initializing the current sampling number to a predetermined number in response to said restart message.

35. The method of claim 34 wherein said method further includes the steps of:

comparing the voted sampling number with the sampling numbers contained in the sampling number messages received from the other computers to generate an error signal identifying each computer which sent a message containing a sampling number which disagrees with the voted sampling number;

recording in the fault status table as faulty in response to said error signals each computer which sent a sampling number message containing a sampling number which disagreed with the voted sampling number; and

sending error messages to all the computers in response to said error signals identifying each computer which sent a sampling number message containing a sampling number which disagreed with said voted sampling number.

36. The method of claim 34 wherein the computer is assigned at least one predetermined input task which samples the input data being received from at least one external source or at least one output task which outputs data to an external device, and wherein said predetermined input and output tasks are to be executed once during each sampling period said method further includes the step of generating an initiate input/output task signal recording said predetermined input and output tasks as ready in said status table.

37. The method of claim 33 wherein the computer is assigned at least one start-up task which performs predetermined functions required to facilitate the starting of the computer system when the whole system is being started or when the particular computer is being restarted said method further includes the step of detecting the first finding of a voted sampling number following the start up of the computer or the generation of a restart signal to generate an initiate start-up task signal recording said predetermined start-up task as ready in said status table.

38. The method of claim 33 wherein each computer has a plurality of receivers and each receiver can only receive messages from one associated computer, and wherein each message has a predetermined format which includes a message type code identifying the message type and a computer identification code identifying the computer which sent the message, said step of checking includes the step of checking the format of each received message, to generate an error signal when the message type code contained in the message is not a valid message type code and to generate said error signals when the computer associated with the receiver which received the message is not the same as the computer identified by the computer identification code contained in the message.

39. The method of claim 33 wherein said message type includes task data value messages said step of checking includes the step of comparing the value of the data variable contained in each task data value message with at least one predetermined limit value to generate said error signal when the value of the data variable contained in the message does not have a predetermined relationship to said predetermined limit value.

40. The method of claim 39 wherein said step of comparing compares the value of the data variable contained in each task data value message with predetermined maximum and minimum values to generate said error signal when the value of the data variable contained in the message is outside said predetermined maximum and minimum values.

41. The method of claim 33 wherein said messages include redundant data value messages containing values for data variables redundantly computed by more than two computers, said step of checking includes the steps of:

comparing the values contained in said redundant data value messages with the values received from other computers for the same data variable to find a voted data value when the values contained in a predetermined number redundant data value messages agree; and

comparing the voted data value with the value received in each redundant data value message containing a value for the same data variable, to generate said error signal identifying each computer which sent a message containing a value for the same redundantly computed data variable which disagrees with the voted data value.

42. The method of claim 33 wherein said step of checking includes the steps of:

monitoring the time required by each computer to execute the tasks identified in sequentially received task completed/started messages from the same computer; and

comparing the monitored execution time with at least one predetermined time to generate said error signal when the execution time exceeds said predetermine time.

43. The method of claim 42 wherein said messages include sampling number messages containing a sampling number identifying the current sampling period of the computer which sent the message, said method further includes the step of:

sequentially generating sampling periods having a predetermined time duration;

storing a current sampling number identifying the current sampling period;

comparing the sampling numbers contained in the sampling number messages received from all the computers to find a voted sampling number having a value which is the same as the value of the sampling numbers contained in a predetermined number of sampling number messages received from other computers;

recording said voted sampling number as said current sampling number;

comparing the time remaining in the current sampling period, when said voted sampling number is found, with predetermined maximum and minimum values to determine if the sampling period is synchronized with like sampling periods in the computers which sent the sampling number messages from which the voted sampling number was found;

correcting the remaining time in the sampling period when the current remaining time is outside said maximum and minimum values to synchronize the sampling period with like sampling periods in the other computers;

recording in said status table as ready predetermined tasks which must be executed in approximate synchronization with the other computers in the system in response to finding said voted sampling number;

comparing the voted sampling number with the sampling numbers contained in the sampling number message received from the other computers to generate said error signals identifying each computer which sent a sampling number message containing a sampling number which disagrees with the voted sampling number;

recording in the fault status table as faulty in response to said error signals each computer which sent a sampling number message containing a sampling number which disagreed with the voted sampling number;

sending error messages to all of the computers in response to said error signals identifying each computer which sent a sampling number message which disagreed with the voted sampling number;

incrementing by one the stored current sampling number and starting the sampling period at the end of each sampling period;

sending sampling number messages containing the current sampling number incremented by one to all of the other computers at the end of each sampling period in which a voted sampling number is obtained;

generating a restart signal at the end of each sampling period in which no voted sampling number is obtained; and

initializing the current sampling number to a predetermined number, in response to said restart message.

44. The method of claims 33 or 43 wherein said step of recording faulty computers in the fault status table further includes the step of storing for each computer recorded as faulty an elapsed time indicative of the time since the computer was last detected to be faulty; and

recording in the fault status table as no longer faulty each computer previously recorded as faulty whose elapsed time exceeds a predetermined time.

45. The method of claim 33 wherein said step of checking includes the step of checking the sequence in which the tasks are selected, started, and completed by each computer as identified in sequentially received task selected and task completed/started messages to generate said error signal identifying each computer whose scheduling sequence does not follow a correct sequence.

46. The method of claim 43 wherein the data variables required for the execution of any task may be generated at different times, and several values for the same data variable may be generated before the task is executed requiring an earlier generated value for that data variable, and wherein said step of sending task data value messages further includes sending a task data value message containing the value of the data variable and a sequence number indicative of the sequential order in which the value of the data variable was generated, and wherein said status table includes a plurality of entries for each task and each entry for the same task further storing a different execution number for that task, said step of recording in said status table includes the steps of:

identifying each task which requires the value of the data variables contained in each message;

computing an execution number for each identified task from the sequence number contained in the message;

searching the status table to find an entry for the identified task having the same execution number as the computed execution number;

recording in the entry for the identified task having the computed execution number that the value of the data variable contained in the received message has been received when the computed execution number is found;

searching when the computed execution number is not found the entries for the identified task to find the oldest execution number;

recording the computed execution number in the entry having the oldest execution number when the oldest execution number found is older than the computed execution number; and

recording that the value of the data variable contained in the message has been received in the entry in which the computed execution number is recorded; and

repeating the above steps for each identified task until the reception of the value of the data variable is recorded for every task which requires the value of the data variable contained in the data value message;

and wherein said data values table includes a plurality of entries for each data variable and each entry stores the value of the data variable and the sequence number contained in received data value messages, said step of recording in said data values table includes the steps of;

searching said data values table to find an entry for the data variable identified in the received message having the same sequence number as the sequence number contained in the received message;

recording the value of the data variable contained in the message in the data variable entry having the same sequence number when a data variable entry having the same sequence number is found;

searching when the sequence number contained in the message is not found the entries for the data variable identified in the message to find the oldest sequence number;

recording the sequence number contained in the message in the entry having the oldest sequence number when the sequence number contained in the message is newer than the oldest sequence number; and

recording the value of the data variable contained in the message in the entry in which the newer sequence number was recorded;

said step of communicating the values of the data variables to said applications computer comprising the steps of:

identifying from the task identified in said release task signal the data variables required for the execution of that task;

computing from the execution number of the identified task the sequence number of each data variable required for the execution of the identified task;

communicating the value of the data variable from the entry in the data values table having the computed sequence number to said applications computer;

repeating the above steps until the value of each identified data variable is communicated to the applications computer.

47. The method of claim 46 wherein said step of communicating the value of the data variable having the computed sequence number comprises the steps of:

copying the value of the data variable from the entry in the data values table having the computed sequence number into a task input table storing the values of data variables for at least two tasks in response to said release task signal; and

releasing to the applications computer the values of the data variables copied into the task input table by the preceeding release task signal in response to said release task signal;

wherein while the values of the data variables required for the execution of the task identified in the release task signal are being copied into the task input table the values of the data variables for the preceeding task simultaneously being released to said applications computer.

48. The method of claim 33 wherein each computer is assigned a priority for the execution of its assigned tasks said step of recording in said status table further includes the steps of:

comparing the task identified as selected in each task selected message with task selected but not started by the computer receiving the message;

comparing the priority of the computer which sent in the task selected message with the priority of the computer which received the message when both computers have selected the same task to identify which computer has the higher priority;

recording in said status table that the selected task is unselected by the receiving computer when the computer which sent the message has a higher priority than the computer receiving the message; and

selecting from said status table the highest priority task ready for execution and not selected by any computer, when the computer which sent the message has a higher priority than the computer receiving the message;

and wherein said step of sending a task selected message sends in response to the unselection of a task and the selection of a new task a task unselected/selected message further identifying both the task unselected and the new task selected by that computer.

49. The method of claim 47 wherein said step of copying the values of data variables in to the task input table further includes the step of copying into said task input table, from a private table storing the starting address in the applications computer's program memory where each tasks begins, the starting address for the task identified in said release task signal to generate a task input message containing both the starting address of the identified task and the values of the data variables; and

wherein said step of releasing releases said task input message to said applications computer.

50. The method of claim 33 or 49 wherein the system has a normal mode of operation, and at least one degraded mode of operation said method further includes the step of computing from the number of computers recorded as faulty in said fault status table a mode signal indicative of the system current mode of operation;

and wherein said status table comprises a plurality of task status tables, each task status table corresponding to one of said modes of operation, said step of selecting from said status table is further responsive to said mode signal and selects said tasks from the task status table corresponding to the mode of operation identified by said mode signal.

51. The method of claim 50 further including the step of monitoring the execution time of each task executed by its own applications computer to generate an error signal when the execution time for a task is not within a predetermined time limit.

52. The method of claim 51 wherein predetermined data variables are redundantly computed by more than two tasks executed by more than two computers, said step of sending data value messages sends two different types of data value messages, task data messages containing the values of the data variables singularly computed, and redundant data value messages containing the values of the data variables which are redundantly computed.


Description

CROSS REFERENCE

The disclosed invention is related to the commonly assigned co-pending applications Applications, Ser. Nos. 118,691, now U.S. Pat. No. 4,330,820, 118,693, now U.S. Pat. No. 4,323,966, 118,694, now U.S. Pat. No. 4,342,083, 118,811, now U.S. Pat. No. 4,321,666, 118,812, now U.S. Pat. No. 4,318,173, and 118,813 now U.S. Pat. No. 4,333,144 filed concurrently herewith.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention is related to Multiple Computer Systems, and in particular to Fault-Tolerant Multiple Computer Systems not having multiple Computers performing each system function.

2. Prior Art

The earliest attempts to produce Fault-Tolerant Control Systems provided redundant computers in which each computer simultaneously executed every task required for the control operation. Voting circuits monitoring the outputs of the multiple computers determined the "correct" system output, the "correct" system output being the output produced by the majority of computers. When a faulty computer produces an output which differs from the "voted" output, the differing output is discarded and does not affect the "voted" or "correct" output of the control system. In this type of Fault-Tolerant System, the failure of a computer may or may not be detected and that computer may or may not be turned "off."

This method, though highly successful, is expensive since it requires multiple equivalent computers, each simultaneously performing the same function. These systems require relatively powerful computers, since each computer has to perform every task required for the operation of the system.

As an alternative, a master-slave concept was introduced in which the operation of several computers was coordinated through a master control. The master designated which tasks were to be executed by the individual computers. This reduced the execution time of the control operation since the good computers no longer were required to execute each and every task. When a fault was detected in the operation of one of the computers, that computer was disconnected and the master distributed the tasks among the good or operative computers. The master-slave concept is dependent upon the continued operation of the master and if the master failed, the system failed. This situation may be rectified by using redundant masters, however, the increased cost of redundant masters limit the applicability of these types of systems to situations where the user is willing to pay for the added reliability, such as in space exploration, nuclear energy facilities, or any other situation where failure of the system would endanger lives.

Recent efforts to improve upon master-slave and redundant execution Fault-Tolerant Multiple Computer Systems are exemplified in the October, 1978 Proceedings of the IEEE, Volume 66, No. 10, which is dedicated to fault-tolerant control systems. Of particular interest are the papers entitled "Pluribus: An Operational Fault-Tolerant Multiprocessor" by D. Katsuki et al., pp. 1146-1159 and "SIFT: The Design and Analysis of A Fault Tolerant Computer for Aircraft Control" by J. H. Wensley et al., pp. 1240-1255. The Pluribus and SIFT control systems are believed to represent the present state of the art. The SIFT system uses redundant execution of each system task, and of the master control functions. The Pluribus system has a single "master" copy of most current information, which can be lost when a fault occurs. Such loss of current information can cause interruption of system operation for several seconds or minutes.

SUMMARY OF THE INVENTION

The invention is a fault-tolerant multiple computer system for executing a set of tasks. Each computer has an assigned subset of the set of tasks which it is capable of executing, and each task is included in more than one subset. No one computer in the system acts as a master and no one computer need be capable of executing all of the tasks. Communications between the individual computers is by means of individual communication links, one link associated with each computer, by means of which only the associated computer can send messages to all other computers in the system.

Each computer comprises an applications computer and an operations controller. The operations controller receives the messages from the other computers over the communication links. It also selects the tasks which its computer will execute, from its assigned subset of tasks. The applications computer executes the tasks selected by its associated operations controller. The results of the executed tasks are either applied to output devices or transmitted via its operations controller and its associated communication link to all of the other computers in the system. The results are transmitted when required for a task to be subsequently executed by itself or another computer.

The operations controller comprises a fault handler, a scheduler, a task communicator, a transmitter and requisite receivers.

The fault handler checks each message received from the other computers, and determines which computer(s) are faulty when a failure occurs. The fault handler discards messages received from computers deemed to be faulty and forwards to the scheduler only messages received from non-faulty computers. The fault handler sends messages to all of the other computers in the system identifying each computer it has deemed to be faulty.

The scheduler maintains a status table in which it stores the current status of each task in its assigned subset. The status table is continuously updated by the messages forwarded by the fault handler. The scheduler selects from the status table the highest priority task which is ready for execution and which has not been selected by another computer. The task selected by the scheduler is forwarded to the task communicator, and a message is sent informing all of the other computers that the computer has selected the identified task.

The task communicator assembles the data variable values required for the execution of the selected task and makes them available to the applications computer. After the execution of the selected task by the applications computer, the task communicator sends messages containing the value of the data variables computed by the completed task. It then sends a message to all of the other computers identifying the task just completed and the next tast started.

The object of the invention is a fault-tolerant multiple computer system for cooperatively executing a set of tasks in a fault tolerant manner. Each computer in the system selects and executes tasks so that each task required to be executed by the system is executed in a correct sequence. Each computer detects which, if any, computers in the system are faulty, and ignores the selection and execution of tasks by the faulty computer(s), thereby permitting the tasks to be selected and executed by the remaining non-faulty computers.

One advantage of the system is that each computer selects its own tasks, from its assigned subset of tasks, without the intervention of a master or controlling scheduling device or computer. Another advantage is that each task is assigned to more than one computer so that, in the event a computer fails, the tasks assigned to the failed computer may be selected and executed by an alternate good or operational computer. Another advantage is that no one computer is required to execute all of the tasks, reducing the memory size requirements of each computer and permitting the use of lower cost microcomputers. Still another advantage is that each computer decides which computers in the system are faulty, and effectively excludes the faulty computer from participating in the operation of the system.

These and other advantages will become apparent from reading the detailed description in conjunction with the drawings and tables.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing the basic architecture of the Fault-Tolerant Multiple Computer System.

FIG. 2 is a block diagram of the Fault-tolerant Multiple Computer System showing further detail of the system.

FIG. 3 is a block diagram of the Applications Computer.

FIG. 4 is a block diagram of the Operations Controller.

FIG. 5 is a block diagram of the Fault Handler.

FIGS. 6A and 6B are a flow diagram for the Message Format Checker.

FIG. 7 is a circuit implementation of the Message Format Checker.

FIG. 8 is a flow diagram for the Reasonable Limits Checker.

FIG. 9 is a circuit implementation of the Reasonable Limits Checker.

FIG. 10 shows the waveforms of the timing signals used in the discussion of the Message Format Checker and Reasonable Limits Checker.

FIG. 11 is a flow diagram for the Redundant Value Voter.

FIG. 12 is a flow diagram for the "Check Agreement" subroutine of the Redundant Value Voter.

FIG. 13 is a flow diagram for the "Find Values That Agree" subroutine of the Redundant Value Voter.

FIG. 14 is a flow diagram for the "Record Voted Value" subroutine of the Redundant Value Voter.

FIG. 15 is a block diagram of the Message Sequence Checker.

FIG. 16 is a block diagram of the Execution Time Checker.

FIG. 17 is a block diagram of the Synchronizer.

FIGS. 18, 19 and 20 are time-sequence charts used in the discussion of the Synchronizer.

FIG. 21 is a time-sequence chart showing the sequence of events during normal operation of the Synchronizer.

FIG. 22 is a time-sequence chart showing the sequence of events during a start or restart of the Synchronizer.

FIG. 23 is a block diagram of the Fault Tolerator.

FIG. 24 is a block diagram of the Scheduler.

FIG. 25 is a schematic showing the arrangement of the data and subtables of the Status Table.

FIG. 26 is a block diagram of the Task Communicator.

FIG. 27 is a block diagram of the Internal Watch-Dog Timer.

BRIEF DESCRIPTION OF THE TABLES

    ______________________________________
    ARCHITECTURE OF THE FAULT-TOLERANT
    MULTIPLE COMPUTER SYSTEM
    TABLE        DESCRIPTION
    ______________________________________
    Table I   Tables used in the System
    Messages
    Table II-A
              *Inter-Computer Messages
    Table II-B
               Internal Messages
    Fault Handler
    Table III-A
               Message Format Checker
    Table III-B
               Reasonable Limits Checker
    Table III-C
               Redundant Data Table
    Table III-D
              *Redundant Value Voter
    Table III-E
              *Check Agreement
    Table III-F
              *Find Values That Agree
    Table III-G
              *Record Voted Value
    Table III-H
              *Task Unselected/Selected Message Module
    Table III-I
              *Task Completed/Started Message Module
    Table III-J
              *Watch Dog Timer Checker
    Table III-K
              *Start Watch Dog Timer Module
    Table III-L
               Sampling Data Table
    Table III-M
              *Start Synchronizer Module
    Table III-N
              *Check Sampling Timer Module
    Table III-O
              *Find Sampling Number Agreement Module
    Table III-P
              *Find Computers That Agree
    Table III-Q
              *Reset Sampling Timer
    Table III-R
              *Record Voted Sampling Number
    Table III-S
               Fault State Table
    Table III-T
              *Send Good Message Module
    Table III-U
              *End Time Period Module
    Table III-V
              *Check Error Message Agreement Module
    Table III-W
              *Record Error Module
    Table III-X
              *Display Faulty Computer
    Table III-Y
              *Start Fault Handler Module
    Scheduler
    Table IV-A
               Task Status Table
    Table IV-B
               Task Index Table
    Table IV-C
               Scheduling Status Table
    Table IV-D
               Awaiting Task Table
    Table IV-E
              *Record Data Ready
    Table IV-F
              *Find Awaiting Execution Number
    Table IV-G
              *Test If Health Check Selected
    Table IV-H
               Special Tasks Table
    Table IV-I
              *Record Special Tasks
    Table IV-J
              *Task Selector
    Table IV-K
              *Record Task Selected By Own Computer
    Table IV-L
              *Completed Task Recorder
    Table IV-M
              *Test If Last Completed Task
    Table IV-N
              *Unselected/Selected Task Recorder
    Table IV-O
              *Record Task Selected
    Table IV-P
              *Test If Selected Task
    Table IV-Q
              *Task Unselector
    Table IV-R
              *Task Releaser
    Table IV-S
              *System Status Monitor
    Table IV-T
              *Start Scheduler Module
    Task Communicator
    Table V-A  Data Values Table
    Table V-B  Task Input Table
    Table V-C  Task Output Table
    Table V-D *Store Data Value Module
    Table V-E  Task Data Table
    Table V-F *Task Dispatcher
    Table V-G *Release Task Module
    Table V-H *Task Results Message Sender
    Table V-I *Starter
    Table V-J *Counter
    Table V-K *Start Task Communicator
    Applications Computer
    Table VI  *Applications Computer Executive Program
    Microprocessor Based Implementation of Operations Controller
    Table VII-A
              *General Executive Program
    Table VII-B
               Conditions for Module Execution
    Table VII-C
              *General "Task" Program
    Table VII-D
              *Fault Handler Executive Program
    Table VII-E
               Module Modifications
    ______________________________________
     The tables indicated with asterisks (*) are pseudo code programs


DETAILED DESCRIPTION OF THE INVENTION

ARCHITECTURE OF THE FAULT-TOLERANT MULTI-COMPUTER SYSTEM

The architecture of the disclosed Fault-Tolerant Multi-Computer System is illustrated in FIG. 1. The system comprises a plurality of Computers 10 connected by means of input lines 12 to various sensors and manual inputs collectively represented by block 14.

The outputs of Computers 10 are transmitted by means of output lines 22 to a Combiner/Voter Network 24, which selects and/or combines the output data generated by the various Computers. The Combiner/Voter Network 24 distributes this data, by designated line 26, to the appropriate actuators and displays collectively represented by block 28.

Each Computer 10 has its own private communication link, such as Communication Links 16, 18, and 20, over which it can transmit messages containing data to every other Computer. For example, messages originating in Computer A are transmitted to all the other Computers via Communication Link 16. All the other Computers connected to Communication Link 16 can only receive messages over Communication Link 16. To transmit a message back to Computer A, they must use their own communication link, i.e., Computer B would use Communication Link 18 and Computer N would use Communication Link 20. The messages and data sent over the communication links are sent in serial form; therefore, each link may be a single pair of wires or other serial transmission medium such as an optical fiber. Each communication link is also connected back to the transmitting Computer, permitting verification that the message sent on the communication link is correct. This is part of the fault detection features of the system to be discussed later.

Each Computer such as Computer 10a through 10n consists of one or more computers (or processors), depending upon the number of tasks to be executed by that particular Computer for a particular application and upon the fault-tolerant sophistication of the system. Each Computer 10a through 10n is hereinafter referred to as Computer 10 without the identifying subscript.

Each Computer has an assigned set of tasks which it is capable of executing, where the set of tasks assigned to each Computer 10 is less than the total set of tasks to be executed by the system. One feature of the system, however, is that each task to be executed is assigned to at least two different Computers. Certain tasks critical to the operation of the system are assigned to several or possibly all of the Computers. Each Computer in the system is capable of individually executing each assigned task.

For example, consider a relatively simple system having three Computers and required to execute fifteen (15) different tasks, of which the tasks designated Tasks 7 and 11 are critical to the operation of the system. Further, consider each Computer in the system to be capable of executing at least eleven (11) of the required tasks. In this example, Computer A may be assigned Tasks 1 through 11, Computer B assigned Tasks 5 through 15 and Computer C assigned Tasks 11 through 15, Tasks 1 through 5, and Task 7. In the example, Tasks 7 and 11 are assigned to each Computer; however, in a system having more than three Computers, Tasks 7 and 11 would have been assigned to more than two Computers but not necessarily to all of them.

The execution of each assigned task in each of the Computers 10 is data driven, i.e., when all of the data required for the execution of a particular task is available, each Computer to which the task is assigned is capable of selecting and executing the task. The data is usually the results of one or more previously executed tasks. When execution of a task is completed, the task results are communicated to each of the computers by sending messages via the communication link. When the task results are received by Computers which require the particular data for the performance of a subsequent assigned task, the receiving Computer will store the received data. The Computers which have no assigned task requiring the particular data may discard the received data.

The selection of each task to be executed by each Computer is made dynamically by each Computer. This is done in such a manner that all Computers assigned a task will not necessarily proceed to execute that task. Stated alternatively, a Computer may not execute all tasks which are assigned to that Computer. Each Computer makes its own decisions, based upon knowledge of previous decisions by all Computers, as communicated in messages received via the communication links.

The task selection is performed by a Schedular described in detail hereinafter with reference to FIG. 24. Briefly, the selection of each task to be executed by each Computer is made dynamically on a priority basis. To this end, a priority number is assigned to each Computer and a priority number is assigned to each task within a given Computer. When a given Computer needs to select a task, the task status information is scanned to determine which of the assigned tasks are ready for execution. A task is ready for execution when all of the data necessary for the execution of the task is available. The Computer selects the ready task having the highest task priority and sends out a message on its communication link signifying to the other Computers that it has selected the task. When a Computer receives a message indicating that another Computer has selected a task, the selected task is removed from the ready status in all of the other Computers capable of performing the same task.

In the time interval between task selection and starting the execution of the selected task, the computer checks to determine if another Computer has selected the same task. If the Computer which selected a task does not receive a message indicating that another Computer has selected the same task, the Computer initiates the execution of the selected task.

In the event another Computer selects the same task before the first Computer initiates the execution of the task, the priority of each Computer which selected the task is analyzed, and the task remains selected by the Computer having the highest priority. The remaining Computers unselect the previously selected task, and proceed to select the next highest priority task ready for execution. When it is desirable that certain identified tasks be executed by more than one Computer, the same functional task is duplicated for scheduling purposes, one copy for each execution desired.

Fault detection in the system is accomplished by a combination of methods. Faults may be detected by comparing the results of each executed task with stored range limits, by comparing the results of the same task executed by two or more different computers, by error detecting codes on information communicated, by analyzing the scheduling sequence or by the use of watch-dog timers. The system may embody all five fault detection methods, any lesser number of the above methods in combination, or in special applications, any one of the above listed methods.

The messages sent by a Computer are received and analyzed in every Computer in the system to determine if an error exists. If an error is detected, each Computer detecting the error sends out an error message via its communication link to all of the other Computers. An error message signals the detection of an error and identifies the Computer which made the error. The error messages received are analyzed in each Computer. When a Computer receives error messages from two or more Computers, the computer which is identified as making the error is assumed to be faulty.

When a Computer is deemed faulty by another Computer, the messages signaling task selection and containing data results of any task executed by the faulty Computer are discarded or ignored. The receipt of messages signifying an error detected by two or more Computers also reinstates the ready status of the tasks presently selected and being executed by the Computer which is deemed to be faulty. The tasks selected and being executed by the faulty Computer are subsequently re-selected and re-executed by other Computers capable of executing those tasks.

In the disclosed system, a Computer determined to be faulty is not turned off or disabled, but is permitted to remain active and to continue to execute each of the assigned tasks, if it can. The remaining Computers continue to check the messages sent by the faulty Computer to determine if the malfunction is temporary or permanent. If the malfunction is temporary, the faulty Computer will eventually return to normal operation and the results of the tasks executed in that Computer will be correct. After a Computer deemed to be faulty correctly executes its assigned tasks for a predetermined period of time, the malfunction is assumed to have been temporary and the excluded Computer is restored to full participation in task selection and task execution.

If a faulty Computer sends incorrect information to Actuators and Displays, the faulty information is corrected by the Combiner/Voter Network 24. The output or task results of any task used for actuator activation or display purposes are generated by the Computers 10 to which the specific tasks are assigned. The output of each Computer 10 is transmitted on lines 22 to the Combiner/Voter Network 24. The Combiner/Voter Network 24 combines the appropriate output data for actuator activation or display purposes as required. When duplicate outputs are provided by multiple Computers, the output data to be used is selected by a voting process.

FIG. 2 shows in greater detail the architecture of the multi-computer system, shown in FIG. 1. Each Computer 10 comprises an Applications Computer 100, such as Applications Computers 100a through 100n, and an Operations Controller 200, such as Operations Controllers 200a through 200n. Each Applications Computer 100 and its associated Operations Controller 200 are interconnected by a buss 30, as indicated by busses 30a through 30n.

The data from the sensors and manual controls, indicated by block 14, are received directly by the Applications Computers 100. Similarly, the data to the Actuators and Displays 28, via the Combiner/Voter Network 24, are obtained directly from the outputs of the Applications Computers. Only the Operations Controllers are interconnected by the communication links 16 through 20.

The Applications Computers 100 are of conventional architecture as shown on FIG. 3. Each Applications Computer comprises a Power Supply 102, a Central Processing Unit (CPU) 104, a Memory 106, and an Input-Output Network 108. The Operations Controllers 200 each comprise a plurality of Receivers 202, a Fault Handler 204, a Scheduler 206, a Task Communicator 208, and a Transmitter 212, as shown on FIG. 4.

The structure of the Operations Controllers 200a through 200n shown in FIG. 2, including the interconnecting communication links 16 through 20, represent the novel aspects of the disclosed Fault-Tolerant MultiComputer System. The system does not contain a master controller to determine or control which Computer 10 will execute a designated task. Further, the system is not a fully redundant system wherein each Computer 10 is capable of executing and does execute every task.

APPLICATIONS COMPUTER

FIG. 3 shows the structure of a typical Applications Computer 100. Each Applications Computer 100 has a Power Supply 102, which supplies electrical power to the Central Processing Unit 104, the Memory 106, the Input-Output Network 108, and the associated Operations Controller 200, as indicated. The Central Processing Unit 104, Memory 106 and Input-Output Network 108 are connected by the buss 30. The Input-Output Network 108 is further connected to the Sensors and Manual Controls 14 by line 12, and to the Combiner/Voter Network by line 22. As previously indicated, the Operations Controller 200 is also connected to the buss 30.

The Central Processing Unit 104 may comprise one or more microcomputers, such as Central Processor 8086 manufactured by Intel Corporation of Santa Clara, Calif. The Memory 106 may comprise one or more read only memories, such as Erasable PROM 8708 also manufactured by Intel Corporation, which store the programs to be executed by the Central Processing Unit. The Memory 106 may also include one or more read-write (RAM) memories, such as Static RAM 8102A also manufactured by Intel Corporation. The Input-Output Network 108 may comprise one or more commercially available integrated circuits, such as Programmable Peripheral Interface 8255A manufactured by Intel Corporation, with attendant A/D and D/A converters. Alternatively, the Central Processing Unit 104, Memory 106 and Input-Output Network 108 may be incorporated in a single integrated circuit such as Microcomputer 8748 manufactured by Intel Corporation.

The operation of the Applications Computer 100 is as follows. The Operations Controller 200 generates a task signal indicative of the task to be executed by the Central Processing Unit 104, and sends the task signal along with the requisite data to the Central Processing Unit over Buss 30. The Central Processing Unit 104 responds to the task signal, accesses the appropriate program in the Memory 106, executes the task with the provided data, and outputs the results on Buss 30. Data from the Sensors and Manual Controls are received over lines 12 at the input of the Input-Output Network 208, which makes the received data available for use in the execution of an assigned task. The results of those executed tasks which are used for actuator control or display are output to the appropriate actuator and/or display through the Input-Output Network 108. Other results from a task executed by the Central Processing Unit 104, which are required for further computation within the system, are transmitted to all other Computers via the Operations Controller 200 and the communications link. When the execution of the task is completed, the Central Processing Unit initiates execution of the next task selected by the Operations Controller.

OPERATIONS CONTROLLER

FIG. 4 shows the structure of the Operations Controller 200 in block diagram form. The Operations Controller 200 has a plurality of Receivers 202a through 202k, each connected to a communication link associated with one of the Computers 10. There may be as many Receivers 202 as there are Computers 10, or there may be fewer Receivers if the computer associated with this Operations Controller has no need to receive communications from one or more other Computers in the system, i.e., the results of none of their tasks are needed by this Computer for the execution of its assigned tasks.

The input to one of the receivers, designated Receiver 202k, is connected by means of line 214 to the output of Transmitter 212, which sends the messages and data over the communication link from the associated Operations Controller. This feedback connection between the Transmitter 212 and Receiver 202k is part of the fault detection system to check the message sent over the communication link, and also permits the task results to be input back into the generating Computer for subsequent task execution. This feedback connection may be direct, or preferably in the form of a loop connection from Transmitter 212 to the appropriate Receivers in each other Computer and finally back to Receiver 202k in the same Computer.

The Receivers 202a through 202k receive the messages serially transmitted over the communication links and convert them to a parallel format for subsequent utilization in the Operations Controller and Application Computer. As used hereinafter, the term "messages" will include all messages, such as Task Completed/Started, Task Unselected/Selected, Error, Task Data Values, and other messages communicating information between the Computers via the communication links.

The Receivers 202 also include circuits which establish message protocol and perform other necessary format conversions. The Transmitter 212 performs the reverse function, receiving parallel data and converting it to a serial format for transmission over the communication link. The format conversion may strip or add carriers, provide padding or add special codes for transmission error control as is known in the art.

The Receivers 202 and Transmitter 212 each contain buffers permitting a message to be received some time before it can be output. Each buffer is capable of holding more than one message; for example, each buffer may be capable of holding up to ten (10) messages. Receivers 202 and Transmitter 212 may be commercially available integrated circuits incorporating both a receiver and transmitter, such as the Programmable Communications Interface (PCI) 2651 manufactured by Signetics of Sunnyvale, Calif. or the SDLC Protocol Controller 8273 manufactured by Intel Corporation of Santa Clara, Calif. These circuits are supplemented with additional buffering using commercially available integrated circuits such as FIFO 33512 manufactured by Fairchild Corporation of Mountain View, Calif.

The parallel data outputs of the Receivers 202 are transmitted to a Fault Handler 204, where each received message is analyzed to determine if it is good or faulty. The Fault Hander 204 may be a micro-computer having storage capabilities, a part of a micro-computer, or a special purpose circuit. The Fault Handler 204 performs one or more of the following fault detection checks:

1. Compare the received data value with predetermined limit values to determine if it is reasonable, i.e., has a value between predetermined minimum and maximum values.

2. Compare the received data with the results of other Computers performing the same task, to determine the most probable value, and to identify Computers providing values which differ significantly from the most probable value,

3. Determine if the scheduling information was received in a proper sequence,

4. Determine by means of watch-dog timers if the task execution was completed within a predetermined time period after the execution was started, or

5. Check error detecting codes, determined over other information communicated and included in each message.

In addition to performing fault detection checks, the Fault Handler 204 also performs the following functions:

1. Transmits an Error message to the Transmitter 212 when an error is detected.

2. Stores Error messages received from all Computers.

3. Decides if one or more of the Computers is faulty.

4. Discards all messages received from the Computers determined to be faulty.

5. Transmits to the Scheduler 206 error-free messages from non-faulty Computers.

6. Generates a fault display indicating the Computers which have been determined to be faulty.

7. Decides that a Computer is no longer faulty and readmits the Computer previously determined faulty, after the faulty Computer sends good messages for a predetermined period of time,

8. Generates the required input/output sampling commands, and

9. Initiates startup of the Applications Computer and Operations Controller, when the Computer is first turned on or power is returned after a temporary power failure.

The function of the Scheduler 206 is to schedule the task to be executed by its own Applications Computer 100. The Scheduler performs the following functions:

1. Keeps track of the status of all assigned tasks and determines which of the tasks are ready for execution, i.e., all the data needed for execution is available.

2. Selects the ready task having the highest task priority for next execution and generates a signal indicative of the task selected, and

3. Unselects the selected task and selects the next highest priority task when it receives a task selection message for the same task from another Computer having a higher assigned Computer priority.

The Scheduler 206 may be implemented by means of a micro computer or a part thereof, or with special purpose hardware, depending upon the number of assigned tasks and complexity of the system.

The Task Communicator 208 stores the current values of the data required for the execution of each task assigned to the associated Applications Computer 100. The Task Communicator responds to each task signal generated by the Scheduler 206 and makes available to the associated Applications Computer 100 the data required for the execution of the task identified by the task signal. Upon completion of each task, data values produced by the executed task, or an error message if an error was detected in the execution of the task, are sent by the Task Communicator 208 to the Transmitter 212.

The Transmitter 212 also receives the Task Complete/Started messages from the Task Communicator, Task Unselected/Selected messages from the Scheduler 206, and Sampling Number and Error messages from the Fault Handler 204. The Transmitter 212 converts the received messages to a serial format which is sent to the other computers via the associated communication link.

The data sent over the associated communication link is also received by Receiver 202k over line 214. The messages received by the Fault Handler 204 from Receiver 202k are treated in the same way as any other message received from the other Computers in the system. In this way, the data generated by the associated Applications Computer, required for the execution of a subsequent task, is communicated to and stored in the associated Task Communicator 208 of each Computer 10.

The operation of the Operations Controller 200 requires the maintenance of various tables of information. These tables store the recent actions of all Computers, including itself. Table I below lists the various tables used in the system and the elements to which these tables are assigned.

                  TABLE I
    ______________________________________
    TABLES USED IN THE SYSTEM
    TABLE              ELEMENT
    ______________________________________
    Redundant Data     Fault handler 204
    Computer Status    Fault Handler 204
    Sampling Data      Fault Handler 204
    Fault State        Fault handler 204
    Scheduling Status  Scheduler 206
    Task Status        Scheduler 206
    Data Values        Task Communicator 208
    Internal Watch-Dog Timer
                       Task Communicator 208
    ______________________________________


MESSAGES

The operation of the Fault-Tolerant Multi-Computer System requires that various items of information be transmitted in messages between the multiple Computers in the system. Table II-A is a tabulation of the message types used in the following description of the system. Each message is assumed to comprise a fixed integer number of 8-bit bytes or characters. It is recognized that the various items of information in the messages listed in Table II-A may be presented in various other ways and may use different numbers of bytes and/or bits. The message types given in Table II-A, and their contents, represent a specific format that may be used.

                  TABLE II-A
    ______________________________________
    INTER-COMPUTER MESSAGES
    Message Type Byte No.    Byte Contents
    ______________________________________
    Task Data Value
                 1           Message Type
                 2           Sending Computer
                 3           Data I.D.
                 4           Sequence number
                  5-12       Data Value
                 13-14       Error Detecting Code
    Redundant Data
                 1           Message Type
    Value        2           Sending Computer
                 3           Data I.D.
                 4           Sequence Number
                  5-12       Data Value
                 13-14       Error Detecting Code
    Task Completed/
                 1           Message Type
    Started      2           Sending Computer
                 3           Completed Task
                 4           Completed Execution
                             Number
                 5           Started Task
                 6           Started Execution
                             Number
                 7-8         Error Detecting Code
    Task Unselected/
                 1           Message Type
    Selected     2           Sending Computer
                 3           Unselected Task
                 4           Unselected Execution
                             Number
                 5           Selected Task
                 6           Selected Execution
                             Number
                 7-8         Error Detecting Code
    Error        1           Message Type
                 2           Sending Computer
                 3           Faulty Computer
                 4           Error Type Code
                 5-6         Null (not used)
                 7-8         Error Detecting Code
    Sampling Number
                 1           Message Type
                 2           Sending Computer
                 3           Sampling Number
                 4           Starting Flag
                 5-6         Excluded Bits
                 7-8         Error Detecting Code
    ______________________________________


The first two and last two bytes of all the intercomputer messages listed on Table II-A contain similar information. The first and second bytes of each message identify the message type and sending Computer respectively. The last two bytes are an error detecting code determined and checked over all other bytes of the message. The form of error detecting code used depends upon the communication link protocol selected; a 16 bit Cyclic Redundancy Check (CRC) code or any other code having similar error detection coverage may be used. In addition to these error detecting code bytes, each byte or character may be transmitted with additional bits which are used solely for error detection and/or correction. The error detecting bits and bytes are generated by the Transmitter 212 and checked by the Receivers 202, and are not passed along with the rest of the message for subsequent handling in the Operations Controller.

Task Data Value and Redundant Data Value messages differ only in whether or not the data values contained in the messages are redundantly computed by more than one Computer, and thus must be processed by majority voting as discussed hereinafter. Task Data Value Messages and Redundant Data Value Messages are sent by a Computer after completing the execution of a task, in which new values for some task data variables have been computed.

A Task Data Value or Redundant Data Value message comprises 14 bytes as indicated on Table II-A. The first byte identifies the message as a Task Data Value or Redundant Data Value message, which contains a new data variable value. The second byte identifies the Computer in which the new data value was computed. The third byte identifies the particular data variable for which a new value was computed by the sending Computer. The fourth byte provides the sequence number of the new data value. The sequence number distinguishes this particular value of the data variable from previous and subsequent values of the same data variable, computed by the same Computer or by any other Computer in the system. The sequence numbers are assigned sequentially (0 to 255 decimal) in circular fashion, i.e., 0 follows 255. The next 8 bytes, bytes 5 through 12, contain the new value for the data variable. The final two bytes contain the error detecting code.

The Task Completed/Started message is sent after a task has been completed, and follows the Task Data Value and Redundant Data Value messages from the completed task. The Task Completed/Started message informs the other Computers in the system that the sending Computer has completed the execution of the task identified in Byte 3, and identifies the new task started in Byte 5. Bytes 4 and 6 give the execution numbers of the completed and started tasks, respectively. Each execution number distinguishes the particular execution of a task from previous and subsequent executions of the same task. The execution number corresponds to the sequence number of the data values being used or being computed in the execution of the task.

The Task Unselected/Selected message is sent when the Scheduler has selected the next task to be executed by the Applications Computer. Bytes 5 and 6 of the Task Unselected/Selected message identify the newly selected task and its execution number. Bytes 3 and 4 identify the previously selected task and its execution number; this task is now unselected and replaced by the selected task.

When a Computer starts executing its selected task, it tentatively selects a known, fixed task, namely the Health Check task, so that a task is always selected. The selection of this Health Check task is not explicitly communicated to other Computers; its selection is assumed by all Computers when a Task Completed/Started message is received. Later, if the Computer selects another task, it sends out a Task Unselected/Selected message. Bytes 3 and 4 identify the unselected Health Check task, and bytes 5 and 6 identify the task selected in place of the Health Check task.

If prior to initiating the execution of the selected task, the Operations Controller receives a Task Unselected/Selected message from another Computer having a higher priority, indicating that it also has selected the same task (not Health Check) with the same execution number, the Operations Controller of the lower priority Computer unselects the task and selects a new task. The Operations Controller then generates a Task Unselected/Selected message informing all of the other Operations Controllers that it has unselected the previously selected task and identifying the newly selected task and its execution number.

An Error message is generated when an Operations Controller detects an error in a message received from another Computer, or detects an error committed by its own Computer. The first byte identifies the message as an Error message. The second byte identifies the Computer which detected the error, while the third byte identifies the Computer from which the erroneous message originated. The fourth byte contains an error type code which identifies the type of error detected. The fifth and sixth bytes contain null codes (not used). As previously indicated, bytes 7 and 8 contain an error detecting code. It should be noted that null bytes are included in some messages so that most message types are the same length and thus simplify message handling. Alternately, these null bytes could be omitted from the messages.

A Sampling Number message is sent by each Computer at the end of each sampling period. The first byte identifies the message type, and the second byte identifies the Computer sending the message. The third byte provides the new sampling number, which distinguishes the present sampling period from previous and subsequent sampling periods. Like the data value sequence numbers and task execution numbers, the sampling numbers are assigned sequentially (from 0 to 255 decimal) in circular fashion, i.e., 0 follows 255. The fourth byte is a starting flag signifying if the sending Computer is starting or restarting operation. The fifth and sixth bytes contain one bit for each possible Computer in the system, and indicate if the Computer associated with each bit is currently excluded by the sending Computer or not. The seventh and eighth bytes contain the error detecting code.

As previously stated, these messages are transmitted between the multiple Computers of the system. The same messages are also transmitted between some subsystems of the Operations Controller. Within one Operations Controller, not all bytes of a message may be transmitted. In particular, the error detecting code bytes are not communicated beyond the receivers.

Within each Operations Controller, additional internal messages are used to communicate information between the subsystems or modules of the Operations Controller. These messages are listed in Table II-B and will be discussed in conjunction with the modules that produce and/or use such internal messages.

                                      TABLE II-B
    __________________________________________________________________________
    INTERNAL MESSAGES
    MESSAGE TYPE   BYTE NO.
                          BYTE CONTENTS
    __________________________________________________________________________
    EXCLUDE COMPUTER
                   1     MESSAGE TYPE
                   2     EXCLUDED COMPUTER
                   3-4   EXCLUDED BITS
    INITIATE SPECIAL TASKS
                   1     MESSAGE TYPE
                   2     TASK TYPE
                   3     EXECUTION NUMBER
    RESTART        1     MESSAGE TYPE
    DISPATCH TASK  1     MESSAGE TYPE
                   2     TASK
                   3     EXECUTION NUMBER
    RELEASE TASK   1     MESSAGE TYPE
                   2     COMPLETED TASK
                   3     COMPLETED EXECUTION
                          NUMBER
                   4     STARTED TASK
                   5     STARTED EXECUTION
                          NUMBER
    TASK DONE      1     MESSAGE TYPE
                   2     TASK
    RECORD ERROR   1     MESSAGE TYPE
                   2     NEW FAULTY COMPUTER
                   3     ERROR INDICATOR
    TASK INPUT     1     MESSAGE TYPE
                   2-3   TASK ADDRESS
           THE FOLLOWING SET OF BYTES ARE REPEATED FOR
           EACH DATA VARIABLE USED AS A TASK INPUT. SEE
           TASK COMMUNICATOR DISCUSSION FOR MORE DETAIL.
                    4-11 INPUT VALUE
                   12    ACTUAL DELAY INTEGER
    TASK OUTPUT    1     MESSAGE TYPE
           THE FOLLOWING SET OF BYTES ARE REPEATED FOR
           EACH DATA VARIABLE COMPUTED AS A TASK OUTPUT.
           SEE TASK COMMUNICATOR DISCUSSION FOR MORE DETAIL.
                   2     DATA I.D.
                   3     REDUNDANT DATA
                    4-11 OUTPUT VALUE
    __________________________________________________________________________


FAULT HANDLER

The details of the Fault Handler 204 are shown in FIG. 5. The Fault Handler 204 comprises a Message Format Checker 216, Reasonable Limits Checker 218, Redundant Value Voter 220, Message Sequence Checker 222, Execution Time Checker 224, Synchronizer 226, Fault Tolerator 228, Fault Status Display Panel 230, and Start Fault Handler Module 231.

The Message Format Checker 216 receives the outputs from the Receivers 202a through 202k, merges the messages received into a single stream of data, and performs selected message format checks. The Message Format Checker 216 checks each received message to determine if the message type is valid, if the sending Computer identified in the message corresponds to the Receiver that received the message, and if the error detecting code is correct (checked in conjunction with the Receivers). A Record Error message is sent to a Fault Tolerator 228 when the message type is not valid, when the Computer identified in the message does not correspond to the Receiver receiving the message, or when an error is detected through use of the error detecting code.

The error-free messages passed by the Message Format Checker are received by one of a plurality of error detection modules or checkers, such as the Reasonable Limits Checker 218, Redundant Value Voter 220, Message Sequence Checker 222 or Execution Time Checker 224. The error detection module to which a message is communicated is determined by the message type; each message is usually further checked for errors by only one of the error detection modules.

The Reasonable Limits Checker 218 checks if the data value of a Task Data Value message is between predetermined minimum and maximum limits. It generates a Record Error message when the data value is outside the predetermined limits. Error-free Task Data Value messages are forwarded to the Fault Tolerator 228.

The Redundant Value Voter 220 receives the Redundant Data Value messages and generates a "voted data value" when a predetermined number of Redundant Data Value messages are received having the same sequence number and same data value for a given task data variable. The "voted data value" is the value of that data variable that will be used in the execution of any subsequent task requiring this data. The "voted data value" obtained is communicated in a Redundant Data Value message forwarded to the Task Communicator via the Fault Tolerator and Scheduler. After the "voted data value" is determined, a Record Error message is generated for any received message having a data value which does not agree with the "voted data value" for that sequence number of that data variable.

The Execution Time checker 224 comprises a plurality of "watch-dog timers", one for each Computer 10. Each "watch-dog timer" is started in response to a Task Completed/Started message received from the associated Computer. The "watch-dog timer" monitors the execution time of the task started by that Computer. A Record Error message is generated when the "watch-dog timer" expires before a subsequent Task Completed/Started message is received, which indicates that the previously started task has been completed and another task has been started. Expiration of the watch-dog timer indicates that the task was improperly executed. The Task Completed/Started messages are always forwarded to the Message Sequence Checker 222.

The Message Sequence Checker 222 checks that the Task Completed/Started and Task Unselected/Selected messages are received from each Computer in a correct sequential order. For example, a Task Completed/Started message, indicating that a particular task has been started, should have been preceded by a Task Unselected/Selected message from the same Computer indicating that the same task with the same execution number had been selected. In a like manner, a Task Completed/Started message should be preceded by a Task Completed/Started message from the same Computer in which the started task and execution number of the first message are the same as the completed task and execution number in the subsequent message. If the task numbers or execution numbers do not agree, a Record Error message is generated. Error-free Task Unselected/Selected and Task Completed/Started messages are forwarded to the Fault Tolerator.

Each Record Error message generated by the various fault detection modules is sent to the Fault Tolerator 228. Each Record Error message includes the identity of the Computer 10 which sent the message, and an identification of the particular error detected.

The error-free Sampling Number messages, after passing through the Message Format Checker, are received by the Synchronizer 226. The Synchronizer generates "initiate input/output tasks" messages in synchronization with the Synchronizer modules in other Computers in the system. At the end of each sampling period, the Synchronizer generates a Sampling Number message containing the current sampling number of the associated Computer. The Sampling Number messages are sent to all of the Computers in the system via the Transmitter 212, and are used to synchronize operations of like Synchronizers 226 in the other Computers 10.

In the event the Synchronizer's own Computer is starting after a momentary power interruption or other failure, the Synchronizer will also generate an "initiate start-up task" message and "initiate fail-safe task" messages. The "initiate input/output tasks", "initiate start-up task" and "initiate fail-safe task" messages are internal messages used by the Synchronizer's own Operations Controller. These messages are sent to the Scheduler 206 and are not communicated to the other Computers. Each of these messages is a particular version of the Initiate Special Tasks message listed in Table II-B.

The "initiate input/output tasks" message is sent to the Scheduler 206 to initiate scheduling of the input/output tasks assigned to its own Computer, in synchronization with all of the other Computers in the system. These input/output tasks perform sampling of system inputs and outputs, where the sampling must be synchronized between Computers. The sampling number generated by the Synchronizer becomes the execution number of the input/output tasks.

The "initiate start-up task" message initiates scheduling of the system start-up task(s) assigned to its own Computer, in synchronization with all the other Computers in the system. These start-up tasks perform any functions needed to properly start the operation of the other application tasks.

Finally, the "initiate fail-safe task" message initiates scheduling of the fail-safe task or tasks assigned to the Synchronizer's own Computer. The fail-safe tasks send out "safe" data values during a start or restart, to all actuators and displays connected to the Computer.

In addition, the Synchronizer 226 and Fault Tolerator 228 generate Restart messages when operation of the associated Operations Controller needs to be restarted. The Restart messages initiate start-up procedures within the Scheduler 206, Task Communicator 208, and Fault Handler 204, which initializes the variable data used within those units. Within the Fault Handler, the Restart messages are sent to the Start Fault Handler Module 231, which initialize variable data within the checkers and the Fault Tolerator 228.

The error-free Task Data Value messages, the Redundant Data Value messages which convey a "voted data value", the Task Completed/Started messages, the Task Unselected/Selected messages, and the Error messages are received by the Fault Tolerator 228. The Fault Tolerator also receives the Record Error messages generated by the Message Format Checker 216, Reasonable Limits Checker 218, Redundant Value Voter 220, Execution Time Checker 224, Message Sequence Checker 222, and Synchronizer 226.

The function of the Fault Tolerator 228 is to pass on to the Scheduler 206 only those error-free messages received from Computers which are not deemed to be faulty. The Fault Tolerator maintains, for each Computer in the system, an indication of whether or not that Computer is currently deemed to be faulty. Whenever an error-free message is received from a Computer which is not considered faulty, that message is forwarded to the Scheduler. Messages from faulty Computers and erroneous messages are discarded. These actions are performed for Task Data Value, Task Completed/Started, and Task Unselected/Selected messages. Redundant Data Value messages which convey a "voted data value" are always forwarded to the Scheduler, even though the sending Computer may be deemed faulty. Error and Record Error messages are used and not forwarded by the Fault Tolerator.

When a Record Error message is received from the Message Format Checker 216, Reasonable Limits Checker 218, Redundant Value Voter 220, Message Sequence Checker 222, Execution Time Checker 224, or Synchronizer 226, the Computer which sent the erroneous message is recorded as being faulty, and an Error message is generated identifying the Computer which sent the message. The Error message is sent out to all Computers via the Transmitter 212. An internal Exclude Computer message identifying the faulty Computer is sent to the Scheduler 206.

The Fault Tolerator 228 also responds to the Error messages received from other Computers, and will conclude that a Computer is faulty when a predetermined number of Computers have sent Error messages identifying that particular Computer as faulty, even though an error has not been detected by an error detection module in its own Computer. As before, when the Fault Tolerator decides that a Computer is now faulty, it sends an Exclude Computer message to the Scheduler.

If the number of Computers sending Error messages identifying a particular Computer as faulty is less than the predetermined number, the Computer is assumed to be healthy since the received Error message(s) may be the result of malfunctions in the Computers sending the Error messages or their associated communication links. The Computer or Computers which sent these Error messages will discard messages from the Computer deemed faulty; however, the remaining Computers will treat that same Computer as healthy and will accept the messages as if no Error messages were received. In all cases where one of the Computer's own checkers or the Synchronizer send an internal Record Error message indicating a detected error or fault, that Computer will deem the Computer faulty and will discard all messages received from that Computer; this continues until it is concluded that the fault was temporary and the faulty Computer has recovered.

Although the Fault Tolerator 228 will discard messages received from Computers deemed to be faulty, the Message Format Checker 216, Reasonable Limits Checker 218, Redundant Value Voter 220, Message Sequence Checker 222, Execution Time Checker 224, and Synchronizer 226 will continue to check each message received from all Computers. The Fault Tolerator continues to monitor the messages received from the Computer deemed to be faulty. The Fault Tolerator will decided that a Computer is no longer faulty when, during a predetermined time period, its own checkers do not detect an error and simultaneously the number of Computers generating Error messages identifying the faulty Computer is less than the required predetermined number. When it is determined that a Computer is no longer faulty, the Fault Tolerator will generate an "Exclude Computer" message which shows that the previously excluded Computer is no longer excluded. The "Exclude Computer" message is communicated to the Scheduler 206, where it cancels the current exclusion status of the identified Computer, and the previously excluded Computer is thus readmitted to full participation in the system.

The Fault Tolerator 228 further generates signals activating a Fault Status Display Panel 230 identifying the Computers deemed to be faulty or excluded. The Fault Status Display Panel 230 may be an externally mounted display panel readily visible to the operator, and/or may be placed inside the Computer cabinet adjacent to the particular Operations Controller hardware. Each Computer in the system has its own display panel, and each display panel has at least two lamps or indicators for each Computer in the system. Both of the lamps are activated when the corresponding Computer has been deemed to be faulty by the Operations Controller associated with the particular display, and the faulty Computer is presently excluded from the system. The first lamp is turned "off" when the Computer is readmitted; however, the second lamp is left on indicating that the Computer had previously been excluded. The in-cabinet mounting of the display panel is desirable, since the display will be conveniently available to service personnel during maintenance or servicing of the system.

The operation of the Fault Handler 204 is as follows: Messages from the Computers in the Fault-Tolerant Multi-Computer System are received by the individual Receivers 202 connected to the respective communication links. The Receivers 202 check the error detection code, the length of the message, etc. The received message is then forwarded to the Message Format Checker 216, along with information identifying the Receiver which received the message. If an error is detected by a Receiver, information identifying the type of error detected is communicated to the Message Format Checker 216. Because the messages are randomly received at the individual Receivers 202, and may be received at a rate too fast for immediate processing by the Message Format Checker 216, the messages are placed in a temporary storage buffer associated with each Receiver, until they can be checked by the Message Format Checker. Each temporary storage buffer is able to store about ten messages at any time.

Each received message contains additional bytes or bits of information, such as the message error detecting code, start of message and end of message codes, and character error detecting/correcting codes, which are only used by the Receivers. These additional bits of information are stripped from the message before it is forwarded to the buffer and Message Format Checker 216.

The Message Format Checker 216 interrogates the buffers associated with each Receiver 202 in a cyclical manner, and checks each received message. It checks if an error was detected by the Receiver, if the message type is a valid message type, and if the Receiver which received the message is associated with the particular Computer which originated the message. If the Message Format Checker detects an error, it sends a Record Error message to the Fault Tolerator 228. If no error is detected, the received message is forwarded to the appropriate Fault Handler module.

Subsequent operation of the Fault Handler depends upon the message type. Operation will thus be discussed for each message type.

Error-free Task Data Value messages, passed by the Message Format Checker 216, are forwarded to the Reasonable Limits Checker 218. The Reasonable Limits Checker checks each Task Data Value message and forwards it to the Fault Tolerator 228 if no error is detected. The Fault Tolerator checks if the Computer which sent the message is currently considered to be faulty. If that Computer is not faulty, the Task Data Value message is forwarded to the Scheduler 206; otherwise, the message is discarded. If the Reasonable Limits Checker detects an error, it sends a Record Error message to the Fault Tolerator 228.

Each error-free Redundant Data Value message, passed by the Message Format Checker 216, is forwarded to the Redundant Value Voter 220. The Redundant Value Voter compares the value of the data variable contained in the received message with the values of that data variable contained in previously received Redundant Data Value messages. If the data value contained in the received Redundant Data Value message agrees with the values in a predetermined number of previously received Redundant Data Value messages, a "voted data value" is obtained. The Redundant Data Value message containing the "voted data value" is forwarded to the Scheduler 206 through the Fault Tolerator 228. When a "voted data value" is obtained, and the value contained in a previously received Redundant Data Value message disagrees with the "voted data value" just obtained, a Record Error message is also transmitted to the Fault Tolerator identifying the Computer which sent the disagreeing data value. If the Redundant Data Value message does not produce a "voted data value", the Redundant Data Value message is discarded. If after a "voted data value" is obtained, the value of the data variable contained in a subsequent Redundant Data Value message disagrees with the "voted data value", a Record Error message is transmitted to the Fault Tolerator 228.

Each error-free Task Unselected/Selected message, passed by the Message Format Checker 216, is forwarded to the Message Sequence Checker 222. The Message Sequence Checker checks the message for scheduling sequence errors, and forwards it to the Fault Tolerator 228 if no errors are detected. The Fault Tolerator checks if the Computer which sent the message is currently considered to be faulty. If that Computer is not faulty, the error free Task Unselected/Selected message is forwarded to the Scheduler 206; otherwise, the message is discarded. If the Sequence Checker detects an error, it sends a Record Error message to the Fault Tolerator 228.

Each error-free Task Completed/Started message, passed by the Message Format Checker 216, is forwarded to the Execution Time Checker 224. The Execution Time Checker starts a watch-dog timer and forwards the message to the Message Sequence Checker 222. The Message Sequence Checker checks each message and forwards it to the Fault Tolerator 228, if no error is detected. The Fault Tolerator checks if the Computer which sent the message is currently considered to be faulty. If that Computer is not faulty, the Task Completed/Started message is forwarded to the Scheduler 206; otherwise, the message is discarded. If the watch-dog timer for a Computer expires before it is restarted by a subsequent Task Completed/Started message, the Execution Time Checker 224 sends a Record Error message to the Fault Tolerator 228. If the Message Sequence Checker detects an error, it sends a Record Error message to the Fault Tolerator.

Each error-free Sampling Number message, passed by the Message Format Checker 216, is forwarded to the Synchronizer 226. The Synchronizer compares the Sampling Number messages. Sampling Number messages are not passed on to the Fault Tolerator 228. However, the Synchronizer periodically generates a new Sampling Number message, sending it to the Transmitter 212. The Synchronizer compares the sampling number contained in each received Sampling Number message with the sampling numbers contained in previously received Sampling Number messages and with the previously determined "voted sampling number". If the sampling number contained in the received Sampling Number message agrees with a predetermined number of sampling numbers contained in previously received Sampling Number messages, a new "voted sampling number" is obtained and an "initiate input/output tasks" message is sent to the Scheduler 206. If the Sampling Number message produces a new "voted sampling number", and if the sampling number given in a previously received Sampling Number message disagrees with the "voted sampling number" just obtained, a Record Error message is sent to the Fault Tolerator 228.

Each error-free Error message is forwarded directly to the Fault Tolerator 228 from the Message Format Checker 216. The Fault Tolerator compares this message with previously received Error messages. If the Fault Tolerator decides that a particular Computer is faulty, based upon a predetermined number of Error messages naming that Computer, the Fault Tolerator will thereafter consider that Computer to be faulty. If that Computer was not previously considered to be faulty, the Fault Tolerator sends an internal Exclude Computer message to the Scheduler 206. The Fault Tolerator also activates the lamps in the Fault Status Display Panel 230 associated with the Computer which is now considered to be faulty. The Display Panel indicates those Computers which are presently excluded, as well as any Computer which was at one time excluded but has subsequently been readmitted into the system.

When a Record Error message is received by the Fault Tolerator 228, from the Message Format Checker 216, Reasonable Limits Checker 218, Redundant Value Voter 220, Message Sequence Checker 222, Execution Timer Checker 224, or Synchronizer 226, the Fault Tolerator thereafter considers the Computer identified in the Record Error message to be faulty. If a specified time interval has passed since an Error message was sent regarding that Computer, an Error message is sent to the Transmitter 212 for transmission to all Computers. If that Computer was not previously considered to be faulty, the Fault Tolerator sends an Exclude Computer message to the Scheduler 206. The Fault Tolerator also activates the lamps in the Fault Status Display Panel 230 associated with the Computer which is now considered to be faulty.

When the Fault Tolerator excludes a Computer, it checks for certain abnormal conditions. If the excluded Computer is the Fault Tolerator's own Computer, it restarts its own Computer. Similarly, if the number of excluded Computers exceeds a predetermined number, it restarts its own Computer. The number of excluded Computers could exceed the predetermined when its own Computer is faulty, or when some common fault produces errors in many Computers. To restart its own Computer, the Fault Tolerator sends a Restart message to the Start Fault Tolerator Module 231 and to the Scheduler 206.

The Fault Tolerator also monitors the elapsed time since a Computer was last deemed to be faulty, in response to either an internal Record Error message or matching Error messages received from other Computers. When a faulty (excluded) Computer transmits error-free messages for a predetermined length of time, the Fault Tolerator reverses the excluded status for that Computer and readmits that Computer into active participation in the system. When such a decision is made, the Fault Tolerator sends an Exclude Computer message to the Scheduler 206. The Exclude Computer message shows the readmitted Computer as not (presently) excluded. The Fault Tolerator also deactivates the presently excluded lamp in the Fault Status Display Panel associated with the Computer no longer excluded. However, it leaves on the lamp indicating that the Computer was excluded at one time.

When the Computer is starting after being turned on, or restarting after a momentary power failure or interruption, the Synchronizer 228 starts its sampling period timer, and transmits an internal Restart message to the Start Fault Handler Module 231 and the Scheduler 206. The Start Fault Handler Module initilizes internal data for the Fault Tolerator 228, Redundant Value Voter 220, Message Sequence Checker 222, and Execution Time Checker 224. The Synchronizer then generates an internal "initiate fail-safe task" message which is transmitted to the Scheduler 206. The Synchronizer continues to generate the "initiate fail-safe task" message at periodic intervals until a predetermined number of Computers are operating and their sampling period timers and sampling numbers are synchronized.

When the sampling period timer expires, the Synchronizer restarts the sampling period timer and generates a Sampling Number message containing its current sampling number. This message is sent via the Tansmitter 212 to all of the Computers in the system. Concurrently, the other Computers are generating similar Sampling Number messages, whether they are also starting, or are operating normally. The Synchronizer accepts the Sampling Number messages received from all Computers and attempts to determine the current sampling number of the system. The Sampling number is determined by a voting process, i.e, a sampling number on which at least a predetermined number of Computers agree. Once this "voted sampling number" is determined, the Synchronizer uses the "voted sampling number" as its own sampling number and synchronizes its sampling period timer with all the other sampling period timers in the system.

When the "voted sampling number" is first obtained and the sampling period timer is synchronized, the Synchronizer sends an internal "initiate start-up task" message to the Scheduler 206. The "initiate start-up task" message causes the Scheduler 206 to initiate scheduling of special start-up task(s) assigned to the Computer. The Synchronizer also generates an internal "initiate input/output tasks" message when a "voted sampling number" is obtained, which is sent to the Scheduler 206.

As previously indicated, the "initiate input/output tasks" message initiates scheduling of the input/output tasks which sample the system inputs and outputs. Sampling is done by the input/output tasks using the Input/Output Network 108 of the Applications Computer, to receive input data from the Sensors and Manual Controls 14 and to output data to the Actuators and Displays as shown on FIG. 3. The execution number used for the initiated input/output tasks is the current sampling number of the Synchronizer. The Computer thereafter receives messages from the other Computers, and new input data from the sensors and manual controls, and assumes normal active participation in the Fault Tolerant Multi-Computer System.

The preferred implementation of the Fault Handler is one, or possibly several, microprocessors having adequate storage and computational capabilities, such as the 8080A Microprocessor manufactured by the Intel Corporation of Santa Clara, Calif. or any other microcomputer of similar type. However, if desired, the Fault Handler may be made from commercially available discrete electronic components, as shall be shown by way of example in the following description of the individual modules of the Fault Handler.

The individual modules of the Fault Handler will be described in the following sections by means of Psuedo Code computer program listings. Psuedo Code is used for the program listings because it is not dedicated to a particular microprocessor or type of microprocessor, and is universally applicable to different types of computers and computer languages. A programmer having ordinary skills in the art would be able to translate the presented Psuedo Code program listings into actual program listings for a particular computer.

MESSAGE FORMAT CHECKER

The Psuedo Code program for the Message Format Checker 216 is given in Table III-A and a comparable flow diagram is shown on FIG. 6. The Message Format Checker module checks all messages received from all Computers 10, via the Receivers 202. The portions of the received message that are checked by the Message Format Checker are the first byte of the message which identifies the message type, the second byte which identifies the Computer sending the message, and the special bits generated by the Receiver which identify the Computer connected to that Receiver and any errors detected by the Receiver. As previously discussed, each Receiver 202 receives messages from a particular Computer and the Operations Controller has a plurality of Receivers 202, each receiving only the messages sent by a specified Computer in the system. In the given example, it is assumed that a special byte generated by the Receiver 202 is identical to the expected second byte of the message, which identifies the Computer which sent the message.

                  TABLE III-A
    ______________________________________
    MESSAGE FORMAT CHECKER
    ______________________________________
    /* IF ERROR DETECTED BY RECEIVER */
    IF ERROR DETECTED BITS NOT = 0
    THEN
    ERROR INDICATOR =
    FUNCTION OF (ERROR DETECTED BITS)
    ELSE /*IF MESSAGE TYPE CODE NOT VALID*/
    IF MESSAGE TYPE > MAXIMUM TYPE
    ORIF MESSAGE TYPE = 0
    THEN
    ERROR INDICATOR = MESSAGE TYPE ERROR
    ELSE /*CHECK SENDING COMPUTER CODE*/
    IF SENDING COMPUTER NOT = RECEIVER
    THEN
    ERROR INDICATOR =
    SENDING COMPUTER CODE ERROR
    ELSE
    ERROR INDICATOR = 0
    ENDIF
    ENDIF
    ENDIF
    IF ERROR
    INDICATOR NOT = 0 /* IF ERROR WAS DETECTED*/
    THEN
    CALL: SEND MESSAGE TO FALUT TOLERATOR
    INPUT DATA:
    MESSAGE TYPE = RECORD ERROR TYPE
    NEW FAULTY COMPUTER = RECEIVER
    ERROR INDICATOR = ERROR INDICATOR
    OUTPUT DATA: NONE
    ELSE /*FORWARD RECEIVED MESSAGE*/
    /*CASE OF MESSAGE TYPE*/
    IF MESSAGE TYPE = TASK DATA VALUE TYPE
    THEN
    CALL: SEND MESSAGE TO
    REASONABLE LIMITS CHECKER
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ELSE IF MESSAGE TYPE =
    REDUNDANT DATA VALUE
    TYPE
    THEN
    CALL: SEND MESSAGE TO REDUNDANT VALUE
    VOTER
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ELSE IF MESSAGE TYPE =
    TASK COMPLETED/STARTED TYPE
    THEN
    CALL: SEND MESSAGE TO EXECUTION TIME
    CHECKER
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ELSE IF MESSAGE TYPE =
    TASK UNSELECTED/SELECTED TYPE
    THEN
    CALL: SEND MESSAGE TO
    MESSAGE SEQUENCE CHECKER
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ELSE IF MESSAGE TYPE = SAMPLING NUMBER TYPE
    THEN
    CALL: SEND MESSAGE TO SYNCHRONIZER
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ELSE /*MESSAGE TYPE = ERROR MESSAGE TYPE*/
    CALL: SEND MESSAGE TO FAULT TOLERATOR
    INPUT DATA: MESSAGE = RECEIVED MESSAGE
    OUTPUT DATA: NONE
    ENDIF ENDIF ENDIF ENDIF ENDIF
    /*END CASE*/
    ENDIF
    RETURN
    END:
    ______________________________________


Referring to the Psuedo Code program in Table III-A and flow diagram of FIG. 6, the Message Format Checker 216 first checks if an error was detected by the Receiver, as shown in the flow diagram by block 232. The symbols "/*" and "*/" are used in the first line of Table III-A and thereafter to indicate that the enclosed text is a comment in the Psuedo Code and not part of the actual code. The enclosed text is only a comment explaining the following line. For example, the enclosed text on line one of Table III-A identifies the "ERROR DETECTED BITS" of line two as the error detected signals generated by the Receiver. If the error detected bits obtained from the Receiver are not equal to zero (0), where zero values of the error detected bits are indicative of no error detected by the Receiver, then a Record Error message is generated as indicated by block 234, identifying that an error was detected by the Receiver and the checking is terminated (third ENDIF). The error indicator code designating the type of error detected is generated as a function of the error detected bits obtained from the receiver.

If no error was detected by the Receiver, the Message Format Checker proceeds (ELSE) to check the message type code as indicated by block 236. If the message type code is a number greater than the constant maximum inter-computer message type number used in the system, or if it is equal to zero as checked by block 237, then a Record Error message is generated as indicated by block 238, and the checking is terminated (second ENDIF). The error indicator code is set equal to the fixed value which identifies the error as a message type error.

If the message type code is not equal to zero (0) and is not greater than the maximum type number, the program proceeds (ELSE) to compare the sending Computer byte of the message with the Computer code generated by the Receiver, as indicated by block 240. If the sending Computer code contained in the message does not agree with the Computer code generated by the Receiver, a Record Error message is generated as indicated by block 242 and the checking is terminated (first ENDIF). The error indicator is set equal to the fixed value which identifies the error as a sending Computer code error. If no error in the sending Computer code is found, the error indicator is set to zero (0) as indicated by block 244, and the checking is ended. The error indicator value of zero indicates that no error was detected.

In the Psuedo Code program Table III-A and flow diagram FIG. 6, a Record Error message is "generated" when an error is detected by making the error indicator non-zero. Following the checking (the third ENDIF), the error indicator is tested to determine if a Record Error message must be sent, as indicated by block 233. If the error indicator is not zero, (THEN) a Record Error message is sent to the Fault Tolerator, as indicated by block 235. If the error indicator is zero (ELSE), the received message must be forwarded to the proper checker module. The message type code is then tested to determine the message type.

If the message type is a Task Data Value message as tested by block 259, the received message is sent to the Reasonable Limits Checker 218, as indicated by block 239. If the message type is a Redundant Data Value message as tested by block 241, the received message is sent to the Redundant Value Voter 220, as indicated by block 243. If the message type is a Task Completed/Started messag