Pipeline flattener for simplifying event detection during data processor debug operations6836882Abstract Pipeline activity information associated with all stages of execution of an instruction in an instruction pipeline of a data processor is presented to an event detector in timewise aligned format. This permits events in the pipeline to be presented to the event detector in a sequence that is consistent with the context in which a programmer of the event detector would normally think of those events, thereby simplifying programmation of the event detector. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 1
Emulation System Architecture and Usage
Architectural Visibility Control
Component Provisions Provisions Debug Usage
RTE Static view of the Analysis Basic debug
CPU and memory components are Computational
state after used to stop problems
background execution of Code design
program is background problems
stopped. program.
Interrupt driven
code continues to
execute.
RTDX .TM. Debugger soft- Analysis Dynamic
ware interacts components are instrumentation
with the applica- used to identify Dynamic variable
tion code to observation points adjustments
exchange and interrupt Dynamic data
commands and program flow to collection
data while the collect data.
application
continues to
execute.
Trace Bus snooper hard- Analysis Prog. Flow corrup-
ware collects components are tion debug
selective program used to define Memory corruption
flow and data program segments Benchmarking
transactions for and bus Code Coverage
export without transactions that Path Coverage
interacting with are to be recorded Program timing
the application. for export. problems
Analysis Allows observa- Alter program Benchmarking
tion of occur- flow after the Event/sequence
rences of events detection of identification
or event events or event Ext. trigger
sequences. sequences. generation
Measure elapsed Stop program
time between execution
events. Activate Trace and
Generate external RTDX .TM.
triggers.
Real-Time Emulation (RTE) provides a base set of fixed capabilities for real-time execution control (run, step, halt, etc.) and register/memory visibility. This component allows the user to debug application code while real-time interrupts continue to be serviced. Registers and memory may be accessed in real-time with no impact to interrupt processing. Users may distinguish between real-time and non real-time interrupts, and mark code that must not be disturbed by real-time debug memory accesses. This base emulation capability includes hardware that can be configured as two single point hardware breakpoints, a single data watchpoint, an event counter, or a data logging mechanism. The EMU pin capability includes trigger I/Os for multiprocessor event processing and a uni-directional (target to host) data logging mechanism. RTDX.TM. provides real-time data transfers between an emulator host and target application. This component offers both bi-directional and uni-directional DSP target/host data transfers facilitated by the emulator. The DSP (or target) application may collect target data to be transferred to the host or receive data from the host, while emulation hardware (within the DSP and the emulator) manages the actual transfer. Several RTDX.TM. transfer mechanisms are supported, each providing different levels of bandwidth and pin utilization allowing the trade off of gates and pin availability against bandwidth requirements. Trace is a non-intrusive mechanism of providing visibility of the application activity. Trace is used to monitor CPU related activity such as program flow and memory accesses, system activity such as ASIC state machines, data streams and CPU collected data. Historical trace technology also used logic analyzer like collection and special emulation (SEs) devices with more pins than a production device. The logic analyzer or like device processed native representations of the data using a state machine like programming interface (filter mechanism). This trace model relied on all activity being exported with external triggering selecting the data that needed to be stored, viewed and analyzed. Existing logic analyzer like technology does not, however, provide a solution to decreasing visibility due to higher integration levels, increasing clock rates and more sophisticated packaging. In this model, the production device must provide visibility through a limited number of pins. The data exported is encoded or compressed to reduce the export bandwidth required. The recording mechanism becomes a pure recording device, packing exported data into a deep trace memory. Trace software is used to convert the recorded data into a record of system activity. On-chip Trace with high speed serial data export, in combination with Advanced Analysis provides a solution for SOC designs. Trace is used to monitor CPU related activity such as program flow and memory accesses, system activity such as ASIC state machines, data streams etc. and CPU collected data. This creates four different classes of trace data: Program flow and timing provided by the DSP core (PC trace); Memory data references made by the DSP core or chip level peripherals (Data reads and writes); Application specific signals and data (ASIC activity); and CPU collected data. Collection mechanisms for the four classes of trace data are modular allowing the trade off of functionality verses gates and pins required to meet desired bandwidth requirements. The RTDX.TM. and Trace functions provide similar, but different forms of visibility. They differ in terms of how data is collected, and the circumstances under which they would be most effective. A brief explanation is included below for clarity. RTDX.TM. (Real Time Data eXchange) is a CPU assisted solution for exchanging information; the data to be exchanged have a well-defined behavior in relation to the program flow. For example, RTDX.TM. can be used to record the input or output buffers from a DSP algorithm. RTDX.TM. requires CPU assistance in collecting data hence there is definite, but small, CPU bandwidth required to accomplish this. Thus, RTDX.TM. is an application intrusive mechanism of providing visibility with low recurring overhead cost. Trace is a non-intrusive, hardware-assisted collection mechanism (such as, bus snoopers) with very high bandwidth (BW) data export. Trace is used when there is a need to export data at a very high data rate or when the behavior of the information to be traced is not known, or is random in nature or associated with an address. Program flow is a typical example where it is not possible to know the behavior a priori. The bandwidth required to export this class of information is high. Data trace of specified addresses is another example. The bandwidth required to export data trace is very high. Trace data is unidirectional, going from target to host only. RTDX.TM. can exchange data in either direction although unidirectional forms of RTDX are supported (data logging). The Trace data path can also be used to provide very high speed uni-directional RTDX.TM. (CPU collected trace data). The high level features of Trace and RTDX.TM. are outlined in Table 2.
TABLE 2
RTDX .RTM. and Trace Features
Features RTDX .TM. Trace
Bandwidth/pin Low High
Intrusiveness Intrusive Non-intrusive
Data Exchange Bi-directional or uni- Export only
directional
Data collection CPU assisted CPU or Hardware assisted
Data transfer No extra hardware for Hardware assisted
minimum BW
(optional hardware for
higher BW)
Cost Relatively low recurring Relatively high recurring
cost cost
Advanced analysis provides a non-intrusive on-chip event detection and trigger generation mechanism. The trigger outputs created by advanced analysis control other infrastructure components such as Trace and RTDX.TM.. Historical trace technology used bus activity exported to a logic analyzer to generate triggers that controlled trace within the logic analyzer unit or generated triggers which were supplied to the device to halt execution. This usually involved a chip that had more pins than the production device (an SE or special emulation device). This analysis model does not work well in the System-on-a-Chip (SOC) era as the integration levels and clock rates of today's devices preclude full visibility bus export. Advanced analysis provides affordable on-chip instruction and data bus comparators, sequencers and state machines, and event counters to recreate the most important portions of the triggering function historically found off chip. Advanced analysis provides the control aspect of debug triggering mechanism for Trace, RTDX.TM. and Real-Time Emulation. This architectural component identifies events, tracks event sequences, and assigns actions based on their occurrence (break execution, enable/disable trace, count, enable/disable RTDX.TM., etc.). The modular building blocks for this capability include bus comparators, external event generators, state machines or state sequencers, and trigger generators. The modularity of the advanced analysis system allows the trade off of functionality versus gates. Emulator capability is created by the interaction of four emulator components: 1. debugger application program; 2. host computer; 3. emulation controller; and 4. on-chip debug facilities. These components are connected as shown in FIG. 1. The host computer 10 is connected to an emulation controller 12 (external to the host) with the emulation controller (also referred to herein as the emulator or the controller) also connected to the target system 16. The user preferably controls the target application through a debugger application program, running on the host computer, for example, Texas Instruments' Code Composer Studio program. A typical debug system is shown in FIG. 1. This system uses a host computer 10 (generally a PC) to access the debug capabilities through an emulator 12. The debugger application program presents the debug capabilities in a user-friendly form via the host computer. The debug resources are allocated by debug software on an as needed basis, relieving the user of this burden. Source level debug utilizes the debug resources, hiding their complexity from the user. The debugger together with the on-chip Trace and triggering facilities provide a means to select, record, and display chip activity of interest. Trace displays are automatically correlated to the source code that generated the trace log. The emulator provides both the debug control and trace recording function. The debug facilities are programmed using standard emulator debug accesses through the target chips' JTAG or similar serial debug interface. Since pins are at a premium, the technology provides for the sharing of the debug pin pool by trace, trigger, and other debug functions with a small increment in silicon cost. Fixed pin formats are also supported. When the sharing of pins option is deployed, the debug pin utilization is determined at the beginning of each debug session (before the chip is directed to run the application program), maximizing the trace export bandwidth. Trace bandwidth is maximized by allocating the maximum number of pins to trace. The debug capability and building blocks within a system may vary. The emulator software therefore establishes the configuration at run-time. This approach requires the hardware blocks to meet a set of constraints dealing with configuration and register organization. Other components provide a hardware search capability designed to locate the blocks and other peripherals in the system memory map. The emulator software uses a search facility to locate the resources. The address where the modules are located and a type ID uniquely identifies each block found. Once the IDs are found, a design database may be used to ascertain the exact configuration and all system inputs and outputs. The host computer is generally a PC with at least 64 Mbytes of memory and capable of running at least Windows95, SR-2, Windows NT, or later versions of Windows. The PC must support one of the communications interfaces required by the emulator, for example: Ethernet 10T and 100T, TCP/IP protocol; Universal Serial Bus (USB), rev 1.x; Firewire, IEEE 1394; and/or Parallel Port (SPP, EPP, and ECP). The emulation controller 12 provides a bridge between the host computer 10 and target system 16, handling all debug information passed between the debugger application running on the host computer and a target application executing on a DSP (or other target processor) 14. One exemplary emulator configuration supports all of the following capabilities: Real-time Emulation; RTDX.TM.; Trace; and Advanced Analysis. Additionally, the emulator-to-target interface supports: Input and output triggers; Bit I/O; and Managing special extended operating modes. The emulation controller 12 accesses Real-time Emulation capabilities (execution control, memory, and register access) via a 3, 4, or 5 bit scan based interface. RTDX.TM. capabilities can be accessed by scan or by using three higher bandwidth RTDX.TM. formats that use direct target-to-emulator connections other than scan. The input and output triggers allow other system components to signal the chip with debug events and vice-versa. The emulator 12 is partitioned into communication and emulation sections. The communication section supports communication with the host 10 on host communication links while the emulation section interfaces to the target, managing target debug functions and the device debug port. The emulator 12 communicates with the host computer 10 using e.g., one of the aforementioned industry standards communication links at 15. The host-to-emulator connection can be established with off the shelf cabling technology. Host-to-emulator separation is governed by the standards applied to the interface used. The emulation controller 12 communicates with the target system 16 through a target cable or cables at 17. Debug, Trace, Triggers, and RTDX.TM.0 capabilities share the target cable, and in some cases, the same device pins. FIG. 2 is a timing diagram which illustrates exemplary pipeline activity exhibited by a target processor, for example the processor shown at 14 in FIG. 1, with a pipelined architecture. FIG. 2 exhibits the following exemplary pipeline stages: Instruction Fetch IF; Instruction Data ID; Instruction Decode DC; Read Address RA; Read Data RD; Arithmetic Unit Operation AU; and Write WR. As shown in FIG. 2, a new seven-stage instruction begins with each new clock cycle (t0-t12). Assume now, for example, that a debug event detector such as a state machine has been programmed to detect a sequence as indicated below: State 0: If (IF0=0.times.55) then goto State 1 State 1: If (RA1=0.times.50 and RD1=0.times.9999) then goto State 2 State 2: If (IF2=0.times.90) then goto State 3 State 3: Trigger The events which drive the foregoing exemplary state machine sequence are highlighted in FIG. 2. In this example, the programmer wishes to detect the following sequence of events. The execution of an instruction represented by instruction fetch 0 (IF0) followed by a read operation represented by RA1 (Read Address 1) and RD1 (Read Data 1) followed by the execution of an instruction represented by instruction fetch 2 (IF2). If this event sequence is to be detected from observation of conventional pipeline activity (e.g. as shown in FIG. 2), then the sequence must be specified in a different order from the order in which the programmer would normally (and most conveniently) think of the sequence. More particularly, the programmer would normally, and most conveniently, think of the foregoing sequence in the following context: First, all activities of the seven pipeline stages of instruction 1 occur; second, all activities of the seven pipeline stages of instruction 2 occur, etc. In fact, the foregoing state machine sequence is programmed according to this way of thinking about the sequence of events. Unfortunately, due to the pipeline effect illustrated in FIG. 2, all of the desired events can occur as shown in FIG. 2 but, due to the state machine programmation, the state machine will not progress from state 2 to state 3 and therefore will not trigger. The state machine will not progress from state 2 into state 3 because the state machine programmation assumes that the events RA1 and RD1 will precede the event IF2. That is, the state machine has been designed with the aforementioned presumption that all activities associated with all pipeline stages of instruction 1 will occur, after which all activities associated with all pipeline stages of instruction 2 will occur. As shown in FIG. 2, this is not the case, inasmuch as event IF2 actually occurs before either of events RA1 or RD1, so the state machine will not progress from state 2 to state 3 as desired. The foregoing state machine programmation would be correct if IF2 were replaced in the desired sequence by IF6, because event IF6 occurs after events RA1 and RD1 so the state machine would advance from state 2 to state 3 as desired. The event ordering problem described above indicates that the state machine programmation should preferably anticipate the pipeline effects and the relative proximity of events, for example the proximity of event IF0 to event IF2. Anticipating the pipeline effects and the proximity of various events can be particularly difficult when, for example, the event sequence to be detected is related entirely to read and write operations. In such situations, the state machine programmation would be set up not knowing the relationship of the reads and the writes in the pipeline. If it is assumed that the read and write positioning in the pipeline will influence the event sequence, then the detection sequence would be specified in a different manner than if it were assumed that the read and write positioning in the pipeline does not influence the event sequence (i.e., the first event has cleared the pipeline before the second event happens). Because it is impossible to know what instruction sequence and event proximity will generate the desired read and write sequence, no matter how the state machine is programmed, it can still either fail to detect a legitimate sequence, or falsely indicate that a legitimate sequence has been detected. These problems are addressed according to exemplary embodiments of the invention by timewise aligning all pipeline stage activities of a given instruction with the activity of the last pipeline stage of that instruction. This timewise alignment of the activities of all pipeline stages of a given instruction advantageously permits the programmer to program the state machine or other detection logic according to the way that the programmer would normally think of the sequence of events that is to be detected. The aforementioned timewise alignment of the activities of all pipeline stages of a given instruction can be accomplished according to the invention by a pipeline flattener such as illustrated at 31 in FIG. 3. The pipeline flattener receives, for example, the pipeline activity information for each stage in the seven-stage pipeline sequence illustrated in FIG. 2. At the pipeline flattener input, the pipeline activity information can be arranged in the sequential format (see FIG. 2) that is conventionally provided to event detectors. For each instruction, the pipeline flattener 31 is operable to arrange the pipeline activity information from the first six pipeline stages in timewise alignment with the pipeline activity information from the seventh pipeline stage. This timewise aligned pipeline stage information is then provided by the pipeline flattener 31 to the event detector, for example a state machine or other suitable event detection logic. FIG. 4 is a timing diagram which illustrates exemplary operations which can be performed by the pipeline flattener 31 of FIG. 3. In the example of FIG. 4, the pipeline flattener operates on instruction 0 of FIG. 2 (other instructions have been omitted for clarity). In FIG. 4, the time scale proceeds horizontally in the same fashion as illustrated in FIG. 2. However, in FIG. 4, the pipeline stages are also offset from one another in the vertical direction in order to clearly illustrate the pipeline flattener operation. As shown in the example of FIG. 4, all activities of all pipeline stages are timewise aligned at time t7, the first clock cycle after execution of instruction 0 has been completed. The last pipeline stage of instruction 0, namely the write stage WR, exits the pipeline at time t7, after its execution at time t6. Thus, at time t7, the activities of all pipeline stages can be timewise aligned for presentation to the event detector. As shown in FIG. 4, this timewise alignment requires that each of the pipeline stages other than the WR stage be delayed by an appropriate amount. In particular, the IF stage is delayed by six clock cycles (D6), the ID stage is delayed by five clock cycles (D5), the DC stage is delayed by four cycles (D4), the RA stage is delayed by three clock cycles (D3), the RD stage is delayed by two clock cycles (D2), and the AU stage is delayed by one clock cycle (D1). By implementing these time delays with respect to the first six pipeline stages, the pipeline flattener is able to present all seven pipeline stages in timewise alignment at time t7. FIG. 5 diagrammatically illustrates exemplary embodiments of the pipeline flattener of FIG. 3. As shown in FIG. 5, the pipeline flattener 31 is embodied as a plurality of delay lines which appropriately delay all but the last pipeline stage of each instruction, for example the first six pipeline stages of the seven stage pipeline of FIGS. 2 and 4. As shown in FIG. 5, the delay associated with a given pipeline stage n is equal to the pipeline length minus n. For example, in FIG. 4, the delay associated with the fourth pipeline stage RA (n=4) is 7-4=3 clock cycles. FIG. 6 is a timing diagram which illustrates an example of the output of the pipeline flattener of FIGS. 3 and 5 in response to the pipeline stage information of the seven instructions of FIG. 2. In FIG. 6, the ID, DC and AU stages are omitted for purposes of clarity. It should also be noted that the WR pipeline stage of FIG. 2 is illustrated in FIG. 6 as a two-part pipeline stage including a write address portion WA and a write data portion WD. This relationship is illustrated in FIG. 7, which shows that, in some embodiments, the WR stage includes concurrent WA and WD sub-stages. In FIG. 6, the pipeline stages of instruction 0 are timewise aligned at time t7, and the pipeline stages of instructions 1-6 are respectively timewise aligned at times t8-t13. The events that are highlighted in FIG. 2 are also highlighted in FIG. 6, thereby clearly illustrating that the desired events will now be presented to the event detector in a sequence that is consistent with the context in which the programmer would normally think of those events. Therefore, the state machine described in the above example would detect the desired events when provided with the pipeline flattener output illustrated in FIG. 6. Although exemplary embodiments of the invention are described above in detail, this does not limit the scope of the invention, which can be practiced in a variety of embodiments.
|
Same subclass Same class Consider this |
||||||||||
