Previous work on system level synthesis has focused mainly on fine-grain hardware/software partitioning. Examples include Vulcan II [GM92] and Cosyma [EHB93]. These programs automatically partition the input specification into basic blocks (or fine-grain operations) and move the basic blocks to hardware or software components while satisfying the given constraints. The resulting fine-grain partitioning may, however, move logically coherent blocks across different parts or put logically unrelated blocks in the same part. Furthermore, the resulting partitioning creates an implementation which is very different from the initial specification, and hence, is not convenient for human designers to debug or improve upon.
In contrast, coarse-grain partitioning does not decompose the initial specification into basic blocks and does not assign a process in the initial specification to several processors. It is therefore able to preserve the granularity and modularity of the initial specification. In addition, coarse-grain partitioning can exploit the designers' expertise more easily and can achieve a desired partitioning which satisfies some macroscopic choices more readily [MMS95]. Finally, the resulting solution has more logical coherence which facilitates the top-down design process and allows for debugging of the hardware/software.
Many of the coarse-grain partitioning algorithms start from a task graph which consists of a set of communicating processes. In the published literature, task graphs that describe the set of communicating processes (or tasks) are directed acyclic graphs (DAGs) which use nodes to represent processes and arcs to represent precedence relations or communication among the processes. In these task graphs, the communication is assumed to take place from the end of one process (node) to the beginning of another process. We refer to this type of communication as end/begin communication. The coarse-grain processes may however communicate with each other at times other than the end or the beginning of their lifetimes. We refer to this type of communication as midway communication and to the task graph with midway communication as a generalized task graph. The problem we attempt to solve is then stated as follows:
Problem 1: Given a generalized task graph consisting of processes which communicate with each other by arbitrary blocking/nonblocking communication mechanisms and a library containing several possible mappings (or implementations) for each process, simultaneously schedule and map the computational and communication processes to HW/SW resources so as to minimize the total area cost while satisfying a given deadline.
The cost of mapping a process to a library unit (implementation) cannot be determined exactly because of the possibility of sharing the same unit between different processes (e.g., using time-division multiplexing or TDM for short). The cost should account for this possibility and include the area and delay overhead associated with the context switching. We assume in this work that TDM will be used whenever possible, and that the overhead of the context switching is accounted for in the area/delay cost of processes which share the same unit.
A task graph with midway communication becomes a directed multi-graph, i.e., there may exist multiple arcs from one node to another node. The task graph may be periodic. We can handle the case where the period is no less than the deadline by performing the same schedule on every period.
The hardware components which are available in the library can be classified into computational or communication units. Both classes can be further divided into programmable or non-programmable. Examples of programmable computational units are CPUs, DSPs and examples of non-programmable computational units are ASICs and custom ICs. Examples of programmable communication units are FIFOs with controllers, bidirectional handshake controllers, DMA controllers, bus arbiters, or shared memory access and examples of non-programmable communication units are special purpose, customized communication units. All computational and communication units in our library are assumed to be compatible with industry interface standards such as the evolving Virtual Socket Interface. As a result, we can mix and match various IP blocks.
We allow the resource sharing of programmable components by different processes according to TDM, even if the process lifetimes overlap. For nonprogrammable resources, the sharing can only happen if the process lifetimes do not overlap or the processes are mutually exclusive.
Our algorithm consists of three major phases. First, processes are decomposed into subprocesses which perform parts of the required computation. The correct precedence relationships implied by the specified communication mechanism are then added in by a systematic transformation process. Second, the decomposed subprocesses are scheduled so as to ensure that the subprocesses which belong to the same original process are mapped to the same hardware type (for example, the same CPU with the same utilization factor). We refer to the condition that all of the subprocesses which are obtained from the same original process are mapped to the same hardware type with the same utilization factor as type consistency constraint. This constraint is necessary because we assume that the original coarse-grain process has strong internal communication (variable reference, etc.). As a result, we do not want the subprocesses which are decomposed from the same original coarse grain process to be mapped to different hardware units in the final solution. The scheduling is done using a dynamic programming based algorithm which finds the cost-optimal process mapping, while satisfying a given task deadline. The third phase is a hardware allocation and binding (sharing) phase which ensures that the decomposed subprocesses will be mapped not only to the same hardware type, but also to the same hardware instance. The allocation and binding determines the sharing of hardware among all coarse-grain processes in the system.
We summarize the major contributions of this work: