----------------------------------------------------------------------------

BUILD A MULTIBUS II SINGLE-BOARD COMPUTER

By designing a complete CPU on one standard Multibus II board, system
integrators can take advantage of rapidly advancing CPU technology while
preserving their investments in existing hardware.  Ideally, the
single-board computer should have its own memory, local I/O circuitry, and a
Multibus II interface.

By basing the CPU on an i860 64-bit microprocessor, the designer can bring
the power of a supercomputer to the single-board computer.  The board can be
the main CPU running an operating system such as Unix, or it can become an
application accelerator in an existing system.  As an alternative, multiple
CPU boards can be linked in a multiprocessor system with performance
unavailable in an minicomputer.

Multibus II is a processor-independent bus architecture with full
distributed-multiprocessor support.  The standard, IEEE-959, defines a
32-bit parallel system bus (PSB) with a maximum throughput of 40 Mbytes/s. 
The PSB, which is optimized for standardized interprocessor communications,
also handles accesses to I/O devices not dedicated to one CPU board.

The CPU board contains a number of functions that are made possible by the
high integration level of the i860 and another key component, the 82380
integrated system peripheral.  These features include a Multibus II
interface using DMA and a message-passing coprocessor (MPC), a dynamic RAM
(DRAM) main memory of 8 Mbytes based on 256-kbit-by-4-bit RAMs, a local I/O
system, and an SBX connector (Fig. 1).

Built into the i860 microprocessor are an integer execution unit, a
floating-point unit, a graphics unit, a 4-kbyte instruction and 8-kbyte data
cache, and memory-management and bus-control units.  The integer execution
unit runs 85 kDyhstones, and the floating-point unit delivers 24 MWhetstones
(DP) and 80 MFLOPS (SP).  Z-buffer check and shading is supplied by the
graphics unit.  The cache memory makes possible a 960-Mbyte/s data rate. 
The units can operate in parallel, with high-bandwidth data rates and a
40-MHz operating frequency.

The 82389 MPC simplifies the task of interfacing the processor's local bus
to the parallel system bus.  Designed for the message-passing protocols of
the Multibus II architecture, the coprocessor participates in the entire PSB
protocol and performs bus arbitration, transfer contro
l, error detection and
reporting, and parity generation and checking.  These functions occur
independently of the host CPU.

BUSSES ISOLATED

The parallel system bus is isolated from the CPU local bus so that the i860
can access memory on its 64-bit data bus.  In addition, decoupling the local
bus activities from interprocessor communications over the PSB offers two
advantages.  First, resources that would be held in waitstates while
dedicated bus-access arbitration is underway are instead free.  This
parallelism increases system performance.  Furthermore, the bandwidth of one
bus doesn't limit the transfer rate of the other: Each bus can perform
full-speed, synchronous transfers.

The MPC's signals can be divided into three functional groups: PSB
interface, local bus interface, and DMA interface (Fig. 2).  The primary
functions of the PSB interface signals are arbitration and system control. 
Six arbitration signals ([ARB.sub.0-5]) read card-slot ID and arbitration ID
from the central services module (CSM) during reset.  During arbitration,
these signals output the arbitration ID for priority resolution.

Bus Request (BREQ) shouldn't be confused with the i860 microprocessor's BREQ
signal.  Each bus agent asserts BREQ to gain control of the bus and samples
BREQ to determine if other agents are also contending for bus control.  A
bus agent sends Bus Error (BUSERR) to all other bus agents when it detects a
transfer-cycle parity error.  The CSM sends the bus Timeout signal (TIMOUT)
to all bus agents when a bus cycle fails to end within a prescribed time.

Ten System-Control signals ([SC.sub.0-9]) coordinate transfer cycles, as
defined by the Multibus II Architectural Specification.  With Directional
Enables ([SCDIR.sub.0] and [SCDIR.sub.1]), transceivers can buffer the
bidirectional system-control signals.  The MPC checks byte-parity lines
([PAR.sub.0-3]) for incoming operations and sets the parity lines for
outgoing operations.

Other parallel system bus signals are Reset (RST), Reset-Not-Complete
(RSTNC), and ID Latch (LACHn, where n = slot number).  These signals are
only used during system initialization.

The MPC also handles interrupts for bus agents on the parallel system bus by
implementing them as virtual interrupts in the message space.  To send an
interrupt message, the processor writes the source destination and message
type to the MPC, which coordinates the interrupt message transfer.

The MPC's local bus interface is like that of any other simple I/O device,
consisting of select lines, address signals, and read and write control
lines.  The coprocessor's registers are accessed by asserting REGSEL and the
appropriate register address while performing a read or write cycle.  Among
other functions, the registers program data message transfers, receive and
send control transfers, and handle errors.

For DMA interfacing, the MPC has two channels that use the standard DMA
Request (DREQ) and DMA Acknowledge (DACK) hardware transfer protocol.  The
two channels are dedicated to the MPC: one as the input and one as the
output.  Each has its own control lines and operates independently.

The CPU board includes a static RAM (SRAM) message area that isolates
backplane data rates from the processor's local bus.  For an outgoing
solicited message (data transfer), the microprocessor sends messages to the
SRAM at a high rate.  The DMA channel then transfers messages from the SRAM
to the MPC's outgoing message FIFO buffer.  For an incoming solicited
message, the DMA channel transfers the message into the SRAM buffer,
signaling the microprocessor when the transfer is done.  The processor can
then quickly transfer the message from the SRAM into main memory.

A DRAM MAIN MEMORY

The i860's large on-chip caches and pipelined architecture make its
performance less dependent on a fast--and expensive--SRAM main memory. 
Instead, the microprocessor's bus is optimized for interfacing directly with
DRAMs.  The 64-bit wide data bus accepts data on every other clock for read
cycles, allowing a longer cycle time between accesses.  To further increase
the access time without decreasing the bandwidth, the bus accommodates two
levels of pipelining, so that up to three cycles can be outstanding.

To fill a cache line, the processor performs four read cycles (Fig. 3). 
When the Next Address signal (NA) is returned to the microprocessor, the
system can accept the next bus cycle.  Two NAs are returned before any of
the cycles are completed.  To complete a read cycle, the memory system puts
the data on the bus and returns Ready to the microprocessor.  When it's
fully pipelined, the memory system supplies data and Ready on every other
clock.  Ordinary static-column DRAMs can deliver this data rate.  The
processor also supplies a control signal, Next Near (NENE), to optimize DRAM
control.

The CPU board's memory system consists of two address latches, eight
latching data buffers, and a 64-bit-wide static-column DRAM (Fig. 4).  The
memories used are 256-kbit-by-4-bit devices, which allow for an incremental
system memory size of 2 Mbytes.  The use of 4-bit-wide memories reduces
power and signal-drive needs.

To accommodate pipelining, both address and data are latched.  The address
latches hold the address of the previous cycle while the data from the cycle
prior to that is held in the data buffers.

Using TTL components on the address and data paths isolates the memory from
the processor's pin timing.  The memory system uses two address latches in
order to multiplex the row and column addresses from the processor to the
DRAM's address lines.  When accesses occur within the DRAM page, only the
column address needs to be supplied to the memory-address lines.

Most systems that use a fast-access DRAM mode need an additional hardware
comparator, but the i860 has a comparator built into the bus unit.  On each
bus cycle the comparator supplies NENE, which the controller uses to
determine if a fast static-column mode access can occur or if a full DRAM
cycle must occur.

Bidirectional data buffers latch the data for both reads and writes.  For
reads, the data is latched and Ready is returned on the following clock. 
With two levels of pipelining, total access time is six clocks and data is
available every two clocks.

Write cycles don't need pipelining for zero-wait-state operation.  When a
write occurs, the address and data are latched in the buffers, making it
possible for Ready to be returned to the processor.  The actual write cycle
occurs after Ready is returned to the processor.  This delayed write
operation allows the processor to continue executing even though the write
is incomplete.

With 85-ns static-column code DRAMs for the main memory, the 33-MHz i860 can
run at zero wait states for access within the DRAM page.  And with the
two-level pipelining and two-clock-cycle transfer rate, the CPU board
doesn't need an expensive external cache memory.

The board's memory-mapped I/O system consists of an 82510 serial port, the
82380 integrated systems peripheral, one 8-bit boot EPROM, and an SBX
connector.  The serial port, which can be used in a polled or an interrupt
mode, is also suitable for use as a console monitor.  A reduced version of
the serial controller's oscillator module clocks the system's timers.  A
control port enables and disables the timers, whose outputs are connected as
interrupts.

PROGRAMMABLE TIMERS

The 82380 serves both as a slave I/O device and as a bus master DMA
controller.  The peripheral contains four 16-bit programmable timers
(8254s), two interrupt controllers (an 8259 master and slave), and eight DMA
channels.  Five internal interrupts can be used with the DMA channels and
the timers.  In addition, the device has connections for 15 external
interrupts.

Because it uses a double-frequency clock, the 20-MHz 82380 can operate
synchronously with a 40-MHz CPU.  At reset, a phase clock is generated and
used by the control logic for 82380 accesses.  Port addressing is handled by
connecting pin [A.sub.3] of the CPU to pin [A.sub.2] of the 82380.

The I/O system uses four of the eight DMA channels: two for the MPC
interface and two for the SBX connector.  These channels, which can transfer
into or out of the SRAM message area, are programmed with the 82380 in slave
mode.  Once programmed, the DMA channels become bus masters to complete
transfers.  To gain control of the bus, the 82380 asserts Hold.  When Holda
is returned to the 82380, the peripheral initiates the DMA transfer, which
continues until completion or until Holda is deasserted.

To support the SBX connector, the designer uses several jumpers on the 82380
to configure the DMA control interface as defined in the SBX specification. 
To access 8- and 16-bit SBX devices, the processor must address them on
8-byte boundaries.  The DMA controller performs byte assembly and unassembly
from 8 or 16 bits to 32 bits.

As noted, the board uses the SRAM message area to handle DAM transfers with
the SBX.  DMA transfers may occur in parallel with CPU main memory accesses.
Both the DMA controller and the CPU can act as bus masters to access the
SRAM and the I/O bus.

The SRAM bank, which is 32 bits wide, connects directly to the I/O data bus,
but four data buffers are needed on the processor side.  The system only
uses the lower 32-bit processor data lines.  Though the processor can access
the SRAM with zero wait states, it may only access even-word addresses.  The
processor's automatic-increment addressing mode is used, and the full cycle
rate is maintained.  The system could handle a 64-bit SRAM bank but the
memory would need buffers on the processor side, and the I/O side of the bus
would require four additional SRAMs and data buffers.

A programmable logic device (PLD) arbitrates between the two bus masters,
granting control of the DMA message system to the 82380 by asserting Hold. 
If the microprocessor begins a bus cycle that requires the DMA message
system, the PLD forces the 82380 from the bus.

Arbitration occurs as follows: When the 82380 asserts Hold, the arbitration
PLD returns Holda.  The 82380 then takes control of the bus and performs DMA
cycles until the transfer is completed or Holda is deasserted.

If the microprocessor requests the bus, the PLD deasserts Holda and waits
for the system peripheral to relinquish bus control by deasserting Hold. 
This process duration includes the 82380 cycle time and the data bus
recovery time of the device being accessed.

Though Hold becomes active again in the next clock, Holda isn't returned to
the 82380 until the processor is finished with the bus.  When the processor
finishes its I/O cycle, it returns Holda to the 82380.  For this arbitration
scheme to work, the 82380 is programmed in demand mode.

Using an SRAM message area as an I/O buffer gives the CPU design several
advantages.  For one thing, physical memory is paged in an i860
microprocessor system, and block transfers are often limited to the page
size.  But disk access times can be better used if the seek time is
amortized over a larger transfer size.  The SRAM does this by supplying a
continuous memory area where block transfers larger than 4 kbytes can be
programmed.

SRAM ENHANCES DMA

Also, the SRAM area enhances DMA performance, which depends on the DMA
controller's transfer rate and the effect that the DMA controller has on the
processor's memory bandwidth.  The relative influence of these factors
varies with processor workload.  If the processor waits for DMA data, then
transfer rate is more important.  A multitasking system, however, puts a
task waiting for the DMA to sleep and continues to execute other tasks.  In
a multitasking environment, therefore, the DMA's effect on CPU bandwidth is
more critical.

I/O system devices are slow relative to the processor's local bus. 
Consequently, the board design separates the DMA controller from the
processor bus by address and data buffers, allowing DMA memory accesses to
occur independently of processor memory accesses.  As a result, the
processor retains full memory bandwidth during DMA transfers to the SRAM
area and may continue to execute tasks.  Transfers into the SRAM area,
however, require the processor to perform a main-memory data transfer once
the DMA transfer completes.

Third, transfers between the SRAM and DRAM areas execute very quickly.  In
the process, they automatically update the caches.  If desired, a special
load instruction will avoid emptying the current data cache contents.  The
copy can occur at over 40 Mbytes/s at 33 MHz.

Finally, any DMA method must also consider the CPU's write-back caching
protocols.  To prevent DMA accesses to stale data, a cache flush is needed
before a DMA transfer into or out of main memory, or the transfer areas need
to be marked as non-cacheable.  But because the SRAM address area is never
cached, the board requires no additional cache flushing.

Transfers between the I/O and SRAM systems may use a fly-by mode.  SRAM
reads and MPC writes will then occur together, producing the highest
possible DMA transfer rate.  Fly-by mode, however, can only be used with
32-bit DMA devices, such as the MPC.

The Multibus II specification defines an initialization sequence for the
system.  The 8751 microcontroller handles the sequence using the
interconnect space (Fig. 2, again).  The microcontroller brings the
processor out of reset to start board-level operations.

RESET OUT OF EPROM

Because the i860 CPU has a code size 8 (CS8) reset mode, it can execute out
of one 8-bit wide EPROM.  As a result, the processor can reset out of the
EPROM, initialize the local system, copy a program from EPROM to RAM, and
begin execution.  But the Reset address is the same as the interrupt
address, so once system execution begins, the EPROM must be mapped out of
the interrupt address space and replaced by the 64-bit-wide DRAM containing
the interrupt service routine.

To enter CS8 mode, the system's reset PLD forces the INT/CS8 pin high.  Once
the reset routine is complete and the system is ready to operate out of
DRAM, the processor disables CS8 mode by clearing a bit in the processor
status register.  The processor sends a write to the command-port PLD
setting a control signal the decade PLD uses to remap the EPROM.  The
processor can then enable interrupts, because they will trap to the service
routine in the DRAM.  Once the board exits CS8 mode, it can only be
re-entered by resetting the processor.

Neal Margulis, Intel's chief applications engineer for high-performance
processors, is currently writing a book on programming the i860.  Margulis
has a BSEE from the University of Vermont in Burlington.

----------------------------------------------------------------------------
