当前位置:首页 >> >> A Parallel Rendering Algorithm for MIMD Architectures

A Parallel Rendering Algorithm for MIMD Architectures


NASA

Contractor

Report

187571

ICASE

Report

No.

91-3

ICASE
A PARALLEL RENDERING MIMD ARCHITECTURES ALGORITHM FOR

Thomas W. Crockett Tobias Orloff

Contract June

No.

NAS1-18605

1991

Institute NASA Hampton, Operated

for Computer Langley Virginia

Applications Center

in Science

and Engineering

Research

23665-5225 Space Research Association

by the Universities

National Aeronautics and Space Administration Langley Res_rch Center Hampton, Virginia 23665-5225

A Parallel

Rendering

Algorithm
Thomas

for MIMD

Architectures

W. Crockett in Science Center and Engineering Research Orloff Graphics

Institute

for Computer NASA

Applications Langley Tobias

Great

Northwestern

Abstract Applications dering parallel software gorithm algorithm examined is found for range the of complex hardware which targeted exploits both Intel of scene on such as animation dimensional are the and scientific To The visualization deliver the demand necessary is then This paper For The to high performance rates, algorithms a rendering performance, of the algorithm of processors implementation across will a wide adapt algorithm renhighly and althe is

three

scenes. required. hardware

rendering design

architectures effectively to use

challenge

parallelism. MIMD architectures.

describes maximum behavior

distributed object-level and shows memory

memory and

both

pixel-level

parallelism. Its performance overheads. from

analytically iPSC/860 complexities.

experimentally. by communication increasing

for large An 1 to 128

numbers

to be limited

primarily

experimental processors to the

performance that minimal as well.

It is shown architectures

modifications

it for use

shared

This work was supported NAS1-18605 while the authors Authors'addresses: Tobias Orloff, Great Electronic mail:

by the National were in residence

Aeronautics at ICASE.

and

Space

Administration

under

NASA

Contract

No.

Thomas W. Crockett, ICASE, M.S. Northwestern Graphics, 119 North

132C, NASA Langley Fourth Street, Suite

Research Center, 206, Minneapofis,

Itampton, VA 23665; MN 55401.

tom_icase.edu,

orloff@poincare.geom.umn.edu.

1

Introduction
such of complex impressive, and for scalable MIMD as real-time major parallel architectures. animation improvements rendering Although and scenes. scientific While visualization the This algorithm will results will paper require demand achieved the for use and describes on use high current of highly such performance hardware parallel rendering memory memory the issues of its compare can

Applications rendering have been hardware algorithm

three-dimensional

in performance algorithms. the

one

is designed adapt it for pipeline give

distributed in shared

message passing environments. In the involved the performance. experimental be adapted following

systems, section, it.

straightforward we introduce Next,

modifications the traditional our

rendering and

consider and

in parallel]zing We then results for shared

we present

algorithm on the Intel Finally,

a theoretical 1 hypercube, how the

analysis algorithm

describe with memory

an implementation analytical MIMD predictions. architectures.

iPSC/860 we examine

2
some into are

The
light account all point

Rendering
that sources, the light we are and lighting sources properties, a fairly well given and and

Problem
a scene consisting The triangles as specularity pipeline goal of objects is to produce (Fig. only texture, for the fast described 1). as collections of 3D of the we assume attribute. affect of such the The main triangles, taking lights addition structure the a viewpoint. the a 2D representation For simplicity coloring really do not a diffuse scene

We assume

perspective such

distortion possess and

of other

material

of the algorithm. There is now scenes is not Phong One into [10]. fixed, shading

established

rendering

three

dimensional

The standard for example must step.

pipeline shading

may be represented may be done after rasterizing process been ultimate step

as shown transforming [3]), or clipping the

in Figure (indeed, may

2. The exact sequence an essential portion of be delayed of the adopted degree until pipeline by after the

be done the

in the

rasterization hardware

way to parallelize [2]. hardware is limited must known This by the [11],

rendering has the But

is to map successful, performance

various and has attainable

stages been

directly a number the

approach number there time. operations

very

of graphics pipeline other As most

vendors.

by directly

exploiting of parallelism, account

of stages are These three are

in tile pipe. main steps

To achieve in the

a greater

strategies is well

be examined. rendering process which for

of tile computation floating point

1. The ping. 2. 3. (We be The

performed

on objects,

such

as transforming,

lighting,

and

clip-

rasterization pixels here by the the

of primitives to the problem frame

transformed buffer. the steps.

into

screen

coordinates.

Writing ignore

of traversing of these three three three steps steps

database Moreover, to map

prior

to rendering.) serial [11]. pipeline

The an([ Thus onto

renderillg pipelined to obtain a hardware

time

will

limited

slowest each each

in current at its limit rendering

hardware significant archione basic

implementations, improvements tecture in which

of these of these

is operating the can

in performance,

it is necessary

be parallelized, of Intel Corporation.

preferably

by replicating

1 iPSC, iPSC/2,

iPSC/860,

and i860 are trademarks

typeof
steps scribed system pixel

processing by Torberg of Fuchs

element. in [12]. and is described

We refer A system by Fuchs

to parallel with a high

computations with degree of pixel

in step a high a system cases,

1 as object of object the

parallelism, parallelism classic both object

and

in

2 and

3 as image

or pizel Poulton,

parallelism. is described

A system in [6].

degree

is deand

parallelism, the

Pixel-Planes for 3D render-

Finally,

incorporating algorithms

parallelism

et al. in [7]. In all these

ing are mapped onto specific map the rendering algorithm experiment performance can with the algorithm and parameters

hardware, more or less constructed onto more general purpose parallel at a high tradeoffs maximum level are and with a high thoroughly understood, As we will

for that purpose. In our case, we architectures. This allows us to of flexibility. then show, the Once the critical in special-purpose algorithm containing discussion hardware described from 1 of the

degree

be designed

to achieve

performance.

this paper achieves both object and pixel parallelism, and will run on systems to p processors, where p is bounded by the number of scanlines. For an excellent various approaches to object and pixel paraUelization, a good without frame to scale see [11]. Besides exploiting both types structures are distributed among two such structures: the list among the processors, allowing the of parallelism, the processors and the algorithm

algorithm must ensure that wasteful duplication. In our buffer. We distribute complex scenes while these and higher distributing

all large data case there are evenly frame resolutions. the

of triangles

structures

to more

Note that distributing the triangles corresponds buffer corresponds to pixel parallelism.

to object

parallelism,

3

Algorithm

Description
we first distributed is divided such specify evenly among as the how the data structures fashion are divided among the processors:

To describe ? The ? ? The Small

the algorithm triangles frame data are buffer

in round-robin the processors and

to all processors. stripes are (Fig. 3). on each

by horizontal viewing

structures,

lights

parameters,

replicated

processor. The ture distribution of the only splittings the above of the algorithm. frame buffer into x 1024 can be modified all that is needed stripes, from the clipping horizontal using considerably is a regular which topic strategy are seems without affecting the basic strucinto of

Essentially

geometric appropriate The effect research.

division.

We have

implemented a frame different With The buffer

a division 1024 of the frame

for rendering on performance

of size

2 to 128

processors.

buffer of data, and

is an interesting following steps

for further is used:

distribution transforming,

shading,

performed

by each

processor

on its

local

triangles. Before necessary) then Upon standard For simplicity, of split triangles sent rasterizing into to the a triangle, trapezoids processor a trapezoid, algorithm[4] which treated it along which is first local owns transformed frame the buffer segment into of the screen frame it into coordinates, (Fig. buffer its local 4). Each in which frame then split (if is

boundaries

trapezoid it lies. buffer using

receiving z-buffer triangles are

a given

processor

rasterizes hidden

a

to eliminate within

surfaces. frame buffer two segment of the and vertices triangular happen pieces to be

lie fully

a single trapezoids

as degenerate

in which

Lights

,o o S

Objects

J

_-_

Plane

Viewer Figure 1: A scene is described by a collection of objects and light sources, with associated viewing

parameters.

Scene

Description

Final Image

Shade

Store and Display

Cull, and Transform, Clip

,

Rasterize

Object Computations

Pixel Computations

Figure tions.

2: A typical

rendering

pipeline.

Tile

dotted

line

divides

object-level

and

pixel-level

opera-

T
y resolution processor 0 x resolution Figure 3: The frame buffer is distributed across processors by horizontal stripes.

3

2

processor O

Figure correct

4: In general, triangles must processors for rasterization.

be split at frame

buffer

boundaries

and

the pieces sent to the

the

same

point.

To reduce

communication before

overhead, sending. The

trapezoids choice

destined size,

for the which

same can

processor

are buffered

into larger

messages

of buffer

significantly

affect performance, The algorithm Until

is discussed more may be summarized done

fully later. as follows.

Each

processor

performs

the loop:

If local triangles Select a local

remain { triangle cull, and clip

Shade the triangle Transform, back face Split into trapezoids

Put the trapezoids into When a buffer fills up,

outgoing buffers send its con_ents

If this is the last local triangle Send all non-empty buffers

> >
If incoming messages exist For each incoming message Rasterize all of the trapezoids in the message

To avoid

having

to store

large

numbers

of trapezoids

in memory,

the

algorithm

alternates

between into If a by not flushed have for loop. tend of

splitting triangles the frame buffer. processor generating quickly shown incoming A Processors to unbalance operations into several that

into trapezoids It is not obvious on work message rasterizing to keep queues

and disposing of incoming what the proper balance incoming busy. trapezoids, Alternatively, outgoing at least

trapezoids is between it may starve

by rasterizing them these two activities. other messages processors are not Experiments before point factors a different checking in the will number

concentrates enough enough, there data. will

them

if incoming buffers a few

will fill up and

will be delayed. triangles

is a slight Beyond feature start that, off with

advantage the algorithm algorithm nearly First, and Next, the the

in processing is relatively is the same culling cause

insensitive of triangles, clipping step

to this but requires

choice. several

significant the

of this

absence and

of a synchronization

number triangles

workload.

for different smaller

triangles,

may the

to be thrown into

away,

or to be subdivided varies with the

triangles.

time

required

for splitting buffer boundaries and communication in their time.

trapezoids

orientation of the triangle and the number number of trapezoids in turn affects the numbers comparisons, cant of incoming will cause trapezoids, variations along in the

of frame buffering with rasterization

which are intersected. The times. Similarly, varying and the results of z-buffer

differences

size

These considerations suggest amounts of idle time, since our some strategy coordination to the next. is to let But

that any synchronization each iteration of the loop individual the use processors to ensure of an that

points in the loop will introduce signifiwould be bound by the slowest processor. as asynchronously buffers are message-passing as protocol, possible. passed combined Of from message correctly

Instead, course, one

proceed

is necessary

processor

asynclironous

with dual messages. However,

send the

and

receive

buffers,

has proven point

effective leads

in minimizing to difficulties

idle

time

spent when

waiting to exit

for the

lack

of a synchronization

in deciding

loop. Even after a given processor completes work on it local triangles, it has no way of determining by itself when it has received the last incoming message from another processor. We use the following algorithm to detect termination:
.

Each

processor

maintains

a list of all other sending

processors Note

to which that

it sends

trapezoids. may

We refer of

to these itself.
.

as neighbors

of the

processor.

a processor

be a neighbor

After to each

the

last

local

triangle

is processed, that

a processor there will

sends

a Last work

Trapezoid message

(LT) order

message from that LT

of its

neighbors,

indicating

be no more

forthcoming

particular messages
o

source. The message passing protocols do not precede the trapezoids to which receives an LT message, of an LTC message from trapezoids are sent

must preserve they refer.

so that

When a processor message. Receipt rasterizing

it replies a neighbor

with a Last Trapezoid Complete (LTC) indicates that the neighbor has finished records this fact. knows that then all of its a

all of the messages finished

to it. The from every that

processor neighbor,

.

When neighbors

LTC

received

a processor The

have

all of the message

work which

it gave

to them.

processor

produces

Neighbors Complete (NC) choose to be processor 0.
.

it sends

to a specific

processor,

which

we arbitrarily

When

processor

0 receives

an NC

message

from

every

processor

(including

itself),

it knows that the a Global

that each processor has finished local frame buffer segments now Completion that From number merge can the rendering time must algorithm, be done (GC) message

all of the work sent to it by all of its neighbors, and contain the final image. Processor 0 then broadcasts processors. it should Receipt drop out until and of a GC of the message notifies

to all and that

a processor

is complete

loop. it receives them. Note the O(logp) a GC message, that using GC broadcasts. for a given a parallel broadcast a

it generates continue p, rather

LT messages to check the than NC using

for its neighbors trapezoids can if the be method

the time process above. directly in time

processor

for incoming messages the O(p) O(1)

of processors in time

accumulated described

Similarly, supports

O(logp),

or even

architecture

4

Performance
the performance transforming, into

Analysis
of the algorithm culling, and we will break it down into the following steps:

To analyze

* Shading, . Splitting

clipping.

trapezoids.

* Sending ? . ? Rasterizing Storing Wait

trapezoids. trapezoids. data.

pixel time.

* Termination For will each speed of these nonlinear up which from

algorithm. steps, part. we will The as the speedups. of processors of triangles of the height buffer number y and frame buffer (in scanlines) of triangles depth (in pixels) per processor per neighbor of p, n >> p, and to our plane, splitting than rather that the frame triangles buffer. comprising Note The that linear h break down the running part time into The the a general parallelizes nonlinear following linear perfectly, component and which notation: part and and contains therefore an thus

explicit overheads detract

linear number

component with

is that increasing

which

linearly perfect

of processors proceeding

increases. numbers we introduce

do not decrease

linearly Before

of processors,

p = number n = number y = height h = average d = trapezoid r = number v = average We assume the part scene of the is the average that are

(in trapezoids) generated generated

of trapezoids

of trapezoids fixed, on the

h are height

y is a multiple with of the respect form projection

uniformly triangle time

distributed is a term

of the

in world

coordinates.

running

c
P where the time the constant C is machineto the and running else. machine and time scene-dependent, parallelizes constants. to determine but perfectly. this independent as explicitly of n, p, and part as possible d. This

(1)
is of

contribution variables

that

TILe nonlinear

of the

running

will be everything

We will attempt dependent

in terms

-tile above

4.1 Since

Shading, the triangles

transforming, have been

culling, distributed triangle,

and evenly this part

clipping to tile of the processors, algorithm and these operations only may be

performed independently to the running time. 2

on each

contributes

a linear

term

4.2

Splitting

into

trapezoids at its middle it contributes the number vertex (see Figure 4). Since this can be performed only a linear term to the running time. As a side of triangles actually version to 2n while reducing their average height dividing the triangle into trapezoids. of this algorithm, in the parallel version time. This cost may be regarded as part the parallel so that the to that of a

Each triangle independently effect, to h/2. Although it still this split

must first be split for each triangle, effectively doubles

Next, there is a certain setup this cost would not be incurred contributes only a linear term

cost before in a serial total Although contribute on

to the

running

of the parallel overhead overhead in our analysis, performance serial version.
2Strictly

of the algorithm. it does in fact algorithm

we are not explicitly isolating very little to the running time, processor is virtually identical

of the

parallel

running

one

speaking,

back

face

culling are in

and

clipping

can

introduce

local

variations

in analysis, and

workload

which

will

detract

from perfect speedup. Similar variations can practice, the impact of

But since we be introduced these variations

assuming a uniform scene for purposes of the rasterization and z-buffer computations, is scene-dependent.

we can ignore will likewise be

this effect. ignored. In

A nonlinear contribution to the running time results from actually splitting the triangle. In loose terms, the more processors we have the more we must split the triangle, so that adding processors increases the number of trapezoids in the system. To quantify this, one easily computes that a triangle in the projection plane crosses a local frame buffer boundary line on average hp/2y times. Since back face culling will, on average, eliminate half the original triangles, the number of resulting trapezoids per processor is 2_a2_2_ + 1) (
7" _

nh

n

p

2y + P

(2)
is simply rtsvlit, where tsptit is the time for the arithmetic operations performed. The code generated and the characteristics adds and 10 integer 15 integer

and the time to split n triangles among p processors one split. We can further analyze tspli t by counting actual time will of course depend In our current on the precise implementation, of the processor. compares. 4.3 Sending

assembly

one split requires

trapezoids we assume sending that messages must do so one at a time.

To a first approximation ? A single processor ? Multiple processors

several

can be sending

simultaneously. overheads for messages to itself.

? A processor

does not incur

communication

The second assumption in particular is somewhat questionable--edge contention among competing sends can seriously impair message passing performance, as shown in [1]. This point will be discussed more fully in later sections. Communication time can be divided into two independent parts, a fixed and a transfer cost tt. The latency includes various software overheads and from network contention, etc. This is incurred on a per message basis. The inverse of the network bandwidth multiplied by the total number of bytes overhead, or latency, tt, hardware delays, effects transfer cost is just the to be communicated. including

A processor will, on average, generate v = rip trapezoids for each of p destinations, itself. If tb is the per-byte transfer cost and s is the size of a trapezoid in bytes, then tt = (pTaking 1)vstb buffering, the number of messages m generated by each processor is

(3)

into account

and the total time for sending
tsend

trapezoids

is simply

= mtl+tt

= (p-1)([d]tt+vstb)

(5)

4.4 Since the the

Rasterizing each triangles pixel into step

trapezoids of each it would trapezoids essentially trapezoid appear is rasterized that this an part exactly of the once, and this work prior Bresenham is split However, linear equally among The

processors,

algorithm of the

is linear.

by splitting interpolation

we incur consists

overhead

for each

trapezoid

to rasterization.

rasterization

of several

applications

algorithm [5], once in the vertical direction and The overhead is incurred in the vertical application performed where for every startup trapezoid. cost Therefore and 20 adds. the tB is the for the Bresenham

once per scanline of the Bresenham contribution In terms

in the horizontal direction. algorithm, which must be to the running arithmetic time is rtB, operations,

nonlinear algorithm.

of integer

tB is 5 divides, 4.5 The sors, 4.6 Storing z-buffer so this Wait

10 multiplies, pixels compare and

conditional contributes

store only

operations a linear term

are

perfectly

distributed time.

among

the

proces-

computation time

to the

running

The rendering the first phase, If no more arrives. trapezoids triangles and During can

algorithm as viewed by the processor alternates have during the this arrived the second during polls phase, subsequent the

a single processor consists of two distinct phases. During between processing triangles and rasterizing trapezoids. iteration, In the trapezoids will no processor an excursion an argument bound, rather A similar it must d/2 this flush performance than the processor phase, a Global if trapezoids terminate into based queueing on the of the analysis its partially we will assume remaining more or less d/2 these the 1). which the until be idle can can keep busy triangles to arrive the by processing have been as has (G C) message at least one slowest which of the Our For applied trapezoid d <_ v/2, time. cannot total numi)er iteration. processor second all local fail until

a loop

processed

processor be rasterized.

for incoming

Completion

fast as they finished. A precise the scope nication assumes

Furthermore, situation requires the

treatment paper. which

of this

theory, capacity algorithm. bound. be could filled that same will

is beyond commuargument purposes to other buffers. in which Therefore into the must of edges deficit of all be sent

of this network that

Instead, approximates

we present

observed

performance we will use networks.

is communication a hypercube its last

compute

of exposition communication When Since case the

architecture. triangle,

a processor buffers then the For therefore

completes of our will contain p(p time.

one of the goals outgoing entire at and scene, system about

algorithm will

is to conserve on average reach

memory, trapezoids state at length

to be sent.

If we assume

a uniform network travel,

processors same the

contains

1) messages Because of edges

of average of edge

be injected a message

contcntion, is kp/2(p-

messages distance the

simultaneously.

a hypercube number

of size p = 2 k processors, it ties up,

average Since

in a hypercube the order P(P-

(assuming

unidirectional

communication)

is pk/2

we have

a bandwidth

1)d2-_
2

_

pd 2 proportional to the shortage of communication capacity:

(6)

Thus

wait
twait

time,
---- 0--

t_it, pd 2

is roughly

(7)

In a subsequent 4.7

section,

we will determine algorithm

a empirically

for a particular

implementation.

Termination

The termination algorithm requires each pair of neighbors to exchange messages, followed by a global merge step and a broadcast. The time required for these last two operations depends on the architecture of the interconnection network. If we assume a hypercube, then tq_it = 2 [(p - 1) + log_ p] tt There is no per-byte 4.8 Total time all of the above contributions, P Substituting and rearranging, get we t-= C n +r(tspt,t P +ta)+(p--1)vstb+ 2 [(p --1) + log2 p] tt + (p --1) [d] t, +a_ (10) we find the total running time t for the algorithm:
(9)

(8) are used only as signals and contain no data. 3

cost, since these messages

Combining

t -- C n + rtsplit + tsend + rtB + twait + tquit

In section 6 we present the experimental results for a particular and compare those results with predictions from the analytical pertinent details of our implementation.

implementation model. First,

of this algorithm, we describe some

5

iPSC/860
implemented hypercube

Implementation
the above computers. rendering algorithm in the C language on the Intel iPSC/2 and All of the experiments described below were performed on the

We have iPSC/860

latter system. The actual implementation differs slightly from the algorithm described above, in that shading calculations are pulled out of the main loop and done as a preprocessing step. (This is advantageous Consequently, if a shaded scene will be displayed repeatedly using different viewing parameters.) the rendering rates quoted below do not include the time for shading. Remember

that the shading step parallelizes perfectly, and so would only improve the observed processor utilization. We should also note that the iPSC systems do not currently provide a graphical display device, so rendering rates do not reflect the final display step of the pipeline in Figure 2. 4 Our sample implementation incorporates a standard scanline-based, z-buffered triangle renderer. The shading calculations take into account diffuse and ambient lighting components at the triangle vertices, and the rasterization process smoothly interpolates these values across the triangle. We use 8 bits for each of the red, green, and blue color channels, 24 bits for the z buffer, and pixel positions are maintained to a subpixel accuracy of one part in 64. The current implementation makes little attempt to optimize the graphics code for the i860 processor chip employed in the iPSC/860. Tile code is written entirely in a scalar (as opposed to vector) style, and no use has
3Actually, which 4Our current viewing ofltine. tile iPSC the message as part must of the at least latency, convey tj. buffer segments algorithm, rather into a file for later than on the use of its type, but we assume that all messages contain type information,

we include

practice is to merge tile finished contents of the local frame Our emphasis here is on the behavior of the parallel rendering engine.

as a rendering

10

0.016

0.014

o
°_

0.012

o n,,,

0.010

0.008 nn 0.006 :D rn

0.004 _ _-_p=128 0.002 p=64 p=32 0.000 _=8
I l I I 1 I 11 I I I I I I I tl l I I I I I 1 1

p=16

1 01

10 2

103

Buffer

Depth,

d

Figure

5: Buffer

busy

ratio

vs.

buffer

depth.

been us

made exploited

of the few mode.

built-in of the With

graphics high better

features and

of the capabilities some

i860. tuning, and

In of the

addition, i860, as the the

the such limiting

compilers as pipelining of the factors

available and on graphics

to dual code

performance compilers leaving

instruction should speed.

performance

increase

substantially,

communication

I/O

rendering

In contrast passing. The and taken overhead One outgoing expressed all of the numbers point all cases, immediately.
_On messages performance the (>

to the graphics iPSC operating which can with of these, in

computations, we have gone system provides asynchronous to overlap transfer message with time (it) transfers as well conjunction number buffers buffer our

to some lengths to optimize routines for both message with as much wait other scheme, of the when computations. to edge in_erting hide most contention

message sending We have of the delays. into number across varying data In

receiving, advantage

be used message is the the

a double-buffering processors from to the must a previous total sizes described the overlap generated

associated buffers as the

measure

of overlap because ratio

of times busy-waits are test

trapezoids this with s generated

are still busy plotted standard It can of the

send.

Figure from next able

5 shows 2 to v/2 section, is very

of total The using than

number ranging

of trapezoids in the were

processors. of processors, mean

values five runs.

for buffer scene, that

Each

is the

across

be seen

strategy

successful.

for d _> 3, more

99.5%

trapezoids

to be placed

in buffers

iPSC/860, 100 bytes).

different Since than

message our

passing trapezoid

protocols data simplicity,

are

employed is 64 our bytes

for

short long, to

messages buffers buffer of sizes

(_< lO0 depth > 2. 1

bytes) have

vs.

long

structure we limit

different

parameters

larger

ones.

For

analysis

11

6

Performance

Results

In this section we present experimental results from the iPSC/860 implementation of our algorithm and compare them to predictions based on our performance model. Our standard test scene is composed of 100000 10 x 10 pixel triangles in random orientations (Fig. 6). This scene was chosen since it statistically approximates a uniform scene for purposes of comparison with the performance model. In all eases, we enable back face culling so that the number of triangles actually drawn is about half of the total. The scene is rendered with a frame buffer resolution of 512 x 512. The average triangle height on the projection plane as measured by the renderer is h = 8.3. To determine the effects of scene complexity, we modify the standard scene by varying the number of triangles while holding the triangle size, in pixels, constant. Unless otherwise noted, performance figures 6.1 are mean Sensitivity values across five runs. depth

to buffer

As mentioned previously, the selection of buffer depth can have a significant impact on performance. Figure 7 shows scatterplots of rendering time vs. buffer depth for our standard scene, with p ranging from 8 to 128. (Because of memory requirements, a minimum of 8 processors are needed to render the standard scene.) Again, d ranges from 2 to v/2. The sensitivity to d can be readily understood in terms of the performance costs due to message latency congestion increases model. If d is small, then the ratio v/d in Equation 5 is large and the are high. If d is large, latency is reduced but wait time due to network large d, our algorithm is equivalent to a simpler two-

(Eq. 7). For sufficiently

phase version in which (1) all triangles are first split into trapezoids and the trapezoids are stored in memory, then (2) trapezoids are sent to their destinations and rasterized. It is clear from the performance model that this simpler algorithI_ not only wastes memory, but also maximizes edge contention by injecting all of the trapezoid data into the network at once. By using smaller buffer sizes and allowing splitting and rasterization to proceed together, our algorithm memory, but spreads the communication load over a longer period of time. not only conserves

For best performance, we would like to be able to predict an optimum buffer size, dopt, without having to resort to a tong series of test runs. If we know something about the scene, such as r, or n and h, then the performance model can be used to determine a near-optimal value for d. (Recall that v = r/p.) If we take the derivative of Equation 10 with respect to d (ignoring the ceiling function) and solve for the minimum, we get dopt The J2
?

(p - 1) vt_ _p r is unknown in discussed in Section 6.4 in the context of non-uniform scenes.

(11) We

case where

now turn 6.2

our attention latency

to values and

for tt and ?r. wait time

Message

Experimental measurements of message latency on the iPSC/860 have typically been done under carefully controlled test conditions in order to get consistent results. Because our algorithm is very dynamic, and because we include contributions due to buffer management, published values for message latency are not directly applicable. In addition, our simplistic analysis of wait time does not yield a value for the proportionality constant e_. These considerations lead us to determine the values of tl and c_ empirically. We recast Equation l0 as a function of d:

C1
t( d) = Co + -_ + C2,1 12 (12)

Figure 6: Standard half of the triangles

test scene comprised of 100000 randomly oriented have been eliminated by culling, h = 8.3.

10 x 10 pixel triangles.

About

ORIGINAL OF POOR

PAGE

IS 13

(_RJN.JTY

5
I

u

I

4

I

eo

E
°_

F-E

5_

I

p=_

°-t._

no ?r).,

2
O' o

|°°°Oooe_._
I

_ _ In_ -

d

p=16

1

*
o o

?

?

IOOo

_

p=32

?

o :o,o./ p=128

p=64

0

1

l

i

i

i

i

ill

i

i

I

1 i

I i i i

]

i

I

I

I I II

101

10 z

0_

Buffer

Depfh,

d

Figure 7: Execution time as a function of buffer depth for the standard test scene, d = 2,..., v/2.

where

c0
C] C2

= = -

cn
P (potp

+r(t,ptit+tB)+2[(p--1)+log2p]tto 1)vtt

(13) (14)

2

(]5)
we again time ignore and from the ceiling 3. the data a_d function. in our Finally, differing from our C2, and Because implementation, we substitute protocols standard then solve test of the high degree also long tt in the and to determine, ct. The results of overlap dropped contribution messages. are We shown achieved the term from can in computation Equation to reflect the Co, C], we have tto for for short scene for tt and

For convenience, between for data the now transfer

communication algorithm

termination of the

do a least-squares

fit using

approximately,

the values Table 1. The data

coefficients

suggest

that

tl and Ilowever,

(_ are the

not constants, limited sample basis

but are size

instead does not

functions allow tile form

of p, or more any firm of tile functions,

specifically, to be we have

k, where drawn, chosen

k = log 2 p.

conclusions

and in the absence of a theoretical to use the mean values.

for determining

6.3

Measured
exl)lore each

vs.
the

predicted
parallel of the p ranged optimal

performance
pcrf,,;mance ralLdom fi'om the depth of our triangle minimnm predicted 14 algorithm scene from allowed and 6250 to validate to 200000 the analytical up model, to 128 in the

"lb further we varied of 2. powers For

tile COml)lexity scene, the

triangles

in multiples

by memory

requirements 2). Figure

of 2, using

buffer

by Equation

11 (Table

8a shows

]
I p 8 16 32 65 Mean '

Time tl 452 443 419 411 431

in #s a 113 120 152 185 143

Table

1: Empirical

values

of message

latency

and

wait

time

for the

standard

test

scene.

_200000

47 25 13

94 50 27 15 9

m

71 38 21 12

Table

2: Predicted

buffer

sizes

for several

random

triangle

scenes.

observed effective The execute case, scenes speedups. test Table 128) scene

rendering are for more usual the a problem

rates added, complex on

for each even scenes.

scene.

The

results scene,

show although

that

performance large numbers

continues of processors defined to memory use these rate

to increase are most to

as processors

for the

smallest

measure smallest speedups the We define would speedups primarily

of effectiveness a single test cannot rendering the processor scenes be can rates,

of a parallel divided be run instead level processor, case. costs on

algorithm time Instead, execution 1 to was be a single of the

is speedup, to execute processor time, the due

as the

time

by the directly. for p = which Speedups costs (tsvlit

it on p processors. performance of the

In our across largest (64 and and On the

only

limitations, to estimate

so traditional 6

computed

we normalize and rendering numbers tw_it), the 32). that

by comparing which

performance to this trapezoid terms combined

fit on a single relative due the

4366

triangles/second

for n = 12500.

3 shows are poor

on large (tsend and tB) model ttr_p.

of processors which primary test are the

to communication numbers in the into

dominant the relative for tse,,d

overheads. speedups contributions plot,

As p decreases, are reasonable tB have of the individual been

and

become for our Note

overheads, scene.

on moderate

of processors performance term,

(16 and

Figure standard the

9 shows contributions

tsplit and

a single

SWe consider our normalized speedup computations to be just estimates for two reasons: (1) As tile density of the random triangle scenes increases, a larger proportion of the z-buffer comparisons will fail because pixels are obscured by other triangles which lie closer to the viewer. This results in a lower percentage of frame buffer stores and slightly reduced computational cost per pixel. (2) Because of the suspected effects of caching (described subsequently), execution times on small numbers of processors may not be directly comparable to those on larger numbers of processors. This effect could be mitigated by comparing performance at constant values of n/p, but the compensation is only partial since the size of the frame buffer segments is independent of the number of triangles.

15

2.50E+05 Number ? + 2.00E+05 o o of Triangles \ \ _\ Obse_otions * o 200,000 5o,ooo

200,000 100,000 50,000 25,000 12,500

E 0 o_

[] 1.50E+05

6,250

t_, o_ E "0 E 13d

1.00E+05

5.00E+04

O. 10 o 101 102

0.0 100 lo 1 _o 2

Number

of

Processors

Number

of

Processors

(a)

(b)

Figure dicted

8:

(a)

Observed times

rendering

rates

for several and

random 200000.

triangle Solid lines

scenes. are

(b)

Observed

and

pre-

rendering

for n = 12500,

50000,

the 1)redicted

performance.

and

tw_it

are

roughly and

equal

because large which interesting dramatic occur

tile was

buffer ignored

size

was

chosen

on

tile

basis

of Equation of tile 11. certain from that they points one ceiling

11.

The

divergence in Equation We note table to the (shown next.

of t,?_d

twait at an type),

values

of p illustrates in order in the values

tile importance to derive speedup are Equation data. At observed

function in the of p to frame To p and 3 due

10, a contribution in passing in bold Since these

phenomenon increases at fixed

in performance of n/p,

value are

points

we conjecture

caching on the buffer segment, In Figure predict constant solving as the tto. that actual shows, for small the

i860 processor. As p increases, the size of several data message buffers) decreases, which may result in better 8b, we compare using is done 10 for C. operation ItS and our the observed and predicted first entries 4 and routines rendering performance the the time lying n/p). timing and on some We also in the incurs very [1] to set trends. and the model, we must the determine along

structures cache hit of several value of the number boldface need from termination little tto = Some that values

(triangles, ratios. test sce,es.

performance C. This Equation points tsplit on the = 2.500

scene-dependent in Table

by taking We have counts

the observed chosen from

of processors diagonal

at which

to solve

for C (points

of constant Since

for lsplit , tB, and [8, 9], we estimate algorithm beyond the uses the plot occur

Based

Section passing

information

t/3 = 13.375/is. message we use predicts caching lends

communication

synchronous

(non-overlapped)

overhead

message transmission, our model successfully p where the suspected p. This at large

published latency data the general performance perturbations credence to our occur, previous

75 #s. As discrepancies

model

underestimates

slightly

overheads

observation

a is an increasing

16

Speedup

p IIe25Ol l 5oool 0000100000 mOO 5
1 2 0.6 1.7 3.2 5.8 9.9 13.1 15.3 18.8 1.0 1.2 3.4 6.2 10.7 16.1 18.6 21.2 2.0 2.5 6.4 11.6 18.6 23.4 25.6 3.8 5.1 12.3 20.1 28.0 31.4 7.5 10.5 20.7 32.8 37.4

$00000

4
8 16 32

14.2 25.5 40.1 50.6

64
128

Table pectedly

3: Speedup large

estimates

derived increases.

from

observed

rendering

rates.

Boldface

entries

indicate

unex-

performance

totcl 101

time,

t

--_

Cn/p
_Lrap

tsend "_
{Z)

10 o _wait --4-t quit

.E 10-I p-"13 _ "o
L.

10-2

/
10-3

10-4

I

I

1

I

1

t

]

J J t

'

i

i

J

[

l

i J t

00

10 !

10 2

Number

of

Processors

Figure standard

9:

Predicted test scene,

contributions tt_p = v(tsp.t

of individual + tB).

components

of the

performance

model

for

our

17

functionof
6.4

k, although

other on test test The

terms

may

be involved scenes for analyzing

as well.

Performance our random To obtain additional in the scene. a large

non-uniform scene cases, is useful feel shown of small scene,

Although application. with Plato, to place two

our algorithm, on more The the realistic first

it is not a very representative scenes, which range we ran we will varying experiments refer from to as place sizes, time. time shown of d. and the as a in final Thus scene,

a better number second

for performance in Figures triangles designated with

10-11. LDEF,

contains

density contains

of triangles a wide

of triangle

which is very effective at desynchronizing Both scenes were rendered at a resolution The function Figure Equation buffer have prediction shown If r, of 10-100 magnitude. repeatedly sequence, guess derive and performance first 7, the flushes for issue we address size, where buffer applicable spread for small a trend v, are a good buffer changes can out buffer is that d varies sizes because in time, size, values hinted unknown, starting depth in the frames,

the processors because of 512 x 512. a buffer scenes are effect works the size.

of differences Figure 12 shows to the at much farther In the in many depth

in rasterization rendering scene values

of picking from the

of buffer 11 is not are that gains, hence seems with then a better

2 to 1.25v. of these processors the v/2 12. best that

In contrast occur much of twait. well buffer

uniform larger out absence of sync

optimal

for both

reducing

of an analytical experiments additional depth of v offers

a good

we note of n/p, then point should

cases.

Other

increasing the

to around a guess. by one If a scene to frame, For the the previous

at in Figure since

we can with the value

do is hazard latency costs p.

A buffer to two

like minor the guess

it reduces

orders

As a rule,

decrease adjust

increasing from buffer frame size.

will be rendered as in an animated frame frame all initial is used to

viewing

parameters

renderer for d.

automatically

first

is needed.

For subsequent

the observed

of r from

Figure 13 shows rendering Because of memory limitations, 512 x 512 Examination consistent scene peaks resolution. of the with out the Hence available previous of buffer

rates for the LDEF and Plato scenes using a buffer depth of v/2. neither of these scenes could be rendered with a single processor at we have no single-processor data shows that processor results and sizes for scenes then can boost of these the Plato declines as the data on which to base speedup estimates. utilization is best for p of 16-32 or less, sizes. Note that performance overhead on 64 and 128 of the becomes nodes Plato domito about communication

at 64 processors

nant. Careful choice 125000 triangles/second.

performance

7

Considerations

for

Shared
in Section for shared an efficient structure shared triangles

Memory
3 was memory mechanism of the data

Architectures
designed specifically for distributed memory architectures. We assume that viable shared for implementing remains rather and fashion. read already the times fashion When the than critical same, with needs The current given has the sections with them overhead value, suitable variable messages. to particular to work it, This for fetching increment architectural locked. a triangle on shared interprocessor

Although machines, memory variables.

the aigt, rithm described it can be readily adapted systems Given would this, taking of partitioning they the it. Some is the wait are next placed one support the place basic the from this may

algorithm structures

communication Instead processors, on, the and it grabs triangle unlock

through in a shared the can list, to lock

in round-robin list, or in typical the array.

assigning

a processor

self-scl,edulcd variable, instruction processor

time time

it takes

list index in a few if another

Presumably

be done

support.

be incurred

18

Figure

10: The five hyperbolic

Platonic

solids,

n = 59276, h = 3.1.

Figure

11: NASA's

Long Duration

Exposure

Facility.

n = 17726, h = 6.2.

19

4.0

4 u
V

35J
5.0 _'_ 2.5 ? p=2 _.__ p=64
I I t I I Illl I I 1 I I I III i I I I IIItl 0 O* I t I It1111 I I I IIIIII I I I Iltll I 1 I I II

w

3

E
I--

?_- 2
?-

1 = =1 O5

1 01 Buffer Depfh,

10 2 d

103

101 Buffer

10 7 Depth, d

10 3

(a)
Figure ranges 12: Execution time as a function of buffer depth for the (a) Plato from 2 to 1.25v. The vertical bars indicate d = v/2.

(b)
and (b) LDEF scenes, d

2O

1.00E+05
U Q)

8.00E+04
tO

?"" 6.00E+04
0

I,,,..

4.00E+04
ro_ Q)

c 0

2.00E+04

O, 0 0

I

i

i

l

i

i

i

I i

i

i

i

i

i

i

J i

i

101 Number of Processors

102

Figure

13: Rendering

rates

for the

LDEF

and

Plato

scenes.

should would

not

be a problem larger than

for moderate the time

numbers required

of processors, to fetch the frame and

since update

the the list

time

to process

a triangle

be much

index.

The

other

major
rasterize triangles

shared

data

structure
directly other into

is the frame

buffer.
buffer frame after

A naive

approach
them

would
into

be to
screen be

let processors

triangles overlapped

transforming A poor

coordinates.
conflicts when

But since many processors

would be doing this simultaneously,
triangles in the buffer.

there would be memory
solution would

to lock the entire frame buffer for the duration of the rasterization step, but that would effectively serialize the rasterization phase of the computation. A better solution is to partition the frame buffer into p segments. Then triangles could be split into trapezoids as in our original algorithm. But instead of sending the trapezoids trapezoids needing to be rasterized.
After processing one or more triangles,

to other processors, they would be placed on a shared list of There would be one trapezoid list per frame buffer segment.
a processor would one grab an unlocked frame buffer segment

and
many be

process
segments

all of the
as there the cost

outstanding
are processors,

trapezoids
at least version

queued

for that

segment.
be unlocked. As

Because
By not the

there
tying overhead

are

as

will always algorithm.

frame

buffer segments
better than to the

to particular
distributed of the

processors,
memory rasterization

load balancing
of the

will be automatic frame buffer

and performance
before,

should
for

maintaining
compared

the trapezoids the shared memory

lists and locking version

and unlocking

segments

should

be small

computations.

Thus,

of the algorithm

becomes:

21

Until done If triangles remain Select the next triangle Shade the triangle Transform, back face cull, and clip Split into trapezoids Insert the trapezoids onto the trapezoid lists Find an unlocked frame buffer segment with outstanding trapezoids (if any) Rasterlze all of the trapezoids in that list Continue Termination of the algorithm is also simpler in the shared memory version. Each processor must certify when it has finished working on its last triangle. This occurs when a processor checks for the next triangle and none remain. After all processors have finished their last triangle, then when all of the trapezoid lists become empty and all of the frame buffer segments are unlocked, rendering is complete. Modification of the performance model for the shared memory algorithm is straightforward. An additional nonlinear term is needed to model contention for the triangle list index variable. Message passing terms in the distributed memory model are replaced with terms which reflect the time needed to update the trapezoid lists (including buffer segments with outstanding trapezoids. contention) and to search for unlocked frame

8

Conclusion
we have described is attractive a parallel rendering algorithm for MIMD computer architectures. We have

In this paper The algorithm

for its exploitation

of both object

and pixel level parallelism.

given a theoretical analysis of its performance on distributed memory, message passing systeJns, and compared this with an actual implementation on tile 1,tel ii'SC/860 hypercube. Our results show that the algorithm is a viable means of achieving a highly parallel renderer. Scalability is limited primarily by communication costs, which increase as a function sors. Expected improvements in communication speed and optimization rasterization rendering software systems. will allow this algorithm to compete favorably of the number of procesof the transformation and high-performance

with other

Acknowledgments
We would like to thank the Institute for Parallel Computation at the University of Virginia, the Mathematical Sciences Section at Oak Ridge National Laboratory, and the NAS Systems Division at NASA Ames Research Center for providing access to their iPSC computers at various stages during the development and testing of our algorithm. We would also like to thank Shahid Bokhari for sharing his expertise during many helpful dl,_cussions, and Harry Jordan for providing valuable assistance in refining the performance model. The geometric database for the LDEF scene was provided and courtesy of the Flight Software and Graphics Branch at NASA Langley Research Center, Computer Sciences Corporation. 22

References
[1] Bokhari, Shahid. Communication Overhead on the Intel iPSC-860 Hypercube. ICASE Interim Report 10 (NASA CR 182055), Institute for Computer Applications in Science and Engineering, NASA Langley Research Center, Hampton, Va., May 1990. [2] Clark, J. The geometry engine: a VLSI geometry Vol. 16, No. 3, July 1982, 127-133. system for graphics. Computer Graphics,

[3] Foley, J. D., van Dam, A., Feiner, S.K., and Hughes, J.F. Computer Practice. 2nd ed., Addison-Wesley, Reading, Mass., 1990, ?38-739. [4] Foley et al. op. cir., 668-672. [5] Foley et al. op. cit., 71-81. [6] Fuchs, H., and Poulton, J. Pixel-Planes: a VLSI-oriented VLSI Design, Vol. 2, No. 3, Q3 1981, 20-28.

Graphics:

Principles

and

design for a raster graphics

engine.

[7] Fuchs, H., Poulton, J., Eyles, J., Greer, T., Goldfeather, J., Ellsworth, D., Molnar, S., Turk, G., Tebbs, B., and Isreal, L. Pixel-Planes 5: a heterogeneous multiprocessor graphics system using processor-enhanced memories. Computer Graphics, Vol. 23, No. 3, July 1989, 79-88. [8] i860 64-Bit [9] i860 6_-Bit CA, 1990. [10] Molnar, 866-873. [11] Molnar, S., and Puchs, H. Advanced raster graphics architecture. In Foley et al., op. cir., Microprocessor Microprocessor Data Sheet. Programmer's Intel Corporation, Santa Clara, CA, 1989. Corporation, Santa Clara,

Reference

Manual.

Intel

S., and Fuchs,

It. op. cit., 855-923. for graphics arithmetic operations. Computer

[12] Torberg, J. G. A parallel processor architecture Graphics, Vol. 21, No. 4, July 1987, 197-204.

23

....
1. Report No. NASA CR-187571 No. 91-3

Report

Documentation

Page
3. Recipient's Catalog No.

2. Government Accession No.

ICASE Report 4. Title and Subtitle

5. Report Date RENDERING ALGORITHM FOR MIMD

A

PARALLEL

ARCHITECTURES

June 1991 6, Performing Organization Code

7. Auth0r(s)

8. Performing Organization Report No. W. Crockett 91-3 10, Work Unit No.

Thomas Tobias

Orloff

9. Performing Organization Name 'and Address Institute and Mail Hampton, 12. Spon_ring National Langley Hampton, Stop for 132C, VA Computer NASA Applications Langley Research in Science Center

505-90-52-01 11. Contract or Grant No.

Engineering NASI-18605 13. Type of Report and Period Covered 23665-5225 and Center Space Administration

Agency Name and Addm_ Aeronautics Research VA 23665-5225 Contractor Report 14, Sponsoring Agency Code

15. Supplementaw Notes Langley Michael Technical F. Card Monitor: To be submitted and to Concurrency:

Practice

Experience

Final Report 16. Abstract Applications formance rendering is MIMD level is then to lelism. and rendering rates, design paper This such of highly as animation three and scientific architectures which effectively targeted algorithm of for the large the algorithm behavior visualization scenes. To are deliver required. use to the demand the high The hardware both is of 1 to per-

complex parallel

dimensional

necessary challenge paralmemory objectboth implepromodi-

hardware software rendering performance, The Its by shows scene adapt a

algorithms describes For

and

distributed examined

architectures. pixel-level and to be for across to the the limited

maximum

exploits numbers An from shown memory that

parallelism. primarily iPSC/860 range of will

algorithm

analytically found mentation cessors fications as well.

experimentally. Intel algorithm

performance increasing complexities. it for use

processors 128

communication

overheads. performance It on is shared

experimental

a wide

minimal

architectures

17, Key Words (Suggested by Author(s)) 3D graphics, rithms, analysis rendering, multiprocessors, parallel algo-

18, Distribution Statement 61 Computer Programming and Software

performance Unclassified Unlimited --4

19 Security Classif. (of this report) Unclassified

20 Security Classif (of this page)

25 2TNo
, Unclassified

of pages

NASA

FORM

t626

OCI

_,


赞助商链接
更多相关文档:
更多相关标签:
网站地图

文档资料共享网 nexoncn.com copyright ©right 2010-2020。
文档资料共享网内容来自网络,如有侵犯请联系客服。email:zhit325@126.com