nexoncn.com

文档资料共享网 文档搜索专家

文档资料共享网 文档搜索专家

当前位置：首页 >> >> Abstract--

Comparison of the dynamic behavior of di erent multipliers in the 8 8 Inverse Discrete Cosine Transform, using VHDL

Delft University of Technology Department of Electrical Engineering Circuits and Systems Group Mekelweg 4, 2628 CD Delft, The Netherlands tel. +31 (0)15 278 6281 fax. +31 (0)15 278 6190

smirren@cas.et.tudelft.nl

D. van Smirren, R. Nouta, E.A. Hendriks

Abstract | The Inverse Discrete Cosine Transform (idct) is used in almost all video-decoders, like mpeg and hdtv. In de process of video decoding, the idct is a computationally heavy part. One of the performance characteristics is the power consumption, which is related to the dynamic behavior of the hardware. In this paper the signal transitions that occur in the multipliers are analyzed with vhdl, when two's complement arithmetic is changed in signed binary arithmetic. The simulations show that the signed binary multipliers need about three time less signal transitions than two's complement multipliers.

The use of digital coded video signals (images) will increase in the near future. Digital signals are more robust and more exible then the analogue representations. On the other hand digital video needs a lot of data (ca. 200 Mbit/sec). Therefore compression of the images, and the image-sequences is necessary. The current standard for coding digital image-sequences is mpeg, which results in compression factors between 10 and 50. In some cases, it is necessary to compress imagesequences, that were already compressed. For example when high-quality video needs to be transformed to a lower quality. This transformation is called transcoding. To research hardware implementations of transcoders, it's necessary to nd out what the 'hardware' bottlenecks are, and which trade-o s can be made. Both in a normal decoder, as in a normal encoder the Discrete Cosine Transform (dct) and it's inverse (idct) play an important role. These transfor-

I. Introduction

mations are both computational heavy, and therefore interesting for research. In this paper we are interested in optimization possibilities concerning speed, power or area. We are also interested in the use of vhdl as Hardware Description Language. In vhdl (large) digital circuits can be described on di erent levels of abstraction. The lowest level is the synthesizable one, which can be implemented on hardware. Of course vhdl is mainly focussed on the describing part of vlsi circuits, but it may include properties which can be used during analyses. Since power consumption is related to the number of transitions that occur in a digital circuit, we will focus on analysing the number of transitions in the multipliers. The multipliers are computationally the heaviest part in the dct and idct. Special attention will be paid to the simulation environment wich uses Matlab tools and special vhdl constructions.

II. Algorithm

In this section the dct and idct are introduced, and an algorithm based on row- column decomposition is shown A. The (Inverse) Discrete Cosine Transform The Discrete Cosine Transform and its inverse, are mappings of spatial blocks of pixel data to blocks of data in the transformed cosine domain. Given an input block x(m; n) with f0 m; n N g in the spatial domain, the 2-D-dct can be written as:

491

492

Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997

Where (0) = 1=2 and (i) = 1 for i 6= 0. From this point we will presume that N = 8, which is de defacto value of N for use of the dct in imageprocessing.

n=0 m=0 (1) (2m + 1) k cos (2n + 1) l r cos 2N 2N where (0) = 1 and (k) = 1 for k 6= 0: (6) 2 The inverse transform, from the cosine domain back to the spatial domain, is described as: Using this decomposition technique, decreases the number of nescecary multiplications from N 4 with diXX 2 N ?1 N ?1 (k) (l)Z (k; l) rect computation to 2N 3 . Therefore, this technique is x(m; n) = N (2) used in most implementations. k=0 l=0 The matrix A can now be written as + + cos (2m 2N1) k cos (2n 2N1) l

XX 2 (k) (l) N ?1 N ?1 x(m; n) Z (k; l) = N

+ 2 a(k; n) = N (k) cos (2n 2N1) k for n = 0; : : : ; N ? 1 and k = 0; : : : ; N ? 1

r

p

B. Row Column Decomposition Calculating the (Inverse) Discrete Cosine Transform by means of row column decomposition, is the most commonly used technique. Formula 2 can be rewritten as follows,

2 a a a a a a a a3 6 b d e g ?g ?e ?d ?b 7 6 c f ?f ?c ?c ?f f c 7 6 7 6 6 d ?g ?b ?e e b g ?d 7 7 A = 6 a ?a ?a a a ?a ?a a 7 6 7 6 6 e ?b g d ?d ?g b ?e 7 7 6 4 f ?c c ?f ?f c ?c f 7 5 g ?e d ?b b ?d e ?g 2 6a 6b 6 6 6c 6 6 6d 6 6 6e 6 6 6 6f 4

g

(7)

x(m; n) = N N

r N ?1 2 X

l=0

where

k (r N ?1 =0 X 2

(k)

Now, let r N ?1 4 0 (m; k) = 2 X (l)Z (k; l) cos (2n + 1)l : (4) x N 2N We can regard x0 (m; k) as intermediate data, and rewrite x(m;r) as n X 2 N ?1 (k)x0 (m; k) cos (2m + 1)k : x(m; n) = N 2N (5) k=0 The 2-D (i)dct is now separated in two 1-D transformations. This separated transformation can also be expressed in matrix notation: Z = AXAT X = AT ZA Where AAT = IN by virtue of the orthogonality of A, which is an N N matrix whose basis vectors are sampled cosines, de ned as

l=0

+ (l)Z (k; l) cos (2n 2N1)l + cos (2m 2N1)k : (3)

)

2 3 6 cos 4 7 6 cos 7 6 16 7 7 6 7 r 6 cos 6 8 7 7 6 7 = 2 6 cos 3 7 6 7 N 6 16 6 cos 5 7 6 16 7 6 7 6 3 7 6 cos 8 7 4 5

cos 7 16

3 7 7 7 7 7 7 7 7: 7 7 7 7 7 7 7 5

(8)

Even rows of A are even-symmetric and odd rows are odd-symmetric. This symmetry can be exploited, for both the dct as the idct, by separating the even and odd rows. For the idct we get

2 Y (0) 3 2 a c a f 3 2 X (0) 3 6 Y (1) 7 6 a f ?a ?c 7 6 X (2) 7 6 7 =6 4 Y (2) 5 4 a ?f ?a c 7 6 X (4) 7 54 5 Y (3) a ?c a ?f X (6) 2 b d e g 3 2 X (1) 3 6 7 b e7 6 + 6 d ?g ?g ?d 7 6 X (3) 7 4 e ?b 5 4 X (5) 5 g ?e d ?b X (7)

(9)

Comparison of the dynamic behavior of di erent multipliers

2 Y (7) 3 2 a c a f 3 2 X (0) 3 6 Y (6) 7 6 a f ?a ?c 7 6 X (2) 7 6 7 =6 4 Y (5) 5 4 a ?f ?a c 7 6 X (4) 7 54 5 Y (4) a ?c a ?f X (6) 2 b d e g 3 2 X (1) 3 6 7 b e7 6 ? 6 d ?g ?g ?d 7 6 X (3) 7 : 4 e ?b 5 4 X (5) 5 g ?e d ?b X (7)

III. Hardware Architecture

493

ACF

4

Multiplier and Accumulation Unit

(10)

8x

8

BDEG

4

Multiplier and Accumulation Unit

8x

Fig. 2. 1d-idct Unit

In this section an architecture to compute the (Inverse) Discrete Cosine Transform is described. This architecture is based on 2]. The architecture is described in vhdl. Vhdl extensions are used to make it possible to view the zero-to-one changes (and reverse). In this way relation between the numerical system and the dynamic behavior of the multipliers can be studied. In this study two's complement representation and signed binary representation are evaluated. A. Basic hardware structure The basic structure to calculate the dct and idct, is illustrated in gure 1.

coe cients, wich are used in the acf module, and the odd coe cients wich are used in the bdeg module. The output data from both units is added to compute Y (0); : : : ; Y (3), and subtracted to get Y (4); : : : ; Y (7). The hardware structure of the acf-multiply and accumulation unit is shown in gure 3. The bdeg-unit has the same structure. An input xe to the ACF multiplier bank, is broadcasted to all three multipliers. The products axe , cxe , and fxe are computed in parallel. A bank of four accumulators then selects one of these products to perform the various linear combinations required for the matrix-vector multiply over four clock cycles. In table I is shown wich accumulations should be performed when the input sequence of the input data is X (4), X (2), X (6), X (0), for the acf-unit and X (3), X (5), X (1), X (7) for the bdeg-unit.

Input Selections for Accumulators for the idct

X

MUX 2:1

1D DCT/IDCT UNIT

DMUX 1:2

Transpose Memory Y

TABLE I

Z

Fig. 1. Multiplexed dct/idct unit

In this gure the multiplexed architecture is shown, where one 1d-(i)dct unit performs all the computations, and the other blocks take care of data (re)arrangement. The structure presented in 2] is used to design a hardware architecture. In this paper we will concentrate on the computational part only. Further we will focus on the Inverse Discrete Cosine Transform. Because both the dct and the idct have the same structure it is useful to concentrate on one of them. We choose the Inverse Transformation. In gure 2 the hardware structure of the 1d-idct can be seen. Based on formulas 9 and 10 the computation is separated into two parts. A multiply and accumulation unit in which the multiply coe cients a, c and f are used (acf-unit), and a unit where the coe cients b, d, e and g are used (bdeg-unit). The incoming data X (0); : : : ; X (7) is separated in the even

ACF ACCBANK FOR IDCT ACC0 ACC1 ACC2 ACC3 a a a a c f f c f c c f a a a a BDEG ACCBANK FOR IDCT ACC0 ACC1 ACC2 ACC3 d g b e e b g d b d e g g e d b This section will concentrate on the multiplier architectures. First the di erent number representations that were used, are mentioned. Next the hardware architectures are shown.

IV. Multipliers

494

Timing and Control

Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997

2 2

-10

-9

Xe MULT a MULT c MULT f ACC 0 ACC 1 ACC 2 ACC 3 MUX 4:1 Ye

2

-8

2

-4

Fig. 3. acf matrix-vectory multiply unit

2

-2

A. Representations The computation of the idct uses 12 bits two's complement numbers as input, and returns 9 bits two's complement numbers. The computations performed during the rst 1d-idct use this 12 bits numbers as input. In 2] is shown that an intermediate result of 22 bits satis es the quality demands as speci ed for mpeg. The multiplier coe cients are coded in Signed Digit in a range of 2?2 ; : : : ; 2?14 . In this paper we are not concerned about quality demands, so we will reduce quality in order to get less complicated multipliers. The same structures are used for the Signed Binary variant as for the Two's complement multiplier. In this way we are able to compare these two representations in an honest way. In this paper we will use the following speci cations: Input coe cients are 12 bits twos's complement or 12 bits Signed Binary numbers. This holds for both the rst 1-D transformation as for the second one, where you should use 22 bits input coe cients. Output values are 22 bits two's complement or Signed Binary. Multiplication coe cients are coded in normal binary in a range of 2?2 ; : : : ; 2?10 . B. Architecture For the implementation of the Signed Binary and Two's Complement multipliers, with the mentioned speci cations, a straight forward architecture is chosen. In gure 4 the architectures are shown for a multiplication coe cient of 2?2 + 2?4 + 2?8 + 2?9 + 2?10 (0; 31933593750). For every '1' in the binary representation of the multiplication coe cient an adder section is inserted. Dependent on the representation, empty places at the left are lled with the sign, for two's complement, or with a '0' for signed binairy. The input coe cient in this picture is ?1399, which results after multiplication in ?446; 709765625. According to this structure seven multipliers are de-

Fig. 4. Two's complement (left) and Signed Binary (right) multiplier.

veloped with the multiplication coe cients as shown in table II C. Adders In both multipliers standard full adder sections are used, wich consist of two xor, two and and one or gate. The nand, and and or gates are presumed to be basic ports, which means that they have only one driving output in wich we are interested and they have a certain delay from input to output. The xor function is more complicated and therefore composed of four nands, which are also regarded as basic ports. D. vhdl descriptions The two's complement multipliers and the signed binairy multipliers are both described in VHDL on a synthesizable level. In this case the lowest level in the descriptions is the gate level. To use the Signed Binary multiplier in an environment which is based on two's complement representations, two conversion blocks are added. The two multipliers are now functional exactly the same, which makes a good comparison possible. To perform simulations, a special environment is created. First this environment will be discussed, and second the method of transition counting in vhdl which is used in this situation is shown. A. The simulation environment In gure 5 the simulation environment is shown. A bitmap image with gray values is used as input. With Matlab the 2d-dct from this image is computed, based on 8 8 blocks. This results in an image in the cosine domain. From this image, 8 8 blocks of data are used as input of the 1d-idct unit, which is described in vhdl. The rst output block contains the

V. Simulations

Comparison of the dynamic behavior of di erent multipliers

495

Multiplier coefficients

TABLE II

coe cient value 12 bit signed a 0.35355 0.35351 b 0.49039 0.49023 c d e f g 0.46194 0.41573 0.27779 0.19134 0.09755 0.46191 0.41552 0.27734 0.19091 0.09716

2?2 + 2?4 + 2?5 + 2?7 + 2?9 2?2 + 2?3 + 2?4 +2?5 + 2?6 + 2?8 + 2?9 2?2 + 2?3 + 2?4 +2?6 + 2?7 + 2?10 ?2 + 2?3 + 2?5 + 2?7 + 2?10 2 2?2 + 2?6 + 2?7 + 2?8 2?3 + 2?4 + 2?9 + 2?10 2?4 + 2?5 + 2?9 ? 2?10

Binairy representation

intermediate data, which needs to be transposed and In this way an and-port, a nand-port and an or-port fed again to the 1d-idct unit. This second transfor- are described. Higher hierarchical circuits, consist of mation supplies the spatial data. Besides the VHDL this blocks, as shown in gure 7. The count values of all the blocks in such a circuit are added and directly passed to the count value of this block. Counter Data Following this procedure, on all hierarchical levels the 8x8 block VHDL Matlab Transformed structure of the building blocks is the same: the norData 1D-IDCT Image Image Image 8x8 mal in-out signals are available, plus a clock input and (Cosine (Spatial) (Spatial) 2D-DCT Domain) a count-output. The seven multipliers, as described in the previous section, are composed in this way. So after every clock 8x8 block Intermediate data period the number of transitions that occurred in each Fig. 5. Simulation environment in vhdl of them is available. With a special interface block these values are written to disk every clock period. descriptions which describe the computational part of After each simulation, the counting values are added the transformation, other vhdl descriptions take care with Matlab. of the necessary data (re)ordering, like for example the transposition and the composition. count count The vdhl descriptions of the multipliers supply ev16 bits CLK CLK counter a a&b ery clock period the number of one-zero and zero-one b a a&b transitions that occurred in that clock period. These b numbers are collected and stored so they can be anaFig. 6. Transition counter in vhdl lyzed later.

B. Transition counting in VHDL To count the transitions that occur in a larger combinatorial circuit, like a multiplier, a special construction in vhdl is used. In g 6 a normal and port is connected to a 16 bits counter. Every time one of the inputs of the counter changes, the corresponding process in vhdl is evaluated. If the monitor input of the counter has changed, an internal counter is incremented with one. If a positive edge on the clock signal occurred, the internal count value is passed to the output port count and the internal counter is reset to zero.

Combinatorial Circuit

Fig. 7. Counting hierarchy in vhdl

Simulations have been performed with two images, the Mars image and the Clown image ( gure 8). To reduce simulation time, only a small image-size is chosen. Because the idct works on 8 8 blocks, these images will supply a lot of di erent 8 8 blocks. In the

VI. Results

496

Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997

Number of transitions with different images, and different delays of the or-port.

TABLE III

Picture

Mars Mars Mars Clown Clown Clown

Delay Two's OR-port Complement

1 ns 2 ns 3 ns 1 ns 2 ns 3 ns

Total

20

153.496.423 43.513.640 65.715.271 34.136.516 116.791.649 35.821.012 141.457.075 24.330.572 52.513.663 19.693.086 107.840.587 20.819.358 637.814.668 178.314.184

Signed Binary

Two's/SB

3,53 1,93 3,26 5,81 2,67 5,18 3,58

discussed, and second the usage of vhdl to setup the environment.

40

60

80

100

120 20 40 60 80 100 120

20

40

60

80

100

120 20 40 60 80 100 120

a input-to-output delay of 1ns. Simulations with different delays for the or-port are performed, because internal delays are expected to in uence the number of transitions. In table III the simulation results are shown.

VII. Conclusions and discussion

Fig. 8. Mars and Clown images simulations, the basic ports (and, or and nand) have

The conclusions and discussion are separated in two parts. First the results of the measurements will be

A. Comparison of two numerical systems Based on the results as shown in table III it can be concluded that the usage of the Signed Binary numerical system reduces the number of transitions signi cantly, compared to the number of transitions that occur when using the two's complement numerical system. The amount of reduction is expressed as the ratio Two0 s=SB . This ratio has in our measurements reached the lowest value of 1,93 when transforming the Mars image with 2 ns delay in the or-ports. The maximum value of 5,81 occurs when using the Clown image with 1 ns delay in the or-ports. The simulations performed are based on simple descriptions of the gates, with only a speci ed input-tooutput delay. Only from a description that is synthesized to a speci c VLSI implementation, real timing data can be extracted. This information is needed to analyze speed performance of the multipliers. Of course the amount of simulation data is not enough to relate the di erent Two0 s=SB values to the images that were used or to the delay in the or-port. It can be concluded that, when using this multiplier structure, the number of transitions is strongly related to the internal delays. Further it can be concluded that the Signed Binary multipliers can be expected to have signi cantly less internal transitions then Two's Complement multipliers. Further the question arises what these results tell about other multiplier structures. Do they also have less transitions when their structure is based on Signed Binary in stead of Two's Complement? Based on our results this question can of course not be answered for sure, but it is reasonable to think that this reduction also applies to other structures, because the

Comparison of the dynamic behavior of di erent multipliers

497

di erence in performance in this situation appears to 9] C.L. Wang and C.Y. Chen, "High-Througput VLSI Architectures for the 1-D and 2-D Discrete Cosine Transforms," be related to the used numerical system.

B. Vhdl as simulation environment Simulations were performed with vhdl as the hardware description language. Especially the possibilities of vhdl to describe hardware structures on di erent levels of abstraction are used. The counting mechanism is an example of such a description. Because vhdl simulates in an event-triggered 1 way, it is possible to detect signal changes in a circuit, and take action, in this case: update a counting value. Further the event-triggered operation of vhdl, causes simulations to be fast, since signals do not have to be monitored constantly. Of course the usage of vhdl as simulation tool as shown in this paper, is only one of the possible solutions. In other situations, the same principles can be used in other vhdl structures.

References

1] K.R. Rao and P. Yip, "Discrete Cosine Transform, Algorithms, Advantages, Applications", Academic Press, Inc. San Diego, 1990. 2] A. Madisetti and A.N. Wilson Jr., "A 100 MHz 2-D 8x8 DCT/IDCT Processor for HDTV Applications," IEEE Transactions on Circuits and Systems for video technology, vol. 5, no. 2, april 1995. 3] A.P. Thijsen et. al., "Digitale Techniek, van probleemstelling tot realisatie, deel 2", Delftse Uitgevers Maatschappij, 1990. 4] V. Srinivasan and K.J. Ray Liu, "VLSI Design of HighSpeed Time-Recursive 2-D DCT/IDCT Processor for Video Applications," IEEE Transactions on Circuits and Systems for video technology, vol. 6, no. 1, february 1996. 5] Y.T. Chang and C.L. Wang, "New Systolic Implementation of the 2-D Discrete Cosine Transform and Its Inverse," IEEE Transactions on Circuits and Systems for video technology, vol. 5, no. 2, april 1995. 6] T. Masaki et. al. "VLSI Implementation of Inverse Discrete Cosine Transformer and Motion Compensator for MPEG2 HDTV Video Coding," IEEE Transactions on Circuits and Systems for video technology, vol. 5, no. 5, october 1995. 7] R. Miyazaki, et al., "DCT/IDCT Processor for HDTV Developed with DSP Silicon Compiler," Journal of VLSI Signal Processing, vol. 5, pp. 151-158, 1993. 8] W. Chen, C.H. Smith, and S.C. Fralick, "A Fast Computational Algorithm for the Discrete Cosine Transform", IEEE Transactions on Cummunications, vol. Com-25, no.9, september 1977. Event-triggered: Software construction where a speci c part op the software is only evaluated when a change of one of the input parameters occurs.

1

IEEE Transactions on Circuits and Systems for video technology, vol. 5, no. 1, february 1995. 10] K.J.R. Liu, C.T. Chiu, R.K. Kolagotla and J.F. JaJa, "Optimal Uni ed Architectures for the Real-Time Computation of Time-Recursive Discrete Sinusoidal Transforms," IEEE Transactions on Circuits and Systems for video technology, vol. 4, no. 2, april 1994. 11] N. Ahmed, T. Natarajan and K.R. Rao, "Discrete Cosine Transform", IEEE Transactions on computers, vol. C23, pp 90-93, january 1974. 12] J.H. Hsiao, L.G. Chen, T.D. Chiuch and C.T. Chen, "High Throughput CORDIC-Based Systolic Array Design for the Discrete Cosine Transform," IEEE Transactions on Circuits and Systems for video technology, vol. 5, no. 3, june 1995. 13] T. Kuroda et. al., "A 0.9-V, 150-MHz, 10-mW, 4 mm2, 2D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage (VT) Scheme," IEEE Journal Of SolidState Circuits, vol. 31, no. 11, noverber 1996.

498

Proceedings of the ProRISC Workshop on Circuits, Systems and Signal Processing 1997

更多相关文档:

Hendriks *Abstract* | The Inverse Discrete

Varma *Abstract*--A novel method for Elect

文献的摘要-*Abstract*-SelfStudyDirection_英语学习_外

AOAC-*Abstract*-Presentation[1] - WR Tolbe

更多相关标签:

相关文档

- The Discrete Cosine Transform (DCT)_Theory and Application
- Discrete Cosine Transform
- Image Compression Using the Discrete Cosine Transform
- AForward-Mapping Realization of the Inverse Discrete Cosine Transform
- A Fast Computational Algorithm for the Discrete Cosine Transform
- A Forward-Mapping Realization of the Inverse Discrete Cosine Transform
- A New Algorithm to Compute the Discrete Cosine Transform
- ABSTRACT VARIABILITY ANALYSIS OF DISCRETE COSINE TRANSFORM COEFFICIENT (DCTC) FEATURES FOR
- Using Modified Discrete Cosine Transform The MP3 Coding Stan-
- A fast recursive algorithm for computing the discrete cosine transform

文档资料共享网 nexoncn.com
copyright ©right 2010-2020。

文档资料共享网内容来自网络，如有侵犯请联系客服。email:zhit325@126.com