# Comparative Analysis of Latches and Flip-Flops for High-Performance Systems

Vladimir Stojanovic, Vojin Oklobdzija\* and Raminder Bajwa\*\*
University of Belgrade, Yugoslavia, \*Integration, Berkeley, CA
\*\*Semiconductor Research Laboratories, Hitachi America Ltd., San Jose, CA

#### **Abstract**

In this paper we propose a set of rules for consistent estimation of the real performance and power features of the latch and flip-flop structures. A new simulation and optimization approach is presented, targeting both high-performance and power budget issues. The analysis approach reveals the sources of performance and power consumption bottlenecks in different design styles. Certain misleading parameters have been properly modified and weighted to reflect the real properties of the compared structures. Furthermore, the results of the comparison of representative latches and flip-flops illustrate the advantages of our approach and the suitability of different design styles for high-performance applications.

#### 1. Introduction

Interpretation of published results comparing various latches and flip-flops has been very difficult because of different simulation methods used for generation and presentation of results. Certain approaches, [1], [2], etc., did not illustrate real performance and power features of the presented structures. Main reason for that was improper consideration and weighting of relevant parameters. In this paper we establish a set of rules in order to make comparisons fair and realistic: first, definition of the relevant set of parameters to be measured and rules for weighting their importance; and second, a set of relevant simulation conditions, which emphasize the parameters of interest. Primary goal of the simulation and optimization procedures was the best compromise between power consumption and performance, given that limitation in performance is usually imposed by available power budget.

# 2. Analysis

# 2.1. Power Considerations

Data activity rate,  $\alpha$ , presents average number of output transitions per clock cycle. We have applied four different data sequences where: ...010101010...,  $\alpha = 1$ ,

reflects maximum internal dynamic power consumption; however, depending on the structure, the sequence ...111111... can in some cases dissipate more power. Pseudo-random sequence with equal probability of all transitions (data activity rate  $\alpha=0.5$ ) is considered to reflect average internal power consumption given the uniform data distribution. Sequence: ...111111...,  $\alpha=0$ , reflects power dissipation of precharged nodes while ...000000...,  $\alpha=0$ , reflects leakage power consumption and power spent on internal clock processing.

Dynamic power consumption can be estimated by:

$$P_{\scriptscriptstyle d} = f C_{\scriptscriptstyle eff} V dd^2$$
 , where  $C_{eff} = \sum_{i=1}^N lpha_i k_i C_i$ 

- α<sub>i</sub> is the switching probability of node i (in regard to the clock cycle)
- k<sub>i</sub> is the swing range coefficient of node i (k<sub>i</sub> =1 for rail to rail swing)
- C<sub>i</sub> is the total capacitance of node i
- f is the clock frequency
- *Vdd* is the rail to rail voltage range (supply voltage)

Fig. 1 describes differences in switching activity, and therefore power consumption, for different design styles. Capacitances  $C_{total}$ ,  $C_{precharge}$  and  $C_{out}$  are calculated taking into account the  $C_i$  and  $k_i$  coefficient of each node in the circuit.

Semi-Dynamic structures are generally composed of dynamic (precharged) front-end and static output part. Thus we designated two major effective capacitances:  $C_{precharge}$  and  $C_{out}$ , each representing the corresponding part of the circuit. It is shown on Fig. 1 that these two capacitances have different charging and discharging activities.

Total effective precharge capacitance of semi-dynamic, differential structures is comprised of two effective capacitances of the same size:  $C_{prechargeQ}$  and  $C_{prechargeQb}$  which actually represent the two complementary halves of the precharged differential tree.



Fig. 1. Sources of internal, dynamic power consumption

 $C_{eff} = C_{prech}(p(0 \rightarrow 1) + p(1 \rightarrow 0) + p(0 \rightarrow 0) + p(1 \rightarrow 1)) + 2C_{out}(p(0 \rightarrow 1) + p(1 \rightarrow 0))$ 

We used the .MEASURE average power statement in HSPICE to measure the power dissipation of interest. Results were compared with earlier power measurement method presented in [3] and shown the same level of accuracy.

There are three main sources of power dissipation in the latch:

- Internal power dissipation of the latch, including the power dissipated for switching the output loads
- Local clock power dissipation, presents the portion of power dissipated in local clock buffer driving the clock input of the latch
- Local data power dissipation, presents the portion of power dissipated in the logic stage driving the data input of the latch

Total power parameter is the sum of all three measured kinds of power.

#### 2.2. Timing

Stable region, Fig. 2, is the region of Data-Clk (the time difference between the last transition of Data and the latching Clock edge) axis in which Clk-Q delay does not depend on Data-Clk time. As Data-Clk decreases, at certain point, Clk-Q delay starts to rise monotonously and ends in failure. This region of Data-Clk axis is Metastable region. Metastable region is defined as the region of unstable Clk-Q delay, where Clk-Q delay rises exponentially as indicated by Shoji in [7]. Changes in

Data that happen in Failure region of D-Clk are not transferred to the outputs of the circuit.

The question arises of how much we can let the Clk-Q delay be degraded in Metastable region and still have increase in performance (due to the minimum in D-Q) and insured reliability?



Fig. 2. StrongArm110 flip-flop, Stable, Metastable and Failure regions

 $D_{CQ}$ , [6], is the value of *Clk-Q* delay, Fig. 2, in *Stable region*, and U, [6], is the minimum point on D-Clk axis which is still a part of *Stable region*.

In *Metastable region D-Q* curve has its minimum as we move the last transition of data towards the latching edge of the clock. It is clear that beyond that *minimum D-Q* point it is no longer applicable to evaluate the Data closer to the rising edge of the clock. We refer to D-Clk delay at that point as the optimum setup time, the limit beyond which performance of the latch is degraded and reliability is endangered.

Our interest is to minimize D-Q delay (or  $D_{CQ}+U$ , as defined by Unger and Tan, [6]) which presents the portion of time that the flip-flop or Master-Slave structure takes out of the clock cycle. Since  $D_{CQ}+U > minimum\ D-Q$  (as defined in Fig. 2) it is obvious that cycle time will be reduced if it is allowed for the change in Data to arrive no later than *Optimum setup time* before the trailing edge of the clock.

In the light of the reasons presented above, we accepted *minimum D-Q* delay as *Delay* parameter of a flip-flop or Master-Slave latch.

Metastable region consists of Setup and Hold zones. Last data transition can be moved all the way to the Optimum setup time. First or late data transition is allowed to come after the Hold zone.

Hybrid design technique, [9], [13], [14], shifts the reference point of hold and setup time parameters from the rising edge of the clock to the falling edge of the buffered clock signal which ends the transparency period.

In this way setup and hold times measured in reference to the rising edge of the clock (as conventionally defined for flip-flops) are functions of the width of transparency period since their real reference point is the end of that period (just like in custom transparent latches).

#### 3. Simulation

#### 3.1. Test bench



Fig. 3. The simulation test bench

Buffering inverters on Fig. 3 provide realistic Data and Clock signals, while themselves fed from ideal voltage sources. Capacitive loads simulate the fan-out signal degradation. Since buffering inverters dissipate power even without any external load (due to their internal capacitances) we made the corrections of measured power of the shaded inverters, Fig. 3, by interpolating the power over the wide range of loads. In case of the Data inverter, the correction took into account not only the inverter's intrinsic capacitance, but also the load Cl.

| Techn                 | ology:         |
|-----------------------|----------------|
| Channel length        | .2 µm          |
| Min. gate width       | 1.6 µm         |
| Max. gate width       | 22 μm          |
| Vtp,n                 | 0.7V           |
| MOSFE                 | T Model:       |
| Level 28 modified BSI | M Model        |
| MOS Gate Capacitance  | e Model:       |
| Charge Conservation M | <b>l</b> odel  |
| Cond                  | itions:        |
| Nominal               | Vdd=2V, T=25°C |

Table 1. MOS transistor model parameters

Parameters of the MOS model used in simulations are shown in Table 1. For given technology, load capacitance Cl =200fF equals the load of 22 minimal inverters (wp/wn = 3.2u/1.6u). Dependence of power consumption on clock frequency appeared to be nearly linear (since the throughput was increased accordingly), so we decided to fix the frequency at 100MHz.

# 3.2. Transistor width optimization

All structures were optimized both in terms of speed and power. We used the Levenberg-Marquardt optimization algorithm embedded in HSPICE. The search direction of this algorithm is the combination of the Steepest Descent and the Gauss-Newton method. A

variety of other optimization algorithms is available today, like the ones presented by Yuan and Svensson, in [11] and [12]. Both algorithms will eventually lead to good results when applied to logic structures, but they do not take into account the setup time parameter and therefore the effective time taken from the cycle.

First step is the optimization of both *Clk-Q* delay and *Total power*, which essentially presents the optimization in terms of PDP with the addition of the *Total power* parameter. Next step is the calculation and correction of the *minimum D-Q* taken as the *Delay* parameter. The problem arises in how to calculate the *Delay* and find the minimum PDP<sub>tot</sub> in one step. Several iterations are needed to achieve satisfying results.

New automated tools are needed especially because the existing ones consider the *Clk-Q* delay as a relevant parameter for the optimization. If we try to optimize MS latch in terms of the classical PDP (*Clk-Q* \* Internal Power) the result will be minimal Master latch optimized for low power, and Slave latch optimized for both speed and power. The "optimized" structure will have excessively large setup time thus requiring the larger clock cycle to meet the timing requirements. The reason for such result is that the optimizer does not "see" the real performance through *Clk-Q* delay.

#### 4. Results

Results of the simulations are shown in Table 2. Power dissipation parameters presented in Table 2 are for the pseudo-random data sequence with equal probability of all transitions. The point of minimum Power-Delay Product exists and presents the point of optimal energy utilization. PDP<sub>tot</sub> parameter is the product of the *Delay* and *Total power* parameters. We have chosen the PDP<sub>tot</sub> as the overall performance parameter for comparison in terms of speed and power.

Main advantages of PowerPC 603 MS latch, Fig. 6, [4], are short direct path and low-power feedback. Its big clock load greatly influences the total power consumption on chip.

Modification of standard dynamic C<sup>2</sup>MOS MS latch, Fig. 14, has small clock load, achieved by the local clock buffering, and low-power feedback assuring fully static operation. It is slower than PowerPC 603 MS latch. The faster pull-up in PowerPC 603 MS latch is achieved by the use of complementary pass-gates, which are less robust. Unlike classical C<sup>2</sup>MOS structure, mC<sup>2</sup>MOS is robust to clock slope variation due to the local clock buffering.

Milestones of hybrid-design technique are HLFF, Fig. 9, [9] and SDFF, Fig. 10, [13]. SDFF is the fastest of all the presented structures. The significant advantage over HLFF lies in very little performance penalty for

embedded logic functions. SDFF's larger front-end increases the clock load, but is needed to charge large effective precharge capacitance. The size of this capacitance causes increased power consumption for data patterns with more "ones".

K6 Edge-Triggered-Latch, Fig. 11, [14], is dynamic, self-resetting, differential, hybrid structure. It is very fast but has very high power consumption independent on the data pattern.

Precharged sense-amplifier stage SA-F/F, Fig. 12, [10], and the flip-flop used in StrongArm110, Fig. 13, [8]. Have the speed bottleneck in output S-R latch stage. Uneven

rise and fall times not only degrade speed but also cause glitches in succeeding logic stages, which increases total power consumption. The additional transistor in StrongArm FF, only provides fully static operation, with little penalty in power and delay.

SA-F/F, StrongArm110 FF, and self-reset stage in K6 ETL have a very useful feature of monotonous transitions at the outputs, which drive fast domino logic, [14], [15]. These structures also have very small clock load.

The SSTC\* and DSTC\* MS latches, Fig. 7 and Fig. 8, were simulated with minimized Master latch, as proposed in [5], and optimized Slave latch.

Using our optimization approach we achieved approximately 40% better results, in terms of PDP<sub>tot</sub>.

Minimized Master latch in SSTC\* and DSTC\* suffers from substantial voltage drop at the outputs, due to the capacitive coupling effect between the common node of the Slave latch and the floating high output driving node of the Master latch. The optimized Master latch consumes more power than the minimized one but minimizes the portion of short circuit power dissipated in the Slave latch. With this tradeoff, power remains the same and setup time is significantly reduced which leads to much better PDP<sub>tot</sub>. However, the presented capacitive coupling effect along with the problems associated with the glitches at the data inputs, noted by Blair in [16], result in much worse performance and power features compared with other presented latches, even for the optimized structures SSTC and DSTC.

For systems where high-performance is of primary interest, within available power budget, single-ended, hybrid, semi-dynamic designs present very good choice, given their features of negative setup time, and small internal delay. They have comparable power dissipation to Static MS latches, but much better performance.

| Nominal conditions  | # of<br>T's. | Total<br>gate<br>width<br>[u] | Internal<br>power<br>[uW] | Clock<br>power<br>[uW] | Data<br>power<br>[uW] | Total<br>power<br>[uW] | Delay<br>[ ps] | PDP <sub>tot</sub><br>[fJ] |
|---------------------|--------------|-------------------------------|---------------------------|------------------------|-----------------------|------------------------|----------------|----------------------------|
| PowerPC             | 16           | 185                           | 56                        | 46                     | 5                     | 107                    | 266            | 28                         |
| HLFF                | 20           | 162                           | 126                       | 18                     | 3                     | 148                    | 199            | 29                         |
| SDFF                | 23           | 167                           | 178                       | 27                     | 2                     | 207                    | 187            | 39                         |
| mC <sup>2</sup> MOS | 24           | 170                           | 114                       | 15                     | 6                     | 136                    | 292            | 40                         |
| SA-F/F              | 19           | 214                           | 137                       | 18                     | 3                     | 158                    | 272            | 43                         |
| StrongArm           | 20           | 215                           | 141                       | 18                     | 3                     | 162                    | 275            | 45                         |
| K6 ETL              | 37           | 246                           | 330                       | 15                     | 5                     | 349                    | 200            | 70                         |
| SSTC                | 16           | 147                           | 134                       | 22                     | 4                     | 160                    | 592            | 95                         |
| DSTC                | 10           | 136                           | 172                       | 22                     | 4                     | 198                    | 629            | 125                        |
| SSTC*               | 16           | 86                            | 132                       | 14                     | 1                     | 146                    | 898            | 131                        |
| DSTC*               | 10           | 76                            | 172                       | 13                     | 1                     | 185                    | 1060           | 196                        |

**Table 2. General Characteristics** 



Fig. 4: Overall Delay comparison

In our comparisons, differential structures appear to be worse than single-ended ones. Differential structures switch for all data patterns and have doubled input and output capacitive load. Differential latches based on DCVS logic style suffer from uneven rise and fall times which can cause glitches and short-circuit power dissipation in succeeding logic stages. However, in case where logic in the pipeline operates with reduced voltage swing signals these latches have the role of signal amplifiers, i.e. swing recovery circuits, [10]. Thus, the logic in the pipeline is the party that saves power and not the latches themselves.



Fig. 5. Ranges of PDP<sub>tot</sub>

Fig. 5 presents the ranges and distribution of  $PDP_{tot}$  for different data patterns. Symbol • designates the point of power dissipation ( $PDP_{tot}$ ) for average activity data pattern.

Detailed timing parameters of the presented structures are shown in Table 3.



| Nominal             | Clk- | Clk- | Min.  | Min.  | Opt.     |
|---------------------|------|------|-------|-------|----------|
| conditions          | Qhl  | Qlh  | D-Qhl | D-Qlh | Setup    |
|                     | [ps] | [ps] | [ps]  | [ps]  | time[ps] |
| HLFF                | 195  | 191  | 199   | 155   | -21      |
| PowerPC             | 145  | 139  | 266   | 220   | 79       |
| SDFF                | 176  | 176  | 187   | 143   | -21      |
| mC <sup>2</sup> MOS | 193  | 188  | 292   | 282   | 92       |
| Strong Arm          | 262  | 162  | 275   | 171   | -35      |
| SA-F/F              | 262  | 162  | 272   | 168   | -35      |
| K6 ETL              |      | 168  |       | 200   | -4       |
| SSTC                | 97   | 301  | 374   | 592   | 267      |
| DSTC                | 98   | 318  | 375   | 629   | 263      |
| SSTC*               | 150  | 393  | 639   | 898   | 476      |
| DSTC*               | 200  | 500  | 716   | 1060  | 480      |

Table 3. Timing parameters

On Fig. 15, hybrid structures show the best performance, as they really should, due to the negative setup time.

When only *Clk-Q* parameter is taken as valid performance indicator, the positive setup time of the MS structures becomes hidden and they become comparable,

if not better than hybrid ones. This is illustrated on Fig. 16, where PowerPC 603 MS latch becomes the "fastest",  $mC^2MOS$  MS latch becomes as "fast" as HLFF and DSTC and SSTC MS latches become comparable to other structures in terms of "speed".



Fig. 15. Total Power range vs. Delay



Fig. 16. Total Power range vs. Clk-Q

The amount of power consumed for driving the clock inputs of each structure is shown on Fig. 17.



Fig. 17. Local Clock power consumption

### 5. Conclusion

The problem of consistency in analysis of various latch and flip-flop designs was addressed. A set of consistent analysis approach and simulation conditions has been introduced. We strongly feel that any research of the latch and flip-flop design techniques for high-performance systems should take those parameters into account. The problems of the transistor width optimization methods have also been described. Some hidden weaknesses and potential dangers in terms of reliability of previous timing parameters and optimization methods were brought to light.

# 6. Acknowledgements

Authors would like to thank Mr. Stanisha Okobdzija for his useful comments and help in language aspects of the document, to Mr.Borivoje Nikolic, Ph.D.student, UC Davis, for sharp comments and ideas on simulation and analysis procedures.

# 7. References

 U. Ko., et al., "Design techniques for high-performance, energy-efficient control logic," ISLPED Digest of Technical Papers, Aug. 1996

- [2] J. Yuan and C. Svensson, "Latches and flip-flops for Low Power Systems," in A. Chandrakasan and R. Brodersen, Low Power CMOS design, 233-238, IEEE Press, NJ 1998.
- [3] G. J. Fisher, "An Enhanced Power Meter for SPICE2 Circuit," *IEEE Transactions on Computer-Aided Design*, vol. 7, no. 5, Oct. 1986.
- [4] G. Gerosa, et al., "A 2.2 W, 80 MHz Superscalar RISC Microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 29, no. 12, December 1994., 1440-1452.
- [5] J. Yuan and C. Svensson, "New Single-Clock CMOS Latches and Flipflops with Improved Speed and Power Savings," *IEEE Journal of Solid- State Circuits*, vol. 32, no. 1, January 1997.
- [6] S.H. Unger and C. Tan, "Clocking Schemes for High-Speed Digital Systems," *IEEE Transactions on Computers*, vol. C-35, No 10, October 1986
- [7] M. Shoji, *Theory of CMOS Digital Circuits and Circuit Failures*, Princeton University Press, Princeton NJ, 1992.
- [8] J. Montanaro, et al., "A 160-MHz, 32-b, 0.5-W CMOS RISC microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 31, no. 11, 1703-14., Nov. 1996.
- [9] H. Partovi, et al., "Flow-through latch and edge-triggered flip-flop hybrid elements," ISSCC Digest of Technical Papers, Feb. 1996.
- [10] M. Matsui, et al. "A 200 MHz 13 mm<sup>2</sup> 2-D DCT Macrocell Using Sense-Amplifier Pipeline Flip-Flop Scheme," IEEE Journal of Solid-State Circuits, vol. 29, no. 12, 1482-91, Dec. 1994.
- [11] J. Yuan and C. Svensson, "CMOS Circuit Speed Optimization Based on Switch Level Simulation," Proceedings of International Symposium on Circuits and Systems, ISCAS 88, 1988.
- [12] J. Yuan and C. Svensson, "Principle of CMOS circuit power-delay optimization with transistor sizing," Proceedings of International Symposium on Circuits and Systems, ISCAS 96, vol.1, 1996.
- [13] F. Klass, "Semi-Dynamic and Dynamic Flip-Flops with embedded logic," *Digest of Technical Papers*, 1998 Symposium on VLSI Circuits, Honolulu, HI, USA, 11-13 June 1998.
- [14] D. Draper, et al., "Circuit techniques in a 266-MHz MMX-enabled processor," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 11, 1650-64., Nov. 1997.
- [15] B.A. Gieseke, et al., "A 600 MHz superscalar RISC microprocessor with out-of-order execution," ISSCC Digest of Technical Papers, 176-7, 451, Feb. 1997.
- [16] G.M. Blair, "Comments on New single-clock CMOS latches and flip-flops with improved speed and power savings," *IEEE Journal of Solid-State Circuits*, vol. 32, no. 10, pp. 1610-11., Oct.1997.