Design-Agnostic Distributed Timing Fault Injection Monitor with End-to-End Design Automation

Yan He*, Yumin Su*, and Kaiyuan Yang Manuscript received on(Yan He and Yumin Su contributed equally to this paper.)Y. He, Y. Su, and K. Yang are with the Department of Electrical and Computer Engineering, Rice University, Houston TX, 77005, USA. (Corresponding Author: Kaiyuan Yang, kyang@rice.edu)1063-8210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

Abstract

Fault Injection Attacks (FIAs) induce hardware failures in circuits and exploit these faults to compromise the security of the system. It has been demonstrated that FIAs can bypass system security mechanisms, cause faulty outputs, and gain access to secret information. Certain types of FIAs can be mounted with little effort by tampering with clock signals and/or the chip’s operating conditions. To mitigate such low-cost yet powerful attacks, we propose a fully synthesizable and distributable in situ Fault Injection Monitor that employs a Delay Locked Loop (DLL) to track the pulse width of the clock. We further develop a fully automated design framework to optimize and implement the FIA monitors at any process node. Our design is fabricated and verified in 65nm CMOS technology with a small footprint of 1500 $\boldsymbol{\mu}\textbf{m}^{\textbf{2}}$ . It can lock to clock frequencies from 2MHz to 1.26GHz while detecting all twelve types of possible clock glitches, as well as timing FIA injections via the supply voltage, electromagnetic signals, and chip temperature.

Index Terms:

hardware security; Fault Injection Attacks; DLL; Fault Injection Monitors; design automation

I Introduction

Fault Injection Attacks (FIAs) induce errors in the logic operations of a chip and exploit such errors to break systems that are otherwise secure. As an example, when a system is validating some information, a hardware failure could trick the system into accepting incorrect data. All subsequent operations that depend on the authenticity of this information would become insecure because of this injected fault. The effectiveness and risk of FIAs have been established through academic research and real-world attacks. Multiple gaming consoles have been hacked by introducing turbulence to the clock or the voltage supply [1, 2]. Drones were attacked without any physical contact: electromagnetic (EM) interference can inject faults into drones remotely [3]. Even CPU functions carefully designed for security have shown vulnerabilities in the face of FIAs. [4] breaks Intel SGX by injecting faults with software, while [5] attacks ARM TrustZone by manipulating the processor’s voltage. The fact that FIAs open up backdoors to flawlessly designed secure systems makes them a particularly devastating threat to modern computing systems.

As evident from the examples above, FIAs can be launched with a variety of methods and tools. But, broadly speaking, FIAs exploit two major categories of physical anomalies: directly flipping the voltage of a signal wire or a memory/register unit, and violating timing constraints to make a register sample a wrong value.

Bit Flips. Laser is the most powerful tool to induce bit flips and cause highly localized faults in the chip. It is even possible to attack individual gates and wires with high-power lasers. While laser attacks are powerful, mounting such attacks requires expensive equipment and considerable complexity. More importantly, it has been demonstrated in various works that laser attacks can be detected with low-cost light sensors densely distributed across the chip. [6] reports a 28% increase in area when the AES module is protected with Bulk Built-In Current Sensors, while the photosensors in [7] achieve an area overhead down to 20%. [8] employs standard inverters to form a laser detection circuit array, whose area overhead is reported to be as low as 3% at an advanced node. Considering the gap between the requirements for attackers and defenders, we argue that these protections are sufficient to thwart most laser-based attacks.

Timing Violations. Logic faults can also be injected through intentional timing violations. The attacker alters the relative timing relationship between the clock and the data arrival time, causing the register to latch a wrong value. Timing FIAs can be conducted by directly hijacking the clock signal or by indirectly influencing the logic gates’ delay and, thus, data arrival time. As the delay of a CMOS gate can be affected by various environmental factors (voltage, electromagnetic interference, temperature, etc.), a broad attack surface exists for timing FIAs. More importantly, simple tools and even software hacks can be exploited to interfere with these factors and launch timing FIAs. Since timing FIAs are inexpensive yet effective, we focus on mitigating this type of FIA in this work. We aim to develop low-cost timing FIA sensors that will complement the aforementioned laser sensors to cover a broad spectrum of FIAs.

An ideal solution to FIAs should be applicable to any digital design, be compatible with the typical digital design workflow in any technology node, and require minimum design and testing efforts. Specifically, we target a fully synthesizable FIA sensor design that can be generated by a compiler for synthesis and placement & routing (P&R), just like standard digital circuits. It should also handle the parasitics and layout mismatches introduced in the automatic P&R process without post-silicon testing. Furthermore, the hardware development costs will be greatly reduced if the FIA mitigation can be automatically generated by a framework. The user only needs to provide the process design kit (PDK), the standard cell library, and a few parameters regarding the system’s clock. The framework can then run the required simulations and provide a netlist ready for layout with electronic design automation (EDA) tools. As such, the framework will enable agile and low-cost insertion of FIA mitigation to any existing design at any technology node.

To achieve these goals, we present a fully synthesizable design-agnostic timing FIA monitor along with an automatic design framework. The principle of the monitor is to replicate the system clock with an internal delay-locked loop (DLL) and compare it with every clock edge in real time. Our monitor is able to detect attacks on the clock signal as well as the logic gates’ propagation/contamination delay. It features a tiny footprint using standard cells and can be distributed across the chip for localized FIA protection. Our prototype of the timing FIA monitor in 65nm CMOS demonstrates:

•

robust detection of timing faults, supporting a wide range of clock frequencies from $2MHz$ to $1.26GHz$ ;
•

a fully synthesized PVT-robust design validated across the automotive temperature range (-40 to $125^{\circ}C$ ), 0.5 to $1.4V$ VDD, and 50 DUTs;
•

an end-to-end automation framework to implement the monitor in any technology node;
•

low power consumption (0.2 - $1.12mW$ ) for 2 - $1250MHz$ clocks at 1.2V, $25^{\circ}C$ and small footprints ( $1500\mu m^{2}$ ).

This article extends [9] and is organized as follows. Section II summarizes the attack surface and possible mitigations of timing FIAs. Section III elaborates on our proposed monitor, followed by a description of the automation framework in Section IV. Section V offers the measurement results of 65nm test chips, while Section VI showcases the design framework in 28nm. Finally, Section VII concludes the article.

II Timing Fault Injection Attacks and Mitigations

Refer to caption — Figure 1: (a) Timing FIAs entry points: the clock signal and the gates’ delay. (b) Correct timing at the register. (c) The case where the data arrives too late, and (d) the case where the data changes too fast.

For standard synchronous digital circuits to function correctly, it is essential that all timing constraints are satisfied. The timing at the registers is particularly critical because an incorrectly sampled value cannot be recovered or even detected, leading to system failures and logic errors that attackers can exploit. The setup time requirement of a register refers to how much earlier the data signal must be ready before the clock edge, while the hold time requirement states that the data signal cannot change for a certain amount of time after the clock edge. Violating either of them will lead to a timing failure and potentially wrong values sampled by the register. Timing FIAs induce faults by intentionally breaching the timing requirements, as shown in Fig. 1, by targeting either the clock signal or the data arrival time.

II-A Clock Glitching

Hijacking the clock signal is the most intuitive way to trigger a timing violation since the setup time and the hold time requirements are relative to the clock edge. In a highly optimized design, shifting the clock signal forward/backward will likely violate the setup/hold time requirement [10], [11], [12], [13]. Clock glitches and abnormal clock patterns can be induced through external clock pins, software/firmware control of the clock generator, physical interference, or hardware trojans. More advanced attacks have been demonstrated recently, in which dynamic voltage and frequency scaling of a CPU/GPU is exploited to disrupt the timing [14].

Apparently, a wide variety of clock glitches can be induced in practice. Without losing generality, we categorize all possible clock glitches into four major classes with twelve types of waveforms (Fig. 2) to illustrate the coverage of our FIA sensor. While these elemental glitches may or may not form a valid FIA when used alone, real-world clock glitches are likely a temporal-spatial combination of the elemental glitches and will be detected if our sensor covers all the basic elements.

II-A1 Pulse Addition

Additional pulses can be added to the clock signal. $\text{T}_{1}$ represents the scenario where an additional pulse is inserted into the negative phase of the clock, while $\text{T}_{2}$ shows an added pulse at the positive phase.

II-A2 Cycle Skipping

Clock pulses can also be removed. A positive (negative) clock phase is skipped in $\text{T}_{3}$ ( $\text{T}_{4}$ ).

II-A3 Duty Cycle Change

Attackers may choose to modify the duty cycle in one clock period as well. $\text{T}_{5}$ and $\text{T}_{7}$ extend the next and the previous positive phase, respectively. Similarly, $\text{T}_{6}$ and $\text{T}_{8}$ increase the negative phase.

II-A4 Phase Shift

The clock phase may be shifted by the attacker starting from an arbitrary time point. The attacker shifts the phase forward when the clock is at the low (high) voltage in $\text{T}_{9}$ ( $\text{T}_{10}$ ). Likewise, the clock phase is shifted backward in $\text{T}_{11}$ and $\text{T}_{12}$ . Note that this class of clock glitches changes only one positive or negative phase. This characteristic differs from the change duty cycle class, which varies both the positive and the negative phases.

II-B Delay Manipulation of the Logic Paths

The other entry point for timing FIAs is the data input to the register. If the clock edge remains the same but the data arrives too late, a setup time violation will happen. Similarly, a hold time violation can be triggered when the data changes too fast after the clock edge. An attacker can hack the delay of logic gates and thus the data arrival time, by altering the chip’s operating conditions, as shown in Fig. 3.

II-B1 Voltage

The delay of the logic gates is highly dependent on the supply voltage. As the supply voltage decreases, the delay will increase. [2] applies this technique to bypass a read length assertion in PlayStation Vita to exploit a buffer overflow vulnerability. ChipWhisperer [15], an open-source toolchain, is used to perform the voltage glitching attack in this work. This readily available and affordable toolchain underscores the versatility of timing FIAs.

II-B2 Electromagnetic Coupling

Other than controlling the supply voltage directly, the attacker may choose to remotely couple a high-power EM signal to the power rail. Short electromagnetic pulses (EMP) have been demonstrated to trigger both data-dependent and constant faults during Advanced Encryption Standard (AES) computations in [16]. EMP attacks affect the circuit’s delay in a way similar to voltage glitch attacks: the supply voltage is suddenly lowered for a short interval and causes the delay to increase temporarily. More subtle attacks can be achieved through electromagnetic interference (EMI) [17]. EMI enables the attacker to precisely interfere with the security module by coupling noise to the circuit’s supply cable at a frequency that has less attenuation for the security module. [18] injects into the device an EM interference whose amplitude is as small as $90mV$ . The attack requires a longer time to take effect, but the voltage disturbance on the victim chip’s power supply is smaller. The small voltage difference makes EMI attacks more difficult to detect.

II-B3 Temperature

It is well known that the delay of CMOS gates is sensitive to temperature variations, which offers another entry point for FIAs. High temperatures also damage the data stored in the memories. For example, [19] reports fault injection in the RSA encryption by heating the hardware. The faulty RSA computation is then exploited to reveal the secret RSA primes, showcasing the damaging consequence of FIAs.

It is worth noting that EM and temperature attacks targeting data paths or clock trees can be highly localized. They need not affect the functionality of the rest of the chip. This necessitates distributed low-cost sensors to protect the full chip.

II-C Existing Timing Fault Mitigations

Generally, chip-level FIA mitigations follow three directions: logic checking, adaptive design, and anomaly detection.

II-C1 Logic Checking

Hardware-enforced assertions can be integrated into the chip to detect faults. [8] employs a combination of parity and algorithmic checks to ensure that the computations inside the AES block produce the expected results. This method is effective against any AES fault but is not directly applicable to other designs. In order to capture FIAs in other circuits, all the checkers must be redesigned, which takes considerable engineering effort. Pre-silicon logic analysis can also be performed to identify circuits that are susceptible to FIAs [20, 21]. These frameworks try to resolve FIAs at design time and can reduce manufacturing and testing costs. They still, unfortunately, require human experts to input security properties or critical registers that are highly design-specific. In addition, logic checking is limited to specialized accelerators, since the flexibility of the logic executed on general-purpose processors makes it extremely challenging to predict and verify the circuit’s outputs with low overheads.

II-C2 Adaptive Design

Timing adaptive design is a well-studied topic aiming to adapt the voltage and/or clock of a digital system based on the specific process, voltage, temperature (PVT) condition of a chip. Adaptive designs are generally achieved through either PVT sensors, or in-situ error detectors. As such, their principles are highly relevant to timing FIA detection and thus can potentially be reused here. We will discuss two representative techniques and their pros and cons as FIA monitors. First, Razor is an adaptive technique based on error detection, which looks for setup time violations [22], [23], [24], [25]. Tiny transition detection circuits, as small as three extra transistors [25], are embedded in critical registers/latches to detect data changes within a speculation window right after the clock edge. Therefore, Razor is supposed to detect timing FIAs that induce setup time violations. However, Razor techniques always face the challenge of deciding whether a data transition after the clock edge is a delayed arrival from the previous cycle or a fast transition in the current cycle. Razor avoids this ambiguity by enforcing a short-path constraint: all paths to a Razor register must be buffered to meet a minimum contamination delay. To adapt to PVT variations, the minimum delay can be relatively small, and only a very small portion of registers on critical paths need to be equipped with Razor. If Razor is employed to detect intentional clock glitches, the detection window will become so large that the buffers’ overhead will be significant. Additionally, some clock glitches, such as cycle skipping and phase shifts, cannot be detected by Razor. The Tunable Replica Circuit (TRC) is a well-known canary-type delay replica based on a programmable delay line calibrated to the delay of the critical path [26, 27]. As timing FIAs often alter the gate delays, the TRC can identify malicious changes in the clock signal or operating conditions by detecting delay changes in the replica circuits. [28] discusses the feasibility of reusing TRC to detect timing FIAs. However, most delay replica designs in prior works require post-silicon calibration. To serve as timing FIA monitors, they must be distributed across the chip in much higher quantities, leading to high power and area overheads, as well as testing costs. Moreover, a time-to-digital converter with crafted pattern-matching logic is necessary to cover all types of glitches, further increasing the overhead. In summary, adaptive design techniques were developed with highly relevant but different goals as timing FIA monitors. They inspire the design of FIA countermeasures but are not optimal or comprehensive by themselves.

II-C3 Anomaly Detection

The third class of FIA protection is by detecting physical anomalies in the chip. For example, clock glitches can be directly monitored using an on-chip oscilloscope that oversamples the system clock. In [29], an FLL is designed to lock to the clock’s frequency and monitor the clock waveform in the following cycles. While this method can recognize all types of clock glitches, the FLL’s large area and power consumption hinder the monitor from being distributed across the chip to cover localized FIAs. Additionally, this method requires a high oversampling ratio so that the supported system clock frequency is severely restricted. It is also dedicated to the clock waveform with low protection for data arrival time. As another example, [30, 31] developed an on-chip sensor to detect the electromagnetic probes that are placed near the chip to launch EM attacks. The sensor employs an on-chip LC oscillator whose frequency shift indicates the existence of abnormal coupling to the on-chip inductor. These sensors excel at detecting one particular type of EM attack but cannot generalize to other types of timing FIAs.

III DLL-Based Synthesizable FIA Monitor

We present a design-agnostic, compact, and distributable FIA monitor based on anomaly detection. Despite its small footprint, it detects both clock glitching and delay manipulation attacks. The monitor is a fully synthesizable soft IP that can be easily integrated into any existing design. Its negligible design and testing costs, together with minimal power and area overhead, enable the monitor to be distributed across the chip, providing extensive coverage for localized attacks.

III-A Timing Anomaly Detection with Clock Replica

The key principle of our FIA monitor is to track the clock pulse width with a clock replica circuit. The clock replica circuit locks its pulse width to that of the normal system clock. Since any clock signal naturally exhibits jitter and variations from noise and uncertainties, we introduce a programmable acceptance window, as shown in Fig. 4. Under normal operations, the system clock falls within this acceptance window. If the clock’s pulse width is shorter (longer) than the programmed minimum (maximum) threshold, the replica circuit reports an attack has been detected.

This clock pulse width monitoring scheme is effective regardless of whether the attacker hijacks the clock signal or alters the data path delay. If an FIA is mounted by altering the clock signal, the system clock’s pulse width will differ from the locked pulse width and the attack can be detected. If the attacker targets timing violations by glitching the voltage, temperature, and EM conditions, the distributed clock replica’s pulse width will be affected by the attack and deviate from the actual clock (Fig. 5), thus raising alerts. The only possible way to bypass the monitor is to change the delay of the gates and then match that delay with the pulse width of the clock accordingly. However, since the delay and the clock match in this case, there are no timing violations in the first place.

III-B Same-Cycle FIA Alert Generation

We employ digital delay lines to realize the clock replica and the acceptance window, as shown in Fig. 6. The clock is first passed into the main configurable delay line to produce a pulse $\text{P}_{\text{Min}}$ , which determines the minimum acceptable delay $\text{D}_{\text{Min}}$ . A second configurable delay line is used to further delay the rising edge of the clock, creating $\text{P}_{\text{L}}$ . $\text{D}_{\text{L}}$ , derived from $\text{P}_{\text{L}}$ , will be calibrated by an FSM to be the same as the clock’s pulse width. Similarly, the maximum acceptable clock uncertainty is $\text{D}_{\text{Max}}$ , which is set by $\text{P}_{\text{Max}}$ , the output of the last configurable delay line.

With the three pulses generated by the configurable delay lines, FIAs can be detected in the same cycle as the fault is injected (Fig. 7). The values of $\text{D}_{\text{Min}}$ , $\text{D}_{\text{L}}$ , and $\text{D}_{\text{Max}}$ are sampled at the falling edge of the clock signal to produce $\text{R}_{\text{Min}}$ , $\text{R}_{\text{L}}$ , and $\text{R}_{\text{Max}}$ . Under normal operations, the clock’s negative edge comes after $\text{D}_{\text{Min}}$ but before $\text{D}_{\text{Max}}$ . Therefore, $\text{R}_{\text{Min}}$ should be 0 while $\text{R}_{\text{Max}}$ needs to be 1. Any other combination of $\text{R}_{\text{Min}}$ and $\text{R}_{\text{Max}}$ values indicates that the clock pulse width diverges from the expected value and an alert is raised.

A single FIA monitor is able to detect nine out of twelve types of clock glitches because the monitor only locks and monitors the positive pulse width of the clock. The remaining three glitch types affect the negative pulse width of the clock while leaving the positive pulse width intact, as illustrated in Fig. 8b. To capture these glitches, another copy of the monitor can be inserted to track the negative phase of the clock. With this dual-monitor setup, all types of glitches can be captured. The placement of these monitors is flexible; they are not required to be deployed in pairs. Each monitor can work independently to detect most types of clock glitches as well as voltage, EM, and temperature glitches. Alternately placing the monitors for the positive and the negative clock phases across the chip in a checkerboard fashion is a possible scheme for increasing comprehensive coverage with low overheads.

III-C Digitally Controlled Delay Line

To ensure that the FIA monitor is synthesizable, the delay lines are constructed using only standard cells. This standard-cell-only design enables automatic place and route with software and guarantees that the monitor is technology-agnostic.

The main configurable delay line is broken down into three stages in order to provide a wide delay tuning range while maintaining a small footprint (Fig. 9). The coarse stage counts the cycles of a ring oscillator (RO). Once the pre-defined cycle number ( $\text{Conf}_{\text{c}}$ ) has been reached, the coarse stage outputs 1, and the medium stage takes over. The medium stage is a path-selection delay line with thermometer-coded control bits [32]. When the control bits are all one, the input needs to propagate through all the stages to reach the output port. If the control code is all zero, on the other hand, the input travels through the first stage and goes directly to the output. Note that the two delay lines to tune the acceptable window also have the same structure. The fine stage is built from standard-cell-based varactors with sub-gate-delay resolution. Such varactors work on the principle that the capacitance at a logic gate’s port (NAND, NOR, etc.) depends on the voltage at other ports. By holding other ports at 0 or 1, the delay, which is affected by the loading capacitance, can be tuned at a high precision.

Here, both the coarse stage and the fine stage can be bypassed for two purposes: to increase the maximum locking frequency and to reduce power consumption. The coarse stage and the fine stage have a relatively large delay offset, i.e., the inherent delay when the control code is 0. When all control bits are 0, skipping these two stages can further reduce the delay and increase the upper bound of the frequency range at the cost of resolution loss. As for the power consumption, the coarse stage (RO plus the counter) dominates the total power usage. Bypassing the coarse stage prevents the RO from oscillating and, therefore, reduces the power consumption.

III-D Automatic Pulse Width Locking

Real-time delay-locking is implemented on-chip as a Finite State Machine (FSM) to eliminate off-chip calibration. The FSM automatically configures the delay line to lock to the pulse width of the clock. As shown in Fig. 10, it determines the configurations for the three delay stages step by step. The FSM starts with all configurations reset to zero and then sequentially calibrates coarse, medium, and fine stages. The coarse stage tuning can be completed in one cycle: the FSM times the system clock’s pulse width with the ring oscillator inside the coarse stage. The configuration for the coarse stage is exactly the ring oscillator cycles it takes for the system clock to rise and fall back to zero again.

To tune the medium or the fine stage, the FSM linearly searches for the delay configuration. The FSM increments the delay control bits by one until $R_{L}$ becomes one, which indicates that the pulse width of the replica is longer than that of the system clock. Then, the FSM will decrease the control bits to reduce the delay. At this point, the locked delay is the maximum delay that is less than the clock’s pulse width. Therefore, the linear search is completed if the pattern “010” is found on $R_{L}$ . On the other hand, if the minimum configuration (i.e., all zeros) is already too large, the FSM will sequentially try bypassing the coarse and fine stages to reduce the delay offset of the configurable delay line. If the FSM still cannot lock to the clock, it terminates and reports an error.

III-E Real-Timing Delay Tuning under Frequency Drift

The initial locking process above is performed once right after the system powers up or resets. The monitor starts FIA detection after this initial phase. The FSM then switches into the Full Range Linear Tracking mode. In this mode, the FSM dynamically matches the delay configurations to the clock signal using the finest step. Temporal Majority Voting is used to low-pass filter $R_{L}$ for better stability. The purpose of this mode is to tolerate and track the slow drifting of the clock due to clock source drifts, environmental changes, or device aging. Otherwise, the clock signal may drift out of the acceptance window without any FIAs, which will trigger false alerts.

Clock drift is not the only source of potential false alerts. The coarse, medium, and fine stages in the fully synthesized delay line will not match perfectly to have a monotonic relation with the digital delay calibration bits. This is an intentional design choice because the automatic placement and routing procedure inevitably introduces delay mismatches unless significant efforts are made to match the layout. Indeed, the tuning range of a finer stage is designed to be greater than one step in the coarser stage. As shown in Fig. 11a, when the finer tuning stage reaches the maximum value and returns to zero, the coarser stage will increase by one. Because of the mismatch, the delay will decrease even if the overall configuration bits are increasing. This delay drop could cause the clock edge to fall out of the tolerance window and trigger a false alert. To overcome this issue, programmable configuration skips are introduced to ensure that the delay monotonously increases as the configuration bits increase. Fig. 11b demonstrates one example in which one medium step and nine fine steps are skipped after each coarse step increase while six fine steps are skipped for each medium step increase. Specifically, when the coarse configuration increments by one, the medium bits and the fine bits start from one and nine, respectively, instead of zero. This scheme guarantees that false alerts are minimized even with the presence of mismatches between the delay tuning ranges.

IV Automatic Generation of the FIA Monitor

To facilitate the agile implementation of our FIA monitor in different systems, we develop an end-to-end design framework to fully automate the design and implementation process (Fig. 12). Since the overall monitor design is fully synthesizable without any manual constraints, only the configurable delay line needs to be redesigned for different systems and process nodes to achieve optimal performance. The framework takes a list of delay cells to use, the desired locking frequency range, and the target resolution as inputs. It also needs access to the standard cell library as well as the standard EDA tools in the digital design flow. It first generates the configurable delay line, and then combines it with the tuning FSM compiled from Verilog code. The framework is optimized to produce the circuit with the minimum area that satisfies the frequency range and resolution demands. We build the framework using Python, which interfaces with EDA tools with Tcl scripts.

IV-A Configurable Delay Line Optimization

The workflow for designing the delay line is summarized in Fig. 13. The framework needs to decide which delay cells to use for the medium and the fine stages, as well as how many of these cells to use. The coarse stage is fixed to be the RO-counting topology because this topology’s area scales sublinearly with the delay tuning range. The delay line generator iterates through all pairs of medium and fine cells in the user-specified delay cell list. For each pair of cells, Python generates an evaluation netlist and invokes the spice simulator to determine the cell’s delay when the control bit is 0 and 1. The resolution of the cell is calculated as the delay difference when the cell is turned on and off. The cell with the greater delay difference is assigned as the medium stage, with the other as the fine stage. Since the delay tuning range of the finer stage must be greater than one delay tuning step in the coarse stage, the number of cells to use in each stage is straightforward: the ratio between the two stages’ resolution is rounded up to be the number of cells in the finer stage. The user may specify an additional margin to ensure that the finer stage’s range is enough considering the variations. If a $50\%$ margin is used, the number of cells in the finer stage becomes 1.5 times the resolution of the coarser stage divided by the finer stage’s resolution, rounded up to the nearest integer.

While the coarse stage’s topology is fixed, the framework still has to optimize the number of stages in RO. A shorter RO leads to a smaller coarse step, which reduces the number of medium cells. However, since the delay line needs to cover the user-specified delay tuning range, a smaller coarse step means that a larger counter must be used. To optimize under this tradeoff, the framework first simulates the period of the RO with the smallest number of stages. This period is also the smallest coarse step, which determines the maximum number of counter bits. Then, the framework increases the number of stages in the RO by two at a time, calculates the number of cells in each stage, and evaluates the total area of the delay line. As shown in Fig. 14, the counter’s footprint dominates the total area in the beginning. As the number of stages in the RO increases, the counter becomes smaller, and the medium stage eventually dominates the total area.

Once the framework determines the RO design, it performs post-layout simulations to evaluate the current design built from the two selected delay cells. It executes standard EDA tools with Tcl scripts to place and route the delay line and obtains the post-layout netlist. Spice simulations are run on the post-layout netlist by Python to extract the delay tuning range and the resolution of each stage. The framework first verifies that the tuning range of the finer stage indeed covers one tuning step of the coarser stage and then examines whether the resolution and the tuning range pass the user-specified requirements. If all constraints are met, the framework records the area and optionally power data of the current topology. Otherwise, the framework tries to fine-tune the design and discard the current circuit if the fine-tuning attempt fails. In either case, the framework finishes evaluating the selected pair of delay cells and moves to the next.

After exhaustively searching for all designs that meet the specifications, the framework decides the one with the smallest area. The framework also supports optimization based on power consumption when the monitor’s power is a concern. To enable this function, the user must define the frequency at which the monitor’s power should be evaluated. The framework can combine the area and the power consumption with programmable weights as the cost function for optimization.

IV-B FSM Integration

After obtaining the optimized configurable delay line, the framework attaches the auto-calibration FSM to it and completes the monitor design. The number of control bits in each stage is parameterized so the framework can reuse the FSM for any delay line design without modifying the code. From this step, the framework follows the standard backend flow to generate the monitor layout. The user can then instantiate the monitor design wherever FIA protection is desired.

V Measurement Results

V-A Automated Monitor Design and Silicon Prototype

TABLE I: Monitors Designed by the Automation Framework

User Inputs			Generated Structure			Post-Layout Evaluation
Technology	Frequency Range	Resolution Requirement	RO Stage Number	Medium Bit Number	Fine Stage	Frequency Range w/o Resolution Loss	Frequency w/ Bypassing	Resolution	Area ( $\mu$ m²)
65nm	2MHz - 600MHz	10ps	13	8b	4b NAND2	1.84MHz - 698MHz	1.13GHz	6.84ps	1500
65nm	2MHz - 700MHz	20ps	13	8b	3b NOR2	1.89MHz - 768MHz	1.18GHz	17.1ps	1464
65nm	500kHz - 10MHz	100ps	17	11b	N/A	348kHz - 1.09GHz	1.09GHz	69.9ps	1596
65nm	500MHz - 600MHz	10ps	0	9b	4b NAND2	369MHz - 704MHz	1.16GHz	7.44ps	1188
28nm	2MHz - 2GHz	2ps	21	12b	4b NAND2	1.86MHz - 2.07GHz	2.66GHz	1.92ps	459
28nm	1.5GHz - 2.5GHz	5ps	0	16b	2b NOR3	1.43GHz - 2.66GHz	3.07GHz	4.92ps	374

We evaluated the automated framework with four specifications in 65nm CMOS, shown in the first four rows of Table I. To demonstrate technology scalability, we also generated two designs with a 28nm CMOS PDK, with more details in Section VI. The first row is a scenario demanding both a wide frequency range and a high resolution. For comparison, the second row showcases that by lowering the resolution requirement, higher frequencies can be achieved without bypassing the fine stage. The design is also slightly smaller than the first scenario. The third and fourth rows demonstrate the use cases for low and high frequencies, respectively. The low-frequency scenario does not require the fine stage, while the counter-based coarse stage is omitted when only high frequencies are targeted. In all scenarios, the path-selection-based delay line is chosen as the medium stage because of its small overhead in terms of delay offset and footprint.

The design shown in the first row of Table I is fabricated and verified in 65nm LP CMOS. The design is automatically placed and routed by industry-standard software without any manual adjustment or optimization. Each chip contains ten monitors, and each monitor takes a footprint of $30\mu m\times 50\mu m$ (Fig. 15), or equivalently $355,000F^{2}$ if calculated with the feature size (F) of 65nm.

V-B Evaluations of the Synthesized Delay Line

Fig. 16a showcases the locking process of the configurable delay line. In this particular case, the clock frequency is $250MHz$ , and the locking process takes 15 cycles, or $60ns$ . Depending on the pulse width of the clock, it takes from 7 to 26 cycles for the monitor to lock to the clock: 1 cycle for the coarse stage, 3 to 9 cycles for the medium stage, and 3 to 17 cycles for the fine stage. The lower bound of the medium and the fine stage is determined by the length of the “010” pattern, and the number of possible configurations in each stage decides the upper bound.

Fig. 16b summarizes the range and the resolution of the three delay tuning stages. The tuning step is extrapolated from the maximum and the minimum delay for each stage, assuming that the delay increase is linear. While the fine stage has the finest resolution, the offset of the fine stage is the greatest. Therefore, the option to bypass the fine stage is implemented. With this option, the total tuning range of the configurable delay line is $400ps$ to $257ns$ at $1.2V$ , or equivalently $2MHz$ to $1.26GHz$ . The lower frequency limit can be reduced by increasing the number of configuration bits in the coarse stage. As the coarse stage is based on counting the ring oscillator cycles, this change incurs almost no overhead.

In total, fifty DUTs are tested, and the locking frequency results are shown in Fig. 17 and Fig. 18. The locking frequency range shows minimum variations across fifty DUTs. The temperature also does not significantly affect the frequency range. The locking frequency does change with the supply voltage. This dependency is anticipated as the monitor is expected to reflect the logic gates’ delay at different voltages.

V-C Clock Glitching Attacks

The miss rates of the monitor for each clock glitch type with different acceptance window configurations are listed in Fig. 19b. Clock glitches of the 12 types are injected into the normal $250MHz$ clock using the on-chip $2GHz$ arbitrary pattern generator. The glitch width is set to be $500ps$ . A conceptual diagram for this testing setup is shown in Fig. 19a. A pair of monitors is tested in order to detect every type of clock glitch, as explained in Section III. When the acceptance window is less than or equal to $400ps$ , all 12 types of clock glitches can be accurately detected. At $500ps$ or larger, the monitor cannot distinguish clock glitches of type $\text{T}_{\text{5-12}}$ from clock jitters. $\text{T}_{\text{1-4}}$ , on the other hand, can still be detected by the monitor. $\text{T}_{\text{3-4}}$ merges three clock pulses into a longer clock phase. These glitches can always be identified because the acceptance window will never exceed one pulse width of the clock. $\text{T}_{\text{1-2}}$ involves splitting a clock pulse into three segments. Such changes can be detected as long as any of the modified clock edges falls out of the acceptance window.

To better demonstrate the monitor’s remarkable ability to detect pulse injection attacks on the clock, a fast pulse addition circuit is implemented on the chip. Fig. 20a depicts the principle of the adder circuit and how it interacts with the monitor. This testing circuit can inject pulses as short as $100ps$ into the clock signal. Again, our monitor can detect all of these glitches even when the acceptance window is greater than the injected pulse width, as shown in Fig. 20b.

The testing results prove that our monitor can detect all 12 types of glitches, provided that the acceptance window setting is reasonable. Since real-world clock glitch attacks can always be viewed as a combination of these 12 types, our monitor is able to detect FIAs based on clock glitches precisely.

V-D Fault Injections via Voltage, EM, and Temperature

Besides the clock signal, FIAs can also be mounted by altering the delay of the logic gates. These alternations can be achieved by manipulating the operating environment of the chip, such as the supply voltage, the temperature, or even EM signals. The monitor is evaluated to determine if it can detect these physical fault injection attacks. The clock frequency is $100MHz$ in all of the tests below.

V-D1 Voltage Attack

An off-chip function generator is used to inject glitches into the power supply of the circuit. The monitor can correctly identify voltage glitches as small as $120mV$ whose duration is $9ns$ . By configuring the acceptance window, different sensitivity levels can be achieved, as shown in Fig. 21. Note that the supply voltage is quite noisy in the testing setup. False alerts are avoided by choosing the appropriate acceptance window size.

TABLE II: Temperature Attack Parameters

	Heating Attack	Freezing Attack	Temperature Drift
Temperature (^∘C)	$25\rightarrow 122$	$25\rightarrow-11$	$-40\rightarrow 125$
Average Slew Rate (^∘C/min)	1200	-600	2
Testing Equipment	Hot air rework station	Freeze spray	Temperature chamber
Monitor Result	Glitch detected	Glitch detected	No glitch

V-D2 EM Attack

Similar to voltage attack testing, an RF signal is generated off-chip and coupled to the chip’s power supply. EMP attacks are not evaluated due to equipment restrictions, but EMP attacks have similar, if not weaker, effects as the voltage glitch attack. Fig. 22a shows the testing setup for EMI attacks. The RF signal injects into the supply voltage a $10MHz$ noise with $>-10\text{dBm}$ power, the same EM frequency as used in [18]. This supply disturbance has an equivalent voltage of $60mV_{pp}$ , which is smaller than the $90mV$ interference in that work. Our monitor successfully detects this more subtle EM FIA, as shown in Fig. 22b. The acceptance window can also be configured to tune the detection threshold of the injected noise.

V-D3 Temperature Attack

Freeze spray is applied to the chip to quickly cool down the circuit, while a hot air rework station is used to ramp up the temperature in a short time (Fig. 23). As summarized in Table II, the temperature slew rate for the cooling (heating) process is $-600^{\circ}C/min$ ( $1200^{\circ}C/min$ ). The monitor detects the injection attack in both cases. On the contrary, the chip is placed in a temperature chamber to simulate normal environmental temperature change. With the help of the real-time delay tuning functionality, the monitor does not raise a false alert in this case.

TABLE III: Comparison Table with State-of-the-Art FIA Monitors

	This Work	ISSCC’23 [8]	VLSI’22 [29]
Technology	65nm	4nm	5nm
Protection Target	Design Agnostic	AES-256	Design Agnostic
Principle	Pulse Width Comparison	Error Checking	High-Frequency Sampling
Digital Design	Fully Synthesizable	Fully Synthesizable	Partially Digital
Voltage (V)	0.5 - 1.4	0.75	0.5 - 1.0
Temperature (^∘C)	-40 - 125	25	25
Power (mW)	0.487 ${}^{\text{a}}$	-	0.8025 ${}^{\text{b}}$
Area (MF²)	0.355	244.56	192
Monitor Precision	DLL Delay Step	-	FLL Period
Clock Frequency	2MHz - 1.26GHz	0 - 780MHz	0 - 40MHz
Target Attacks	Clock, Voltage, EM, Temperature Attacks	Any Fault Attack on AES	Low-Frequency Clock Glitches
a: measured @ 1.2V VDD, locking to 250MHz clock.		b: measured @ 0.75V VDD, locking to 40MHz clock.

V-E Power Consumption and Breakdown

The monitor is tested across a wide range of temperatures and voltages. The power consumption at each tested point is depicted in Fig. 24 and Fig. 25. The monitor is able to function across the automotive temperature range, which is $-40$ to $125^{\circ}C$ . It can be inferred from the figure that the tuning frequency is insensitive to the environment temperature. The monitor is also validated under a versatile of voltages, from $0.5V$ to $1.4V$ . The average power of the fifty DUTs is $0.487mW$ at $1.2V$ , $25^{\circ}C$ , when they are locked to a $250MHz$ clock. Fig. 25b shows the power breakdown at various locking frequencies. The coarse stage dominates the power consumption at lower frequencies because it contains a counter triggered by an active RO. After the coarse stage is bypassed at $833MHz$ , the delay window logic consumes the majority of the power.

VI Technology Scalability of the Framework

Table I showcases two sample FIA monitor designs automatically generated by our framework in 28nm CMOS. The fifth row represents a typical use case where a wide tuning range and a high resolution are desired. Compared to the generated design in 65nm, a higher frequency and a higher resolution are naturally achieved even if the same delay cell is used. To obtain the same minimum frequency, however, more stages are required in the ring oscillator. The monitor can function across a wide VDD range, with $3.13GHz$ max locking frequency achieved at $1.0V$ and $86.3MHz$ at $0.4V$ . The last row presents a scenario where the target frequency range is very narrow. In this case, the coarse stage is omitted, and the footprint is minimized. Fig. 26 shows the automatically generated layouts of the two designs.

VII Conclusion

Timing FIAs are powerful attacks that need relatively low efforts to cause devastating consequences. They can be launched through a diverse set of methods and the interference can be highly localized. To mitigate this flexible attack, we propose a distributable and synthesizable timing FIA monitor to detect timing violations caused by clock glitches and data path delay manipulation. The monitor works on the principle of replicating the pulse width of the legitimate clock signal with a configurable delay line to detect injected anomalies. To further reduce development efforts, the FIA monitor is accompanied by an end-to-end automation framework to automatically optimize and implement the monitor based on a few user specifications. The chip prototype in 65nm CMOS demonstrates comprehensive and robust detection against all possible types of clock glitches, as well as timing faults induced through voltage, EM, or temperature channels. It self-calibrates to support a wide clock frequency range ( $2MHz-1.26GHz$ ). The small power ( $0.487mW$ ) and area ( $0.355MF^{2}$ ) enable the monitor to be readily integrated into existing systems in a distributed fashion for optimal coverage of localized attacks. Table III summarizes the performance of the presented FIA monitor and compares it with state-of-the-art designs.

References

[1] B. Giller, “Implementing practical electrical glitching attacks,” presented at Black Hat Europe , Amsterdam, Netherlands, Nov. 2015. [Online]. Available: https://www.blackhat.com/eu-15/briefings.html#implementing-practical-electrical-glitching-attacks
[2] Y. Lu, “Injecting Software Vulnerabilities with Voltage Glitching,” Feb. 2019. [Online]. Available: http://arxiv.org/abs/1903.08102
[3] G. Gonzalez, “Drone Security and Fault Injection Attacks,” IOActive, Tech. Rep., Jun. 2023. [Online]. Available: https://ioactive.com/drone-security-fault-injection-attacks-gabriel-gonzalez/
[4] K. Murdock, D. Oswald, F. D. Garcia, J. Van Bulck, D. Gruss, and F. Piessens, “Plundervolt: Software-based Fault Injection Attacks against Intel SGX,” in 2020 IEEE Symposium on Security and Privacy (S&P), May 2020, pp. 1466–1482.
[5] P. Qiu, D. Wang, Y. Lyu, and G. Qu, “VoltJockey: Breaching TrustZone by Software-Controlled Voltage Manipulation over Multi-core Frequencies,” in Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, Nov. 2019, pp. 195–209.
[6] K. Matsuda, T. Fujii, N. Shoji, T. Sugawara, K. Sakiyama, Y.-I. Hayashi, M. Nagata, and N. Miura, “A 286 F2/Cell Distributed Bulk-Current Sensor and Secure Flush Code Eraser Against Laser Fault Injection Attack on Cryptographic Processor,” IEEE Journal of Solid-State Circuits, vol. 53, no. 11, pp. 3174–3182, Nov. 2018.
[7] H. Zhang, L. Lin, Q. Fang, and M. Alioto, “Laser Voltage Probing Attack Detection With 100% Area/Time Coverage at Above/Below the Bandgap Wavelength and Fully-Automated Design,” IEEE Journal of Solid-State Circuits, vol. 58, no. 10, pp. 2919–2930, Oct. 2023.
[8] R. Kumar, A. Varna, C. Tokunaga, S. Taneja, V. De, and S. Mathew, “A 100Gbps Fault-Injection Attack Resistant AES-256 Engine with 99.1-to-99.99% Error Coverage in Intel 4 CMOS,” in 2023 IEEE International Solid-State Circuits Conference (ISSCC), Feb. 2023, pp. 1–3.
[9] Y. He and K. Yang, “A Synthesizable Design-Agnostic Timing Fault Injection Monitor Covering 2MHz to 1.26GHz Clocks in 65nm CMOS,” in 2024 IEEE International Solid-State Circuits Conference (ISSCC), vol. 67, Feb. 2024, pp. 304–306.
[10] P. Yang, F. Luo, Q. Ou, and D. Zhou, “Design and Analysis of Clock Fault Injection for AES,” in 2020 International Conference on Computer Communication and Network Security (CCNS), Aug. 2020, pp. 87–91.
[11] R. Lashermes, G. Reymond, J.-M. Dutertre, J. Fournier, B. Robisson, and A. Tria, “A DFA on AES Based on the Entropy of Error Distributions,” in 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography, Sep. 2012, pp. 34–43.
[12] B. Ning and Q. Liu, “Modeling and Efficiency Analysis of Clock Glitch Fault Injection Attack,” in 2018 Asian Hardware Oriented Security and Trust Symposium (AsianHOST), Dec. 2018, pp. 13–18.
[13] M. Matsubayashi, A. Satoh, and J. Ishii, “Clock glitch generator on SAKURA-G for fault injection attack against a cryptographic circuit,” in 2016 IEEE 5th Global Conference on Consumer Electronics, Oct. 2016, pp. 1–4.
[14] R. Sun, P. Qiu, Y. Lyu, J. Dong, H. Wang, D. Wang, and G. Qu, “Lightning: Leveraging DVFS-induced Transient Fault Injection to Attack Deep Learning Accelerator of GPUs,” ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 1, pp. 14:1–14:22, Nov. 2023.
[15] “ChipWhisperer - the complete open-source toolchain for side-channel power analysis and glitching attacks.” [Online]. Available: https://github.com/newaetech/chipwhisperer
[16] A. Dehbaoui, J.-M. Dutertre, B. Robisson, and A. Tria, “Electromagnetic Transient Faults Injection on a Hardware and a Software Implementations of AES,” in 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography, Sep. 2012, pp. 7–15.
[17] Y.-i. Hayashi, N. Homma, T. Mizuki, T. Aoki, and H. Sone, “Transient IEMI Threats for Cryptographic Devices,” IEEE Transactions on Electromagnetic Compatibility, vol. 55, no. 1, pp. 140–148, Feb. 2013.
[18] D. Fujimoto, Y.-i. Hayashi, A. Beckers, J. Balasch, B. Gierlichs, and I. Verbauwhede, “Detection of IEMI fault injection using voltage monitor constructed with fully digital circuit,” in 2018 IEEE International Symposium on Electromagnetic Compatibility and 2018 IEEE Asia-Pacific Symposium on Electromagnetic Compatibility (EMC/APEMC), May 2018, pp. 753–755.
[19] M. Hutter and J.-M. Schmidt, “The Temperature Side Channel and Heating Fault Attacks,” in Smart Card Research and Advanced Applications: 12th International Conference (CARDIS), Jun. 2014, pp. 219–235.
[20] P. Nasahl, M. Osorio, P. Vogel, M. Schaffner, T. Trippel, D. Rizzo, and S. Mangard, “SYNFI: Pre-Silicon Fault Analysis of an Open-Source Secure Element,” IACR Transactions on Cryptographic Hardware and Embedded Systems, pp. 56–87, Aug. 2022.
[21] A. M. Shuvo, N. Pundir, J. Park, F. Farahmandi, and M. Tehranipoor, “LDTFI: Layout-aware Timing Fault-Injection Attack Assessment Against Differential Fault Analysis,” in 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), Jul. 2022, pp. 134–139.
[22] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: a low-power pipeline based on circuit-level timing speculation,” in Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., Dec. 2003, pp. 7–18.
[23] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. T. Blaauw, “RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32–48, Jan. 2009.
[24] I. Kwon, S. Kim, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, “Razor-Lite: A Light-Weight Register for Error Detection by Observing Virtual Supply Rails,” IEEE Journal of Solid-State Circuits, vol. 49, no. 9, pp. 2054–2066, Sep. 2014.
[25] Y. Zhang, M. Khayatzadeh, K. Yang, M. Saligane, N. Pinckney, M. Alioto, D. Blaauw, and D. Sylvester, “iRazor: Current-Based Error Detection and Correction Scheme for PVT Variation in 40-nm ARM Cortex-R4 Processor,” IEEE Journal of Solid-State Circuits, vol. 53, no. 2, pp. 619–631, Feb. 2018.
[26] M. Cho, S. T. Kim, C. Tokunaga, C. Augustine, J. P. Kulkarni, K. Ravichandran, J. W. Tschanz, M. M. Khellah, and V. De, “Postsilicon Voltage Guard-Band Reduction in a 22 nm Graphics Execution Core Using Adaptive Voltage Scaling and Dynamic Power Gating,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 50–63, Jan. 2017.
[27] K. A. Bowman, C. Tokunaga, T. Karnik, V. K. De, and J. W. Tschanz, “A 22 nm All-Digital Dynamically Adaptive Clock Distribution for Supply Voltage Droop Tolerance,” IEEE Journal of Solid-State Circuits, vol. 48, no. 4, pp. 907–916, Apr. 2013.
[28] D. Nemiroff and C. Tokunaga, “Fault-Injection Detection Circuits: Design, Calibration, Validation and Tuning,” presented at Black Hat USA , Las Vegas, NV, USA, Aug. 2022. [Online]. Available: https://www.blackhat.com/us-22/briefings/schedule/index.html#fault-injection-detection-circuits-design-calibration-validation-and-tuning-27397
[29] S. Song, S. G. Tell, B. Zimmer, S. S. Kudva, N. Nedovic, and C. T. Gray, “An FLL-Based Clock Glitch Detector for Security Circuits in a 5nm FINFET Process,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Jun. 2022, pp. 146–147.
[30] N. Miura, D. Fujimoto, D. Tanaka, Y.-I. Hayashi, N. Homma, T. Aoki, and M. Nagata, “A local EM-analysis attack resistant cryptographic engine with fully-digital oscillator-based tamper-access sensor,” in 2014 Symposium on VLSI Circuits Digest of Technical Papers, Jun. 2014.
[31] D.-H. Seo, M. Nath, D. Das, B. Chatterjee, S. Ghosh, and S. Sen, “PG-CAS: Patterned-Ground Co-Planar Capacitive Asymmetry Sensing for mm-Range EM Side-Channel Attack Probe Detection,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS), May 2021, pp. 1–5.
[32] B. Liu, Y. Zhang, J. Qiu, H. C. Ngo, W. Deng, K. Nakata, T. Yoshioka, J. Emmei, J. Pang, A. T. Narayanan, H. Zhang, T. Someya, A. Shirane, and K. Okada, “A Fully Synthesizable Fractional-N MDLL With Zero-Order Interpolation-Based DTC Nonlinearity Calibration and Two-Step Hybrid Phase Offset Calibration,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 2, pp. 603–616, Feb. 2021.