ReTMiC: Reliability-Aware Thermal Management in Multicore Mixed-Criticality Embedded Systems

Published in IEEE Xplore: 17 February 2025
Authors: Sepideh Safari, Mohsen Ansari, Shaahin Hessabi, Jörg Henkel
This figure shows the overall operational flow of the proposed ReTMiC (Reliability-Aware Thermal Management in Multicore Mixed-Criticality Embedded Systems) method. First, the scheduler receives software and hardware level parameters. As shown in this figure, the proposed method consists of two offline and online phases. In the offline phase, first, the minimum number of the required replica(s) for each task in the maximum v-f level to satisfy the reliability target is computed, and the task graph is extended to include replica(s). Then, the minimum v-f levels for each task are determined such that the reliability target is satisfied. In the next step, all the normal and overrun parts of each HC task and corresponding replicas are mapped to the cores and scheduled in a way that meets timing and TDP constraints. Next, LC tasks are mapped and scheduled in a way that satisfies the QoS and TPD constraints. In the final extracted schedule at the offline phase, all LC, HC, and corresponding replica tasks are scheduled in a way that meets the deadline, reliability, TPD, and QoS constraints even in the worst-case fault or overrun occurrence scenarios. However, since the actual temperatures of tasks are determined at run-time, the temperature should be balanced at runtime. Hence, at design time, the scheduler should determine the temperature balancing points and balancing factors that will be exploited at run-time. To this end, according to the scheduling, the proposed method specifies Balancing Points (BPs) and Balancing Factors (BFs) to balance the overall temperature of the chip at run-time.

As the number of cores in multicore platforms increases, temperature constraints may prevent powering all cores simultaneously at maximum voltage and frequency level. Thermal hot spots and unbalanced temperatures between the processing cores may degrade the reliability. This paper introduces a reliability-aware thermal management scheduling (ReTMiC) method for mixed-criticality embedded systems. In this regard, ReTMiC meets Thermal Design Power as the chip-level power constraint at design time. In order to balance the temperature of the processing cores, our proposed method determines balancing points on each frame of the scheduling, and at run time, our proposed lightweight online re-mapping technique is activated at each determined balancing point for balancing the temperature of the processing cores. The online mechanism exploits the proposed temperature-aware factor to reduce the system’s temperature based on the current temperature of processing cores and the behavior of their corresponding running tasks. Our experimental results show that the ReTMiC method achieves up to 12.8°C reduction in the chip temperature and 3.5°C reduction in spatial thermal variation in comparison to the state-of-the-art techniques while keeping the system reliability at a required level.