RELY - Design for RELIABILITY of SoCs for Applications
RELY aims to set reliability as a design goal through the entire value chain of a chip (semiconductor manufacturing, IC design, etc.). To ensure proper functionality under extreme side conditions, new reliability methodologies have to be investigated. Mechanisms for countermeasures have to be placed and coordinated at all system levels. To increase the reliability of Systems-on-Chip (SoC), internal system states (temperature, aging, etc.) have to be controlled at runtime. According to the states operating parameters like the operation frequency, circuit voltage, power mode or load distribution can be dynamically adjusted.
The Department for Integrated Systems (LIS) focusses on the system architecture domain. The other domains are worked on by project partners.
The major research focus on:
- Methodologies to discover potential sources of error utilizing sensors
- Development of countermeasures on hardware level
In RELY it is planned to place sensors/monitors over the entire system to check internal system states.
Since a massive amount of data is expected, the sensor data cannot be directly forwarded to an operating system (OS) or middleware. Therefore a new hardware layer (reliability layer) has to be introduced that performs data preprocessing, data evaluation and triggers low level countermeasures. For fail safety and scalability reasons the evaluation of the monitoring data is logically divided into areas/subsystems. To provide different Quality-of-Service (QoS) levels, the reliability layer consist of three sublayers that work in different time domains.
The aggregation layer performs data preprocessing (calculation of mean values over specified time slots, filtering according threshold values, etc.). The preprocessed data is forwarded to the local evaluation layer. In this layer the sensor data of an area/subsystem is evaluated and local countermeasures that work on clock cycle level (nsec) are executed. To enable global optimizations the preprocessed and pre-evaluated data is forwarded to the system evaluation layer. The triggered countermeasures are executed in milliseconds. The remaining data is forwarded to the OS or middleware. Here further evaluation of the internal system sate can be done which utilizes the flexibility of software for both the evaluation itself and the triggered countermeasures (sec).
A Long Duration Transient Resilient Pipeline Scheme. IEEE Transactions on Device and Materials Reliability 17 (1), 2017 mehr… BibTeX Volltext ( DOI )
A Chip-level Redundant Threading (CRT) Scheme for Shared-Memory Protection. Int. Conference on High Performance Computing & Simulation (HPCS) 2016, 2016 mehr… BibTeX
Tackling Long Duration Transients in Sequential Logic. IEEE Int. Symp. on On-Line Testing and Robust System Design (IOLTS) 2016, 2016 mehr… BibTeX
Integrated Soft Error Resilience and Self-Test. IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC) 2016, 2016 mehr… BibTeX
Matching Detection and Correction Schemes for Soft Error Handling in Sequential Logic. Euromicro Conference on Digital System Design (DSD), 2015 mehr… BibTeX
A run-time reconfigurable NoC Monitoring System for performance analysis and debugging support. PARS-Workshop 2015, 2015 mehr… BibTeX
Improving the Significance of Probabilistic Circuit Fault Emulations. 20th IEEE International On-Line Testing Symposium (IOLTS), 2014 mehr… BibTeX
Probabilistic Circuit Fault Emulation. edaWorkshop, 2014 mehr… BibTeX
A Resource-efficient Probabilistic Fault Simulator. 23rd International Conference on Field Programmable Logic and Applications (FPL), 2013 mehr… BibTeX
RELY - Reliability of SoCs for Safety Critical Applications. edaWorkshop 12, 2012 mehr… BibTeX
An FPGA-based Probability-aware Fault Simulator. SAMOS XII, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 2012, 302-309 mehr… BibTeX