Feature Stories - Breaking through the fault-testing bottleneck in chip production
'A decade or two ago the complexity of silicon chips was considerably lower than now. Since then, the size in terms of the number of transistors has grown to billions. Such chips are difficult to design but even more difficult to test and verify. Testing and verification have clearly become the bottleneck of the development process,' explains Dr Jaan Raik, a professor of digital systems verification at the Department of Computer Engineering of Tallinn University of Technology in Estonia. One EU-funded project, PROSYD, carried out extensive research on chip manufacturing processes and found that as much as 40 % of the entire IC design cycle is spent locating and correcting errors caused by design mistakes. In recent years numerous test and verification approaches have emerged. But, while they are good at identifying the presence of faults, they are usually unable to pin-point the root cause of an error. 'It is not helpful for the designer to merely know that the chip is not working. It is necessary also to locate the fault and, ultimately, to correct it. Relatively little attention has been paid to the latter tasks,' explains Dr Raik. Until now. Over the course of three years a consortium of universities and technology companies teamed up in the 'Diagnosis, error-modelling and correction for reliable systems design' (DIAMOND) project to develop innovative models and technology to test, detect, verify and, most importantly, fix IC errors. Their approach, supported by almost EUR 2.9 million in funding from the European Commission, marks a major leap forward for the semiconductor industry - offering potentially enormous time and cost savings if implemented widely. A holistic approach with three-fold benefits According to the project, a modern chip design project costs around EUR 60 million, but if the error detection and correction process could be accelerated by automation, these costs could be cut by EUR 15 million. Dr Raik, who coordinated DIAMOND, describes the project's contribution to solving the challenge of detecting and fixing IC errors as three-fold. 'First, a holistic model for different types of faults was developed. Based on this model the same localisation engines can be applied to design errors, soft-errors and defects. Second, more efficient automated localisation and correction methods were developed. Particular stress was put on system-level approaches where previous research work has been inadequate. And third, post-silicon in-situ debug approaches were developed. Such approaches extend the life-time of silicon chips by localising and isolating faulty regions in them,' Dr Raik says. The team created an open source system-level design error localisation and correction system called 'Formal repair environment for simple C' (Forensic) - jointly developed by Graz University of Technology, the University of Bremen and Tallinn University of Technology, with a second version of the environment released last December. To detect and fix errors at the register-transfer level, the DIAMOND researchers used a design elaboration system called zamiaCAD, a highly scalable open-source platform that can easily handle large commercial systems, such as those used by one of the project's industrial partners, IBM. On top of that platform, the team implemented new error localisation methods capable of pin-pointing design errors in such large designs. In addition, because so-called soft errors - such as those caused by the effects of radiation - are increasingly becoming an issue in new nanometre-scale technologies, IBM and the University of Bremen jointly developed efficient simulation and vulnerability check approaches for such faults. Meanwhile, a post-silicon fault management system to extend the lifetime of future chips was designed by Ericsson, the University of Linköping and Estonian 'electronic design automation' (EDA) company Testonica. 'The key innovations of the DIAMOND project are the holistic handling of different kinds of faults, as well as new engines for system-level fault localisation and correction, for soft error analysis and for fault management,' Dr Raik summarises. An almost four-fold increase in efficiency The upshot is major gains in the efficiency of the processes used to find and correct faults. 'At the system-level, Forensic was able to correct 60 % of the benchmark designs compared to 16 % with previous tools,' the project coordinator notes. 'At the register-transfer level, we performed a case study on a real processor design. We cooperated with a design team at TU Ilmenau who kindly provided us with documented bug cases. The DIAMOND methods were able to locate all the bugs in a couple of minutes, versus several hours needed for manual localisation.' These new, more efficient fault correction methods mean huge savings. IBM estimates that EUR 15 million per chip project could be saved if fault diagnosis and correction efficiency can be doubled. For a consumer, ultimately, it means cheaper and safer electronic products. On the back of DIAMOND's success, three of the project partner organisations are preparing to launch BASTION, a follow-up project to further enhance their fault-detection technology which has also won funding from the European Commission. IBM has so far filed two patents on technologies developed in DIAMOND and is continuing to exploit the results internally, including successfully applying the testing and verification tools in design projects. Ericsson, another project partner, is also exploiting the results internally and is applying the new fault-management technology in product developments. EDA companies TransEDA and Testonica - itself a spin-off of the research group at Tallinn University of Technology - have included the fault diagnosis tools in their product portfolios. Testonica has filed one patent on technologies developed in DIAMOND. Meanwhile, some of the results have been presented to chipmaker Intel and several SME's have also expressed interest. 'The interest from outside has been quite strong,' Dr Raik notes. 'There is a clear trend towards multi-core designs due mainly to the need to keep the power dissipation of future chips under control. I therefore see that test, reliability and design will become more intertwined within multi-core systems. In fact, the regularity and modularity inherent to multi-core architectures provides new opportunities for test, verification and design.' On the strength of these results, the project's work could be expected to help shorten the time for new chips to be designed, produced and come to market - accelerating innovation in electronic devices - as well as leading to lower prices for electronic equipment. DIAMOND received research funding under the European Union's Seventh Framework Programme (FP7). Link to project on CORDIS: - FP7 on CORDIS - DIAMOND project factsheet on CORDIS Link to project's website: - 'Diagnosis, error modelling and correction for reliable systems design' project website Link to related video: - DIAMOND project video Other links: - European Commission's Digital Agenda website