Silent data corruptions at scale

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar
[arXiv] [Google Scholar] [DBLP] [Citeseer] [url]
Read: 15 September 2021

CoRR abs/2102.11245
2102.11245
2021
Note(s): Hardware faults
Papers: hochschild:hotos:2021

This paper explores the same problem as hochschild:hotos:2021: CPUs that silently give incorrect results without any detection or correction mechanism (like, for example, ECC on SRAM/DRAM). Unlike radiation-induced failures that are transient, they are interested in more consistent and higher frequency failures that they observe in datacenters due to timing errors, manufacturing variation, degradation and end-of-life wearout and associated with increased transistor density and wider datapaths.

They do a detailed case study of debugging one failure on one particular core on one chip. In a distributed filesystem, they found that the Scala power function was returning zero for non-zero input (for double-precision floats)

pow(1.1, 53) = 0

but

pow(1.1, 52) = 142

But, debugging JITed languages like Scala is non-trivial (can’t just use gdb). And minimizing binary code takes care. Eventually, they reduce 146K to 60 lines of assembly and find both positive and negative exponents that fail including the following examples.

1.1 ^ 3 = 0 (should be 1)
1.1 ^ 107 = 32809 (should be 26854)
1.1 ^ -3 = 1 (should be 0)