Cores that don't count

Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, Amin Vahdat
[ISBN] [Google Scholar] [DBLP] [Citeseer] [url]
Read: 10 September 2021

Proceedings of the Workshop on Hot Topics in Operating Systems
Association for Computing Machinery
New York, NY, USA
Pages 9-16
2021
Note(s): Google, hardware faults
Papers: dixit:arxiv:2021

Silent Corrupt Execution Errors (CEEs) due to minor manufacturing defects in CPUs cause infrequent corruption of output while otherwise appearing to be fine. (The infrequent aspect is what makes them “silent”.) Google and Facebook are both seeing a few “mercurial” cores per several thousand machines (not clear if they mean CPUs).

Suspected to be caused by smaller feature sizes, more complex design, and increasing difficulty testing (esp. for corner cases and post-deployment aging).

Detection

  • pre/post-deployment screening (testing)
  • offline/online screening (testing idle cores or testing during idle moments)
  • infrastructure-level vs application-level screening (ie generic or application specific testing)

Mitigations - both hardware and software. Mostly an opportunity for future research.