Silent Corrupt Execution Errors (CEEs) due to minor manufacturing defects in CPUs cause infrequent corruption of output while otherwise appearing to be fine. (The infrequent aspect is what makes them “silent”.) Google and Facebook are both seeing a few “mercurial” cores per several thousand machines (not clear if they mean CPUs).
Suspected to be caused by smaller feature sizes, more complex design, and increasing difficulty testing (esp. for corner cases and post-deployment aging).
Detection
- pre/post-deployment screening (testing)
- offline/online screening (testing idle cores or testing during idle moments)
- infrastructure-level vs application-level screening (ie generic or application specific testing)
Mitigations - both hardware and software. Mostly an opportunity for future research.
Papers related to Cores that don’t count
- Silent data corruptions at scale [dixit:arxiv:2021]