Dagstuhl: Formal Methods for Correct Persistent Programming

Last week, I boarded the train to Wadern, Germany (through London, Paris and Saarbrücken) to attend Dagstuhl seminar 23412 on “Formal Methods for Correct Persistent Programming”.

What is persistent programming?

Persistent programming is concerned with memory that retains its contents after a power loss. While magnetic disks and solid state disks satisfy that definition, the main focus was on non-volatile memory that provides fast access (not too much slower than DRAM) and fine granularity (notionally byte addressable but, in practice, at the granularity of a cache line (64 bytes)).

Non-volatile memory can be built using DRAM and a battery (and some non-volatile memories require the use of battery-backed caches for safety) but it is usually based on some more exotic technology such as memristors, spin, ferro-electric FETs, phase change, etc. These technologies promise to provide much higher capacity at much lower cost per gigabyte and much lower power consumption. Sadly, for the field, the most prominent non-volatile memory product was shut down last summer but we did not feel that this was the end of persistent systems.

What makes persistent programming hard?

There seem to be two main issues in persistent programming. The first is that your code has to be able to cope with a power failure at any moment so every write to non-volatile memory must take care that

Any data that it refers to is also in non-volatile memory because, otherwise, that data would be lost if the power failed.
It is possible to recover from a power failure after every single write. Your data structure in non-volatile memory must never be in an inconsistent state.

These properties are checked with type systems and/or formal verification of the software.

The second, and more challenging problem, is that, for performance reasons, memory systems reorder memory reads and writes so even if you write to address ‘x’ then address ‘y’ and then print “It is now safe to turn your computer off” on the screen, those three steps could occur in the reverse order. That is, all modern computer systems provide some form of “weak memory consistency.” Obviously, this reordering makes it harder to ensure that the data structure in the non-volatile memory is always in a recoverable state.

To avoid problems with weak memory, you need to add memory fences and flush cache contents out to the non-volatile memory but, if you do this too much, performance will suffer. So the trick is to add the bare minimum of fences and flushes to ensure that your data is safe. Unsurprisingly, this is what most of our discussion centered on with questions like

What exactly do the fence and flush instructions guarantee? The more precisely we understand what they do, the closer we can get to the edge (without stepping over).

Note that different ISAs have different properties so sometimes you need different code for different systems to make it simultaneously fast and sound.
How can we confirm that our understanding of the memory system is correct? Power cycling is too slow to test exhaustively by repeatedly turning off the power and checking that the memory got the values that we expect!
How should we formalize our understanding of the memory system’s guarantees? The most popular approach seemed to use the “cat” declarative notation that is supported by the “herd” tools and others.
Is one memory system stronger than another? A program written for one memory system should also work for any stronger memory system.
Given an algorithm or a C program or a machine-code program, can we formally verify that it contains enough fences and flushes to ensure that we can always recover from a power failure.

Lots of fun stuff to discuss!

What else was discussed?

A related, but subtly different, topic is intermittent computing. This is concerned with systems that use energy harvesting such as solar power, vibrations, RF power, etc. instead of having mains power or a long-life battery. The challenge here is that you may only have a few milliseconds of power before it disappears with no warning. To deal with this, your program must contain frequent ‘checkpoints’ where the state of the system is saved to non-volatile memory and the code between checkpoints must be idempotent because that code will be re-executed when power is restored. (There are usually no weak memory effects to worry about because embedded systems are designed with a different set of performance constraints.)

And, of course, you don’t need special hardware to get persistence: your filesystem provides a reliable way to save the state of your system to disk. So it’s possible to create interesting persistent applications and tools that anybody can run.

What is Dagstuhl?

To quote the Dagstuhl website, “Schloss Dagstuhl, the Leibniz Center for Informatics was originally founded in 1990 to provide a retreat for world class research and training in computer science.” In practice, this means that 30 or so researchers travel to a place out in the Saarland countryside that provides the perfect environment for focusing on a topic and meeting other researchers working in the field. A few things that stand out are the excellent conference facilities; the many smaller meeting rooms in case a group want to split off for a more focused discussion; the kitchen staff randomly assigning seats at lunch and dinner; and the potential for walks in the surrounding countryside or billiards in the games room.

Dagstuhl is well known among researchers in continental Europe but, even if you have never heard of Dagstuhl, you have probably heard of DBLP the CS publication database.

During the week, we shared Dagstuhl with a seminar on Accountable Software Systems. In the mornings and evenings we would talk to those researchers about their research and how their seminar was going. There were some similarities to our group (some of them also work on formal methods) but they were a much more diverse group with lawyers, control theorists, mathematicians, etc. so they were working hard to build a shared understanding of what accountability is while we largely understood the problems and were working on solutions. A very different meeting!

Written on October 15, 2023.
The opinions expressed are my own views and not my employer's.