Clément Pit-Claudel

Resources for prospective students

I will be joining EPFL as an assistant professor in academic year 2022-2023 (in the meantime, I'm taking a gap year at Amazon AWS). If you're interested in being part of my group, please apply!

I'm generally interested in the intersection of systems and formal methods (especially compilers, hardware, and verification), as well as topics around programmer productivity, tools, visualizations, and user interfaces.

There's a brief summary of my current projects on my home page.


Are you recruiting this year? (2022-2023)

Yes! I'm planning to take two PhD students and possibly one postdoc in Fall 2022.

Do I need to speak French to join EPFL?

No, not at all.

How large is your group? How large will it get?

We're just getting started :). I'm planning to ramp up slowly (1-2 students a year) to a reasonably small group (4-6 students at most for the foreseeable future): this way I'll have all the time I need to support and collaborate closely with my first students.

What criteria do you prioritize in selecting students?

Motivation, technical chops, collegiality, and fit with my group's research interests. I want to be confident that we'll both be happy to work together for ~5 years, so on your side I'm looking for evidence that you have skills and interests that match my group's research, and on my side I'm trying to ensure that EPFL will offer everything you'll need to develop to your very best (this includes my personal expertise and background, but also a sufficient mass of other faculty and students working on related topics to provide an exciting and stimulating environment).

I'm passionate about teaching and education (and EPFL profs care a lot about good teaching in general), so teaching experience or interest is a plus; and my group seeks to build real systems, so strong programming skills are a big plus as well.

Previous research experience helps, but it's not stricly necessary — you're doing a PhD to learn.

Do I need to know which specific problem I want to work on?

No, but you need to know what you're interested in. Part of the PhD is developing your research tastes and your ability to pick interesting problems to work on, so don't feel that you have to know exactly what problem you want to work on before you start. What you do need is to be passionate enough about a broad area / class of problems to spend 4-6 years working on that.

What is it like to do a PhD at EPFL/in Switzerland/in Europe?

See this blog post by my colleague Mathias Payer for a general overview, and this comprehensive presentation from EPFL's CS department (EDIC) for details about doing a PhD at EPFL.

Students are employees and by law have 4 weeks of paid vacation a year plus holidays, and hiking, biking, sailing, and winter sports are all extremely popular on the weekends. (In general, Switzerland is one of the top places in the world for work-life balance and life satisfaction, and EPFL follows suit.) Additionally, EPFL and the Swiss NSF have comprehensive provisions and financial support for post docs and grad students with children.

There's excellent public transit around Lausanne, so it's less common to own a car around here than in most US cities: Lausanne has two subway lines, and it sits on a train network that takes you mountain hiking in 20 to 40 minutes (or 1h to the top by cograil!) and gets you to Geneva (the closest international airport) in 40 minutes, to Zurich in 2 hours, to Lyon in 2 hours and a half, to Milan in 3 hours and a half, and to Paris in just 3 hours, 40 minutes.

Overall, Europe is a great place — there's a reason I came back!

Is living in Switzerland expensive?

Yes: the cost of living in Lausanne is roughly comparable to cities like Boston, San Francisco, London, or Paris (e.g. compare MIT's estimated living expenses and living wage estimates with a typical budget at EPFL). Fortunately, the high PhD salaries at EPFL (55-60kUSD/y) are more than enough to live comfortably.

How much will I have to worry about funding?

Typically, not at all. EPFL gives professors guaranteed base funding covering 2 to 4 students, so when grants don't go through there's always a backup. In exchange, EPFL doctoral students are expected to help with teaching most semesters, though at a much lower commitment level (20%) than a typical North American TA position (about 1 day a week vs 2 - 3 days in the US).

Who else could I work with at EPFL? Can I be co-advised?

EPFL is full of incredible people — we have a large CS department (~240 PhD candidates, 90% international) with plenty of great faculty and students working on systems, programming languages, and formal methods. Co-advising is relatively common. Check out the following professors and their labs and students, in no particular order: Viktor Kunčak, Martin Odersky, George Candea, Jim Larus, Bryan Ford, Katerina Argyraki, Mathias Payer, Sanidhya Kashyap, Christoph Koch, Ed Bugnion; see also the per-domain pages for PL and systems, and the Systems @ EPFL website.

I really want to move to Europe / stay in Europe for my PhD. I will apply to EPFL, but where else should I consider?

You have lots of great options.

  • In Switzerland, ETH is EPFL's sister institution and also a top option; if you're interested in my work, have a look at what Peter Müller and Markus Püschel are doing.
  • In France, PhDs are a bit shorter (~3 years) and more focused on a specific topic, with (typically) no teaching requirements. Labs are organized a bit differently (with more professors and fewer students), and most labs are not attached to a specific university (so searching for open positions can be a bit trickier). Look for PhD position announcements on field-relevant mailing lists, and ask your professors for recommendations.
  • Germany's Max-Plank institutes (MPI) are great; the system there is a bit different from the US as well, but if you like what I do you should enjoy most of the PL work at SWS.
  • There's a strong tradition of functional programming and type theory in Denmark (e.g. Bas Spitters and Amin Timany in the LOGSEM group at Aarhus) and Sweden (especially at Chalmers, also at KTH). Both countries rank high on quality-of-life metrics.
  • I don't have much experience with PhDs in the UK, but folks at Cambridge, Oxford, Imperial College, and Edinburgh consistently produce top-notch research in areas I'm interested in.
  • Other places that I'm less familiar with but produce great research include IMDEA in Spain, IST Austria, and TU Munich. Ask your professors and mentors for advice!

How do I apply to join EPFL?

The process is basically the same as US universities, except we have two deadlines: December 15th and April 15th. This page has all the info.

EPFL typically recruits PhD students after they complete a master's degree. For students coming in with a 4-year BS we also have a direct-doctorate program (and an excellent master's program, if you're not feeling ready to start a PhD). Refer to this page for details.

How do I apply to join your group at EPFL?

EPFL PhD admissions are centralized, so you need to apply to the EDIC PhD program (mention me as a potential research advisor in your statement). Bonus point if one of your professors or mentors can send me an email to introduce you.

Do you take interns?

Yes. For summer internships, please apply to the Summer@EPFL program. Otherwise, please send me a CV. Put “EPFL Internship” in the subject, and tell me what work you're interested in and what research you've done in the past, if any. Ideally, have one of your professors or mentors also send me an email to introduce you.

You haven't replied to my email!

Please wait two weeks, then resend it (once).

Core research

I always welcome new collaborators on the following topics; things move too fast to keep an up-to-date list of concrete projects for these on this page, but I keep personal notes: please talk to me if you're interested in working in one of the directions below:

Systems-software verification

I'm generally interested in most systems-verification projects, especially the ones that do not involve shared-memory concurrency (there's no need to introduce concurrent threads — or even I/O — to start running into lots of interesting bugs). Some software that I'm interested in verifying:

  • Binary parsing: compressed image formats, network packets, serialization libraries, etc.
  • (Userland) network stacks and network applications: TCP/IP, SSH, WireGuard
  • Unix style utilities (think Busybox)

In reality, the exact programs matter less than developing the technology that makes them easy to verify, so think of the above as case studies for new ways to approach systems verification. My favored approach at the moment is proof-producing compilation / binary extraction, which we're exploring in Rupicola.

Hardware and whole-system verification

In the long term I'm hoping to deploy long-running verified systems in challenging environments (think medical devices, communications, infrastructure, etc.) so we can afford to design custom hardware and verify it. At MIT I worked on a derivative of Bluespec called Kôika, and I'm interested in extending it to verify embedded-class processors that we'll use to run our verified software on, with end-to-end proofs connecting high-level specifications to the I/O behavior of RTL circuits.

Verification tools and methodology

I'm interested in exploring better ways to engineer, maintain, preserve, and communicate formal proofs (it's amazing how patient academics can be with subpar tools). These days, one of my main interests is proof presentation, in conjunction with literate programming — especially coming up with visualizations and media that make formal proofs easier to develop and comprehend.

Here are two illustrative medium- to long-term projects bring most of the above together:

Problem: Connecting a device to the internet is dangerous: it immediately exposes it to a wide world of potential attacks. Firewalls help, but only to some extent (attackers may hide behind a firewall; network equipment itself may be malicious, etc.).

Often a device does not need access to the whole internet; it just needs to talk to some other specific devices.

Proposal: Build a verified software-based communication stack (userland UDP/IP) from high-level specs (~ RFC diagrams), build a VPN on top of that, and run it on a verified chip built on an FPGA. The final device should expose just two Ethernet ports: one to connect to a LAN, and one to connect a computer or other networked device into.

Use this device to build secure point-to-point communication channels: pair two copies of the device, connect them to the internet on two sides of the world, and then plug two potentially-unsecured computers into the devices. These computers can now talk to each other as if they were connected by a physical wire, without fear of data theft, corruption, or attacks — it's as if they were on their own private network.

Context: A reasonable objection here is that this is already the abstraction that a pure-software VPN provides, but the attack surface there is much larger: bugs in the OS, the IP stack, the VPN software can all lead to compromise of the machines using the communication channel. With a verified physical VPN device, the only exposure that each machine has on both sides of the communication channel is to the other machine. If we achieve sufficiently strong guarantees, it would be safe to use this to use the internet to connect even critical infrastructure that is too complex to fully secure against network attacks (of course, there is still the quality of service issue: we won't provide guarantees against denial of service, network interruptions, etc.).

Problem: Medical devices combine relatively simple software and hardware to deliver vital care; a robust verification platform should be able to bridge the gap from high-level characterization of behavior (think: differential equations for an insulin pump system at night) to low-level implementation (think: voltage shifts on the pins of the physical insulin pump).

Can we develop the technologies and know-how to verify the operation of a simple medical device at reasonable cost?

Can we systematically rule out software or hardware-based misdesigns, security flaws, and functional bugs, thereby ensuring that the device is as trustworthy as its physical components, and isn't weakened by flawed software or hardware engineering?

Can we significantly reduce the amount of trust that patients are forced to put into vitally important systems? (“don't trust, verify”?)

Context: There is an excellent explanation of the genesis of looping, from its beginnings as an open-source hack, in this Atlantic article from 2017. There are now commercialized, FDA approved looping devices.

It requires simple hardware, reasonably simple software, and feasibility of a non-verified implementation has been demonstrated in online open source communities a while back.

Proposal: Develop an end-to-end prototype for looped insulin pumps. As an idealized roadmap (but to be successful this project will require collaboration with doctors, hardware-design experts, HCI folks for usability, and even ideally patients, so the steps below are just guidelines):

  1. Translate discrete equations relating blood sugar measurements with insulin release decisions into relations on I/O traces.

  2. Write a functional implementation of the control system; prove it correct against the trace relations.

  3. Lower the functional programs into an imperative language for which there exists a verified compiler: Bedrock2, or CompCert C, or some other option; compile to assembly.

  4. Verify a simple embedded-class processor chip (probably RISCV?) for execution on FPGA, exposing a Coq-level spec. Connect the monitor and the pump to that device, exposing them through MMIO.

  5. Link the proofs at each level of the stack, connecting the hardware I/O level with the original equations.

Beyond these, below are some concrete directions that I would be happy to explore with a motivated student or an interested collaborator.

Other projects

This is a sample of projects or interesting thoughts that I'm interested in exploring in the short to medium term, sorted by topic; if a project is in this list, it means that I currently do not have a student working on it.

This is not a complete list, and the projects are not fully fleshed out: a lot of what's below is intended to be a conversation starter, and in fact I expect some of the projects below will quickly lead to negative results (and that's OK, we'll write a blog post explaining why and everyone will learn from that).

Some projects below would be suitable as BS and MS projects for current EPFL students; others are more open project directions that would be suitable for prospective PhD students. The classification is fluid: good projects at all levels often have arbitrarily complex follow-ups.

I have certainly missed lots of existing work while compiling this list. If you have worked on, know of work about, or are starting work on any of these topics, I'd love to hear about it! I'll be happy to update the list and add pointers to your work.

Systems and performance engineering

Problem: Web browsers and web servers typically perform decompression and decoding separately. As a result decoding does not benefit from redundancy in the input: a 10kB compressed JSON JSON stream that decompresses into 10MB worth of JSON will cause 10MB worth of JSON to be fed into the JSON decoder, and 10MB worth of decoding work to be performed. This is a giant waste of time: there's really only 10kB worth of decoding work to do here; the rest is redundant.

Proposal: Explore opportunities to fuse decompression and decoding.

Here is a very simple example: suppose that “compression” is “run-length encoding” (00000001111111111000*7,1*10,0*2) and “decoding” is “counting the number of ones”.

Here is the decoder:

for bit in decompressed_stream:
    if bit == 1:
        count += 1

This is hopelessly inefficient compared to the fused decompressor/decoder that operates directly on the compressed stream:

for bit, repeats in compressed_stream:
    if bit == 1:
        count += repeats

We should be able to achieve similar savings with off-the-shelf formats and compression algorithms, such as gzip+JSON. For example, if a key is repeated 1000 times in a given JSON file, we ideally shouldn't have to re-decode it 1000 times. Instead, assuming (optimistically) that each repeated appearance of that key in the uncompressed stream is replaced in the compressed stream by a backpointer to the location of its first appearance, a fused decompressor-decoder should be able to decode the key only once (the first time it sees it) and then reuse the decoded version every time it sees a new backpointer to it.

(Of course, compressed block boundaries won't nicely align with JSON object boundaries — that's partly what makes the problem so interesting. I suppose it would be reasonable to start by exploring the case where we hack together a gzip compressor that's polite enough to put block boundaries in semantically meaningful places from JSON's perspective.)

Context: I first ran into this issue in 2020 in Alectryon. The HTML and JSON that Alectryon generates for unfocused proofs with many open goals is highly redundant, with many literally repeated fragments throughout the output. The result is a web page that weighs 25MB but compresses down to 20kB… and still takes 3 seconds to render, because the HTML decoder has to go through 25MB of HTML after decompressing.

I fixed the issue in Alectryon by using a bespoke compression scheme that's trivial to fuse with decoding: I replace each redundant HTML node by a backpointer (using a custom HTML tag), let the browser decode the deduplicated HTML, and then use a 3-lines JavaScript program to replace each backpointer with a copy of the node that it points to (copying nodes is very cheap — much cheaper than decoding the corresponding HTML over and over).

Amazingly, it turned out in that case that the same savings apply to the HTML encoding side as well: instead of repeatedly re-encoding the same objects as HTML, we can memoize the HTML corresponding to repeated objects and perform the encoding only once, shaving off over 90% of the encoding time in some cases. (Again, this is easy because of the bespoke compression scheme.)

The aim of this project is to try to replicate these savings with off-the shelf compression algorithms.

Problem: Storage media are filled with redundant copies of libraries and binaries. For example, my main workstation has 5 versions of OCaml, 3 versions of LLVM, 4 versions of GCC, 5 versions of Python. VMs and containers contain even more copies of all these. Things are similar or worse on my phone (each app bundles its own copies of all the libraries it uses). File-system compression helps a bit (but not much: most file systems compress data in individual blocks, not across blocks), and deduplication not at all (minor changes in libraries, binaries, or VM layouts move block boundaries, and deduplicating FSs typically deduplicate at the block boundary).

Proposal: Design a deduplicating file-system based on content-aware chunking.

Context: Deduplicating backup systems have been using content-aware chunking for years. The key issue with block-based deduplication is misalignment. Suppose a file containing aaabbbaaabbb is stored on a file system that deduplicates blocks of length three. The file is segmented into four blocks aaa, bbb, aaa, bbb, and only two blocks need to be stored. Now an insertion is performed on the file, and the contents become aaaXbbbaaabbb. The file gets segmented into aaa, Xbb, baa, abb, b, and all deduplication is lost.

Content-aware chunking does away with fixed block lengths, using a hash function to split a file into blocks instead: we'll split every time the hash of the previous three characters is 0. Let's use xyz ↦ x + y + z mod 3 as the hash on our our example:

...^     hash(aaa) = 3*a     mod 3 = 0 → cut after "aaa"
 ...^    hash(aab) = 2*a + b mod 3 ≠ 0
  ...^   hash(abb) = a + 2*b mod 3 ≠ 0
   ...^  hash(bbb) = 3*b     mod 3 = 0 → cut after "bbb"
    ...^ hash(bba) = 2*b + a mod 3 ≠ 0
     ... …etc.

In other words, we get the same two blocks aaa, bbb, aaa, bbb. What happens when we perform the insertion?

...^      hash(aaa) = 3*a     mod 3 = 0 → cut after "aaa"
 ...^     hash(aaX) = 2*a + X mod 3 ≠ 0
  ...^    hash(aXb) = a + X + b mod 3 ≠ 0
   ...^   hash(Xbb) = X + 2*b mod 3 ≠ 0
    ...^  hash(bbb) = 3*b     mod 3 = 0 → cut after "bbb"
     ...^ hash(bba) = 2*b + a mod 3 ≠ 0
      ... …etc.

Here we get the blocks aaa, Xbbb, aaa, bbb — using a content-based function to decide on block boundaries yields a much more robust segmentation.

Content-aware deduplication works great for backup systems where each new snapshot of a system is taken by scanning the whole system and rehashing all of it. Can we make it work on a live system?

Problem: Some common patterns in code generated by interpreted as well as lazy languages give a hard time to branch hardware branch predictors. There should be ways to generate code patterns that work better with branch prediction, or to propagate relevant information to the hardware.

For example, a typical functional language may represent pending or forced computations like this:

type α thunk = Forced of α | Suspended of unit  α

and lazy values like that:

type α lazy = ref (α thunk)

Then, every attempt to force a lazy value will branch on the thunk to determine if it has been forced yet. For a given lazy value, the branch is trivially predictable: it is never taken except on the first taken (assuming “taken” means that the value has not been forced yet — the Suspended case above). But if we have a single library function that performs that branch for all lazy values, then that branch is much harder to predict (branch histories help, but they unalias branches on the same value happening at two different locations in the program, so they are not ideal either).

Similarly, a typical bytecode interpreter for a non-JITted language will have a piece of code implementing that bytecode's notion of conditional branches, as that piece of code will use a branch conditioned on a register or stack entry of the bytecode's VM. The result is that all branches of the interpreted program are overlaid onto a single branch of the interpreter.

Proposal: Explore software and hardware approaches to mitigate this issue: in the lazy case, maybe this means assigning distinct branch instructions to each lazy value. In the bytecode case, this could mean exposing to the hardware the value of the interpreted PC.

Problem: Some applications use lots of memory. RAM compression helps (recent Windows kernels do it by default), but it comes with high CPU costs, and it is usually done in application-agnostic way (by the OS or the architecture). Can we improve things using application-specific algorithms and compressed datastructures? Are there application-specific ways to compress RAM that interact better with the application's access patterns?

Proposal: Investigate application-specific compression schemes compatible with operations performed on the decompressed data.

For example, Emacs uses a gap buffer, and every editing operation requires moving the gap to the right place (a linear-time operation) and then performing edits. A standard compression algorithm like gzip applied to the whole buffer would make motion and changes costly (changes at position x may require re-compressing every position after x).

Alternative compression schemes may perform better: for example, compressing both sides of the gap separately, from each end (left-to-right on the left part, right-to-left on the right part), would have better locality, and changes would not require re-encoding (on the other hand, motion would).

(There's a wealth of work out there on elaborate data structures that represent data more densely. This proposal is narrower in scope: can we stick closely to existing datastructures, but carefully apply existing compression schemes to slices of the program's data?)

Problem: Some programs, like SMT solvers, put a lot of pressure on the data cache, but not on the instruction cache. Can we use the instruction cache to store data?

Proposal: Encode data in the instruction cache through a form of just-in-time compilation.

For example, consider a multiply-and-add type of computation with a constant matrix M, a constant vector B, and input A, and an output C: C = M * A + B. Assume that M is a matrix that barely fits in cache, without enough space to store B as well. Then we should be able to replace the + B part of the computation (implemented with a loop that consumes very little instruction cache but loads B into the data cache) by a dynamically generated piece of assembly that encodes the + B as a sequence of memory operations with immediates holding the values of B. Concretely, for B = [12, 4, 8, 27, 31, 1, 44, 18], instead of running a small loop with a tiny instruction cache footprint, like this:

for (i = 0; i < 8; i++)
  c[i] += b[i];

… we'll run an unrolled computation that hardcodes B, with a larger instruction cache footprint but no data cache footprint, like that:

c[0] += 12; c[1] += 4; c[2] += 8;  c[3] += 27;
c[4] += 31; c[5] += 1; c[6] += 44; c[7] += 18;

In assembly, this would mean replacing this:

     mov     edx, DWORD PTR [rsi+rax]
     add     DWORD PTR [rdi+rax], edx
     add     rax, 4
     cmp     rax, 32
     jne     .loop

with this:

add     DWORD PTR [rdi],    12
add     DWORD PTR [rdi+4],  4
add     DWORD PTR [rdi+8],  8
add     DWORD PTR [rdi+12], 27
add     DWORD PTR [rdi+16], 31
add     DWORD PTR [rdi+20], 1
add     DWORD PTR [rdi+24], 44
add     DWORD PTR [rdi+28], 18

In more general cases, less efficient encodings are possible: for example, we can store a value as an immediate in an instruction that sets a register, followed by a ret to jump back to the caller (this is morally the same as storing an array of values as a flat array of functions that each return the corresponding value of the original array — “flat” because the array does not store pointers to functions, it stores the assembly code of each function directly).

Static analysis, program synthesis

Context: Dynamic variables (defvar and defcustom in Emacs Lisp, make-parameter in Racket) are a convenient way to pass arguments deep down a call chain without polluting every function along the chain with extra arguments (avoiding so-called tramp data). Other languages do this with global variables, implicit parameters 1, etc.

Problem: Dependencies on the values of dynamic variables are not easy to track: for example, lots of code in Emacs incorrectly depends on the value of case-fold-search (a variable that controls whether text searches are case-sensitive). Lisps have a great way to locally shadow a global (that's what let-binding is about), but authors often forget these let-bindings.

Proposal: Implement a static analysis pass that bubbles up dependencies on dynamic variables. Evaluate it by annotating Emacs primitives and seeing how bad the problem of dynamic variable pollution is on popular Emacs libraries. Annotating primitives could be done by hand, by instrumenting the Emacs binary, or by statically analyzing its C sources.


Global variables in Haskell, John Hughes, 2004

Problem: Advanced IDEs offer powerful keybindings, but users commonly don't use them optimally.

For example, in Emacs, I can use C-a <tab> C-SPC C-e C-w to cut a whole line of text without its leading indentation:

C-a      ;; beginning-of-visual-line
<tab>    ;; indent-for-tab-command
C-SPC    ;; set-mark-command
C-e      ;; end-of-visual-line
C-w      ;; kill-region

It's much more efficient, however, to use M-m C-k:

M-m      ;; back-to-indentation
C-k      ;; kill-visual-line

The same thing happens all the time in vim and other modal editors: users are constantly writing single-use programs to move around a text buffer and perform edits.

Proposal: Design and implement a program optimizer for keyboard macros to suggest shorter key combinations as users type.

Peephole optimizations are probably a good place to start, but eventually we want context-sensitive optimizations too: for example, C-; (flyspell-auto-correct-previous-word), is a valid alternative to C-5 C-b C-d C-d Em when the buffer contains EMacs right before the current point, and so is M-b M-c (both correct the capitalization to Emacs).

Eventually, we need the system to be fast enough to run as the user type, and there are interesting questions around the feedback mechanisms to use.

Proof presentation and exploration

Context: The ideal way to think of most proof assistant goals isn't plain text; early experiments with company-coq, Lean widgets, and Alectryon renderings show that there's plenty of demand for domain-specific graphical representations of proof states.

Proposal: Collect examples of domain-specific renderings, and implement them in the browser and in a proof assistant. Study specifically the diagrams used in published textbooks: what representations do people use when they are not constrained by plain text?

Context: Proof assistant goals often mix concepts from many different domains, each of which may be rendered differently for optimal clarity. For example, the following goal mixes programming language syntax (C code), logical syntax (Coq matching), graphs (separation logic), math (arithmetic formulae), and data structures (separation logic predicates):

{{ let t := tree p0 (Branch (k0, v0) t1
                            (match Heq return … with
                             | eq_rect => Branch (k1, v1) Empty t1
                             end)) in
   t * (t -* tree p tr) }}
  while (p && p->k != k0) {
    sum += p->u;
    p = p->k < k0 ? p->l : p->r;
  v = p ? p->v : null;
{{ λ out. out["v"] = (find k tr).(v) ∧
          out["sum"] = sum (map v (path k tr)) }}

Problem: Domain-specific visualization tools tend to be designed in a vacuum, and are typically not very modular: different fields have come up with different ways to build visualizations (graphs, diagrams, pictures, pretty-printed text, etc.), but they can't be combined easily. As a concrete example, Graphviz (one of the top graph-drawing programs) does not support displaying anything but text or fixed-sized images within a graph node.

Partly as a result, rich goal-visualization tools are rare in proof assistants, and mostly focus on a single custom kind of visualization. In the vast majority of cases it is plain text (possibly with some notations) that is used to represent everything, and that leads to goals that are much harder to read than needed.

Constraint-solving diagramming and rendering engines partly solve this problem, but:

  1. Many rendering algorithms have not been implemented efficiently as constraint solving problems (I'm not aware of an encoding of Wadler-style pretty printers or Coq's box-layout models as a general constraint-solving problem).

  2. Constraint solvers are not often continuous in the sense that small perturbations of the input preserve most of the output. When we're looking at consecutive proof states, we want to see clearly what changed from one step to the next, even if that leads to a slightly suboptimal rendering for the second goal.

  1. Design a reasonably modular constraint-solving based renderer, encode common kinds of pretty-printing as well as a few graph layout problems, and use it to render mixed proof goals.

  2. Design an appropriate extension mechanism to allow new kinds of rendering along with a language to describe layout problems, and implement a notation mechanism in Coq that lets users describe how to render specific types (or symbols?) using this layout language.

  3. Study continuity by constraining new renderings to be sufficiently “close” to previous ones.

Proof engineering

Problem: The magic wand operator separation logic isn't satisfactory: it allows only one single borrow from a given structure. Lots of extensions have been proposed. What's the right way to represent repeated borrows from a data structure?

As a concrete example, consider the following setup: we have an associative map (say, a tree) that stores keys and values inline (for concreteness we'll say that keys are bytes and values are constant-size objects).

We write map p m to say that map m is at address p and val p v to say that value v is at address p. We can use a simple representation for m, say a list of key * value pairs. For example, we may say map 42 [(37, v1), (12, v2)], which expands to a set of separating conjunctions, including val p1 v1 and val p2 v2 for some pointers p1 and p2.

First, let's consider the single-borrow case (assume that p0 points to a map that contains k1 and lookup returns an inner pointer to the value corresponding to key k1):

val* p1 = lookup(p0, k1);
memset((void*)p1, (int)0, sizeof(val));

How do we reason about this code fragment? To call memset we need ownership of p1, so we need a fact like val p1 v1. Assuming we start with map p0 [(k1, v1), (k2, v2), …], what do we know after running lookup, right before the memset?

It can't be val p1 v1 * map p0 [(k1, v1), (k2, v2), …], because lookup returns an internal pointer, not a copy, so there would be double ownership of p1 (we could say val p1 v1 ∧ map p0 [(k1, v1), …], but the non-separating conjunction is quite hard to work with in separation logic).

It can't be val p1 v1 * map p0 [(k2, v2), …], because this loses track of the space that was allocated for the key k1 inside the map.

This is what the magic wand operator is for; it's the separation-logic version of implication. We write val p1 v1 * (val p1 v1 -* map p0 [(k1, v1), (k2, v2), …]) to indicate that we have “borrowed” val p1 v1 out of the map. This is enough to then call memset, though we need to be careful about how to plug the mutated v1 back into the data structure, since after calling memset we have val p1 zeroes * (val p1 v1 -* map p0 [(k1, v1), (k2, v2), …]), and hence what we have and what's on the left of the magic wand doesn't match anymore (there's a paper about this by Andrew Appel, I think).

But now consider something trickier:

val* p1 = lookup(p0, k1);
val* p2 = lookup(p0, k2);
memmove((void*)p1, (void*)p2, sizeof(val));

We can't even call lookup the second time: we have a precondition like val p1 v1 -* map p0 [(k1, v1), (k2, v2), …], but we need a clean map fact to call lookup.

Proposal: Explore solutions to this problem. Malte Schwerhoff and Alex Summers (and perhaps other folks at ETH) had work on iterated separating conjunctions that approached this problem. I'd like to explore explicitly annotating borrows in the logical representation of the data instead (use list (key * option value) instead of list (key * value), where None means that the map has temporarily relinquished ownership of the corresponding value.

We have a more advanced version of this in Rupicola with two levels of borrowing (reserving and borrowing) and a more expressive type (instead of option you use a type that records the original pointer so that you can unborrow more easily). One nice aspect is that it allows you to borrow subparts of the whole data structure if you change the constructors of the type that represents the structure of the object (like list in this case); for example, you can borrow complete subtrees, mutate them separately, and plug them back in.

There are plenty of interesting questions on automating reasoning about this, and plenty to explore around representation choices, too.

Problem: What is a good representation for hardware circuits (specifically one to be used as a compilation target) in the context of a proof assistant like Coq?

Context: For circuits respecting the digital abstraction, there are some properties that we want to enforce (such as absence of combinational cycles), but not necessarily with types (we may want to prove that). Encoding choices have a strong impact on how easy it is to reason about circuits (for example, a fully general graph encoding will not allow us to write a structurally decreasing interpreter, because of termination concerts).

The Kôika language uses a cute encoding hack to avoid reasoning explicitly about cycles and sharing in the circuits it generates: its type of circuits is a tree, and the DAG structure is encoded using implicit in-memory sharing; then, an unverified pass reconstructs a proper DAG by testing for physical equality (in other words, there are no let-bindings in Kôika's circuit type). Let's take a concrete circuit for illustration purposes (A and B are arbitrary circuits, potentially large):


In Kôika, the same circuit will be represented as A B) (A ¬B), or more precisely as let a := A in let b := B in a b) (a ¬b); the two instances of a are physically equal, so it's possible to reconstruct the graph structure and the proper circuit sharing, but the reasoning and proofs get to treat circuits as trees, and reasoning is hence greatly simplified.

(It is interesting to think of what kind of optimizations are expressible on this tree representation: in particular, graph simplifications and circuit optimizations can be performed as part of the construction of the graph, but not as a separate pass after the fact, since applying a recursive function that reconstructs a tree circuit destroys sharing.)

In general, there are many constraints to satisfy: simplicity, ease of reasoning, support for local and global optimizations, compatibility with (or ease of export to) other representations, expressiveness, performance, etc.

Proposal: Design or identify a usable representation of circuits, and use it to retarget the Kôika compiler, reducing its trusted base. Explore extensions to a general representation of modules, including theorems about linking.

Problem: Coq's type inference does not introduce new type variables (it does not allow generalization). For example, Coq will refuse to typecheck the following definition:

Cannot infer ?T in the partial instance "?T -> ?T0" found for the type of f.

This makes writing non-dependently typed programs in Coq more painful than in (say) OCaml.

Proposal: The problem is not solvable in general because of the power of Coq's type system; but it would still be nice to handle simple cases like the above; it's possible that it would be enough to try and plug a type variable in each remaining hole after Coq's own type inference (Hindley-Milner's standard generalization rules). The most interesting insight here would probably be to see in which way this fails: does it ever lead to confusing errors? Does it lead to other issues?

Productivity and debugging

Problem: Backtraces are a terrible way to understand and debug issues. At worst they show no nothing but a list of lines that led to an error (without concrete values); at best they show the state of the program when it crashed, but not how it got there.

Proposal: Implement a tracing debugger that records every single line executed by a program, with concrete values substituted for variables and expressions. The result should be a full trace, with indentation indicating call depth.

This will generate huge traces. That's OK. It would be particularly useful for bugs deep in unfamiliar code (and it would beat detailed stack traces by far).

Here is a concrete example. Take this Python program:

def swap(l, src, dst):
    l[src], l[dst] = l[dst], l[src]
def shuffle(l: List[int]):
    for i in range(len(l)):
        swap(l, i, random.randint(i, len(l)))

l = MutableString("abcde")

It has a bug: the bounds of randint should be i, len(l) - 1. Running it prints the following backtrace, which does not make it clear exactly what happened:

Traceback (most recent call last):
  File "~/shuffle.py", line 10, in <module>
  File "~/shuffle.py", line 6, in shuffle
    swap(l, i, random.randint(i, len(l)))
  File "~/shuffle.py", line 2, in swap
    l[src], l[dst] = l[dst], l[src]
IndexError: list index out of range

The proposal is to generate this instead:

# for i in range(len(l)):
  for i⇒0 in range(len('abcde')⇒5):
     # swap(l, i, random.randint(i, len(l)))
       swap('abcde', 0, random.randint(0, 5)⇒1)
       # def swap(l, src, dst):
       ⤷ def swap(l⇒'abcde', src⇒0, dst⇒1):
           # l[src], l[dst] = l[dst], l[src]
             l[0], l[1] = l[1]⇒b, l[0]⇒a
# for i in range(len(l)):
  for i⇒1 in range(len('bacde')⇒5):
     # swap(l, i, random.randint(i, len(l)))
       swap('bacde', 1, random.randint(1, 5)⇒5)
       # def swap(l, src, dst):
       ⤷ def swap(l='bacde', src=1, dst=5):
           # l[src], l[dst] = l[dst], l[src]
             l[1], l[5] = l[5]⇒💥IndexError, l[1]⇒a
             💥 IndexError: list index out of range

Context: Traditional profilers are great at identifying obvious performance bottlenecks and blatant performance bugs. Unfortunately (perhaps because profilers work so well for that?), blatant performance bugs are not why most programs are slow — there's only so many places in a program where you can replace a slow algorithm with a fast one.

In fact, many established codebases are slow despite using efficient algorithms, and show no obvious bottlenecks with a typical profiler. Instead, inefficiencies often come from doing redundant or unnecessary work.

Problem: Traditional profilers don't help with identifying and eliminating useless computations: they don't automatically identify opportunities for memoization, for example.

Proposal: Design a different kind of profiler that uses memory snapshots and taint tracking to flag redundant or unused work.

Concretely, we'd want to show programmers information on how many times each function ended up computing distinct results from distinct inputs — not just how many times each function was called (this would help decide whether it's worth caching (or “memoizing”) the results of the function).

We'd also want to get information about how many times the result of a computation was used: if the results of a computation are unused, we'd want to suggest removing that computation, or disable it for certain inputs.

Finally, we might want to do this at a finer granularity than a single function, to identify places where computations performed by a given function differ in trivial ways, and the bulk of the work could be reused across calls.

Problem: The world depends on massive amounts of code written in legacy languages (think COBOL, early versions of Fortran, etc.), and we don't have enough developers to maintain them.

Proposal: Write a translator from a legacy language (say, COBOL) into a more popular one (say, Python). The result should be as readable and idiomatic as possible, so that a developer experienced in Python will be able to work with the result, using the standard suite of Python tools: debuggers, static analyzers, etc. (maybe even profilers!).

Then write a reverse translator, so that changes performed on the translation can be back-propagated to the source. Not all changes can be represented, of course, but it should be possible to reorder computations, change constants, add function parameters, introduce logging instructions, move code from one project to another (e.g. support for a new input format), etc.

Problem: Adding support for a new language (think syntax highlighting, semantic navigation, indentation, completion, structural editing, code folding, etc.) to a text editor is a very complex and time consuming process. Initiatives like the language server protocol and tree-sitter mitigate the n² problem (n editors × m languages), but implementing an LSP server or tools on top of a tree-sitter AST is still quite time-consuming.

Proposal: Enhance a parser-generator framework (say, Menhir) to accept indentation and other annotations in such a way that the grammar serves as a single source of truth for most syntactic features of the IDE (highlighting, indentation, structural editing, code folding, and related features). Leverage existing error-recovery mechanisms to (mostly) correctly highlight and indent partial or broken programs.

Software engineering

Problem: Semantic versioning is supposed to eliminate dependency hell by making version numbers informative and predictable. In practice, developers can make mistakes (incrementing version numbers too conservatively or too eagerly, e.g. by forgetting certain changes). How common are these mistakes, and how often do they cause issues? (Some ecosystems, like Elm, automatically enforce a form of semver by diffing APIs.)

Proposal: Use static analysis (API diffs) to check assigned version numbers for consistency with semver. Do human-assigned version numbers conform with semantic versioning?

Compare this with tools like semantic-release. How often are these tools wrong? (They use human annotations to determine the next version number, so they leave some responsibility to developers).

How about automatic enforcement with API diffs? How often does that make mistakes? (That technique can miss relevant changes that don't alter the signature or name of a function but change its semantics.) To answer this question API diffs will not be enough, but we can gather interesting data by looking at the amount of breakage if any that follows an update (if program P depends on library L and library L gets a minor or patch release then the test suite of P should not be affected; if it is, then L's change probably should have been classified as major and that change was missed).

Notes: Théo Zimmermann kindly pointed me to this very relevant paper, which suggests that there is wide deviation between human-assigned version numbers and proper semver. This is a great starting point to refine the question above (specifically: do projects that claim to follow semver in fact follow it?), and to explore the other questions that follow it.

(I suspect this has been done in various forms, so it might be enough to do a literature search and produce a nice summary of the results across various blogs and papers and the CIs of some large FLOSS projects.)

Problem: Compiler updates add a mix of new optimizations and new features, often at the cost of higher compiler times. How much do these updates actually buy you in terms of performance? Is that worth the added compilation cost?

Proposal: Pick a few interesting programs from the 60s to the 80s and compile them with every version of, say, GCC from then to now, on a modern machine. How much faster does the result get? How much slower does the compilation process get?

Context: A lot of programming languages and software projects are converging towards Sphinx as their documentation system. Sphinx is built on top of Docutils, an extensible document-processing framework originally written for Python's documentation.

Problem: Nominally, Docutils support arbitrary frontends. In practice one of them dominates by far: reStructuredText (reST). Unfortunately, while reST is extensible at the block level, it does not support nesting for inline constructs. Other frontends like MyST (for Markdown) suffer from various related problems.

Proposal: Design and implement a conservative extension of reST (ideally with minimal modifications) that solves its long-standing issues. Can its parser be fixed to support nesting in a simple way? (See the old notes about this on the reST website) Is there a reasonable alternative to significant indentation that avoids having to write pages and pages of text indented 9 spaces to the right?


Context: Role-playing video games are built around “quests” that players collect and resolve as they explore the world: to attain a goal, players need to perform a series of steps (talking to a character, collecting an item, visiting a location, etc.). A simple model to understand these quests is labeled transition systems: actions taken by the player influence the state of the game and hence the state of a given quest (the labels are the entries in the players “journal” or “quest log”).

Additionally, RP video games are very often extensible, so quests are expressed in a custom programming language that so-called “modders” use to describe new quests or patch existing ones.

Problem: RP video games are very buggy. It's common for quests to get stuck (a place in the state machine where no transitions are possible), to require repeated interactions (the transition system fails to take a step despite a trigger), to require a reset (the state of the whole transition system is corrupt), etc.

Proposal: Use automated program verification to prove functional assertions and reachability / liveness properties about complex quests, such as:

  • There is always a way to complete a quest: either success or failure, but no quest that persists forever, unfinished in the journal (a form of liveness).

  • It's not possible to obtain the reward without completing a specific step (a form of functional correctness)

  • It is not possible to receive a quest reward twice (a form of linearity).

etc.: existing bugs and patches provide a rich source of inspiration for interesting properties.

Problem: Checking Coq proofs can be extremely slow, because it requires re-executing an arbitrarily complex program. We'd like a consistently fast way to check a proof, so as to be able to cheaply verify a claimed proof of a property, possibly using a protocol or encoding that requires extra work on the prover's side (that is, on the side of the person supplying the proof, as opposed to the verifier who checks it).

Context: Research on PCP (probabilistically checkable proofs) has progressed a lot lately, and PCP in theory provides a cheap way to verify proofs (albeit at the cost of generating very large certificates).

Proposal: Investigate a PCP encoding of traces of Coq's VM, with the goal to confirm the correctness of a claimed reduction.


The projects below would produce artifacts useful for many people, but on their own may not directly produce a publishable research contribution. They would be good as undergraduate projects for engineering-focused students with strong hacking skills. Some of them would likely lead to interesting research questions and follow-ups, too.

Context: Alectryon currently supports Coq and has preliminary support for Lean3; folks are working on porting it to Lean4. Experimenting with new proof languages yields interesting insights into similarities and differences and how different communities like to present their proofs.

Proposal: Extend Alectryon to support other protocols and proof languages. We have early support for EasyCrypt; F* would be another natural candidate (early F* work was the inspiration for Alectryon), and so would be HOL-light.

Problem: We are developing ways to present proofs and documents in a convenient and accessible way, but these documents are static: readers cannot readily attach comments to relevant parts and discuss inline. (Contrast this with in-context discussions on e.g. Gitlab or SourceHut.)

Proposal: Extend reStructuredText to support inline annotations and discussions; ideally these annotations should be robust to changes in the document (perhaps comments visible by everyone could be expressed as patches to the original document, so that there remains a single source of truth).

Context: Umut Acar and Anil Ada at CMU have a CMS (https://www.diderot.one/) that does this sort of stuff, though in a custom format. If Diderot's source were available this would be a valid option.

Problem: Most free software uses text-only unit and integration tests (Emacs is like that, for example). Unfortunately, these tests fail to catch all sorts of interesting bugs that only manifest graphically (such bugs are hard to test for non-graphically, so large parts of the codebase are in fact untested).

Proposal: Apply the visual regression testing methodology to existing, large free software apps: use a DSL (essentially keyboard macros and mouse clicks) to drive interactions with the software, taking regular screenshots, and compare these screenshots across versions. Flag revisions that create display changes and review them for bugs. Does this find many new bugs? Does it flag bugs that were later reported by users and manually diagnosed?

Context: I implemented something like this a while back for Emacs, specifically in the context of automatic bisecting: there was a specific display bug that I observed and I implemented automatic screenshotting which, combined with git bisect, pinpointed the source of the issue in a few hours of automatic recompilation and screenshot cycles. Visual regression testing has also been very useful in testing company-coq, Alectryon, and other packages (it caught many bugs that were not caught by other testing mechanisms).

My simplest proposals in that direction live on the Coq bug tracker. The difficulty ranges from a solid afternoon of work to a few days of work. Beyond that, here are things that don't fit on the bug tracker:

Proposal: Better unification

Coq's unification errors are not very eager, so often Coq reports that the whole goal fails to unify with a whole lemma; then you spend a lot of time manually padding arguments to the lemma until it applies (typically because of a dependent type) or until you get a proper error that pinpoints the issue. The proposal is to report on a more eager unification attempt by pinpointing where a naive unification ended up failing.

Proposal: Single-step eauto

For extensible forward reasoning I use a trick with typeclasses eauto + shelve. This trick is too complicated and should be replaced by an eauto_one with db that applies any lemma in database db that matches the current goal.


I keep a private list of attack ideas (proposals for offensive security research); I'm happy to chat about it if we meet in person.