

Soutenance en vue de l'obtention de l'Habilitation à Diriger des Recherches

16 janvier 2014

# Reasoning between programming languages and architectures

Francesco Zappa Nardelli

http://www.di.ens.fr/~zappa/readings/hdr





## Illustrate my approach to research

## III Explain one research project ch



## 1. A Better World



```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
      return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
b = 42;
printf("%d\n", b);
```

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

#### Thread 2

b = 42;
printf("%d\n", b);

```
int a = 1;
int b = 0;
```

#### Thread 1

int s;
for (s=0; s!=4; s++) {
 if (a==1)
 return NULL;
 for (b=0; b>=26; ++b)
 ;
}

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
b = 42;
printf("%d\n", b);
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
       return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
int a = 1;
int b = 0;
```

#### Thread 1

#### Thread 2

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

Thread 1 returns without modifying b

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int a = 1;
int b = 0;
```

#### Thread 1

```
b = 42;
int s;
for (s=0; s!=4; s++) {
                                   printf("%d\n", b);
  if (a==1)
    return NULL;
  for (b=0; b>=26; ++b)
    9
}
            Thread 1 returns without modifying b
     Thread 2 is not affected by Thread 1 and vice-versa
               (this program is data-race free)
             We expect this program to print 42.
```

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
      return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
      return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
int a = 1;
int b = 0;
```

#### Thread 1

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
      return NULL;
    for (b=0; b>=26; ++b)
    ;
}
```

#### Thread 2



...sometimes we get 0 on the screen

```
int s;
for (s=0; s!=4; s++) {
    if (a==1)
       return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```

```
movl a(%rip), %edx  # load a into edx
movl b(%rip), %eax  # load b into eax
testl %edx, %edx  # if a!=0
jne .L2  # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip)  # store eax into b
xorl %eax, %eax  # store 0 into eax
ret  # return
```

The outer loop can be (and is) optimised away

| MOVL  | a(%rip), %edx | # load a into edx             |
|-------|---------------|-------------------------------|
| movl  | b(%rip), %eax | # load b into eax             |
| testl | %edx, %edx    | # if a!=0                     |
| jne   | .L2           | # jump to .L2                 |
| movl  | \$0, b(%rip)  |                               |
| ret   |               |                               |
| .L2:  |               |                               |
| movl  | %eax, b(%rip) | <pre># store eax into b</pre> |
| xorl  | %eax, %eax    | <pre># store 0 into eax</pre> |
| ret   |               | # return                      |
|       |               |                               |

| movl  | a(%rip), %edx            | <pre># load a into edx</pre>  |
|-------|--------------------------|-------------------------------|
| movl  | b(%rip), %eax            | <pre># load b into eax</pre>  |
| testl | %edx, %edx               | # if a!=0                     |
| jne   | .L2                      | # jump to .L2                 |
| movl  | \$0, b(%rip)             |                               |
| ret   |                          |                               |
| .L2:  |                          |                               |
| movl  | <pre>%eax, b(%rip)</pre> | <pre># store eax into b</pre> |
| xorl  | %eax, %eax               | <pre># store 0 into eax</pre> |
| ret   |                          | # return                      |
|       |                          |                               |

| movl  | a(%rip), %edx            | <pre># load a into edx</pre>  |
|-------|--------------------------|-------------------------------|
| movl  | b(%rip), %eax            | <pre># load b into eax</pre>  |
| testl | %edx, %edx               | # if a!=0                     |
| jne   | .L2                      | # jump to $.L2$               |
| movl  | \$0, b(%rip)             |                               |
| ret   |                          |                               |
| .L2:  |                          |                               |
| movl  | <pre>%eax, b(%rip)</pre> | <pre># store eax into b</pre> |
| xorl  | %eax, %eax               | <pre># store 0 into eax</pre> |
| ret   |                          | # return                      |
|       |                          |                               |

```
movl a(%rip), %edx  # load a into edx
movl b(%rip), %eax  # load b into eax
testl %edx, %edx  # if a!=0
jne .L2  # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip)  # store eax into b
xorl %eax, %eax  # store 0 into eax
ret  # return
```

|       | a(%rip), %edx                       | <pre># load a into edx</pre>                     |
|-------|-------------------------------------|--------------------------------------------------|
| movl  | b(%rip), %eax                       | <pre># load b into eax</pre>                     |
| testl | %edx, %edx                          | # if a!=0                                        |
| jne   | .L2                                 | # jump to .L2                                    |
| movl  | \$0, b(%rip)                        |                                                  |
| ret   |                                     |                                                  |
| .L2:  |                                     |                                                  |
|       |                                     |                                                  |
| MOVL  | %eax, b(%rip)                       | <pre># store eax into b</pre>                    |
|       | <pre>%eax, b(%rip) %eax, %eax</pre> | <pre># store eax into b # store 0 into eax</pre> |

```
movl a(%rip), %edx  # load a into edx
movl b(%rip), %eax  # load b into eax
testl %edx, %edx  # if a!=0
jne .L2  # jump to .L2
movl $0, b(%rip)
ret
.L2:
movl %eax, b(%rip)  # store eax into b
xorl %eax, %eax  # store 0 into eax
ret  # return
```

The compiled code saves and restores **b** Correct in a sequential setting, but...

| movl  | a(%rip), %edx            | <pre># load a into edx</pre>  |
|-------|--------------------------|-------------------------------|
| movl  | b(%rip), %eax            | <pre># load b into eax</pre>  |
| testl | %edx, %edx               | # if a!=0                     |
| jne   | .L2                      | # jump to .L2                 |
| movl  | \$0, b(%rip)             |                               |
| ret   |                          |                               |
| .L2:  |                          |                               |
| movl  | <pre>%eax, b(%rip)</pre> | <pre># store eax into b</pre> |
| xorl  | %eax, %eax               | <pre># store 0 into eax</pre> |
| ret   |                          | # return                      |

```
int a = 1;
int b = 0;
```

#### Thread 1

movl a(%rip),%edx
movl b(%rip),%eax
testl %edx, %edx
jne .L2
movl \$0, b(%rip)
ret
.L2:
movl %eax, b(%rip)
xorl %eax, %eax
ret

#### Thread 2

b = 42;
printf("%d\n", b);

```
int a = 1;
int b = 0;
```

#### Thread 1

| movl  | a(%rip),%edx  |
|-------|---------------|
| movl  | b(%rip),%eax  |
| testl | %edx, %edx    |
| jne   | .L2           |
| movl  | \$0, b(%rip)  |
| ret   |               |
| L2:   |               |
| movl  | %eax, b(%rip) |
| xorl  | %eax, %eax    |
| ret   |               |
|       |               |

#### Thread 2

b = 42;
printf("%d\n", b);

- Read a (1) into edx

```
int a = 1;
int b = 0;
```

#### Thread 1

| movl  | a(%rip),%edx  |
|-------|---------------|
| movl  | b(%rip),%eax  |
| testl | %edx, %edx    |
| jne   | .L2           |
| movl  | \$0, b(%rip)  |
| ret   |               |
| L2:   |               |
| movl  | %eax, b(%rip) |
| xorl  | %eax, %eax    |
| ret   |               |

- Read a (1) into edx
- Read b (0) into eax

```
int a = 1;
int b = 0;
```

#### Thread 1

movl a(%rip),%edx
movl b(%rip),%eax
testl %edx, %edx
jne .L2
movl \$0, b(%rip)
ret
.L2:
movl %eax, b(%rip)
xorl %eax, %eax
ret

#### Thread 2

b = 42;
printf("%d\n", b);

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b

```
int a = 1;
int b = 0;
```

#### Thread 1

a(%rip),%edx movl movl b(%rip),%eax testl %edx, %edx jne .L2 movl \$0, b(%rip) ret .L2: %eax, b(%rip) movl %eax, %eax xorl ret

- Read a (1) into edx
- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b

```
int a = 1;
int b = 0;
```

#### Thread 1

a(%rip),%edx movl movl b(%rip),%eax testl %edx, %edx jne .L2 \$0, b(%rip) movl ret .L2: movl %eax, b(%rip) xorl %eax, %eax ret

- Read a (1) into edx
- Read **b** (**0**) into **eax**
- Store 42 into b
- Store eax (0) into b
- Print b: 0 is printed

int a = 1; int b = 0;

## ...gives unexpected results in some concurrent contexts!

movl %eax, b(%rip)
xorl %eax, %eax
ret

- Read b (0) into eax
- Store 42 into b
- Store eax (0) into b
- Print b: 0 is printed



### What? Can our program print 0?



What? Can our program print 0?

# No, C11 states that printing 42 is the only correct output

# This is a compiler bug

What? Can our program print 0?

# No, C11 states that printing 42 is the only correct output

This is a concurrency compiler bug

# We reported it

# it was promptly fixed

# World is a better place



# World is a better place



# Can we catch more similar bugs and make the world even better?



# Compiler testing: state of the art

Yang, Chen, Eide, Regehr - PLDI 2011



# Compiler testing: state of the art

Yang, Chen, Eide, Regehr - PLDI 2011



# Compiler testing: state of the art

Yang, Chen, Eide, Regehr - PLDI 2011



## Hunting concurrency compiler bugs?

## How to deal with non-determinism?

How to generate non-racy interesting programs?

How to capture all the behaviours of concurrent programs?

A compiler can optimise away behaviours: *how to test for correctness? limit case*: two compilers generate correct code with disjoint final states C/C++ compilers support separate compilation Functions can be called in arbitrary non-racy concurrent contexts

C/C++ compilers can only apply transformations sound with respect to an arbitrary non-racy concurrent context

Hunt concurrency compiler bugs

search for transformations of sequential code not sound in an arbitrary non-racy context



# 2. Soundness of compiler optimisations in the C11/C++11 memory model



## World of optimisations

gcc 4.8.1 with -O2 option goes through 163 compilation passes computed using -fdump-tree-all and -fdump-rtl-all

Sun Hotspot Server JVM has 18 high-level passes each pass composed of one or more smaller passes

http://www.azulsystems.com/blog/cliff-click/2009-04-14-odds-ends

### **Compiler Writer**





## **Compiler Writer**



Sophisticated program analyses Fancy algorithms Source code or IR

Operations on AST



## **Compiler Writer**



Sophisticated program analyses Fancy algorithms Source code or IR

Operations on AST



## **Compiler Writer**



Sophisticated program analyses Fancy algorithms Source code or IR

**Operations on AST** 



```
tmp = y+1 ;
for (int i=0; i<2; i++) {
   z = i;
   x[i] +=tmp;
}</pre>
```

## Compiler Writer



Sophisticated program analyses Fancy algorithms Source code or IR

**Operations on AST** 

#### Semanticist



Elimination of run-time events Reordering of run-time events Introduction of run-time events *Operations on sets of events* 

```
tmp = y+1 ;
for (int i=0; i<2; i++) {
   z = i;
   x[i] +=tmp;
}</pre>
```

## Compiler Writer



Sophisticated program analyses Fancy algorithms Source code or IR

**Operations on AST** 

## Semanticist



Elimination of run-time events Reordering of run-time events Introduction of run-time events *Operations on sets of events* 

```
Store z 0
Load y 42
Store x[0] 43
Store z 1
Load y 42
Store x[1] 43
```

## Compiler Writer



Sophisticated program analyses Fancy algorithms Source code or IR

**Operations on AST** 

## Semanticist



Elimination of run-time events Reordering of run-time events Introduction of run-time events *Operations on sets of events* 

> Load y 42 Store z 0

Store x[0] 43 Store z 1

Store x[1] 43

## Elimination of overwritten writes



Under which conditions is it correct to eliminate the first store?

## Elimination of overwritten writes



Under which conditions is it correct to eliminate the first store?

# What is the semantics of concurrent C11/C++11 code?

## The C11/C++11 memory model

C11/C++11 are based on the DRF approach:

- racy code is undefined
- race-free code must exhibit only sequentially consistent behaviours
- main synchronisation mechanism: lock/unlock

## The C11/C++11 memory model

C11/C++11 are based on the DRF approach:

- racy code is undefined
- race-free code must exhibit only sequentially consistent behaviours
  - main synchronisation mechanism: lock/unlock

Our first example is data-race free. Printing 0 is disallowed by the standard.

## The C11/C++11 memory model

C11/C++11 are based on the DRF approach:

- racy code is undefined
- race-free code must exhibit only sequentially consistent behaviours
- main synchronisation mechanism: lock/unlock

Escape mechanism for experts, low-level atomics:

- races allowed
- attributes on accesses specify their semantics:

MO\_SEQ\_CST MO\_RELEASE/MO\_ACQUIRE MO\_RELAXED

$$g = 0$$
; atomic  $f = 0$ ;

#### Thread 1

#### Thread 2

g = 42; f.store(1,MO\_RELEASE);

$$g = 0;$$
 atomic  $f = 0;$ 

#### Thread 1

#### Thread 2

g = 42; f.store(1,MO\_RELEASE);

$$g = 0;$$
 atomic  $f = 0;$ 

#### Thread 1

#### Thread 2

g = 42; f.store(1,MO\_RELEASE);

$$g = 0;$$
 atomic  $f = 0;$ 

#### Thread 1

#### Thread 2

g = 42; f.store(1,MO\_RELEASE);

$$g = 0;$$
 atomic  $f = 0;$ 



$$g = 0$$
; atomic  $f = 0$ ;



The release/acquire synchronisation guarantees that:

- the program is DRF
- 42 is printed at the end of the execution

Remark: unlock  $\simeq$  release, lock  $\simeq$  acquire.

## Same-thread release/acquire pairs

A same-thread release-acquire pair is a pair of a release action followed by an acquire action in program order.

An action is a *release* if it is a possible source of a synchronisation *unlock mutex, release or seq\_cst atomic write* 

An action is an *acquire* if it is a possible target of a synchronisation lock mutex, acquire or seq\_cst atomic read

## Elimination of overwritten writes



It is safe to eliminate the first store if there are:

 no intervening accesses to *g* no intervening same-thread release-acquire pairs

## The soundness condition

Shared memory

$$g = 0;$$
 atomic f1 = f2 = 0;

#### Thread 1

```
g = 1;
f1.store(1,RELEASE);
while(f2.load(ACQUIRE)==0);
g = 2;
```

## The soundness condition

Shared memory

$$g = 0$$
; atomic f1 = f2 = 0;

Thread 1 candidate overwritten write
g = 1;
f1.store(1,RELEASE);
while(f2.load(ACQUIRE)==0);
g = 2;

## The soundness condition

#### Shared memory

$$g = 0$$
; atomic f1 = f2 = 0;

Thread 1 candidate overwritten write
g = 1;
f1.store(1,RELEASE); same-thread release-acquire pair
while(f2.load(ACQUIRE)==0);
g = 2;

#### Shared memory

$$g = 0$$
; atomic f1 = f2 = 0;

#### Thread 1

```
Thread 2
```

```
g = 1;
f1.store(1,RELEASE);
while(f2.load(ACQUIRE)==0);
g = 2;
```

while(f1.load(ACQUIRE)==0);
printf("%d", g);
f2.store(1,RELEASE);

#### Shared memory

$$g = 0;$$
 atomic f1 = f2 = 0;

#### Thread 1

#### Thread 2

#### g = 1; f1.store(1,RELEASE); while(f1.load(ACQUIRE)==0); while(f2.load(ACQUIRE)==0); g = 2; while(f1.load(ACQUIRE)==0); f2.store(1,RELEASE);

## Thread 2 is non-racy

#### Shared memory

$$g = 0$$
; atomic f1 = f2 = 0;

# Thread 1 Thread 2 g = 1; f1.store(1,RELEASE); while(f1.load(ACQUIRE)==0); f2.store(1,RELEASE); f2.store(1,RELEASE); g = 2;

## Thread 2 is non-racy The program should only print **1**



## Thread 2 is non-racy The program should only print **1**

If we perform overwritten write elimination it prints 0

#### Shared memory

$$g = 0;$$
 atomic f1 = f2 = 0;

#### Thread 1

#### Thread 2

while(f1.load(ACQUIRE)==0);
printf("%d", g);
f2.store(1,RELEASE);

#### Shared memory

$$g = 0;$$
 atomic f1 = f2 = 0;

Thread 1

#### Thread 2

while(f1.load(ACQUIRE)==0);
printf("%d", g);
f2.store(1,RELEASE);

#### Shared memory

$$g = 0;$$
 atomic f1 = f2 = 0;



If only a release (or acquire) is present, then all discriminating contexts *are racy*. It is sound to optimise the overwritten write.

# Eliminations: bestiary



Overwritten-Write Write-after-Write Read-after-Read Read-after-Write Write-after-Read

Reads which are not used (via data or control dependencies) to decide a write or synchronisation event are also eliminable (*irrelevant reads*).



Overwritten-Write Write-after-Write Read-after-Read Read-after-Write Write-after-Read

Reads which are not used (via data or control dependencies) to decide a write or synchronisation event are also eliminable (*irrelevant reads*).

# Reorderings and introductions

Correctness criterion for reordering events:

- different addresses
- no synchronisations in-between

Roach-motel reordering (reordering across locks) not observed in practice

## Read introductions observed in practice (gcc, clang).

Introduction of eliminable reads proved correct. Introduction of irrelevant reads does not introduce new behaviours, but cannot be proved correct in a DRF model.

# 3. From theory to the Cmmtest tool



Friday 24 January 14











```
const unsigned int g3 = 0UL;
long long g4 = 0x1;
int g6 = 6L;
volatile unsigned int g5 = 1UL;
void func_1(void){
     int *18 = \&g6;
     int 136 = 0 \times 5E9D070FL;
     unsigned int 1107 = 0xAA37C3ACL;
     q4 \&= q3;
     g5++;
     int *1102 = \&136;
     for (g6 = 4; g6 < (-3); g6 += 1);
     1102 = \&g6;
     *1102 = ((*18) && (1107 << 7)*(*1102));
}
```

Start with a randomly generated well-defined program

Init g3 0 Init g4 1 Init g5 1 Init g6 6

```
void func_1(void){
    int *l8 = &g6;
    int l36 = 0x5E9D070FL;
    unsigned int l107 = 0xAA37C3ACL;
    g4 &= g3;
    g5++;
    int *l102 = &l36;
    for (g6 = 4; g6 < (-3); g6 += 1);
    l102 = &g6;
    *l102 = ((*l8) && (l107 << 7)*(*l102));
}</pre>
```











If we focus on the miscompiled initial example...

```
int a = 1; int s;
int b = 0; for (s=0; s!=4; s++) {
    if (a==1)
        return NULL;
    for (b=0; b>=26; ++b)
      ;
}
```





## Cannot match some events — detect compiler bug





#### CSmith

## Subtleties:

- dependencies between eliminable events
- some optimisations (e.g. merging of accesses) cannot be expressed in the C11/C++11 formalisation
- the tool also ensures that the compilation of atomic accesses is preserved by the optimiser



in



# 4. Impact on the non-academic world



Friday 24 January 14

# 1. Some GCC concurrency bugs

Some concurrency compiler bugs found in the latest version of GCC.

Store introductions performed by loop invariant motion or if-conversion optimisations. *Remark*: these bugs break the Posix thread model too.

All promptly fixed.

## Good timing

### Existing compilers are being adapted to the C11/C++11 standard

> Just to get this straight, am I to assume that the default code > generation for GCC is a single threaded environment?

It certainly is, though we are getting more careful about this stuff in recent years and generally only read data-races are ok.

Store introductions performed by loop invariant motion of

if-conversion optimisations. *Remark*: these bugs break the Posix thread model too.

All promptly fixed.

All promptly fixed.

# 2. Checking compiler invariants

GCC internal invariant: never reorder with an atomic access

Baked this invariant into the tool and found a counterexample... ...not a bug, but fixed anyway

## 3. Detecting unexpected behaviours



The introduced store cannot be observed by a non-racy context. Still, *arguable if a compiler should do this or not*.

## 3. Detecting unexpected behaviours



The introduced store cannot be observed by a non-racy context. Still, arguable if a compiler should do this or not.

# 5. Behind the scenes



Friday 24 January 14



Low-level software

# Applications Architectures Language definitions Hardware

## Low-level software



#### 



#### Should not terminate printing **0**.



#### Should not terminate printing **0 0**.





The output 0 0 can be observed on x86 and on ARM/Power.

#### 

The output 0 0 can be observed on x86 and on ARM/Power.

Write buffers hide the latency of writes to memory but are observable by concurrent code.



$$x = f = 0$$

$$x = 1$$

$$f = 1$$

$$message passing$$

$$message passing$$

$$x = f = 0$$

$$x = 1$$

$$f = 1$$

$$message passing$$

$$message passing$$

#### Should not terminate printing 1 0.

$$x = f = 0$$

$$x = 1$$

$$f = 1$$

$$message passing$$

$$message passing$$

Should not terminate printing 1 0.

Let's see...



$$\begin{array}{c|c} x = f = 0 \\ \hline x = 1 \\ f = 1 \end{array} \qquad \begin{array}{c|c} \text{print f} \\ \text{print x} \end{array}$$
The output 0 0 can be observed on ARM/Power but not on x86.

spearur

Friday 24 January 14

# Establishing models (done, almost)

X86

Sarkar, Owens, Sewell, Zappa Nardelli

ARM/Power *Alglave, Maranget, Sarkar, Sewell, Williams*C11/C++11
Batty, Owens, Sarkar, Memarian, Sewell, Weber

# Testing against the models (doable)

Soundness of optimisations in JSR-133 and C11/C++11 Sevcik, Morisset, Pawan, Zappa Nardelli

Model checker for C11/C++11 atomics Norris, Demsky

# Verification above the models (in its infancy)

Compilation scheme from C11/C++11 to Power/ARM and X86 Sarkar, Owens, Memarian, Batty, Sewell, Owens

Verified compilation from a concurrent C-like language to X86 Sevcik, Vafeiadis, Zappa Nardelli, Jagannathan, Sewell

doing for C11/C++11 would be much harder, doing it for GCC would be impossible

Fence-elimination optimisations Vafeiadis, Zappa Nardelli

X86 lock algorithms, work-stealing algorithms on X86 and Power/ARM *Owens, Lê, Morisset, Guatto, Cohen, Zappa Nardelli* 

. . .

# Take-up in Industrial Concurrency Community?



#### Handled the real behaviour

Found some bugs

Ongoing dialogue with language designers and developers

# Take-up in Industrial Concurrency Community?



# Take-up in Industrial Concurrency Community?



# 6. The Big Picture



Friday 24 January 14

#### Audit logs

#### Static and dynamic typing









#### utic and dynamic typing

#### Safe marshalling

PhD



