content/blog/105_allocation.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168

Title: Allocations are good, actually
Category: Blog
Tags: Rust, programming
Date: 2019-04-07

Something that you can often hear in the Rust community,
especially from people who were previously C or C++ developers,
is the adage "allocations are slow".

The other day a friend asked me how to create a consecutive list of numbers.
I pointed her at `(0..).take(x).collect()` which can be made into a `Vec<_>`,
with a number of her choice.
It did made me think however about how this could be done
much nicer in a allocation-free manner.

It lead me to come up with the following code which creates a `[_; _]` slice,
depending on which integer representation and length you choose.

```rust
(0..)
  .zip(0..x)
  .fold([0; x], |mut acc, (step, i)| {
    acc[i] = step;
    acc
  })
```

So with this in mind, I wanted to run some comparisons.
I chose the numbers so that 32768 consecutive numbers would be generated.
I compiled the example with both `Debug` and `Release` mode.
(All of these measurements are done with `rustc 1.33.0 (2aa4c46cf 2019-02-28)`)

Let's start with the non-allocating version.

```console
$ time target/debug/playground
target/debug/playground  1.45s user 0.00s system 99% cpu 1.457 total
$ time target/release/playground
target/release/playground  0.27s user 0.00s system 99% cpu 0.270 total
```

Cool! So as you can see, the `Release` profile is over 500% faster.
And performance-wise this is quite reasonable.

Let's see how an allocating implementation stacks up to it.
The code used here is the following.

```rust
let vec: Vec<u32> = (0..)
  .take(1024 * 32)
  .collect();
```

So how fast is this gonna be?

```console
$ time target/debug/playground
target/debug/playground  0.01s user 0.00s system 93% cpu 0.010 total
$ time target/release/playground
target/release/playground  0.00s user 0.00s system 85% cpu 0.005 total
```

What? ...it's faster?!

Well, I guess this does to show that it's not as simple as saying "allocations are bad".
Avoiding allocations at all cost can slow you down.

Thanks for coming to my TED talk!

*. . .*

## Yes but *why*?

Okay maybe you're more curious than that and want to understand what's going on here.
So come along, let's read some assembly!

Let's focus mostly on the release profile here,
because `Debug` generates a lot of code that makes it harder to understand.
So we have two code snippets that we should throw into [godbolt] to see what rustc does.

[godbolt]: https://rust.godbolt.org/

```rust
// This doesn't allocate
const length: usize = 1024 * 32;
pub fn slice() -> [u32; length] {
    (0..)
        .zip(0..length)
        .fold([0; length], |mut acc, (num, idx)| {
        acc[idx] = num;
        acc
    })
}

// This does
pub fn vec() -> Vec<u32> {
    (0..).take(1024 * 32).collect()
}
```

Let's have a look at the assembly that the `vec()` function generates.

<skip>

```gas
.LCPI0_0:
        .long   0
        .long   1
        .long   2
        .long   3

# ... snip ...

example::vec:
        push    rbx
        mov     rbx, rdi
        mov     edi, 131072
        mov     esi, 4
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
        test    rax, rax
        je      .LBB0_4
        movdqa  xmm0, xmmword ptr [rip + .LCPI0_0]
        mov     ecx, 28
        movdqa  xmm8, xmmword ptr [rip + .LCPI0_1]
        movdqa  xmm9, xmmword ptr [rip + .LCPI0_2]
        movdqa  xmm10, xmmword ptr [rip + .LCPI0_3]
        movdqa  xmm4, xmmword ptr [rip + .LCPI0_4]
        movdqa  xmm5, xmmword ptr [rip + .LCPI0_5]
        movdqa  xmm6, xmmword ptr [rip + .LCPI0_6]
        movdqa  xmm7, xmmword ptr [rip + .LCPI0_7]
        movdqa  xmm1, xmmword ptr [rip + .LCPI0_8]
.LBB0_2:
        movdqa  xmm2, xmm0
        paddd   xmm2, xmm8
        # ... snip ...
        ret
.LBB0_4:
        mov     edi, 131072
        mov     esi, 4
        call    qword ptr [rip + _ZN5alloc5alloc18...@GOTPCREL]
        ud2
```

(full code dump [here](https://pastebin.com/zDXi7qtt))

</skip>

As you can see this uses the "Move Aligned Packed Integer Values" instructions in x86_64.
From some `x86` docs:

> Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second operand) to the destination operand (the first operand).

Basically the LLVM can figure out that our numbers are predictable
and can allocate them in a way that is batchable.

We will already see how the non-alloc code is going to be slower:
because the code that assigns numbers is less unterstandable to a compiler
(i.e. assigning values to an array sequencially) this will not end up being batched.

That's not to say that alloc code is going to be this fast on every platform
(RISC instruction sets lack many vectoring techniques)
and this doesn't even take embedded targets into account.

But there you have it.

LLVM is magic...

... and saying "allocations are bad" really isn't telling the whole story.