1 files changed, 168 insertions, 0 deletions
diff --git a/content/blog/105_allocation.md b/content/blog/105_allocation.md
new file mode 100644
index 0000000..4016718
--- /dev/null
+++ b/content/blog/105_allocation.md
@@ -0,0 +1,168 @@
+Title: Allocations are good, actually
+Category: Blog
+Tags: Rust, programming
+Date: 2019-04-07
+
+Something that you can often hear in the Rust community,
+especially from people who were previously C or C++ developers,
+is the adage "allocations are slow".
+
+The other day a friend asked me how to create a consecutive list of numbers.
+I pointed her at `(0..).take(x).collect()` which can be made into a `Vec<_>`,
+with a number of her choice.
+It did made me think however about how this could be done
+much nicer in a allocation-free manner.
+
+It lead me to come up with the following code which creates a `[_; _]` slice,
+depending on which integer representation and length you choose.
+
+```rust
+(0..)
+  .zip(0..x)
+  .fold([0; x], |mut acc, (step, i)| {
+    acc[i] = step;
+    acc
+  })
+```
+
+So with this in mind, I wanted to run some comparisons.
+I chose the numbers so that 32768 consecutive numbers would be generated.
+I compiled the example with both `Debug` and `Release` mode.
+(All of these measurements are done with `rustc 1.33.0 (2aa4c46cf 2019-02-28)`)
+
+Let's start with the non-allocating version.
+
+```console
+$ time target/debug/playground
+target/debug/playground  1.45s user 0.00s system 99% cpu 1.457 total
+$ time target/release/playground
+target/release/playground  0.27s user 0.00s system 99% cpu 0.270 total
+```
+
+Cool! So as you can see, the `Release` profile is over 500% faster.
+And performance-wise this is quite reasonable.
+
+Let's see how an allocating implementation stacks up to it.
+The code used here is the following.
+
+```rust
+let vec: Vec<u32> = (0..)
+  .take(1024 * 32)
+  .collect();
+```
+
+So how fast is this gonna be?
+
+```console
+$ time target/debug/playground
+target/debug/playground  0.01s user 0.00s system 93% cpu 0.010 total
+$ time target/release/playground
+target/release/playground  0.00s user 0.00s system 85% cpu 0.005 total
+```
+
+What? ...it's faster?!
+
+Well, I guess this does to show that it's not as simple as saying "allocations are bad".
+Avoiding allocations at all cost can slow you down.
+
+Thanks for coming to my TED talk!
+
+*. . .*
+
+## Yes but *why*?
+
+Okay maybe you're more curious than that and want to understand what's going on here.
+So come along, let's read some assembly!
+
+Let's focus mostly on the release profile here,
+because `Debug` generates a lot of code that makes it harder to understand.
+So we have two code snippets that we should throw into [godbolt] to see what rustc does.
+
+[godbolt]: https://rust.godbolt.org/
+
+```rust
+// This doesn't allocate
+const length: usize = 1024 * 32;
+pub fn slice() -> [u32; length] {
+    (0..)
+        .zip(0..length)
+        .fold([0; length], |mut acc, (num, idx)| {
+        acc[idx] = num;
+        acc
+    })
+}
+
+// This does
+pub fn vec() -> Vec<u32> {
+    (0..).take(1024 * 32).collect()
+}
+```
+
+Let's have a look at the assembly that the `vec()` function generates.
+
+<skip>
+
+```gas
+.LCPI0_0:
+        .long   0
+        .long   1
+        .long   2
+        .long   3
+
+# ... snip ...
+
+example::vec:
+        push    rbx
+        mov     rbx, rdi
+        mov     edi, 131072
+        mov     esi, 4
+        call    qword ptr [rip + __rust_alloc@GOTPCREL]
+        test    rax, rax
+        je      .LBB0_4
+        movdqa  xmm0, xmmword ptr [rip + .LCPI0_0]
+        mov     ecx, 28
+        movdqa  xmm8, xmmword ptr [rip + .LCPI0_1]
+        movdqa  xmm9, xmmword ptr [rip + .LCPI0_2]
+        movdqa  xmm10, xmmword ptr [rip + .LCPI0_3]
+        movdqa  xmm4, xmmword ptr [rip + .LCPI0_4]
+        movdqa  xmm5, xmmword ptr [rip + .LCPI0_5]
+        movdqa  xmm6, xmmword ptr [rip + .LCPI0_6]
+        movdqa  xmm7, xmmword ptr [rip + .LCPI0_7]
+        movdqa  xmm1, xmmword ptr [rip + .LCPI0_8]
+.LBB0_2:
+        movdqa  xmm2, xmm0
+        paddd   xmm2, xmm8
+        # ... snip ...
+        ret
+.LBB0_4:
+        mov     edi, 131072
+        mov     esi, 4
+        call    qword ptr [rip + _ZN5alloc5alloc18...@GOTPCREL]
+        ud2
+```
+
+(full code dump [here](https://pastebin.com/zDXi7qtt))
+
+</skip>
+
+As you can see this uses the "Move Aligned Packed Integer Values" instructions in x86_64.
+From some `x86` docs:
+
+> Moves 128, 256 or 512 bits of packed doubleword/quadword integer values from the source operand (the second operand) to the destination operand (the first operand).
+
+Basically the LLVM can figure out that our numbers are predictable
+and can allocate them in a way that is batchable.
+
+We will already see how the non-alloc code is going to be slower:
+because the code that assigns numbers is less unterstandable to a compiler
+(i.e. assigning values to an array sequencially) this will not end up being batched.
+
+That's not to say that alloc code is going to be this fast on every platform
+(RISC instruction sets lack many vectoring techniques)
+and this doesn't even take embedded targets into account.
+
+But there you have it.
+
+LLVM is magic...
+
+... and saying "allocations are bad" really isn't telling the whole story.