aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--infra/website/content/blog/119_wrong_abstraction.md152
1 files changed, 152 insertions, 0 deletions
diff --git a/infra/website/content/blog/119_wrong_abstraction.md b/infra/website/content/blog/119_wrong_abstraction.md
new file mode 100644
index 000000000000..e89b7adc6d29
--- /dev/null
+++ b/infra/website/content/blog/119_wrong_abstraction.md
@@ -0,0 +1,152 @@
+Title: That's the wrong abstraction layer
+Category: Blog
+Date: 2020-11-11
+Tags: /dev/diary
+
+I'm writing this post mostly to my future self, not any specific
+project or piece of code I've seen other people write. That's not to
+say that I don't think this is something that probably applies to many
+projects. Sometimes it's easy to lose sight of what we're doing, and
+it's good to be reminded.
+
+So to start at the beginning: I've been working on [supergit], a Rust
+library to parse git repositories. It's built on top of `libgit2`
+(and the `git2` rust bindings), and aims to create a more Rustic
+interface and type fascade for git repositories. It also aims to
+solve issues such as: rename detection, path-history, and subtree
+management. I'm writing this library for [octopus], which will
+eventually host my monorepo.
+
+In `supergit` the main workflow is around iterating things, seeing as
+git is an acyclical graph, and iterators are a decent way to view this
+datastructure. But git graphs can get pretty big. I wanted the
+iterator to be configurable in a way that allows someone to write a
+tool that searches a whole repository history, while also making it
+possible to step through a history 20 commits at a time (to implement
+history pagination on a website, for example).
+
+Looking at the current API, this is how you would implement the
+latter, for a `main` branch:
+
+```rust
+use supergit::Repository;
+
+fn main() {
+ let path = ... // get your repository path somehow
+ let repo = Repository::open(path).unwrap();
+
+ let main = repo.get_branch("main").unwrap();
+ let iter = main.get(20);
+
+ iter.for_each(|c| {
+ println!("{}: {}", c.commit().id_str(), c.commit().summary());
+ });
+}
+```
+
+That's easy enough, right? But wait, why am I calling `.commit()` on
+`c`. Isn't it already a commit? Well...sort of. In `supergit`, this
+type is a `BranchCommit`, because this is where things get
+complicated.
+
+
+## Sort of like a tree, but not really
+
+In git, rarely is a branch just a history of single commits. Maybe
+this is how some people think about their history, but it certainly
+has never been the case for any of the repositories that I work on.
+Basically the second you have more than one contributor, it's very
+common for a history to have merge-commits in it.
+
+So how do we deal with that in an iterator? The design I chose was to
+wrap a `Commit` object in another type, which can convey this state.
+`BranchCommit` is an enum and has three variants: `Commit` (maybe I
+should rename that to `Simple` or something?), `Merge`, and `Octopus`
+(if you don't know what an octopus merge is, don't worry about it.
+Most people don't and they're very rare and weird).
+
+What `Merge` and `Octopus` contain are new `Branch` handles (the type
+returned by `get_branch()`), meaning that for every split it's now up
+to the user to decide whether they want to continue first-parent
+(i.e. only ever follow the main branch line, ignoring the history of
+merged branches), or if they want to enumerate the histories as well.
+Most importantly: for every branch merge, you get to re-decide what
+your iterator strategy should be: infinate, limited by number, or
+limited up to a certain commit-hash.
+
+So far so good I thought, this is an okay enough interface for me to
+work with. But this is where some problems appeared.
+
+
+## File histories (and git internals)
+
+*(a slight de-tour through git - feel free to skip)*
+
+The main reason why I'm writing this more Rustic wrapper around
+`libgit2` is to make it easier to determine what the history of a file
+has been. This is pretty simple to find out via the git CLI (`git --
+<your file here>`), but not something that `libgit2` exposes, because
+that's not how git stores data.
+
+To git, all data is stored in a key-value store indexed by a SHA1
+(soon to be SHA256 I think?) hash reference. That applies to files,
+full file trees, and commits as well. Say we have a file `acab.txt`,
+we commit it and it gets the ID
+`da39a3ee5e6b4b0d3255bfef95601890afd80709` (the file ID, not the
+commit ID!), but then we open it and write `ACAB` in it, and commit
+that again. Now the file ID is
+`99f069b8a0cbe4c9485a14fe50775d0f71deb4e7`. Both these files are
+saved in the git object store, because after all you might want to go
+back to the older version.
+
+But here's the thing: from the actual commits we can get two things:
+the file tree at the time of commit, and the commit parents. To
+figure out what actually _changed_ in the commit, you have to diff it
+against it's parents, which is exactly what `git show` does if you
+give it a reference to a commit.
+
+What this means is that if you want to have a library that grabs the
+history of a path, well you'll have to go through all commits, and
+check the tree for changes at that specific path. Furthermore, that
+won't actually let you know if a file has simply been renamed, only
+that it has changed. Further logic is required to figure out if the
+file is the same, but just has a different name.
+
+And all of this is something that `supergit` implements, behind a nice
+Rustic API (I hope...).
+
+
+## Bloated abstractions
+
+So I wrote a function that would, for a branch iterator, step along it
+and check the history of a path, by diffing each commit with it's
+parents, and tracking a path via the delta information in the diff.
+But this is where I ran into problems. Because my iterator design
+always chose the first-parent to step through. Other branches were
+ignored, and because the function accepted an iterator and stepped it
+internally, there was no way for my `file_history()` function to
+figure out the exact behaviour the user wanted.
+
+My first instinct was to implement branching in the `BranchIter`
+itself; allowing it to branch off, essentially pushing commits it
+would have to get back to onto a stack, and resuming from a previous
+position. That turned out to be a really [bad idea][badidea].
+
+It took me about an hour of banging my head against this abstraction
+before I realised that it wasn't meant to be. Sometimes systems are
+self-contained, and adding more functionality takes a considerable
+amount of effort, and begs the question, if it's really the right
+choice to make. Why add more functionality to an abstraction that
+works fine on it's own?
+
+Instead, embrace composition, and add another layer on top, that can
+use the previous. You end up with a much more managable design, and
+data can flow from one layer to the next. Make sure that your
+interfaces are flexible enough to be re-used, but don't think that
+just because a component _could_ technically be responsible for some
+work, that it really has to implement this work.
+
+And that's it basically. Thanks for reading my ramblings about git
+and one of my side-projects. I hope I managed to make you think about
+the way you build systems a bit, and maybe next time you are in a
+situation similar to this one, don't be like me :)