git friendly file format (`g3f`)

A flat file format that can encode literally anything, while being plain text (utf-8 tho) and very git friendly.

What does git friendly mean?

Changes are done in place, resulting in visually pleasing (and useful) diffs that are generated by VCS programs such as git.

The spec

Before we start with the (more or less) formal specification, there's some design principles that went into designing g3f:

Easy to write by hand: a human should easily be able to write a data file, without much effort or boilerplate. It should also be possible to edit generated files without being swamped with boilerplate (or indentation!)
Flat structure: a file should not allow for nested structures in the file itself. This adds complexity and makes it harder to edit by hand. It also adds complexity at the parse level and makes graphs more difficult
VCS friendly: a file change should only touch parts of the data section that were changed.

Now...

g3f files are strongly typed. This means that every file has an schema section in it's header defining what data types exist and how they are layed out.

The file extention for a g3f file is .g3f by default, however this implementation is not opinionated on that.

Another important point: it should be considered part of the spec that writes are done in-place to existing data. No data should be overwritten if not explicitly desired by the application using the g3f library!

At the top of every g3f file is a header. It contains the spec version the file was made with as well as the implementation ID and version.

It looks something like this

{header:builtin/header}
{spec "1.0.0"}
{impl "g3f-reference"}
{impl_version "0.8.5"}

{data}

A few notes here:

g3f is a flat format. When declaring a new top-level block (i.e. {data}) this ends the {header} block.
A block can enforce a schema (i.e. here we enforce that all required fields from builtin/header are present)
Nodes always have a single data value. Supported types are
string ("1.0.0")
int (42)
float (13.37)
bool (true|false)
list<...> ([ ... ] - Elements are not comma-separated!)
ref (some_id - not quoted!)
type (<...> refers to some type information
schema (<schema> as a literal)
# is a line comment. There are no block-comments

Schemas

As previously mentioned g3f is a strongly typed file format. Schemas are IDs that can be referenced by other IDs. Because g3f is completely flat, it's impossible to have a {schemas} block in which to define schemas. Instead inside the header it's possible to use the <schema> type marker to pre-declare schema data which will later be defined by blocks.

{header}
{node <schema>}
{links <schema>}

{node}
{id <int>}
{links <list<int>>}

{link}
{id <int>}
{in <int>}
{out <int>}

Defining data

Then using these schemas is easy enough. You don't have to use schemas however, if you want your file format to be completely dynamic and terrible.

{<>:node}
{id 0}
{links [ 1 ]}

{<>:node}
{id 1}
{links [ 0 ]}

Note that <> in the name position of a block refers to an anonymous block without a name of it's own. Deserialisation of this file would happen as a list of nodes, each without a name.

When building graph structures, it is possible to have loops. This is allowed via g3f.

Also of note: when using blocks that are named, in a flat structure, deserialisation happens as a map name => { data }!

Some thoughts on deserialisation

(not specifically part of the spec - to be expanded!)

Deserialised into C code this would look like the following:

struct node_t {
  id: int32_t;
  links: *int32_t;
}

struct node_t * nodes = [ node_t { ... }, node_t { ... } ];

Because g3f has no hierarchical structure, and there's no in-file format references between the two nodes, the deserialised returns a list of nodes. Building a graph in memory is then your responsibility. However, g3f can handle a few scenarios for you.

Image we used references, instead of integers, for links:

{node}
{id <int>}
{links <list<ref>>}

What does this change? Well let's look at a data section:

{node_0:node}
{id 0}
{links [ node_1 ]}

{node_1:node}
{id 1}
{links [ node_0 ]}

In this case, g3f will deserialise into a list with a single node, which is node_0 because it is considered the root-node for the graph.

Upgradability

Applications might add new fields to their schemas and data sections. In binary encoders such as protobuf, code is specifically generated for an exchange format and also includes forwards compatible markers to allow for schema changes.

g3f needs none of that! Because data state inside the parser is dynamic and type checking is only done against the schema in a file, if the code using the parser library doesn't expect certain data keys or expects others to be there that aren't present, this can be gracefully handled.

New keys can be added the same way they would be in a dynamic file. Keys that are present despite not being expected can simply be ignored. The spec makes explicit note of writes and re-writes being done in-place, meaning that changes are always local to the keys that are changed. If an update ignores certain keys, it doesn't matter if they were ignored because they were not important or unknown to the application.