README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183

# git friendly file format (`g3f`)

A flat file format that can encode literally anything,
while being plain text (`utf-8` tho) and very git friendly.

**What does git friendly mean?**

Changes are done in place, resulting in visually pleasing
(and useful) diffs that are generated by VCS programs such as `git`.

## The spec

Before we start with the (more or less) formal specification,
there's some design principles that went into designing `g3f`:

- Easy to write by hand: a human should easily be able to write
  a data file, without much effort or boilerplate. It should also
  be possible to edit generated files without being swamped with
  boilerplate (or indentation!)
- Flat structure: a file should not allow for nested structures
  in the file itself. This adds complexity and makes it harder
  to edit by hand. It also adds complexity at the parse level
  and makes graphs more difficult
- VCS friendly: a file change should only touch parts of the
  data section that were changed.

Now...

`g3f` files are strongly typed.
This means that every file has an schema section in it's header
defining what data types exist and how they are layed out.

The file extention for a `g3f` file is `.g3f` by default, 
however this implementation is not opinionated on that.

### Header

At the top of every `g3f` file is a header.
It contains the spec version the file was made with
as well as the implementation `ID` and version.

It looks something like this

```g3f
{header:builtin/header}
{spec "1.0.0"}
{impl "g3f-reference"}
{impl_version "0.8.5"}

{schemas} 
# ...
```

A few notes here:

- `g3f` is a flat format. When declaring a new top-level block (i.e. `{schemas}`) this ends the `{header}` block.
- A block can enforce a schema (i.e. here we enforce that all required fields from `builtin/header` are present)
- Nodes always have a single data value. Supported types are
  - string (`"1.0.0"`)
  - int (`42`)
  - float (`13.37`)
  - bool (`true`|`false`)
  - list<...> (`[ ... ]` - Elements are not comma-separated!)
  - ref (`some_id` - not quoted!)
  - type (`<...>` refers to some type information
  - NULL (`<>` which is an empty type/name marker)
- `#` is a line comment. There are no block-comments

### Schemas

As previously mentioned `g3f` is a strongly typed file format.
Schemas are IDs that can be referenced by other IDs.
But because `g3f` is completely flat, it's impossible to define
schema blocks inside the `{schema}` block itself.

Instead it uses the `NULL` markers to define the existence of schemas.
Schemas are then later defined in-line with the rest of the data. 

```g3f
{schemas}
{node <>}
{link <>}

{node}
{id <int>}
{links <list<int>>}

{link}
{id <int>}
{in <int>}
{out <int>}
```

### Defining data

Then using these schemas is easy enough.
You don't have to use schemas however,
if you want your file format to be completely dynamic and terrible.

```g3f
{<>:node}
{id 0}
{links [ 1 ]:}

{<>:node}
{id 1}
{links [ 0 ]}
```

Note that `<>` in the name position of a block refers to an anonymous block without a name of it's own.
Deserialisation of this file would happen as a list of nodes, each without a name.

When building graph structures, it is possible to have loops.
This is allowed via `g3f`.

Also of note: when using blocks that are named, in a flat structure,
deserialisation happens as a map `name => { data }`!

### Some thoughts on deserialisation

(not specifically part of the spec - to be expanded!)

Deserialised into C code this would look like the following:

```C
struct node_t {
  id: int32_t;
  links: *int32_t;
}

struct node_t * nodes = [ node_t { ... }, node_t { ... } ];
```

Because `g3f` has no hirarchy structure, and there's no in-file format references between the two nodes,
the deserialised returns a list of nodes.
Building a graph in memory is then your responsibility.
However, `g3f` can handle a few scenarios for you.

Image we used references, instead of integers, for links:

```g3f
{node}
{id <int>}
{links <list<ref>>}
```

What does this change? Well let's look at a data section: 

```g3f
{node_0:node}
{id 0}
{links [ node_1 ]}

{node_1:node}
{id 1}
{links [ node_0 ]}
```

In this case, `g3f` will deserialise into a list with a single node, 
which is `node_0` because it is considered the root-node for the graph.

### Upgradability

Applications might add new fields to their schemas and data sections.
In binary encoders such as protobuf, code is specifically generated for
an exchange format and also includes forwards compatible markers to
allow for schema changes.

`g3f` needs none of that!
Because data state inside the parser is dynamic and type checking
is only done against the schema in a file,
if the code using the parser library doesn't expect certain
data keys or expects others to be there that aren't present,
this can be gracefully handled.

New keys can be added the same way they would be in a dynamic file.
Keys that are present despite not being expected can simply be ignored.
The spec makes explicit note of writes and re-writes being done
in-place,
meaning that changes are always local to the keys that are changed.
If an update ignores certain keys, it doesn't matter if they were
ignored because they were not important or unknown to the application.