From 89d2605fe096b7ff483b0f7acb5e4f29c8c6e98f Mon Sep 17 00:00:00 2001 From: Katharina Fey Date: Thu, 28 Feb 2019 18:02:15 +0100 Subject: Adding the initial spec draft --- README.md | 183 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 183 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..8e78484 --- /dev/null +++ b/README.md @@ -0,0 +1,183 @@ +# git friendly file format (`g3f`) + +A flat file format that can encode literally anything, +while being plain text (`utf-8` tho) and very git friendly. + +**What does git friendly mean?** + +Changes are done in place, resulting in visually pleasing +(and useful) diffs that are generated by VCS programs such as `git`. + +## The spec + +Before we start with the (more or less) formal specification, +there's some design principles that went into designing `g3f`: + +- Easy to write by hand: a human should easily be able to write + a data file, without much effort or boilerplate. It should also + be possible to edit generated files without being swamped with + boilerplate (or indentation!) +- Flat structure: a file should not allow for nested structures + in the file itself. This adds complexity and makes it harder + to edit by hand. It also adds complexity at the parse level + and makes graphs more difficult +- VCS friendly: a file change should only touch parts of the + data section that were changed. + +Now... + +`g3f` files are strongly typed. +This means that every file has an schema section in it's header +defining what data types exist and how they are layed out. + +The file extention for a `g3f` file is `.g3f` by default, +however this implementation is not opinionated on that. + +### Header + +At the top of every `g3f` file is a header. +It contains the spec version the file was made with +as well as the implementation `ID` and version. + +It looks something like this + +```g3f +{header:builtin/header} +{spec "1.0.0"} +{impl "g3f-reference"} +{impl_version "0.8.5"} + +{schemas} +# ... +``` + +A few notes here: + +- `g3f` is a flat format. When declaring a new top-level block (i.e. `{schemas}`) this ends the `{header}` block. +- A block can enforce a schema (i.e. here we enforce that all required fields from `builtin/header` are present) +- Nodes always have a single data value. Supported types are + - string (`"1.0.0"`) + - int (`42`) + - float (`13.37`) + - bool (`true`|`false`) + - list<...> (`[ ... ]` - Elements are not comma-separated!) + - ref (`some_id` - not quoted!) + - type (`<...>` refers to some type information + - NULL (`<>` which is an empty type/name marker) +- `#` is a line comment. There are no block-comments + +### Schemas + +As previously mentioned `g3f` is a strongly typed file format. +Schemas are IDs that can be referenced by other IDs. +But because `g3f` is completely flat, it's impossible to define +schema blocks inside the `{schema}` block itself. + +Instead it uses the `NULL` markers to define the existence of schemas. +Schemas are then later defined in-line with the rest of the data. + +```g3f +{schemas} +{node <>} +{link <>} + +{node} +{id } +{links >} + +{link} +{id } +{in } +{out } +``` + +### Defining data + +Then using these schemas is easy enough. +You don't have to use schemas however, +if you want your file format to be completely dynamic and terrible. + +```g3f +{<>:node} +{id 0} +{links [ 1 ]:} + +{<>:node} +{id 1} +{links [ 0 ]} +``` + +Note that `<>` in the name position of a block refers to an anonymous block without a name of it's own. +Deserialisation of this file would happen as a list of nodes, each without a name. + +When building graph structures, it is possible to have loops. +This is allowed via `g3f`. + +Also of note: when using blocks that are named, in a flat structure, +deserialisation happens as a map `name => { data }`! + +### Some thoughts on deserialisation + +(not specifically part of the spec - to be expanded!) + +Deserialised into C code this would look like the following: + +```C +struct node_t { + id: int32_t; + links: *int32_t; +} + +struct node_t * nodes = [ node_t { ... }, node_t { ... } ]; +``` + +Because `g3f` has no hirarchy structure, and there's no in-file format references between the two nodes, +the deserialised returns a list of nodes. +Building a graph in memory is then your responsibility. +However, `g3f` can handle a few scenarios for you. + +Image we used references, instead of integers, for links: + +```g3f +{node} +{id } +{links >} +``` + +What does this change? Well let's look at a data section: + +```g3f +{node_0:node} +{id 0} +{links [ node_1 ]} + +{node_1:node} +{id 1} +{links [ node_0 ]} +``` + +In this case, `g3f` will deserialise into a list with a single node, +which is `node_0` because it is considered the root-node for the graph. + +### Upgradability + +Applications might add new fields to their schemas and data sections. +In binary encoders such as protobuf, code is specifically generated for +an exchange format and also includes forwards compatible markers to +allow for schema changes. + +`g3f` needs none of that! +Because data state inside the parser is dynamic and type checking +is only done against the schema in a file, +if the code using the parser library doesn't expect certain +data keys or expects others to be there that aren't present, +this can be gracefully handled. + +New keys can be added the same way they would be in a dynamic file. +Keys that are present despite not being expected can simply be ignored. +The spec makes explicit note of writes and re-writes being done +in-place, +meaning that changes are always local to the keys that are changed. +If an update ignores certain keys, it doesn't matter if they were +ignored because they were not important or unknown to the application. + -- cgit v1.2.3 From f28abd489cc7ebd9f2d14f584052a93607d78985 Mon Sep 17 00:00:00 2001 From: Katharina Fey Date: Thu, 28 Feb 2019 20:33:22 +0100 Subject: Adjusting the way that schemas work --- README.md | 36 +++++++++++++++++++++--------------- 1 file changed, 21 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 8e78484..1d92cd7 100644 --- a/README.md +++ b/README.md @@ -33,6 +33,12 @@ defining what data types exist and how they are layed out. The file extention for a `g3f` file is `.g3f` by default, however this implementation is not opinionated on that. +Another important point: +it should be considered part of the spec that writes are +done in-place to existing data. +No data should be overwritten if not explicitly desired +by the application using the `g3f` library! + ### Header At the top of every `g3f` file is a header. @@ -47,14 +53,15 @@ It looks something like this {impl "g3f-reference"} {impl_version "0.8.5"} -{schemas} -# ... +{data} ``` A few notes here: -- `g3f` is a flat format. When declaring a new top-level block (i.e. `{schemas}`) this ends the `{header}` block. -- A block can enforce a schema (i.e. here we enforce that all required fields from `builtin/header` are present) +- `g3f` is a flat format. When declaring a new top-level block + (i.e. `{data}`) this ends the `{header}` block. +- A block can enforce a schema (i.e. here we enforce that all + required fields from `builtin/header` are present) - Nodes always have a single data value. Supported types are - string (`"1.0.0"`) - int (`42`) @@ -63,23 +70,22 @@ A few notes here: - list<...> (`[ ... ]` - Elements are not comma-separated!) - ref (`some_id` - not quoted!) - type (`<...>` refers to some type information - - NULL (`<>` which is an empty type/name marker) + - schema (`` as a literal) - `#` is a line comment. There are no block-comments ### Schemas As previously mentioned `g3f` is a strongly typed file format. Schemas are IDs that can be referenced by other IDs. -But because `g3f` is completely flat, it's impossible to define -schema blocks inside the `{schema}` block itself. - -Instead it uses the `NULL` markers to define the existence of schemas. -Schemas are then later defined in-line with the rest of the data. +Because `g3f` is completely flat, it's impossible to have a `{schemas}` +block in which to define schemas. +Instead inside the header it's possible to use the `` type marker +to pre-declare schema data which will later be defined by blocks. ```g3f -{schemas} -{node <>} -{link <>} +{header} +{node } +{links } {node} {id } @@ -100,7 +106,7 @@ if you want your file format to be completely dynamic and terrible. ```g3f {<>:node} {id 0} -{links [ 1 ]:} +{links [ 1 ]} {<>:node} {id 1} @@ -131,7 +137,7 @@ struct node_t { struct node_t * nodes = [ node_t { ... }, node_t { ... } ]; ``` -Because `g3f` has no hirarchy structure, and there's no in-file format references between the two nodes, +Because `g3f` has no hierarchical structure, and there's no in-file format references between the two nodes, the deserialised returns a list of nodes. Building a graph in memory is then your responsibility. However, `g3f` can handle a few scenarios for you. -- cgit v1.2.3