diff options
-rw-r--r-- | docs/ARCHITECTURE.md | 97 |
1 files changed, 96 insertions, 1 deletions
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 5b50f8faa..a1fa246c2 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md | |||
@@ -1 +1,96 @@ | |||
1 | # Design and open questions about libsyntax. | 1 | # Design and open questions about libsyntax |
2 | |||
3 | |||
4 | The high-level description of the architecture is in RFC.md. You might | ||
5 | also want to dig through https://github.com/matklad/fall/ which | ||
6 | contains some pretty interesting stuff build using similar ideas | ||
7 | (warning: it is completely undocumented, poorly written and in general | ||
8 | not the thing which I recommend to study (yes, this is | ||
9 | self-contradictory)). | ||
10 | |||
11 | ## Tree | ||
12 | |||
13 | The centerpiece of this whole endeavor is the syntax tree, in the | ||
14 | `tree` module. Open questions: | ||
15 | |||
16 | - how to best represent errors, to take advantage of the fact that | ||
17 | they are rare, but to enable fully-persistent style structure | ||
18 | sharing between tree nodes? | ||
19 | |||
20 | - should we make red/green split from Roslyn more pronounced? | ||
21 | |||
22 | - one can layout nodes in a single array in such a way that children | ||
23 | of the node form a continuous slice. Seems nifty, but do we need it? | ||
24 | |||
25 | - should we use SoA or AoS for NodeData? | ||
26 | |||
27 | - should we split leaf nodes and internal nodes into separate arrays? | ||
28 | Can we use it to save some bits here and there? (leaves don't need | ||
29 | first_child field, for example). | ||
30 | |||
31 | |||
32 | ## Parser | ||
33 | |||
34 | The syntax tree is produced using a three-staged process. | ||
35 | |||
36 | First, a raw text is split into tokens with a lexer. Lexer has a | ||
37 | peculiar signature: it is an `Fn(&str) -> Token`, where token is a | ||
38 | pair of `SyntaxKind` (you should have read the `tree` module and RFC | ||
39 | by this time! :)) and a len. That is, lexer chomps only the first | ||
40 | token of the input. This forces the lexer to be stateless, and makes | ||
41 | it possible to implement incremental relexing easily. | ||
42 | |||
43 | Then, the bulk of work, the parser turns a stream of tokens into | ||
44 | stream of events. Not that parser **does not** construct a tree right | ||
45 | away. This is done for several reasons: | ||
46 | |||
47 | * to decouple the actual tree data structure from the parser: you can | ||
48 | build any datastructre you want from the stream of events | ||
49 | |||
50 | * to make parsing fast: you can produce a list of events without | ||
51 | allocations | ||
52 | |||
53 | * to make it easy to tweak tree structure. Consider this code: | ||
54 | |||
55 | ``` | ||
56 | #[cfg(test)] | ||
57 | pub fn foo() {} | ||
58 | ``` | ||
59 | |||
60 | Here, the attribute and the `pub` keyword must be the children of | ||
61 | the `fn` node. However, when parsing them, we don't yet know if | ||
62 | there would be a function ahead: it very well might be a `struct` | ||
63 | there. If we use events, we generally don't care about this *in | ||
64 | parser* and just spit them in order. | ||
65 | |||
66 | * (Is this true?) to make incremental reparsing easier: you can reuse | ||
67 | the same rope data structure for all of the original string, the | ||
68 | tokens and the events. | ||
69 | |||
70 | |||
71 | The parser also does not know about whitespace tokens: it's the job of | ||
72 | the next layer to assign whitespace and comments to nodes. However, | ||
73 | parser can remap contextual tokens, like `>>` or `union`, so it has | ||
74 | access to the text. | ||
75 | |||
76 | And at last, the TreeBuilder converts a flat stream of events into a | ||
77 | tree structure. It also *should* be responsible for attaching comments | ||
78 | and rebalancing the tree, but it does not do this yet :) | ||
79 | |||
80 | |||
81 | ## Error reporing | ||
82 | |||
83 | TODO: describe how stuff like `skip_to_first` works | ||
84 | |||
85 | |||
86 | ## Validator | ||
87 | |||
88 | Parser and lexer accept a lot of *invalid* code intentionally. The | ||
89 | idea is to post-process the tree and to proper error reporting, | ||
90 | literal conversion and quick-fix suggestions. There is no | ||
91 | design/implementation for this yet. | ||
92 | |||
93 | |||
94 | ## AST | ||
95 | |||
96 | Nothing yet, see `AstNode` in `fall`. | ||