aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorAleksey Kladov <[email protected]>2018-08-24 16:14:21 +0100
committerAleksey Kladov <[email protected]>2018-08-24 16:14:21 +0100
commit2812015d40c94ddb94bb57ed1a9811ff24414c44 (patch)
treed134a3af875036f3921c5b620685510f6dbe0bf6 /docs
parent6cade3f6d8ad7bb5a11b1910689b25f709c12502 (diff)
README
Diffstat (limited to 'docs')
-rw-r--r--docs/ARCHITECTURE.md93
-rw-r--r--docs/RFC.md494
-rw-r--r--docs/TESTS.md44
-rw-r--r--docs/TOOLS.md36
4 files changed, 0 insertions, 667 deletions
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
deleted file mode 100644
index 6b4434396..000000000
--- a/docs/ARCHITECTURE.md
+++ /dev/null
@@ -1,93 +0,0 @@
1# Design and open questions about libsyntax
2
3
4The high-level description of the architecture is in RFC.md. You might
5also want to dig through https://github.com/matklad/fall/ which
6contains some pretty interesting stuff build using similar ideas
7(warning: it is completely undocumented, poorly written and in general
8not the thing which I recommend to study (yes, this is
9self-contradictory)).
10
11## Tree
12
13The centerpiece of this whole endeavor is the syntax tree, in the
14`tree` module. Open questions:
15
16- how to best represent errors, to take advantage of the fact that
17 they are rare, but to enable fully-persistent style structure
18 sharing between tree nodes?
19
20- should we make red/green split from Roslyn more pronounced?
21
22- one can layout nodes in a single array in such a way that children
23 of the node form a continuous slice. Seems nifty, but do we need it?
24
25- should we use SoA or AoS for NodeData?
26
27- should we split leaf nodes and internal nodes into separate arrays?
28 Can we use it to save some bits here and there? (leaves don't need
29 first_child field, for example).
30
31
32## Parser
33
34The syntax tree is produced using a three-staged process.
35
36First, a raw text is split into tokens with a lexer (the `lexer` module).
37Lexer has a peculiar signature: it is an `Fn(&str) -> Token`, where token
38is a pair of `SyntaxKind` (you should have read the `tree` module and RFC
39by this time! :)) and a len. That is, lexer chomps only the first
40token of the input. This forces the lexer to be stateless, and makes
41it possible to implement incremental relexing easily.
42
43Then, the bulk of work, the parser turns a stream of tokens into
44stream of events (the `parser` module; of particular interest are
45the `parser/event` and `parser/parser` modules, which contain parsing
46API, and the `parser/grammar` module, which contains actual parsing code
47for various Rust syntactic constructs). Not that parser **does not**
48construct a tree right away. This is done for several reasons:
49
50* to decouple the actual tree data structure from the parser: you can
51 build any data structure you want from the stream of events
52
53* to make parsing fast: you can produce a list of events without
54 allocations
55
56* to make it easy to tweak tree structure. Consider this code:
57
58 ```
59 #[cfg(test)]
60 pub fn foo() {}
61 ```
62
63 Here, the attribute and the `pub` keyword must be the children of
64 the `fn` node. However, when parsing them, we don't yet know if
65 there would be a function ahead: it very well might be a `struct`
66 there. If we use events, we generally don't care about this *in
67 parser* and just spit them in order.
68
69* (Is this true?) to make incremental reparsing easier: you can reuse
70 the same rope data structure for all of the original string, the
71 tokens and the events.
72
73
74The parser also does not know about whitespace tokens: it's the job of
75the next layer to assign whitespace and comments to nodes. However,
76parser can remap contextual tokens, like `>>` or `union`, so it has
77access to the text.
78
79And at last, the TreeBuilder converts a flat stream of events into a
80tree structure. It also *should* be responsible for attaching comments
81and rebalancing the tree, but it does not do this yet :)
82
83## Validator
84
85Parser and lexer accept a lot of *invalid* code intentionally. The
86idea is to post-process the tree and to proper error reporting,
87literal conversion and quick-fix suggestions. There is no
88design/implementation for this yet.
89
90
91## AST
92
93Nothing yet, see `AstNode` in `fall`.
diff --git a/docs/RFC.md b/docs/RFC.md
deleted file mode 100644
index 2bd9f18f1..000000000
--- a/docs/RFC.md
+++ /dev/null
@@ -1,494 +0,0 @@
1- Feature Name: libsyntax2.0
2- Start Date: 2017-12-30
3- RFC PR: (leave this empty)
4- Rust Issue: (leave this empty)
5
6
7>I think the lack of reusability comes in object-oriented languages,
8>not functional languages. Because the problem with object-oriented
9>languages is they’ve got all this implicit environment that they
10>carry around with them. You wanted a banana but what you got was a
11>gorilla holding the banana and the entire jungle.
12>
13>If you have referentially transparent code, if you have pure
14>functions — all the data comes in its input arguments and everything
15>goes out and leave no state behind — it’s incredibly reusable.
16>
17> **Joe Armstrong**
18
19# Summary
20[summary]: #summary
21
22The long-term plan is to rewrite libsyntax parser and syntax tree data
23structure to create a software component independent of the rest of
24rustc compiler and suitable for the needs of IDEs and code
25editors. This RFCs is the first step of this plan, whose goal is to
26find out if this is possible at least in theory. If it is possible,
27the next steps would be a prototype implementation as a crates.io
28crate and a separate RFC for integrating the prototype with rustc,
29other tools, and eventual libsyntax removal.
30
31Note that this RFC does not propose to stabilize any API for working
32with rust syntax: the semver version of the hypothetical library would
33be `0.1.0`. It is intended to be used by tools, which are currently
34closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls`
35and hypothetical `rustfix`. While it would be possible to create
36third-party tools on top of the new libsyntax, the burden of adopting
37to breaking changes would be on authors of such tools.
38
39
40# Motivation
41[motivation]: #motivation
42
43There are two main drawbacks with the current version of libsyntax:
44
45* It is tightly integrated with the compiler and hard to use
46 independently
47
48* The AST representation is not well-suited for use inside IDEs
49
50
51## IDE support
52
53There are several differences in how IDEs and compilers typically
54treat source code.
55
56In the compiler, it is convenient to transform the source
57code into Abstract Syntax Tree form, which is independent of the
58surface syntax. For example, it's convenient to discard comments,
59whitespaces and desugar some syntactic constructs in terms of the
60simpler ones.
61
62In contrast, IDEs work much closer to the source code, so it is
63crucial to preserve full information about the original text. For
64example, IDE may adjust indentation after typing a `}` which closes a
65block, and to do this correctly, IDE must be aware of syntax (that is,
66that `}` indeed closes some block, and is not a syntax error) and of
67all whitespaces and comments. So, IDE suitable AST should explicitly
68account for syntactic elements, not considered important by the
69compiler.
70
71Another difference is that IDEs typically work with incomplete and
72syntactically invalid code. This boils down to two parser properties.
73First, the parser must produce syntax tree even if some required input
74is missing. For example, for input `fn foo` the function node should
75be present in the parse, despite the fact that there is no parameters
76or body. Second, the parser must be able to skip over parts of input
77it can't recognize and aggressively recover from errors. That is, the
78syntax tree data structure should be able to handle both missing and
79extra nodes.
80
81IDEs also need the ability to incrementally reparse and relex source
82code after the user types. A smart IDE would use syntax tree structure
83to handle editing commands (for example, to add/remove trailing commas
84after join/split lines actions), so parsing time can be very
85noticeable.
86
87
88Currently rustc uses the classical AST approach, and preserves some of
89the source code information in the form of spans in the AST. It is not
90clear if this structure can full fill all IDE requirements.
91
92
93## Reusability
94
95In theory, the parser can be a pure function, which takes a `&str` as
96an input, and produces a `ParseTree` as an output.
97
98This is great for reusability: for example, you can compile this
99function to WASM and use it for fast client-side validation of syntax
100on the rust playground, or you can develop tools like `rustfmt` on
101stable Rust outside of rustc repository, or you can embed the parser
102into your favorite IDE or code editor.
103
104This is also great for correctness: with such simple interface, it's
105possible to write property-based tests to thoroughly compare two
106different implementations of the parser. It's also straightforward to
107create a comprehensive test suite, because all the inputs and outputs
108are trivially serializable to human-readable text.
109
110Another benefit is performance: with this signature, you can cache a
111parse tree for each file, with trivial strategy for cache invalidation
112(invalidate an entry when the underling file changes). On top of such
113a cache it is possible to build a smart code indexer which maintains
114the set of symbols in the project, watches files for changes and
115automatically reindexes only changed files.
116
117Unfortunately, the current libsyntax is far from this ideal. For
118example, even the lexer makes use of the `FileMap` which is
119essentially a global state of the compiler which represents all know
120files. As a data point, it turned out to be easier to move `rustfmt`
121into the main `rustc` repository than to move libsyntax outside!
122
123
124# Guide-level explanation
125[guide-level-explanation]: #guide-level-explanation
126
127Not applicable.
128
129
130# Reference-level explanation
131[reference-level-explanation]: #reference-level-explanation
132
133It is not clear if a single parser can accommodate the needs of the
134compiler and the IDE, but there is hope that it is possible. The RFC
135proposes to develop libsynax2.0 as an experimental crates.io crate. If
136the experiment turns out to be a success, the second RFC will propose
137to integrate it with all existing tools and `rustc`.
138
139Next, a syntax tree data structure is proposed for libsyntax2.0. It
140seems to have the following important properties:
141
142* It is lossless and faithfully represents the original source code,
143 including explicit nodes for comments and whitespace.
144
145* It is flexible and allows to encode arbitrary node structure,
146 even for invalid syntax.
147
148* It is minimal: it stores small amount of data and has no
149 dependencies. For instance, it does not need compiler's string
150 interner or literal data representation.
151
152* While the tree itself is minimal, it is extensible in a sense that
153 it possible to associate arbitrary data with certain nodes in a
154 type-safe way.
155
156
157It is not clear if this representation is the best one. It is heavily
158inspired by [PSI] data structure which used in [IntelliJ] based IDEs
159and in the [Kotlin] compiler.
160
161[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html
162[IntelliJ]: https://github.com/JetBrains/intellij-community/
163[Kotlin]: https://kotlinlang.org/
164
165
166## Untyped Tree
167
168The main idea is to store the minimal amount of information in the
169tree itself, and instead lean heavily on the source code for the
170actual data about identifier names, constant values etc.
171
172All nodes in the tree are of the same type and store a constant for
173the syntactic category of the element and a range in the source code.
174
175Here is a minimal implementation of this data structure with some Rust
176syntactic categories
177
178
179```rust
180#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
181pub struct NodeKind(u16);
182
183pub struct File {
184 text: String,
185 nodes: Vec<NodeData>,
186}
187
188struct NodeData {
189 kind: NodeKind,
190 range: (u32, u32),
191 parent: Option<u32>,
192 first_child: Option<u32>,
193 next_sibling: Option<u32>,
194}
195
196#[derive(Clone, Copy)]
197pub struct Node<'f> {
198 file: &'f File,
199 idx: u32,
200}
201
202pub struct Children<'f> {
203 next: Option<Node<'f>>,
204}
205
206impl File {
207 pub fn root<'f>(&'f self) -> Node<'f> {
208 assert!(!self.nodes.is_empty());
209 Node { file: self, idx: 0 }
210 }
211}
212
213impl<'f> Node<'f> {
214 pub fn kind(&self) -> NodeKind {
215 self.data().kind
216 }
217
218 pub fn text(&self) -> &'f str {
219 let (start, end) = self.data().range;
220 &self.file.text[start as usize..end as usize]
221 }
222
223 pub fn parent(&self) -> Option<Node<'f>> {
224 self.as_node(self.data().parent)
225 }
226
227 pub fn children(&self) -> Children<'f> {
228 Children { next: self.as_node(self.data().first_child) }
229 }
230
231 fn data(&self) -> &'f NodeData {
232 &self.file.nodes[self.idx as usize]
233 }
234
235 fn as_node(&self, idx: Option<u32>) -> Option<Node<'f>> {
236 idx.map(|idx| Node { file: self.file, idx })
237 }
238}
239
240impl<'f> Iterator for Children<'f> {
241 type Item = Node<'f>;
242
243 fn next(&mut self) -> Option<Node<'f>> {
244 let next = self.next;
245 self.next = next.and_then(|node| node.as_node(node.data().next_sibling));
246 next
247 }
248}
249
250pub const ERROR: NodeKind = NodeKind(0);
251pub const WHITESPACE: NodeKind = NodeKind(1);
252pub const STRUCT_KW: NodeKind = NodeKind(2);
253pub const IDENT: NodeKind = NodeKind(3);
254pub const L_CURLY: NodeKind = NodeKind(4);
255pub const R_CURLY: NodeKind = NodeKind(5);
256pub const COLON: NodeKind = NodeKind(6);
257pub const COMMA: NodeKind = NodeKind(7);
258pub const AMP: NodeKind = NodeKind(8);
259pub const LINE_COMMENT: NodeKind = NodeKind(9);
260pub const FILE: NodeKind = NodeKind(10);
261pub const STRUCT_DEF: NodeKind = NodeKind(11);
262pub const FIELD_DEF: NodeKind = NodeKind(12);
263pub const TYPE_REF: NodeKind = NodeKind(13);
264```
265
266Here is a rust snippet and the corresponding parse tree:
267
268```rust
269struct Foo {
270 field1: u32,
271 &
272 // non-doc comment
273 field2:
274}
275```
276
277
278```
279FILE
280 STRUCT_DEF
281 STRUCT_KW
282 WHITESPACE
283 IDENT
284 WHITESPACE
285 L_CURLY
286 WHITESPACE
287 FIELD_DEF
288 IDENT
289 COLON
290 WHITESPACE
291 TYPE_REF
292 IDENT
293 COMMA
294 WHITESPACE
295 ERROR
296 AMP
297 WHITESPACE
298 FIELD_DEF
299 LINE_COMMENT
300 WHITESPACE
301 IDENT
302 COLON
303 ERROR
304 WHITESPACE
305 R_CURLY
306```
307
308Note several features of the tree:
309
310* All whitespace and comments are explicitly accounted for.
311
312* The node for `STRUCT_DEF` contains the error element for `&`, but
313 still represents the following field correctly.
314
315* The second field of the struct is incomplete: `FIELD_DEF` node for
316 it contains an `ERROR` element, but nevertheless has the correct
317 `NodeKind`.
318
319* The non-documenting comment is correctly attached to the following
320 field.
321
322
323## Typed Tree
324
325It's hard to work with this raw parse tree, because it is untyped:
326node containing a struct definition has the same API as the node for
327the struct field. But it's possible to add a strongly typed layer on
328top of this raw tree, and get a zero-cost AST. Here is an example
329which adds type-safe wrappers for structs and fields:
330
331```rust
332// generic infrastructure
333
334pub trait AstNode<'f>: Copy + 'f {
335 fn new(node: Node<'f>) -> Option<Self>;
336 fn node(&self) -> Node<'f>;
337}
338
339pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option<Node<'f>> {
340 node.children().find(|child| child.kind() == kind)
341}
342
343pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box<Iterator<Item=A> + 'f> {
344 Box::new(node.children().filter_map(A::new))
345}
346
347// AST elements, specific to Rust
348
349#[derive(Clone, Copy)]
350pub struct StructDef<'f>(Node<'f>);
351
352#[derive(Clone, Copy)]
353pub struct FieldDef<'f>(Node<'f>);
354
355#[derive(Clone, Copy)]
356pub struct TypeRef<'f>(Node<'f>);
357
358pub trait NameOwner<'f>: AstNode<'f> {
359 fn name_ident(&self) -> Node<'f> {
360 child_of_kind(self.node(), IDENT).unwrap()
361 }
362
363 fn name(&self) -> &'f str { self.name_ident().text() }
364}
365
366
367impl<'f> AstNode<'f> for StructDef<'f> {
368 fn new(node: Node<'f>) -> Option<Self> {
369 if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None }
370 }
371 fn node(&self) -> Node<'f> { self.0 }
372}
373
374impl<'f> NameOwner<'f> for StructDef<'f> {}
375
376impl<'f> StructDef<'f> {
377 pub fn fields(&self) -> Box<Iterator<Item=FieldDef<'f>> + 'f> {
378 ast_children(self.node())
379 }
380}
381
382
383impl<'f> AstNode<'f> for FieldDef<'f> {
384 fn new(node: Node<'f>) -> Option<Self> {
385 if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None }
386 }
387 fn node(&self) -> Node<'f> { self.0 }
388}
389
390impl<'f> FieldDef<'f> {
391 pub fn type_ref(&self) -> Option<TypeRef<'f>> {
392 ast_children(self.node()).next()
393 }
394}
395
396impl<'f> NameOwner<'f> for FieldDef<'f> {}
397
398
399impl<'f> AstNode<'f> for TypeRef<'f> {
400 fn new(node: Node<'f>) -> Option<Self> {
401 if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None }
402 }
403 fn node(&self) -> Node<'f> { self.0 }
404}
405```
406
407Note that although AST wrappers provide a type-safe access to the
408tree, they are still represented as indexes, so clients of the syntax
409tree can easily associated additional data with AST nodes by storing
410it in a side-table.
411
412
413## Missing Source Code
414
415The crucial feature of this syntax tree is that it is just a view into
416the original source code. And this poses a problem for the Rust
417language, because not all compiled Rust code is represented in the
418form of source code! Specifically, Rust has a powerful macro system,
419which effectively allows to create and parse additional source code at
420compile time. It is not entirely clear that the proposed parsing
421framework is able to handle this use case, and it's the main purpose
422of this RFC to figure it out. The current idea for handling macros is
423to make each macro expansion produce a triple of (expansion text,
424syntax tree, hygiene information), where hygiene information is a side
425table, which colors different ranges of the expansion text according
426to the original syntactic context.
427
428
429## Implementation plan
430
431This RFC proposes huge changes to the internals of the compiler, so
432it's important to proceed carefully and incrementally. The following
433plan is suggested:
434
435* RFC discussion about the theoretical feasibility of the proposal,
436 and the best representation representation for the syntax tree.
437
438* Implementation of the proposal as a completely separate crates.io
439 crate, by refactoring existing libsyntax source code to produce a
440 new tree.
441
442* A prototype implementation of the macro expansion on top of the new
443 sytnax tree.
444
445* Additional round of discussion/RFC about merging with the mainline
446 compiler.
447
448
449# Drawbacks
450[drawbacks]: #drawbacks
451
452- No harm will be done as long as the new libsyntax exists as an
453 experiemt on crates.io. However, actually using it in the compiler
454 and other tools would require massive refactorings.
455
456- It's difficult to know upfront if the proposed syntax tree would
457 actually work well in both the compiler and IDE. It may be possible
458 that some drawbacks will be discovered during implementation.
459
460
461# Rationale and alternatives
462[alternatives]: #alternatives
463
464- Incrementally add more information about source code to the current
465 AST.
466
467- Move the current libsyntax to crates.io as is. In the past, there
468 were several failed attempts to do that.
469
470- Explore alternative representations for the parse tree.
471
472- Use parser generator instead of hand written parser. Using the
473 parser from libsyntax directly would be easier, and hand-written
474 LL-style parsers usually have much better error recovery than
475 generated LR-style ones.
476
477# Unresolved questions
478[unresolved]: #unresolved-questions
479
480- Is it at all possible to represent Rust parser as a pure function of
481 the source code? It seems like the answer is yes, because the
482 language and especially macros were cleverly designed with this
483 use-case in mind.
484
485
486- Is it possible to implement macro expansion using the proposed
487 framework? This is the main question of this RFC. The proposed
488 solution of synthesizing source code on the fly seems workable: it's
489 not that different from the current implementation, which
490 synthesizes token trees.
491
492
493- How to actually phase out current libsyntax, if libsyntax2.0 turns
494 out to be a success?
diff --git a/docs/TESTS.md b/docs/TESTS.md
deleted file mode 100644
index a9d32d1d4..000000000
--- a/docs/TESTS.md
+++ /dev/null
@@ -1,44 +0,0 @@
1# libsyntax2.0 testing infrastructure
2
3Libsyntax2.0 tests are in the `tests/data` directory. Each test is a
4pair of files, an `.rs` file with Rust code and a `.txt` file with a
5human-readable representation of syntax tree.
6
7The test suite is intended to be independent from a particular parser:
8that's why it is just a list of files.
9
10The test suite is intended to be progressive: that is, if you want to
11write a Rust parser, you can TDD it by working through the test in
12order. That's why each test file begins with the number. Generally,
13tests should be added in order of the appearance of corresponding
14functionality in libsytnax2.0. If a bug in parser is uncovered, a
15**new** test should be created instead of modifying an existing one:
16it is preferable to have a gazillion of small isolated test files,
17rather than a single file which covers all edge cases. It's okay for
18files to have the same name except for the leading number. In general,
19test suite should be append-only: old tests should not be modified,
20new tests should be created instead.
21
22Note that only `ok` tests are normative: `err` tests test error
23recovery and it is totally ok for a parser to not implement any error
24recovery at all. However, for libsyntax2.0 we do care about error
25recovery, and we do care about precise and useful error messages.
26
27There are also so-called "inline tests". They appear as the comments
28with a `test` header in the source code, like this:
29
30```rust
31// test fn_basic
32// fn foo() {}
33fn function(p: &mut Parser) {
34 // ...
35}
36```
37
38You can run `cargo collect-tests` command to collect all inline tests
39into `tests/data/inline` directory. The main advantage of inline tests
40is that they help to illustrate what the relevant code is doing.
41
42
43Contribution opportunity: design and implement testing infrastructure
44for validators.
diff --git a/docs/TOOLS.md b/docs/TOOLS.md
deleted file mode 100644
index f8754c06f..000000000
--- a/docs/TOOLS.md
+++ /dev/null
@@ -1,36 +0,0 @@
1# Tools used to implement libsyntax
2
3libsyntax uses several tools to help with development.
4
5Each tool is a binary in the [tools/](../tools) package.
6You can run them via `cargo run` command.
7
8```
9cargo run --package tools --bin tool
10```
11
12There are also aliases in [./cargo/config](../.cargo/config),
13so the following also works:
14
15```
16cargo tool
17```
18
19
20## Tool: `gen`
21
22This tool reads a "grammar" from [grammar.ron](../grammar.ron) and
23generates the `syntax_kinds.rs` file. You should run this tool if you
24add new keywords or syntax elements.
25
26
27## Tool: `parse`
28
29This tool reads rust source code from the standard input, parses it,
30and prints the result to stdout.
31
32
33## Tool: `collect-tests`
34
35This tools collect inline tests from comments in libsyntax2 source code
36and places them into `tests/data/inline` directory.