aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/ARCHITECTURE.md1
-rw-r--r--docs/CONTRIBUTING.md32
-rw-r--r--docs/TESTS.md30
-rw-r--r--docs/rfc.md494
-rw-r--r--docs/validation.md14
5 files changed, 571 insertions, 0 deletions
diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md
new file mode 100644
index 000000000..5b50f8faa
--- /dev/null
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1 @@
# Design and open questions about libsyntax.
diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md
new file mode 100644
index 000000000..5ae2e830e
--- /dev/null
+++ b/docs/CONTRIBUTING.md
@@ -0,0 +1,32 @@
1The project is in its early stages: contributions are welcome and
2would be **very** helpful, but the project is not *yet* optimized for
3contributors. Moreover, it is doubly experimental, so there's no
4guarantee that any work here would reach production. That said, here
5are some arias where contributions would be **especially** welcome:
6
7
8* Designing internal data structures: RFC only outlines the
9 constraints, it's an open question how to satisfy them in the
10 optimal way. See `ARCHITECTURE.md` for current design questions.
11
12* Porting libsyntax parser to libsyntax2: currently libsyntax2 parses
13 only a tiny subset of Rust. This should be fixed by porting parsing
14 functions from libsyntax one by one.
15
16* Writing validators: by design, libsyntax2 is very lax about the
17 input. For example, the lexer happily accepts unclosed strings. The
18 idea is that there should be a higher level visitor, which walks the
19 syntax tree after parsing and produces all the warnings. Alas,
20 there's no such visitor yet :( Would you like to write one? :)
21
22* Creating tests: it would be tremendously helpful to read each of
23 libsyntax and libsyntax2 parser functions and crate a small separate
24 test cases to cover each and every edge case.
25
26* Building stuff with libsyntax2: it would be really cool to compile
27 libsyntax2 to WASM and add *client side* syntax validation to rust
28 playground!
29
30
31Do take a look at the issue tracker, and try to read other docs in
32this folder.
diff --git a/docs/TESTS.md b/docs/TESTS.md
new file mode 100644
index 000000000..8005ec9da
--- /dev/null
+++ b/docs/TESTS.md
@@ -0,0 +1,30 @@
1# libsyntax2.0 testing infrastructure
2
3Libsyntax2.0 tests are in the `tests/data` directory. Each test is a
4pair of files, an `.rs` file with Rust code and a `.txt` file with a
5human-readable representation of syntax tree.
6
7The test suite is intended to be independent from a particular parser:
8that's why it is just a list of files.
9
10The test suite is intended to be progressive: that is, if you want to
11write a Rust parser, you can TDD it by working through the test in
12order. That's why each test file begins with the number. Generally,
13tests should be added in order of the appearance of corresponding
14functionality in libsytnax2.0. If a bug in parser is uncovered, a
15**new** test should be created instead of modifying an existing one:
16it is preferable to have a gazillion of small isolated test files,
17rather than a single file which covers all edge cases. It's okay for
18files to have the same name except for the leading number. In general,
19test suite should be append-only: old tests should not be modified,
20new tests should be created instead.
21
22
23Note that only `ok` tests are normative: `err` tests test error
24recovery and it is totally ok for a parser to not implement any error
25recovery at all. However, for libsyntax2.0 we do care about error
26recovery, and we do care about precise and useful error messages.
27
28
29Contribution opportunity: design and implement testing infrastructure
30for validators.
diff --git a/docs/rfc.md b/docs/rfc.md
new file mode 100644
index 000000000..2bd9f18f1
--- /dev/null
+++ b/docs/rfc.md
@@ -0,0 +1,494 @@
1- Feature Name: libsyntax2.0
2- Start Date: 2017-12-30
3- RFC PR: (leave this empty)
4- Rust Issue: (leave this empty)
5
6
7>I think the lack of reusability comes in object-oriented languages,
8>not functional languages. Because the problem with object-oriented
9>languages is they’ve got all this implicit environment that they
10>carry around with them. You wanted a banana but what you got was a
11>gorilla holding the banana and the entire jungle.
12>
13>If you have referentially transparent code, if you have pure
14>functions — all the data comes in its input arguments and everything
15>goes out and leave no state behind — it’s incredibly reusable.
16>
17> **Joe Armstrong**
18
19# Summary
20[summary]: #summary
21
22The long-term plan is to rewrite libsyntax parser and syntax tree data
23structure to create a software component independent of the rest of
24rustc compiler and suitable for the needs of IDEs and code
25editors. This RFCs is the first step of this plan, whose goal is to
26find out if this is possible at least in theory. If it is possible,
27the next steps would be a prototype implementation as a crates.io
28crate and a separate RFC for integrating the prototype with rustc,
29other tools, and eventual libsyntax removal.
30
31Note that this RFC does not propose to stabilize any API for working
32with rust syntax: the semver version of the hypothetical library would
33be `0.1.0`. It is intended to be used by tools, which are currently
34closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls`
35and hypothetical `rustfix`. While it would be possible to create
36third-party tools on top of the new libsyntax, the burden of adopting
37to breaking changes would be on authors of such tools.
38
39
40# Motivation
41[motivation]: #motivation
42
43There are two main drawbacks with the current version of libsyntax:
44
45* It is tightly integrated with the compiler and hard to use
46 independently
47
48* The AST representation is not well-suited for use inside IDEs
49
50
51## IDE support
52
53There are several differences in how IDEs and compilers typically
54treat source code.
55
56In the compiler, it is convenient to transform the source
57code into Abstract Syntax Tree form, which is independent of the
58surface syntax. For example, it's convenient to discard comments,
59whitespaces and desugar some syntactic constructs in terms of the
60simpler ones.
61
62In contrast, IDEs work much closer to the source code, so it is
63crucial to preserve full information about the original text. For
64example, IDE may adjust indentation after typing a `}` which closes a
65block, and to do this correctly, IDE must be aware of syntax (that is,
66that `}` indeed closes some block, and is not a syntax error) and of
67all whitespaces and comments. So, IDE suitable AST should explicitly
68account for syntactic elements, not considered important by the
69compiler.
70
71Another difference is that IDEs typically work with incomplete and
72syntactically invalid code. This boils down to two parser properties.
73First, the parser must produce syntax tree even if some required input
74is missing. For example, for input `fn foo` the function node should
75be present in the parse, despite the fact that there is no parameters
76or body. Second, the parser must be able to skip over parts of input
77it can't recognize and aggressively recover from errors. That is, the
78syntax tree data structure should be able to handle both missing and
79extra nodes.
80
81IDEs also need the ability to incrementally reparse and relex source
82code after the user types. A smart IDE would use syntax tree structure
83to handle editing commands (for example, to add/remove trailing commas
84after join/split lines actions), so parsing time can be very
85noticeable.
86
87
88Currently rustc uses the classical AST approach, and preserves some of
89the source code information in the form of spans in the AST. It is not
90clear if this structure can full fill all IDE requirements.
91
92
93## Reusability
94
95In theory, the parser can be a pure function, which takes a `&str` as
96an input, and produces a `ParseTree` as an output.
97
98This is great for reusability: for example, you can compile this
99function to WASM and use it for fast client-side validation of syntax
100on the rust playground, or you can develop tools like `rustfmt` on
101stable Rust outside of rustc repository, or you can embed the parser
102into your favorite IDE or code editor.
103
104This is also great for correctness: with such simple interface, it's
105possible to write property-based tests to thoroughly compare two
106different implementations of the parser. It's also straightforward to
107create a comprehensive test suite, because all the inputs and outputs
108are trivially serializable to human-readable text.
109
110Another benefit is performance: with this signature, you can cache a
111parse tree for each file, with trivial strategy for cache invalidation
112(invalidate an entry when the underling file changes). On top of such
113a cache it is possible to build a smart code indexer which maintains
114the set of symbols in the project, watches files for changes and
115automatically reindexes only changed files.
116
117Unfortunately, the current libsyntax is far from this ideal. For
118example, even the lexer makes use of the `FileMap` which is
119essentially a global state of the compiler which represents all know
120files. As a data point, it turned out to be easier to move `rustfmt`
121into the main `rustc` repository than to move libsyntax outside!
122
123
124# Guide-level explanation
125[guide-level-explanation]: #guide-level-explanation
126
127Not applicable.
128
129
130# Reference-level explanation
131[reference-level-explanation]: #reference-level-explanation
132
133It is not clear if a single parser can accommodate the needs of the
134compiler and the IDE, but there is hope that it is possible. The RFC
135proposes to develop libsynax2.0 as an experimental crates.io crate. If
136the experiment turns out to be a success, the second RFC will propose
137to integrate it with all existing tools and `rustc`.
138
139Next, a syntax tree data structure is proposed for libsyntax2.0. It
140seems to have the following important properties:
141
142* It is lossless and faithfully represents the original source code,
143 including explicit nodes for comments and whitespace.
144
145* It is flexible and allows to encode arbitrary node structure,
146 even for invalid syntax.
147
148* It is minimal: it stores small amount of data and has no
149 dependencies. For instance, it does not need compiler's string
150 interner or literal data representation.
151
152* While the tree itself is minimal, it is extensible in a sense that
153 it possible to associate arbitrary data with certain nodes in a
154 type-safe way.
155
156
157It is not clear if this representation is the best one. It is heavily
158inspired by [PSI] data structure which used in [IntelliJ] based IDEs
159and in the [Kotlin] compiler.
160
161[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html
162[IntelliJ]: https://github.com/JetBrains/intellij-community/
163[Kotlin]: https://kotlinlang.org/
164
165
166## Untyped Tree
167
168The main idea is to store the minimal amount of information in the
169tree itself, and instead lean heavily on the source code for the
170actual data about identifier names, constant values etc.
171
172All nodes in the tree are of the same type and store a constant for
173the syntactic category of the element and a range in the source code.
174
175Here is a minimal implementation of this data structure with some Rust
176syntactic categories
177
178
179```rust
180#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
181pub struct NodeKind(u16);
182
183pub struct File {
184 text: String,
185 nodes: Vec<NodeData>,
186}
187
188struct NodeData {
189 kind: NodeKind,
190 range: (u32, u32),
191 parent: Option<u32>,
192 first_child: Option<u32>,
193 next_sibling: Option<u32>,
194}
195
196#[derive(Clone, Copy)]
197pub struct Node<'f> {
198 file: &'f File,
199 idx: u32,
200}
201
202pub struct Children<'f> {
203 next: Option<Node<'f>>,
204}
205
206impl File {
207 pub fn root<'f>(&'f self) -> Node<'f> {
208 assert!(!self.nodes.is_empty());
209 Node { file: self, idx: 0 }
210 }
211}
212
213impl<'f> Node<'f> {
214 pub fn kind(&self) -> NodeKind {
215 self.data().kind
216 }
217
218 pub fn text(&self) -> &'f str {
219 let (start, end) = self.data().range;
220 &self.file.text[start as usize..end as usize]
221 }
222
223 pub fn parent(&self) -> Option<Node<'f>> {
224 self.as_node(self.data().parent)
225 }
226
227 pub fn children(&self) -> Children<'f> {
228 Children { next: self.as_node(self.data().first_child) }
229 }
230
231 fn data(&self) -> &'f NodeData {
232 &self.file.nodes[self.idx as usize]
233 }
234
235 fn as_node(&self, idx: Option<u32>) -> Option<Node<'f>> {
236 idx.map(|idx| Node { file: self.file, idx })
237 }
238}
239
240impl<'f> Iterator for Children<'f> {
241 type Item = Node<'f>;
242
243 fn next(&mut self) -> Option<Node<'f>> {
244 let next = self.next;
245 self.next = next.and_then(|node| node.as_node(node.data().next_sibling));
246 next
247 }
248}
249
250pub const ERROR: NodeKind = NodeKind(0);
251pub const WHITESPACE: NodeKind = NodeKind(1);
252pub const STRUCT_KW: NodeKind = NodeKind(2);
253pub const IDENT: NodeKind = NodeKind(3);
254pub const L_CURLY: NodeKind = NodeKind(4);
255pub const R_CURLY: NodeKind = NodeKind(5);
256pub const COLON: NodeKind = NodeKind(6);
257pub const COMMA: NodeKind = NodeKind(7);
258pub const AMP: NodeKind = NodeKind(8);
259pub const LINE_COMMENT: NodeKind = NodeKind(9);
260pub const FILE: NodeKind = NodeKind(10);
261pub const STRUCT_DEF: NodeKind = NodeKind(11);
262pub const FIELD_DEF: NodeKind = NodeKind(12);
263pub const TYPE_REF: NodeKind = NodeKind(13);
264```
265
266Here is a rust snippet and the corresponding parse tree:
267
268```rust
269struct Foo {
270 field1: u32,
271 &
272 // non-doc comment
273 field2:
274}
275```
276
277
278```
279FILE
280 STRUCT_DEF
281 STRUCT_KW
282 WHITESPACE
283 IDENT
284 WHITESPACE
285 L_CURLY
286 WHITESPACE
287 FIELD_DEF
288 IDENT
289 COLON
290 WHITESPACE
291 TYPE_REF
292 IDENT
293 COMMA
294 WHITESPACE
295 ERROR
296 AMP
297 WHITESPACE
298 FIELD_DEF
299 LINE_COMMENT
300 WHITESPACE
301 IDENT
302 COLON
303 ERROR
304 WHITESPACE
305 R_CURLY
306```
307
308Note several features of the tree:
309
310* All whitespace and comments are explicitly accounted for.
311
312* The node for `STRUCT_DEF` contains the error element for `&`, but
313 still represents the following field correctly.
314
315* The second field of the struct is incomplete: `FIELD_DEF` node for
316 it contains an `ERROR` element, but nevertheless has the correct
317 `NodeKind`.
318
319* The non-documenting comment is correctly attached to the following
320 field.
321
322
323## Typed Tree
324
325It's hard to work with this raw parse tree, because it is untyped:
326node containing a struct definition has the same API as the node for
327the struct field. But it's possible to add a strongly typed layer on
328top of this raw tree, and get a zero-cost AST. Here is an example
329which adds type-safe wrappers for structs and fields:
330
331```rust
332// generic infrastructure
333
334pub trait AstNode<'f>: Copy + 'f {
335 fn new(node: Node<'f>) -> Option<Self>;
336 fn node(&self) -> Node<'f>;
337}
338
339pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option<Node<'f>> {
340 node.children().find(|child| child.kind() == kind)
341}
342
343pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box<Iterator<Item=A> + 'f> {
344 Box::new(node.children().filter_map(A::new))
345}
346
347// AST elements, specific to Rust
348
349#[derive(Clone, Copy)]
350pub struct StructDef<'f>(Node<'f>);
351
352#[derive(Clone, Copy)]
353pub struct FieldDef<'f>(Node<'f>);
354
355#[derive(Clone, Copy)]
356pub struct TypeRef<'f>(Node<'f>);
357
358pub trait NameOwner<'f>: AstNode<'f> {
359 fn name_ident(&self) -> Node<'f> {
360 child_of_kind(self.node(), IDENT).unwrap()
361 }
362
363 fn name(&self) -> &'f str { self.name_ident().text() }
364}
365
366
367impl<'f> AstNode<'f> for StructDef<'f> {
368 fn new(node: Node<'f>) -> Option<Self> {
369 if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None }
370 }
371 fn node(&self) -> Node<'f> { self.0 }
372}
373
374impl<'f> NameOwner<'f> for StructDef<'f> {}
375
376impl<'f> StructDef<'f> {
377 pub fn fields(&self) -> Box<Iterator<Item=FieldDef<'f>> + 'f> {
378 ast_children(self.node())
379 }
380}
381
382
383impl<'f> AstNode<'f> for FieldDef<'f> {
384 fn new(node: Node<'f>) -> Option<Self> {
385 if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None }
386 }
387 fn node(&self) -> Node<'f> { self.0 }
388}
389
390impl<'f> FieldDef<'f> {
391 pub fn type_ref(&self) -> Option<TypeRef<'f>> {
392 ast_children(self.node()).next()
393 }
394}
395
396impl<'f> NameOwner<'f> for FieldDef<'f> {}
397
398
399impl<'f> AstNode<'f> for TypeRef<'f> {
400 fn new(node: Node<'f>) -> Option<Self> {
401 if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None }
402 }
403 fn node(&self) -> Node<'f> { self.0 }
404}
405```
406
407Note that although AST wrappers provide a type-safe access to the
408tree, they are still represented as indexes, so clients of the syntax
409tree can easily associated additional data with AST nodes by storing
410it in a side-table.
411
412
413## Missing Source Code
414
415The crucial feature of this syntax tree is that it is just a view into
416the original source code. And this poses a problem for the Rust
417language, because not all compiled Rust code is represented in the
418form of source code! Specifically, Rust has a powerful macro system,
419which effectively allows to create and parse additional source code at
420compile time. It is not entirely clear that the proposed parsing
421framework is able to handle this use case, and it's the main purpose
422of this RFC to figure it out. The current idea for handling macros is
423to make each macro expansion produce a triple of (expansion text,
424syntax tree, hygiene information), where hygiene information is a side
425table, which colors different ranges of the expansion text according
426to the original syntactic context.
427
428
429## Implementation plan
430
431This RFC proposes huge changes to the internals of the compiler, so
432it's important to proceed carefully and incrementally. The following
433plan is suggested:
434
435* RFC discussion about the theoretical feasibility of the proposal,
436 and the best representation representation for the syntax tree.
437
438* Implementation of the proposal as a completely separate crates.io
439 crate, by refactoring existing libsyntax source code to produce a
440 new tree.
441
442* A prototype implementation of the macro expansion on top of the new
443 sytnax tree.
444
445* Additional round of discussion/RFC about merging with the mainline
446 compiler.
447
448
449# Drawbacks
450[drawbacks]: #drawbacks
451
452- No harm will be done as long as the new libsyntax exists as an
453 experiemt on crates.io. However, actually using it in the compiler
454 and other tools would require massive refactorings.
455
456- It's difficult to know upfront if the proposed syntax tree would
457 actually work well in both the compiler and IDE. It may be possible
458 that some drawbacks will be discovered during implementation.
459
460
461# Rationale and alternatives
462[alternatives]: #alternatives
463
464- Incrementally add more information about source code to the current
465 AST.
466
467- Move the current libsyntax to crates.io as is. In the past, there
468 were several failed attempts to do that.
469
470- Explore alternative representations for the parse tree.
471
472- Use parser generator instead of hand written parser. Using the
473 parser from libsyntax directly would be easier, and hand-written
474 LL-style parsers usually have much better error recovery than
475 generated LR-style ones.
476
477# Unresolved questions
478[unresolved]: #unresolved-questions
479
480- Is it at all possible to represent Rust parser as a pure function of
481 the source code? It seems like the answer is yes, because the
482 language and especially macros were cleverly designed with this
483 use-case in mind.
484
485
486- Is it possible to implement macro expansion using the proposed
487 framework? This is the main question of this RFC. The proposed
488 solution of synthesizing source code on the fly seems workable: it's
489 not that different from the current implementation, which
490 synthesizes token trees.
491
492
493- How to actually phase out current libsyntax, if libsyntax2.0 turns
494 out to be a success?
diff --git a/docs/validation.md b/docs/validation.md
new file mode 100644
index 000000000..2739bfcdd
--- /dev/null
+++ b/docs/validation.md
@@ -0,0 +1,14 @@
1Fixmes:
2
3Lexer:
4* Fix `is_whitespace`, add more tests
5* Add more thorough tests for idents for XID_Start & XID_Continue
6* Validate that float and integer literals use digits only of the appropriate
7 base, and are in range
8* Validation for unclosed char literal
9* Strings are completely wrong: more tests and comparison with libsyntax.
10* Comment lexing is completely wrong
11
12Parser:
13* Figure out what is the expected state of attribute grammar.
14 Token trees or something more structured? Token trees would be unfortunate: no extend selection =/ \ No newline at end of file