From 4c10c31be339ef3082f142061d83149af2a30ec8 Mon Sep 17 00:00:00 2001 From: Aleksey Kladov Date: Wed, 10 Jan 2018 21:58:38 +0300 Subject: D: start documenting stuff --- Cargo.toml | 1 + README.md | 5 + docs/ARCHITECTURE.md | 1 + docs/CONTRIBUTING.md | 32 ++++ docs/TESTS.md | 30 ++++ docs/rfc.md | 494 +++++++++++++++++++++++++++++++++++++++++++++++++++ docs/validation.md | 14 ++ rfc.md | 494 --------------------------------------------------- validation.md | 14 -- 9 files changed, 577 insertions(+), 508 deletions(-) create mode 100644 README.md create mode 100644 docs/ARCHITECTURE.md create mode 100644 docs/CONTRIBUTING.md create mode 100644 docs/TESTS.md create mode 100644 docs/rfc.md create mode 100644 docs/validation.md delete mode 100644 rfc.md delete mode 100644 validation.md diff --git a/Cargo.toml b/Cargo.toml index 043f97752..802a99b37 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -2,6 +2,7 @@ name = "libsyntax2" version = "0.1.0" authors = ["Aleksey Kladov "] +license = "MIT OR Apache-2.0" [dependencies] unicode-xid = "0.1.0" diff --git a/README.md b/README.md new file mode 100644 index 000000000..78bcc89b3 --- /dev/null +++ b/README.md @@ -0,0 +1,5 @@ +# libsyntax2.0 + +libsyntax2.0 is an **experimental** implementation of the corresponding [RFC](https://github.com/rust-lang/rfcs/pull/2256). + +See `docs` folder for more details. \ No newline at end of file diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 000000000..5b50f8faa --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1 @@ +# Design and open questions about libsyntax. diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md new file mode 100644 index 000000000..5ae2e830e --- /dev/null +++ b/docs/CONTRIBUTING.md @@ -0,0 +1,32 @@ +The project is in its early stages: contributions are welcome and +would be **very** helpful, but the project is not *yet* optimized for +contributors. Moreover, it is doubly experimental, so there's no +guarantee that any work here would reach production. That said, here +are some arias where contributions would be **especially** welcome: + + +* Designing internal data structures: RFC only outlines the + constraints, it's an open question how to satisfy them in the + optimal way. See `ARCHITECTURE.md` for current design questions. + +* Porting libsyntax parser to libsyntax2: currently libsyntax2 parses + only a tiny subset of Rust. This should be fixed by porting parsing + functions from libsyntax one by one. + +* Writing validators: by design, libsyntax2 is very lax about the + input. For example, the lexer happily accepts unclosed strings. The + idea is that there should be a higher level visitor, which walks the + syntax tree after parsing and produces all the warnings. Alas, + there's no such visitor yet :( Would you like to write one? :) + +* Creating tests: it would be tremendously helpful to read each of + libsyntax and libsyntax2 parser functions and crate a small separate + test cases to cover each and every edge case. + +* Building stuff with libsyntax2: it would be really cool to compile + libsyntax2 to WASM and add *client side* syntax validation to rust + playground! + + +Do take a look at the issue tracker, and try to read other docs in +this folder. diff --git a/docs/TESTS.md b/docs/TESTS.md new file mode 100644 index 000000000..8005ec9da --- /dev/null +++ b/docs/TESTS.md @@ -0,0 +1,30 @@ +# libsyntax2.0 testing infrastructure + +Libsyntax2.0 tests are in the `tests/data` directory. Each test is a +pair of files, an `.rs` file with Rust code and a `.txt` file with a +human-readable representation of syntax tree. + +The test suite is intended to be independent from a particular parser: +that's why it is just a list of files. + +The test suite is intended to be progressive: that is, if you want to +write a Rust parser, you can TDD it by working through the test in +order. That's why each test file begins with the number. Generally, +tests should be added in order of the appearance of corresponding +functionality in libsytnax2.0. If a bug in parser is uncovered, a +**new** test should be created instead of modifying an existing one: +it is preferable to have a gazillion of small isolated test files, +rather than a single file which covers all edge cases. It's okay for +files to have the same name except for the leading number. In general, +test suite should be append-only: old tests should not be modified, +new tests should be created instead. + + +Note that only `ok` tests are normative: `err` tests test error +recovery and it is totally ok for a parser to not implement any error +recovery at all. However, for libsyntax2.0 we do care about error +recovery, and we do care about precise and useful error messages. + + +Contribution opportunity: design and implement testing infrastructure +for validators. diff --git a/docs/rfc.md b/docs/rfc.md new file mode 100644 index 000000000..2bd9f18f1 --- /dev/null +++ b/docs/rfc.md @@ -0,0 +1,494 @@ +- Feature Name: libsyntax2.0 +- Start Date: 2017-12-30 +- RFC PR: (leave this empty) +- Rust Issue: (leave this empty) + + +>I think the lack of reusability comes in object-oriented languages, +>not functional languages. Because the problem with object-oriented +>languages is they’ve got all this implicit environment that they +>carry around with them. You wanted a banana but what you got was a +>gorilla holding the banana and the entire jungle. +> +>If you have referentially transparent code, if you have pure +>functions — all the data comes in its input arguments and everything +>goes out and leave no state behind — it’s incredibly reusable. +> +> **Joe Armstrong** + +# Summary +[summary]: #summary + +The long-term plan is to rewrite libsyntax parser and syntax tree data +structure to create a software component independent of the rest of +rustc compiler and suitable for the needs of IDEs and code +editors. This RFCs is the first step of this plan, whose goal is to +find out if this is possible at least in theory. If it is possible, +the next steps would be a prototype implementation as a crates.io +crate and a separate RFC for integrating the prototype with rustc, +other tools, and eventual libsyntax removal. + +Note that this RFC does not propose to stabilize any API for working +with rust syntax: the semver version of the hypothetical library would +be `0.1.0`. It is intended to be used by tools, which are currently +closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls` +and hypothetical `rustfix`. While it would be possible to create +third-party tools on top of the new libsyntax, the burden of adopting +to breaking changes would be on authors of such tools. + + +# Motivation +[motivation]: #motivation + +There are two main drawbacks with the current version of libsyntax: + +* It is tightly integrated with the compiler and hard to use + independently + +* The AST representation is not well-suited for use inside IDEs + + +## IDE support + +There are several differences in how IDEs and compilers typically +treat source code. + +In the compiler, it is convenient to transform the source +code into Abstract Syntax Tree form, which is independent of the +surface syntax. For example, it's convenient to discard comments, +whitespaces and desugar some syntactic constructs in terms of the +simpler ones. + +In contrast, IDEs work much closer to the source code, so it is +crucial to preserve full information about the original text. For +example, IDE may adjust indentation after typing a `}` which closes a +block, and to do this correctly, IDE must be aware of syntax (that is, +that `}` indeed closes some block, and is not a syntax error) and of +all whitespaces and comments. So, IDE suitable AST should explicitly +account for syntactic elements, not considered important by the +compiler. + +Another difference is that IDEs typically work with incomplete and +syntactically invalid code. This boils down to two parser properties. +First, the parser must produce syntax tree even if some required input +is missing. For example, for input `fn foo` the function node should +be present in the parse, despite the fact that there is no parameters +or body. Second, the parser must be able to skip over parts of input +it can't recognize and aggressively recover from errors. That is, the +syntax tree data structure should be able to handle both missing and +extra nodes. + +IDEs also need the ability to incrementally reparse and relex source +code after the user types. A smart IDE would use syntax tree structure +to handle editing commands (for example, to add/remove trailing commas +after join/split lines actions), so parsing time can be very +noticeable. + + +Currently rustc uses the classical AST approach, and preserves some of +the source code information in the form of spans in the AST. It is not +clear if this structure can full fill all IDE requirements. + + +## Reusability + +In theory, the parser can be a pure function, which takes a `&str` as +an input, and produces a `ParseTree` as an output. + +This is great for reusability: for example, you can compile this +function to WASM and use it for fast client-side validation of syntax +on the rust playground, or you can develop tools like `rustfmt` on +stable Rust outside of rustc repository, or you can embed the parser +into your favorite IDE or code editor. + +This is also great for correctness: with such simple interface, it's +possible to write property-based tests to thoroughly compare two +different implementations of the parser. It's also straightforward to +create a comprehensive test suite, because all the inputs and outputs +are trivially serializable to human-readable text. + +Another benefit is performance: with this signature, you can cache a +parse tree for each file, with trivial strategy for cache invalidation +(invalidate an entry when the underling file changes). On top of such +a cache it is possible to build a smart code indexer which maintains +the set of symbols in the project, watches files for changes and +automatically reindexes only changed files. + +Unfortunately, the current libsyntax is far from this ideal. For +example, even the lexer makes use of the `FileMap` which is +essentially a global state of the compiler which represents all know +files. As a data point, it turned out to be easier to move `rustfmt` +into the main `rustc` repository than to move libsyntax outside! + + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +Not applicable. + + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +It is not clear if a single parser can accommodate the needs of the +compiler and the IDE, but there is hope that it is possible. The RFC +proposes to develop libsynax2.0 as an experimental crates.io crate. If +the experiment turns out to be a success, the second RFC will propose +to integrate it with all existing tools and `rustc`. + +Next, a syntax tree data structure is proposed for libsyntax2.0. It +seems to have the following important properties: + +* It is lossless and faithfully represents the original source code, + including explicit nodes for comments and whitespace. + +* It is flexible and allows to encode arbitrary node structure, + even for invalid syntax. + +* It is minimal: it stores small amount of data and has no + dependencies. For instance, it does not need compiler's string + interner or literal data representation. + +* While the tree itself is minimal, it is extensible in a sense that + it possible to associate arbitrary data with certain nodes in a + type-safe way. + + +It is not clear if this representation is the best one. It is heavily +inspired by [PSI] data structure which used in [IntelliJ] based IDEs +and in the [Kotlin] compiler. + +[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html +[IntelliJ]: https://github.com/JetBrains/intellij-community/ +[Kotlin]: https://kotlinlang.org/ + + +## Untyped Tree + +The main idea is to store the minimal amount of information in the +tree itself, and instead lean heavily on the source code for the +actual data about identifier names, constant values etc. + +All nodes in the tree are of the same type and store a constant for +the syntactic category of the element and a range in the source code. + +Here is a minimal implementation of this data structure with some Rust +syntactic categories + + +```rust +#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] +pub struct NodeKind(u16); + +pub struct File { + text: String, + nodes: Vec, +} + +struct NodeData { + kind: NodeKind, + range: (u32, u32), + parent: Option, + first_child: Option, + next_sibling: Option, +} + +#[derive(Clone, Copy)] +pub struct Node<'f> { + file: &'f File, + idx: u32, +} + +pub struct Children<'f> { + next: Option>, +} + +impl File { + pub fn root<'f>(&'f self) -> Node<'f> { + assert!(!self.nodes.is_empty()); + Node { file: self, idx: 0 } + } +} + +impl<'f> Node<'f> { + pub fn kind(&self) -> NodeKind { + self.data().kind + } + + pub fn text(&self) -> &'f str { + let (start, end) = self.data().range; + &self.file.text[start as usize..end as usize] + } + + pub fn parent(&self) -> Option> { + self.as_node(self.data().parent) + } + + pub fn children(&self) -> Children<'f> { + Children { next: self.as_node(self.data().first_child) } + } + + fn data(&self) -> &'f NodeData { + &self.file.nodes[self.idx as usize] + } + + fn as_node(&self, idx: Option) -> Option> { + idx.map(|idx| Node { file: self.file, idx }) + } +} + +impl<'f> Iterator for Children<'f> { + type Item = Node<'f>; + + fn next(&mut self) -> Option> { + let next = self.next; + self.next = next.and_then(|node| node.as_node(node.data().next_sibling)); + next + } +} + +pub const ERROR: NodeKind = NodeKind(0); +pub const WHITESPACE: NodeKind = NodeKind(1); +pub const STRUCT_KW: NodeKind = NodeKind(2); +pub const IDENT: NodeKind = NodeKind(3); +pub const L_CURLY: NodeKind = NodeKind(4); +pub const R_CURLY: NodeKind = NodeKind(5); +pub const COLON: NodeKind = NodeKind(6); +pub const COMMA: NodeKind = NodeKind(7); +pub const AMP: NodeKind = NodeKind(8); +pub const LINE_COMMENT: NodeKind = NodeKind(9); +pub const FILE: NodeKind = NodeKind(10); +pub const STRUCT_DEF: NodeKind = NodeKind(11); +pub const FIELD_DEF: NodeKind = NodeKind(12); +pub const TYPE_REF: NodeKind = NodeKind(13); +``` + +Here is a rust snippet and the corresponding parse tree: + +```rust +struct Foo { + field1: u32, + & + // non-doc comment + field2: +} +``` + + +``` +FILE + STRUCT_DEF + STRUCT_KW + WHITESPACE + IDENT + WHITESPACE + L_CURLY + WHITESPACE + FIELD_DEF + IDENT + COLON + WHITESPACE + TYPE_REF + IDENT + COMMA + WHITESPACE + ERROR + AMP + WHITESPACE + FIELD_DEF + LINE_COMMENT + WHITESPACE + IDENT + COLON + ERROR + WHITESPACE + R_CURLY +``` + +Note several features of the tree: + +* All whitespace and comments are explicitly accounted for. + +* The node for `STRUCT_DEF` contains the error element for `&`, but + still represents the following field correctly. + +* The second field of the struct is incomplete: `FIELD_DEF` node for + it contains an `ERROR` element, but nevertheless has the correct + `NodeKind`. + +* The non-documenting comment is correctly attached to the following + field. + + +## Typed Tree + +It's hard to work with this raw parse tree, because it is untyped: +node containing a struct definition has the same API as the node for +the struct field. But it's possible to add a strongly typed layer on +top of this raw tree, and get a zero-cost AST. Here is an example +which adds type-safe wrappers for structs and fields: + +```rust +// generic infrastructure + +pub trait AstNode<'f>: Copy + 'f { + fn new(node: Node<'f>) -> Option; + fn node(&self) -> Node<'f>; +} + +pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option> { + node.children().find(|child| child.kind() == kind) +} + +pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box + 'f> { + Box::new(node.children().filter_map(A::new)) +} + +// AST elements, specific to Rust + +#[derive(Clone, Copy)] +pub struct StructDef<'f>(Node<'f>); + +#[derive(Clone, Copy)] +pub struct FieldDef<'f>(Node<'f>); + +#[derive(Clone, Copy)] +pub struct TypeRef<'f>(Node<'f>); + +pub trait NameOwner<'f>: AstNode<'f> { + fn name_ident(&self) -> Node<'f> { + child_of_kind(self.node(), IDENT).unwrap() + } + + fn name(&self) -> &'f str { self.name_ident().text() } +} + + +impl<'f> AstNode<'f> for StructDef<'f> { + fn new(node: Node<'f>) -> Option { + if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None } + } + fn node(&self) -> Node<'f> { self.0 } +} + +impl<'f> NameOwner<'f> for StructDef<'f> {} + +impl<'f> StructDef<'f> { + pub fn fields(&self) -> Box> + 'f> { + ast_children(self.node()) + } +} + + +impl<'f> AstNode<'f> for FieldDef<'f> { + fn new(node: Node<'f>) -> Option { + if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None } + } + fn node(&self) -> Node<'f> { self.0 } +} + +impl<'f> FieldDef<'f> { + pub fn type_ref(&self) -> Option> { + ast_children(self.node()).next() + } +} + +impl<'f> NameOwner<'f> for FieldDef<'f> {} + + +impl<'f> AstNode<'f> for TypeRef<'f> { + fn new(node: Node<'f>) -> Option { + if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None } + } + fn node(&self) -> Node<'f> { self.0 } +} +``` + +Note that although AST wrappers provide a type-safe access to the +tree, they are still represented as indexes, so clients of the syntax +tree can easily associated additional data with AST nodes by storing +it in a side-table. + + +## Missing Source Code + +The crucial feature of this syntax tree is that it is just a view into +the original source code. And this poses a problem for the Rust +language, because not all compiled Rust code is represented in the +form of source code! Specifically, Rust has a powerful macro system, +which effectively allows to create and parse additional source code at +compile time. It is not entirely clear that the proposed parsing +framework is able to handle this use case, and it's the main purpose +of this RFC to figure it out. The current idea for handling macros is +to make each macro expansion produce a triple of (expansion text, +syntax tree, hygiene information), where hygiene information is a side +table, which colors different ranges of the expansion text according +to the original syntactic context. + + +## Implementation plan + +This RFC proposes huge changes to the internals of the compiler, so +it's important to proceed carefully and incrementally. The following +plan is suggested: + +* RFC discussion about the theoretical feasibility of the proposal, + and the best representation representation for the syntax tree. + +* Implementation of the proposal as a completely separate crates.io + crate, by refactoring existing libsyntax source code to produce a + new tree. + +* A prototype implementation of the macro expansion on top of the new + sytnax tree. + +* Additional round of discussion/RFC about merging with the mainline + compiler. + + +# Drawbacks +[drawbacks]: #drawbacks + +- No harm will be done as long as the new libsyntax exists as an + experiemt on crates.io. However, actually using it in the compiler + and other tools would require massive refactorings. + +- It's difficult to know upfront if the proposed syntax tree would + actually work well in both the compiler and IDE. It may be possible + that some drawbacks will be discovered during implementation. + + +# Rationale and alternatives +[alternatives]: #alternatives + +- Incrementally add more information about source code to the current + AST. + +- Move the current libsyntax to crates.io as is. In the past, there + were several failed attempts to do that. + +- Explore alternative representations for the parse tree. + +- Use parser generator instead of hand written parser. Using the + parser from libsyntax directly would be easier, and hand-written + LL-style parsers usually have much better error recovery than + generated LR-style ones. + +# Unresolved questions +[unresolved]: #unresolved-questions + +- Is it at all possible to represent Rust parser as a pure function of + the source code? It seems like the answer is yes, because the + language and especially macros were cleverly designed with this + use-case in mind. + + +- Is it possible to implement macro expansion using the proposed + framework? This is the main question of this RFC. The proposed + solution of synthesizing source code on the fly seems workable: it's + not that different from the current implementation, which + synthesizes token trees. + + +- How to actually phase out current libsyntax, if libsyntax2.0 turns + out to be a success? diff --git a/docs/validation.md b/docs/validation.md new file mode 100644 index 000000000..2739bfcdd --- /dev/null +++ b/docs/validation.md @@ -0,0 +1,14 @@ +Fixmes: + +Lexer: +* Fix `is_whitespace`, add more tests +* Add more thorough tests for idents for XID_Start & XID_Continue +* Validate that float and integer literals use digits only of the appropriate + base, and are in range +* Validation for unclosed char literal +* Strings are completely wrong: more tests and comparison with libsyntax. +* Comment lexing is completely wrong + +Parser: +* Figure out what is the expected state of attribute grammar. + Token trees or something more structured? Token trees would be unfortunate: no extend selection =/ \ No newline at end of file diff --git a/rfc.md b/rfc.md deleted file mode 100644 index 2bd9f18f1..000000000 --- a/rfc.md +++ /dev/null @@ -1,494 +0,0 @@ -- Feature Name: libsyntax2.0 -- Start Date: 2017-12-30 -- RFC PR: (leave this empty) -- Rust Issue: (leave this empty) - - ->I think the lack of reusability comes in object-oriented languages, ->not functional languages. Because the problem with object-oriented ->languages is they’ve got all this implicit environment that they ->carry around with them. You wanted a banana but what you got was a ->gorilla holding the banana and the entire jungle. -> ->If you have referentially transparent code, if you have pure ->functions — all the data comes in its input arguments and everything ->goes out and leave no state behind — it’s incredibly reusable. -> -> **Joe Armstrong** - -# Summary -[summary]: #summary - -The long-term plan is to rewrite libsyntax parser and syntax tree data -structure to create a software component independent of the rest of -rustc compiler and suitable for the needs of IDEs and code -editors. This RFCs is the first step of this plan, whose goal is to -find out if this is possible at least in theory. If it is possible, -the next steps would be a prototype implementation as a crates.io -crate and a separate RFC for integrating the prototype with rustc, -other tools, and eventual libsyntax removal. - -Note that this RFC does not propose to stabilize any API for working -with rust syntax: the semver version of the hypothetical library would -be `0.1.0`. It is intended to be used by tools, which are currently -closely related to the compiler: `rustc`, `rustfmt`, `clippy`, `rls` -and hypothetical `rustfix`. While it would be possible to create -third-party tools on top of the new libsyntax, the burden of adopting -to breaking changes would be on authors of such tools. - - -# Motivation -[motivation]: #motivation - -There are two main drawbacks with the current version of libsyntax: - -* It is tightly integrated with the compiler and hard to use - independently - -* The AST representation is not well-suited for use inside IDEs - - -## IDE support - -There are several differences in how IDEs and compilers typically -treat source code. - -In the compiler, it is convenient to transform the source -code into Abstract Syntax Tree form, which is independent of the -surface syntax. For example, it's convenient to discard comments, -whitespaces and desugar some syntactic constructs in terms of the -simpler ones. - -In contrast, IDEs work much closer to the source code, so it is -crucial to preserve full information about the original text. For -example, IDE may adjust indentation after typing a `}` which closes a -block, and to do this correctly, IDE must be aware of syntax (that is, -that `}` indeed closes some block, and is not a syntax error) and of -all whitespaces and comments. So, IDE suitable AST should explicitly -account for syntactic elements, not considered important by the -compiler. - -Another difference is that IDEs typically work with incomplete and -syntactically invalid code. This boils down to two parser properties. -First, the parser must produce syntax tree even if some required input -is missing. For example, for input `fn foo` the function node should -be present in the parse, despite the fact that there is no parameters -or body. Second, the parser must be able to skip over parts of input -it can't recognize and aggressively recover from errors. That is, the -syntax tree data structure should be able to handle both missing and -extra nodes. - -IDEs also need the ability to incrementally reparse and relex source -code after the user types. A smart IDE would use syntax tree structure -to handle editing commands (for example, to add/remove trailing commas -after join/split lines actions), so parsing time can be very -noticeable. - - -Currently rustc uses the classical AST approach, and preserves some of -the source code information in the form of spans in the AST. It is not -clear if this structure can full fill all IDE requirements. - - -## Reusability - -In theory, the parser can be a pure function, which takes a `&str` as -an input, and produces a `ParseTree` as an output. - -This is great for reusability: for example, you can compile this -function to WASM and use it for fast client-side validation of syntax -on the rust playground, or you can develop tools like `rustfmt` on -stable Rust outside of rustc repository, or you can embed the parser -into your favorite IDE or code editor. - -This is also great for correctness: with such simple interface, it's -possible to write property-based tests to thoroughly compare two -different implementations of the parser. It's also straightforward to -create a comprehensive test suite, because all the inputs and outputs -are trivially serializable to human-readable text. - -Another benefit is performance: with this signature, you can cache a -parse tree for each file, with trivial strategy for cache invalidation -(invalidate an entry when the underling file changes). On top of such -a cache it is possible to build a smart code indexer which maintains -the set of symbols in the project, watches files for changes and -automatically reindexes only changed files. - -Unfortunately, the current libsyntax is far from this ideal. For -example, even the lexer makes use of the `FileMap` which is -essentially a global state of the compiler which represents all know -files. As a data point, it turned out to be easier to move `rustfmt` -into the main `rustc` repository than to move libsyntax outside! - - -# Guide-level explanation -[guide-level-explanation]: #guide-level-explanation - -Not applicable. - - -# Reference-level explanation -[reference-level-explanation]: #reference-level-explanation - -It is not clear if a single parser can accommodate the needs of the -compiler and the IDE, but there is hope that it is possible. The RFC -proposes to develop libsynax2.0 as an experimental crates.io crate. If -the experiment turns out to be a success, the second RFC will propose -to integrate it with all existing tools and `rustc`. - -Next, a syntax tree data structure is proposed for libsyntax2.0. It -seems to have the following important properties: - -* It is lossless and faithfully represents the original source code, - including explicit nodes for comments and whitespace. - -* It is flexible and allows to encode arbitrary node structure, - even for invalid syntax. - -* It is minimal: it stores small amount of data and has no - dependencies. For instance, it does not need compiler's string - interner or literal data representation. - -* While the tree itself is minimal, it is extensible in a sense that - it possible to associate arbitrary data with certain nodes in a - type-safe way. - - -It is not clear if this representation is the best one. It is heavily -inspired by [PSI] data structure which used in [IntelliJ] based IDEs -and in the [Kotlin] compiler. - -[PSI]: http://www.jetbrains.org/intellij/sdk/docs/reference_guide/custom_language_support/implementing_parser_and_psi.html -[IntelliJ]: https://github.com/JetBrains/intellij-community/ -[Kotlin]: https://kotlinlang.org/ - - -## Untyped Tree - -The main idea is to store the minimal amount of information in the -tree itself, and instead lean heavily on the source code for the -actual data about identifier names, constant values etc. - -All nodes in the tree are of the same type and store a constant for -the syntactic category of the element and a range in the source code. - -Here is a minimal implementation of this data structure with some Rust -syntactic categories - - -```rust -#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] -pub struct NodeKind(u16); - -pub struct File { - text: String, - nodes: Vec, -} - -struct NodeData { - kind: NodeKind, - range: (u32, u32), - parent: Option, - first_child: Option, - next_sibling: Option, -} - -#[derive(Clone, Copy)] -pub struct Node<'f> { - file: &'f File, - idx: u32, -} - -pub struct Children<'f> { - next: Option>, -} - -impl File { - pub fn root<'f>(&'f self) -> Node<'f> { - assert!(!self.nodes.is_empty()); - Node { file: self, idx: 0 } - } -} - -impl<'f> Node<'f> { - pub fn kind(&self) -> NodeKind { - self.data().kind - } - - pub fn text(&self) -> &'f str { - let (start, end) = self.data().range; - &self.file.text[start as usize..end as usize] - } - - pub fn parent(&self) -> Option> { - self.as_node(self.data().parent) - } - - pub fn children(&self) -> Children<'f> { - Children { next: self.as_node(self.data().first_child) } - } - - fn data(&self) -> &'f NodeData { - &self.file.nodes[self.idx as usize] - } - - fn as_node(&self, idx: Option) -> Option> { - idx.map(|idx| Node { file: self.file, idx }) - } -} - -impl<'f> Iterator for Children<'f> { - type Item = Node<'f>; - - fn next(&mut self) -> Option> { - let next = self.next; - self.next = next.and_then(|node| node.as_node(node.data().next_sibling)); - next - } -} - -pub const ERROR: NodeKind = NodeKind(0); -pub const WHITESPACE: NodeKind = NodeKind(1); -pub const STRUCT_KW: NodeKind = NodeKind(2); -pub const IDENT: NodeKind = NodeKind(3); -pub const L_CURLY: NodeKind = NodeKind(4); -pub const R_CURLY: NodeKind = NodeKind(5); -pub const COLON: NodeKind = NodeKind(6); -pub const COMMA: NodeKind = NodeKind(7); -pub const AMP: NodeKind = NodeKind(8); -pub const LINE_COMMENT: NodeKind = NodeKind(9); -pub const FILE: NodeKind = NodeKind(10); -pub const STRUCT_DEF: NodeKind = NodeKind(11); -pub const FIELD_DEF: NodeKind = NodeKind(12); -pub const TYPE_REF: NodeKind = NodeKind(13); -``` - -Here is a rust snippet and the corresponding parse tree: - -```rust -struct Foo { - field1: u32, - & - // non-doc comment - field2: -} -``` - - -``` -FILE - STRUCT_DEF - STRUCT_KW - WHITESPACE - IDENT - WHITESPACE - L_CURLY - WHITESPACE - FIELD_DEF - IDENT - COLON - WHITESPACE - TYPE_REF - IDENT - COMMA - WHITESPACE - ERROR - AMP - WHITESPACE - FIELD_DEF - LINE_COMMENT - WHITESPACE - IDENT - COLON - ERROR - WHITESPACE - R_CURLY -``` - -Note several features of the tree: - -* All whitespace and comments are explicitly accounted for. - -* The node for `STRUCT_DEF` contains the error element for `&`, but - still represents the following field correctly. - -* The second field of the struct is incomplete: `FIELD_DEF` node for - it contains an `ERROR` element, but nevertheless has the correct - `NodeKind`. - -* The non-documenting comment is correctly attached to the following - field. - - -## Typed Tree - -It's hard to work with this raw parse tree, because it is untyped: -node containing a struct definition has the same API as the node for -the struct field. But it's possible to add a strongly typed layer on -top of this raw tree, and get a zero-cost AST. Here is an example -which adds type-safe wrappers for structs and fields: - -```rust -// generic infrastructure - -pub trait AstNode<'f>: Copy + 'f { - fn new(node: Node<'f>) -> Option; - fn node(&self) -> Node<'f>; -} - -pub fn child_of_kind<'f>(node: Node<'f>, kind: NodeKind) -> Option> { - node.children().find(|child| child.kind() == kind) -} - -pub fn ast_children<'f, A: AstNode<'f>>(node: Node<'f>) -> Box + 'f> { - Box::new(node.children().filter_map(A::new)) -} - -// AST elements, specific to Rust - -#[derive(Clone, Copy)] -pub struct StructDef<'f>(Node<'f>); - -#[derive(Clone, Copy)] -pub struct FieldDef<'f>(Node<'f>); - -#[derive(Clone, Copy)] -pub struct TypeRef<'f>(Node<'f>); - -pub trait NameOwner<'f>: AstNode<'f> { - fn name_ident(&self) -> Node<'f> { - child_of_kind(self.node(), IDENT).unwrap() - } - - fn name(&self) -> &'f str { self.name_ident().text() } -} - - -impl<'f> AstNode<'f> for StructDef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == STRUCT_DEF { Some(StructDef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} - -impl<'f> NameOwner<'f> for StructDef<'f> {} - -impl<'f> StructDef<'f> { - pub fn fields(&self) -> Box> + 'f> { - ast_children(self.node()) - } -} - - -impl<'f> AstNode<'f> for FieldDef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == FIELD_DEF { Some(FieldDef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} - -impl<'f> FieldDef<'f> { - pub fn type_ref(&self) -> Option> { - ast_children(self.node()).next() - } -} - -impl<'f> NameOwner<'f> for FieldDef<'f> {} - - -impl<'f> AstNode<'f> for TypeRef<'f> { - fn new(node: Node<'f>) -> Option { - if node.kind() == TYPE_REF { Some(TypeRef(node)) } else { None } - } - fn node(&self) -> Node<'f> { self.0 } -} -``` - -Note that although AST wrappers provide a type-safe access to the -tree, they are still represented as indexes, so clients of the syntax -tree can easily associated additional data with AST nodes by storing -it in a side-table. - - -## Missing Source Code - -The crucial feature of this syntax tree is that it is just a view into -the original source code. And this poses a problem for the Rust -language, because not all compiled Rust code is represented in the -form of source code! Specifically, Rust has a powerful macro system, -which effectively allows to create and parse additional source code at -compile time. It is not entirely clear that the proposed parsing -framework is able to handle this use case, and it's the main purpose -of this RFC to figure it out. The current idea for handling macros is -to make each macro expansion produce a triple of (expansion text, -syntax tree, hygiene information), where hygiene information is a side -table, which colors different ranges of the expansion text according -to the original syntactic context. - - -## Implementation plan - -This RFC proposes huge changes to the internals of the compiler, so -it's important to proceed carefully and incrementally. The following -plan is suggested: - -* RFC discussion about the theoretical feasibility of the proposal, - and the best representation representation for the syntax tree. - -* Implementation of the proposal as a completely separate crates.io - crate, by refactoring existing libsyntax source code to produce a - new tree. - -* A prototype implementation of the macro expansion on top of the new - sytnax tree. - -* Additional round of discussion/RFC about merging with the mainline - compiler. - - -# Drawbacks -[drawbacks]: #drawbacks - -- No harm will be done as long as the new libsyntax exists as an - experiemt on crates.io. However, actually using it in the compiler - and other tools would require massive refactorings. - -- It's difficult to know upfront if the proposed syntax tree would - actually work well in both the compiler and IDE. It may be possible - that some drawbacks will be discovered during implementation. - - -# Rationale and alternatives -[alternatives]: #alternatives - -- Incrementally add more information about source code to the current - AST. - -- Move the current libsyntax to crates.io as is. In the past, there - were several failed attempts to do that. - -- Explore alternative representations for the parse tree. - -- Use parser generator instead of hand written parser. Using the - parser from libsyntax directly would be easier, and hand-written - LL-style parsers usually have much better error recovery than - generated LR-style ones. - -# Unresolved questions -[unresolved]: #unresolved-questions - -- Is it at all possible to represent Rust parser as a pure function of - the source code? It seems like the answer is yes, because the - language and especially macros were cleverly designed with this - use-case in mind. - - -- Is it possible to implement macro expansion using the proposed - framework? This is the main question of this RFC. The proposed - solution of synthesizing source code on the fly seems workable: it's - not that different from the current implementation, which - synthesizes token trees. - - -- How to actually phase out current libsyntax, if libsyntax2.0 turns - out to be a success? diff --git a/validation.md b/validation.md deleted file mode 100644 index 2739bfcdd..000000000 --- a/validation.md +++ /dev/null @@ -1,14 +0,0 @@ -Fixmes: - -Lexer: -* Fix `is_whitespace`, add more tests -* Add more thorough tests for idents for XID_Start & XID_Continue -* Validate that float and integer literals use digits only of the appropriate - base, and are in range -* Validation for unclosed char literal -* Strings are completely wrong: more tests and comparison with libsyntax. -* Comment lexing is completely wrong - -Parser: -* Figure out what is the expected state of attribute grammar. - Token trees or something more structured? Token trees would be unfortunate: no extend selection =/ \ No newline at end of file -- cgit v1.2.3