diff options
author | Aleksey Kladov <[email protected]> | 2019-01-19 12:51:46 +0000 |
---|---|---|
committer | Aleksey Kladov <[email protected]> | 2019-01-21 08:27:01 +0000 |
commit | 7e1d866481aa6f11dbe96c4d47b6b61b276f07b3 (patch) | |
tree | 53355cf1d99d2af3555dab3030c5efc88002a052 | |
parent | 237bb929f4332021d97b064dd8178518b30e1f32 (diff) |
add guide
-rw-r--r-- | guide.md | 364 |
1 files changed, 364 insertions, 0 deletions
diff --git a/guide.md b/guide.md new file mode 100644 index 000000000..beb8294ba --- /dev/null +++ b/guide.md | |||
@@ -0,0 +1,364 @@ | |||
1 | # Guide to rust-analyzer | ||
2 | |||
3 | ## About the guide | ||
4 | |||
5 | This guide describes the current start of the rust-analyzer as of 2019-01-20 | ||
6 | (commit hash guide-2019-01). Its purpose is to | ||
7 | document various problems and architectural solutions related to the problem of | ||
8 | building IDE-first compiler. | ||
9 | |||
10 | ## The big picture | ||
11 | |||
12 | On the highest possible level, rust analyzer is a stateful component. Client may | ||
13 | apply changes to the analyzer (new contents of `foo.rs` file is "fn main() {}") | ||
14 | and it may ask semantic questions about the current state (what is the | ||
15 | definition of the identifier with offset 92 in file `bar.rs`?). Two important | ||
16 | properties hold: | ||
17 | |||
18 | * Analyzer does not do any IO. It starts in an empty state and all input data is | ||
19 | provided via `apply_change` API. | ||
20 | |||
21 | * Only queries about the current state are supported. One can, of course, | ||
22 | simulate undo and redo by keeping log of changes and inverse-changes. | ||
23 | |||
24 | ## IDE API | ||
25 | |||
26 | To see this big picture, let's take a look at the [`AnalysisHost`] and | ||
27 | [`Analysis`] pair of types. `AnalysisHost` has three methods: | ||
28 | |||
29 | * `default` for creating an empty analysis | ||
30 | * `apply_change(&mut self)` to make changes (this is how you get from an empty | ||
31 | state to something interesting) | ||
32 | * `analysis(&self)` to get an instance of `Analysis` | ||
33 | |||
34 | `Analysis` has a ton of methods for IDEs, like `goto_definition`, or | ||
35 | `completions`. Both inputs and outputs of `Analysis`' methods are formulated in | ||
36 | terms of files and offsets, and **not** in terms of Rust concepts like structs, | ||
37 | traits, etc. The "typed" API with Rust specific types is slightly lower in the | ||
38 | stack, we'll talk about it later. | ||
39 | |||
40 | [`AnalysisHost`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L265-L284 | ||
41 | [`Analysis`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L291-L478 | ||
42 | |||
43 | The reason for `Analysis` and `AnalysisHost` separation is that we want apply | ||
44 | changes "uniquely", but we might want to fork an `Analysis` and send it to | ||
45 | another thread for background processing. That is, there is only a single | ||
46 | `AnalysisHost`, but there may be several (equivalent) `Analysis`. | ||
47 | |||
48 | Note that all of the `Analysis` API return `Cancelable<T>`. This is required to | ||
49 | be responsive in IDE setting. Sometimes a long-running query is being computed | ||
50 | and the user types something in the editor and asks for completion. In this | ||
51 | case, we cancel the long-running computation (so it returns `Err(Canceled)`), | ||
52 | apply the change and execute request for completion. We never use stale data to | ||
53 | answer requests. Under the cover, `AnalysisHost` "remembers" all outstanding | ||
54 | `Analysis` instances. `AnalysisHost::apply_change` method cancels all | ||
55 | `Analysis`es, blocks until of them are `Dropped` and then applies change | ||
56 | in-place. This is the familiar to rustaceans read-write lock interior | ||
57 | mutability. | ||
58 | |||
59 | Next, lets talk about what are inputs to the Analysis, precisely. | ||
60 | |||
61 | ## Inputs | ||
62 | |||
63 | Rust Analyzer never does any IO itself, all inputs get passed explicitly via | ||
64 | `AnalysisHost::apply_change` method, which accepts a single argument: | ||
65 | `AnalysisChange`. [`AnalysisChange`] is a builder for a single change | ||
66 | "transaction", so it suffices to study its methods to understand all of the | ||
67 | input data. | ||
68 | |||
69 | [`AnalysisChange`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L119-L167 | ||
70 | |||
71 | The `(add|change|remove)_file` methods control the set of the input files, where | ||
72 | each file has an integer id (`FileId`, picked by the client), text (`String`) | ||
73 | and a filesystem path. Paths are tricky, they'll be explained in source roots | ||
74 | section, together with `add_root` method. `add_library` method allows to add a | ||
75 | group of files which are assumed to rarely change. It's mostly an optimization | ||
76 | and does not change fundamental picture. | ||
77 | |||
78 | `set_crate_graph` method allows to control how the input files are partitioned | ||
79 | into compilation unites -- crates. It also controls (in theory, not implemented | ||
80 | yet) `cfg` flags. `CrateGraph` is a directed acyclic graph of crates. Each crate | ||
81 | has a root `FileId`, a set of active `cfg` flags and a set of dependencies. Each | ||
82 | dependency is a pair of a crate and a name. It is possible to have two crates | ||
83 | with the same root `FileId` but different `cfg`-flags/dependencies. This model | ||
84 | is lower than Cargo's model of packages: each Cargo package consists of several | ||
85 | targets, each of which is a separate crate (or several crates, if you try | ||
86 | different feature combinations). | ||
87 | |||
88 | Procedural macros should become inputs as well, but currently they are not | ||
89 | supported. Procedural macro will be a black box `Box<dyn Fn(TokenStream) -> TokenStream>` | ||
90 | function, and will be inserted into the crate graph just like dependencies. | ||
91 | |||
92 | Soon we'll talk how we build an LSP server on top of `Analysis`, but first, | ||
93 | let's deal with that paths issue. | ||
94 | |||
95 | |||
96 | ## Source roots (aka filesystems are horrible) | ||
97 | |||
98 | This is a non-essential section, feel free to skip. | ||
99 | |||
100 | The previous section said that the file system path is an attribute of a file, | ||
101 | but this is not a whole truth. Making it an absolute `PathBuf` will be bad for | ||
102 | several reasons. First, file-systems are full of (platform-dependent) edge cases: | ||
103 | |||
104 | * it's hard (requires a syscall) to decide if two paths are equivalent | ||
105 | * some file-systems are case-sensitive | ||
106 | * paths are not necessary UTF-8 | ||
107 | * symlinks can form cycles | ||
108 | |||
109 | Second, this might hurt reproducibility and hermeticity of builds. In theory, | ||
110 | moving a project from `/foo/bar/my-project` to `/spam/eggs/my-project` should | ||
111 | not change a bit in the output. However, if absolute path is a part of the | ||
112 | input, it is at least in theory observable, and *could* affect the output. | ||
113 | |||
114 | Yet another problem is that we really-really want to avoid doing IO, but with | ||
115 | Rust the set of "input" files is not necessary known up-front. In theory, you | ||
116 | can have `#[path="/dev/random"] mod foo;`. | ||
117 | |||
118 | To solve (or explicitly refuse to solve) these problems rust analyzer uses the | ||
119 | concept of source root. Roughly speaking, source roots is a contents of a | ||
120 | directory on a file systems, like `/home/matklad/projects/rustraytracer/**.rs`. | ||
121 | |||
122 | More precisely, all files (`FileId`s) are partitioned into disjoint | ||
123 | `SourceRoot`s. Each file has a relative utf-8 path within the `SourceRoot`. | ||
124 | `SourceRoot` has an identity (integer id). Crucially, the root path of the | ||
125 | source root itself is unknown to the analyzer: client is supposed to maintain a | ||
126 | mapping between SourceRoot ids (which are assigned by the client) and actual | ||
127 | `PathBuf`s. `SourceRoot`s give a sane tree model of the file system to the | ||
128 | analyzer. | ||
129 | |||
130 | Note that `mod`, `#[path]` and `include!()` can only reference files from the | ||
131 | same source root. It is of course is possible to explicitly add extra files to | ||
132 | the source root, even `/dev/random`. | ||
133 | |||
134 | ## Language Server Protocol | ||
135 | |||
136 | Now let's see how `Analysis` API is exposed via JSON RPC based LSP protocol. The | ||
137 | hard part here is managing changes (which can come either from the file system | ||
138 | or from the editor) and concurrency (we want to spawn background jobs for things | ||
139 | like syntax highlighting). We use the event loop pattern to manage the zoo, and | ||
140 | the loop is the [`main_loop_inner`] function. The [`main_loop`] does a one-time | ||
141 | initialization and tearing down of the resources. | ||
142 | |||
143 | [`main_loop`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L51-L110 | ||
144 | [`main_loop_inner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L156-L258 | ||
145 | |||
146 | |||
147 | Let's walk through a typical analyzer session! | ||
148 | |||
149 | First, we need to figure out what to analyze. To do this, we run `cargo | ||
150 | metadata` to learn about Cargo packages for current workspace and dependencies, | ||
151 | and we run `rustc --print sysroot` and scan sysroot to learn about crates like | ||
152 | `std`. Currently we load this configuration once at the start of the server, but | ||
153 | it should be possible to dynamically reconfigure it later without restart. | ||
154 | |||
155 | [main_loop.rs#L62-L70](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L62-L70) | ||
156 | |||
157 | The [`ProjectModel`] we get after this step is very Cargo and sysroot specific, | ||
158 | it needs to be lowered to get the input in the form of `AnalysisChange`. This | ||
159 | happens in [`ServerWorldState::new`] method. Specifically | ||
160 | |||
161 | * Create a `SourceRoot` for each Cargo package and sysroot. | ||
162 | * Schedule a file system scan of the roots. | ||
163 | * Create an analyzer's `Crate` for each Cargo **target** and sysroot crate. | ||
164 | * Setup dependencies between the crates. | ||
165 | |||
166 | [`ProjectModel`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/project_model.rs#L16-L20 | ||
167 | [`ServerWorldState::new`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L38-L160 | ||
168 | |||
169 | The results of the scan (which may take a while) will be processed in the body | ||
170 | of the main loop, just like any other change. Here's where we handle | ||
171 | |||
172 | * [File system changes](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L194) | ||
173 | * [Changes from the editor](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L377) | ||
174 | |||
175 | After a single loop's turn, we group them into one `AnalysisChange` and | ||
176 | [apply] it. This always happens on the main thread and blocks the loop. | ||
177 | |||
178 | [apply]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216 | ||
179 | |||
180 | To handle requests, like ["goto definition"], we create an instance of the | ||
181 | `Analysis` and [`schedule`] the task (which consumes `Analysis`) onto | ||
182 | threadpool. [The task] calls the corresponding `Analysis` method, while | ||
183 | massaging the types into the LSP representation. Keep in mind that if we are | ||
184 | executing "goto definition" on the threadpool and a new change comes in, the | ||
185 | task will be canceled as soon as the main loop calls `apply_change` on the | ||
186 | `AnalysisHost`. | ||
187 | |||
188 | ["goto definition"]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216 | ||
189 | [`schedule`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L426-L455 | ||
190 | [The task]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop/handlers.rs#L205-L223 | ||
191 | |||
192 | This concludes the overview of the analyzer's programing *interface*. Next, lets | ||
193 | dig into the implementation! | ||
194 | |||
195 | ## Salsa | ||
196 | |||
197 | The most straightforward way to implement "apply change, get analysis, repeat" | ||
198 | API would be to maintain the input state and to compute all possible analysis | ||
199 | information from scratch after every change. This works, but scales poorly with | ||
200 | the size of the project. To make this fast, we need to take advantage of the | ||
201 | fact that most of the changes are small, and that analysis results are unlikely | ||
202 | to change significantly between invocations. | ||
203 | |||
204 | To do this we use [salsa]: a framework for incremental on-demand computation. | ||
205 | You can skip the rest of the section if you are familiar with rustc red-green | ||
206 | algorithm. | ||
207 | |||
208 | [salsa]: https://github.com/salsa-rs/salsa | ||
209 | |||
210 | It's better to refer to salsa's docs to learn about it. Here's a small excerpt: | ||
211 | |||
212 | The key idea of salsa is that you define your program as a set of queries. Every | ||
213 | query is used like function K -> V that maps from some key of type K to a value | ||
214 | of type V. Queries come in two basic varieties: | ||
215 | |||
216 | * **Inputs**: the base inputs to your system. You can change these whenever you | ||
217 | like. | ||
218 | |||
219 | * **Functions**: pure functions (no side effects) that transform your inputs | ||
220 | into other values. The results of queries is memoized to avoid recomputing | ||
221 | them a lot. When you make changes to the inputs, we'll figure out (fairly | ||
222 | intelligently) when we can re-use these memoized values and when we have to | ||
223 | recompute them. | ||
224 | |||
225 | |||
226 | For further discussion, its important to understand one bit of "fairly | ||
227 | intelligently". Suppose we have to functions, `f1` and `f2`, and one input, `i`. | ||
228 | We call `f1(X)` which in turn calls `f2(Y)` which inspects `i(Z)`. `i(Z)` | ||
229 | returns some value `V1`, `f2` uses that and returns `R1`, `f1` uses that and | ||
230 | returns `O`. Now, let's change `i` at `Z` to `V2` from `V1` and try to compute | ||
231 | `f1(X)` again. Because `f1(X)` (transitively) depends on `i(Z)`, we can't just | ||
232 | reuse its value as is. However, if `f2(Y)` is *still* equal to `R1` (despite the | ||
233 | `i`'s change), we, in fact, *can* reuse `O` as result of `f1(X)`. And that's how | ||
234 | salsa works: it recomputes results in *reverse* order, starting from inputs and | ||
235 | progressing towards outputs, stopping as soon as it sees an intermediate value | ||
236 | that hasn't changed. | ||
237 | |||
238 | ## Salsa Input Queries | ||
239 | |||
240 | All analyzer information is stored in a salsa database. `Analysis` and | ||
241 | `AnalysisHost` types are newtype wrappers for [`RootDatabase`] -- a salsa | ||
242 | database. | ||
243 | |||
244 | [`RootDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/db.rs#L88-L134 | ||
245 | |||
246 | Salsa input queries are defined in [`FilesDatabase`] (which is a part of | ||
247 | `RootDatabase`). They closely mirror the familiar `AnalysisChange` structure: | ||
248 | indeed, what `apply_change` does is it sets the values of input queries. | ||
249 | |||
250 | [`FilesDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/input.rs#L150-L174 | ||
251 | |||
252 | ## From text to semantic model | ||
253 | |||
254 | The bulk of the rust-analyzer is transforming input text into semantic model of | ||
255 | Rust code: a web of entities like modules, structs, functions and traits. | ||
256 | |||
257 | An important fact to realize is that (unlike most other languages like C# or | ||
258 | Java) there isn't a one-to-one mapping between source code and semantic model. A | ||
259 | single function definition in the source code might result in several semantic | ||
260 | functions: for example, the same source file might be included as a module into | ||
261 | several crate, or a single "crate" might be present in the compilation DAG | ||
262 | several times, with different sets of `cfg`s enabled. | ||
263 | |||
264 | The semantic interface is declared in [`code_model_api`] module. Each entity is | ||
265 | identified by integer id and has a bunch of methods which take a salsa database | ||
266 | as an argument and returns other entities (which are ids). Internally, this | ||
267 | methods invoke various queries on the database to build the model on demand. | ||
268 | Here's [the list of queries]. | ||
269 | |||
270 | [`code_model_api`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/code_model_api.rs | ||
271 | [the list of queries]: https://github.com/rust-analyzer/rust-analyzer/blob/7e84440e25e19529e4ff8a66e521d1b06349c6ec/crates/ra_hir/src/db.rs#L20-L106 | ||
272 | |||
273 | The first step of building the model is parsing the source code. | ||
274 | |||
275 | ## Syntax trees | ||
276 | |||
277 | An important property of the Rust language is that each file can be parsed in | ||
278 | isolation. Unlike, say, `C++`, an `include` can't change the meaning of the | ||
279 | syntax. For this reason, Rust analyzer can build a syntax tree for each "source | ||
280 | file", which could then be reused by several semantic models if this file | ||
281 | happens to be a part of several crates. | ||
282 | |||
283 | Rust analyzer uses a similar representation of syntax trees to that of `Roslyn` | ||
284 | and Swift's new | ||
285 | [libsyntax](https://github.com/apple/swift/tree/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax). | ||
286 | Swift's docs give an excellent overview of the approach, so I skip this part | ||
287 | here and instead outline the main characteristics of the syntax trees: | ||
288 | |||
289 | * Syntax trees are fully lossless. Converting **any** text to a syntax tree and | ||
290 | back is a total identity function. All whitespace and comments are explicitly | ||
291 | represented in the tree. | ||
292 | |||
293 | * Syntax nodes have generic `(next|previous)_sibling`, `parent`, | ||
294 | `(first|last)_child` functions. You can get from any one node to any other | ||
295 | node in the file using only these functions. | ||
296 | |||
297 | * Syntax nodes know their range (start offset and length) in the file. | ||
298 | |||
299 | * Syntax nodes share the ownership of their syntax tree: if you keep a reference | ||
300 | to a single function, the whole enclosing file is alive. | ||
301 | |||
302 | * Syntax trees are immutable and the cost of replacing the subtree is | ||
303 | proportional to the depth of the subtree. Read Swift's docs to learn how | ||
304 | immutable + parent pointers + cheap modification is possible. | ||
305 | |||
306 | * Syntax trees are build on best-effort basis. All accessor methods return | ||
307 | `Option`s. The tree for `fn foo` will contain a function declaration with | ||
308 | `None` for parameter list and body. | ||
309 | |||
310 | * Syntax trees do not know the file they are build from, they only know about | ||
311 | the text. | ||
312 | |||
313 | The implementation is based on the generic [rowan] crate on top of which a | ||
314 | [rust-specific] AST is generated. | ||
315 | |||
316 | [rowan]: https://github.com/rust-analyzer/rowan/tree/100a36dc820eb393b74abe0d20ddf99077b61f88 | ||
317 | [rust-specific]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_syntax/src/ast/generated.rs | ||
318 | |||
319 | The next step in constructing the semantic model is ... | ||
320 | |||
321 | ## Building a Module Tree | ||
322 | |||
323 | The algorithm for building a tree of modules is to start with a crate root | ||
324 | (remember, each `Crate` from a `CrateGraph` has a `FileId`), collect all mod | ||
325 | declarations and recursively process child modules. This is handled by the | ||
326 | [`module_tree_query`](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L116-L123), | ||
327 | with a two slight variations. | ||
328 | |||
329 | First, rust analyzer builds a module tree for all crates in a source root | ||
330 | simultaneously. The main reason for this is historical (`module_tree` predates | ||
331 | `CrateGraph`), but this approach also allows to account for files which are not | ||
332 | part of any crate. That is, if you create a file but do not include it as a | ||
333 | submodule anywhere, you still get semantic completion, and you get a warning | ||
334 | about free-floating module (the actual warning is not implemented yet). | ||
335 | |||
336 | The second difference is that `module_tree_query` does not *directly* depend on | ||
337 | the "parse" query (which is confusingly called `source_file`). Why would calling | ||
338 | the parse directly be bad? Suppose the user changes the file slightly, by adding | ||
339 | an insignificant whitespace. Adding whitespace changes the parse tree (because | ||
340 | it includes whitespace), and that means recomputing the whole module tree. | ||
341 | |||
342 | We deal with this problem by introducing an intermediate [`submodules_query`]. | ||
343 | This query processes the syntax tree an extract a set of declared submodule | ||
344 | names. Now, changing the whitespace results in `submodules_query` being | ||
345 | re-executed for a *single* module, but because the result of this query stays | ||
346 | the same, we don't have to re-execute [`module_tree_query`]. In fact, we only | ||
347 | need to re-execute it when we add/remove new files or when we change mod | ||
348 | declarations, | ||
349 | |||
350 | [`submodules_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L41) | ||
351 | |||
352 | |||
353 | |||
354 | |||
355 | |||
356 | ## Location Interner pattern | ||
357 | |||
358 | ## Macros and recursive locations | ||
359 | |||
360 | ## Name resolution | ||
361 | |||
362 | ## Source Map pattern | ||
363 | |||
364 | ## Tying it all together: completion | ||