diff options
Diffstat (limited to 'docs/dev/guide.md')
-rw-r--r-- | docs/dev/guide.md | 575 |
1 files changed, 575 insertions, 0 deletions
diff --git a/docs/dev/guide.md b/docs/dev/guide.md new file mode 100644 index 000000000..abbe4c154 --- /dev/null +++ b/docs/dev/guide.md | |||
@@ -0,0 +1,575 @@ | |||
1 | # Guide to rust-analyzer | ||
2 | |||
3 | ## About the guide | ||
4 | |||
5 | This guide describes the current state of rust-analyzer as of 2019-01-20 (git | ||
6 | tag [guide-2019-01]). Its purpose is to document various problems and | ||
7 | architectural solutions related to the problem of building IDE-first compiler | ||
8 | for Rust. There is a video version of this guide as well: | ||
9 | https://youtu.be/ANKBNiSWyfc. | ||
10 | |||
11 | [guide-2019-01]: https://github.com/rust-analyzer/rust-analyzer/tree/guide-2019-01 | ||
12 | |||
13 | ## The big picture | ||
14 | |||
15 | On the highest possible level, rust-analyzer is a stateful component. A client may | ||
16 | apply changes to the analyzer (new contents of `foo.rs` file is "fn main() {}") | ||
17 | and it may ask semantic questions about the current state (what is the | ||
18 | definition of the identifier with offset 92 in file `bar.rs`?). Two important | ||
19 | properties hold: | ||
20 | |||
21 | * Analyzer does not do any I/O. It starts in an empty state and all input data is | ||
22 | provided via `apply_change` API. | ||
23 | |||
24 | * Only queries about the current state are supported. One can, of course, | ||
25 | simulate undo and redo by keeping a log of changes and inverse changes respectively. | ||
26 | |||
27 | ## IDE API | ||
28 | |||
29 | To see the bigger picture of how the IDE features works, let's take a look at the [`AnalysisHost`] and | ||
30 | [`Analysis`] pair of types. `AnalysisHost` has three methods: | ||
31 | |||
32 | * `default()` for creating an empty analysis instance | ||
33 | * `apply_change(&mut self)` to make changes (this is how you get from an empty | ||
34 | state to something interesting) | ||
35 | * `analysis(&self)` to get an instance of `Analysis` | ||
36 | |||
37 | `Analysis` has a ton of methods for IDEs, like `goto_definition`, or | ||
38 | `completions`. Both inputs and outputs of `Analysis`' methods are formulated in | ||
39 | terms of files and offsets, and **not** in terms of Rust concepts like structs, | ||
40 | traits, etc. The "typed" API with Rust specific types is slightly lower in the | ||
41 | stack, we'll talk about it later. | ||
42 | |||
43 | [`AnalysisHost`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L265-L284 | ||
44 | [`Analysis`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L291-L478 | ||
45 | |||
46 | The reason for this separation of `Analysis` and `AnalysisHost` is that we want to apply | ||
47 | changes "uniquely", but we might also want to fork an `Analysis` and send it to | ||
48 | another thread for background processing. That is, there is only a single | ||
49 | `AnalysisHost`, but there may be several (equivalent) `Analysis`. | ||
50 | |||
51 | Note that all of the `Analysis` API return `Cancelable<T>`. This is required to | ||
52 | be responsive in an IDE setting. Sometimes a long-running query is being computed | ||
53 | and the user types something in the editor and asks for completion. In this | ||
54 | case, we cancel the long-running computation (so it returns `Err(Canceled)`), | ||
55 | apply the change and execute request for completion. We never use stale data to | ||
56 | answer requests. Under the cover, `AnalysisHost` "remembers" all outstanding | ||
57 | `Analysis` instances. The `AnalysisHost::apply_change` method cancels all | ||
58 | `Analysis`es, blocks until all of them are `Dropped` and then applies changes | ||
59 | in-place. This may be familiar to Rustaceans who use read-write locks for interior | ||
60 | mutability. | ||
61 | |||
62 | Next, let's talk about what the inputs to the `Analysis` are, precisely. | ||
63 | |||
64 | ## Inputs | ||
65 | |||
66 | Rust Analyzer never does any I/O itself, all inputs get passed explicitly via | ||
67 | the `AnalysisHost::apply_change` method, which accepts a single argument, a | ||
68 | `AnalysisChange`. [`AnalysisChange`] is a builder for a single change | ||
69 | "transaction", so it suffices to study its methods to understand all of the | ||
70 | input data. | ||
71 | |||
72 | [`AnalysisChange`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L119-L167 | ||
73 | |||
74 | The `(add|change|remove)_file` methods control the set of the input files, where | ||
75 | each file has an integer id (`FileId`, picked by the client), text (`String`) | ||
76 | and a filesystem path. Paths are tricky; they'll be explained below, in source roots | ||
77 | section, together with the `add_root` method. The `add_library` method allows us to add a | ||
78 | group of files which are assumed to rarely change. It's mostly an optimization | ||
79 | and does not change the fundamental picture. | ||
80 | |||
81 | The `set_crate_graph` method allows us to control how the input files are partitioned | ||
82 | into compilation unites -- crates. It also controls (in theory, not implemented | ||
83 | yet) `cfg` flags. `CrateGraph` is a directed acyclic graph of crates. Each crate | ||
84 | has a root `FileId`, a set of active `cfg` flags and a set of dependencies. Each | ||
85 | dependency is a pair of a crate and a name. It is possible to have two crates | ||
86 | with the same root `FileId` but different `cfg`-flags/dependencies. This model | ||
87 | is lower than Cargo's model of packages: each Cargo package consists of several | ||
88 | targets, each of which is a separate crate (or several crates, if you try | ||
89 | different feature combinations). | ||
90 | |||
91 | Procedural macros should become inputs as well, but currently they are not | ||
92 | supported. Procedural macro will be a black box `Box<dyn Fn(TokenStream) -> TokenStream>` | ||
93 | function, and will be inserted into the crate graph just like dependencies. | ||
94 | |||
95 | Soon we'll talk how we build an LSP server on top of `Analysis`, but first, | ||
96 | let's deal with that paths issue. | ||
97 | |||
98 | |||
99 | ## Source roots (a.k.a. "Filesystems are horrible") | ||
100 | |||
101 | This is a non-essential section, feel free to skip. | ||
102 | |||
103 | The previous section said that the filesystem path is an attribute of a file, | ||
104 | but this is not the whole truth. Making it an absolute `PathBuf` will be bad for | ||
105 | several reasons. First, filesystems are full of (platform-dependent) edge cases: | ||
106 | |||
107 | * it's hard (requires a syscall) to decide if two paths are equivalent | ||
108 | * some filesystems are case-sensitive (e.g. on macOS) | ||
109 | * paths are not necessary UTF-8 | ||
110 | * symlinks can form cycles | ||
111 | |||
112 | Second, this might hurt reproducibility and hermeticity of builds. In theory, | ||
113 | moving a project from `/foo/bar/my-project` to `/spam/eggs/my-project` should | ||
114 | not change a bit in the output. However, if the absolute path is a part of the | ||
115 | input, it is at least in theory observable, and *could* affect the output. | ||
116 | |||
117 | Yet another problem is that we really *really* want to avoid doing I/O, but with | ||
118 | Rust the set of "input" files is not necessary known up-front. In theory, you | ||
119 | can have `#[path="/dev/random"] mod foo;`. | ||
120 | |||
121 | To solve (or explicitly refuse to solve) these problems rust-analyzer uses the | ||
122 | concept of a "source root". Roughly speaking, source roots are the contents of a | ||
123 | directory on a file systems, like `/home/matklad/projects/rustraytracer/**.rs`. | ||
124 | |||
125 | More precisely, all files (`FileId`s) are partitioned into disjoint | ||
126 | `SourceRoot`s. Each file has a relative UTF-8 path within the `SourceRoot`. | ||
127 | `SourceRoot` has an identity (integer ID). Crucially, the root path of the | ||
128 | source root itself is unknown to the analyzer: A client is supposed to maintain a | ||
129 | mapping between `SourceRoot` IDs (which are assigned by the client) and actual | ||
130 | `PathBuf`s. `SourceRoot`s give a sane tree model of the file system to the | ||
131 | analyzer. | ||
132 | |||
133 | Note that `mod`, `#[path]` and `include!()` can only reference files from the | ||
134 | same source root. It is of course is possible to explicitly add extra files to | ||
135 | the source root, even `/dev/random`. | ||
136 | |||
137 | ## Language Server Protocol | ||
138 | |||
139 | Now let's see how the `Analysis` API is exposed via the JSON RPC based language server protocol. The | ||
140 | hard part here is managing changes (which can come either from the file system | ||
141 | or from the editor) and concurrency (we want to spawn background jobs for things | ||
142 | like syntax highlighting). We use the event loop pattern to manage the zoo, and | ||
143 | the loop is the [`main_loop_inner`] function. The [`main_loop`] does a one-time | ||
144 | initialization and tearing down of the resources. | ||
145 | |||
146 | [`main_loop`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L51-L110 | ||
147 | [`main_loop_inner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L156-L258 | ||
148 | |||
149 | |||
150 | Let's walk through a typical analyzer session! | ||
151 | |||
152 | First, we need to figure out what to analyze. To do this, we run `cargo | ||
153 | metadata` to learn about Cargo packages for current workspace and dependencies, | ||
154 | and we run `rustc --print sysroot` and scan the "sysroot" (the directory containing the current Rust toolchain's files) to learn about crates like | ||
155 | `std`. Currently we load this configuration once at the start of the server, but | ||
156 | it should be possible to dynamically reconfigure it later without restart. | ||
157 | |||
158 | [main_loop.rs#L62-L70](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L62-L70) | ||
159 | |||
160 | The [`ProjectModel`] we get after this step is very Cargo and sysroot specific, | ||
161 | it needs to be lowered to get the input in the form of `AnalysisChange`. This | ||
162 | happens in [`ServerWorldState::new`] method. Specifically | ||
163 | |||
164 | * Create a `SourceRoot` for each Cargo package and sysroot. | ||
165 | * Schedule a filesystem scan of the roots. | ||
166 | * Create an analyzer's `Crate` for each Cargo **target** and sysroot crate. | ||
167 | * Setup dependencies between the crates. | ||
168 | |||
169 | [`ProjectModel`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/project_model.rs#L16-L20 | ||
170 | [`ServerWorldState::new`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L38-L160 | ||
171 | |||
172 | The results of the scan (which may take a while) will be processed in the body | ||
173 | of the main loop, just like any other change. Here's where we handle: | ||
174 | |||
175 | * [File system changes](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L194) | ||
176 | * [Changes from the editor](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L377) | ||
177 | |||
178 | After a single loop's turn, we group the changes into one `AnalysisChange` and | ||
179 | [apply] it. This always happens on the main thread and blocks the loop. | ||
180 | |||
181 | [apply]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216 | ||
182 | |||
183 | To handle requests, like ["goto definition"], we create an instance of the | ||
184 | `Analysis` and [`schedule`] the task (which consumes `Analysis`) on the | ||
185 | threadpool. [The task] calls the corresponding `Analysis` method, while | ||
186 | massaging the types into the LSP representation. Keep in mind that if we are | ||
187 | executing "goto definition" on the threadpool and a new change comes in, the | ||
188 | task will be canceled as soon as the main loop calls `apply_change` on the | ||
189 | `AnalysisHost`. | ||
190 | |||
191 | ["goto definition"]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216 | ||
192 | [`schedule`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L426-L455 | ||
193 | [The task]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop/handlers.rs#L205-L223 | ||
194 | |||
195 | This concludes the overview of the analyzer's programing *interface*. Next, lets | ||
196 | dig into the implementation! | ||
197 | |||
198 | ## Salsa | ||
199 | |||
200 | The most straightforward way to implement an "apply change, get analysis, repeat" | ||
201 | API would be to maintain the input state and to compute all possible analysis | ||
202 | information from scratch after every change. This works, but scales poorly with | ||
203 | the size of the project. To make this fast, we need to take advantage of the | ||
204 | fact that most of the changes are small, and that analysis results are unlikely | ||
205 | to change significantly between invocations. | ||
206 | |||
207 | To do this we use [salsa]: a framework for incremental on-demand computation. | ||
208 | You can skip the rest of the section if you are familiar with rustc's red-green | ||
209 | algorithm (which is used for incremental compilation). | ||
210 | |||
211 | [salsa]: https://github.com/salsa-rs/salsa | ||
212 | |||
213 | It's better to refer to salsa's docs to learn about it. Here's a small excerpt: | ||
214 | |||
215 | The key idea of salsa is that you define your program as a set of queries. Every | ||
216 | query is used like a function `K -> V` that maps from some key of type `K` to a value | ||
217 | of type `V`. Queries come in two basic varieties: | ||
218 | |||
219 | * **Inputs**: the base inputs to your system. You can change these whenever you | ||
220 | like. | ||
221 | |||
222 | * **Functions**: pure functions (no side effects) that transform your inputs | ||
223 | into other values. The results of queries is memoized to avoid recomputing | ||
224 | them a lot. When you make changes to the inputs, we'll figure out (fairly | ||
225 | intelligently) when we can re-use these memoized values and when we have to | ||
226 | recompute them. | ||
227 | |||
228 | |||
229 | For further discussion, its important to understand one bit of "fairly | ||
230 | intelligently". Suppose we have two functions, `f1` and `f2`, and one input, | ||
231 | `z`. We call `f1(X)` which in turn calls `f2(Y)` which inspects `i(Z)`. `i(Z)` | ||
232 | returns some value `V1`, `f2` uses that and returns `R1`, `f1` uses that and | ||
233 | returns `O`. Now, let's change `i` at `Z` to `V2` from `V1` and try to compute | ||
234 | `f1(X)` again. Because `f1(X)` (transitively) depends on `i(Z)`, we can't just | ||
235 | reuse its value as is. However, if `f2(Y)` is *still* equal to `R1` (despite | ||
236 | `i`'s change), we, in fact, *can* reuse `O` as result of `f1(X)`. And that's how | ||
237 | salsa works: it recomputes results in *reverse* order, starting from inputs and | ||
238 | progressing towards outputs, stopping as soon as it sees an intermediate value | ||
239 | that hasn't changed. If this sounds confusing to you, don't worry: it is | ||
240 | confusing. This illustration by @killercup might help: | ||
241 | |||
242 | <img alt="step 1" src="https://user-images.githubusercontent.com/1711539/51460907-c5484780-1d6d-11e9-9cd2-d6f62bd746e0.png" width="50%"> | ||
243 | |||
244 | <img alt="step 2" src="https://user-images.githubusercontent.com/1711539/51460915-c9746500-1d6d-11e9-9a77-27d33a0c51b5.png" width="50%"> | ||
245 | |||
246 | <img alt="step 3" src="https://user-images.githubusercontent.com/1711539/51460920-cda08280-1d6d-11e9-8d96-a782aa57a4d4.png" width="50%"> | ||
247 | |||
248 | <img alt="step 4" src="https://user-images.githubusercontent.com/1711539/51460927-d1340980-1d6d-11e9-851e-13c149d5c406.png" width="50%"> | ||
249 | |||
250 | ## Salsa Input Queries | ||
251 | |||
252 | All analyzer information is stored in a salsa database. `Analysis` and | ||
253 | `AnalysisHost` types are newtype wrappers for [`RootDatabase`] -- a salsa | ||
254 | database. | ||
255 | |||
256 | [`RootDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/db.rs#L88-L134 | ||
257 | |||
258 | Salsa input queries are defined in [`FilesDatabase`] (which is a part of | ||
259 | `RootDatabase`). They closely mirror the familiar `AnalysisChange` structure: | ||
260 | indeed, what `apply_change` does is it sets the values of input queries. | ||
261 | |||
262 | [`FilesDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/input.rs#L150-L174 | ||
263 | |||
264 | ## From text to semantic model | ||
265 | |||
266 | The bulk of the rust-analyzer is transforming input text into a semantic model of | ||
267 | Rust code: a web of entities like modules, structs, functions and traits. | ||
268 | |||
269 | An important fact to realize is that (unlike most other languages like C# or | ||
270 | Java) there isn't a one-to-one mapping between source code and the semantic model. A | ||
271 | single function definition in the source code might result in several semantic | ||
272 | functions: for example, the same source file might be included as a module into | ||
273 | several crate, or a single "crate" might be present in the compilation DAG | ||
274 | several times, with different sets of `cfg`s enabled. The IDE-specific task of | ||
275 | mapping source code position into a semantic model is inherently imprecise for | ||
276 | this reason, and is handled by the [`source_binder`]. | ||
277 | |||
278 | [`source_binder`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/source_binder.rs | ||
279 | |||
280 | The semantic interface is declared in the [`code_model_api`] module. Each entity is | ||
281 | identified by an integer ID and has a bunch of methods which take a salsa database | ||
282 | as an argument and returns other entities (which are also IDs). Internally, these | ||
283 | methods invoke various queries on the database to build the model on demand. | ||
284 | Here's [the list of queries]. | ||
285 | |||
286 | [`code_model_api`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/code_model_api.rs | ||
287 | [the list of queries]: https://github.com/rust-analyzer/rust-analyzer/blob/7e84440e25e19529e4ff8a66e521d1b06349c6ec/crates/ra_hir/src/db.rs#L20-L106 | ||
288 | |||
289 | The first step of building the model is parsing the source code. | ||
290 | |||
291 | ## Syntax trees | ||
292 | |||
293 | An important property of the Rust language is that each file can be parsed in | ||
294 | isolation. Unlike, say, `C++`, an `include` can't change the meaning of the | ||
295 | syntax. For this reason, rust-analyzer can build a syntax tree for each "source | ||
296 | file", which could then be reused by several semantic models if this file | ||
297 | happens to be a part of several crates. | ||
298 | |||
299 | The representation of syntax trees that rust-analyzer uses is similar to that of `Roslyn` | ||
300 | and Swift's new [libsyntax]. Swift's docs give an excellent overview of the | ||
301 | approach, so I skip this part here and instead outline the main characteristics | ||
302 | of the syntax trees: | ||
303 | |||
304 | * Syntax trees are fully lossless. Converting **any** text to a syntax tree and | ||
305 | back is a total identity function. All whitespace and comments are explicitly | ||
306 | represented in the tree. | ||
307 | |||
308 | * Syntax nodes have generic `(next|previous)_sibling`, `parent`, | ||
309 | `(first|last)_child` functions. You can get from any one node to any other | ||
310 | node in the file using only these functions. | ||
311 | |||
312 | * Syntax nodes know their range (start offset and length) in the file. | ||
313 | |||
314 | * Syntax nodes share the ownership of their syntax tree: if you keep a reference | ||
315 | to a single function, the whole enclosing file is alive. | ||
316 | |||
317 | * Syntax trees are immutable and the cost of replacing the subtree is | ||
318 | proportional to the depth of the subtree. Read Swift's docs to learn how | ||
319 | immutable + parent pointers + cheap modification is possible. | ||
320 | |||
321 | * Syntax trees are build on best-effort basis. All accessor methods return | ||
322 | `Option`s. The tree for `fn foo` will contain a function declaration with | ||
323 | `None` for parameter list and body. | ||
324 | |||
325 | * Syntax trees do not know the file they are built from, they only know about | ||
326 | the text. | ||
327 | |||
328 | The implementation is based on the generic [rowan] crate on top of which a | ||
329 | [rust-specific] AST is generated. | ||
330 | |||
331 | [libsyntax]: https://github.com/apple/swift/tree/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax | ||
332 | [rowan]: https://github.com/rust-analyzer/rowan/tree/100a36dc820eb393b74abe0d20ddf99077b61f88 | ||
333 | [rust-specific]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_syntax/src/ast/generated.rs | ||
334 | |||
335 | The next step in constructing the semantic model is ... | ||
336 | |||
337 | ## Building a Module Tree | ||
338 | |||
339 | The algorithm for building a tree of modules is to start with a crate root | ||
340 | (remember, each `Crate` from a `CrateGraph` has a `FileId`), collect all `mod` | ||
341 | declarations and recursively process child modules. This is handled by the | ||
342 | [`module_tree_query`], with two slight variations. | ||
343 | |||
344 | [`module_tree_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L116-L123 | ||
345 | |||
346 | First, rust-analyzer builds a module tree for all crates in a source root | ||
347 | simultaneously. The main reason for this is historical (`module_tree` predates | ||
348 | `CrateGraph`), but this approach also enables accounting for files which are not | ||
349 | part of any crate. That is, if you create a file but do not include it as a | ||
350 | submodule anywhere, you still get semantic completion, and you get a warning | ||
351 | about a free-floating module (the actual warning is not implemented yet). | ||
352 | |||
353 | The second difference is that `module_tree_query` does not *directly* depend on | ||
354 | the "parse" query (which is confusingly called `source_file`). Why would calling | ||
355 | the parse directly be bad? Suppose the user changes the file slightly, by adding | ||
356 | an insignificant whitespace. Adding whitespace changes the parse tree (because | ||
357 | it includes whitespace), and that means recomputing the whole module tree. | ||
358 | |||
359 | We deal with this problem by introducing an intermediate [`submodules_query`]. | ||
360 | This query processes the syntax tree and extracts a set of declared submodule | ||
361 | names. Now, changing the whitespace results in `submodules_query` being | ||
362 | re-executed for a *single* module, but because the result of this query stays | ||
363 | the same, we don't have to re-execute [`module_tree_query`]. In fact, we only | ||
364 | need to re-execute it when we add/remove new files or when we change mod | ||
365 | declarations. | ||
366 | |||
367 | [`submodules_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L41 | ||
368 | |||
369 | We store the resulting modules in a `Vec`-based indexed arena. The indices in | ||
370 | the arena becomes module IDs. And this brings us to the next topic: | ||
371 | assigning IDs in the general case. | ||
372 | |||
373 | ## Location Interner pattern | ||
374 | |||
375 | One way to assign IDs is how we've dealt with modules: Collect all items into a | ||
376 | single array in some specific order and use the index in the array as an ID. The | ||
377 | main drawback of this approach is that these IDs are not stable: Adding a new item can | ||
378 | shift the IDs of all other items. This works for modules, because adding a module is | ||
379 | a comparatively rare operation, but would be less convenient for, for example, | ||
380 | functions. | ||
381 | |||
382 | Another solution here is positional IDs: We can identify a function as "the | ||
383 | function with name `foo` in a ModuleId(92) module". Such locations are stable: | ||
384 | adding a new function to the module (unless it is also named `foo`) does not | ||
385 | change the location. However, such "ID" types ceases to be a `Copy`able integer and in | ||
386 | general can become pretty large if we account for nesting (for example: "third parameter of | ||
387 | the `foo` function of the `bar` `impl` in the `baz` module"). | ||
388 | |||
389 | [`LocationInterner`] allows us to combine the benefits of positional and numeric | ||
390 | IDs. It is a bidirectional append-only map between locations and consecutive | ||
391 | integers which can "intern" a location and return an integer ID back. The salsa | ||
392 | database we use includes a couple of [interners]. How to "garbage collect" | ||
393 | unused locations is an open question. | ||
394 | |||
395 | [`LocationInterner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/loc2id.rs#L65-L71 | ||
396 | [interners]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/db.rs#L22-L23 | ||
397 | |||
398 | For example, we use `LocationInterner` to assign IDs to definitions of functions, | ||
399 | structs, enums, etc. The location, [`DefLoc`] contains two bits of information: | ||
400 | |||
401 | * the ID of the module which contains the definition, | ||
402 | * the ID of the specific item in the modules source code. | ||
403 | |||
404 | We "could" use a text offset for the location of a particular item, but that would play | ||
405 | badly with salsa: offsets change after edits. So, as a rule of thumb, we avoid | ||
406 | using offsets, text ranges or syntax trees as keys and values for queries. What | ||
407 | we do instead is we store "index" of the item among all of the items of a file | ||
408 | (so, a positional based ID, but localized to a single file). | ||
409 | |||
410 | [`DefLoc`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/ids.rs#L127-L139 | ||
411 | |||
412 | One thing we've glossed over for the time being is support for macros. We have | ||
413 | only proof of concept handling of macros at the moment, but they are extremely | ||
414 | interesting from an "assigning IDs" perspective. | ||
415 | |||
416 | ## Macros and recursive locations | ||
417 | |||
418 | The tricky bit about macros is that they effectively create new source files. | ||
419 | While we can use `FileId`s to refer to original files, we can't just assign them | ||
420 | willy-nilly to the pseudo files of macro expansion. Instead, we use a special | ||
421 | ID, [`HirFileId`] to refer to either a usual file or a macro-generated file: | ||
422 | |||
423 | ```rust | ||
424 | enum HirFileId { | ||
425 | FileId(FileId), | ||
426 | Macro(MacroCallId), | ||
427 | } | ||
428 | ``` | ||
429 | |||
430 | `MacroCallId` is an interned ID that specifies a particular macro invocation. | ||
431 | Its `MacroCallLoc` contains: | ||
432 | |||
433 | * `ModuleId` of the containing module | ||
434 | * `HirFileId` of the containing file or pseudo file | ||
435 | * an index of this particular macro invocation in this file (positional id | ||
436 | again). | ||
437 | |||
438 | Note how `HirFileId` is defined in terms of `MacroCallLoc` which is defined in | ||
439 | terms of `HirFileId`! This does not recur infinitely though: any chain of | ||
440 | `HirFileId`s bottoms out in `HirFileId::FileId`, that is, some source file | ||
441 | actually written by the user. | ||
442 | |||
443 | [`HirFileId`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/ids.rs#L18-L125 | ||
444 | |||
445 | Now that we understand how to identify a definition, in a source or in a | ||
446 | macro-generated file, we can discuss name resolution a bit. | ||
447 | |||
448 | ## Name resolution | ||
449 | |||
450 | Name resolution faces the same problem as the module tree: if we look at the | ||
451 | syntax tree directly, we'll have to recompute name resolution after every | ||
452 | modification. The solution to the problem is the same: We [lower] the source code of | ||
453 | each module into a position-independent representation which does not change if | ||
454 | we modify bodies of the items. After that we [loop] resolving all imports until | ||
455 | we've reached a fixed point. | ||
456 | |||
457 | [lower]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres/lower.rs#L113-L117 | ||
458 | [loop]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres.rs#L186-L196 | ||
459 | |||
460 | And, given all our preparation with IDs and a position-independent representation, | ||
461 | it is satisfying to [test] that typing inside function body does not invalidate | ||
462 | name resolution results. | ||
463 | |||
464 | [test]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres/tests.rs#L376 | ||
465 | |||
466 | An interesting fact about name resolution is that it "erases" all of the | ||
467 | intermediate paths from the imports: in the end, we know which items are defined | ||
468 | and which items are imported in each module, but, if the import was `use | ||
469 | foo::bar::baz`, we deliberately forget what modules `foo` and `bar` resolve to. | ||
470 | |||
471 | To serve "goto definition" requests on intermediate segments we need this info | ||
472 | in the IDE, however. Luckily, we need it only for a tiny fraction of imports, so we just ask | ||
473 | the module explicitly, "What does the path `foo::bar` resolve to?". This is a | ||
474 | general pattern: we try to compute the minimal possible amount of information | ||
475 | during analysis while allowing IDE to ask for additional specific bits. | ||
476 | |||
477 | Name resolution is also a good place to introduce another salsa pattern used | ||
478 | throughout the analyzer: | ||
479 | |||
480 | ## Source Map pattern | ||
481 | |||
482 | Due to an obscure edge case in completion, IDE needs to know the syntax node of | ||
483 | an use statement which imported the given completion candidate. We can't just | ||
484 | store the syntax node as a part of name resolution: this will break | ||
485 | incrementality, due to the fact that syntax changes after every file | ||
486 | modification. | ||
487 | |||
488 | We solve this problem during the lowering step of name resolution. The lowering | ||
489 | query actually produces a *pair* of outputs: `LoweredModule` and [`SourceMap`]. | ||
490 | The `LoweredModule` module contains [imports], but in a position-independent form. | ||
491 | The `SourceMap` contains a mapping from position-independent imports to | ||
492 | (position-dependent) syntax nodes. | ||
493 | |||
494 | The result of this basic lowering query changes after every modification. But | ||
495 | there's an intermediate [projection query] which returns only the first | ||
496 | position-independent part of the lowering. The result of this query is stable. | ||
497 | Naturally, name resolution [uses] this stable projection query. | ||
498 | |||
499 | [imports]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres/lower.rs#L52-L59 | ||
500 | [`SourceMap`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres/lower.rs#L52-L59 | ||
501 | [projection query]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/nameres/lower.rs#L97-L103 | ||
502 | [uses]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/query_definitions.rs#L49 | ||
503 | |||
504 | ## Type inference | ||
505 | |||
506 | First of all, implementation of type inference in rust-analyzer was spearheaded | ||
507 | by [@flodiebold]. [#327] was an awesome Christmas present, thank you, Florian! | ||
508 | |||
509 | Type inference runs on per-function granularity and uses the patterns we've | ||
510 | discussed previously. | ||
511 | |||
512 | First, we [lower the AST] of a function body into a position-independent | ||
513 | representation. In this representation, each expression is assigned a | ||
514 | [positional ID]. Alongside the lowered expression, [a source map] is produced, | ||
515 | which maps between expression ids and original syntax. This lowering step also | ||
516 | deals with "incomplete" source trees by replacing missing expressions by an | ||
517 | explicit `Missing` expression. | ||
518 | |||
519 | Given the lowered body of the function, we can now run [type inference] and | ||
520 | construct a mapping from `ExprId`s to types. | ||
521 | |||
522 | [@flodiebold]: https://github.com/flodiebold | ||
523 | [#327]: https://github.com/rust-analyzer/rust-analyzer/pull/327 | ||
524 | [lower the AST]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/expr.rs | ||
525 | [positional ID]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/expr.rs#L13-L15 | ||
526 | [a source map]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/expr.rs#L41-L44 | ||
527 | [type inference]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/ty.rs#L1208-L1223 | ||
528 | |||
529 | ## Tying it all together: completion | ||
530 | |||
531 | To conclude the overview of the rust-analyzer, let's trace the request for | ||
532 | (type-inference powered!) code completion! | ||
533 | |||
534 | We start by [receiving a message] from the language client. We decode the | ||
535 | message as a request for completion and [schedule it on the threadpool]. This is | ||
536 | the also place where we [catch] canceled errors if, immediately after completion, the | ||
537 | client sends some modification. | ||
538 | |||
539 | In [the handler] we a deserialize LSP request into the rust-analyzer specific data | ||
540 | types (by converting a file url into a numeric `FileId`), [ask analysis for | ||
541 | completion] and serializer results to LSP. | ||
542 | |||
543 | The [completion implementation] is finally the place where we start doing the actual | ||
544 | work. The first step is to collect the `CompletionContext` -- a struct which | ||
545 | describes the cursor position in terms of Rust syntax and semantics. For | ||
546 | example, `function_syntax: Option<&'a ast::FnDef>` stores a reference to | ||
547 | enclosing function *syntax*, while `function: Option<hir::Function>` is the | ||
548 | `Def` for this function. | ||
549 | |||
550 | To construct the context, we first do an ["IntelliJ Trick"]: we insert a dummy | ||
551 | identifier at the cursor's position and parse this modified file, to get a | ||
552 | reasonably looking syntax tree. Then we do a bunch of "classification" routines | ||
553 | to figure out the context. For example, we [find an ancestor `fn` node] and we get a | ||
554 | [semantic model] for it (using the lossy `source_binder` infrastructure). | ||
555 | |||
556 | The second step is to run a [series of independent completion routines]. Let's | ||
557 | take a closer look at [`complete_dot`], which completes fields and methods in | ||
558 | `foo.bar|`. First we extract a semantic function and a syntactic receiver | ||
559 | expression out of the `Context`. Then we run type-inference for this single | ||
560 | function and map our syntactic expression to `ExprId`. Using the ID, we figure | ||
561 | out the type of the receiver expression. Then we add all fields & methods from | ||
562 | the type to completion. | ||
563 | |||
564 | [receiving a message]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L203 | ||
565 | [schedule it on the threadpool]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L428 | ||
566 | [catch]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L436-L442 | ||
567 | [the handler]: https://salsa.zulipchat.com/#narrow/stream/181542-rfcs.2Fsalsa-query-group/topic/design.20next.20steps | ||
568 | [ask analysis for completion]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L439-L444 | ||
569 | [completion implementation]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion.rs#L46-L62 | ||
570 | [`CompletionContext`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion/completion_context.rs#L14-L37 | ||
571 | ["IntelliJ Trick"]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion/completion_context.rs#L72-L75 | ||
572 | [find an ancestor `fn` node]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion/completion_context.rs#L116-L120 | ||
573 | [semantic model]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion/completion_context.rs#L123 | ||
574 | [series of independent completion routines]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion.rs#L52-L59 | ||
575 | [`complete_dot`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/completion/complete_dot.rs#L6-L22 | ||