aboutsummaryrefslogtreecommitdiff
path: root/guide.md
blob: babf584884ca437dee5a3e6a860a516699088981 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
# Guide to rust-analyzer

## About the guide

This guide describes the current start of the rust-analyzer as of 2019-01-20
(commit hash guide-2019-01). Its purpose is to
document various problems and architectural solutions related to the problem of
building IDE-first compiler.

## The big picture

On the highest possible level, rust analyzer is a stateful component. Client may
apply changes to the analyzer (new contents of `foo.rs` file is "fn main() {}")
and it may ask semantic questions about the current state (what is the
definition of the identifier with offset 92 in file `bar.rs`?). Two important
properties hold:

* Analyzer does not do any IO. It starts in an empty state and all input data is
  provided via `apply_change` API.

* Only queries about the current state are supported. One can, of course,
  simulate undo and redo by keeping log of changes and inverse-changes.

## IDE API

To see this big picture, let's take a look at the [`AnalysisHost`] and
[`Analysis`] pair of types. `AnalysisHost` has three methods:

* `default` for creating an empty analysis
* `apply_change(&mut self)` to make changes (this is how you get from an empty
  state to something interesting)
* `analysis(&self)` to get an instance of `Analysis`

`Analysis` has a ton of methods for IDEs, like `goto_definition`, or
`completions`. Both inputs and outputs of `Analysis`' methods are formulated in
terms of files and offsets, and **not** in terms of Rust concepts like structs,
traits, etc. The "typed" API with Rust specific types is slightly lower in the
stack, we'll talk about it later.

[`AnalysisHost`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L265-L284
[`Analysis`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L291-L478

The reason for `Analysis` and `AnalysisHost` separation is that we want apply
changes "uniquely", but we might want to fork an `Analysis` and send it to
another thread for background processing. That is, there is only a single
`AnalysisHost`, but there may be several (equivalent) `Analysis`.

Note that all of the `Analysis` API return `Cancelable<T>`. This is required to
be responsive in IDE setting. Sometimes a long-running query is being computed
and the user types something in the editor and asks for completion. In this
case, we cancel the long-running computation (so it returns `Err(Canceled)`),
apply the change and execute request for completion. We never use stale data to
answer requests. Under the cover, `AnalysisHost` "remembers" all outstanding
`Analysis` instances. `AnalysisHost::apply_change` method cancels all
`Analysis`es, blocks until of them are `Dropped` and then applies change
in-place. This is the familiar to rustaceans read-write lock interior
mutability.

Next, lets talk about what are inputs to the Analysis, precisely.

## Inputs

Rust Analyzer never does any IO itself, all inputs get passed explicitly via
`AnalysisHost::apply_change` method, which accepts a single argument:
`AnalysisChange`. [`AnalysisChange`] is a builder for a single change
"transaction", so it suffices to study its methods to understand all of the
input data.

[`AnalysisChange`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/lib.rs#L119-L167

The `(add|change|remove)_file` methods control the set of the input files, where
each file has an integer id (`FileId`, picked by the client), text (`String`)
and a filesystem path. Paths are tricky, they'll be explained in source roots
section, together with `add_root` method. `add_library` method allows to add a
group of files which are assumed to rarely change. It's mostly an optimization
and does not change fundamental picture.

`set_crate_graph` method allows to control how the input files are partitioned
into compilation unites -- crates. It also controls (in theory, not implemented
yet) `cfg` flags. `CrateGraph` is a directed acyclic graph of crates. Each crate
has a root `FileId`, a set of active `cfg` flags and a set of dependencies. Each
dependency is a pair of a crate and a name. It is possible to have two crates
with the same root `FileId` but different `cfg`-flags/dependencies. This model
is lower than Cargo's model of packages: each Cargo package consists of several
targets, each of which is a separate crate (or several crates, if you try
different feature combinations).

Procedural macros should become inputs as well, but currently they are not
supported. Procedural macro will be a black box `Box<dyn Fn(TokenStream) -> TokenStream>`
function, and will be inserted into the crate graph just like dependencies.

Soon we'll talk how we build an LSP server on top of `Analysis`, but first,
let's deal with that paths issue.


## Source roots (aka filesystems are horrible)

This is a non-essential section, feel free to skip.

The previous section said that the file system path is an attribute of a file,
but this is not a whole truth. Making it an absolute `PathBuf` will be bad for
several reasons. First, file-systems are full of (platform-dependent) edge cases:

* it's hard (requires a syscall) to decide if two paths are equivalent
* some file-systems are case-sensitive
* paths are not necessary UTF-8
* symlinks can form cycles

Second, this might hurt reproducibility and hermeticity of builds. In theory,
moving a project from `/foo/bar/my-project` to `/spam/eggs/my-project` should
not change a bit in the output. However, if absolute path is a part of the
input, it is at least in theory observable, and *could* affect the output.

Yet another problem is that we really-really want to avoid doing IO, but with
Rust the set of "input" files is not necessary known up-front. In theory, you
can have `#[path="/dev/random"] mod foo;`.

To solve (or explicitly refuse to solve) these problems rust analyzer uses the
concept of source root. Roughly speaking, source roots is a contents of a
directory on a file systems, like `/home/matklad/projects/rustraytracer/**.rs`.

More precisely, all files (`FileId`s) are partitioned into disjoint
`SourceRoot`s. Each file has a relative utf-8 path within the `SourceRoot`.
`SourceRoot` has an identity (integer id). Crucially, the root path of the
source root itself is unknown to the analyzer: client is supposed to maintain a
mapping between SourceRoot ids (which are assigned by the client) and actual
`PathBuf`s. `SourceRoot`s give a sane tree model of the file system to the
analyzer.

Note that `mod`, `#[path]` and `include!()` can only reference files from the
same source root. It is of course is possible to explicitly add extra files to
the source root, even `/dev/random`.

## Language Server Protocol

Now let's see how `Analysis` API is exposed via JSON RPC based LSP protocol. The
hard part here is managing changes (which can come either from the file system
or from the editor) and concurrency (we want to spawn background jobs for things
like syntax highlighting). We use the event loop pattern to manage the zoo, and
the loop is the [`main_loop_inner`] function. The [`main_loop`] does a one-time
initialization and tearing down of the resources.

[`main_loop`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L51-L110
[`main_loop_inner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L156-L258


Let's walk through a typical analyzer session!

First, we need to figure out what to analyze. To do this, we run `cargo
metadata` to learn about Cargo packages for current workspace and dependencies,
and we run `rustc --print sysroot` and scan sysroot to learn about crates like
`std`. Currently we load this configuration once at the start of the server, but
it should be possible to dynamically reconfigure it later without restart.

[main_loop.rs#L62-L70](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L62-L70)

The [`ProjectModel`] we get after this step is very Cargo and sysroot specific,
it needs to be lowered to get the input in the form of `AnalysisChange`. This
happens in [`ServerWorldState::new`] method. Specifically

* Create a `SourceRoot` for each Cargo package and sysroot.
* Schedule a file system scan of the roots.
* Create an analyzer's `Crate` for each Cargo **target** and sysroot crate.
* Setup dependencies between the crates.

[`ProjectModel`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/project_model.rs#L16-L20
[`ServerWorldState::new`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L38-L160

The results of the scan (which may take a while) will be processed in the body
of the main loop, just like any other change. Here's where we handle

* [File system changes](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L194)
* [Changes from the editor](https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L377)

After a single loop's turn, we group them into one `AnalysisChange` and
[apply] it. This always happens on the main thread and blocks the loop.

[apply]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216

To handle requests, like ["goto definition"], we create an instance of the
`Analysis` and [`schedule`] the task (which consumes `Analysis`) onto
threadpool. [The task] calls the corresponding `Analysis` method, while
massaging the types into the LSP representation. Keep in mind that if we are
executing "goto definition" on the threadpool and a new change comes in, the
task will be canceled as soon as the main loop calls `apply_change` on the
`AnalysisHost`.

["goto definition"]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/server_world.rs#L216
[`schedule`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop.rs#L426-L455
[The task]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_lsp_server/src/main_loop/handlers.rs#L205-L223

This concludes the overview of the analyzer's programing *interface*. Next, lets
dig into the implementation!

## Salsa

The most straightforward way to implement "apply change, get analysis, repeat"
API would be to maintain the input state and to compute all possible analysis
information from scratch after every change. This works, but scales poorly with
the size of the project. To make this fast, we need to take advantage of the
fact that most of the changes are small, and that analysis results are unlikely
to change significantly between invocations.

To do this we use [salsa]: a framework for incremental on-demand computation.
You can skip the rest of the section if you are familiar with rustc red-green
algorithm.

[salsa]: https://github.com/salsa-rs/salsa

It's better to refer to salsa's docs to learn about it. Here's a small excerpt:

The key idea of salsa is that you define your program as a set of queries. Every
query is used like function K -> V that maps from some key of type K to a value
of type V. Queries come in two basic varieties:

* **Inputs**: the base inputs to your system. You can change these whenever you
  like.

* **Functions**: pure functions (no side effects) that transform your inputs
  into other values. The results of queries is memoized to avoid recomputing
  them a lot. When you make changes to the inputs, we'll figure out (fairlywe
  intelligently) when we can re-use these memoized values and when we have we
  recompute them.


For further discussion, its important to understand one bit of "fairly
intelligently". Suppose we have to functions, `f1` and `f2`, and one input,we
We call `f1(X)` which in turn calls `f2(Y)` which inspects `i(Z)`. `i(Z)`
returns some value `V1`, `f2` uses that and returns `R1`, `f1` uses that anwe
returns `O`. Now, let's change `i` at `Z` to `V2` from `V1` and try to compwe
`f1(X)` again. Because `f1(X)` (transitively) depends on `i(Z)`, we can't jwe
reuse its value as is. However, if `f2(Y)` is *still* equal to `R1` (despitwe
`i`'s change), we, in fact, *can* reuse `O` as result of `f1(X)`. And that'we
salsa works: it recomputes results in *reverse* order, starting from inputswe
progressing towards outputs, stopping as soon as it sees an intermediate vawe
that hasn't changed.

## Salsa Input Queries

All analyzer information is stored in a salsa database. `Analysis` and
`AnalysisHost` types are newtype wrappers for [`RootDatabase`] -- a salsa
database.

[`RootDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_ide_api/src/db.rs#L88-L134

Salsa input queries are defined in [`FilesDatabase`] (which is a part of
`RootDatabase`). They closely mirror the familiar `AnalysisChange` structure:
indeed, what `apply_change` does is it sets the values of input queries.

[`FilesDatabase`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/input.rs#L150-L174

## From text to semantic model

The bulk of the rust-analyzer is transforming input text into semantic model of
Rust code: a web of entities like modules, structs, functions and traits.

An important fact to realize is that (unlike most other languages like C# or
Java) there isn't a one-to-one mapping between source code and semantic model. A
single function definition in the source code might result in several semantic
functions: for example, the same source file might be included as a module into
several crate, or a single "crate" might be present in the compilation DAG
several times, with different sets of `cfg`s enabled.

The semantic interface is declared in [`code_model_api`] module. Each entity is
identified by integer id and has a bunch of methods which take a salsa database
as an argument and returns other entities (which are ids). Internally, this
methods invoke various queries on the database to build the model on demand.
Here's [the list of queries].

[`code_model_api`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/code_model_api.rs
[the list of queries]: https://github.com/rust-analyzer/rust-analyzer/blob/7e84440e25e19529e4ff8a66e521d1b06349c6ec/crates/ra_hir/src/db.rs#L20-L106

The first step of building the model is parsing the source code.

## Syntax trees

An important property of the Rust language is that each file can be parsed in
isolation. Unlike, say, `C++`, an `include` can't change the meaning of the
syntax. For this reason, Rust analyzer can build a syntax tree for each "source
file", which could then be reused by several semantic models if this file
happens to be a part of several crates.

Rust analyzer uses a similar representation of syntax trees to that of `Roslyn`
and Swift's new [libsyntax]. Swift's docs give an excellent overview of the
approach, so I skip this part here and instead outline the main characteristics
of the syntax trees:

* Syntax trees are fully lossless. Converting **any** text to a syntax tree and
  back is a total identity function. All whitespace and comments are explicitly
  represented in the tree.

* Syntax nodes have generic `(next|previous)_sibling`, `parent`,
  `(first|last)_child` functions. You can get from any one node to any other
  node in the file using only these functions.

* Syntax nodes know their range (start offset and length) in the file.

* Syntax nodes share the ownership of their syntax tree: if you keep a reference
  to a single function, the whole enclosing file is alive.

* Syntax trees are immutable and the cost of replacing the subtree is
  proportional to the depth of the subtree. Read Swift's docs to learn how
  immutable + parent pointers + cheap modification is possible.

* Syntax trees are build on best-effort basis. All accessor methods return
  `Option`s. The tree for `fn foo` will contain a function declaration with
  `None` for parameter list and body.

* Syntax trees do not know the file they are build from, they only know about
  the text.

The implementation is based on the generic [rowan] crate on top of which a
[rust-specific] AST is generated.

[libsyntax]: https://github.com/apple/swift/tree/5e2c815edfd758f9b1309ce07bfc01c4bc20ec23/lib/Syntax
[rowan]: https://github.com/rust-analyzer/rowan/tree/100a36dc820eb393b74abe0d20ddf99077b61f88
[rust-specific]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_syntax/src/ast/generated.rs

The next step in constructing the semantic model is ...

## Building a Module Tree

The algorithm for building a tree of modules is to start with a crate root
(remember, each `Crate` from a `CrateGraph` has a `FileId`), collect all mod
declarations and recursively process child modules. This is handled by the
[`module_tree_query`], with a two slight variations.

[`module_tree_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L116-L123

First, rust analyzer builds a module tree for all crates in a source root
simultaneously. The main reason for this is historical (`module_tree` predates
`CrateGraph`), but this approach also allows to account for files which are not
part of any crate. That is, if you create a file but do not include it as a
submodule anywhere, you still get semantic completion, and you get a warning
about free-floating module (the actual warning is not implemented yet).

The second difference is that `module_tree_query` does not *directly* depend on
the "parse" query (which is confusingly called `source_file`). Why would calling
the parse directly be bad? Suppose the user changes the file slightly, by adding
an insignificant whitespace. Adding whitespace changes the parse tree (because
it includes whitespace), and that means recomputing the whole module tree.

We deal with this problem by introducing an intermediate [`submodules_query`].
This query processes the syntax tree an extract a set of declared submodule
names. Now, changing the whitespace results in `submodules_query` being
re-executed for a *single* module, but because the result of this query stays
the same, we don't have to re-execute [`module_tree_query`]. In fact, we only
need to re-execute it when we add/remove new files or when we change mod
declarations.

[`submodules_query`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/module_tree.rs#L41)

We store the resulting modules in a `Vec`-based indexed arena. The indices in
the arena becomes module ids. And this brings us to the next topic:
assigning ids in the general case.

## Location Interner pattern

One way to assign ids is how we've dealt with modules: collect all items into a
single array in some specific order and use index in the array as an id. The
main drawback of this approach is that ids are not stable: adding a new item can
shift ids of all other items. This works for modules, because adding a module is
a comparatively rare operation, but would be less convenient for, for example
functions.

Another solution here is positional ids: we can identify a function as "the
function with name `foo` in a ModuleId(92) module". Such locations are stable:
adding a new function to the module (unless it is also named `foo`) does not
change the location. However, such "id" ceases to be a `Copy` integer and in
general can become pretty large if we account for nesting (third parameter of
the foo function of the bar impl in the baz module).

[`LocationInterner`] allows us to combine benefits of positional and numeric
ids. It is a bidirectional append only map between locations and consecutive
integers which can "intern" a location and return an integer id back. Salsa
database we use includes a couple of [interners]. How to "garbage collect"
unused locations is an open question.

[`LocationInterner`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_db/src/loc2id.rs#L65-L71
[interners]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/db.rs#L22-L23

For example, we use `LocationInterner` to assign ids to defs: functions,
structs, enums, etc. The location, [`DefLoc`] contains two bits of information:

* the id of the module which contains the def,
* the id of the specific item in the modules source code.

We "could" use a text offset for location a particular item, but that would play
badly with salsa: offsets change after edits. So, as a rule of thumb, we avoid
using offsets, text ranges or syntax trees as keys and values for queries. What
we do instead is we store "index" of the item among all of the items of a file
(so, a positional based ID, but localized to a single file).

[`DefLoc`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/ids.rs#L127-L139

One thing we've glossed over for the time being is support for macros. We have
only proof of concept handling of macros at the moment, but they are extremely
interesting from "assigning ids" perspective.

## Macros and recursive locations

The tricky bit about macros is that they effectively create new source files.
While we can use `FileId`s to refer to original files, we can't just assign them
willy-nilly to the pseudo files of macro expansion. Instead, we use a special
ID, [`HirFileId`] to refer to either a usual file or a macro-generated file:

```rust
enum HirFileId {
    FileId(FileId),
    Macro(MacroCallId),
}
```

`MacroCallId` is an interned ID that specifies a particular macro invocation.
Its `MacroCallLoc` contains:

* `ModuleId` of the containing module
* `HirFileId` of the containing file or pseudo file
* an index of this particular macro invocation in this file (positional id
  again).

Note how `HirFileId` is defined in terms of `MacroCallLoc` which is defined in
terms of `HirFileId`! This does not recur infinitely though: any chain of
`HirFileId`s bottoms out in `HirFileId::FileId`, that is, some source file
actually written by the user.

[`HirFileId`]: https://github.com/rust-analyzer/rust-analyzer/blob/guide-2019-01/crates/ra_hir/src/ids.rs#L18-L125

## Name resolution

## Source Map pattern

## Tying it all together: completion