aboutsummaryrefslogtreecommitdiff
path: root/posts/lightweight_linting.md
blob: 2436f30aaea59be666cc10264c402b2ef6c8b4cf (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
[Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
queries allow you to search for patterns in syntax trees,
much like a regex would, in text. Combine that with some Rust
glue to write simple, custom linters.

### Tree-sitter syntax trees

Here is a quick crash course on syntax trees generated by
tree-sitter. Syntax trees produced by tree-sitter are
represented by S-expressions. The generated S-expression for
the following Rust code,

```rust
fn main() {
    let x = 2;
}
```

would be:

```scheme
(source_file
 (function_item
  name: (identifier)
  parameters: (parameters)
  body: 
  (block
   (let_declaration 
    pattern: (identifier)
    value: (integer_literal)))))
```

Syntax trees generated by tree-sitter have a couple of other
cool properties: they are _lossless_ syntax trees. Given a
lossless syntax tree, you can regenerate the original source
code in its entirety. Consider the following addition to our
example:

```rust
 fn main() {
+    // a comment goes here
     let x = 2;
 }
```

The tree-sitter syntax tree preserves the comment, while the
typical abstract syntax tree wouldn't:

```scheme
 (source_file
  (function_item
   name: (identifier)
   parameters: (parameters)
   body:
   (block
+   (line_comment)
    (let_declaration
     pattern: (identifier)
     value: (integer_literal)))))
```

### Tree-sitter queries

Tree-sitter provides a DSL to match over CSTs. These queries
resemble our S-expression syntax trees, here is a query to
match all line comments in a Rust CST:

```scheme
(line_comment)

; matches the following rust code
; // a comment goes here
```

Neat, eh? But don't take my word for it, give it a go on the
[tree-sitter
playground](https://tree-sitter.github.io/tree-sitter/playground).
Type in a query like so:

```scheme
; the web playground requires you to specify a "capture"
; you will notice the capture and the nodes it captured
; turn blue
(line_comment) @capture
```

Here's another to match `let` expressions that
bind an integer to an identifier:

```scheme
(let_declaration
 pattern: (identifier)
 value: (integer_literal))
 
; matches:
; let foo = 2;
```

We can _capture_ nodes into variables:

```scheme
(let_declaration 
 pattern: (identifier) @my-capture
 value: (integer_literal))
 
; matches:
; let foo = 2;

; captures:
; foo
```

And apply certain _predicates_ to captures:

```scheme
((let_declaration
  pattern: (identifier) @my-capture
  value: (integer_literal))
 (#eq? @my-capture "foo"))
 
; matches:
; let foo = 2;

; and not:
; let bar = 2;
```

The `#match?` predicate checks if a capture matches a regex:

```scheme
((let_declaration
  pattern: (identifier) @my-capture
  value: (integer_literal))
 (#match? @my-capture "foo|bar"))
 
; matches both `foo` and `bar`:
; let foo = 2;
; let bar = 2;
```

Exhibit indifference, as a stoic programmer would, with the
_wildcard_ pattern:

```scheme
(let_declaration
 pattern: (identifier)
 value: (_))
 
; matches:
; let foo = "foo";
; let foo = 42;
; let foo = bar;
```

[The
documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
does the tree-sitter query DSL more justice, but we now know
enough to write our first lint.

### Write you a tree-sitter lint

Strings in `std::env` functions are error prone:

```rust
std::env::remove_var("RUST_BACKTACE");
                            // ^^^^ "TACE" instead of "TRACE"
```

I prefer this instead:

```rust
// somewhere in a module that is well spellchecked
static BACKTRACE: &str = "RUST_BACKTRACE";

// rest of the codebase
std::env::remove_var(BACKTRACE);
```

Let's write a lint to find `std::env` functions that use
strings. Put aside the effectiveness of this lint for the
moment, and take a stab at writing a tree-sitter query. For
reference, a function call like so:

```rust
remove_var("RUST_BACKTRACE")
```

Produces the following S-expression:

```scheme
(call_expression
  function: (identifier)
  arguments: (arguments (string_literal)))
```

We are definitely looking for a `call_expression`:

```scheme
(call_expression) @raise
```

Whose function name matches `std::env::var` or
`std::env::remove_var` at the very least (I know, I know,
this isn't the most optimal regex):

```scheme
((call_expression
  function: (_) @fn-name) @raise
 (#match? @fn-name "std::env::(var|remove_var)"))
```

Let's turn that `std::` prefix optional:

```scheme
((call_expression
  function: (_) @fn-name) @raise
 (#match? @fn-name "(std::|)env::(var|remove_var)"))
```

And ensure that `arguments` is a string:

```scheme
((call_expression
  function: (_) @fn-name
  arguments: (arguments (string_literal)))
 (#match? @fn-name "(std::|)env::(var|remove_var)"))
```

### Running our linter

We could always plug our query into the web playground, but
let's go a step further:

```bash
cargo new --bin toy-lint
```

Add `tree-sitter` and `tree-sitter-rust` to your
dependencies:

```toml
# within Cargo.toml
[dependencies]
tree-sitter = "0.20"

[dependencies.tree-sitter-rust]
git = "https://github.com/tree-sitter/tree-sitter-rust"
```

Let's load in some Rust code to work with. As [an ode to
Gödel](https://en.wikipedia.org/wiki/Self-reference)
(G`ode`l?), why not load in our linter itself:

```rust
fn main() {
    let src = include_str!("main.rs");
}
```

Most tree-sitter APIs require a reference to a `Language`
struct, we will be working with Rust if you haven't
already guessed:

```rust
use tree_sitter::Language;

let rust_lang: Language = tree_sitter_rust::language();
```

Enough scaffolding, let's parse some Rust:

```rust
use tree_sitter::Parser;

let mut parser = Parser::new();
parser.set_language(rust_lang).unwrap();

let parse_tree = parser.parse(&src, None).unwrap();
```

The second argument to `Parser::parse` may be of interest.
Tree-sitter has this cool feature that allows for quick
reparsing of existing parse trees if they contain edits. If
you do happen to want to reparse a source file, you can pass
in the old tree:

```rust
// if you wish to reparse instead of parse
old_tree.edit(/* redacted */);

// generate shiny new reparsed tree
let new_tree = parser.parse(&src, Some(old_tree)).unwrap()
```

Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it:

```rust
println!("{}", parse_tree.root_node().to_sexp());
```

Or better yet, run a query on it:

```rust
use tree_sitter::Query;

let query = Query::new(
    rust_lang,
    r#"
    ((call_expression
      function: (_) @fn-name
      arguments: (arguments (string_literal))) @raise
     (#match? @fn-name "(std::|)env::(var|remove_var)"))
    "#
)
.unwrap();
```

A `QueryCursor` is tree-sitter's way of maintaining state as
we iterate through the matches or captures produced by
running a query on the parse tree. Observe:

```rust
use tree_sitter::QueryCursor;

let mut query_cursor = QueryCursor::new();
let all_matches = query_cursor.matches(
    &query,
    parse_tree.root_node(),
    src.as_bytes(),
);
```

We begin by passing our query to the cursor, followed by the
"root node", which is another way of saying, "start from the
top", and lastly, the source itself. If you have already
taken a look at the C API, you will notice that the last
argument, the source (known as the `TextProvider`), is not
required. The Rust bindings seem to require this argument to
provide predicate functionality such as `#match?` and
`#eq?`.

Do something with the matches:

```rust
// get the index of the capture named "raise"
let raise_idx = query.capture_index_for_name("raise").unwrap();

for each_match in all_matches {
    // iterate over all captures called "raise"
    // ignore captures such as "fn-name"
    for capture in each_match
        .captures
        .iter()
        .filter(|c| c.idx == raise_idx)
    {
        let range = capture.node.range();
        let text = &src[range.start_byte..range.end_byte];
        let line = range.start_point.row;
        let col = range.start_point.column;
        println!(
            "[Line: {}, Col: {}] Offending source code: `{}`",
            line, col, text
        );
    }
}
```

Lastly, add the following line to your source code, to get
the linter to catch something:

```rust
env::remove_var("RUST_BACKTRACE");
```

And `cargo run`:

```shell
λ cargo run
   Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint)
    Finished dev [unoptimized + debuginfo] target(s) in 0.74s
     Running `target/debug/toy-lint`
[Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")`
```

Thank you tree-sitter!

### Bonus

Keen readers will notice that I avoided `std::env::set_var`.
Because `set_var` is called with two arguments, a "key" and
a "value", unlike `env::var` and `env::remove_var`. As a
result, it requires more juggling:

```scheme
((call_expression
  function: (_) @fn-name
  arguments: (arguments . (string_literal)? . (string_literal) .)) @raise
 (#match? @fn-name "(std::|)env::(var|remove_var|set_var)"))
```

The interesting part of this query is the humble `.`, the
_anchor_ operator. Anchors help constrain child nodes in
certain ways. In this case, it ensures that we match exactly
two `string_literal`s who are siblings or exactly one
`string_literal` with no siblings. Unfortunately, this query
also matches the following invalid Rust code:

```rust
// remove_var accepts only 1 arg!
std::env::remove_var("RUST_BACKTRACE", "1");
```

### Notes

All-in-all, the query DSL does a great job in lowering the
bar to writing language tools. The knowledge gained from
mastering the query DSL can be applied to other languages
that have tree-sitter grammars too. This query
detects `to_json` methods that do not accept additional
arguments, in Ruby:

```scheme
((method
  name: (identifier) @fn
  !parameters)
 (#is? @fn "to_json"))
```