aboutsummaryrefslogtreecommitdiff
path: root/posts/lightweight_linting.md
diff options
context:
space:
mode:
Diffstat (limited to 'posts/lightweight_linting.md')
-rw-r--r--posts/lightweight_linting.md427
1 files changed, 427 insertions, 0 deletions
diff --git a/posts/lightweight_linting.md b/posts/lightweight_linting.md
new file mode 100644
index 0000000..2436f30
--- /dev/null
+++ b/posts/lightweight_linting.md
@@ -0,0 +1,427 @@
1[Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
2queries allow you to search for patterns in syntax trees,
3much like a regex would, in text. Combine that with some Rust
4glue to write simple, custom linters.
5
6### Tree-sitter syntax trees
7
8Here is a quick crash course on syntax trees generated by
9tree-sitter. Syntax trees produced by tree-sitter are
10represented by S-expressions. The generated S-expression for
11the following Rust code,
12
13```rust
14fn main() {
15 let x = 2;
16}
17```
18
19would be:
20
21```scheme
22(source_file
23 (function_item
24 name: (identifier)
25 parameters: (parameters)
26 body:
27 (block
28 (let_declaration
29 pattern: (identifier)
30 value: (integer_literal)))))
31```
32
33Syntax trees generated by tree-sitter have a couple of other
34cool properties: they are _lossless_ syntax trees. Given a
35lossless syntax tree, you can regenerate the original source
36code in its entirety. Consider the following addition to our
37example:
38
39```rust
40 fn main() {
41+ // a comment goes here
42 let x = 2;
43 }
44```
45
46The tree-sitter syntax tree preserves the comment, while the
47typical abstract syntax tree wouldn't:
48
49```scheme
50 (source_file
51 (function_item
52 name: (identifier)
53 parameters: (parameters)
54 body:
55 (block
56+ (line_comment)
57 (let_declaration
58 pattern: (identifier)
59 value: (integer_literal)))))
60```
61
62### Tree-sitter queries
63
64Tree-sitter provides a DSL to match over CSTs. These queries
65resemble our S-expression syntax trees, here is a query to
66match all line comments in a Rust CST:
67
68```scheme
69(line_comment)
70
71; matches the following rust code
72; // a comment goes here
73```
74
75Neat, eh? But don't take my word for it, give it a go on the
76[tree-sitter
77playground](https://tree-sitter.github.io/tree-sitter/playground).
78Type in a query like so:
79
80```scheme
81; the web playground requires you to specify a "capture"
82; you will notice the capture and the nodes it captured
83; turn blue
84(line_comment) @capture
85```
86
87Here's another to match `let` expressions that
88bind an integer to an identifier:
89
90```scheme
91(let_declaration
92 pattern: (identifier)
93 value: (integer_literal))
94
95; matches:
96; let foo = 2;
97```
98
99We can _capture_ nodes into variables:
100
101```scheme
102(let_declaration
103 pattern: (identifier) @my-capture
104 value: (integer_literal))
105
106; matches:
107; let foo = 2;
108
109; captures:
110; foo
111```
112
113And apply certain _predicates_ to captures:
114
115```scheme
116((let_declaration
117 pattern: (identifier) @my-capture
118 value: (integer_literal))
119 (#eq? @my-capture "foo"))
120
121; matches:
122; let foo = 2;
123
124; and not:
125; let bar = 2;
126```
127
128The `#match?` predicate checks if a capture matches a regex:
129
130```scheme
131((let_declaration
132 pattern: (identifier) @my-capture
133 value: (integer_literal))
134 (#match? @my-capture "foo|bar"))
135
136; matches both `foo` and `bar`:
137; let foo = 2;
138; let bar = 2;
139```
140
141Exhibit indifference, as a stoic programmer would, with the
142_wildcard_ pattern:
143
144```scheme
145(let_declaration
146 pattern: (identifier)
147 value: (_))
148
149; matches:
150; let foo = "foo";
151; let foo = 42;
152; let foo = bar;
153```
154
155[The
156documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries)
157does the tree-sitter query DSL more justice, but we now know
158enough to write our first lint.
159
160### Write you a tree-sitter lint
161
162Strings in `std::env` functions are error prone:
163
164```rust
165std::env::remove_var("RUST_BACKTACE");
166 // ^^^^ "TACE" instead of "TRACE"
167```
168
169I prefer this instead:
170
171```rust
172// somewhere in a module that is well spellchecked
173static BACKTRACE: &str = "RUST_BACKTRACE";
174
175// rest of the codebase
176std::env::remove_var(BACKTRACE);
177```
178
179Let's write a lint to find `std::env` functions that use
180strings. Put aside the effectiveness of this lint for the
181moment, and take a stab at writing a tree-sitter query. For
182reference, a function call like so:
183
184```rust
185remove_var("RUST_BACKTRACE")
186```
187
188Produces the following S-expression:
189
190```scheme
191(call_expression
192 function: (identifier)
193 arguments: (arguments (string_literal)))
194```
195
196We are definitely looking for a `call_expression`:
197
198```scheme
199(call_expression) @raise
200```
201
202Whose function name matches `std::env::var` or
203`std::env::remove_var` at the very least (I know, I know,
204this isn't the most optimal regex):
205
206```scheme
207((call_expression
208 function: (_) @fn-name) @raise
209 (#match? @fn-name "std::env::(var|remove_var)"))
210```
211
212Let's turn that `std::` prefix optional:
213
214```scheme
215((call_expression
216 function: (_) @fn-name) @raise
217 (#match? @fn-name "(std::|)env::(var|remove_var)"))
218```
219
220And ensure that `arguments` is a string:
221
222```scheme
223((call_expression
224 function: (_) @fn-name
225 arguments: (arguments (string_literal)))
226 (#match? @fn-name "(std::|)env::(var|remove_var)"))
227```
228
229### Running our linter
230
231We could always plug our query into the web playground, but
232let's go a step further:
233
234```bash
235cargo new --bin toy-lint
236```
237
238Add `tree-sitter` and `tree-sitter-rust` to your
239dependencies:
240
241```toml
242# within Cargo.toml
243[dependencies]
244tree-sitter = "0.20"
245
246[dependencies.tree-sitter-rust]
247git = "https://github.com/tree-sitter/tree-sitter-rust"
248```
249
250Let's load in some Rust code to work with. As [an ode to
251Gödel](https://en.wikipedia.org/wiki/Self-reference)
252(G`ode`l?), why not load in our linter itself:
253
254```rust
255fn main() {
256 let src = include_str!("main.rs");
257}
258```
259
260Most tree-sitter APIs require a reference to a `Language`
261struct, we will be working with Rust if you haven't
262already guessed:
263
264```rust
265use tree_sitter::Language;
266
267let rust_lang: Language = tree_sitter_rust::language();
268```
269
270Enough scaffolding, let's parse some Rust:
271
272```rust
273use tree_sitter::Parser;
274
275let mut parser = Parser::new();
276parser.set_language(rust_lang).unwrap();
277
278let parse_tree = parser.parse(&src, None).unwrap();
279```
280
281The second argument to `Parser::parse` may be of interest.
282Tree-sitter has this cool feature that allows for quick
283reparsing of existing parse trees if they contain edits. If
284you do happen to want to reparse a source file, you can pass
285in the old tree:
286
287```rust
288// if you wish to reparse instead of parse
289old_tree.edit(/* redacted */);
290
291// generate shiny new reparsed tree
292let new_tree = parser.parse(&src, Some(old_tree)).unwrap()
293```
294
295Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it:
296
297```rust
298println!("{}", parse_tree.root_node().to_sexp());
299```
300
301Or better yet, run a query on it:
302
303```rust
304use tree_sitter::Query;
305
306let query = Query::new(
307 rust_lang,
308 r#"
309 ((call_expression
310 function: (_) @fn-name
311 arguments: (arguments (string_literal))) @raise
312 (#match? @fn-name "(std::|)env::(var|remove_var)"))
313 "#
314)
315.unwrap();
316```
317
318A `QueryCursor` is tree-sitter's way of maintaining state as
319we iterate through the matches or captures produced by
320running a query on the parse tree. Observe:
321
322```rust
323use tree_sitter::QueryCursor;
324
325let mut query_cursor = QueryCursor::new();
326let all_matches = query_cursor.matches(
327 &query,
328 parse_tree.root_node(),
329 src.as_bytes(),
330);
331```
332
333We begin by passing our query to the cursor, followed by the
334"root node", which is another way of saying, "start from the
335top", and lastly, the source itself. If you have already
336taken a look at the C API, you will notice that the last
337argument, the source (known as the `TextProvider`), is not
338required. The Rust bindings seem to require this argument to
339provide predicate functionality such as `#match?` and
340`#eq?`.
341
342Do something with the matches:
343
344```rust
345// get the index of the capture named "raise"
346let raise_idx = query.capture_index_for_name("raise").unwrap();
347
348for each_match in all_matches {
349 // iterate over all captures called "raise"
350 // ignore captures such as "fn-name"
351 for capture in each_match
352 .captures
353 .iter()
354 .filter(|c| c.idx == raise_idx)
355 {
356 let range = capture.node.range();
357 let text = &src[range.start_byte..range.end_byte];
358 let line = range.start_point.row;
359 let col = range.start_point.column;
360 println!(
361 "[Line: {}, Col: {}] Offending source code: `{}`",
362 line, col, text
363 );
364 }
365}
366```
367
368Lastly, add the following line to your source code, to get
369the linter to catch something:
370
371```rust
372env::remove_var("RUST_BACKTRACE");
373```
374
375And `cargo run`:
376
377```shell
378λ cargo run
379 Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint)
380 Finished dev [unoptimized + debuginfo] target(s) in 0.74s
381 Running `target/debug/toy-lint`
382[Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")`
383```
384
385Thank you tree-sitter!
386
387### Bonus
388
389Keen readers will notice that I avoided `std::env::set_var`.
390Because `set_var` is called with two arguments, a "key" and
391a "value", unlike `env::var` and `env::remove_var`. As a
392result, it requires more juggling:
393
394```scheme
395((call_expression
396 function: (_) @fn-name
397 arguments: (arguments . (string_literal)? . (string_literal) .)) @raise
398 (#match? @fn-name "(std::|)env::(var|remove_var|set_var)"))
399```
400
401The interesting part of this query is the humble `.`, the
402_anchor_ operator. Anchors help constrain child nodes in
403certain ways. In this case, it ensures that we match exactly
404two `string_literal`s who are siblings or exactly one
405`string_literal` with no siblings. Unfortunately, this query
406also matches the following invalid Rust code:
407
408```rust
409// remove_var accepts only 1 arg!
410std::env::remove_var("RUST_BACKTRACE", "1");
411```
412
413### Notes
414
415All-in-all, the query DSL does a great job in lowering the
416bar to writing language tools. The knowledge gained from
417mastering the query DSL can be applied to other languages
418that have tree-sitter grammars too. This query
419detects `to_json` methods that do not accept additional
420arguments, in Ruby:
421
422```scheme
423((method
424 name: (identifier) @fn
425 !parameters)
426 (#is? @fn "to_json"))
427```