diff options
Diffstat (limited to 'posts')
-rw-r--r-- | posts/lightweight_linting.md | 427 |
1 files changed, 427 insertions, 0 deletions
diff --git a/posts/lightweight_linting.md b/posts/lightweight_linting.md new file mode 100644 index 0000000..2436f30 --- /dev/null +++ b/posts/lightweight_linting.md | |||
@@ -0,0 +1,427 @@ | |||
1 | [Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) | ||
2 | queries allow you to search for patterns in syntax trees, | ||
3 | much like a regex would, in text. Combine that with some Rust | ||
4 | glue to write simple, custom linters. | ||
5 | |||
6 | ### Tree-sitter syntax trees | ||
7 | |||
8 | Here is a quick crash course on syntax trees generated by | ||
9 | tree-sitter. Syntax trees produced by tree-sitter are | ||
10 | represented by S-expressions. The generated S-expression for | ||
11 | the following Rust code, | ||
12 | |||
13 | ```rust | ||
14 | fn main() { | ||
15 | let x = 2; | ||
16 | } | ||
17 | ``` | ||
18 | |||
19 | would be: | ||
20 | |||
21 | ```scheme | ||
22 | (source_file | ||
23 | (function_item | ||
24 | name: (identifier) | ||
25 | parameters: (parameters) | ||
26 | body: | ||
27 | (block | ||
28 | (let_declaration | ||
29 | pattern: (identifier) | ||
30 | value: (integer_literal))))) | ||
31 | ``` | ||
32 | |||
33 | Syntax trees generated by tree-sitter have a couple of other | ||
34 | cool properties: they are _lossless_ syntax trees. Given a | ||
35 | lossless syntax tree, you can regenerate the original source | ||
36 | code in its entirety. Consider the following addition to our | ||
37 | example: | ||
38 | |||
39 | ```rust | ||
40 | fn main() { | ||
41 | + // a comment goes here | ||
42 | let x = 2; | ||
43 | } | ||
44 | ``` | ||
45 | |||
46 | The tree-sitter syntax tree preserves the comment, while the | ||
47 | typical abstract syntax tree wouldn't: | ||
48 | |||
49 | ```scheme | ||
50 | (source_file | ||
51 | (function_item | ||
52 | name: (identifier) | ||
53 | parameters: (parameters) | ||
54 | body: | ||
55 | (block | ||
56 | + (line_comment) | ||
57 | (let_declaration | ||
58 | pattern: (identifier) | ||
59 | value: (integer_literal))))) | ||
60 | ``` | ||
61 | |||
62 | ### Tree-sitter queries | ||
63 | |||
64 | Tree-sitter provides a DSL to match over CSTs. These queries | ||
65 | resemble our S-expression syntax trees, here is a query to | ||
66 | match all line comments in a Rust CST: | ||
67 | |||
68 | ```scheme | ||
69 | (line_comment) | ||
70 | |||
71 | ; matches the following rust code | ||
72 | ; // a comment goes here | ||
73 | ``` | ||
74 | |||
75 | Neat, eh? But don't take my word for it, give it a go on the | ||
76 | [tree-sitter | ||
77 | playground](https://tree-sitter.github.io/tree-sitter/playground). | ||
78 | Type in a query like so: | ||
79 | |||
80 | ```scheme | ||
81 | ; the web playground requires you to specify a "capture" | ||
82 | ; you will notice the capture and the nodes it captured | ||
83 | ; turn blue | ||
84 | (line_comment) @capture | ||
85 | ``` | ||
86 | |||
87 | Here's another to match `let` expressions that | ||
88 | bind an integer to an identifier: | ||
89 | |||
90 | ```scheme | ||
91 | (let_declaration | ||
92 | pattern: (identifier) | ||
93 | value: (integer_literal)) | ||
94 | |||
95 | ; matches: | ||
96 | ; let foo = 2; | ||
97 | ``` | ||
98 | |||
99 | We can _capture_ nodes into variables: | ||
100 | |||
101 | ```scheme | ||
102 | (let_declaration | ||
103 | pattern: (identifier) @my-capture | ||
104 | value: (integer_literal)) | ||
105 | |||
106 | ; matches: | ||
107 | ; let foo = 2; | ||
108 | |||
109 | ; captures: | ||
110 | ; foo | ||
111 | ``` | ||
112 | |||
113 | And apply certain _predicates_ to captures: | ||
114 | |||
115 | ```scheme | ||
116 | ((let_declaration | ||
117 | pattern: (identifier) @my-capture | ||
118 | value: (integer_literal)) | ||
119 | (#eq? @my-capture "foo")) | ||
120 | |||
121 | ; matches: | ||
122 | ; let foo = 2; | ||
123 | |||
124 | ; and not: | ||
125 | ; let bar = 2; | ||
126 | ``` | ||
127 | |||
128 | The `#match?` predicate checks if a capture matches a regex: | ||
129 | |||
130 | ```scheme | ||
131 | ((let_declaration | ||
132 | pattern: (identifier) @my-capture | ||
133 | value: (integer_literal)) | ||
134 | (#match? @my-capture "foo|bar")) | ||
135 | |||
136 | ; matches both `foo` and `bar`: | ||
137 | ; let foo = 2; | ||
138 | ; let bar = 2; | ||
139 | ``` | ||
140 | |||
141 | Exhibit indifference, as a stoic programmer would, with the | ||
142 | _wildcard_ pattern: | ||
143 | |||
144 | ```scheme | ||
145 | (let_declaration | ||
146 | pattern: (identifier) | ||
147 | value: (_)) | ||
148 | |||
149 | ; matches: | ||
150 | ; let foo = "foo"; | ||
151 | ; let foo = 42; | ||
152 | ; let foo = bar; | ||
153 | ``` | ||
154 | |||
155 | [The | ||
156 | documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) | ||
157 | does the tree-sitter query DSL more justice, but we now know | ||
158 | enough to write our first lint. | ||
159 | |||
160 | ### Write you a tree-sitter lint | ||
161 | |||
162 | Strings in `std::env` functions are error prone: | ||
163 | |||
164 | ```rust | ||
165 | std::env::remove_var("RUST_BACKTACE"); | ||
166 | // ^^^^ "TACE" instead of "TRACE" | ||
167 | ``` | ||
168 | |||
169 | I prefer this instead: | ||
170 | |||
171 | ```rust | ||
172 | // somewhere in a module that is well spellchecked | ||
173 | static BACKTRACE: &str = "RUST_BACKTRACE"; | ||
174 | |||
175 | // rest of the codebase | ||
176 | std::env::remove_var(BACKTRACE); | ||
177 | ``` | ||
178 | |||
179 | Let's write a lint to find `std::env` functions that use | ||
180 | strings. Put aside the effectiveness of this lint for the | ||
181 | moment, and take a stab at writing a tree-sitter query. For | ||
182 | reference, a function call like so: | ||
183 | |||
184 | ```rust | ||
185 | remove_var("RUST_BACKTRACE") | ||
186 | ``` | ||
187 | |||
188 | Produces the following S-expression: | ||
189 | |||
190 | ```scheme | ||
191 | (call_expression | ||
192 | function: (identifier) | ||
193 | arguments: (arguments (string_literal))) | ||
194 | ``` | ||
195 | |||
196 | We are definitely looking for a `call_expression`: | ||
197 | |||
198 | ```scheme | ||
199 | (call_expression) @raise | ||
200 | ``` | ||
201 | |||
202 | Whose function name matches `std::env::var` or | ||
203 | `std::env::remove_var` at the very least (I know, I know, | ||
204 | this isn't the most optimal regex): | ||
205 | |||
206 | ```scheme | ||
207 | ((call_expression | ||
208 | function: (_) @fn-name) @raise | ||
209 | (#match? @fn-name "std::env::(var|remove_var)")) | ||
210 | ``` | ||
211 | |||
212 | Let's turn that `std::` prefix optional: | ||
213 | |||
214 | ```scheme | ||
215 | ((call_expression | ||
216 | function: (_) @fn-name) @raise | ||
217 | (#match? @fn-name "(std::|)env::(var|remove_var)")) | ||
218 | ``` | ||
219 | |||
220 | And ensure that `arguments` is a string: | ||
221 | |||
222 | ```scheme | ||
223 | ((call_expression | ||
224 | function: (_) @fn-name | ||
225 | arguments: (arguments (string_literal))) | ||
226 | (#match? @fn-name "(std::|)env::(var|remove_var)")) | ||
227 | ``` | ||
228 | |||
229 | ### Running our linter | ||
230 | |||
231 | We could always plug our query into the web playground, but | ||
232 | let's go a step further: | ||
233 | |||
234 | ```bash | ||
235 | cargo new --bin toy-lint | ||
236 | ``` | ||
237 | |||
238 | Add `tree-sitter` and `tree-sitter-rust` to your | ||
239 | dependencies: | ||
240 | |||
241 | ```toml | ||
242 | # within Cargo.toml | ||
243 | [dependencies] | ||
244 | tree-sitter = "0.20" | ||
245 | |||
246 | [dependencies.tree-sitter-rust] | ||
247 | git = "https://github.com/tree-sitter/tree-sitter-rust" | ||
248 | ``` | ||
249 | |||
250 | Let's load in some Rust code to work with. As [an ode to | ||
251 | Gödel](https://en.wikipedia.org/wiki/Self-reference) | ||
252 | (G`ode`l?), why not load in our linter itself: | ||
253 | |||
254 | ```rust | ||
255 | fn main() { | ||
256 | let src = include_str!("main.rs"); | ||
257 | } | ||
258 | ``` | ||
259 | |||
260 | Most tree-sitter APIs require a reference to a `Language` | ||
261 | struct, we will be working with Rust if you haven't | ||
262 | already guessed: | ||
263 | |||
264 | ```rust | ||
265 | use tree_sitter::Language; | ||
266 | |||
267 | let rust_lang: Language = tree_sitter_rust::language(); | ||
268 | ``` | ||
269 | |||
270 | Enough scaffolding, let's parse some Rust: | ||
271 | |||
272 | ```rust | ||
273 | use tree_sitter::Parser; | ||
274 | |||
275 | let mut parser = Parser::new(); | ||
276 | parser.set_language(rust_lang).unwrap(); | ||
277 | |||
278 | let parse_tree = parser.parse(&src, None).unwrap(); | ||
279 | ``` | ||
280 | |||
281 | The second argument to `Parser::parse` may be of interest. | ||
282 | Tree-sitter has this cool feature that allows for quick | ||
283 | reparsing of existing parse trees if they contain edits. If | ||
284 | you do happen to want to reparse a source file, you can pass | ||
285 | in the old tree: | ||
286 | |||
287 | ```rust | ||
288 | // if you wish to reparse instead of parse | ||
289 | old_tree.edit(/* redacted */); | ||
290 | |||
291 | // generate shiny new reparsed tree | ||
292 | let new_tree = parser.parse(&src, Some(old_tree)).unwrap() | ||
293 | ``` | ||
294 | |||
295 | Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it: | ||
296 | |||
297 | ```rust | ||
298 | println!("{}", parse_tree.root_node().to_sexp()); | ||
299 | ``` | ||
300 | |||
301 | Or better yet, run a query on it: | ||
302 | |||
303 | ```rust | ||
304 | use tree_sitter::Query; | ||
305 | |||
306 | let query = Query::new( | ||
307 | rust_lang, | ||
308 | r#" | ||
309 | ((call_expression | ||
310 | function: (_) @fn-name | ||
311 | arguments: (arguments (string_literal))) @raise | ||
312 | (#match? @fn-name "(std::|)env::(var|remove_var)")) | ||
313 | "# | ||
314 | ) | ||
315 | .unwrap(); | ||
316 | ``` | ||
317 | |||
318 | A `QueryCursor` is tree-sitter's way of maintaining state as | ||
319 | we iterate through the matches or captures produced by | ||
320 | running a query on the parse tree. Observe: | ||
321 | |||
322 | ```rust | ||
323 | use tree_sitter::QueryCursor; | ||
324 | |||
325 | let mut query_cursor = QueryCursor::new(); | ||
326 | let all_matches = query_cursor.matches( | ||
327 | &query, | ||
328 | parse_tree.root_node(), | ||
329 | src.as_bytes(), | ||
330 | ); | ||
331 | ``` | ||
332 | |||
333 | We begin by passing our query to the cursor, followed by the | ||
334 | "root node", which is another way of saying, "start from the | ||
335 | top", and lastly, the source itself. If you have already | ||
336 | taken a look at the C API, you will notice that the last | ||
337 | argument, the source (known as the `TextProvider`), is not | ||
338 | required. The Rust bindings seem to require this argument to | ||
339 | provide predicate functionality such as `#match?` and | ||
340 | `#eq?`. | ||
341 | |||
342 | Do something with the matches: | ||
343 | |||
344 | ```rust | ||
345 | // get the index of the capture named "raise" | ||
346 | let raise_idx = query.capture_index_for_name("raise").unwrap(); | ||
347 | |||
348 | for each_match in all_matches { | ||
349 | // iterate over all captures called "raise" | ||
350 | // ignore captures such as "fn-name" | ||
351 | for capture in each_match | ||
352 | .captures | ||
353 | .iter() | ||
354 | .filter(|c| c.idx == raise_idx) | ||
355 | { | ||
356 | let range = capture.node.range(); | ||
357 | let text = &src[range.start_byte..range.end_byte]; | ||
358 | let line = range.start_point.row; | ||
359 | let col = range.start_point.column; | ||
360 | println!( | ||
361 | "[Line: {}, Col: {}] Offending source code: `{}`", | ||
362 | line, col, text | ||
363 | ); | ||
364 | } | ||
365 | } | ||
366 | ``` | ||
367 | |||
368 | Lastly, add the following line to your source code, to get | ||
369 | the linter to catch something: | ||
370 | |||
371 | ```rust | ||
372 | env::remove_var("RUST_BACKTRACE"); | ||
373 | ``` | ||
374 | |||
375 | And `cargo run`: | ||
376 | |||
377 | ```shell | ||
378 | λ cargo run | ||
379 | Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint) | ||
380 | Finished dev [unoptimized + debuginfo] target(s) in 0.74s | ||
381 | Running `target/debug/toy-lint` | ||
382 | [Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")` | ||
383 | ``` | ||
384 | |||
385 | Thank you tree-sitter! | ||
386 | |||
387 | ### Bonus | ||
388 | |||
389 | Keen readers will notice that I avoided `std::env::set_var`. | ||
390 | Because `set_var` is called with two arguments, a "key" and | ||
391 | a "value", unlike `env::var` and `env::remove_var`. As a | ||
392 | result, it requires more juggling: | ||
393 | |||
394 | ```scheme | ||
395 | ((call_expression | ||
396 | function: (_) @fn-name | ||
397 | arguments: (arguments . (string_literal)? . (string_literal) .)) @raise | ||
398 | (#match? @fn-name "(std::|)env::(var|remove_var|set_var)")) | ||
399 | ``` | ||
400 | |||
401 | The interesting part of this query is the humble `.`, the | ||
402 | _anchor_ operator. Anchors help constrain child nodes in | ||
403 | certain ways. In this case, it ensures that we match exactly | ||
404 | two `string_literal`s who are siblings or exactly one | ||
405 | `string_literal` with no siblings. Unfortunately, this query | ||
406 | also matches the following invalid Rust code: | ||
407 | |||
408 | ```rust | ||
409 | // remove_var accepts only 1 arg! | ||
410 | std::env::remove_var("RUST_BACKTRACE", "1"); | ||
411 | ``` | ||
412 | |||
413 | ### Notes | ||
414 | |||
415 | All-in-all, the query DSL does a great job in lowering the | ||
416 | bar to writing language tools. The knowledge gained from | ||
417 | mastering the query DSL can be applied to other languages | ||
418 | that have tree-sitter grammars too. This query | ||
419 | detects `to_json` methods that do not accept additional | ||
420 | arguments, in Ruby: | ||
421 | |||
422 | ```scheme | ||
423 | ((method | ||
424 | name: (identifier) @fn | ||
425 | !parameters) | ||
426 | (#is? @fn "to_json")) | ||
427 | ``` | ||