From fb61d3d063555a9183fab0ec78bb0f4c3cbf0e82 Mon Sep 17 00:00:00 2001 From: Akshay Date: Wed, 26 Jan 2022 18:26:32 +0530 Subject: new post: Lightweight Linting --- posts/lightweight_linting.md | 427 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 427 insertions(+) create mode 100644 posts/lightweight_linting.md (limited to 'posts') diff --git a/posts/lightweight_linting.md b/posts/lightweight_linting.md new file mode 100644 index 0000000..2436f30 --- /dev/null +++ b/posts/lightweight_linting.md @@ -0,0 +1,427 @@ +[Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) +queries allow you to search for patterns in syntax trees, +much like a regex would, in text. Combine that with some Rust +glue to write simple, custom linters. + +### Tree-sitter syntax trees + +Here is a quick crash course on syntax trees generated by +tree-sitter. Syntax trees produced by tree-sitter are +represented by S-expressions. The generated S-expression for +the following Rust code, + +```rust +fn main() { + let x = 2; +} +``` + +would be: + +```scheme +(source_file + (function_item + name: (identifier) + parameters: (parameters) + body: + (block + (let_declaration + pattern: (identifier) + value: (integer_literal))))) +``` + +Syntax trees generated by tree-sitter have a couple of other +cool properties: they are _lossless_ syntax trees. Given a +lossless syntax tree, you can regenerate the original source +code in its entirety. Consider the following addition to our +example: + +```rust + fn main() { ++ // a comment goes here + let x = 2; + } +``` + +The tree-sitter syntax tree preserves the comment, while the +typical abstract syntax tree wouldn't: + +```scheme + (source_file + (function_item + name: (identifier) + parameters: (parameters) + body: + (block ++ (line_comment) + (let_declaration + pattern: (identifier) + value: (integer_literal))))) +``` + +### Tree-sitter queries + +Tree-sitter provides a DSL to match over CSTs. These queries +resemble our S-expression syntax trees, here is a query to +match all line comments in a Rust CST: + +```scheme +(line_comment) + +; matches the following rust code +; // a comment goes here +``` + +Neat, eh? But don't take my word for it, give it a go on the +[tree-sitter +playground](https://tree-sitter.github.io/tree-sitter/playground). +Type in a query like so: + +```scheme +; the web playground requires you to specify a "capture" +; you will notice the capture and the nodes it captured +; turn blue +(line_comment) @capture +``` + +Here's another to match `let` expressions that +bind an integer to an identifier: + +```scheme +(let_declaration + pattern: (identifier) + value: (integer_literal)) + +; matches: +; let foo = 2; +``` + +We can _capture_ nodes into variables: + +```scheme +(let_declaration + pattern: (identifier) @my-capture + value: (integer_literal)) + +; matches: +; let foo = 2; + +; captures: +; foo +``` + +And apply certain _predicates_ to captures: + +```scheme +((let_declaration + pattern: (identifier) @my-capture + value: (integer_literal)) + (#eq? @my-capture "foo")) + +; matches: +; let foo = 2; + +; and not: +; let bar = 2; +``` + +The `#match?` predicate checks if a capture matches a regex: + +```scheme +((let_declaration + pattern: (identifier) @my-capture + value: (integer_literal)) + (#match? @my-capture "foo|bar")) + +; matches both `foo` and `bar`: +; let foo = 2; +; let bar = 2; +``` + +Exhibit indifference, as a stoic programmer would, with the +_wildcard_ pattern: + +```scheme +(let_declaration + pattern: (identifier) + value: (_)) + +; matches: +; let foo = "foo"; +; let foo = 42; +; let foo = bar; +``` + +[The +documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) +does the tree-sitter query DSL more justice, but we now know +enough to write our first lint. + +### Write you a tree-sitter lint + +Strings in `std::env` functions are error prone: + +```rust +std::env::remove_var("RUST_BACKTACE"); + // ^^^^ "TACE" instead of "TRACE" +``` + +I prefer this instead: + +```rust +// somewhere in a module that is well spellchecked +static BACKTRACE: &str = "RUST_BACKTRACE"; + +// rest of the codebase +std::env::remove_var(BACKTRACE); +``` + +Let's write a lint to find `std::env` functions that use +strings. Put aside the effectiveness of this lint for the +moment, and take a stab at writing a tree-sitter query. For +reference, a function call like so: + +```rust +remove_var("RUST_BACKTRACE") +``` + +Produces the following S-expression: + +```scheme +(call_expression + function: (identifier) + arguments: (arguments (string_literal))) +``` + +We are definitely looking for a `call_expression`: + +```scheme +(call_expression) @raise +``` + +Whose function name matches `std::env::var` or +`std::env::remove_var` at the very least (I know, I know, +this isn't the most optimal regex): + +```scheme +((call_expression + function: (_) @fn-name) @raise + (#match? @fn-name "std::env::(var|remove_var)")) +``` + +Let's turn that `std::` prefix optional: + +```scheme +((call_expression + function: (_) @fn-name) @raise + (#match? @fn-name "(std::|)env::(var|remove_var)")) +``` + +And ensure that `arguments` is a string: + +```scheme +((call_expression + function: (_) @fn-name + arguments: (arguments (string_literal))) + (#match? @fn-name "(std::|)env::(var|remove_var)")) +``` + +### Running our linter + +We could always plug our query into the web playground, but +let's go a step further: + +```bash +cargo new --bin toy-lint +``` + +Add `tree-sitter` and `tree-sitter-rust` to your +dependencies: + +```toml +# within Cargo.toml +[dependencies] +tree-sitter = "0.20" + +[dependencies.tree-sitter-rust] +git = "https://github.com/tree-sitter/tree-sitter-rust" +``` + +Let's load in some Rust code to work with. As [an ode to +Gödel](https://en.wikipedia.org/wiki/Self-reference) +(G`ode`l?), why not load in our linter itself: + +```rust +fn main() { + let src = include_str!("main.rs"); +} +``` + +Most tree-sitter APIs require a reference to a `Language` +struct, we will be working with Rust if you haven't +already guessed: + +```rust +use tree_sitter::Language; + +let rust_lang: Language = tree_sitter_rust::language(); +``` + +Enough scaffolding, let's parse some Rust: + +```rust +use tree_sitter::Parser; + +let mut parser = Parser::new(); +parser.set_language(rust_lang).unwrap(); + +let parse_tree = parser.parse(&src, None).unwrap(); +``` + +The second argument to `Parser::parse` may be of interest. +Tree-sitter has this cool feature that allows for quick +reparsing of existing parse trees if they contain edits. If +you do happen to want to reparse a source file, you can pass +in the old tree: + +```rust +// if you wish to reparse instead of parse +old_tree.edit(/* redacted */); + +// generate shiny new reparsed tree +let new_tree = parser.parse(&src, Some(old_tree)).unwrap() +``` + +Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it: + +```rust +println!("{}", parse_tree.root_node().to_sexp()); +``` + +Or better yet, run a query on it: + +```rust +use tree_sitter::Query; + +let query = Query::new( + rust_lang, + r#" + ((call_expression + function: (_) @fn-name + arguments: (arguments (string_literal))) @raise + (#match? @fn-name "(std::|)env::(var|remove_var)")) + "# +) +.unwrap(); +``` + +A `QueryCursor` is tree-sitter's way of maintaining state as +we iterate through the matches or captures produced by +running a query on the parse tree. Observe: + +```rust +use tree_sitter::QueryCursor; + +let mut query_cursor = QueryCursor::new(); +let all_matches = query_cursor.matches( + &query, + parse_tree.root_node(), + src.as_bytes(), +); +``` + +We begin by passing our query to the cursor, followed by the +"root node", which is another way of saying, "start from the +top", and lastly, the source itself. If you have already +taken a look at the C API, you will notice that the last +argument, the source (known as the `TextProvider`), is not +required. The Rust bindings seem to require this argument to +provide predicate functionality such as `#match?` and +`#eq?`. + +Do something with the matches: + +```rust +// get the index of the capture named "raise" +let raise_idx = query.capture_index_for_name("raise").unwrap(); + +for each_match in all_matches { + // iterate over all captures called "raise" + // ignore captures such as "fn-name" + for capture in each_match + .captures + .iter() + .filter(|c| c.idx == raise_idx) + { + let range = capture.node.range(); + let text = &src[range.start_byte..range.end_byte]; + let line = range.start_point.row; + let col = range.start_point.column; + println!( + "[Line: {}, Col: {}] Offending source code: `{}`", + line, col, text + ); + } +} +``` + +Lastly, add the following line to your source code, to get +the linter to catch something: + +```rust +env::remove_var("RUST_BACKTRACE"); +``` + +And `cargo run`: + +```shell +λ cargo run + Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint) + Finished dev [unoptimized + debuginfo] target(s) in 0.74s + Running `target/debug/toy-lint` +[Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")` +``` + +Thank you tree-sitter! + +### Bonus + +Keen readers will notice that I avoided `std::env::set_var`. +Because `set_var` is called with two arguments, a "key" and +a "value", unlike `env::var` and `env::remove_var`. As a +result, it requires more juggling: + +```scheme +((call_expression + function: (_) @fn-name + arguments: (arguments . (string_literal)? . (string_literal) .)) @raise + (#match? @fn-name "(std::|)env::(var|remove_var|set_var)")) +``` + +The interesting part of this query is the humble `.`, the +_anchor_ operator. Anchors help constrain child nodes in +certain ways. In this case, it ensures that we match exactly +two `string_literal`s who are siblings or exactly one +`string_literal` with no siblings. Unfortunately, this query +also matches the following invalid Rust code: + +```rust +// remove_var accepts only 1 arg! +std::env::remove_var("RUST_BACKTRACE", "1"); +``` + +### Notes + +All-in-all, the query DSL does a great job in lowering the +bar to writing language tools. The knowledge gained from +mastering the query DSL can be applied to other languages +that have tree-sitter grammars too. This query +detects `to_json` methods that do not accept additional +arguments, in Ruby: + +```scheme +((method + name: (identifier) @fn + !parameters) + (#is? @fn "to_json")) +``` -- cgit v1.2.3