[Tree-sitter](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) queries allow you to search for patterns in syntax trees, much like a regex would, in text. Combine that with some Rust glue to write simple, custom linters. ### Tree-sitter syntax trees Here is a quick crash course on syntax trees generated by tree-sitter. Syntax trees produced by tree-sitter are represented by S-expressions. The generated S-expression for the following Rust code, ```rust fn main() { let x = 2; } ``` would be: ```scheme (source_file (function_item name: (identifier) parameters: (parameters) body: (block (let_declaration pattern: (identifier) value: (integer_literal))))) ``` Syntax trees generated by tree-sitter have a couple of other cool properties: they are _lossless_ syntax trees. Given a lossless syntax tree, you can regenerate the original source code in its entirety. Consider the following addition to our example: ```rust fn main() { + // a comment goes here let x = 2; } ``` The tree-sitter syntax tree preserves the comment, while the typical abstract syntax tree wouldn't: ```scheme (source_file (function_item name: (identifier) parameters: (parameters) body: (block + (line_comment) (let_declaration pattern: (identifier) value: (integer_literal))))) ``` ### Tree-sitter queries Tree-sitter provides a DSL to match over CSTs. These queries resemble our S-expression syntax trees, here is a query to match all line comments in a Rust CST: ```scheme (line_comment) ; matches the following rust code ; // a comment goes here ``` Neat, eh? But don't take my word for it, give it a go on the [tree-sitter playground](https://tree-sitter.github.io/tree-sitter/playground). Type in a query like so: ```scheme ; the web playground requires you to specify a "capture" ; you will notice the capture and the nodes it captured ; turn blue (line_comment) @capture ``` Here's another to match `let` expressions that bind an integer to an identifier: ```scheme (let_declaration pattern: (identifier) value: (integer_literal)) ; matches: ; let foo = 2; ``` We can _capture_ nodes into variables: ```scheme (let_declaration pattern: (identifier) @my-capture value: (integer_literal)) ; matches: ; let foo = 2; ; captures: ; foo ``` And apply certain _predicates_ to captures: ```scheme ((let_declaration pattern: (identifier) @my-capture value: (integer_literal)) (#eq? @my-capture "foo")) ; matches: ; let foo = 2; ; and not: ; let bar = 2; ``` The `#match?` predicate checks if a capture matches a regex: ```scheme ((let_declaration pattern: (identifier) @my-capture value: (integer_literal)) (#match? @my-capture "foo|bar")) ; matches both `foo` and `bar`: ; let foo = 2; ; let bar = 2; ``` Exhibit indifference, as a stoic programmer would, with the _wildcard_ pattern: ```scheme (let_declaration pattern: (identifier) value: (_)) ; matches: ; let foo = "foo"; ; let foo = 42; ; let foo = bar; ``` [The documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries) does the tree-sitter query DSL more justice, but we now know enough to write our first lint. ### Write you a tree-sitter lint Strings in `std::env` functions are error prone: ```rust std::env::remove_var("RUST_BACKTACE"); // ^^^^ "TACE" instead of "TRACE" ``` I prefer this instead: ```rust // somewhere in a module that is well spellchecked static BACKTRACE: &str = "RUST_BACKTRACE"; // rest of the codebase std::env::remove_var(BACKTRACE); ``` Let's write a lint to find `std::env` functions that use strings. Put aside the effectiveness of this lint for the moment, and take a stab at writing a tree-sitter query. For reference, a function call like so: ```rust remove_var("RUST_BACKTRACE") ``` Produces the following S-expression: ```scheme (call_expression function: (identifier) arguments: (arguments (string_literal))) ``` We are definitely looking for a `call_expression`: ```scheme (call_expression) @raise ``` Whose function name matches `std::env::var` or `std::env::remove_var` at the very least (I know, I know, this isn't the most optimal regex): ```scheme ((call_expression function: (_) @fn-name) @raise (#match? @fn-name "std::env::(var|remove_var)")) ``` Let's turn that `std::` prefix optional: ```scheme ((call_expression function: (_) @fn-name) @raise (#match? @fn-name "(std::|)env::(var|remove_var)")) ``` And ensure that `arguments` is a string: ```scheme ((call_expression function: (_) @fn-name arguments: (arguments (string_literal))) (#match? @fn-name "(std::|)env::(var|remove_var)")) ``` ### Running our linter We could always plug our query into the web playground, but let's go a step further: ```bash cargo new --bin toy-lint ``` Add `tree-sitter` and `tree-sitter-rust` to your dependencies: ```toml # within Cargo.toml [dependencies] tree-sitter = "0.20" [dependencies.tree-sitter-rust] git = "https://github.com/tree-sitter/tree-sitter-rust" ``` Let's load in some Rust code to work with. As [an ode to Gödel](https://en.wikipedia.org/wiki/Self-reference) (G`ode`l?), why not load in our linter itself: ```rust fn main() { let src = include_str!("main.rs"); } ``` Most tree-sitter APIs require a reference to a `Language` struct, we will be working with Rust if you haven't already guessed: ```rust use tree_sitter::Language; let rust_lang: Language = tree_sitter_rust::language(); ``` Enough scaffolding, let's parse some Rust: ```rust use tree_sitter::Parser; let mut parser = Parser::new(); parser.set_language(rust_lang).unwrap(); let parse_tree = parser.parse(&src, None).unwrap(); ``` The second argument to `Parser::parse` may be of interest. Tree-sitter has this cool feature that allows for quick reparsing of existing parse trees if they contain edits. If you do happen to want to reparse a source file, you can pass in the old tree: ```rust // if you wish to reparse instead of parse old_tree.edit(/* redacted */); // generate shiny new reparsed tree let new_tree = parser.parse(&src, Some(old_tree)).unwrap() ``` Anyhow ([hah!](http://github.com/dtolnay/anyhow)), now that we have a parse tree, we can inspect it: ```rust println!("{}", parse_tree.root_node().to_sexp()); ``` Or better yet, run a query on it: ```rust use tree_sitter::Query; let query = Query::new( rust_lang, r#" ((call_expression function: (_) @fn-name arguments: (arguments (string_literal))) @raise (#match? @fn-name "(std::|)env::(var|remove_var)")) "# ) .unwrap(); ``` A `QueryCursor` is tree-sitter's way of maintaining state as we iterate through the matches or captures produced by running a query on the parse tree. Observe: ```rust use tree_sitter::QueryCursor; let mut query_cursor = QueryCursor::new(); let all_matches = query_cursor.matches( &query, parse_tree.root_node(), src.as_bytes(), ); ``` We begin by passing our query to the cursor, followed by the "root node", which is another way of saying, "start from the top", and lastly, the source itself. If you have already taken a look at the C API, you will notice that the last argument, the source (known as the `TextProvider`), is not required. The Rust bindings seem to require this argument to provide predicate functionality such as `#match?` and `#eq?`. Do something with the matches: ```rust // get the index of the capture named "raise" let raise_idx = query.capture_index_for_name("raise").unwrap(); for each_match in all_matches { // iterate over all captures called "raise" // ignore captures such as "fn-name" for capture in each_match .captures .iter() .filter(|c| c.idx == raise_idx) { let range = capture.node.range(); let text = &src[range.start_byte..range.end_byte]; let line = range.start_point.row; let col = range.start_point.column; println!( "[Line: {}, Col: {}] Offending source code: `{}`", line, col, text ); } } ``` Lastly, add the following line to your source code, to get the linter to catch something: ```rust env::remove_var("RUST_BACKTRACE"); ``` And `cargo run`: ```shell λ cargo run Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint) Finished dev [unoptimized + debuginfo] target(s) in 0.74s Running `target/debug/toy-lint` [Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")` ``` Thank you tree-sitter! ### Bonus Keen readers will notice that I avoided `std::env::set_var`. Because `set_var` is called with two arguments, a "key" and a "value", unlike `env::var` and `env::remove_var`. As a result, it requires more juggling: ```scheme ((call_expression function: (_) @fn-name arguments: (arguments . (string_literal)? . (string_literal) .)) @raise (#match? @fn-name "(std::|)env::(var|remove_var|set_var)")) ``` The interesting part of this query is the humble `.`, the _anchor_ operator. Anchors help constrain child nodes in certain ways. In this case, it ensures that we match exactly two `string_literal`s who are siblings or exactly one `string_literal` with no siblings. Unfortunately, this query also matches the following invalid Rust code: ```rust // remove_var accepts only 1 arg! std::env::remove_var("RUST_BACKTRACE", "1"); ``` ### Notes All-in-all, the query DSL does a great job in lowering the bar to writing language tools. The knowledge gained from mastering the query DSL can be applied to other languages that have tree-sitter grammars too. This query detects `to_json` methods that do not accept additional arguments, in Ruby: ```scheme ((method name: (identifier) @fn !parameters) (#is? @fn "to_json")) ```