Lightweight Linting
Tree-sitter queries allow you to search for patterns in syntax trees, much like a regex would, in text. Combine that with some Rust glue to write simple, custom linters.
Tree-sitter syntax trees
Here is a quick crash course on syntax trees generated by tree-sitter. Syntax trees produced by tree-sitter are represented by S-expressions. The generated S-expression for the following Rust code,
fn main() {
let x = 2;
}
would be:
(source_file
(function_item
name: (identifier)
parameters: (parameters)
body:
(block
(let_declaration
pattern: (identifier) value: (integer_literal)))))
Syntax trees generated by tree-sitter have a couple of other cool properties: they are lossless syntax trees. Given a lossless syntax tree, you can regenerate the original source code in its entirety. Consider the following addition to our example:
fn main() {
+ // a comment goes here
let x = 2;
}
The tree-sitter syntax tree preserves the comment, while the typical abstract syntax tree wouldn’t:
(source_file
(function_item
name: (identifier)
parameters: (parameters)
body:
(block+ (line_comment)
(let_declaration
pattern: (identifier) value: (integer_literal)))))
Tree-sitter queries
Tree-sitter provides a DSL to match over CSTs. These queries resemble our S-expression syntax trees, here is a query to match all line comments in a Rust CST:
(line_comment)
; matches the following rust code
; // a comment goes here
Neat, eh? But don’t take my word for it, give it a go on the tree-sitter playground. Type in a query like so:
; the web playground requires you to specify a "capture"
; you will notice the capture and the nodes it captured
; turn blue
(line_comment) @capture
Here’s another to match let
expressions that bind an
integer to an identifier:
(let_declaration
pattern: (identifier)
value: (integer_literal))
; matches:
; let foo = 2;
We can capture nodes into variables:
(let_declaration
pattern: (identifier) @my-capture
value: (integer_literal))
; matches:
; let foo = 2;
; captures:
; foo
And apply certain predicates to captures:
((let_declaration
pattern: (identifier) @my-capture
value: (integer_literal))#eq? @my-capture "foo"))
(
; matches:
; let foo = 2;
; and not:
; let bar = 2;
The #match?
predicate checks if a capture matches a
regex:
((let_declaration
pattern: (identifier) @my-capture
value: (integer_literal))"foo|bar"))
(#match? @my-capture
; matches both `foo` and `bar`:
; let foo = 2;
; let bar = 2;
Exhibit indifference, as a stoic programmer would, with the wildcard pattern:
(let_declaration
pattern: (identifier)_))
value: (
; matches:
; let foo = "foo";
; let foo = 42;
; let foo = bar;
The documentation does the tree-sitter query DSL more justice, but we now know enough to write our first lint.
Write you a tree-sitter lint
Strings in std::env
functions are error prone:
std::env::remove_var("RUST_BACKTACE");
// ^^^^ "TACE" instead of "TRACE"
I prefer this instead:
// somewhere in a module that is well spellchecked
static BACKTRACE: &str = "RUST_BACKTRACE";
// rest of the codebase
std::env::remove_var(BACKTRACE);
Let’s write a lint to find std::env
functions that use
strings. Put aside the effectiveness of this lint for the moment, and
take a stab at writing a tree-sitter query. For reference, a function
call like so:
"RUST_BACKTRACE") remove_var(
Produces the following S-expression:
(call_expression
function: (identifier) arguments: (arguments (string_literal)))
We are definitely looking for a call_expression
:
(call_expression) @raise
Whose function name matches std::env::var
or
std::env::remove_var
at the very least (I know, I know,
this isn’t the most optimal regex):
((call_expression_) @fn-name) @raise
function: ("std::env::(var|remove_var)")) (#match? @fn-name
Let’s turn that std::
prefix optional:
((call_expression_) @fn-name) @raise
function: ("(std::|)env::(var|remove_var)")) (#match? @fn-name
And ensure that arguments
is a string:
((call_expression_) @fn-name
function: (
arguments: (arguments (string_literal)))"(std::|)env::(var|remove_var)")) (#match? @fn-name
Running our linter
We could always plug our query into the web playground, but let’s go a step further:
cargo new --bin toy-lint
Add tree-sitter
and tree-sitter-rust
to
your dependencies:
# within Cargo.toml
[dependencies]
tree-sitter = "0.20"
[dependencies.tree-sitter-rust]
git = "https://github.com/tree-sitter/tree-sitter-rust"
Let’s load in some Rust code to work with. As an ode to Gödel
(Gode
l?), why not load in our linter itself:
fn main() {
let src = include_str!("main.rs");
}
Most tree-sitter APIs require a reference to a Language
struct, we will be working with Rust if you haven’t already guessed:
use tree_sitter::Language;
let rust_lang: Language = tree_sitter_rust::language();
Enough scaffolding, let’s parse some Rust:
use tree_sitter::Parser;
let mut parser = Parser::new();
.set_language(rust_lang).unwrap();
parser
let parse_tree = parser.parse(&src, None).unwrap();
The second argument to Parser::parse
may be of interest.
Tree-sitter has this cool feature that allows for quick reparsing of
existing parse trees if they contain edits. If you do happen to want to
reparse a source file, you can pass in the old tree:
// if you wish to reparse instead of parse
.edit(/* redacted */);
old_tree
// generate shiny new reparsed tree
let new_tree = parser.parse(&src, Some(old_tree)).unwrap()
Anyhow (hah!), now that we have a parse tree, we can inspect it:
println!("{}", parse_tree.root_node().to_sexp());
Or better yet, run a query on it:
use tree_sitter::Query;
let query = Query::new(
,
rust_langr#"
((call_expression
function: (_) @fn-name
arguments: (arguments (string_literal))) @raise
(#match? @fn-name "(std::|)env::(var|remove_var)"))
"#
).unwrap();
A QueryCursor
is tree-sitter’s way of maintaining state
as we iterate through the matches or captures produced by running a
query on the parse tree. Observe:
use tree_sitter::QueryCursor;
let mut query_cursor = QueryCursor::new();
let all_matches = query_cursor.matches(
&query,
.root_node(),
parse_tree.as_bytes(),
src; )
We begin by passing our query to the cursor, followed by the “root
node”, which is another way of saying, “start from the top”, and lastly,
the source itself. If you have already taken a look at the C API, you
will notice that the last argument, the source (known as the
TextProvider
), is not required. The Rust bindings seem to
require this argument to provide predicate functionality such as
#match?
and #eq?
.
Do something with the matches:
// get the index of the capture named "raise"
let raise_idx = query.capture_index_for_name("raise").unwrap();
for each_match in all_matches {
// iterate over all captures called "raise"
// ignore captures such as "fn-name"
for capture in each_match
.captures
.iter()
.filter(|c| c.idx == raise_idx)
{
let range = capture.node.range();
let text = &src[range.start_byte..range.end_byte];
let line = range.start_point.row;
let col = range.start_point.column;
println!(
"[Line: {}, Col: {}] Offending source code: `{}`",
, col, text
line;
)}
}
Lastly, add the following line to your source code, to get the linter to catch something:
env::remove_var("RUST_BACKTRACE");
And cargo run
:
λ cargo run
Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint)
Finished dev [unoptimized + debuginfo] target(s) in 0.74s
Running `target/debug/toy-lint`
[Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")`
Thank you tree-sitter!
Bonus
Keen readers will notice that I avoided
std::env::set_var
. Because set_var
is called
with two arguments, a “key” and a “value”, unlike env::var
and env::remove_var
. As a result, it requires more
juggling:
((call_expression_) @fn-name
function: (. (string_literal)? . (string_literal) .)) @raise
arguments: (arguments "(std::|)env::(var|remove_var|set_var)")) (#match? @fn-name
The interesting part of this query is the humble .
, the
anchor operator. Anchors help constrain child nodes in certain
ways. In this case, it ensures that we match exactly two
string_literal
s who are siblings or exactly one
string_literal
with no siblings. Unfortunately, this query
also matches the following invalid Rust code:
// remove_var accepts only 1 arg!
std::env::remove_var("RUST_BACKTRACE", "1");
Notes
All-in-all, the query DSL does a great job in lowering the bar to
writing language tools. The knowledge gained from mastering the query
DSL can be applied to other languages that have tree-sitter grammars
too. This query detects to_json
methods that do not accept
additional arguments, in Ruby:
((method
name: (identifier) @fn
!parameters)#is? @fn "to_json")) (
I'm Akshay, programmer and pixel-artist.
I write open-source stuff to pass time. I also design fonts: scientifica, curie.
Send me a mail at nerdy@peppe.rs or a message at np@irc.rizon.net.