From 366df8852f503523cc4f9046d82ba9a99dd51d7f Mon Sep 17 00:00:00 2001 From: Akshay Date: Sun, 12 Feb 2023 12:13:49 +0530 Subject: new art: lapse --- docs/posts/lightweight_linting/index.html | 212 +++++++++++++++++++++--------- 1 file changed, 151 insertions(+), 61 deletions(-) (limited to 'docs/posts/lightweight_linting') diff --git a/docs/posts/lightweight_linting/index.html b/docs/posts/lightweight_linting/index.html index 9bc84f2..b30c719 100644 --- a/docs/posts/lightweight_linting/index.html +++ b/docs/posts/lightweight_linting/index.html @@ -28,12 +28,12 @@ 26/01 — 2022
- 170.62 + 170.63 cm   - 8.5 + 8.6 min
@@ -42,14 +42,23 @@ Lightweight Linting
-

Tree-sitter queries allow you to search for patterns in syntax trees, much like a regex would, in text. Combine that with some Rust glue to write simple, custom linters.

+

Tree-sitter +queries allow you to search for patterns in syntax trees, much like a +regex would, in text. Combine that with some Rust glue to write simple, +custom linters.

Tree-sitter syntax trees

-

Here is a quick crash course on syntax trees generated by tree-sitter. Syntax trees produced by tree-sitter are represented by S-expressions. The generated S-expression for the following Rust code,

-
fn main() {
+

Here is a quick crash course on syntax trees generated by +tree-sitter. Syntax trees produced by tree-sitter are represented by +S-expressions. The generated S-expression for the following Rust +code,

+
fn main() {
     let x = 2;
 }

would be:

-
(source_file
+
(source_file
  (function_item
   name: (identifier)
   parameters: (parameters)
@@ -58,13 +67,19 @@
    (let_declaration 
     pattern: (identifier)
     value: (integer_literal)))))
-

Syntax trees generated by tree-sitter have a couple of other cool properties: they are lossless syntax trees. Given a lossless syntax tree, you can regenerate the original source code in its entirety. Consider the following addition to our example:

-
 fn main() {
+

Syntax trees generated by tree-sitter have a couple of other cool +properties: they are lossless syntax trees. Given a lossless +syntax tree, you can regenerate the original source code in its +entirety. Consider the following addition to our example:

+
 fn main() {
 +    // a comment goes here
      let x = 2;
  }
-

The tree-sitter syntax tree preserves the comment, while the typical abstract syntax tree wouldn’t:

-
 (source_file
+

The tree-sitter syntax tree preserves the comment, while the typical +abstract syntax tree wouldn’t:

+
 (source_file
   (function_item
    name: (identifier)
    parameters: (parameters)
@@ -75,25 +90,34 @@
      pattern: (identifier)
      value: (integer_literal)))))

Tree-sitter queries

-

Tree-sitter provides a DSL to match over CSTs. These queries resemble our S-expression syntax trees, here is a query to match all line comments in a Rust CST:

-
(line_comment)
+

Tree-sitter provides a DSL to match over CSTs. These queries resemble +our S-expression syntax trees, here is a query to match all line +comments in a Rust CST:

+
(line_comment)
 
 ; matches the following rust code
 ; // a comment goes here
-

Neat, eh? But don’t take my word for it, give it a go on the tree-sitter playground. Type in a query like so:

-
; the web playground requires you to specify a "capture"
+

Neat, eh? But don’t take my word for it, give it a go on the tree-sitter +playground. Type in a query like so:

+
; the web playground requires you to specify a "capture"
 ; you will notice the capture and the nodes it captured
 ; turn blue
 (line_comment) @capture
-

Here’s another to match let expressions that bind an integer to an identifier:

-
(let_declaration
+

Here’s another to match let expressions that bind an +integer to an identifier:

+
(let_declaration
  pattern: (identifier)
  value: (integer_literal))
  
 ; matches:
 ; let foo = 2;

We can capture nodes into variables:

-
(let_declaration 
+
(let_declaration 
  pattern: (identifier) @my-capture
  value: (integer_literal))
  
@@ -103,7 +127,8 @@
 ; captures:
 ; foo

And apply certain predicates to captures:

-
((let_declaration
+
((let_declaration
   pattern: (identifier) @my-capture
   value: (integer_literal))
  (#eq? @my-capture "foo"))
@@ -113,8 +138,10 @@
 
 ; and not:
 ; let bar = 2;
-

The #match? predicate checks if a capture matches a regex:

-
((let_declaration
+

The #match? predicate checks if a capture matches a +regex:

+
((let_declaration
   pattern: (identifier) @my-capture
   value: (integer_literal))
  (#match? @my-capture "foo|bar"))
@@ -122,8 +149,10 @@
 ; matches both `foo` and `bar`:
 ; let foo = 2;
 ; let bar = 2;
-

Exhibit indifference, as a stoic programmer would, with the wildcard pattern:

-
(let_declaration
+

Exhibit indifference, as a stoic programmer would, with the +wildcard pattern:

+
(let_declaration
  pattern: (identifier)
  value: (_))
  
@@ -131,73 +160,106 @@
 ; let foo = "foo";
 ; let foo = 42;
 ; let foo = bar;
-

The documentation does the tree-sitter query DSL more justice, but we now know enough to write our first lint.

+

The +documentation does the tree-sitter query DSL more justice, but we +now know enough to write our first lint.

Write you a tree-sitter lint

Strings in std::env functions are error prone:

-
std::env::remove_var("RUST_BACKTACE");
+
std::env::remove_var("RUST_BACKTACE");
                             // ^^^^ "TACE" instead of "TRACE"

I prefer this instead:

-
// somewhere in a module that is well spellchecked
+
// somewhere in a module that is well spellchecked
 static BACKTRACE: &str = "RUST_BACKTRACE";
 
 // rest of the codebase
 std::env::remove_var(BACKTRACE);
-

Let’s write a lint to find std::env functions that use strings. Put aside the effectiveness of this lint for the moment, and take a stab at writing a tree-sitter query. For reference, a function call like so:

-
remove_var("RUST_BACKTRACE")
+

Let’s write a lint to find std::env functions that use +strings. Put aside the effectiveness of this lint for the moment, and +take a stab at writing a tree-sitter query. For reference, a function +call like so:

+
remove_var("RUST_BACKTRACE")

Produces the following S-expression:

-
(call_expression
+
(call_expression
   function: (identifier)
   arguments: (arguments (string_literal)))

We are definitely looking for a call_expression:

-
(call_expression) @raise
-

Whose function name matches std::env::var or std::env::remove_var at the very least (I know, I know, this isn’t the most optimal regex):

-
((call_expression
+
(call_expression) @raise
+

Whose function name matches std::env::var or +std::env::remove_var at the very least (I know, I know, +this isn’t the most optimal regex):

+
((call_expression
   function: (_) @fn-name) @raise
  (#match? @fn-name "std::env::(var|remove_var)"))

Let’s turn that std:: prefix optional:

-
((call_expression
+
((call_expression
   function: (_) @fn-name) @raise
  (#match? @fn-name "(std::|)env::(var|remove_var)"))

And ensure that arguments is a string:

-
((call_expression
+
((call_expression
   function: (_) @fn-name
   arguments: (arguments (string_literal)))
  (#match? @fn-name "(std::|)env::(var|remove_var)"))

Running our linter

-

We could always plug our query into the web playground, but let’s go a step further:

-
cargo new --bin toy-lint
-

Add tree-sitter and tree-sitter-rust to your dependencies:

-
# within Cargo.toml
+

We could always plug our query into the web playground, but let’s go +a step further:

+
cargo new --bin toy-lint
+

Add tree-sitter and tree-sitter-rust to +your dependencies:

+
# within Cargo.toml
 [dependencies]
 tree-sitter = "0.20"
 
 [dependencies.tree-sitter-rust]
 git = "https://github.com/tree-sitter/tree-sitter-rust"
-

Let’s load in some Rust code to work with. As an ode to Gödel (Godel?), why not load in our linter itself:

-
fn main() {
+

Let’s load in some Rust code to work with. As an ode to Gödel +(Godel?), why not load in our linter itself:

+
fn main() {
     let src = include_str!("main.rs");
 }
-

Most tree-sitter APIs require a reference to a Language struct, we will be working with Rust if you haven’t already guessed:

-
use tree_sitter::Language;
+

Most tree-sitter APIs require a reference to a Language +struct, we will be working with Rust if you haven’t already guessed:

+
use tree_sitter::Language;
 
 let rust_lang: Language = tree_sitter_rust::language();

Enough scaffolding, let’s parse some Rust:

-
use tree_sitter::Parser;
+
use tree_sitter::Parser;
 
 let mut parser = Parser::new();
 parser.set_language(rust_lang).unwrap();
 
 let parse_tree = parser.parse(&src, None).unwrap();
-

The second argument to Parser::parse may be of interest. Tree-sitter has this cool feature that allows for quick reparsing of existing parse trees if they contain edits. If you do happen to want to reparse a source file, you can pass in the old tree:

-
// if you wish to reparse instead of parse
+

The second argument to Parser::parse may be of interest. +Tree-sitter has this cool feature that allows for quick reparsing of +existing parse trees if they contain edits. If you do happen to want to +reparse a source file, you can pass in the old tree:

+
// if you wish to reparse instead of parse
 old_tree.edit(/* redacted */);
 
 // generate shiny new reparsed tree
 let new_tree = parser.parse(&src, Some(old_tree)).unwrap()
-

Anyhow (hah!), now that we have a parse tree, we can inspect it:

-
println!("{}", parse_tree.root_node().to_sexp());
+

Anyhow (hah!), now +that we have a parse tree, we can inspect it:

+
println!("{}", parse_tree.root_node().to_sexp());

Or better yet, run a query on it:

-
use tree_sitter::Query;
+
use tree_sitter::Query;
 
 let query = Query::new(
     rust_lang,
@@ -209,8 +271,11 @@
     "#
 )
 .unwrap();
-

A QueryCursor is tree-sitter’s way of maintaining state as we iterate through the matches or captures produced by running a query on the parse tree. Observe:

-
use tree_sitter::QueryCursor;
+

A QueryCursor is tree-sitter’s way of maintaining state +as we iterate through the matches or captures produced by running a +query on the parse tree. Observe:

+
use tree_sitter::QueryCursor;
 
 let mut query_cursor = QueryCursor::new();
 let all_matches = query_cursor.matches(
@@ -218,15 +283,22 @@
     parse_tree.root_node(),
     src.as_bytes(),
 );
-

We begin by passing our query to the cursor, followed by the “root node”, which is another way of saying, “start from the top”, and lastly, the source itself. If you have already taken a look at the C API, you will notice that the last argument, the source (known as the TextProvider), is not required. The Rust bindings seem to require this argument to provide predicate functionality such as #match? and #eq?.

+

We begin by passing our query to the cursor, followed by the “root +node”, which is another way of saying, “start from the top”, and lastly, +the source itself. If you have already taken a look at the C API, you +will notice that the last argument, the source (known as the +TextProvider), is not required. The Rust bindings seem to +require this argument to provide predicate functionality such as +#match? and #eq?.

Do something with the matches:

-
// get the index of the capture named "raise"
+
// get the index of the capture named "raise"
 let raise_idx = query.capture_index_for_name("raise").unwrap();
 
-for each_match in all_matches {
+for each_match in all_matches {
     // iterate over all captures called "raise"
     // ignore captures such as "fn-name"
-    for capture in each_match
+    for capture in each_match
         .captures
         .iter()
         .filter(|c| c.idx == raise_idx)
@@ -241,8 +313,10 @@
         );
     }
 }
-

Lastly, add the following line to your source code, to get the linter to catch something:

-
env::remove_var("RUST_BACKTRACE");
+

Lastly, add the following line to your source code, to get the linter +to catch something:

+
env::remove_var("RUST_BACKTRACE");

And cargo run:

λ cargo run
    Compiling toy-lint v0.1.0 (/redacted/path/to/toy-lint)
@@ -251,17 +325,33 @@
 [Line: 40, Col: 4] Offending source code: `env::remove_var("RUST_BACKTRACE")`

Thank you tree-sitter!

Bonus

-

Keen readers will notice that I avoided std::env::set_var. Because set_var is called with two arguments, a “key” and a “value”, unlike env::var and env::remove_var. As a result, it requires more juggling:

-
((call_expression
+

Keen readers will notice that I avoided +std::env::set_var. Because set_var is called +with two arguments, a “key” and a “value”, unlike env::var +and env::remove_var. As a result, it requires more +juggling:

+
((call_expression
   function: (_) @fn-name
   arguments: (arguments . (string_literal)? . (string_literal) .)) @raise
  (#match? @fn-name "(std::|)env::(var|remove_var|set_var)"))
-

The interesting part of this query is the humble ., the anchor operator. Anchors help constrain child nodes in certain ways. In this case, it ensures that we match exactly two string_literals who are siblings or exactly one string_literal with no siblings. Unfortunately, this query also matches the following invalid Rust code:

-
// remove_var accepts only 1 arg!
+

The interesting part of this query is the humble ., the +anchor operator. Anchors help constrain child nodes in certain +ways. In this case, it ensures that we match exactly two +string_literals who are siblings or exactly one +string_literal with no siblings. Unfortunately, this query +also matches the following invalid Rust code:

+
// remove_var accepts only 1 arg!
 std::env::remove_var("RUST_BACKTRACE", "1");

Notes

-

All-in-all, the query DSL does a great job in lowering the bar to writing language tools. The knowledge gained from mastering the query DSL can be applied to other languages that have tree-sitter grammars too. This query detects to_json methods that do not accept additional arguments, in Ruby:

-
((method
+

All-in-all, the query DSL does a great job in lowering the bar to +writing language tools. The knowledge gained from mastering the query +DSL can be applied to other languages that have tree-sitter grammars +too. This query detects to_json methods that do not accept +additional arguments, in Ruby:

+
((method
   name: (identifier) @fn
   !parameters)
  (#is? @fn "to_json"))
-- cgit v1.2.3