Enhanced document symbol menu for Zig

Idea

If you want to take the code, here is the link.

It was a cold winter evening in Dublin. I was making a clone of Space Invaders in Zig and I came across another memory bug. The cursor in my Neovim stood steel. I realized that I created a self referential structure in the init method of the Game structure. I opened the fuzzy search and quickly typed "init", and found 4 symbols with the name I was looking for. Apparently, I could differ them by the preview, but I don't like using a preview. I realized that it would be much easier to find a document symbol by its path. Namely, it'd be easier to type two words "Game" and "init" to get the right method. When we use fuzzy search to find a file in a project, we see the entire file path instead of just a name. If we found multiple files with the same name, we can type the name of one of the folders from the path of the file which we need. Why wouldn't we use the same approach for symbols in a document?

zigSymbolsBefore.png
Figure 1: The list of the document symbols obtained with the Language protocol client.
zigSymbolsAfter.png
Figure 2: My list of the document symbols based on Treesitter.

Implementation

The fuzzy search in Neovim is implemented by a couple of plugins. Personally, I prefer Telescope, mainly it's easy to extend it, thank the rich documentation. Also, Neovim supports Tree-sitter to get the Abstract Syntax Tree (AST) to build the symbol paths, the plugin Nvim Treesitter does the work. Therefore, the high level plan is this:

  • Get the AST of the current document with Tree-sitter.
  • Traverse the AST and take complex language structures. In Zig they are: struct, union, enum and a test declaration.
  • Collect the items of the structures.
  • Build the paths to every item in a document and to every ascendant of the item.
  • Create the fuzzy search menu for the paths with Telescope.

Telescope API

The Telescope source code is well documented. You type :help tele and press <TAB> and see all components of the plugin in the completion list. Using the information you can create a dead simple bookmark menu with 35 lines of code:

local conf = require "telescope.config".values
local make_entry = require "telescope.make_entry"
local pickers = require "telescope.pickers"
local finders = require "telescope.finders"

local function create_bookmark_menu()
  local current_filename = vim.api.nvim_buf_get_name(0)
  local first_line = 1
  local last_line = vim.api.nvim_buf_line_count(0)
  local entries = {
    {
      lnum = first_line,
      col = 1,
      text = "begin",
      filename = current_filename,
    },
    {
      lnum = last_line,
      col = 1,
      text = "end",
      filename = current_filename,
    }
  }

  pickers.new({}, {
    prompt_title = "dummy menu",
    finder = finders.new_table({
      results = entries,
      entry_maker = make_entry.gen_from_buffer_lines({}),
      sorter = conf.generic_sorter({}),
      previewer = conf.grep_previewer({}),
    }),
  }):find()
end

Out of the example you can see two things:

  1. Telescope provides facilities to create a menu with 9 lines only.
  2. The most of the code is creating entries for the menu. Of cause, static entries aren't that useful. Creating the entries for the menu is the main part of the task.

Tree Sitter API

The data format we get out of Tree-sitter shapes the entries for Telescope. The data we can get has following pieces:

  1. The symbols coordinates. Namely, the line number and the column number.
  2. The name of the symbol.
  3. The kind of the symbol. This helps us to distinguish fields from variables and from functions.

This structure looks almost like that from the example from Telescope API. The symbols coordinates are pairs of lnum and col. The name of the symbol can be put into the text of a menu entry. There is no field for the kind of a symbol in the example. Therefore, I we have to introduce it. Also, we don't need the field filename, because we collect symbols from one file only. Given that, the structure looks like this:

entry = {
  lnum = 12,
  col = 1,
  text = "init",
  kind = "function"
}

However, the purpose of a menu with entries of this structure is missed, because text has only the symbol name. The purpose is having the entire path to the symbol. Also, Tree-sitter calls a function a "functiondeclaration". Therefore, the structure will be like this:

entry = {
  lnum = 12,
  col = 1,
  text = "App::init",
  kind = "function_declaration"
}

Data mining

To get the entries we traverse the AST of a file. The typical way is the visitor pattern. The function of the visitor requires the following data:

  1. AST node.
  2. AST node type. I'll explain the reason why it's passed separately.
  3. The current path which has been traversed by the visitor.
  4. The resulting table which will be passed to Telescope.
  5. The number of the buffer to get the text of the symbol with Tree-Sitter API.

That's what they are look like in the function declaration

local function visit_node(node, expected_node_type, tree_path_array, result, bufnr)

Let's define what the function does.

  1. Get the node name.
  2. Add the node name into tree_path_array.
  3. Get the node type name.
  4. Get the node coordinates.
  5. Visit the node children with the updated tree_path_array.
  6. Pop the node name from tree_path_array.

Here is the code of the steps:

visit_node = function(node, expected_node_type, tree_path_array, result, bufnr)
  local node_name = get_name(node, expected_node_type, bufnr)
  if node_name == nil or node_name:len() == 0 then
    return
  end

  local node_type = node:type()
  table.insert(tree_path_array, node_name)
  local line, col = node:start()
  local node_path = table.concat(tree_path_array, path_delim)
  table.insert(result, {
    lnum = line + 1,
    col = col + 1,
    kind = node_type,
    text = node_path,
  })

  visit_children(node, expected_node_type, tree_path_array, result, bufnr)
  table.remove(tree_path_array, table.maxn(tree_path_array))
end

Out of the code, you see that result gets the data exactly in the format we defined. Also, you can see that we have node_type and expected_node_type. Those two fields seem carrying the same information. I'll disclose the purpose of expected_node_type in the following sections.

The last piece in the algorithm is visiting the children. It has 2 steps:

  1. We have to identify if the current node has a body, and if it does, we visit the nodes in the body. The structure of every kind of a node is different. Therefore, the algorithm of fetching the body depends on the kind of the node.
  2. We don't want to visit every child in the body, we need only function declarations, fields and variable declarations.

In the code, it looks this way:

local function visit_children(parent_node, expected_parent_node_type, tree_path_array, result, bufnr)
  local body_node = (function()
    if expected_parent_node_type == expected_node_type_variable_declaration then
      return find_complex_child(parent_node:iter_children())
    elseif expected_parent_node_type == expected_node_type_test_declaration then
      return find_child(parent_node, expected_node_type_block)
    elseif expected_parent_node_type == expected_node_type_container_field then
      for _, body in pairs(parent_node:field("type")) do
        return filter_complex_child(body)
      end
    end
    return nil
  end)()

  if body_node ~= nil then
    for child_node, _ in body_node:iter_children() do
      local child_expected_node_type = get_expected_node_type(child_node)
      if (
          child_expected_node_type == expected_node_type_container_field or
          child_expected_node_type == expected_node_type_function_declaration or
          child_expected_node_type == expected_node_type_variable_declaration
          ) then
        visit_node(child_node, child_expected_node_type, tree_path_array, result, bufnr)
      end
    end
  end
end

The function find_complex_child finds a child which passes the function filter_complex_child, which determines if the child is one of the following types:

  • union
  • struct
  • enum

Then we traverse the child nodes of the body according the rules from (2).

Node type

The type of the data in node:type() is a string. The type of the data in expected_node_type is an integer. We use expected_node_type to get the name of the node. Consider the following Zig snippet:

const ObjectHandle = struct {
    index: u16,
};
fn app() void {}
test "Happy path" { }

We have 4 types of nodes here:

  1. Variable declaration
  2. Container field
  3. Function declaration
  4. Test declaration

Each one of them requires different approach to get its name.

The AST of the variable declaration requires to iterate over its children to find identifier:

(variable_declaration ; [0, 0] - [2, 2]
  (identifier) ; [0, 6] - [0, 18]
  (struct_declaration ; [0, 21] - [2, 1]
   ; the container field will be shown separately
    )) ; [1, 11] - [1, 14]

The AST of the function declaration has its name in the named child name:

(function_declaration ; [3, 0] - [3, 16]
  name: (identifier) ; [3, 3] - [3, 6]
  (parameters) ; [3, 6] - [3, 8]
  type: (builtin_type) ; [3, 9] - [3, 13]
  body: (block)) ; [3, 14] - [3, 16]

The AST of the container field has its name at the same place:

(container_field ; [1, 4] - [1, 14]
  name: (identifier) ; [1, 4] - [1, 9]
  type: (builtin_type)))) ; [1, 11] - [1, 14]

The AST of the test declaration requires to iterate over its children to find string:

(test_declaration ; [4, 0] - [4, 17]
  (string ; [4, 5] - [4, 13]
    (string_content)) ; [4, 6] - [4, 12]
  (block)) ; [4, 14] - [4, 17]

Given that, we have to compare the type of the visited node with the expected node types and get the name according the type:

local function get_name(of_node, expected_node_type, bufnr)
  if expected_node_type == expected_node_type_variable_declaration then
    local identifier = find_child(of_node, expected_node_type_identifier)
    if identifier == nil then
      return nil
    end
    return vim.treesitter.get_node_text(identifier, bufnr)
  elseif (
      expected_node_type == expected_node_type_function_declaration or
      expected_node_type == expected_node_type_container_field
      ) then
    local _, function_name_node = next(of_node:field("name"))
    if function_name_node == nil then
      return nil
    end
    return vim.treesitter.get_node_text(function_name_node, bufnr)
  elseif expected_node_type == expected_node_type_test_declaration then
    local test_name = find_child(of_node, expected_node_type_string)
    if test_name ~= nil then
      return "test" .. vim.treesitter.get_node_text(test_name, bufnr)
    else
      return "test"
    end
  else
    return nil
  end
end

It's time to answer the question. Why do I need this expected_node_type? As I said the data type is an integer. It's faster to compare the numbers than strings. In other words, the algorithmic complexity of the string comparison is linear, while integer is constant (spoiler: it's not true for LuaJIT 2.1). It's not a problem to compare strings once. However, sometimes you inspect a source file of 10k - 20k lines, and the plugin, which you rely on, fails to do in a frame (16 ms or even 32 ms). It is a deal breaker for me. The main reason why I like Neovim (and Vim in the past) is that, that it's snappy and crispy (not only performance-wise, but the experience in general). I really hate when some plugin takes away this experience. Also, it's not that green to spend unnecessary CPU cycles.

Node symbol

My initial idea was using the integer identifier which I get from node:symbol(). The method gives you an integer which you can use to identify the type. However, the identifiers change with a new version of Nvim Treesitter. That means, that I have to do a follow up change in the code. Apparently it's annoying.

In order to mitigate the inconvenience, I came up with a silly idea of keeping both ways of the symbol type identification in the code. Namely, the functions which utilize node:symbol() work, but the functions which utilize node:type() remain commented.

I remove the comment operator in the beginning of them, upgrade Nvim Treesitter, and my symbol menu works again. The flaw of the plan is that I have a few places where I get the type. It made me to create this expected_node_type and tables to match it against the data from node:symbol() and node:type(). That allowed me to amend only one place.

The entire idea is in these 3 functions:

local function get_expeceted_node_type_from_node_symbol(node)
  local node_symbol = node:symbol()
  for _, value in pairs(expected_node_type_against_symbols_table) do
    if value[1] == node_symbol then
      return value[2]
    end
  end
  return expected_node_type_unknown
end

local function get_expeceted_node_type_from_node_type(node)
  return expected_node_type_against_types_table[node:type()]
end

local function get_expected_node_type(node)
  local expected_node_type = get_expeceted_node_type_from_node_symbol(node)
  -- local expected_node_type = get_expeceted_node_type_from_node_type(node)
  return expected_node_type
end

As you can see they do a table lookup. I wanted to minimize them, that's why I pass expected_node_type to the visiting function. However, the necessity of maintaining the code compelled me to check the actual code execution duration, which surprised me.

Performance

I did the benchmark of the code this way:

  • Garbage collector remains working, because it works in a regular situation.
  • I run the algorithm 100 times, because the garbage collector can skip a single iteration. Running a few timer increases the chance of getting both of the cases equally distributed (at least I hope on it).
  • I take the median and the average value of the execution time.
  • The execution time is the difference between the two results of gettimeofday (actually its wrapper from luv), before and after the execution.
  • The result of the gettimeofday is a pair of seconds and microseconds. With C I would've used the timersub function to subtract two instances of the result to get the duration. However, I failed to find the function without installing anything extra. Therefore, I had to implement the subtraction.

Here is the code of the benchmark:

local diffs = {}
function benchmark(callable)
  local before_s, before_ms = vim.uv.gettimeofday()
  callable()
  local after_s, after_ms = vim.uv.gettimeofday()
  local after = after_s .. "." .. after_ms
  if before_ms > after_ms then
    after_ms = after_ms + 1000000
    after_s = after_s - 1
  end
  local diff_s = after_s - before_s
  local diff_ms = after_ms - before_ms
  print("before: " .. before_s .. "." .. before_ms .. ", after: " .. after .. ", duration " .. diff_s .. "." .. diff_ms)
  table.insert(diffs, diff_s * 1000000 + diff_ms)
end

for i = 0, 100, 1 do
  benchmark(function() collect_document_symbols(vim.api.nvim_get_current_buf()) end)
end

table.sort(diffs)
local avg = vim.iter(diffs):fold(0, function(acc, v) return acc + v end) / (#diffs)
print("median: " .. (diffs[49] + diffs[50]) / 2 .. " avg: " .. avg)

The function collect_document_symbols only collects the symbols with Treesitter API. Those are the symbols I'm talking about. The results of the benchmark are surprising:

  • The algorithm using the node:symbol() (integer):
    • median: 58461 μs
    • avg: 78586 μs
  • The algorithm using the node:type() (string):
    • median: 58419 μs
    • avg: 78379 μs

The difference is inside a statistical error range. Does the Lua compare string in a constant time? Yes it does! Here is my investigation.

As a result, I dropped node:symbol() and used only node:type(). However, I kept the expected_node_type. Another useful property of integers is that I can make binary masks out of them. Those masks can be used to say if a type is one of a set of types with a single condition.

Customize the Telescope menu

The last piece in the puzzle is filling the menu. In the example we used the entry maker from the menu for a buffer lines. Unfortunately, it doesn't work for us, because the format of our data is different. It expects the data with 2 fields: the line number, the line content. In our case, every entry has the line, the column, the kind, the text. There is Telescope's entry maker for Treesitter, though, but it creates its own text, which doesn't work for us, because we need our own text.

Also, in order to search among a certain kind of symbols, I customized the sorter of the menu. As a result the code to create our menu is:

local function ret(opts)
  if opts == nil then
    opts = {}
  end

  local bufnr = vim.api.nvim_get_current_buf()
  opts.filename = vim.api.nvim_buf_get_name(bufnr)
  local results = collect_document_symbols(bufnr)
  if vim.tbl_isempty(results) then
    return
  end

  pickers.new(opts, {
    prompt_title = "Zig symbols",
    finder = finders.new_table({
      results = results,
      entry_maker = opts.entry_maker or entry_maker(opts),
    }),
    sorter = conf.prefilter_sorter({
      tag = "kind",
      sorter = conf.generic_sorter(opts),
    }),
    previewer = conf.grep_previewer(opts),
  }):find()
end

Out of the code note 2 things:

  1. entry_maker is our custom function which I'll describe later.
  2. The field sorter has the field tag with the value "kind". This value will be used by Telescope to the kind of a symbol in an entry of the table results.

The entry maker - displayer

An entry maker returns a function which is the actual entry maker. There are 2 important things to create an entry maker:

  1. The factory of functions which format their entries. It's called displayer in the code of Telescope.
  2. The mapping of the values of our data to the data of a Telescope entry.

The formatting, which the displayer does, determines 3 things:

  1. The lengths of the fields of the entries. In our case, I'd like to make 5 characters for the line number, 25 characters for the kind, the remaining for the text.
  2. The order of the fields.
  3. The highlighting of the text in the columns. In Neovim, and Vim of cause, we can use highlight groups to do it.

The listed items is implemented with the following code:

local displayer = entry_display.create({
  separator = " │ ",
  items = {
    { width = 5 },
    { width = 25 },
    { remaining = true },
  },
})

local make_display = function(entry)
  return displayer({
    { entry.lnum, "TelescopeResultsLineNr" },
    { entry.kind, "TelescopeResultsSpecialComment" },
    { entry.text, "" },
  })
end

The entry maker - data mapping

The last piece in creating the entry maker is the mapping of the data, which we have mined, to the Telescope menu. As it navigates a user, it needs the following data:

  1. The file path.
  2. The location in the file.
  3. The text representation of the location.
  4. The text for filtering.

The code to do this:

return function(entry)
  local entry_kind = entry.kind
  return make_entry.set_default_entry_mt({
    ordinal = entry.text .. " " .. entry_kind,
    kind = entry_kind,
    display = make_display,
    filename = opts.filename,
    lnum = entry.lnum,
    text = entry.text,
    col = entry.col,
  }, opts)
end

That's what the function entry_maker returns.

Conclusion

Out of this little adventure, I found that it's fairly simple (not easy, but simple) to work with Tree-sitter to get information about the source code. It made me to wonder about custom code linters. For example, a team might have internal code style, they check if the style is maintained in every pull request manually. A custom linter would automate this process and it could be run even in CI or even Git hooks.

It was very interesting to discover the internals of comparing strings in Lua. I understand better the reason of Lua being popular in Game development where your budget is 16 ms, not mentioning screens with 120Hz and 240Hz refresh frequencies. I definitely learned the lesson of not assuming something based on my experience and checking the actual implementation.

I haven't created a plugin for this, because essentially it's just a configuration. There is no licence for the code, feel free to use however you want. Link to the code again.

Discussion