Lua string comparison under the hood

How I end up looking under this hood

While I was working on the enhanced document symbol menu for Neovim, I came across an interesting fact. The comparison of strings takes the same time as the comparison of integers. After an hour of looking up information on the internet, I found very few pieces of information about it. However, the information didn't have enough details, and it was conflicting. As a result I decided to find it on my own. I had a hunch that the string comparison might be more sophisticated, because of this piece of code in LuaJIT:

typedef struct GCstr {
  GCHeader;
  uint8_t reserved;     /* Used by lexer for fast lookup of reserved words. */
  uint8_t hashalg;      /* Hash algorithm. */
  StrID sid;            /* Interned string ID. */
  StrHash hash;         /* Hash of string. */
  MSize len;            /* Size of string. */
} GCstr;

The field hash gave me an impression that it might be used in the string comparison. I had in my mind a picture, that strings are compared by their hashes, then by their content. I checked the references of hash in the code, but it wasn't used to compare them.

Comparison implementation - where is it?

I found the function seemed the implementation of the string comparison. Here it is:

/* Ordered compare of strings. Assumes string data is 4-byte aligned. */
int32_t LJ_FASTCALL lj_str_cmp(GCstr *a, GCstr *b)

Even the comment whispered it, but GDB didn't stop at the function. I was puzzled, but found that I can get the byte code of my program with the command:

luajit -bl source.lua

I hoped that I can find the massive switch over all of the types of the Lua virtual machine operations and see the code which is invoked for the string comparison operation. I created this sample to keep the byte code easier to comprehend.

local clock = os.clock
local function sleep(n) -- seconds
  local t0 = clock()
  while clock() - t0 <= n do
  end
end

local function main()
  print("game begins\n")
  sleep(1)
  print(arg[1] == "arst")
  print("game over\n")
end

main()

The pause and the print commands are needed to see in GDB if we haven't stopped somewhere at the initialization of the script. The byte code of the program is 40 lines long, so I post only the sensible place:

0007    GGET     0   0      ; "print"
0008    GGET     2   2      ; "arg"
0009    TGETB    2   2   1
0010    ISEQS    2   3      ; "arst"
0011    JMP      2 => 0014
0012    KPRI     2   1
0013    JMP      3 => 0015
0014 => KPRI     2   2
0015 => CALL     0   1   2
0016    GGET     0   0      ; "print"
0017    KSTR     2   4      ; "game over\n"

So, ISEQS it is. I found a few places in the code which the operation, but GDB didn't stop at any of them. I saw a few matches in the files with the extension dasc. I ignored them in the beginning, but later I ran out of sensible guesses and just checked. I found that the names of the files have architecture, so I picked my x64, and inside of them I found this:

case BC_ISEQS: case BC_ISNES:
  vk = op == BC_ISEQS;
  |  ins_AND	// RA = src, RD = str const, JMP with RD = target
  |  mov RB, [BASE+RA*8]
  |  add PC, 4
  |  checkstr RB, >3
  |  cmp RB, [KBASE+RD*8]

Is it some sort of assembler? It is, and it's name is DynASM. As of now the page has this text:

Sorry, right now there is no proper documentation included other than some Examples and of course the source code. The source is well documented, though (IMHO).

However, the same page refers to the unofficial documentation and the tutorial from the same website. To be honest, I mostly worked on website backends. Reading assembly is not something I do regularly. I understand some basics which learned from:

This tutorial, following which I made a dead-simple clone of Curl
Some practice in data-driven programming, out of which I learned to identify things which cause cache misses.
What every programmer should know about memory. The Part 5 thought me how to deal with cache misses.

Nevertheless, out the 7 lines seemed DynASM, I can see following:

First two lines is the operator matching. That's how I guessed that's it's the code which I need.
The third line with the instruction int_AND doesn't make sense so far. I see that it's macro |.macro ins_AND; not RD; .endmacro, and I understand the content of the macro, but the purpose of it is not clear yet.
On the same line I see register mappings. For example, RB is mapped to rbp |.define RB, rbp // Must be rbp (C callee-save).
Again on the same line, the author of the code left the comment // RA = src, RD = str const, JMP with RD = target. I understand that the RA register has the string (or address to it) from my arg[1]. The register RD has my constant "arst".
The fourth line says that the string from arg[1] is set to RB. However, is see that the value is multiplied by 8, and it's taken with some shift BASE.
The fifth line adds something to the register PC. Not sure what it is for.
The six line has 2 complicated things at once
1. The macro checkstr. It unwraps to the code which checks the type of the value arg[1].
2. >3. This puzzled me until I found in the tutorial this chapter. It says that >3 jumps to the next code under the mark 3:. Given that, the entire line can be read as "make sure that the type in RB is string or jump to "3:".
The last line takes the constant string "arst" with manipulations similar to the item 5, and compares it to the register RB, where we have our string from arg[1].

After we see these 5 lines:

if (vk) {
  |  jne >2
} else {
  |  je >1
}

However, I have to be sure that I'm looking at the right place. I made sure that this is the code which is invoked when we compare string. Namely, I swapped jne >2 and je >1, which I expected to negate the comparison operator, and it did.

Comparison implementation - down to the machine code

So far, we found the code which performs the comparison, and the code allegedly has constant time complexity. Let's confirm. I ran luajit and set the breakpoint at lj_BC_ISEQS to see the block of the generated assembly.

Please note, that I compiled luajit with the following command:

make DEFAULT_CC="musl-gcc -static" BUILDMODE="static" TARGET_XCFLAGS=-DLUAJIT_NO_UNWIND Q=

The block of the assembly code is big enough, so I cover only the part we looked at, and split it into logical parts:

These two lines set string from arg[1] and the constant "arst" to registers.

Dump of assembler code for function lj_BC_ISEQS:
=> 0x000000000043508f <+0>:     not    rax
   0x0000000000435092 <+3>:     mov    rbp,QWORD PTR [rdx+rcx*8]

These two lines match the code of the macro from the DynASM code ins_AND. It's a good sign. The address in the register rbp is 0xfffdfffff7fe3038, and when I try to access it I get Cannot access memory at address 0xfffdfffff7fe3038. Why? Hasn't Linux allocated the address for the process? Let's have a look at the memory mappings of the process:

(gdb) info proc map
process 336700
Mapped address spaces:

          Start Addr           End Addr       Size     Offset  Perms  objfile
            0x400000           0x401000     0x1000        0x0  r--p   /home/laladrik/Nest/personal/luajit/current/src/luajit
            0x401000           0x4cb000    0xca000     0x1000  r-xp   /home/laladrik/Nest/personal/luajit/current/src/luajit
            0x4cb000           0x4eb000    0x20000    0xcb000  r--p   /home/laladrik/Nest/personal/luajit/current/src/luajit
            0x4eb000           0x4ed000     0x2000    0xea000  rw-p   /home/laladrik/Nest/personal/luajit/current/src/luajit
            0x4ed000           0x4ee000     0x1000        0x0  rw-p
            0x4ee000           0x4ef000     0x1000        0x0  ---p   [heap]
            0x4ef000           0x4f0000     0x1000        0x0  rw-p   [heap]
           0x58f0000          0x5900000    0x10000        0x0  r-xp
      0x7ffff7fd9000     0x7ffff7ff9000    0x20000        0x0  rw-p
      0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0  r--p   [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0  r-xp   [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]
  0xffffffffff600000 0xffffffffff601000     0x1000        0x0  --xp   [vsyscall]

None of the address ranges includes the value. It's not an address. Let's have a look back at the DynASM code. There are lines: | mov RB, [BASE+RA*8] and | checkstr RB, >3. We see that the register RB (which is rpb in the regular assembly) is used in both of them. The first line sets it, the second reads it for the macro checkstr. The code of the macro is

|.macro checkstr, reg, target; checktp reg, LJ_TSTR, target; .endmacro

The macro has 2 arguments: reg and target. The argument reg is used in the macro checktp, the second argument, which is passed to the macro, is LJ_TSTR. The code of the macro is

|.macro checktp, reg, tp, target
|  mov ITYPE, reg
|  cleartp reg
|  sar ITYPE, 47
|  cmp ITYPEd, tp
|  jne target

The ITYPE register is r11. However, the name says the value in the register is a type. Given that, checktp starts making sense - it's "check type". Therefore, the register rbp contains an encoded type, and this code decodes it and compares to LJ_TSTR. Grepping the code of LuaJIT I found this in lj_obj.h

#define LJ_TSTR                        (~4u)

The value ~4u is 0xfffffffb, which is the value we can see in the assembly code

0x000000000043509d <+14>:    shl    rbp,0x11 ; cleaning the original value
0x00000000004350a1 <+18>:    shr    rbp,0x11 ; cleaning the original value
0x00000000004350a5 <+22>:    sar    r11,0x2f ; decoding the type
0x00000000004350a9 <+26>:    cmp    r11d,0xfffffffb ; checking the type
0x00000000004350ad <+30>:    jne    0x4350d4 <lj_BC_ISEQS+69> ; leaving the code if the check fails.

In conclusion, we check the type at first place, which makes sense for a scripting language.

As of now the value in the register rbp has changed. Effectively, the first 11 bits of the value are zeroes now. Given that, the value 0xfffdfffff7fe3038 became 0x7ffff7fe3038. After we have this two lines:

0x00000000004350af <+32>:    cmp    rbp,QWORD PTR [r15+rax*8]
0x00000000004350b3 <+36>:    je     0x4350b5 <lj_BC_ISEQS+50>
0x00000000004350b5 <+38>:    movzx  eax,WORD PTR [rbx-0x2]

The value at the address pointed by the expression [r15+rax*8] is 0xf7fe3038, well at least GDB says it. The last 4 bytes of the value equal to the last 4 bytes of the value at the register rpb. However, the cmp instruction sets Zero Flag. That means that the values are equal. Let's do a little hack. Before the instruction let's put the following code:

|  push rdx
|  mov rdx, [KBASE+RD*8]
|  pop rdx

With this, I see that the value is 0x7ffff7fe3038 from the register rdx. I found that I actually misused GDB. I should've used x/g $r15+($rax*8) to see 8 byte long value at the address. I didn't set any format after x, as a consequence I saw just 4 byte long value.

As a result, a string comparison in Lua is comparison of 2 integers. The first for the type, the second for the value. This is the reason why string comparison takes as long as integer comparison.

Bonus level

Despite the fact, that I found the way the strings are compared, I've got some questions which I couldn't let go without answers.

The location of a string

I can't help, but wonder about the value 0x7ffff7fe3038. It's big enough to look like an address. However, the value at the address is zero. I had a hunch that the string is located somewhere near. Let's check the address which stores the value. The address is 0x7ffff7fe3038, and this range includes it:

0x7ffff7fd9000     0x7ffff7ff9000    0x20000        0x0  rw-p

I guess my string constant "arst" is somewhere there as well. Let's check it:

(gdb) find 0x7ffff7fd9000, (0x7ffff7ff9000 - 1), "arst"
0x7ffff7fe3050
1 pattern found.

It is there! The address 0x7ffff7fe3050 is 24 bytes away from 0x7ffff7fe3038. This is exactly the amount of memory which the structure GCstr takes if the LuaJIT is compiled without the flag -DLUAJIT_DISABLE_GC64. The flag affects the size of the fields which are embedded by the macro GCHeader, which is inside the structure. The macro embeds 3 fields, on of which takes either 4 or 8 bytes depending on the flag. I compiled LuaJIT with the same parameters and with the flag. After that, the address I got 0x40008680 and the pattern is found at 0x40008694, which 20 bytes as expected.

After I set a watch point on the address in GDB, I found the code which allocates memory for the string.

static GCstr *lj_str_alloc(lua_State *L, const char *str, MSize len,
                           StrHash hash, int hashalg)
{
  GCstr *s = lj_mem_newt(L, lj_str_size(len), GCstr);
  // ...
}

The macro lj_str_size(len) unwraps into the following code:

#define lj_str_size(len)        (sizeof(GCstr) + (((len)+4) & ~(MSize)3))

This shows two things:

The first chunk is supposed to be used for the structure GCstr as I guessed.
The maximal length of a string is 2³² - 3 bytes (slightly less than 4 GB).

In conclusion, the value which is used to compare the strings is their addresses.

Two strings - one address

Strings are compared by their addresses - clear. Why is the address the same? Is there any registry of strings? Well, there is. The global state owns the registry which is a hash-table of chains (linked lists). LuaJIT computes a Add-Rotate-Xor hash for the string in a constant time, looks up the registry by the hash for the same string. If something matches the string, the creation is canceled, and the match from the registry is returned. The following code shows the implementation (careful - original formatting):

/* Intern a string and return string object. */
GCstr *lj_str_new(lua_State *L, const char *str, size_t lenx)
{
  // ...
     GCobj *o = gcref(g->str.tab[hash & g->str.mask]);

  // ...
    while (o != NULL) {
      GCstr *sx = gco2str(o);
      if (sx->hash == hash && sx->len == len) {
        if (memcmp(str, strdata(sx), len) == 0) {
      // ...
          return sx;  /* Return existing string. */
        }
        coll++;
      }
      coll++;
      o = gcnext(o);
    }
  // ...
}

It's worth to mention that LuaJIT deals with the case when a string has more than 32 collisions. It recalculates hashes for the items of the chain.

Conclusion

The efficiency and popularity of Lua in game development. That fact, that Lua does such a common operation in a very efficient way, answers the question. Not only the algorithm takes constant time, it's implemented in assembly. When your budget is 16 ms, you want to have something highly efficient. LuaJIT has quite a few critical places implemented in assembly for a few CPU architectures.
Little code, good implementation. Despite my sarcasm about the formatting, I find the code of LuaJIT pretty straightforward. It's always challenging to understand macro in C or C++. It's good to have only a little bit of macro, and LuaJIT does it. Also, the code has a good amount of compile time tweaks. Not only are they clear out of their names, but have good comments in the Makefile.
I understood why it's recommended to avoid string concatenation for intermediate values and use arrays of characters or arrays of strings. The lookup the string registry for a match of a string reads random memory places. It slows down the process if the places are not next to each other. As a result, we might have cache misses.
While I was debugging the project, I understood why GDB is still used. Instead of using it interactively I wrote a script for it and LuaJIT within GDB along the script and just watched the data, pressing Enter occasionally. The second impressive thing was watching the memory addresses. It allowed to find the moment, when a string was created, but I was staying at start of the program. Another useful thing is convenience variables. I had a case when I wanted to trigger a breakpoint only if certain condition happened. However, that data for the condition was out of the scope of the function where I needed the breakpoint. I set a breakpoint at the place where the data is available, attach the commands to the breakpoint setting a convenience variable if certain conditions are met and continue, which means the breakpoint doesn't interrupt the process. At the place, which I needed to inspect I set a breakpoint with a condition to that convenience variable.