While I was working on the enhanced document symbol menu for Neovim, I came across an interesting fact. The comparison of strings takes the same time as the comparison of integers. After an hour of looking up information on the internet, I found very few pieces of information about it. However, the information didn't have enough details, and it was conflicting. As a result I decided to find it on my own. I had a hunch that the string comparison might be more sophisticated, because of this piece of code in LuaJIT:
typedef struct GCstr { GCHeader; uint8_t reserved; /* Used by lexer for fast lookup of reserved words. */ uint8_t hashalg; /* Hash algorithm. */ StrID sid; /* Interned string ID. */ StrHash hash; /* Hash of string. */ MSize len; /* Size of string. */ } GCstr;
The field hash
gave me an impression that it might be used in the string comparison. I had in my mind a picture, that strings are compared by their hashes, then by their content. I checked the references of hash
in the code, but it wasn't used to compare them.
I found the function seemed the implementation of the string comparison. Here it is:
/* Ordered compare of strings. Assumes string data is 4-byte aligned. */ int32_t LJ_FASTCALL lj_str_cmp(GCstr *a, GCstr *b)
Even the comment whispered it, but GDB didn't stop at the function. I was puzzled, but found that I can get the byte code of my program with the command:
luajit -bl source.lua
I hoped that I can find the massive switch over all of the types of the Lua virtual machine operations and see the code which is invoked for the string comparison operation. I created this sample to keep the byte code easier to comprehend.
local clock = os.clock local function sleep(n) -- seconds local t0 = clock() while clock() - t0 <= n do end end local function main() print("game begins\n") sleep(1) print(arg[1] == "arst") print("game over\n") end main()
The pause and the print commands are needed to see in GDB if we haven't stopped somewhere at the initialization of the script. The byte code of the program is 40 lines long, so I post only the sensible place:
0007 GGET 0 0 ; "print" 0008 GGET 2 2 ; "arg" 0009 TGETB 2 2 1 0010 ISEQS 2 3 ; "arst" 0011 JMP 2 => 0014 0012 KPRI 2 1 0013 JMP 3 => 0015 0014 => KPRI 2 2 0015 => CALL 0 1 2 0016 GGET 0 0 ; "print" 0017 KSTR 2 4 ; "game over\n"
So, ISEQS
it is. I found a few places in the code which the operation, but GDB didn't stop at any of them. I saw a few matches in the files with the extension dasc
. I ignored them in the beginning, but later I ran out of sensible guesses and just checked. I found that the names of the files have architecture, so I picked my x64, and inside of them I found this:
case BC_ISEQS: case BC_ISNES: vk = op == BC_ISEQS; | ins_AND // RA = src, RD = str const, JMP with RD = target | mov RB, [BASE+RA*8] | add PC, 4 | checkstr RB, >3 | cmp RB, [KBASE+RD*8]
Is it some sort of assembler? It is, and it's name is DynASM. As of now the page has this text:
Sorry, right now there is no proper documentation included other than some Examples and of course the source code. The source is well documented, though (IMHO).
However, the same page refers to the unofficial documentation and the tutorial from the same website. To be honest, I mostly worked on website backends. Reading assembly is not something I do regularly. I understand some basics which learned from:
Nevertheless, out the 7 lines seemed DynASM, I can see following:
int_AND
doesn't make sense so far. I see that it's macro |.macro ins_AND; not RD; .endmacro
, and I understand the content of the macro, but the purpose of it is not clear yet.|.define RB, rbp // Must be rbp (C callee-save).
// RA = src, RD = str const, JMP with RD = target
. I understand that the RA register has the string (or address to it) from my arg[1]
. The register RD has my constant "arst".
arg[1]
is set to RB. However, is see that the value is multiplied by 8
, and it's taken with some shift BASE
.checkstr
. It unwraps to the code which checks the type of the value arg[1]
.>3
. This puzzled me until I found in the tutorial this chapter. It says that >3
jumps to the next code under the mark 3:
.
Given that, the entire line can be read as "make sure that the type in RB is string or jump to "3:"."arst"
with manipulations similar to the item 5, and compares it to the register RB, where we have our string from arg[1]
.After we see these 5 lines:
if (vk) { | jne >2 } else { | je >1 }
However, I have to be sure that I'm looking at the right place. I made sure that this is the code which is invoked when we compare string. Namely, I swapped jne >2
and je >1
, which I expected to negate the comparison operator, and it did.
So far, we found the code which performs the comparison, and the code allegedly has constant time complexity. Let's confirm. I ran luajit and set the breakpoint at lj_BC_ISEQS
to see the block of the generated assembly.
Please note, that I compiled luajit with the following command:
make DEFAULT_CC="musl-gcc -static" BUILDMODE="static" TARGET_XCFLAGS=-DLUAJIT_NO_UNWIND Q=
The block of the assembly code is big enough, so I cover only the part we looked at, and split it into logical parts:
These two lines set string from arg[1]
and the constant "arst"
to registers.
Dump of assembler code for function lj_BC_ISEQS: => 0x000000000043508f <+0>: not rax 0x0000000000435092 <+3>: mov rbp,QWORD PTR [rdx+rcx*8]
These two lines match the code of the macro from the DynASM code ins_AND
. It's a good sign. The address in the register rbp is 0xfffdfffff7fe3038
, and when I try to access it I get Cannot access memory at address 0xfffdfffff7fe3038
. Why? Hasn't Linux allocated the address for the process? Let's have a look at the memory mappings of the process:
(gdb) info proc map process 336700 Mapped address spaces: Start Addr End Addr Size Offset Perms objfile 0x400000 0x401000 0x1000 0x0 r--p /home/laladrik/Nest/personal/luajit/current/src/luajit 0x401000 0x4cb000 0xca000 0x1000 r-xp /home/laladrik/Nest/personal/luajit/current/src/luajit 0x4cb000 0x4eb000 0x20000 0xcb000 r--p /home/laladrik/Nest/personal/luajit/current/src/luajit 0x4eb000 0x4ed000 0x2000 0xea000 rw-p /home/laladrik/Nest/personal/luajit/current/src/luajit 0x4ed000 0x4ee000 0x1000 0x0 rw-p 0x4ee000 0x4ef000 0x1000 0x0 ---p [heap] 0x4ef000 0x4f0000 0x1000 0x0 rw-p [heap] 0x58f0000 0x5900000 0x10000 0x0 r-xp 0x7ffff7fd9000 0x7ffff7ff9000 0x20000 0x0 rw-p 0x7ffff7ff9000 0x7ffff7ffd000 0x4000 0x0 r--p [vvar] 0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x0 r-xp [vdso] 0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack] 0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
None of the address ranges includes the value. It's not an address. Let's have a look back at the DynASM code. There are lines: | mov RB, [BASE+RA*8]
and | checkstr RB, >3
. We see that the register RB (which is rpb in the regular assembly) is used in both of them. The first line sets it, the second reads it for the macro checkstr
. The code of the macro is
|.macro checkstr, reg, target; checktp reg, LJ_TSTR, target; .endmacro
The macro has 2 arguments: reg
and target
. The argument reg
is used in the macro checktp
, the second argument, which is passed to the macro, is LJ_TSTR
. The code of the macro is
|.macro checktp, reg, tp, target | mov ITYPE, reg | cleartp reg | sar ITYPE, 47 | cmp ITYPEd, tp | jne target
The ITYPE
register is r11. However, the name says the value in the register is a type. Given that, checktp
starts making sense - it's "check type". Therefore, the register rbp contains an encoded type, and this code decodes it and compares to LJ_TSTR
. Grepping the code of LuaJIT I found this in lj_obj.h
#define LJ_TSTR (~4u)
The value ~4u
is 0xfffffffb
, which is the value we can see in the assembly code
0x000000000043509d <+14>: shl rbp,0x11 ; cleaning the original value 0x00000000004350a1 <+18>: shr rbp,0x11 ; cleaning the original value 0x00000000004350a5 <+22>: sar r11,0x2f ; decoding the type 0x00000000004350a9 <+26>: cmp r11d,0xfffffffb ; checking the type 0x00000000004350ad <+30>: jne 0x4350d4 <lj_BC_ISEQS+69> ; leaving the code if the check fails.
In conclusion, we check the type at first place, which makes sense for a scripting language.
As of now the value in the register rbp has changed. Effectively, the first 11 bits of the value are zeroes now. Given that, the value 0xfffdfffff7fe3038
became 0x7ffff7fe3038
. After we have this two lines:
0x00000000004350af <+32>: cmp rbp,QWORD PTR [r15+rax*8] 0x00000000004350b3 <+36>: je 0x4350b5 <lj_BC_ISEQS+50> 0x00000000004350b5 <+38>: movzx eax,WORD PTR [rbx-0x2]
The value at the address pointed by the expression [r15+rax*8]
is 0xf7fe3038
, well at least GDB says it. The last 4 bytes of the value equal to the last 4 bytes of the value at the register rpb. However, the cmp instruction sets Zero Flag. That means that the values are equal. Let's do a little hack. Before the instruction let's put the following code:
| push rdx | mov rdx, [KBASE+RD*8] | pop rdx
With this, I see that the value is 0x7ffff7fe3038
from the register rdx. I found that I actually misused GDB. I should've used x/g $r15+($rax*8)
to see 8 byte long value at the address. I didn't set any format after x
, as a consequence I saw just 4 byte long value.
As a result, a string comparison in Lua is comparison of 2 integers. The first for the type, the second for the value. This is the reason why string comparison takes as long as integer comparison.
Despite the fact, that I found the way the strings are compared, I've got some questions which I couldn't let go without answers.
I can't help, but wonder about the value 0x7ffff7fe3038
. It's big enough to look like an address. However, the value at the address is zero. I had a hunch that the string is located somewhere near. Let's check the address which stores the value. The address is 0x7ffff7fe3038
, and this range includes it:
0x7ffff7fd9000 0x7ffff7ff9000 0x20000 0x0 rw-p
I guess my string constant "arst"
is somewhere there as well. Let's check it:
(gdb) find 0x7ffff7fd9000, (0x7ffff7ff9000 - 1), "arst" 0x7ffff7fe3050 1 pattern found.
It is there! The address 0x7ffff7fe3050
is 24 bytes away from 0x7ffff7fe3038
. This is exactly the amount of memory which the structure GCstr
takes if the LuaJIT is compiled without the flag -DLUAJIT_DISABLE_GC64
. The flag affects the size of the fields which are embedded by the macro GCHeader
, which is inside the structure. The macro embeds 3 fields, on of which takes either 4 or 8 bytes depending on the flag. I compiled LuaJIT with the same parameters and with the flag. After that, the address I got 0x40008680
and the pattern is found at 0x40008694
, which 20 bytes as expected.
After I set a watch point on the address in GDB, I found the code which allocates memory for the string.
static GCstr *lj_str_alloc(lua_State *L, const char *str, MSize len, StrHash hash, int hashalg) { GCstr *s = lj_mem_newt(L, lj_str_size(len), GCstr); // ... }
The macro lj_str_size(len)
unwraps into the following code:
#define lj_str_size(len) (sizeof(GCstr) + (((len)+4) & ~(MSize)3))
This shows two things:
GCstr
as I guessed.In conclusion, the value which is used to compare the strings is their addresses.
Strings are compared by their addresses - clear. Why is the address the same? Is there any registry of strings? Well, there is. The global state owns the registry which is a hash-table of chains (linked lists). LuaJIT computes a Add-Rotate-Xor hash for the string in a constant time, looks up the registry by the hash for the same string. If something matches the string, the creation is canceled, and the match from the registry is returned. The following code shows the implementation (careful - original formatting):
/* Intern a string and return string object. */ GCstr *lj_str_new(lua_State *L, const char *str, size_t lenx) { // ... GCobj *o = gcref(g->str.tab[hash & g->str.mask]); // ... while (o != NULL) { GCstr *sx = gco2str(o); if (sx->hash == hash && sx->len == len) { if (memcmp(str, strdata(sx), len) == 0) { // ... return sx; /* Return existing string. */ } coll++; } coll++; o = gcnext(o); } // ... }
It's worth to mention that LuaJIT deals with the case when a string has more than 32 collisions. It recalculates hashes for the items of the chain.