zunzuncito

About a week ago I noticed that fd(1), a Rust-based alternative to find(1), would suddenly segfault on my musl-based server system. Usually a segfault is nothing particularly special to my eyes, but this one was different. Even just having fd(1) attempt to print its help text was enough to trigger it, and when I attempted to debug it with gdb(1), I saw the following:

(gdb) run
Starting program: /usr/bin/fd

Program received signal SIGSEGV, Segmentation fault.
memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
warning: 18	../src_musl/src/string/x86_64/memcpy.s: No such file or directory
(gdb) bt
#0  memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
#1  0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66
#2  0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149
#3  0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39
#4  0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80
#5  0x00007ffff74107f6 in _start ()

So… the segfault is in musl, not in fd!?

I immediately checked whether other basic programs on the system worked. They did. I checked when I last updated musl. A couple of months ago, so that can’t be it. I checked specifically whether another Rust-based program worked. It did.

fd(1) had been updated pretty recently, and I remembered it working correctly about a month ago, so maybe something specific to fd(1)’s usage of Rust triggered this segfault in musl? I wanted to make sure I could reproduce this in a development environment, so I cloned the fd(1) repository, built a debug release, and ran it…

It worked. Huh!?

I decided it was likely that portage, Gentoo’s package manager, was building the program differently, so I took care to apply the same build flags to the development build. And what can I say:

error: failed to run custom build command for `crossbeam-utils v0.8.20`

Caused by:
  process didn't exit successfully: `fd/target/[...]/build-script-build`
      (signal: 11, SIGSEGV: invalid memory reference)

… it didn’t even get to build the fd binary proper. A segfault again, too. What on earth was going on? Why didn’t this also happen in the portage build?

Thankfully I now had a reproducer, so I did the only sensible thing and started removing random build flags until I got fd to build again. This was our culprit:

-Wl,-z,pack-relative-relocs

Already pretty out of my depth considering the fact that I couldn’t fathom how fd(1) got musl to segfault on memcpy, I now also found that a piece of the puzzle required me to understand specific linker flags. Oof.

Unsure what to do next I decided on a whim to compare the working and the broken binary with readelf(1). The most obvious difference was that the working binary had its .rela.dyn relocation section populated with entries whilst the broken one was missing .rela.dyn but had .relr.dyn instead. At a loss, I stopped and went to do something else.

The story would probably have ended here had I not mentioned this conundrum to my partner later in the day. We decided to have another look at the binaries. After some discussion we determined that the working binary was dynamically linked whilst the broken one wasn’t. The other working Rust-based program, rg(1), was also dynamically linked and had been built a while ago, so at some point portage must have stopped producing Rust executables that were dynamically linked. Finally some progress!

At this point we need some background. Early on, Rust decided to use the x86_64-unknown-linux-musl target to provide statically-linked binaries that would run on a wide range of systems. Whilst support for dynamically linked executables on musl systems was added back in 2017, the default behaviour was never changed, so Gentoo has to make sure to disable static linking by passing the target-feature=-crt-static flag.

It does this in a system-wide fashion by setting an environment variable in /etc/env.d:

$ cat /etc/env.d/50rust-bin-1.80.1
LDPATH="/usr/lib/rust/lib"
MANPATH="/usr/lib/rust/man"
CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static"

This setting should therefore be picked up by portage as well, but when I examined its build environment it was simply not there. So finally we come to the last piece of the puzzle: a recent change in how RUSTFLAGS are set within portage. Here’s the important part:

local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}"
[[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")"
local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}"

Quoth the bash(1) manual:

Local variables “shadow” variables with the same name declared at previous scopes. For instance, a local variable declared in a function hides a global variable of the same name: references and assignments refer to the local variable, leaving the global variable unmodified.

When previously the RUSTFLAGS environment variable was only touched when cross-compiling, it was now overridden. To confirm, I edited the file in question to include the previous value, and both fd(1) and rg(1) worked again. Success!

This whole saga was also reported to the Gentoo bug tracker and promptly fixed. A project for another day is figuring out exactly how a change from static linking to dynamic linking causes segfaults like this, because I sure would love to know the details.