About a week ago I noticed that fd(1)
, a
Rust-based alternative to find(1)
,
would suddenly segfault on my musl-based server
system. Usually a segfault is nothing particularly special to my eyes, but this
one was different. Even just having fd(1)
attempt to print its help text was
enough to trigger it, and when I attempted to debug it with
gdb(1)
, I saw the following:
(gdb) run
Starting program: /usr/bin/fd
Program received signal SIGSEGV, Segmentation fault.
memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
warning: 18 ../src_musl/src/string/x86_64/memcpy.s: No such file or directory
(gdb) bt
#0 memcpy () at ../src_musl/src/string/x86_64/memcpy.s:18
#1 0x00007ffff7ab7177 in __copy_tls () at ../src_musl/src/env/__init_tls.c:66
#2 0x00007ffff7ab730d in static_init_tls () at ../src_musl/src/env/__init_tls.c:149
#3 0x00007ffff7aae89d in __init_libc () at ../src_musl/src/env/__libc_start_main.c:39
#4 0x00007ffff7aae9c0 in __libc_start_main () at ../src_musl/src/env/__libc_start_main.c:80
#5 0x00007ffff74107f6 in _start ()
So… the segfault is in musl, not in fd
!?
I immediately checked whether other basic programs on the system worked. They did. I checked when I last updated musl. A couple of months ago, so that can’t be it. I checked specifically whether another Rust-based program worked. It did.
fd(1)
had been updated pretty recently, and I remembered it working correctly
about a month ago, so maybe something specific to fd(1)
’s usage of Rust
triggered this segfault in musl? I wanted to make sure I could reproduce this in
a development environment, so I cloned the fd(1)
repository, built a debug
release, and ran it…
It worked. Huh!?
I decided it was likely that portage
,
Gentoo’s package manager, was building the program differently, so I took care
to apply the same build flags to the development build. And what can I say:
error: failed to run custom build command for `crossbeam-utils v0.8.20`
Caused by:
process didn't exit successfully: `fd/target/[...]/build-script-build`
(signal: 11, SIGSEGV: invalid memory reference)
… it didn’t even get to build the fd
binary proper. A segfault again, too.
What on earth was going on? Why didn’t this also happen in the portage
build?
Thankfully I now had a reproducer, so I did the only sensible thing and started
removing random build flags until I got fd
to build again. This was our
culprit:
-Wl,-z,pack-relative-relocs
Already pretty out of my depth considering the fact that I couldn’t fathom how
fd(1)
got musl to segfault on memcpy
, I now also found that a piece of the
puzzle required me to understand specific linker flags. Oof.
Unsure what to do next I decided on a whim to compare the working and the
broken binary with readelf(1)
. The most obvious difference was that the
working binary had its .rela.dyn
relocation section
populated with entries whilst the broken one was missing .rela.dyn
but had
.relr.dyn
instead. At a loss, I stopped and went to do something else.
The story would probably have ended here had I not mentioned this conundrum to
my partner later in the day. We decided to have
another look at the binaries. After some discussion we determined that the
working binary was dynamically linked whilst the broken one wasn’t. The other
working Rust-based program, rg(1)
,
was also dynamically linked and had been built a while ago, so at some point
portage
must have stopped producing Rust executables that were dynamically
linked. Finally some progress!
At this point we need some background. Early on, Rust decided to use the
x86_64-unknown-linux-musl
target to provide statically-linked binaries that
would run on a wide range of systems. Whilst support for dynamically linked
executables on musl systems was added back in
2017, the default behaviour was
never changed, so Gentoo has to make sure to disable static linking by passing
the target-feature=-crt-static
flag.
It does this in a system-wide fashion by setting an environment variable in
/etc/env.d
:
$ cat /etc/env.d/50rust-bin-1.80.1
LDPATH="/usr/lib/rust/lib"
MANPATH="/usr/lib/rust/man"
CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_RUSTFLAGS="-C target-feature=-crt-static"
This setting should therefore be picked up by portage
as well, but when I
examined its build environment it was simply not there. So finally we come to
the last piece of the puzzle: a recent
change
in how RUSTFLAGS
are set within portage
. Here’s the important part:
local -x CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS="-C strip=none -C linker=${LD_A[0]}"
[[ ${#LD_A[@]} -gt 1 ]] && local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+="$(printf -- ' -C link-arg=%s' "${LD_A[@]:1}")"
local CARGO_TARGET_"${TRIPLE}"_RUSTFLAGS+=" ${RUSTFLAGS}"
Quoth the bash(1)
manual:
Local variables “shadow” variables with the same name declared at previous scopes. For instance, a local variable declared in a function hides a global variable of the same name: references and assignments refer to the local variable, leaving the global variable unmodified.
When previously the RUSTFLAGS
environment variable was only touched when
cross-compiling, it was now overridden. To confirm, I edited the file in
question to include the previous value, and both fd(1)
and rg(1)
worked
again. Success!
This whole saga was also reported to the Gentoo bug tracker and promptly fixed. A project for another day is figuring out exactly how a change from static linking to dynamic linking causes segfaults like this, because I sure would love to know the details.