The function wordexp(3)
is a POSIX C standard library function which
performs "word expansion like a POSIX shell". wordexp(3)
combines
the safety of elaborate string parsing in C with the efficiency and
robustness of invoking the shell on arbitrary user input. Why does it
even exist? And why shouldn't you use it?
Probably the most legit use is by init
-style programs executing a
command line (e.g. finit); though, since many wordexp implementations
invoke the shell anyway, these might as well exec sh -c 'exec ...'
instead.
Applications typically use wordexp
to expand tildes and globs
(~/*.txt
), and are oblivious to its excessive powers. Mostly, these
uses are in places like configuration files the user directly
controls, so any disasters as a consequence of wordexp
can be
considered the user's fault.
More severe are the cases where wordexp's input comes from an untrusted source1 or the program is in question is setuid2. Sometimes people use it when parsing files, even (e.g. this tinygltf issue ended up affecting blender).3
I continue to find code copied from Stack Overflow where it remains recommended, despite it being unsafe, and probably slow, too.
All this would just make wordexp seem like just another call like
popen
or system
where it's obvious that you're opening Pandora's
box, except wordexp has a flag, WRDE_NOCMD
, intended to prevent the
worst abuses of it. The existence of this flag is a mistake, because
almost no libc actually tries to make it consistently safe. This flag
may imply to people that wordexp is ever safe to use on untrusted
input. However WRDE_NOCMD
is effectively broken depending on the
combination of shell and libc in use.
Command substitution has two forms in shell: backtick-delimited
(`command`
) and dollar-parenthesized ($(command)
). The former
presents more problems for the user, but fewer for the author of the
parser: simply scan ahead for a matching backtick, obeying other
escaping and quoting.
The latter form is often where first-time shell writers fall into despair. It may be nested, and contain content-dependent unbalanced parentheses 4. The lexer must mutually-recursively invoke the parser just to find the end of the token.
The POSIX standard is ambiguous about a concerning detail of shell syntax: the difference between command and arithmetic substitution. Don't expend too much effort trying to understand this:
If the current character is an unquoted '
$
' or '`
', the shell shall identify the start of any candidates for parameter expansion, command substitution, or arithmetic expansion from their introductory unquoted character sequences: '$
' or "${
", "$(
" or '`
', and "$((
", respectively. The shell shall read sufficient input to determine the end of the unit to be expanded (as explained in the cited sections). While processing the characters, if instances of expansions or quoting are found nested within the substitution, the shell shall recursively process them in the manner specified for the construct that is found. The characters found from the beginning of the substitution to its end, allowing for any recursion necessary to recognize embedded constructs, shall be included unmodified in the result token, including any embedded or enclosing substitution operators or quotes. The token shall not be delimited by the end of the substitution.
Each shell and wordexp implementation have their own take on how to deal with ambiguous expressions like these:
$((echo a);(echo b)) $((echo "(");(echo ")")) $((case a in *) echo b;; esac))
An actual shell needs to recursively parse these expressions, whereas most wordexp implementations try to simply match parentheses. The last example in particular offers an opportunity to introduce arbitrary parentheses.
The POSIX rationale does say:
Arithmetic expansions have precedence over command substitutions. That is, if the shell can parse an expansion beginning with "
$((
" as an arithmetic expansion then it will do so. It will only parse the expansion as a command substitution (that starts with a subshell) if it determines that it cannot parse the expansion as an arithmetic expansion. If the syntax is valid for neither type of expansion, then it is unspecified what kind of syntax error the shell reports.
But who's reading that? Clearly not the authors of most popular shells:
shell | $((echo a);(echo b)) |
$((echo "(");(echo ")")) |
$((case a in *) echo b;; esac)) |
---|---|---|---|
dash | expects ) | parse error | parse error |
bash | a b |
( ) |
syntax error |
zsh | a b |
bad math expression | b |
mksh | a b |
( ) |
b |
hush | expects ) | syntax error | syntax error |
And while perhaps zsh
is unlikely to ever be /bin/sh
, all of the
others are reasonable candidates for it that I've seen on other
systems.
Now that we know shells handle these expressions inconsistently, how
do different libcs implement wordexp
?
OpenBSD has the best possible implementation of wordexp(3)
: none.
The demerits of the function are discussed in this thread from 2010.
There are implementations which try to keep wordexp
simple by
shelling out, which was probably the intended behavior when the
function was first created. Unfortunately, this means WRDE_NOCMD
can't be trusted in these libcs, without the direct assistance of the
shell.
musl's implementation is nice and simple, because it shells out.
Unfortunately, in this simplicity, there's no way to really enforce
WRDE_NOCMD
. musl tries, by matching parentheses, but $((echo
a);(echo b))
or similar will get around it, as long as /bin/sh
supports such contortions (this confuses dash
, but bash
happily
runs these commands).
FreeBSD added wordexp in 2002, using the problematic shell-invoking approach. To FreeBSD's credit, they fixed the major issues with this approach circa 2015 by specializing the shell, indeed noting:
Shell syntax is too complicated to detect command substitution and unquoted operators reliably without implementing much of sh's parser. Therefore, have sh do this detection.
macOS has inherited versions of this implementation, with some
modifications (1044.1.2, 1534.81.1). An important practical
difference, though, is that on FreeBSD, /bin/sh
is always their
ash
, which at the least doesn't suffer from the aforementioned
parsing problem, while on macOS /bin/sh
has been bash
.5
One interesting twist of Apple's implementation is that in the past,
they shelled out to perl
, via popen
. Last time I checked, they
use a helper called /usr/lib/system/wordexp
, but this did nothing to
prevent command substitution – Apple's libc suffered the same problem
as musl. (The shell situation on macOS is always evolving in
interesting ways so who knows what the state is now.)
The Solaris implementation (originally from MKS with a copyright of
1985!) is notable for implementing WRDE_NOCMD
by leveraging ksh's
restricted mode.
Not many people may still be using this implementation, but this is pretty clever and I guess demonstrates that the commercial Unix implementations may not have been bad.
So, instead of being simple, we can try the herculean task of
implementing most of a shell in libc instead. The only libc I know of
that does this is glibc, though (often old, broken) copies of its
implementation are found widely, both in programs trying to get around
systems like OpenBSD as well as other libcs. For example, uclibc
has an old version of glibc's wordexp which has the fatal flaw that
backtick can still be parsed from arithmetic expressions, so
e.g. $[`touch foo`]
will execute a command.
glibc avoids calling out to the shell except when it must, for command substitution. This results in an elaborate, 2500-line reimplementation of some of a shell parser, including a single 800-line function to parse parameter expansions.
It omits many details; for example, in arithmetic expansion. Many
valid (and useful) expressions like $((1<<16))
will not be
recognized.
However it has one great merit. In 2014, Carlos O'Donnel fixed an
important vulnerability. Previously, the code tried to enforce
WRDE_NOCMD
only when it recognized command substitution's two forms,
like many of the implementations which lean on the shell for
everything. Though it seems obvious in retrospect, glibc's current
implementation guards the actual execution of commands with
WRDE_NOCMD
tests, instead of trying to do this during parsing.
Despite its complexity, it does seem like the only implementation that is safe to use, though it seems a better policy is to declare a ban on wordexp.
I had never heard of wordexp(3)
until I saw it mentioned in the
POSIX standard, while I was implementing a shell.
Many programs that just use wordexp("~/foo")
could be replaced with
glob(..., GLOB_TILDE)
. (Though keep in mind, anyone can crash your
program with a bad glob.)
I'm not sure where it was first implemented; scanning the Unix history repo, it doesn't really appear until the 90s but there's a reference in a manpage from FreeBSD 1.0 (which doesn't implement it). Let's hope it fades into history similarly.
(Post-publishing addendum: a friend alerted me to the POSIX rationale
for wordexp
(which I should have read to begin with). It seems the
problem was that people kept demanding more and more features for
glob
, so it seems like the committee threw up their hands and added
the far-too-broad wordexp
in attempt to cover all possible bases.)
jailkit uses wordexp in its restricted shell; to be fair,
this is only enabled if you use the allow_word_expansion
option
which is disabled by default, and there's always the chance it will
link with a safer wordexp like in modern glibc. However, it ships
with a vulnerable wordexp bundled, for systems without it, and for
example jk_lsh -c '$[`touch foo`]'
will do bad things in such a
case.
sudo has some clever logic to stuff the WRDE_NOCMD
flag
into calls to wordexp in its noexec mode; unfortunately, as
demonstrated in this post, that's not sufficient to prevent execution
from happening in all cases.
Debian's excellent codesearch service provides a quick way to find calls: https://codesearch.debian.net/search?q=%5Cbwordexp%5C%28%5Cb&literal=0&page=1
Consider this example:
$(case $(case $x in *) (echo $x);; esac) in (x) $(echo :);; *) $(echo :);; esac)
bash has a (disabled by default at compile time) --wordexp
option, which is an attempt to provide this kind of functionality more
safely. It tries to disable command substitution everywhere and only
invokes the parser and expander immediately. Last time I checked,
macOS didn't use this.