Why can't you parse ls(1) output?

Team ls(1) does a pretty good job of displaying the attributes of a single file (at least in some cases), but when you ask it for a list of files, it gets huge problem: Unix allows you to use almost any character in a file name, including spaces, line breaks, periods, vertical bars, and pretty much anything else you can use as a separator, except NUL. There are offers to “correct” this situation internally POSIXbut they will not help in solving the current situation (see also, how to work with file names correctly). If the terminal is not used as standard output, in default mode ls Separates file names with line breaks. And no problems arise until a file is encountered whose name contains a line break. Since very few implementations ls allow file names to be terminated with NUL characters rather than line breaks, this does not allow you to securely obtain a list of file names using ls (at least in a portable way).

$ touch 'a space' $'a\nnewline'
$ echo "don't taze me, bro" > a
$ ls | cat
a
a
newline
a space

This output shows that we have two files named aone with a name newline and one with a name space.

But if you use ls -lthen you can see that this is completely wrong:

$ ls -l
total 8
-rw-r-----  1 lhunath  lhunath  19 Mar 27 10:47 a
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a?newline
-rw-r-----  1 lhunath  lhunath   0 Mar 27 10:47 a space

The problem is that from the output ls neither the user nor the computer can tell which parts make up a file name. Is it every word? No. Is this every line? No. There is only one correct answer to this question: we cannot understand it.

It is also worth noting that sometimes ls corrupts filename data (in our case, it turned the line break character between words a And newline into a question mark. (Some systems use this instead \n.) On some systems the command does not do this when the output is not to the terminal, and on others the filename is always corrupted. Ultimately, you should never assume that the conclusion ls will be a true representation of the names of the files you are working with.

Having understood the problem, let's explore different ways to deal with it. As always, we need to start by understanding what do we really want to do.

Listing files or performing file operations

When users try to get a list of file names using ls (or all files, or files corresponding globor files sorted in some way), disaster occurs.

If you just want to iterate through all the files in the current folder, then use a loop for And glob:

# Хорошо!
for f in *; do
    [ -e "$f" ] || [ -L "$f" ] || continue
    ...
done

It's also worth trying to use shopt -s nullglobso that an empty folder doesn't give you a literal *.

# Хорошо! (Только для Bash)
shopt -s nullglob
for f in *; do
    ...
done

Never don't do this:

# ПЛОХО! Не делайте этого!
for f in $(ls); do
    ...
done
# ПЛОХО! Не делайте этого!
for f in $(find . -maxdepth 1); do # в этом контексте find столь же плох, как и ls
    ...
done
# ПЛОХО! Не делайте этого!
arr=($(ls)) # Здесь разбиение на слова и globbing, та же ошибка, что и выше
for f in "${arr[@]}"; do
    ...
done
# ПЛОХО! Не делайте этого!! (Сама по себе функция корректна.)
f() {
    local f
    for f; do
        ...
    done
}

f $(ls) # Здесь разбиение на слова и globbing, та же ошибка, что и выше.

See more details. BashPitfalls And DontReadLinesWithFor.

The situation gets more complicated if you need some special sorting that only ls, for example, ordering by mtime. If you want the oldest or newest file in a folder, then don't use ls -t | head -1; read instead Bash FAQ 99. If you really need a list everyone files in a folder are in mtime order so that they can be processed in order, then write a Perl program that will itself open and sort the folder. Then do the processing in a Perl program, or (worst case) have the program output NUL-delimited filenames.

You can do it even better: put the modification time V file name in YYYYMMDD format for order glob was also the mtime order. Then you won't need ls, Perl or something else. (IN overwhelming number of cases when you need the oldest or newest file in a folder, the problem can be solved this way.)

Can be patched lsso that it supports the option --null, and send the patch to the developer of your operating system. This should have been done about fifteen years ago. (In fact, people triedbut the patch was rejected! See below.)

Of course, this was not done because very few people actually need sorting ls in scripts. Most often, when people need a list of file names, they use find(1), because the order is not important to them. And BSD/GNU find has long had the ability to terminate filenames with NUL.

So instead:

# Плохо!  Не делайте так!
ls | while read filename; do
  ...
done

Try it:

# Учтите, что здесь происходит не совсем то, что выше. Этот код выполняется рекурсивно и создаёт списки только обычных файлов (то есть не папок и не симлинков). В некоторых ситуациях это может подойти, но не будет полной заменой кода выше.
find . -type f -print0 | while IFS= read -r -d '' filename; do
  ...
done

Besides, most people don't actually need a list of file names. They need perform operations with files. The list is just intermediate stage of achieving some real goalFor example, replacing www.mydomain.com with mydomain.com in each *.html file. find can pass file names directly to another command. There is usually no need to print the file names to a line so that another program can then read the stream and split it into names again.

Getting file metadata

If you need the file size, then a portable way would be to use wc:

# POSIX
size=$(wc -c < "$file")

Most implementations wc recognize that stdin is a regular file and gets its size using the call fstat(2). However, this is not guaranteed. Some implementations read all bytes.

Other metadata is often difficult to obtain in a portable manner. stat(1) is not available on all platforms, and when available, the argument syntax is often very different. Unable to use stat so as not to break another POSIX system on which the script will be run. However, if you're happy with it, a very good way to get information about a file is through both GNU implementations stat(1) And find(1) (using option -printf), depending on whether you need one or more files. U find AST is also there -printfbut also with incompatible formats, and it is much less common than find GNU.

# GNU
size=$(stat -c %s -- "$file")
(( totalSize = $(find . -maxdepth 1 -type f -printf %s+)0 ))

If nothing else helps, you can try parsing some metadata from output ls -l. But it is worth remembering the following:

  1. Launch ls only for one file at a time (remember that you can't absolutely tell where the first filename ends, because there is no good separator (and no, line breaks are not a good enough separator), so it's impossible to know where the second file's metadata begins).

  2. Don't parse the timestamp/date and what comes after them (time/date fields are usually formatted in a very platform dependent and locales style, so they cannot be parsed reliably).

  3. Don't forget the -d optionwithout which if the file had the type directory, then the contents of this folder will be listed instead; also don't forget about delimiter —which avoids problems with file names starting with -.

  4. Set to ls C/POSIX locale, since the output format is not specified outside of this locale. In particular, in general, the format of the timestamp depends on the locale, but something else can also depend on it.

  5. Remember that the split behavior when reading depends on the current value $IFS

  6. Select numeric output for owner And group with help -n instead of -l, because sometimes user and group names may contain spaces. Additionally, user and group names may sometimes be truncated.

This is quite reliable:

IFS=' ' read -r mode links owner _ < <(LC_ALL=C ls -nd -- "$file")

It is worth noting that the line mode also often depends on the platform. For example, OS X adds @ for files with xattrs and + for files with extended security information. GNU sometimes adds the symbol . or +. That is, depending on what you are doing, you may need to limit the field mode the first ten characters.

mode=${mode:0:10}

If you don't believe me, here's an example of why you shouldn't parse the timestamp:

# OpenBSD 4.4:
$ ls -l
-rwxr-xr-x  1 greg  greg  1080 Nov 10  2006 file1
-rw-r--r--  1 greg  greg  1020 Mar 15 13:57 file2

# Debian unstable (2009):
$ ls -l
-rw-r--r-- 1 wooledg wooledg       240 2007-12-07 11:44 file1
-rw-r--r-- 1 wooledg wooledg      1354 2009-03-13 12:10 file2

In OpenBSD, as in most versions of Unix, ls displays timestamps in three fields (month, day, and year-or-time) where the latest time becomes time (hours:minutes) if the file is less than six months old, or year if the file is older than six months.

On Debian unstable (approximately 2009) with modern version of coreutils GNU ls displayed timestamps in two fields: the first was Y-M-D, and the second was H:M, regardless of the age of the file.

That is, it is quite obvious that we should never parse the output ls, if a file timestamp is required. You would have to write code to handle all three time/date formats, and maybe others.

But the fields before date/time usually quite reliable.

(Note: some versions ls by default they do not display group ownership of a file and require a flag for this -g. Others display the default group, and -g disables this. In general, you have been warned.)

If we wanted to get the metadata of multiple files in one command ls, then you might encounter the same problem as above – with files containing line breaks in the name and breaking the output. Imagine how this code would break if there was a line break in the file name:

# Не делайте так
{ IFS=' ' read -r 'perms[0]' 'links[0]' 'owner[0]' 'group[0]' _
  IFS=' ' read -r 'perms[1]' 'links[1]' 'owner[1]' 'group[1]' _
} < <(LC_ALL=C ls -nd -- "$file1" "$file2")

Similar code using two separate calls lswill probably work without problems because the second command read is guaranteed to start reading from the beginning of the command output ls, not from the middle of the file name; it is worth remembering that ls sorts its output and may not find any of the files, so we can't be sure what will be in perms[1]… The first problem can be circumvented by the option -q teams lsbut she won't help with the rest.

If all this seems like a big headache to you, then you're right. It's probably not worth trying to avoid all this lack of standardization. Ways to get file metadata without parsing the output at all ls see in Bash FAQ 87.

Notes on ls from GNU coreutils

IN 2014 patch was rejectedwhich adds an option to GNU coreutils -0 (similar find -print0). However, surprisingly, in GNU coreutils 9.0 (2021) option has been added --zero. If you're lucky enough to write for platforms with ls --zerothen you can use it for tasks like “delete the five oldest files in this folder.”

# Bash 4.4 и coreutils 9.0
# Удаление пяти самых старых файлов в текущей папке.
readarray -t -d '' -n 5 sorted < <(ls --zero -tr)
(( ${#sorted[@]} == 0 )) || rm -- "${sorted[@]}"

In the latest (approximately 2016) versions of GNU coreutils have an option --quoting-style with different options.

One of them very helpful in combination with the team eval bash. In particular, --quoting-style=shell-always produces output that Bourne-style shells can parse back into filenames.

$ touch zzz yyy $'zzz\nyyy'
$ ls --quoting-style=shell-always
'yyy'  'zzz'  'zzz?yyy'
$ ls --quoting-style=shell-always | cat
'yyy'
'zzz'
'zzz
yyy'

It always uses single quotes for file names (and single quotes themselves are rendered as \'), because this is the only safe way to use quotes.

It's worth noting that some control characters are still rendered as ?if the output is sent to the terminal, but this does not happen with redirected output (for example, when they are piped to catas shown above, or in general when post-processing the output is performed).

Combining this with eval, we can solve some problems like “get the five oldest files in this folder”. Of course eval should be used carefully.

# Bash + последние (примерно с 2016 года) GNU coreutils

# Получаем все файлы, отсортированные по mtime.
eval "sorted=( $(ls -rt --quoting-style=shell-always) )"

# Первые пять элементов массива - это пять самых старых файлов.
# Мы можем отобразить их пользователю:
(( ${#sorted[@]} == 0 )) || printf '<%s>\n' "${sorted[@]:0:5}"

# Или отправить их в xargs -r0:
print0() {
  [ "$#" -eq 0 ] || printf '%s\0' "$@"
}
print0 "${sorted[@]:0:5}" | xargs -r0 something

# Или сделать с ними ещё что угодно

Besides, ls GNU supports the option --quoting-style=shell-escape (which became the default option for terminal output in version 8.25), but it is not as secure because it produces output that does not always contain quotes or uses quoting operators that are unportable or unsafe when used in some locales.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *