directory index

(1.11) By sodface on 2022-03-22 02:22:03 edited from 1.10 [source]

I wanted to get some basic directory browsing working within a specific directory (and all sub-directories) on my modest little website. I reviewed the script Stephan posted in the contrib folder but ultimately decided to roll my own.

Critique and comments welcome, but it does seem to be working fairly well, as seen in action at:

http://www.sodface.com/repo/

The script filename is "index" and is placed in the root of the top level directory you want to index. Browsing is then handled via links that include "index" in the url. The text after "index" is placed in PATH_INFO by althttpd and is used by the script for navigation.

There were some differences with gnu awk and busybox awk print(f) that resulted in unwanted newlines when viewing the page source - busybox awk adds a newline when using a line continuation within the string. That is why I ended up with multiple statements in a row. It also made it a little cleaner to look at.

I'm not an experienced scripter or a security expert, so please use at your own risk. I made some effort at the top of the script to account for anyone manually adding extra slashes or other gibberish to the url and to make sure that path is valid and relative to the script root directory.

Any regular file beginning with "index" is filtered from the output, so that may be an issue depending on your content. Another approach would be to check the file last modified timestamp and only include files newer than say Jan 1 1970. You can then use touch to manually set files you want to hide to 1970.

//edit, the script below has been updated to filter out dirs/files that begin with "-" which stat was interpreting as an option. I also eliminated an unnecessary substr call and realigned the size column for dirs and files, which was off by one character after switching to human readable file sizes.

//edit, the most current version of this script has been added to the contrib directory here:

dir-browser-2

(2.3) By sodface on 2022-03-16 19:29:20 edited from 2.2 in reply to 1.01 [link] [source]

I made a small update to display more human readable file sizes based off this stackexchange post

Here's the updated script:

Edit// see first post for the most current version of the script that I'm using.

(3) By anonymous on 2022-03-15 22:58:19 in reply to 2.1 [link] [source]

It's your friendly OpenBSD guy, who's trying to run this script...

$ sh index
gstat: unknown option -- a
Try '/usr/local/bin/gstat --help' for more information.
Content-Type: text/html

<!DOCTYPE html>
<html lang="en">
<head><meta charset="utf-8" />
<title>Index of </title></head>
<body><h1>Index of </h1><hr />
<pre><a href="/index">../</a>
</pre><hr /></body></html>

Only change I made was to give the full path to gstat: /usr/local/bin/gstat -c "%F|%n|%z|%s" * | sort | \

Where does it get the idea it's look for -a?

(4.6) By sodface on 2022-03-16 19:29:52 edited from 4.5 in reply to 3 [link] [source]

Maybe something to do with the shell globbing?

Yes, I just tested it by creating a file named -aaa and got the same error:

server:/srv/smb-sod/photos$ stat -c "%F|%n|%z|%s" *
stat: unrecognized option: a

Edit// see first post for the most current version of the script that I'm using.

(5) By anonymous on 2022-03-17 13:54:59 in reply to 1.07 [link] [source]

I see there were a few changes to your latest script. The latest version seems to break the usage of gawk, when replacing awk.

--- index       Thu Mar 17 06:50:22 2022
+++ myindex     Thu Mar 17 06:50:02 2022
@@ -6,10 +6,10 @@
 rurl="${SCRIPT_DIRECTORY#${DOCUMENT_ROOT}}"
 durl="${rurl}/index"

-cd .${path}
+cd ".${path}"

 /usr/local/bin/gstat -c "%F|%n|%z|%s" [!-]* | sort | \
-gawk -F '|' -v rurl="${rurl}" -v durl="${durl}" \
+awk -F '|' -v rurl="${rurl}" -v durl="${durl}" \
           -v path="${path}" -v purl="${path%/*}" '
 BEGIN {
        u="BKMGT"
@@ -23,14 +23,13 @@

 $1=="directory" {
        printf("%s%-60s", "<a href=\""durl path"/"$2"\">", $2"/</a>")
-       printf("%s%21s\n", substr($3, 1, 16), "-"); next
+       printf("%.16s%20s\n", $3, "-"); next
 }

 $1=="regular file" && $2 !~ /^index/ {
-       printf("%s%-60s", "<a href=\""rurl path"/"$2"\">", $2"</a>")
-       printf(substr($3, 1, 16))
        e=int(log($4)/log(1024))
-       printf("%21.1f%s\n", $4/(1024^e), substr(u,e+1,1))
+       printf("%s%-60s", "<a href=\""rurl path"/"$2"\">", $2"</a>")
+       printf("%.16s%19.1f%s\n", $3, $4/(1024^e), substr(u,e+1,1))
 }

 END {

index is your previous version and myindex is your latest version; the latter one will work, since I'm not using gawk.

(6) By sodface on 2022-03-17 14:39:01 in reply to 5 [link] [source]

What's breaking with gawk? I've tested with gawk and busybox awk locally and both seem to work for me. On my website I'm using busybox awk and do not have gawk installed at all there.

My local gawk version is:

GNU Awk 5.1.1, API: 3.1

(7) By sean (naes_guy) on 2022-03-17 15:07:12 in reply to 6 [link] [source]

Ah, my mistake. I also needed to specify the full path to gawk. Sorry for the false alarm and thanks for the script.

I've created (another) account here on the forum. I hope not to lose those password!

@Stephan, do you think this script could get added to the contrib directory, or does it need to be carefully scrutinized?

(8) By Stephan Beal (stephan) on 2022-03-17 16:10:15 in reply to 7 [link] [source]

@Stephan, do you think this script could get added to the contrib directory, or does it need to be carefully scrutinized?

i don't see why not. That's on my todo list for this evening.

(9) By Stephan Beal (stephan) on 2022-03-17 16:55:26 in reply to 1.07 [link] [source]

Critique and comments welcome, but it does seem to be working fairly well, as seen in action at:

i'm preparing this script for including in the contrib dir of the repo and have just one comment:

"It would be cool" if the "../" link were filtered out when at at the top of the tree (since it doesn't do anything there).

FWIW, it's working for me as-is on a Mint Linux 20.3 system, and i like it better than my own solution primarily because it doesn't require adding a script to each browseable subdirectory.

(11) By sodface on 2022-03-17 21:48:57 in reply to 9 [link] [source]

Thanks for the review. More comments later, but what do you think of the update in post #1?

(10) By Stephan Beal (stephan) on 2022-03-17 17:07:59 in reply to 1.07 [link] [source]

Critique and comments welcome, but it does seem to be working fairly well, as seen in action at:

Here's a tiny fix to keep it from polluting stderr: the stat call should redirect stderr to /dev/null:

stat -c "%F|%n|%z|%s" [!-]* 2>/dev/null  ...
# --------------------------^^^^^^^^^^^ only that part is new

Without that, it generates (at least on my system) ugly output on each hit:

[stephan@nuc:~/fossil/althttpd]$ ./althttpd --root . -debug 1 --port 9090
stat: cannot stat '[!-]*': No such file or directory
stat: cannot stat '[!-]*': No such file or directory
stat: cannot stat '[!-]*': No such file or directory
stat: cannot stat '[!-]*': No such file or directory

(12) By sodface on 2022-03-17 21:49:15 in reply to 10 [link] [source]

Added to post #1, thanks!

(13.7) By sodface on 2022-03-18 01:36:48 edited from 13.6 in reply to 1.08 [link] [source]

Thanks to Stephan and Sean for reviewing and testing this script and for the suggested improvements.

I wanted to add a few notes probably more for my future self than anyone else.

My first go at this (which I didn't post), while functional, didn't use awk at all and just for looped over the directory contents. I used the find command with -mindepth and -maxdepth instead of the shell glob, and a cat heredoc to output the html. I have one directory on my website with 1000 or so files in it and that page was quite slow to load, on average around 2500ms. I decided to rewrite the script from scratch to see if I could improve the performance.

I suspected that calling stat once per file inside the loop was probably not the best thing to do and also thought that awk's BEGIN, middle, END was a natural fit for the page structure so I moved to a find command with a -exec stat and piped the results to awk.

Whenever I work on a script (not often) and need to traverse directories, I seem to always be battling leading and trailing slashes, and using find here didn't help in that regard. I didn't want to do a lot of string chopping inside of awk, so the next change was to ditch find and move to a combination of changing directory into the requested path, using the shell glob star to expand the filenames, and calling stat once. This method produces just the base file or dir name, without any leading or trailing slashes to deal with. The output of find is also not sorted whereas the glob results are, which was nice, but I still ended up with a pipe to sort as a way to order the directories first before passing to awk.

The issue Sean found, where stat failed because it was treating a leading "-" in a file or dir name as an invalid command option, could have also been fixed with a double dash to signal the end of options, eg. stat -c "%F|%n|%z|%s" -- * but I think going with [!-]* is better because althttpd doesn't (by default) allow a leading dash in a file or dir name anyway, so no point in showing them in the index results if they can't be retrieved.

These changes resulted in about a 10x speed improvement, page load of the 1000 file directory went from ~2500ms to ~250ms.

I'm not that crazy about my variable names:

rurl = relative or root url
durl = directory url
purl = previous url
path is really the only self explanatory one.

I moved the u="BKMGT" assignment up into the BEGIN section because I thought it might be wasteful to assign the variable the same value over and over again for each record, which I assume is what was happening when I had it down in the regular file section. I did a local test with a directory of 10,000 files and it actually made no difference whatsoever. I left it in the BEGIN anyway even though it might be more readable to have it closer to where it's used in the script.

Rookie mistakes but it was fun to work on.

(14) By sodface on 2022-03-18 03:01:59 in reply to 1.08 [link] [source]

shellcheck results:

In index line 3:
path=$(printf "${PATH_INFO}" | tr -s "/")
              ^------------^ SC2059: Don't use variables in the printf format string. Use printf '..%s..' "$foo".


In index line 9:
cd ".${path}"
^-----------^ SC2164: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

Did you mean: 
cd ".${path}" || exit

I'm inclined to ignore both of those as the printf works fine as-is and path should always be valid by the time we reach the cd command, but if it makes it more correct, then maybe?

(15.1) By Stephan Beal (stephan) on 2022-03-18 09:26:11 edited from 15.0 in reply to 14 [link] [source]

Don't use variables in the printf format string.

Worst that can happen if you put variables in the format string is that a % char gets into the format string, leading to misformatted output. In C that can be disastrous but in shell it's pretty harmless in terms of security and script integrity.

That said: it's not bad advice and it's certainly more robust.

Edit: but for your particular case, printf is not needed: echo "${PATH_INFO}" ... works just as well and is more efficient.

I'm inclined to ignore both of those as the printf works fine as-is and path should always be valid by the time we reach the cd command, but if it makes it more correct, then maybe?

Another option for the cd bit, though perhaps not appropriate for this script, is to run:

set -e

at the start of the script. That causes the script to behave as if || exit were added to every single command, exiting the script if any command fails.

(16) By Stephan Beal (stephan) on 2022-03-18 09:53:31 in reply to 14 [link] [source]

Did you mean: cd ".${path}" || exit

Your script checks for the given path using [ -d ... ] above that, so the || exit bit is superfluous in this case. If the path is invalid, it simply jumps back to the top of the browseable dirs. That seems to be the most appropriate thing to do, IMO.

(17) By sodface on 2022-03-18 15:52:45 in reply to 16 [link] [source]

Thanks for the feedback again Stephan, and the changes and additions to the contrib directory, much appreciated!

I'm going to edit post #1 and refer readers to the contrib directory for the most current script version.

(18) By sean (naes_guy) on 2022-03-18 17:01:08 in reply to 1.10 [link] [source]

Thanks for the script and @Stephan for working with sodface on it.

(19) By sodface on 2022-03-18 21:22:56 in reply to 1.10 [link] [source]

A minor change, same result but more succinct.

--- index.orig
+++ index
@@ -3,7 +3,7 @@
 path=$(printf "${PATH_INFO}" | tr -s "/")
 path="${path%/}"
 [ -d ".${path}" ] || path=""
-rurl="${SCRIPT_DIRECTORY#${DOCUMENT_ROOT}}"
+rurl="${SCRIPT_NAME%/*}"
 durl="${rurl}/index"
 
 cd ".${path}"

(20.1) By sodface on 2022-03-22 02:27:23 edited from 20.0 in reply to 1.11 [link] [source]

In the first post, I originally mentioned that I symlinked "index.html" to "index" so that the first page request would be handled by althttpd's normal default file logic, I've edited that out. I didn't notice that "index" was recently added to the default list of filenames to look for, see this commit:

https://sqlite.org/althttpd/info/d59cf0a83eed61b0

So as long as you are running a version of althttpd since that change was made, and you keep the name "index", it should "just work".

(21) By sodface on 2022-03-26 16:31:52 in reply to 1.11 [link] [source]

Still tinkering around with this script. I thought it might be nice to add a "View" link for text files so that you can view them in browser instead of downloading them. You can do this manually by adding view-source: in front of the download url but a link is more convenient.

Testing a file to see if it's binary vs text (to decide whether to provide a View link or not) seems to be trickier than I thought it would be, especially if you limit yourself to the tools that busybox brings and don't install anything extra, like the file utility for example, which seems to be one of the better tools to determine binary vs text by using the --mime option.

Testing the file while in the awk section seems more elegant anyway so I tried to come up with a method that worked both in busybox awk and gnu awk, and was reasonably reliable. There's probably plenty of gotchas where this will fall apart, but in my limited testing, it's working ok.

In the regular file section, awk reads in the file line by line and scans for non-printable characters, stopping at the first line that contains one, leaving b > 0. If b is 0 (meaning no non-printable characters were found) then a View link is provided, otherwise not. This does mean all files have to be opened and scanned, with text files always being scanned in their entirety (as no non-printable characters will be found). Non-printable characters seem to show up fairly quickly in binary files (within the first 100 lines, often in the first line). Performance impact I guess will vary depending on the number and size of files, though it's been negligible in my testing.

--- index.bak
+++ index
@@ -2,6 +2,13 @@
 
 path=$(printf "${PATH_INFO}" | tr -s "/")
 path="${path%/}"
+
+if [ -f ".${path}" ]
+then
+	printf "Content-Type: text/plain; charset=utf-8\n\n"
+	cat ".${path}"; exit
+fi
+
 [ -d ".${path}" ] || path=""
 rurl="${SCRIPT_NAME%/*}"
 durl="${rurl}/index"
@@ -29,7 +36,12 @@
 $1=="regular file" && $2 !~ /^index/ {
 	e=int(log($4)/log(1024))
 	printf("%s%-60s", "<a href=\""rurl path"/"$2"\">", $2"</a>")
-	printf("%.16s%19.1f%s\n", $3, $4/(1024^e), substr(u,e+1,1))
+	printf("%.16s%19.1f%s", $3, $4/(1024^e), substr(u,e+1,1))
+	while(( getline line < $2 ) > 0 ) {
+		b=match(line, /[^[:print:][:blank:]]/)
+		if(b>0) {break} else {continue}
+	}
+	printf("%5s%s\n", "", (b==0 ? "<a href=\""durl path"/"$2"\">View</a>" : ""))
 }
 
 END {

(22.1) By sodface on 2022-04-03 11:55:09 edited from 22.0 in reply to 21 [link] [source]

Last spam for today, here's another diff that includes the above "View" link for files and adds a "tgz" link for directories, which as implied, allows you to download a tarball of that dir (and subdirs).

There are probably improvements to be made here and I'm undecided if I'm even going to use these changes on my own website. I like the simplicity of the original script version and these additions make it a little more convoluted.

But, they do seem to be working ok...

--- index.bak
+++ index
@@ -2,6 +2,21 @@
 
 path=$(printf "${PATH_INFO}" | tr -s "/")
 path="${path%/}"
+
+if [ -f ".${path}" ]
+then
+	printf "Content-Type: text/plain; charset=utf-8\n\n"
+	cat ".${path}"; exit
+fi
+
+if [ -d ".${path%.tgz}" -a "${path##*.}" == "tgz" ]
+then
+	cd ".${path%/*}"
+	dir="${path##*/}"
+	printf "Content-Type: application/x-tar-gz\n\n"
+	tar czf - "${dir%.tgz}"; exit
+fi 
+
 [ -d ".${path}" ] || path=""
 rurl="${SCRIPT_NAME%/*}"
 durl="${rurl}/index"
@@ -23,13 +38,20 @@
 
 $1=="directory" {
 	printf("%s%-60s", "<a href=\""durl path"/"$2"\">", $2"/</a>")
-	printf("%.16s%20s\n", $3, "-"); next
+	printf("%.16s%20s", $3, "-")
+	printf("%6s%s\n", "", "<a href=\""durl path"/"$2".tgz\">tgz</a>")
+	next
 }
 
 $1=="regular file" && $2 !~ /^index/ {
 	e=int(log($4)/log(1024))
 	printf("%s%-60s", "<a href=\""rurl path"/"$2"\">", $2"</a>")
-	printf("%.16s%19.1f%s\n", $3, $4/(1024^e), substr(u,e+1,1))
+	printf("%.16s%19.1f%s", $3, $4/(1024^e), substr(u,e+1,1))
+	while(( getline line < $2 ) > 0 ) {
+		b=match(line, /[^[:print:][:blank:]]/)
+		if(b>0) {break} else {continue}
+	}
+	printf("%5s%s\n", "", (b==0 ? "<a href=\""durl path"/"$2"\">View</a>" : ""))
 }
 
 END {