301 Redirects For URLs not Ending in /

(1.1) By sodface on 2022-07-31 12:29:11 edited from 1.0 [link] [source]

I think I noticed it before, but while testing the mime type for index yesterday, I saw again that a url which does not specify a resource or which does not end in a "/" gets a 301 redirect when althttpd appends and returns one of the search files, currently:

"/home", "/index", "/index.html", "/index.cgi"

This also seems somehow related to the discussion on PATH_INFO

The althttpd.c source notes that:

If the requested URL does not end with "/" but we had to append "index.html", then a redirect is necessary. Otherwise none of the relative URLs in the delivered document will be correct.

Screwing up relative URLs wouldn't be good, but it seems the result is that all non-specific page requests that don't end in a "/" get redirected, including the homepage, eg:

http://www.sodface.com

301 redirects to:

http://www.sodface.com/index

The homepage redirect actually behaves differently from pages with a deeper path in that for the deeper paths, as long as a trailing slash is included, it won't redirect, whereas the homepage seems to always redirect eg:

This will redirect:

http://www.sodface.com/misc/qots-crew-gen

And this won't:

http://www.sodface.com/misc/qots-crew-gen/

I'm not really sure what the standard behavior should be, but at least in the case of the top level homepage, always redirecting doesn't seem optimal.

(2.1) By sodface on 2022-08-05 01:46:14 edited from 2.0 in reply to 1.1 [link] [source]

To follow up on this, I think the 301 redirect behavior of althttpd is correct, or at least consistent with Apache mod_dir documentation as far as paths beyond the top level domain page:

A "trailing slash" redirect is issued when the server receives a request for a URL http://servername/foo/dirname where dirname is a directory. Directories require a trailing slash, so mod_dir issues a redirect to http://servername/foo/dirname/.

I do not think that althttpd should always redirect requests for the domain itself as it does currently. There are many comments to be found that state a request for the domain url is treated differently from urls that contain additional path information with respect to redirection. I haven't been able to find a good reference that I think is completely clear on the distinction, though this one seems close:

6.2.3. Scheme-Based Normalization

The syntax and semantics of URIs vary from scheme to scheme, as described by the defining specification for each scheme. Implementations may use scheme-specific rules, at further processing cost, to reduce the probability of false negatives. For example, because the "http" scheme makes use of an authority component, has a default port of "80", and defines an empty path to be equivalent to "/", the following four URIs are equivalent:
 http://example.com
 http://example.com/
 http://example.com:/
 http://example.com:80/
In general, a URI that uses the generic syntax for authority with an empty path should be normalized to a path of "/". Likewise, an explicit ":port", for which the port is empty or the default for the scheme, is equivalent to one where the port and its ":" delimiter are elided and thus should be removed by scheme-based normalization. For example, the second URI above is the normal form for the "http" scheme.

(3.1) By sodface on 2022-08-05 00:35:38 edited from 3.0 in reply to 1.1 [source]

This patch seems to enable behavior which I think is more correct by the spec and equivalent to what other web servers do, that is, do not redirect a request for the homepage but do redirect a deeper path if a trailing slash had to be added:

--- althttpd.c.orig
+++ althttpd.c
@@ -3025,7 +3025,7 @@
         NotFound(400); /* LOG: URI is a directory w/o index.html */
       }
       zRealScript = StrDup(&zLine[j0]);
-      if( zScript[i]==0 ){
+      if( zScript[i]==0 && strcmp(zScript, "/")!=0 ){
         /* If the requested URL does not end with "/" but we had to
         ** append "index.html", then a redirect is necessary.  Otherwise
         ** none of the relative URLs in the delivered document will be

(4) By Stephan Beal (stephan) on 2022-08-10 09:17:01 in reply to 3.1 [link] [source]

This patch seems to enable behavior which I think is more correct by the spec and equivalent to what other web servers do

Follow-up: your post isn't being ignored, we (Richard and myself) are just consumed by other priorities for the time being. i will definitely play with your patch as priorities allow but don't currently have any estimate on when that will be - it may be 5 days or 30.

(5) By sodface on 2022-08-10 11:12:58 in reply to 4 [link] [source]

No problem Stephan, thanks for the reply!

(6.1) By sodface on 2022-11-02 00:49:22 edited from 6.0 in reply to 4 [link] [source]

Stephan, I'm sure you don't need any help but as a way of politely bumping this topic, below is a curl invocation to test the response code. I used firefox developer tools before and I'm sure there are plenty of alternatives, but the curl test is quick and easy.

First two are pre-patch, with and without trailing slash, both requests for the homepage redirect:

$ curl http://www.sodface.com -o /dev/null --silent --write-out "%{http_code}\n"
301
$ curl http://www.sodface.com/ -o /dev/null --silent --write-out "%{http_code}\n"
301

Same two tests with the patch applied:

$ curl http://www.sodface.com -o /dev/null --silent --write-out "%{http_code}\n"
200
$ curl http://www.sodface.com/ -o /dev/null --silent --write-out "%{http_code}\n"
200

I haven't noticed anything broken because of it.

One more test site:

$ curl https://wanderinghorse.net -o /dev/null --silent --write-out "%{http_code}\n"
301
$ curl https://wanderinghorse.net/ -o /dev/null --silent --write-out "%{http_code}\n"
301

(7) By Stephan Beal (stephan) on 2022-11-02 06:37:36 in reply to 6.1 [link] [source]

... politely bumping this topic...

Has it already been that long?!? We've been preoccupied with sqlite and haven't been paying much attention to anything else :/.

... I haven't noticed anything broken because of it.

FWIW, i agree with your patch and test results but this is a change Richard will need to approve or disapprove.

(8) By drh on 2022-11-02 11:13:36 in reply to 7 [link] [source]

I have read through this thread twice now. I don't understand what the problem is or why the change is requested. I do know that althttpd has been working great as currently coded for 20+ years and so I am reluctant to change it without a very good reason, because a break would disrupt a lot of stuff.

(9) By sodface on 2022-11-02 11:48:47 in reply to 8 [link] [source]

I don't understand what the problem is or why the change is requested.

Requests for the base domain are redirected 100% of the time. I think this is incorrect behavior and results in an extra round trip for every request of the base domain.

working great as currently coded for 20+ years

Well, a redirect is mostly invisible to the user, so from that perspective it does indeed work great but I don't think that makes it technically correct.

(10) By drh on 2022-11-02 11:51:44 in reply to 9 [link] [source]

If it is just a (minor) optimization, I'm not willing to take the risk at this time.

(11) By sodface on 2024-02-19 16:07:14 in reply to 1.1 [link] [source]

At the risk of being an ass (I can't really judge anymore!), I was reminded of this redirect topic when a discussion in another thread about 80->443 redirects led me the documentation which bid me to try it myself:

Try it: visit http://sqlite.org/ and verify that you are redirected to https://sqlite.org/.

So I did and indeed you do ultimately end up with a secure connection although it's on the third try, not the second:

$ wget -S http://sqlite.org
Connecting to sqlite.org (45.33.6.223:80)
  HTTP/1.1 301 Permanent Redirect
  Connection: close
  Date: Mon, 19 Feb 2024 15:55:02 GMT
  Location: http://sqlite.org/index.html
Connecting to sqlite.org (45.33.6.223:80)
  HTTP/1.1 301 Permanent Redirect
  Connection: close
  Date: Mon, 19 Feb 2024 15:55:02 GMT
  Location: https://sqlite.org/index.html
Connecting to sqlite.org (45.33.6.223:443)
  HTTP/1.1 200 OK
  Connection: close
  Date: Mon, 19 Feb 2024 15:55:04 GMT
  Last-Modified: Thu, 01 Feb 2024 18:38:45 GMT
  Cache-Control: max-age=120
  ETag: "m65bbe535s2479"
  Content-type: text/html; charset=utf-8
  Content-length: 9337

(12) By spindrift on 2024-02-20 07:58:54 in reply to 11 [link] [source]

To be honest, that would seem to require two redirects (and hence three connections) with a simple redirection system.

One is the path redirect, and the other the protocol redirect.

I think you could argue about which order they should be in, as this order leaks some information about site structure, but nothing especially important.

Could both redirections be combined into one?

Maybe, but that would increase complexity without any apparently benefit at all. And these redirections should be cacheable for a standard web browser.

So I think this is even less of an issue than the trailing "/" redirect that this thread started with (which I also don't think is a problem worth solving!).

(13) By spindrift on 2024-02-20 08:06:40 in reply to 11 [link] [source]

I'd also point out that in this particular situation, per your own timings, the impact of "the additional round trip" from the first redirect is inconsequential compared to the need to set up the HTTPS connection on the third connection.
So, irrespective of any theoretical objections, in this case at least there seems to be essentially nothing to gain by altering the (by all accounts entirely successful and reliable) current approach.

(14.1) By sodface on 2024-02-25 19:01:03 edited from 14.0 in reply to 13 [link] [source]

Respectfully, I think your analysis and conclusions on this are wrong. This thread, admittedly, took a little while to get to the point, which is basically this question:

Is it correct behavior for a web server to return a 301 redirect for a request for the base domain, 100% of the time.?

Where base domain is defined as a request for the domain name without specific path information, eg. https://sqlite.org

I think the answer to that question should be an obvious "no". Yes, there are cases where the answer is "yes", like a redirect from 80->443 as we've been discussing, but that's not the issue reported here. The issue is althttpd redirects all homepage requests, all the time.

I would argue that this is a bug. A low severity bug I agree, but a bug nonetheless. I'm surprised by the apparent resistance to characterize it as such. To describe it as "a minor optimization", "a problem not worth solving", or a "theoretical objection", is inexplicable to me.

would seem to require two redirects

Here's nginix doing the same thing with one redirect:

$ wget -S http://alpinelinux.org
Connecting to alpinelinux.org (213.219.36.190:80)
  HTTP/1.1 301 Moved Permanently
  Server: nginx
  Date: Wed, 21 Feb 2024 02:26:03 GMT
  Content-Type: text/html
  Content-Length: 162
  Connection: close
  Location: https://alpinelinux.org/
Connecting to alpinelinux.org (213.219.36.190:443)
  HTTP/1.1 200 OK
  Server: nginx
  Date: Wed, 21 Feb 2024 02:26:05 GMT
  Content-Type: text/html
  Content-Length: 8823
  Connection: close
  Last-Modified: Sun, 04 Feb 2024 17:48:57 GMT
  ETag: "65bfce09-2277"
  Accept-Ranges: bytes
  Strict-Transport-Security: max-age=31536000
  X-Frame-Options: DENY
  X-Content-Type-Options: nosniff

As another example with althttpd (I can't use sqlite.org because of the site structure) take a look at these two links from Stephan's site and explain to me why these should behave differently:

https://wanderinghorse.net/
https://wanderinghorse.net/computing/

They shouldn't right? They are both requests with trailing slashes and neither specify a file to retrieve so althttpd has to look for and append a file (eg. index.html) for both. But in practice, the results are different. The first redirects, the second does not:

$ wget -S https://wanderinghorse.net/
Connecting to wanderinghorse.net (194.195.245.37:443)
  HTTP/1.1 301 Permanent Redirect
  Connection: close
  Date: Wed, 21 Feb 2024 01:18:34 GMT
  Location: https://wanderinghorse.net/index.html
Connecting to wanderinghorse.net (194.195.245.37:443)
  HTTP/1.1 200 OK
  Connection: close
  Date: Wed, 21 Feb 2024 01:18:36 GMT
  Last-Modified: Tue, 23 Jan 2024 02:22:10 GMT
  Cross-Origin-Opener-Policy: same-origin
  Cross-Origin-Embedder-Policy: require-corp
  Cache-Control: max-age=120
  ETag: "m65af22d2s12ce"
  Content-type: text/html; charset=utf-8
  Content-length: 4814

$ wget -S https://wanderinghorse.net/computing/
Connecting to wanderinghorse.net (194.195.245.37:443)
  HTTP/1.1 200 OK
  Connection: close
  Date: Wed, 21 Feb 2024 01:18:43 GMT
  Last-Modified: Wed, 16 Aug 2023 11:16:50 GMT
  Cross-Origin-Opener-Policy: same-origin
  Cross-Origin-Embedder-Policy: require-corp
  Cache-Control: max-age=120
  ETag: "m64dcb022s17da"
  Content-type: text/html; charset=utf-8
  Content-length: 6106

without any apparently benefit at all ... essentially nothing to gain

Let's say your website averages 100 requests per month for the "homepage" (the base domain with or without a trailing slash), that would be 200 lines in the log, 100 redirects and 100 OK's. That scales linearly. Is there any number where you would change your assessment of "nothing to gain" by fixing it? I'd be curious to know how many log entries per month there are for https://sqlite.org. Half of those are unnecessary 301 redirects that could be eliminated.

Back to using an example of Stephan's site, take a look at the total transfer time between these two urls:

https://wanderinghorse.net/computing
https://wanderinghorse.net/computing/

The first correctly redirects and the second, also correctly, does not. In the few tests I did, the redirect is the most expensive operation. Yes, we are talking about milliseconds, but given the option, which would you choose? As a reminder of the issue, both of the following urls for the homepage will always redirect, so currently the choice has been made for you:

https://wanderinghorse.net
https://wanderinghorse.net/

I haven't done the math, but setting up and tearing down a TCP connection isn't "free", both in terms of resources on the server and of bytes on the wire. Granted whatever it is, it's not much per connection, but if those unnecessary connections could be eliminated with a trivial code change, how many wasted connections would it take per month to make it worthwhile to fix? 100,000? 500,000? Any number?

And finally, regardless of whether there's any measurable benefit or not, there's the principle of the thing, which to me is reason enough to fix it (which I do with a downstream patch). Unpatched althttpd just flat out does the wrong thing for requests for the homepage. As in objectively the wrong thing, not just my opinion of what wrong is.

File it as a "won't fix" if you want but can't we at least agree that it's a bug? What am I missing?

(15) By drh on 2024-02-21 11:36:26 in reply to 14.0 [link] [source]

The fact that you get two redirects, one for http:→https and then another to append "index.html" on the end is not a bug. Nor is it an optimization that I am willing to implement.

Those two redirects are happening in different parts of the code. If they are combined into one, then we lose separation of responsibility, making the code more complex and and more difficult to audit for correctness and security. I am unwilling to accept the added complexity and risk for such a minor optimization.

Recall the published purpose of Althttpd: "a small, simple, stand-alone HTTP server" (Emphasis added). Adding minor optimizations like this moves the product away from "small" and "simple", thus defying the original purpose of Althttpd.

I am sorry that you find the two redirects to be inconvenient. If you like the way nginx does it better, then by all means use nginx. You will not hurt my feelings. Nginx makes no effort to be "small" and "simple". It is a complex beast, which is important in many contexts, and for may use cases, but not my context and my use case. I am not trying to compete with nginx.

(16) By sodface on 2024-02-21 14:28:14 in reply to 15 [link] [source]

The fact that you get two redirects, one for http:→https and then another to append "index.html" on the end is not a bug. Nor is it an optimization that I am willing to implement.

That isn't the issue. It was just another example that illustrates the issue which is with the "homepage" only, not deeper urls which are handled correctly by althttpd, to include a redirect when appending a trailing slash.

Those two redirects are happening in different parts of the code. If they are combined into one, then we lose separation of responsibility, making the code more complex and and more difficult to audit for correctness and security. I am unwilling to accept the added complexity and risk for such a minor optimization.

That wasn't the suggested fix. See patch upthread. I believe that one change would eliminate the first redirect seen in the http->https example without touching the other section of code. The http->https redirect is still required.

I am sorry that you find the two redirects to be inconvenient.

I'm not inconvenienced by this at all. I'm patching it.

If you like the way nginx does it better, then by all means use nginx. You will not hurt my feelings. Nginx makes no effort to be "small" and "simple". It is a complex beast, which is important in many contexts, and for may use cases, but not my context and my use case. I am not trying to compete with nginx.

I didn't say any of this.

Incidentally, when looking around sqlite.org to find a url for demonstration purposes (which I didn't) I came across this port 80 url that doesn't seem to redirect, should it?

http://sqlite.org/c3ref/funclist.html

(17) By drh on 2024-02-21 14:54:12 in reply to 16 [link] [source]

Very well. See check-in check-in d5fe16ad7ef858d7.