Performance issue over time, high number of processes

(1.1) By sodface on 2023-06-01 23:03:38 edited from 1.0 [link] [source]

Over the last 6 months or so I noticed that my website would occasionally be very slow to load. On the order of 15-30 seconds. Each time this happens I ssh in and see ~80 althttpd processes running. To fix, I stop the service, kill all the processes and restart the service.

Things then are very snappy for some period of time. I don't go to my own site often so usually when this happens I check back over the next couple of days and see that performance is good and then I move on and forget about it until the next time I run into the issue. Like today.

Any tips on how to troubleshoot this next time?

(2) By drh on 2023-05-29 13:20:28 in reply to 1.0 [link] [source]

What I do:

Make sure logging is turned on.
Periodically download the log. Parse the log into an SQLite database and runs lots of analysis scripts to figure out which requests are taking a long time.
Put timeouts on slow scripts.
NULL-route IP addresses (and in some cases entire Class-C networks) that host abusive robots.

In the top-left corner of my desktop, I keep a shell open in which I have ssh-ed into the main SQLite server and run "top -d 10". That lets me keep an eye out for new robot attacks.

For references, the following are the sqlite3 CLI commandsI use to import the log:

CREATE TABLE log(
  date TEXT,                 /* Timestamp */
  ip TEXT,                   /* Source IP address */
  url TEXT,                  /* Request URI */
  ref TEXT,                  /* Referer */
  code INT,                  /* Result code.  ex: 200, 404 */
  nIn INT,                   /* Bytes in request */
  nOut INT,                  /* Bytes in reply */
  t1 INT, t2 INT,            /* Process time (user, system) milliseconds */
  t3 INT, t4 INT,            /* CGI script time (user, system) milliseconds */
  t5 INT,                    /* Wall-clock time, milliseconds */
  nreq INT,                  /* Sequence number of this request */
  agent TEXT,                /* User agent */
  user TEXT,                 /* Remote user */
  n INT,                     /* Bytes of url that are in SCRIPT_NAME */
  lineno INT                 /* Source code line that generated log entry */
);
.mode csv
.import http.log log

(3) By sodface on 2023-05-29 14:19:26 in reply to 2 [link] [source]

Thank you. I've got the log imported per your reply and I'm (slowly) working through different queries to see if anything jumps out.

(4) By sodface on 2023-05-29 16:30:44 in reply to 2 [link] [source]

t1-t5 are microseconds not milliseconds?

(5.2) By sodface on 2023-06-01 02:07:35 edited from 5.1 in reply to 1.0 [link] [source]

Right now the server is still responsive, but I was logged in for other reasons and checked out the running processes. The string of closely numbered pids with over a day elapsed time doesn't seem right? (I removed the command line args)

sod01:/srv/www/log$ ps -T -o pid,etime,args | grep althttpd
 6358  2d13   /usr/bin/althttpd
11326  1d07   /usr/bin/althttpd
11329  1d07   /usr/bin/althttpd
11330  1d07   /usr/bin/althttpd
11331  1d07   /usr/bin/althttpd
11340  1d07   /usr/bin/althttpd
11343  1d07   /usr/bin/althttpd
11345  1d07   /usr/bin/althttpd
11347  1d07   /usr/bin/althttpd
11352  1d07   /usr/bin/althttpd
11354  1d07   /usr/bin/althttpd
11357  1d07   /usr/bin/althttpd
11359  1d07   /usr/bin/althttpd
11361  1d07   /usr/bin/althttpd
11362  1d07   /usr/bin/althttpd
11363  1d07   /usr/bin/althttpd
11365  1d07   /usr/bin/althttpd
11366  1d07   /usr/bin/althttpd
11367  1d07   /usr/bin/althttpd
11371  1d07   /usr/bin/althttpd
14551  4:56   /usr/bin/althttpd
14580  4:56   /usr/bin/althttpd
14581  4:56   /usr/bin/althttpd
14582  4:56   /usr/bin/althttpd
14591  0:28   /usr/bin/althttpd

All the pids return the same strace output:

sod01:/srv/www/log$ sudo strace -p 11329
strace: Process 11329 attached
futex(0x7f5bdcd77900, FUTEX_WAIT_PRIVATE, 4294967295, NULL

sod01:/srv/www/log$ sudo strace -p 11359
strace: Process 11359 attached
futex(0x7f5bdcd77900, FUTEX_WAIT_PRIVATE, 4294967295, NULL

sod01:/srv/www/log$ sudo strace -p 11371
strace: Process 11371 attached
futex(0x7f5bdcd77900, FUTEX_WAIT_PRIVATE, 4294967295, NULL

And the same:

sod01:/srv/www/log$ sudo cat /proc/11329/stack 
[<0>] futex_wait_queue_me+0x98/0x100
[<0>] futex_wait+0xf7/0x250
[<0>] do_futex+0x1ba/0xb10
[<0>] __x64_sys_futex+0x85/0x1f0
[<0>] do_syscall_64+0x66/0xa0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

I don't know what any of this means. I'm going to update the server since it's running Alpine 3.16.2 and 3.18 was recently released so maybe there's an update in there that might help.

(6) By Stephan Beal (stephan) on 2023-06-01 12:14:02 in reply to 5.2 [link] [source]

All the pids return the same strace output:

That trace is deep inside the OS-level code. It's apparently hung there, but not through any of althttpd's explicit doing. i unfortunately can't speculate what might be triggering that.

Today i learned:

cat /proc/11329/stack

That is pretty nifty.

(7) By sodface on 2023-06-01 23:01:50 in reply to 6 [link] [source]

Thanks Stephan. Only a single process running now, 22 hours after upgrade and reboot. Fingers crossed.

My logs are full of garbage. If it wasn't for bogus requests, I wouldn't get any traffic at all. I need to look into some better defense, enabling the ipshun feature or something...

(8.1) By sodface on 2023-06-09 11:55:15 edited from 8.0 in reply to 1.1 [link] [source]

Performance still ok but I've got a couple long running processes again:

sod01:/srv/www$ ps -T -o pid,etime,args | grep althttpd
 4864  7d11   /usr/bin/althttpd
13247  2d20   /usr/bin/althttpd
17793  3h45   /usr/bin/althttpd

The stack is slightly different but only because of the kernel version change from the upgrade I did:

sod01:/srv/www$ sudo strace -p 17793
strace: Process 17793 attached
futex(0x7fd143ea9900, FUTEX_WAIT_PRIVATE, 4294967295, NULL

sod01:/srv/www$ sudo cat /proc/17793/stack
[sudo] password for sodface: 
[<0>] futex_wait_queue+0x60/0x90
[<0>] futex_wait+0x18a/0x280
[<0>] do_futex+0x14d/0x270
[<0>] __x64_sys_futex+0x8e/0x1e0
[<0>] do_syscall_64+0x5b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

See kernel commit

futex: Rename futex_wait_queue_me()

uname -a
Linux sod01 6.1.31-0-lts #1-Alpine SMP PREEMPT_DYNAMIC Tue, 30 May 2023 19:31:29 +0000 x86_64 Linux

I'll keep poking around and update here if I figure anything out.

(9) By sodface on 2023-06-23 21:28:42 in reply to 1.1 [link] [source]

Well crap, 53 processes right now, about half are 14 days old and half are 9 days old.

(10) By drh on 2023-06-23 21:36:30 in reply to 9 [link] [source]

I just pushed a change that blocks the AmazonBot. That reduced the load average on the SQLite.org server by 50%. Dunno if it will help you or not.

(11.1) By sodface on 2023-06-24 00:36:51 edited from 11.0 in reply to 10 [link] [source]

Thanks. I will upgrade but first, and based on really nothing but an uneducated guess, I'm going to try running without a log file for a bit. I feel like there might some kind of race condition going on with the log file based on how the "stuck" processes are all grouped by the same age and almost sequential pids.

Probably a dumb idea but we'll see what happens.

(12.1) By sodface on 2023-07-01 12:05:40 edited from 12.0 in reply to 11.1 [link] [source]

Hung processes again today so the log file didn't have anything to do with it. I upgraded to the latest checkin and also applied this patch:

--- ./althttpd.c
+++ ./althttpd.c
@@ -1154,6 +1154,7 @@
     "<body><h1>Document Not Found</h1>\n"
     "The document %s is not available on this server\n"
     "</body>\n", lineno, zScript);
+  closeConnection = 1;
   MakeLogEntry(0, lineno);
   althttpd_exit();
 }
@@ -2990,11 +2991,7 @@
   if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
     sprintf(zLine, "%s/default.website", zRoot);
     if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
-      if( standalone ){
-        sprintf(zLine, "%s", zRoot);
-      }else{
-        NotFound(350);  /* LOG: *.website permissions */
-      }
+      NotFound(350);  /* LOG: *.website permissions */
     }
   }
   zHome = StrDup(zLine);

(13) By Stephan Beal (stephan) on 2023-07-14 06:27:09 in reply to 12.1 [link] [source]

Hung processes again today

Are you still seeing this? It's unfortunately not happening on my systems so i can't actively poke at it.

PS: i apparently don't have an email address for you so will post it here: thank you very much for the thing you sent yesterday. i actually collect that type of thing for casual reading.

(14) By sodface on 2023-07-14 11:58:42 in reply to 13 [link] [source]

Are you still seeing this?

Since applying the patch discussed in this post the issue hasn't happened again, currently only the main process up for almost 12 days:

4380 11d19   /usr/bin/althttpd

The patch is different than the one mentioned further up this thread so posting here again just to reflect the current state:

--- ./althttpd.c
+++ ./althttpd.c
@@ -2990,11 +2990,10 @@
   if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
     sprintf(zLine, "%s/default.website", zRoot);
     if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
-      if( standalone ){
-        sprintf(zLine, "%s", zRoot);
-      }else{
-        NotFound(350);  /* LOG: *.website permissions */
-      }
+      UnlinkExpiredIPBlockers();
+      BlockIPAddress();
+      closeConnection = 1;
+      althttpd_exit();
     }
   }
   zHome = StrDup(zLine);

As noted by the anonymous poster in the other thread, there are probably other countermeasures I could take instead of patching but I wanted to run like this for a bit to see if it helped, which it has it seems.

thank you

(15) By spindrift on 2023-07-14 14:12:39 in reply to 14 [link] [source]

anonymous poster in the other thread

Ah yes, that was me. Apologies, wasn't logged in.

Glad it's working.

Did you send a custom error message in the end or just 404 them?

(16.2) By sodface on 2023-07-15 01:54:24 edited from 16.1 in reply to 15 [link] [source]

Did you send a custom error message in the end or just 404 them?

I'm running with the above patch which just shuns and drops the connection. Subsequent requests from the same IP get a 503 "blocked due to abuse" so even though its resolved my hanging process issue (so far) the 503's are still logged, so the noise level is still high.

If you are interested, you can see the result by going to:

http://www.pokelifehacks.com

which is the url that I don't own but resolves to my IP. You can see it's cloudflare doing the forwarding. First hit gets the unknown error page and then subsequent hits get the 503.

(17) By spindrift on 2023-07-15 11:18:11 in reply to 16.2 [link] [source]

Just seems to timeout from here, with cloudflare eventually reporting a 522 error.

Still, that seems fine too.

(18) By spindrift on 2023-07-16 10:38:53 in reply to 16.2 [link] [source]

I've also just noticed that another option now exists for shunning these unwanted referrals. By dropping a CGI script in the virtual host directory for the unwanted website and returning a 418 http status:

If a CGI reports status code 418 ("I'm a teapot", rfc2324) and if the --ipshun option is used, then the IP address is temporarily blocked.

Which should help you run trunk althttpd without needing to patch.

(19) By sodface on 2023-07-16 11:43:53 in reply to 18 [link] [source]

Ah! That does look like it would work well. I guess default.website would work (instead of a specific directory for pokelifehacks) to enforce that HTTP_HOST needs to be an exact match and to work around the stand-alone mode looking for content in --root ".".

I'm going try it. Thanks for coming up with this (though I'm annoyed I didn't see it myself!!).

(21) By sodface on 2023-07-20 01:18:13 in reply to 18 [link] [source]

Ok, I upgraded to the latest check-in and created a default.website folder with a CGI that returns 418. Tested and working. We'll see how it goes. I'm still running with another one line patch, see further down thread.

(23) By sodface on 2023-07-21 11:43:45 in reply to 18 [link] [source]

This works, but unless I'm missing something, it only works for a request for the domain with no path. Creating the virtual host directory (or default.website) and dropping a CGI script in the root will work to return 418 for a request like:

www.pokelifehacks.com

But requests with a path like:

www.pokelifehacks.com/foo/bar

Will just get a 404, and no ipshun, and I'm seeing a bunch of those in the logs since testing this. You'd have to rewrite or redirect all the requests to the same script and I don't think that's possible with an unpatched althttpd. As I said, I might be missing something obvious, which is frequently the case.

I've deleted the default.website directory and teapot script and running patched again, see down thread.

(26) By spindrift on 2023-07-21 13:45:56 in reply to 23 [link] [source]

You're right, I don't think there is a wildcard route option.

(20.1) By sodface on 2023-07-19 02:29:56 edited from 20.0 in reply to 1.1 [link] [source]

Hung processes are back again. 22 of them are 5 hours old. That corresponds to around 2023-07-18 20:35. Here is a log excerpt from that time. There are 56 log entries like this from two different source IPs. I don't really see an obvious pattern that would account for exactly 22 hung processes but it seems like 22 of these 56 just never exit.

2023-07-18 20:35:40,165.227.226.173,"http://","",400,1,180,286,0,0,0,34,1,"","",7,200                                                                                                          
2023-07-18 20:35:41,209.97.172.12,"http://7","",400,280,180,232,0,0,0,36,1,"","",8,200                                                                                                         
2023-07-18 20:35:41,209.97.172.12,"http://7","",913,0,0,160,0,0,0,783,1,"","",8,132                                                                                                            
2023-07-18 20:35:41,165.227.226.173,"http://","",400,4,180,0,231,0,0,31,1,"","",7,200                                                                                                          
2023-07-18 20:35:43,209.97.172.12,"http://","",400,1,180,191,0,0,0,41,1,"","",7,200                                                                                                            
2023-07-18 20:35:43,165.227.226.173,"http://","",400,1,180,241,0,0,0,55,1,"","",7,200                                                                                                          
2023-07-18 20:35:43,209.97.172.12,"http://","",400,6,180,237,0,0,0,25,1,"","",7,200                                                                                                            
2023-07-18 20:35:43,209.97.172.12,"http://","",400,36,180,196,0,0,0,22,1,"","",7,200                                                                                                           
2023-07-18 20:35:45,165.227.226.173,"http://","",400,4,180,261,0,0,0,36,1,"","",7,200                                                                                                          
2023-07-18 20:35:45,165.227.226.173,"http://^B^A^A`^G^B^A^B^D","",914,0,0,175,0,0,0,8490388,1,"","",16,802                                                                                     
2023-07-18 20:35:45,209.97.172.12,"http://","",400,6,180,261,0,0,0,45,1,"","",7,200                                                                                                            
2023-07-18 20:35:46,209.97.172.12,"http://","",400,3,180,269,0,0,0,30,1,"","",7,200                                                                                                            
2023-07-18 20:35:46,165.227.226.173,"http://","",400,1,180,225,0,0,0,38,1,"","",7,200                                                                                                          
2023-07-18 20:35:47,209.97.172.12,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",914,0,0,227,0,0,0,8499037,1,"","",50,802                                                             
2023-07-18 20:35:48,209.97.172.12,"http://","",400,2,180,291,0,0,0,29,1,"","",7,200                                                                                                            
2023-07-18 20:35:48,165.227.226.173,"http://","",400,0,180,234,0,0,0,85,1,"","",7,200                                                                                                          
2023-07-18 20:35:48,209.97.172.12,"http://.^A...","",400,100,180,213,0,0,0,21,1,"","",12,200                                                                                                   
2023-07-18 20:35:49,165.227.226.173,"http://","",400,2,180,126,0,0,0,41,1,"","",7,200                                                                                                          
2023-07-18 20:35:49,209.97.172.12,"http://7","",914,0,0,122,0,0,0,8499400,1,"","",8,802                                                                                                        
2023-07-18 20:35:50,209.97.172.12,"http://","",400,1,180,259,0,0,0,35,1,"","",7,200                                                                                                            
2023-07-18 20:35:51,165.227.226.173,"http://","",400,0,180,232,0,0,0,182,1,"","",7,200                                                                                                         
2023-07-18 20:35:52,209.97.172.12,"http://","",400,33,180,268,0,0,0,48,1,"","",7,200                                                                                                           
2023-07-18 20:35:53,165.227.226.173,"http://","",400,5,180,227,0,0,0,36,1,"","",7,200

I believe this is the relevant section of code handling these requests (the 400 responses). I think it's prior to ipshun logic being applied and it doesn't set closeConnection = 1 so maybe that's why some of these are hanging with open socket files?

Here's lsof of one of the pids:

sod01:/srv/www$ sudo lsof | grep 17360
17360	/usr/bin/althttpd	0	socket:[152490]
17360	/usr/bin/althttpd	1	socket:[152490]
17360	/usr/bin/althttpd	2	/dev/null
17360	/usr/bin/althttpd	3	socket:[105795]

(22) By sodface on 2023-07-20 01:23:29 in reply to 20.1 [link] [source]

Reverted the other patches discussed in this thread and now running the latest check-in with the below patch:

--- ./althttpd.c
+++ ./althttpd.c
@@ -2668,6 +2668,7 @@
    || i>9990
   ){
     zProtocol = 0;
+    closeConnection = 1;
     if( i<=9990 ){
       StartResponse("400 Bad Request");
       nOut += althttpd_printf(

(24) By sodface on 2023-07-21 11:57:50 in reply to 22 [link] [source]

Sorry to keep spamming this thread, but in addition to the above patch, I've also reapplied a variation of a previous patch. My goals are:

All requests must explicitly match a virtual site folder or the requester is shunned
Close connections for bogus requests (to help with my hanging process issue)

IMO this would be a nice option to have, eg. --strict-match or something. Maybe there are cases where non-matching requests are not malicious but I don't know what they are.

--- ./althttpd.c
+++ ./althttpd.c
@@ -3008,11 +3008,8 @@
   if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
     sprintf(zLine, "%s/default.website", zRoot);
     if( stat(zLine,&statbuf) || !S_ISDIR(statbuf.st_mode) ){
-      if( standalone ){
-        sprintf(zLine, "%s", zRoot);
-      }else{
-        NotFound(350);  /* LOG: *.website permissions */
-      }
+      BlockIPAddress();
+      ServiceUnavailable(350); /* LOG: *.website permissions */
     }
   }
   zHome = StrDup(zLine);

(25) By Robert Hairgrove (bobhairgrove) on 2023-07-21 13:02:46 in reply to 24 [link] [source]

All requests must explicitly match a virtual site folder or the requester is shunned

What happens if someone makes a typo in the URL? Do you really want to shun them for all time?

(27) By sodface on 2023-07-21 15:06:56 in reply to 25 [link] [source]

If the virtual host directory matches but the path is typo'd they won't get shunned, they'll just get a 404 as usual. And the shuns that althttpd adds expire. If you manually touch a zero byte file in the ipshun dir, that's the only way to permanently ban an IP (as I understand it).

(28) By spindrift on 2023-07-21 15:14:35 in reply to 27 [link] [source]

You're correct about the permanent ban.

Similarly, you could always write a script that goes through your log and shuns IP addresses which have visited your "honeypot" site (I know it isn't one, but for want of a better term...)

(29) By Stephan Beal (stephan) on 2023-07-21 16:12:00 in reply to 27 [link] [source]

If you manually touch a zero byte file in the ipshun dir, that's the only way to permanently ban an IP (as I understand it).

That's correct. A temporary ban is 300s (5m) per byte of the file, with the timer comparing to the mtime of that file. Those 300s can be configured at compile-time with -DBANISH_TIME=numberOfSeconds. A zero-byte file is a permanent ban.

(30) By sodface on 2023-08-04 02:11:18 in reply to 20.1 [link] [source]

A couple hung processes showed up about a day after my last post but I didn't update this thread because I had no new info to add. Instead, I patched althttpd.c again to log the pid for each request as the last field in the line. I figured that would take the guess work out of determining which requests corresponded to which hanging process.

It took 6 days to get some more hung processes, 22 of them again, all about 4 hours old as I type this. Here are the log entries for those pids:

2023-08-03 21:52:51,45.79.134.132,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",400,117,180,325,0,0,0,39,1,"","",50,200,4928
2023-08-03 21:52:51,45.79.134.132,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",913,0,0,125,0,0,0,246,1,"","",50,132,4928
2023-08-03 21:52:59,45.79.134.132,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",914,0,0,137,0,0,0,8496407,1,"","",50,802,4928
2023-08-03 21:52:52,45.79.134.132,"http://7","",400,280,180,269,0,0,0,40,1,"","",8,200,4930
2023-08-03 21:52:52,45.79.134.132,"http://7","",913,0,0,114,0,0,0,279,1,"","",8,132,4930
2023-08-03 21:53:01,45.79.134.132,"http://7","",914,0,0,131,0,0,0,8499509,1,"","",8,802,4930
2023-08-03 21:52:54,45.79.134.132,"http://","",400,1,180,211,0,0,0,55,1,"","",7,200,4931
2023-08-03 21:52:55,45.79.134.132,"http://","",400,6,180,331,0,0,0,51,1,"","",7,200,4934
2023-08-03 21:52:57,45.79.134.132,"http://","",400,2,180,259,0,0,0,50,1,"","",7,200,4936
2023-08-03 21:52:58,45.79.134.132,"http://","",400,1,180,177,0,0,0,37,1,"","",7,200,4938
2023-08-03 21:53:00,45.79.134.132,"http://","",400,33,180,228,0,0,0,67,1,"","",7,200,4939
2023-08-03 21:53:28,157.245.45.169,"http://","",400,0,180,402,0,0,0,98,1,"","",7,200,4948
2023-08-03 21:53:41,157.245.45.169,"http://","",400,0,180,313,0,0,0,59,1,"","",7,200,4958
2023-08-03 21:53:46,157.245.45.169,"http://`","",400,11,180,222,0,0,0,44,1,"","",16,200,4961
2023-08-03 21:53:46,157.245.45.169,"http://`","",913,0,0,210,0,0,0,295,1,"","",16,132,4961
2023-08-03 21:53:54,157.245.45.169,"http://`","",914,0,0,301,0,0,0,8181914,1,"","",16,802,4961
2023-08-03 21:53:49,157.245.45.169,"http://","",400,5,180,277,0,0,0,59,1,"","",7,200,4963
2023-08-03 21:53:51,157.245.45.169,"http://","",400,1,180,301,0,0,0,61,1,"","",7,200,4964
2023-08-03 21:53:53,157.245.45.169,"http://","",400,4,180,264,0,0,0,63,1,"","",7,200,4965
2023-08-03 21:53:54,157.245.45.169,"http://","",400,1,180,134,0,0,0,91,1,"","",7,200,4966
2023-08-03 21:54:00,157.245.45.169,"http://","",400,1,180,254,0,0,0,52,1,"","",7,200,4969
2023-08-03 21:54:02,157.245.45.169,"http://","",400,0,180,241,0,0,0,48,1,"","",7,200,4970
2023-08-03 21:54:05,157.245.45.169,"http://","",400,0,180,294,0,0,0,61,1,"","",7,200,4972
2023-08-03 21:54:08,157.245.45.169,"http://","",400,5,180,145,0,0,0,56,1,"","",7,200,4973
2023-08-03 21:54:12,157.245.45.169,"http://","",400,0,180,595,0,0,0,62,1,"","",7,200,4975
2023-08-03 21:54:15,157.245.45.169,"http://","",400,2,180,299,0,0,0,62,1,"","",7,200,4976
2023-08-03 21:54:21,157.245.45.169,"http://","",400,1,180,135,0,0,0,31,1,"","",7,200,4977
2023-08-03 21:54:23,157.245.45.169,"http://","",400,0,180,0,373,0,0,52,1,"","",7,200,4979

The common thread between them is the code 400 Bad Request which I suspected in a previous post and added the patch to set closeConnection=1, which obviously didn't fix it. You can see all the patches I have applied here.

Still not sure how to fix it but at least now I'm reasonably sure what type of request is causing it. A week ago I tried to duplicate the issue by triggering a 400 response with nc from a script and couldn't cause the hang even with 1000 iterations.

(31) By sodface on 2023-08-04 21:42:54 in reply to 30 [link] [source]

I haven't killed the processes yet. If anybody can explain this to me I'd appreciate it because I don't understand it:

The main process, running for 7 days is pid 25949:

25949 7d09 /usr/bin/althttpd

The lowest numbered pid in the list of 22 hung processes is 4928 which has been running almost 24 hours:

4928 23h27 /usr/bin/althttpd

When I look at netstat, I'd expect to see pid 25949 listening on port 80, but I don't, the only althttpd pid I see is 4928:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*               LISTEN      0          43083      4928/althttpd

Notice the inode is 43083.

The output of lsof has all 22 hung processes and the main pid 25949 process listing socket:[43083] for fd3, to show just a few:

4928	/usr/bin/althttpd	3	socket:[43083]
4948	/usr/bin/althttpd	3	socket:[43083]
4972	/usr/bin/althttpd	3	socket:[43083]
25949	/usr/bin/althttpd	3	socket:[43083]

The strace on pid 25949 shows scrolling:

select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
wait4(0, NULL, WNOHANG, NULL)           = 0

When I strace the pid for my main process on my local home server I see scrolling:

select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
wait4(0, NULL, WNOHANG, NULL)           = -1 ECHILD (No child process)

(32) By Stephan Beal (stephan) on 2023-08-04 21:52:44 in reply to 31 [link] [source]

If anybody can explain this to me I'd appreciate it because I don't understand it:

My layman's suspicion (completely unconfirmed but backed by anecdotal evidence) is that the hung processes are a quirk of your particular OS (Alpine, IIRC?) or libc. Richard and myself run althttpd heavily and aren't seeing this. :/

The althttpd binary running on my public server is built statically on an Alpine system but run on an Ubuntu system, which leaves me at a loss to explain why my Alpine-built copy runs headache-free but yours (assuming i'm remembering correctly that you use Alpine) is choking.

(33) By sodface on 2023-08-04 22:58:20 in reply to 32 [link] [source]

Thanks for the reply Stephan. Yes, you remember correctly that I'm running on Alpine. I didn't realize you'd built a static version on Alpine, that was cool of you.

So I think two differences are you are running with inetd and https with built in tls, right?

I'm running stand-alone with tls support but without a cert (http on port 80).

I think my next test will be to rebuild without tls support. Just don't ask me to explain why, because I can't!

(34) By Stephan Beal (stephan) on 2023-08-04 23:03:37 in reply to 33 [link] [source]

So I think two differences are you are running with inetd and https with built in tls, right?

xinetd, but otherwise yes. You're running it standalone, i guess?

I think my next test will be to rebuild without tls support.

You can use stunnel for TLS support, but you probably know that already. That's documented in the home page.

(35) By sodface on 2023-08-04 23:11:42 in reply to 34 [link] [source]

xinetd

Oops, yes that's what I meant.

You're running it standalone, i guess?

Yes.

you probably know that already.

Yep.

I should probably get with the program and use https.

But anyway, I've rebuilt without tls support, killed all the processes and restarted althttpd (I didn't reboot the VM though).

For the record:

7427 6:08 /usr/bin/althttpd

(36) By sodface on 2023-08-09 01:36:40 in reply to 35 [link] [source]

Ok, last post in this thread unless I find a solution because this is just basically spam at this point. A bunch of day old processes which seem to be centered around code 400 again:

2023-08-07 22:23:23,206.81.2.108,"http://","",400,1,180,0,340,0,0,53,1,"","",7,200,13226
2023-08-07 22:23:26,206.81.2.108,"http://","",400,0,180,154,0,0,0,43,1,"","",7,200,13229
2023-08-07 22:23:28,206.81.2.108,"http://","",400,0,180,0,341,0,0,39,1,"","",7,200,13231
2023-08-07 22:23:29,159.89.15.37,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",400,117,180,0,261,0,0,39,1,"","",50,200,13233
2023-08-07 22:23:29,159.89.15.37,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",913,0,0,0,98,0,0,211,1,"","",50,132,13233
2023-08-07 22:23:37,159.89.15.37,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",914,0,0,0,130,0,0,8495719,1,"","",50,802,13233
2023-08-07 22:23:30,206.81.2.108,"http://","",400,1,180,168,0,0,0,35,1,"","",7,200,13235
2023-08-07 22:23:32,206.81.2.108,"http://","",400,0,180,0,260,0,0,35,1,"","",7,200,13238
2023-08-07 22:23:32,159.89.15.37,"http://","",400,1,180,236,0,0,0,36,1,"","",7,200,13239
2023-08-07 22:23:34,206.81.2.108,"http://","",400,0,180,262,0,0,0,32,1,"","",7,200,13240
2023-08-07 22:23:34,159.89.15.37,"http://","",400,6,180,207,0,0,0,36,1,"","",7,200,13243
2023-08-07 22:23:36,159.89.15.37,"http://","",400,2,180,408,0,0,0,81,1,"","",7,200,13247
2023-08-07 22:23:37,206.81.2.108,"http://","",400,0,180,227,0,0,0,35,1,"","",7,200,13251
2023-08-07 22:23:38,159.89.15.37,"http://","",400,1,180,302,0,0,0,37,1,"","",7,200,13253
2023-08-07 22:23:39,159.89.15.37,"http://","",400,33,180,310,0,0,0,32,1,"","",7,200,13256
2023-08-07 22:23:39,206.81.2.108,"http://`","",400,11,180,174,0,0,0,68,1,"","",16,200,13257
2023-08-07 22:23:39,206.81.2.108,"http://`","",913,0,0,122,0,0,0,246,1,"","",16,132,13257
2023-08-07 22:23:48,206.81.2.108,"http://`","",914,0,0,101,0,0,0,8444553,1,"","",16,802,13257
2023-08-07 22:23:41,206.81.2.108,"http://","",400,5,180,267,0,0,0,34,1,"","",7,200,13259
2023-08-07 22:23:43,206.81.2.108,"http://","",400,1,180,355,0,0,0,212,1,"","",7,200,13260
2023-08-07 22:23:49,206.81.2.108,"http://","",400,1,180,305,0,0,0,57,1,"","",7,200,13264
2023-08-07 22:23:51,206.81.2.108,"http://","",400,0,180,248,0,0,0,33,1,"","",7,200,13265
2023-08-07 22:23:53,206.81.2.108,"http://","",400,2,180,246,0,0,0,39,1,"","",7,200,13266
2023-08-07 22:23:54,206.81.2.108,"http://","",400,0,180,329,0,0,0,49,1,"","",7,200,13267
2023-08-07 22:23:56,206.81.2.108,"http://","",400,5,180,299,0,0,0,49,1,"","",7,200,13268
2023-08-07 22:23:57,206.81.2.108,"http://","",400,0,180,295,0,0,0,56,1,"","",7,200,13269
2023-08-07 22:23:59,206.81.2.108,"http://","",400,0,180,299,0,0,0,47,1,"","",7,200,13270
2023-08-07 22:24:00,206.81.2.108,"http://","",400,2,180,1094,0,0,0,146,1,"","",7,200,13271
2023-08-07 22:24:04,206.81.2.108,"http://","",400,0,180,311,0,0,0,49,1,"","",7,200,13274

(37) By jose i cabrera (jicman) on 2023-08-09 20:33:05 in reply to 36 [link] [source]

Hi.

I've seen all your posts, and I want to suggest something, but I don't know if you have tried this already or not. The change will require a rebuild of althttpd without any changes, just plain-ole vanilla althttpd. Give it a try for 4 hours or some time where you know that some processes should be hung, and if there are no hung precesses, then you can probably start to think that there may be something wrong with the changes/enhancements to the original program. If there are some hung state, I would say that it may be a hardware problem. Loooong time ago (1998), we had a few SunOS unix servers that printed to an specific print server, but one of those Unix servers kept leaving stale print jobs on the queue, even though those jobs finished printing. These servers had the same OS, same print driver version, everything was the same. We decided to change the network card on the server leaving those stale print jobs, and everything started to work ok. I am not saying that this is your problem, but, (WARNING) you may want to see if you can make some changes to the NIC settings and see if it may help. Just thinking out-loud. Ihth.

josé

(38) By sodface on 2023-08-12 01:23:11 in reply to 37 [link] [source]

josé

Thanks for the reply and suggestions. I understand what you are saying, reset back to factory and see what happens. I probably will do that at some point but I'm not quite ready yet. I'm still working on trying to generate a request that will trigger a hang so I can do some better debugging.

I've worked on my fair share of Sun servers also, mostly Sparc SunFire V240's, V440's etc. circa mid-2000's. If this is a hardware problem I'm not going to be able to do anything about since it's a VM on a shared host somewhere. I have been thinking about moving datacenters or maybe virtual hosting providers altogether. That would also change my IP and fix my issue with that one annoying domain.

I'll keep you posted.

(39) By sodface on 2023-09-10 22:32:58 in reply to 1.1 [link] [source]

Stephan mentioned that he built a statically linked version of althttpd on Alpine and hasn't experienced this issue, which inspired me to do the opposite. I built a statically linked non-tls althttpd in a glibc based Void linux chroot on Alpine.

I was going to wait a bit longer to post an update but so far it's been 14 days and 10 hours and all is well.

I've been through my log and I see the same pattern of successive 400 bad protocol requests that have triggered the hung processes previously but so far, as I said, all is well.

I'll give it another couple weeks but with the native dynamically linked builds, I probably would've had issues by now.

(40) By sodface on 2023-10-05 01:24:17 in reply to 1.1 [link] [source]

Ok, so the glibc static build ran fine for 37 days. No hung processes. I swapped back to the musl libc dynamically linked build and I've got one hung process already.

The original process has been running for 23h37 and a second process has been running for 14h44.

I checked the log for the pid and there's only one solitary code 400 entry for it:

2023-10-04 10:32:19,174.138.61.44,"http://","",400,3,180,258,0,0,0,47,1,"","",7,200,9780

I'm going to see if I can get an opinion from Rich Felker on this.

(41) By sodface on 2023-10-05 19:12:57 in reply to 1.1 [link] [source]

Richard or Stephan, would you be able to review this musl mailing list discussion:

Hung processes with althttpd web server

There's a lot there that's way over my head but the last statement was:

Solution is still the same as before: Either clean up the signal handler or don't even register it in the first place. The proper place to log unusual exit of a process is in the parent process. All of the intercepted signals have a default action of Term or Core, so they will cause the end of the process.

(42.1) By Stephan Beal (stephan) on 2023-10-05 21:28:34 edited from 42.0 in reply to 41 [link] [source]

Richard or Stephan, would you be able to review this musl mailing list discussion:

This part looks like a possible culprit:

https://www.openwall.com/lists/musl/2023/10/05/5

I often see this system call hung when signal handlers are doing signal-unsafe things. Looking at the source code, that is exactly what happens if the process catches a signal at the wrong time. Try removing all calls to signal(); that should do what the designers intended better (namely quit the process). If you want to log when a process dies of unnatural causes, that's something the parent process can do.

The signal handler will call MakeLogEntry(), and that will do signal-unsafe things such as call free(), localtime(), or fopen(). If the main process is currently using malloc() when that happens, you will get precisely this hang.

That it causes the process to hang in that particular OS and (apparently) not others argues against it being the genuine culprit, but the subtleties of Unix signal handling are beyond my ken. What Markus is claiming there certainly looks feasible to my layman's eyes, in particular given his apparently experience with precisely these symptoms.

(43) By Stephan Beal (stephan) on 2023-10-05 21:42:23 in reply to 42.1 [link] [source]

Further relevant comments from Rich F. in that thread:

OK, this is almost surely the source of the problem. It would still be interesting to know which lock is being hit here, since for the most part, locks are skipped in single-threaded processes. But even if the lock were skipped, the invalid calls to async-signal-unsafe functions from async-signal context would be corrupting the state those locks were meant to protect. That's probably what's happening on glibc (meaning this code only appears to work there, but it likely behaving dangerously).

(44) By sodface on 2023-10-05 23:39:57 in reply to 42.1 [link] [source]

A couple comments about that musl list discussion.

First, I didn't mention that the static glibc build I did was without TLS support. I didn't think it mattered since I wasn't using https in either case but Markus commented:

Apparently libssl has a destructor, and while it is running the signal hits.

So now I'm not sure if it's a relevant detail or not. Further up this thread though I noted that I rebuilt the musl build without TLS support and subsequently still had hangs.

(45) By Stephan Beal (stephan) on 2023-10-05 23:55:11 in reply to 44 [link] [source]

So now I'm not sure if it's a relevant detail or not.

Based on their statements about calling functions which are not strictly legal to call in signal handlers, my bet is that if you patch your copy to remove the innermost "if" from Timeout(), your hangs will go away. That block makes a call to MakeLogEntry(), which does all sorts of work:

Index: althttpd.c
==================================================================
--- althttpd.c
+++ althttpd.c
@@ -1232,5 +1232,5 @@
 static void Timeout(int iSig){
   if( !debugFlag ){
-    if( zScript && zScript[0] ){
+    if( 0 && zScript && zScript[0] ){
       char zBuf[10];
       zBuf[0] = '9';

How althttpd might be refactored to avoid that situation is currently unclear to me - signal handling and the interaction between parent/child processes is largely voodoo to me.

According to their comments, calling routines like malloc() is not safe from a signal handler, but there is no mention of such things in the Linux man pages for malloc(3) or signal(2), which leaves me wondering whether that's "insider info" or general knowledge which simply isn't mentioned in either of those two particular sources.

(46) By sodface on 2023-10-06 01:35:50 in reply to 45 [link] [source]

Patch applied and restarted. Thanks for reviewing Stephan.

(52.1) By sodface on 2023-10-12 14:55:56 edited from 52.0 in reply to 46 [link] [source]

Stephan, 22 hung processes again even with your patch applied, from the log (pid is last field):

2023-10-09 18:28:36,164.92.73.37,"http://","",400,1,180,236,0,0,0,61,1,"","",7,200,9616
2023-10-09 18:28:38,164.92.73.37,"http://","",400,0,180,334,0,0,0,96,1,"","",7,200,9617
2023-10-09 18:28:39,164.92.73.37,"http://","",400,0,180,0,530,0,0,46,1,"","",7,200,9619
2023-10-09 18:28:40,134.122.20.78,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",400,117,180,208,0,0,0,43,1,"","",50,200,9621
2023-10-09 18:28:42,134.122.20.78,"http://7","",400,280,180,139,0,0,0,69,1,"","",8,200,9627
2023-10-09 18:28:42,164.92.73.37,"http://","",400,0,180,244,0,0,0,40,1,"","",7,200,9629
2023-10-09 18:28:45,164.92.73.37,"http://`","",400,11,180,338,0,0,0,59,1,"","",16,200,9633
2023-10-09 18:28:45,134.122.20.78,"http://","",400,6,180,312,0,0,0,80,1,"","",7,200,9636
2023-10-09 18:28:46,134.122.20.78,"http://","",400,2,180,277,0,0,0,61,1,"","",7,200,9640
2023-10-09 18:28:48,134.122.20.78,"http://","",400,1,180,361,0,0,0,102,1,"","",7,200,9643
2023-10-09 18:28:50,164.92.73.37,"http://","",400,4,180,309,0,0,0,89,1,"","",7,200,9644
2023-10-09 18:28:50,134.122.20.78,"http://","",400,33,180,938,0,0,0,89,1,"","",7,200,9645
2023-10-09 18:28:51,164.92.73.37,"http://","",400,1,180,0,579,0,0,79,1,"","",7,200,9646
2023-10-09 18:28:54,164.92.73.37,"http://","",400,1,180,291,0,0,0,123,1,"","",7,200,9648
2023-10-09 18:28:58,164.92.73.37,"http://","",400,2,180,622,0,0,0,203,1,"","",7,200,9650
2023-10-09 18:28:59,164.92.73.37,"http://","",400,0,180,1306,0,0,0,58,1,"","",7,200,9651
2023-10-09 18:29:01,164.92.73.37,"http://","",400,5,180,1192,0,0,0,73,1,"","",7,200,9652
2023-10-09 18:29:02,164.92.73.37,"http://","",400,0,180,419,0,0,0,48,1,"","",7,200,9653
2023-10-09 18:29:04,164.92.73.37,"http://","",400,0,180,0,507,0,0,79,1,"","",7,200,9654
2023-10-09 18:29:06,164.92.73.37,"http://","",400,2,180,399,0,0,0,52,1,"","",7,200,9655
2023-10-09 18:29:07,164.92.73.37,"http://","",400,1,180,0,293,0,0,158,1,"","",7,200,9656
2023-10-09 18:29:09,164.92.73.37,"http://","",400,0,180,294,0,0,0,68,1,"","",7,200,9658

A backtrace of one of the processes. I checked several of them and they all seem basically the same:

(gdb) bt
#0  __syscall_cp_c (nr=202, u=139695721777408, v=128, w=-1, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f0d71db36c9 in __futex4_cp (to=0x0, val=-1, op=128, addr=0x7f0d71df8900 <lock>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f0d71df8900 <lock>, val=val@entry=-1, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f0d71db376e in __timedwait (addr=addr@entry=0x7f0d71df8900 <lock>, val=-1, clk=clk@entry=0, at=at@entry=0x0, priv=128) at src/thread/__timedwait.c:68
#4  0x00007f0d71db6020 in __pthread_rwlock_timedwrlock (at=<optimized out>, rw=<optimized out>) at src/thread/pthread_rwlock_timedwrlock.c:18
#5  __pthread_rwlock_timedwrlock (rw=rw@entry=0x7f0d71df8900 <lock>, at=at@entry=0x0) at src/thread/pthread_rwlock_timedwrlock.c:3
#6  0x00007f0d71db60ef in __pthread_rwlock_wrlock (rw=rw@entry=0x7f0d71df8900 <lock>) at src/thread/pthread_rwlock_wrlock.c:5
#7  0x00007f0d71dbc9f3 in __libc_exit_fini () at ldso/dynlink.c:1441
#8  0x00007f0d71d73082 in exit (code=0) at src/exit/exit.c:30
#9  0x00005654289d4e89 in ?? ()
#10 <signal handler called>
#11 0x00007f0d71db40db in __syscall6 (a6=0, a5=0, a4=0, a3=139695721496797, a2=128, a1=139695721777284, n=202) at ./arch/x86_64/syscall_arch.h:59
#12 __syscall_cp_c (nr=202, u=139695721777284, v=128, w=139695721496797, x=0, y=0, z=0) at src/thread/pthread_cancel.c:31
#13 0x00007f0d71db36c9 in __futex4_cp (to=0x0, val=-2147483632, op=128, addr=0x7f0d71df8884 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#14 __timedwait_cp (addr=addr@entry=0x7f0d71df8884 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#15 0x00007f0d71db376e in __timedwait (addr=addr@entry=0x7f0d71df8884 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#16 0x00007f0d71db59b1 in __pthread_mutex_timedlock (m=0x7f0d71df8880 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#17 0x00007f0d71db57c0 in __pthread_mutex_lock (m=m@entry=0x7f0d71df8880 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#18 0x00007f0d71dbc9ff in __libc_exit_fini () at ldso/dynlink.c:1442
#19 0x00007f0d71d73082 in exit (code=0) at src/exit/exit.c:30
#20 0x00005654289d4e89 in ?? ()
#21 <signal handler called>
#22 0x00007f0d71da92cd in __syscall3 (a3=2, a2=140734778735424, a1=1, n=20) at ./arch/x86_64/syscall_arch.h:29
#23 __stdio_write (f=0x7f0d71df62c0 <__stdout_FILE>, buf=<optimized out>, len=0) at src/stdio/__stdio_write.c:15
#24 0x00007f0d71da9153 in close_file (f=0x7f0d71df62c0 <__stdout_FILE>) at src/stdio/__stdio_exit.c:12
#25 0x00007f0d71da91a9 in __stdio_exit () at src/stdio/__stdio_exit.c:21
#26 0x00007f0d71d73089 in exit (code=0) at src/exit/exit.c:31
#27 0x00005654289d4be1 in ?? ()
#28 0x00005654289d6878 in ?? ()
#29 0x00005654289d42d2 in ?? ()
#30 0x00007f0d71d7aaad in libc_start_main_stage2 (main=0x5654289d3780, argc=13, argv=0x7fff5e7ec4c8) at src/env/__libc_start_main.c:95
#31 0x00005654289d431a in ?? ()
#32 0x000000000000000d in ?? ()
#33 0x00007fff5e7ece6a in ?? ()
#34 0x00007fff5e7ece7c in ?? ()
#35 0x00007fff5e7ece83 in ?? ()
#36 0x00007fff5e7ece85 in ?? ()
#37 0x00007fff5e7ece8c in ?? ()
#38 0x00007fff5e7ece8f in ?? ()
#39 0x00007fff5e7ece96 in ?? ()
#40 0x00007fff5e7ece9f in ?? ()
#41 0x00007fff5e7ecea8 in ?? ()
#42 0x00007fff5e7eceb8 in ?? ()
#43 0x00007fff5e7ecec2 in ?? ()
#44 0x00007fff5e7ecedc in ?? ()
#45 0x00007fff5e7ecee3 in ?? ()
--Type <RET> for more, q to quit, c to continue without paging--
#46 0x0000000000000000 in ?? ()

Do you think this exit() call could also be an issue?

(53) By Stephan Beal (stephan) on 2023-10-12 12:25:22 in reply to 52.0 [link] [source]

Do you think this exit() call could also be an issue?

i'm not experienced enough with those nuances to even guess :/. There is an alternative which skips some of the exit-time cleanup, _exit(0), but althttpd doesn't use the parts which are skipped except for TLS cleanup and i have no idea whether skipping that cleanup may cause problems. Some form of exit is necessary at that spot, in any case, or all of the code paths leading to that would need to be adapted to handle the change in behavior (that function returning instead of exiting).

My RSI has been acting up this week, delaying application of Richard's suggestions, but i'll post here once that's done (with no promises on when that might be - might be 2 days, might be 4 weeks).

(54) By sodface on 2023-10-12 14:47:03 in reply to 53 [link] [source]

Completely understand Stephan and I'm certainly not demanding or even expecting any fixes on any kind of timeline. I just thought I would provide this additional info in case it has any bearing on the remediation plan (whenever it might be implemented).

I'm just sort of happy to generally know what's happening. I can go back to running the static glibc build of althttpd which will allow me to stay on Alpine since I'm more comfortable there.

Thanks for all you do and I hope your RSI improves (and I don't mean that so you can work on this issue!).

Cheers, Carl.

(55) By Stephan Beal (stephan) on 2023-10-12 14:59:06 in reply to 54 [link] [source]

I'm certainly not demanding or even expecting any fixes on any kind of timeline.

No worries - i understand that you're just chasing down, and providing more details for, this problem.

(and I don't mean that so you can work on this issue!)

Sure you do ;).

Once my RSI lets up and my backlog is caught up i'll be patching this per Richard's recipe. Hopefully it will resolve the issue.

(47.1) By psykose on 2023-10-06 02:24:32 edited from 47.0 in reply to 45 [link] [source]

> but there is no mention of such things in the Linux man pages for malloc(3) or signal(2), which leaves me wondering whether that's "insider info" or general knowledge which simply isn't mentioned in either of those two particular sources.

i'm not sure which manpages you have on hand (version, etc, since there might be some that actually don't mention it), but linux 'man-pages' has for as long as i can remember a NOTES section under signal(2), and it contains:

See signal-safety(7) for a list of the async-signal-safe functions that
can be safely called from inside a signal handler.

signal-safety(7) is the reference for this case, and what you are looking for. it contains:

The set of functions required to be async-signal-safe by POSIX.1 is
shown in the following table.

followed by a list of async-signal-safe functions (the only valid ones to call in a signal handler), plus some notes/glibc trivia.

notably, said list does not contain malloc(3) or the printf(3) (!) family, amongst others. this is a very hard thing to get right- i've seen heaps of programs over the years that just have a very subtle misuse of something like this that has a "very low" probability of actually happening- and then happens once in a blue moon, or with higher likelihood under certain conditions, ..., and then it's a huge pain to debug/figure out. even trying to "brute force" a hang in something like this took 30+ seconds of pinning a cpu core sending signals at something with unsafe handling in a single-threaded process, and piling on unsafe interface use inside the handler (and outside), in a test program- the stars really have to align.

the reason glibc doesn't have an issue here is probably because one of the diagnosed hangs is in exit() in an exit handler. it probably just gets lucky via doing something very differently, or doesn't have a lock on the same destructor-handling path, and any corruption in this stage probably doesn't matter either way and just exits anyway.

for exit() specifically, the general way you would do that is write to a value in the signal handler (a global `volatile sig_atomic_t`? i wonder if there's a better way; this should be ok for a single-threaded process, and the read/write to that type is guaranteed atomic), and exit from the 'main loop' after return instead. that's one of the things a `while(!wantsToExit) { /* program */ }` construct is useful for here. but i haven't viewed the code as a whole, so there is probably more refactoring to do.

(48) By Stephan Beal (stephan) on 2023-10-06 02:17:06 in reply to 47.0 [link] [source]

See signal-safety(7) for a list of the async-signal-safe functions that can be safely called from inside a signal handler.

Indeed, i was grepping for "signal" in the malloc page and vice versa, so didn't find the connection. Thank you for the details.

(49) By drh on 2023-10-06 11:04:37 in reply to 48 [link] [source]

Remediation Plan

This is a sketch of what I think needs to happen in order to make the Timeout() signal handler signal-safe. Note that Timeout() is the only signal handler function in the application, so Timeout() is the only thing that needs to be made signal-safe. Within Timeout(), the offending routine that makes it unsafe is MakeLogEntry(). So the problem reduces to making MakeLogEntry() signal-safe. These are my proposed steps to achieve that:

Add a new global variable inSignalHandler which defaults to zero. Timeout() sets inSignalHandler to one prior to invoking MakeLogEntry().
Within MakeLogEntry(), put the call to free() inside of a "if( !inSignalHandler )". Do the same with the calls to getrusage() with the addition that the "else" clause should zero the "self" and "children" structures. That means we won't get timing information for logfile entries associated with signals.
Make zExpLogFile a global variable. The name means "Expanded Logfile Name". This idea is that the --logfile command line entry can have date values that get expanded by strftime(). Make a new subroutine to do that expansion, and call it when when starting processing on each new HTTP request - in other words call it outside of any signal handler. Thus the zExpLogFile value is already computed when the signal handler is invoked.
Use time() instead of gettimeofday() inside of MakeLogEntry().
Omit the use of localtime() inside of MakeLogEntry(). After this change, all log file entry timestamps should be in zulu-time, as they should be regardless. Call this a bug fix.
Compute the date string yyyy-mm-dd HH:MM:SS manually. We cannot use strftime() for that any more. Copy code to convert seconds since 1970 into yyyy-mm-dd HH:MM:SS from the date.c source file of SQLite. In particular, see the computeYMD() routine.
Use open() with O_APPEND instead of fopen().
Manually construct the log file message into a large (5000-byte?) stack buffer. Probably there will need to be new subroutines to convert integer values into strings and append the result to that buffer. Integer-to-text conversion is easy and only takes a few lines of code.
Write out the log message using write(). Use close() to close the file descriptor.
Always invoke exit() at the end of a MakeLogFile() call that is run when inSignalHandler is set.

I think with those changes, Timeout() will become signal safe. What we lose with these changes:

Log message are zulu-time instead of local time. This seems more like a bug-fix than a problem, though.
We do not get process run-time information in log file entries that come from the signal handler because getrusage() is unavailable.

Those seem like minor problems in order to make althttpd run better on systems that are picky about signal-safety.

(50) By sodface on 2023-10-07 01:41:03 in reply to 49 [link] [source]

Thanks a lot for taking the time to address this. I can't comment on the technical details but I did want to point out, in case you missed it in the noise, that I ran without a log file for a while and still got hangs.

It seems like doing that removes most of what MakeLogEntry() does except for:

if( zPostData ){
    free(zPostData);
    zPostData = 0;
  }

and

if( closeConnection ){
    exit(exitCode);
  }

If I've got that right and since every hang seems to be coming from the response code 400 requests maybe there's a less invasive fix that could just target that situation? But I guess that wouldn't address all the signal-safety items in the above punch list.

(51) By Stephan Beal (stephan) on 2023-10-08 14:00:43 in reply to 50 [link] [source]

Thanks a lot for taking the time to address this.

If Richard does not get around to it before then, i will use his description as a blueprint for patching it as soon as coding priorities allow (probably late this coming week).

(56) By Stephan Beal (stephan) on 2023-11-22 02:45:28 in reply to 50 [link] [source]

But I guess that wouldn't address all the signal-safety items in the above punch list.

Follow-up: my apologies for not having gotten to this yet. It's flagged in my inbox so it won't get overlooked, but project- and familial responsibilities have kept me devoid of extra energy for random hacking lately.

Historically i tend to be hyperactive, code-wise, around Christmas. Perhaps that trend will continue this year?

(57) By sodface on 2023-11-22 03:01:26 in reply to 56 [link] [source]

Thanks for the update Stephan and no worries! I've been running my static glibc build which seems fine. That said, I love Christmas presents!

(58) By Stephan Beal (stephan) on 2023-12-15 12:19:17 in reply to 49 [link] [source]

Remediation Plan

i've started work on this and have a question about the log handling:

Use open() with O_APPEND instead of fopen().

If we do that we lose the ability to fprintf() unless we use fdopen() on the file handle returned by open(), which would give us a FILE pointer for that handle. Presumably, however, if we should avoid fopen() then we should also avoid fdopen()?

Losing fprintf() would be unfortunate.

(59) By Stephan Beal (stephan) on 2023-12-15 13:08:30 in reply to 49 [link] [source]

Remediation Plan ...

What we lose with these changes:

Sub-second precision in processing time in the logs, since we'll be using time().

(60) By Stephan Beal (stephan) on 2023-12-15 16:05:32 in reply to 49 [link] [source]

Remediation Plan

Current status of the signal-safety branch:

✅Add a new global variable inSignalHandler which defaults to zero. ...
✅Within MakeLogEntry(), put the call to free() inside of a "if( !inSignalHandler )". ...
✅Make zExpLogFile a global variable. ...
✅Use time() instead of gettimeofday() inside of MakeLogEntry().
✅Omit the use of localtime() inside of MakeLogEntry(). ...
✅Compute the date string yyyy-mm-dd HH:MM:SS manually. ...
❌Use open() with O_APPEND instead of fopen().
❌Manually construct the log file message into a large (5000-byte?) stack buffer. ...
❌Write out the log message using write(). Use close() to close the file descriptor.
✅Always invoke exit() at the end of a MakeLogFile() call that is run when inSignalHandler is set.

The loss of sub-second precision, as a side effect of #4, is unfortunate but i'm at a loss as to how to regain that. For the time being, those log entries will be in seconds rather than the documented ms (alternately, we can multiply them by 1000 so that they're >0 for any requests which last longer than 1 second).

@sodface if you're willing to run that in the configuration which was previously giving you hung processes, it would be really interesting to know whether it still hangs intermittently for you. It's currently running on my server but that one uses a static build compiled on Alpine (and run on Ubuntu) and i never personally witnessed the hanging processes.

(61) By sodface on 2023-12-15 17:28:30 in reply to 60 [link] [source]

Thanks for working on this Stephan!! I'll get this built and running ASAP.

One item of possible interest: I've been running the glibc static build as a workaround as noted in previous comments. I do have one hung process even with that configuration:

sod01:/srv/www/log$ busybox ps -eo pid,etime,args | grep [a]lthttpd
 6197 23d20   /usr/bin/althttpd
18230 33d15   /usr/bin/althttpd

6197 being the hung process, with the pid numbers having wrapped around I guess. A bit of a departure from what I was seeing with most of the hangs being on code 400 responses, this one is a 404:

2023-11-06 18:01:18,172.104.4.17,"http:///","",404,18,262,0,2037,0,0,1778,1,"","",8,330,6193                                                                                                    
2023-11-06 18:01:18,172.104.4.17,"http:///","",501,20,183,187,0,0,0,25,1,"","",8,220,6194                                                                                                       
2023-11-06 18:01:18,172.104.4.17,"http:///","",400,20,180,312,0,0,0,41,1,"","",8,200,6195                                                                                                       
2023-11-06 18:01:23,172.104.4.17,"http://","",400,1,180,142,0,0,0,48,1,"","",7,200,6196                                                                                                         
2023-11-06 18:01:23,172.104.4.17,"http:///nice_ports_/Tri_ity.txt.bak","",404,53,289,329,0,0,0,145,1,"","",35,330,6197                                                                          
2023-11-06 18:01:25,172.104.4.17,"http://","",400,2,180,0,765,0,0,27,1,"","",7,200,6198                                                                                                         
2023-11-06 18:01:31,172.104.4.17,"http://","",400,1,180,244,0,0,0,39,1,"","",7,200,6199                                                                                                         
2023-11-06 18:01:37,172.104.4.17,"http://","",400,0,180,288,0,0,0,51,1,"","",7,200,6200                                                                                                         
2023-11-06 18:01:37,172.104.4.17,"http://","",400,6,180,0,263,0,0,23,1,"","",7,200,6201

There's a bunch more around the same time from that same source ip but I only included a few for context before and after.

Do you see this kind of junk in your logs?

(62) By Stephan Beal (stephan) on 2023-12-15 18:11:41 in reply to 61 [link] [source]

Do you see this kind of junk in your logs?

Indeed... there are lots of those in my logs, and they're all responses with code 400. i've got neither an explanation nor a hypothesis for those but will see if i can track that down while working on these changes.

(63) By sodface on 2023-12-15 18:59:13 in reply to 62 [link] [source]

Thanks. Currently running on the latest check-in, I'll keep you posted.

(64.1) By sodface on 2023-12-15 19:18:30 edited from 64.0 in reply to 61 [link] [source]

Actually, the elapsed process time of PID 6197 should be more like 39 days based on the log timestamp, not 23 days. I'm a little confused by that. My server clock is correct.

I've already restarted althttpd (and the VM) on the new code so a little late now to try to figure that out... but seems odd. I have a blinding headache today though so maybe I'm just not thinking straight.

(65) By Stephan Beal (stephan) on 2023-12-15 20:04:29 in reply to 61 [link] [source]

I'll get this built and running ASAP.

In case you didn't notice, there have been a few checkins since then and one of them replaced fopen/fclose/fprintf with open/close/write, which may well be relevant for your triggering case.

With that, i believe the new implementation matches Richard's checklist and it "works for me," but it could certainly use some wider use and review before merging.

(66.1) By sodface on 2023-12-15 21:37:36 edited from 66.0 in reply to 65 [link] [source]

Ok, I had a spot of trouble with my VM but I got it sorted and I'm running with the check-in as of 202312151958.

Server is running ok but there's a couple of things I noticed, first is that if I delete the log file and let althttpd create it on startup, it gets created with no permissions and I can only read it with sudo (it also only seems to log one line and that's it):

sod01:/srv/www/log$ ls -alh
total 60M    
drwxr-xr-x    2 sodface  sodface     4.0K Dec 15 21:18 .
drwxr-xr-x    6 sodface  sodface     4.0K Aug  4 01:44 ..
----------    1 sodface  sodface      313 Dec 15 21:18 althttpd.log
-rw-r--r--    1 sodface  sodface     2.3M Jun 24 00:20 althttpd.log.20230623
-rw-r--r--    1 sodface  sodface    48.2K Jul  1 14:00 althttpd.log.20230701

If I stop althttpd, delete the file, and then touch it before restarting althttpd, that fixes it.

The other thing, which may or may not be related to your recent changes because I think I've noticed it before, I'm starting and stopping althttpd with an openrc service file, eg:

$ sudo rc-service althttpd stop
$ sudo rc-service althttpd start

Sometimes it crashes on start:

sod01:/srv/www/log$ rc-status
Runlevel: default
 shorewall                          [  started  ]
 chronyd                            [  started  ]
 acpid                              [  started  ]
 althttpd                           [  crashed  ]
 crond                              [  started  ]
 sshd                               [  started  ]
Dynamic Runlevel: hotplugged

So I stop and start again and then it comes up ok. Probably disregard that for now and I'll try to debug it later.

(67) By Stephan Beal (stephan) on 2023-12-15 21:43:55 in reply to 66.1 [link] [source]

... , first is that if I delete the log file and let althttpd create it on startup, it gets created with no permissions and I can only read it with sudo (it also only seems to log one line and that's it):

That's almost certainly a misuse of open() on my part. i'll fix that in the morning.

Sometimes it crashes on start:

It can't be ruled out that my muddling is related to that!

(68) By Stephan Beal (stephan) on 2023-12-16 02:29:40 in reply to 66.1 [link] [source]

If I stop althttpd, delete the file, and then touch it before restarting althttpd, that fixes it.

That's fixed now. i had omitted the optional file-mode argument to open(), which is now set to 0640. On my system the default (without that explicit mode) was even stranger than yours (something like -rw---S---) but i didn't notice because the file was still readable.

PS: that default is subject to change. An argument couple be made for 0600.

Sometimes it crashes on start: ... Probably disregard that for now and I'll try to debug it later.

It's certainly possible that my changes are causing that, but if you can get more information about what's happening there it would be very helpful.

(69) By sodface on 2023-12-16 02:52:08 in reply to 68 [link] [source]

That's fixed now.

Yes, looks good on my end now too.

get more information

I'll do some more testing locally but just now applying the new update on my VM restarts went fine, no crashes.

(70) By sodface on 2023-12-16 03:04:19 in reply to 68 [link] [source]

Meant to mention that in order to keep my environment similar to what it was, I'm still applying three local patches that I was applying before:

10-allow-filename-plus-sign.patch
20-homepage-redirect-fix.patch
40-strict-website-match.patch

The patch to add the pid to the log file I can't apply with the latest changes so I've lost that quick reference should I encounter any more hangs.

(71) By Stephan Beal (stephan) on 2023-12-16 11:16:23 in reply to 70 [link] [source]

The patch to add the pid to the log file I can't apply with the latest changes so I've lost that quick reference should I encounter any more hangs.

Try this:

Index: Makefile
==================================================================
--- Makefile
+++ Makefile
@@ -1,10 +1,10 @@
 default: althttpd althttpsd
 SHELL = /bin/bash
 VERSION_NUMBER = 2.0
 CC=cc
-CFLAGS=-Os -Wall -Wextra
+CFLAGS=-Os -Wall -Wextra -DLOG_PID
 
 manifest:
 	@if which fossil > /dev/null; then \
 		fossil update --nosync current; \
 	else \

Index: althttpd.c
==================================================================
--- althttpd.c
+++ althttpd.c
@@ -977,10 +977,14 @@
       /* (13) */ aint( nRequest );
       /* (14) */ astr( "\"" ); astr( Escape(zAgent) ); astr2("\"");
       /* (15) */ astr( "\"" ); astr( Escape(zRM) ); astr2("\"");
       /* (16) */ aint( strlen(zHttpScheme)+strlen(zHttpHost)+strlen(zRealScript)+3 );
       /* (17) */ altfmt_int( zPos, zEnd, lineNum, 1, &zPos );
+#ifdef LOG_PID
+      acomma;
+      /* (18) */ altfmt_int( zPos, zEnd, getpid(), 1, &zPos );
+#endif
       astr( "\n" );
       priorSelf = self;
       priorChild = children;
 #endif
       if( zPos ){

(72) By sodface on 2023-12-16 14:33:46 in reply to 71 [link] [source]

This is great, thanks again Stephan. I saw you added this to the latest check-in so I rebuilt again and am back up and running and seeing pids in the log.

A couple of comments:

I'm building althttpd using the same Alpine build tooling that the distro uses and in order to set -DLOG_PID I found two ways in the APKBUILD file build() function (a patch would work too), the first is commented out below because I think the second is the preferred Alpine way:

build() {
#       sed -in '/^CFLAGS/ s/$/ -DLOG_PID/' Makefile
        sed -in 's/^CFLAGS=/CFLAGS+=/' Makefile
        export CFLAGS="$CFLAGS -DLOG_PID"
        make althttpsd
}

Would you consider changing the Makefile to CFLAGS+= so external CFLAGS can be passed in and not clobbered?

With the above changes, this is what my compiler invocation looked like:

cc -Os -Wformat -Werror=format-security -DLOG_PID -Os -Wall -Wextra '-DALTHTTPD_VERSION="2.0 [1fb1a1c2a54b] [2023-12-16T11:19:14.148]"' -fPIC -o althttpsd -DENABLE_TLS althttpd.c -lssl -lcro

While fiddling with that, I also took notice of the version hash in brackets, which doesn't really line up when you are browsing the repo, for example:

compiler value: 1fb1a1c2a54b
url value: 1fb1a1c2a54bc65d
check-in value displayed at the top of the page using above url: 1fb1a1c2a5

So three different length prefixes from same longer hash. Not a big deal I guess? I don't have fossil installed on my build box and the Makefile mentions that might cause a problem but as you can see I do get the correct first 12 characters at compile time.

(73) By sodface on 2023-12-16 15:08:44 in reply to 72 [link] [source]

Ah, either of these also work in the build function:

make CFLAGS+=-DLOG_PID althttpsd
make CFLAGS+=" -DLOG_PID" althttpsd

Though this option might not stay in the end as it's mainly for troubleshooting this issue, there are still Alpine global default CFLAGS that ideally would be pulled in:

export CFLAGS="-Os -Wformat -Werror=format-security"

(74) By sodface on 2023-12-17 13:58:36 in reply to 72 [link] [source]

Unfortunately, I'm seeing 23 hung processes, all of them 21h06 old (main process started 23h43 ago). Log file looks about the same as usual for these:

sod01:/srv/www$ ./logsearch.sh 
2023-12-16 16:45:02,66.228.39.232,"http://","",400,0,180,269,0,0,0,0,1,"","",7,200,5038
2023-12-16 16:45:04,66.228.39.232,"http://","",400,0,180,188,0,0,0,0,1,"","",7,200,5039
2023-12-16 16:45:07,66.228.39.232,"http://","",400,0,180,159,0,0,0,0,1,"","",7,200,5042
2023-12-16 16:45:09,66.228.39.232,"http://","",400,0,180,175,0,0,0,0,1,"","",7,200,5043
2023-12-16 16:45:10,66.228.39.232,"http://","",400,0,180,227,0,0,0,0,1,"","",7,200,5049
2023-12-16 16:45:12,66.228.39.232,"http://`","",400,11,180,200,0,0,0,0,1,"","",16,200,5052
2023-12-16 16:45:12,66.228.39.232,"http://`","",913,0,0,-200,0,0,0,0,1,"","",16,132,5052
2023-12-16 16:45:21,66.228.39.232,"http://`","",914,0,0,0,0,0,0,9000000,1,"","",16,802,5052
2023-12-16 16:45:15,66.228.39.232,"http://","",400,1,180,214,0,0,0,0,1,"","",7,200,5055
2023-12-16 16:45:17,66.228.39.232,"http://","",400,4,180,246,0,0,0,0,1,"","",7,200,5056
2023-12-16 16:45:18,66.228.39.232,"http://","",400,1,180,293,0,0,0,0,1,"","",7,200,5057
2023-12-16 16:45:24,66.228.39.232,"http://","",400,2,180,328,0,0,0,0,1,"","",7,200,5061
2023-12-16 16:45:26,66.228.39.232,"http://","",400,0,180,0,391,0,0,0,1,"","",7,200,5062
2023-12-16 16:45:27,66.228.39.232,"http://","",400,5,180,154,0,0,0,0,1,"","",7,200,5064
2023-12-16 16:45:27,137.184.100.40,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",400,117,180,265,0,0,0,0,1,"","",50,200,5066
2023-12-16 16:45:27,137.184.100.40,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",913,0,0,-265,0,0,0,0,1,"","",50,132,5066
2023-12-16 16:45:36,137.184.100.40,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",914,0,0,0,0,0,0,9000000,1,"","",50,802,5066
2023-12-16 16:45:29,66.228.39.232,"http://","",400,0,180,0,490,0,0,0,1,"","",7,200,5067
2023-12-16 16:45:29,137.184.100.40,"http://7","",400,280,180,222,0,0,0,0,1,"","",8,200,5069
2023-12-16 16:45:29,137.184.100.40,"http://7","",913,0,0,-222,0,0,0,0,1,"","",8,132,5069
2023-12-16 16:45:37,137.184.100.40,"http://7","",914,0,0,0,0,0,0,8000000,1,"","",8,802,5069
2023-12-16 16:45:30,66.228.39.232,"http://","",400,0,180,177,0,0,0,0,1,"","",7,200,5070
2023-12-16 16:45:30,137.184.100.40,"http://","",400,1,180,277,0,0,0,0,1,"","",7,200,5071
2023-12-16 16:45:32,66.228.39.232,"http://","",400,2,180,155,0,0,0,0,1,"","",7,200,5072
2023-12-16 16:45:32,137.184.100.40,"http://","",400,6,180,175,0,0,0,0,1,"","",7,200,5075
2023-12-16 16:45:33,66.228.39.232,"http://","",400,1,180,319,0,0,0,0,1,"","",7,200,5076
2023-12-16 16:45:33,137.184.100.40,"http://","",400,2,180,191,0,0,0,0,1,"","",7,200,5078
2023-12-16 16:45:35,66.228.39.232,"http://","",400,0,180,156,0,0,0,0,1,"","",7,200,5080
2023-12-16 16:45:36,137.184.100.40,"http://","",400,33,180,198,0,0,0,0,1,"","",7,200,5084

Backtrace for several of them is the same as below and also this earlier post.

Attaching to process 5038
Reading symbols from /usr/bin/althttpd...
(No debugging symbols found in /usr/bin/althttpd)
Reading symbols from /lib/libssl.so.3...
(No debugging symbols found in /lib/libssl.so.3)
Reading symbols from /lib/libcrypto.so.3...
(No debugging symbols found in /lib/libcrypto.so.3)
Reading symbols from /lib/ld-musl-x86_64.so.1...
Reading symbols from /usr/lib/debug//lib/ld-musl-x86_64.so.1.debug...
__syscall_cp_c (nr=202, u=139864990550272, v=128, w=-1, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
61	./arch/x86_64/syscall_arch.h: No such file or directory.
(gdb) bt
#0  __syscall_cp_c (nr=202, u=139864990550272, v=128, w=-1, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f34db0fb6b9 in __futex4_cp (to=0x0, val=-1, op=128, addr=0x7f34db140900 <lock>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f34db140900 <lock>, val=val@entry=-1, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f34db0fb75e in __timedwait (addr=addr@entry=0x7f34db140900 <lock>, val=-1, clk=clk@entry=0, at=at@entry=0x0, priv=128) at src/thread/__timedwait.c:68
#4  0x00007f34db0fe010 in __pthread_rwlock_timedwrlock (at=<optimized out>, rw=<optimized out>) at src/thread/pthread_rwlock_timedwrlock.c:18
#5  __pthread_rwlock_timedwrlock (rw=rw@entry=0x7f34db140900 <lock>, at=at@entry=0x0) at src/thread/pthread_rwlock_timedwrlock.c:3
#6  0x00007f34db0fe0df in __pthread_rwlock_wrlock (rw=rw@entry=0x7f34db140900 <lock>) at src/thread/pthread_rwlock_wrlock.c:5
#7  0x00007f34db1049e3 in __libc_exit_fini () at ldso/dynlink.c:1441
#8  0x00007f34db0bb082 in exit (code=0) at src/exit/exit.c:30
#9  0x00005581aa0d962d in ?? ()
#10 <signal handler called>
#11 0x00007f34db0fc0cb in __syscall6 (a6=0, a5=0, a4=0, a3=139864990269645, a2=128, a1=139864990550148, n=202) at ./arch/x86_64/syscall_arch.h:59
#12 __syscall_cp_c (nr=202, u=139864990550148, v=128, w=139864990269645, x=0, y=0, z=0) at src/thread/pthread_cancel.c:31
#13 0x00007f34db0fb6b9 in __futex4_cp (to=0x0, val=-2147483632, op=128, addr=0x7f34db140884 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#14 __timedwait_cp (addr=addr@entry=0x7f34db140884 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#15 0x00007f34db0fb75e in __timedwait (addr=addr@entry=0x7f34db140884 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#16 0x00007f34db0fd9a1 in __pthread_mutex_timedlock (m=0x7f34db140880 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#17 0x00007f34db0fd7b0 in __pthread_mutex_lock (m=m@entry=0x7f34db140880 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#18 0x00007f34db1049ef in __libc_exit_fini () at ldso/dynlink.c:1442
#19 0x00007f34db0bb082 in exit (code=0) at src/exit/exit.c:30
#20 0x00005581aa0d962d in ?? ()
#21 <signal handler called>
#22 0x00007f34db0f12bd in __syscall3 (a3=2, a2=140737279760528, a1=1, n=20) at ./arch/x86_64/syscall_arch.h:29
#23 __stdio_write (f=0x7f34db13e2c0 <__stdout_FILE>, buf=<optimized out>, len=0) at src/stdio/__stdio_write.c:15
#24 0x00007f34db0f1143 in close_file (f=0x7f34db13e2c0 <__stdout_FILE>) at src/stdio/__stdio_exit.c:12
#25 0x00007f34db0f1199 in __stdio_exit () at src/stdio/__stdio_exit.c:21
#26 0x00007f34db0bb089 in exit (code=0) at src/exit/exit.c:31
#27 0x00005581aa0d937d in ?? ()
#28 0x00005581aa0db08e in ?? ()
#29 0x00005581aa0d82e2 in ?? ()
#30 0x00007f34db0c2aad in libc_start_main_stage2 (main=0x5581aa0d7790, argc=13, argv=0x7ffff39172a8) at src/env/__libc_start_main.c:95
#31 0x00005581aa0d832a in ?? ()
#32 0x000000000000000d in ?? ()
#33 0x00007ffff3918e5e in ?? ()
#34 0x00007ffff3918e70 in ?? ()
#35 0x00007ffff3918e77 in ?? ()
#36 0x00007ffff3918e79 in ?? ()
#37 0x00007ffff3918e80 in ?? ()
#38 0x00007ffff3918e83 in ?? ()
#39 0x00007ffff3918e8a in ?? ()
#40 0x00007ffff3918e93 in ?? ()
#41 0x00007ffff3918e9c in ?? ()
#42 0x00007ffff3918eac in ?? ()
#43 0x00007ffff3918eb6 in ?? ()
#44 0x00007ffff3918ed0 in ?? ()
#45 0x00007ffff3918ed7 in ?? ()
#46 0x0000000000000000 in ?? ()

(75) By Stephan Beal (stephan) on 2023-12-19 12:27:56 in reply to 74 [link] [source]

Unfortunately, I'm seeing 23 hung processes, all of them 21h06 old (main process started 23h43 ago).

We missed one invocation of malloc() which might be the culprit. As a quick test, please try this patch:

Index: althttpd.c
==================================================================
--- althttpd.c
+++ althttpd.c
@@ -952,11 +952,11 @@
       ** (14) User agent
       ** (15) Remote user
       ** (16) Bytes of URL that correspond to the SCRIPT_NAME
       ** (17) Line number in source file
       */
-
+#define Escape(X) X
       /* (1) */ altfmt_DateTime( zPos, zEnd, &dt, &zPos );
       astr(",");
       /* (2) */ astr2( zRemoteAddr );
       /* (3) */ astr( "\"" );
       astr( zHttpScheme );
@@ -988,10 +988,11 @@
 #endif
       astr( "\n" );
       priorSelf = self;
       priorChild = children;
 #endif
+#undef Escape
       if( zPos ){
         assert( zPos<zEnd );
         assert( zPos>&msgbuf[0] );
         *zPos = 0;
         write( logfd, msgbuf, (size_t)(zPos - &msgbuf[0]) );

That's not a genuine fix because it will lead to unescaped locked entries, but that should not matter except for entries which themselves contain double-quotes.

i will get the string-escaping refactored to not use malloc() as time and energy allow.

(76) By sodface on 2023-12-19 13:10:05 in reply to 75 [link] [source]

Patch applied and server restarted, thanks.

(77.1) By sodface on 2023-12-19 14:07:10 edited from 77.0 in reply to 75 [link] [source]

One thing I'm a bit confused about in drh's remediation plan is item 10:

Always invoke exit() at the end of a MakeLogFile() call that is run when inSignalHandler is set.

In psykose's reply, she notes:

one of the diagnosed hangs is in exit() in an exit handler.

Obviously I'm not proficient enough in C (or much else) to understand everything that's going on in the code and be able to tell what's an issue and what isn't, but, doesn't the current code call exit() in a signal handler, and if so, isn't that an issue?

There are numerous references, like this one, that recommend using _exit() instead but I don't know what side effects that might have to althttpd.

(78) By Stephan Beal (stephan) on 2023-12-19 14:55:10 in reply to 77.1 [link] [source]

doesn't the current code call exit() in a signal handler, and if so, isn't that an issue?

Good question! Next question? ;)

The part about "in an exit handler" is telling, and suggests that it's happening via atexit(). althttpd has a single atexit() handler and that's the one which shuts down TLS. Something you can try is comment out either:

     atexit(tls_atexit);

(which will generate a harmless warning about an unused static function) or, in tls_atexit(), add 0 && like so:

  if( 0 && tlsState.sslCon ){

that will cause the atexit handler to do nothing.

Ideally althttpd will clean up the TLS resources when it ends, but that change will force it not to. That should not cause any weird effects except that it will very briefly leak memory for each connection (noting that the pending call to exit() will free that memory without performing whatever formal cleanup libssl would like to do).

(79) By sodface on 2023-12-21 13:21:53 in reply to 78 [link] [source]

Stephan, I was waiting to see if I'd get more hangs before applying the above change. I checked this morning and had 29 hanging processes about 8 hours old so I've now rebuilt using check-in as of 202312201454 and also patched in:

--- ./althttpd.c.orig
+++ ./althttpd.c
@@ -2218,7 +2218,7 @@
 }
 
 static void tls_atexit(void){
-  if( tlsState.sslCon ){
+  if( 0 && tlsState.sslCon ){
     tls_close_server(tlsState.sslCon);
     tlsState.sslCon = NULL;
   }

I'll keep an eye on it. Thanks again.

(80) By sodface on 2023-12-27 13:00:11 in reply to 78 [link] [source]

Stephan, up 5 days with no hangs so that's good!

There's something going on with the formatting of the request url in the log though, for example:

Before:

2023-10-04 02:04:08,73.131.169.187,"http://www.sodface.com/repo/x86_64/APKINDEX.tar.gz","",304,144,188,424,0,0,0,126,1,"libfetch/2.0","",50,470,9065

Now:

2023-12-27 12:02:06,73.131.169.187,"http://www.sodface.comwww.sodface.com/repo/x86_64/APKINDEX.tar.gz/repo/x86_64/APKINDEX.tar.gz","",200,119,3045,1099,0,0,0,0,1,"libfetch/2.0libfetch/2.0","",50,0,24395

(81) By Stephan Beal (stephan) on 2023-12-27 15:25:01 in reply to 80 [link] [source]

up 5 days with no hangs so that's good!

That's great news! Here's one thing to try while waiting on the log-escaping fix:

Replace:

  if( 0 && tlsState.sslCon ){

with:

  if( !inSignalHandler && tlsState.sslCon ){

or its equivalent:

  if( inSignalHandler==0 && tlsState.sslCon ){

That might have an equivalent effect while also properly cleaning up the TLS stuff for "normal" (non-signal) exits, but it might also cause hung processes if the signal arrives during that cleanup.

There's something going on with the formatting of the request url in the log though, for example:

Definitely a new bug of mine. i'll post back when it's fixed, but it might take me until tomorrow.

(82) By Stephan Beal (stephan) on 2023-12-27 16:09:12 in reply to 81 [link] [source]

Definitely a new bug of mine.

And a facepalmy one, at that! It's now fixed in the trunk.

Thank you again for your continued testing!

(83) By sodface on 2023-12-27 17:14:26 in reply to 82 [link] [source]

Thank you again for your continued testing!

Happy to help and thanks again for working on it!!

Up and running on the latest checkin and with patch:

--- ./althttpd.c.orig
+++ ./althttpd.c
@@ -2218,7 +2218,7 @@
 }
 
 static void tls_atexit(void){
-  if( tlsState.sslCon ){
+  if( !inSignalHandler && tlsState.sslCon ){
     tls_close_server(tlsState.sslCon);
     tlsState.sslCon = NULL;
   }

Log entries look good now.

(84) By sodface on 2023-12-28 15:35:52 in reply to 81 [link] [source]

but it might also cause hung processes if the signal arrives during that cleanup.

With that change it took about 20 hours to get a string of hung processes, roughly following the usual pattern:

sod01:/srv/www$ ./logsearch.sh 
2023-12-28 13:15:39,45.79.145.181,"http://","",400,1,180,0,1008,0,0,0,1,"","",7,200,26483
2023-12-28 13:15:40,45.79.145.181,"http://","",400,0,180,151,0,0,0,0,1,"","",7,200,26486
2023-12-28 13:15:44,45.79.145.181,"http://","",400,1,180,678,0,0,0,0,1,"","",7,200,26489
2023-12-28 13:15:45,45.79.145.181,"http://","",400,0,180,308,0,0,0,0,1,"","",7,200,26490
2023-12-28 13:15:47,45.79.145.181,"http://","",400,0,180,215,0,0,0,0,1,"","",7,200,26491
2023-12-28 13:15:50,45.79.145.181,"http://`","",400,11,180,295,0,0,0,0,1,"","",16,200,26500
2023-12-28 13:15:50,45.79.145.181,"http://`","",913,0,0,-295,0,0,0,0,1,"","",16,132,26500
2023-12-28 13:15:58,45.79.145.181,"http://`","",914,0,0,0,0,0,0,8000000,1,"","",16,802,26500
2023-12-28 13:15:53,45.79.145.181,"http://","",400,1,180,401,0,0,0,0,1,"","",7,200,26503
2023-12-28 13:15:54,45.79.145.181,"http://","",400,4,180,198,0,0,0,0,1,"","",7,200,26504
2023-12-28 13:15:56,45.79.145.181,"http://","",400,1,180,211,0,0,0,0,1,"","",7,200,26506
2023-12-28 13:15:57,23.239.16.144,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",400,117,180,251,0,0,0,0,1,"","",50,200,26509
2023-12-28 13:15:57,23.239.16.144,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",913,0,0,-251,0,0,0,0,1,"","",50,132,26509
2023-12-28 13:16:06,23.239.16.144,"http://XP|'|'|No|'|'|0.6.4|'|'|..|'|'||'|'|[endof]","",914,0,0,0,0,0,0,9000000,1,"","",50,802,26509
2023-12-28 13:15:59,45.79.145.181,"http://","",400,1,180,342,0,0,0,0,1,"","",7,200,26510
2023-12-28 13:15:59,23.239.16.144,"http://7","",400,280,180,195,0,0,0,0,1,"","",8,200,26512
2023-12-28 13:15:59,23.239.16.144,"http://7","",913,0,0,-195,0,0,0,0,1,"","",8,132,26512
2023-12-28 13:16:08,23.239.16.144,"http://7","",914,0,0,0,0,0,0,9000000,1,"","",8,802,26512
2023-12-28 13:16:01,45.79.145.181,"http://","",400,0,180,225,0,0,0,0,1,"","",7,200,26513
2023-12-28 13:16:01,23.239.16.144,"http://","",400,1,180,398,0,0,0,0,1,"","",7,200,26514
2023-12-28 13:16:02,45.79.145.181,"http://","",400,2,180,357,0,0,0,0,1,"","",7,200,26515
2023-12-28 13:16:02,23.239.16.144,"http://","",400,6,180,266,0,0,0,0,1,"","",7,200,26518
2023-12-28 13:16:04,45.79.145.181,"http://","",400,0,180,244,0,0,0,0,1,"","",7,200,26519
2023-12-28 13:16:04,23.239.16.144,"http://","",400,2,180,192,0,0,0,0,1,"","",7,200,26521
2023-12-28 13:16:05,45.79.145.181,"http://","",400,5,180,253,0,0,0,0,1,"","",7,200,26522
2023-12-28 13:16:07,45.79.145.181,"http://","",400,0,180,341,0,0,0,0,1,"","",7,200,26529
2023-12-28 13:16:08,45.79.145.181,"http://","",400,0,180,352,0,0,0,0,1,"","",7,200,26560
2023-12-28 13:16:10,45.79.145.181,"http://","",400,2,180,335,0,0,0,0,1,"","",7,200,26561
2023-12-28 13:16:11,45.79.145.181,"http://","",400,1,180,195,0,0,0,0,1,"","",7,200,26562
2023-12-28 13:16:13,45.79.145.181,"http://","",400,0,180,0,2187,0,0,0,1,"","",7,200,26564

(85) By Stephan Beal (stephan) on 2023-12-28 16:46:38 in reply to 84 [link] [source]

With that change it took about 20 hours to get a string of hung processes, roughly following the usual pattern:

Now, just to verify once more, please change the tls shutdown part to:

static void tls_atexit(void){
#if 0
  if( !inSignalHandler && tlsState.sslCon ){
    tls_close_server(tlsState.sslCon);
    tlsState.sslCon = NULL;
  }
#endif
}

That "should" get you back to a setup which does not leave hung processes, at the cost of not properly cleaning up the TLS connection state (it will be freed automatically when the althttpd process exits, it just won't be "properly" cleaned up).

(86.1) By sodface on 2023-12-28 22:43:42 edited from 86.0 in reply to 85 [link] [source]

Done and restarted.

//edit, btw, I didn't get an email for your last post.

(88) By Stephan Beal (stephan) on 2023-12-31 16:45:40 in reply to 86.1 [link] [source]

edit, btw, I didn't get an email for your last post.

If you're receiving the notifications via gmail, they had severe lags the past few days across all fossil forums i follow, and is definitely possible that some traffic was lost. That seems to be resolved now.

(89) By sodface on 2023-12-31 17:13:35 in reply to 88 [link] [source]

gmail

Yes, well a custom domain with a gmail backend, google workspace or whatever they are calling it these days. If there was a clear way out of it I would do it since it used to be "free" and now I'm paying $60 a month for basically the same service due to their policy changes and poor communication after the fact. If I didn't have family members using it I would have moved on already.

In my half hearted search for alternatives I came across this depressing post:

https://cfenollosa.com/blog/after-self-hosting-my-email-for-twenty-three-years-i-have-thrown-in-the-towel-the-oligopoly-has-won.html

But anyway, thanks for the info.

(90.1) By Stephan Beal (stephan) on 2024-01-02 08:05:31 edited from 90.0 in reply to 86.1 [link] [source]

Done and restarted.

Can you please take a look at your logs and make sure the date is correct? i seem to have introduced a buggy time conversion in which an off-by-one day-of-month is sometimes used. e.g. my current logs are says 2024-01-01, despite it being 2024-01-02 here (and GMT). Curiously, though, the logs from yesterday (2024-01-01) also say 2024-01-01.

Edit: that's resolved in the latest trunk. It was an off-by-half-a-day bug caused by a missing set of parens around a Julian Day-to-Unix conversion.

(91) By sodface on 2024-01-04 01:45:12 in reply to 90.1 [link] [source]

Just now restarted on checkin [ef5d1a1fe558] [2024-01-02T08:04:38.288].

Previous build was up for 3 days with no hangs.

Happy New Year!

(92) By Stephan Beal (stephan) on 2024-01-04 01:58:18 in reply to 91 [link] [source]

Previous build was up for 3 days with no hangs.

Great, thank you!

Now we just have to figure out what we're going to do about the response timing info. It now has seconds, rather that milliseconds, precision, which is essentially useless. Removing it from the logs would break anyone's log importers, but retaining it doesn't seem helpful (except to avoid breaking log importers). Perhaps there's an approach, unknown to me, which doesn't require losing that info.

(93) By sodface on 2024-01-04 02:54:50 in reply to 92 [link] [source]

Now we just have to figure out what we're going to do about the response timing info.

I won't be of much help other than to post SO links like this one which contains a couple of comments which sound reasonable but need a better assessment than I can provide.

gettimeofday is not defined as async-signal-safe, but if you pass 0 for the second (timezone) argument, it doesn't have to do anything that time and clock_gettime (both of which are officially async-signal-safe) don't also do, so as a matter of QoI it should be async-signal-safe in that case.

(94) By Stephan Beal (stephan) on 2024-01-04 03:12:18 in reply to 93 [link] [source]

... doesn't have to do anything that time and clock_gettime (both of which are officially async-signal-safe) don't also do,

clock_gettime() is a great tip, thank you. Looking over the code, the timing info is currently based on CPU usage, rather than clock time, but using clock_gettime() with the CLOCK_PROCESS_CPUTIME_ID should, it seems, give us essentially the same values. That particular timer isn't 100% portable but for platforms without that one we can use the CLOCK_MONOTONIC timer, which will give us "wall clock" time, which is close enough for this purpose.

Let's give this a try...

(95) By Stephan Beal (stephan) on 2024-01-04 12:11:08 in reply to 94 [link] [source]

Let's give this a try...

i was confusing two topics there: getrusage() and the response wall-clock time. The latter now, in the current check-in, uses clock_gettime(). The former still uses getrusage(), but only if a signal handler is not active. If a signal handler is active, getrusage() is not called and the four log fields of CPU times (as opposed to the wall-clock time) will be zero. That getrusage() handling is not new - it's been that way for a couple of weeks now - just a clarification.

(96) By sodface on 2024-01-04 21:17:45 in reply to 95 [link] [source]

Up and running on [7b35e12f1505] [2024-01-04T12:07:08.258], thanks!

(87) By sodface on 2023-12-31 14:18:26 in reply to 85 [link] [source]

Running on checkin [ff8547f8d74f] [2023-12-30T06:46:43.662]

(97) By Stephan Beal (stephan) on 2024-01-12 13:32:28 in reply to 87 [link] [source]

Running on checkin...

The signal-related cleanups have been merged to trunk. Hopefully they don't cause anyone any undue grief!

(98) By sodface on 2024-01-12 17:27:16 in reply to 97 [link] [source]

Running on [e28f3136109b] [2024-01-12T13:51:28.698]

Previous build was up for 7 days with no hangs.

Thanks a lot Stephan and Richard for working on this.

(99) By Stephan Beal (stephan) on 2024-01-17 12:29:18 in reply to 70 [link] [source]

The patch to add the pid to the log file I can't apply with the latest changes so I've lost that quick reference should I encounter any more hangs.

Achtung: it turns out that LOG_PID was always being defined, at least on Linux, because syslog.h defines that. The macro has been renamed to ALTHTTPD_LOG_PID.

https://pubs.opengroup.org/onlinepubs/009695399/basedefs/syslog.h.html

(100) By sodface on 2024-01-17 12:52:28 in reply to 99 [link] [source]

Copy. Still running on the previous build - up 4 days with no hangs. Looking good!

(101) By sodface on 2024-01-23 23:41:14 in reply to 1.1 [link] [source]

Unfortunately I'm seeing 25 hung processes again, same basic symptoms as before. Main process was up 11days and the 25 hanging processes are all pretty much sequential pids and 1d22h old.

The backtrace for the several I checked is the same:

(gdb) bt
#0  __syscall_cp_c (nr=202, u=140010256718080, v=128, w=-1, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f56ad9986b9 in __futex4_cp (to=0x0, val=-1, op=128, addr=0x7f56ad9dd900 <lock>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f56ad9dd900 <lock>, val=val@entry=-1, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f56ad99875e in __timedwait (addr=addr@entry=0x7f56ad9dd900 <lock>, val=-1, clk=clk@entry=0, at=at@entry=0x0, priv=128) at src/thread/__timedwait.c:68
#4  0x00007f56ad99b010 in __pthread_rwlock_timedwrlock (at=<optimized out>, rw=<optimized out>) at src/thread/pthread_rwlock_timedwrlock.c:18
#5  __pthread_rwlock_timedwrlock (rw=rw@entry=0x7f56ad9dd900 <lock>, at=at@entry=0x0) at src/thread/pthread_rwlock_timedwrlock.c:3
#6  0x00007f56ad99b0df in __pthread_rwlock_wrlock (rw=rw@entry=0x7f56ad9dd900 <lock>) at src/thread/pthread_rwlock_wrlock.c:5
#7  0x00007f56ad9a19e3 in __libc_exit_fini () at ldso/dynlink.c:1441
#8  0x00007f56ad958082 in exit (code=0) at src/exit/exit.c:30
#9  0x000055db08b3ad03 in ?? ()
#10 0x000055db08b3b85d in ?? ()
#11 <signal handler called>
#12 0x00007f56ad9990cb in __syscall6 (a6=0, a5=0, a4=0, a3=140010256437453, a2=128, a1=140010256717956, n=202) at ./arch/x86_64/syscall_arch.h:59
#13 __syscall_cp_c (nr=202, u=140010256717956, v=128, w=140010256437453, x=0, y=0, z=0) at src/thread/pthread_cancel.c:31
#14 0x00007f56ad9986b9 in __futex4_cp (to=0x0, val=-2147483632, op=128, addr=0x7f56ad9dd884 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#15 __timedwait_cp (addr=addr@entry=0x7f56ad9dd884 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#16 0x00007f56ad99875e in __timedwait (addr=addr@entry=0x7f56ad9dd884 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#17 0x00007f56ad99a9a1 in __pthread_mutex_timedlock (m=0x7f56ad9dd880 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#18 0x00007f56ad99a7b0 in __pthread_mutex_lock (m=m@entry=0x7f56ad9dd880 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#19 0x00007f56ad9a19ef in __libc_exit_fini () at ldso/dynlink.c:1442
#20 0x00007f56ad958082 in exit (code=0) at src/exit/exit.c:30
#21 0x000055db08b3ad03 in ?? ()
#22 0x000055db08b3b85d in ?? ()
#23 <signal handler called>
#24 0x00007f56ad98e2bd in __syscall3 (a3=2, a2=140733081039376, a1=1, n=20) at ./arch/x86_64/syscall_arch.h:29
#25 __stdio_write (f=0x7f56ad9db2c0 <__stdout_FILE>, buf=<optimized out>, len=0) at src/stdio/__stdio_write.c:15
#26 0x00007f56ad98e143 in close_file (f=0x7f56ad9db2c0 <__stdout_FILE>) at src/stdio/__stdio_exit.c:12
#27 0x00007f56ad98e199 in __stdio_exit () at src/stdio/__stdio_exit.c:21
#28 0x00007f56ad958089 in exit (code=0) at src/exit/exit.c:31
#29 0x000055db08b3ad03 in ?? ()
#30 0x000055db08b3c9d5 in ?? ()
#31 0x000055db08b39bb9 in ?? ()
#32 0x00007f56ad95faad in libc_start_main_stage2 (main=0x55db08b39030, argc=13, argv=0x7ffef94e0c28) at src/env/__libc_start_main.c:95
#33 0x000055db08b39c03 in ?? ()
#34 0x000000000000000d in ?? ()
#35 0x00007ffef94e2e5e in ?? ()
#36 0x00007ffef94e2e70 in ?? ()
#37 0x00007ffef94e2e77 in ?? ()
#38 0x00007ffef94e2e79 in ?? ()
#39 0x00007ffef94e2e80 in ?? ()
#40 0x00007ffef94e2e83 in ?? ()
#41 0x00007ffef94e2e8a in ?? ()
#42 0x00007ffef94e2e93 in ?? ()
#43 0x00007ffef94e2e9c in ?? ()
#44 0x00007ffef94e2eac in ?? ()
#45 0x00007ffef94e2eb6 in ?? ()
#46 0x00007ffef94e2ed0 in ?? ()
#47 0x00007ffef94e2ed7 in ?? ()
#48 0x0000000000000000 in ?? ()
(gdb)

I'm going to rebuild althttpd with debug symbols and install the other debug packages for openssl.

(102) By Stephan Beal (stephan) on 2024-01-24 03:22:33 in reply to 101 [link] [source]

I'm going to rebuild althttpd with debug symbols and install the other debug packages for openssl.

This one is particularly puzzling because the signal handler is triggered after exit() is called, during what appears to be the libc-level cleanup of stdout. Once we call exit(), we're at the mercy of the post-main() cleanup.

We do register one atexit() handler, but it's currently a no-op (it was previously used for TLS cleanup). Perhaps (just speculating) merely having an atexit() handler puts us in a signal-unsafe situation.

A quick-and-dirty patch to completely eliminate our use of atexit():

Index: althttpd.c
==================================================================
--- althttpd.c
+++ althttpd.c
@@ -619,11 +619,11 @@
   va_end(va);
   return rc;
 }
 static void *tls_new_server(int iSocket);
 static void tls_close_server(void *pServerArg);
-static void tls_atexit(void);
+/*static void tls_atexit(void);*/
 #else
 #define althttpd_printf printf
 #endif
 
 
@@ -2310,23 +2310,23 @@
   SSL_free(pServer->ssl);
   memset(pServer, 0, sizeof(TlsServerConn));
   free(pServer);
 }
 
-static void tls_atexit(void){
 #if 0
+static void tls_atexit(void){
   /*
   ** Shutting down TLS can lead to spurious hung processes on some
   ** platforms/builds.  See the long discussion on this at:
   ** https://sqlite.org/althttpd/forumpost/4dc31619341ce947
   */
   if( inSignalHandler==0 && tlsState.sslCon!=0 ){
     tls_close_server(tlsState.sslCon);
     tlsState.sslCon = NULL;
   }
-#endif
 }
+#endif
 #endif /* ENABLE_TLS */
 
 
 /*
 ** Works like fgets():
@@ -2860,11 +2860,11 @@
       tlsState.sslCon = (TlsServerConn *)tls_new_server(iSocket);
       if( NULL==tlsState.sslCon ){
         Malfunction(512, /* LOG: TLS context */
           "Could not instantiate TLS context.");
       }
-      atexit(tls_atexit);
+      /*atexit(tls_atexit);*/
     }
     return 1;
   }
 #else
   if( 0==iSocket ){/*unused arg*/}

That said: the chance that that will make any difference seems slim, as this all looks to be happening deep in libc. Perhaps you should not apply this patch until we see what happens with debugging symbols added. That might shed more light on where the train wreck starts.

(103) By sodface on 2024-01-24 12:19:20 in reply to 102 [link] [source]

Perhaps you should not apply this patch until we see what happens with debugging symbols added.

Alice helped me sort out the openssl debug symbols last night and I also rebuilt althttpd without stripping it (before you posted the patch above) so I think I'm better prepared now to get a good backtrace. As you suggest, I'll just leave it as is and wait for the next hang.

(104) By sodface on 2024-01-25 23:18:01 in reply to 102 [source]

One process hung for 15h46m.

Log entry:

2024-01-25 07:22:51,174.138.61.44,"http://","",400,3,180,0,603,0,0,89,1,"","",7,200,5274

Backtrace:

Attaching to process 5274
Reading symbols from /usr/bin/althttpd...
Reading symbols from /lib/libssl.so.3...
Reading symbols from /usr/lib/debug//lib/libssl.so.3.debug...
Reading symbols from /lib/libcrypto.so.3...
Reading symbols from /usr/lib/debug//lib/libcrypto.so.3.debug...
Reading symbols from /lib/ld-musl-x86_64.so.1...
Reading symbols from /usr/lib/debug//lib/ld-musl-x86_64.so.1.debug...
__syscall_cp_c (nr=202, u=140148024469572, v=128, w=-2147483632, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61

warning: 61	./arch/x86_64/syscall_arch.h: No such file or directory
(gdb) bt
#0  __syscall_cp_c (nr=202, u=140148024469572, v=128, w=-2147483632, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f76c1327a2e in __futex4_cp (to=<optimized out>, val=-2147483632, op=128, addr=0x7f76c136e844 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f76c136e844 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f76c1327afb in __timedwait (addr=addr@entry=0x7f76c136e844 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#4  0x00007f76c132a1c4 in __pthread_mutex_timedlock (m=0x7f76c136e840 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#5  0x00007f76c132a000 in __pthread_mutex_lock (m=m@entry=0x7f76c136e840 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#6  0x00007f76c1332452 in __libc_exit_fini () at ldso/dynlink.c:1477
#7  0x00007f76c12e1094 in exit (code=0) at src/exit/exit.c:30
#8  0x000056469594c825 in Timeout ()
#9  <signal handler called>
#10 0x00007f76c12580c0 in __do_global_dtors_aux () from /lib/libssl.so.3
#11 0x00007f76c136eb08 in builtin_tls () from /lib/ld-musl-x86_64.so.1
#12 0x00007f76c1332557 in __libc_exit_fini () at ldso/dynlink.c:1488
#13 0x00007f76c12e1094 in exit (code=0) at src/exit/exit.c:30
#14 0x000056469594bcbd in MakeLogEntry ()
#15 0x000056469594d98f in ProcessOneRequest ()
#16 0x000056469594abb9 in main ()

(105) By Stephan Beal (stephan) on 2024-01-26 03:23:46 in reply to 104 [link] [source]

If i'm reading this properly:

#8  0x000056469594c825 in Timeout ()
#9  <signal handler called>
#10 0x00007f76c12580c0 in __do_global_dtors_aux () from /lib/libssl.so.3
#11 0x00007f76c136eb08 in builtin_tls () from /lib/ld-musl-x86_64.so.1
#12 0x00007f76c1332557 in __libc_exit_fini () at ldso/dynlink.c:1488
#13 0x00007f76c12e1094 in exit (code=0) at src/exit/exit.c:30
#14 0x000056469594bcbd in MakeLogEntry ()

... then what's happening is the Timeout() signal handler is being called after exit() has been called, which then, somewhat unsurprisingly, leads to grief.

Please try the signal-after-exit branch, which makes the following changes:

All exit() calls are replaced by a proxy function.
The proxy sets an is-exiting flag.
The signal handler checks for that flag and becomes a no-op if it's set.

It also reinstates the TLS atexit() cleanup, with the caveat that it will be a no-op if called during a signal handler.

(106) By sodface on 2024-01-26 03:44:16 in reply to 105 [link] [source]

Back up and running on [092234da8524] [2024-01-26T03:17:22.591] I'll keep you posted. Thanks again!

(107) By sodface on 2024-01-26 03:47:40 in reply to 105 [link] [source]

If i'm reading this properly:

This latest backtrace actually seems to line up pretty well with Markus' reading of one back in October on the musl mailing list thread.

The process is actually finished and called exit(). exit() has in turn called __libc_exit_fini(), which has taken the init_fini_lock and is now calling the destructors. Apparently libssl has a destructor, and while it is running the signal hits. And then the signal handler also calls exit() (which is invalid, for exit() is not signal-safe), and tries to call __libc_exit_fini() again, and that deadlocks on init_fini_lock.

(110) By sodface on 2024-02-08 02:48:42 in reply to 105 [link] [source]

Just checking in to say it's been 12 days and all is good, no hangs.

I was reading the musl page that documents differences with glibc just now for unrelated reasons and noticed this which has been more or less already covered here in the comments but I thought I would post the reference anyway:

Calling exit more than once invokes undefined behavior. musl does not support recursive calls to exit (via atexit functions/destructors) or simultaneous calls to exit (from multiple threads). The error is not diagnosed and may give unexpected results, or may happen to do what the programmer intended. musl used to deadlock when exit was called recursively and gracefully handled exit from multiple threads, but the code that caused this behavior was removed in 1.1.2. glibc does not deadlock under recursive exit, but it’s unclear whether such usage just “happens to work” or whether it’s supported.

(111) By Stephan Beal (stephan) on 2024-02-08 08:43:41 in reply to 110 [link] [source]

Just checking in to say it's been 12 days and all is good, no hangs.

Great, thank you for the feedback and the doc link.

These changes are now on the trunk.

(112) By sodface on 2024-02-17 02:13:22 in reply to 111 [link] [source]

Well, shoot. Took 20 days but I currently have two hung processes. They aren't sequential though, they are about 11 hours apart. Backtraces for each are a little different but it still looks like the exit from the signal handler we've been seeing.

(gdb) bt
#0  __syscall_cp_c (nr=202, u=140232719063108, v=128, w=-2147483632, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f8a79636a2e in __futex4_cp (to=<optimized out>, val=-2147483632, op=128, addr=0x7f8a7967d844 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f8a7967d844 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f8a79636afb in __timedwait (addr=addr@entry=0x7f8a7967d844 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#4  0x00007f8a796391c4 in __pthread_mutex_timedlock (m=0x7f8a7967d840 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#5  0x00007f8a79639000 in __pthread_mutex_lock (m=m@entry=0x7f8a7967d840 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#6  0x00007f8a79641452 in __libc_exit_fini () at ldso/dynlink.c:1477
#7  0x00007f8a795f0094 in exit (code=0) at src/exit/exit.c:30
#8  0x000055b2efd1ed20 in MakeLogEntry ()
#9  0x000055b2efd1f891 in Timeout ()
#10 <signal handler called>
#11 0x00007f8a7904a070 in __do_global_dtors_aux () from /lib/libcrypto.so.3
#12 0x00007f8a79641557 in __libc_exit_fini () at ldso/dynlink.c:1488
#13 0x00007f8a795f0094 in exit (code=0) at src/exit/exit.c:30
#14 0x000055b2efd1ed20 in MakeLogEntry ()
#15 0x000055b2efd20af6 in ProcessOneRequest ()
#16 0x000055b2efd1dbb9 in main ()

and

(gdb) bt
#0  __syscall_cp_c (nr=202, u=140232719063108, v=128, w=-2147483632, x=0, y=0, z=0) at ./arch/x86_64/syscall_arch.h:61
#1  0x00007f8a79636a2e in __futex4_cp (to=<optimized out>, val=-2147483632, op=128, addr=0x7f8a7967d844 <init_fini_lock+4>) at src/thread/__timedwait.c:24
#2  __timedwait_cp (addr=addr@entry=0x7f8a7967d844 <init_fini_lock+4>, val=val@entry=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128) at src/thread/__timedwait.c:52
#3  0x00007f8a79636afb in __timedwait (addr=addr@entry=0x7f8a7967d844 <init_fini_lock+4>, val=-2147483632, clk=clk@entry=0, at=at@entry=0x0, priv=priv@entry=128)
    at src/thread/__timedwait.c:68
#4  0x00007f8a796391c4 in __pthread_mutex_timedlock (m=0x7f8a7967d840 <init_fini_lock>, at=at@entry=0x0) at src/thread/pthread_mutex_timedlock.c:85
#5  0x00007f8a79639000 in __pthread_mutex_lock (m=m@entry=0x7f8a7967d840 <init_fini_lock>) at src/thread/pthread_mutex_lock.c:9
#6  0x00007f8a79641452 in __libc_exit_fini () at ldso/dynlink.c:1477
#7  0x00007f8a795f0094 in exit (code=0) at src/exit/exit.c:30
#8  0x000055b2efd1e204 in althttpd_exit ()
#9  0x000055b2efd1f89e in Timeout ()
#10 <signal handler called>
#11 0x00007f8a7962ad81 in __syscall3 (a3=2, a2=140722443814416, a1=1, n=20) at ./arch/x86_64/syscall_arch.h:29
#12 __stdio_write (f=0x7f8a7967b280 <__stdout_FILE>, buf=<optimized out>, len=0) at src/stdio/__stdio_write.c:15
#13 0x00007f8a7962ab7e in close_file (f=0x7f8a7967b280 <__stdout_FILE>) at src/stdio/__stdio_exit.c:12
#14 0x00007f8a7962abdb in __stdio_exit () at src/stdio/__stdio_exit.c:21
#15 0x00007f8a795f009b in exit (code=0) at src/exit/exit.c:31
#16 0x000055b2efd1ed20 in MakeLogEntry ()
#17 0x000055b2efd20af6 in ProcessOneRequest ()
#18 0x000055b2efd1dbb9 in main ()

(113) By Stephan Beal (stephan) on 2024-02-17 03:25:23 in reply to 112 [link] [source]

Well, shoot. Took 20 days but I currently have two hung processes.

When replacing exit() recently i missed two spots, one of which looks like it corresponds with your stack traces. My suspicion, based on poking around but not experimentation, is that the two responses which triggered the hangs just happened to be error responses (HTTP result code 400 or higher). That might explain why it took so long for the problem to reproduce.

Please try the latest trunk and let us know if it explodes.

Thanks again for your persistence in this!

(114) By sodface on 2024-02-17 13:16:51 in reply to 113 [link] [source]

HTTP result code 400 or higher

I didn't post the log entries because oddly my little log search script only found one of the two (the second one).

Looking at it again today, the two offending pids are 22271 and 22916. The backtraces above are posted in that order.

grep 22271 althttpd.log returns nothing. Opening the log in vi and searching for 22271 finds it though:

2024-02-16 03:24:32,185.100.87.136,"http://.p#..*5#].R:../^Q...Z..^Yf.........%d1^W.%n.","",400,52,180,276,0,0,0,83,1,"","",48,200,22271
2024-02-16 03:24:32,185.100.87.136,"http://.p#..*5#].R:../^Q...Z..^Yf.........%d1^W.%n.","",914,0,0,-276,0,0,0,244,1,"","",48,802,22271

So maybe grep is choking on one of the characters in the line?? A grep for 22270 and 22272 both match the line immediately before and after, respectively. Kinda weird.

Second one:

$ grep 22916 althttpd.log
2024-02-16 14:13:23,188.166.87.67,"http://","",400,3,180,0,618,0,0,103,1,"","",7,200,22916

So yes, both error codes.

Please try the latest trunk

Will do, thanks for the quick turnaround!

(115) By sodface on 2024-02-17 14:16:49 in reply to 114 [link] [source]

Ok, back up and running on latest trunk.

A bit off-topic, but it looks like the grep issue is with busybox. On my laptop (also Alpine) busybox grep returns nothing but a full version of grep returns:

~/Downloads $ busybox grep 22271 althttpd.log.20240217

~/Downloads $ grep 22271 althttpd.log.20240217
grep: althttpd.log.20240217: binary file matches

~/Downloads $ grep --text 22271 althttpd.log.20240217
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",400,52,180,276,0,0,0,83,1,"","",48,200,22271
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",914,0,0,-276,0,0,0,244,1,"","",48,802,22271

The busybox version of awk and then gawk behave similarly:

~/Downloads $ busybox awk '/22271/' althttpd.log.20240217

~/Downloads $ gawk '/22271/' althttpd.log.20240217
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",400,52,180,276,0,0,0,83,1,"","",48,200,22271
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",914,0,0,-276,0,0,0,244,1,"","",48,802,22271

(116) By spindrift on 2024-02-17 20:17:47 in reply to 115 [link] [source]

Yeah. Is it that "binary file" bit that's the giveaway?

(117) By sodface on 2024-02-17 22:35:51 in reply to 116 [link] [source]

Indeed, there are characters in the line that are outside the set of printable ascii characters.

Busybox's grep and awk implementations seem to bail to the next line when encountering the first non-printable* character. Grepping for a pattern in the line before the first non-printable matches, anything after doesn't. Which is misleading and a bit of a head scratcher considering it will print the entire line (minus the non-printables) so it is obviously capable of skipping over non-printables and not just truncating the output.

In this case, the pid 22271 is at the end of the line and has non-printable characters preceding it, so busybox grep returns nothing for:

$ busybox grep 22271 log-sample.txt

But grep for something before the first non-printable, like 2024:

$ busybox grep 2024 log-sample.txt 
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",400,52,180,276,0,0,0,83,1,"","",48,200,22271
2024-02-16 03:24:32,185.100.87.136,"http://p#5#]:Ȏ/͉Zfd1%n,"",914,0,0,-276,0,0,0,244,1,"","",48,802,22271

And you get all the printables in the line, including the 22271 that won't match in the first example.

Here's a link to a file with just the two offending log lines if anyone wants to play with it.

I wonder if althttpd should translate or remove non-printable characters before writing the log entry?

I'm not sure "non-printable" is the best term to use here, but it's the best I could come up with.

(118) By Stephan Beal (stephan) on 2024-02-17 23:05:44 in reply to 117 [link] [source]

Indeed, there are characters in the line that are outside the set of printable ascii characters.

Note that the URL does not have a closing quote in the log. Looking at your example log file, it turns out that that's just a display problem - a backspace or delete character in the URL is confusing the console.

I wonder if althttpd should translate or remove non-printable characters before writing the log entry?

Fair question. On the one hand, they're just going to confuse software which reads the logs. On the other, they're useful for seeing attempted attacks.

i'll let you and Richard decide how to proceed on that, then i'll implement any changes y'all decide on.

And just in case you don't already know this: cerbot is a trivially easy way to get a free SSL certificate. It takes just moments to set up and the docs describe how to use the provided certificates with althttpd. If you prefer to stick with http, rather than https, that's fine, too.

(119) By sodface on 2024-02-18 00:04:46 in reply to 118 [link] [source]

a backspace or delete character in the URL is confusing the console.

Yes, I looked at the log in neovim and it highlights the characters in blue and displays them like <8c> or ^Q.

get a free SSL certificate

Thanks, I need to do it. I haven't thought through how to do the redirects so existing http links don't break. I think we talked about it in another thread. Run two instances - one on port 80 and the other on 443? I'm sure there's an easy way to do it but I just haven't bothered to think about it yet.

(120) By sodface on 2024-02-18 01:05:02 in reply to 119 [link] [source]

I think we talked about it in another thread.

This thread

Which referenced the documentation for redirecting using the -auth file. The (obvious) requirement not explicitly mentioned there is something needs to listen on both 80 and 443.

In the example of sqlite.org, I believe that's xinetd and both config files documented in the xinetd page are present on the server.

In my case, running in standalone mode, I believe I would just need to start two instances of althttpd one on 80 and one on 443 and either serve from both or redirect 80->443 using the -auth file example.

(121) By spindrift on 2024-02-18 08:27:57 in reply to 120 [link] [source]

start two instances of althttpd one on 80 and one on 443 and either serve from both or redirect 80->443 using the -auth file example.

Yes, this is exactly how my site runs.

(123) By sodface on 2024-02-18 12:50:54 in reply to 121 [link] [source]

Thanks for the confirmation.

(122.1) By spindrift on 2024-02-18 08:44:08 edited from 122.0 in reply to 118 [link] [source]

I would prefer for the log to represent reality.

If the log file requires pre processing in order to work around limitations in other tools, I don't see that as a problem per se.

As we've seen, gnu grep works fine. It's just busybox grep's arguably buggy limitations on unusual characters that have caused the issue here.

Unless this is seen as an actual security risk (binary file infiltration via logging specially crafted requests, for example) I'd vote to leave the literal content.

I suppose the canonical tool for log analysis is sqlite via csv import (as this is how Richard describes analysing it in the code).

I haven't checked, but I wonder how this log file would be handled using this workflow? If it also causes problems importing or searching via the sqlite route, I'd be more enthused about it needing fixing.

(124) By sodface on 2024-02-18 13:01:13 in reply to 122.1 [link] [source]

I agree with your points here, though I don't have a strong or informed opinion either way. On other topics, Stephan has used "how does apache do it?" as a yardstick for expected behavior. Maybe not the best reference, but it was still open in my tab from yesterday so I'll post it:

You'll find that it gets escaped in ap_escape_logitem. Have a look at the following code. It uses a macro called TEST_CHAR to determine what needs escaped, but output is basically ASCII

(126.1) By spindrift on 2024-02-22 14:32:34 edited from 126.0 in reply to 124 [link] [source]

Ok, I've got another perspective on this.

Althttpd logging claims compliance with the csv RFC 4180 (here) which includes the following definition of field payload:

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

(125.1) By spindrift on 2024-02-20 20:05:27 edited from 125.0 in reply to 122.1 [link] [source]

Just for completeness, my (ubuntu) grep only finds a binary file match too, but sqlite (unsurprisingly) doesn't have any issue with @sodface's logfile.

The forum layout rather suffers though, predictably!


sqlite> CREATE TABLE log(
(x1...>   date TEXT,             /* Timestamp */
(x1...>   ip TEXT,               /* Source IP address */
(x1...>   url TEXT,              /* Request URI */
(x1...>   ref TEXT,              /* Referer */
(x1...>   code INT,              /* Result code.  ex: 200, 404 */
(x1...>   nIn INT,               /* Bytes in request */
(x1...>   nOut INT,              /* Bytes in reply */
(x1...>   t1 INT, t2 INT,        /* Process time (user, system) milliseconds */
(x1...>   t3 INT, t4 INT,        /* CGI script time (user, system) milliseconds */
(x1...>   t5 INT,                /* Wall-clock time, milliseconds */
(x1...>   nreq INT,              /* Sequence number of this request */
(x1...>   agent TEXT,            /* User agent */
(x1...>   user TEXT,             /* Remote user */
(x1...>   n INT,                 /* Bytes of url that are in SCRIPT_NAME */
(x1...>   lineno INT,            /* Source code line that generated log entry */
(x1...>   pid INT                /* PID of process */
(x1...> );
sqlite> .mode csv
sqlite> .import log-sample.log log
sqlite> .mode qbox
sqlite> select * from log where pid = 22271;

┌───────────────────────┬──────────────────┬──────────────────────┬─────┬──────┬─────┬──────┬──────┬────┬────┬────┬─────┬──────┬───────┬──────┬────┬────────┬───────┐
│         date          │        ip        │         url          │ ref │ code │ nIn │ nOut │  t1  │ t2 │ t3 │ t4 │ t5  │ nreq │ agent │ user │ n  │ lineno │  pid  │
├───────────────────────┼──────────────────┼──────────────────────┼─────┼──────┼─────┼──────┼──────┼────┼────┼────┼─────┼──────┼───────┼──────┼────┼────────┼───────┤
│ '2024-02-16 03:24:32' │ '185.100.87.136' │ 'http://p#*5#]R:Ȏ/ │ ''  │ 400  │ 52  │ 180  │ 276  │ 0  │ 0  │ 0  │ 83  │ 1    │ ''    │ ''   │ 48 │ 200    │ 22271 │
│                       │                  │ ͉Z                  │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
│                       │                  │ fγ%d1             │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
│                       │                  │ %n'                 │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
├───────────────────────┼──────────────────┼──────────────────────┼─────┼──────┼─────┼──────┼──────┼────┼────┼────┼─────┼──────┼───────┼──────┼────┼────────┼───────┤
│ '2024-02-16 03:24:32' │ '185.100.87.136' │ 'http://p#*5#]R:Ȏ/ │ ''  │ 914  │ 0   │ 0    │ -276 │ 0  │ 0  │ 0  │ 244 │ 1    │ ''    │ ''   │ 48 │ 802    │ 22271 │
│                       │                  │ ͉Z                  │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
│                       │                  │ fγ%d1             │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
│                       │                  │ %n'                 │     │      │     │      │      │    │    │    │     │      │       │      │    │        │       │
└───────────────────────┴──────────────────┴──────────────────────┴─────┴──────┴─────┴──────┴──────┴────┴────┴────┴─────┴──────┴───────┴──────┴────┴────────┴───────┘

(108) By sean (naes_guy) on 2024-01-27 21:34:19 in reply to 104 [link] [source]

Just curious, does this issue happen only on one operating system you use, or have you only tested this on the one OS?

(109) By Stephan Beal (stephan) on 2024-01-27 22:08:48 in reply to 108 [link] [source]

does this issue happen only on one operating system you use

So far it's only ever been reported in Alpine, but it could hypothetically show up on any OS which uses a libc other than GNU libc, such as musl libc (which Alpine uses). It's never come up before because althttpd is primarily run on Linux systems which use glibc, which is apparently less fussy about this level of signal safety. musl libc has indirectly exposed several problems in althttpd's usage of signal-unsafe functions within the context of a signal handler.

That is to say: the bug is not the fault of Alpine/musl libc, but in althttpd's historically lax use of APIs which, when one digs far enough into their docs, are not intended to be used the ways it uses them.

(127) By sodface on 2024-04-18 19:36:38 in reply to 1.1 [link] [source]

60+ days of uptime and no hanging processes!

24523 61d05   /usr/bin/althttpd

(128) By Stephan Beal (stephan) on 2024-04-18 21:33:17 in reply to 127 [link] [source]

60+ days of uptime and no hanging processes!

Great! It's not bug-free, though: Richard recently (a week or three ago) fixed a bug, introduced in those changes, in which one or more of the timing-related log fields was/were being-miscalculated. If you don't use those log fields, you're unaffected.