Q3:JA Server with Segment Fault 139

Ask questions about dedicated servers here and we and other users will do our best to answer them. Please also refer to the self-help section for tutorials and answers to the most commonly asked questions.
Post Reply
Jawfin
New to forums
New to forums
Posts: 7
https://www.youtube.com/channel/UC40BgXanDqOYoVCYFDSTfHA
Joined: Mon Sep 02, 2013 8:29 pm

Q3:JA Server with Segment Fault 139

Post by Jawfin »

Hey :)

This server is one core, 1GB running the latest Ubuntu.

I've searched the forums but haven't found anyone experiencing this problem. I am using my virtual server to host the game Jedi Knight:Jedi Academy based on the Quake 3 engine. The issue is after about 12-14 hours of running the application will crash with the exit code of 139. This means its a segment fault (11) but it was crashed/closed by the OS (+128) and not within the application.

It's not much to go on, but here's the error I see: -
/home/knights/jka/GameData/knights.sh: line 62: 4123 Segmentation fault (core dumped) ./openjkded.i386 +set com_hunkmegs 256 +set logfile "3" +set fs_basepath ~/jka/GameData +set fs_homepath ~/jka/GameData +set sv_pure "1" +set fs_game japlus +exec krservconfig.cfg +set net_port "29070"

knights.sh = my script
./openjkded.i386 = the executable
and all that stuff after that are the parameters.

And my script responds with: -
JKA server crashed with exit code 139. Mon Sep 2 20:38:48 CDT 2013 respawning...


This is the thing though, it takes something like 14 hours to crash the first time, but if I fire the application up again it will crash in under 10 minutes, and will repeated do so until I sudo reboot. This shows it cannot be the application as a reload should give it another 14 hours. What utilities can I use to diagnose what's happening? I've just started running top and vmstat each hour for the last few hours, but it hasn't crashed yet for me judge what I am looking for. Would running out of RAM somehow do this? (Maybe from a memory leak in the application.) Also, it would appear the swap file is not active (in that the utility free always returns 0's in the swap values).

Even if this app is using/leaking memory, surely it crashing would free that? Thus I could run it again without needing to reboot the whole box.

Thanks for any help, and let me know what further info I can provide !
Cheers
Jonathan
User avatar
Edge100x
Founder
Founder
Posts: 13156
Joined: Thu Apr 18, 2002 11:04 pm
Location: Seattle
Contact:

Re: Q3:JA Server with Segment Fault 139

Post by Edge100x »

It's possible that running out of RAM could cause problems, yes. You also might be seeing a kernel bug of some sort. Or, the game server has a bug related to server uptime or is storing state somewhere.

If you run the VDS for 14 hours and then start the server, do you have the same problem?

Are you running the latest versions of the kernel and utilities? What OS are you using?
Jawfin
New to forums
New to forums
Posts: 7
Joined: Mon Sep 02, 2013 8:29 pm

Re: Q3:JA Server with Segment Fault 139

Post by Jawfin »

I still haven't had a crash yet, but I haven't waited long enough...! I'm replying now though to say what I've found out so far.

To answer your questions: I'm running Ubuntu 13.04 with all the latest updates installed. Rebooting the actual box sudo reboot does give me back the 14 hours of server up-time. I'm not running any utilities or background application, only the game server.


From running top every hour I can state my application is slowly using up memory. It starts with 200MB, and eats 30MB to 40MB per hour. The thing is, the app is releasing this memory, but the OS isn't claiming it back - so without proof (a crash) I strongly suspect I am running out of RAM.

I know the app is handling the mem correctly as echo 3 > /proc/sys/vm/drop_caches frees it back up - which it couldn't do if my app wasn't letting it go. So I suspect now I won't actually have a problem as I've set that command (along with sync) to run hourly - and now the app runs between 150MB and 200MB depend on the load, and releases that 30MB with the cache drop.

I'll post again with my complete solution in case others need it (or just want to be tidier with their RAM) in a couple of days if this shows to be the solution.

Thanks for getting back to me so quickly :)
User avatar
Edge100x
Founder
Founder
Posts: 13156
Joined: Thu Apr 18, 2002 11:04 pm
Location: Seattle
Contact:

Re: Q3:JA Server with Segment Fault 139

Post by Edge100x »

If you're seeing decreased free memory and /proc/sys/vm/drop_caches raises that number again, then you're just looking at the disk cache :). It's actually much better to have that disk cache in use, since it speeds up later reads. Disk cache is automatically recovered by the OS when it's needed for applications again.
Jawfin
New to forums
New to forums
Posts: 7
Joined: Mon Sep 02, 2013 8:29 pm

Re: Q3:JA Server with Segment Fault 139

Post by Jawfin »

Drat, then I haven't fixed anything.

Do you know what tools I could run to work this out? Remember that although the application completely terminates (I hope) after a crash from running 14 hours, subsequent executions can crash it in a minute... Until I reboot the box. I am a programmer and know computers pretty well, but not linux, so I am stuck here!
User avatar
Edge100x
Founder
Founder
Posts: 13156
Joined: Thu Apr 18, 2002 11:04 pm
Location: Seattle
Contact:

Re: Q3:JA Server with Segment Fault 139

Post by Edge100x »

Does "dmesg" show anything? Kernel output generally goes there, so if the kernel is shutting down the process for an exotic memory (such as due to an OOM condition), you should see it in dmesg.

You could try attaching an "strace" process to the server to watch what it does before it crashes, or you could use "tcpdump" to see if there's any unusual traffic coming in. Have you tested with the network turned off? Have you performed the test I mentioned before, in terms of leaving the VDS on for 14 hours and then running the server for the first time, to see if it's uptime-related?
Jawfin
New to forums
New to forums
Posts: 7
Joined: Mon Sep 02, 2013 8:29 pm

Re: Q3:JA Server with Segment Fault 139

Post by Jawfin »

Sorry, I misread your first 14 hour post. As this is a server I don't want to have it down if i can help it, but I understand where you are coming from - so that will be the next test.

For now though the crash still happens. After 5 crashes I rebooted the server. It's not from an attack as I ran tcpdump previously, and nothing suspicious. Here is a snapshot of dmesg.txt: -
http://www.jawfin.net/jka/dmesg.txt
And I ran strace against the process until it crashed: -
http://www.jawfin.net/jka/strace.txt

I'm not expecting you to do my work for me, I just don't know what I am looking at here.

This is an open source application, so if the code is causing this somehow I may be able to fix it. But, as it completely terminates, I don't think it's the app. I'll leave it running for now in case there's a clue in those files, and if I'm still stuck I'll reset the server, wait 14 hours, run the app and see how long it lasts! Depending on the result of that I'll try with the network off.
User avatar
Edge100x
Founder
Founder
Posts: 13156
Joined: Thu Apr 18, 2002 11:04 pm
Location: Seattle
Contact:

Re: Q3:JA Server with Segment Fault 139

Post by Edge100x »

There's nothing very useful in those, unfortunately, except to indicate that it's an application error.

You might try attaching a gdb process and getting a backtrace. If it's open-source and you have the symbols and code, that should show the line where the bug is occurring, as long as you plug it all in. (I don't have a lot of experience with doing this.)
Jawfin
New to forums
New to forums
Posts: 7
Joined: Mon Sep 02, 2013 8:29 pm

Re: Q3:JA Server with Segment Fault 139

Post by Jawfin »

I think that gives me enough to go on then. I can compile with debug on, that may help. Also, it could be from the time of day, like when it hits 10pm which will always be 14-16 hours from boot, it could be that it's 10pm not elapsed time. This time it also crashed pretty quick *after* a reboot. So the gettimeofday() [the last entry in strace] may be relevant but as its called so often it'll probably appear in every analysis/dump. It may be signed/unsigned int issue, or range check error, as it's doing math from the start of the epoch, 1970, in milliseconds - but I just can't see that raising a SegFault which is usually from uninitialized pointers being used.

I may actually wait it out too, log all the crashes and see if/when they stop happening - more strongly indicating a time of day issue. This I may also do with the network out, so I don't get complaints from the users! (Or just firewall the game port if I need the network up for testing/isolating the cause.)
Post Reply