Multiple HLDS running on VDS receiving random time outs

tenub · Post by **tenub** » Fri Apr 27, 2018 9:20 am

I currently run 4 HLDS game servers on my VDS. There has been an ongoing issue for the longest time and research has gleaned me nothing. I did find a thread someone posted here that seems to be the same issue I'm experiencing but there were no replies.

Users seem to randomly get disconnected from my CS 1.6 game servers with the message:

Disconnected
You have been disconnected from the server.
Reason: Timed out

This does not seem to be a server-side networking issue. People seem to drop in groups (not all people will receive a time out). I have not been able to verify whether these people who do get disconnected are across game servers or the issue only affects one game server at a time.

Recently this has become more and more frequent and is to the point where I am considering changing to a new VDS since my users are becoming restless and unhappy. I was wondering if there might be any information anyone could offer regarding a potential cause or solution.

Pedro-NF · Post by **Pedro-NF** » Sun Apr 29, 2018 12:07 am

I was having the exact same problem with a Rust server I added to my 3-core VDS in January, which until then only ran a (mostly empty) Wolfenstein: Enemy Territory server that has an extremely low resource usage. Players would be disconnected in groups (usually all players on the server), at seemingly random times, no matter how many people were playing (even just 1 or 2), and independently of location or internet connection latency, speed or quality. I tried EVERYTHING I could think of, and everything NFO support suggested (thanks for the patience, guys), but nothing solved the problem.

Like in your case, there was nothing unusual installed on the VDS, which runs Windows Server 2008 R2. I tried adding more RAM (even though there was more than enough left), adding SSD space and placing the page file on it, running it without a page file, fiddling with process priorities and affinities - everything. The most taxed core never went over 60% load. Support even tried moving the VDS to another physical machine, which didn't fix the problem either.

Then I upgraded the VDS to 4 cores, and without changing ANYTHING on it software wise, the timeouts stopped COMPLETELY. This led me to the conclusion that the problem had to be related to the way actual physical cores and "hyperthreaded cores" are assigned to VDS's by Xen and/or to the way NFO has Xen configured. In my case, with only one "heavier" server like Rust running, a 4-core VDS was enough to guarantee that whatever wasn't working well regarding core allocation to the VDS now works perfectly.

Post by **Edge100x** » Sun Apr 29, 2018 10:17 pm

We haven't seen problems in our testing with any specific size of VDS, but it's certainly possible that Rust would use the capacity of three threads occasionally, causing problems for other services.

tenub, what sort of CPU usage are you seeing? What size of VDS do you have? Are you seeing packet loss in MTR tests to the server (which could indicate a deeper problem with networking)?

tenub · Post by **tenub** » Mon Apr 30, 2018 7:30 pm

Edge100x wrote: ↑Sun Apr 29, 2018 10:17 pm We haven't seen problems in our testing with any specific size of VDS, but it's certainly possible that Rust would use the capacity of three threads occasionally, causing problems for other services.

tenub, what sort of CPU usage are you seeing? What size of VDS do you have? Are you seeing packet loss in MTR tests to the server (which could indicate a deeper problem with networking)?

CPU usage averages around 10% with a handful of active players. Each active player seems to add about 0.5-1% usage. It's the standard 4 core setup with Windows Server 2008 R2. No packet loss to report.

The time outs don't seem to be correlated to server load as I've had reports of it happening with just 1-2 players in a server.

I use HLSM to manage the HLDS instances. Don't know if if this is useful information though.

Post by **Edge100x** » Mon Apr 30, 2018 10:12 pm

Does the VDS itself become unresponsive when they happen, or does it respond slowly to some activities, such as disk access? Do all of the servers time out at the same time?

tenub · Post by **tenub** » Tue May 01, 2018 4:59 am

Edge100x wrote: ↑Mon Apr 30, 2018 10:12 pm Does the VDS itself become unresponsive when they happen, or does it respond slowly to some activities, such as disk access? Do all of the servers time out at the same time?

The VDS remains fine. The people who don't get disconnected are able to continue playing just fine. I haven't been able to verify whether the issue only happens on one HLDS instance at a time or across HLDS instances.

tenub · Post by **tenub** » Tue May 01, 2018 5:18 am

Another thing to note is that HLSM eventually ends up having two processes running after a hard restart and the one that's visible on remote login can no longer find any of my HLDS instances (it shows them as "lost" status), so it constantly tries to start a new hlds process which of course won't work because one of the background hlds processes has already used the specified port. I doubt this has anything to do with the issue but I thought I'd mention it. Maybe it's due to virtualization or something else.

Pedro-NF · Post by **Pedro-NF** » Tue May 01, 2018 10:20 am

Edge100x wrote: ↑Sun Apr 29, 2018 10:17 pmit's certainly possible that Rust would use the capacity of three threads occasionally, causing problems for other services.

No, it wasn't.

tenub wrote: ↑Tue May 01, 2018 5:18 am Another thing to note is that HLSM eventually ends up having two processes running after a hard restart and the one that's visible on remote login can no longer find any of my HLDS instances (it shows them as "lost" status), so it constantly tries to start a new hlds process which of course won't work because one of the background hlds processes has already used the specified port. I doubt this has anything to do with the issue but I thought I'd mention it. Maybe it's due to virtualization or something else.

That has always happened to any process I configure to start up with the OS that is not a service, so I gave up on trying to have servers automatically start (useful in case the VDS needs to be restarted by the host for maintenance). I also believe that's due to virtualization.

Pedro-NF · Post by **Pedro-NF** » Tue May 01, 2018 11:44 am

My take on this issue is that, with an odd number of VDS cores, at least one of them will always be running on a hyperthread "belonging" to a physical core running on another VDS. Whenever that "foreign" core needs more resources, it has priority over its corresponding hyperthread and processes running on that hyperthread might experience issues, depending on how sensitive those processes are to having their resources unavailable even for an extremely short time (like a few miliseconds). That would be undetectable in Windows using PerfMon, for example, which has a minimum polling rate of 1 second.

With an even number of cores, I believe most of the time Xen will allocate both the physical core and its hyperthread to the VDS, which should minimize those problems since the OS will always have full control of the cores' full resources. That probably happens with all VDS hosts, and certainly way less frequently with NFO, simply because you do offer the best service on the market, your staff is the most qualified, and your own personal experience and knowledge of these matters is unmatched.

My plan is to switch to one of your (amazing) dedicated server packages as soon as possible, anyway.

Post by **Edge100x** » Wed May 09, 2018 2:44 pm

Pedro-NF wrote: ↑Tue May 01, 2018 11:44 am My take on this issue is that, with an odd number of VDS cores, at least one of them will always be running on a hyperthread "belonging" to a physical core running on another VDS.

That's not how it works, because vCPUs aren't mapped permanently to physical cores. Xen migrates a server's virtual cores around on the machine to maximize performance, similar to the way the scheduler inside your OS does (unless overridden). It will attempt to put a busy virtual CPU core onto a physical CPU core alongside a less-busy virtual CPU core whenever possible. This happens in the same way regardless of the number of vCPUs assigned to the VDS.

Whenever that "foreign" core needs more resources, it has priority over its corresponding hyperthread and processes running on that hyperthread might experience issues, depending on how sensitive those processes are to having their resources unavailable even for an extremely short time (like a few miliseconds). That would be undetectable in Windows using PerfMon, for example, which has a minimum polling rate of 1 second.

It is not true that another customer's vCPU running on the same physical core, but a different hyperthreaded core, would preempt your vCPU. Both can run simultaneously (and both would also have the same priority). Delays due to resource contention within the physical CPU core will be measured in cycles and not milliseconds and manifest as apparent higher CPU usage for running processes.

Pedro-NF · Post by **Pedro-NF** » Sun Jul 08, 2018 7:24 am

After not having experienced a single mass kick for timeout on my Rust server since I upgraded the VDS to 4 cores in February, we just had two mass kicks for timeout this morning - the first one around 8:21 am EST (with less than 10 players online), and the second one around 10:05 am EST. Like before, absolutely nothing wrong on the VDS, nothing in the event logs, CPU usage low, lots of available RAM, networking OK, etc. - the events seem completely random.

John, could you please check if anything was changed on the machine hosting the VDS (pid #110852), like new VDSs having been added or something like that? The server is just starting to reach a good player base and we simply can't start having this kind of problem again now.

hiimcody1 · Post by **hiimcody1** » Sun Jul 08, 2018 10:09 am

Pedro-NF wrote: ↑Sun Jul 08, 2018 7:24 amJohn, could you please check if anything was changed on the machine hosting the VDS (pid #110852), like new VDSs having been added or something like that? The server is just starting to reach a good player base and we simply can't start having this kind of problem again now.

For something like this, you'd want to shoot us a request from your panel so we can check the machine.

For security reasons, we can't directly investigate a specific service like that through the forums.

Pedro-NF · Post by **Pedro-NF** » Sun Jul 08, 2018 6:32 pm

hiimcody1 wrote: ↑Sun Jul 08, 2018 10:09 am
Pedro-NF wrote: ↑Sun Jul 08, 2018 7:24 amJohn, could you please check if anything was changed on the machine hosting the VDS (pid #110852), like new VDSs having been added or something like that? The server is just starting to reach a good player base and we simply can't start having this kind of problem again now.
For something like this, you'd want to shoot us a request from your panel so we can check the machine.

For security reasons, we can't directly investigate a specific service like that through the forums.

Yes, I did that when the mass kicks for timeout were happening in February, and support did everything they could to troubleshoot the problem, and even moved my VDS to a different physical machine (which didn't help), so I'm positive this is not related to the physical machine's hardware or to the VDS. It's not an isolated case either, as you can see from the OP's report and this other topic created today.

We haven't had new events today, so I'll update this topic in case it happens again.

Pedro-NF · Post by **Pedro-NF** » Sat Jul 21, 2018 6:58 pm

Unfortunately, the mass timeouts continue. Like I did before when I had this issue in February, right after my previous post I increased the number of "hyperthreaded cores", going from 4 to 6 this time. That stopped the timeouts for some time, but we just had one with less than 10 players on the server (21 Jul 2018, 22:13 EST), with 3 "hyperthreaded cores" barely registering any activity, and a 4th with ~20% usage. I made an experiment last week, going back to 4 "hyperthreaded cores", and the timeouts became much more frequent, so I switched to 6 again.

This issue is not related to whatever is running on my VDS, or the OP's VDS, or the other people having these issues. It is caused by a physical core running threads from different VDSs. That's simply a disaster waiting to happen. No matter how fast the hypervisor juggles those threads around, eventually that won't happen fast enough, especially when it comes to game servers. Adding more " HT cores", even when they are not needed, the probability of "thread crashing" decreases, The problem naturally starts happening more frequently as a physical machine becomes more populated and the probability of "thread crashing" increases. And it will keep happening for as long as a physical core is allowed to run threads from different VDSs.

What I fail to understand is this: since the physical machines are allegedly not overpopulated, why doesn't NFO sell VDS packages where each VDS is assigned a full core instead of "hyperthreaded cores"? That way, the 2 "HT core" package would become a 1 full core package, the 4 "HT core" package" a 2 full core package, and so on. A VDS would never be sharing cores with another VDS. The only reason I can think of is that offering packages with a large number of "HT cores" (12 cores, 16 cores, etc) might look more appealing (personally, I don't believe it does). NFO already offers a better service than other VDS host providers, and I believe that doing something like that would completely differentiate the company's services from the competition and attract many new clients who know better than trust the "hyperthreaded core" system.

Post by **Edge100x** » Sat Jul 21, 2018 10:49 pm

Pedro-NF wrote: ↑Sat Jul 21, 2018 6:58 pm Unfortunately, the mass timeouts continue. Like I did before when I had this issue in February, right after my previous post I increased the number of "hyperthreaded cores", going from 4 to 6 this time. That stopped the timeouts for some time, but we just had one with less than 10 players on the server (21 Jul 2018, 22:13 EST), with 3 "hyperthreaded cores" barely registering any activity, and a 4th with ~20% usage. I made an experiment last week, going back to 4 "hyperthreaded cores", and the timeouts became much more frequent, so I switched to 6 again.

Certainly have me take another look at the machine just to be safe. And, on your end, take another look at the OS and game server logs to make sure that nothing obvious and unexpected is going on, such as an OOM condition or clear DoS attack.

Does this happen when there's a lot of I/O? If you're using Windows as your OS, I'll want to check to see if you're on an all-SSD host.

This issue is not related to whatever is running on my VDS, or the OP's VDS, or the other people having these issues. It is caused by a physical core running threads from different VDSs. That's simply a disaster waiting to happen. No matter how fast the hypervisor juggles those threads around, eventually that won't happen fast enough, especially when it comes to game servers. Adding more " HT cores", even when they are not needed, the probability of "thread crashing" decreases, The problem naturally starts happening more frequently as a physical machine becomes more populated and the probability of "thread crashing" increases. And it will keep happening for as long as a physical core is allowed to run threads from different VDSs.

Thankfully, no, that's not the case.

For clients to start timing out in-game, most games require several seconds without a response from the server.

Hyperthreading doesn't mean that one thread will grind to a halt for multiple seconds, or crash, while another thread runs on the other physical core. Instead, both threads effectively run all the time, with very small additional delays (sub-microsecond delays) added as their instructions are split into microcode and interleaved within the processor to take better advantage of its internal resources. These delays manifest simply as higher CPU usage to the OS.

Scheduling inside Xen also happens with microsecond resolution, at least with the settings that we use.

Multiple-second delays would have another cause, such as a design problem on the software side, an I/O delay, or attack of some sort.

If your problem is not related to the other customer's, I should split this into a separate thread.

Server rentals :: NFOservers.com

Multiple HLDS running on VDS receiving random time outs

Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs

Re: Multiple HLDS running on VDS receiving random time outs