I am running clearos 6.x Professional on a Dell Poweredge R610 server with dual quad-core xeons (8 cpus), 32gb RAM, 15k SAS drives in Raid1, etc (I.E. it's a very fast machine.)
It's primary use is gateway w/ content filtering in a dorm with about 150-200 boys.
Our internet connection is 100/30mbps fiber.
Here's the problem (and this doesn't happen on our other two installations - our women's dorm gateway, or or main campus gateway. Each of those locations have a 100/30 connection and same model server as well.
Internet runs great, bandwidth usage gets pretty high at peak times (netflix, playstation, etc.) but web browsing is still fine. But at some point the internet just goes to an absolute crawl - and I've figured out the why, I just need a solution to fix it.
If I go to look at "resource report" everything is fine, everything has plenty left, usually 95-99% cpu usage left, 20gb of RAM free, no swap being used, etc. But if I look at the 'processes' - if it is at 1671-1674 or so, I know the internet is at a crawl. Processes normally stay at around 800-1000. But once it gets to 1600-something, it is absolutely stuck until I touch the content filter in some way to reset it (restart dansguardian, delete proxy cache, change a content filter setting, anything.)
I've been through all the guides on optimizing squid. I am already running cacheless. Here is the relevant portion my dansguardian config file (although I've changed it SEVERAL times, this is the latest iteration):
# sets the maximum number of processes to spawn to handle the incoming
# connections. Max value usually 250 depending on OS.
# On large sites you might want to try 180.
maxchildren = '999'
# sets the minimum number of processes to spawn to handle the incoming connections.
# On large sites you might want to try 32.
minchildren = '64'
# sets the minimum number of processes to be kept ready to handle connections.
# On large sites you might want to try 8.
minsparechildren = '32'
# sets the minimum number of processes to spawn when it runs out
# On large sites you might want to try 10.
preforkchildren = '16'
# sets the maximum number of processes to have doing nothing.
# When this many are spare it will cull some of them.
# On large sites you might want to try 64.
maxsparechildren = '64'
# sets the maximum age of a child process before it croaks it.
# This is the number of connections they handle before exiting.
# On large sites you might want to try 10000.
maxagechildren = '2000'
The last change I made was taking the maxagechildren down from 10000 to 2000, thinking that maybe some processes were getting hung and that it would kill off some of them earlier and make room. Well, it worked for about 24 hours. Which is usually the case - I have to fix this 'dansguardian processes maxxed out' issue about every 24-48 hours.
Woke up this morning and here's what the processes looked like on the gateway:
So I knew the internet was deadly slow.
Ran the command to see how many processes dansguardian was using:
ps aux | grep dansguardian-av | wc -l
Ran 'TOP' command (not that I understand much of what I'm reading, but I knew you guys might ask, so here is a screenshot
Restarted dans-guardian, everything back to normal....for now.
Can anyone offer any advice? I've tried changing the numbers around on the dansguardian.conf file several times, but it doesn't seem to help. It doesn't appear that I'm running into any resource problems from the hardware as the system load is always around 0.1-0.2, RAM is never even remotely close to fully utilized, etc. It appears to me that I've got a hard process limit problem. I figure there are two ways to approach this:
1. Find a way to increase the HARD limit of 1,000 processes Dansguardian is allowed to run.
2. I have someone with a virus, bit torrent, or other program that is generating wayy too many processes/requests. How could I find this person? OR, how can I set a limit on processes ran by individual devices/IPs?
OR, anyone have any other ideas? How would you handle this? We have the bandwidth - we have the hardware - but every day I have to reset this thing, and usually it's been dead for an hour or two before I can catch it, because I can't monitor it 24/7 - so the kids are usually pretty mad by then. Please Help...