Forums

Resolved
0 votes
I am running clearos 6.x Professional on a Dell Poweredge R610 server with dual quad-core xeons (8 cpus), 32gb RAM, 15k SAS drives in Raid1, etc (I.E. it's a very fast machine.)

It's primary use is gateway w/ content filtering in a dorm with about 150-200 boys.

Our internet connection is 100/30mbps fiber.

Here's the problem (and this doesn't happen on our other two installations - our women's dorm gateway, or or main campus gateway. Each of those locations have a 100/30 connection and same model server as well.

Internet runs great, bandwidth usage gets pretty high at peak times (netflix, playstation, etc.) but web browsing is still fine. But at some point the internet just goes to an absolute crawl - and I've figured out the why, I just need a solution to fix it.

If I go to look at "resource report" everything is fine, everything has plenty left, usually 95-99% cpu usage left, 20gb of RAM free, no swap being used, etc. But if I look at the 'processes' - if it is at 1671-1674 or so, I know the internet is at a crawl. Processes normally stay at around 800-1000. But once it gets to 1600-something, it is absolutely stuck until I touch the content filter in some way to reset it (restart dansguardian, delete proxy cache, change a content filter setting, anything.)

I've been through all the guides on optimizing squid. I am already running cacheless. Here is the relevant portion my dansguardian config file (although I've changed it SEVERAL times, this is the latest iteration):

# sets the maximum number of processes to spawn to handle the incoming
# connections. Max value usually 250 depending on OS.
# On large sites you might want to try 180.
maxchildren = '999'


# sets the minimum number of processes to spawn to handle the incoming connections.
# On large sites you might want to try 32.
minchildren = '64'


# sets the minimum number of processes to be kept ready to handle connections.
# On large sites you might want to try 8.
minsparechildren = '32'


# sets the minimum number of processes to spawn when it runs out
# On large sites you might want to try 10.
preforkchildren = '16'


# sets the maximum number of processes to have doing nothing.
# When this many are spare it will cull some of them.
# On large sites you might want to try 64.
maxsparechildren = '64'


# sets the maximum age of a child process before it croaks it.
# This is the number of connections they handle before exiting.
# On large sites you might want to try 10000.
maxagechildren = '2000'


The last change I made was taking the maxagechildren down from 10000 to 2000, thinking that maybe some processes were getting hung and that it would kill off some of them earlier and make room. Well, it worked for about 24 hours. Which is usually the case - I have to fix this 'dansguardian processes maxxed out' issue about every 24-48 hours.

Woke up this morning and here's what the processes looked like on the gateway:

http://i1093.photobucket.com/albums/i424/blakemcginnis/1671.png

So I knew the internet was deadly slow.

Ran the command to see how many processes dansguardian was using:

ps aux | grep dansguardian-av | wc -l


1003

Maxxed out.

Ran 'TOP' command (not that I understand much of what I'm reading, but I knew you guys might ask, so here is a screenshot:)

http://i1093.photobucket.com/albums/i424/blakemcginnis/top.png

Restarted dans-guardian, everything back to normal....for now.

Can anyone offer any advice? I've tried changing the numbers around on the dansguardian.conf file several times, but it doesn't seem to help. It doesn't appear that I'm running into any resource problems from the hardware as the system load is always around 0.1-0.2, RAM is never even remotely close to fully utilized, etc. It appears to me that I've got a hard process limit problem. I figure there are two ways to approach this:

1. Find a way to increase the HARD limit of 1,000 processes Dansguardian is allowed to run.

2. I have someone with a virus, bit torrent, or other program that is generating wayy too many processes/requests. How could I find this person? OR, how can I set a limit on processes ran by individual devices/IPs?

OR, anyone have any other ideas? How would you handle this? We have the bandwidth - we have the hardware - but every day I have to reset this thing, and usually it's been dead for an hour or two before I can catch it, because I can't monitor it 24/7 - so the kids are usually pretty mad by then. Please Help...

Thanks!
Friday, October 02 2015, 02:08 PM
Share this post:
Responses (11)
  • Accepted Answer

    Monday, November 23 2015, 03:03 PM - #Permalink
    Resolved
    1 votes
    Hi Blake,

    There's a 1000-ish process limit in DansGuardian. We have gone down the road of patching Dansguardian to avoid the limit, but there are other outstanding issues that need to be resolved. Fundamentally, Dansguardian is old & crusty and we need to get the recommended replacement running on ClearOS.

    I see your support ticket in the queue and there will be a follow up. I believe there are ways around the limitation, but I'm not the expert on this topic.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 07:28 PM - #Permalink
    Resolved
    0 votes
    Yeah, I've seen those links - was just hoping I could escalate from here instead of having to re-explain the problem. I went ahead and put in a support ticket and attached a link to this thread. I have to get this fixed.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 05:53 PM - #Permalink
    Resolved
    0 votes
    Have you seen the support stuff here? I thought there was a cheaper option but I could be wrong.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 05:14 PM - #Permalink
    Resolved
    0 votes
    What's the proper procedure for escalating this to upper level/engineer type tech support. I don't care if I have to pay, I need help getting this fixed. Mods? Anyone?
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 05:05 PM - #Permalink
    Resolved
    0 votes
    Yep, seen em all. I mentioned that I am running cacheless in my initial post (even put it in italics because I knew someone would mention it ;) . It's not a 'my hardware is too slow' problem - it's a problem with the dansguardian processes shooting up from like 300 to 1000 in the space of 5 minutes and then sticking there forever (making the internet unuseable) until I reset the proxy cache (which there shouldn't even be a cache, since it's set to run cachless - but I think just changing any setting that makes dansguardian restart is what fixes it.)

    I need to know how to do one of three things:

    1. Increase the number of maxchildren past 1000.
    2. Limit the amount of processes/connections that are allowed to come from one IP/device.
    3. If it's a computer/device/etc on the network that is doing this, i need to know how to find them - I've tried everything, nothing stands out when looking at the network visualizer, etc, when it is locked up. If I run TOP, well...you can see the results of that in the posts above - nothing conclusive.
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 04:54 PM - #Permalink
    Resolved
    0 votes
    Have you seen threads like this? Do you need to run the content filter cacheless?
    The reply is currently minimized Show
  • Accepted Answer

    Friday, November 20 2015, 02:39 PM - #Permalink
    Resolved
    0 votes
    By the way - this is still happening. Every. Day.

    It's also started to creep up on our main network, not as often, but it has happened a few times - which makes me very worried.

    This seems to be a fatal flaw in Clearos/dansguardian that once the maxchildren processes are maxxed out, the system can't recover. I've received no suggestions that I haven't already tried, I guess I'm going to have to start looking for another content filter solution.
    The reply is currently minimized Show
  • Accepted Answer

    Monday, October 05 2015, 09:11 PM - #Permalink
    Resolved
    0 votes
    I hoped you did not have the base configuration of the Dell as its NIC's seem to be tried and tested. ClearOS seems to come with the latest available drivers so there is no point trying to update them.

    I don't know Dansguardian and don't use it so can't really advise. You can try looking at connections with something like "netstat" with appropriate switches. I'd also consider removing any tweaking you did to Dansguardian and starting again.
    The reply is currently minimized Show
  • Accepted Answer

    Monday, October 05 2015, 07:17 PM - #Permalink
    Resolved
    0 votes
    Nick, thanks for the article - I just read it, and have read many others like it. They usually point to some sort of resource problems (not enough cpu power, not enough ram, etc.) and how to tune dansguardain to work within those limited resources. The server I'm running is overkill to the extreme for this software, and always has resources to spare - never swaps, is running cacheless, etc.

    It worked fine over the weekend (which is not unusual, it seems to work fine for 24-72hrs.) I checked on the processes this morning and they were fine, 900 or so (with 200-500 of those being used by dansguardian.) and then bam, around 9AM, it went to 1672 (dansguardian using 1003) and internet was functionally dead until I reset the web proxy cache (which it shouldn't be using any of.)

    If it IS a particular problem user (malware, bit torrent, etc.) how do I find the offending IP/device? I can navigate the web interface and run any command I like easily while the processes are stuck - I just don't know where to look...

    Here is the output to the command you asked me to run (although I wouldn't think it's the NICs as I have 2 other identical servers running the same clearos setup at our other two locations, and this location is the only one with this trouble):

    [root@mensdorm ~]# lspci -k | grep Eth -A 3
    01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
    Kernel driver in use: bnx2
    Kernel modules: bnx2
    01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
    Kernel driver in use: bnx2
    Kernel modules: bnx2
    02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
    Kernel driver in use: bnx2
    Kernel modules: bnx2
    02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
    Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
    Kernel driver in use: bnx2
    Kernel modules: bnx2
    03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)


    Also, I changed those two other mistake posts, so you can delete them if you like - I only set them to private so everyone wouldn't have to see them. thanks!
    The reply is currently minimized Show
  • Accepted Answer

    Monday, October 05 2015, 06:51 PM - #Permalink
    Resolved
    0 votes
    There are heaps of articles about tuning Dansguardian including this one on the devs' site. I know it talks about swapping if maxchildren is too high so that may not be your issue as your swap usage is 0.

    Which NICs are you using as the Dell seems to have a number of options? Perhaps give the output of "lspci -k | grep Eth -A 3"


    BTW, If you've flagged your other two posts for Mods only, can you edit the posts and remove the flag. Then I may be able to delete them. There seem to be 2 classes of mods and only some (the devs mainly) have full access.
    The reply is currently minimized Show
  • Accepted Answer

    Monday, October 05 2015, 02:10 PM - #Permalink
    Resolved
    0 votes
    Anyone?
    The reply is currently minimized Show
Your Reply