The problem to solve
If you are monitoring an AWS instance that is busy and has other programs accessing the same logs as the Cloudwatch agent then you may run into this error in the agents logs “too many open files” which stops any further information or metrics being sent to Cloudwatch. Not a huge problem, but what happens when everyone has gone home and know one is there to restarts the agent, or worse….what happens if you had a security breach and you need those logs.
Part of the Solution
Is to create a small script to monitor the cloudwatch agent log file for that error “too many open files” as done below. The script looks for that error as new logs are entered in to the log file and if that line is seen stop the agent, then start it up and write a log message to say the agent has been restarted.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
tail -fn0 /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log | while read line ; do
echo "${line}" | grep -i "too many open files" > /dev/null
if [ $? = 0 ] ; then
sudo amazon-cloudwatch-agent-ctl -a stop
sleep 5
sudo amazon-cloudwatch-agent-ctl -a start
sleep 5
sudo echo Agent restarted-$(date +"%d_%m_%y__%I_%M_%p") >> /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
fi
done
But I dont want to have to put that into a cron job and monitor that as well, with other scenarios that can interfere with it. It just becomes messy.
The other part of the Solution
Lets create a Systemd service to manage it for us as done below.
1
vi /etc/systemd/system/monitor-cloudwatch.service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[Unit]
Description=Monitors Amazon CloudWatch agent and restart it if error of to many files are open
[Service]
User=root
WorkingDirectory=/root
ExecStart=/root/restartcw.sh
Restart=always
[Install]
WantedBy=multi-user.target
Once you have saved the service file we need to reload the daemon so our new service file gets included.
1
systemctl daemon-reload
Now lets start our new service.
1
system start monitor-cloudwatch.service
If you want the service to come up at boot dont forget to enable it.
1
system enable monitor-cloudwatch.service
Check the stauts, everything should be happy.
1
system status monitor-cloudwatch.service
Grab yourself two terminals and you can test it out.
1
tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
1
echo "too many open files" >> /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
You should have seen the error message showing up, the agent restarting and a log message to stay your agent has been restarted. Know you will also want to look at getting a notification/email that the agent has been restarted. Im not going to cover it here but you will want to create a metric filter that monitors your log group for specific patterns. AWS has this well documented