Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
David_Evans
Collaborator

Custom Metric-Behaves different, CLI vs sklnctl - (Fixed) - Logging rate custom script for MLM's

The below script works for exporting the total log receive rate when run alone.  (currently commented out below).

But when I try and run this to pull the firewalls log send rate, it works fine when run at the CLI.   I get a nicely formatted JSON output at the CLI with the 100+ firewall names listed individually along with their receive rate value.

However, when I actually add this to Skyline, I get only the 1st firewall in each MDS domain and nothing else.   

I had the inner loop originally written with a for loop which is my go to loop, but switched over to the while loop and got the same result.     I've played around with putting various variables in quotes and not in quotes and it seems to make no difference.    It works great at the CLI and I get a nice several hundred line long JSON file output, but when added to skyline, only the first firewall is listed in grafana.

 

#!/bin/bash
. /opt/CPotlpAgent/cs_data_handler_is.bash


for i in $(mdsstat | grep CMA | cut -d"|" -f3); do      #Grab list of MDS Domains from MDS Stat, Loop through them
        export BASH_i=$i    				#Have to export i so that bash has the variable.
        logginginfo=$(bash -ic 'mdsenv $BASH_i ; cpstat ls -f logging')    #have to run mdsenv in bash, cs_data_handler doesn't like switching MDS env's
#       TotalLoggingRate=$(echo "$logginginfo" | grep "Log Receive Rate:" | cut -d ":" -f 2 | tr -d '[:space:]')  # Pull total log receive rate for each domain.
#       set_ot_object new value "$TotalLoggingRate"
#       set_ot_object last label Logging_Rate "TotalLogRate_$i"

        while IFS= read -r line; do					#loop through cpstat, pulling each connected firewall and the log count
                if [[ ! $line == *"Gateways"* && $line == *"Connected"* ]]; then     #Looking for the lines with connected firewalls
                        fwname=$(echo $line |cut -d "|" -f 2| tr -d '[:space:]')
                        fwvalue=$(echo $line |cut -d "|" -f 5| tr -d '[:space:]')
                        set_ot_object new value "$fwvalue"
                        set_ot_object last label Firewall_Name "$fwname"
                fi
        done <<< "$logginginfo"
done

#verbose_print "%s" "$LoggingRate"
script_exit "Finished running" 0

 

0 Kudos
20 Replies
Vincent_Bacher
Advisor
Advisor

This is a nice one, good job!
Regarding your issue, i would create a ticket at CP, i guess this can be answered by R&D. I doubt that we have many mates already diving such deep into custom metrics.
I just started playing around so i cannot tell either.

Cheers
Vince

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
0 Kudos
David_Evans
Collaborator

It does look like I will have to open a TAC case,  I'll keep the thread updated.    

I've tried several creative ways of running mdsenv on the MLM's and they are not reliable.   Even ones that seem to correctly move between the domains when added at the CLI and work for a while, will stop working after a service stop and start,  mdsstop and start...etc.

Getting skyline to run mdsenv to switch to different domains, even for a moment, to run one command seems to be an issue.

0 Kudos
Bob_Zimmerman
Authority
Authority

How about in Prometheus? On your MLM, run:

curl_cli "$(sklnctl --show_open_telemetry | jq '."export-targets"[0].url' | sed 's#write#label/Firewall_Name/values#' | tr -d '"')" | jq '.'

Edit: made a slight change to the command after testing more broadly.

0 Kudos
David_Evans
Collaborator

Right now its broken again and I'm waiting till I get mdsenv to work consistently.   I think some of my variability was the fact that it wasn't switching domains correctly.    Sometimes not switching at all and sometimes erroring out on the command and not running the rest of the script.

0 Kudos
Bob_Zimmerman
Authority
Authority

Yeah, I just confirmed that running /etc/profile.d/CP.sh before /opt/CPotlpAgent/cs_data_handler_is.bash results in a non-working mdsenv, and running it after results in the rest of the script not working. This feels wildly kludgy. I made some changes, and this seems to be working for me:

#!/bin/bash
. /opt/CPotlpAgent/cs_data_handler_is.bash

#Grab list of MDS Domains from MDS Stat, Loop through them
mdsstat | grep CMA | cut -d"|" -f3 | while read cmaName; do
	#Have to export the CMA name so we can pass it to a subshell.
	export cmaName
	loggingInfo=$(bash -lic 'mdsenv "${cmaName}"; cpstat ls -f logging')
	#loop through cpstat, pulling each connected firewall and the log count
	<<<"${loggingInfo}" grep "Connected" | grep -v "Gateways" \
	| while read line; do
		set_ot_object new value "$(<<<"${line}" cut -d"|" -f5 | tr -d '[:space:]')"
		set_ot_object last label Firewall_Name "$(<<<"${line}" cut -d"|" -f2 | tr -d '[:space:]')"
	done
done

script_exit "Finished running" 0

The one-liner I posted above should find the Prometheus instance you're sending data to and ask it for all the values it has for the label Firewall_Name. The idea is to check where the problem is: if Prometheus has all the names, the problem is likely in Grafana.

David_Evans
Collaborator

I think we were posting to the thread at the same time,   I'll give it a try.

0 Kudos
David_Evans
Collaborator

So this does work.   for a while.

I put this on 6 MLM's.    all of the gathered data for around 24 hours   +- 3 hours.     Then they just stopped.    restarting /opt/CPotelcol/CPotelcolCli.sh didn't make them come back and didn't generate a message in the logs.

However they all started this message about that time.

ts=2025-05-12T14:48:03.059Z caller=level.go:63 ts=2025-05-12T14:48:03.059Z caller=level.go:63 level=info msg="Api Status Collector: Command: api status Failed" TheCommand:apiexceededtheruntimethreshold=(MISSING)

I cant find that a major API command was run against them around that time that would have locked up the API service, but the API service is not happy on any of them.

The one MLM that I've been doing all my testing on is very broken now.    after a reboot, and no other changes, its back to giving me errors about "bash can't find the command"....   Its the exact some file that was working before...    (I got so paranoid as to restore from backup and MD5Hash the 2 files).    

So something with our work around is not stable long term.

I'll reboot some of the other log servers when I have a window and keep the thread updated.

0 Kudos
Bob_Zimmerman
Authority
Authority

What jumbo are you on?

Anything in /var/log/dump/usermode? Sounds like a process is crashing.

0 Kudos
David_Evans
Collaborator

I'm on R81.20 jumbo 96


At least some of them had the lock file in the temp folder that I mentioned in the troubleshooting post....  ( forgot my own troubleshooting suggestions)
/var/log/cs_data_handler_is.bash.log

.....
Unable to acquire script lock: /tmp/cs_data_handler_is.bash.lock
....

Deleting and restarting services.

The times match up with when I'm backing up the logs from these MLM's.
The backup is a ssh in from a remote linux box and a rsync command.     
Maybe some conflict with spawning more than one bash shell?
CPU related or IO?  I wouldn't think so, these are 6000XL's and not super busy at that time of day.
I'll try an get some more data.
No crash dumps on any of them.

0 Kudos
Bob_Zimmerman
Authority
Authority

I also wouldn't expect this to be load-related. Once in the bad state, what do you get when you run 'api status' yourself?

What does $FWDIR/log/api.elg say?

0 Kudos
David_Evans
Collaborator

The api acts weird on all my MLM's.   We don't run the API against them directly ever so I"m not sure what is normal.   but the API readiness tests always seem to fail.    So I'm pretty sure that is normal for these.   

api status, does come back with a full page of text after about 20 seconds, but with the status "API Stopped".

Yesterday I got 5 running again.    3 are still running ~24 hours later.    One stopped after about 3 hours, the other one after about 12 hours.    

Both the ones that stopped had the lock directory in /tmp/   but looking closer they also had a "storage" file in tmp from the time of the crash.


[Expert@****-mlm3:0]# ls -lh /tmp/cs*
-rw-r--r-- 1 admin root 12K May 14 06:15 /tmp/cs_data_handler_is.bash.storage.vP SVQ4

/tmp/cs_data_handler_is.bash.lock:
total 0
[Expert@***-mlm3:0]#

Not sure if this is a symptom of the issue or a cause?    My skyline VM is in the same datacenter as the two MLM's that crashed, so connectivity shouldn't be an issue.   Previously the local MLM stopped and the ones 1000 miles away stopped but not all at the same time.

So I"m not sure what to make of that.

We had a spike over night where 2 MLM's processed 30,000+ logs per second each for several mins and they continued to report afterwards so, as expected, probably not CPU.

This is a side project right now trying to balance some of the MLM load so I'm getting the "urgent" data I need for the most part with a bit of extra work restarting services, but I'd like to have this long term.    I'll keep poking at it.    

I'd like to see others running it to see if its something unique to my setup.


0 Kudos
Bob_Zimmerman
Authority
Authority

That's definitely not normal. On my MLMs, the readiness test passes. 'mgmt_cli -f json -r true show domains' shows my domains. I can't show objects from the domains, but I can show objects and rules from the Global domain, and can show objects from the MDS level (like administrators; the 'show domains' command technically runs in the MDS domain).

I would involve the TAC at this point. The API log should contain some information about the last call which worked and the first call which failed. One of those is probably breaking something.

0 Kudos
David_Evans
Collaborator

The MLM's that I've rebooted have stayed up for more than 24 hours now.   The one MLM that I have not rebooted, crashed after about 16 hours with the lock file.

So I'll continue to watch this for a while.    Maybe a resource / memory leak.    

Time will tell.

Next week I'll have some time to spend on a TAC case.

0 Kudos
David_Evans
Collaborator

After the MLM's were rebooted, the custom script has been far more stable.    The 'other' MLM's had been up for months and just had the good working copy of the script installed one time.

Once they were rebooted, they have been stable for 5 - 7 days now.   No idea why it took a full reboot to help with the stability

0 Kudos
David_Evans
Collaborator

After reboot, the custom script has been stable now for ~3 weeks.   I"m not sure why it took a reboot to get it working correctly.   As I don't have a non working MLM any more getting to the bottom of that is not likely.

I'm not sure if we want to do a new thread with the script so its easier to find.   I updated the title to the thread so maybe it will pop up in some searches.

0 Kudos
Sven_Glock
Advisor

Thanks @Bob_Zimmerman  and @David_Evans for the nice short script. 
I am actually working on the same kind of customscript, but mine was much longer for the same output 🙈

I added 3 improvements to your scripts:

  • additional label for clm to have the possibility to filter by clm
  • additional filter for "Local Clients"
  • label type "lograte" (just cosmetics)

Thanks a lot.
Looking forward to see the outcome of @David_Evans investigations.

#!/bin/bash
. /opt/CPotlpAgent/cs_data_handler_is.bash

#Grab list of MDS Domains from MDS Stat, Loop through them
mdsstat | grep CMA | cut -d"|" -f3 | while read cmaName; do
	#Have to export the CMA name so we can pass it to a subshell.
	export cmaName
	loggingInfo=$(bash -lic 'mdsenv "${cmaName}"; cpstat ls -f logging')
	#loop through cpstat, pulling each connected firewall and the log count
	<<<"${loggingInfo}" grep "Connected" | grep -Ev "Gateways|Local Clients" \
	| while read line; do
		set_ot_object new value "$(<<<"${line}" cut -d"|" -f5 | tr -d '[:space:]')"
		set_ot_object last label clm "${cmaName}"
		set_ot_object last label Firewall_Name "$(<<<"${line}" cut -d"|" -f2 | tr -d '[:space:]')"
		set_ot_object last label type "lograte"
	done
done

script_exit "Finished running" 0


Regards
Sven

David_Evans
Collaborator

Some general trouble shooting notes when you get this far into these scripts.

This log can be useful:
/var/log/cs_data_handler_is.bash.log

After stopping and starting the services (CPotlpagentCli.sh), a lot with testing, or accidently creating a infinite loop in your custom script you can get all skyline services hung. They will not gather any data even after a reboot. This will be in the log file:

Unable to acquire script lock: /tmp/cs_data_handler_is.bash.lock

Just delete the directory and restart services again.


Log rotate doesn't seem to be working on my MLM's for /opt/CPotlpAgent/otlp_agent.log. It hit 200MB during this trouble shooting and just stopped putting new logs in the file. Other MLM's that I haven't messed with are at 200Mb and haven't put anything new in the file in months. So that must be a hard limit. So if you are not getting logs in that file, check the size.


I think the solution to my problem is to include the MDS Checkpoint profile in the skyline scripts / processers.

/opt/CPmds-R81.20/scripts/MDSprofile.sh

I've added the line into my custom script and into "/opt/CPotlpAgent/cs_data_handler_is.bash" with a couple different syntax's and it never can seem to find mdsenv to run.

Directly running a new instance of bash like in the scirpt above will work at the CLI, I think because it is pulling the profile from the logged in user and maybe it gets that somehow when its added to skyline as the logged in user, but once the skyline services are restarted "normally" it fails and if you dig into the log files you get these super easy to read error messages.  Somebody got a little over zealous with the 'remove white space" function....

"ts=2025-05-06T15:54:07.363Z caller=level.go:63 ts=2025-05-06T15:54:07.363Z caller=level.go:63 level=info msg="Collector: /home/admin/mlm_total_logginghas disabled due to: " Script:/var/log/CPotlpAgent/backup/scripts/mlm_total_logging.shchangethestatetodisableddueto:TheCommand:/bin/bash,Error:Error:exitstatus1,Stderr:bash:cannotsetterminalprocessgroup(139635):Inappropriateioctlfordevicebash:nojobcontrolinthisshellbash:mdsenv:commandnotfounderror:syntaxerror,unexpectedLITERAL,expecting'}'01compileerror;terminated=(MISSING)"

Now it ran this for several hours when I restarted it per the SK for custom metrics, but when I did a reboot, then I started getting this message and it stopped working.


So at this point I just need to take most of this thread and put it in an actual TAC case.

Vincent_Bacher
Advisor
Advisor

I would be delighted if this knowledge could be included in a sk article.

and now to something completely different - CCVS, CCAS, CCTE, CCCS, CCSM elite
0 Kudos
Duane_Toler
Advisor

I'm replying at the top level for this, but for everyone's FYI:

Watch out for the various process pipes you're doing.  You obviously are aware in the <<< for input redirection, but there are still some process pipes going on.  Each of these pipes are opening sub-shells in Bash.  Anything you do, such as setting variable values, has no bearing outside the sub-shell and that content is lost once the sub-shell ends.

Things like the "while ... do ... <<< (blah)" is the right way (and you're doing that; excellent).  Just be aware for anything else with pipes.

Sub-shells are nice silent killers for a good shell script! 😞

0 Kudos
David_Evans
Collaborator

Here is an example of some of the dashboard data you can get from this script.

MLM-single2.pngMLM-Summary.png

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events