Understanding Ktranslate CPU usage

Problem

You've ktranslate containers hitting 100% CPU utilization, or just generally it's higher than expected.

Tip

One detail to be careful about is that for ktranslate it's important to focus on the maximum CPU percent instead of the average. Ktranslate uses a high percentage of CPU near the beginning of a polling cycle and much less at the end of the cycle. When you look at the average usage you might see 60% and miss the fact that ktranslate is spending time close to 100%, which is a potential problem. You need to allocate enough resources so that the max CPU consumption doesn't hit 100%.

Solutions

The causes of CPU usage vary depending on what type of ktranslate container you're working with.

SNMP

SNMP containers primarily see CPU usage increases due to the number of data points we need to collect from the target devices, and the amount of time spent attempting to collect each data point.

Devices with high latency when responding to SNMP requests effectively require more resources. Having a longer timeout or more retries, makes it more likely that you'll get the data from these slow devices to avoid charting gaps, but it also increases the required amount of resources you need to provide for your container. It's also possible to have a high number of SNMP trap events coming in, as a general estimate you should be able to ingest roughly 1000 events per second for each CPU.

With high usage your options are these:

  • Add more CPU's to the host to handle the existing workload.
  • Increase the polling intervals by having a larger number in the poll_time_sec global or device specific configs.
  • Reduce the number of devices being polled by removing them from the devices section of ktranslate configuration yaml and making sure they are not included in an automatic rediscovery.
  • Exclude MIBs that you don't need from the mibs_enabled global list (less typical).
  • Possibly reduce the volume of lower value/noisy SNMP traps by changing configurations on network devices to not send them.

Flow

Flow containers primarily see CPU usage increases due to the number of inbound flow events. As a general estimate, you should be able to ingest roughly 1000 events per second for each CPU, but loads vary slightly for the different flow protocols.

With high usage your options are these:

  • Add more CPU's to the host to handle the existing workload.
  • Reduce the volume of incoming flows by changing settings on the network device, such as only collecting flow at critical interfaces or by using sampled flows if possible.

Syslog

Syslog containers primarily see CPU usage increases due to the number of inbound syslog events. As a general estimate, you should be able to ingest roughly 1000 events per second.

With high usage your options are these:

  • Add more CPU's to the host to handle the existing workload.
  • Reduce the volume of lower value/noisy syslog events by changing configurations on network devices to not send certain types of messages or severities. Consult with your device vendor's documentation for how to make these changes.