In the ever-evolving landscape of Kubernetes networking and security, Calico has proven to be a battle-hardened, scalable and robust solution. Core to Calico’s architecture are two components, Felix and Typha. And given their importance for running Kubernetes deployment, it is no surprise that monitoring these components is crucial to secure and maintain them for optimal cluster operation.
This blog post explores the importance of Calico Typha and Felix metrics, providing insights into their roles, key metrics, and recommended monitoring practices.
Calico Architecture: Introducing Typha and Felix
Calico comprises various components, each playing a distinct role in ensuring the seamless functioning of Kubernetes environments. Among these, Typha and Felix take center stage, collectively responsible for Calico’s networking and security functions.
Typha: Scaling Datastore Proxy
Typha acts as a caching datastore proxy, positioned between calico-nodes and the Kubernetes API server. Its primary function is to reduce the load on the Kubernetes API server, making it essential for large clusters. Typha watches for events related to nodes, pods, network policies, and BGP configurations on the Kubernetes API server. It then caches and deduplicates this data, efficiently distributing events to its many clients, with Felix being its primary recipient.
The significance of Typha lies in its ability to enhance cluster scalability by minimizing the memory requirements associated with continuous watches and requests from calico-nodes. Without Typha, managing large clusters could become resource-intensive and challenging.
Felix: Ensuring Policy Compliance
Felix, a core component of Calico-Node, plays a pivotal role in implementing Calico Network policies. Calico-Node metrics provide insights into BGP information and traffic flowing through Calico Network Policies.
For Felix to operate seamlessly, it must remain in continuous sync with the datastore, ensuring the correct application of policies to the respective nodes. It is also worth noting that, as an extra layer of security, these monitoring ports are protected by TLS (for Felix) or mTLS (for Calico-Node).
Monitoring Typha and Felix to secure Kubernetes application and optimize operations
Monitoring Calico-Node
Effective monitoring of Calico Enterprise involves paying attention to specific metrics that highlight the performance and health of key components. Calico-Node metrics can provide data on policy actions that can indicate misconfigurations or potential attacks.
Denied Traffic Metrics
These metrics represent the number of packets or bytes dropped by explicit or implicit deny rules. The goal is to have these metrics report zero under stable conditions so that any deviation could indicate policy and traffic divergence.
Metric
- calico_denied_packets
- calico_denied_bytes
Example Value
calico_denied_packets{ endpoint="calico-metrics-port", instance="<node-FQDN>", job="calico-node-metrics", namespace="calico-system", pod="calico-node-6pcqm", policy="default|default-deny|0|deny", service="calico-node-metrics", srcIP="10.48.0.214" }
Threshold Recommendations
Maintaining these metrics at zero is ideal, but achieving this depends on the cluster’s stability and policy maturity. It may be more achievable to determine a baseline metric and then monitor for deviations. These deviations can then indicate either:
- Unexpected traffic flowing through the cluster is being denied
- This may be a desired if you are only permitting traffic that you expect to see in the cluster
- Or there is a misconfigured policy that is denying traffic
- This is usually not desired and warrants using the metric output to investigation into the policy that is dropping traffic
Threshold Breach Symptoms
Unexpected traffic denial or dropped packets due to policy misconfigurations.
Threshold Breach Recommendations
Actions include investigating for potential attacks or allowing denied flows as needed, potentially updating policies accordingly.
Priority Level
Recommended
Monitoring Typha
Monitoring Typha involves tracking several metrics, each providing valuable insights into the component’s performance. Let’s explore key Typha metrics regarding it’s clients.
Client Connections Actively Streaming Metric
This metric signifies the current count of active connections in a “streaming” state, having completed the handshake, within Typha. It indicates the number of clients actively connected to a Typha instance.
Metric
- typha_connections_streaming
Example Value
{instance="10.0.1.20:9093"} 10 {instance="10.0.1.31:9093"} 5
Threshold Recommendations
It is recommended to compare the values of the Total Connections Accepted metric, and this Client Connections Actively Streaming metric. Their fluctuations should align, indicating the transition of Accepted Connections into Actively Streamed connections. Any discrepancy warrants investigation.
It is worth noting that, in smaller clusters, an imbalance in Typha connections may occur, which is acceptable as Typha can manage numerous connections efficiently.
Threshold Breach Symptoms
Issues may arise where Felix isn’t receiving updates from Typha, causing Calico Network Policies to fall out of sync.
Threshold Breach Recommendations
Examine Typha and Felix logs to determine potential issues that could cause an imbalance.
Priority Level
Recommended
Monitoring Felix Metrics
Felix is at the heart of Calico and can provide many valuable metrics. Here we will look at how fast Felix is applying dataplane updates.
Dataplane Apply Time Quantiles Metrics
- felix_int_dataplane_apply_time_seconds{quantile=”0.5”}
- felix_int_dataplane_apply_time_seconds{quantile=“0.9”}
- felix_int_dataplane_apply_time_seconds{quantile=“0.99″}
This metric denotes the time, in seconds, required to apply a dataplane update, viewed across the median (50th percentile), 90th percentile, and 99th percentile.
Example Value
felix_int_dataplane_apply_time_seconds{quantile="0.5"}: felix_int_dataplane_apply_time_seconds{endpoint="metrics-port", instance="10.0.1.30:9091", job="felix-metrics-svc", namespace="calico-system", pod="calico-node-6pcqm", quantile="0.5", service="felix-metrics-svc"} 0.020859218
Threshold Recommendations
Threshold values will vary based on cluster size and update frequency. As always, it is recommended to establish a baseline to define a normal threshold value.
Examples of what has been seen in the field:
- 3-node test cluster averaging 100ms
- 1000-node 15x federated cluster averaging 30s (potentially larger at felix boot-time)
Threshold Breach Symptoms
Extended time-to-apply values can lead to delays between Calico Security Policy commits and their enforcement in the dataplane. This delay is influenced by Calico waiting for kube-proxy to release the iptables lock, which in turn depends on the number of services in use.
Threshold Breach Recommendations
Consider scaling cluster resources or reducing the number of Kubernetes services if feasible to mitigate prolonged dataplane update times.
Priority Level
Recommended
Conclusion
In conclusion, monitoring Calico Enterprise metrics, especially those related to Typha and Felix, is essential for maintaining a robust Kubernetes environment. Tigera’s recommended metrics provide a solid foundation for effective monitoring, ensuring that clusters operate at scale while upholding security and policy compliance. By understanding the significance of each metric and following the recommended monitoring practices, organizations can proactively address issues and enhance the overall performance of their Calico Enterprise deployments.
Ready to try Calico node-specific policies? Sign up for a free trial of Calico Cloud
Join our mailing list
Get updates on blog posts, workshops, certification programs, new releases, and more!