Fixing network performance problems with Google’s networking toolsSyah Ismail
Imagine this scenario: Your network is down and everyone in the office including the boss is hounding at you to get it fixed as soon as possible. That’s one of the challenges any IT department faces almost every day.
Whether you’re trying to troubleshoot a performance problem or understand your network performance in order to make optimal deployment decisions, Google Cloud has a comprehensive set of tools for network monitoring, verification and optimisation. With these tools, you can visualise, measure, troubleshoot and optimise network performance on Google Cloud as well as in your on-premises and hybrid environment.
These tools will help you answer any number of questions about your network performance. This blog post will give a tour of these tools and show you how to use them to answer your most common network performance questions.
Before we delve into the different performance troubleshooting scenarios that networking teams encounter, let’s take a quick look at the tools for troubleshooting the Google Cloud network and beyond.
Network Intelligence Center
Network Intelligence Center is Google Cloud’s comprehensive network monitoring, verification and optimisation platform for use across on-prem and cloud environments. Google’s vision with Network Intelligence Center is to enable “intelligent network operations,” enabled by four modules with several modules to follow. These include:
PerfKit Benchmarker is an open-source tool created at Google that allows you to measure and understand performance across multiple clouds and hybrid deployments. PerfKit Benchmarker is a great tool for benchmarking your network performance to guide deployment decisions. Using PerfKit Benchmarker, Google has also introduced a live dashboard measuring median inter-region performance metrics.
Now, let’s take a look at some common network performance scenarios that you’ve probably been asked to troubleshoot in your role as a network engineer.
1. The application is down or performing poorly
In this scenario, the networking team has to triage whether or not the underlying network is the root cause. Network Intelligence Center’s Performance Dashboard can show you real-time performance metrics (latency and packet loss) between the zones where you have VMs, enabling you to quickly troubleshoot where the packet loss is happening and if it’s a networking issue at all.
2. The network is being blamed for an outage
Network Intelligence Center’s Connectivity Tests module is your go-to for diagnosing connectivity issues. Connectivity Tests lets you self-diagnose connectivity issues within Google Cloud Platform (GCP) or from GCP to an external IP address that’s on-prem or in another cloud. That way, you can isolate whether or not the issue is in GCP. It can diagnose connectivity issues related to misconfigurations, so you can understand the impact on connectivity and proactively resolve connectivity problems that could lead to performance issues.
3. Users in a region are experiencing delays
The Network Topology module in Network Intelligence Center allows you to visualise your network and the associated network performance metrics to better monitor network health. For instance, you can easily visualise how your users are being served worldwide and whether they are being served out of the closest geographical region.
4. You need to benchmark performance to make deployment decisions
If you are thinking of migrating workloads to the cloud, you need visibility into expected network performance metrics so you can choose the best cloud and deployment architecture for your use case. PerfKit Benchmarker makes network performance benchmarking fast and easy by automating network setup, provisioning of VMs and test runs. To help with your decision, you can use the live dashboard to view median inter-region latency and throughput for Google Cloud networks.
With the right tools, you can get a comprehensive understanding of your network performance, make well-informed decisions about where to put your workloads, prevent outages and triage and troubleshoot performance issues quickly so that your users get the best possible experience.