Over the last 2 years, I worked as DevOps Engineer for companies hosting on Google Cloud Platform. During this time I consulted many developers and learned a few things where I hoped someone would've told me already when I started. In this post, I'm describing some Best Practices that helped me and the organizations I worked for.
Basics
Setup Billing Export
While the billing console is quite nice, sometimes it can be useful to have a bit more insight into things.
Fortunately, you can export your entire raw billing data into BigQuery Doing this costs practically nothing, but allows you to query the data using SQL. I've used this in the past to create reports to quickly find anomalies between months.
If you don't want to use SQL yourself, there are also FinOps tools that can use this data.
Use managed services whenever possible
Need RabbitMQ? Just use PubSub and save yourself the trouble of installing and maintaining a highly-available cluster.
Need Kafka? Use PubSub instead, unless you actually need stream reply.
Need MySQL/PostgreSQL? Use Cloud SQL and enjoy setting up replicas with just a few clicks.
Need a Data Warehouse? Just use BigQuery, it's awesome and has a rich ecosystem. Except Looker Studio, that one is a total pain to use.
You get the idea: Whatever managed service you can use, go for it. It's always easier and mostly cheaper than the time you need it to run it yourself.
Project setup
Have dedicated projects for everything
Give every team their own folder. Give every application their own project.
In the beginning, this might sound overkill. But time has shown over and over again that systems often grow or change overship, and having multiple teams working in the same project will be a headache at some point in the future.
Additionally, there are rarely easy or cheap ways to move resources or data between projects - splitting them up later is therefore often dangerous and expensive.
Applying this is easier when combined with the next point:
You can not just Infrastructure-as-code your project resources, but also the Google Cloud projects themselves.
Instead of clicking new projects in the GUI, create a Terraform/OpenTofu module that does for example:
- Configure which billing account to use (and move everything at once when ever changing billing accounts)
- Enable a set of standard services
- Set up dedicated Service Accounts used for Terraform/OpenTofu or deployments
- Implement a log sink to have all of your logs messages at a central place
This also makes it easy to globally enforce the next point:
Label your projects with teams and owners
Google Cloud allows you to label entire projects, which can be viewed in the IAM module in the console. Especially in larger organisations, add owner labels (name and contact) and other attributes like cost centers.
This enables central teams like DevOps and Security to quickly find out who's responsible for a project and it's also useful for billing analysis.
Label everything, FinOps will thank you later
While we're at it, label everything else than can be labelled.
- Label buckets with their name, so you can filter them in the billing
- Label IP adresses to see how much you're paying for not migrating to IPv6
As with many things, the earlier you start, the easier it will be. Trust me, labeling hundreds of resources at once is not a lot of fun 😉
Security
Have a restore strategy
Many companies have backup strategies: They create copies of some of their data somewhere else. Sometimes they even check if it works.
But what happens if you actually need to restore your data:
- Does the backup actually exist, or did the process break months ago and nobody noticed?
- Are there written and up-to-date instructions, or has the only person with the knowledge left the company a year ago?
- What about dependencies between data? Do you have a consistent view of your world, or are the backups of your datastores hours apart and contain broken references?
- If an entire GCP region goes down (like the Paris 2023 incident), are you able to spin up your 30 microservices into another region and make them work again?
When planning your restore strategy, do not start by "what needs to be backed up", but rather start at the very end (your applications) and walk backwards all the dependencies your infrastructure has and what you need to run it. This will also help you discovering data that is normally overlooked, like container images, as well the order what needs to be brought up first.
Here are some fun facts about GCP services:
BigQuery datasets are not safe against regional failure (even when selecting the multi-region option). You need to configure Cross-region replication yourself and maybe have some backups in Cloud Storage as well.
If you delete a Cloud SQL database instance, all of it's backups are also gone. Make sure they can only be deleted by superadmins.
As admin: Have a less-powerful account for daily work
As a DevOps, you often have global owner permissions to the entire organization. This is comfortable, but only actually needed in certain situation.
Instead, use a less powerful account for your daily work, which has just Viewer permissions everywhere. This approach also prevents you from doing things outside of Terraform, but still lets you debug issues. Developers should have similar permissions for their production environments.
To increase security, use a hardware key for the admin account and use a dedicated browser just for it - to prevent any credentials/cookies ever being leaked by accident. The higher the power level of an account, the higher the barriers to use and breach it should be.
If you're using external auth (like Entra ID), have glass breaking accounts
If you have set up your environment to use a third-party authentication like Entra ID (formerly Azure AD), make sure you still have accounts that work without the other vendor: In case the auth breaks for any reason (downtime, misconfiguration), this ensures that you can still access and manage your cloud environment.
Use KMS for encrypting secrets
You should not store any credentials in your repositories. In cases that's absolutely necessary, use sops and use Google Cloud KMS as key storage. Most tutorials use age because it's simple to use, but if you're careless, you might either lose or leak or keys. By using KMS instead, you can clearly control who can encrypt and decrypt secrets.
If you don't want to use sops, you can also use the gcloud kms CLI to directly encrypt/decrypt files.
Cloud SQL: Use the auth proxy to not bother with passwords at all
While it's a bit complicated to configure, it's possible to authenticate against Cloud SQL instances using Service Accounts. This way your workloads like VMs, Kubernetes Pods or Cloud Run services do not need any passwords to connect to the database. The proxy also ensures a TLS-encrypted connection and uses short-lived certificates. Your security department will love you for that :)