
Open Positions
DevOps/Platform Engineer (Ref 23045)
About us
At Programize, we partner with teams of all sizes - from startups to established enterprises - across industries and continents to create innovative, high-impact software products. We don’t just implement requirements; we turn ambitious ideas into marketable software solutions we are genuinely proud to put our names on. With 200+ successfully delivered projects behind us, we’ve tackled everything from greenfield architectures to complex, large-scale platforms.
Our vision is to become the go-to company for entrepreneurs and engineers, who want to design and develop impactful, scalable software systems.
To achieve that, we need talented professionals to join our team, to share the thrill for technology and innovation.
The Role
We are looking for a DevOps / Platform Engineer to join our team and collaborate with an international organization building a cutting-edge AI agent evaluation platform.
The platform runs large-scale AI benchmarks, multi-agent simulations, and automated evaluation workflows, powered by distributed systems running on Google Cloud. You will work closely with backend, ML, and frontend engineers to ensure the infrastructure is reliable, scalable, observable, and easy to operate.
This is a high-ownership role where you will play a key part in making deployments, monitoring, and operations repeatable, secure, and production-ready.
What You Will Do
-
Cloud infrastructure as code. Own and extend our Terraform estate across multiple GCP environments (`base`, `core`, `obs`, `dev`, `test`, `prod`), including GKE clusters, Cloud SQL (Postgres/MySQL), networking, buckets, and IAM. Drive the in-progress "Neo" platform rollout and the cutover/retirement of legacy infrastructure.
-
Kubernetes & containers. Manage workloads on GKE, maintain Dockerfiles and Helm-style application configs for ~10 backend services, and tune autoscaling, resource limits, and pod disruption budgets.
-
CI/CD. Maintain and improve our GitHub Actions pipelines: PR checks (Python/JS lint, type-check, tests), Terraform prechecks, image builds and pushes, auto-deploy, and DB-migration labelling/gating. Reduce build times and flakiness, and make deploys self-service for product teams.
-
Data & messaging infrastructure. Operate Postgres, Redis, and Celery-based async workers; manage Alembic migrations, queue health, and backpressure for long-running simulation jobs.
-
Observability. Own our monitoring stack — Grafana dashboards, ClickHouse, Langfuse (LLM tracing), and Celery queue metrics. Build alerting and SLOs so we catch issues before customers do.
-
Security & secrets. Manage secret distribution, least-privilege IAM, and remediation tracking. Partner with engineering on findings in our security assessment process.
-
Cost & reliability. Keep an eye on cloud and LLM-proxy (LiteLLM) spend, right-size resources, and improve resilience of the simulation and evaluation pipelines.
What You Have
-
3+ years in DevOps / SRE / Platform Engineering, or strong backend experience with heavy infra ownership.
-
Solid hands-on Terraform (modules, state, multi-environment) and cloud experience (GCP preferred; AWS/Azure transferable).
-
Production Kubernetes experience: deployments, services, autoscaling, debugging pods, rollouts/rollbacks.
-
Strong Docker fundamentals and comfort writing/optimising Dockerfiles.
-
CI/CD pipeline design and maintenance (GitHub Actions, or equivalent like GitLab CI / CircleCI).
-
Comfortable scripting and reading code in Python and/or Bash; able to navigate a polyglot monorepo.
-
Operational experience with relational databases and managed database services (migrations, backups, performance).
-
A reliability mindset: monitoring, alerting, incident response, and writing runbooks.
Nice to have
-
Experience operating Celery / distributed task queues and Redis at scale.
-
Familiarity with LLM/AI infrastructure (model proxies, GPU scheduling, token/cost management).
-
Observability tooling depth (Grafana, Prometheus, ClickHouse, OpenTelemetry, Langfuse or similar tracing).
-
Security/compliance experience (IAM hardening, secret management, vulnerability remediation).
-
Cost-optimisation experience for cloud + third-party API spend.
-
Experience supporting a monorepo with multiple language ecosystems and editable/internal package dependencies.
What to expect from us
Programize was founded on the values of respect and appreciation for customers and colleagues alike. We believe in equal opportunity, diversity, flexibility, hard work and continuous improvement in all aspects of our company. We want our people to feel happy, creative, productive and motivated. So, in Programize you will find the following:
-
Friendly, respectful and appreciative working environment.
-
Competitive remuneration package.
-
On-site and remote working options.
-
Lab-like, collaborative, and engaging environment
-
Continuous learning and growth opportunities.
-
International working environment.
-
Work-life balance.
-
Private health insurance plan, including dependents.
Disclaimer:
Programize collects and processes personal data in accordance with the EU General Data Protection Regulation (GDPR). We are bound to use the information provided within your job application for recruitment purposes only and not to share these with any unauthorized third parties, and all applications will be treated as strictly confidential.
