Lead AI Systems Engineer
About the Role
VectorGrid is a Series A climate-tech company on a mission to accelerate Europe's energy transition. Their real-time grid optimization platform helps energy traders and network operators balance supply and demand across renewable-heavy grids — reducing curtailment, cutting costs, and keeping the lights on.
Backed by Northzone and EQT Ventures, VectorGrid operates across the Netherlands, Germany, and the Nordics, with plans to expand into the UK and Iberian markets by 2027. The engineering team is 22 strong, and this role reports directly to the CTO.
As Lead AI Systems Engineer, you'll sit at the intersection of machine learning and infrastructure — architecting the systems that take forecasting models from Jupyter notebooks to production services making real-time decisions on critical energy infrastructure. This is not a research role. You'll be building production-grade ML pipelines that must be fast, reliable, and auditable.
The stack: Python and Rust services on AWS (EKS), real-time data ingestion via Apache Flink, model serving through a custom inference layer built on ONNX Runtime, and time-series storage in TimescaleDB. You'll inherit solid foundations and a team of three ML engineers who need technical leadership, not micromanagement.
This is a hybrid role based at VectorGrid's Amsterdam Zuidas office, with the team in two to three days per week. The company offers a competitive base, equity with a clear liquidity path, a €3,000 annual learning budget, and relocation support for candidates outside the Netherlands.
Responsibilities
- Architect and maintain the end-to-end ML platform — from feature engineering and model training pipelines through to real-time inference serving at sub-200ms latency
- Design high-availability deployment patterns for AI services that operate on critical energy infrastructure — these systems cannot go down during peak grid events
- Build and improve the real-time data ingestion layer (Apache Flink) processing 500,000+ grid telemetry events per second from SCADA systems and smart meters
- Collaborate with the ML research team to productionise forecasting models — translating prototype notebooks into versioned, monitored, and rollback-capable services
- Establish and enforce MLOps best practices — model versioning, A/B testing, drift detection, automated retraining triggers, and lineage tracking
- Lead the technical direction of the AI infrastructure squad (3 ML engineers), setting standards through architecture reviews, pairing, and written technical guidance
- Work closely with energy domain experts to ensure model outputs are interpretable and auditable for regulatory compliance across European energy markets
- Own observability and performance monitoring for all AI services — defining SLIs/SLOs, building dashboards, and driving down inference costs
Requirements
Essential
- 7+ years in software engineering with at least 3 years focused on ML infrastructure, MLOps, or AI platform engineering
- Strong production experience with Python — you write clean, tested, well-structured Python, not just scripts and notebooks
- Hands-on experience deploying and operating ML models in real-time production environments (not just batch inference)
- Deep understanding of model serving frameworks — ONNX Runtime, TensorFlow Serving, Triton, or similar
- Experience with stream processing systems (Apache Flink, Kafka Streams, or Apache Beam) for real-time feature computation
- Comfortable with cloud infrastructure on AWS or GCP — Kubernetes, Terraform, CI/CD pipelines
- Proven ability to lead a small technical team — setting direction, unblocking people, and raising the bar through code review and mentorship
- Excellent communication skills in English (VectorGrid's working language); Dutch is not required
Nice to Have
- Experience with Rust for performance-critical systems or data processing
- Background in energy, utilities, or other critical infrastructure domains
- Familiarity with time-series databases (TimescaleDB, InfluxDB, QuestDB)
- Understanding of forecasting techniques — time-series models, probabilistic forecasting, or reinforcement learning for control systems
- Contributions to ML infrastructure open-source projects (MLflow, Kubeflow, Feast, etc.)