Design & build large-scale distributed services for telemetry ingestion, event streaming, and command orchestration across edge and cloud environments
Implement real-time data pipelines using Kafka, NATS, or gRPC streams, ensuring low-latency, high-throughput processing
Maintain and optimize stateful services (Redis, InfluxDB, Postgres) for consistency, replication, and failover in multi-region deployments
Collaborate with embedded, controls, and ML teams to define API contracts, message schemas (Protobuf), and service SLAs
Develop infrastructure-as-code (Terraform, Helm) and CI/CD workflows to automate testing, security scans, and rolling upgrades
Monitor & troubleshoot production systems with Prometheus, Grafana, Jaeger, and custom observability tooling to meet 99.99% uptime goals
Champion best practices in reliability engineering, capacity planning, and incident response for distributed platforms
What You’ll Bring
5+ years building and operating distributed, fault-tolerant systems in production
Deep understanding of distributed systems concepts: consensus (Raft/Paxos), partition tolerance, consistency models, and backpressure
Hands-on experience with streaming platforms (Kafka, Pulsar) or message queues (NATS, RabbitMQ)
Expertise in container orchestration (Kubernetes), service mesh (Istio/Linkerd), and microservices architecture
Proficiency in systems programming (C++/Go/Python) and strong CS fundamentals (algorithms, data structures, networking)
Solid background with observability stacks (Prometheus/Grafana, OpenTelemetry, Jaeger/Zipkin)
Track record of automating infrastructure (Terraform, Ansible) and building reliable CI/CD pipelines
Excellent communication skills, collaborative mindset, and a bias for pragmatic solutions Join Menlo.ai’s Distributed Systems team and help architect the resilient infrastructure that underpins the future of autonomous robotics and AI-driven services.