About Me
About Me Link to heading
I am a highly accomplished and results-driven Senior Site Reliability & Platform Engineer with over 8 years of experience architecting scalable cloud-native systems and leading high-performing engineering teams. My expertise spans Kubernetes, AWS infrastructure automation, and GenAI/ML platform reliability. I have a proven track record of maintaining 99.99% service availability, significantly reducing Change Failure Rate (CFR) by 60%, and accelerating deployment velocity for over 100 engineering teams. I possess deep expertise in observability, incident response, Infrastructure as Code (IaC), and applying robust software engineering principles to achieve operational excellence.
Professional Experience Link to heading
LinkedIn Link to heading
Senior Site Reliability & Platform Engineer | June 2022 - Present | Sunnyvale, CA
- Architected and owned the Automated Stability Guard from concept to production MVP in 4 weeks, implementing SLI and SLO framework and error budget management that maintained 99.99% service availability while reducing Change Failure Rate by 60%, from 50 to under 20 SEVs per 10k deployments.
- Designed and delivered a GenAI-powered Go/gRPC incident detection platform, building microservices and APIs that drove a 40% MTTD reduction for production threats via automated, real-time observability dashboards.
- Led end-to-end software delivery of a unified test platform using Playwright and GitOps automation, creating a self-service Internal Developer Platform (IDP) that improved release velocity for more than 100 teams by providing standardized golden path templates for CI/CD and accelerating feature development.
- Defined the 18-month technical roadmap for platform reliability, securing senior leadership buy-in to invest in initiatives that improved infrastructure scalability and observability company-wide.
- Architected and shipped RESTful APIs and microservices for key platform features, collaborating with product teams to deliver new functionality for 50M+ daily users.
Amazon Lab126 Inc. Link to heading
System Development Engineer | Nov 2020 - June 2022 | Sunnyvale, CA
- Owned the architectural migration of a monolithic Java system to a microservices platform, ensuring a 99.99% availability SLO was maintained post-launch through a comprehensive observability framework and automated monitoring.
- Engineered an internal developer platform by developing and maintaining core Java and Python microservices, which reduced new pipeline onboarding from 8 weeks to 4 weeks through self-service infrastructure automation and standardized deployment patterns.
- Implemented SLI and SLO frameworks with automated alerting and incident response procedures, establishing platform reliability standards that became the blueprint for all new distributed infrastructure operations.
Amazon Web Services Link to heading
Platform and Reliability Engineer | Jan 2020 - Nov 2020 | Dallas, TX
- Accelerated platform adoption for thousands of internal AWS developers by architecting high-level Python SDK abstractions, reducing infrastructure complexity and development cycle time.
- Designed and shipped new API features and automation tools as part of the internal developer experience team, supporting both reliability and product velocity goals.
- Enhanced service availability across AWS infrastructure by engineering proactive client-side IAM validation for the internal S3 client, eliminating an entire class of permission-related runtime errors.
Technical Skills Link to heading
Software Engineering Link to heading
- Software Architecture, Distributed Systems Design, API Development & Versioning, Microservices, End-to-End Feature Delivery
- Unit & Integration Testing, Test-Driven Development (TDD), Test Automation Frameworks (JUnit, pytest)
- Agile Development, Software Performance Tuning, Code Review Best Practices
Site Reliability Engineering Link to heading
- SLI/SLO Management, Error Budget Management, Incident Response, Capacity Planning
- Service Mesh (Istio), Platform Engineering, Production Readiness
Cloud and Infrastructure Link to heading
- AWS (8+ years), Kubernetes, Docker, Terraform, CDK, Serverless, Multi-Region Design
Observability and Monitoring Link to heading
- Prometheus, Grafana, OpenTelemetry (OTel), Distributed Tracing, Performance Monitoring
DevOps and Automation Link to heading
- Jenkins, GitLab CI, ArgoCD, Helm, GitOps, CI/CD Pipelines, Infrastructure Automation
Programming Languages Link to heading
- Python (8+ years), Go, Java, JavaScript, Bash/zsh Scripting
Databases and Emerging Tech Link to heading
- MySQL, DynamoDB, PostgreSQL, GenAI Guardrails, Vector Databases
Education Link to heading
- Santa Clara University: M.S. Computer Science, 2019
- Pune University: B.S. Computer Science, 2016
Connect Link to heading
- GitHub: github.com/abhimanbhau
- LinkedIn: linkedin.com/in/abhimank
- LeetCode: leetcode.com/abhimanbhau
Contact Link to heading
- Email: akolte@icloud.com