Job Seekers, Please send resumes to resumes@hireitpeople.com
Detailed Job Description:
- Troubleshoot mission critical full stack applications, microservices, infrastructure and legacy business applications/websites performance and availability issues
- Work with DevOps Architects to implement fault tolerance, back-up, and disaster recovery solutions.
- Lead root cause analysis/investigations through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability
- Pre-emptively pursue the discovery of system faults throughout the application lifecycle before and after release.
- Manage the incident lifecycle to resolutions and conducting Blameless Post Incident Review
- Working with the QA Lead to establish best practices for measuring and monitoring availability, latency and overall system health. You re expected to be on- call and have strong written communication skills and be able to develop working relationships with coworkers
- Experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale
- Implementing concepts in Chaos Engineering like Simeon Armies.
- Work across multiple project teams simultaneously to support rapid development efforts
- Solve complex, business critical issues that impact bottom line financial numbers and customer loyalty/experience
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Contribute positively to open source projects developed and join existing communities
- Bring experience, pragmatism, empathy, and composure to interactions with teams outside of the RE organization
- Work frequently with Product teams on shared goals and cross-team projects
- Balance planned and reactive work using basic project planning techniques and technical roadmaps
- Work and collaborate across teams such Application services, Capacity Planning, Hardware, Network, and Datacenter Operations
- Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments
- Experience negotiating SLIs, SLOs, and SLAs with product owners
General/Minimum Qualifications:
- 3-5+ years of applying reliability and chaos engineering principles with distributed cloud services
- Strong knowledge of and comfortability with GNU/Linux and Windows operating system(s)
- Proficiency in high-level languages such as Ruby, Python, Powershell, and Bash
- Exposure to system-level languages such as Go, C/C++/C#
- Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt
- Source control, branching, & merging, packaging (git, GitHub, NuGet, npm)
- Networking basics: TCP vs UDP, basic troubleshooting, HTTP load balancing, firewall, private networks, multi-tier design, scale-out
- Databases RDBMS, NoSQL, SQL, analytics, persistent data
- Familiarity with standard infrastructure concepts like load balancers, firewalls, object storage and where/when they might be used
- Service Management Incident Response, Change, and Problem Management.
- Experience with Kubernetes, Docker, Helm, and Virtual Machines
- Cloud computing concepts (one or more public cloud providers) VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation
- Experience operating a platform, software as a service, or shipping software
- Experience as an open-source contributor