Job ID :
28106
Company :
Internal Postings
Location :
Chicago, IL
Type :
Contract
Duration :
6 Months
Salary :
DOE
Status :
Active
Openings :
1
Posted :
28 Aug 2020
Job Seekers, Please send resumes to resumes@hireitpeople.com

Detailed Job Description:

  • Troubleshoot mission critical full stack applications, microservices, infrastructure and legacy business applications/websites performance and availability issues
  • Work with DevOps Architects to implement fault tolerance, back-up, and disaster recovery solutions.
  • Lead root cause analysis/investigations through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability
  • Pre-emptively pursue the discovery of system faults throughout the application lifecycle before and after release.
  • Manage the incident lifecycle to resolutions and conducting Blameless Post Incident Review
  • Working with the QA Lead to establish best practices for measuring and monitoring availability, latency and overall system health. You re expected to be on- call and have strong written communication skills and be able to develop working relationships with coworkers
  • Experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale
  • Implementing concepts in Chaos Engineering like Simeon Armies.
  • Work across multiple project teams simultaneously to support rapid development efforts
  • Solve complex, business critical issues that impact bottom line financial numbers and customer loyalty/experience
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Contribute positively to open source projects developed and join existing communities
  • Bring experience, pragmatism, empathy, and composure to interactions with teams outside of the RE organization
  • Work frequently with Product teams on shared goals and cross-team projects
  • Balance planned and reactive work using basic project planning techniques and technical roadmaps
  • Work and collaborate across teams such Application services, Capacity Planning, Hardware, Network, and Datacenter Operations
  • Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments
  • Experience negotiating SLIs, SLOs, and SLAs with product owners

General/Minimum Qualifications:

  • 3-5+ years of applying reliability and chaos engineering principles with distributed cloud services
  • Strong knowledge of and comfortability with GNU/Linux and Windows operating system(s)
  • Proficiency in high-level languages such as Ruby, Python, Powershell, and Bash
  • Exposure to system-level languages such as Go, C/C++/C#
  • Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt
  • Source control, branching, & merging, packaging (git, GitHub, NuGet, npm)
  • Networking basics: TCP vs UDP, basic troubleshooting, HTTP load balancing, firewall, private networks, multi-tier design, scale-out
  • Databases RDBMS, NoSQL, SQL, analytics, persistent data
  • Familiarity with standard infrastructure concepts like load balancers, firewalls, object storage and where/when they might be used
  • Service Management Incident Response, Change, and Problem Management.
  • Experience with Kubernetes, Docker, Helm, and Virtual Machines
  • Cloud computing concepts (one or more public cloud providers) VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation
  • Experience operating a platform, software as a service, or shipping software
  • Experience as an open-source contributor