nClouds is conducting off- campus recruitment drive to hire candidates for Site Reliability Engineer. Interested candidate can read the details and apply as soon as possible.
About: nClouds is a certified, award-winning provider of AWS and DevOps consulting and implementation services. **AWS Premier Consulting Partner.** We are an integrated team of skilled engineers, architects, developers, project managers, and sales & marketing professionals who are passionate about software excellence, innovation, and client success. We work with organizations of all sizes, in all industries, including some of the coolest startups and growth companies in Silicon Valley.
Position: Site Reliability Engineer
- Bachelor’s degree in computer science (preferred) or equivalent management, technical, scientific discipline.
- A clear understanding of SRE principles and practices and Agile and DevOps methodologies.
- Experience in AWS Well-Architected framework in order to implement the scalable and reliable infrastructure.
- Great team player with the flexibility to work in 24/7 rotating shifts.
- Excellent written/verbal communication and leadership skills.
- Ensure the availability and reliability of distributed systems.
- Help the L1 team to resolve the client’s infrastructure issues, escalations, alerts, tickets, and queries.
- Works as a bridge between DevOps and other teams in order to build and maintain resilient systems.
- Conduct, coordinate and oversee post incident Root Cause Analysis / Reviews.
- Build and maintain documentation for all assigned clients / projects.
- Leverage DevOps, Agile methodology, and standards in day-to-day work.
- Adopt and propose automation of repetitive tasks to reduce/eliminate toil.
- Implement and troubleshoot using observability tools like Datadog, New Relic, Splunk, CloudWatch etc.
- Adopt and ensure the SRE practices in Team.
- Maintenance of AWS managed resources, CI/CD, IAC.
- Planning and implementing disaster recovery and backup plans for AWS cloud platforms.
- Proactively work on efficiency and capacity planning.
- Untoiling repetitive tasks and keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
- Liaise and work closely with Layer-1 Oncall support, DevOps and Operations teams .
- Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues.