Deadline : 30 Mar, 2025

They will also work closely with stakeholders to investigate and resolve incidents, perform root cause analysis, and propose solutions to increase the reliability and stability of the Databricks platform.

The impact you will have:

Monitor critical infrastructure, triage alerts to proactively identify incidents, and work with stakeholders to resolve incidents.
Investigate incidents and propose solutions to improve platform reliability and stability.
Perform root cause analysis for reoccurring incidents and provide proactive solutions.
Develop toolings or automate processes to improve platform monitoring and alerting.
Contribute to software development efforts to improve overall service reliability and stability.
Communicate with internal stakeholders, including executive staff, to provide incident analysis.
Participate in war rooms and temporary communication channels during outages.
Demonstrate cross-functional leadership and establish ownership of incidents and outages.
Multitask on several incidents and/or projects at once

What we look for:

3 years of experience as a NOC, SRE, or DevOps engineer
Knowledge of cloud technologies such as Azure, AWS, and GCP
Hands-on experience with monitoring, logging, and alerting tools
Hands-on experience with containers and orchestration technologies
Automation and scripting skills
Linux systems administration skills.
Knowledge of managing incidents
Excellent communication skills.
Technical degree or equivalent experience
Willingness to learn the Databricks products

To apply for this job please visit www.databricks.com.