Press ENTER to skip to the job details. Press ENTER to search jobs

Manager, Site Reliability Engineering (SRE) - eCommerce Job At The Home Depot in Toronto, ON  M3C 4H9

Manager, Site Reliability Engineering (SRE) - eCommerce

1 Concorde Gate, Toronto, ON  M3C 4H9

Req125256 Full Time Corporate Remote

With a career at The Home Depot, you can be yourself and also be part of something bigger.
 

Position Overview:
 The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.
Responsibilities:

  • Leadership & Management:
    • Lead and mentor a team of Site Reliability Engineers
    • Foster a culture of continuous improvement and innovation
    • Collaborate with cross-functional teams to align SRE practices with business objectives
  • Reliability & Performance:
    • Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
    • Implement and promote performance engineering practices to ensure optimal system performance on GCP
    • Develop and maintain service level objectives (SLOs) and error budgets
  • Production Engineering & Operational Support:
    • Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
    • Manage incident response and post-incident reviews to minimize downtime and improve system resilience
    • Implement monitoring, alerting, and observability solutions to proactively identify and address issues
    • Develop and maintain runbooks and playbooks for common operational tasks.
    • Coordinate with security teams to ensure compliance with security policies and best practice
  • DevOps & Continuous Improvement:
    • Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
    • Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
    • Identify and implement automation opportunities to reduce manual tasks and improve efficiency
    • Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
    • Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
  • Release Management:
    • Implement and maintain release management best practices to minimize disruptions and maximize system stability
    • Collaborate with DevOps teams to integrate release management into CI/CD pipelines
    • Oversee release schedules, ensuring minimal impact on business operations
    • Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
    • Maintain a release calendar and communicate release plans to stakeholders

 

  • Strategic Planning:
    • Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
    • Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
    • Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
    • Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
    • Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.


Experience:

  • Bachelor’s degree in computer science, Engineering, or a related field
  • Strong problem-solving and analytical abilities
  • Excellent communication and collaboration skills
  • 4-6 years of relevant work experience, including significant experience with GCP
  • Extensive experience with cloud infrastructure, GCP services and architecture
  • Proven track record of managing and optimizing large-scale systems on GCP
  • Proven ability to effectively communicate with individuals at all levels of the organization
  • Ability to maintain relationship and negotiate with vendors.
  • Ability to operate in and leverage resources in a matrixed environment.
  • Ability to analyze and present data to support ideas.
  • Ability to clearly communicate to all levels of the organization.
THIS JOB IS CLOSED
We strive to maintain a culture that welcomes everyone, and we believe it helps us achieve our business goals by driving excellent customer service and innovation, empowering our associates to thrive and excel, and enriching the communities in which we operate. This includes creating an environment where our associates feel welcomed, valued and respected and providing equal opportunity for all of our associates. If you require accommodation during the recruitment process, please contact accessibility_Canada@homedepot.com.

Search Open Jobs

Manager, Site Reliability Engineering (SRE) - eCommerce

1 Concorde Gate
Toronto, ON  M3C 4H9

Req125256

Full Time

Corporate

Remote

With a career at The Home Depot, you can be yourself and also be part of something bigger.
 

Position Overview:
 The Manager, SRE will lead a team of Site Reliability Engineers to ensure the reliability, performance, and operational support of our eCommerce systems, with a focus on Google Cloud Platform (GCP) environments. This role requires a strong background in reliability reviews, performance engineering practices, production engineering, and operational support, with emphasis on DevOps principles and GCP expertise.
Responsibilities:

  • Leadership & Management:
    • Lead and mentor a team of Site Reliability Engineers
    • Foster a culture of continuous improvement and innovation
    • Collaborate with cross-functional teams to align SRE practices with business objectives
  • Reliability & Performance:
    • Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability, particularly in GCP environments
    • Implement and promote performance engineering practices to ensure optimal system performance on GCP
    • Develop and maintain service level objectives (SLOs) and error budgets
  • Production Engineering & Operational Support:
    • Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability, leveraging GCP services and best practices
    • Manage incident response and post-incident reviews to minimize downtime and improve system resilience
    • Implement monitoring, alerting, and observability solutions to proactively identify and address issues
    • Develop and maintain runbooks and playbooks for common operational tasks.
    • Coordinate with security teams to ensure compliance with security policies and best practice
  • DevOps & Continuous Improvement:
    • Drive DevOps initiatives to improve collaboration between development and operations teams, with a focus on GCP-native tools and services
    • Implement and maintain CI/CD pipelines to streamline deployment processes in GCP environments
    • Identify and implement automation opportunities to reduce manual tasks and improve efficiency
    • Promote the use of Infrastructure as Code (IaC) to manage and provision cloud resources.
    • Continuously evaluate and integrate new tools and technologies to enhance DevOps practices
  • Release Management:
    • Implement and maintain release management best practices to minimize disruptions and maximize system stability
    • Collaborate with DevOps teams to integrate release management into CI/CD pipelines
    • Oversee release schedules, ensuring minimal impact on business operations
    • Ensure there is a rigorous release readiness process in place that includes reviews and post-release retrospectives
    • Maintain a release calendar and communicate release plans to stakeholders

 

  • Strategic Planning:
    • Create and maintain a strategic roadmap for SRE initiatives, aligning with business goals and technological advancements.
    • Refine and standardize Standard Operating Procedures (SOPs) to enhance operational efficiency and consistency.
    • Address customer pain points by developing and implementing solutions that improve user experience and system reliability.
    • Engage with stakeholders to understand their needs and incorporate feedback into strategic planning and execution
    • Monitor industry trends and best practices to ensure the SRE team remains at the forefront of technology.


Experience:

  • Bachelor’s degree in computer science, Engineering, or a related field
  • Strong problem-solving and analytical abilities
  • Excellent communication and collaboration skills
  • 4-6 years of relevant work experience, including significant experience with GCP
  • Extensive experience with cloud infrastructure, GCP services and architecture
  • Proven track record of managing and optimizing large-scale systems on GCP
  • Proven ability to effectively communicate with individuals at all levels of the organization
  • Ability to maintain relationship and negotiate with vendors.
  • Ability to operate in and leverage resources in a matrixed environment.
  • Ability to analyze and present data to support ideas.
  • Ability to clearly communicate to all levels of the organization.
THIS JOB IS CLOSED
facebook sharing button twitter sharing button linkedin sharing button copy_url sharing button