Senior Site Reliability Engineer - CTJ - Poly
Location: Redmond
Posted on: June 23, 2025
|
|
Job Description:
Are you interested in shaping the future of Microsoft 365
products that empower our customers to seamlessly create,
collaborate, and share within government cloud environments? In
this role, you will leverage your expertise in software
development, online services, and AI to envision, design, and
improve upon next-generation Microsoft 365 government cloud service
offerings. The Site Reliability Engineering (SRE) team provides
leadership, direction and accountability for application
architecture, system design, and end-to-end implementation. As a
Senior Site Reliability Engineering , you will identify and deliver
software improvements using your expertise in software development,
AI, complexity analysis, and scalable system design. Microsoft’s
mission is to empower every person and every organization on the
planet to achieve more. As employees we come together with a growth
mindset, innovate to empower others, and collaborate to realize our
shared goals. Each day we build on our values of respect,
integrity, and accountability to create a culture of inclusion
where everyone can thrive at work and beyond. Qualifications
Required/Minimum Qualifications: 6 years technical experience in
software engineering, network engineering, or systems
administration OR Bachelors Degree in Computer Science, Information
Technology, or related field AND 3 years technical experience in
software engineering, network engineering, or systems
administration OR Masters Degree in Computer Science, Information
Technology, or related field AND 2 years technical experience in
software engineering, network engineering, or systems
administration. Other Requirements: Security Clearance
Requirements: Candidates must be able to meet Microsoft, customer
and/or government security screening requirements are required for
this role. These requirements include, but are not limited to the
following specialized security screenings: Candidates must have an
active TS and be willing to upgrade to TS/SCI (with polygraph) or
have an active TS/SCI and be willing to upgrade to TS/SCI (with
polygraph). This role will require candidates to maintain the
TS/SCI (with polygraph) clearance. Ability to meet Microsoft,
customer and/or government security screening requirements are
required for this role. Failure to maintain or obtain the
appropriate clearance and/or customer screening requirements may
result in employment action up to and including termination.
Clearance Verification : This position requires successful
verification of the stated security clearance to meet federal
government customer requirements. You will be asked to provide
clearance verification information prior to an offer of employment.
Microsoft Cloud Background Check: This position will be required to
pass the Microsoft Cloud background check upon hire/transfer and
every two years thereafter. Citizenship & Citizenship Verification:
This position requires verification of U.S. citizenship due to
citizenship-based legal restrictions. Specifically, this position
supports United States federal, state, and/or local United States
government agency customer and is subject to certain
citizenship-based restrictions where required or permitted by
applicable law. To meet this legal requirement, citizenship will be
verified via a valid passport, or other approved documents, or
verified US government Clearance Preferred/Additional
Qualifications: 7 years technical experience in software
engineering, network engineering, or systems administration OR
Bachelors Degree in Computer Science, Information Technology, or
related field AND 4 years technical experience in software
engineering, network engineering, or systems administration OR
Masters Degree in Computer Science, Information Technology, or
related field AND 3 years technical experience in software
engineering, network engineering, or systems administration OR
Doctorate Degree in Computer Science, Information Technology, or
related field. Site Reliability Engineering IC4 - The typical base
pay range for this role across the U.S. is USD $119,800 - $234,700
per year. There is a different range applicable to specific work
locations, within the San Francisco Bay area and New York City
metropolitan area, and the base pay range for this role in those
locations is USD $158,400 - $258,000 per year. Microsoft will
accept applications for the role until June 26, 2025
Responsibilities Technical Knowledge and Domain-Specific Expertise
Demonstrates end-to-end expertise in distributed systems design,
interactions between cloud technology layers and components,
functions of physical network devices, and dependencies at scale.
Drives efforts within an organization to identify and recommend
optimal configurations of cloud technology solutions and develops
or modifies the code base that defines infrastructures to improve
the reliability and operability of supported products. Develops
end-to-end technical expertise in the architecture, code, features,
and operations of specific products as required to implement
improvements in product availability, reliability, efficiency,
observability, and/or performance. Drives code/design reviews with
the engineering teams that develop and/or manage those products and
shares learnings and recommendations across engineering teams
working on related products within their organization. Researches
and maintains deep knowledge of industry trends as well as advances
in large-scale distributed systems and cloud technologies;
identifies opportunities to create, implement, and/or optimally
utilize new tools, technologies, and/or processes to solve
ambiguous problems and improve product availability, reliability,
efficiency, observability, and/or performance. Drives the adoption
of new solutions across engineering teams working with related
products within an organization and provides guidance and coaching
to others on relevant topics. Contributions to Development and
Design Leverages technical expertise in the infrastructure of large
scale distributed systems and specific products, as well as
objective insights drawn from analyses of production telemetry data
to advocate for, or directly contribute to, changes to the code
base to improve the availability, reliability, efficiency,
observability, and performance of related sets of products
developed and supported by teams within an organization. Develops,
tests, and implements changes to optimize code and improve the
observability, reliability and operability of platforms, systems,
and products at scale. Reviews the effect of these changes to
document and share development insights within their team. Engages
with product engineering teams within an organization by driving
code/design reviews, hosting regular meetings, and participating in
on-call rotations and incident responses throughout product
development and operations cycles; leverages end-to-end technical
expertise on underlying systems/platforms and insights from
engagements with product engineering teams and telemetry analyses
to propose scalable improvements in code and designs with attention
to customer/business objectives and incident prevention. Driving
Operational Excellence Develops code, scripts, systems, or
platforms that automate moderately complex but repetitive
operations processes (e.g., monitoring, alerting, deploying
products and updates, debugging) at scale; reviews existing
automation code and scripts to evaluate reusability, extendibility,
and scalability within an organization. Leverages end-to-end
technical expertise and telemetry analysis to identify patterns and
opportunities to implement configuration and data changes for
related sets of platforms, systems, or products in production using
code, tooling, and automation; identifies cases where teams lack
the tools and/or capability to manage platforms, systems, or
products using code and drives efforts within an organization to
expand capabilities and/or tooling accordingly. Leverages existing
tools and automation to enable product engineering teams within
their organization to increase the velocity in which they can
reliably and safely implement changes in production; monitors the
effects of changes across platforms or systems. Analyzes data from
telemetry pipelines and monitoring tools that detail operations
metrics (e.g., availability, reliability, performance, efficiency)
of systems, platforms, or products operating at scale. Contributes
to the development of new tooling and/or predictive models to
identify and test potential improvements in product development
and/or operations, and monitors the impact of changes on operations
metrics (e.g., Time-to-X) within an organization. Identifies
optimal uses for existing tools and/or models to identify
contributing factors or points of failure that are affecting the
availability, reliability, performance, and/or efficiency of
systems, platforms, or products; proposes and implements solutions
that resolve root cause(s) and prevent issues from occurring in
related products by working with product engineering teams within
an organization to test and deploy them to production. Responds to
incidents during regular on-call rotations by identifying the level
of impact, troubleshooting complex issues, and deploying
appropriate fixes to resolve root cause(s); alerts product teams,
owners, and leadership to issues with major customer/business
impact and escalates resolution of the highly complex, ambiguous,
and impactful issues to include other engineering teams and/or
subject matter experts as needed. Shares details related to
incidents and their resolution through post-mortem reports and
during regular review meetings. Develops, maintains, and leverages
capacity planning models and monitoring tools to forecast product
capacity and resource demands; models the predicted effect of
changes to capacity plans to optimize code bases to better manage
resources in respond to dynamic capacity demands. May contribute to
the development of automated resource utilization tools or
processes that can dynamically scale compute resources up or down
to adjust to capacity demands. Draws insights from performance and
resource monitoring across products within their organization to
identify whether there is a need to optimize code, infrastructure,
or architecture - or if changes to compute resources are required;
uses advanced models to forecast and verify the efficacy of changes
at scale and proposes solutions that are aligned with
customer/business needs. Shares insights and best practices that
can be applied to improve development and operations across related
sets of systems, platforms, and/or products. Continues to develop
their understanding of insights and best practices through
interactions with more experienced SREs and members of product
engineering teams. Mentors and coaches more engineers to help them
identify and propose relevant solutions. Additional
Responsibilities Design, develop, and deliver engineering solutions
that serve and protect M365 government clouds. Own deployment,
availability, reliability, performance and customer escalation
targets for sovereign environments. Proactively identify and reduce
issues through design, testing, and implementation of
software-based solutions. Collaborate with Engineering and Program
Management partners to translate customer, business, and technical
requirements into architectural designs and feature releases. Drive
efficiencies through software improvement and root cause analysis
resulting in service delivery, maturity, and scalability. Develop,
test, and implement changes to optimize code and improve platforms.
You leverage end-to-end technical expertise and telemetry analysis
to identify patterns and opportunities to implement configuration
and data changes. You review the effect of changes to documents and
share development insights within your team. You drive code/design
reviews, host regular meetings, and participate in on-call
rotations and incident responses throughout product development and
operations cycles. In addition, you respond to incidents during
regular on-call rotations and share details related to incidents
and their resolution through post-mortem reports and regular review
meetings. Other Embody our culture and values
Keywords: , Olympia , Senior Site Reliability Engineer - CTJ - Poly, IT / Software / Systems , Redmond, Washington