Feb 22, 2022

Principal Site Reliability Engineer (SRE)

  • Canva
  • Sydney, New South Wales, Australia

Job Description

We’re constantly working towards making Canva the best place to work, for everyone. We believe deeply that bringing together diversity of thoughts, perspectives and expression is key for building the best product for our equally diverse community all around the world. We celebrate uniqueness and whatever makes you, you and encourage everyone who wants to help us transform the way the world designs, to join us on this journey. We value all different types of experiences. If you don’t think you quite meet all of the qualifications, we’d still love to hear from you.  

About Us
At Canva, our mission is to democratise design and empower creativity for anyone and everyone, on every platform. Inspired by a team of talented thinkers, an amazing culture and a remarkable growth trajectory – we’re out to change the world, one design at a time. 

Since launch in August 2013, we have grown exponentially, amassing over 75 million monthly active users across 190 different countries who have created more than 6 Billion designs. We are one of the world’s fastest-growing technology companies and we have only achieved about 1% of what we want to do.

About the Reliability Platform Group 

The Reliability Platform Group is responsible for providing the tools and processes to scale reliability across all Canva services. Our teams work together, and with other groups, to deliver preventive and detective tooling, processes and best practices that uplift Canva’s reliability. We do this by driving operational excellence, reducing the impact of incidents, and providing visibility and accountability across the broader Engineering community.

The group encompasses Observability, Availability, Detection & Response, Major Incident Response and Pre-Emption domains and is set to grow rapidly in the near future as we shoot for some ambitious goals.

Role Responsibilities

  • Oversight of all Reliability Engineering activities.
  • Performing deep dives into both systemic and latent reliability issues.
  • Engaging in service capacity planning and demand forecasting, software performance analysis and system tuning.
  • Grow out, build and drive the operational excellence culture & best practices across the engineering specialty.
  • Deliver continuous improvement of operational processes, tools and capabilities based on operational experiences and application of industry-standard methodologies.
  • Set vision and direction for reliability teams, including usage of platforms and tools.
  • Representing the reliability team in design reviews and operational readiness exercises for new and existing services.

Required Skills & Experience

  • Ability to set and drive the long term strategy for the team, while delivering on incremental value for the business.
  • Excellent ability to build and maintain healthy relationships with diverse business & engineering stakeholders.
  • Demonstrates a passion for learning and technology with the ambition to drive and succeed collaboratively.
  • Experience with technical and people leadership - having previously led high-performing teams where everyone is able to share their best ideas and be their best selves.
  • Experience working with a mainstream programming language.
  • Our services and libraries are primarily written in Java, so it’s nice to have.
  • Solid understanding of resiliency techniques and patterns – load balancing, throttling, back pressure, circuit breaking, etc.
  • Disciplined coding practices, experience with code reviews and pull requests and a creative and conceptual problem-solving approach.
  • Strong communication and team collaboration skills, both written and verbal, as you will need to share knowledge, communicate and coordinate changes across multiple teams.
  • Be capable of leading by example - promoting Canva’s values, no-blame mentality, and engineering values
  • Five-plus (5+) years of commercial experience working with developing complex, distributed web applications.

Nice to have; Not required!

  • Experience working with microservice architectures in large complex distributed cloud environments (ideally AWS).
  • We’re hosted on AWS and leverage the tools they provide as much as possibleExperience with RPC Frameworks, Finagle, Thrift or gRPC will be a huge plus, but not required; Understanding how services communicate with each other is crucial to find out where a failure can occur. 
  • Knowledge of networking protocols such as TCP, HTTP/2, WebSockets, etc. would be a big plus; The life of a request doesn’t start inside the backend web server, but rather in the browser of a user.
  • Previous experience of working as a reliability/chaos engineer and/or strong knowledge of Google SRE corpus et. al. 
  • Previous experience establishing, growing, and developing new teams from the ground up.
  • Experience writing Infrastructure as Code (IaC).

Our benefits

  • Competitive salary, plus stock options via our ESOP plan.
  • Flexible daily working hours, we value work-life balance.
  • Breakfast and lunch prepared by our wonderful Vibe team.
  • Onsite-Gym and Yoga Membership.
  • End-of-Trip Facilities: Bicycle parking and showers.
  • Generous parental (including secondary) leave policy.
  • Pet-friendly offices.
  • Internal Coaching and Employee Support Programs.
  • Sponsored social clubs, team events, and celebrations.
  • Relocation budget for interstate or overseas individuals that legally qualify for visa sponsorship.
We make hiring decisions based on your experience, skills and passion. If you’re keen to apply and need reasonable adjustments or would like to note which pronouns you use at any point in the application or interview process, please let us know.