Manager, Site Reliability Engineer

Austin Downtown Office

BigCommerce named a ”Best Place to Work" in Australia, a “Best and Brightest” place to work in San Francisco, and a “Best Place to Work” in Austin, is looking for a full-time Manager, Site Reliability Engineering to join our San Francisco, CA or Austin, TX office. The Manager, Site Reliability Engineering will be part of the team responsible for ensuring the Bigcommerce platform is available, reliable, and performant

As the Manager of Site Reliability and Systems Engineering, you will be responsible for leading and enabling a team of site reliability engineers and operations engineers that will execute on the vision of building a distributed computing operating system, middleware, services layer abstraction and operational API - taking our vision of global ecommerce SAAS platform to reality.  

Our SRE team is made up of talented and enthusiastic individuals who have a huge amount of experience in the running, managing and scaling of large-scale web operations and systems administration. The team works closely with the rest of our Engineering organization to ensure that the platform powering BigCommerce remains reliable, performant and secure, 24x7

Success in this role requires very strong technical leadership skills, a broad background, and understanding of every layer of the systems ecosystem. You will have strong ability in the areas of process development and management, organization, planning, and communication. It is absolutely critical that you have the skills to manage, motivate, lead and gain the respect of a highly technical engineering team. We’re looking for an experienced candidate with a mindset skewed towards performance analysis, scalability and high availability

Day to day you’ll find us with our nose in the terminal, using Terraform and Puppet to manage our Debian hosts in a heterogeneous environment of Docker containers, VMs, and Bare metal servers. We rely heavily on Logstash and Grafana to provide the data we need to direct our focus and attention to diagnosing and resolving performance issues across a variety of software built in PHP, Ruby, Scala (JVM) and on occasion, Go

We are always working to empower the BigCommerce Engineering teams to deliver a faster and more robust platform

What makes you tick:

  • You love to build and manage teams to excellence. You are passionate about the success of the company you work for and the people you work with
  • Someone who loves to code and enjoys working with multiple programming languages. We primarily work with PHP, Ruby, and Python. Puppet manages all of our configurations
  • A good communicator who works well with geographically distributed teams such as ours. We are split between Sydney, Austin, and San Francisco
  • You're obsessive-compulsive, in a good way. Your systems and scripts are clean, well-documented and comprehensible
  • Hates doing the same thing twice, you’d rather spend the time to automate a problem away rather than having to spend time on it again
  • You have a passion for learning when it comes to working with new technologies or languages
  • You live and breathe scalable web architectures
  • You’re cool in a crisis and can align with others to ensure complex problems meet a timely and effective resolution
  • While work is a big part of your life, you strive to maintain a good balance between the office and home. The pager is an important part of your day, but you don’t let it rule your life

 What you will do:

  • Manage a world-class NOC and full-stack system monitoring and reporting
  • Ensure the platform maintains 99.95% uptime.  Increase platform availability to 99.99
  • Build and operate a high-performance site reliability and operations team via hiring, mentoring and career development
  • Work with the rest of the BigCommerce Engineering team to deliver systems which are highly available, reliable, and performant
  • Budgetary responsibility for staffing and associated 3rd party product/services portfolio

Who We're Looking For:

  • BS/MS in Computer Science or equivalent engineering experience
  • 7+ years experience building and running a high-transactional, 24x7 production environment
  • Strong familiarity with security as it applies to software, system and network engineering
  • Experience delivering a world-class systems platform with scalability and resiliency
  • Deep hands-on technical expertise; designing and deploying global Linux based systems
  • Detail orientation on technology, technical-development, quality, operations, and system performance
  • Outstanding organizational, communication, interpersonal and relationship building skills
  • Is able to dive deep and is never out of touch with the details of the business or the technology
  • Thinks big. Has a bias for action. Delivers results
  • Communicates effectively both verbally and in writing
  • Prior experience in handling budgetary responsibilities with success

Curious what we’ve been up to?

Curious what our Infrastructure Engineering team has been up to or some of our upcoming roadmap items you could have the opportunity to be involved in?

  • We ensured 100% uptime during North America’s busiest shopping period from Thanksgiving through to Cyber Monday 4 straight years (These days are known around here as: Cyber 5)
  • We’ve designed and built our software to automatically upgrade system OS packages (in BASH). Systems using this code now install OS upgrades with zero touch, so we can concentrate on more automation and serving our customers needs
  • Developing an intelligent incident management and response process with automation in the form of a chatbot giving anyone on the team the ability to comfortably handle anything that’s been thrown at them
  • Creating automation to identify, remediate and purify our systems of SPAM. This keeps our mail reputation high and ensures we can deliver order email effectively
  • Deploying and scaling our Integration environment out into a second Datacenter, enabling the software engineering teams with additional resources, enabling them to release and test faster and improving parity with our Production environment
Share job posting