Site Reliability Engineer
Cape Town, Western Cape, South Africa
3d ago

Who We Are

Zappi is a SaaS company that is aimed at completely transforming the market research industry. Our platform integrates world class research methodologies and engineering to allow brands to run consumer testing through all stages of advertising and innovation development.

We are constantly innovating and tackling diverse and complex problems using a multitude of technologies in order to scale expertise and create the world’s most powerful enterprise research platform thereby making the world of insights even better.

We have created an environment that fosters constant learning and innovating and we believe in having ambitious goals. We are data scientists, developers, researchers, analysts, designers, engineers, and marketers all driven by the notion of trying to make the impossible possible.

To realise our vision we are constantly in search of people who will bring a different perspective, who will challenge our thinking, create value for our customers and apply themselves passionately to our vision and culture.

You :

  • Prioritise your own learning.
  • Passionate about solving complex problems with an array of engineering techniques.
  • Are versatile and think outside the box.
  • Display leadership qualities and are enthusiastic about taking on new problems.
  • Take time to listen deeply before acting.
  • Aren’t afraid to address challenging issues directly, with compassion.
  • Lead through inspiration not coercion and create space for others to lead.
  • Are comfortable with radical transparency.
  • Are comfortable being uncomfortable; change is constant at Zappi.
  • Are humble and honest.
  • We :

  • Listen carefully to each other and to our customers.
  • Believe anyone can achieve great things; we don’t put people in boxes.
  • Promote experimentation and freedom of expression whether it is with new engineering practices, new technologies or new cultural and working practices.
  • Fundamentally trust each other.
  • Leave our egos at the door.
  • Aren’t afraid to fail.
  • Want to have a positive impact on the planet and our communities.
  • We are an equal opportunities employer; our diversity is a major strength. We maintain a constant dialogue with our teams and wider communities about how we can become a more inclusive place to work.


    We are looking for one Site Reliability Engineer, to help us better manage the infrastructure that runs the Zappi platform and support the workflows of 60+ developers.

    We pride ourselves in making major infrastructure changes a non-event, providing tools to increase developer productivity and giving developers the confidence to ship features to our remote environments often, quickly and easily, even on a Friday!

    On a day-to-day basis you would be involved with everything listed in the What You’ll Work On’ section of this job description.

    You will also need to write code from time to time so programming proficiency is important. Web development helps.

    We have a few core traits and abilities that we expect those who join our team to possess. These are :

  • Good scripting ability.
  • Strong problem solving ability.
  • The ability to work autonomously.
  • Interest in learning about and working with infrastructure and tooling.
  • Attentiveness. Honest mistakes are understandable, negligence is not. Awareness of the level of danger you’re introducing to the system is important.
  • If you’re not sure of the risks, ask one of your teammates for advice or a second opinion.

  • If you do happen to make mistakes, admit them openly.
  • It would be advantageous if you have :

  • Experience with logging, monitoring, containerisation, container orchestration, continuous integration / deployment, database management and cloud infrastructure is required.
  • Opinions on alerting, application configuration, autoscaling, automation, centralised logging, cloud infrastructure, containerisation, deployment strategies, distributed systems, failure management / modes, high-availability, immutable infrastructure, monitoring, latency reduction, load-testing, performance measuring and security;
  • since you will have a high degree of influence on design and implementation details of our infrastructure.

  • Knowledge of security tooling (e.g. SIEMs, IDSs & IPSs) would be a plus but it’s absolutely not required as we have an internal dedicated security team that we work closely with to tackle any security work.
  • However, you should have a general sense of what’s required to configure our applications and infrastructure securely.

    Our Stack

    It includes but is not limited to :

  • CI / CD Jenkins (task runner), CircleCI (application tests) and Port Control (internal application that powers a Heroku-like deployment experience for developers)
  • Cloud AWS (for compute) & GCP (for select application APIs)
  • Containerisation Docker
  • Databases MySQL (Aurora), Postgres (RDS), Redis (ElasticCache), RedisGraph, RedisRoaring and Elasticsearch
  • Infrastructure as Code Terraform, Jenkins-Job-Builder, Packer
  • Logging Stack Elasticsearch, Logstash, Kibana and Filebeat
  • Metrics Stack Prometheus (including Alert Manager), Grafana and InfluxDB
  • Operating System Linux
  • Orchestration Kubernetes
  • Programming Languages Ruby (Ruby on Rails), Python, JavaScript (NodeJS), Go, Elixir (Phoenix) and PHP (WordPress)
  • Tracing Honeycomb
  • Version Control Git and GitHub
  • We run all our applications on self-managed Kubernetes clusters which we bootstrap and manage using Kops. We’ve complemented our Kubernetes setup quite a bit using add-ons such as : AWS Load Balancer Controller, Calico, Cluster Autoscaler, Custom Metrics Adapter, External DNS, Falco, Kube State Metrics, Metrics Server, Nginx Ingress and Node Problem Detector.

    We run all the above on AWS, where some of the primary services we use and maintain are : CloudFront, CloudWatch, Cost Explorer, ECR, EC2, ELB, Elasticache, Redshift, RDS, Route53, S3, SES, SNS and SQS.

    We also use and help maintain the following services alongside the security team : CloudTrail, Cognito, GuardDuty, KMS, Macie and WAF.

    If you find our stack interesting then you’ll probably love working with us. And we have some talks up where we share a little about our journey and experience :

  • 5 Things I Wish I Knew Before Moving To Kubernetes
  • Around & After Kubernetes : The Principles and Ideas that Guide Us
  • Pitfalls of Kubernetes Adoption
  • Ruby on Rails on Kubernetes on Production
  • What You'll Work On

    To give you an inkling of what a typical day will look like we’ll share general aspects of the role overall as well as what our focus will be over the upcoming months.

    Here are some tasks that you may find yourself working on a day-to-day :

  • Designing, building, and maintaining the core infrastructure used by all of the
  • development teams.
  • Building and maintaining internal tooling to manage continuous integration and deployment.
  • Automation of arduous developer processes with the goal of making their lives
  • easier.
  • Debugging issues across services and different levels of the stack.
  • Monitoring and managing the cost of our infrastructure.
  • Planning for the growth of our infrastructure.
  • Working closely with the security team to configure for secure infrastructure.
  • Improving the experience of internal and external clients.
  • Writing high quality application code in a programming language Go & Ruby.
  • Writing scripts to automate small tasks i.e. bash scripts.
  • Support developers to roll out high-risk application changes e.g. large migrations.
  • Perform upgrades to keep everything up to date.
  • Maintain documentation of our infrastructure and tooling.
  • Educate developers on our infrastructure and tooling.
  • We also do have a keen interest in blogging a lot more about what we do and open-sourcing anything that would benefit the larger community.

    Team Focus

    We will be working on improving our stakeholders’ ability to self-serve (with respect to their infrastructure needs). We would like to enable our stakeholders (developers, business-intelligence and anyone else who use our tooling) to perform most if not all the tasks they would like to with no manual intervention required from members of the SRE team.

    This is because we believe our stakeholders will be happier and more effective if they no longer encounter some of the resistance they typically experience when attempting certain tasks.

    We also believe that we will be more effective as a team and better able to focus on our vision for SRE going forward.

    How We Work

    We start from a position of trust. We believe that given the right information, people will make good decisions. Therefore we lean toward principles and guidelines rather than hard and fast rules.

    Here are a few things that we would like to highlight on how it is to work with us :

  • Advice & Feedback You should both count on, and be prepared for, completely honest advice & feedback from your team-mates.
  • We may offer encouragement or criticism, indifference or unease; in any case, you can count on it being honest and candid, and from the heart.

    And in return, we expect and encourage you to also be courageously honest.

  • Decision Making Once you have sought advice, you are empowered to make a decision. Not everyone has to agree with your chosen course of action;
  • we value disruptive innovation and it might not always please everyone. Constantly seeking consensus can be tiresome and so we place emphasis on obtaining consent, not consensus.

  • Meetings We have only three meetings per week that run about 30 minutes to an hour depending on what needs to be discussed.
  • One on Monday morning to plan the week, another on Wednesday where we break work down into chunks and make sure there’s a ticket to track each chunk and the other on Friday to recap on the week and provide peer feedback.

    The rest of the time you’re free to manage your time as you wish as long as you’re getting work done. In some cases we scrap the meeting if we feel we’re already aligned.

  • Communication Most of our communication is on Slack and should be asynchronous. during work hours. However, if you’re blocked you’re free to nudge anyone on the team to unblock you.
  • Conventions Internally we have a lot of conventions that we’ve come to follow over the years. Understandably you won’t know about all these but we value your perspective and want to make the most of your unique point of view.
  • We’ll be ready to listen and discuss anything that you would like to challenge.

  • Onboarding & Support You can expect to have the support you need to have a delightful onboarding experience with enough room to learn at a reasonable pace.
  • Working Hours While 8-5 are the official hours, you have the freedom to slide this earlier or later depending on what works best for you (with agreement with the team).
  • It’s certainly true that every now and then, crunch time hits hard, and we might have to work some extra hours. But for the most part, this is more the exception than the norm.

  • On-Call We have one person from the team rotate weekly. Their responsibility is to handle issues that arise after hours so as to give the rest of the team room to not think about work after hours.
  • Our on-call is considered paid overtime.

    Application Process

    Once selected, our typical interview process will run you through the following steps :

  • Technical Interview A role based technical assessment that would evaluate your grasp of operations & infrastructure related tasks as well as your application programming skills.
  • The former would be a take-home exercise that you have about a week to work through at your own time and the latter would be a chat on Zoom where we just walk through your thinking (no hard-core algorithms, just a basic programming exercise that you would encounter on a typical day).

  • Team Chat Casual 30 minute to one hour chat with your team-mates possibly on Zoom. We don’t aim for this to be long but we’re open to giving you as much of an opportunity to get to know us and vice versa.
  • Company Chat One hour coffee chat with different people from different teams across the company so that you get acquainted with other people in the company.
  • If you prefer it can be on Zoom.

  • CTO Chat Thirty minute chat with our CTO on Zoom which includes (but is not limited to) discussing salary expectations.
  • Benefits

  • Competitive pay scales benchmarked annually.
  • Unlimited holidays and this is not a trap! We expect and encourage people to take plenty of leave.
  • Flexibility to work in a way that suits your lifestyle with flexible working and travel arrangements to and from work.
  • Nice working setup i.e. MacBook Pro, high-res screen or 2 monitors, keyboard, mouse, stand etc. Basically, you’ll get what you ask for to make you productive.
  • Open plan office with stocked snacks, fruit, beers & cool-drinks.
  • Support setting up your home office, if appropriate e.g chair, desk etc.
  • Paid 24 hour secure parking.
  • Free Yoga.
  • Report this job

    Thank you for reporting this job!

    Your feedback will help us improve the quality of our services.

    My Email
    By clicking on "Continue", I give neuvoo consent to process my data and to send me email alerts, as detailed in neuvoo's Privacy Policy . I may withdraw my consent or unsubscribe at any time.
    Application form