At Hippo Insurance, we built a platform as a service that allows engineers to add/mutate service-infrastructure simple and engineer-friendly.

Before I talk about the platform itself, I want to spend bit of time explaining the choices that we made along the way to make this possible, scalable, and super efficient.

Polyglot out, well crafted OTF products in

At Hippo, we have dozens of micro-services. We choose to not allow teams to write a service in any language/framework they want. We built well-crafted, well-thought-out products in advance. As an engineer, you choose the product that best fit your needs OR you can suggest a new product.

What is considered a product?

  1. HTTP Service - NodeJS (TypeScript) using an internal framewrok that gives you everything: metrics, dashboards, observability, logging, etc. You as the engineer focus on your code.

Each HTTP service logs HTTP request to ES by default. This happens “above” your code and you can utilize it for debugging purposes.

Each service will have full metrics of controllers by default. Which means you don’t have to worry about measuring the controllers/actions. You CAN add more metrics, which are provided by the framework.

Each service will have support for “who called me”, “how did they call me”, so engineers know which API is used and how they can deprecate parts.

Each service will generate OpenAPI by default. Each service will generate a client by default.

Each service will auto scale by default, auto scaling is horizontal, based on the amount of requests your service is able to process. We “listen” on the writing connections of each service and scale it when it reaches a point we think it should scale at. You as an engineer can tweak this number.

Now, think about how much time this saves when you develop a service. How much don’t you need to figure out. You pick up a new product and it has everything. You will also get upgrades pushed to you “for free”.

  1. Worker - NodeJS (TypeScript). A Worker subscribes to a queue, a queue subscribes to a topic (with or without a filter).

You get everything you get for an HTTP service. Auto scaling, metrics, logging, etc.

  1. Runner - NodeJS (TypeScript). Like a worker, but the trigger is a cron mechanism. Run X at midnight

Everything is connected via a strong convention-based URL pattern, queue name, environment encapsulated mechanism.

Worker X will have a queue named X, subscribed to the broadcast topic.

HTTP service x will have dns named x.{{ env }}.{{ internal-domain }} that will resolve to internal address.

HTTP service x will connect do a DB named x_db. In production it will have a dedicated instance named x.

Your secrets will be stored in vault under applications/x/{{ env }}.

Architecture patterns

We have well-defined architecture patterns for pubsub, DB access, service-to-service communication. We believe these are correct 99% of the time.

If you as an engineer or EM believe you have a use-case that belongs in the 1%, we have an architecture review board to suggest/approve these changes.

For example: a worker will subscribe to a queue. You cannot subscribe to another queue. If you think your worker needs to subscribe to two queues, you will need to come up with reasoning for this and prove that it is the right choice.

All of our services use PG as the DB. You need BigQuery? OK. Explain why

Service permissions

Services are following the least-access principle. A service cannot access anything that is outside of the service naming convention. Services can access buckets with their naming pattern, they can access db with their name on it, etc. Services cannot talk to services they have not been configured to access or any other cloud resource that is not in the architectural pattern.

The PAAS implementation

Now that we have the boundaries established. We wanted to allow engineers an easy way to add/mutate infra.

The choice was to NOT have engineers write any terraform, but terraform should be used to make sure the entire environment is configured as code.

What we built is a JSON DSL that looks like this:

{
  "service_type": "safari-service",
  "tier": "backend",
  "name": "fake-name-service",
  "responsible_teams": [
    "REDACTED"
  ],
  "github": {
    "description": "Fake Name Service",
    "topics": []
  },
  "resources": {
    "cpu": 1024,
    "memory": 1024
  },
  "auto_scaling": {
    "up_threshold": 8,
    "up_step": 4,
    "up_cooldown": 30,
    "down_cooldown": 120,
    "down_threshold": 5,
    "down_step": -2,
    "min_capacity": 8,
    "max_capacity": 50
  },
  "databases": [
    {
      "type": "postgres",
      "instance_class": "db.m5.large",
      "storage": "100"
    }
  ]
},
  • service_type defined the type (product) of the service.
  • tier is the network tier in which the service “sits”. We have a 4 tier network that is private by default and can only reach one level deep at a time.
  • responsible_teams who will get alerted about this? Alerts for the service automatically goes to the team channel, oncall (L1), and SRE.
  • auto_scaling as mentioned above, you can set the auto scaling for your HTTP service, how much will it scale, at which point, etc.
  • resources CPU and memory resources
  • databases which DB should this be using

All of these have sane defaults, you do not have to define almost anything except the service name and responsible teams.

Once you commit this, it will generate terraform, create repos, generate build templates, docker, etc. You are now ready to write code.

For completeness, here’s how you’d define a worker:

{
  "service_type": "safari-worker",
  "tier": "worker",
  "name": "fake-name-worker",
  "responsible_teams": [
    "REDACTED"
  ],
  "github": {
    "description": "Fake Name Worker",
    "topics": []
  },
  "resources": {
    "memory": 512
  },
  "delay_seconds": 60,
  "max_receive_count": 10,
  "pubsub": [
    {
      "topic_type": "global",
      "filter_policy": [
        "something.create",
        "something.update"
      ]
    }
  ],
  "visibility_timeout_seconds": 1200
},

Notice you do not define the queue name, this is the convention that I mentioned above. The queue name is defined by default

You can tweak timeouts, retries, etc.

Once you write and commit code, it will deploy. It’s that simple.

You can create a service and publish it to production in under 20 minutes.

Quick thoughts

Obviously, there’s a lot more to it than mentioned in this post, but I wanted to explain a bit about how we enable the business using technology and build a platform that will scale with the organization without creating bottlenecks or OPS/SRE.