Why Service Discovery

Let say you have an web application that requires querying some external services e.g. db, authentication, etc. These external services are often hosted in a different environment be it another host, rack or datacenter and thus its address needs to be resolvable. The easiest way could be assigning a static address (DNS entry/IP) to each service and configure our web app to point to these addresses. The challenges however are:

  • Adding/Removing new services requires reconfiguring the web app.
  • If one db server goes down, how do we stop our web app from querying that server
  • How to load balance the services without adding logic to our web app

In a shared/container-centric environment like Mesos, it is critical to have service discovery if you plan to run everything on Mesos. Because there is no guarantee where an app can be run. (Actually you can restrict with constraints but its is not a good practice)

There are 2 most popular approaches to service discovery:

  • Server side discovery: all services run under a proxy or commonly know as load balancer e.g. HAProxy or Nginx. App client will talk to the load balancer end points(front end) which then route request to corresponding services (backend). The load balancer also handle health checking to ensure only live backend can be used to serve request.
  • Client side discover: (Consul) all services register themselves to a service registry. App client will talk to service registry to find out where the available services are. The app client then make request directly to the services using information given by the registry.

There are good resources to learn about service discovery on the Internet. Below are some of them:

Consul

Consul is a distributed service discovery and Key-Value store. It implements the Raft Consensus Algorithm to maintain distributed consistent replicated logs which can be used to construct a Finite State Machine(FSM) representing the state of our systems. It is worth pointing out that since our system often consists of many servers, each servers may have different view of the FSM. Hence we need an algorithm to make sure each FSM is consistent even when failure happens.

More on Raft:

In the context of Service discovery and Consul, the FSM are the list of services in the system also know as Service Catalog. This Catalog is replicated in Consul master. All Consul slave is responsible to

  • Monitor the health of the host
  • Monitor the health of the host services
  • Send update to Master

Consul take an extra step further to expose service information (host, port) via DNS which can be query from the client (port 8600). Consul take an extra step further to expose service information (host, port) via DNS which can be query from the client (port 8600). A DNS cache server such as dnsmasq can be run on the agent to allow query from both consul and the normal DNS servers.

Mesos-Consul integration

We will run Consul agent in slave mode on each Mesos agent (and Mesos master too). The slave agent is responsible for register/deregister and monitor services running on the Mesos slave. How does it know when a new service has been launched on the same host? This is the job of the mesos-consul (orange) which poll the Mesos Master for all tasks/frameworks information and send that information to the consul slave running on the same host as the corresponding task/framework to update the service catalog at the Consul master.

Note: Consul has something called Anti-Entropy for keeping the entire state of the system consistent. This is done by periodically synchronization between client and server. Therefore each update to service catalog need to be sent from the Consul slave. This means we our Mesos-consul needs to be able to talk to all Consul client on each slaves.

Consul Configuration

Consul-Master systemd .service
[Unit]
Description=consul-master
After=network.target
Wants=network.target

[Service]
ExecStart=<PATH_TO_CONSUL>/consul agent --config-file==<PATH_TO_CONSUL>/master.json --config-dir=/etc/consul.d
ExecReload=<PATH_TO_CONSUL>/consul reload
Restart=always
RestartSec=20

StandardOutput=journal
StandardError=inherit

[Install]
WantedBy=multi-user.target
Consul-Slave systemd .service
[Unit]
Description=consul-slave
After=network.target
Wants=network.target

[Service]
ExecStart=<PATH_TO_CONSUL>/consul agent --config-file==<PATH_TO_CONSUL>/slave.json --config-dir=/etc/consul.d
ExecReload=<PATH_TO_CONSUL>/consul reload
Restart=always
RestartSec=20

StandardOutput=journal
StandardError=inherit

[Install]
WantedBy=multi-user.target
master.json
{
  "datacenter": "default",
  "data_dir": "/data/consul",
  "log_level": "INFO",
  "server": false,
  "bind_addr": "specific_slave_IP_address",
  "client_addr": "0.0.0.0", 
  "start_join": [
    "CONSUL_MASTER_001",
    "CONSUL_MASTER_002",
    "CONSUL_MASTER_003"
  ]
}
slave.json
{
  "datacenter": "default",
  "data_dir": "/data/consul",
  "log_level": "INFO",
  "server": true,
  "bootstrap_expect": 2,
  "ui": true,
  "start_join": [
    "CONSUL_MASTER_001",
    "CONSUL_MASTER_002",
    "CONSUL_MASTER_003"
  ]
}

Alternatively you can set the --bind option to the default private IP and --client option to 0.0.0.0

#!/usr/bin/env bash
BIND_IP=$(ip addr | grep 'state UP' -A2 | tail -n1 | awk '{print $2}' | cut -f1 -d'/')
/opt/consul/bin/consul agent --config-file=/opt/consul/conf/master.json --bind=$BIND_IP --client=0.0.0.0 --config-dir=/etc/consul.d

Running Mesos-Consul

Clone the repo from https://github.com/mantl/mesos-consul/. You need to have go to compile the binary. Alternatively run mesos-consul from a Docker image using the provided Dockerfile.

The best way to run mesos-consul is via Marathon.

{
  "id": "/mesos-consul",
  "cpus": 0.1,
  "mem": 128,
  "disk": 0,
  "instances": 1,
  "container": {
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "DOCKER_REGISTRY/mesos-consul",
      "network": "HOST",
      "portMappings": [],
      "privileged": false,
      "parameters": [],
      "forcePullImage": false
    }
  },
  "uris": [
    "file:///etc/docker.tar.gz"
  ],
  "args": [
    "--zk=zk://zookeeper.service:2181/mesos",
    "--shared-service-name=(KafkaMesos)-.*",
    "--log-level=INFO",
    "--blacklist=task.*"
  ]
}

Some useful arguments to pass to mesos-consul:

--blacklist=REGEX all task/framework that match the REGEX will not be added to Consul
--shared-service-name=REGEX extract the common group name from task/framework name using the regex and register 
service under the same group name

results matching ""

    No results matching ""