Hi Lutter,
Michal pointed me to your draft of Tracker design which looks great.
Couple of notes are inline. Sending it to aeolus-devel list too so
others can join.
Original doc can be found here:
https://raw.github.com/lutter/deltacloud/tracker/design/site/content/trac...
---
title: Deltacloud Tracker Design (Draft)
extension: html
filter:
- markdown
- outline
---
# Introduction
The purpose of Deltacloud Tracker [TODO: find a snazzier name] is to allow
clients to be notified of state changes in cloud resources through
callbacks rather than by polling themselves. Under the hood, the tracker
will of course have to poll, too.
The Tracker will add a few capabilities to the Deltacloud API; this
document is only concerned with changing the Deltacloud API. The other
Deltacloud frontends will need to be changed accordinlgy at some point.
The resource changes that will trigger a notification to clients are
specific to each resource (and driver ?) They are
* Instances: change to instance state
* [TODO: what else ?]
Realms are needed too. We check realms availability in Conductor, so we
will not be able to replace our current checking tool with Tracker until
realms are supported too.
# API changes
Tracker needs credentials for the backend cloud; it is therefore important
that on each request these are set properly. In particular, Tracker will
store the driver, provider, user and password each time a callback is
registered.
## Registering a callback on resource creation
Any collection that supports state tracking will indicate that with a
feature 'state_tracking' on the corresponding create operation. This will
make it possible to add the following parameters to create operations:
track\_hook : the absolute URL to which to post on state changes
track\_token: (optional) a security token that must be included in the callback
For example, to register a callback when creating an instance, the request
would look like
POST /api/instances?...&track_hook=http://example.com/cb&...
## Retrieving callback details
Resources that support state tracking will contain a <callback/> element in
their representation. The element will have the following form
<callback
hook="http://example.com/cb">
<delivery status="(noevent|success|failure)"
time="2012-05-23T18:23"/>
</callback>
## Registering a callback for an existing resource
Resources that support state tracking allow updating the callback
information with a PUT request to the resource. To register a callback for
an instance, which will overwrite any existing callbacks, issue
PUT /instances/42?track_hook=http://example.com/cb
and to delete a callback, use the special token 'none':
PUT /instances/42?track_hook=none
Authentication will be needed for ^ these requests to make sure that 3rd
party doesn't modifies my callback.
# Callback
When Tracker detects a change to a tracked resource, it will POST a JSON
document to the hook URL [TODO: do we need XML, too ?]. The JSON body will
look like
{
'token': security token
'changes': [
{ 'attribute': path in [JSON pointer
notation](http://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-00),
'old': old value,
'new': new value
}
],
'resource': ... representation of resource ...
}
For example, for an instance that just changed from 'pending' to
'running',
the callback hook would receive the following JSON document
{
'token': "ABCDEFG42",
'changes':
[ { 'attribute': "/state", 'old': "PENDING",
'new': "RUNNING" } ],
'resource': .. JSON object for the instance ...
}
The recipient for the hook should respond with 204 No Content to indicate
that the update was received successfully.
## Explicitly retrieveing events
Callbacks can fail, and will be retried for a while, but at some point we
have to give up trying to deliver the callback (or retry so infrequently
that it's not really useful to the recipient)
To make it possible for recipients to catch up after a failure on their
side, we'll support a 'changes' collection that only allows GET:
GET /changes
The response to this request will be a JSON array, where each entry is the
same JSON object that is used for delivering callbacks. Note that only
changes pertinent to the current provider will be delivered, i.e. clients
that use Tracker to track resources in multiple providers will need to make
one request for each provider. Once a change has been delivered through
this mechanism, it will be considered successfully delivered.
This might be a problem if Tracker is used by multiple clients - _all_
callbacks for a provider are fetched by one request no matter whom these
callbacks are addressed to.
Maybe 'GET /changes' could just trigger common retry of callback
delivery for all provider requests instead of returning changes
directly. Another benefit is that then 'GET /changes' wouldn't have to
be authenticated.
# Implementation Notes
We will need to run a background job that performs the state
polling. DelayedJob seems like the right tool for this; we'll want a
+1 for delayed job - we already have this requirement in conductor
periodic job that goes out to each backend/provider and asks for
changes to
tracked resources. For the first cut, we can do this resource-by-resource,
but longer term we want to be more clever and use cloud-specific features
(DescribeInstances for multiple instances in EC2, changed-since for
OpenStack etc.) and will therefore require driver support.
Conceptually, Tracker decorates the backend driver for the API operations
that are modified by the 'state_tracking' feature. It is therefore tempting
to implement that aspect as a Module that gets included into drivers and
does the decoration. By doing this at the driver level, state tracking is
immediately available to all frontends.
Implementing Tracker requires that we keep state about the registered
callbacks, and about the previous state of tracked resources. We'll use an
RDBMS (sqlite/postgresql) and ORM (DataMapper ?) for this purpose. Very
roughly, I hope we can get away with this data model (plus a jobs table):
class Provider
property :id, Serial
property :driver, String
property :provider, String
property :user, String
property :password, String
end
class Callback
property :id, Serial
property :hook, String
property :token, String
property :res_type, String
property :res_id, String # Just enough to get resource from backend
property :res_old, Text # Serialization of old resoure state
property :last_event Timestamp
# TODO: Need to track delivery state of callback and payload
belongs_to :provider
end
## Timings/frequencies
There are a number of timings (polling frequency, how long and how often to
retry change delivery) For now, they can be hardcoded, but should be in
some central place so they can easily be tweaked - we do not want them
controlled through the API though. We'll start with something like
* Poll frequency: once a minute for instances in transient states (pending
etc.), once every 15 minutes for instances in permanent states (running,
stopped, ...)
* Poll failure: retry at the normal frequency 5 times, then back off
exponentially until frequency falls to once every two
hours. [TODO: Should we issue a callback at that point ?]
* Callback failure: retry callback 5 times every minute, then back off
exponentially until frequency falls to once a day