Deltacloud Tracker Design (Draft)

Thursday, 7 June 2012

Hi Lutter,
Michal pointed me to your draft of Tracker design which looks great. 
Couple of notes are inline. Sending it to aeolus-devel list too so 
others can join.

Original doc can be found here:
https://raw.github.com/lutter/deltacloud/tracker/design/site/content/trac...

...
 ---
 title: Deltacloud Tracker Design (Draft)
 extension: html
 filter:
   - markdown
   - outline
 ---

 # Introduction

 The purpose of Deltacloud Tracker [TODO: find a snazzier name] is to allow
 clients to be notified of state changes in cloud resources through
 callbacks rather than by polling themselves. Under the hood, the tracker
 will of course have to poll, too.

 The Tracker will add a few capabilities to the Deltacloud API; this
 document is only concerned with changing the Deltacloud API. The other
 Deltacloud frontends will need to be changed accordinlgy at some point.

 The resource changes that will trigger a notification to clients are
 specific to each resource (and driver ?) They are

 * Instances: change to instance state
 * [TODO: what else ?] 
Realms are needed too. We check realms availability in Conductor, so we 
will not be able to replace our current checking tool with Tracker until 
realms are supported too.

...

 # API changes

 Tracker needs credentials for the backend cloud; it is therefore important
 that on each request these are set properly. In particular, Tracker will
 store the driver, provider, user and password each time a callback is
 registered.

 ## Registering a callback on resource creation

 Any collection that supports state tracking will indicate that with a
 feature 'state_tracking' on the corresponding create operation. This will
 make it possible to add the following parameters to create operations:

   track\_hook : the absolute URL to which to post on state changes
   track\_token: (optional) a security token that must be included in the callback

 For example, to register a callback when creating an instance, the request
 would look like

     POST /api/instances?...&track_hook=http://example.com/cb&...

 ## Retrieving callback details

 Resources that support state tracking will contain a <callback/> element in
 their representation. The element will have the following form

     <callback hook="http://example.com/cb">
       <delivery status="(noevent|success|failure)"
time="2012-05-23T18:23"/>
     </callback>

 ## Registering a callback for an existing resource

 Resources that support state tracking allow updating the callback
 information with a PUT request to the resource. To register a callback for
 an instance, which will overwrite any existing callbacks, issue

     PUT /instances/42?track_hook=http://example.com/cb

 and to delete a callback, use the special token 'none':

     PUT /instances/42?track_hook=none

Authentication will be needed for ^ these requests to make sure that 3rd 
party doesn't modifies my callback.

...
 # Callback

 When Tracker detects a change to a tracked resource, it will POST a JSON
 document to the hook URL [TODO: do we need XML, too ?]. The JSON body will
 look like

     {
       'token': security token
       'changes': [
         { 'attribute': path in [JSON pointer
notation](http://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-00),
           'old': old value,
           'new': new value
         }
       ],
       'resource': ... representation of resource ...
     }

 For example, for an instance that just changed from 'pending' to
'running',
 the callback hook would receive the following JSON document

     {
       'token': "ABCDEFG42",
       'changes':
         [ { 'attribute': "/state", 'old': "PENDING",
'new': "RUNNING" } ],
       'resource': .. JSON object for the instance ...
     }

 The recipient for the hook should respond with 204 No Content to indicate
 that the update was received successfully.

 ## Explicitly retrieveing events

 Callbacks can fail, and will be retried for a while, but at some point we
 have to give up trying to deliver the callback (or retry so infrequently
 that it's not really useful to the recipient)

 To make it possible for recipients to catch up after a failure on their
 side, we'll support a 'changes' collection that only allows GET:

     GET /changes

 The response to this request will be a JSON array, where each entry is the
 same JSON object that is used for delivering callbacks. Note that only
 changes pertinent to the current provider will be delivered, i.e. clients
 that use Tracker to track resources in multiple providers will need to make
 one request for each provider. Once a change has been delivered through
 this mechanism, it will be considered successfully delivered.

This might be a problem if Tracker is used by multiple clients - _all_ 
callbacks for a provider are fetched by one request no matter whom these 
callbacks are addressed to.

Maybe 'GET /changes' could just trigger common retry of callback 
delivery for all provider requests instead of returning changes 
directly. Another benefit is that then 'GET /changes' wouldn't have to 
be authenticated.

...
 # Implementation Notes

 We will need to run a background job that performs the state
 polling. DelayedJob seems like the right tool for this; we'll want a 
+1 for delayed job - we already have this requirement in conductor

...
 periodic job that goes out to each backend/provider and asks for
changes to
 tracked resources. For the first cut, we can do this resource-by-resource,
 but longer term we want to be more clever and use cloud-specific features
 (DescribeInstances for multiple instances in EC2, changed-since for
 OpenStack etc.) and will therefore require driver support.

 Conceptually, Tracker decorates the backend driver for the API operations
 that are modified by the 'state_tracking' feature. It is therefore tempting
 to implement that aspect as a Module that gets included into drivers and
 does the decoration. By doing this at the driver level, state tracking is
 immediately available to all frontends.

 Implementing Tracker requires that we keep state about the registered
 callbacks, and about the previous state of tracked resources. We'll use an
 RDBMS (sqlite/postgresql) and ORM (DataMapper ?) for this purpose. Very
 roughly, I hope we can get away with this data model (plus a jobs table):

     class Provider
       property :id,       Serial
       property :driver,   String
       property :provider, String
       property :user,     String
       property :password, String
     end

     class Callback
       property :id,       Serial
       property :hook,     String
       property :token,    String
       property :res_type, String
       property :res_id,   String # Just enough to get resource from backend
       property :res_old,  Text   # Serialization of old resoure state
       property :last_event Timestamp
       # TODO: Need to track delivery state of callback and payload
       belongs_to :provider
     end

 ## Timings/frequencies

 There are a number of timings (polling frequency, how long and how often to
 retry change delivery) For now, they can be hardcoded, but should be in
 some central place so they can easily be tweaked - we do not want them
 controlled through the API though. We'll start with something like

 * Poll frequency: once a minute for instances in transient states (pending
   etc.), once every 15 minutes for instances in permanent states (running,
   stopped, ...)
 * Poll failure: retry at the normal frequency 5 times, then back off
   exponentially until frequency falls to once every two
   hours. [TODO: Should we issue a callback at that point ?]
 * Callback failure: retry callback 5 times every minute, then back off
   exponentially until frequency falls to once a day 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011