On Thu, Apr 05, 2012 at 05:44:20PM +0100, Angus Thomas wrote:
We should present administrators, with the ability to configure the
launch-time provider account selection policy for a specific pool,
and to set a global default policy to apply in pools where no custom
policy is defined.
The policy would be applied after a set of viable provider accounts
has been identified. Those will be the provider accounts to which the
relevant images have been pushed, and for which a set of hardware
profile matches can be made etc..
The selection policy should work by defining a probability
distribution, stating how likely each provider account is to be
selected to host the new deployment, expressed as a percentage. Once
those percentage are calculated, Conductor should pick a random
number between 1 and 100 and attempt to launch on the lucky provider
account.
Using a probability range and randomly selecting within that range
might seem counter-intuitive: Having done the maths and assigned a
numerical probability to each provider account, based on its
suitability to host the deployment, why not just launch on the "best"
provider? The issue is one of scale. When considering a single
launch, selecting the account which gathered the highest score makes
sense, but once Conductor is managing a large volume of deployments,
the downside of that approach becomes clear - If one provider account
gathered more than 50% of the probability ranking, it would get 100%
of the instances, without the randomness.
I've been struggling with this for a bit. I see the reasoning, but it
also feels wrong to randomly pick anything. I suspect I just need a
little more time to digest this, though, as the latter thought is more
of a knee-jerk reaction.
To be clear, this is a weighted random selection, yes? In other words,
if a policy plug-in gave a weight of 90 to Cloud A, and 10 to Cloud B,
there is a 90% chance it would launch on Cloud A? In that case, I think
this is fairly sensible.
Whilst the various policies should be stackable, one of the two
following policies should be the initial basis for the calculation:
*Round robin, with optional weighting: *
With this policy, Conductor would use each of the available provider
accounts equally, by assigning the same probability to each of them.
Varying the probabilities, to assign a weighting, would be useful in
instances where the private cloud providers associated with each
provider account are of differing sizes. e.g. Three vSphere clusters,
one of which has double the capacity of the other two. In that
circumstance, the Administrator could adjust the weighting ratios to
more closely reflect the actual capacities of each cluster.
It is worth noting that this isn't strictly round robin. The provider
accounts wouldn't be selected in strict rotation, though the overall
result is the same.
This is perhaps an implementation detail, but how should all of this be
configured? Do we need a web interface so you can dynamically tune this,
or should it be stored in a config file somewhere?
*Least used, with optional weighting: *
This policy would make most sense in scenarios where Conductor is the
sole means by which instances are launched on private cloud
providers. Conductor would seek to ensure that the usage of the
providers was balanced, by giving a higher probability to whichever
provider accounts are currently least used. As with round robin, the
weightings could be adjusted to reflect differing capacities between
providers.
Having used on of those two policies to acquire an initial set of
probabilities, administrators could then elect to apply additional
policies, including:
*Assigned priority: *
The probability assigned to each provider account would be adjusted
according to the provider accounts' priority, by increasing the
probability ranking percentage of the higher priority provider
accounts at the expense of the lower priority ones
This almost sounds like it could be the default, since it's all data we
have today.
*Punishing failure: *
Once the audit history records past failures, for each occurrence of
a launch failure within a configurable period (6 hours feels
reasonable), a provider account would be fined 5% from its
probability ranking. This would serve to reduce the attempts to
launch on a provider which is running out of capacity, or
experiencing hardware issues etc.
I definitely like the stackable aspect. Each plugin can return its
scores summing to 100, and Conductor will then aggregate them and
proceed accordingly.
The question (in my mind, at least) is _how_ to combine them, especially
where a plugin gives a weight of zero. Suppose I have two plugins, one
of which gives a weight of 50/50 between the two, and the other does
100/0. Should we just add them to get 150 and 50, or should we take a
score of zero as meaning, "Absolutely do not use this provider" and drop
it from consideration? Should this be configurable?
*Cost *
There are three principle cloud uses which can incur costs:
consumption of network bandwidth, consumption of storage and running
a VM.
Happily, only one of these needs to be a factor when Conductor is
selecting a provider account to launch: the cost of running the VM.
The amount of network bandwidth that a deployment will consume is
pretty much unknowable at launch time. And, if it is known because,
for example, a deployment is for a streaming media server, then
Administrators can minimize costs by only launching that deployment
in a "Low cost bandwith" pool.
As long as we're not supporting deployments which include the
allocation of additional storage, the costs of storage consumption
are an issue to consider at build & push time, rather than at launch
time.
So, in order to allow cost to be another factor which affects the
probability rankings, all we need is a cost per realm, per hour, for
each provider hardware profile, for each provider account.
Admins are going to have to enter that data themselves. That's not as
onerous as it sounds, given that, for example, it will often be the
case that costs will not vary across realms, so the UI can help by
pre-filling.
I had previously given some thought to the various metrics we might care
about -- where is bandwidth cheapest? Where is the cheapest place to run
a compute-intensive workload? Where can I launch a memory hog for the
least cost? But you raise a really good point that these aren't things
Conductor can know to optimize for.
But I do think administrators may want to optimize for some of these
things on their own. If I'm building an application that's going to push
massive bandwidth, I might still want to build it for multiple providers
for flexibility, but heavily weight it to whatever provider is cheapest
on high-bandwidth deployables. I don't think that's something we
can/should provide out of the box, but I think that savvy administrators
may want to write their own plugins for specialized use cases.
Clearly, for private clouds, no alternative means for getting
pricing
data into Conductor exists. For public providers, it would be
beneficial if their APIs exported list pricing, however:
- Few organizations which operate on a scale which justifies using
Aeolus are likely to be paying list price
Really just an off-topic nit, but I like to think that even small
organizations that don't have this sort of muscle could benefit from
Aeolus. I'm not disputing your point that pricing isn't standard, I just
don't think Aeolus has to be exclusively for enormous deployments.
- Organizations may wish to store and export the adjusted costs that
they'll be assigning to users, rather than the basic costs appearing
on the provider's monthly invoice.
Once the cost data is in Conductor, adjusting the probabilities of
each provider account to favour whichever provider could more cheaply
host the specific range of hardware profiles would be a relatively
simple matter of increasing the selection probability percentages of
cheaper provider accounts, by a configurable amount, at the expense
of the more costly provider accounts.
You know, I wonder how complex people want to make this. If you're just
weighting based on instance cost, does it make sense to just use the
provider's existing weight field? Can we assign different weights
depending on the hardware profile that would be matched? In other
words: on average Provider X is cheaper, but for this specific
deployment, it would get matched onto a HWP which is considerably more
expensive than what it would match on Provider Y. Can we catch that case
and do something marginally intelligent, as opposed to having flat
weighting per-provider?
Having completed the stack of modules' calculations, the result
is a
final set of probabilities. At this point, Conductor would roll the
loaded dice and attempt to launch on the winner.
I wonder how this will work with Jan's work around robust instance
launching, especially around failing over to the next provider if
something goes wrong. Do we "roll the dice" a second time, with the
failed provider removed from the set?
The UI to allow Administrators to enable modules, and to tune the
parameters associated with them, could give a real-time
representation of the effect of the current settings for a specific
deployable. A certain type of Administrator would be very happy,
tuning options and seeing an immediate change in, for example, a pie
chart, which showed the resultant probability ranking percentages.
In future, we could provide Administrators with an interface to
implement their own selection modules. They might choose, for
example, to vary the selection probability percentages according to
time and date, to increase usage of private cloud at times when they
would otherwise be relatively idle.
I'm not sure if this needs to be a web UI (maybe it does), but I think
we ought to allow administrators to create and load their own plugins.
It could just be a matter of copying the plugin into the vendor/plugins
directory (or wherever these end up going).
-- Matt