On Mon, Apr 09, 2012 at 05:44:45PM -0700, Richard Su wrote:
Hi,
This expands on some of the notes Jan provided in other RFCs.
delayed_jobs and resque appears to be the most commonly deployed
solution.
I have worked with delayed_job a little bit in the past, and was pretty
happy with it.
I listed what I thought should be the requirements for a background
processing solution. For each requirement I then added some details
on how well delayed_jobs and resque could satisfy it.
Resque contains most of the features we need. It requires Redis,
which is a open source project sponsored by VMware. Redis is
available in Fedora. But I don't see Redis available in RHEL and
getting it in for RHEL is the big question mark.
Redis is really cool. But the idea of pulling it in as a dependency just
for a queuing system feels like overkill to me. (IMHO)
https://www.aeolusproject.org/redmine/projects/aeolus/wiki/Background_Pro...
---
Background Processing
# Summary
The two most common solutions are delayed_jobs and resque. There is a
good write up on github comparing other background processing
solutions and why they eventually steered towards delayed_jobs and
then resque,
https://github.com/blog/542-introducing-resque.
When we first started using DelayedJob on a previous project, I was
really nervous about the idea of using our already-busy database for it,
and at how it would scale as our number of jobs grew. It sounds like
GitHub _did_ hit scaling issues, but on the project I worked on using
DJ, we never ran into any problems even as we enqueued thousands of jobs
and processed several hundred a minute. (Sending out customized emails,
etc.)
The primary differences between delayed_jobs and resque are:
At the moment, delayed_jobs doesn't have support for recurring jobs.
Resque does support recuring jobs through the resque-scheduler
extension/gem.
I've only taken a quick look so far, but it looks like resque-scheduler
leverages
https://github.com/jmettraux/rufus-scheduler for its cron-like
functionality. I wonder if we can take advantage of that?
resque provides a sinatra app to monitor the queue. delayed_job
doesn't provide monitoring tools out of the box, but we can potential
build something on top of rails or simply look at the contents of the
database table.
So I actually view this as a plus for DJ. Resque comes with a standalone
Sinatra app. DelayedJob can be integrated into our app cleanly by
treating the table like it's ActiveRecord. My memory is slightly hazy,
but I seem to recall that we set up a Job model based on DJ and set up
some named scopes, we could easily do "Job.failed.count" or
"Job.pending.count" for quick statistics, and we built a tiny little
admin controller for paginating through the list of jobs.
resque requires multiple components and potentially could be more
difficult to support. It requries a second gem called
resque-scheduler. It also uses Redis as its backend and it is
currently not available with RHEL. This may be the deal breaker.
# Requirements
1. Bucket jobs into different queues. A long running job to check
instance status for 1000 instances should not hold up other jobs. The
solution should also support multiple workers which would minimize
impact of longer running jobs. But using different queues will offer
finer grain control.
* delayed_job: supports multiple queues through named queues starting
with version 3.0. Can start up multiple workers for all queues or for
specific queues.
* resque: supports multiple queues and workers.
2. Jobs should persist in some way. If a crash occurs, we should be
able to restart the system and continue with processing incomplete
jobs in the queue.
* delayed_job: Jobs persists as objects stored in activerecord entries.
* resque: Jobs persists as json objects in redis entries. Using json
objects instead of actual objects which may have advanced to a
different version makes updating the application potentially easier.
3. Recurring jobs.
* delayed_jobs: Not available, in development.
* resque: Through resque-scheduler extension.
* whenever: A potential alternative to do cron style scheduling [6].
See also rufus-scheduler, linked above.
It's not clear to me if "whenever" actually lets you define new jobs while
running or not.
4. Alerts. Failures should be presented to the user in some way
(email, conductor UI) so that appropriate actions can be taken.
* delayed_jobs: Support code hooks for different stages in the
process. Hooks can be added for error, failure, success.. By default
workers will retry a job 25 times. We should use a lower number. No
sense in retrying that number of times and holding up the queue if
there is a hard failure somewhere in the system. By default it also
deletes failed jobs, but can be configured to leave them in the queue
with a flag to indicate failure.
* resque: Failed jobs can go through additional processing using
different failure backends. redis, syslog, custom, etc..
5. A mechanism to requeue a failed job once the underlying issue has
been resolved. If an instance start job fails and there is a network
failure to a provider. Once the network is back online, we should
have an ability to requeue those jobs. Not sure if this should be
automated or if this should be a button somewhere where a user can
manually requeue all or select failed jobs.
I'm slightly uneasy about this, but perhaps I'm just not thinking it
through fully enough. If launching an instance fails, I think, per Jan's
robust image launching stuff, we want to just move on and try somewhere
else, rather than having the instance potentially pop up an hour later.
I think the general idea is good, though.
* custom
6. Monitor job status. We should have some way to see what is in the queue.
* delayed_jobs: Can only view queue through activerecord database
entries. There is no UI so it is more difficult to see what is going
on.
* resque: Provides a sinatra app to monitor queues, jobs, and workers.
Though as I mentioned above, I would put a different spin on it -- DJ
lets us use our existing ActiveRecord interfaces to query this data
easily and integrate it into our app. Resque comes with a standalone
Sinatra app.
7. Should not enqueue duplicate jobs.
* custom
8. Ability to remove jobs from the queues and to place a pause on the
queues or jobs.
* custom
9. Supportable in Fedora and RHEL
* delayed_jobs: We used it in the past. Will need to carry the gem.
* resque: Will need to carry the gem. In addition it requires Redis
as the backend. Redis is available in Fedora but not in RHEL. Redis
is a open source project sponsored by VMware [4].
# Use Cases
1. Dbomatic replacement for instance and realm checking and RHEV
instance start.
Each RHEV instance that is created will also lead to a job that is
enqueued to start that instance.
Create a new job to perform instance status check. Create a status
check job for each provider account. Allow status check job to be
disabled/enabled per provider account.
Create a new job to sync realms for all providers. This can be broken
up to a job per provider if needed.
Create two queues. One for managing instance lifecycle. And a second
queue for all other jobs. Start with two workers per queue. Make the
number of workers configurable so that it may be adjusted when
needed.
I think this is a good approach, though it bears mention that if each
worker has a copy of the Rails runtime, this can end up gobbling up a
lot of memory.
-- Matt