You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The bosh director currently is a single point of failure.
Describe the solution you'd like
I would like to be able to deploy multiple VMs which as a cluster perform the function of a bosh director. A single VM failure should not affect the functioning of bosh.
Describe alternatives you've considered
In the event of an AZ failure we've built a script which manually updates all bosh deployed VMs with a new message bus URL. This allows for manually restoring a bosh director in a different AZ.
Additional context
In the early days of bosh it was deployed as "micro bosh" and "full bosh". Microbosh was fully colocated, like the current bosh director, and full bosh was then deployed by the microbosh as a number of independent VMs - the web process, the worker processes, redis, postgres, nats etc. This provided some scalability, but because nats, postgres and redis were all single points of failure it actually resulted in a lower predicted MTBF. Since those days bosh has usually been deployed fully colocated.
In the intervening years some interesting developments have happened. NATS has grown from a SPOF ruby process to a message bus which supports all kinds of clustering which Cloud Foundry has been using drama-free for many years for distributing the routing table to gorouters. OSS highly available relational databases have become much more available. Additionally bosh has migrated from Resque to Delayed Job, removing the dependency on Redis entirely.
Making a highly available bosh director today is a much simpler proposition than it was when bosh was conceived. If we posit a reliable blobstore and database, making bosh highly available comes into reach.
Changes I Made
I've prototyped this out and the majority of it surprisingly Just Works™.
Here are the approximate steps I took (omitting some trial and error)
used bosh to deploy a bosh director using the misc/bosh-dev.yml ops file
configured bosh to use an external database
configured bosh to use an external blobstore
found many instances of ((internal_ip)) in bosh deployment and replaced them either with 127.0.0.1 or a DNS record pointing at the bosh director(s)
updated certificates to include the DNS record or 127.0.0.1 as valid alternative names
updated the nats configuration to support clustering by stealing the nats-release clustered configuration
updated the delayed job worker name in src/bosh-director/lib/bosh/director/worker.rb from "worker_#{@index}" to "worker_#{@index}_#{Config.process_uuid}" as delayed job will attempt to pick up the same jobs at the same time if the worker name is the same
changed instances: 1 to instances: 3
At this point the bosh director functions as a highly available director! It is possible to shut down a bosh director instance or two and connect to any remaining bosh directors and do a bosh deploy or bosh vms.
However there are some caveats currently:
Health Monitor and Scheduler when deployed on three instances do everything three times as often. To do this for real they should use the database and elect a leader
Task debug logs are stored on the director file system, so the request for a task debug log only succeeds occasionally. These could be stored in the blobstore once the task is over.
Using a DNS record for the bosh director adds a significant barrier to entry. One fairly easy win here is that the director NATS code does not support passing multiple ips as NATS now recommends.
Next Steps
I'm unfortunately setting this work down for the moment. Creating this issue to document how close bosh is to being HA in hopes that someday someone picks this up again and finishes it.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The bosh director currently is a single point of failure.
Describe the solution you'd like
I would like to be able to deploy multiple VMs which as a cluster perform the function of a bosh director. A single VM failure should not affect the functioning of bosh.
Describe alternatives you've considered
In the event of an AZ failure we've built a script which manually updates all bosh deployed VMs with a new message bus URL. This allows for manually restoring a bosh director in a different AZ.
Additional context
In the early days of bosh it was deployed as "micro bosh" and "full bosh". Microbosh was fully colocated, like the current bosh director, and full bosh was then deployed by the microbosh as a number of independent VMs - the web process, the worker processes, redis, postgres, nats etc. This provided some scalability, but because nats, postgres and redis were all single points of failure it actually resulted in a lower predicted MTBF. Since those days bosh has usually been deployed fully colocated.
In the intervening years some interesting developments have happened. NATS has grown from a SPOF ruby process to a message bus which supports all kinds of clustering which Cloud Foundry has been using drama-free for many years for distributing the routing table to gorouters. OSS highly available relational databases have become much more available. Additionally bosh has migrated from Resque to Delayed Job, removing the dependency on Redis entirely.
Making a highly available bosh director today is a much simpler proposition than it was when bosh was conceived. If we posit a reliable blobstore and database, making bosh highly available comes into reach.
Changes I Made
I've prototyped this out and the majority of it surprisingly Just Works™.
Here are the approximate steps I took (omitting some trial and error)
misc/bosh-dev.yml
ops filesrc/bosh-director/lib/bosh/director/worker.rb
from "worker_#{@index}" to "worker_#{@index}_#{Config.process_uuid}" as delayed job will attempt to pick up the same jobs at the same time if the worker name is the sameinstances: 1
toinstances: 3
At this point the bosh director functions as a highly available director! It is possible to shut down a bosh director instance or two and connect to any remaining bosh directors and do a
bosh deploy
orbosh vms
.However there are some caveats currently:
Next Steps
I'm unfortunately setting this work down for the moment. Creating this issue to document how close bosh is to being HA in hopes that someday someone picks this up again and finishes it.
The text was updated successfully, but these errors were encountered: