Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bosh director should be highly available #2601

Open
mkocher opened this issue Feb 5, 2025 · 0 comments
Open

Bosh director should be highly available #2601

mkocher opened this issue Feb 5, 2025 · 0 comments

Comments

@mkocher
Copy link
Member

mkocher commented Feb 5, 2025

Is your feature request related to a problem? Please describe.

The bosh director currently is a single point of failure.

Describe the solution you'd like

I would like to be able to deploy multiple VMs which as a cluster perform the function of a bosh director. A single VM failure should not affect the functioning of bosh.

Describe alternatives you've considered

In the event of an AZ failure we've built a script which manually updates all bosh deployed VMs with a new message bus URL. This allows for manually restoring a bosh director in a different AZ.

Additional context

In the early days of bosh it was deployed as "micro bosh" and "full bosh". Microbosh was fully colocated, like the current bosh director, and full bosh was then deployed by the microbosh as a number of independent VMs - the web process, the worker processes, redis, postgres, nats etc. This provided some scalability, but because nats, postgres and redis were all single points of failure it actually resulted in a lower predicted MTBF. Since those days bosh has usually been deployed fully colocated.

In the intervening years some interesting developments have happened. NATS has grown from a SPOF ruby process to a message bus which supports all kinds of clustering which Cloud Foundry has been using drama-free for many years for distributing the routing table to gorouters. OSS highly available relational databases have become much more available. Additionally bosh has migrated from Resque to Delayed Job, removing the dependency on Redis entirely.

Making a highly available bosh director today is a much simpler proposition than it was when bosh was conceived. If we posit a reliable blobstore and database, making bosh highly available comes into reach.

Changes I Made

I've prototyped this out and the majority of it surprisingly Just Works™.

Here are the approximate steps I took (omitting some trial and error)

  • used bosh to deploy a bosh director using the misc/bosh-dev.yml ops file
  • configured bosh to use an external database
  • configured bosh to use an external blobstore
  • found many instances of ((internal_ip)) in bosh deployment and replaced them either with 127.0.0.1 or a DNS record pointing at the bosh director(s)
  • updated certificates to include the DNS record or 127.0.0.1 as valid alternative names
  • updated the nats configuration to support clustering by stealing the nats-release clustered configuration
  • updated the delayed job worker name in src/bosh-director/lib/bosh/director/worker.rb from "worker_#{@index}" to "worker_#{@index}_#{Config.process_uuid}" as delayed job will attempt to pick up the same jobs at the same time if the worker name is the same
  • changed instances: 1 to instances: 3

At this point the bosh director functions as a highly available director! It is possible to shut down a bosh director instance or two and connect to any remaining bosh directors and do a bosh deploy or bosh vms.

However there are some caveats currently:

  • Health Monitor and Scheduler when deployed on three instances do everything three times as often. To do this for real they should use the database and elect a leader
  • Task debug logs are stored on the director file system, so the request for a task debug log only succeeds occasionally. These could be stored in the blobstore once the task is over.
  • Using a DNS record for the bosh director adds a significant barrier to entry. One fairly easy win here is that the director NATS code does not support passing multiple ips as NATS now recommends.

Next Steps

I'm unfortunately setting this work down for the moment. Creating this issue to document how close bosh is to being HA in hopes that someday someone picks this up again and finishes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Review | Discussion
Development

No branches or pull requests

1 participant