Copyright (C) SchedMD LLC.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | ~> 1.3 |
>= 3.53 | |
random | ~> 3.0 |
Name | Version |
---|---|
>= 3.53 |
Name | Source | Version |
---|---|---|
bucket | terraform-google-modules/cloud-storage/google | ~> 5.0 |
slurm_controller_hybrid | ./modules/slurm_controller_hybrid | n/a |
slurm_controller_instance | ./modules/slurm_controller_instance | n/a |
slurm_controller_template | ./modules/slurm_instance_template | n/a |
slurm_files | ./modules/slurm_files | n/a |
slurm_login_instance | ./modules/slurm_login_instance | n/a |
slurm_login_template | ./modules/slurm_instance_template | n/a |
slurm_nodeset | ./modules/slurm_nodeset | n/a |
slurm_nodeset_dyn | ./modules/slurm_nodeset_dyn | n/a |
slurm_nodeset_template | ./modules/slurm_instance_template | n/a |
slurm_nodeset_tpu | ./modules/slurm_nodeset_tpu | n/a |
slurm_partition | ./modules/slurm_partition | n/a |
Name | Type |
---|---|
google_storage_bucket_iam_binding.legacyReaders | resource |
google_storage_bucket_iam_binding.viewers | resource |
google_compute_subnetwork.nodeset_subnetwork | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
bucket_dir | Bucket directory for cluster files to be put into. If not specified, then one will be chosen based on slurm_cluster_name. | string |
null |
no |
bucket_name | Name of GCS bucket. Ignored when 'create_bucket' is true. |
string |
null |
no |
cgroup_conf_tpl | Slurm cgroup.conf template file path. | string |
null |
no |
cloud_parameters | cloud.conf options. | object({ |
{} |
no |
cloudsql | Use this database instead of the one on the controller. * server_ip : Address of the database server. * user : The user to access the database as. * password : The password, given the user, to access the given database. (sensitive) * db_name : The database to access. |
object({ |
null |
no |
compute_startup_scripts | List of scripts to be ran on compute VM startup. | list(object({ |
[] |
no |
compute_startup_scripts_timeout | The timeout (seconds) applied to each script in compute_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
controller_hybrid_config | Creates a hybrid controller with given configuration. See 'main.tf' for valid keys. |
object({ |
{} |
no |
controller_instance_config | Creates a controller instance with given configuration. | object({ |
{} |
no |
controller_startup_scripts | List of scripts to be ran on controller VM startup. | list(object({ |
[] |
no |
controller_startup_scripts_timeout | The timeout (seconds) applied to each script in controller_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
create_bucket | Create GCS bucket instead of using an existing one. | bool |
true |
no |
disable_default_mounts | Disable default global network storage from the controller * /usr/local/etc/slurm * /etc/munge * /home * /apps If these are disabled, the slurm etc and munge dirs must be added manually, or some other mechanism must be used to synchronize the slurm conf files and the munge key across the cluster. |
bool |
false |
no |
enable_bigquery_load | Enables loading of cluster job usage into big query. NOTE: Requires Google Bigquery API. |
bool |
false |
no |
enable_cleanup_compute | Enables automatic cleanup of compute nodes and resource policies (e.g. placement groups) managed by this module, when cluster is destroyed. NOTE: Requires Python and script dependencies. WARNING: Toggling this may impact the running workload. Deployed compute nodes may be destroyed and their jobs will be requeued. |
bool |
false |
no |
enable_debug_logging | Enables debug logging mode. Not for production use. | bool |
false |
no |
enable_devel | Enables development mode. Not for production use. | bool |
false |
no |
enable_hybrid | Enables use of hybrid controller mode. When true, controller_hybrid_config will be used instead of controller_instance_config and will disable login instances. |
bool |
false |
no |
enable_login | Enables the creation of login nodes and instance templates. | bool |
true |
no |
enable_slurm_gcp_plugins | Enables calling hooks in scripts/slurm_gcp_plugins during cluster resume and suspend. | any |
false |
no |
epilog_scripts | List of scripts to be used for Epilog. Programs for the slurmd to execute on every node when a user's job completes. See https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog. |
list(object({ |
[] |
no |
extra_logging_flags | The list of extra flags for the logging system to use. See the logging_flags variable in scripts/util.py to get the list of supported log flags. | map(bool) |
{} |
no |
login_network_storage | Storage to mounted on login and controller instances * server_ip : Address of the storage server. * remote_mount : The location in the remote instance filesystem to mount from. * local_mount : The location on the instance filesystem to mount to. * fs_type : Filesystem type (e.g. "nfs"). * mount_options : Options to mount with. |
list(object({ |
[] |
no |
login_nodes | List of slurm login instance definitions. | list(object({ |
[] |
no |
login_startup_scripts | List of scripts to be ran on login VM startup. | list(object({ |
[] |
no |
login_startup_scripts_timeout | The timeout (seconds) applied to each script in login_startup_scripts. If any script exceeds this timeout, then the instance setup process is considered failed and handled accordingly. NOTE: When set to 0, the timeout is considered infinite and thus disabled. |
number |
300 |
no |
network_storage | Storage to mounted on all instances. * server_ip : Address of the storage server. * remote_mount : The location in the remote instance filesystem to mount from. * local_mount : The location on the instance filesystem to mount to. * fs_type : Filesystem type (e.g. "nfs"). * mount_options : Options to mount with. |
list(object({ |
[] |
no |
nodeset | Define nodesets, as a list. | list(object({ |
[] |
no |
nodeset_dyn | Defines nodesets (dynamic), as a list. | list(object({ |
[] |
no |
nodeset_tpu | Define TPU nodesets, as a list. | list(object({ |
[] |
no |
partitions | Cluster partitions as a list. See module slurm_partition. | list(object({ |
n/a | yes |
project_id | Project ID to create resources in. | string |
n/a | yes |
prolog_scripts | List of scripts to be used for Prolog. Programs for the slurmd to execute whenever it is asked to run a job step from a new job allocation. See https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog. |
list(object({ |
[] |
no |
region | The default region to place resources in. | string |
n/a | yes |
slurm_cluster_name | Cluster name, used for resource naming and slurm accounting. | string |
n/a | yes |
slurm_conf_tpl | Slurm slurm.conf template file path. | string |
null |
no |
slurmdbd_conf_tpl | Slurm slurmdbd.conf template file path. | string |
null |
no |
Name | Description |
---|---|
cloud_logging_filter | Cloud Logging filter to find startup errors. |
cluster_config | Slurm partition details. |
slurm_bucket_path | Bucket path used by cluster. |
slurm_cluster_name | Slurm cluster name. |
slurm_controller_instance_details | Slurm controller instance details. |
slurm_controller_instance_self_links | Slurm controller instance self_link. |
slurm_controller_instances | Slurm controller instance object details. |
slurm_login_instance_details | Slurm login instance details. |
slurm_login_instance_self_links | Slurm login instance self_link. |
slurm_nodeset | Slurm nodeset details. |
slurm_nodeset_dyn | Slurm partition details. |
slurm_partition | Slurm partition details. |