Disk Partitioning in M-grid Clusters

Authors: Arto Teräs, Pekka Tolvanen (CSC)
Status: Final, version 1.2
Date: 2004-09-28

Cluster disks should be partitioned according to this document. The RAID configuration on the front end should be already done by the HP person according to the installation and configuration guide. File system type should be ext3 in all partitions except swap.

Cluster frontend

The cluster front-end should have two 72 GB system disks configured as RAID-1 (mirror) and a shared disk array configured as RAID-5.

LocationMount pointUsageSizeComment
system disk- (swap)Swap space2 GB 
system disk/Root, OS (CSC use)10 GB 
system disk/optPrograms not belonging to the core OS10 GB 
system disk/varLogs, temporary filesrestCombined with /tmp
shared disk/export/gridFiles related to Grid jobs10%, min. 50 GB 
shared disk/exportUser home directories, data, local softwarerest 

Note that it is not possible to combine /var and /tmp partitions during the installation. The /tmp directory will initially reside on the / (root) partition. After the installation is completed, local admins should create a directory /var/tmp and a symbolic link /tmp pointing to it.

Quotas are mainly a question of local policy. A small default could be set by CSC and sites can then manage their own quota requirements and sizes. FIXME: In the current version of the installation package quotas are not enabled by default. A reasonable amount of disk space for grid use is guaranteed by having a separate partition /export/grid for the temporary data files related to grid jobs.

Sites can also decide how they manage the disk space between actual user home directories, optional separate shared data directories under /export and directory /export/home/opt which is reserved for local software installations. Rocks also has a directory /export/home/install where packages to be installed to the nodes are put. Home partition filling up may prevent copying upgrades to this directory and thus be a problem. However, it was considered unnecessarily complicated to make a separate partition for it.

Admin server

The admin server comes with two 80GB system disks. They should be configured as software RAID-1 (mirror) by the local administrator. This can be done with Disk Druid in the installation program. The partitions should be as follows:

LocationMount pointUsageSizeComment
system disk- (swap)Swap space1 GB 
system disk/Root, OS, tmp10 GB 
system disk/exportHome directories, backupsrestPossibly /var will be moved here too

Note that there is no intention of exporting any admin server partitions using NFS, but as we use the Rocks frontend installation also for the admin server home directory is by default placed under /export.

Compute nodes

Disk configuration of compute nodes varies between groups. The following physical disk combinations are possible:

In general, it does not matter whether the disk size is 80 GB or 160 GB, but the different RAID configurations require different partitioning. Partitions are summarized in the following table:

Compute node with one disk
LocationMount pointUsageSizeComment
first disk- (swap)Swap space1 GB
first disk/Root, OS5 GB 
first disk/optPrograms not belonging to the core OS10 GB 
first disk/tmpTemporary filesrest
 
Compute node with two disks as RAID-1 (mirror)
LocationMount pointUsageSizeComment
both disksAs in compute nodes with only one disk (identical partitions on both disks)
 
Compute node with two disks as RAID-0 (stripe)
LocationMount pointUsageSizeComment
both disksAs in compute nodes with only one disk (identical partitions on both disks)
The difference to the RAID-1 scheme is that in this case the two /tmp partitions are combined using RAID-0 to form one big partition. Partitions / (root) and /opt are configured as RAID-1 (mirror).

Based on Mikael Johansson's <mikael.johansson@helsinki.fi> tests, chunk size 128 kt gives optimal performance for the RAID-0 setup when using the ext3 filesystem.

Some groups wondered about whether the swap should be bigger - we don't really see a need for that. Running calculations larger than the actual memory size on cluster nodes is slow, and the nodes are not multi-user boxes where there are normally several unused processes running (which can be moved to swap). If a group still thinks they require a large swap, they could modify the predefined partition table file deducting the extra swap space from the /tmp partition - at least initially groups will need to manually select the right partitioning schema file after frontend installation.

Files implementing these partition schemas in Rocks installation are available in the partitioning_schemas subdirectory.


Changelog

2004-09-28 Version 1.2. Rocks partitioning schema files added. (AJT)
2004-09-28 Version 1.1. RAID-0 node partitioning changed according to Mikael Johansson's suggestion. (AJT)
2004-09-27 Version 1.0.