Caution: Proxmox VE and OCFS2 Shared Storage

Proxmox VE (Virtual Environment) is enjoying a great deal of interest, not least because of the Broadcom/VMware deal. For many, the hypervisor is a welcome alternative to vSphere ESXi - but differs in terms of storage. Here you usually have the choice between:

  • NFS/CIFS
  • iSCSI
  • Btrfs (technical preview)
  • LVM
  • ZFS
  • CephFS/Ceph RBD
  • Gluster

However, snapshot functionality is not supported for all storage types - ideally, the following should be selected:

  • NFS/CIFS (if qcow is used)
  • ZFS (local/over iSCSI)
  • LVM (thin-provisioned)
  • Gluster (if qcow is used)
  • CephFS/Ceph RBD
Note

More technical details can be found in the manufacturer's wiki.

Ceph is particularly interesting for HCI setups (hyper-converged infrastructure) with local storage, while conventional servers in the data center are more likely to rely on iSCSI or Fibre Channel Storage. Unfortunately, the latter is only supported with LVM as shared storage and therefore does not offer snapshot support.

Oracle Cluster Filesystem 2

Another approach is to use OCFS2 on a shared disk (e.g. via Multipath) and mount the volume as a local directory.

OCFS2 is a cluster file system with a distributed lock manager, which regulates simultaneous access to the storage. It has been an official component of the Linux kernel since version 2.6.16. By default, TCP port 7777 is used for communication. The o2cb tool fills the configuration file /etc/ocfs2/cluster.conf.

Attention

The procedure shown here is not officially supported by Proxmox. Support requests will only be processed after best effort.

Implementation

First, the OCFS2 tools are installed and directories for the Lock Manager and the mountpoint are created.

1# apt-get install ocfs2-tools
2# mkdir /dlm /mnt/ocfs

Two parameters must be changed in the configuration file /etc/default/o2cb:

1O2CB_ENABLED=true
2O2CB_BOOTCLUSTER=MYPVE

The following commands must be executed on all cluster nodes in order to create and initialize the cluster:

1# o2cb add-cluster MYPVE
2# o2cb add-node --ip <ip> MYPVE node01
3# o2cb add-node --ip <ip> MYPVE node02
4...
5# /etc/init.d/o2cb enable

The file system is then created on one cluster node as follows:

1# mkfs.ocfs2 -J block64 -T vmstore -L mypve /dev/mapper/<name>

A line must be added to the /etc/fstab file so that the volume is mounted when booting:

1LABEL=mypve  /mnt/ocfs ocfs2 _netdev,defaults  0 0

Finally, the new storage is mounted and a new Proxmox storage is created (/etc/pve/storage.cfg):

1# mount -a
1dir: mypve
2  path /mnt/ocfs
3  content snippets,images,vztmpl,iso,rootdir
4  is_mountpoint /mnt/ocfs
5  shared 1
Hinweis

Further technical details can be found in the following presentation from Heinlein Support.

Issues during upgrade

I recently observed a very disturbing error during an upgrade from Proxmox 8.0 to 8.1. Here the version of the Linux kernel changed from 6.2.16 to 6.5.11 - and thus also the version of the ocfs2 module.

The usual procedure for patching a cluster is as follows:

  1. Evacuate a node by moving the VMs to another node
  2. Patching and rebooting the node
  3. Moving the VMs back
  4. Repeat the steps for the next node

However, the last two steps did not work. Although the OCFS2 volume was correctly mounted again and files could be viewed and written to on the freshly patched node, it was not possible to run VMs. VMs could be started, but were not able to write to the storage. The error messages in the kernel log were conspicuous:

1(kvm,182849,0):ocfs2_dio_end_io:2423 ERROR: Direct IO failed, bytes = -5
2(kvm,182849,0):ocfs2_dio_end_io:2423 ERROR: Direct IO failed, bytes = -5
3(kvm,182849,7):ocfs2_dio_end_io:2423 ERROR: Direct IO failed, bytes = -5
4(kvm,182849,7):ocfs2_dio_end_io:2423 ERROR: Direct IO failed, bytes = -5

Rebooting to the old kernel solved the access problems. In this specific case, updating the cluster without downtime is therefore impossible - which makes the idea of a cluster ad-absurdum.

It is not entirely clear to me whether OCFS2 generally only works if all cluster nodes have the same module version - or whether there were simply breaking changes in this specific version jump. Between the two kernel versions, I found the following changes to OCFS2 in the changelogs:

  • ocfs2: fix data corruption after failed write (6.3)
  • ocfs2: fix use-after-free when unmounting read-only filesystem (6.3.9 and 6.4)
  • ocfs2: Switch to security_inode_init_security() (6.4)
    • new format of extended file system attributes (xattrs) is used
  • ocfs2: remove redundant assignment to variable bit_off (6.5)

Another error message in the kernel log seems to go in the direction of the latter:

1Jan 05 13:37:00 proxmox01 kernel: seq_file: buggy .next function ocfs2_dlm_seq_next [ocfs2] did not update position index

Under these circumstances, I would advise against using OCFS2 in conjunction with Proxmox. 🫠

Note

Update from 27.03.2024: In the meantime, a workaround has been posted in the Proxmox forum. This consists of setting the VM parameter aio to the value threads.

Translations: