Deploying Alluxio in High Availability mode using UFS journal requires a shared storage medium (journal UFS) that all masters can access. The journaling system is vital to an Alluxio Master's health as it is the bedrock of High Availability. This article discusses what the solutions are if the journal UFS becomes unavailable.
Root Cause Analysis
Background
The Alluxio master applies journal entries to its internal state before it journals them. However, it only returns success to the client after the journal entry has been flushed to the journal UFS. All entries applied to master's internal state but not flushed to journal UFS are said to be buffered.
Problem description
The machine running the journal UFS becomes unavailable. The leading master is no longer be able to write new journal entries to the journal.
This will manifest in the master logs as:
2022-07-19 17:30:47,728 WARN AsyncJournalWriter - Failed to flush journal entry: Failed to flush journal file (UfsJournalFile{location=hdfs:// ...
Client side operations may time out, looking something like
Failed to connect to master (localhost:19998) after 22 attempts.
Please check if Alluxio master is currently running on "localhost:19998". Service="FileSystemMasterClient"
Current behavior in this failure case
When the shared storage becomes unavailable, the default behavior is to attempt to write / flush the journal to the shared storage every second for five minutes. After the five minutes have elapsed, the master process will crash.
The retry interval can be configured using:
alluxio.master.journal.retry.interval
The timeout can be configured using:
alluxio.master.journal.flush.timeout
Increasing the timeout exposes you to more metadata operations being lost in the case where the journal UFS does not become available again: more operations will get buffered by the master, and all buffered but unflushed operations are lost.
If the connection to the shared storage is restored it will take a moment to recover and then resume normal operation.
Externalities
In the event where the journal UFS is unavailable for a prolonged period of time before becoming available again, the client can see a timeout / failure for a certain operation while the journal entries produced by said operation still get applied to the journal as they were buffered while the journal UFS was unavailable.
Effectively, a client can see an operation fail in the short term due to timeout while it succeeds later on.
Justification for the current behavior
Journal writing / flushing exceptions are exceptionally hard to recover from because they ensure masters are in synchrony. Without journal flushes, High Availability is non existent as the master's operations cannot be persisted for other masters to see.
Journal writing / flushing, along with the zookeeper dependency, is critical to Alluxio's correct operation. If the connection was down for minutes, Alluxio does have a retry mechanism but any prolonged outage on the journal UFS is fatal.
Resolution
The best way to avoid this problem is to use Embedded Journal over UFS Journal. The Embedded Journal does not rely on a singular connection to the journal UFS. Rather, it relies on a majority of master nodes being available.
This conclusion is further supported by the comparison between Embedded Journal and UFS Journal in the docs.
Comments
0 comments
Please sign in to leave a comment.