Embedded transaction filesystem (ETFS)
ETFS implements a high-reliability filesystem for use with embedded solid-state memory devices, particularly NAND flash memory.
The filesystem supports a fully hierarchical directory structure with POSIX semantics as shown in the table above.
ETFS is a filesystem composed entirely of transactions. Every write operation, whether of user data or filesystem metadata, consists of a transaction. A transaction either succeeds or is treated as if it never occurred.
Transactions never overwrite live data. A write in the middle of a file or a directory update always writes to a new unused area. In this way, if the operation fails part way through (due to a crash or power failure), the old data is still intact.
Some log-based filesystems also operate under the principle that live data is never overwritten. But ETFS takes this to the extreme by turning everything into a log of transactions. The filesystem hierarchy is built on the fly by processing the log of transactions in the device. This scan occurs at startup, but is designed such that only a small subset of the data is read and CRC-checked, resulting in faster startup times without sacrificing reliability.
Transactions are position-independent in the device and may occur in any order. You could read the transactions from one device and write them in a different order to another device. This is important because it allows bulk programming of devices containing bad blocks that may be at arbitrary locations.
This design is well-suited for NAND flash memory. NAND flash is shipped with factory-marked bad blocks that may occur in any location.
Inside a transaction
- FID
- A unique file ID that identifies which file the transaction belongs to.
- Offset
- The offset of the data portion within the file.
- Size
- The size of the data portion.
- Sequence
- A monotonically increasing number (to enable time ordering).
- CRCs
- Data integrity checks (for NAND, NOR, SRAM).
- ECCs
- Error correction (for NAND).
- Other
- Reserved for future expansion.
Types of storage media
| Class | CRC | ECC | Wear-leveling erase | Wear-leveling read | Cluster size |
|---|---|---|---|---|---|
| NAND 512+16 | Yes | Yes | Yes | Yes | 1 KB |
| NAND 2048+64 | Yes | Yes | Yes | Yes | 2 KB |
| RAM | No | No | No | No | 1 KB |
| SRAM | Yes | No | No | No | 1 KB |
| NOR | Yes | No | Yes | No | 1 KB |
Reliability features
- dynamic wear-leveling
- static wear-leveling
- CRC error detection
- ECC error correction
- read degradation monitoring with automatic refresh
- transaction rollback
- atomic file operations
- automatic file defragmentation.
- Dynamic wear-leveling
- Flash memory allows a limited number of erase cycles on a flash block before the block will fail. This number can be as low as 100,000. ETFS tracks the number of erases on each block. When selecting a block to use, ETFS attempts to spread the erase cycles evenly over the device, dramatically increasing its life. The difference can be extreme: from usage scenarios of failure within a few days without wear-leveling to over 40 years with wear-leveling.
- Static wear-leveling
- Filesystems often consist of a large number of static files
that are read but not written.
These files will occupy flash blocks that have no reason to be erased.
If the majority of the files in flash are static, this will cause the remaining
blocks containing dynamic data to wear at a dramatically increased rate.
ETFS notices these underworked static blocks and forces them into service by copying their data to an overworked block. This solves two problems: it gives the overworked block a rest, since it now contains static data, and it forces the underworked static block into the dynamic pool of blocks.
- CRC error detection
- Each transaction is protected by a cyclic redundancy check (CRC). This ensures quick detection of corrupted data, and forms the basis for the rollback operation of damaged or incomplete transactions at startup. The CRC can detect multiple bit errors that may occur during a power failure.
- ECC error correction
- On a CRC error, ETFS can apply error correction coding (ECC)
to attempt to recover the data.
This is suitable for NAND flash memory, in which single-bit errors may occur
during normal usage.
An ECC error is a warning signal that the flash block the error occurred
in may be getting weak, i.e., losing charge.
ETFS marks the weak block for a refresh operation, which copies the data to a new flash block and erases the weak block. The erasure recharges the flash block.
- Read degradation monitoring with automatic refresh
- Each read operation within a NAND flash block weakens the
charge maintaining the data bits.
Most devices support about 100,000 reads before there's danger of losing a bit.
The ECC recovers a single-bit error, but may not be able to recover multi-bit errors.
ETFS solves this by tracking reads and marking blocks for refresh before the 100,000 read limit is reached.
- Transaction rollback
- When ETFS starts, it processes all transactions and rolls back (discards) the last partial or damaged transaction. The rollback code is designed to handle a power failure during a rollback operation, thus allowing the system to recover from multiple nested faults. The validity of a transaction is protected by CRC codes on each transaction.
- Atomic file operations
- ETFS implements a very simple directory structure on the device, allowing significant modifications with a single flash write. For example, the move of a file or directory to another directory is often a multistage operation in most filesystems. In ETFS, a move is accomplished with a single flash write.
- Automatic file defragmentation
- Log-based filesystems often suffer from fragmentation, since each update or write to an existing file causes a new transaction to be created. ETFS uses write-buffering to combine small writes into larger write transactions in an attempt to minimize fragmentation caused by lots of very small transactions. ETFS also monitors the fragmentation level of each file and will do a background defragmenting operation on files that do become badly fragmented. Note that this background activity will always be preempted by a user data request in order to ensure immediate access to the file being defragmented.
