OneFS - High Performance Parallel I/O

Hardware Reference

In-Depth Information

namespace. With namespace aggregation, files still have to be managed in

separate volumes, but a simple \vaneer" layer allows for individual directories

in volumes to be \glued" to a \top-level" tree via symbolic links. In that

model, LUNs and volumes, as well as volume limits, are still present. Files

have to be manually moved from volume to volume in order to load-balance.

The administrator has to be careful about how the tree is laid out. Tiering is

far from seamless and requires significant and continual intervention. Failover

requires mirroring files between volumes, driving down eciency and ramping

up purchase cost, power, and cooling. Overall, the administrator burden when

using namespace aggregation is higher than it is for a simple traditional NAS

device. This prevents such infrastructures from growing very large.

11.4.3 Data Layout

OneFS uses physical pointers and extents for metadata and stores file and

directory metadata in inodes. B-trees are used extensively in the file system,

allowing scalability to billions of objects and near-instant lookups of data

or metadata. OneFS is a completely symmetric, and highly distributed file

system. Data and metadata are always redundant across multiple hardware

devices. Data is protected using erasure coding across the nodes in the cluster.

This creates a cluster that has high eciency, allowing up to 80% raw-to-usable

data on clusters of five nodes or more. Metadata (which makes up generally

less than 1% of the system) is mirrored in the cluster for performance and

availability. As OneFS is not reliant on RAID, the amount of redundancy

is selectable by the administrator, at the file or directory level beyond the

defaults of the cluster. Metadata access and locking tasks are managed by all

nodes collectively and equally in a peer-to-peer architecture. This symmetry

is key to the simplicity and resiliency of the architecture. There is no single

metadata server, lock manager, or gateway node.

Because OneFS must access blocks from several devices simultaneously, the

addressing scheme used for data and metadata is indexed at the physical level

by a tuple of fnode, drive, osetg. For example, if 12345 was a block address

for a block that lived on disk 2 of node 3, then it would read, 3,2,12345. All

metadata within the cluster is multiply mirrored for data protection, at least

to the level of redundancy of the associated file. For example, if a file were at

an erasure-code protection of \N + 2," implying the le could withstand two

simultaneous failures, then all metadata needed to access that file would be

3 mirrored, so it too could withstand two failures. The le system inherently

allows for any structure to use any and all blocks on any nodes in the cluster.

Other storage systems send data through RAID and volume management

layers, introducing ineciencies in data layout and providing non-optimized

block access. Isilon's OneFS controls the placement of les directly, down to

the sector level on any drive anywhere in the cluster. This allows for optimized

data placement and I/O patterns and avoids unnecessary read-modify-write

operations. By laying data on disks in a le-by-le manner, OneFS is able

High Performance Parallel I/O

Search WWH ::

Custom Search

Home