Hardware Reference
In-Depth Information
namespace. With namespace aggregation, files still have to be managed in
separate volumes, but a simple \vaneer" layer allows for individual directories
in volumes to be \glued" to a \top-level" tree via symbolic links. In that
model, LUNs and volumes, as well as volume limits, are still present. Files
have to be manually moved from volume to volume in order to load-balance.
The administrator has to be careful about how the tree is laid out. Tiering is
far from seamless and requires significant and continual intervention. Failover
requires mirroring files between volumes, driving down eciency and ramping
up purchase cost, power, and cooling. Overall, the administrator burden when
using namespace aggregation is higher than it is for a simple traditional NAS
device. This prevents such infrastructures from growing very large.
11.4.3 Data Layout
OneFS uses physical pointers and extents for metadata and stores file and
directory metadata in inodes. B-trees are used extensively in the file system,
allowing scalability to billions of objects and near-instant lookups of data
or metadata. OneFS is a completely symmetric, and highly distributed file
system. Data and metadata are always redundant across multiple hardware
devices. Data is protected using erasure coding across the nodes in the cluster.
This creates a cluster that has high eciency, allowing up to 80% raw-to-usable
data on clusters of five nodes or more. Metadata (which makes up generally
less than 1% of the system) is mirrored in the cluster for performance and
availability. As OneFS is not reliant on RAID, the amount of redundancy
is selectable by the administrator, at the file or directory level beyond the
defaults of the cluster. Metadata access and locking tasks are managed by all
nodes collectively and equally in a peer-to-peer architecture. This symmetry
is key to the simplicity and resiliency of the architecture. There is no single
metadata server, lock manager, or gateway node.
Because OneFS must access blocks from several devices simultaneously, the
addressing scheme used for data and metadata is indexed at the physical level
by a tuple of fnode, drive, osetg. For example, if 12345 was a block address
for a block that lived on disk 2 of node 3, then it would read, 3,2,12345. All
metadata within the cluster is multiply mirrored for data protection, at least
to the level of redundancy of the associated file. For example, if a file were at
an erasure-code protection of \N + 2," implying the le could withstand two
simultaneous failures, then all metadata needed to access that file would be
3 mirrored, so it too could withstand two failures. The le system inherently
allows for any structure to use any and all blocks on any nodes in the cluster.
Other storage systems send data through RAID and volume management
layers, introducing ineciencies in data layout and providing non-optimized
block access. Isilon's OneFS controls the placement of les directly, down to
the sector level on any drive anywhere in the cluster. This allows for optimized
data placement and I/O patterns and avoids unnecessary read-modify-write
operations. By laying data on disks in a le-by-le manner, OneFS is able
 
Search WWH ::




Custom Search