Database Reference
In-Depth Information
Amazon S3 storage for very large objects
Amazon stores billions of objects with a redundant and fault-tolerant S3 system. The
design is proprietary and is not widely publicized. However, it is given as a service
and can be used as such.
Please note that Amazon S3 is also a key-value store and delivers the same
functionality as Google BlobStorage. Again, you can use it only if you are allowed
to use Amazon's AWS as a part of your solution. It might make more sense than
you might have initially imagined, as cloud-based storage is fault-tolerant, simple,
and reliable. It can be expensive for large files, and you might not have enough
bandwidth for you. So much for S3.
A practical approach
Now is the time for the less fancy and more practical approaches that you can
implement yourself. You can store the URL (the HDFS URL) in HBase and the
actual file in the HDFS. This approach gives you the data tolerance of HDFS,
since the data is replicated three times by default in HDFS.
Here is what your logical data structure might look like in this approach:
File ID (UUID)
File display name
File URL of access
f81d4fae-7dec-11d0-
a765-00a0c91e6bf6
My cat movie
/user/storage/pic1.mp4
This does feel a little crude, doesn't it? However, this is a design pattern that has
been around since the birth of SQL databases, and it worked. Considering the fact
that HDFS is unlimited in size and is fault-tolerant and self-healing, it can work
even better here.
Finally, you can look for your own library or you can design it such that it stores
multiple small blocks comprising of large files, each one in HBase. Now, I have not
found such a library, and it will face the usual problem of hiding the complexities
of a big data system, which aim for scalability. Here is what I mean—big data is
complex. Just because you created a nice library and your developers like it, it does
not automatically mean that they will use it in the most efficient way. Since all of big
data is about performance, you might have to formulate a set of best practices, with
examples, so that your elegant solution will deliver elegant performance. Otherwise,
the developer might not know how to use it efficiently, and because of this, it might
be inefficient.
Search WWH ::




Custom Search