Database Reference
In-Depth Information
• How much total data will comprise the initial project?
• What is the shape of the data files that are going to encompass the
initial project?
The initial data volume of the project will give you an idea of the scope
of your infrastructure. Second, considering the shape of the files is an
important aspect of the initial scoping. Will you be working with a large
number of small files or relatively few large files? Hadoop is designed to
work with large chunks of data at a time and when the size of files is smaller
than 64MB; then it will increase the number of map-reduce jobs necessary
to complete any submission. This will slow down each job, potentially
significantly. If you are literally talking about thousands or millions of files,
each very small (in the low-kilobyte range), you'll probably find it best to
aggregate these files before loading them into Hadoop.
Incremental Data
Loading data for any big data or data warehouse solution usually comes in
two forms. First you need to load the initial bulk load of history, and then
you need to determine an approach to ingest incremental data. Once again,
you need to consider the size of the files that you will be loading into the
Hadoop Distributed File System (HDFS).
If the files are many but relatively small, consider using Flume to queue
them up and write them as a larger data set. Otherwise, write your data
to Hadoop from their source using HDFS's put command, much like you
would into a staging environment for a data warehouse. Write the full data
set to Hadoop and then rely on your transform processes in Hadoop to
determine duplicates, unwanted data removal, and additional necessary
transforms.
Privacy Laws
The issue of privacy and big data is a large and diverse subject that could
be a book by itself. These privacy issues cross country and cultural barriers.
As you are collecting and storing more data about your customers and
augmenting it with additional ambient data from either third parties or
from public sources, you must be aware of the privacy laws that affect the
customers whose data you are collecting data.
Search WWH ::




Custom Search