Databases Reference
In-Depth Information
at its core, Hadoop. Nokia has over 100 terabytes (TB) of structured data on Teradata and petabytes
(PB) of multistructured data on the HDFS. The centralized Hadoop cluster that lies at the heart of
Nokia's infrastructure contains 0.5 PB of data. Nokia's data warehouses and datamarts continuously
stream multistructured data into a multitenant Hadoop environment, allowing the company's 60,000+
employees to access the data. Nokia runs hundreds of thousands of Scribe processes each day to effi-
ciently move data from, for example, servers in Singapore to a Hadoop cluster in the U.K. data center.
The company uses Sqoop to move data from HDFS to Oracle and/or Teradata. And Nokia serves data
out of Hadoop through HBase.
Business challenges
Prior to deploying Hadoop, numerous groups within Nokia were building application silos to accom-
modate their individual needs. It didn't take long before the company realized it could derive greater
value from its collective data sets if these application silos could be integrated, enabling all globally
captured data to be cross-referenced for a single, comprehensive version of truth. “We were invento-
rying all of our applications and data sets,” O'Connor noted. “Our goal was to end up with a single
data asset.”
Nokia wanted to understand at a holistic level how people interact with different applications
around the world, which required them to implement an infrastructure that could support daily, tera-
byte-scale streams of unstructured data from phones in use, services, log files, and other sources.
Leveraging this data also requires complex processing and computation to be consumable and
useful for a variety of uses, like gleaning market insights, or understanding collective behaviors of
groups; some aggregations of that data also need to be easily migrated to more structured environ-
ments to leverage specific analytic tools.
However, capturing petabyte-scale data using a relational database was cost prohibitive and would
limit the data types that could be ingested. “We knew we'd break the bank trying to capture all this
unstructured data in a structured environment,” O'Connor said. Because Hadoop uses industry-
standard hardware, the cost per terabyte of storage is, on average, ten times cheaper than a traditional
relational data warehouse system. Additionally, unstructured data must be reformatted to fit into a
relational schema before it can be loaded into the system. This requires an extra data processing step
that slows ingestion, creates latency, and eliminates elements of the data that could become important
down the road.
Various groups of engineers at Nokia had already begun experimenting with Apache Hadoop, and
a few were using Cloudera's distribution including Apache Hadoop (CDH). The benefits of Hadoop
were clear—it offers reliable, cost-effective data storage and high-performance parallel processing of
multistructured data at the petabyte scale—however, the rapidly evolving platform and tools designed
to support and enable it are complex and can be difficult to deploy in production. CDH simplifies this
process, bundling the most popular open-source projects in the Apache Hadoop stack into a single,
integrated package with steady and reliable releases.
After experimenting with CDH for several months, the company decided to standardize the use of
the Hadoop platform to be the cornerstone of its technology ecosystem. With limited Hadoop exper-
tise in-house, Nokia turned to Cloudera to augment their internal engineering team with strategic
technical support and global training services, giving them the confidence with expertise necessary to
deploy a very large production Hadoop environment in a short timeframe.
Search WWH ::




Custom Search