Databases Reference
In-Depth Information
but in the LOD the data was created by different organizations, so the only way to join
the data is to use consistent URI s to identify nodes.
The number of datasets that participate in the LOD community is large and grow-
ing, but as you might guess, there are few ways to guarantee the quality and consis-
tency of public data. If you find inconsistencies and missing data, there's no easy way
to create bulk updates to correct the source data. This means you may need to manu-
ally edit hundreds of Wiki pages in order to add or correct data. After this is done, you
may need to wait till the next time the pages get indexed by the RDF extraction tools.
These are challenges that have led to the concept of curated datasets that are based
on public data but then undergo a postprocessing cleanup and normalization phase
to make the data more usable by organizations.
In this section, we've covered graph representations and shown how organizations
are using graph stores to solve business problems. We now move on to our third
NoSQL data architecture pattern.
4.3
Column family (Bigtable) stores
As you've seen, key-value stores and graph stores have simple structures that are useful
for solving a variety of business problems. Now let's look at how you can combine a
row and column from a table to use as the key.
Column family systems are important NoSQL data architecture patterns because
they can scale to manage large volumes of data. They're also known to be closely tied
with many MapReduce systems. As you may recall from our discussion of MapReduce
in chapter 2, MapReduce is a framework for performing parallel processing on large
datasets across multiple computers (nodes). In the MapReduce framework, the map
operation has a master node which breaks up an operation into subparts and distrib-
utes each operation to another node for processing, and reduce is the process where
the master node collects the results from the other nodes and combines them into the
answer to the original problem.
Column family stores use row and column identifiers as general purposes keys for
data lookup. They're sometimes referred to as data stores rather than databases , since
they lack features you may expect to find in traditional databases. For example, they
lack typed columns, secondary indexes, triggers, and query languages. Almost all col-
umn family stores have been heavily influenced by the original Google Bigtable paper.
HBase, Hypertable, and Cassandra are good examples of systems that have Bigtable-
like interfaces, although how they're implemented varies.
We should note that the term column family is distinct from a column store . A column-
store database stores all information within a column of a table at the same location on
disk in the same way a row-store keeps row data together. Column stores are used in
many OLAP systems because their strength is rapid column aggregate calculation.
MonetDB , SybaseIQ , and Ver tica are examples of column-store systems. Column-store
databases provide a SQL interface to access their data.
Search WWH ::




Custom Search