Big Data Storage - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

are guaranteed to be completely identical. However, CP could not ensure sound

availability because of the high cost for consistency assurance. Therefore, CP sys-

tems are useful for the scenarios with moderate load but stringent requirements on

data accuracy (e.g., trading data). BigTable and Hbase are two popular CP systems.

BigTable is well-known since it was successful for managing the background data

of Google's search engine. Because a lot of data in Google is structured data,

BigTable mainly stores data with tables. Nevertheless, when a lot of information

is put in a table, the table size will grow. Such information should be partitioned and

stored separately. The table is usually highly sparse. Therefore, BigTable divides

the columns into different Column Families, where every column family stores the

same type of information. This way, similar data is stored together and the same type

of information is processed in the same manner, making it easy for system users. In

the same column family, new columns can be arbitrarily inserted, thus reducing the

usage limit of BigTable to a great extent.

BigTable is designed in the way similar to GFS, a distributed file system of

Google, where one Master and several Tablet Servers constitute a star structure in

a system. The star structure has a single point of failure. The load of the Master

server should be reduced in order to minimize Master errors. In BigTable, data

transmission and data addressing do not involve the Master. Therefore the load of the

Master is not high. In order to solve the problem of a single point of failure, BigTable

adopts a Master election mechanism. In particular, it incorporates an asynchronous

and consistent locking mechanism to ensure that exact one Master is elected every

time based on the Paxos protocol [ 3 ].

Data in BigTable is sequenced in the lexicographic order of rows. During data

modification, we shall insert a record in a sequential table, find a position to be

inserted, and then move the original data to make room for the newly inserted

data. This operation is very time-consuming. BigTable utilizes batch processing to

solve this problem. Specifically, BigTable uses two tables to store data: it stores

historical data with a big table and stores recently modified data with a very small

table.when the recent data accumulates to a certain amount or after a certain amount

of time, BigTable merges the recent data into the historical data. This approach

greatly reduces the times that big tables are modified, since only small tables are

frequently modified. The cost of data modification is thus reduced to a great extent.

Therefore, this method mitigates the problem of high cost for data changes and

increases the look-up speed for recently modified data.

AP systems,also ensure partition tolerance. However, AP systems are different

from CP systems in that AP systems also ensure availability. However, AP systems

only ensure eventual consistency rather than strong consistency in the previous two

systems. Therefore, AP systems only apply to the scenarios with frequent requests

but not very high requirements on accuracy. For example, in online SNS (Social

Networking Services) systems, there are many concurrent visits to the data but

certain amount of data errors are tolerable. Furthermore, because AP systems ensure

eventual consistency, accurate data can still be obtained after a certain amount of

delay. Therefore, AP systems may also be used under the circumstances with no

stringent real-time requirements.

Search WWH ::

Custom Search

Home