Database Reference
In-Depth Information
an experiment where we counted the stream tuples which are processed using only
the non-swappable part of disk buffer. The results of this experiment are shown
in Figure 4. As before, we set the size of the non-swappable part to be equal to
the size of the swappable part. It is clear from the figure that in 4000 iterations
when the memory budget is 50 MB and the size of R is 2 million tuples, about 0.4
million stream tuples are processed through the non-swappable part of the disk
buffer and this number increases if we increase the total allocated memory. For 250
MB memory with the same size of R (2 million tuples), this amount reaches more
than 2 million. In the other algorithms, since this non-swappable part is loaded
from the disk each time, the I/O cost increases significantly.
Cost validation. We validate our results by comparing the predicted cost with
the measured cost. Figure 5 presents the comparisons of both costs for each
algorithm. In the figure, it can be seen that for each algorithm the predicted
cost closely matches the measured cost, which is evidence of the consistency of
our study.
6 Conclusions and Future Work
In this paper, we explored the potential improvement for stream-based joins if
characteristics of the data such as skew are taken into account. MESHJOIN
performs worse with skewed distributions, which is a problem since these distri-
butions are common in real world applications. We presented a robust algorithm
called X-HYBRIDJOIN (Extended Hybrid Join) with two major modifications
over MESHJOIN. The first modification is the use of an index on disk-based
master data. The second modification is that X-HYBRIDJOIN caches the most
frequent tuples of master data. As a result it reduces the disk access and improves
the performance substantially. To validate our arguments we implemented the
prototypes for both modifications and carried out experiments comparing the dif-
ferent algorithms. We provided open source implementations of our algorithms.
In the future we plan to tune the X-HYBRIDJOIN algorithm in order to
utilize the available memory resources optimally.
Source URL: The source of our implementations and pseudo-codes can be
downloaded using the given URL:
https://www.cs.auckland.ac.nz/research/groups/serg/src/
References
1. Karakasidis, A., Vassiliadis, P., Pitoura, E.: ETL queues for active data warehous-
ing. In: IQIS 2005: Proceedings of the 2nd International Workshop on Information
Quality in Information Systems, pp. 28-39. ACM, New York (2005)
2. Naeem, M.A., Dobbie, G., Weber, G.: An Event-Based Near Real-Time Data In-
tegration Architecture. In: Enterprise Distributed Object Computing Conference
Workshops, pp. 401-404. IEEE, Munich (2008)
 
Search WWH ::




Custom Search