PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES

M. Al Hajj Hassan, M. Bamha

2006

Abstract

SQL queries involving join and group-by operations are frequently used in many decision support applications. In these applications, the size of the input relations is usually very large, so the parallelization of these queries is highly recommended in order to obtain a desirable response time. The main drawbacks of the presented parallel algorithms that treat this kind of queries are that they are very sensitive to data skew and involve expansive communication and Input/Output costs in the evaluation of the join operation. In this paper, we present an algorithm that minimizes the communication cost by performing the group-by operation before redistribution where only tuples that will be present in the join result are redistributed. In addition, it evaluates the query without the need of materializing the result of the join operation and thus reducing the Input/Output cost of join intermediate results. The performance of this algorithm is analyzed using the scalable and portable BSP (Bulk Synchronous Parallel) cost model which predicts a near-linear speed-up even for highly skewed data.

References

  1. Bamha, M. (2005). An optimal and skew-insensitive join and multi-join algorithm for ditributed architectures. In Proceedings of the International Conference on Database and Expert Systems Applications (DEXA'2005). 22-26 August, Copenhagen, Danemark, volume 3588 of Lecture Notes in Computer Science, pages 616-625. Springer-Verlag.
  2. Bamha, M. and Hains, G. (2000). A skew insensitive algorithm for join and multi-join operation on Shared Nothing machines. In the 11th International Conference on Database and Expert Systems Applications DEXA'2000, volume 1873 of Lecture Notes in Computer Science, London, United Kingdom. SpringerVerlag.
  3. Bamha, M. and Hains, G. (2005). An efficient equi-semijoin algorithm for distributed architectures. In Proceedings of the 5th International Conference on Computational Science (ICCS'2005). 22-25 May, Atlanta, USA, volume 3515 of Lecture Notes in Computer Science, pages 755-763. Springer-Verlag.
  4. Bamha, M. and Hains, G. (September 1999). A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP), Volume 3, Number 3, pages 333-345. Appears also in Progress in Computer Research, F. Columbus Ed. Vol. II, Nova Science Publishers, 2001.
  5. Bisseling, R. H. (2004). Parallel Scientific Computation : A Structured Approach using BSP and MPI. Oxford University Press, USA.
  6. Carter, J. L. and Wegman, M. N. (April 1979). Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143-154.
  7. Chaudhuri, S. and Shim, K. (1994). Including Group-By in Query Optimization. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 354-366, Santiago, Chile.
  8. Datta, A., Moon, B., and Thomas, H. (1998). A case for parallelism in datawarehousing and OLAP. In Ninth International Workshop on Database and Expert Systems Applications, DEXA 98, IEEE Computer Society, pages 226-231, Vienna.
  9. DeWitt, D. J. and Gray, J. (1992). Parallel database systems : The future of high performance database systems. Communications of the ACM, 35(6):85-98.
  10. DeWitt, D. J., Naughton, J. F., Schneider, D. A., and Seshadri, S. (1992). Practical Skew Handling in Parallel Joins. In Proceedings of the 18th VLDB Conference, pages 27-40, Vancouver, British Columbia, Canada.
  11. Hua, K. A. and Lee, C. (1991). Handling data skew in multiprocessor database computers using partition tuning. In Lohman, G. M., Sernadas, A., and Camps, R., editors, Proc. of the 17th International Conference on Very Large Data Bases, pages 525-535, Barcelona, Catalonia, Spain. Morgan Kaufmann.
  12. Seetha, M. and Yu, P. S. (December 1990). Effectiveness of parallel joins. IEEE, Transactions on Knowledge and Data Enginneerings, 2(4):410-424.
  13. Shatdal, A. and Naughton, J. F. (1995). Adaptive parallel aggregation algorithms. SIGMOD Record (ACM Special Interest Group on Management of Data), 24(2):104-114.
  14. Skillicorn, D. B., Hill, J. M. D., and McColl, W. F. (1997). Questions and Answers about BSP. Scientific Programming, 6(3):249-274.
  15. Taniar, D., Jiang, Y., Liu, K., and Leung, C. (2000). Aggregate-join query processing in parallel database systems,. In Proceedings of The Fourth International Conference/Exhibition on High Performance Computing in Asia-Pacific Region HPC-Asia2000, volume 2, pages 824-829. IEEE Computer Society Press.
  16. Taniar, D. and Rahayu, J. W. (2001). Parallel processing of 'groupby-before-join' queries in cluster architecture. In Proceedings of the 1st International Symposium on Cluster Computing and the Grid, Brisbane, Qld, Australia, pages 178-185. IEEE Computer Society.
  17. Tsois, A. and Sellis, T. K. (2003). The generalized pregrouping transformation: Aggregate-query optimization in the presence of dependencies. In VLDB, pages 644-655.
  18. Valiant, L. G. (August 1990). A bridging model for parallel computation. Communications of the ACM, 33(8):103-111.
  19. Wolf, J. L., Dias, D. M., Yu, P. S., and Turek, J. (1994). New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Transactions on Knowledge and Data Engineering, 6(6):990-997.
  20. Yan, W. P. and Larson, P.-k. (1994). Performing group-by before join. In Proceedings of the 10th IEEE International Conference on Data Engineering, pages 89- 100. IEEE Computer Society Press.
Download


Paper Citation


in Harvard Style

Al Hajj Hassan M. and Bamha M. (2006). PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES . In Proceedings of the First International Conference on Software and Data Technologies - Volume 1: ICSOFT, ISBN 978-972-8865-69-6, pages 301-307. DOI: 10.5220/0001316003010307


in Bibtex Style

@conference{icsoft06,
author={M. Al Hajj Hassan and M. Bamha},
title={PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES},
booktitle={Proceedings of the First International Conference on Software and Data Technologies - Volume 1: ICSOFT,},
year={2006},
pages={301-307},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001316003010307},
isbn={978-972-8865-69-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Software and Data Technologies - Volume 1: ICSOFT,
TI - PARALLEL PROCESSING OF ”GROUP-BY JOIN” QUERIES ON SHARED NOTHING MACHINES
SN - 978-972-8865-69-6
AU - Al Hajj Hassan M.
AU - Bamha M.
PY - 2006
SP - 301
EP - 307
DO - 10.5220/0001316003010307