As you may recall from my previous post, The Sharding Conflict, using a distributed database can do wonders for the scalability of your database. However, as with any new system, a few small kinks have to be worked out. Continuing with my initial example of ‘employee’ and ‘department’ data tables, let’s have a look at some other challenges that arise from “shard conflicts” that can occur in a distributed database. However, before I scare you with these problems, I should let you know that ScaleBase software addresses all of these issues for you. OK, so let’s take a look.
Aggregations (with or without a group)
Aggregations occur when a query requires data from multiple tables across various partitions. Say you need to know the average salary of employees over the age of 30 in your company. This is specific information that does not necessarily exist in one partition alone. If you’re looking for employees in a certain age group, you will have to examine multiple partitions from all of the departments. The real problem arises when selecting the average salary from all partitions and amalgamating the results. This does not generate the true average salary, demonstrating yet another conflict.
Finding identical values across multiple partitions can skew data analysis and produce false query results. In reality, data entries may be repeated and there’s nothing (and should be nothing) we can do about it. For example: salary or age. When an application requests a list of distinct values (salaries or ages) it means that the data needs to be processed in a way where repetitions are eliminated from result set. The challenge is the result-set needs to be further processed.
Querying occurs across multiple partitions in a distributed database. But wouldn’t the world be a better place if it were possible to join records that exist in different partitions? It may be possible after considering all of the query criteria. The Sharding Conflict exacerbates this situation, however, when attempting multiple joins from records that are situated across various partitions.
Many times the result of one query is needed to complete another query, which would seem simple if all queries were derived from the same partition. This type of query brings dependencies into the system, complicating the situation even further if any of the above conflicts exist in fulfilling the opening query.
Let’s say we need to process the following statement (keeping with our example from the Sharding Conflict article):
‘Select e.first_name, e.last_name, e.salary from employee where e.salary>(select avg(salary) from employee)’
The subquery result is based on an aggregation of records from multiple partitions and that’s why it can’t be simply executed as part of the whole query. Averaging one specific partition does not reflect the average of the database as a whole. The subquery has to be processed independently, and before the parent query, in order to uncover the average. Only then can the result be applied back into the parent query. It is crucial to comprehend though, that executing a query, as is, on a specific partition, or each partition individually, will return false results.
Multiple features in one single query
Plain and simple, any combination of the above conflicts can be complicated. You need to identify the problems and handle each one in the correct order, for example:
‘Select avg(distinct e.salary) from employee e’
This ‘simple’ short statement has no joins or subqueries this time. We simply want to have an average, but this time on the distinctive values of salary. So, we first need to eliminate repetitions and only then aggregate. It’s impossible to do both together.
The application DBA needs to consider how to arrange data across multiple partitions in a distributed database. By distributing the data with intelligence (according to an examination of your data, the functional use of your data, and your queries) will help avoid many conflicts. A simple way to think of this is that you want to place together the data that is used together. However, some cross-partition queries will always exist. In line with the examples above, placing employees with related departments in the same partition may prevent potential conflicts. Considering the nature of the queries and the application is key to create a functional distributional database, bearing in mind that how you write the data will affect how you will read it later on. Home-grown data distribution into a cluster of partitions is not for the faint of heart.
But, there are some very real benefits that only a distributed database can provide.
Another option to create a distributed database is to utilize a comprehensive solution like ScaleBase that provides specialized data distribution mechanisms and that resolves a broad range of use cases, like we have discussed (and beyond). Due to ScaleBase’s independent status, it has the freedom to create generic code, easing the implementation of said code on standard problems, such as partition/sharding conflicts. As a result, ScaleBase has a wealth of experience and knowledge handling numerous database distribution cases, including the examples listed above.
About the Author
Senior Software Engineer