I’d like to write about a particularly challenging situation that happens when you try to create a distributed database — we call it the Sharding Conflict. If you have deployed a home-grown DIY database sharding solution, you may not always be aware of the ramifications of this issue. Actually, you may be receiving incorrect results to queries without even realizing it. Thankfully, ScaleBase’s distributed database platform takes care of this situation for you. So, let’s take a look at the problem.
Sharding and Data Distribution
Traditional databases are made up of tables of data that are related to one another. An HR application, for example, may include one table listing all employees and another table consisting of the various departments within a company. Every employee belongs to a department and every department is made up of the corresponding employees. This relationship is stored deep in the database and can be easily analyzed and queried using conventional methods, since all of the data is located in one place and is easily accessible.
In the world of the cloud, however, data is distributed, meaning information is spread across various servers (or instances). While data distribution is necessary for scalability, it runs the risk of having related data in different partitions or shards. In spite of this, related data is constantly requested together which can make query processing a great challenge. Obviously, the ideal situation would be to have all related data in the same shard or partition, but in some cases it’s not possible.
So what is a “shard conflict”?
At ScaleBase, we have coined the term ‘shard conflict’ to describe a situation where a given statement cannot be executed as it is, unchanged, on all (or one) partitions and be counted on to yield a truly correct result.
In order to make this concept a bit clearer, let’s look at some shard conflicts with an example using a database with tables for departments and employees.
As seen in Example 1 below, the ‘Department Table’ consists of the elements: ‘id’ and ‘name’, whereas the ‘Employee Table’ consists of the elements: ‘id’, ‘first name’, ‘last name’, ‘department id’, ‘salary’, and ‘manager_id’.
Suppose we want to shard both of these tables and we choose ‘id’ as the ‘Department Table’ shard key and, similarly, ‘id’ as the ‘Employee Table’ shard key.
‘Select e.first_name, e.last_name, d.name from employee e join department d on e.department_id=d.id ’
The statement above presents a shard conflict due to the fact that there is no guarantee that all employees are in the same shard as their corresponding departments. That is to say, data concerning individual employees’ departments may exist in a different partition than the data about an employee.
In Example 2, below, on the other hand, we chose ‘department_id’ as the ‘Employee Table’ shard key, in order to resolve the conflict from Example 1. The join query was optimized as a result of the shards’ orthogonal arrangement of ‘department.id’. In other words, all department-related data was stored in the same partition. Now there are no cross-joins between partitions and we can safely execute our statements on all partitions and append the results into one result.
Wait a Minute…There’s Still a Conflict
It is clear from these examples that when dealing with a foreign key and two tables, a common key can be utilized to resolve certain conflicts, though that doesn’t address the situation all the time.
Suppose we want to execute another statement:
‘Select e.first_name, e.last_name, m.first_name, m.last_name from employee e join employee m on e.manager_id=m.id’
In this case, we will join the ‘Employee Table’ together with itself to find an employee’s manager. While managers are still, in fact, employees, there is no guarantee they are in the same partition. This is because the employee tables are not capable of being partitioned by both ‘id’ and ‘manager_id’ at the same time (though a foreign key clearly exists between the employee table and itself).
So, as illustrated above, distributed data can become quite complex if not handled correctly. It’s the kind of problem that isn’t always obvious, and can yield incorrect results without you easily knowing it. ScaleBase’s distributed data platform takes care of this for you to ensure your queries always yield the correct result..
A number of other querying challenges can come into play when working with distributed databases, however, we will save that discussion for my next post.