What Is Sharding?
Sharding has proven itself in some of the world’s most popular web sites. However, it is typically considered too complex to implement for regular database applications. Simply stated, it means that instead of storing data in one database, you distribute the application data across multiple databases. It’s done by many SaaS companies, some the most popular web sites and many big enterprises.
ScaleBase has developed a scale out solution via data distribution that will enable your MySQL to scale like never before.
While the implementation of Sharding is traditionally “outsourced” from the database to the application and it has been done in the application by the developers. This is the wrong way to go! Application developers should be focused on making the apps better, putting more logic into the app, enabling the business to be competitive and win. Spending time on a database access layer is wrong, and it’s also a skill mismatch, as this work should be done by database developers – expert engineers that are focused only on that and draw experience from hundreds of cases.
Wikipedia defines sharding as:
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
But I think the best explanation uses an example, so let’s take the following table:
This is a small table containing a list of customers. Any modern database can handle such a table. But what will happen if instead of 7 rows the table has to store 7 million rows? Now, theoretically, this should be supported. But usually there will be lots of operations on such a large table – so we have many read and write operations on this table every second. Now, “At Scale, Everything breaks” (as Google VP Engineering, Urs Hölzle, says here). So our nice customer table will probably become a bottleneck. Why? Because it doesn’t fit in the database server cache anymore, because of database isolation management, and for other reasons, all of which cause the database to crawl under load.
Enter Sharding. If we take the customers table, and split it to 4 different databases, each database will contain 1.75 million rows. That’s still a lot – but less than 7 million rows. This will result in improved database performance. In fact, ScaleBase tests have shown about 75% response time improvements in some standard performance tests. You can see the results here.
The following diagram shows how such a table can be split:
Every database will get some of the rows, and it becomes the developer’s responsibility to know which row is located in which table.
1Unlimited Scale - Sharding allows apps to cost-effectively scale to an infinite number of users, while increasing performance. If you have hundreds of thousands (or hundreds of millions) of users, you probably experience angst, wondering if failure is imminent. Databases break at scale. With Sharding, that’s no longer an issue. As long as you have enough shards, everything will work.
2Response time improvement – If your database is big (over 50GB) or has many hits/second (anything above a few hundred hits/second), sharding will boost your database performance.
Writing sharding code is difficult. It requires you to rewrite most of your Data Access Layer from scratch. And while it’s difficult to do when you write your own SQL code, it’s even more complex when using O/R mapping tools, as most are not “sharding oriented”.
But even after writing the initial sharding code, you might run into issues. For instance, a common problem occurs when scaling requires adding more shards. Usually, internally written sharding code supports a fixed number of shards, and adding shards requires massive code rewrites – as well as the major downtime required when moving data from one shard to another.
Other parts of the infrastructure also change when using a sharded database. For example, the reporting application must now be aware of the sharding logic, since you want to collect data from multiple databases rather than just one. And if the reporting application is an off-the-shelf product, you’re out of luck. You’ll have to write the reporting application from scratch.
Backup is also an issue, as is Database Administration. And the complexities continue to pile on. So, while Database Sharding is a great solution for database scaling, it is complex and costly if done manually, as many of the costs are hidden and only become realized after the initial sharding is performed.
Scale Out via Data Distribution
ScaleBase eliminates the needs for Database Sharding, without changing a single line of your code. And since it’s not embedded inside your application, your BI, DBA team and backup tools can use it, too – no infrastructure changes!
ScaleBase also provides the ability to scale your database using read/write splitting; meaning copies of the database can now serve for read operations, while only one database is used for the writes. Database replicas are no longer useless machines only used for high availability – with ScaleBase they can be used for scaling as well and your architecture will look something like this:
In this architecture you get database high availability and scale out via data distribution or read/write splitting – all in a fully redundant environment that also boosts your performance. Several deployment options exist, so check out our Technical Whitepaper here to see which would be best for you.