In this third blog entry comparing MongoDB and MySQL scalability, I want to focus on query models.
We’ve previously discussed how MongoDB and ScaleBase are both highly scalable distributed databases, where data is distributed across an array of servers so as to distribute workloads, transactions, and concurrent users.
Now let’s compare query models, and see how MongoDB and ScaleBase answer application calls and queries.
The Challenge – Aggregating query results from several database nodes
Famously, efficiently satisfying queries on a distributed system can be challenging. While accessing a single collection or document is easy enough, building result-sets from data spread across multiple nodes and collections can often become a manual process.
At a high-level, we can say a query can be filtered either by ID (and return a single result) or a range (and return multiple results)
In a distributed database, the challenge for a query is to efficiently and accurately access several data nodes where each node gives a partial results, and then to aggregate results efficiently so as to provide one answer to the application and user.
Again, at a high level we can say there are several operations that aggregate query results:
- Aggregate functions: count, min, max, sum
The challenge is to execute these operations in a distributed data environment and so that the applications still “sees” one database.
For example, for a distributed database with four nodes a simple count of records across the entire database needs four queries, run in parallel on four databases, which results in four numbers that needs to be summed up and returned to the application as a single summed number.
As I mentioned in my previous blog entry, since MongoDB does not do joins, Discrete docs are “joined” together inside the application using after several roundtrips to the database.
Aggregations can make this even more complex, but both MongoDB and ScaleBase do most of the work for you. An aggregation is an operation that scans over a set of documents and returns computed values such as Sum, Average, or Count.
So, what options do MongoDB and ScaleBase offer to aggregate query results from across several database nodes?
MongoDB Query Aggregation Options
MongoDB provides two main methods of aggregation: the aggregation pipeline (and simple single purpose pipeline operations), and map-reduce.
MongoDB Aggregation Pipeline
The aggregation pipeline is a framework for aggregates built into MongoDB (since version 2.2). You can think of the pipeline framework as working something like the UNIX pipe command. You stream documents through the pipeline, filtering and transforming as needed on each operator.
Each pipeline operator can do things like skip, match, sort and even geospatial style matching. You can improve performance by doing filtering (match) at the beginning of the pipeline, thus reducing the amount of data being scanned and manipulated. The aggregation pipeline can use indexes that are available.
Single Purpose Aggregation Operations
The single purpose aggregation operations are very simple procedures that return a very specific set of data. Examples would be count on a single value, grouping, a distinct list of values, etc.
However, it’s important to note that in Mongo DB, the group function does NOT support distributed database nodes or shards. This is a major difference. Secondly, all single purpose aggregation operation results must be less than 16MB.
MongoDB’s Map-Reduce capability provides programmatic query processing flexibility not available in Aggregation Pipeline, but at a cost to performance and coherence. Map-Reduce is a massively parallel process for manipulating and condensing large volumes of data down to something more useful. MongoDB provides the mapReduce command to process map-reduce scripts.
In a map-reduce process, you match data that you want to work with (the map process) and then you filter and/or condense (with the reduce process).
The map process creates a set of key->value pairs that you want to work with and the reduce process takes those values in. In MongoDB, a map process takes in a collection and the output of the reduce operation can be a collection or it can be returned inline. If you generate a collection, it can then be used as input to another map-reduce process.
Like the pipelined aggregation, map-reduce does support sharded databases. MongoDB notes that while they have made improvements in map-reduce performance in later releases, the aggregation pipeline is usually more performant, though it may not be as dynamic and functional.
ScaleBase Query Aggregation Options
ScaleBase also provides two main methods of query data aggregation: ALL_DB Aggregation and an automatic and built-in “Map-Reduce-Like” capability that executes across distributed database nodes.
The primary difference is with ScaleBase you can use SQL and tools that you are most likely already familiar with.
Additionally, with ScaleBase you don’t need to decide between an aggregation pipeline or a map reduce approach. Instead, you submit regular SQL and ScaleBase performs the aggregations operations all behind the scenes for you.
ScaleBase supports single phase and multi-phase ALL_DB aggregation.
For every ALL_DB command, ScaleBase performs:
- Parallel execution
- Local processing at each individual DB, such as Filter, local joins, local groups, local sort (That’s most of the processing!)
- Meta-Aggregation at ScaleBase Controller
SELECT COUNT(*) FROM my_distributed _table;
ScaleBase delegates the same command to all databases in parallel. Databases perform it very efficiently in parallel, on smaller datasets, returns n counts to ScaleBase which in turn sums all the counts to one big count, to return to the client.
The same goes with “sum of sums”, “min of mins” and so on. ScaleBase even supports a global ALL_DB average aggregated function.
In addition, consider the following supported cases:
- Multi-phase aggregation:
GROUP BY + HAVING + ORDER BY + LIMIT
- DDL (ALTER TABLE, CREATE INDEX)
As mentioned before, ScaleBase provides a Ma-Reduce-Like capability. ScaleBase will deconstruct the query, run parallel scans, and aggregate results.
To compare Map-Reduce in ScaleBase and MongoDB, let’s look at the figure below.
Comparing a typical Map-Reduce query in MySQL and MongoDB, you can see that to get the same functionality, a simple SQL query is a lot easier to read and to write. Ideally, less code can also mean less bugs and less maintenance.
And the above example is just for a very simple SQL command on only one table, with no joins, with three WHERE predicates and six SELECT predicates. As we know, queries can be much longer than that, but not more complicated. This is really the power of SQL.
You can see how the flow resembles a map-reduce program in the figure below.
With no extra coding, you get localized access at the shard/node level, running in parallel, with local sorts, joins and grouping. ScaleBase does the final aggregation for your application in the middle ScaleBase layer. Not in the application.
Aggregation is a fact of life in the data world. You don’t want a tool that makes your life harder than it has to be. MongoDB and ScaleBase make it as easy as possible for you.
MongoDB was developed after SQL databases were around for quite a while already and providing great query capabilities. MongoDB is much more than a key/value database. It has great querying capabilities as well, some say equal to SQL databases, with very similar concepts of filters, indexes, results aggregation
The main difference is in the “how”.
In MongoDB, for most aggregation operations, the Aggregation Pipeline provides better performance and a more coherent interface. However, map-reduce operations provide some programmatic flexibility that is not presently available in the aggregation pipeline.
The take-way is that (as we have seen in scalability, auto-sharding and data redistribution already), in query model and query execution there’s more similarity than difference between MongoDB and ScaleBase.
Bottom line: In a distributed database environment you must have the aggregation capabilities. MongoDB give that to you. ScaleBase gives that to you too, but retaining SQL and ACID properties.
If you’d like to review all of the ideas discussed in this series of MongoDB and ScaleBase comparisons, I’ve recorded a webinar that you can check out. It’s called: MongoDB and MySQL Scalability: How to get MongoDB scale from MySQL. I look forward to your feedback.
Read the rest of our Extreme Scalability with MongoDB and MySQL Series: