MongoDB: Grouping by ranges using the Aggregration Framework

Recently i was required to extract a count of registered users by some predefined age brackets for a simple statistics dashboard. Sounds like a simple task right? You’ve already got a good idea of the query you’d write in SQL? So did I, until i realised this was a project we decided to use MongoDB with rather than the normal MySQL database.

Like any other developer under time pressure, i turned to Google for a quick code snippet answer that i could modify rather than reading the documentation, i understood MongoDB well enough to modify something fairly easily, but not enough to create this query from scratch. Unfortunately all the search terms i tried yielded nothing all that useful, so it was time to grab a coffee, sit back and read the documentation. Now i understand more about aggregating documents in MongoDB, i thought i’d share how it’s done.

What is the Aggregation Framework Pipeline?

The first thing i recommend to anyone trying to understand aggregation in MongoDB, is stop referring to it as a framework, and start calling it the aggregation pipeline. Framework and Pipeline seem to be used interchangeably in the documentation, and most google results seem to use “framework” too, whatever the reason for this (legacy name maybe?), “pipeline” really helps to understand how aggregation works and goes a significant way to helping you understand it.

You can think of the pipeline as a series of steps (officially coined “stages”), each one accepting the output of the previous step as it’s input. The first stage’s input is the entire collection, and the last stage’s output is what is returned to you as the result. For me at least, this is actually an easier and more logical way of aggregating data (although feels more verbose). Normally with SQL, I would note down all the steps required, then start woking out how i mangle the steps together into a single query, then start playing around swapping between JOIN’s and Sub-query’s working out the most performant query i can. With MongoDB’s Aggregation Pipeline, we don’t have to do that, we can simply create a stage for each step we noted. The beautiful thing about this approach is we can see the output of an intermediary step by running the query before it’s all complete, so we can make sure that stage is doing what we want it too before moving on to the next.

How we do it

So here’s a list of steps i figured are required to get from a collection of my user documents to my required output.

Calculate an “age” field from the existing “birthday” field.
Calculate an age “range” field by placing the “age” into it’s corresponding group.
Group the “range” fields and return a “count” for each one.

Stage 1: Calculate the age field

{
  $project: {
    "_id":0, 
    "age": {
      $divide: [ 
        { 
          $subtract: [ 
            new Date(), 
            { $ifNull: ["$birthday", new Date() ]}
          ]
        }, 
        1000 * 86400 * 365 
      ]
    }
  }
}

{

$project: {

"_id":0,

"age": {

$divide: [

{

$subtract: [

new Date(),

{ $ifNull: ["$birthday", new Date() ]}

]

1000 * 86400 * 365

]

}

This first stage uses the $project operator in order to transform our input documents into a different document for the next stage. We calculate an “age” field rom the “birthday” field, and "_id": 0 excludes the “_id” field from the input documents as we don’t care about it, and it’s the only default inclusion.

Let’s get into the meat of this stage. What we are doing for the age field is calculating the number of milliseconds that have elapsed between the the current date, and the users date of birth by subtracting the two dates, then we divide that by the number of milliseconds in a year to give us the number of years. The $ifNull operator ensures that we have a valid date to subtract, in the event of a missing birthday field, we return the current date (this yields 0 years, which we can use as a special range bracket later on).

Here’s what the output of this looks like (remember we can run the query after each stage so we can make sure it’s doing what it should, and have a visual cue of the input for the next stage).

/* 1 */
{
    "age" : 14.5909136617199
}

/* 2 */
{
    "age" : 40.6099923654236
}

/* 3 */
{
    "age" : 0
}

/* 4 */
{
    "age" : 27.7481470457889
}

... Truncated ...

/* 1 */

{

"age" : 14.5909136617199

}

/* 2 */

{

"age" : 40.6099923654236

}

/* 3 */

{

"age" : 0

}

/* 4 */

{

"age" : 27.7481470457889

}

... Truncated ...

Notice that we have decimal values in the output, if our age range brackets were linear (i.e 0-9, 10-19) we could try dividing these further by 10, then rounding these down using the $floor operator. This would yield a single digit value (1 to 9 up to 99 years old) which we could group on, and have our application logic worry about the labelling (i.e. range=1 would be the 10 – 19 bracket).

In my case however, my brackets are not linear, so i need to do it a different way and therefore the decimal’s won’t cause any issues.

Stage 2: Calculate the range field

{
  $project: {    
    "range": {
       $concat: [
          { $cond: [{$lte: ["$age",0]}, "Unknown", ""]}, 
          { $cond: [{$and:[ {$gt:["$age", 0 ]}, {$lt: ["$age", 18]}]}, "Under 18", ""] },
          { $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},
          { $cond: [{$and:[ {$gte:["$age",25]}, {$lt:["$age", 31]}]}, "25 - 30", ""]},
          { $cond: [{$and:[ {$gte:["$age",31]}, {$lt:["$age", 41]}]}, "31 - 40", ""]},
          { $cond: [{$and:[ {$gte:["$age",41]}, {$lt:["$age", 51]}]}, "41 - 50", ""]},
          { $cond: [{$gte:["$age",51]}, "Over 50", ""]}
       ]
    }  
  }    
}

{

$project: {

"range": {

$concat: [

{ $cond: [{$lte: ["$age",0]}, "Unknown", ""]},

{ $cond: [{$and:[ {$gt:["$age", 0 ]}, {$lt: ["$age", 18]}]}, "Under 18", ""] },

{ $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},

{ $cond: [{$and:[ {$gte:["$age",25]}, {$lt:["$age", 31]}]}, "25 - 30", ""]},

{ $cond: [{$and:[ {$gte:["$age",31]}, {$lt:["$age", 41]}]}, "31 - 40", ""]},

{ $cond: [{$and:[ {$gte:["$age",41]}, {$lt:["$age", 51]}]}, "41 - 50", ""]},

{ $cond: [{$gte:["$age",51]}, "Over 50", ""]}

]

}

This may look a little daunting but it’s pretty simple really, let’s break it down.

{ $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},

1	{ $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},

This essentially can be represented like so:

if ( $age >= 18 && $age < 25 ) {
    return "18 - 24";
}

if ( $age >= 18 && $age < 25 ) {

return "18 - 24";

}

Hopefully that suddenly makes sense. Essentially all we are doing is testing to see which range the age fits into, and returning the label for that range (it doesn’t have to be a label of course, as long as its unique to that range). We do this using the following operators:

$cond is basically an IF/ELSE ternary statement, if the expression (first array argument) is true, it returns the second argument, otherwise the third.
$and is used in all but the first and last conditions (those outer conditions don’t have a lower or upper bound to their ranges). You pass an array of operators to $and, if they all match you get a true return value.
$gt / $lt / $gte / $lte are all basic comparitors you’ll have seen in any language using the >,<,>=,<= symbols.

Notice that the first condition is there to match against age <= 0, this captures any missing birthdays or any birthdays that are erroneously in the future, and places them into an unknown range.

Finally, the $concat operator is for string concatenation. We use this as MongoDB does not have a CASE operator. We could have a nest of $cond's but that would get messy and complicated to maintain/modify in the future. Using $concat all $cond operators will return an empty value except the matching one, which will return the label. So an age of 19 would return $concat: [ "", "", "18 - 24", "", "", "", "" ] resulting in a range field value of “18 – 24”.

Let’s have a look at some example output at this stage.

/* 1 */
{
    "range" : "Under 18"
}

/* 2 */
{
    "range" : "31 - 40"
}

/* 3 */
{
    "range" : "Under 18"
}

/* 4 */
{
    "range" : "Unknown"
}

/* 5 */
{
    "range" : "41 - 50"
}

/* 6 */
{
    "range" : "25 - 30"
}

... Truncated...

/* 1 */

{

"range" : "Under 18"

}

/* 2 */

{

"range" : "31 - 40"

}

/* 3 */

{

"range" : "Under 18"

}

/* 4 */

{

"range" : "Unknown"

}

/* 5 */

{

"range" : "41 - 50"

}

/* 6 */

{

"range" : "25 - 30"

}

... Truncated...

Stage 3: Get the counts of each range

{
  $group: { 
    "_id" : "$range", 
    count: { 
      $sum: 1
    } 
  }
}

{

$group: {

"_id" : "$range",

count: {

$sum: 1

}

So this one’s pretty simple, we use the $group operator. “_id” is essentially the field we want to group on (synonymous with MySQL’s GROUP BY column). That’s followed by our accumulator operator $sum creating a computed field of “count”. $sum which we pass a “1” into as the value to be summed per document in the group.

Finally, we are left with this output

/* 1 */
{
    "_id" : "25 - 30",
    "count" : 5
}

/* 2 */
{
    "_id" : "41 - 50",
    "count" : 2
}

/* 3 */
{
    "_id" : "Unknown",
    "count" : 9
}

/* 4 */
{
    "_id" : "31 - 40",
    "count" : 7
}

/* 5 */
{
    "_id" : "Under 18",
    "count" : 13
}

/* 1 */

{

"_id" : "25 - 30",

"count" : 5

}

/* 2 */

{

"_id" : "41 - 50",

"count" : 2

}

/* 3 */

{

"_id" : "Unknown",

"count" : 9

}

/* 4 */

{

"_id" : "31 - 40",

"count" : 7

}

/* 5 */

{

"_id" : "Under 18",

"count" : 13

}

Notice that they aren’t in order? For my application that wasn’t a problem as i wasn’t just going to loop through the results and output them, but if you wanted to sort them ready for output, you could add a 4th stage using the $sort operator.

That’s it, we now have the final query, for brevity, here’s it all-in-one!

db.getCollection('users').aggregate([
{
  $project: {
    "_id":0, 
    "age": {
      $divide: [ 
        { 
          $subtract: [ 
            new Date(), 
            { $ifNull: ["$birthday", new Date() ]}
          ]
        }, 
        1000 * 86400 * 365 
      ]
    }
  }
},
{
  $project: {    
    "range": {
       $concat: [
          { $cond: [{$lte: ["$age",0]}, "Unknown", ""]}, 
          { $cond: [{$and:[ {$gt:["$age", 0 ]}, {$lt: ["$age", 18]}]}, "Under 18", ""] },
          { $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},
          { $cond: [{$and:[ {$gte:["$age",25]}, {$lt:["$age", 31]}]}, "25 - 30", ""]},
          { $cond: [{$and:[ {$gte:["$age",31]}, {$lt:["$age", 41]}]}, "31 - 40", ""]},
          { $cond: [{$and:[ {$gte:["$age",41]}, {$lt:["$age", 51]}]}, "41 - 50", ""]},
          { $cond: [{$gte:["$age",51]}, "Over 50", ""]}
       ]
    }  
  }    
},
{
  $group: { 
    "_id" : "$range", 
    count: { 
      $sum: 1
    } 
  }
}     
]);

db.getCollection('users').aggregate([

{

$project: {

"_id":0,

"age": {

$divide: [

{

$subtract: [

new Date(),

{ $ifNull: ["$birthday", new Date() ]}

]

1000 * 86400 * 365

]

}

{

$project: {

"range": {

$concat: [

{ $cond: [{$lte: ["$age",0]}, "Unknown", ""]},

{ $cond: [{$and:[ {$gt:["$age", 0 ]}, {$lt: ["$age", 18]}]}, "Under 18", ""] },

{ $cond: [{$and:[ {$gte:["$age",18]}, {$lt:["$age", 25]}]}, "18 - 24", ""]},

{ $cond: [{$and:[ {$gte:["$age",25]}, {$lt:["$age", 31]}]}, "25 - 30", ""]},

{ $cond: [{$and:[ {$gte:["$age",31]}, {$lt:["$age", 41]}]}, "31 - 40", ""]},

{ $cond: [{$and:[ {$gte:["$age",41]}, {$lt:["$age", 51]}]}, "41 - 50", ""]},

{ $cond: [{$gte:["$age",51]}, "Over 50", ""]}

]

}

{

$group: {

"_id" : "$range",

count: {

$sum: 1

}

]);

Using the information above, it should be fairly trivial to create your own Aggregation pipelines to do all sorts of magical things with MongoDB. If you’re coming from an SQL background this may seem a little alien at first, but once you have that eureka moment and it all makes sense, you’ll start to love how simple MongoDB makes it all.

MongoDB: Grouping by ranges using the Aggregration Framework

What is the Aggregation Framework Pipeline?

How we do it

Stage 1: Calculate the age field

Stage 2: Calculate the range field

Stage 3: Get the counts of each range

Recent Posts

Archives

MongoDB: Grouping by ranges using the Aggregration Framework

What is the Aggregation Framework Pipeline?

How we do it

Stage 1: Calculate the age field

Stage 2: Calculate the range field

Stage 3: Get the counts of each range

Share this:

Recent Posts

Archives

Tags