Binning, and eventually graphing as a histogram, is an effective and useful way to analyze and understand trends in your data. BI tools like Tableau provide easy binning capabilities so you can bin on ranges like month or year. But is there a way without a BI tool?
Presented here is a simple, effective, and highly versatile method for binning your data directly in MySQL.
To have a highly versatile method, we should have some initial CTE to define our bins, our Select Statement to get our data, and a Group By. This method will just allow us to swap our select statement very easily without changing too much of the code.
The first example will bin on predefined ranges; the second example will use recursive CTEs to 'auto-magically' create bins.
First some fake data:
CREATE TABLE `books` (
`id` INT NOT NULL AUTO_INCREMENT,
`name` VARCHAR(255) NOT NULL,
`pages` VARCHAR(255) NOT NULL,
`created_at` DATE NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1;
INSERT INTO books (name, pages, created_at) VALUES
("Book 1", 228, "2000-01-03") ,
("Book 2", 235, "2000-01-07") ,
("Book 3", 197, "2000-02-09") ,
("Book 4", 179, "2000-03-14") ,
("Book 5", 197, "2000-04-12") ,
("Book 6", 230, "2000-04-27") ,
("Book 7", 217, "2000-05-29") ,
("Book 8", 197, "2000-05-08") ,
("Book 9", 241, "2000-05-29") ,
("Book 10", 240, "2000-05-02") ,
("Book 11", 230, "2000-06-01") ,
("Book 12", 198, "2000-07-09") ,
("Book 13", 234, "2000-08-17") ,
("Book 14", 240, "2000-09-30") ,
("Book 15", 238, "2000-10-03");
Example 1: Bin over page number
We can define some ranges and then bin over them.
Some useful ranges would be
- 0 - 10 page
- 11 - 100 pages
- 101 - 200 pages
- 201 + pages
We then left join our books
table onto our bins with a where statement and then we can group.
WITH bins (bin_name, bin_start, bin_end) AS (
SELECT '0-10' AS bin_name, 0 AS bin_start, 10 AS bin_end
UNION SELECT '11-100' , 11 , 100
UNION SELECT '101-200' , 101 , 200
UNION SELECT '201+' , 201 , NULL
),
T AS (
SELECT
pages AS `value`
FROM books
)
SELECT
bins.bin_name,
COUNT(T.value) AS count,
COUNT(T.value) / (SELECT COUNT(*) FROM T) * 100 AS percentages
FROM bins
LEFT JOIN T ON T.value
WHERE (
T.value >= bins.bin_start
AND (T.value < bins.bin_end OR bins.bin_end IS NULL)
)
GROUP BY 1;
Giving us:
+----------+-------+-------------+
| bin_name | count | percentages |
+----------+-------+-------------+
| 0-10 | 1 | 6.6667 |
| 11-100 | 2 | 13.3333 |
| 101-200 | 3 | 20.0000 |
| 201+ | 8 | 53.3333 |
+----------+-------+-------------+
It is so versatile because can swap in any table you'd like for T and adjust the bins and it just works.
Example 2: Bin with recursive CTE
What if we want to bin over something like months, and don't want to list out all the dates. We can use a recursive CTE to do the work for us!
WITH RECURSIVE bins as (
SELECT '2000-01-01' as dt, MONTH('2000-01-01') as `month`, YEAR('2000-01-01') as `year`
UNION
SELECT
DATE_ADD(bins.dt, INTERVAL 1 MONTH),
MONTH(DATE_ADD(bins.dt, INTERVAL 1 MONTH)),
YEAR(DATE_ADD(bins.dt, INTERVAL 1 MONTH))
FROM bins WHERE DATE_ADD(bins.dt, INTERVAL 1 MONTH) <= '2001-01-01'
),
T AS (
SELECT
id,
MONTH(created_at) AS `month`,
YEAR(created_at) AS `year`
FROM books
)
SELECT
bins.year,
bins.month,
COUNT(T.id) AS count,
COUNT(T.id) / (SELECT COUNT(T.id) FROM T) * 100 AS percentage
FROM bins
LEFT JOIN T ON (T.month = bins.month and T.year = bins.year)
GROUP BY 1, 2
ORDER BY 1, 2
And we get:
+------+-------+-------+------------+
| year | month | count | percentage |
+------+-------+-------+------------+
| 2000 | 1 | 2 | 13.3333 |
| 2000 | 2 | 1 | 6.6667 |
| 2000 | 3 | 1 | 6.6667 |
| 2000 | 4 | 2 | 13.3333 |
| 2000 | 5 | 4 | 26.6667 |
| 2000 | 6 | 1 | 6.6667 |
| 2000 | 7 | 1 | 6.6667 |
| 2000 | 8 | 1 | 6.6667 |
| 2000 | 9 | 1 | 6.6667 |
| 2000 | 10 | 1 | 6.6667 |
| 2000 | 11 | 0 | 0.0000 |
| 2000 | 12 | 0 | 0.0000 |
| 2001 | 1 | 0 | 0.0000 |
+------+-------+-------+------------+
With these two examples you can now bin by pre-defined ranges or by dates in a clean, simple, and versatile manner!