Home Microsoft SQL Server SQL GROUP BY

Polls

Who administers your corporate database infrastructure?
 

related links:

SQL GROUP BY
User Rating: / 10
PoorBest 

In this article I will describe the purpose and best practices of group by clause of select statement. Any reporting select query requires group by clause to aggregate some results by particular parameters. That is critical for overall system performance to follow some best practices of group by queries.

GROUP BY purpose

Aggregate functions

 Group by can't be discussed without appealing to aggregate functions. So Aggregate functions are embedded transact-sql functions that iterate through the recordset and collect the result. The main aggregate functions compliant with ANSI SQL 92 are MIN, MAX, SUM, COUNT, AVG. For visual understanding let's create a test table and fill it in with some test data

 create table testgroupby (
    id int identity (1,1) primary key clustered
    ,city varchar(100) not null
    ,country varchar(100) not null
    ,population int
    
    )
    
go
insert into  testgroupby (city, country, population)
select 'kiev', 'ukraine', 5000000
union all select 'odessa', 'ukraine', 1000000
union all select 'moscow', 'russia', 10000000
union all select 'rostov', 'russia', 2000000
union all select 'omsk', 'russia', 2000000
union all select 'los-angeles', 'USA', 15000000
union all select 'New York', 'USA', 20000000
union all select 'Miami', 'USA', 7000000
go

So let's use aggregation functions against the testgroupby table.

select COUNT (*), MIN(population), MAX( population), SUM(population)
from testgroupby

And you will see the next results

We see that aggregate functions iterated though all rows of testgroupby table and returned the result. Now if we want to see the aggregate values per country we need to use group by clause of select statement. In ANSI SQL 92, SQL Server 2000, SQL Server 2005, SQL Server 2008 and any other version of most dbmss this clause goes  right after where clause (if exists, otherwise - right after from clause). So the query calculating countries' populations will look like

select country, SUM(population)
from testgroupby
group by country

and the group by query output

 

How GROUP BY works

The columns specified in group by list divide table into groups and then aggregate functions are executed against each group separately as against the entire table in first example. in our case the table is divided into groups the next way


Columns ID and city don't take part in grouping table since we are interested only in countries and population columns.

All fields listed in select statement when you use aggregate functions with group by statement must be either results of operations with aggregate functions (i.e. sum=SUM(population), field1= SUM(Population)/AVG(Population)) or appear in group by fields list. For instance we can't include fields ID or city in our previous select statement because the next error will occur

Msg 8120, Level 16, State 1, Line 1
Column 'testgroupby.id' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

This is obvious, remember that we already divided the tables into groups and summarized population thus we left only one row per country. Now it is meaningless to address back to the cities. 

GROUP BY uses

GROUP BY has a wide usage in reporting queries. Remember that you can use group by not only against a table like testgroupby in our examples but against any resultset. This means that if you have a select statement with joins and where clause you can always apply aggregate functions and group by clause to this statement. You can filter results of group by query using HAVING clause. This works the same as where clause but applies to the resultset after all aggregate functions are calculated and rows are merged. For instance

 select country, pop= SUM(population)
from testgroupby
group by country
having SUM(population) > 10000000

select country, pop
from (
    select country, pop= SUM(population)
    from testgroupby
    group by country
) t
where pop > 10000000

both queries return the same data

but the first one is much more compact and readable as well as less resource consumptive even though in this particular case execution plans for both queries are the same.

GROUP BY instead of DISCTINCT

You can always use group by when you need to gather distinct values of a table/resultset, the rule is simple - include all columns into group by clause

select country
from testgroupby
group by country

select distinct country
from testgroupby

again, the plans are the same.

GROUP BY best practices

Remember that group by statement dramatically affects performance of the query and try to follow the next principles
Keep group by fields list as short as possible. That means that aggregations and GROUP BY clause must usually be formed in a nested select.

For instance imagine we have another table - countrieslist

 create table countrieslist (
     country varchar(100) not null primary key clustered
    ,location varchar(100)
    ,VAT int
    ,phonenumbers varchar(100)
    ,description varchar(100)
    )
GO

insert into countrieslist
select 'ukraine', 'eastern europe', 10000000, '+380', 'nice slovanic country'
union all select 'russia', 'western europe / asia', 100000000, '+7', 'powerful country'
union all select 'USA', 'north america', 1000000000, '+1', 'leader country'

Now when we want to see the overall county info including its population the worst choice is the next query

  select
      c.country
    ,location
    ,VAT
    ,phonenumbers
    ,description
    ,pop = SUM(t.population)
from countrieslist c
    inner join testgroupby t on c.country = t.country
group by
      c.country
    ,location
    ,VAT
    ,phonenumbers
    ,description

 Now lets try to put the aggregation and group by statement into the nested query (derived table)


 select
      c.country
    ,location
    ,VAT
    ,phonenumbers
    ,description
    ,pop = t.pop
from countrieslist c
    inner join (
        select country, pop= SUM(population)
        from testgroupby
        group by country
    ) t on t.country = c.country

 In this particular case the second query plan is worse than the first but this is only because we deal with only 8 records totally.

The information provided within this article applies to all database management systems compliant with ANSI SQL-92 not only transact-sql of sql server family (SQL Server 2005, SQL Server 2008) but also to Oracle, Sybase, MySQL etc.

 

Comments (0)

 
SPONSORED LINKS:
Content View Hits : 179489
DB-staff Remote DBA Services, Powered by DB-Staff 2008