18  aggregate functions, group by, having

18.1 The “books database”

In this and other sections we will be referring to the “books database”.

Click below to download a document that describes the books database and database technology in general. This document also contains many questions/answers to test your knowledge of SQL. You can also download the CSV files that contain the data for the books database.

18.2 click to download the specified files

# load the package we'll need
if (!require(sqldf)) { install.packages("sqldf"); require(sqldf) }
Loading required package: sqldf
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
if (!require(readr)) { install.packages("readr"); require(readr) }
Loading required package: readr
# Read in the data for the books database - see the 
titles = read_csv("data/booksDatabase/titles.csv", na="NULL", show_col_types=FALSE)
authors = read_csv("data/booksDatabase/authors.csv", na="NULL", show_col_types=FALSE)
publishers = read_csv("data/booksDatabase/publishers.csv", na="NULL", show_col_types=FALSE)
title_authors = read_csv("data/booksDatabase/title_authors.csv", na="NULL", show_col_types=FALSE)
royalties = read_csv("data/booksDatabase/royalties.csv", na="NULL", show_col_types=FALSE)

18.3 “non-aggreate functions” vs “aggregate functions”

############################################################################
# Intro to SQL functions:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#   SQL contains two different types of functions
#
#   - non-aggregate functions - return a different value for every row
#
#   - aggregate functions - return a single value for multiple rows in the table
#     These functions combine several data values in a single value. 
#     For example the sum function will combine several values into a single sum.
#     see below for more details
#
############################################################################

18.3.1 The titles table

# Let's remember what is in the titles table.
sqldf("select * from titles")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
7       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
8       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00

18.3.2 Example of a non-aggregate function - upper

#................................................................
# Example of non-aggregate function
#
# The upper function simply transforms its argument to UPPERCASE.
#................................................................

# we can get just the titles and the prices:
sqldf("select title_name, price from titles")
                            title_name price
1                                1977! 21.99
2            200 Years of German Humor 19.95
3         Ask Yor System Administrator 39.95
4            But I Did It Unconciously 12.99
5               Exchange of Platitudes  6.95
6                     How About Never? 19.95
7                    I Blame My Mother 23.95
8         Just Wait Until After School 10.00
9                      Kiss My Boo Boo 13.95
10         Not Without My Fabrerge Egg    NA
11    Perhaps It's a Glandular Problem  7.99
12           Spontaneous, Not Annoying 12.99
13 What Are The Civilian Applications? 29.99
# The following uses the upper function to make each title UPPER CASE
sqldf("select upper(title_name), price from titles")
                     upper(title_name) price
1                                1977! 21.99
2            200 YEARS OF GERMAN HUMOR 19.95
3         ASK YOR SYSTEM ADMINISTRATOR 39.95
4            BUT I DID IT UNCONCIOUSLY 12.99
5               EXCHANGE OF PLATITUDES  6.95
6                     HOW ABOUT NEVER? 19.95
7                    I BLAME MY MOTHER 23.95
8         JUST WAIT UNTIL AFTER SCHOOL 10.00
9                      KISS MY BOO BOO 13.95
10         NOT WITHOUT MY FABRERGE EGG    NA
11    PERHAPS IT'S A GLANDULAR PROBLEM  7.99
12           SPONTANEOUS, NOT ANNOYING 12.99
13 WHAT ARE THE CIVILIAN APPLICATIONS? 29.99

18.3.3 Example of an aggregate function - avg

#................................................................
# Example of an aggregate function
#
# The avg function takes the average of a set of a set of values.
#................................................................

# show all the prices
sqldf("select price from titles")
   price
1  21.99
2  19.95
3  39.95
4  12.99
5   6.95
6  19.95
7  23.95
8  10.00
9  13.95
10    NA
11  7.99
12 12.99
13 29.99
# show the average of the prices
sqldf("select avg(price) from titles")
  avg(price)
1    18.3875

18.4 Demonstrating non-aggreate functions with simple select statements

##############################################################################
# queries with JUST a select clause
##############################################################################

# It's not the most common thing to do but you can write a select statement
# that does not actually retrieve any information from any table. 
# 
# The purpose is to use the SQL language to calculate values, similar to
# how you can use R or a calculator to calculate sums, products, etc.
#
# This feature will help us demonstrate some of the sql functions below.
# 
# Example:

# perform some calculations and display the results
sqldf("select 3+2,   10*2,   23/5,   23.0/5")    # show the results of various calculations
  3+2 10*2 23/5 23.0/5
1   5   20    4    4.6
# We can include our own names for the columns.
# When you change the name of a column, the new column name is known as a 'column alias'
sqldf("select 3+2 as A,   10*2 as B,   23/5 as C,   23.0/5 as D")    
  A  B C   D
1 5 20 4 4.6
# You don't need the "as" when creating column aliases
# ... but some people do so anyway since it reads nicer with the "as"
#
# same results as above
sqldf("select 3+2 A,   10*2 B,   23/5 C,   23.0/5 D")    
  A  B C   D
1 5 20 4 4.6
# The following is not really doing anything useful but it
# helps to demonstrate what the upper function does.
# This technique is used on the sqlitetutorial.net website
# to demonstrate how many of the SQLite functions work. 
#
# For example, click on the specific function names you find on the following
# page for more info about the functions and examples of using the functions.
#   https://www.sqlitetutorial.net/sqlite-string-functions/

sqldf("select upper('abcde')")
  upper('abcde')
1          ABCDE

18.5 Non-aggreate functions in SQLite

#----------------------------------------------------
#
# Non-aggregate functions in SQLite
#
#----------------------------------------------------


# The non-aggregate functions are NOT defined as part of the ANSI SQL standard.
# Every database management system software offers their own versions of 
# non-aggregate functions.  For example, MySQL, Microsoft SQL Server, Postgres
# and sqlite all have different non-aggregate functions that they make available.
# To undertand how to use these function, you need to see the documentation
# for your particular version of SQL. See the following webpage for
# some functions in SQLite. We will first look at the 
# "string functions" i.e. functions that work with "string" values
# (i.e. character values):
#
#     https://www.sqlitetutorial.net/sqlite-string-functions/
#
# The string functions available in sqlite are the following.
# See the website for more details:
#
#    substr, trim, rtrim, ltrim, length, replace, upper, lower, instr

# first let's remember what is in the titles table
sqldf("select * from titles")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
7       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
8       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00

18.5.1 upper() and lower() functions

#........................................................
#
# upper ( character_value ) - returns the UPPERCASE version of the character value
#
# lower ( character_value ) - returns the lowercase version of the character value
# 
#........................................................

# show first 5 titles 
sqldf("select title_name
       from titles
       limit 5")
                    title_name
1                        1977!
2    200 Years of German Humor
3 Ask Yor System Administrator
4    But I Did It Unconciously
5       Exchange of Platitudes
# show the first 5 titles in UPPERCASE
sqldf("select upper(title_name)
      from titles
      limit 5")
             upper(title_name)
1                        1977!
2    200 YEARS OF GERMAN HUMOR
3 ASK YOR SYSTEM ADMINISTRATOR
4    BUT I DID IT UNCONCIOUSLY
5       EXCHANGE OF PLATITUDES
# Use a more friendly column name (i.e. a 'column alias').
# i.e. display "title" at the top of the column instead of "upper(title_name)"
sqldf("select upper(title_name) as title
      from titles
      limit 5")
                         title
1                        1977!
2    200 YEARS OF GERMAN HUMOR
3 ASK YOR SYSTEM ADMINISTRATOR
4    BUT I DID IT UNCONCIOUSLY
5       EXCHANGE OF PLATITUDES
# show both the original title and the uppercase version in the same query
sqldf("select title_name, upper(title_name) as TITLE_NAME
      from titles
      limit 5")
                    title_name                   TITLE_NAME
1                        1977!                        1977!
2    200 Years of German Humor    200 YEARS OF GERMAN HUMOR
3 Ask Yor System Administrator ASK YOR SYSTEM ADMINISTRATOR
4    But I Did It Unconciously    BUT I DID IT UNCONCIOUSLY
5       Exchange of Platitudes       EXCHANGE OF PLATITUDES
# show the title in addition to other information
sqldf("select upper(title_name) as TITLE_NAME, price
      from titles
      limit 5")
                    TITLE_NAME price
1                        1977! 21.99
2    200 YEARS OF GERMAN HUMOR 19.95
3 ASK YOR SYSTEM ADMINISTRATOR 39.95
4    BUT I DID IT UNCONCIOUSLY 12.99
5       EXCHANGE OF PLATITUDES  6.95
# In SQLite, you can use the * along with other column names
# but note that this is not standard.
# Not every SQL will accept the * unless it is the only value in the select clause.
#
# Notice that the * stands in for ALL column names, including the title_name
# even though you already displayed the title_name.

sqldf("select upper(title_name) as TITLE_NAME, *
       from titles
       limit 5")
                    TITLE_NAME title_id                   title_name       type pub_id pages price  sales        pubdate
1                        1977!      T01                        1977!    history    P01   107 21.99    566  8/1/2000 0:00
2    200 YEARS OF GERMAN HUMOR      T02    200 Years of German Humor    history    P03    14 19.95   9566  4/1/1998 0:00
3 ASK YOR SYSTEM ADMINISTRATOR      T03 Ask Yor System Administrator   computer    P02  1226 39.95  25667  9/1/2000 0:00
4    BUT I DID IT UNCONCIOUSLY      T04    But I Did It Unconciously psychology    P01   510 12.99  13001 5/31/1999 0:00
5       EXCHANGE OF PLATITUDES      T05       Exchange of Platitudes psychology    P01   201  6.95 201440  1/1/2001 0:00

18.5.2 ltrim() , rtrim() , trim()

#.........................................................................
#
# ltrim ( string, [character] )
# rtrim ( string, [character] )   
# trim  ( string, [character] )
#
# ARGUMENTS
#   - string - the value to be modified
#
#
#   - [character] 
#
#     The [brackets] in the documenation indicate that this value is optional.
#     If specified, this is expected to be a string (ie. a character value 
#     enclosed in 'single quotes'). The characters in the string will be removed
#     from the left (or right) hand side of the string:
#        the l in ltrim stands for left
#        the r in rtrim stands for right
#.........................................................................

# let's see the original data
sqldf("select title_name, pubdate
      from titles")
                            title_name         pubdate
1                                1977!   8/1/2000 0:00
2            200 Years of German Humor   4/1/1998 0:00
3         Ask Yor System Administrator   9/1/2000 0:00
4            But I Did It Unconciously  5/31/1999 0:00
5               Exchange of Platitudes   1/1/2001 0:00
6                     How About Never?  7/31/2000 0:00
7                    I Blame My Mother  10/1/1999 0:00
8         Just Wait Until After School   6/1/2001 0:00
9                      Kiss My Boo Boo  5/31/2002 0:00
10         Not Without My Fabrerge Egg            <NA>
11    Perhaps It's a Glandular Problem 11/30/2000 0:00
12           Spontaneous, Not Annoying  8/31/2000 0:00
13 What Are The Civilian Applications?  5/31/1999 0:00
# get rid of the time indicator from the end of the pubdate
sqldf("select title_name, rtrim(pubdate, '0:') as publication_date
      from titles")
                            title_name publication_date
1                                1977!        8/1/2000 
2            200 Years of German Humor        4/1/1998 
3         Ask Yor System Administrator        9/1/2000 
4            But I Did It Unconciously       5/31/1999 
5               Exchange of Platitudes        1/1/2001 
6                     How About Never?       7/31/2000 
7                    I Blame My Mother       10/1/1999 
8         Just Wait Until After School        6/1/2001 
9                      Kiss My Boo Boo       5/31/2002 
10         Not Without My Fabrerge Egg             <NA>
11    Perhaps It's a Glandular Problem      11/30/2000 
12           Spontaneous, Not Annoying       8/31/2000 
13 What Are The Civilian Applications?       5/31/1999 

18.5.3 substr( string, start, length )

#........................................................
# substr ( string, start, length)
# 
# returns a "substring", i.e. a portion of the string (ie. charcter value).
# ARGUMENTS:
#    start - the position to start taking info from
#    length - the number of characters to include
#........................................................

# get just the number from the title_id
sqldf("select title_id, substr(title_id, 2, 2) as title_number from titles")
   title_id title_number
1       T01           01
2       T02           02
3       T03           03
4       T04           04
5       T05           05
6       T06           06
7       T07           07
8       T08           08
9       T09           09
10      T10           10
11      T11           11
12      T12           12
13      T13           13
# If length is too long the rest of the string is taken.
sqldf("select title_id, substr(title_id, 2, 200) as title_number from titles")
   title_id title_number
1       T01           01
2       T02           02
3       T03           03
4       T04           04
5       T05           05
6       T06           06
7       T07           07
8       T08           08
9       T09           09
10      T10           10
11      T11           11
12      T12           12
13      T13           13
# abbreviate the title
sqldf("select title_name, substr(title_name, 1, 20) as abbreviated_title from titles")
                            title_name    abbreviated_title
1                                1977!                1977!
2            200 Years of German Humor 200 Years of German 
3         Ask Yor System Administrator Ask Yor System Admin
4            But I Did It Unconciously But I Did It Unconci
5               Exchange of Platitudes Exchange of Platitud
6                     How About Never?     How About Never?
7                    I Blame My Mother    I Blame My Mother
8         Just Wait Until After School Just Wait Until Afte
9                      Kiss My Boo Boo      Kiss My Boo Boo
10         Not Without My Fabrerge Egg Not Without My Fabre
11    Perhaps It's a Glandular Problem Perhaps It's a Gland
12           Spontaneous, Not Annoying Spontaneous, Not Ann
13 What Are The Civilian Applications? What Are The Civilia
# abbreviate the title (add on '...' at the end of every value - not just the ones that were shortened)
sqldf("select title_name, substr(title_name, 1, 20) || ' ...' as abbreviated_title from titles")
                            title_name        abbreviated_title
1                                1977!                1977! ...
2            200 Years of German Humor 200 Years of German  ...
3         Ask Yor System Administrator Ask Yor System Admin ...
4            But I Did It Unconciously But I Did It Unconci ...
5               Exchange of Platitudes Exchange of Platitud ...
6                     How About Never?     How About Never? ...
7                    I Blame My Mother    I Blame My Mother ...
8         Just Wait Until After School Just Wait Until Afte ...
9                      Kiss My Boo Boo      Kiss My Boo Boo ...
10         Not Without My Fabrerge Egg Not Without My Fabre ...
11    Perhaps It's a Glandular Problem Perhaps It's a Gland ...
12           Spontaneous, Not Annoying Spontaneous, Not Ann ...
13 What Are The Civilian Applications? What Are The Civilia ...

18.6 Practice

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
# Write a select statement that retrieves all books that were published 
# during a month with a single digit.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER

sqldf("select *
       from titles
       where substr(pubdate, 2, 1) = '/'
       order by title_name")
   title_id                          title_name       type pub_id pages price  sales        pubdate
1       T01                               1977!    history    P01   107 21.99    566  8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95   9566  4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95  25667  9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99  13001 5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95 201440  1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95  11320 7/31/2000 0:00
7       T08        Just Wait Until After School   children    P01    86 10.00   4095  6/1/2001 0:00
8       T09                     Kiss My Boo Boo   children    P01    22 13.95   5000 5/31/2002 0:00
9       T12           Spontaneous, Not Annoying  biography    P05   507 12.99 100001 8/31/2000 0:00
10      T13 What Are The Civilian Applications?    history    P03   802 29.99  10467 5/31/1999 0:00
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Question
# Write a select statement that lists the title_name, type and publication date
# of those books that were published in October, November or December 
# (i.e. months whose numeric value is 2 digits)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER
# We could simply use != 

sqldf("select title_name, type, pubdate
       from titles
       where substr(pubdate, 2, 1) != '/'
       order by pubdate")
                        title_name       type         pubdate
1                I Blame My Mother  biography  10/1/1999 0:00
2 Perhaps It's a Glandular Problem psychology 11/30/2000 0:00
# ANSWER

# One way: Identify those rows that have a 2 digit month as those rows which
# have a '/' in the 3rd position of the pubdate.
#
# Notice that the books with a 1 digit months in the pubdate
# do not appear in the results.

sqldf("select title_name, type, pubdate 
       from titles
       where substr(pubdate, 3, 1) = '/'")
                        title_name       type         pubdate
1                I Blame My Mother  biography  10/1/1999 0:00
2 Perhaps It's a Glandular Problem psychology 11/30/2000 0:00
# Another way: Identify those rows for which the first 2 characters of the 
# pubdate are '10', '11' or '12'

sqldf("select title_name, type, pubdate 
       from titles
       where substr(pubdate, 1, 2) in ('10','11','12')
      ")
                        title_name       type         pubdate
1                I Blame My Mother  biography  10/1/1999 0:00
2 Perhaps It's a Glandular Problem psychology 11/30/2000 0:00
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Question
# Write a select statement that lists the title_name and day of the month 
# that each title was published.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# ANSWER

# To find the day of the month, let's first trim away the numbers from the 
# left and the right. Then we'll trim away the /'s from the left and the right
# to leave just the day of the month.


# We'll start by just trimming away the numbers spaces and colons.
sqldf("select title_name, trim(pubdate, '0123456789: ') 
       from titles")
                            title_name trim(pubdate, '0123456789: ')
1                                1977!                           /1/
2            200 Years of German Humor                           /1/
3         Ask Yor System Administrator                           /1/
4            But I Did It Unconciously                          /31/
5               Exchange of Platitudes                           /1/
6                     How About Never?                          /31/
7                    I Blame My Mother                           /1/
8         Just Wait Until After School                           /1/
9                      Kiss My Boo Boo                          /31/
10         Not Without My Fabrerge Egg                          <NA>
11    Perhaps It's a Glandular Problem                          /30/
12           Spontaneous, Not Annoying                          /31/
13 What Are The Civilian Applications?                          /31/
# Now for the final answer we can also trim away the slashes
sqldf("select title_name, trim(trim(pubdate, '0123456789: '),'/') as dayOfMonth
       from titles")
                            title_name dayOfMonth
1                                1977!          1
2            200 Years of German Humor          1
3         Ask Yor System Administrator          1
4            But I Did It Unconciously         31
5               Exchange of Platitudes          1
6                     How About Never?         31
7                    I Blame My Mother          1
8         Just Wait Until After School          1
9                      Kiss My Boo Boo         31
10         Not Without My Fabrerge Egg       <NA>
11    Perhaps It's a Glandular Problem         30
12           Spontaneous, Not Annoying         31
13 What Are The Civilian Applications?         31
# NOTE - the following DOES NOT WORK - because it trims off EVERY
sqldf("select title_name, trim(pubdate, '0123456789: /') as dayOfMonth
       from titles")
                            title_name dayOfMonth
1                                1977!           
2            200 Years of German Humor           
3         Ask Yor System Administrator           
4            But I Did It Unconciously           
5               Exchange of Platitudes           
6                     How About Never?           
7                    I Blame My Mother           
8         Just Wait Until After School           
9                      Kiss My Boo Boo           
10         Not Without My Fabrerge Egg       <NA>
11    Perhaps It's a Glandular Problem           
12           Spontaneous, Not Annoying           
13 What Are The Civilian Applications?           

18.6.1 Other functions in SQLite

#------------------------------------------------------------------
#
# see these webpages for examples of the other functions
#    https://www.sqlitetutorial.net/sqlite-functions/sqlite-instr/
#    https://www.techonthenet.com/sqlite/functions/index.php
#------------------------------------------------------------------

18.7 aggregate functions

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Aggregate functions
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Aggregate functions take a bunch of values as input and return a single value.
# For example the avg function averages together several numbers and returns
# a single number.
#
# The following are standard aggregate functions that are included with 
# every version of SQL. These are part of the ANSI standard for SQL.
# These are the aggregagte functions that we will focus on. 
#
#   sum, avg, count, min, max
#
# Different database management systems might include some other aggregate
# functions in addition to these. See the documentation for your particular
# database management system for more info about other aggregate functions.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# Show the contents of the titles table
sqldf("select * from titles")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
7       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
8       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00
# show the average price, minimum (i.e. lowest) price and maximum (ie. highest price) 
# of all the titles
sqldf("select avg(price),  min(price) , max(price)
      from titles")
  avg(price) min(price) max(price)
1    18.3875       6.95      39.95

18.8 Don’t mix aggregate functions with non-aggregated data

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# DO NO MIX AGGREGATE FUNCTIONS WITH NON-AGGREGATED DATA.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# IN MANY SQL DBMS's YOU WILL GET AN ERROR IF YOU TRY TO DO THIS.
# sqlite does NOT report an error if you do this but the results will not 
# make much sense. 
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# The following shows that in sqlite when you mix aggregate functions 
# with non-aggregated data the results are confusing. Don't do this!
# Some DBMS's will return an ERROR for this query. sqlite does not return
# an error but the results are very misleading.
#
# For example the following is the result of the code below:
#
#    > sqldf("select avg(price), sum(pages),  title_name, type
#             from titles")
# 
#       avg(price) sum(pages) title_name    type
#          18.3875       5107      1977! history
# 
# The avg function and the sum function return aggregagated data for all of
# the rows in the titles table. Specically the average price of all the books
# and the total number of pages from all the books. 
# 
# However, the title_name and type columns simply display
# the name of a specific book. The title_name and type that is displayed
# has nothing to do with the avg(price) and sum(pages). The exact book 
# that is displayed is really not predictable.
#
# Bottom line - do NOT mix aggregrate functions, e.g. avg(price), sum(paes)
# with non-aggregated data, e.g. title_name, type

# DON'T DO THIS!!! - see the comment above.
sqldf("select avg(price), sum(pages),  title_name, type
      from titles")
  avg(price) sum(pages) title_name    type
1    18.3875       5107      1977! history

18.9 count function - three different versions

#--------------------------------------------------------------------
# count function - three different versions
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The count function counts the number of rows returned.
# There are three different versions of the count function
#
#   count(SOME_COLUMN) 
#
#     This returns the number of rows for the column that are NOT NULL. 
#     (Remember that in SQL NULL is basically the same as NA in R. 
#      Since we are using sqldf to analyze R dataframes instead of actual 
#      database tables, NA, is treated as NULL
#      ie. when using sqldf, count(SOME_COLUMN) will return the number of 
#          rows in that column that don't contain NA.
#
#   count(*) - returns the number of rows. This version counts all rows, including
#              NULL values and non-distinct values
#              Note that count is the ONLY aggregate function that can be used 
#              in this way.
#
#   count(distinct SOME_COLUMN) - Returns the number of DISTINCT values
#                                 (i.e. different unique values) in the column.
#                                 Note that other aggregate functions MAY ALSO
#                                 be used in this way, eg. avg(distinct price) 
#                                 will average the distinct prices.
#                                 
#--------------------------------------------------------------------

# show all of the data in the titles table
sqldf("select * from titles")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
7       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
8       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00
# show various aggregate function values for the titles column
sqldf("select sum(price) , avg(price) , avg(DISTINCT price), count(*), count(price) , count(DISTINCT price), min(price) , max(price), count(type), count(DISTINCT type)
      from titles")
  sum(price) avg(price) avg(DISTINCT price) count(*) count(price) count(DISTINCT price) min(price) max(price) count(type) count(DISTINCT type)
1     220.65    18.3875              18.771       13           12                    10       6.95      39.95          13                    5

18.10 min and max also work with character columns.

# min and max also work with character columns.
#
# min value of a character column is the value that would appear FIRST in alphabetical order.
# max value of a character column is the value that would appear LAST  in alphabetical order.
sqldf("select min(type), max(type), count(*), count(type), count(DISTINCT type)
      from titles")
  min(type)  max(type) count(*) count(type) count(DISTINCT type)
1 biography psychology       13          13                    5
# QUESTION
# WRite a query to display the different types of books. 
# Sort the results alphabetically.
# Only include each type once.
sqldf("select distinct type from titles order by type asc")
        type
1  biography
2   children
3   computer
4    history
5 psychology

18.11 two diffferent meanings of “distinct”

#--------------------------------------------------------------------
# two diffferent meanings of "distinct"
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The word "distinct" means two toally different things based on 
# where it appears in the select command.
#
# 1. select ... count (distinct SOME_COLUMN) ...
#   
#    see above
#
#
# 2. select distinct COLUMN1, COLUMN2, ...
#
#    When "distinct" appears right after the word select, it 
#    means that the output should not include rows that are 
#    exact duplicates of each other - in that case only one of the 
#    duplicates is returned.
# 
#--------------------------------------------------------------------

# show all the different types of titles (do not show duplicates)
sqldf("select DISTINCT type
      from titles")
        type
1    history
2   computer
3 psychology
4  biography
5   children
# show just the entire contents of the type column
sqldf("select type
      from titles")
         type
1     history
2     history
3    computer
4  psychology
5  psychology
6   biography
7   biography
8    children
9    children
10  biography
11 psychology
12  biography
13    history
#---------------------------------------------------------------------
# "select distinct ..."  works on ENTIRE ROWS, not individual columns.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Compare the following two queries. Notice that distinct works on entire
# rows of output to eliminate those rows that are duplicates. 
#
# Queries that contain more than one column  could still result with duplicate
# values in each column even when DISTINCT is specified. DISTINCT only serves
# to eliminate entire rows of data that are exact duplicates.
#
# compare the following 2 queries ...
#---------------------------------------------------------------------

# show the entire contents of the type and pub_id columns
# sort the results by type and then by pub_id within each type
#
# notice in the output that some rows are exact duplicates of each other
sqldf("select type, pub_id
      from titles
      order by type, pub_id")
         type pub_id
1   biography    P01
2   biography    P03
3   biography    P05
4   biography    P05
5    children    P01
6    children    P01
7    computer    P02
8     history    P01
9     history    P03
10    history    P03
11 psychology    P01
12 psychology    P01
13 psychology    P05
# Do the same thing but use "distinct" to eliminate the duplicate rows.
# 
# Notice in the output that even though "distinct" is specified, each column
# in the result DOES have duplicate values. Hoewver, there are no rows in the
# output that are exact duplicates of each other.

sqldf("select distinct type, pub_id
      from titles
      order by type, pub_id")
        type pub_id
1  biography    P01
2  biography    P03
3  biography    P05
4   children    P01
5   computer    P02
6    history    P01
7    history    P03
8 psychology    P01
9 psychology    P05

18.12 Using WERE in queries that contain aggregate functions

#-----------------------------------------------------------------------------
# Using WERE in queries that contain aggregate functions
#-----------------------------------------------------------------------------
#
# Aggregate functions work only on rows that are included by the WHERE clause.
# 
#-----------------------------------------------------------------------------

# show the complete titles table
sqldf ("select * from titles")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
4       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
5       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
6       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
7       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
8       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
9       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
10      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
11      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
12      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00
# show the number of titles in the titles table
# The highest price and the average price.
sqldf( "select count(*), max(price), avg(price) 
        from titles")
  count(*) max(price) avg(price)
1       13      39.95    18.3875
# show the number of biographies,
# the highest price for a biography and the average price for a biography
# HINT: use the where clause
sqldf( "select count(*), max(price), avg(price)
        from titles
        where type = 'biography'")
  count(*) max(price) avg(price)
1        4      23.95   18.96333

18.13 Practice

#-----------------------------------------------------------------------------
# 
# Practice Questions
#
#-----------------------------------------------------------------------------

#-----------------------------------------------------------------------------
# Question:
#-----------------------------------------------------------------------------
# Write a query that shows the names of the authors who have written at least one book
# for which they got 1.00 of the royalty share.
# Display the author's names in alphabetical order.
# Do not repeat the same author's name twice
#
# HINTS
# 1. Which tables do you need to include in your query?
#    Make sure to join all of those tables and any intervening tables
# 
# 2. Use DISTINCT at the front of your query to eliminate duplicate rows.
#
# 3. Use "order by" to display the author's names in alphabetical order. Remember
#    that last name is more significant than first name when arranging names
#    in alphabetical order.
#-----------------------------------------------------------------------------

# Answer:

sqldf("select distinct au_fname, au_lname
       from authors join title_authors on authors.au_id = title_authors.au_id
       where royalty_shares = 1.0
       order by au_lname asc, au_fname asc")
   au_fname  au_lname
1     Sarah   Buchman
2     Wendy Heydemark
3      Klee      Hull
4 Christian     Kells
5    Harvey   Kellsey
# notice what will happen if you leave out the distinct
sqldf("select au_fname, au_lname
       from authors join title_authors on authors.au_id = title_authors.au_id
       where royalty_shares = 1.0
       order by au_lname asc, au_fname asc")
    au_fname  au_lname
1      Sarah   Buchman
2      Sarah   Buchman
3      Sarah   Buchman
4      Wendy Heydemark
5      Wendy Heydemark
6      Wendy Heydemark
7       Klee      Hull
8  Christian     Kells
9     Harvey   Kellsey
10    Harvey   Kellsey
#-----------------------------------------------------------------------------
# Question
#-----------------------------------------------------------------------------
# Write a query that shows the names of the authors who have written at least one book
# for which the book has a royalty_rate of at least .06
# Display the author's names in alphabetical order.
# Do not repeat the same author's name twice.

# HINTS
# 1. which tables do you need to include in your query?
#    make sure to join all of those tables and any intervening tables
# 
# 2. Use DISTINCT at the front of your query to eliminate duplicate rows.
#
# 3. Use "order by" to display the author's names in alphabetical order. Remember
#    that last name is more significant than first name when arranging names
#    in alphabetical order.
#-----------------------------------------------------------------------------


# First attempt ...
# Not the answer yet ...
# Let's start out by including the royalty_rate in the select clause. This lets
# us see that we indeed are only getting authors who contributed to a book 
# whose royalty rate was at least .06. 
#
# The problem with the following is that because we are including the royalty_rate
# in the output we will get some authors more than once since the royalty_rate
# may be different for different books that they worked on.

sqldf(" select distinct au_fname, au_lname, royalty_rate
        from authors join title_authors on authors.au_id = title_authors.au_id
                     join titles        on title_authors.title_id = titles.title_id
                     join royalties     on titles.title_id = royalties.title_id
        where royalty_rate > .05
        order by au_lname, au_fname
      ")
    au_fname  au_lname royalty_rate
1      Sarah   Buchman         0.06
2      Wendy Heydemark         0.08
3      Wendy Heydemark         0.11
4      Wendy Heydemark         0.09
5     Hallie      Hull         0.08
6     Hallie      Hull         0.07
7       Klee      Hull         0.08
8       Klee      Hull         0.09
9       Klee      Hull         0.11
10      Klee      Hull         0.07
11 Christian     Kells         0.07
12    Harvey   Kellsey         0.07
# Final answer
# Let's now take out the royalty_rate from the select clause. This will serve
# to have every author only be displayed once since the distinct will eliminate
# rows that are duplicates of each other.
sqldf(" select distinct au_fname, au_lname
        from authors join title_authors on authors.au_id = title_authors.au_id
                     join titles        on title_authors.title_id = titles.title_id
                     join royalties     on titles.title_id = royalties.title_id
        where royalty_rate > .06
        order by au_lname, au_fname
      ")
   au_fname  au_lname
1     Wendy Heydemark
2    Hallie      Hull
3      Klee      Hull
4 Christian     Kells
5    Harvey   Kellsey

18.14 Reminder about how multiple joins work

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Reminder about how multiple joins work
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Remember that every INNER JOIN (which contains an ON clause)
# is the same as doing a cross join followed by removing the rows from the
# cross join that don't match the on clause of the inner join.
# The condition in the ON clause would work the same way if instead it 
# were part of the WHERE clause.
#
# Also remember that when you join multiple tables, the first two tables 
# are joined (as described in the previous paragraph). Then the rows and
# columns that resulted from the first join are joined to the next table in the
# same way (i.e a cross join of rows from the result with the rows from the
# 2nd table followed by removing the rows that don't match the corresponding
# on clause). This keeps happening for all the tables. The entire process
# happens automatically when you specify the inner joins correctly.
#
# For example, if you run the following query, you will get the output shown
# below.
#
#   QUERY:
#
#       sqldf(" select distinct au_fname, au_lname
#               from authors join title_authors on authors.au_id = title_authors.au_id
#                            join titles        on title_authors.title_id = titles.title_id
#                            join royalties     on titles.title_id = royalties.title_id
#               where royalty_rate > .06
#               order by au_lname, au_fname
#             ")
#
#   OUTPUT:
#
#           au_fname     au_lname
#              Wendy    Heydemark
#             Hallie         Hull
#               Klee         Hull
#          Christian        Kells
#             Harvey      Kellsey      
#
#
# The output looks very short and simple. It has five rows and two columns.
# However, in order to arrive at this output, the database must go through
# several steps to process the SQL query. At each step of the internal
# processing there is an "intermediate result" that you normally would not see. 
#
# Below, we demonstrate step by step what actually happens internally when
# the database performs the query. I recommend that you look at
# the output of each step to better understand the process:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# 1st step - 
#
# The following part of the "from clause" in the above query is done first:
#
#    select ...                             (SELECT CLAUSE IS DONE LATER)
#    from authors join title_authors ...    (ON CLAUSE IS DONE LATER)
#                 ...                       (OTHER JOINS ARE DONE LATER)
#    where ...                              (WHERE CLAUSE IS DONE LATER)
#    order by ...                           (ORDER BY CLAUSE IS DONE LATER)
#
# Notice that this query does an "inner join" (or in other words, a plain
# "join"). However, what actually happens behind the scenes is that these
# two tables are really CROSS JOINED and then the extra rows from the 
# "cross join" that don't satisfy the on clause of the inner join are
# removed..
#
# At this point, the "select clause" has not executed yet so all columns from 
# both tables are still part of this intermediate result. In the output 
# of the following command every row contains 12 columns. The first 8 columns,
# come from authors table and the last 4 come from the title_authors table.
# 
#   The columns from the authors table are: 
#
#     au_id, au_fname, au_lname, phone, address, city, state, zip
# 
#   and the columns from the title_authors table are:
#
#     title_id, au_id, au_order, royalty_share
#
# Notice that in a single row of this intermediate output, the value of the
# au_id from the authors table and the value of the au_id from the title_authors
# table do NOT necessarily match

sqldf("select authors.*, title_authors.* 
      from authors CROSS JOIN title_authors")
    au_id  au_fname    au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares
1     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T01   A01        1            1.0
2     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T02   A01        1            1.0
3     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T03   A05        1            1.0
4     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T04   A03        1            0.6
5     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T04   A04        2            0.4
6     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T05   A04        1            1.0
7     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T06   A02        1            1.0
8     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T07   A02        1            0.5
9     A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T07   A04        2            0.5
10    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T08   A06        1            1.0
11    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T09   A06        1            1.0
12    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T10   A02        1            1.0
13    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T11   A03        2            0.3
14    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T11   A04        3            0.3
15    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T11   A06        1            0.4
16    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T12   A02        1            1.0
17    A01     Sarah     Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T13   A01        1            1.0
18    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T01   A01        1            1.0
19    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T02   A01        1            1.0
20    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T03   A05        1            1.0
21    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T04   A03        1            0.6
22    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T04   A04        2            0.4
23    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T05   A04        1            1.0
24    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0
25    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5
26    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A04        2            0.5
27    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T08   A06        1            1.0
28    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T09   A06        1            1.0
29    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T10   A02        1            1.0
30    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T11   A03        2            0.3
31    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T11   A04        3            0.3
32    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T11   A06        1            0.4
33    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0
34    A02     Wendy   Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T13   A01        1            1.0
35    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T01   A01        1            1.0
36    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T02   A01        1            1.0
37    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T03   A05        1            1.0
38    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6
39    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4
40    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0
41    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T06   A02        1            1.0
42    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A02        1            0.5
43    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5
44    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T08   A06        1            1.0
45    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T09   A06        1            1.0
46    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T10   A02        1            1.0
47    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3
48    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3
49    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A06        1            0.4
50    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T12   A02        1            1.0
51    A03    Hallie        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T13   A01        1            1.0
52    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T01   A01        1            1.0
53    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T02   A01        1            1.0
54    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T03   A05        1            1.0
55    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6
56    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4
57    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0
58    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T06   A02        1            1.0
59    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A02        1            0.5
60    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5
61    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T08   A06        1            1.0
62    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T09   A06        1            1.0
63    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T10   A02        1            1.0
64    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3
65    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3
66    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A06        1            0.4
67    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T12   A02        1            1.0
68    A04      Klee        Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T13   A01        1            1.0
69    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T01   A01        1            1.0
70    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T02   A01        1            1.0
71    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0
72    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T04   A03        1            0.6
73    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T04   A04        2            0.4
74    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T05   A04        1            1.0
75    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T06   A02        1            1.0
76    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T07   A02        1            0.5
77    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T07   A04        2            0.5
78    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T08   A06        1            1.0
79    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T09   A06        1            1.0
80    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T10   A02        1            1.0
81    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T11   A03        2            0.3
82    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T11   A04        3            0.3
83    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T11   A06        1            0.4
84    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T12   A02        1            1.0
85    A05 Christian       Kells 212-771-4680       114 Horatio St      New York    NY 10014      T13   A01        1            1.0
86    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T01   A01        1            1.0
87    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T02   A01        1            1.0
88    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T03   A05        1            1.0
89    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T04   A03        1            0.6
90    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T04   A04        2            0.4
91    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T05   A04        1            1.0
92    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T06   A02        1            1.0
93    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T07   A02        1            0.5
94    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T07   A04        2            0.5
95    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T08   A06        1            1.0
96    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T09   A06        1            1.0
97    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T10   A02        1            1.0
98    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A03        2            0.3
99    A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A04        3            0.3
100   A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4
101   A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T12   A02        1            1.0
102   A06    Harvey     Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T13   A01        1            1.0
103   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T01   A01        1            1.0
104   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T02   A01        1            1.0
105   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T03   A05        1            1.0
106   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T04   A03        1            0.6
107   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T04   A04        2            0.4
108   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T05   A04        1            1.0
109   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T06   A02        1            1.0
110   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T07   A02        1            0.5
111   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T07   A04        2            0.5
112   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T08   A06        1            1.0
113   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T09   A06        1            1.0
114   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T10   A02        1            1.0
115   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T11   A03        2            0.3
116   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T11   A04        3            0.3
117   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T11   A06        1            0.4
118   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T12   A02        1            1.0
119   A07     Paddy O'Furniture 941-925-0752         1442 Main St      Sarasota    FL 34236      T13   A01        1            1.0
# 2nd step - 
#
# The on clause is applied to the results from the previous step 1 to eliminate
# those rows from the cross join that don't match the on clause specified
# in the query. Notice that in the results of the following command the 
# first au_id and the second au_id 

sqldf("select authors.*, title_authors.*
      from authors join title_authors on authors.au_id = title_authors.au_id")
   au_id  au_fname  au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares
1    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T01   A01        1            1.0
2    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T02   A01        1            1.0
3    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T13   A01        1            1.0
4    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0
5    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5
6    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T10   A02        1            1.0
7    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0
8    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6
9    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3
10   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4
11   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0
12   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5
13   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3
14   A05 Christian     Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0
15   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T08   A06        1            1.0
16   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T09   A06        1            1.0
17   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4
# 3rd step - 

sqldf("select authors.*, title_authors.*, titles.*
      from authors join title_authors on authors.au_id = title_authors.au_id
                   join titles        on title_authors.title_id = titles.title_id")
   au_id  au_fname  au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares title_id                          title_name       type pub_id pages price   sales         pubdate
1    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T01   A01        1            1.0      T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
2    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T02   A01        1            1.0      T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
3    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T13   A01        1            1.0      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00
4    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0      T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
5    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5      T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
6    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T10   A02        1            1.0      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
7    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
8    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6      T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
9    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
10   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4      T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
11   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0      T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
12   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5      T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
13   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
14   A05 Christian     Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0      T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
15   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T08   A06        1            1.0      T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
16   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T09   A06        1            1.0      T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
17   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
# 4th step - join royalties on titles.title_id = royalties.title_id

sqldf("select authors.*, title_authors.*, titles.*, royalties.*
      from authors join title_authors on authors.au_id = title_authors.au_id
                   join titles        on title_authors.title_id = titles.title_id
                   join royalties     on titles.title_id = royalties.title_id")
   au_id  au_fname  au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares title_id                          title_name       type pub_id pages price   sales         pubdate title_id advance royalty_rate
1    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T01   A01        1            1.0      T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00      T01   10000         0.05
2    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T02   A01        1            1.0      T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00      T02    1000         0.06
3    A01     Sarah   Buchman 718-496-7223       75 West 205 St         Bronx    NY 10468      T13   A01        1            1.0      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00      T13   20000         0.06
4    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0      T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00      T06   20000         0.08
5    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5      T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
6    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T10   A02        1            1.0      T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>      T10      NA           NA
7    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0      T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00      T12   50000         0.09
8    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6      T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
9    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
10   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4      T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
11   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0      T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00      T05  100000         0.09
12   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5      T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
13   A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
14   A05 Christian     Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0      T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00      T03   15000         0.07
15   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T08   A06        1            1.0      T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00      T08       0         0.04
16   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T09   A06        1            1.0      T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00      T09       0         0.05
17   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
# 5th step - where

sqldf("select authors.*, title_authors.*, titles.*, royalties.*
               from authors join title_authors on authors.au_id = title_authors.au_id
                            join titles        on title_authors.title_id = titles.title_id
                            join royalties     on titles.title_id = royalties.title_id
               where royalty_rate > .06
             ")
   au_id  au_fname  au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares title_id                       title_name       type pub_id pages price   sales         pubdate title_id advance royalty_rate
1    A05 Christian     Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0      T03     Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00      T03   15000         0.07
2    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6      T04        But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
3    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4      T04        But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
4    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0      T05           Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00      T05  100000         0.09
5    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0      T06                 How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00      T06   20000         0.08
6    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5      T07                I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
7    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5      T07                I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
8    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
9    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
10   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
11   A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0      T12        Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00      T12   50000         0.09
# 6th step - order by 
       
sqldf("select authors.*, title_authors.*, titles.*, royalties.*
               from authors join title_authors on authors.au_id = title_authors.au_id
                            join titles        on title_authors.title_id = titles.title_id
                            join royalties     on titles.title_id = royalties.title_id
               where royalty_rate > .06
               order by au_lname, au_fname
             ")
   au_id  au_fname  au_lname        phone              address          city state   zip title_id au_id au_order royalty_shares title_id                       title_name       type pub_id pages price   sales         pubdate title_id advance royalty_rate
1    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T06   A02        1            1.0      T06                 How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00      T06   20000         0.08
2    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T07   A02        1            0.5      T07                I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
3    A02     Wendy Heydemark 303-986-7020     2922 Baseline Rd       Boulder    CO 80303      T12   A02        1            1.0      T12        Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00      T12   50000         0.09
4    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A03        1            0.6      T04        But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
5    A03    Hallie      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A03        2            0.3      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
6    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T04   A04        2            0.4      T04        But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00      T04   20000         0.08
7    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T05   A04        1            1.0      T05           Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00      T05  100000         0.09
8    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T07   A04        2            0.5      T07                I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00      T07 1000000         0.11
9    A04      Klee      Hull 415-549-4278 3800 Waldo Ave, #14F San Francisco    CA 94123      T11   A04        3            0.3      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
10   A05 Christian     Kells 212-771-4680       114 Horatio St      New York    NY 10014      T03   A05        1            1.0      T03     Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00      T03   15000         0.07
11   A06    Harvey   Kellsey 650-836-7128       390 Serra Mall     Palo Alto    CA 94305      T11   A06        1            0.4      T11 Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00      T11  100000         0.07
# 7th step - select

sqldf("select au_fname, au_lname
               from authors join title_authors on authors.au_id = title_authors.au_id
                            join titles        on title_authors.title_id = titles.title_id
                            join royalties     on titles.title_id = royalties.title_id
               where royalty_rate > .06
               order by au_lname, au_fname
             ")
    au_fname  au_lname
1      Wendy Heydemark
2      Wendy Heydemark
3      Wendy Heydemark
4     Hallie      Hull
5     Hallie      Hull
6       Klee      Hull
7       Klee      Hull
8       Klee      Hull
9       Klee      Hull
10 Christian     Kells
11    Harvey   Kellsey
# 6th step - select distinct

sqldf("select DISTINCT au_fname, au_lname
               from authors join title_authors on authors.au_id = title_authors.au_id
                            join titles        on title_authors.title_id = titles.title_id
                            join royalties     on titles.title_id = royalties.title_id
               where royalty_rate > .06
               order by au_lname, au_fname
             ")
   au_fname  au_lname
1     Wendy Heydemark
2    Hallie      Hull
3      Klee      Hull
4 Christian     Kells
5    Harvey   Kellsey

18.15 WARNING - If where clause removes all rows then avg,sum,max,min return NULL

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# WARNING : If no rows are returned by the where clause, then the 
#           aggregate functions (except for count) all return NULL
#  
#           Note that since we are using R dataframes instead
#           of tables, NULL will be displayed as NA
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# No rows match the where - result of all aggregate functions, except count 
# is NULL (i.e. NA in R)

sqldf("select count(*), sum(pages), avg(pages), min(pages), max(pages)
      from titles
      where type = 'this type does not exist' ")
  count(*) sum(pages) avg(pages) min(pages) max(pages)
1        0         NA         NA         NA         NA

18.16 Practice

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# show the total number of cookbooks (i.e. type = 'cookbook')
#          total number of pages in all the cookbooks
#          and the average number of pages in cookbooks
#
# NOTE that since there are currently no cookbooks in the table
# there will be no rows returned by the where clause. Therefore
# the sum and the avg will be NULL (i.e. NA in R)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

sqldf("select count(*), sum(pages), avg(pages)
      from titles
      where type = 'cookbook'")
  count(*) sum(pages) avg(pages)
1        0         NA         NA

18.17 DON’T use aggregate functions & non-aggregated values in same select clause

##################################################################
#
# WARNING : 
#
# DO NOT include both aggregate function and non-aggregated values
# in the same select clause
#
#    (NOTE - it may help to look at the example below before reading further)
#
# BECAUSE ...    
#
# The aggregate function will return fewer rows than the 
# non-aggregated data. This presents a problem for SQL because
# there is a contradiction as to how many rows should be shown!
#
# In many many versions of SQL, such queries would generate an ERROR!
#
# In SQLite, this query doesn't generate an error. Rather, the query 
# will only return one row. The values of the non-aggregate column
# will be the first value that would have appeared for that non-aggregate
# column if the aggregate function was not present.
#
# See this page for an in-depth analysis of this issue: 
#   https://www.dataquest.io/blog/sql-tutorial-selecting-ungrouped-columns-without-aggregate-functions/
##################################################################


# The output of the following produces ONE row
sqldf("select avg(price) from titles")
  avg(price)
1    18.3875
# The output of the following produces MORE THAN ONE row
sqldf("select title_name from titles")
                            title_name
1                                1977!
2            200 Years of German Humor
3         Ask Yor System Administrator
4            But I Did It Unconciously
5               Exchange of Platitudes
6                     How About Never?
7                    I Blame My Mother
8         Just Wait Until After School
9                      Kiss My Boo Boo
10         Not Without My Fabrerge Egg
11    Perhaps It's a Glandular Problem
12           Spontaneous, Not Annoying
13 What Are The Civilian Applications?
# The following will generate an ERROR in MANY MANY versions of SQL.
# since avg(price) should return ONE number, but there are many different
# values for title_name.
#
# SQLite allows this type of query but the results are confusing.
# Since the avg generates only a single number, the following results in
# only a single row with only one particular title even though there are
# many different title_names.

# DON'T DO THIS!!! (see the comment above)
sqldf("select avg(price) , title_name from titles")
  avg(price) title_name
1    18.3875      1977!

18.18 ORDER OF CLAUSES: select, from, where, group by, having, order by, limit

# REMEMBER
#    While select statements don't always need to use every clause, 
#    the clauses that are in a statement must appear in the following order:
#
# select ...
# from ...
# where ...
# group by ...
# having ...
# order by ...
# limit ... 

18.19 GROUP BY affects how aggregate functions work

#################################################################
# group by COLUMN1, COLUMN2, ...
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The "group by" clause has a direct effect on how aggregate functions work.
#
# As we've seen above, when a query contains an
# aggregate function 
#   (and does not contain a group by clause -
#    remember we didn't start discussing "group by" until just now),
# the query will return a single row. The single row
# of output would contain the result of applying the aggregate function(s)
# to all of the rows from the table that satisfied the where clause.
#
# By contrast, a query that contains aggregate functions, and also contains
# a group by clause, could return more than one row of output, i.e. one row of
# output for each "group" of rows from the original table(s) [... keep reading ...] 
#
# The group by clause specifies one or more columns. The rows from the table(s)
# that contain the same values for these columns constitute a logical "group" of
# rows. For example the following shows the format of a query that gets info
# from the titles table. A lot of the query was left out because for now 
# I want to focus just on "from titles" and "group by type"
#
#     select ... (specify what you want to select)
#     from titles
#     ... (more stuff from the qeury)
#     group by type
#
# This query segments the rows from the titles table into
# different "groups". 
#   There will be one     "group" that contains all rows with type='biography'.
#   There will be another "group" that contains all rows with type='childrens'.
#   There will be another "group" that contains all rows with type='computer'.
#   etc. 
#
# If the query were to include aggregate functions, the output would indlude
# one row for each "group" of rows from the original table. This is 
# best explained with an example:
#################################################################

18.20 Practice

# First let's look at all of the types and prices in order
sqldf("select type, price 
      from titles 
      order by type, price")
         type price
1   biography    NA
2   biography 12.99
3   biography 19.95
4   biography 23.95
5    children 10.00
6    children 13.95
7    computer 39.95
8     history 19.95
9     history 21.99
10    history 29.99
11 psychology  6.95
12 psychology  7.99
13 psychology 12.99
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Write a query that returns the following for each 
# type of book (i.e. "childrens", "computer", etc)
#
#     the number of copies of each type 
#     the average price of all books of that type
#     the max price of all books of that type
#     the min price of all books of that type
#
# HINT: use group by to create different groups of rows from the original table
#       All the rows with the same value for type will be part of the same group.
#       Then use the avg(price) function to get the average price for each type.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

sqldf("select type, count(*), avg(price), max(price), min(price)
      from titles
      group by type
      order by type")
        type count(*) avg(price) max(price) min(price)
1  biography        4   18.96333      23.95      12.99
2   children        2   11.97500      13.95      10.00
3   computer        1   39.95000      39.95      39.95
4    history        3   23.97667      29.99      19.95
5 psychology        3    9.31000      12.99       6.95
# To make the output easier to read, change the code to display "#copies"
# instead of count(*) at the top of that column in the output.

sqldf("select type, count(*) as '#copies', avg(price), max(price), min(price)
      from titles
      group by type
      order by type")
        type #copies avg(price) max(price) min(price)
1  biography       4   18.96333      23.95      12.99
2   children       2   11.97500      13.95      10.00
3   computer       1   39.95000      39.95      39.95
4    history       3   23.97667      29.99      19.95
5 psychology       3    9.31000      12.99       6.95
sqldf("select count(*) as '#copies', avg(price), max(price), min(price)
      from titles")
  #copies avg(price) max(price) min(price)
1      13    18.3875      39.95       6.95

18.21 GROUP BY more than one column

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# GROUP BY more than one column
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The GROUP BY may contain more than one column. 
# This in effect creates a group from all rows that have the same value
# for all of the specified columns.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# First let's look at the types, pub_id's, titles and prices of all books in order
sqldf("select type, pub_id, price, title_name
       from titles
       order by type, pub_id, price")
         type pub_id price                          title_name
1   biography    P01 19.95                    How About Never?
2   biography    P03 23.95                   I Blame My Mother
3   biography    P05    NA         Not Without My Fabrerge Egg
4   biography    P05 12.99           Spontaneous, Not Annoying
5    children    P01 10.00        Just Wait Until After School
6    children    P01 13.95                     Kiss My Boo Boo
7    computer    P02 39.95        Ask Yor System Administrator
8     history    P01 21.99                               1977!
9     history    P03 19.95           200 Years of German Humor
10    history    P03 29.99 What Are The Civilian Applications?
11 psychology    P01  6.95              Exchange of Platitudes
12 psychology    P01 12.99           But I Did It Unconciously
13 psychology    P05  7.99    Perhaps It's a Glandular Problem
# Now let's create groups and show how many titles in each group.
sqldf("select type, pub_id, count(*)
       from titles
       group by type, pub_id
       order by type, pub_id")
        type pub_id count(*)
1  biography    P01        1
2  biography    P03        1
3  biography    P05        2
4   children    P01        2
5   computer    P02        1
6    history    P01        1
7    history    P03        2
8 psychology    P01        2
9 psychology    P05        1

18.22 Practice

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Write a query that returns the average price, max price and min price of each
# type of book (i.e. "childrens", "computer", etc) that is published by each 
# publisher.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# IMPORTANT - REMEMBER ... 
# Clauses in a SELECT statement must appear in the following order:
# All of the clauses are optional except for the "select" clause.
#
#   select ...
#   from ...
#   where ...
#   group by  ...
#   having ...
#   order by ...
#   limit ...       (limit is not part of the SQL standard but is part of many versions of SQL)


# ANSWER
#
# In the output of the following command, note that only one row is
# shown for each type/pub_id combination

sqldf("select type, pub_id, count(*) as '#copies', avg(price), min(price), max(price)
      from titles
      group by type, pub_id
      order by type")
        type pub_id #copies avg(price) min(price) max(price)
1  biography    P01       1     19.950      19.95      19.95
2  biography    P03       1     23.950      23.95      23.95
3  biography    P05       2     12.990      12.99      12.99
4   children    P01       2     11.975      10.00      13.95
5   computer    P02       1     39.950      39.95      39.95
6    history    P01       1     21.990      21.99      21.99
7    history    P03       2     24.970      19.95      29.99
8 psychology    P01       2      9.970       6.95      12.99
9 psychology    P05       1      7.990       7.99       7.99
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# rewrite the order by of the previous query so that all books by the same publisher
# appear together in consecutive rows (i.e. rows that are "one after the other" for 
# the same publisher)
#
# Note it makes no difference which column is listed first in the group by 
#      it DOES make a difference which column is listed first in the order by

# ANSWER

sqldf("select type, pub_id, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type, pub_id   
      order by pub_id, type")
        type pub_id avg(price) count(*) min(price) max(price) sum(price)
1  biography    P01     19.950        1      19.95      19.95      19.95
2   children    P01     11.975        2      10.00      13.95      23.95
3    history    P01     21.990        1      21.99      21.99      21.99
4 psychology    P01      9.970        2       6.95      12.99      19.94
5   computer    P02     39.950        1      39.95      39.95      39.95
6  biography    P03     23.950        1      23.95      23.95      23.95
7    history    P03     24.970        2      19.95      29.99      49.94
8  biography    P05     12.990        2      12.99      12.99      12.99
9 psychology    P05      7.990        1       7.99       7.99       7.99
# Change the order of the pub_id and the type in the select so that the output is easier to read.

sqldf("select pub_id, type, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type, pub_id
      order by pub_id, type")
  pub_id       type avg(price) count(*) min(price) max(price) sum(price)
1    P01  biography     19.950        1      19.95      19.95      19.95
2    P01   children     11.975        2      10.00      13.95      23.95
3    P01    history     21.990        1      21.99      21.99      21.99
4    P01 psychology      9.970        2       6.95      12.99      19.94
5    P02   computer     39.950        1      39.95      39.95      39.95
6    P03  biography     23.950        1      23.95      23.95      23.95
7    P03    history     24.970        2      19.95      29.99      49.94
8    P05  biography     12.990        2      12.99      12.99      12.99
9    P05 psychology      7.990        1       7.99       7.99       7.99
# WARNING - THE FOLLOWING QUERY IS WRONG SINCE title_name IS NOT AN AGGREGATE
# FUNCTION AND IS ALSO NOT IN THE GROUP BY CLAUSE.
# 
# SOME Database Management Systems (DBMS) WILL RETURN AN ERROR FOR THE 
# FOLLOWING QUERY, WHEREAS SQLITE, the database that R sqldf uses by default,
# DOESN'T REUTRN AN ERROR BUT RATHER RETURNS CONFUSING RESULTS. -- DON'T DO THIS!

sqldf("select pub_id, type, title_name, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type, pub_id
      order by pub_id, type")
  pub_id       type                          title_name avg(price) count(*) min(price) max(price) sum(price)
1    P01  biography                    How About Never?     19.950        1      19.95      19.95      19.95
2    P01   children                     Kiss My Boo Boo     11.975        2      10.00      13.95      23.95
3    P01    history                               1977!     21.990        1      21.99      21.99      21.99
4    P01 psychology           But I Did It Unconciously      9.970        2       6.95      12.99      19.94
5    P02   computer        Ask Yor System Administrator     39.950        1      39.95      39.95      39.95
6    P03  biography                   I Blame My Mother     23.950        1      23.95      23.95      23.95
7    P03    history What Are The Civilian Applications?     24.970        2      19.95      29.99      49.94
8    P05  biography           Spontaneous, Not Annoying     12.990        2      12.99      12.99      12.99
9    P05 psychology    Perhaps It's a Glandular Problem      7.990        1       7.99       7.99       7.99
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# QUESTION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Rewrite the previous query to show the publisher's name instead of the publisher's id
# 
# HINTS
# - you must join the appropriate tables
#
# - since the query has a group by, the select clause should only refer to 
#   aggregate functions and to columns that appear in the group by.
#
#   The select clause should NOT refer to columns that do not appear in the group by.
#   Since we want to display the publisher's name we should also group by 
#   the publisher's name instead of the pub_id.
#   
#   NOTE - that this is reasonable to do since for all rows with the same
#          pub_id, the publisher's name is guaranteed to be the same 
#          (think about it)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

# show one row for each title.
#
# Show the info from the title table and the associated info
# from the publisher's table in the single row of output for each title.

sqldf("select titles.*, publishers.*
      from titles join publishers on titles.pub_id = publishers.pub_id
      order by pub_name, type")
   title_id                          title_name       type pub_id pages price   sales         pubdate pub_id             pub_name          city state country
1       T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>    P05       AAA Publishing      Berkeley    CA     USA
2       T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00    P05       AAA Publishing      Berkeley    CA     USA
3       T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00    P05       AAA Publishing      Berkeley    CA     USA
4       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00    P01    Abatis Publishers      New York    NY     USA
5       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00    P01    Abatis Publishers      New York    NY     USA
6       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00    P01    Abatis Publishers      New York    NY     USA
7       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00    P01    Abatis Publishers      New York    NY     USA
8       T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00    P01    Abatis Publishers      New York    NY     USA
9       T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00    P01    Abatis Publishers      New York    NY     USA
10      T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00    P02      Core Dump Books San Francisco    CA     USA
11      T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00    P03 Schandenfreude Press       Hamburg  <NA> Germany
12      T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00    P03 Schandenfreude Press       Hamburg  <NA> Germany
13      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00    P03 Schandenfreude Press       Hamburg  <NA> Germany
# Show the pub_name in the output (not just the pub_id)
# along with aggregate info for that publisher and type of book.
#
# In ANSI standard SQL and in most DBMS software (e.g. oracle, MySql, Postgres)
# We need to include pub_name in the group by if we will include it in the 
# output. 

sqldf("select pub_name, type, avg(price), count(*), min(price), max(price), sum(price)
      from titles join publishers on titles.pub_id = publishers.pub_id
      group by type, pub_name
      order by pub_name, type")
              pub_name       type avg(price) count(*) min(price) max(price) sum(price)
1       AAA Publishing  biography     12.990        2      12.99      12.99      12.99
2       AAA Publishing psychology      7.990        1       7.99       7.99       7.99
3    Abatis Publishers  biography     19.950        1      19.95      19.95      19.95
4    Abatis Publishers   children     11.975        2      10.00      13.95      23.95
5    Abatis Publishers    history     21.990        1      21.99      21.99      21.99
6    Abatis Publishers psychology      9.970        2       6.95      12.99      19.94
7      Core Dump Books   computer     39.950        1      39.95      39.95      39.95
8 Schandenfreude Press  biography     23.950        1      23.95      23.95      23.95
9 Schandenfreude Press    history     24.970        2      19.95      29.99      49.94
# SQLite will allow the pub_name in the select clause even if it isn't
# in the group by. However, you should avoid this - it isn't standard ANSI SQL.

sqldf("select pub_name, type, avg(price), count(*), min(price), max(price), sum(price)
      from titles join publishers on titles.pub_id = publishers.pub_id
      group by type, publishers.pub_id
      order by pub_name, type")
              pub_name       type avg(price) count(*) min(price) max(price) sum(price)
1       AAA Publishing  biography     12.990        2      12.99      12.99      12.99
2       AAA Publishing psychology      7.990        1       7.99       7.99       7.99
3    Abatis Publishers  biography     19.950        1      19.95      19.95      19.95
4    Abatis Publishers   children     11.975        2      10.00      13.95      23.95
5    Abatis Publishers    history     21.990        1      21.99      21.99      21.99
6    Abatis Publishers psychology      9.970        2       6.95      12.99      19.94
7      Core Dump Books   computer     39.950        1      39.95      39.95      39.95
8 Schandenfreude Press  biography     23.950        1      23.95      23.95      23.95
9 Schandenfreude Press    history     24.970        2      19.95      29.99      49.94
# Rewrite the previous query to only group by the pub_name and not by the type.
#
# HINTS - because the type is not part of the group by it should also be eliminated
#         from the select clause and the order by clause

sqldf("select pub_name, avg(price), count(*), min(price), max(price), sum(price)
      from titles join publishers on titles.pub_id = publishers.pub_id
      group by pub_name
      order by pub_name")
              pub_name avg(price) count(*) min(price) max(price) sum(price)
1       AAA Publishing     10.490        3       7.99      12.99      20.98
2    Abatis Publishers     14.305        6       6.95      21.99      85.83
3      Core Dump Books     39.950        1      39.95      39.95      39.95
4 Schandenfreude Press     24.630        3      19.95      29.99      73.89
#``````````````````````````````````````````````````````````````````````````````
# SQLite doesn't follow the standard exactly ..
#``````````````````````````````````````````````````````````````````````````````
# Note that the SQL standard does not allow for column names in the select and order by 
# clauses that are not also part of the group by.
# However, SQLite, does not report an error in these cases and the output could be
# confusing. The following output shows a type, but the avg and other aggregate functions
# include ALL types, not just the one shown.
#

# See these page for a description of the issue and suggestions for workarounds:
#
#   https://www.dataquest.io/blog/sql-tutorial-selecting-ungrouped-columns-without-aggregate-functions/
#
#   https://learnsql.com/blog/not-a-group-by-expression-error/
#
#
# See this page for the official SQLite documentation 
# Below is an excerpt of the relevant info from this page.
#   
#   https://www.sqlite.org/lang_select.html#bareagg

        # 2.5. Bare columns in an aggregate query The usual case is that all column
        # names in an aggregate query are either arguments to aggregate functions or
        # else appear in the GROUP BY clause. A result column which contains a column
        # name that is not within an aggregate function and that does not appear in the
        # GROUP BY clause (if one exists) is called a "bare" column. Example:
        #
        # SELECT a, b, sum(c) FROM tab1 GROUP BY a; In the query above, the "a" column
        # is part of the GROUP BY clause and so each row of the output contains one of
        # the distinct values for "a". The "c" column is contained within the sum()
        # aggregate function and so that output column is the sum of all "c" values in
        # rows that have the same value for "a". But what is the result of the bare
        # column "b"? The answer is that the "b" result will be the value for "b" in one
        # of the input rows that form the aggregate. The problem is that you usually do
        # not know which input row is used to compute "b", and so in many cases the
        # value for "b" is undefined.
        #
        # Special processing occurs when the aggregate function is either min() or
        # max(). Example:
        #
        # SELECT a, b, max(c) FROM tab1 GROUP BY a; If there is exactly one min() or
        # max() aggregate in the query, then all bare columns in the result set take
        # values from an input row which also contains the minimum or maximum. So in the
        # query above, the value of the "b" column in the output will be the value of
        # the "b" column in the input row that has the largest "c" value. There are
        # limitations on this special behavior of min() and max():
        #
        # If the same minimum or maximum value occurs on two or more rows, then bare
        # values might be selected from any of those rows. The choice is arbitrary.
        # There is no way to predict from which row the bare values will be choosen. The
        # choice might be different for different bare columns within the same query.
        #
        # If there are two or more min() or max() aggregates in the query, then bare
        # column values will be taken from one of the rows on which one of the
        # aggregates has their minimum or maximum value. The choice of which min() or
        # max() aggregate determines the selection of bare column values is arbitrary.
        # The choice might be different for different bare columns within the same
        # query.
        #
        # This special processing for min() or max() aggregates only works for the
        # built-in implementation of those aggregates. If an application overrides the
        # built-in min() or max() aggregates with application-defined alternatives, then
        # the values selected for bare columns will be taken from an arbitrary row.
        #
        # Most other SQL database engines disallow bare columns. If you include a bare
        # column in a query, other database engines will usually raise an error. The
        # ability to include bare columns in a query is an SQLite-specific extension.
#
#
#``````````````````````````````````````````````````````````````````````````````

sqldf("select pub_name, type, avg(price), count(*), min(price), max(price), sum(price)
      from titles join publishers on titles.pub_id = publishers.pub_id
      group by pub_name
      order by pub_name")
              pub_name      type avg(price) count(*) min(price) max(price) sum(price)
1       AAA Publishing biography     10.490        3       7.99      12.99      20.98
2    Abatis Publishers   history     14.305        6       6.95      21.99      85.83
3      Core Dump Books  computer     39.950        1      39.95      39.95      39.95
4 Schandenfreude Press   history     24.630        3      19.95      29.99      73.89

18.23 HAVING clause

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# HAVING 
#
#    group by COLUMN1, COLUMN2, ...
#    having SOME_LOGICAL_EXPRESSION
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Select statements that contain a "group by" MAY ALSO contain an optional "having clause".
# Select statements that do not contain a "group by" may NOT contain "having"
#
# Both "where" and "having" clauses contain logical expression. 
#
# Where an having clauses differ in the following ways:
# 
# 1. "where"  eliminates individual rows from the original table that do not meet the conditions.
#    "having" eliminates entire groups of rows that do not meet its conditions.
#
# 2. "where" may refer to any columns in the original tables, but may NOT refer to aggregate functions.
#    "having" may ONLY refer to aggregate functions and column names that appear in the group by clause.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


# Example

# show all the data from the titles table
sqldf("select * 
       from titles
       order by type")
   title_id                          title_name       type pub_id pages price   sales         pubdate
1       T06                    How About Never?  biography    P01   473 19.95   11320  7/31/2000 0:00
2       T07                   I Blame My Mother  biography    P03   333 23.95 1500200  10/1/1999 0:00
3       T10         Not Without My Fabrerge Egg  biography    P05    NA    NA      NA            <NA>
4       T12           Spontaneous, Not Annoying  biography    P05   507 12.99  100001  8/31/2000 0:00
5       T08        Just Wait Until After School   children    P01    86 10.00    4095   6/1/2001 0:00
6       T09                     Kiss My Boo Boo   children    P01    22 13.95    5000  5/31/2002 0:00
7       T03        Ask Yor System Administrator   computer    P02  1226 39.95   25667   9/1/2000 0:00
8       T01                               1977!    history    P01   107 21.99     566   8/1/2000 0:00
9       T02           200 Years of German Humor    history    P03    14 19.95    9566   4/1/1998 0:00
10      T13 What Are The Civilian Applications?    history    P03   802 29.99   10467  5/31/1999 0:00
11      T04           But I Did It Unconciously psychology    P01   510 12.99   13001  5/31/1999 0:00
12      T05              Exchange of Platitudes psychology    P01   201  6.95  201440   1/1/2001 0:00
13      T11    Perhaps It's a Glandular Problem psychology    P05   826  7.99   94123 11/30/2000 0:00
# Without having
sqldf("select type, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type
      order by type")
        type avg(price) count(*) min(price) max(price) sum(price)
1  biography   18.96333        4      12.99      23.95      56.89
2   children   11.97500        2      10.00      13.95      23.95
3   computer   39.95000        1      39.95      39.95      39.95
4    history   23.97667        3      19.95      29.99      71.93
5 psychology    9.31000        3       6.95      12.99      27.93
# same query WITH THE ADDITION of "having" (note that having must come before "order by")
#
# Notice that the rows for groups that don't satisfy the having conditions 
# are eliminated from the output.
sqldf("select type, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type
      having count(*) >=3 
      order by type")
        type avg(price) count(*) min(price) max(price) sum(price)
1  biography   18.96333        4      12.99      23.95      56.89
2    history   23.97667        3      19.95      29.99      71.93
3 psychology    9.31000        3       6.95      12.99      27.93
# We can see the effect of the having in the above query by running
# the same query WITHOUT the having
sqldf("select type, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type
      order by type")
        type avg(price) count(*) min(price) max(price) sum(price)
1  biography   18.96333        4      12.99      23.95      56.89
2   children   11.97500        2      10.00      13.95      23.95
3   computer   39.95000        1      39.95      39.95      39.95
4    history   23.97667        3      19.95      29.99      71.93
5 psychology    9.31000        3       6.95      12.99      27.93
# The having clause's logical expression can be as complex as you want.
#
# Modify the previous select to show only types of books for which
# - there are at least 3 books of that type
#        and also 
# - the avg price is at least $20 or is less than or equal to  $10

sqldf("select type, avg(price), count(*), min(price), max(price), sum(price)
      from titles
      group by type
      having count(*) >=3 and (avg(price) >= 20 or avg(price) <= 10)
      order by type")
        type avg(price) count(*) min(price) max(price) sum(price)
1    history   23.97667        3      19.95      29.99      71.93
2 psychology    9.31000        3       6.95      12.99      27.93

18.24 using where and having in the same query - a detailed analysis

#############################################################################
# using where and having in the same query
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# When a query runs the processing of the raw data in to the
# final output happens in the following logical order
#
#   - FROM: all of the data from the tables is retrieved and joined into 
#           a single "mega-table" (mega-table is NOT a standard word). 
#
#   - WHERE: the where clause eliminates rows from this "mega-table" that
#            do not satisfy the where's logical expression
#
#   - GROUP BY: the remaining rows a "grouped" by the group by clause
#
#   - HAINVG:   the having clause works to eliminate groups that don't 
#               satisfy the having clause's logical expression
#    
#   - SELECT:   the select clause is used to generate the output of the 
#               aggregate functions 
#
#   - ORDER BY: the resulting rows are ordered according to the "order by" clause
#
#   - LIMIT:    only the rows specified by the limit clause are shown
#
# To help you remember this, the clauses in a select statement must be specified
# in the above order - EXCEPT that the select clause must be listed first. The designers
# of SQL thought that since the select clause specifies which column appear in the 
# output it is important to see it first ...
#
# Notice that the where clause eliminates rows from the original tables
# BEFORE the having clause eliminates groups. This can have an effect on the output
# of queries that use both the where and having clauses.
#
##################################################################################


#----------------------------------------------------
# compare the following queries WITHOUT having
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# The numbers are different in the 2nd query since 
# the where clause eliminated specific rows.
# 
# After the where clause runs there are no childrens books
# left and there are fewer biograph, history and psychology books.
# The min and max page counts also are different to match the 
# books that are left after the where ran.
# The only group that wasn't affected by the where was the
# computer books.
#  
#----------------------------------------------------

# group by ..

sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       GROUP BY TYPE
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        4            3        333        507
2   children        2            2         22         86
3   computer        1            1       1226       1226
4    history        3            3         14        802
5 psychology        3            3        201        826
# where ...
# group by ..

sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       WHERE PAGES > 350
       GROUP BY TYPE
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        2            2        473        507
2   computer        1            1       1226       1226
3    history        1            1        802        802
4 psychology        2            2        510        826
#------------------------------------------------------------------
# Let's see the actual rows that were affected by the where clause.
# We'll do so by running a similar query WIHTOUT the GROUP BY.
# We will display the actual data for each row.
#-----------------------------------------------------------------

# See the rows that were processed by the 1st query above.

sqldf("select type, pages, title_name
       from titles
       order by type, pages, title_name
      ")
         type pages                          title_name
1   biography    NA         Not Without My Fabrerge Egg
2   biography   333                   I Blame My Mother
3   biography   473                    How About Never?
4   biography   507           Spontaneous, Not Annoying
5    children    22                     Kiss My Boo Boo
6    children    86        Just Wait Until After School
7    computer  1226        Ask Yor System Administrator
8     history    14           200 Years of German Humor
9     history   107                               1977!
10    history   802 What Are The Civilian Applications?
11 psychology   201              Exchange of Platitudes
12 psychology   510           But I Did It Unconciously
13 psychology   826    Perhaps It's a Glandular Problem
# Now let's add in the where clause to see the rows that were
# processed by the 2nd query above.

sqldf("select type, pages, title_name
       from titles
       WHERE PAGES > 350
       order by type, pages, title_name
      ")
        type pages                          title_name
1  biography   473                    How About Never?
2  biography   507           Spontaneous, Not Annoying
3   computer  1226        Ask Yor System Administrator
4    history   802 What Are The Civilian Applications?
5 psychology   510           But I Did It Unconciously
6 psychology   826    Perhaps It's a Glandular Problem
# We can now see why we got the results we did when we 
# ran the versions of the queries with the GROUP BY and 
# the aggregate functions.


#--------------------------------------------------------
# If you use
#     a where clause 
#     and a group by 
#     and a having
#
# The where clause runs then 
# the group by clause runs then
# the having clause runs
#--------------------------------------------------------

# group by ...   (same as the 1st query above)

sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       GROUP BY TYPE
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        4            3        333        507
2   children        2            2         22         86
3   computer        1            1       1226       1226
4    history        3            3         14        802
5 psychology        3            3        201        826
# group by ...
# having ...

# having gets rid of groups that don't meet it's criteria
sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       group by type
       having count(*) > 1
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        4            3        333        507
2   children        2            2         22         86
3    history        3            3         14        802
4 psychology        3            3        201        826
# ADDING IN A WHERE CHANGES THINGS
# 
# In the following we simply added the line: where pages > 350
# 
# This does two things
#   (a) childrens books don't show up at all - because none are longer than 350 pages
#   (b) only 1 history book shows up - the other 2 history books were shorter then 350 pages


# where ...
# group by ...

sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       where pages > 350
       group by type
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        2            2        473        507
2   computer        1            1       1226       1226
3    history        1            1        802        802
4 psychology        2            2        510        826
# NOW - since the where clause was added, the having clause will now
# also eliminate the history group in addition to the computer group
# since now the history group also only has 1 title.
#
# As a result we are only left with biography and psychology groups.


# where ...
# group by ...
# having ...

sqldf("select type, count(*), count(pages), min(pages), max(pages)
       from titles
       where pages > 350
       group by type
       having count(*) > 1
       order by type")
        type count(*) count(pages) min(pages) max(pages)
1  biography        2            2        473        507
2 psychology        2            2        510        826