32  Where to find datasets

Note that the following was compiled circa Apr 16, 2024.

There are many many sources of data that can be found online. These resources are constantly being updated and new ones are constantly becoming available. The following are just some ideas to get started with your search for data.

32.1 Datasets in R packages

There are many R packages that come with datasets. By “dataset” I mean any data that is referenced in an R variable. This can be a dataframe, a vector, a list, etc. etc. etc. Below are just a few example R packages that contain datasets. There are many, many others. Also make sure to see the section below about R pacakges that offer access to APIs.

Package name Description
nycflights13 data about all flights coming into or out of NYC in 2013.
Lahman Extensive data about baseball teams, players, etc.
babynames data provided by USA Social Security administration about births from 1880 through 2017
stringr Contains vectors fruits,sentences,words that the help uses in examples.
etc …

To use the data in a package you should first install the package. As an example we’ll use the “nycflights13” package which contains data about all airline flights that came into or out of New York City in 2013.

if(!require("nycflights13")){install.packages("nycflights13");require("nycflights13");}

We can now view which datasets are in the package with the following command:

data(package="nycflights13")

To see the list of datasets in ALL packages that you have installed, you can use the following command

data(package=rownames(installed.packages()))

To access the data you can use the syntax PACKAGE_NAME::VARIABLE_NAME. Alternatively you can load the package by using the library or require functions and then access the data by just using the variable name.

# This will work after you've installed the nycflights13 package
# even before you've loaded it.

head(nycflights13::flights)   # show the first few rows
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# Alternatively if you have already loaded the package with the 
# library or require functions (which we did above) you can access
# the flights variable without specifying nycflights::

head(flights)   # show the first few rows
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • nycflights13 package

    • install.packages(“nycflights13”) # install the package library(nycflights13) # load the package

      help(package=“nycflights13”) # see what is included in the package

32.2 Project Gutenberg

https://www.gutenberg.org/

This website hosts complete text of many many books whose copyright has expired. If you are planning to work with “text” files, then you should search for the “UTF-8” or “ASCII” or “Plain Text” versions.

32.3 Search for “Open Data”

https://www.google.com/search?q=open+data

Search for “open data” and MANY websites, especially government based websites will come up. For example:

32.5 Kaggle

https://www.kaggle.com/datasets

This is a popular website for machine learning and data science competitions, projects and tutorials. There are many datasets hosted on this platform