if(!require("nycflights13")){install.packages("nycflights13");require("nycflights13");}
32 Where to find datasets
Note that the following was compiled circa Apr 16, 2024.
There are many many sources of data that can be found online. These resources are constantly being updated and new ones are constantly becoming available. The following are just some ideas to get started with your search for data.
32.1 Datasets in R packages
There are many R packages that come with datasets. By “dataset” I mean any data that is referenced in an R variable. This can be a dataframe, a vector, a list, etc. etc. etc. Below are just a few example R packages that contain datasets. There are many, many others. Also make sure to see the section below about R pacakges that offer access to APIs.
Package name | Description |
---|---|
nycflights13 | data about all flights coming into or out of NYC in 2013. |
Lahman | Extensive data about baseball teams, players, etc. |
babynames | data provided by USA Social Security administration about births from 1880 through 2017 |
stringr | Contains vectors fruits,sentences,words that the help uses in examples. |
etc … | … |
To use the data in a package you should first install the package. As an example we’ll use the “nycflights13” package which contains data about all airline flights that came into or out of New York City in 2013.
We can now view which datasets are in the package with the following command:
data(package="nycflights13")
To see the list of datasets in ALL packages that you have installed, you can use the following command
data(package=rownames(installed.packages()))
To access the data you can use the syntax PACKAGE_NAME::VARIABLE_NAME. Alternatively you can load the package by using the library or require functions and then access the data by just using the variable name.
# This will work after you've installed the nycflights13 package
# even before you've loaded it.
head(nycflights13::flights) # show the first few rows
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# Alternatively if you have already loaded the package with the
# library or require functions (which we did above) you can access
# the flights variable without specifying nycflights::
head(flights) # show the first few rows
# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
nycflights13 package
install.packages(“nycflights13”) # install the package library(nycflights13) # load the package
help(package=“nycflights13”) # see what is included in the package
32.2 Project Gutenberg
This website hosts complete text of many many books whose copyright has expired. If you are planning to work with “text” files, then you should search for the “UTF-8” or “ASCII” or “Plain Text” versions.
32.3 Search for “Open Data”
https://www.google.com/search?q=open+data
Search for “open data” and MANY websites, especially government based websites will come up. For example:
https://data.gov/ - US Government Open Data
https://data.ny.gov/ - New York State
https://opendata.cityofnewyork.us/ - New York City
many, many others …
32.4 Google dataset search
Google search engine for datasets: https://datasetsearch.research.google.com/
32.5 Kaggle
https://www.kaggle.com/datasets
This is a popular website for machine learning and data science competitions, projects and tutorials. There are many datasets hosted on this platform