STAT 19000: Project 11 — Fall 2020
Motivation: The ability to understand a problem, know what tools are available to you, and select the right tools to get the job done, takes practice. In this project we will use what you’ve learned so far this semester to solve data-driven problems. In previous projects, we’ve directed you towards certain tools. In this project, there will be less direction, and you will have the freedom to choose the tools you’d like.
Context: You’ve learned lots this semester about the R environment. You now have experience using a very balanced "portfolio" of R tools. We will practice using these tools on a set of economic data from Zillow.
Scope: R
Questions
Question 1
Read /class/datamine/data/zillow/Zip_time_series.csv
into a data.frame called zipc
. Look at the RegionName
column. It is supposed to be a 5-digit zip code. Either fix the column by writing a function and applying it to the column, or take the time to read the read.csv
documentation by running ?read.csv
and use an argument to make sure that column is not read in as an integer (which is why zip codes starting with 0
lose the leading 0
when being read in).
This video demonstrates how to read in data and respect the leading zeroes. |
-
R code used to solve the problem.
-
head
of theRegionName
column.
Question 2
One might assume that the owner of a house tends to value that house more than the buyer. If that was the case, perhaps the median listing price (the price which the seller puts the house on the market, or ask price) would be higher than the ZHVI (Zillow Home Value Index — essentially an estimate of the home value). For those rows where both MedianListingPrice_AllHomes
and ZHVI_AllHomes
have non-NA values, on average how much higher or lower is the median listing price? Can you think of any other reasons why this may be?
-
R code used to solve the problem.
-
The result itself and 1-2 sentences talking about whether or not you can think of any other reasons that may explain the result.
Question 3
Convert the Date
column to a date using as.Date
. How many years of data do we have in this dataset? Create a line plot with lines for the average MedianListingPrice_AllHomes
and average ZHVI_AllHomes
by year. The result should be a single plot with multiple lines on it.
Here we give two videos to help you with this question. The first video gives some examples about working with dates in R. |
This second video gives an example about how to plot two line graphs at the same time in R. |
For a nice addition, add a dotted vertical line on year 2008 near the housing crisis: |
abline(v="2008", lty="dotted")
-
R code used to solve the problem.
-
The results of running the code.
Question 4
Read /class/datamine/data/zillow/State_time_series.csv
into a data.frame called states
. Calculate the average median listing price by state, and create a map using plot_usmap
from the usmap
package that shows the average median price by state.
We give a full example about how to plot values, by State, on a map. |
In order for |
-
R code used to solve the problem.
-
The resulting map.
Question 5
Read /class/datamine/data/zillow/County_time_series.csv
into a data.frame named counties
. Choose a state (or states) that you would like to "dig down" into county-level data for, and create a plot (or plots) like in (4) that show some interesting statistic by county. You can choose average median listing price if you so desire, however, you don’t need to! There are other cool data!
Make sure that you remember to aggregate your data by |
|
If you get Question 4 working correctly, here are the main differences for Question 5. You need the |
-
R code used to solve the problem.
-
The resulting map.