Housing Prices in College Towns during Recessions

In [2]:
import pandas as pd
import numpy as np
import re
from scipy.stats import ttest_ind
from scipy import stats

Definitions:

  • A quarter is a specific three month period, Q1 is January through March, Q2 is April through June, Q3 is July through September, Q4 is October through December.
  • A recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth.
  • A recession bottom is the quarter within a recession which had the lowest GDP.
  • A university town is a city which has a high percentage of university students compared to the total population of the city.

Hypothesis: University towns have their mean housing prices less affected by recessions. We run a t-test to compare the ratio of the mean price of houses in university towns the quarter before the recession starts compared to the recession bottom. (price_ratio=quarter_before_recession/recession_bottom)

The following data files are used in this project:

  • From the Zillow research data site there is housing data for the United States. In particular the datafile for all homes at a city level, City_Zhvi_AllHomes.csv, has median home sale prices at a fine grained level.
  • From the Wikipedia page on college towns is a list of university towns in the United States which has been copy and pasted into the file university_towns.txt.
  • From Bureau of Economic Analysis, US Department of Commerce, the GDP over time of the United States in current dollars, in quarterly intervals, in the file gdplev.xls. For this project, we only look at GDP data from the first quarter of 2000 onward.

If we read in the data from 'university_towns.txt', we find that state names (like Alabama) are always followed by the string '[edit]'. This allows us to extract which states are in the list with college towns. In this case, all 50 of the states in the text file.

In [17]:
university_towns = open('university_towns.txt').readlines()
university_towns[0:5]
Out[17]:
['Alabama[edit]\n',
 'Auburn (Auburn University)[1]\n',
 'Florence (University of North Alabama)\n',
 'Jacksonville (Jacksonville State University)[2]\n',
 'Livingston (University of West Alabama)[2]\n']

We also notice that college towns are often followed by characters inside parenthesis or brackets. Since the list of college towns was copied and pasted from wikipedia, brackets are usually indices corresponding to references.

In order to increase the chance that each entry in the list of college towns matches exactly with the corresponding item in the list of regions included in our housing data, we will get rid of everything in parenthesis or brackets.

Finally we note that in our list of college towns, one line appears to contain two regions separated by a *

In order to deal with this region, we could use the following line, but in general we note that since our list of college towns probably won't match with the list of regions included in our housing data, it's probably not worth trying to correct.

In [ ]:
#university_towns = ''.join(university_towns).replace('*','\n').split('\n')
In [18]:
states = [line.split('[edit]')[0] for line in university_towns if '[edit]' in line]
university_towns = [ x.split(' (')[0] for x in university_towns]
university_towns[0:5]
Out[18]:
['Alabama[edit]\n', 'Auburn', 'Florence', 'Jacksonville', 'Livingston']

We still need to get rid of the states in our list of college towns. In order to do this, we'll use our list of states and the fact that every state is followed by an '[edit]\n' and split on those characters with re.split

In [19]:
university_towns = '\n'.join(university_towns)
states_pattern = '\[edit\]\n|'.join(states) + '\[edit\]\n'

university_towns = re.split(states_pattern, university_towns)[1:]
university_towns[0:5]
Out[19]:
['\nAuburn\nFlorence\nJacksonville\nLivingston\nMontevallo\nTroy\nTuscaloosa\nTuskegee\n',
 '\nFairbanks\n',
 '\nFlagstaff\nTempe\nTucson\n',
 '\nArkadelphia\nConway\nFayetteville\nJonesboro\nMagnolia\nMonticello\nRussellville\nSearcy\n',
 '\nAngwin\nArcata\nBerkeley\nChico\nClaremont\nCotati\nDavis\nIrvine\nIsla Vista\nUniversity Park, Los Angeles\nMerced\nOrange\nPalo Alto\nPomona\nRedlands\nRiverside\nSacramento\nUniversity District, San Bernardino\nSan Diego\nSan Luis Obispo\nSanta Barbara\nSanta Cruz\nTurlock\nWestwood, Los Angeles\nWhittier\n']

Now each element of our list contains all of the college towns in a state, separated by a '\n'. We still need to format the list of strings above as a data frame, so we will build a list of dictionaries with key value pairs corresponding to (state, regionname)

In [20]:
schools = []

for state in range(len(states)):
    for school in university_towns[state].split('\n'):
        if school != '\n' and school != '':
            schools.append({"State":states[state], "RegionName":school})
df = pd.DataFrame(schools)
df.head()
Out[20]:
RegionName State
0 Auburn Alabama
1 Florence Alabama
2 Jacksonville Alabama
3 Livingston Alabama
4 Montevallo Alabama

The function below implements these operations and returns the raw data in a cleaned data frame format

In [23]:
def get_list_of_university_towns():
    '''Returns a DataFrame of towns and the states they are in from the 
    university_towns.txt list. The format of the DataFrame is:
    DataFrame( [ ["Michigan","Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
    columns=["State","RegionName"]  )'''
    
    file = open('university_towns.txt').readlines()
    #file = ''.join(file).replace('*','\n').split('\n')
    states = [line.split('[edit]')[0] for line in file if '[edit]' in line]
    file = [ x.split(' (')[0] for x in file]

    file = '\n'.join(file) + '\n'
    states_pattern = '\[edit\]\n|'.join(states) + '\[edit\]\n'

    file = re.split(states_pattern, file)[1:]
    schools = []

    for state in range(len(states)):
        for school in file[state].split('\n'):
            if school != '\n' and school != '':
                schools.append({"State":states[state], "RegionName":school})
    df = pd.DataFrame(schools)
    
    return df.iloc[:,::-1]
get_list_of_university_towns().head()
Out[23]:
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo

Recall that a recession is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth. We'll use our definition of a recession above and data from the Bureau of Economic Analysis to find the quarter and year when the recession starts.

In [25]:
filename = 'gdplev.xls'

economy_df = pd.read_excel(filename)

economy_df.head(n=10)
Out[25]:
Current-Dollar and "Real" Gross Domestic Product Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 2016-09-29 00:00:00
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 Annual NaN NaN NaN Quarterly NaN NaN NaN
2 NaN NaN NaN NaN (Seasonally adjusted annual rates) NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN GDP in billions of current dollars GDP in billions of chained 2009 dollars NaN NaN GDP in billions of current dollars GDP in billions of chained 2009 dollars NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN
7 1929 104.6 1056.6 NaN 1947q1 243.1 1934.5 NaN
8 1930 92.2 966.7 NaN 1947q2 246.3 1932.3 NaN
9 1931 77.4 904.8 NaN 1947q3 250.1 1930.3 NaN

Oops, we notice that there's some header information that we still need to get rid of. So we skip the first 5 rows. We also note that columns 1 through 4 contain GDP information in years, not quarters. We will keep only the information that we need and get rid of the rest.

Finally, we need to rename the columns to something simple and descriptive. We do that below

In [29]:
economy_df = pd.read_excel(filename,skiprows=5).iloc[2:,[4,6]]
    
economy_df = economy_df.rename(index = str, columns = {'Unnamed: 4':'Quarter',
                                             'GDP in billions of chained 2009 dollars.1':'GDP (billions)'})

economy_df.head(5)
Out[29]:
Quarter GDP (billions)
2 1947q1 1934.5
3 1947q2 1932.3
4 1947q3 1930.3
5 1947q4 1960.7
6 1948q1 1989.5

We are restricting ourselves to quarters starting in the year 2000, and we are interested in finding when a recession begins. According to our definition, a recession starts with two consecutive months of decreased GDP. The easiest way to check for this condition is to create new columns representing the GDP one and two quarters in the future.

In [37]:
economy_df = economy_df[economy_df.Quarter >= '2000q1']

economy_df['Next Quarter GDP'] = list(economy_df['GDP (billions)'].iloc[1:]) + [np.NAN]
economy_df['Two Quarters GDP'] = list(economy_df['GDP (billions)'].iloc[2:]) + 2*[np.NAN]

economy_df['Recession Begin'] = (economy_df['GDP (billions)'] > 
                                     economy_df['Next Quarter GDP']) & (
                                    economy_df['Next Quarter GDP'] > 
                                    economy_df['Two Quarters GDP'])

recession_begin = economy_df[economy_df['Recession Begin'] == True]
recession_begin
Out[37]:
Quarter GDP (billions) Next Quarter GDP Two Quarters GDP Recession Begin Last Quarter GDP
247 2008q2 14963.4 14891.6 14577.0 True 14889.5
248 2008q3 14891.6 14577.0 14375.0 True 14963.4
249 2008q4 14577.0 14375.0 14355.6 True 14891.6

So the recession started in '2008q2'. We now check for the end of the recession, or when the GDP increased two quarters in a row following the beginning of the recession

In [45]:
economy_df['Last Quarter GDP'] = [np.NAN] + list(economy_df['GDP (billions)'].iloc[:-1])
    
economy_df['Two Quarters Ago GDP'] = 2*[np.NAN] + list(economy_df['GDP (billions)'].iloc[:-2])
    
economy_df['Recession End'] = (economy_df['GDP (billions)'] > 
                                     economy_df['Last Quarter GDP']) & (
                                    economy_df['Last Quarter GDP'] > 
                                    economy_df['Two Quarters Ago GDP']) & (
                                    economy_df['Quarter'] > recession_begin['Quarter'].values[0])
recession_end = economy_df[(economy_df['Recession End'] == True)]
recession_end['Quarter'].values[0]
Out[45]:
'2009q4'

Finally, recalling that our definition of a recession bottom was the quarter with lowest GDP during the recession

In [50]:
economy_df = economy_df[(economy_df.Quarter >= recession_begin['Quarter'].values[0]) & (economy_df.Quarter <= recession_end['Quarter'].values[0])]
    
recession_bottom = economy_df[economy_df['GDP (billions)'] == economy_df['GDP (billions)'].min()]

recession_bottom['Quarter'].values[0]
Out[50]:
'2009q2'

Putting this together

In [51]:
def get_recession_data():
    '''Returns the year and quarter of the recession start, end, and bottom time as a 
    string value in a format such as 2005q3'''
    
    filename = 'gdplev.xls'
    
    economy_df = pd.read_excel(filename,skiprows=5).iloc[2:,[4,6]]
    
    economy_df = economy_df.rename(index = str, columns = {'Unnamed: 4':'Quarter',
                                             'GDP in billions of chained 2009 dollars.1':'GDP (billions)'})

    economy_df = economy_df[economy_df.Quarter >= '2000q1']

    economy_df['Next Quarter GDP'] = list(economy_df['GDP (billions)'].iloc[1:]) + [np.NAN]
    economy_df['Two Quarters GDP'] = list(economy_df['GDP (billions)'].iloc[2:]) + 2*[np.NAN]

    economy_df['Recession Begin'] = (economy_df['GDP (billions)'] > 
                                     economy_df['Next Quarter GDP']) & (
                                    economy_df['Next Quarter GDP'] > 
                                    economy_df['Two Quarters GDP'])

    recession_begin = economy_df[economy_df['Recession Begin'] == True]

    economy_df['Last Quarter GDP'] = [np.NAN] + list(economy_df['GDP (billions)'].iloc[:-1])
    
    economy_df['Two Quarters Ago GDP'] = 2*[np.NAN] + list(economy_df['GDP (billions)'].iloc[:-2])
    
    economy_df['Recession End'] = (economy_df['GDP (billions)'] > 
                                     economy_df['Last Quarter GDP']) & (
                                    economy_df['Last Quarter GDP'] > 
                                    economy_df['Two Quarters Ago GDP']) & (
                                    economy_df['Quarter'] > recession_begin['Quarter'].values[0])
    recession_end = economy_df[(economy_df['Recession End'] == True)]
    recession_end['Quarter'].values[0]

    economy_df = economy_df[(economy_df.Quarter >= recession_begin['Quarter'].values[0]) & (economy_df.Quarter <= recession_end['Quarter'].values[0])]
    
    recession_bottom = economy_df[economy_df['GDP (billions)'] == economy_df['GDP (billions)'].min()]

    recession_bottom['Quarter'].values[0]
    
    return recession_begin['Quarter'].values[0], recession_end['Quarter'].values[0], recession_bottom['Quarter'].values[0]

get_recession_data()
Out[51]:
('2008q2', '2009q4', '2009q2')

Next, we need to clean up the housing data so that we can determine whether college towns are less affected by recessions than non-college towns. Since the housing data has housing price time series data in months, we need to convert that to quarters. We'll get rid of columns that we don't need in the process

In [64]:
filename = 'City_Zhvi_AllHomes.csv'

housing_df = pd.read_csv(filename)
housing_df.head()
Out[64]:
RegionID RegionName State Metro CountyName SizeRank 1996-04 1996-05 1996-06 1996-07 ... 2015-11 2015-12 2016-01 2016-02 2016-03 2016-04 2016-05 2016-06 2016-07 2016-08
0 6181 New York NY New York Queens 1 NaN NaN NaN NaN ... 573600 576200 578400 582200 588000 592200 592500 590200 588000 586400
1 12447 Los Angeles CA Los Angeles-Long Beach-Anaheim Los Angeles 2 155000.0 154600.0 154400.0 154200.0 ... 558200 560800 562800 565600 569700 574000 577800 580600 583000 585100
2 17426 Chicago IL Chicago Cook 3 109700.0 109400.0 109300.0 109300.0 ... 207800 206900 206200 205800 206200 207300 208200 209100 211000 213000
3 13271 Philadelphia PA Philadelphia Philadelphia 4 50000.0 49900.0 49600.0 49400.0 ... 122300 121600 121800 123300 125200 126400 127000 127400 128300 129100
4 40326 Phoenix AZ Phoenix Maricopa 5 87200.0 87700.0 88200.0 88400.0 ... 183800 185300 186600 188000 189100 190200 191300 192800 194500 195900

5 rows × 251 columns

We load the housing data in. Now we need to rename the columns. Since there aren't many columns, a for loop should suffice despite the fact that for loops are slow in python. Scrolling to the far right, we see that the last available month is August, 2016, so we won't have full data for quarter 3 and we will have no data for quarter 4 of that year.

In [65]:
for year in range(2000,2017):
    for quarter in range(1,5):
        
        if quarter == 4 and year == 2016:
            break
            
        new_column_name = '{0}q{1}'.format(year, quarter)
        begin_month = (quarter-1)*3 + 1
        end_month = quarter*3
        begin_column = '{0}-{1:02d}'.format(year,begin_month)
        end_column = '{0}-{1:02d}'.format(year,end_month)
        
        if quarter == 3 and year == 2016:
            new_column_name = '2016q3'
            begin_month = 6
            end_month = 8
            begin_column = '{0}-{1:02d}'.format(year,begin_month)
            end_column = '{0}-{1:02d}'.format(year,end_month)                
            
        data = housing_df.loc[:,begin_column:end_column]
            
        housing_df[new_column_name] = data.mean(axis = 1)

We will apply a dictionary mapping states to two letter abbreviations, set the index of the housing data to a multiindex involving state and region so we can compare to the college towns in the next part, and take only the portion of the data that is concerned with housing prices in quarters.

In [66]:
states = {'OH': 'Ohio', 'KY': 'Kentucky', 'AS': 'American Samoa', 'NV': 'Nevada', 
          'WY': 'Wyoming', 'NA': 'National', 'AL': 'Alabama', 'MD': 'Maryland', 
          'AK': 'Alaska', 'UT': 'Utah', 'OR': 'Oregon', 'MT': 'Montana', 'IL': 'Illinois', 
          'TN': 'Tennessee', 'DC': 'District of Columbia', 'VT': 'Vermont', 'ID': 'Idaho', 
          'AR': 'Arkansas', 'ME': 'Maine', 'WA': 'Washington', 'HI': 'Hawaii', 
          'WI': 'Wisconsin', 'MI': 'Michigan', 'IN': 'Indiana', 'NJ': 'New Jersey', 
          'AZ': 'Arizona', 'GU': 'Guam', 'MS': 'Mississippi', 'PR': 'Puerto Rico', 
          'NC': 'North Carolina', 'TX': 'Texas', 'SD': 'South Dakota', 
          'MP': 'Northern Mariana Islands', 'IA': 'Iowa', 'MO': 'Missouri', 
          'CT': 'Connecticut', 'WV': 'West Virginia', 'SC': 'South Carolina', 
          'LA': 'Louisiana', 'KS': 'Kansas', 'NY': 'New York', 'NE': 'Nebraska', 
          'OK': 'Oklahoma', 'FL': 'Florida', 'CA': 'California', 'CO': 'Colorado', 
          'PA': 'Pennsylvania', 'DE': 'Delaware', 'NM': 'New Mexico', 'RI': 'Rhode Island', 
          'MN': 'Minnesota', 'VI': 'Virgin Islands', 'NH': 'New Hampshire', 
          'MA': 'Massachusetts', 'GA': 'Georgia', 'ND': 'North Dakota', 'VA': 'Virginia'}
In [67]:
housing_df['State'] = housing_df['State'].apply(lambda x: states[x])  
housing_df = housing_df.set_index(['State','RegionName']) 
    
begin = housing_df.columns.get_loc('1996-04')
end = housing_df.columns.get_loc('2016-08')
    
housing_df.drop(housing_df.columns[begin:end+1], axis=1, inplace = True)
housing_df.drop(housing_df.columns[0:4], axis=1, inplace = True)
housing_df.head()
Out[67]:
2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 2002q1 2002q2 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
State RegionName
New York New York NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 515466.666667 522800.000000 528066.666667 532266.666667 540800.000000 557200.000000 572833.333333 582866.666667 591633.333333 588200.000000
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667 233000.000000 239100.000000 245066.666667 253033.333333 261966.666667 272700.000000 ... 498033.333333 509066.666667 518866.666667 528800.000000 538166.666667 547266.666667 557733.333333 566033.333333 577466.666667 582900.000000
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333 156933.333333 161800.000000 166400.000000 170433.333333 175500.000000 177566.666667 ... 192633.333333 195766.666667 201266.666667 201066.666667 206033.333333 208300.000000 207900.000000 206066.666667 208200.000000 211033.333333
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000 55333.333333 55533.333333 56266.666667 57533.333333 59133.333333 60733.333333 ... 113733.333333 115300.000000 115666.666667 116200.000000 117966.666667 121233.333333 122200.000000 123433.333333 126933.333333 128266.666667
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000 119600.000000 121566.666667 122700.000000 124300.000000 126533.333333 128366.666667 ... 164266.666667 165366.666667 168500.000000 171533.333333 174166.666667 179066.666667 183833.333333 187900.000000 191433.333333 194400.000000

5 rows × 67 columns

Putting this all together, we get

In [69]:
def convert_housing_data_to_quarters():
    '''Converts the housing data to quarters and returns it as mean 
    values in a dataframe. This dataframe should be a dataframe with
    columns for 2000q1 through 2016q3, and should have a multi-index
    in the shape of ["State","RegionName"].
    
    Note: Quarters are defined above, they are not arbitrary three 
    month periods.
    
    The resulting dataframe should have 67 columns, and 10,730 rows.
    '''
    
    filename = 'City_Zhvi_AllHomes.csv'
    
    housing_df = pd.read_csv(filename)
    
    for year in range(2000,2017):
        for quarter in range(1,5):
            
            if quarter == 4 and year == 2016:
                break
            
            new_column_name = '{0}q{1}'.format(year, quarter)
            begin_month = (quarter-1)*3 + 1
            end_month = quarter*3
            begin_column = '{0}-{1:02d}'.format(year,begin_month)
            end_column = '{0}-{1:02d}'.format(year,end_month)
            
            if quarter == 3 and year == 2016:
                new_column_name = '2016q3'
                begin_month = 6
                end_month = 8
                begin_column = '{0}-{1:02d}'.format(year,begin_month)
                end_column = '{0}-{1:02d}'.format(year,end_month)                
            
            data = housing_df.loc[:,begin_column:end_column]
            
            housing_df[new_column_name] = data.mean(axis = 1)
    housing_df['State'] = housing_df['State'].apply(lambda x: states[x])  
    housing_df = housing_df.set_index(['State','RegionName']) 
    
    begin = housing_df.columns.get_loc('1996-04')
    end = housing_df.columns.get_loc('2016-08')
    
    housing_df.drop(housing_df.columns[begin:end+1], axis=1, inplace = True)
    housing_df.drop(housing_df.columns[0:4], axis=1, inplace = True)
    #housing_df.drop(housing_df.columns['1996-04':'2016-08'], inplace=True)
    return housing_df
convert_housing_data_to_quarters().head()
Out[69]:
2000q1 2000q2 2000q3 2000q4 2001q1 2001q2 2001q3 2001q4 2002q1 2002q2 ... 2014q2 2014q3 2014q4 2015q1 2015q2 2015q3 2015q4 2016q1 2016q2 2016q3
State RegionName
New York New York NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 515466.666667 522800.000000 528066.666667 532266.666667 540800.000000 557200.000000 572833.333333 582866.666667 591633.333333 588200.000000
California Los Angeles 207066.666667 214466.666667 220966.666667 226166.666667 233000.000000 239100.000000 245066.666667 253033.333333 261966.666667 272700.000000 ... 498033.333333 509066.666667 518866.666667 528800.000000 538166.666667 547266.666667 557733.333333 566033.333333 577466.666667 582900.000000
Illinois Chicago 138400.000000 143633.333333 147866.666667 152133.333333 156933.333333 161800.000000 166400.000000 170433.333333 175500.000000 177566.666667 ... 192633.333333 195766.666667 201266.666667 201066.666667 206033.333333 208300.000000 207900.000000 206066.666667 208200.000000 211033.333333
Pennsylvania Philadelphia 53000.000000 53633.333333 54133.333333 54700.000000 55333.333333 55533.333333 56266.666667 57533.333333 59133.333333 60733.333333 ... 113733.333333 115300.000000 115666.666667 116200.000000 117966.666667 121233.333333 122200.000000 123433.333333 126933.333333 128266.666667
Arizona Phoenix 111833.333333 114366.666667 116000.000000 117400.000000 119600.000000 121566.666667 122700.000000 124300.000000 126533.333333 128366.666667 ... 164266.666667 165366.666667 168500.000000 171533.333333 174166.666667 179066.666667 183833.333333 187900.000000 191433.333333 194400.000000

5 rows × 67 columns

We are nearly ready to test whether there is a significant difference between college towns and other towns during a recession. First, we load all the necessary data.

In [73]:
recession_begin, recession_end, recession_bottom = get_recession_data()
    
university_towns = get_list_of_university_towns()
    
housing_df = convert_housing_data_to_quarters()

Next, we get housing prices at the beginning and end of the recession. We will test to see if housing prices have a smaller market loss during recessions. To check for this, we will use the ratio of the prices at the beginning and end of the recession. The higher this ratio, the worse, in terms of how much housing values fall.

In [75]:
prices_begin = housing_df[recession_begin]
prices_end = housing_df[recession_end]
housing_df['ratio'] = prices_begin/prices_end

We set a multiindex of our data frame of university towns as 'State', 'RegionName' to match with the multiindex of our housing prices data frame.

In [77]:
university_towns = university_towns.set_index(['State', 'RegionName'])

In order to conduct the t test, we need to drop anything from the table that is not a number so the ttest function works right

In [78]:
ratio_college = housing_df.loc[list(university_towns.index)]['ratio'].dropna()

We can use the complement of the indices which are university towns to get the indices of non-university towns

In [79]:
ratio_not_college_indices = set(housing_df.index) - set(ratio_college.index)
    
ratio_not_college = housing_df.loc[list(ratio_not_college_indices),:]['ratio'].dropna()

Now we can do our t-test. We collect our statistic and p-value, then we evaluate whether university towns or non-university towns are better off during recessions

In [80]:
statistic, p_value = tuple(stats.ttest_ind(ratio_college, ratio_not_college))

outcome = statistic < 0
    
different = p_value < 0.05
    
better = ["non-university town", "university town"]
    
(different, p_value, better[outcome])
Out[80]:
(True, 0.0013928955877980315, 'university town')

So we conclude there is a significant difference between market loss in university towns and non-university towns, with p-value = 0.0014 < 0.05.

We can put all of this together in the function below

In [81]:
def run_ttest():
    '''First creates new data showing the decline or growth of housing prices
    between the recession start and the recession bottom. Then runs a ttest
    comparing the university town values to the non-university towns values, 
    return whether the alternative hypothesis (that the two groups are the same)
    is true or not as well as the p-value of the confidence. 
    
    Return the tuple (different, p, better) where different=True if the t-test is
    True at a p<0.01 (we reject the null hypothesis), or different=False if 
    otherwise (we cannot reject the null hypothesis). The variable p should
    be equal to the exact p value returned from scipy.stats.ttest_ind(). The
    value for better should be either "university town" or "non-university town"
    depending on which has a lower mean price ratio (which is equivilent to a
    reduced market loss).'''
    
    recession_begin, recession_end, recession_bottom = get_recession_data()
    
    university_towns = get_list_of_university_towns()
    
    housing_df = convert_housing_data_to_quarters()
    
    prices_begin = housing_df[recession_begin]
    prices_end = housing_df[recession_end]
    
    housing_df['ratio'] = prices_begin/prices_end #The higher this ratio, the worse
    
    university_towns = university_towns.set_index(['State', 'RegionName'])
    
    ratio_college = housing_df.loc[list(university_towns.index)]['ratio'].dropna()
    
    ratio_not_college_indices = set(housing_df.index) - set(ratio_college.index)
    
    ratio_not_college = housing_df.loc[list(ratio_not_college_indices),:]['ratio'].dropna()
        
    statistic, p_value = tuple(stats.ttest_ind(ratio_college, ratio_not_college))
    
    outcome = statistic < 0
    
    different = p_value < 0.05
    
    better = ["non-university town", "university town"]
    
    return (different, p_value, better[outcome])

run_ttest()
Out[81]:
(True, 0.0013928955877980315, 'university town')