Exploring the BRFSS data


BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews.

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.


Load packages


Load data

First, we need to load the data. Change the working directory to where behavioral risk factor dataset (“brfss2013.RData”) is located. Then run the following command:


Once the data is loaded, we can check out the dimension of the data with dim

## [1] 491775    330

Since the dimensions of the data is huge, we will instead refer to the codebook provided to determine research questions.

Is the Data Randomly Sampled?

It’s important to note how the observations in the samples we use to answer our research questions are collected. This information can be found on the CDC website, where it is stated:

“Disproportionate stratified sampling (DSS) has been used for the landline sample since 2003.

“The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations.”

Data collected in 1984 contains only information from 15 states collected from monthly telephone interviews. By 2001, all 50 states and the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Data collected before 2011 is collected from landline telephones, while information afterwards includes cellular users, which tends to include more young people. Thus, information collected before 2011 has certain limitations. We can only generalize the information to specific states and demographics using data over these time periods. Additionally, if we were interested in generalizing about institutionalized people, we would be unable to do this since the data does not sample people from this group.

For landlines, interviewers collect data from a randomly selected adult in a household. This means that whoever answers the telephone in a given household does not necessarily answer the survey questions. This measure helps to ensure that the data is randomly sampled as much as possible. For cellular phones, any adult who participates resides in a private residence or college housing.

Is there Random Assignment?

For surveys, causation cannot be inferred unless the study is longitudinal, which allows reasoning about dynamics. Since this is a cross sectional study, no random assignment has been performed. Whether causation can be inferred from cross sectional studies is a little controversial, but I will assume the standard view. No experiment was performed, so although conclusions about the data may sometimes be generalized due to the fact that the data is randomly sampled subject to a few constraints, these conclusions cannot be causal associations.

Research questions

Research question 1:

Is there an association between alcohol consumption and access to health care? Strictly speaking this is a question about whether there is an association between alcohol consumption and home ownership and medical coverage. Hoever, alcohol consumption has an association with wealth and poverty. Wealthy people may be able to afford alcohol and may have more leisure time available to drink. Alcohol and poverty may also be associated through binge and heavy drinking. Access to health care through medicaid or lack of coverage may indicate poverty, as does an inability to get medicine because of cost or having medical bills.

Relevant variables for assessing wealth

through health coverage:

hlthpln1: Have Any Health Care Coverage
medcost: Could Not See Dr. Because Of Cost
checkup1: Length Of Time Since Last Routine Checkup

through home ownership

renthom1: Own Or Rent Home

Relevant variables for assessing alchohol consumption

Main Survey - Section 10 - Alcohol Consumption

alcday5: Days In Past 30 Had Alcoholic Beverage
avedrnk2: Avg Alcoholic Drinks Per Day In Past 30
drnk3ge5: Binge Drinking
maxdrnks: Most Drinks On Single Occasion Past 30 Days

Research question 2:

Is home ownership and consumption of 100% pure fruit juice consumption associated, when other factors for fruit/vegetable consumption is held constant?

Main Survey - Section 11 - Fruits and Vegetables

fruitju1: How Many Times Did You Drink 100 Percent Pure Fruit Juices?
fruit1: How Many Times Did You Eat Fruit?
fvbeans: How Many Times Did You Eat Beans Or Lentils?
fvgreen: How Many Times Did You Eat Dark Green Vegetables?
fvorang: How Many Times Did You Eat Orange-Colored Vegetables?
vegetab1: How Many Times Did You Eat Other Vegetables?

Research question 3:

Among those who own a home and have not had to worry about rent and food, how satisfied are these individuals with life?

Relevant variables

Optional Module 19 - Social Context

scntmony: Times Past 12 Months Worried/Stressed About Having Enough Money To Pay Your Rent
scntmeal: Times Past 12 Months Worried/Stressed About Having Enough Money To Buy Nutritiou
scntpaid: How Are You Generally Paid For The Work You Do
scntlpad: How Were You Generally Paid For The Work You Did

Optional Module 22 - Emotional Support and Life Satisfaction

lsatisfy: Satisfaction With Life

Exploratory data analysis

Research quesion 1:

In the first research question we will be looking for an association between alcohol consumption and wealth and poverty. To find out whether alcohol consumption may be correlated with wealth, we will find out whether access to health care, owning a home, and the number of days alcohol was consumed in the last 30 days are associated. To find out whether lack of wealth and alcohol consumption might be related, we will check to see if lack of access to health care, home ownership, and binge drinking are associated.

First we note that the variable that gives information about the number of drinks consumed in the last 30 days, alcday5, is formatted peculiarly. If the first digit is a 1, then that means the following two numbers tell how many drinks were consumed in the last week. If the first digit is a 2, the following two numbers say how many drinks were consumed in the past month.

To account for this, we make a new column called by_month_week which says whether or not the count of days with drinks is by month or week. We make another column num_per which gives the number of drinks consumed in the past week or month. We then create a new variable called days_drink which converts num_per into a standardized form, number of days at least one drink was consumed in the past 30 days. We store the resulting data frame in a nwe data frame called ques1. The code is as follows.

ques1 <- brfss2013 %>% filter(!is.na(alcday5)) %>% mutate(by_month_week = alcday5%/%100, num_per = alcday5%%100) %>% mutate(days_drink = if_else(by_month_week == 1, 4*num_per, num_per))

Following this, we then apply additional operations to find the mean number of days drinks is consumed (mean_day_drinks), the mean number of drinks consumed in a pattern of binge drinking (mean_binge_drink), and the percentage of individuals who own a home (percent_home) grouped by whether or not these individuals have access to healthcare (hlthpln1). I also include a count column to get a sense of the total numbers involved in each category.

ques1 %>% select(renthom1, hlthpln1, medcost, days_drink, drnk3ge5) %>% na.omit %>% group_by(hlthpln1) %>% summarise(count = n(), percent_home = sum(renthom1 == "Own")/sum(!is.na(renthom1)), mean_day_drinks = mean(days_drink), mean_binge_drink = mean(drnk3ge5))
## # A tibble: 2 × 5
##   hlthpln1  count percent_home mean_day_drinks mean_binge_drink
##     <fctr>  <int>        <dbl>           <dbl>            <dbl>
## 1      Yes 205772    0.7911523        9.270377        0.9725764
## 2       No  23503    0.4627069        8.135770        2.1970387

The results support the hypothesis that more regular alcohol consumption and wealth are moderately associated through home ownership percentage and access to health care. Additionally, we also find support for the hypothesis that binge drinking is strongly associated with lack of access to health care and home ownership.

Finally, we check to see whether there is an association between the number of days alcohol is consumed in the past 30 days and binge drinking. We will check whether there is a difference in this association between the interaction of whether there is access to health care and whether there is home ownership.

ggplot(ques1, aes(days_drink, drnk3ge5, colour = interaction(renthom1, hlthpln1))) + geom_smooth(method = "lm", se = FALSE)
## Warning: Removed 240803 rows containing non-finite values (stat_smooth).

We find that more drinking always tends to be associated with binge drinking, but that the association becomes much stronger without home ownership or access to health care. Lack of access to health care is the biggest factor increasing this relationship rather than home ownership. We note that many rows were removed because lots of people do not drink or chose not to report the amount that they drank.

Research quesion 2:

Next, I am interested in whether there is an association between wealth and 100% pure fruit juice consumption, when other factors like fruit/vegetable consumption is held constant, as these are factors which could influence fruit and vegetable consumption and I’d like to block these potentially confounding variables.

First, I select only the columns that I want, in this case, hlthpln1 and renthom1, whether a given person surveyed has access to health care or whether they own a home, and information about fruit and vegetable consumption.

fruitju1: How Many Times Did You Drink 100 Percent Pure Fruit Juices?
fruit1: How Many Times Did You Eat Fruit?
fvbeans: How Many Times Did You Eat Beans Or Lentils?
fvgreen: How Many Times Did You Eat Dark Green Vegetables?
fvorang: How Many Times Did You Eat Orange-Colored Vegetables?
vegetab1: How Many Times Did You Eat Other Vegetables?

I also get rid of any rows in which any of this inofrmation is not available through na.omit.

ques2 <- brfss2013 %>% select(renthom1, hlthpln1, fruitju1, fruit1, fvbeans, fvgreen, fvorang, vegetab1) %>% na.omit()

I store this information in ques2. Next, I transform each of the columns so that every entry is formatted the same. The survey asks how often someone has consumed fruit in the last day, week, or month, and I would like all of this to be in the format of how often someone has eaten fruit in the last month. There are spurious values of consumed fruit and fruit juice, perhaps due to mistallying in the survey, perhaps due to considering small bundles of consumed fruits or fruit juice to be 1 serving of food. I set a cutoff of 150 servings per month.

ques2 <- ques2 %>% mutate(fruitju1x = if_else(fruitju1 %/% 100 == 1, 30, if_else(fruitju1 %/% 100 == 2, 4, 1)), fruit1x = if_else(fruit1 %/% 100 == 1, 30, if_else(fruit1 %/% 100 == 2, 4, 1)), fvbeansx = if_else(fvbeans %/% 100 == 1, 30, if_else(fvbeans %/% 100 == 2, 4, 1)), fvgreenx = if_else(fvgreen %/% 100 == 1, 30, if_else(fvgreen %/% 100 == 2, 4, 1)), fvorangx = if_else(fvorang %/% 100 == 1, 30, if_else(fvorang %/% 100 == 2, 4, 1)), vegetab1x = if_else(vegetab1 %/% 100 == 1, 30, if_else(vegetab1 %/% 100 == 2, 4, 1)))
ques2 <- ques2 %>% mutate(fruitju1 = fruitju1x*fruitju1%%100, fruit1 = fruit1%%100*fruit1x, fvbeans = fvbeans%%100*fvbeansx, fvgreen = fvgreen%%100*fvgreenx, fvorang = fvorang%%100*fvorangx, vegetab1 = vegetab1%%100*vegetab1x)
ques2 <- ques2 %>% filter(fruit1 < 150, fruitju1 < 150, fvbeans <150, fvgreen < 150, fvorang < 150, vegetab1 < 150)

First, I’d like to look for an association between wealth, home ownership and access to health care, and fruit and vegetable consumption. Generally the pattern in the table is that wealth correlates with increased fruit and vegetable consumption. Since the question I am interested in is whether wealth correlates with 100% fruit juice consumption independent of general fruit and vegetable consumption, I have to block for these factors.

ques2 %>% group_by(renthom1, hlthpln1) %>% summarise(count = n(), mean_fruit_juice = mean(fruitju1), mean_fruit = mean(fruit1), mean_beans = mean(fvbeans), mean_green_veggies = mean(fvgreen), mean_orange = mean(fvorang), mean_veggie = mean(vegetab1))
## Source: local data frame [6 x 9]
## Groups: renthom1 [?]
##            renthom1 hlthpln1  count mean_fruit_juice mean_fruit mean_beans
##              <fctr>   <fctr>  <int>            <dbl>      <dbl>      <dbl>
## 1               Own      Yes 289085         10.78710   30.45004   7.853400
## 2               Own       No  22355         10.98063   25.39870   8.788906
## 3              Rent      Yes  74621         12.10024   26.85335   7.698342
## 4              Rent       No  20302         11.93937   24.11191   9.442715
## 5 Other arrangement      Yes  13276         11.30642   25.91571   7.720172
## 6 Other arrangement       No   4127         10.98231   21.98740   8.810758
## # ... with 3 more variables: mean_green_veggies <dbl>, mean_orange <dbl>,
## #   mean_veggie <dbl>

To control for general fruit and vegetable consumption, I’d need to run a multiple regression to account for multiple predictors. Since I don’t know how to do that, instead I am just going to look at how fruit juice consumption and fruit consumption are related. This at leat allows us to see if accounting for fruit and vegetable consumption will change the relationship between wealth and fruit juice consumption, which is currently a negative association. This can be achieved with the following code.

ggplot(ques2, aes(fruit1, fruitju1, colour = interaction(renthom1, hlthpln1))) + geom_smooth(method = "lm", se = FALSE)