Health pattern analysis

This is an AM01 Applied statitics project undertaken by my study group to analyse the Youth Risk Behavior Surveillance data to study health pattern among high school students. We have performed EDA on YRBSS data to gain insights on the weight data available for high school students. We have also done analysis of physically active people in addition to determining confidence interval for inactive students.We have also plotted graphs to understand the relationship between physical activity and weight among the students.

About: Youth Risk Behavior Surveillance

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Exploratory Data Analysis on distribution of weight data

data(yrbss)

# Create summary statistics
summary(yrbss$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      30      56      64      68      76     181    1004
skimr::skim(yrbss)
(#tab:eda_on_weight)Data summary
Name yrbss
Number of rows 13583
Number of columns 13
_______________________
Column type frequency:
character 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 12 1.00 4 6 0 2 0
grade 79 0.99 1 5 0 5 0
hispanic 231 0.98 3 8 0 2 0
race 2805 0.79 5 41 0 5 0
helmet_12m 311 0.98 5 12 0 6 0
text_while_driving_30d 918 0.93 1 13 0 8 0
hours_tv_per_school_day 338 0.98 1 12 0 7 0
school_night_hours_sleep 1248 0.91 1 3 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 77 0.99 16.16 1.26 12.00 15.0 16.00 17.00 18.00 ▁▂▅▅▇
height 1004 0.93 1.69 0.10 1.27 1.6 1.68 1.78 2.11 ▁▅▇▃▁
weight 1004 0.93 67.91 16.90 29.94 56.2 64.41 76.20 180.99 ▆▇▂▁▁
physically_active_7d 273 0.98 3.90 2.56 0.00 2.0 4.00 7.00 7.00 ▆▂▅▃▇
strength_training_7d 1176 0.91 2.95 2.58 0.00 0.0 3.00 5.00 7.00 ▇▂▅▂▅
# Plot missing values
kable(yrbss%>%
  filter(is.na(weight))%>%
  summarise(Tatal_missing_data=n()),"simple")
Tatal_missing_data
1004
# Create histogram
x <- yrbss$weight 
h<-hist(x,  
        xlab="weight", main="Histogram with Normal Curve") 

# Create normal distribution as comparison
xfit<-seq(min(x,na.rm = TRUE),max(x,na.rm = TRUE),length=40)               #  order the data 
yfit<-dnorm(xfit,mean=mean(x,na.rm = TRUE),sd=sd(x,na.rm = TRUE))          #  density of normal dis
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="red", lwd=2) 

Inference:

The minimum value of weight is 30, the maximum value is 181, the median is 64, the lower quantile is 56, the upper quantile is 76, and the total number of missing values is 1004. The distribution of weights looks lightly skewed to the right. We added a normal distribution line as comparison.

Analysis of physically active people (>3 days a week):

# Create new variable
yrbss$physical_3plus<-ifelse(yrbss$physically_active_7d>=3,"yes","no")

# Method group_by()... summarise()
yrbss%>%
  group_by(physical_3plus)%>%
  summarise(n=n())%>%
  mutate(prop=label_percent()(n/nrow(yrbss)))
## # A tibble: 3 × 3
##   physical_3plus     n prop 
##   <chr>          <int> <chr>
## 1 no              4404 32%  
## 2 yes             8906 66%  
## 3 <NA>             273 2%
# Method count()
yrbss%>%
count(physical_3plus) %>% 
mutate(prop=label_percent()(n/nrow(yrbss)))
## # A tibble: 3 × 3
##   physical_3plus     n prop 
##   <chr>          <int> <chr>
## 1 no              4404 32%  
## 2 yes             8906 66%  
## 3 <NA>             273 2%
# Create a graph to show difference
yrbss%>%
  filter(is.na(physical_3plus)==FALSE)%>%
  group_by(physical_3plus)%>%
  summarise(number=n())%>%
ggplot(aes(x=physical_3plus,y=number))+
  geom_bar(stat = "identity",width =0.5)+
  labs(x="physical_3plus",y="number")+
  ggtitle("How many children workout 3plus times?",
          subtitle = "")+
  theme_bw()

Inference:

In this first analysis we have separately shown the percentage proportion also for the NA values. 8906 kids practice sport 3plus times a week while 4404 do not. In the graph we have not included the NA values.

Confidence Interval calculation for the population proportion of high schools that are NOT active 3 or more days per week

#Filtering data

test<-yrbss %>% 
  filter(is.na(physical_3plus)==FALSE)%>%
  
# Changing yes no to numerical value to make it easier  
  mutate(physical_3_p=ifelse(physical_3plus=="yes",1,0), physical_3_l=ifelse(physical_3plus=="no",1,0))


# With formula

ci_yrbss <- test %>% 
  filter(!is.na(physical_3plus)) %>%
  summarize(count = n(),
            p = (sum(physical_3plus == "no"))/count,
            t_critical = qt(0.975, count-1),
            se_diff = sqrt(p*(1-p)/count),
            margin_of_error = t_critical * se_diff,
            prop_low = p - margin_of_error,
            prop_high = p + margin_of_error)

# print the table with confidence interval

kable(ci_yrbss,
      caption="CI 3 or more times active Formula")
(#tab:confidence_interval)CI 3 or more times active Formula
count p t_critical se_diff margin_of_error prop_low prop_high
13310 0.331 1.96 0.004 0.008 0.323 0.339
# Confidence Interval with bootstrap  

set.seed(1234)
boot_3lower <- test %>% 
# Select
  filter(is.na(physical_3plus)==FALSE)%>%                  
# Specify the variable of interest
  specify(response = physical_3_l) %>%
# Generate a bunch of bootstrap samples
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") 

percentile_ci<- boot_3lower %>% 
   get_confidence_interval(level=0.95, type="percentile")
kable(percentile_ci,
      caption="CI 3 or more times active Infer")
(#tab:confidence_interval)CI 3 or more times active Infer
lower_ci upper_ci
0.323 0.338

The 95% confidence interval is [0.323,0.339] calculated with both, formula and bootstrap.

Determining the relationship between physical_3plusandweight`

# Create boxplot

yrbss%>%
  filter(is.na(physical_3plus)==FALSE)%>%
ggplot(aes(x=physical_3plus,y=weight))+
  geom_boxplot()+
  ggtitle("The boxplot of the weight of physical_3plus",
          subtitle = "")+
  theme_bw()

Inference:

Although almost similiar, there is a positive relationship between being physically active for at least 3 days a week and weight. The result is somehow counter-intuitive, since we thought that more exercise in children would reduce there weight. This might be explained by the reason that people who stay highly active take part of physically demanding sports like basket, football and similiar and have thus a higher percentage of muscle or are just bigger. One additional reason could be that kids who are slightly heavier are more incentivised to exercise more frequently.

Summary statistics for weight & confidence interval

#Apllying formula to weight
ci_yrbss_weight <- yrbss %>% 
  filter(is.na(physical_3plus)==FALSE)%>%
  group_by(physical_3plus) %>% 
  summarize(mean = mean(weight, na.rm=TRUE),
            sd = sd(weight, na.rm=TRUE),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_diff = sd/sqrt(count),
            margin_of_error = t_critical * se_diff,
            weight_low = mean - margin_of_error,
            weight_high = mean + margin_of_error)

#Plotting results
kable(ci_yrbss_weight)
physical_3plus mean sd count t_critical se_diff margin_of_error weight_low weight_high
no 66.7 17.6 4404 1.96 0.266 0.521 66.2 67.2
yes 68.4 16.5 8906 1.96 0.175 0.342 68.1 68.8

There is an observed mean difference of about 1.77kg (68.44 - 66.67), and we notice that the two confidence intervals do not overlap. It seems that the difference is statistically significant at 95% confidence.