Pay Discrimination Analysis

This is an AM01 Applied statitics project undertaken by my study group to analyse the pay differences among men and women. We have analysed the relationship between salary and gender via visualisation and hypothesis testing. We have also determined reltionship between experience and gender along with salaty & experience for the pool of men and women in our dataset. We have also calculated correlation between gender, experience and salaries to draw final inferences from out data.

Omega Group- Pay Discrimination

Objective: Find out whether there is indeed a significant difference between the salaries of men and women, and whether the difference is due to discrimination or whether it is based on another, possibly valid, determining factor.

Loading the data:

omega <- read_csv(here::here("data", "omega.csv"))
glimpse(omega) # examine the data frame

## Rows: 50
## Columns: 3
## $ salary     <dbl> 81894, 69517, 68589, 74881, 65598, 76840, 78800, 70033, 635…
## $ gender     <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
## $ experience <dbl> 16, 25, 15, 33, 16, 19, 32, 34, 1, 44, 7, 14, 33, 19, 24, 3…

Analysis of Relationship between Salary & Gender

# Summary Statistics of salary by gender
kable(favstats (salary ~ gender, data=omega))

gender	min	Q1	median	Q3	max	mean	sd	n	missing
female	47033	60338	64618	70033	78800	64543	7567	26	0
male	54768	68331	74675	78568	84576	73239	7463	24	0

# Dataframe with two rows (male-female) and having as columns gender, mean, SD, sample size, 
# the t-critical value, the standard error, the margin of error, 
# and the low/high endpoints of a 95% condifence interval

ci_omega <- omega %>% 
  group_by(gender) %>% 
  summarize(mean = mean(salary, na.rm=TRUE),
            sd = sd(salary, na.rm=TRUE),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_diff = sd/sqrt(count),
            margin_of_error = t_critical * se_diff,
            salary_low = mean - margin_of_error,
            salary_high = mean + margin_of_error)

# print the table with confidence interval

kable(ci_omega,
      caption="Salary CI by Gender")

(#tab:confint_single_valiables)Salary CI by Gender
gender	mean	sd	count	t_critical	se_diff	margin_of_error	salary_low	salary_high
female	64543	7567	26	2.06	1484	3056	61486	67599
male	73239	7463	24	2.07	1523	3151	70088	76390

Inference:

The two confidence interval for women and men salary of a 95% do not overlap. The difference in salary between the two groups is thus significantly different. The t-test would thus not be needed in this case.

Analysis of relationship between salary & gender via Hypothesis Test (t.test + infer)

# hypothesis testing using t.test() 

t.test(salary ~ gender, data = omega)

## 
##  Welch Two Sample t-test
## 
## data:  salary by gender
## t = -4, df = 48, p-value = 0.0002
## alternative hypothesis: true difference in means between group female and group male is not equal to 0
## 95 percent confidence interval:
##  -12973  -4420
## sample estimates:
## mean in group female   mean in group male 
##                64543                73239

# hypothesis testing using infer package

#Calculating observed statistic
obs_stat <- omega %>%
  specify(salary ~ gender) %>%
  calculate(stat = "diff in means",
            order = c("female", "male"))

set.seed(1234)
salaries_in_null_world <- omega %>% 
  
  #Which variable we are interested in
  specify(salary ~ gender) %>% 
  
  #Hypothesis with no (null) difference
  hypothesize(null = "independence") %>% 
  
  #Create simulated samples
  generate(reps = 10000, type = "permute") %>% 
  
  #Mean difference in each sample
  calculate(stat = "diff in means",
            order = c("female", "male")) # give the order for subtraction first, second

#Visualize distribution  
salaries_in_null_world %>% visualize()+
  shade_p_value(obs_stat = obs_stat, direction = "both")

salaries_in_null_world %>% 
  get_p_value(obs_stat = obs_stat, direction = "both")

## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1  0.0002

Inference:

The R t-test clearly showed that the null hypothesis (no difference) can be rejected. This can be seen by three indicators: first the |t-stat| is approximately > 2, second the CI for delta does not contain zero, third the p-value is < 5%.
With 1000 reps the bootstrap simulation gave a p-value of zero, this can be the case when the observed statistic is very unlikely. We thus had to increase the reps to 10000 to replicate the results from the formula.

Analysis of relationship between experience & gender:

# Summary Statistics of salary by gender
favstats (experience ~ gender, data=omega)

##   gender min    Q1 median   Q3 max  mean    sd  n missing
## 1 female   0  0.25    3.0 14.0  29  7.38  8.51 26       0
## 2   male   1 15.75   19.5 31.2  44 21.12 10.92 24       0

Based on this evidence, can you conclude that there is a significant difference between the experience of the male and female executives? Does your conclusion validate or endanger your conclusion about the difference in male and female salaries?

#Using formula to create confidence interval

ci_omega_experience <- omega %>% 
  group_by(gender) %>% 
  summarize(mean = mean(experience, na.rm=TRUE),
            sd = sd(experience, na.rm=TRUE),
            count = n(),
            t_critical = qt(0.975, count-1),
            se_diff = sd/sqrt(count),
            margin_of_error = t_critical * se_diff,
            experience_low = mean - margin_of_error,
            experience_high = mean + margin_of_error)

# print the table with confidence interval

kable(ci_omega_experience,
      caption="Experience CI by Gender")

(#tab:ci_experience)Experience CI by Gender
gender	mean	sd	count	t_critical	se_diff	margin_of_error	experience_low	experience_high
female	7.38	8.51	26	2.06	1.67	3.44	3.95	10.8
male	21.12	10.92	24	2.07	2.23	4.61	16.52	25.7

Answer:

The two experience confidence intervals for women and men at 95% do not overlap. The difference between the two groups is thus significantly different and there is no need to run a t-test. These findings would endager the conclusion drawn above (gender-based salary discrimination) as it seems that not only gender, but an additional previously not considered factor, experience, could influence the pay-gap. Further anaylsis is suggested.

Analysis of relationship between salary & experience:

Someone at the meeting argues that clearly, a more thorough analysis of the relationship between salary and experience is required before any conclusion can be drawn about whether there is any gender-based salary discrimination in the company.

Analyse the relationship between salary and experience. Draw a scatterplot to visually inspect the data

ggplot(omega, aes(x = experience, y = salary, color = gender)) +
  geom_point()+
  geom_smooth(method='lm', se=FALSE)+
  labs(
    title = "Relationship between salary and experience"
  )+
  theme_bw()

Correlations between Gender, Experience and Salary

omega %>% 
  select(gender, experience, salary) %>% #order variables they will appear in ggpairs()
  ggpairs(aes(colour=gender, alpha = 0.3))+
  theme_bw()

Inference:

The correlation matrix reweals many interesting things in just one visualization. Looking at the scatterplot we can see a clear correlation between years of experience and salary, this is true for both genders. The correlation can also be confirmed mathematically with a total cor of 0.8. In other words, the higher the experience the higher the salary. We have used two colors in the plot above to demonstrate that women salary increase more significantly with growing experience (stepper slope). With women having a median experience of 3 and men of 19.5 it is thus not surprising that there is a significant pay-gap. Omega should rather investigate the underlying reason why women have so little work experience. Are senior positions which require more experience mainly filled by man while graduate positions mainly by women? Anyhow, further investigation is needed!