1. Which college majors will pay the bills?¶

Wondering if that Philosophy major will really help you pay the bills? Think you're set with an Engineering degree? Choosing a college major is a complex decision evaluating personal interest, difficulty, and career prospects. Your first paycheck right out of college might say a lot about your salary potential by mid-career. Whether you're in school or navigating the postgrad world, join me as we explore the short and long term financial implications of this major decision.

In this notebook, we'll be using data collected from a year-long survey of 1.2 million people with only a bachelor's degree by PayScale Inc., made available here by the Wall Street Journal for their article Ivy League's Big Edge: Starting Pay. After doing some data clean up, we'll compare the recommendations from three different methods for determining the optimal number of clusters, apply a k-means clustering analysis, and visualize the results.

To begin, let's prepare by loading the following packages: tidyverse, dplyr, readr, ggplot2, cluster, and factoextra. We'll then import the data from degrees-that-pay-back.csv (which is stored in a folder called datasets), and take a quick look at what we're working with.

# Load relevant packages
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(cluster)
library(factoextra)

# Read in the dataset
degrees <- read_csv('datasets/degrees-that-pay-back.csv', 
                    col_names=c('College.Major', 'Starting.Median.Salary', 'Mid.Career.Median.Salary', 
                                'Career.Percent.Growth', 'Percentile.10', 'Percentile.25', 'Percentile.75', 'Percentile.90'), skip=1)

# Display the first few rows and a summary of the data frame
head(degrees)
summary(degrees)

Parsed with column specification:
cols(
  College.Major = col_character(),
  Starting.Median.Salary = col_character(),
  Mid.Career.Median.Salary = col_character(),
  Career.Percent.Growth = col_double(),
  Percentile.10 = col_character(),
  Percentile.25 = col_character(),
  Percentile.75 = col_character(),
  Percentile.90 = col_character()
)

 College.Major      Starting.Median.Salary Mid.Career.Median.Salary
 Length:50          Length:50              Length:50               
 Class :character   Class :character       Class :character        
 Mode  :character   Mode  :character       Mode  :character        
                                                                   
                                                                   
                                                                   
 Career.Percent.Growth Percentile.10      Percentile.25      Percentile.75     
 Min.   : 23.40        Length:50          Length:50          Length:50         
 1st Qu.: 59.12        Class :character   Class :character   Class :character  
 Median : 67.80        Mode  :character   Mode  :character   Mode  :character  
 Mean   : 69.27                                                                
 3rd Qu.: 82.42                                                                
 Max.   :103.50                                                                
 Percentile.90     
 Length:50         
 Class :character  
 Mode  :character

2. Currency and strings and percents, oh my!¶

Notice that our salary data is in currency format, which R considers a string. Let's strip those special characters using the gsub function and convert all of our columns except College.Major to numeric.

While we're at it, we can also convert the Career.Percent.Growth column to a decimal value.

# Clean up the data
degrees_clean <- degrees %>% mutate_at(vars(Starting.Median.Salary:Percentile.90), function(x) as.numeric(gsub("[\\$,]","",x)) ) %>% 
  mutate(Career.Percent.Growth=Career.Percent.Growth/100)

3. The elbow method¶

Great! Now that we have a more manageable dataset, let's begin our clustering analysis by determining how many clusters we should be modeling. The best number of clusters for an unlabeled dataset is not always a clear-cut answer, but fortunately there are several techniques to help us optimize. We'll work with three different methods to compare recommendations:

Elbow Method
Silhouette Method
Gap Statistic Method

First up will be the Elbow Method. This method plots the percent variance against the number of clusters. The "elbow" bend of the curve indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance. To begin, let's select and scale the following features to base our clusters on: Starting.Median.Salary, Mid.Career.Median.Salary, Perc.10, and Perc.90. Then we'll use the fancy fviz_nbclust function from the factoextra library to determine and visualize the optimal number of clusters.

# Select and scale the relevant features and store as k_means_data
k_means_data <- degrees_clean %>%
  select(Starting.Median.Salary, Mid.Career.Median.Salary, 
         Percentile.10, Percentile.90) %>% scale()

# Run the fviz_nbclust function with our selected data and method "wss"
elbow_method <- fviz_nbclust(k_means_data, FUNcluster=kmeans, method="wss")

# View the plot
elbow_method

4. The silhouette method¶

Wow, that fviz_nbclust function was pretty nifty. Instead of needing to "manually" apply the elbow method by running multiple k_means models and plotting the calculated the total within cluster sum of squares for each potential value of k, fviz_nbclust handled all of this for us behind the scenes. Can we use it for the Silhouette Method as well? The Silhouette Method will evaluate the quality of clusters by how well each point fits within a cluster, maximizing average "silhouette" width.

# Run the fviz_nbclust function with the method "silhouette" 
silhouette_method <- fviz_nbclust(k_means_data, FUNcluster=kmeans, method="silhouette")

# View the plot
silhouette_method

5. The gap statistic method¶

Marvelous! But hmm, it seems that our two methods so far disagree on the optimal number of clusters... Time to pull out the tie breaker.

For our final method, let's see what the Gap Statistic Method has to say about this. The Gap Statistic Method will compare the total variation within clusters for different values of k to the null hypothesis, maximizing the "gap." The "null hypothesis" refers to a uniformly distributed simulated reference dataset with no observable clusters, generated by aligning with the principle components of our original dataset. In other words, how much more variance is explained by k clusters in our dataset than in a fake dataset where all majors have equal salary potential?

Fortunately, we have the clusGap function to calculate this behind the scenes and the fviz_gap_stat function to visualize the results.

# Use the clusGap function to apply the Gap Statistic Method
gap_stat <- clusGap(k_means_data, FUN=kmeans, nstart=25, K.max=10, B=50)

# Use the fviz_gap_stat function to vizualize the results
gap_stat_method <- fviz_gap_stat(gap_stat)

# View the plot
gap_stat_method

6. K-means algorithm¶

Looks like the Gap Statistic Method agreed with the Elbow Method! According to majority rule, let's use 3 for our optimal number of clusters. With this information, we can now run our k-means algorithm on the selected data. We will then add the resulting cluster information to label our original dataframe.

# Set a random seed
set.seed(111)

# Set k equal to the optimal number of clusters
num_clusters <- 3

# Run the k-means algorithm 
k_means <- kmeans(k_means_data, centers=num_clusters, iter.max=15, nstart=25)

# Label the clusters of degrees_clean
degrees_labeled <- degrees_clean %>%
    mutate(clusters = k_means$cluster)

7. Visualizing the clusters¶

Now for the pretty part: visualizing our results. First let's take a look at how each cluster compares in Starting vs. Mid Career Median Salaries. What do the clusters say about the relationship between Starting and Mid Career salaries?

career_growth <- ggplot(degrees_labeled, aes(x=Starting.Median.Salary,
                                             y=Mid.Career.Median.Salary,
                                             color=factor(clusters))) +
  xlab("Starting Median Salary") +
  ylab("Mid-Career Median Salary") +
  ggtitle("Salary Growth") + 
  geom_point(alpha=4/5, size=7) +
  scale_x_continuous(labels=scales::dollar) +
  scale_y_continuous(labels=scales::dollar) + 
  scale_color_manual(name="Clusters", values=c("#EC2C73", "#29AEC7", "#FFDD30"))

# View the plot
career_growth

8. A deeper dive into the clusters¶

Unsurprisingly, most of the data points are hovering in the top left corner, with a relatively linear relationship. In other words, the higher your starting salary, the higher your mid career salary. The three clusters provide a level of delineation that intuitively supports this.

How might the clusters reflect potential mid career growth? There are also a couple curious outliers from clusters 1 and 3... perhaps this can be explained by investigating the mid career percentiles further, and exploring which majors fall in each cluster.

Right now, we have a column for each percentile salary value. In order to visualize the clusters and majors by mid career percentiles, we'll need to reshape the degrees_labeled data using tidyr's gather function to make a percentile key column and a salary value column to use for the axes of our following graphs. We'll then be able to examine the contents of each cluster to see what stories they might be telling us about the majors.

# Use the gather function to reshape degrees and 
# use mutate() to reorder the new percentile column
degrees_perc <- degrees_labeled %>%
  select(College.Major, Percentile.10, Percentile.25, 
         Mid.Career.Median.Salary, Percentile.75, 
         Percentile.90, clusters) %>%
  gather(-c(College.Major, clusters), key=percentile, value=salary) %>%
  mutate(percentile=factor(percentile, levels=c("Percentile.10", "Percentile.25", 
                                                "Mid.Career.Median.Salary", 
                                                "Percentile.75", "Percentile.90")))

9. The liberal arts cluster¶

Let's graph Cluster 1 and examine the results. These Liberal Arts majors may represent the lowest percentiles with limited growth opportunity, but there is hope for those who make it! Music is our riskiest major with the lowest 10th percentile salary, but Drama wins the highest growth potential in the 90th percentile for this cluster (so don't let go of those Hollywood dreams!). Nursing is the outlier culprit of cluster number 1, with a higher safety net in the lowest percentile to the median. Otherwise, this cluster does represent the majors with limited growth opportunity.

An aside: It's worth noting that most of these majors leading to lower-paying jobs are women-dominated, according to this Glassdoor study. According to the research:

"The single biggest cause of the gender pay gap is occupation and industry sorting of men and women into jobs that pay differently throughout the economy. In the U.S., occupation and industry sorting explains 54 percent of the overall pay gap—by far the largest factor."

Does this imply that women are statistically choosing majors with lower pay potential, or do certain jobs pay less because women choose them...?

# Graph the majors of Cluster 1 by percentile
cluster_1 <- ggplot(filter(degrees_perc, clusters==1),
                    aes(x=percentile, y=salary,
                       group=College.Major, color=College.Major)) +
                    geom_point() + geom_line() +
                    ggtitle("Cluster 1: The Liberal Arts") +
                    theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_1

10. The goldilocks cluster¶

On to Cluster 2, right in the middle! Accountants are known for having stable job security, but once you're in the big leagues you may be surprised to find that Marketing or Philosophy can ultimately result in higher salaries. The majors of this cluster are fairly middle of the road in our dataset, starting off not too low and not too high in the lowest percentile. However, this cluster also represents the majors with the greatest differential between the lowest and highest percentiles.

# Modify the previous plot to display Cluster 2
cluster_2 <- ggplot(filter(degrees_perc, clusters==2),
                    aes(x=percentile, y=salary,
                        group=College.Major, color=College.Major)) +
  geom_point() + geom_line() +
  ggtitle("Cluster 2: The Goldilocks") + 
  theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_2

11. The over achiever cluster¶

Finally, let's visualize Cluster 3. If you want financial security, these are the majors to choose from. Besides our one previously observed outlier now identifiable as Physician Assistant lagging in the highest percentiles, these heavy hitters and solid engineers represent the highest growth potential in the 90th percentile, as well as the best security in the 10th percentile rankings. Maybe those Freakonomics guys are on to something...

# Modify the previous plot to display Cluster 3
cluster_3 <- ggplot(filter(degrees_perc, clusters==3),
                    aes(x=percentile, y=salary,
                        group=College.Major, color=College.Major)) +
  geom_point() + geom_line() +
  ggtitle("Cluster 3: The Overachievers") + 
  theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_3

12. Every major's wonderful¶

Thus concludes our journey exploring salary projections by college major via a k-means clustering analysis! Dealing with unsupervized data always requires a bit of creativity, such as our usage of three popular methods to determine the optimal number of clusters. We also used visualizations to interpret the patterns revealed by our three clusters and tell a story.

Which two careers tied for the highest career percent growth? While it's tempting to focus on starting career salaries when choosing a major, it's important to also consider the growth potential down the road. Keep in mind that whether a major falls into the Liberal Arts, Goldilocks, or Over Achievers cluster, one's financial destiny will certainly be influenced by numerous other factors including the school attended, location, passion or talent for the subject, and of course the actual career(s) pursued.

A similar analysis to evaluate these factors may be conducted on the additional data provided by the Wall Street Journal article, comparing salary potential by type and region of college attended. But in the meantime, here's some inspiration from xkcd for any students out there still struggling to choose a major.

# Sort degrees by Career.Percent.Growth
arrange(degrees_labeled, desc(Career.Percent.Growth))

# Identify the two majors tied for highest career growth potential
highest_career_growth <- c('....','....')

College.Major	Starting.Median.Salary	Mid.Career.Median.Salary	Career.Percent.Growth	Percentile.10	Percentile.25	Percentile.75	Percentile.90
Accounting	$46,000.00	$77,100.00	67.6	$42,200.00	$56,100.00	$108,000.00	$152,000.00
Aerospace Engineering	$57,700.00	$101,000.00	75.0	$64,300.00	$82,100.00	$127,000.00	$161,000.00
Agriculture	$42,600.00	$71,900.00	68.8	$36,300.00	$52,100.00	$96,300.00	$150,000.00
Anthropology	$36,800.00	$61,500.00	67.1	$33,800.00	$45,500.00	$89,300.00	$138,000.00
Architecture	$41,600.00	$76,800.00	84.6	$50,600.00	$62,200.00	$97,000.00	$136,000.00
Art History	$35,800.00	$64,900.00	81.3	$28,800.00	$42,200.00	$87,400.00	$125,000.00

College.Major	Starting.Median.Salary	Mid.Career.Median.Salary	Career.Percent.Growth	Percentile.10	Percentile.25	Percentile.75	Percentile.90	clusters
Math	45400	92400	1.035	45200	64200	128000	183000	2
Philosophy	39900	81200	1.035	35500	52800	127000	168000	2
International Relations	40900	80900	0.978	38200	56000	111000	157000	2
Economics	50100	98600	0.968	50600	70600	145000	210000	3
Marketing	40800	79600	0.951	42100	55600	119000	175000	2
Physics	50300	97300	0.934	56000	74200	132000	178000	3
Political Science	40800	78200	0.917	41200	55300	114000	168000	2
Chemistry	42600	79900	0.876	45300	60700	108000	148000	2
Journalism	35600	66700	0.874	38400	48300	97700	145000	1
Architecture	41600	76800	0.846	50600	62200	97000	136000	2
Finance	47900	88300	0.843	47200	62100	128000	195000	2
Communications	38100	70000	0.837	37500	49700	98800	143000	2
Geology	43500	79500	0.828	45000	59600	101000	156000	2
Art History	35800	64900	0.813	28800	42200	87400	125000	1
History	39200	71000	0.811	37000	49200	103000	149000	2
Film	37900	68500	0.807	33900	45500	100000	136000	1
Aerospace Engineering	57700	101000	0.750	64300	82100	127000	161000	3
Computer Engineering	61400	105000	0.710	66100	84100	135000	162000	3
Computer Science	55900	95500	0.708	56000	74900	122000	154000	3
English	38000	64700	0.703	33400	44800	93200	133000	1
Chemical Engineering	63200	107000	0.693	71900	87300	143000	194000	3
Electrical Engineering	60900	103000	0.691	69300	83800	130000	168000	3
Agriculture	42600	71900	0.688	36300	52100	96300	150000	2
Psychology	35900	60400	0.682	31600	42100	87500	127000	1
Civil Engineering	53900	90500	0.679	63400	75100	115000	148000	3
Business Management	43000	72100	0.677	38800	51500	102000	147000	2
Accounting	46000	77100	0.676	42200	56100	108000	152000	2
Graphic Design	35700	59800	0.675	36000	45500	80800	112000	1
Management Information Systems (MIS)	49200	82300	0.673	45300	60500	108000	146000	2
Anthropology	36800	61500	0.671	33800	45500	89300	138000	1
Biology	38800	64800	0.670	36900	47400	94500	135000	1
Construction	53700	88900	0.655	56300	68100	118000	171000	3
Industrial Engineering	57700	94700	0.641	57100	72300	132000	173000	3
Mechanical Engineering	57900	93600	0.617	63700	76200	120000	163000	3
Criminal Justice	35000	56300	0.609	32200	41600	80700	107000	1
Forestry	39100	62600	0.601	41000	49300	78200	111000	1
Sociology	36500	58200	0.595	30700	40400	81200	118000	1
Geography	41200	65500	0.590	40000	50000	90800	132000	1
Drama	35900	56900	0.585	36700	41300	79100	153000	1
Health Care Administration	38800	60600	0.562	34600	45600	78800	101000	1
Spanish	34000	53100	0.562	31000	40000	76800	96400	1
Music	35900	55000	0.532	26700	40200	88000	134000	1
Religion	34100	52000	0.525	29700	36500	70900	96400	1
Information Technology (IT)	49100	74800	0.523	44500	56700	96700	129000	2
Hospitality & Tourism	37800	57500	0.521	35500	43600	81900	124000	1
Education	34900	52000	0.490	29300	37900	73400	102000	1
Interior Design	36100	53200	0.474	35700	42600	72500	107000	1
Nutrition	39900	55300	0.386	33900	44500	70500	99200	1
Nursing	54200	67000	0.236	47600	56400	80900	98300	1
Physician Assistant	74300	91700	0.234	66400	75200	108000	124000	3