1. Which college majors will pay the bills?

Wondering if that Philosophy major will really help you pay the bills? Think you're set with an Engineering degree? Choosing a college major is a complex decision evaluating personal interest, difficulty, and career prospects. Your first paycheck right out of college might say a lot about your salary potential by mid-career. Whether you're in school or navigating the postgrad world, join me as we explore the short and long term financial implications of this major decision.

In this notebook, we'll be using data collected from a year-long survey of 1.2 million people with only a bachelor's degree by PayScale Inc., made available here by the Wall Street Journal for their article Ivy League's Big Edge: Starting Pay. After doing some data clean up, we'll compare the recommendations from three different methods for determining the optimal number of clusters, apply a k-means clustering analysis, and visualize the results.

To begin, let's prepare by loading the following packages: tidyverse, dplyr, readr, ggplot2, cluster, and factoextra. We'll then import the data from degrees-that-pay-back.csv (which is stored in a folder called datasets), and take a quick look at what we're working with.

In [385]:
# Load relevant packages
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(cluster)
library(factoextra)

# Read in the dataset
degrees <- read_csv('datasets/degrees-that-pay-back.csv', 
                    col_names=c('College.Major', 'Starting.Median.Salary', 'Mid.Career.Median.Salary', 
                                'Career.Percent.Growth', 'Percentile.10', 'Percentile.25', 'Percentile.75', 'Percentile.90'), skip=1)

# Display the first few rows and a summary of the data frame
head(degrees)
summary(degrees)
Parsed with column specification:
cols(
  College.Major = col_character(),
  Starting.Median.Salary = col_character(),
  Mid.Career.Median.Salary = col_character(),
  Career.Percent.Growth = col_double(),
  Percentile.10 = col_character(),
  Percentile.25 = col_character(),
  Percentile.75 = col_character(),
  Percentile.90 = col_character()
)
College.MajorStarting.Median.SalaryMid.Career.Median.SalaryCareer.Percent.GrowthPercentile.10Percentile.25Percentile.75Percentile.90
Accounting $46,000.00 $77,100.00 67.6 $42,200.00 $56,100.00 $108,000.00 $152,000.00
Aerospace Engineering$57,700.00 $101,000.00 75.0 $64,300.00 $82,100.00 $127,000.00 $161,000.00
Agriculture $42,600.00 $71,900.00 68.8 $36,300.00 $52,100.00 $96,300.00 $150,000.00
Anthropology $36,800.00 $61,500.00 67.1 $33,800.00 $45,500.00 $89,300.00 $138,000.00
Architecture $41,600.00 $76,800.00 84.6 $50,600.00 $62,200.00 $97,000.00 $136,000.00
Art History $35,800.00 $64,900.00 81.3 $28,800.00 $42,200.00 $87,400.00 $125,000.00
 College.Major      Starting.Median.Salary Mid.Career.Median.Salary
 Length:50          Length:50              Length:50               
 Class :character   Class :character       Class :character        
 Mode  :character   Mode  :character       Mode  :character        
                                                                   
                                                                   
                                                                   
 Career.Percent.Growth Percentile.10      Percentile.25      Percentile.75     
 Min.   : 23.40        Length:50          Length:50          Length:50         
 1st Qu.: 59.12        Class :character   Class :character   Class :character  
 Median : 67.80        Mode  :character   Mode  :character   Mode  :character  
 Mean   : 69.27                                                                
 3rd Qu.: 82.42                                                                
 Max.   :103.50                                                                
 Percentile.90     
 Length:50         
 Class :character  
 Mode  :character  
                   
                   
                   

2. Currency and strings and percents, oh my!

Notice that our salary data is in currency format, which R considers a string. Let's strip those special characters using the gsub function and convert all of our columns except College.Major to numeric.

While we're at it, we can also convert the Career.Percent.Growth column to a decimal value.

In [387]:
# Clean up the data
degrees_clean <- degrees %>% mutate_at(vars(Starting.Median.Salary:Percentile.90), function(x) as.numeric(gsub("[\\$,]","",x)) ) %>% 
  mutate(Career.Percent.Growth=Career.Percent.Growth/100)

3. The elbow method

Great! Now that we have a more manageable dataset, let's begin our clustering analysis by determining how many clusters we should be modeling. The best number of clusters for an unlabeled dataset is not always a clear-cut answer, but fortunately there are several techniques to help us optimize. We'll work with three different methods to compare recommendations:

  • Elbow Method
  • Silhouette Method
  • Gap Statistic Method

First up will be the Elbow Method. This method plots the percent variance against the number of clusters. The "elbow" bend of the curve indicates the optimal point at which adding more clusters will no longer explain a significant amount of the variance. To begin, let's select and scale the following features to base our clusters on: Starting.Median.Salary, Mid.Career.Median.Salary, Perc.10, and Perc.90. Then we'll use the fancy fviz_nbclust function from the factoextra library to determine and visualize the optimal number of clusters.

In [389]:
# Select and scale the relevant features and store as k_means_data
k_means_data <- degrees_clean %>%
  select(Starting.Median.Salary, Mid.Career.Median.Salary, 
         Percentile.10, Percentile.90) %>% scale()

# Run the fviz_nbclust function with our selected data and method "wss"
elbow_method <- fviz_nbclust(k_means_data, FUNcluster=kmeans, method="wss")

# View the plot
elbow_method

4. The silhouette method

Wow, that fviz_nbclust function was pretty nifty. Instead of needing to "manually" apply the elbow method by running multiple k_means models and plotting the calculated the total within cluster sum of squares for each potential value of k, fviz_nbclust handled all of this for us behind the scenes. Can we use it for the Silhouette Method as well? The Silhouette Method will evaluate the quality of clusters by how well each point fits within a cluster, maximizing average "silhouette" width.

In [391]:
# Run the fviz_nbclust function with the method "silhouette" 
silhouette_method <- fviz_nbclust(k_means_data, FUNcluster=kmeans, method="silhouette")

# View the plot
silhouette_method

5. The gap statistic method

Marvelous! But hmm, it seems that our two methods so far disagree on the optimal number of clusters... Time to pull out the tie breaker.

For our final method, let's see what the Gap Statistic Method has to say about this. The Gap Statistic Method will compare the total variation within clusters for different values of k to the null hypothesis, maximizing the "gap." The "null hypothesis" refers to a uniformly distributed simulated reference dataset with no observable clusters, generated by aligning with the principle components of our original dataset. In other words, how much more variance is explained by k clusters in our dataset than in a fake dataset where all majors have equal salary potential?

Fortunately, we have the clusGap function to calculate this behind the scenes and the fviz_gap_stat function to visualize the results.

In [393]:
# Use the clusGap function to apply the Gap Statistic Method
gap_stat <- clusGap(k_means_data, FUN=kmeans, nstart=25, K.max=10, B=50)

# Use the fviz_gap_stat function to vizualize the results
gap_stat_method <- fviz_gap_stat(gap_stat)

# View the plot
gap_stat_method

6. K-means algorithm

Looks like the Gap Statistic Method agreed with the Elbow Method! According to majority rule, let's use 3 for our optimal number of clusters. With this information, we can now run our k-means algorithm on the selected data. We will then add the resulting cluster information to label our original dataframe.

In [395]:
# Set a random seed
set.seed(111)

# Set k equal to the optimal number of clusters
num_clusters <- 3

# Run the k-means algorithm 
k_means <- kmeans(k_means_data, centers=num_clusters, iter.max=15, nstart=25)

# Label the clusters of degrees_clean
degrees_labeled <- degrees_clean %>%
    mutate(clusters = k_means$cluster)

7. Visualizing the clusters

Now for the pretty part: visualizing our results. First let's take a look at how each cluster compares in Starting vs. Mid Career Median Salaries. What do the clusters say about the relationship between Starting and Mid Career salaries?

In [397]:
career_growth <- ggplot(degrees_labeled, aes(x=Starting.Median.Salary,
                                             y=Mid.Career.Median.Salary,
                                             color=factor(clusters))) +
  xlab("Starting Median Salary") +
  ylab("Mid-Career Median Salary") +
  ggtitle("Salary Growth") + 
  geom_point(alpha=4/5, size=7) +
  scale_x_continuous(labels=scales::dollar) +
  scale_y_continuous(labels=scales::dollar) + 
  scale_color_manual(name="Clusters", values=c("#EC2C73", "#29AEC7", "#FFDD30"))

# View the plot
career_growth

8. A deeper dive into the clusters

Unsurprisingly, most of the data points are hovering in the top left corner, with a relatively linear relationship. In other words, the higher your starting salary, the higher your mid career salary. The three clusters provide a level of delineation that intuitively supports this.

How might the clusters reflect potential mid career growth? There are also a couple curious outliers from clusters 1 and 3... perhaps this can be explained by investigating the mid career percentiles further, and exploring which majors fall in each cluster.

Right now, we have a column for each percentile salary value. In order to visualize the clusters and majors by mid career percentiles, we'll need to reshape the degrees_labeled data using tidyr's gather function to make a percentile key column and a salary value column to use for the axes of our following graphs. We'll then be able to examine the contents of each cluster to see what stories they might be telling us about the majors.

In [399]:
# Use the gather function to reshape degrees and 
# use mutate() to reorder the new percentile column
degrees_perc <- degrees_labeled %>%
  select(College.Major, Percentile.10, Percentile.25, 
         Mid.Career.Median.Salary, Percentile.75, 
         Percentile.90, clusters) %>%
  gather(-c(College.Major, clusters), key=percentile, value=salary) %>%
  mutate(percentile=factor(percentile, levels=c("Percentile.10", "Percentile.25", 
                                                "Mid.Career.Median.Salary", 
                                                "Percentile.75", "Percentile.90")))

9. The liberal arts cluster

Let's graph Cluster 1 and examine the results. These Liberal Arts majors may represent the lowest percentiles with limited growth opportunity, but there is hope for those who make it! Music is our riskiest major with the lowest 10th percentile salary, but Drama wins the highest growth potential in the 90th percentile for this cluster (so don't let go of those Hollywood dreams!). Nursing is the outlier culprit of cluster number 1, with a higher safety net in the lowest percentile to the median. Otherwise, this cluster does represent the majors with limited growth opportunity.

An aside: It's worth noting that most of these majors leading to lower-paying jobs are women-dominated, according to this Glassdoor study. According to the research:

"The single biggest cause of the gender pay gap is occupation and industry sorting of men and women into jobs that pay differently throughout the economy. In the U.S., occupation and industry sorting explains 54 percent of the overall pay gap—by far the largest factor."

Does this imply that women are statistically choosing majors with lower pay potential, or do certain jobs pay less because women choose them...?

In [401]:
# Graph the majors of Cluster 1 by percentile
cluster_1 <- ggplot(filter(degrees_perc, clusters==1),
                    aes(x=percentile, y=salary,
                       group=College.Major, color=College.Major)) +
                    geom_point() + geom_line() +
                    ggtitle("Cluster 1: The Liberal Arts") +
                    theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_1

10. The goldilocks cluster

On to Cluster 2, right in the middle! Accountants are known for having stable job security, but once you're in the big leagues you may be surprised to find that Marketing or Philosophy can ultimately result in higher salaries. The majors of this cluster are fairly middle of the road in our dataset, starting off not too low and not too high in the lowest percentile. However, this cluster also represents the majors with the greatest differential between the lowest and highest percentiles.

In [403]:
# Modify the previous plot to display Cluster 2
cluster_2 <- ggplot(filter(degrees_perc, clusters==2),
                    aes(x=percentile, y=salary,
                        group=College.Major, color=College.Major)) +
  geom_point() + geom_line() +
  ggtitle("Cluster 2: The Goldilocks") + 
  theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_2

11. The over achiever cluster

Finally, let's visualize Cluster 3. If you want financial security, these are the majors to choose from. Besides our one previously observed outlier now identifiable as Physician Assistant lagging in the highest percentiles, these heavy hitters and solid engineers represent the highest growth potential in the 90th percentile, as well as the best security in the 10th percentile rankings. Maybe those Freakonomics guys are on to something...

In [405]:
# Modify the previous plot to display Cluster 3
cluster_3 <- ggplot(filter(degrees_perc, clusters==3),
                    aes(x=percentile, y=salary,
                        group=College.Major, color=College.Major)) +
  geom_point() + geom_line() +
  ggtitle("Cluster 3: The Overachievers") + 
  theme(axis.text.x=element_text(size=7, angle=25))

# View the plot
cluster_3

12. Every major's wonderful

Thus concludes our journey exploring salary projections by college major via a k-means clustering analysis! Dealing with unsupervized data always requires a bit of creativity, such as our usage of three popular methods to determine the optimal number of clusters. We also used visualizations to interpret the patterns revealed by our three clusters and tell a story.

Which two careers tied for the highest career percent growth? While it's tempting to focus on starting career salaries when choosing a major, it's important to also consider the growth potential down the road. Keep in mind that whether a major falls into the Liberal Arts, Goldilocks, or Over Achievers cluster, one's financial destiny will certainly be influenced by numerous other factors including the school attended, location, passion or talent for the subject, and of course the actual career(s) pursued.

A similar analysis to evaluate these factors may be conducted on the additional data provided by the Wall Street Journal article, comparing salary potential by type and region of college attended. But in the meantime, here's some inspiration from xkcd for any students out there still struggling to choose a major.

In [409]:
# Sort degrees by Career.Percent.Growth
arrange(degrees_labeled, desc(Career.Percent.Growth))

# Identify the two majors tied for highest career growth potential
highest_career_growth <- c('....','....')
College.MajorStarting.Median.SalaryMid.Career.Median.SalaryCareer.Percent.GrowthPercentile.10Percentile.25Percentile.75Percentile.90clusters
Math 45400 92400 1.035 45200 64200 128000 183000 2
Philosophy 39900 81200 1.035 35500 52800 127000 168000 2
International Relations 40900 80900 0.978 38200 56000 111000 157000 2
Economics 50100 98600 0.968 50600 70600 145000 210000 3
Marketing 40800 79600 0.951 42100 55600 119000 175000 2
Physics 50300 97300 0.934 56000 74200 132000 178000 3
Political Science 40800 78200 0.917 41200 55300 114000 168000 2
Chemistry 42600 79900 0.876 45300 60700 108000 148000 2
Journalism 35600 66700 0.874 38400 48300 97700 145000 1
Architecture 41600 76800 0.846 50600 62200 97000 136000 2
Finance 47900 88300 0.843 47200 62100 128000 195000 2
Communications 38100 70000 0.837 37500 49700 98800 143000 2
Geology 43500 79500 0.828 45000 59600 101000 156000 2
Art History 35800 64900 0.813 28800 42200 87400 125000 1
History 39200 71000 0.811 37000 49200 103000 149000 2
Film 37900 68500 0.807 33900 45500 100000 136000 1
Aerospace Engineering 57700 101000 0.750 64300 82100 127000 161000 3
Computer Engineering 61400 105000 0.710 66100 84100 135000 162000 3
Computer Science 55900 95500 0.708 56000 74900 122000 154000 3
English 38000 64700 0.703 33400 44800 93200 133000 1
Chemical Engineering 63200 107000 0.693 71900 87300 143000 194000 3
Electrical Engineering 60900 103000 0.691 69300 83800 130000 168000 3
Agriculture 42600 71900 0.688 36300 52100 96300 150000 2
Psychology 35900 60400 0.682 31600 42100 87500 127000 1
Civil Engineering 53900 90500 0.679 63400 75100 115000 148000 3
Business Management 43000 72100 0.677 38800 51500 102000 147000 2
Accounting 46000 77100 0.676 42200 56100 108000 152000 2
Graphic Design 35700 59800 0.675 36000 45500 80800 112000 1
Management Information Systems (MIS)49200 82300 0.673 45300 60500 108000 146000 2
Anthropology 36800 61500 0.671 33800 45500 89300 138000 1
Biology 38800 64800 0.670 36900 47400 94500 135000 1
Construction 53700 88900 0.655 56300 68100 118000 171000 3
Industrial Engineering 57700 94700 0.641 57100 72300 132000 173000 3
Mechanical Engineering 57900 93600 0.617 63700 76200 120000 163000 3
Criminal Justice 35000 56300 0.609 32200 41600 80700 107000 1
Forestry 39100 62600 0.601 41000 49300 78200 111000 1
Sociology 36500 58200 0.595 30700 40400 81200 118000 1
Geography 41200 65500 0.590 40000 50000 90800 132000 1
Drama 35900 56900 0.585 36700 41300 79100 153000 1
Health Care Administration 38800 60600 0.562 34600 45600 78800 101000 1
Spanish 34000 53100 0.562 31000 40000 76800 96400 1
Music 35900 55000 0.532 26700 40200 88000 134000 1
Religion 34100 52000 0.525 29700 36500 70900 96400 1
Information Technology (IT) 49100 74800 0.523 44500 56700 96700 129000 2
Hospitality & Tourism 37800 57500 0.521 35500 43600 81900 124000 1
Education 34900 52000 0.490 29300 37900 73400 102000 1
Interior Design 36100 53200 0.474 35700 42600 72500 107000 1
Nutrition 39900 55300 0.386 33900 44500 70500 99200 1
Nursing 54200 67000 0.236 47600 56400 80900 98300 1
Physician Assistant 74300 91700 0.234 66400 75200 108000 124000 3