Burtch Works Foreword
After examining the results of our 2018 SAS, R, or Python flash survey to determine how the tool preferences of data scientists and analytics professionals vary by a myriad of factors, we decided to turn our dataset over to an actual data scientist to see what else could be found in the data!
This blog has some of his analysis, but you can also learn more in this 20-minute video, where he shares more detail about his methodology and key takeaways about what factors influence quantitative professionals to prefer certain tools over others.
SAS, R, or Python: Examining Preference Predictors
This blog post is contributed by Howard Friedman, Chief Data Scientist at DataMed Solutions and Adjunct Professor at Columbia University where he teaches graduate-level data science and applied statistics courses.
For this analysis, Burtch Works provided me with an anonymized dataset including the tool preference of nearly 1,200 data scientists and analytics professionals, plus a variety of other factors such as company, industry, region, years’ experience, education and degree specialty, and more. This survey does not ask questions about how the tools are used or why a specific tool is preferred – focusing only on which tool is preferred of the three options: SAS, R, or Python. Also, it is important to note that this is not a random survey, but rather suffers from selection bias as the respondents are a subset of Burtch Works’ network and not necessarily representative of the US data scientist population.
Burtch Works’ original analysis focused on the relationship between one factor, such as years’ experience, industry or region and tool preference. This can be viewed as a 1-dimensional or “main effects” analysis. They then provided me with an anonymized dataset so that I could take a closer look!
What I’ve done here and in the video is basic Exploratory Data Analysis (EDA), so unfortunately there are no exciting Gaussian mixed models, CNNs, SVMs etc., but rather it is a systematic approach to examining which factors are the greatest predictors of a professional’s tool preference.
Segmenting Respondents by Years’ Experience
In Burtch Work’s initial analysis, they found that tool preferences can shift dramatically depending on a professional’s years of quantitative work experience. Whereas the original analysis only focused on three groupings (0-5 years, 6-15 years, and 16+ years), I used a univariate CART model to determine where the ideal cut points should be.
Classification and regression tree modeling showed that around 10 years of experience was the most important “cut point”, or the point at which tool preferences seemed to shift the most dramatically.
Informed by the CART analysis, I selected 5-year increments of experience for some of the Exploratory Data Analysis. Using these 5-year increments, we can see that Python is most popular among those with 0-5 years’ experience, while SAS dominates among those with 11+ years’ experience with SAS’s dominance increasing as the number of years of experience increases.
Examining Company Type
I also wanted to take a look at analyzing some of the factors that were not covered in Burtch Works’ previous analysis. So, after cleaning the Company Name field, I was also able to map the company names to the Fortune 500 list to see if being at a larger company correlates with tool preference.
As one might imagine, there is some tendency for Fortune 500 employees to prefer SAS. This may reflect the fact that SAS’s proprietary software likely has a stronger foothold in larger companies while free software such as Python and R may be more appealing to smaller companies.
Examining Degree Specialty
Another field I was eager to explore was the degree specialty, or what a professional studied at university. I selected the text field for the highest degree achieved, then assigned them to mutually exclusive categories based on the text in the title.
A hierarchy was applied to ensure that assignment was mutually exclusive with an ordering of Data Science, Statistics, Analytics, Operations, etc. meaning if the person’s highest degree title said Data Science, then any other category also mentioned in the title was ignored. Below you can see the survey respondents grouped by degree specialty.
The largest easily identifiable degree specialty were the 18% of respondents with the word Statistics in their degree title (Statistics, Biostatistics, Applied Statistics, etc.). Interestingly, only about 2% of the respondents have a degree that contains “Data Science” which may reflect the relative newness of the degree, the selection bias of the sample as well as the recent tendency of many professionals to rebrand themselves as “Data Scientists” in response to market demand.
Python preference is highest among those with degrees in Computer Science, Data Science and Engineering, while SAS is highest among those with degrees in Business/Business Administration, Statistics, and Economics. These preferences may reflect the tools used in school as well as the differences in job functions.
Examining Preference Predictors and Analysis Details
Next, I used univariate logistic regression modeling and CART modeling to look at which variables were the most likely to predict tool preference. Both approaches resulted in the same order of variable importance, with years’ experience being the strongest predictor of tool preference in both approaches while the designation of Predictive analytics professional (PAP) versus data scientist (DS) was the second strongest predictor in both modeling methods..
Predictive analytics professional vs. data scientist is a designation made by Burtch Works based on their assessment of your typical work, including use of unstructured data, data size, deployment of models, etc.
Burtch Works separates data scientists from traditional predictive analytics professionals because of the measured differences in salaries, tool usage, data volume and structure, and a variety of other factors. The main difference for our purposes is that data scientists are defined as those working primarily with unstructured or streaming data, while traditional predictive analytics professionals work primarily with structured data.
There are some concerns that the PAP/DS designation (one that was assigned by Burtch Works and not provided by the respondents) may be partially circular in predicting tool preference but one way to understand this more deeply would be in future surveys to probe about what are the most common tasks the individual performs and why they choose a specific tool.
Two simple approaches to multiple variable modeling that were explored were CART modeling and multiple variable multinomial logistic regression modeling (main effects only).
Logistic regression: Models that included Years’ Experience, PAP/DS and Degree Specialty saw limited incremental value for introducing the other variables (Region, Highest Degree, Industry, Fortune 500).
To learn more about this analysis, I highly recommend watching the 20-minute video, where I dive into more detail about preference predictors and how they vary!
In the video I also discuss in detail at what factors aside from those covered in the dataset might impact a quantitative professional’s tool preference.
Thanks again to Burtch Works for letting me tinker with this dataset, it was quite interesting for me to examine!