Machine Learning in Revenue Operations
I am always shocked at the limited number of my RevOps colleagues with machine learning capabilities.
“Why do we need machine learning for RevOps?”
Let me count the ways…
Lead Scoring and Prioritization: ML algorithms analyze historical lead data (demographics, behavior, engagement) to predict which leads are most likely to convert. This helps sales teams focus their efforts on high-potential prospects.
Sales Forecasting: Machine learning models can analyze past sales data, market trends, and even external factors (like economic indicators) to create more accurate revenue forecasts. This aids in resource allocation and planning.
Churn Prediction: ML models identify patterns in customer behavior (usage, support tickets, engagement) that indicate a higher risk of churn. Proactive interventions can then be initiated to retain valuable customers.
Pricing Optimization: By analyzing market data, competitor pricing, and customer behavior, machine learning can suggest optimal pricing strategies to maximize revenue while remaining competitive.
Customer Segmentation: ML-powered segmentation goes beyond basic demographics, grouping customers based on their behaviors and preferences. This allows for more personalized marketing and sales approaches.
Recommendation Engines: In upselling and cross-selling, machine learning can suggest relevant products or services to customers based on their purchase history and behavior, increasing the average order value.
Content Optimization: ML can analyze how different types of content perform with specific customer segments, helping marketing teams tailor their messaging for maximum impact.
Sales Rep Performance Analysis: By analyzing sales call transcripts, emails, and CRM data, ML can identify patterns in successful sales reps' behaviors, which can then be used to train and improve the performance of the entire team.
Those are eight powerful use cases.
Some of them are achievable with old fashioned statistics and other traditional methods, but can still be improved with ML methods.
Ultimately, we use ML to gain insight from data and then use that insight to perform better, thanks to an algorithmic understanding of the future and/or a better description of today.
Learning Machine Learning
Rather than practice on randomly generated Lead data or Opportunity conversion statistics let’s speed run through a classic beginner’s exercise in ML - the Titanic Survivor Prediction Algorithm.
As the name implies, our goal is to use data about survival outcomes (rows of passenger data from the Titanic) to build a model that predicts survival probability of any given passenger.
This is from Kaggle.com, they host competitions, store datasets and provide a proving ground to build models and discuss with fellow data scientists:
The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
To execute this exercise you need two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. Kaggle shares the data.
Here’s the basic process at a high-level:
Data Preprocessing: Handle missing values, convert categorical data to numerical values, and possibly scale data.
Model Training: Use the training dataset to train your model.
Model Evaluation: Use cross-validation on the training dataset to assess performance.
Prediction: Apply the trained model to the test dataset to predict survival.
Model Tuning: Tune hyperparameters and possibly ensemble multiple models to improve prediction accuracy.
One dataset is titled train.csv
and the other is titled test.csv
.
Train.csv
contains the details of a subset of the Titanic’s passenger list - 891 in total.
This data indicates whether they survived or not, also known as the “ground truth” — this is our target, we’re aiming to build a model that predicts survival.
The test.csv
dataset contains similar information but does not disclose the ground truth for each passenger. Using the patterns in the train.csv
data ML practitioners can predict whether the other 418 passengers on board (found in test.csv
) survived.
Building a Crystal Ball
The truth is we don’t need ML to make predictions here.
You can look no further than gender…
Ran these numbers myself this morning.
Women had a 3 out 4 shot of surviving the ordeal.
Men had less than 1 in 5 survive.
No wonder Rose was the one narrating the movie Titanic.
If your “algorithm” was simply to predict all the men died and women survived… you’d post pretty strong results.
We can do much better, though!
To solve this prediction problem, you can employ various machine learning models. Each model will have its strengths and weaknesses, and choosing the right one can depend on the nuances of the data and the specific characteristics you want to emphasize in the analysis.
Let’s run through the candidates listed in escalating complexity.. except the final one. Random Forest is located last because it’s the approach I went with to solve this problem.
1) Logistic Regression
Use Case: This is a good starting model for binary classification problems. It's interpretable and efficient, which can help quickly identify significant features like sex, age, or passenger class.
Logistic Regression is a statistical method used for binary classification tasks, which predicts the probability that an observation belongs to one of two classes. It models the probability that a given input point belongs to a particular class as a function of the input variables. This is achieved using a logistic function, which is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.
Logistic regression is simple and efficient for problems with a linear boundary between classes. It is less prone to overfitting but can struggle with complex relationships in data, which can be mitigated by using techniques such as regularization. It is widely used in various fields like medicine for predicting disease incidence, in finance for credit scoring, and in marketing for predicting customer retention.
Strength: Easy to implement and interpret.
Weakness: May underperform if relationships are non-linear or if there are complex interactions between features.
2) Decision Trees
Use Case: Suitable for capturing non-linear relationships and interactions between variables without needing extensive data transformation.
Decision Trees are a non-parametric supervised learning method used for classification and regression tasks. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A decision tree is represented as a binary tree structure. It divides a dataset into smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
A decision node has two or more branches, each representing values for the attribute tested. Leaf node represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees handle both categorical and continuous data.
The tree is constructed by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions.
This model is popular due to its simplicity and the fact that it requires very little data preparation. However, they are also prone to overfitting, especially with very complex trees. Techniques such as pruning (removing parts of the tree that don't provide additional power) are necessary to avoid this problem.
I really enjoy using Decision Trees and we’re going to explore an evolved version of them that addresses some of the weak generalization capability.
Strength: Easy to understand and interpret; no need for feature scaling.
Weakness: Prone to overfitting; might not generalize well without proper tuning (e.g., setting max depth).
3) Gradient Boosting Machines (GBM)
Use Case: Builds on weak learners (typically decision trees) in a sequential correction of predecessor errors, often leading to high accuracy.
Gradient Boosting Machines (GBM) are a powerful machine-learning technique that builds on decision trees by optimizing a loss function. The key idea is to construct new trees that predict the residuals or errors of prior trees and then combine these trees in an additive model. Each new tree makes up for the deficiencies in the previously built trees.
In gradient boosting, trees are built sequentially such that each subsequent tree aims to reduce the errors of the previous trees. Trees are added one at a time, and existing trees in the model are not changed. A gradient descent procedure is used to minimize the loss when adding trees. Typically, shallow trees are used, such as those with only a few levels, which makes the model robust to overfitting and allows it to generalize well.
GBMs can be used for both regression and classification problems. They have been shown to be very effective, often winning many machine learning competitions. The strength of the model can be controlled by the number of trees, the depth of the trees, and the learning rate, among other parameters. However, they can be more sensitive to overfitting if the data is noisy. Training generally takes longer because trees are built sequentially. They are also harder to tune due to the increased number of hyperparameters.
Strength: Provides high predictive accuracy and can handle different types of feature data effectively.
Weakness: Can be computationally intensive and prone to overfitting if not tuned correctly.
4) Support Vector Machines (SVM)
Use Case: Effective in high-dimensional spaces and ideal for classification boundaries with a clear margin of separation.
Support Vector Machines (SVM) are a set of supervised learning methods used for classification, regression, and outliers detection. The main idea of SVM is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. The optimal hyperplane is the one that has the maximum margin, i.e., the maximum distance between data points of both classes. SVM finds this hyperplane using support vectors and margins. Support vectors are the data points nearest to the hyperplane; the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of the data set.
The advantage of SVM is its effectiveness in high-dimensional spaces. Even in situations where the number of dimensions exceeds the number of samples, SVMs are robust and effective.
Strength: Effective in complex classification problems with clear margins of separation in higher dimensions.
Weakness: Requires feature scaling, computationally intensive for large datasets, and sensitive to the choice of kernel.
5) Neural Networks
Use Case: Good for capturing complex patterns through layers of learning — gains value if you have a very large dataset or can engineer features that capture deep interactions.
Neural Networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.
Neural Networks consist of layers of interconnected nodes. Each node is a neuron that uses a nonlinear activation function. Neural networks rely on training data to learn and improve their accuracy over time. Once trained, neural networks can be applied to tasks like speech recognition, image recognition, and forecasting.
One of the key features of neural networks is their ability to learn and model non-linear and complex relationships. This makes them extremely flexible and powerful. They can learn to perform tasks by considering examples, generally without being programmed with task-specific rules.
For example, a neural network for handwriting recognition is defined by a set of input neurons which may be activated by the pixels of an input image. After being processed by a layer (likely multiple layers) of neurons, the output at the other end is the identification of the image as a particular character.
Neural networks can be trained with several algorithms, most notably backpropagation. This involves providing the network with inputs and known outputs, then adjusting the weights (parameters) by propagating the errors back through the system.
The complexity of neural networks makes them particularly good at modeling complex patterns, but this complexity can also lead to challenges such as overfitting and lengthy training times. Advances in computing power, particularly GPU computing, has played a key role in making neural networks a central component in deep learning, which has been responsible for many of the recent advancements in artificial intelligence.
Strength: Can model highly non-linear relationships.
Weakness: Requires significant data preprocessing, prone to overfitting, and less interpretable.
6) Random Forest
Use Case: An ensemble of decision trees that improves generalization by averaging multiple deep decision trees, reducing the risk of overfitting.
Random Forest is an ensemble learning method for classification, regression, and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.
The fundamental concept behind random forest is to combine the predictions of several decision trees to produce a more accurate and stable prediction. Each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set.
When splitting a node during the construction of the tree, the best split is found either from all input features or a random subset of them. This adds diversity to the model, which is why the performance of the forest usually increases with the number of trees in the ensemble.
Random Forests are very handy because they require very little input preparation. They can handle binary features, categorical features, and numerical features without the need for scaling. They also perform well on large datasets and maintain accuracy even when a large part of the data is missing.
Strength: Better performance and generalization than a single decision tree.
Weakness: Less interpretable compared to simple decision trees.
I began my preparation for this article with a Random Forest approach. Let’s start by walking through this relatively simple and readable Python script.
The script begins by importing the RandomForestClassifier
class from the scikit-learn library. This class is a powerful machine learning algorithm used for classification tasks (in this case, predicting whether a passenger survived or not).
y = train_data["Survived"]
This line extracts the target variable (Survived
) from your training dataset. This is what we want to predict.
features = ["Pclass", "Sex", "SibSp", "Parch"]
This defines the features (passenger class, sex, number of siblings/spouses aboard, number of parents/children aboard) that will be used to make predictions.
X = pd.get_dummies(train_data[features])
This uses pd.get_dummies
to create "dummy" variables for the categorical features (transforms "Sex" into "Sex_male" and "Sex_female", etc…). This is necessary because most machine learning algorithms work with numerical data.
X_test = pd.get_dummies(test_data[features])
This applies the same dummy variable transformation to your testing data.
Next we actually configure the model.
model = RandomForestClassifier(n_estimators=250, max_depth=5, random_state=1)
model.fit(X, y)
This line creates a RandomForestClassifier
object. The parameters (n_estimators
, max_depth
, random_state
) control the complexity and reproducibility of the model, here’s what the control:
n_estimators=250
specifies the number of trees in the forest. In this case, 250 trees will be generated.max_depth=5
sets the maximum depth of each tree. Limiting the depth of each tree helps prevent overfitting, making the model simpler by allowing only 5 splits in each decision path from root to leaf.random_state=1
sets the seed used by the random number generator, ensuring that the random processes in the algorithm (like splitting at nodes and selecting features) are repeatable and consistent across different runs.
model.fit(X, y)
By executing this code, the model
object becomes a trained random forest classifier, ready to predict the classes of new, unseen data using the learned decision rules encapsulated within the 250 trees of the model. This approach is widely appreciated for its robustness to noise and its ability to handle large datasets with many input variables.
Polishing the Crystal Ball
There are a great deal of steps we can take to improve the power of our predictive systems.
Data Cleaning and Preprocessing
The most important step is maximizing the value of your data. You start by cleaning it and once it’s hygienic, you can engineer it to gain more signal.
Let’s look at some examples of this step from the Titanic problem.
Handle Missing Data
Age: Consider imputing missing ages based on other features such as class, sex, or title derived from the name.
Cabin: Given the high percentage of missing values (77%), either drop this column or create a binary feature indicating whether cabin information is available.
Embarked: Since only a small percentage is missing, impute with the most common value or based on other correlated features like fare or class.
Feature Engineering
Title Extraction: Extract titles (Mr, Mrs, Miss, Master, etc.) from names which can provide hints about social status, age, and gender — remember, we are missing some values!
Family Size: Create a new feature by combining SibSp and Parch to indicate the total number of family members on board.
Categorical Encoding: Convert categorical variables like Sex and Embarked into numeric codes or use one-hot encoding.
Fare Bins: Group fares into categories to reduce the effects of extreme values and help the model generalize better.
Once your data is well-structured we can begin to study it. We need to study it in order to select the right model architecture and optimizations.
Exploratory Data Analysis (EDA)
Survival Rate Analysis: Investigate the survival rate across different groups (like sex, Pclass, Embarked, and newly engineered features).
Correlation Check: Analyze how each feature correlates with the survival rate and with other features. This can help in understanding interdependencies and in feature selection.
Combined these efforts give insights into what factors might be predictive of survival. These are the signals we’re hunting for.
Model Building
Start Simple: Begin modeling with logistic regression to establish a baseline.
Experiment with Complex Models: Based on the initial results, experiment with more complex models such as Random Forests and Gradient Boosting Machines to see if they offer significant improvements.
As mentioned, for this example I went with Random Forest.
Validation: Use cross-validation techniques to evaluate model performance. Stratified K-fold validation can be particularly useful here due to the class imbalance in the survived column.
The Challenge of Class Imbalance
In datasets like the Titanic survival data, you often have class imbalance, meaning one class ("survived") has significantly fewer instances than the other ("not survived").
If you use regular cross-validation, your model might primarily train on the majority class and perform poorly on the minority class, leading to misleading performance metrics.
Stratified K-fold validation addresses this issue head-on:
Preserving Class Proportions: It divides your dataset into K folds (subsets) while ensuring that each fold maintains the same proportion of class labels as your original dataset.
Balanced Evaluation: In each iteration of the cross-validation process, one fold is used for testing, and the remaining folds are used for training. This guarantees that the model is tested on a representative sample of both the majority and minority classes in every iteration.
Robust Performance Metrics: By averaging the model's performance across all K folds, you get a more accurate estimate of how well your model generalizes to unseen data, especially for the minority class.
Stratified K-fold validation will give you a better idea of how well your model can predict survival, even for the smaller group of passengers who survived. It helps prevent your model from focusing too much on the "not survived" class and neglecting the "survived" class, the road to overfitting we want to avoid.
Stratified K-fold validation is your friend when dealing with class imbalance.
It gives you a fairer and more reliable assessment of your model's ability to predict both classes, leading to better decision-making.
Model Tuning and Evaluation
Most of these models can be tuned and controlled to produce even more power.
Use grid search or random search to find the optimal parameters for your models.
Hyperparameters are like the settings of your machine learning algorithm. They're not directly learned from the data during training but are crucial for model performance. Examples include:
Regularization Strength (C in Logistic Regression): Controls how much the model penalizes complex relationships to avoid overfitting.
Number of Trees (in Random Forest): Determines the ensemble size and can influence model complexity.
Learning Rate (in Gradient Boosting): Affects how quickly the model adjusts to the data.
The default hyperparameters for your models might not be the best fit for the Titanic dataset. Tuning them can:
Boost Model Performance: Find the hyperparameter values that lead to the most accurate survival predictions.
Prevent Overfitting: Avoid hyperparameters that make your model too complex and prone to memorizing the training data instead of generalizing well.
Optimize for Specific Goals: You can tune hyperparameters to prioritize accuracy, recall (finding all survivors), or precision (minimizing false positives).
Grid Search vs. Random Search
Both techniques are used to explore the hyperparameter space:
Grid Search: You define a grid of possible hyperparameter values. The algorithm systematically tries all combinations and picks the best one based on performance metrics. It's thorough but can be computationally expensive if your grid has a lot of surface area.
Random Search: You specify a range for each hyperparameter, and the algorithm randomly samples combinations within those ranges. It's faster than grid search and can sometimes find better solutions because it's not constrained to a fixed grid.
Grid is preferable when you have a limited hyperparameter space and want to be exhaustive.
Random search is a good option when your hyperparameter space is large, you have computational constraints, or you want to quickly explore different possibilities.
Besides accuracy, consider other metrics like Precision, Recall, F1 Score, and ROC-AUC to evaluate your models, especially since the survival class is imbalanced.
Feature Importance
Once you have a reasonably good model, analyze the feature importances. This will not only provide insights into which features are most predictive, but can also guide further feature engineering and selection efforts.
If features have very low importance scores, consider removing them from your model.
If a feature has high importance, explore ways to create new features derived from it or combine it with other relevant features.
If two features are highly correlated and both seem important, you might want to keep only one to avoid redundancy and potential overfitting.
Why Feature Importances Matter
Interpretability: They reveal which features play a significant role in the model's predictions. This can offer valuable insights into the underlying factors that influenced survival (e.g., gender, class, age).
Feature Selection: Identifying unimportant features can lead to a simpler model that's easier to interpret and potentially less prone to overfitting.
Feature Engineering: Feature importances can highlight areas where you might create new features or transform existing ones to boost model performance. For example, if 'Age' is important, you could create a 'Child' feature to distinguish young passengers.
Many machine learning algorithms offer ways to assess feature importance, although the methods often differ from model to model:
Tree-based models (e.g., Random Forest, Gradient Boosting): These models naturally calculate feature importance based on how often a feature is used for splits or how much it improves impurity measures.
Linear models (e.g., Logistic Regression): You can examine the absolute values of the coefficients to gauge the relative importance of each feature.
Permutation Importance: This technique involves shuffling the values of one feature at a time and measuring how much the model's performance drops. A larger drop indicates a more important feature.
Interpreting Feature Importances
It's important to interpret feature importances in the context of your model and data.
The scale of feature importances can vary depending on the algorithm and the method used. Compare them relative to each other within the same model.
Highly correlated features might share importance. If you remove one, the other might pick up its slack. Always use your understanding of the Titanic data to validate whether the feature importances make sense. Do they align with your intuition about which factors likely played a role in survival?
This structured approach can help you methodically build and refine a predictive model for the Titanic survival prediction challenge.
It has also made you much more effective at using data to create value. The name of the game in Revenue Operations.
Now when the forecast starts to sink, you can help the team survive and thrive!
👋 Thank you for reading Mastering Revenue Operations.
To help us continue our growth, would you please Like, Comment and Share this?
I started this in November 2023 as a central hub for the intersection of technology and revenue operations. Technology keeps accelerating faster ever since. Our audience includes Fortune 500 Executives, RevOps Leaders, Venture Capitalists and Entrepreneurs.
Thank you again for spreading the message.