top of page

Introduction to Active Learning in Machine Learning

  • vazquezgz
  • May 21, 2024
  • 4 min read



Active learning is a subset of machine learning where the algorithm selectively chooses the data from which it learns. Unlike traditional passive learning, where the model is trained on a pre-existing labeled dataset, active learning involves an interactive process. Here, the model actively queries an oracle (often a human expert) to label new data points. This approach is particularly useful when labeled data is scarce or expensive to obtain, as it allows the model to achieve high performance with fewer labeled instances.


How Active Learning Works


The fundamental idea behind active learning is to identify the most informative data points to label, thereby improving the model's performance more efficiently. The process typically involves the following steps:


  • Initialize the Model: Start with a small labeled dataset and train an initial model.


  • Query Selection: Use a strategy to select the most informative unlabeled instances.


  • Labeling: Query the oracle to obtain labels for the selected instances.


  • Update the Model: Retrain the model with the newly labeled data.


  • Iteration: Repeat steps 2-4 until a stopping criterion is met (e.g., model performance stabilizes).


Uncertainty Sampling


One of the most popular strategies in active learning is uncertainty sampling. The premise is simple: the model should query the instances about which it is most uncertain. This strategy helps the model learn from the data points that are most challenging, thereby improving its overall performance.


How Uncertainty Sampling Works


  • Initial Training: Train the model on the currently labeled instances.

  • Prediction on Unlabeled Data: Use the trained model to make predictions on all unlabeled instances.


  • Measure Uncertainty: Identify the instances for which the model is most uncertain. This can be done in several ways:

  • Least Confidence: Select instances with the lowest predicted probability for the most likely class.

  • Query Oracle: Submit the most uncertain instances to the oracle for labeling.


  • Update Model: Retrain the model with the newly labeled data and repeat the process until performance improvement becomes marginal.



Chart illustrates the process of uncertainty sampling. The red points represent instances where the model is uncertain (predicted probabilities are low). The gray area indicates the model's uncertainty around the true function.


Other Active Learning Strategies


While uncertainty sampling is highly effective, other strategies can also be employed to select informative instances.


  1. Query by Committee (QBC):

  • This strategy involves maintaining a committee of models trained on the current labeled data.

  1. Expected Model Change:

  • Select instances that would result in the largest change in the model parameters if labeled and added to the training set.

  1. Expected Error Reduction:

  • Select instances that are expected to result in the largest reduction in the model's validation error.

Example: Query by Committee (QBC) with SVM and Random Forest


Imagine we have an SVM and a Random Forest model trained on a subset of labeled data. For a given set of unlabeled instances, we compare their predictions:


  • SVM predicts class 1 with 60% probability.

  • Random Forest predicts class 0 with 55% probability.


Here, the disagreement is significant. Querying this instance and labeling it can provide valuable information, helping both models converge towards better performance.



This histogram shows the disagreement measure between two models (SVM and Random Forest). The red dashed line represents the mean disagreement. Instances with high disagreement are prime candidates for labeling in the Query by Committee strategy.


Example: Active Learning with XGBoost Classifier


Step 1: Initialize the Model


We'll start with a small labeled dataset to train an initial XGBoost classifier.


Step 2: Query Selection


We'll use uncertainty sampling to select the most uncertain instances. In this example, we'll use the least confidence criterion.


Step 3: Labeling


We'll simulate the labeling process by assigning labels from the ground truth.


Step 4: Update the Model


We'll retrain the model with the newly labeled data.


Step 5: Iteration


We'll repeat the process until a stopping criterion is met.


Here’s the Python code to demonstrate this process:


 
import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=42)
X_train, X_pool, y_train, y_pool = train_test_split(X, y, test_size=0.9, random_state=42)
# Initialize the XGBoost model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# Initial training with a small labeled dataset
model.fit(X_train, y_train)
initial_accuracy = accuracy_score(y_train, model.predict(X_train))
print(f'Initial accuracy: {initial_accuracy:.2f}')
# Active learning loop
num_iterations = 10
num_queries_per_iteration = 20
for i in range(num_iterations):
    # Predict probabilities on the unlabeled pool
    y_prob = model.predict_proba(X_pool)
    
    # Uncertainty sampling: Select instances with lowest confidence in the predicted class
    uncertainty = 1 - np.max(y_prob, axis=1)
    query_idx = np.argsort(uncertainty)[-num_queries_per_iteration:]
    
    # Simulate labeling by adding the queried instances to the training set
    X_train = np.vstack((X_train, X_pool[query_idx]))
    y_train = np.concatenate((y_train, y_pool[query_idx]))
    
    # Remove queried instances from the pool
    X_pool = np.delete(X_pool, query_idx, axis=0)
    y_pool = np.delete(y_pool, query_idx, axis=0)
    
    # Retrain the model
    model.fit(X_train, y_train)
    current_accuracy = accuracy_score(y_train, model.predict(X_train))
    print(f'Iteration {i+1} accuracy: {current_accuracy:.2f}')
    
    # Stopping criterion: If improvement is less than a threshold, break the loop
    if current_accuracy - initial_accuracy < 0.01:
        print('Stopping criterion reached.')
        break
# Final model evaluation on a separate test set
X_test, y_test = make_classification(n_samples=200, n_features=20, n_informative=15, n_redundant=5, random_state=42)
final_accuracy = accuracy_score(y_test, model.predict(X_test))
print(f'Final accuracy: {final_accuracy:.2f}')
 

Explanation


  1. Data Generation:

  • We generate synthetic data using make_classification.

  • The data is split into an initial training set and a pool of unlabeled instances.

  1. Model Initialization:

  • We initialize an XGBoost classifier and train it on the initial labeled dataset.

  • The initial accuracy is printed for reference.

  1. Active Learning Loop:

  • For each iteration, the model predicts probabilities on the unlabeled pool.

  • Instances with the lowest confidence (highest uncertainty) are selected for labeling.

  • These instances are added to the training set, and the model is retrained.

  • The loop continues until the improvement in accuracy is below a threshold.

  1. Final Evaluation:

  • The final model is evaluated on a separate test set to assess its performance.

This example demonstrates how active learning with uncertainty sampling can be implemented using an XGBoost classifier. The iterative process helps in selecting the most informative instances, thereby improving the model's performance efficiently.


Active learning, particularly uncertainty sampling, is a powerful approach for improving model performance efficiently. By focusing on the most informative data points, we can train models with fewer labeled instances while achieving high accuracy. Exploring other strategies like Query by Committee and Expected Error Reduction can further enhance the active learning process, making it a versatile tool in the machine learning toolkit.

Comments


bottom of page