Reinforcement Learning for Pseudo-Labeling

Model-agnostic RL framework for intelligent pseudo-label selection in semi-supervised learning

Overview

In machine learning, model performance heavily depends on large, high-quality labeled datasets. However, manual labeling is time-consuming and costly. This project develops an intelligent Reinforcement Learning framework to address the critical problem of error propagation in semi-supervised learning.

Developed as part of the Fall 2025 Capstone for the George Washington University Data Science Program.

The Problem

Semi-Supervised Learning (SSL) aims to leverage small amounts of labeled data alongside large amounts of unlabeled data. A common SSL technique is pseudo-labeling, where a model predicts labels for unlabeled data and trains on its own predictions.

The core challenge: Error Propagation. When models generate incorrect pseudo-labels, they train on these errors, reinforcing mistakes and causing performance degradation.

Our Approach

We implemented and enhanced the novel framework from “Reinforcement Learning-Guided Semi-Supervised Learning” (Heidari et al., 2024), which cleverly reframes pseudo-labeling as a Reinforcement Learning problem:

Framework Components

  1. Policy Network: The main deep learning model (CNN/Wide-ResNet) acts as the RL policy
  2. States: Unlabeled data points fed to the model
  3. Actions: Probability distributions (pseudo-labels) output by the model
  4. Reward Mechanism:
    • Calculated from model performance on synthetic “Mixup” data (interpolated labeled + pseudo-labeled samples)
    • Based on Mean Squared Error (MSE) between predictions and mixed labels
    • Used in novel RL Loss Function to update the policy

Teacher-Student Framework

  • Student Model: Trained directly on data
  • Teacher Model: Exponential Moving Average (EMA) of student weights
  • Purpose: Teacher provides stable pseudo-labels to guide student training

Key Innovations

We successfully reproduced the core RLGSSL framework and developed significant enhancements to improve stability and performance:

🎯 Adaptive Reward Function

Enhanced the basic negative MSE reward with two novel components:

  • Confidence Bonus: Rewards confident predictions on confident targets, accelerating learning on clear samples
  • Diversity Bonus: Uses batch-level entropy to encourage diverse label predictions, preventing mode collapse

🔄 Adaptive RL Loss Function

Improved loss calculation for more stable training:

  • Curriculum Learning: Gradually increasing weight over epochs allows model to learn from “easy” high-confidence samples first
  • Entropy Regularization: Small penalty discourages overconfidence, maintaining exploration and reducing overfitting

Tech Stack

  • Languages: Python
  • Frameworks: PyTorch, torchvision
  • RL Tools: OpenAI Gymnasium (custom environment)
  • Infrastructure: High Performance Computing (HPC) clusters
  • Monitoring: TensorBoard for experiment tracking
  • Dataset: CIFAR-10

Experimental Results

Performance Metrics

  • Peak Accuracy: 45.55% at epoch 175 on CIFAR-10
  • Training Stability: Improved initial training phases through curriculum learning
  • Exploration: Enhanced diversity in pseudo-label generation

Key Findings

Successful Framework Reproduction: Validated core RLGSSL methodology
Curriculum Learning Impact: Stabilized early training phases significantly
Diversity Bonus Effect: Prevented model from collapsing to few classes
⚠️ System Instability: Identified that reward function enhancements require fine-tuning of component weights

Technical Achievements

  • Model Agnostic Design: Framework works with any deep learning architecture
  • Reproducible Evaluation: Custom environment enables consistent benchmarking
  • Hyperparameter Optimization: Systematic tuning of reward and loss components
  • HPC Optimization: Efficient distributed training on cluster infrastructure

Future Directions

  1. Reward Weight Tuning: Fine-tune adaptive component weights for improved stability
  2. Extended Evaluation: Test on additional datasets (CIFAR-100, ImageNet, SVHN)
  3. Architecture Exploration: Evaluate with Vision Transformers and other modern architectures
  4. Ablation Studies: Systematic analysis of each enhancement component’s contribution

Impact

This project demonstrates how Reinforcement Learning can intelligently guide semi-supervised learning to:

  • Reduce labeling costs by maximizing value from unlabeled data
  • Prevent error propagation through intelligent pseudo-label selection
  • Enable reproducible research with standardized evaluation framework
  • Advance state-of-the-art in SSL with novel adaptive mechanisms

This research contributes to making machine learning more accessible by reducing dependency on expensive labeled datasets while maintaining high model performance.

Semi-Supervised Learning (SSL) aims to solve this by using a small amount of labeled data alongside a large amount of unlabeled data. A common SSL technique is pseudo-labeling, where a model predicts labels for the unlabeled data and then trains on its own predictions.

The core problem this project addresses is error propagation. If a model generates incorrect pseudo-labels, it will train on those errors, reinforcing its own mistakes and leading to “a drop in model performance”. This project develops an intelligent framework to select high-quality, reliable pseudo-labels to prevent this.

  • Implemented and enhanced a novel framework based on the paper “Reinforcement Learning-Guided Semi-Supervised Learning” (Heidari et al., 2024).

This approach cleverly reframes the pseudo-labeling task as a Reinforcement Learning (RL) problem, but not in a traditional way.

Policy Network: The main deep learning model (e.g., a CNN or Wide-ResNet) itself acts as the RL policy.

States: The states are the unlabeled data points fed to the model.

Actions: The actions are the probability distributions (the pseudo-labels) that the model outputs for each unlabeled data point.

Environment & Reward: Instead of creating a complex gymnasium environment, the “environment” is implicitly handled within the training loop. A reward is calculated based on how well the model’s predictions perform on synthetic “Mixup” data (data created by interpolating labeled and pseudo-labeled samples). This reward is based on the Mean Squared Error (MSE) between the model’s predictions and the mixed-up labels.

Training: This reward is then used in a novel RL Loss Function to update the model (the policy), teaching it to generate pseudo-labels that lead to better performance and stability.

This entire system is built on a Teacher-Student Framework. A “Student” model is trained directly, while a “Teacher” model (an Exponential Moving Average (EMA) of the student’s weights) provides more stable pseudo-labels to guide the training process.

  • Successfully reproduced the core RLGSSL framework and then developed several significant enhancements to improve its stability and performance.

Adaptive Reward Function: We enhanced the paper’s basic negative MSE reward by adding two components:

Confidence Bonus: Rewards the model for making confident predictions on targets that are also confident, helping it learn faster on clear-cut samples.

Diversity Bonus: Uses batch-level entropy to encourage the model to predict a diverse set of labels, preventing “mode collapse” where it might only predict a few classes.

Adaptive RL Loss Function: We improved the loss calculation to make training more stable:

Curriculum Learning: A weight is applied that gradually increases over epochs. This allows the model to learn from “easy” (high-confidence) samples first before moving to more difficult ones, stabilizing the initial training phases.

Entropy Regularization: A small penalty is added to the loss to discourage the model from becoming overconfident in its pseudo-label predictions, which helps maintain exploration and reduce overfitting.

  • Experimental Results:

We successfully ran experiments on the CIFAR-10 dataset.

Our enhanced model achieved a peak accuracy of 45.55% (at epoch 175) before a system crash.

Identified that the system instability was “Likely due to the reward function enhancements,” providing a clear direction for future work on fine-tuning the weights of these adaptive components.

  • Here is a detailed description of your project, synthesized from the provided files, suitable for a portfolio website.

Project: Reinforcement Learning-Guided Semi-Supervised Learning (RLGSSL) This project was developed as part of the Fall 2025 Capstone for the George Washington University Data Science Program.

What Problem It Solves In machine learning, model performance is highly dependent on large, high-quality labeled datasets. However, manually labeling data is “time-consuming and costly”.

Semi-Supervised Learning (SSL) aims to solve this by using a small amount of labeled data alongside a large amount of unlabeled data. A common SSL technique is pseudo-labeling, where a model predicts labels for the unlabeled data and then trains on its own predictions.

The core problem this project addresses is error propagation. If a model generates incorrect pseudo-labels, it will train on those errors, reinforcing its own mistakes and leading to “a drop in model performance”. This project develops an intelligent framework to select high-quality, reliable pseudo-labels to prevent this.

Our Approach We implemented and enhanced a novel framework based on the paper “Reinforcement Learning-Guided Semi-Supervised Learning” (Heidari et al., 2024).

This approach cleverly reframes the pseudo-labeling task as a Reinforcement Learning (RL) problem, but not in a traditional way.

Policy Network: The main deep learning model (e.g., a CNN or Wide-ResNet) itself acts as the RL policy.

States: The states are the unlabeled data points fed to the model.

Actions: The actions are the probability distributions (the pseudo-labels) that the model outputs for each unlabeled data point.

Environment & Reward: Instead of creating a complex gymnasium environment, the “environment” is implicitly handled within the training loop. A reward is calculated based on how well the model’s predictions perform on synthetic “Mixup” data (data created by interpolating labeled and pseudo-labeled samples). This reward is based on the Mean Squared Error (MSE) between the model’s predictions and the mixed-up labels.

Training: This reward is then used in a novel RL Loss Function to update the model (the policy), teaching it to generate pseudo-labels that lead to better performance and stability.

This entire system is built on a Teacher-Student Framework. A “Student” model is trained directly, while a “Teacher” model (an Exponential Moving Average (EMA) of the student’s weights) provides more stable pseudo-labels to guide the training process.

Key Achievements & Enhancements We successfully reproduced the core RLGSSL framework and then developed several significant enhancements to improve its stability and performance.

Adaptive Reward Function: We enhanced the paper’s basic negative MSE reward by adding two components:

Confidence Bonus: Rewards the model for making confident predictions on targets that are also confident, helping it learn faster on clear-cut samples.

Diversity Bonus: Uses batch-level entropy to encourage the model to predict a diverse set of labels, preventing “mode collapse” where it might only predict a few classes.

Adaptive RL Loss Function: We improved the loss calculation to make training more stable:

Curriculum Learning: A weight is applied that gradually increases over epochs. This allows the model to learn from “easy” (high-confidence) samples first before moving to more difficult ones, stabilizing the initial training phases.

Entropy Regularization: A small penalty is added to the loss to discourage the model from becoming overconfident in its pseudo-label predictions, which helps maintain exploration and reduce overfitting.

Experimental Results:

We successfully ran experiments on the CIFAR-10 dataset.

Our enhanced model achieved a peak accuracy of 45.55% (at epoch 175) before a system crash.

We identified that the system instability was “Likely due to the reward function enhancements,” providing a clear direction for future work on fine-tuning the weights of these adaptive components.

  • Technologies Used Core Framework: Python, PyTorch (torch, torchvision)

Data Science & Computation: NumPy, scikit-learn

Core Concepts: Reinforcement Learning (Policy Gradients, MDPs), Semi-Supervised Learning, Pseudo-Labeling, Mixup Data Augmentation, Teacher-Student Frameworks (EMA).

Experimentation: gymnasium (for initial Policy Gradient familiarization), Matplotlib/Seaborn, TensorBoard

Datasets: MNIST, CIFAR-10

Technical Stack

  • Languages: Python, JavaScript, etc.
  • Frameworks: Flask, PyTorch, etc.
  • Tools: Docker, AWS, etc.

Key Features

  1. Feature one description
  2. Feature two description
  3. Feature three description

Results & Impact

Describe the outcomes, performance metrics, or impact of your project.


<!--
  See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
  https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
-->

  
    <source
      class="responsive-img-srcset"
      
        srcset="/assets/img/your-image-480.webp 480w,/assets/img/your-image-800.webp 800w,/assets/img/your-image-1400.webp 1400w,"
        type="image/webp"
      
      
        sizes="95vw"
      
    >
  

<img
  src="/assets/img/your-image.jpg"
  
    class="img-fluid rounded z-depth-1"
  
  
    width="100%"
  
  
    height="auto"
  
  
  
  
    title="description"
  
  
  
    loading="lazy"
  
  onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
>

</picture>

</figure>

</div>

</div>

Caption for your image

–>