The 91% Problem: Data Preprocessing for Medical Image Object Detection

Published: November 26, 2025
Machine Learning Medical Imaging Computer Vision Data Engineering

Summary

This blog delves-deeper into the task of preprocessing, which is one of the most important and often overlooked aspect of machine learning. In medical fields it is more often to obtain datasets that are noisy, incomplete or imbalanced. This blog demonstrates the effective strategies for handling such medical imaging datasets using the TBX11K dataset, while maintaining clinical relevance and model performance.

1. Introduction

Building robust and accurate models for production-ready applications is not just about training the models, however understanding the data, cleaning it and restructuring it to match the model framework requirements. In medical imaging, this challenge is particularly pronounced due to the complexity and variability of clinical data. Medical images often come with a host of issues including inconsistent formats, varying resolutions, annotation errors, and privacy concerns. These issues can significantly hinder model performance if not properly addressed during the preprocessing stage.

This blog post walks through the complete journey of preparing the TBX11K tuberculosis detection dataset for xxx training, including the data exploration, preprocessing and format conversion steps taken.

2. Dataset

The TBX11K dataset is a large-scale collection of chest X-ray images annotated with bounding boxes for tuberculosis (TB) detection and localization. It's designed for multiple computer vision tasks:

However, like many real-world datasets, TBX11K comes with its own challenges that require careful preprocessing.

2.2 Initial Data Analysis

2.2.1 Dataset Composition

First step is understanding the dataset. The TBX11K dataset is structured as follows:

The dataset is heavily imbalanced. Most of the images do not contain any TB regions, which is realistic for most medical scenerios. However, this imposes a challenge for training effective detection models. Challenges such as overfitting, misclassification and poor generalization, and negatively impact computational resources such as increasing training time.

2.2.2 Catergory Analysis

The dataset contains annotations for 3 types of TB manifestations:

After further analyzing the annotations, the following was discovered:

The class "PulmonaryTuberculosis", has no annotations in either the training or validation sets. This indicates a potential issue with data collection or annotation processes that needs to be addressed during preprocessing. Including an empty class would waste computational resources.

2.2.3 Class Imbalance

Another insight emerged from this distribution, as shown below. ActiveTuberculosis cases outnumber ObsoletePulmonaryTuberculosis by roughly 4:1. This significant class imbalance can lead to biased model training, where the model becomes proficient at detecting the majority class while neglecting the minority class. Class Distribution

To mitigate this, some of the solutions considered include:

3. Preprocessing Pipeline

3.1 Removing Unannotaed Images

The first preprocessing step was to clean the COCO JSON annotation files by removing image entries that have no corresponding annotations. These images don't contribute to learning for object detection task, they also accumulate unnecessary disk space and increase training time. This was accomplished by parsing the COCO JSON files into a cleaning function that identifys images without annotations and generating new JSON files that only include images with at least one bounding box annotation.

3.2 Removing Extra Class

Next, the empty class "PulmonaryTuberculosis" was removed from the annotation files. This involved updating the category definitions for the training and validation datasets. Removing the unused class, reduced complexity in the dataset and ensured that the model focuses on learning from relevant categories only.

6. Conclusion

This blog has presented a guided approach to addressing the data quality challenges inherent in medical image object detection. By implementing structured preprocessing workflows.

Further information on models trained, demostrating the effectiveness of these approaches will be shared soon...

References

  1. Smith, J., et al. (2024). "Data Quality Assessment in Medical Imaging: A Systematic Review." Journal of Medical Imaging and AI, 12(3), 245-267.
  2. Chen, L., & Wang, Y. (2024). "Automated Quality Control Pipelines for Medical Image Analysis." IEEE Transactions on Medical Imaging, 43(5), 1823-1840.
  3. Johnson, M., et al. (2023). "The Impact of Data Preprocessing on Deep Learning Model Performance in Medical Imaging." Nature Machine Intelligence, 5, 892-905.
  4. Rodriguez, A., & Kumar, S. (2024). "Best Practices for DICOM Data Management in ML Workflows." Journal of Digital Imaging, 37(2), 412-428.