Skip to main content
Learning to classify data with label sparsity

Learning to classify data with label sparsity

Date1st Aug 2023

Time12:00 PM

Venue SSB 233 (MR-1)

PAST EVENT

Details

Data Science is a journey from raw data to actionable intelligence. Every decision taken during this journey is pivoted on how good is our understanding of the dataset properties. Label sparsity is a serious challenge every researcher has to tackle while building machine learning solutions. We address a couple of popular variations of label sparsity, such as a) class imbalance and b) label scarcity.

To address the class imbalance challenge, techniques such as directed data sampling and data-level cost-sensitive methods use the data point importance information to sample from the dataset such that the essential data points are retained and possibly oversampled. We propose a novel topic-modeling-based weighting framework to assign importance to the data points in an imbalanced dataset based on the topic-posterior probabilities estimated using the topic models. We propose TODUS, a topics-oriented directed undersampling algorithm that follows the estimated data distribution to draw samples from the dataset, which aims to minimize the loss of important information during random undersampling. We also propose TOMBoost, a topic-modeled boosting scheme based on the weighting framework, and AdaBoost, particularly tuned for learning with class imbalance. Our empirical study spanning 40 datasets shows that TOMBoost outperforms other boosting and sampling methods. We also show that TOMBoost minimizes the model bias faster than other popular boosting methods.

Label scarcity indicates the scarce availability of labeled data to train a classification model. The natural order of data availability is usually unlabeled, as labels are task-specific. Classification trees are simple and explainable models for data classification, which are grown by repeated dataset partitioning based on a split criterion. In a classification tree learning task, when the class ratio of the unlabeled part of the dataset is made available, it becomes feasible to use the unlabeled data alongside the labeled data to train the tree in a semi-supervised style. We are motivated to use the abundantly available unlabeled data to facilitate building classification trees. We propose a semi-supervised approach to growing classification trees, where we apply maximum mean discrepancy (MMD) for estimating the class ratio at every node split. Our experimentation using several binary and multiclass classification datasets showed that our semi-supervised classification tree is statistically performing better than traditional decision trees. Additionally, we observe that our method works well even for moderately imbalanced datasets.

Speakers

Sudarsun Santhiappan (CS13D030)

Computer Science and Engg.