Supervised machine learning tasks need labelled data as a foundation. However, the budget for labelling could be much smaller than the pool of unlabeled data that spans the feature space of interest. A traditional machine learning pipeline would randomly select a subset to have labelled (e.g. by running an experiment or by a human annotating). Active learning (AL) is a family of machine learning methods which aim to select the next instance(s) from the unlabeled data pool to be labelled. AL can optimise predictive model performance with limited sampling.
Three common criteria which are considered when choosing an AL sampling method are informativeness, representativeness and diversity. However, these criteria are sometimes insufficient or inappropriate for the data set of interest. The aim of this project is to integrate one or more new criteria for data selection in AL. During the project the student will learn to understand and consider AL metrics comprehensively, and implement those metrics into an AL algorithm. Learning curves will be used to study underfitting, overfitting, uncertainty and comparison between metrics. Uniformity of data is the first criteria that will be considered. For a longer project the student will be able to propose and implement new criteria or combine criteria. The algorithms will be verified on several public datasets from the UCI repository.
Understand active learning algorithms and selection criteria, good data hygiene, hyperparameter optimization, documenting workflows, machine learning applied to difficult data sets (e.g. small, unbalanced, many features)
Active Learning, Data selection, data measurement.