Authors: Mohamed Abou-Zleikha, Zheng-Hua Tan, Mads Græsbøll Christensen
Type: Conference paper
Abstract: Data labelling and annotation for dataset creation is considered to be an important, yet costly and time consuming process. However, unlabelled data is cheaper and easier to collect. The solution for this issue is by proposing a cleaver way to select a set of unlabelled examples to be labeled.
The ensemble-based technique has been successfully employed in many classification tasks such as object detection, face recognition, and audio events detection.
The purpose of this paper is to propose an active learning approach using a combination of supervised and unsupervised random forest.
The proposed approach trains a forest of trees, each tree is trained on labelled and unlabelled data.
Then, from each tree, a set of examples are selected for labelling, the selection is based on the tree leaves labelled data distribution. The selected examples those who have the heights frequency in the forest trees are labelled by an oracle and used to update the forest. This selection and updating process continues till no new labelled examples are selected, the validation accuracy achieves a threshold, or achieved the maximum number of labelled examples.
The proposed approach has been tested on subset of MNIST database, and it achieved 4.2% (absolute) lower error rate (PER) compared with the random selection.