Content area

Abstract

When developing systems through the use of machine learning methods, the annotation of training data is one of the major expenses and has become a bottleneck in the development of new systems. Accordingly, the use of active learning (AL) to reduce annotation costs has recently generated considerable interest. The intuition behind AL is that giving the learner the ability to control what data is labeled will yield higher-performing models with less annotation effort.

Support Vector Machines (SVMs) are a method for learning linear classifiers that have worked well for many applications since their introduction and are now in widespread use. Accordingly, AL with SVMs (AL-SVM) is an important area to investigate. In addition to interest in AL-SVM, there has also been considerable interest in dealing effectively with the class imbalance that exists for many applications of machine learning. Class imbalance in the case of binary classification where the target examples are a small proportion of the total number of examples is considered in this dissertation. It’s known from the passive learning literature that SVMs are susceptible to underperforming when there is class imbalance and that methods for addressing class imbalance can signficantly improve performance.

However, addressing imbalance during AL has received relatively little attention. One part of this dissertation explores how to effectively address class imbalance during AL-SVM. A main theme is that the process of AL creates training data that has markedly different characteristics than training data created through passive annotation and that modifying the base inference procedure used during AL to take into account the different characteristics of the actively sampled data can lead to more successful active learning. This theme will be explored in detail for the important case of AL-SVM for imbalanced datasets. Experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new methods, which take into account the skewed data created by the AL scenario, outperform methods which do not take into account the data skew created by the AL scenario.

In order to realize the performance gains enabled by AL, an effective method for stopping the process is critical. Stopping too early results in a lower-performing model and stopping too late results in waste of annotation effort. The second part of this dissertation presents a new stopping method based on detecting model stabilization in terms of predictions on data that does not have to be labeled. The principles behind this new method are explained and experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new method outperforms previous methods, filling a need for a more aggressive stopping method and providing users with more control over the behavior of automatic stopping of active learning.

Details

Title
Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions
Author
Bloodgood, Michael
Year
2009
Publisher
ProQuest Dissertations Publishing
ISBN
978-1-109-24854-8
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
304872209
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.