Abstract

Translate

When developing systems through the use of machine learning methods, the annotation of training data is one of the major expenses and has become a bottleneck in the development of new systems. Accordingly, the use of active learning (AL) to reduce annotation costs has recently generated considerable interest. The intuition behind AL is that giving the learner the ability to control what data is labeled will yield higher-performing models with less annotation effort.

Support Vector Machines (SVMs) are a method for learning linear classifiers that have worked well for many applications since their introduction and are now in widespread use. Accordingly, AL with SVMs (AL-SVM) is an important area to investigate. In addition to interest in AL-SVM, there has also been considerable interest in dealing effectively with the class imbalance that exists for many applications of machine learning. Class imbalance in the case of binary classification where the target examples are a small proportion of the total number of examples is considered in this dissertation. It’s known from the passive learning literature that SVMs are susceptible to underperforming when there is class imbalance and that methods for addressing class imbalance can signficantly improve performance.

However, addressing imbalance during AL has received relatively little attention. One part of this dissertation explores how to effectively address class imbalance during AL-SVM. A main theme is that the process of AL creates training data that has markedly different characteristics than training data created through passive annotation and that modifying the base inference procedure used during AL to take into account the different characteristics of the actively sampled data can lead to more successful active learning. This theme will be explored in detail for the important case of AL-SVM for imbalanced datasets. Experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new methods, which take into account the skewed data created by the AL scenario, outperform methods which do not take into account the data skew created by the AL scenario.

In order to realize the performance gains enabled by AL, an effective method for stopping the process is critical. Stopping too early results in a lower-performing model and stopping too late results in waste of annotation effort. The second part of this dissertation presents a new stopping method based on detecting model stabilization in terms of predictions on data that does not have to be labeled. The principles behind this new method are explained and experimental results across a range of Information Extraction, Text Classification, and Named Entity Classification tasks show that the new method outperforms previous methods, filling a need for a more aggressive stopping method and providing users with more control over the behavior of automatic stopping of active learning.

Details

Title

Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions

Author

Bloodgood, Michael

Year

2009

Publisher

ProQuest Dissertations Publishing

ISBN

978-1-109-24854-8

Source type

Dissertation or Thesis

Language of publication

English

ProQuest document ID

304872209

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions

Content area

Abstract

Details

Suggested sources