Abstract-Streaming portion of data remains unlabeled, however this

Abstract-Streaming
data are typically infinite sequence of incoming data at very high speed and
may evolve over the time with limited labeled instances. This causes several
challenges in mining large scale high speed data streams in real time
application domains. In this paper, we propose a framework to build prediction
models from data streams which contain both a set of limited labeled and plenty
of unlabeled examples. We argue that due to the increasing data collection
ability but limited resources for labeling, stream data collected at hand may
only have a small number of labeled examples, whereas a large portion of data
remains unlabeled, however this set of data (unlabeled) can be bene?cial for
learning. We then focus on semi-supervised self-training to data streams.  The main goal of semi-supervised learning is
to use unlabeled examples and combine the information in the unlabeled data
with the explicit classi?cation information of labeled data to improve the
classi?cation performance. We present a semi-supervised naïve Bayes approach to
mine from data streams. We then perform several modi?cations on the
semi-supervised self-training with Naïve Bayes base learner that produce better
probability estimation for its predictions. The modi?cation that we consider is
using Cluster Labeling by Majority (CLM) to improve selection metric of
self-training. Our experimental results using different datasets indicate that
our method outperforms conventional methods in mining data streams.

 

Keywords- Data Stream, Semi Supervised Learning, Self-Training,
Limited Labeled Data, Naïve Bayes Classifier.

 

                                                                                                                                                    
I.     Introduction

  Data streams are infinite and high speed sequence
of data points 1. Mining patterns from these large scale data streams in
machine learning and data mining community have drawn a significant attention
of researchers in couple of previous years. The data streams resemble the real
time incoming data sequence very well. The source of these data streams can be
various sensors situated in medical domain to monitor health conditions of
patients, in industrial domain to monitor manufactured products and in
environment monitoring, etc. Other sources are user web click streams on
social networking, e-commerce sites, twitter posts, various blogs, web logs
2-3. The above mentioned sources not only produce data streams, but also they
produce them in huge amount (of scale of tera bytes to peta bytes) and at rapid
speed. Now, mining such huge data in real time raises various challenges and
has become the hot area of research recently. These challenges include infinite
length, evolving nature, concept drift and lack of labeled data.

  There are several stream mining surveys that
give overviews of various data stream algorithms. Nguyen et al. 4
presented a comprehensive survey of the state-of-the-art data stream mining
algorithms with a focus on clustering and classi?cation because of their
ubiquitous usage. It identi?es mining constraints, proposes a general model for
data stream mining, and addressed the relationship between traditional data
mining and data stream mining. Rohit Prasad et.al 5 addressed various
challenges associated with mining such data streams. Several available stream
data mining algorithms of classification and clustering are specified along
with their features. Also, the significant performance evaluation measures
relevant in streaming data classification and clustering are presented and
various streaming data computation platforms with their major capabilities are
addressed.

  In many streams that mentioned above, labeled
instances however are dif?cult, expensive, or time consuming to obtain, because
they require empirical research or experienced human annotators. Therefore, the
traditional machine learning approaches may not be able to handle these kind of
applications. In this case semi supervised learning can be a proper solution.
Semi-supervised learning algorithms use not only the labeled data but also
unlabeled data to construct a classification method. The goal of
semi-supervised learning is to use unlabeled instances and combine the
information in the unlabeled data with the explicit classi?cation information
of labeled data to improve the classi?cation accuracy. The main issue of
semi-supervised learning is how to exploit information from the unlabeled data
7.

  Self-training is a commonly used approach for
semi-supervised learning in many application domains, such as Natural Language
Processing 7, 8, 9 and object detection and recognition 10. A self-training
algorithm is an iterative method to semi-supervised learning, which wraps
around a base classifier. It uses its own predictions to labelling unlabeled
data. Thus, a set of newly-labeled data, which we call a set of high-con?dence predictions,
are selected to add to the training set. The performance of the self-training
algorithm strongly depends on the selected newly-labeled data. Therefore it is
vital to self-training that the con?dence of prediction, which called
probability estimation, is measured correctly 6.

  In this paper a self-training method is
utilized in data stream mining with a Naïve Bayes learner as the base
classifier. The goal is showing how to effectively use a Naïve Bayes classi?er
as the base learner in self-training with limited labeled data streams. By a
modification on selection metric in self-training, the performance of
classi?cation is improved and despite the lack of labeled data in data stream
the proposed
algorithm does bene?t from the unlabeled data and has a good performance.  The results of the experiments on the several
benchmark datasets streams con?rm this.

  The rest of this paper is organized as
follows. Section 2 presents a briefly review existing approaches for data
stream classification. Section 3 defines the data streams and then introduces
our new approach. Sections 4 present the results of the experiments. Finally
Section 5 rounds the paper off with some conclusions.