Parallel is determined by the size and nature

Parallel and distributed data mining offer
great promises to address IT security. The Minnesota Intrusion Detection
System can detect sophisticated cyber attacks on large-scale
networks that are difficult to detect using signature-based

The phenomenal growth of computing power for
much of the past five decades has been driven by scientific applications that
involve enormous amounts of processing. But ultimately the primary focus
for parallel and high-performance computers has been in data-centric
applications, where the overall application complexity is determined by the
size and nature of the data. Data mining is one of these
data-centric applications that increasingly drives the development of
distributed and parallel computing technology.
The explosive growth in the availability of
various types of data in the commercial and scientific fields has led
to an unprecedented opportunity to develop data-driven self-discovery
techniques. Data mining, an essential step in this process of discovering
knowledge, consists of methods that discover interesting, non-trivial and
useful patterns hidden in the data.1,2
The massive size and high dimensionality of
the available data sets make computational demanding large-scale data mining
applications so that high-performance parallel computing is rapidly becoming an
essential component of the solution. Data tend to be distributed, and
problems such as scalability, privacy, and security prevent data from being
accepted. Such cases involve the extraction of distributed data.
In this mixture enters the Internet, along
with its enormous advantages and vulnerabilities. The need for computer
security and the inadequacy of traditional approaches have attracted interest
in the application of data mining for intrusion detection. This article
focuses on the promise and application of distributed data mining and in
parallel with information security.

Need for cybersecurity

People and organizations attack and misuse
information systems, creating new Internet threats every day. The number
of cyber attacks has increased exponentially in the last years 3 and its
severity and sophistication are also crescendi.Four For example, when the
Slammer/Sapphire worm began to spread through the Internet in early 2003, it
doubled every 8.5 seconds. And infects at least 75,000 hosts.Three It has
caused network interruptions and unintended consequences such as canceled
flights, interference with elections and ATM defects.
The conventional approach to the protection of
information systems consists in designing mechanisms like firewalls,
authentication tools and virtual private networks that create a protective
shield. Nevertheless, these mechanisms almost inevitably have vulnerabilities. They
can not thwart attacks that continually adapt to exploit the weaknesses of the
system, which are often caused by inattentive projects and implementation
failures. This developed the need for intrusion detection, security
technology 5.6 that integrates conventional security approaches by monitoring
systems and identifying cyber attacks.
Traditional methods of intrusion detection are
based on extensive knowledge of human
expert attack signatures (character strings when loading a
message disclosing malicious content). They consider different
limitations. They can not detect new attacks because
someone has to manually check the signature database in advance for
each new type of intrusion detected. And once someone
discovers a new attack and develops his signature, the
deployment of that signature is often delayed. These limitations
have led to a growing interest in data mining intrusion detection

The Minnesota Intrusion Detection System

The MINDS-based data mining system ( detects
unusual network behavior and emerging cyber threats. It was
implemented at the University of Minnesota, where several hundred
million network streams are recorded by a network of over
40,000 computers per day. MINDS is also part of the Interrogator 7
architecture at the US Army Research Laboratory’s Intrusion Surveillance Centre
and Investigation Centre (ARL-CIMP). UU.Where
analysts collect and analyze traffic from dozens of DoD8,
sites MINDS are enjoying great success on both sites, regularly detecting new
attacks that signature-based systems could not find. In addition, it often
discovers dishonest communication channels and data loss than other widely used
tools such as Snort ( have had difficulty identifying.8,9
Figure 1 illustrates the process of analyzing
real network traffic data using MINDS. The MINDS Suite contains several
modules to collect and analyze huge amounts of network traffic. Typical
tests include the detection of behavioral abnormalities, the summary, and the
profile. In addition, the system has feature extraction and attack
filtering modules for which good predictive models exist (for
example, for scanning detection). Independently, each of these modules
yields key information about the network. If combined, which MINDS do
automatically, these modules have a multiplicative effect on the

Anomaly detection

In the MINDS core, there is a behavior anomaly
detection module based on a new data-based technique to calculate the distance
between points in a high-dimensional space. In particular, this technique
allows a meaningful calculation of the similarity between records that contain
a combination of categorical and numeric attributes (like network traffic
records). Unlike other widely investigated anomaly detection
methods, this new framework does not present many false alarms. As far as
we know, no other existing anomaly detection technique can find complex
behavioral anomalies in a legitimate environment, maintaining a very low false
alarm rate. A multi-thread parallel formulation of this module permits the
analysis of network traffic of many sensors in near-real time

The ability to synthesize enormous amounts of
network traffic can be valuable for network security analysts, who often have
to regulate large amounts of data. For example, when analysts use MENT
anomaly tracking algorithm to score several million networks flows in a typical
data window, several hundred high-rating streams may command attention. But
due to the limited time available, analysts can often only see
the first few pages of the results of the first twelve
most anomalous flows. Because the mind can summarize many of
these flows into a small representation, the analyst can analyze a far more
prominent anomaly than would otherwise be possible set. Our research group
has developed a methodology to summarize the information in a
database transaction with categorical attributes as an optimization
problem.9,10 This methodology uses the analysis of
association models originally developed to recognize patterns of behavior of
consumers in large data sets on sales transactions. These algorithms
assisted us to better recognize the nature of cyber-attacks and to create new
signature rules for intrusion detection systems. Peculiarly, the MINDS
component synthesis compresses the result of the anomaly detection component
into a compact representation, so analysts can consider numerous abnormal
activities on a single screen.

Figure 2 shows a typical MENTES output after detection and summary
of anomalies. The system sorts the connections based on
the score assigned to them by
the anomaly detection algorithm. Then, using the patterns
generated by the association analysis module, MINDS summarizes the anomalous connections with
the highest scores. Each row contains the
average anomaly score, the number
of connections represented by the line, eight
basic connection characteristics, and the relative contribution
of each base and secondary anomaly detection
feature. For example, the second line in Figure 2 represents
138 anomalous connections. From this summary, analysts can
easily deduce it is a backscatter of a denial of service attack on a computer
that is outside the network under investigation. This deduction is
difficult to obtain from individual connections, even if the anomaly
detection module classifies them highly. Figure 2 shows the analysts’
interpretations of numerous other summaries found by the system.

Figure 2. MINDS summary module output. Each line contains
an anomaly score, the number of connections represented by
the line and much other information that help the analyst to bring in
a quick picture.


We can use clustering, a data mining technique
to group similar elements, to find related network connections and discover
dominant modes of behavior. MINDS use the Neighbor grouping algorithm, 11
which works particularly well when data attend extraordinary resolution and
noisy (for example, network data). SNN is
very computationally intensive in order O (n2), where n is the number of network
connections. Thus, we need using the parallel calculation to resize this
algorithm in large data sets. Our group has developed a parallel
formulation of the SNN clustering algorithm
for behavioral modeling, which makes it possible to analyze tremendous amounts
of network data.
An experiment we complete on a
real network illustrates this approach, in addition to the computing
power required to run SNN clusters
on network data. The data consisted of 850,000
connections collected during one hour. In a cluster of 16 CPUs, the SNN algorithm obtained 10 hours to run and required 100
Mbytes of memory in each node to calculate the distances between
points. The last clustering step required 500 Mbytes of memory in a
node. The algorithm produced 3,135 groups of between 10 and 500
records. Most large clusters corresponded to normal behavioral modes, like
virtual private network traffic. However, many
smaller clusters corresponded to minor deviant behavior patterns
related to poorly configured computers, insider trading, and policy violations
that were not detectable by other methods. These clusters provide analysts
with the information they can act on immediately and can assist them to
recognize the behavior of network traffic. Figure 3 shows two
groupings obtained from this experiment. These
clusters portray connections from internal machines to a site called, which recognizes users (or malicious users) to control
desktops remotely. This is a violation of the rules in the
organization for which this data is analyzed.

Figure 3. Two
clusters obtained from network traffic at a US Army
base. UU.Which represents connections with.

Detecting distributed attacks
Interestingly, attacks often arise
from multiple positions. In fact, individual attackers often
control numerous machines and can use
unique machines to launch different phases of
an attack. Furthermore, the targets of the attack could be
distributed across multiple sites. An intrusion
detection system (IDS) that runs on a site may not have enough
information on its own to detect the attack. quickly to
detect such distributed cyber-attacks involves an interconnected IDS system capable
of ingesting data network traffic in near-real-time, detecting anomalous connections,
communicating their results to other IDSs,
and incorporating data from other systems to
improve such Anomaly scores
threats. This system consists of several autonomous IDs that
share their knowledge bases with each other to
quickly detect large-scale malicious computer attacks.

Figure 4 illustrates the distributed aspect of
this problem. Show the two-dimensional space of the global Internet
protocol so that each IP address assigned in the world is represented in a
block. The black region represents unallocated IP space.

Figure 4. Map of the global IP space.

Figure 5 shows a graphical illustration of
suspicious connections coming from outside (right
pane) to machines within the University of Minnesota IP
space (left panel) over a typical 10-minute time frame. Each red
dot in the box on the right represents a suspicious connection created by
a machine on an internal machine on port 80. In this
case, it means that the internal machine being contacted does not
have a running Web server, which causes external machines to they are
trying to connect 80 suspected attackers to the door. The box on the right
indicates that most of these potential attackers are grouped into specific
blocks of Internet addresses. A detailed examination shows that most of
the dense areas belong to blocks of cable and AOL user
networks located in the United States. UU. Or to the blocks assigned
to Asia and Latin America. There are 999 unique
overseas sources attempting to contact 1,126 destinations within the
University of Minnesota IP network space. The sheer number of flows
involved is 1,516, which means that most external sources have only developed a
suspicious connection with the interior. It is problematic to label a
source as a malevolent based on an extraordinary
connection. If several sites that perform the same analysis in the IP
space report that the same external source is suspect, the
classification would be much more accurate.

Figure 5. Suspicious traffic on port
80. (a) Target IP addresses
of suspicious connections within the three Class B networks of the
University of Minnesota. (b) IP source
of suspicious connections in the global IP space.
The ideal scenario for the future would be we
collect the data collected in these different sites in one place and then
analyze them. But this is not feasible because.

• The data are distributed in a customary way
and are more suitable for distributed analysis;
• The cost of combining considerable amounts of
data and analysis running on a site is very high;
• And privacy, security and reliability issues
arise when sharing network data between diverse organizations.
Thus, what is really needed is a distributed
framework in which these unique sites can independently analyze their data and
then share high-level models and results respecting the privacy of individual
site data. Implementing such a system would require the management of
distributed data, the elimination of privacy issues and the use of data mining
tools, and it would be much easier if the middleware provided these
functions. The University of
Minnesota, the University of Florida
and the University of Illinois, in Chicago, are developing and
implementing such a system (see Figure 6) as part of a collaborative
project funded by the US National Science Foundation. UU. Called Data
Mining Middleware for the grid.

Figure 6. The intrusion detection system
of the distributed network was developed in collaboration with three university
This work is supported by the ARDA AR / F30602-03-C-0243 grant,
the NSF assigns IIS-0308264 and ACI-0325949 and the US Army’s high-performance computer
research center. UU. With the contract DAAD19-01-2-0014. The
research reported in this article was produced in collaboration with Paul Dokas, Eric Eilertson, Levent Ertoz, Aleksandar Lazarevic,
Michael Steinbach, George Simon, Mark Shaneck, Liu Haiyang, Jaideep Srivastava, Pang-Ning Tan, Varun Chandola, Yongdae Kim, Zhi. -li Zhang, Sanjay Ranka and Bob Grossman.
I thank Devdatta Kulkarni for her volunteer work in the integration of
audio and PowerPoint files.