Abstract. used while constructing graph from twitter data.

Abstract.   Heaviest k-subgraph problem is determine
a block of  k nodes of a weighted graph
(of n nodes) such that the total edge weight within the subgraph induced by the
block is maximized. Finding heaviest
k-subgraph is a NP hard problem in literature. We have proposed an approach for
approximating the solution of heaviest k-subgraph in which greedy approach is
used to reduce the size of graph which is used as input for branch and bound implementation
of heaviest k-subgraph problem.

1   Introduction

Given an undirected graph

the density of a subgraph on vertex set S is defined

 , where E(S)
is the set of edges in the subgraph induced by S.

The problem of finding a densest
subgraph of a given graph G can be solved optimally in polynomial time 1. When
we require the subgraph to have a specified size, the problem of finding a
maximum density subgraph becomes NP-hard.

Densest subgraph problems have
received significant attention for detecting important sub structures in
massive graphs like web and different social networks. Algorithms for finding
dense subgraphs have proved to be an effective tool in data analysis with applications
in community detection, finding patterns in gene annotation graphs , link spam
detection, as well as event detection in social media.  In all these applications the underlying
graph is massive and thus fast scalable algorithms for detecting dense
subgraphs are required to be effective.

In this paper we used greedy technique to approximate the solution
of heaviest k-subgrah. The main intuition of greedy method is to always select
a choice that looks best among the available options at that moment. This
heuristic might not always give an optimal solution, but can be helpful in
giving near optimal solution. In this paper greedy approach is used while
constructing graph from twitter data. We reduce the size of the graph which act
as input to branch and bound implementation 2 of heaviest k-subgraph.

2   Related Work


Heaviest k-subraph problem is a
NP-hard and there does not exist polynomial time algorithm to find exact
solution. Still there exists some of the researches on approximating heaviest
k-subgraph problem.

Asahiro et al. 3 describe a greedy
algorithm for heaviest k-subgraph problem. They assign weight to each of the
vertices. This is equal to the total sum of the weight of the edges connecting
to that vertex known as weighed degree. They repeatedly remove a vertex with
the minimum weighted-degree in the currently remaining graph, until exactly k
vertices are left.

Feige et al. 4 present an
approximation algorithm for heaviest k-subgraph problem using semidefinite

Papailiopoulos et al. 5 present an
algorithm that combines spectral and combinatorial techniques. This algorithm
examines candidate subgraphs obtained from vectors lying in a low-dimensional
subspace of the adjacency matrix of the graph. Depend upon the spectrum of
graph this algorithm comes with novel performance.

Khuller et al.6 give a
1/2-approximation algorithm for (DalkS) and show that DamkS is as hard as DkS
within a constant factor.

Charikar in 7 describes a simple
heuristic that has a 2-approximation guarantee.

Nicolas et al. 8 alpresent several
approximation algorithms which run in moderately exponential or parameterized
time, describing trade-offs between complexity and approximability.

3   Proposed Technique


In this section, we presented a
proposed technique for approximating solution of heaviest k-subgraph. Floe
diagram of various steps are shown in Fig. 1.


1. Flow diagram of Proposed Technique

3.1   Pre-processing of

this step, pre-processing of twitter data is done. Each tweet is first
tokenized and then Stanford POS Tagger is used to remove the stop words. Remaining
words in particular tweet are stored in same sequence as in original tweet.

Let each tweet be represented by

  which contains stop words

 and other words

. After preprocessing
stopwords are removed and tweet is represented by


3.2   Graph Formation


 be the set of tweets and

  be one of the tweets where set of

 is the sequence of words in tweet


A graph

 be the structure specifying relationships
between the words of a collection, where

 corresponds to the set of words, called
vertices or nodes and

 is the set of relations among words. Each word


 represents a node in  

 and an edge between any two nodes corresponds
to the simultaneous occurrence of the two words in a tweet. The weight denotes
the number of such instances.



 be the boolean function which tells whether
their exists edge between node





Undirected graph is
required in heaviest k- subgraph to extract keywords and directed graph is
required for forming sequences.

For undirected graph,
let it be


 is an unordered pair and edge between x and y
is bidirectional, thus

  is same as



3.3   Heaviest

definition (Heaviest-k-Subgraph
problem). Given an undirected weighted graph G and an integer k
> 1, we wish to find a subgraph containing k nodes such that the
sum of the weights of its edges is maximum.


3.4   Approximation
of heaviest k-subgraph

main idea of this method is to reduce the size of graph which act as input to
branch and bound method. We used greedy approach to select most effective
vertices of the graph. We didn’t pick directly top k weighted degree vertices
rather we used some technique to reduce size of graph. We first calculate the
weighed degree of each of the vertices. Then we take top k/2 weighted degree
vertices in set H and sort the remaining vertices by sum of the weights of the
edges connecting to set H.  From that
sorted vertices, again we take top k/2 vertices in set G .In next step we
extract graph induced by these k vertices of set H and G  and vertices connecting these k vertices.
According to our approach large graph is reduced to small size on which branch
and bound implementation of heaviest k-subraph can give approximate solution.
In next section we have presented pseudocode of the above approach.



3.5  Proposed Algorithm


1.       Input: A graph

and integer

Output: Aproximate  heaviest k-subgraph of G(V,E)


 is an array that contains sum of the weights
of the edges connecting each node

4.       for each node








7.       Set


8.       //remove top


 and put in set






11.    //Put remaining element from  in Set R



13.    //for each node in R calculate sum of weight of edges
connecting to set H

14.    For each node








17.    /remove top


 and put in set

18.    for





20.    Set


21.    // Extract graph induced by vertices of set k  and vertices connecting to these k vertices.

22.    Graph



23.    //apply branch and bound on graph



25.    Return



4   Experiments and Results

We have collected a set of tweets by
means of Twitter Streaming API and Twitter4J (an unofficial java library for
Twitter API). We use these samples for our proposed algorithm for approximating
heaviest k-subgraph problem. The algorithms were implemented in Java and were
run on a machine with 2.53 GHz clock.



4.1   Results


In this section we compare result of
approximation algorithm of heaviest k-subgraph problem with exact branch and
bound algorithm of heaviest k-subgraph problem. We compare running time and
solution of the algorithm i.e. weight of heaviest k-subgraph



Figure 2


Figure 2 shows the running the time
of approximate algorithm with exact branch and bound. It shows that exact
branch and bound runs faster when value of k is smaller i.e. upto 15.  When the value of k get increase more than
certain value (here more then 15), approximate algorithm runs faster. Running
time of exact branch and bound algorithm increase exponential after certain
value of k. After value of k=20, the time of exact branch and bound explode.
But on the other hand approximate algorithm has able to solve problem of large
k without exploding.


Figure 3


In fig 3 we compared solution of
heaviest k-subgraph problem i.e. the weight of k-subgraph returned by both of
the algorithm. Branch and bound algorithm 1 gives the exact solution of
heaviest k-subgraph problem whereas approximate algorithm approximate this
value. According to results if we compare ratio of approximate solution with
exact solution, it shows that most of the time, ration is above 0.9.

After analyzing results of figure 2
and 3, we can say that approximate algorithm can run efficiently for large
value of k while maintaining the ratio of approximate solution to exact
solution more than 0.9 most of the times.








5   Conclusion

We have proposed an approximating algorithm for heaviest
k-subgraph problem where main idea is to reduce the size of graph which act as
input for branch and bound algorithm. Results show that as the value of k is
increased approximate algorithm runs faster with ratio of approximate solution
to exact solution above 0.9 most of the times. Future work includes improvement
of this ratio so that approximate algorithm give more accurate result.