Abstract. Heaviest k-subgraph problem is determine

a block of k nodes of a weighted graph

(of n nodes) such that the total edge weight within the subgraph induced by the

block is maximized. Finding heaviest

k-subgraph is a NP hard problem in literature. We have proposed an approach for

approximating the solution of heaviest k-subgraph in which greedy approach is

used to reduce the size of graph which is used as input for branch and bound implementation

of heaviest k-subgraph problem.

1 Introduction

Given an undirected graph

the density of a subgraph on vertex set S is defined

as

, where E(S)

is the set of edges in the subgraph induced by S.

The problem of finding a densest

subgraph of a given graph G can be solved optimally in polynomial time 1. When

we require the subgraph to have a specified size, the problem of finding a

maximum density subgraph becomes NP-hard.

Densest subgraph problems have

received significant attention for detecting important sub structures in

massive graphs like web and different social networks. Algorithms for finding

dense subgraphs have proved to be an effective tool in data analysis with applications

in community detection, finding patterns in gene annotation graphs , link spam

detection, as well as event detection in social media. In all these applications the underlying

graph is massive and thus fast scalable algorithms for detecting dense

subgraphs are required to be effective.

In this paper we used greedy technique to approximate the solution

of heaviest k-subgrah. The main intuition of greedy method is to always select

a choice that looks best among the available options at that moment. This

heuristic might not always give an optimal solution, but can be helpful in

giving near optimal solution. In this paper greedy approach is used while

constructing graph from twitter data. We reduce the size of the graph which act

as input to branch and bound implementation 2 of heaviest k-subgraph.

2 Related Work

Heaviest k-subraph problem is a

NP-hard and there does not exist polynomial time algorithm to find exact

solution. Still there exists some of the researches on approximating heaviest

k-subgraph problem.

Asahiro et al. 3 describe a greedy

algorithm for heaviest k-subgraph problem. They assign weight to each of the

vertices. This is equal to the total sum of the weight of the edges connecting

to that vertex known as weighed degree. They repeatedly remove a vertex with

the minimum weighted-degree in the currently remaining graph, until exactly k

vertices are left.

Feige et al. 4 present an

approximation algorithm for heaviest k-subgraph problem using semidefinite

programming.

Papailiopoulos et al. 5 present an

algorithm that combines spectral and combinatorial techniques. This algorithm

examines candidate subgraphs obtained from vectors lying in a low-dimensional

subspace of the adjacency matrix of the graph. Depend upon the spectrum of

graph this algorithm comes with novel performance.

Khuller et al.6 give a

1/2-approximation algorithm for (DalkS) and show that DamkS is as hard as DkS

within a constant factor.

Charikar in 7 describes a simple

heuristic that has a 2-approximation guarantee.

Nicolas et al. 8 alpresent several

approximation algorithms which run in moderately exponential or parameterized

time, describing trade-offs between complexity and approximability.

3 Proposed Technique

In this section, we presented a

proposed technique for approximating solution of heaviest k-subgraph. Floe

diagram of various steps are shown in Fig. 1.

Fig.

1. Flow diagram of Proposed Technique

3.1 Pre-processing of

data

In

this step, pre-processing of twitter data is done. Each tweet is first

tokenized and then Stanford POS Tagger is used to remove the stop words. Remaining

words in particular tweet are stored in same sequence as in original tweet.

Let each tweet be represented by

which contains stop words

and other words

. After preprocessing

stopwords are removed and tweet is represented by

.

3.2 Graph Formation

Let

be the set of tweets and

be one of the tweets where set of

is the sequence of words in tweet

.

A graph

be the structure specifying relationships

between the words of a collection, where

corresponds to the set of words, called

vertices or nodes and

is the set of relations among words. Each word

in

represents a node in

and an edge between any two nodes corresponds

to the simultaneous occurrence of the two words in a tweet. The weight denotes

the number of such instances.

Let

be the boolean function which tells whether

their exists edge between node

and

.

Undirected graph is

required in heaviest k- subgraph to extract keywords and directed graph is

required for forming sequences.

For undirected graph,

let it be

,

is an unordered pair and edge between x and y

is bidirectional, thus

is same as

3.3 Heaviest

k-subgraph

Problem

definition (Heaviest-k-Subgraph

problem). Given an undirected weighted graph G and an integer k

> 1, we wish to find a subgraph containing k nodes such that the

sum of the weights of its edges is maximum.

3.4 Approximation

of heaviest k-subgraph

Main

main idea of this method is to reduce the size of graph which act as input to

branch and bound method. We used greedy approach to select most effective

vertices of the graph. We didn’t pick directly top k weighted degree vertices

rather we used some technique to reduce size of graph. We first calculate the

weighed degree of each of the vertices. Then we take top k/2 weighted degree

vertices in set H and sort the remaining vertices by sum of the weights of the

edges connecting to set H. From that

sorted vertices, again we take top k/2 vertices in set G .In next step we

extract graph induced by these k vertices of set H and G and vertices connecting these k vertices.

According to our approach large graph is reduced to small size on which branch

and bound implementation of heaviest k-subraph can give approximate solution.

In next section we have presented pseudocode of the above approach.

3.5 Proposed Algorithm

1. Input: A graph

and integer

2.

Output: Aproximate heaviest k-subgraph of G(V,E)

3.

Let

is an array that contains sum of the weights

of the edges connecting each node

4. for each node

in

do

5.

6.

=

(S);

7. Set

;

8. //remove top

from

and put in set

9.

for

to

10.

=

;

11. //Put remaining element from in Set R

12.

;

13. //for each node in R calculate sum of weight of edges

connecting to set H

14. For each node

in

15.

=

;

16.

=

;

17. /remove top

from

and put in set

18. for

to

19.

=

;

20. Set

;

21. // Extract graph induced by vertices of set k and vertices connecting to these k vertices.

22. Graph

=

;

23. //apply branch and bound on graph

24.

=

25. Return

;

4 Experiments and Results

We have collected a set of tweets by

means of Twitter Streaming API and Twitter4J (an unofficial java library for

Twitter API). We use these samples for our proposed algorithm for approximating

heaviest k-subgraph problem. The algorithms were implemented in Java and were

run on a machine with 2.53 GHz clock.

4.1 Results

In this section we compare result of

approximation algorithm of heaviest k-subgraph problem with exact branch and

bound algorithm of heaviest k-subgraph problem. We compare running time and

solution of the algorithm i.e. weight of heaviest k-subgraph

Figure 2

Figure 2 shows the running the time

of approximate algorithm with exact branch and bound. It shows that exact

branch and bound runs faster when value of k is smaller i.e. upto 15. When the value of k get increase more than

certain value (here more then 15), approximate algorithm runs faster. Running

time of exact branch and bound algorithm increase exponential after certain

value of k. After value of k=20, the time of exact branch and bound explode.

But on the other hand approximate algorithm has able to solve problem of large

k without exploding.

Figure 3

In fig 3 we compared solution of

heaviest k-subgraph problem i.e. the weight of k-subgraph returned by both of

the algorithm. Branch and bound algorithm 1 gives the exact solution of

heaviest k-subgraph problem whereas approximate algorithm approximate this

value. According to results if we compare ratio of approximate solution with

exact solution, it shows that most of the time, ration is above 0.9.

After analyzing results of figure 2

and 3, we can say that approximate algorithm can run efficiently for large

value of k while maintaining the ratio of approximate solution to exact

solution more than 0.9 most of the times.

5 Conclusion

We have proposed an approximating algorithm for heaviest

k-subgraph problem where main idea is to reduce the size of graph which act as

input for branch and bound algorithm. Results show that as the value of k is

increased approximate algorithm runs faster with ratio of approximate solution

to exact solution above 0.9 most of the times. Future work includes improvement

of this ratio so that approximate algorithm give more accurate result.