20th International
Conference on
Management of
Data

Hyderabad, India | December 17-19, 2014

COMAD 2014: Programming Contest


Programming Contest Submission Website - New!


PDF version


Motivation

Mining interesting graph patterns has attracted much attention of late, with interests in studying protein interaction networks [PJZ05,SIKS06,YHC08], statistically significant pattern mining [HS06,YCHY08,RS09,SB12,ASB14,LPP14], and mining dense subgraphs and cliques [MYTM09,LW08,WZZ06,ZWZK06]. To take this further and to revisit the state-of-the-art, we take this opportunity to organize the COMAD 2014 programming contest and invite interesting solutions on mining densest subgraphs from a given graph.

The problem of finding relatively highly dense sub-structures in various social networks such as communication networks, collaboration networks and web networks has received a lot of attention [Gol84,Cha00,GKT05,KP93,Law01]. Experiments have suggested that such subgraphs correspond to communities in those networks. Formally, given an undirected un-weighted graph, the problem is to find the densest subgraph of the given graph.

The combinatorial algorithms based on max flow computation for finding maximum density subgraphs have been studied extensively [Gol84,Law01]. However, the time complexity of exactly finding the densest component using these algorithms is cubic or worse, and hence, they are not scalable for large graphs in real-world settings. Although the previous stated results are the best that any deterministic algorithm can achieve, there exist fast linear time algorithms for computing approximate solutions [Cha00,KP93]. However, they are destructive and, thus, return only a single subgraph.

There are other important variations such as finding the densest $ k$ subgraphs. This problem, unfortunately, is NP-complete [FPK01].


Problem Statement

We first define the density of a subgraph.

Definition 1 (Density of a Subgraph)   Suppose $ S = (V',E')$ be any subgraph of $ G(V,E)$ where $ V' \subseteq V$ and an edge $ e' = (u',v') \in E'$ if and only if $ e' \in E$ . The density of the subgraph $ S$ , denoted by $ \rho_{S}$ , is defined as the ratio of the number of edges $ m' = \vert E'\vert$ to the number of vertices $ n' = \vert V'\vert$ :

$\displaystyle <tex2html_comment_mark>3 \rho_{S} = \frac{m'}{n'} %
$ (1)

We invite participants to submit novel and interesting solutions that aim at solving the following technical problems.

Participants can choose to solve either just the first problem or both of them. Depending on the problem being solved, the result could either contain information about a single subgraph or $ k$ subgraphs.

The results must be returned according to the following format:

We will use the publicly available graph data sets from the Stanford Large Network Database collection (SNAP, http://snap.stanford.edu/data), to evaluate the solutions. We will consider input graphs from the following two categories of datasets:

  1. Networks with ground-truth communities
  2. Collaboration networks

The code written by participants should handle the input format of the corresponding mentioned categories as published on the website. More specifically, the participants should expose a command-line option with two input values:

  1. The first is a positive integer. If only the first problem is attempted, it should be simply $ 1$ . Otherwise, this can be any $ k$ .
  2. The second argument is the graph input file in the format directly available from the website.
A note of caution is the fact that many of these graphs are big enough to not fit in the memory directly.

The solutions would be evaluated based on the following two parameters:

  1. Optimal Density: The optimal density here is the density of the subgraph returned after running the Goldberg's algorithm [Gol84] on our given graph. The solutions should aim at reducing the gap between the output from their approach and the optimal density as computed from the Goldberg's algorithm.

  2. Running Time: The running time to obtain the optimal density again can be checked by using the Goldberg's algorithm. The participants need to reduce this time as much as possible, while keeping the density as close to the optimal as possible.

Submissions

Programming Contest Submission Website - New!

The submission guidelines and the execution environment details are provided below:

Evaluation

The evaluation will be done using a scoring function that takes into account both the density of the solution and the running time. Currently, the function can be simply thought of as a ratio of the density and the time:

score$\displaystyle = \frac{\text{density}}{\text{running time}}$

The solution with the highest score wins.

However, to prevent trivial solutions, those that return subgraph(s) with density less than $ \tau$ % of the optimal density will be rejected. Currently, $ \tau$ is set to be $ 50$ %.

The committee reserves the right to change both the scoring function and the threshold $ \tau$ with adequate notice to the participants.

Resources

We will also like to provide the participants with a head-start to the resources or papers, that we think, would be helpful in arriving at an interesting and novel solution. However, these resources are in no way limiting.

Rules

The competition is open to all. Participants can be in teams. There is a limit of $ 5$ on the team size.

Deadline

Prizes

The top-3 solutions and/or novel/interesting solutions will be invited to present a poster and demonstrate their software in the conference. In addition, the committee may invite the authors of those solutions to write a joint paper which may be published as part of the COMAD conference or at some other venue or journal. The top-3 teams will be presented with a certificate as well.

Sample Code

A sample program is given in sample.py. It works on the input file CA-GrQc.csv and produces the output file dense-CA-GrQc.csv.

Bibliography

ASB14
Akhil Arora, Mayank Sachan, and Arnab Bhattacharya.
Mining statistically significant connected subgraphs in vertex labeled graphs.
In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 1003-1014, 2014.

Cha00
Moses Charikar.
Greedy approximation algorithms for finding dense components in a graph.
In Proceedings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization, APPROX '00, pages 84-95, 2000.

FPK01
U. Feige, D. Peleg, and G. Korsatz.
The dense $ k$ -subgraph problem.
Algorithmica, 29(3):410-421, 2001.

GKT05
David Gibson, Ravi Kumar, and Andrew Tomkins.
Discovering large dense subgraphs in massive graphs.
In Proceedings of the 31st International Conference on Very Large Data Bases, VLDB '05, pages 721-732, 2005.

Gol84
A. V. Goldberg.
Finding a maximum density subgraph.
Technical report, University of California, Berkeley, 1984.

HS06
Huahai He and Ambuj K. Singh.
Graphrank: Statistical modeling and mining of significant subgraphs in the feature space.
In Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 885-890, 2006.

KP93
G. Kortsarz and D. Peleg.
On choosing a dense subgraph.
In Foundations of Computer Science, 1993. Proceedings., 34th Annual Symposium on, pages 692-701, Nov 1993.

Law01
Eugene Lawler.
Combinatorial Optimization: Networks and Matroids.
Dover Publications, 2001.

LPP14
Jefrey Lijffijt, Panagiotis Papapetrou, and Kai Puolamäki.
A statistical significance testing approach to mining the most informative set of patterns.
Data Min. Knowl. Discov., 28(1):238-263, 2014.

LW08
Guimei Liu and Limsoon Wong.
Effective pruning techniques for mining quasi-cliques.
In Machine Learning and Knowledge Discovery in Databases, volume 5212 of Lecture Notes in Computer Science, pages 33-49. Springer, 2008.

MYTM09
Tsutomu Matsunaga, Chikara Yonemori, Etsuji Tomita, and Masaaki Muramatsu.
Clique-based data mining for related genes in a biomedical database.
BMC Bioinformatics, 10, 2009.

PJZ05
Jian Pei, Daxin Jiang, and Aidong Zhang.
Mining cross-graph quasi-cliques in gene expression and protein interaction data.
In Proceedings of the 21st International Conference on Data Engineering, ICDE '05, pages 353-354, 2005.

RS09
Sayan Ranu and Ambuj K. Singh.
Graphsig: A scalable approach to mining significant subgraphs in large graph databases.
In Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE '09, pages 844-855, 2009.

SB12
Mayank Sachan and Arnab Bhattacharya.
Mining statistically significant substrings using the chi-square statistic.
Proc. VLDB Endow., 5(10):1052-1063, 2012.

SIKS06
Jacob Scott, Trey Ideker, Richard M. Karp, and Roded Sharan.
Efficient algorithms for detecting signaling pathways in protein interaction networks.
In Journal of Computational Biology, pages 133-144, 2006.

WZZ06
Jianyong Wang, Zhiping Zeng, and Lizhu Zhou.
Clan: An algorithm for mining closed cliques from large dense graph databases.
In ICDE, pages 73-84, 2006.

YCHY08
Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu.
Mining significant graph patterns by leap search.
In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD '08, pages 433-444, 2008.

YHC08
Chang Hun You, Lawrence B. Holder, and Diane J. Cook.
Temporal and structural analysis of biological networks in combination with microarray data.
In CIBCB, pages 62-69, 2008.

ZWZK06
Zhiping Zeng, Jianyong Wang, Lizhu Zhou, and George Karypis.
Coherent closed quasi-clique discovery from large dense graph databases.
In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '06, pages 797-802, 2006.