Programming Contest Submission Website - New!
Mining interesting graph patterns has attracted much attention of late, with
interests in studying protein interaction networks [PJZ05,SIKS06,YHC08], statistically significant pattern mining [HS06,YCHY08,RS09,SB12,ASB14,LPP14], and mining dense subgraphs and cliques [MYTM09,LW08,WZZ06,ZWZK06]. To take this further and to revisit the
state-of-the-art, we take this opportunity to organize the COMAD 2014
programming contest and invite interesting solutions on mining densest
subgraphs from a given graph.
The problem of finding relatively highly dense sub-structures in various social
networks such as communication networks, collaboration networks and web networks
has received a lot of attention [Gol84,Cha00,GKT05,KP93,Law01].
Experiments have suggested that such subgraphs correspond to communities
in those networks. Formally, given an undirected un-weighted graph, the problem
is to find the densest subgraph of the given graph.
The combinatorial algorithms based on max flow computation for finding maximum
density subgraphs have been studied extensively [Gol84,Law01]. However,
the time complexity of exactly finding the densest component using these
algorithms is cubic or worse, and hence, they are not scalable for large graphs
in real-world settings. Although the previous stated results are the best that
any deterministic algorithm can achieve, there exist fast linear time algorithms
for computing approximate solutions [Cha00,KP93]. However, they are
destructive and, thus, return only a single subgraph.
There are other important variations such as finding the densest
subgraphs. This problem, unfortunately, is NP-complete [FPK01].
We first define the density of a subgraph.
We invite participants to submit novel and interesting solutions that aim at
solving the following technical problems.
Participants can choose to solve either just the first problem or both of them.
Depending on the problem being solved, the result could either contain
information about a single subgraph or
subgraphs.
The results must be returned according to the following format:
We will use the publicly available graph data sets from the Stanford Large
Network Database collection (SNAP, http://snap.stanford.edu/data), to
evaluate the solutions. We will consider input graphs from the following two
categories of datasets:
The code written by participants should handle the input
format of the corresponding mentioned categories as published on the website.
More specifically, the participants should expose a command-line option with two
input values:
The solutions would be evaluated based on the following two parameters:
Optimal Density: The optimal density here is the density of
the subgraph returned after running the Goldberg's algorithm
[Gol84] on our given graph. The solutions should aim at reducing
the gap between the output from their approach and the optimal density
as computed from the Goldberg's algorithm.
Running Time: The running time to obtain the optimal density
again can be checked by using the Goldberg's algorithm. The participants
need to reduce this time as much as possible, while keeping the density
as close to the optimal as possible.
The submission guidelines and the execution environment details are provided
below:
Each submission should contain all the necessary code, scripts,
dependencies, libraries etc. to execute your program. Apart from this,
it should also contain a detailed README (preferably with a
makefile) that describes a step-by-step procedure to compile
and run the submission.
In addition, the final executable should be also included.
Each submission should follow the input and output specification as
mentioned in Section 2.
Each submission would be evaluated against the datasets (chosen by the
judges) from the ones as mentioned in Section 2. The
participants can judge the complexity and nature of the datasets by
looking on the ones mentioned on the SNAP website.
The codes would be run sequentially (no multi-threading, no map-reduce)
on an Intel(R) Xeon(R) E5645 12-core machine with 2.4GHz CPU, 92GB
RAM, and 2TB HDD running Linux Debian 6.0.7. The participants can
choose any programming language/technology/tool, with the constraint
that it should be installed freely and readily on a Linux Debian
desktop.
Finally, the participants are free to use any open/existing
code-bases, libraries to facilitate an easy, neat and faster code for
the same. However, the participants will need to properly acknowledge
this work in their reports.
Each team is allowed up to
submissions.
The evaluation will be done using a scoring function that takes into account
both the density of the solution and the running time. Currently, the function
can be simply thought of as a ratio of the density and the time:
However, to prevent trivial solutions, those that return subgraph(s) with
density less than
% of the optimal density will be rejected. Currently,
is set to be
%.
The committee reserves the right to change both the scoring function and the
threshold
with adequate notice to the participants.
We will also like to provide the participants with a head-start to the resources
or papers, that we think, would be helpful in arriving at an interesting and
novel solution. However, these resources are in no way limiting.
Andrew V. Goldberg: Finding a Maximum Density Subgraph.
http://www.eecs.berkeley.edu/Pubs/TechRpts/1984/CSD-84-171.pdf
Victor E. Lee, Ning Ruan, Ruoming Jin, Charu C. Aggarwal: A Survey of
Algorithms for Dense Subgraph Discovery. Managing and Mining Graph Data
2010: 303-336. http://charuaggarwal.net/dense-survey
Akhil Arora, Mayank Sachan, Arnab Bhattacharya: Mining statistically
significant connected subgraphs in vertex labeled graphs. SIGMOD,
2014: 1003-1014.
David Gibson, Ravi Kumar, Andrew Tomkins: Discovering Large Dense
Subgraphs in Massive Graphs. VLDB, 2005: 721-732.
Charalampos E. Tsourakakis, Francesco Bonchi, Aristides Gionis,
Francesco Gullo, Maria A. Tsiarli: Denser than the densest subgraph:
extracting optimal quasi-cliques with quality guarantees. KDD, 2013:
104-112.
http://videolectures.net/kdd2013_tsourakakis_quasi_cliques/
Samir Khuller, Barna Saha: On Finding Dense Subgraphs.
http://www.cs.umd.edu/~samir/talks/ismp-dense.pdf
Finding Dense Subgraphs.
http://www.cs.princeton.edu/~zdvir/apx11slides/charikar-slides.pdf
The competition is open to all. Participants can be in teams. There is a limit
of
on the team size.
Release of problem statement: 15th September.
Registration deadline of teams: Submission: Announcement of results:
The top-3 solutions and/or novel/interesting solutions will be invited to
present a poster and demonstrate their software in the conference. In addition,
the committee may invite the authors of those solutions to write a joint paper
which may be published as part of the COMAD conference or at some other venue or
journal. The top-3 teams will be presented with a certificate as well.
A sample program is given in sample.py. It works on the input file
CA-GrQc.csv and produces the output file dense-CA-GrQc.csv.
Problem Statement
(1)
A note of caution is the fact that many of these graphs are big enough to not
fit in the memory directly.
Submissions
Programming Contest Submission Website - New!
Evaluation
Resources
Rules
Deadline
20th October 3rd November - New!.
27th October 5th November - New!.
3rd November 12th November - New!.
Prizes
Sample Code
Bibliography
Mining statistically significant connected subgraphs in vertex
labeled graphs.
In Proceedings of the 2014 ACM SIGMOD International Conference
on Management of Data, SIGMOD '14, pages 1003-1014, 2014.
Greedy approximation algorithms for finding dense components in a
graph.
In Proceedings of the Third International Workshop on
Approximation Algorithms for Combinatorial Optimization, APPROX '00, pages
84-95, 2000.
The dense
-subgraph problem.
Algorithmica, 29(3):410-421, 2001.
Discovering large dense subgraphs in massive graphs.
In Proceedings of the 31st International Conference on Very
Large Data Bases, VLDB '05, pages 721-732, 2005.
Finding a maximum density subgraph.
Technical report, University of California, Berkeley, 1984.
Graphrank: Statistical modeling and mining of significant subgraphs
in the feature space.
In Proceedings of the Sixth International Conference on Data
Mining, ICDM '06, pages 885-890, 2006.
On choosing a dense subgraph.
In Foundations of Computer Science, 1993. Proceedings., 34th
Annual Symposium on, pages 692-701, Nov 1993.
Combinatorial Optimization: Networks and Matroids.
Dover Publications, 2001.
A statistical significance testing approach to mining the most
informative set of patterns.
Data Min. Knowl. Discov., 28(1):238-263, 2014.
Effective pruning techniques for mining quasi-cliques.
In Machine Learning and Knowledge Discovery in Databases,
volume 5212 of Lecture Notes in Computer Science, pages 33-49.
Springer, 2008.
Clique-based data mining for related genes in a biomedical database.
BMC Bioinformatics, 10, 2009.
Mining cross-graph quasi-cliques in gene expression and protein
interaction data.
In Proceedings of the 21st International Conference on Data
Engineering, ICDE '05, pages 353-354, 2005.
Graphsig: A scalable approach to mining significant subgraphs in
large graph databases.
In Proceedings of the 2009 IEEE International Conference on Data
Engineering, ICDE '09, pages 844-855, 2009.
Mining statistically significant substrings using the chi-square
statistic.
Proc. VLDB Endow., 5(10):1052-1063, 2012.
Efficient algorithms for detecting signaling pathways in protein
interaction networks.
In Journal of Computational Biology, pages 133-144, 2006.
Clan: An algorithm for mining closed cliques from large dense graph
databases.
In ICDE, pages 73-84, 2006.
Mining significant graph patterns by leap search.
In Proceedings of the 2008 ACM SIGMOD International Conference
on Management of Data, SIGMOD '08, pages 433-444, 2008.
Temporal and structural analysis of biological networks in
combination with microarray data.
In CIBCB, pages 62-69, 2008.
Coherent closed quasi-clique discovery from large dense graph
databases.
In Proceedings of the 12th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, KDD '06, pages 797-802, 2006.