Mining interesting graph patterns has attracted much attention of late, with interests in studying protein interaction networks [PJZ05,SIKS06,YHC08], statistically significant pattern mining [HS06,YCHY08,RS09,SB12,ASB14,LPP14], and mining dense subgraphs and cliques [MYTM09,LW08,WZZ06,ZWZK06]. To take this further and to revisit the state-of-the-art, we take this opportunity to organize the COMAD 2014 programming contest and invite interesting solutions on mining densest subgraphs from a given graph.
The problem of finding relatively highly dense sub-structures in various social networks such as communication networks, collaboration networks and web networks has received a lot of attention [Gol84,Cha00,GKT05,KP93,Law01]. Experiments have suggested that such subgraphs correspond to communities in those networks. Formally, given an undirected un-weighted graph, the problem is to find the densest subgraph of the given graph.
The combinatorial algorithms based on max flow computation for finding maximum density subgraphs have been studied extensively [Gol84,Law01]. However, the time complexity of exactly finding the densest component using these algorithms is cubic or worse, and hence, they are not scalable for large graphs in real-world settings. Although the previous stated results are the best that any deterministic algorithm can achieve, there exist fast linear time algorithms for computing approximate solutions [Cha00,KP93]. However, they are destructive and, thus, return only a single subgraph.
There are other important variations such as finding the densest subgraphs. This problem, unfortunately, is NP-complete [FPK01].
We first define the density of a subgraph.
(1) |
We invite participants to submit novel and interesting solutions that aim at solving the following technical problems.
Participants can choose to solve either just the first problem or both of them. Depending on the problem being solved, the result could either contain information about a single subgraph or subgraphs.
The results must be returned according to the following format:
We will use the publicly available graph data sets from the Stanford Large Network Database collection (SNAP, http://snap.stanford.edu/data), to evaluate the solutions. We will consider input graphs from the following two categories of datasets:
The code written by participants should handle the input format of the corresponding mentioned categories as published on the website. More specifically, the participants should expose a command-line option with two input values:
The solutions would be evaluated based on the following two parameters:
Optimal Density: The optimal density here is the density of the subgraph returned after running the Goldberg's algorithm [Gol84] on our given graph. The solutions should aim at reducing the gap between the output from their approach and the optimal density as computed from the Goldberg's algorithm.
Running Time: The running time to obtain the optimal density again can be checked by using the Goldberg's algorithm. The participants need to reduce this time as much as possible, while keeping the density as close to the optimal as possible.
The submission guidelines and the execution environment details are provided below:
Each submission should contain all the necessary code, scripts, dependencies, libraries etc. to execute your program. Apart from this, it should also contain a detailed README (preferably with a makefile) that describes a step-by-step procedure to compile and run the submission.
In addition, the final executable should be also included.
Each submission should follow the input and output specification as mentioned in Section 2.
Each submission would be evaluated against the datasets (chosen by the judges) from the ones as mentioned in Section 2. The participants can judge the complexity and nature of the datasets by looking on the ones mentioned on the SNAP website.
The codes would be run sequentially (no multi-threading, no map-reduce) on an Intel(R) Xeon(R) E5645 12-core machine with 2.4GHz CPU, 92GB RAM, and 2TB HDD running Linux Debian 6.0.7. The participants can choose any programming language/technology/tool, with the constraint that it should be installed freely and readily on a Linux Debian desktop.
Finally, the participants are free to use any open/existing code-bases, libraries to facilitate an easy, neat and faster code for the same. However, the participants will need to properly acknowledge this work in their reports.
Each team is allowed up to submissions.
The evaluation will be done using a scoring function that takes into account both the density of the solution and the running time. Currently, the function can be simply thought of as a ratio of the density and the time:
The solution with the highest score wins.
However, to prevent trivial solutions, those that return subgraph(s) with density less than % of the optimal density will be rejected. Currently, is set to be %.
The committee reserves the right to change both the scoring function and the threshold with adequate notice to the participants.
We will also like to provide the participants with a head-start to the resources or papers, that we think, would be helpful in arriving at an interesting and novel solution. However, these resources are in no way limiting.
Andrew V. Goldberg: Finding a Maximum Density Subgraph. http://www.eecs.berkeley.edu/Pubs/TechRpts/1984/CSD-84-171.pdf
Victor E. Lee, Ning Ruan, Ruoming Jin, Charu C. Aggarwal: A Survey of Algorithms for Dense Subgraph Discovery. Managing and Mining Graph Data 2010: 303-336. http://charuaggarwal.net/dense-survey
Akhil Arora, Mayank Sachan, Arnab Bhattacharya: Mining statistically significant connected subgraphs in vertex labeled graphs. SIGMOD, 2014: 1003-1014.
David Gibson, Ravi Kumar, Andrew Tomkins: Discovering Large Dense Subgraphs in Massive Graphs. VLDB, 2005: 721-732.
Charalampos E. Tsourakakis, Francesco Bonchi, Aristides Gionis, Francesco Gullo, Maria A. Tsiarli: Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees. KDD, 2013: 104-112. http://videolectures.net/kdd2013_tsourakakis_quasi_cliques/
Samir Khuller, Barna Saha: On Finding Dense Subgraphs. http://www.cs.umd.edu/~samir/talks/ismp-dense.pdf
Finding Dense Subgraphs. http://www.cs.princeton.edu/~zdvir/apx11slides/charikar-slides.pdf
The competition is open to all. Participants can be in teams. There is a limit of on the team size.
Release of problem statement: 15th September.
Registration deadline of teams: 20th October 3rd November - New!.
Submission: 27th October 5th November - New!.
Announcement of results: 3rd November 12th November - New!.
The top-3 solutions and/or novel/interesting solutions will be invited to present a poster and demonstrate their software in the conference. In addition, the committee may invite the authors of those solutions to write a joint paper which may be published as part of the COMAD conference or at some other venue or journal. The top-3 teams will be presented with a certificate as well.
A sample program is given in sample.py. It works on the input file CA-GrQc.csv and produces the output file dense-CA-GrQc.csv.