Title: Exploration and Mining of Web Repositories
With the proliferation of very large data repositories hidden behind web interfaces, e.g., keyword search, form-like search and hierarchical/graph-based browsing interfaces for Amazon.com, eBay.com, etc., efficient ways of searching, exploring and/or mining such web data are of increasing importance. There are two key challenges facing these tasks: how to properly understand web interfaces, and how to bypass the interface restrictions. In this tutorial, we start with a general overview of web search and data mining, including various exciting applications enabled by the effective search, exploration, and mining of web repositories. Then, we focus on the fundamental developments in the field, including web interface understanding, sampling, and data analytics over web repositories with various types of interfaces. We also discuss the potential changes required for query processing, data mining and machine learning algorithms to be applied to web data. Our goal is two-fold: one is to promote the awareness of existing web data search/exploration/mining techniques among all web researchers who are interested in leveraging web data, and the other is to encourage researchers, especially those who have not previously worked in web search and mining before, to initiate their own research in these exciting areas.
Gautam Das is a Full Professor in the Computer Science and Engineering Department of the University of Texas at Arlington. Prior to UTA, Dr. Das has held positions at Microsoft Research, Compaq Corporation and the University of Memphis, as well as visiting positions at IBM Research. He graduated with a BTech in computer science from IIT Kanpur, India, and with a PhD in computer science from the University of Wisconsin- Madison. Dr. Das's research interests span social computing, data mining, information retrieval, databases, graph and network algorithms, and computational geometry. His research has resulted in over 150 papers, many of which have appeared in premier conferences and journals. He is the recipient of the IEEE ICDE 2012 Influential Paper Award. His research has been supported by grants from federal and state agencies such as US National Science Foundation, US Office of Naval Research, US Department of Education, Texas Higher Education Coordinating Board, Qatar National Research Fund, as well as industry such as Cadence, Nokia, Apollo, and Microsoft.
Title: Process Mining Software Repositories
The tutorial is on the topic of Process Mining Software Repositories which is an emerging discipline at the intersection of Process Mining (PM) and Mining Software Repositories (MSR). Process mining is a sub-field of business process intelligence and consists of mining event-logs data for the purpose of process discovery, conformance checking or verification and process enhancement. Mining Software Repositories consisting of analyzing and mining structured and unstructured data stored in various software archives such as version control systems, issue tracking systems, peer code review systems, source code repositories and mail archives to solve problems encountered by practitioners. The tutorial will be of half-day (4 hours) duration and the level will be intermediate. The target audiences for the tutorial are industry practitioners and researchers and academics in the area of data mining, process mining, software analytics and mining software repositories. The pre-requisites for the tutorial are basic background in data mining and software engineering. The tutorial will be divided into 3 parts and cover topics such as: fundamentals of process mining, familiarity with open-source process mining framework ProM, basics of Business Process Modeling Notation (BPMN), overview of mining software repositories and software analytics, understanding of common software repositories and archives, important applications of mining software repositories, basics of process mining software repositories, techniques and applications.
Ashish Sureka is a Faculty Member at Indraprastha Institute of Information Technology, Delhi (IIIT-D). His current research interests are in the area of Mining Software Repositories, Software Analytics, and Social Media Analytics. He graduated with an MS and PhD degree in Computer Science from North Carolina State University (NCSU) in May 2002 and May 2005 respectively. He has worked at IBM Research Labs in USA, Siemens Research Lab (India) and was a Senior Research Associate at the R&D Unit of Infosys Technologies Limited before joining IIIT-D in July 2009. He has received research grants from Department of Information Technology (DIT, Government of India), Confederation of Indian Industry (CII) and Department of Science and Technology (DST, Government of India). He has published several research papers in international conferences and journals, graduated several PhD and MTech students, organized workshops co-located with conferences, and received best paper awards. He was selected for ACM India Eminent Speaker Program. He holds seven granted US patents.
Girish Maskeri Rama is a senior research scientist at Infosys. He has nearly 15 years of experience in applied research and product development. His research focus is on mining software repositories to provide actionable insights for better software maintenance. Previously, he worked extensively in software metrics and measurement, software refactoring, program comprehension, and model driven software development. Girish has served as reviewer for several conferences such as Mining Software Repositories (MSR), and ISEC. He has published several papers in international journals and conferences such as IEEE TSE, Journal of Systems and Software, Wiley Software: Practice and Experience, ICSE, ICSM, APSEC and ISEC. Girish has also ﬁled several patents (4 of which has been granted) in various areas of software engineering. Girish received his Masters in Computer Science from University of York, UK. and currently pursuing his PhD at IISc Bangalore.
Atul Kumar is a Senior Researcher at Siemens. Before Siemens he was a Principal Scientist in the Software Research group at ABB Corporate Research, India. He has worked at IBM Research, Microsoft and Accenture Technology Labs. His research interests are in the areas of Distributed Systems, Software Engineering, Internet Technologies and Data Engineering. He has co-organized workshops and special sessions related Software Engineering and Cloud Computing at various conferences ICSE, ISEC, i-Society etc. He is serving as tutorials co-chair at ICIIS 2014. Atul holds a masters degree and a PhD in Computer Science from IIT Kanpur. Atul is a senior member of both IEEE and ACM.
Title: Entity Linking: Detecting Entities within Text
With unstructured text on the web and social media increasing at a furious pace, it is all the more important to develop techniques that can ease semantic understanding of text data for humans. One of the key tasks in this process is that of entity linking; identifying mentions of entities in text. Consider the line that reads "The Prime Minister came under harsh criticism over the Immigration Act 2014" Without any additional context, it is not obvious to humans as to who is being talked about. An entity linking technique that has the entity database at its disposal, however, can easily figure out that the mention Prime Minister refers to the Prime Minister of UK since the mention of Immigration Act 2014 in the same sentence narrows down the search space from the set of all countries that have Prime Ministers to just UK. Such linking of text documents to entities enables easier understanding for the reader, as well as improved accuracy in automated tasks such as text document clustering, classification and information retrieval.
With the advent of social media, the set of entities that have a presence on the web has increased from just famous places, objects and people, to everyone that has a social media presence, which is to say, virtually the vast majority of human beings. Availability of such a heterogeneous set of entities ranging from those in domain-specific ontologies to social media profiles provides fresh challenges and opportunities for entity linking. In this tutorial, we will cover the set of entity linking techniques that have been proposed in literature over the years, and provide a systematic survey of them with classifications along various dimensions. We will also explore the applicability of entity linking on noisy and short texts, such as those generated in microblogging platforms (ex. Twitter), and elaborate on the new challenges for entity linking that have not quite received enough attention from the scholarly community.
Deepak P is a researcher in the Information Management Group at IBM Research - India, Bangalore. He obtained his B.Tech degree from Cochin University, India followed by M.Tech and PhD degrees from IIT Madras, India, all in Computer Science. His current research interests include Similarity Search, Spatio-temporal Data Analytics, Graph Mining, Information Retrieval and Machine Learning. He is a senior member of the ACM and IEEE.
Sayan Ranu is an Assistant Professor at IIT Madras. Prior to joining IIT Madras, he was a researcher in the Information Management group at IBM Research - India, Bangalore. He obtained his PhD from University of California, Santa Barbara. His current research interests include spatio-temporal data analytics, graph indexing and mining, and bioinformatics.