1. Large-scale Knowledge Harvesting
    Speaker: Partha Pratim Talukdar (IISc)

  2. Machine Learning in the Real World
    Speakers: Vineet Chaoji, Rajeev Rastogi, Gourav Roy (Amazon)

  3. Estimating Network Properties via Sampling
    Speaker: Anirban Dasgupta (IIT Gandhinagar)

Tutorial 1: Large-scale Knowledge Harvesting

Knowledge harvesting from Web-scale text datasets has emerged as an important and active research area over the last decade or so, resulting in the automatic construction of large knowledge bases (KBs) consisting of millions of entities and relationships among them. This has the potential to revolutionize Artificial Intelligence and intelligent decision making by removing the knowledge bottleneck which has plagued systems in these areas all along. Knowledge harvesting has also seen prominent commercial adoptions in the form of the Google Knowledge Graph and the IBM Watson system.

In spite of this early success, several challenging research questions spanning Machine Learning, Natural Language Processing, Crowdsourcing, Knowledge Representation, Data Management, Systems, and Large Data Analytics are wide open in this area of Web-scale knowledge harvesting. This tutorial will given an overview of relevant foundational and recent literature on this topic, with the goal of preparing the participant for further research in this exciting and emerging area.


Partha Talukdar is an Assistant Professor in the Department of Computational and Data Sciences (CDS) at the Indian Institute of Science (IISc), Bangalore. Before that, he was a Postdoctoral Fellow in the Machine Learning Department at Carnegie Mellon University, working with Tom Mitchell on the NELL project. Partha received his PhD (2010) in CIS from the University of Pennsylvania, working under the supervision of Fernando Pereira, Zack Ives, and Mark Liberman. Partha is broadly interested in Machine Learning, Natural Language Processing, and Cognitive Neuroscience, with particular interest in large-scale learning and inference. Partha is a recipient of IBM Faculty Award, Google’s Focused Research Award, and Accenture Open Innovation Award. He is a co-author of a book on Graph-based Semi-Supervised Learning published by Morgan Claypool Publishers. Homepage: http://talukdar.net

Tutorial 2: Machine Learning in the Real World

Machine Learning (ML) has become a mature technology that is being applied to a wide range of business problems such as web search, online advertising, product recommendations, object recognition, and so on. As a result, it has become imperative for researchers and practitioners to have a fundamental understanding of ML concepts and practical knowledge of end-to-end modeling. This tutorial takes a hands-on approach to introducing the audience to machine learning. The first part of the tutorial gives a broad overview and discusses some of the key concepts within machine learning. The second part of the tutorial takes the audience through the end-to-end modeling pipeline for a real-world income prediction problem. The tutorial includes some hands-on exercises. If you want to follow along, you will need a laptop with at least 2 GB of RAM and Firefox/Google Chrome browser installed. Note that your laptop must be capable of connecting to internet via Wifi or your mobile data connection. We will be using docker containers, so specific software does not need to be installed on laptops.


Vineet Chaoji is an Applied Science Manager within the Core Machine Learning team at Amazon where he leads projects related to econometric models of customer behavior, customer targeting and malware detection. Prior to joining Amazon, he was a Scientist at Yahoo! Labs in Bangalore where his research focused on online advertising and social networks. Vineet obtained a PhD in Computer Science from Rensselaer Polytechnic Institute. He has published at top-tier data mining and database conferences and journals. Vineet has also served on the program committees of leading data and web mining conferences.

Rajeev Rastogi is the Director of Machine Learning at Amazon where he directs the development of machine learning platforms and applications such as product classification, product recommendations, customer targeting, and deals ranking. Previously, he was the Vice President of Yahoo! Labs in Bangalore where he was responsible for research programs impacting Yahoo!s web search and online advertising products. He was named a Bell Labs Fellow in 2003 for his contributions to Lucent's networking products while he was at Bell Labs Research in Murray Hill, New Jersey. Rajeev was named an ACM Fellow in 2012 for his contributions to large-scale data analysis and management. He has published over 100 papers in top-tier international conferences and 33 papers in international journals. Rajeev has also been a prolific inventor with 57 issued US Patents. He is currently a member of the News editorial board of the CACM, and was previously an Associate editor for TKDE. He has served on over 50 program committees of the leading database and data mining conferences, and was a Program Co-chair for the Applied Data Science track of the KDD conference in 2016, the CIKM conference in 2013 and the ICDM conference in 2005.

Gourav Roy is a Senior Software Engineer in the Core Machine Learning team at Amazon where he builds scalable machine learning platforms and applications. He is interested in streaming approximate algorithms and distributed systems. His work on streaming anomaly detection recently got accepted at the International Conference on Machine Learning. Prior to joining Amazon, he got a bachelors degree in Computer Science at BIT Mesra.

Tutorial 3: Estimating Network Properties via Sampling

Large networks are ubiquitous in various fields and knowledge of the values of different network properties is often important in making various scientific and business decisions. Such value estimates can also provide key insights about the current "status" and "health" of the network, as well as about possible generative processes. However, in various cases, the network is either implicit and has to be inferred, or is accessible only indirectly via queries made or experiments done. Examples include social networks formed by relations among people, various chemical interaction networks, or networks that are inaccessible due to privacy concerns. Often the large size of the network itself might be a bottleneck to arbitrary access patterns. In such settings, sampling the network judiciously and using the sample to infer the desired network property is often an effective technique. Such techniques have been much studied from both the data mining and theory perspectives.

In this tutorial, we will survey a number of network sampling strategies for different property estimation tasks e.g. sizes of different subpopulations, average degree, various motif counts and other structural properties. Such sampling often has to be implemented via various random walks and crawling techniques. We will discuss some of these methods, their practical significance as well as approaches to prove theoretical guarantees and outline some of the open questions in the area. The tutorial will be mostly self-contained and accessible to anyone with a knowledge of the basics of graph theory, probability and linear algebra.


Anirban Dasgupta is currently an Associate Professor of Computer Science & Engineering at IIT Gandhinagar. Prior to this, he was a Senior Scientist at Yahoo! Labs Sunnyvale. Anirban works on algorithmic problems for massive data sets, large-scale machine learning, analysis of large social networks and randomized algorithms in general. He did his undergraduate studies at IIT Kharagpur and doctoral studies at Cornell University. He has also received the Google Faculty Research Award (2015), the Cisco University grant (2016), and the ICDT Best Newcomer Award (2016).

Copyright 2016-17 CSI. All Rights Reserved