Course Project
Guidelines and suggestions for course projects
Your class project is an opportunity for you to explore an interesting problem in the context of a realworld data sets. Projects should be done in teams of three students. Each project will be assigned a TA as a project consultant/mentor; instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 46% of your final class grade, and will have 4 deliverables:
 Proposal : 2 pages excluding references (10%)
 Midway Report : 5 pages excluding references (20%)
 Presentation : poster presentation (20%)
 Final Report : 8 pages excluding references (50%)
All writeups should use the ICML style.
 Team Formation
 Project Proposal
 Midway Report
 Final Report
 Presentation
 Project Suggestions
 Deep generative models for disentangled representation learning
 Deep generative models for video, text, and audio generation
 Squeezing NO TEARS: scalable Bayesian network structure learning
 Optimizationbased decoding for improved neural machine translation
 Learning contextual models for personalization, interpretability, and beyond
 Hierarchical RL & Maximum Entropy RL
 Largescale (scalable) Machine Learning
 Machine Learning on/with Graphs
 Automated Machine Learning
Team Formation
You are responsible for forming project teams of 34 people. In some cases, we will also accept teams of 2, but a 34person group is preferred. Once you have formed your group, please send one email per team to the class instructor list with the names of all team members. If you have trouble forming a group, please send us an email and we will help you find project partners.
Project Proposal
You must turn in a brief project proposal that provides an overview of your idea and also contains a brief survey of related work on the topic. We will provide a list of suggested project ideas for you to choose from, though you may discuss other project ideas with us, whether applied or theoretical. Note that even though you can use datasets you have used before, you cannot use work that you started prior to this class as your project.
Proposals should be approximately two pages long, and should include the following information:
 Project title and list of group members.
 Overview of project idea. This should be approximately half a page long.
 A short literature survey of 4 or more relevant papers. The literature review should take up approximately one page.
 Description of potential data sets to use for the experiments.
 Plan of activities, including what you plan to complete by the midway report and how you plan to divide up the work.
The grading breakdown for the proposal is as follows:
 40% for clear and concise description of proposed method
 40% for literature survey that covers at least 4 relevant papers
 10% for plan of activities
 10% for quality of writing
The project proposal will be due at 11:59 PM on Friday, February 22th, and should be submitted via Gradescope.
Midway Report
The midway report will serve as a checkpoint at the halfway mark of your project. It should be about 5 pages long, and should be formatted like a conference paper, with the following sections: introduction, background & related work, methods, experiments, conclusion. The introduction and related work sections should be in their final form; the section on the proposed methods should be almost finished; the sections on the experiments and conclusions will have the results you have obtained, perhaps with placeholders for the results you plan/hope to obtain.
The grading breakdown for the midway report is as follows:
 20% for introduction and literature survey
 40% for proposed method
 20% for the design of upcoming experiments and revised plan of activities (in an appendix, please show the old and new activity plans)
 10% for data collection and preliminary results
 10% for quality of writing
The project midway report will be due at 11:59 PM on Friday, March 29th, and must be submitted via Gradescope.
Final Report
Your final report is expected to be 8 pages excluding references, in accordance with the length requirements for an ICML paper. It should have roughly the following format:
 Introduction: problem definition and motivation
 Background & Related Work: background info and literature survey
 Methods – Overview of your proposed method – Intuition on why should it be better than the state of the art – Details of models and algorithms that you developed
 Experiments – Description of your testbed and a list of questions your experiments are designed to answer – Details of the experiments and results
 Conclusion: discussion and future work
The grading breakdown for the final report is as follows:
 10% for introduction and literature survey
 30% for proposed method (soundness and originality)
 30% for correctness, completeness, and difficulty of experiments and figures
 10% for empirical and theoretical analysis of results and methods
 20% for quality of writing (clarity, organization, flow, etc.)
The project final report will be due at 11:59 PM on Friday, May 10th (tentative), and must be submitted via Gradescope.
Presentation
All project teams will present their work at the end of the semester. We will have a 23hour long poster session held in NSH atrium on April 30. Each team should prepare a poster (similar in style to a conference poster) and present it during the allocated time. If applicable, live demonstrations of your software are highly encouraged.
Project Suggestions
If you are interested in a particular project, please contact the respective contact person to get further ideas or details. We may add more project suggestions down the road.
Deep generative models for disentangled representation learning
Contact person: Paul Liang
Disentangled representation learning involves learning a set of latent variables that each capture individual factors of variation in the data. For example, when we learn a generative model for shapes, it would be ideal if each latent variables would correspond to the shapes pose, shadow, rotations, lighting etc. This improves interpretability of our learned representations and allows flexible generation from latent variables. You can explore new methods of learning disentangled representations in both supervised and unsupervised settings (i.e. whether information about pose, shadow, rotations are given or not), design metrics for improved evaluation of disentanglement in models, as well as new applications of disentangled representation learning to improve performance on NLP, vision, and multimodal tasks.
References:
 Papers in NIPS 2017 workshop on disentangled representation learning.
 Denton et al., Unsupervised Learning of Disentangled Representations from Video. NIPS 2017.
 Chen et al., InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. NIPS 2016.
 Kulkarni et al., Deep Convolutional Inverse Graphics Network. NIPS 2015.
Deep generative models for video, text, and audio generation
Contact person: Paul Liang
Generative models are important for probabilistic reasoning within graphical models. Recent advancements in parameterizing these models using deep neural networks and optimizating using gradientbased techniques have enabled large scale modeling of highdimensional, realworld data. Deep generative models have been successfully been applied for image, text, and audio generation. You can explore new applications of deep generative models, improve the theoretical understanding and empirical optimization of deep generative models, design metrics for improved evaluation of deep generative models, and other new directions.
References:
 IJCAI Tutorial on Deep Generative Models by Aditya Grover and Stefano Ermon.
 ICML Workshop on Deep Generative Models.
 Brock et al., Large Scale GAN Training for High Fidelity Natural Image Synthesis. ICLR 2019.
 Hu et al., Toward Controlled Generation of Text. ICML 2017.
 van den Oord et al., WaveNet: A Generative Model for Raw Audio. arXiv 2016.
Squeezing NO TEARS: scalable Bayesian network structure learning
Contact person: Xun Zheng
Estimating Bayesian network structure from data is one of the fundamental problems in graphical models. A similar problem of estimating Markov network structure has efficient solutions such as the graphical lasso [1]. However, compared to Markov networks, estimating Bayesian networks involves extra challenges: 1) the adjacency matrix is not symmetric; 2) the acyclicity constraint is combinatorial. Recently [2] proposed a smooth characterization of directed acyclic graphs, enabling continuous optimization for Bayesian network structure learning, similar to graphical lasso for Markov networks.
In this project, your goal would be to scale up the algorithm of [2] to thousands of nodes. Some computation can be saved by exploiting the problem structure, and some computation can be made more efficient using various approximation techniques.
References:
 Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics.
 Zheng, X., Aragam, B., Ravikumar, P., & Xing, E. P. (2018). DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems.
Optimizationbased decoding for improved neural machine translation
Contact person: Maruan AlShedivat
Encoderdecoder [1] with attention [2] is a typical architecture of modern neural machine translation (NMT) systems. Arguably, the quality of the decoder (as well as the decoding process itself) significantly affects translation and often requires large amounts of parallel training data.
In this project, your goal would be to explore the potential of building more efficient decoders using pretrained target language models (LMs). The idea is inspired by a recent technique used in modelbased reinforcement learning [3]:
Given a sentence in the source language and a pretrained target LM, generate a sequence of words in the target language by starting from a random sequence and iteratively refining it to increase its likelihood under the given LM.
The challenge is to ensure that by optimizing the language model (that represents an unconditional distribution), the generated sentence is a valid translation (i.e., preserves the meaning of the source sentence).
References:
 Sutskever, Vinyals, Le. Sequence to sequence learning with neural networks. NIPS 2014.
 Bahdanau, Cho, Bengio. Neural machine translation by jointly learning to align and translate. ICLR 2015.
 Srinivas, Jabri, Abbeel, Levine, Finn. Universal Planning Networks. ICML 2018.
Learning contextual models for personalization, interpretability, and beyond
Contact person: Maruan AlShedivat
Oftentimes, we would like the models of our data to be not only accurate, but also humaninterpretable. A recently developed class of models, called Contextual Explanation Networks (CENs) [1], provides a flexible way to achieve this: on a high level, it allows to learn families of contextualized simple/interpretable models (e.g., sparse linear), called explanations, which are ‘glued’ together via deep neural networks. In essence, CENs model conditional probability distributions of the form \(P(Y \mid X, C)\), distinguishing between semantic (or interpretable) features \(X\) and nonsemantic features \(C\). There is a number of interesting directions one could take CEN further.

Beyond linear explanations.
The original paper [1] developed CEN with only linear explanations (for scalar and structured output spaces). Is it possible to use other types of simple models? For instance, lists of rules [2] or their causal version [3] is a popular method when it comes to interpretability. However, is it possible to integrate such models into the CEN framework and make them trainable endtoend? Applications of ML in the healthcare domain may significantly benefit from such models. 
Beyond conditional models.
CEN was designed to represent conditional distributions, i.e., directly answer queries such as, What is the probability of \(Y\) given \(X\) and \(C\)? The problem is that the model always requires having access to two modalities of the input (\(X\) and \(C\)), which is a limitation, since in a realworld scenario, certain data modalities might be missing or hard to obtain. Hence, is it possible to build a CEN model that works with missing or latent \(X\) or \(C\) variables? Is building a contextual model for the join distribution over \(X, C, Y\) the right way to go? 
CEN for fewshot learning and/or metalearning.
Note the CEN framework technically uses DNNs to generate parameters for simpler models (to contextualize them). More generally, part of the model is used to ‘modulate’ another part of the model, which has been successful in a number of areas [4], including zeroshot learning, multilingual translation [5], etc. Multitask fewshot learning and metalearning [6] pose a new interesting setup, in which contextual/conditional models could work very well [7]. Another direction to explore in your projects is the design and implementation of CENlike models for these new tasks.
References:
 AlShedivat, Dubey, Xing. Contextual explanation networks. arXiv 2017.
 Wang, Rudin. Falling rule lists. AISTATS 2015.
 Wang, Rudin. Causal falling rule lists. FATML workshop 2017.
 Dumoulin et al. Featurewise transformations. Distill 2018.
 Platanios et al. Contextual parameter generation for neural machine translation. EMNLP 2018.
 Finn, Abbeel, Levine. Modelagnostic metalearning for fast adaptation of neural networks. ICML 2017.
 Garnelo et al. Conditional neural processes. ICML 2018.
Hierarchical RL & Maximum Entropy RL
Contact person: Lisa Lee
Reinforcement learning (RL) is typically formulated as a Markov decision process (MDP) of an agent interacting with the environment to maximize its cumulative reward. More specifically, an MDP is a tuple \((\mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma)\) where \(\mathcal{S}\) is a set of states, \(\mathcal{A}\) is a set of actions, \(\mathcal{T}(s' \mid s, a)\) is the transition probability of ending up in state \(s' \in \mathcal{S}\) when executing action \(a \in \mathcal{A}\) in state \(s \in \mathcal{S}\), \(\mathcal{R}: \mathcal{S} \rightarrow \mathbb{R}\) is the reward function, and \(\gamma \in (0, 1]\) is a discount factor. A policy \(\pi\) maps each stateaction pair \((s, a) \in \mathcal{S} \times \mathcal{A}\) to the probability \(\pi(s, a)\) of taking action \(a\) when in state \(s\). The agent’s goal is to learn a policy that maximizes its cumulative discounted reward \(\mathbb{E}_\pi\left[\sum_t \gamma^t r_t \right]\).
In the last several years, deep learning has helped achieve major breakthroughs in RL by enabling methods to automatically learn features from highdimensional observations (e.g., raw image pixels). These advances have especially benefited visionbased RL problems and robotic manipulation methods. Despite the progress, several key challenges limit the applicability and scalability of deep RL algorithms. For example, efficient exploration and longterm credit assignment remain core problems, especially in settings with sparse or delayed rewards. Another challenge is that the data inefficiency of deep RL algorithms makes training computationally expensive and difficult to scale in the complexity and number of tasks.
One framework to tackle these challenges is hierarchical RL (HRL), which enables temporal abstraction by learning hierarchical policies operating at different timescales and decomposing tasks into smaller subtasks. Another interesting line of work is maximum entropy RL which encourages agents to learn diverse behaviors agnostic of the task.
I’d be happy to share more specific project ideas and advise students. Please feel free to stop by office hours if interested.
Some relevant works:
 Feudal Reinforcement Learning (1993)
 Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning (1999)
 FeUdal Networks for Hierarchical Reinforcement Learning (2017)
 Hierarchical Deep Reinforcement Learning:Integrating Temporal Abstraction andIntrinsic Motivation (2016)
 Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning (2017)
 Modular Multitask Reinforcement Learning with Policy Sketch (2017)
 On the Complexity of Exploration in GoalDriven Navigation (2018)
 Learning SelfImitating Diverse Policies (2018)
Largescale (scalable) Machine Learning
Contact person: Hao Zhang
Machine Learning nowadays has been applied in extremely large dataset, posing many challenges on existing models and algorithms – lack of scalability, lack of guarantee of convergence, inefficient inference, difficulty of programming over big data and big models, etc. This topic will allow us to explore different directions in largescale machine learning to address the aforementioned problems:
 Data parallelism and model parallelism [14]
 Large batch SGD, SGD with staleness [3, 4, 5, 7]
 Efficient Training/inference of Dynamic Neural Networks [6, 8]
 Scalable ML Systems [18]
References:
 Large Scale Distributed Deep Networks.
 Intro to Distributed Deep Learning Systems.
 More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server.
 Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
 An Empirical Model of LargeBatch Training.
 Cavs: An Efficient Runtime System for Dynamic Neural Networks.
 Toward Understanding the Impact of Staleness in Distributed Machine Learning.
 Onthefly Operation Batching in Dynamic Computation Graphs.
Machine Learning on/with Graphs
Contact person: Hao Zhang
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on userdefined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into lowdimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Below are a few interesting topics that have been developed very actively in recently years and worth some explorations in the class:
 Deep Generative Models for Graphs [1, 2, 4]
 Combing Graphical Models with Neural Networks [15]
 Semisupervised Learning on Graphstructured Data [3, 5]
References:
 Learning Deep Generative Models of Graphs.
 GraphRNN: Generating Realistic Graphs with Deep Autoregressive Models.
 SemiSupervised Classification with Graph Convolutional Networks.
 Learning Multimodal Graphtograph Translation for Molecular Optimization.
 Representation Learning on Graphs: Methods and Applications.
Automated Machine Learning
Contact person: Hao Zhang
The goal of AutoML is to make machine learning more accessible by automatically generating a data analysis pipeline that can include data preprocessing, feature selection, and feature engineering methods along with machine learning methods and parameter settings that are optimized for your data. Recently we have witnessed very active development toward a few directions such as hyperparameter searching, neural architecture search, etc. However, most of the existing methods only address a very small component from an endtoend ML pipeline (from data to values), and they usually exhibit unsatisfactory efficiency and scalability.
 Efficient Search for Better Model Architectures, Optimization Methods, or Both [15, 7, 8, 9]
 Searching for Better Endtoend ML Pipelines [7]
 AutoML (for) Systems [6, 9]
References:
 ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware.
 AutoLoss: Learning Discrete Schedules for Alternate Optimization.
 DARTS: Differentiable Architecture Search.
 Neural Optimizer Search with Reinforcement Learning.
 Neural Architecture Search with Bayesian Optimisation and Optimal Transport.
 Ray: A Distributed Framework for Emerging AI Applications.
 AutoAugment: Learning Augmentation Policies from Data.
 MnasNet: PlatformAware Neural Architecture Search for Mobile.
 Device Placement Optimization with Reinforcement Learning.