Notes from The Science of Deep Learning Colloquia

We attended The Science of Deep Learning Colloquia which is a part of Arthur M. Sackler Colloquia of the National Academy of Sciences,US. We listen mostly high level but motivating and fun talks by leading researchers working on deep learning. We drafted some notes which are very terse and might be inaccurate, but you can get the intuition of the talks. Videos told to be released by two weeks in NAS’s youtube channel.

Day-1 Speakers: Amnon Shashua, Jitendra Malik, Chris Manning, Regina Barzilay, Tomaso Poggio, Orial Vinyals, Terrence Senjnowski, Olga Troyanskaya, Kyle Cranmer, Eero Simoncelli, Bruno Olshausen, Antonio Torralba, Rodney Brooks

  • Favorites: Amnon Shashua, Kyle Cranmer, Eero Simoncelli, Olga Troyanskaya, Antonio Torralba, Rodney Brooks, Regina Barzilay

Day-2 Speakers: Tomaso Poggio, Nati Srebro, Peter Bartlett, Anders Hover, Jonathon Phillips, Doina Precup, Naim Sompolinksy, Ronald Coefman, Konrad Kording, Tara Sainath, P. Jonathan Philips, Jitendra Malik, Antonio Torralba, Jon Kleinberg, Terrence Senjnowski, Isabelle Guyon, Leon Bottou

  • Favorites: Nati Srebro, Peter Bartlett


1. The State of Deep Learning : Overview Talk (I) Amnon Shashua

  • Deep Learning w. overparametrized networks enabled: Training Error ↓ Test Error ↓

  • Quantum Entanglement related with Deep Nets

  • Self Driving Cars, Diriving Policy, Ethics in AI

    • Failure allowed only due to perception error and it should be less then human's error
    • There shoud be no failure due wrong decision
    • The world is rich with details, challenge with pattern recognition is accuracy
    • For humans, percetual accidents are very rare 10^7 hours of driving for an accident
    • Combine two different (independent) subsystem for same task to reduce risk
  • Natural Language Understanding is a very good small test environment for General AI. Understanding a book has enough complexity.

2. The State of Deep Learning : Overview Talk (II) - Jitendra Malik, UC Berkeley

Phylogeny of intelligence.

Intelligence really about perception and action.

The evolutionary progression

  • Vision and Locomotion
  • Manipulation - science falling behind
  • Language - same

Major Success of DL

CV, Speech understanding, machine translation, game playing

Behind the success: data + computing + annotation + simulation

1 neuron = 1000 instructions/sec

DL in the context of early history

Turing: rather than simulating an adult brain, try simulate a child's brain and then educate it.

1980: Neocognitron: A Self-organizing Neural Network Model

Lecun 1989 - Convolutional Neural Networks


Mask R-CNN

Biological inspiration is overrated - deeper is better: from retina to the back of the there are only seven sinapsis.

  • Lenet - AlexNet - VGG - GoogleNet - ResNet - ReneXt - Mask R-CNN

The future: Seeing 3D

Understand the geometry

Learned multiview scenario

Visual Navigation in Novel Environments


  • few-shot learning
  • learning with liitle supervision
  • unifying learning with geometric reasoning
  • perception and control

What about unsupervised learning? DL is a function approximation technique, there is input and output.

3. The State of Deep Learning : Overview Talk (III) - Chris Manning, Stanford

He provided an overview of DL in speech recognition and speech synthesis and how this technology evolved to innovations such as Alexa and Siri. A challenge in the past: similar words but non similar representation e.g. hotel and motel RNN to generate words bu repeated sampling The idea of by-pass: is very effective! LSTMs: Hochreiter 1998

Recurrent neural encoder decoder networks such as for translation tasks Ref for deep reps of this kind: Sutskever et al 2014 Luong 2015

Bottleneck: all information of the sentence has to pass thru one pipeline

Solution: seq to seq with attention.

One evaluation: 26% performance improvement from Eng to German 2014 to 2015 in one year

Open domain question answering: DrQA Ref. Chen et al acl 2017 The Stanford Attentive Reader Contextual word representations from LMs - think of hidden states as representations

ELMo (Peters 2018) in NER tagger - reps of this time grew by years e.g. BERT, GPT etc.

4. The State of Deep Reinforcement Learning, Orial Vinyals, Google, Deepmind

Summary: Atari-AlphaGo-Alpha-Star

Reinforcement Learning Sceheme : Agent, Observations, Actions, Environment

Mentioned Deepmind's 2015 Atari Paper

Behind every great agent there's a great environment(importance of datasets)

Policy and Value Learning

Atari vs Go: action space is higher in Go, information type Near-Perfect vs Perfect

  1. AlphaGo: Policy Network and Value Network(predict winner) Take a Search + Imitate the search: related summary blog post that I found

    Policy Improvement Theorem

    AlphaGo Zero: 4TPUs AlphaGo (Lee Sadol): 48TPUs

  2. StartCraft: Action space is 10^6 Exploration is really hard Imitation learning Real time strategy game, in playing phase fast prediction needed Grandmaster level

Patterns for Success and Challenges Ahead

  1. Success

    1. Environment Full of Rewards Atari

    2. Available human demonstrations AlphaGo-AlphaStar

    3. Algorithmic ways to improve the policy AlphaGo-AlphaZero

    4. ...

  2. Challenges

    1. Real Word is not simulations

    2. Transfer/ General AI Alpha Go should learn chess quickly Alphastart should learn Atari easily ImageNet classifier should trivally transfer to MNIST

    3. Understanding theory non-convex optimization

5. Panel: Tomaso Poggio, Regina Barzilay, Terrence Senjnowski, Rodney Brooks

Regina: Interpretability needed to better undestand. Supervision and Quality of Supervision. Datasets are biased( ex Fake News Dataset, remove evidence we can predict bias :)).

Tomassa: We need more science and math but we have many gpu hours instead 🙂 Big data is not realistic. We, human, learn from very less data. Current architectures are not the way human learn.

Terrence: New scientific studies will be enabled by the hype of deep learning we experienced similar things when heat equations are solved. Deep learning revolution

Rodney: we can't trust systems trained on deep nets

6. Deep Learning in Science(I): Regina Barzilay, MIT

MIT Machine Learning for Pharmaceutical Discovery Consortium

Challenge in drug discovery: a huge combinatorial space.

How to deploy ML to address this challenge. Predicting chemical reactions.

  1. Property prediction: take a molecule, extract molecular fingerprint, graph convolution One reason the initial model failed was domain transfer.
  2. Better Molecule Generation

What are the open questions? Molecular representation beyond graphs Modeling underlying physics How to improve a molecule to have better properties.

String to string generation: Linearize the molecules (smiles) - did poorly.

Graph to graph generation: Invalidity of intermediate molecules is a challenge. Should produce many diverse outputs.

Tree decomposition: Molecule to tree.

7. Deep Learning in Science(II): Deep Learning and Particle Physics - Kyle Cranmer, NYU

  • higgs boson discovery

  • particle collision = complex probabilistic model

  • created particles create other particles

  • likelihood calculation is very hard

  • Bayesian inference under intractable likelihoods: lilelihood free inference

    • Approximate Bayesian Computation (can also be intractable)
    • You just need to do Forward Simulation
    • sufficient statistics can not be determined sometimes

New approaches

  1. Use Simulator
    1. Hijack the inside of simulator
  2. Learning The Simulator
    1. Generative Adverserial Network
    2. Learning the Likelihood Ratio (Supervised)
    3. Likelihood ratio trick (binary classifer ~= likelihood ratio)


ML have potentiol to effectively bridge the microscopic-macroscopic divide

Physics Aware Machine Learning

Intersection of Deep Learning(successfull) and Bayesian Methods(interpretable)

  1. Physics aware Gaussian Processes

  2. QCD-Aware Recursive Neural Networks

  3. QCD-Aware Graph Convolutional Networks

  4. JUNIPR: generative model for jets can train on real data! and interpretable

8. Deep Learning in Science(III): DL in Genomic Research - Olga Troyanskaya, Princeton

How does a single mutation in genome affect gene regulation?

Which SNPs are functional and lead to human disease?

Understanding disease causing mutations using DL.

Two types of mutations:

  1. Coding variant
  2. Noncoding regulatory variant

Model should be

Genomic sequence -> sequence model -> chromatin organization

Model is trained on a single genome.

Deep convolutional network-based sequence model

Why relevant to genome data?

  • Many example of the same sequence along the whole DNA.
  • Capture context information
  • Interaction of seq features
  • Multi task prediction

The proposed model is able to predict histone marks, DNase accessibility, transcription factors given a single code change/mutation.

DeepSEA idenrifies significant noncoding regulatory mutation burden in ASD. ASD is composed of families where autism is observed in only one of the children and not in the rest of the family.

One question: could autism be a result of stronger mutations not just a mutation i.e. sibling can also have the mutation but the disease? Yes. BioRxiv. Nature Genetics.

ExPecto - ab inito prediction of tissue specific gene expression from sequence.

A pipeline of methods deel learning, spatial feature transformation, regularized linear models to obtain the expression and associated impact of mutation.


  • A DL based algorithmic framework for predicting the effect of any non-coding mutation in genome.
  • A computational framework for accurate prediction of tissue-specific expression, including de novo prediction of expression variation
  • Functional networks produced by semi-supervised data integration enable insight into mechanisms of human disease, including Alzheimer's, Parkinson's and cardiovascular diseases.

9. Deep Learning in Science(IV), Eero Simoncelli, NYU

  • Deep Convolutional Neural Nets
  • Largely inspired by neurobiology
  • Astonishing (but often brittle) results
  • Model for sensory neurobiology
  • Basic neural selectivity
  • MRI - stimulus similarity
  • Missing: -Largely unsupervised
    -non-classification objective
    -Local learning,
    -Recurrence/state/context(memory, reward, attention)
    -Myriad bio-physical details

Example 1: Difference between two images(Berardino et al. NIPS 2017)

  • MSE is not a good measure for human eyes
  • L(X,Xhat) = ||f(X)-f(Xhat')||
  • TID2018 Dataset
  • All models (whether deep or not) performs same on the test data
  • Which one generalizes: Least noticeable Eigen Distortions Most visable Eigen Distortions
  • Local gain control

Example 2: Perceptual Straightening of Videos (Henaff, Goris, Simoncelli NN-9)

  • Curvature in Pixel Domain
  • Perceptual experiments on Humans
  • Humans reduce curvature in their brain
  • CNNs do not work like human brain

10. Panel: Scientific Funding for Deep Learning

Super Turing Computation

Can handle situations it hasn't encountered before utilizing previous learning.

Lifelong Learning Machines (L2M) - is concerned of learing while executing, improve over lifetime

11. Can Deep Learning provide Deep Insights for Neuroscience, Bruno Olshausen, Berkeley

Embrace complexity of biology

  • Neuroscience moved a lot neuron is not neuron in ML
  • Cortical Circuits:
  1. Highly Organized by layer
  2. Layers are interconnected in a canonical microcircuit
  3. Feed-back connections

What problems should we be solving?

showed pictures of animals with good capabilities

animal's vision system is very robust, low power

  • Nakayama et al.(1995)

  • O'regan & Noe(2001)

  • Mumford(2010) Pattern Theory -Sparse Discreteness



The Sparse Manifold Theorem (Yubei Chan, NeurIPS 2018)

12. Dissecting Neural Networks, Antonio Torralba, MIT

Very fun talk :))

10 billion dollar spend for CERN data We spend very very less data on learning datasets.

Cycle of Deep Learning: we realize datasets are biased, then google (?), then new datasets

Understanding Deep Representations: Network Dissection (~visualization)

Test Units for Semantic Segmentation Top Activated Images: IoU GANs How we can identify which neuron responsible for which object's detections in CNNs, we can also identify in GANs which neuron draws which part of the image

13. Super Intelligence, Rodney Brooks

History of AI

Turing papers mentioned

Approaches to AI

a) Symbolic Approach

Logic, statements about symbols, inference and reasoning
Compositonality in symbolic systems
Symbols are not grounded

b) Neural Networks

c) Traditional Robotics

Finding corners and future points in a picture

d) Behavior Robotics

Behavior trees

What we are doing wrong currently There were some fun comics Betters turing test? Get Machines to do Real Tasks in the World

Hard Things to Do

Real Perception

Ex: Chess board with grays and whites are in same intensity

Ex: Blue filtered strawberry image where computer RBG colors are not red rather more blue

Our perception adjust according to the context

Audience learned a category with 3 images 🙂

Real Manipulation:

Read a book

Common Sense Reasoning

What should work on

  • Object recognition capabilities of 2 year old
  • Language capabilities of 4 year old
  • Manual dexerity of 6 year old
  • Social understanding of 8 years old

Day 2

1. Networks of neurons for learning and representing symbols in the brain, Tomaso Poggio, MIT


2. Inductive Bias and Optimization in DL - Nati Srebro, TTIC


  1. Capacity of the learning system - how many samples do we need to generalize?
  2. Expressivenes - can we capture reality?
    • One opinion: NNs can approx any function. The objective could be expressiveness with small samples. In some cases, even small networks can capture everything.
  3. Computation/Optimization: NP-hard to find weights even with 2 hidden units. Even the simplest NN with O(logd) units, no noise, no poly time algorithm always works.
    • Thus there might be some magic property of reality that makes local search work.


As the number of hidden units increase the training error decreases. In on trial it turns out that the test error keeps decreasing as the number of parameters increases. In repeated trials, in most of test errors are large. Maybe in the cases where test fails when training error is zero, the norm of the parameters is high not the number of parameters. This could be norm etc. so what is the relevant "complexity measure"? And, how to minimize by optimization algorithm.

Ref. Neyshabur Tomioka S ICLR '15



Different optimization algorithm -> Different bias in optimum reached -> Different inductive bias -> Different generalization properties

Need to understand optimization algorithm not just as reaching some (global) optimum, but as reaching a specific optimum. Choice of optimzation algorithm matters! The solution space is like an ocean.

Example 1: Unconstrained Matrix Completion

Ref. Gunasekar Woodworth Bhojanapalli Neyshabur 2017

Gradient Descent (small step size etc.) finds not any global minima but min nuclear norm solution which brings generalization.

Example 2: Single Overparatmerized Linear Unit

Example 3: Linear Conv Nets Over-paratmerization

Optimization Geometry and hence inductive bias affected by geometry of local search in parameter space and paratmerization characterization.

3. Peter Barlett

Generalization : Prediction Accuracy of Test Set

Typical Theorem: pred_err <= trn_err+complexity_penalty [1]


  1. Emprical process theory for classification

  2. Margins analysis: relating classification to regression

  3. Interpolation: There is no apperant tradeoff between fit and complexity

  4. Interpolation in Linear Regression

VC Theory

P(f(x) \noteq y) <= 1/n (trn classification error) + sqrt(c\n (VCDim(F)+log(1/sigma)))

Neural Networks VC-Dimension increases with (p=#of parameters, L=#of layers)

  • p if nonlinearity continues
  • pL o non linearity piece wise continous

A classification problem becomes a regression if we use a loss function that doesn't vary too quickly.

For regression, the complexity of a NN is controlled by the size of parameters.

Interprolation in DL - A new challenge for Statistical Learning Theory

Deep networks can be trained to zero training error for regression loss with near state-of-the-art performance and even for noisy problems. Thus there is no notion of a tradeoff between fit to training data and complexity where [1]. Ref. Zhang, Bengio, Hardt et. al. 2017 and Belkin et al 2018.

Interpolation in Linear Regression

Classical linear regression setting, with n samples. f(x) = x'\theta, squared error as loss, risk = E[loss].

Choose \theta^ to minimize the training error average.

Excess expected loss: Empirical Risk - True Risk

^Q is corrupted because our view of covariance of x is distorted by x1, x2, ..., xn. Also, the noise.

Accurate interpolating prediction as dimension p_n grows.

Consider covariance of x in two pieces

  • a fixed piece due to dimension k
  • a tail which flattens with n


Interpolation: far from the regime of a trade off between fit to training data and complexity.

In high-dimensional linear regression if the covariance has a long and flat tail the minimum norm interpolant can hide the noise in these many unimportant directions.

  • Relizes on overparametrization
  • and lots of unimportant parameters

Can we extend these results ot interpolating deep networks?

Empirical process theory for classification: need n>>p

Margins analysis with Lipshcitz loss complexity can depend on size of parameters.

Interpolation: a new challenge. Where is the tradeoff between fit and complexity?

  • Interpolation in linear regression can exploit overparametrization to hide the noise.

4. Why neuroscience needs science of DL , Konrad Kording, UPenn

Goals in computational systems neuroscience

  1. Understandable.
  2. Should work.

5. Does AI come at a cost? Instabilities in DL - Anders Hover

Deep Fool was established in EPFL to test the instability of NNs.

Theorem: There are uncountably many classification problems.

Key point: there is always a NN that achieved zero training error but achieves generalizibility.

Question: Can stable neural networks be produced using recursion?

Example: Ref. On instabilities of deep learning in image reconstruction. Antun Renna Poon et al.

Image reconstruction with NNs is completely unstable.

If you overperform in two images, things go wrong (instability).

The instability problem is a nontrivial one. But we can test them against instabilities. Cure is DL theory.

6. Challenge and scope for Empirical Modeling for ML - Ronald Coefman, Yale

At this point ML provides encoders/tabulation and regression. The real quest should be to find instrinsic varibales which enables direct consistency and performance match between algorithmic learners.

7. Panel - Julia Kempe & Eero Simoncelli

Expressive theory vs General Theory

What do our students care about? Computation, data size (n=1), instabilities.

8. Dataset for Analyzing Face Recognition - Jonathon Phillips, NIST


  • FERET, Dept of State, Mugshots (2010) - 1.6 million images

Two questions: Verification and Rank 1 recognition (who is this person?)

Ref. Lessons from collecting a million biometric samples - Philips, Flynn

Face recognition accuracy of forensic examiners, superrecognizers and algorithms

Experiment including human recognizers four groups with different expertise levels.

Best recognizer agent is created by combining one facial examiner and A2017b.

9. Neural Solvers for Power Transmission Problems, Isabelle Guyon, Paris-Sud University

AI & Electricity

Thesis works

  • Deep learning methdos for predicting flows in power frid by Benjamin Donnon
  • RL for controlling power grids

The load flow: Input: production, topology etc. --- numeric solver ---> output: power flows

One example of a numeric solver is Hades 2: the challenge is speed 100ms should be faster by 2 orders.

LEAPNet - Latent Encoding of Atypical Perturbations

  • Generalizes to combinatorial topology changes

LEAPNet is able to predict around operating conditions. Ref. LEAPNets for power grid perturbations, Donnot et. al. 2019.

GNS (Graph Neural Solver) for Power Systems - Iteratively propogates messages through edges

Conclusion: Augmented intelligence = operators + hades2 + NNs

10. From Deep Reinforcament Learning to AI, Doina Precup, McGill-MILA

Standart RL Scheme inspired by animals, AlphGo environment very clear, reward function is well known

Golden Goal: Efficient, continual learning and reasoning

Knowledge Representation of AlphaGo (policy and value)

Procedural Knowledge and Predictive/emprical knowledge

Knowledge must be: Expressive, Learnable, Composable

Procedural Knowledge: Options

  • Option: (initiation set, policy, temination condition)

Options as behavioral programs

Where do options come from: Domain knowledge, Option-Critic Models

Back to value function

Knowledge Representation: Generalized Value Functions (cumulant function, continutation function) coming from Horde Architecture

Option Models

Life Learning Agent

11. Theory-based measures of object representations in deep artificial and biological networks, Naim Sompolinksy, Hebrew University of Jeruselam

!Not familiar to subject, so couldn't write much!

Untangling Object Manifolds

Object classification capacity

12. NNs in Speech Recognition - Tara Sainath , Google AI

Conventional ASR pipeline:

Input speech -> Feature extraction -> DNN/RNN Acoustic Models -> Decoder -> Second Pass -> Rescoring Output

NNs helped combine the feature extraction and classification steps into one.

Deepness in speech: lower layers similar phones from different people are group together whereas in higer layers better discrimination is achieved.

Ref. B. Li et al Interspeech 2017

Multi-channel neural networks for Google HOME

What does the network learn? Filters are doing spatial and spectral filtering.

Model: End2End Trained Seq2Seq Recognizer combining the whole pipeline for the sake of simplicity, model size shrinkes and joint optimization.

Ref. C Chiu ICASSP 2018 - conventional baseline model was outperformed by E2E which is launched in Gboard.

Tail cases in speech recognition: numerics, context injecting, injecting domain knowledge.

13. Panel: What's missing in today's experimental analysis of DL? P. Jonathon Philips, Jitandra Malik, Peter Bartlett, Antonio Torralba, Isabelle Guyon

Question: If a breakthrough happens, do we have the capacity to realize/test it?

Reproducibility facilititated DL revolution.

Datasets are biased such that the creator's algorithm shines.

14. Right Ways Forward(I): Terrence Sejnowski, Salk Institue for Biological Studies

! Organizers request short talks starting with this talk. So there was not much to note !

High dimensional Geometry, subspaces will become important Adversarial examples Perturbation could help building robust NNs against adversarial attacks. We’re looking for general architectural principles.

15. Right Ways Forward(II): Jon Kleinberg, Cornell

Social policy and algorithmic decisions Screening as a prediction problem: Tabular structures into predictions e.g CV Interpretability problem for human decisions Two categories of discrimination: - disparate treatment: deliberately favoring individuals on race gender etc - Disparate impact: regardless of intent the output is disproportionate.

The challenge in correcting for human bias

Key argument: well regulated algorithms can make discrimination easier to detect.

Decomposing a Gap in Outcomes Disparity = Structural disparity + bias from choice of outcome + ...

16. From machine learning to Artifical Intelligence: Leon Bottou, Facebook AI Research

Caveat 1: Statistical problem is only a proxy to the real problem.

ML algorithms recklessly take advantage of spurious correlations.

Caveat 2: Causality

Viewpoints to causality - Manipulative causation Causal invariance Causal reasoning Dispositional causation: where do causations come from Causal intuituon: correlation is not causation but the data contains hints For instance, asymmetric relation. The scientific method is a good model of a learning process. Hypothesis generation precedes empirical validation.