We attended The Science of Deep Learning Colloquia which is a part of Arthur M. Sackler Colloquia of the National Academy of Sciences,US. We listen mostly high level but motivating and fun talks by leading researchers working on deep learning. We drafted some notes which are very terse and might be inaccurate, but you can get the intuition of the talks. Videos told to be released by two weeks in NAS’s youtube channel.
Day1 Speakers: Amnon Shashua, Jitendra Malik, Chris Manning, Regina Barzilay, Tomaso Poggio, Orial Vinyals, Terrence Senjnowski, Olga Troyanskaya, Kyle Cranmer, Eero Simoncelli, Bruno Olshausen, Antonio Torralba, Rodney Brooks
 Favorites: Amnon Shashua, Kyle Cranmer, Eero Simoncelli, Olga Troyanskaya, Antonio Torralba, Rodney Brooks, Regina Barzilay
Day2 Speakers: Tomaso Poggio, Nati Srebro, Peter Bartlett, Anders Hover, Jonathon Phillips, Doina Precup, Naim Sompolinksy, Ronald Coefman, Konrad Kording, Tara Sainath, P. Jonathan Philips, Jitendra Malik, Antonio Torralba, Jon Kleinberg, Terrence Senjnowski, Isabelle Guyon, Leon Bottou
 Favorites: Nati Srebro, Peter Bartlett
DAY 1
1. The State of Deep Learning : Overview Talk (I) Amnon Shashua

Deep Learning w. overparametrized networks enabled: Training Error ↓ Test Error ↓

Quantum Entanglement related with Deep Nets https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.122.065301

Self Driving Cars, Diriving Policy, Ethics in AI
 Failure allowed only due to perception error and it should be less then human's error
 There shoud be no failure due wrong decision
 The world is rich with details, challenge with pattern recognition is accuracy
 For humans, percetual accidents are very rare 10^7 hours of driving for an accident
 Combine two different (independent) subsystem for same task to reduce risk

Natural Language Understanding is a very good small test environment for General AI. Understanding a book has enough complexity.
2. The State of Deep Learning : Overview Talk (II)  Jitendra Malik, UC Berkeley
Phylogeny of intelligence.
Intelligence really about perception and action.
The evolutionary progression
 Vision and Locomotion
 Manipulation  science falling behind
 Language  same
Major Success of DL
CV, Speech understanding, machine translation, game playing
Behind the success: data + computing + annotation + simulation
1 neuron = 1000 instructions/sec
DL in the context of early history
Turing: rather than simulating an adult brain, try simulate a child's brain and then educate it.
1980: Neocognitron: A Selforganizing Neural Network Model
Lecun 1989  Convolutional Neural Networks
RCNN
Mask RCNN
Biological inspiration is overrated  deeper is better: from retina to the back of the there are only seven sinapsis.
 Lenet  AlexNet  VGG  GoogleNet  ResNet  ReneXt  Mask RCNN
The future: Seeing 3D
Understand the geometry
Learned multiview scenario
Visual Navigation in Novel Environments
Challenges:
 fewshot learning
 learning with liitle supervision
 unifying learning with geometric reasoning
 perception and control
What about unsupervised learning? DL is a function approximation technique, there is input and output.
3. The State of Deep Learning : Overview Talk (III)  Chris Manning, Stanford
He provided an overview of DL in speech recognition and speech synthesis and how this technology evolved to innovations such as Alexa and Siri. A challenge in the past: similar words but non similar representation e.g. hotel and motel RNN to generate words bu repeated sampling The idea of bypass: is very effective! LSTMs: Hochreiter 1998
Recurrent neural encoder decoder networks such as for translation tasks Ref for deep reps of this kind: Sutskever et al 2014 Luong 2015
Bottleneck: all information of the sentence has to pass thru one pipeline
Solution: seq to seq with attention.
One evaluation: 26% performance improvement from Eng to German 2014 to 2015 in one year
Open domain question answering: DrQA Ref. Chen et al acl 2017 The Stanford Attentive Reader Contextual word representations from LMs  think of hidden states as representations
ELMo (Peters 2018) in NER tagger  reps of this time grew by years e.g. BERT, GPT etc.
4. The State of Deep Reinforcement Learning, Orial Vinyals, Google, Deepmind
Summary: AtariAlphaGoAlphaStar
Reinforcement Learning Sceheme : Agent, Observations, Actions, Environment
Mentioned Deepmind's 2015 Atari Paper
Behind every great agent there's a great environment(importance of datasets)
Policy and Value Learning
Atari vs Go: action space is higher in Go, information type NearPerfect vs Perfect

AlphaGo: Policy Network and Value Network(predict winner) Take a Search + Imitate the search: related summary blog post that I found
Policy Improvement Theorem
AlphaGo Zero: 4TPUs AlphaGo (Lee Sadol): 48TPUs

StartCraft: Action space is 10^6 Exploration is really hard Imitation learning Real time strategy game, in playing phase fast prediction needed Grandmaster level
Patterns for Success and Challenges Ahead

Success

Environment Full of Rewards Atari

Available human demonstrations AlphaGoAlphaStar

Algorithmic ways to improve the policy AlphaGoAlphaZero

...


Challenges

Real Word is not simulations

Transfer/ General AI Alpha Go should learn chess quickly Alphastart should learn Atari easily ImageNet classifier should trivally transfer to MNIST

Understanding theory nonconvex optimization

5. Panel: Tomaso Poggio, Regina Barzilay, Terrence Senjnowski, Rodney Brooks
Regina: Interpretability needed to better undestand. Supervision and Quality of Supervision. Datasets are biased( ex Fake News Dataset, remove evidence we can predict bias :)).
Tomassa: We need more science and math but we have many gpu hours instead 🙂 Big data is not realistic. We, human, learn from very less data. Current architectures are not the way human learn.
Terrence: New scientific studies will be enabled by the hype of deep learning we experienced similar things when heat equations are solved. Deep learning revolution
Rodney: we can't trust systems trained on deep nets
6. Deep Learning in Science(I): Regina Barzilay, MIT
MIT Machine Learning for Pharmaceutical Discovery Consortium
Challenge in drug discovery: a huge combinatorial space.
How to deploy ML to address this challenge. Predicting chemical reactions.
 Property prediction: take a molecule, extract molecular fingerprint, graph convolution One reason the initial model failed was domain transfer.
 Better Molecule Generation
What are the open questions? Molecular representation beyond graphs Modeling underlying physics How to improve a molecule to have better properties.
String to string generation: Linearize the molecules (smiles)  did poorly.
Graph to graph generation: Invalidity of intermediate molecules is a challenge. Should produce many diverse outputs.
Tree decomposition: Molecule to tree.
7. Deep Learning in Science(II): Deep Learning and Particle Physics  Kyle Cranmer, NYU

higgs boson discovery

particle collision = complex probabilistic model

created particles create other particles

likelihood calculation is very hard

Bayesian inference under intractable likelihoods: lilelihood free inference
 Approximate Bayesian Computation (can also be intractable)
 You just need to do Forward Simulation
 sufficient statistics can not be determined sometimes
New approaches
 Use Simulator
 Hijack the inside of simulator
 Learning The Simulator
 Generative Adverserial Network
 Learning the Likelihood Ratio (Supervised)
 Likelihood ratio trick (binary classifer ~= likelihood ratio)
Takeaways
ML have potentiol to effectively bridge the microscopicmacroscopic divide
Physics Aware Machine Learning
Intersection of Deep Learning(successfull) and Bayesian Methods(interpretable)

Physics aware Gaussian Processes

QCDAware Recursive Neural Networks

QCDAware Graph Convolutional Networks

JUNIPR: generative model for jets can train on real data! and interpretable
8. Deep Learning in Science(III): DL in Genomic Research  Olga Troyanskaya, Princeton
How does a single mutation in genome affect gene regulation?
Which SNPs are functional and lead to human disease?
Understanding disease causing mutations using DL.
Two types of mutations:
 Coding variant
 Noncoding regulatory variant
Model should be
Genomic sequence > sequence model > chromatin organization
Model is trained on a single genome.
Deep convolutional networkbased sequence model
Why relevant to genome data?
 Many example of the same sequence along the whole DNA.
 Capture context information
 Interaction of seq features
 Multi task prediction
The proposed model is able to predict histone marks, DNase accessibility, transcription factors given a single code change/mutation.
DeepSEA idenrifies significant noncoding regulatory mutation burden in ASD. ASD is composed of families where autism is observed in only one of the children and not in the rest of the family.
One question: could autism be a result of stronger mutations not just a mutation i.e. sibling can also have the mutation but the disease? Yes. BioRxiv. Nature Genetics.
ExPecto  ab inito prediction of tissue specific gene expression from sequence.
A pipeline of methods deel learning, spatial feature transformation, regularized linear models to obtain the expression and associated impact of mutation.
Summary
 A DL based algorithmic framework for predicting the effect of any noncoding mutation in genome.
 A computational framework for accurate prediction of tissuespecific expression, including de novo prediction of expression variation
 Functional networks produced by semisupervised data integration enable insight into mechanisms of human disease, including Alzheimer's, Parkinson's and cardiovascular diseases.
9. Deep Learning in Science(IV), Eero Simoncelli, NYU
 Deep Convolutional Neural Nets
 Largely inspired by neurobiology
 Astonishing (but often brittle) results
 Model for sensory neurobiology
 Basic neural selectivity
 MRI  stimulus similarity
 Missing:
Largely unsupervised
nonclassification objective
Local learning,
adaptation,
gaincontrol,
homeostasis,
Recurrence/state/context(memory, reward, attention)
Myriad biophysical details
Example 1: Difference between two images(Berardino et al. NIPS 2017)
 MSE is not a good measure for human eyes
 L(X,Xhat) = f(X)f(Xhat')
 TID2018 Dataset
 All models (whether deep or not) performs same on the test data
 Which one generalizes: Least noticeable Eigen Distortions Most visable Eigen Distortions
 Local gain control
Example 2: Perceptual Straightening of Videos (Henaff, Goris, Simoncelli NN9)
 Curvature in Pixel Domain
 Perceptual experiments on Humans
 Humans reduce curvature in their brain
 CNNs do not work like human brain
10. Panel: Scientific Funding for Deep Learning
Super Turing Computation
Can handle situations it hasn't encountered before utilizing previous learning.
Lifelong Learning Machines (L2M)  is concerned of learing while executing, improve over lifetime
11. Can Deep Learning provide Deep Insights for Neuroscience, Bruno Olshausen, Berkeley
Embrace complexity of biology
 Neuroscience moved a lot neuron is not neuron in ML
 Cortical Circuits:
 Highly Organized by layer
 Layers are interconnected in a canonical microcircuit
 Feedback connections
What problems should we be solving?
showed pictures of animals with good capabilities
animal's vision system is very robust, low power

Nakayama et al.(1995)

O'regan & Noe(2001)

Mumford(2010) Pattern Theory Sparse Discreteness
Transformations
Hierarchy
The Sparse Manifold Theorem (Yubei Chan, NeurIPS 2018)
12. Dissecting Neural Networks, Antonio Torralba, MIT
Very fun talk :))
10 billion dollar spend for CERN data We spend very very less data on learning datasets.
Cycle of Deep Learning: we realize datasets are biased, then google (?), then new datasets
Understanding Deep Representations: Network Dissection (~visualization)
Test Units for Semantic Segmentation Top Activated Images: IoU GANs How we can identify which neuron responsible for which object's detections in CNNs, we can also identify in GANs which neuron draws which part of the image
13. Super Intelligence, Rodney Brooks
History of AI
Turing papers mentioned
Approaches to AI
a) Symbolic Approach
Logic, statements about symbols, inference and reasoning
Compositonality in symbolic systems
Symbols are not grounded
b) Neural Networks
c) Traditional Robotics
Finding corners and future points in a picture
d) Behavior Robotics
Behavior trees
What we are doing wrong currently There were some fun comics Betters turing test? Get Machines to do Real Tasks in the World
Hard Things to Do
Real Perception
Ex: Chess board with grays and whites are in same intensity
Ex: Blue filtered strawberry image where computer RBG colors are not red rather more blue
Our perception adjust according to the context
Audience learned a category with 3 images 🙂
Real Manipulation:
Read a book
Common Sense Reasoning
What should work on
 Object recognition capabilities of 2 year old
 Language capabilities of 4 year old
 Manual dexerity of 6 year old
 Social understanding of 8 years old
Day 2
1. Networks of neurons for learning and representing symbols in the brain, Tomaso Poggio, MIT
missed!
2. Inductive Bias and Optimization in DL  Nati Srebro, TTIC
Goals:
 Capacity of the learning system  how many samples do we need to generalize?
 Expressivenes  can we capture reality?
 One opinion: NNs can approx any function. The objective could be expressiveness with small samples. In some cases, even small networks can capture everything.
 Computation/Optimization: NPhard to find weights even with 2 hidden units. Even the simplest NN with O(logd) units, no noise, no poly time algorithm always works.
 Thus there might be some magic property of reality that makes local search work.
Experiment
As the number of hidden units increase the training error decreases. In on trial it turns out that the test error keeps decreasing as the number of parameters increases. In repeated trials, in most of test errors are large. Maybe in the cases where test fails when training error is zero, the norm of the parameters is high not the number of parameters. This could be norm etc. so what is the relevant "complexity measure"? And, how to minimize by optimization algorithm.
Ref. Neyshabur Tomioka S ICLR '15
SGD vs ADAM
Optimization
Different optimization algorithm > Different bias in optimum reached > Different inductive bias > Different generalization properties
Need to understand optimization algorithm not just as reaching some (global) optimum, but as reaching a specific optimum. Choice of optimzation algorithm matters! The solution space is like an ocean.
Example 1: Unconstrained Matrix Completion
Ref. Gunasekar Woodworth Bhojanapalli Neyshabur 2017
Gradient Descent (small step size etc.) finds not any global minima but min nuclear norm solution which brings generalization.
Example 2: Single Overparatmerized Linear Unit
Example 3: Linear Conv Nets Overparatmerization
Optimization Geometry and hence inductive bias affected by geometry of local search in parameter space and paratmerization characterization.
3. Peter Barlett
Generalization : Prediction Accuracy of Test Set
Typical Theorem: pred_err <= trn_err+complexity_penalty [1]
Agenda

Emprical process theory for classification

Margins analysis: relating classification to regression

Interpolation: There is no apperant tradeoff between fit and complexity

Interpolation in Linear Regression
VC Theory
P(f(x) \noteq y) <= 1/n (trn classification error) + sqrt(c\n (VCDim(F)+log(1/sigma)))
Neural Networks VCDimension increases with (p=#of parameters, L=#of layers)
 p if nonlinearity continues
 pL o non linearity piece wise continous
A classification problem becomes a regression if we use a loss function that doesn't vary too quickly.
For regression, the complexity of a NN is controlled by the size of parameters.
Interprolation in DL  A new challenge for Statistical Learning Theory
Deep networks can be trained to zero training error for regression loss with near stateoftheart performance and even for noisy problems. Thus there is no notion of a tradeoff between fit to training data and complexity where [1]. Ref. Zhang, Bengio, Hardt et. al. 2017 and Belkin et al 2018.
Interpolation in Linear Regression
Classical linear regression setting, with n samples. f(x) = x'\theta, squared error as loss, risk = E[loss].
Choose \theta^ to minimize the training error average.
Excess expected loss: Empirical Risk  True Risk
^Q is corrupted because our view of covariance of x is distorted by x1, x2, ..., xn. Also, the noise.
Accurate interpolating prediction as dimension p_n grows.
Consider covariance of x in two pieces
 a fixed piece due to dimension k
 a tail which flattens with n
Summary
Interpolation: far from the regime of a trade off between fit to training data and complexity.
In highdimensional linear regression if the covariance has a long and flat tail the minimum norm interpolant can hide the noise in these many unimportant directions.
 Relizes on overparametrization
 and lots of unimportant parameters
Can we extend these results ot interpolating deep networks?
Empirical process theory for classification: need n>>p
Margins analysis with Lipshcitz loss complexity can depend on size of parameters.
Interpolation: a new challenge. Where is the tradeoff between fit and complexity?
 Interpolation in linear regression can exploit overparametrization to hide the noise.
4. Why neuroscience needs science of DL , Konrad Kording, UPenn
Goals in computational systems neuroscience
 Understandable.
 Should work.
5. Does AI come at a cost? Instabilities in DL  Anders Hover
Deep Fool was established in EPFL to test the instability of NNs.
Theorem: There are uncountably many classification problems.
Key point: there is always a NN that achieved zero training error but achieves generalizibility.
Question: Can stable neural networks be produced using recursion?
Example: Ref. On instabilities of deep learning in image reconstruction. Antun Renna Poon et al.
Image reconstruction with NNs is completely unstable.
If you overperform in two images, things go wrong (instability).
The instability problem is a nontrivial one. But we can test them against instabilities. Cure is DL theory.
6. Challenge and scope for Empirical Modeling for ML  Ronald Coefman, Yale
At this point ML provides encoders/tabulation and regression. The real quest should be to find instrinsic varibales which enables direct consistency and performance match between algorithmic learners.
7. Panel  Julia Kempe & Eero Simoncelli
Expressive theory vs General Theory
What do our students care about? Computation, data size (n=1), instabilities.
8. Dataset for Analyzing Face Recognition  Jonathon Phillips, NIST
Datasets
 FERET, Dept of State, Mugshots (2010)  1.6 million images
Two questions: Verification and Rank 1 recognition (who is this person?)
Ref. Lessons from collecting a million biometric samples  Philips, Flynn
Face recognition accuracy of forensic examiners, superrecognizers and algorithms
Experiment including human recognizers four groups with different expertise levels.
Best recognizer agent is created by combining one facial examiner and A2017b.
9. Neural Solvers for Power Transmission Problems, Isabelle Guyon, ParisSud University
AI & Electricity
Thesis works
 Deep learning methdos for predicting flows in power frid by Benjamin Donnon
 RL for controlling power grids
The load flow: Input: production, topology etc.  numeric solver > output: power flows
One example of a numeric solver is Hades 2: the challenge is speed 100ms should be faster by 2 orders.
LEAPNet  Latent Encoding of Atypical Perturbations
 Generalizes to combinatorial topology changes
LEAPNet is able to predict around operating conditions. Ref. LEAPNets for power grid perturbations, Donnot et. al. 2019.
GNS (Graph Neural Solver) for Power Systems  Iteratively propogates messages through edges
Conclusion: Augmented intelligence = operators + hades2 + NNs
10. From Deep Reinforcament Learning to AI, Doina Precup, McGillMILA
Standart RL Scheme inspired by animals, AlphGo environment very clear, reward function is well known
Golden Goal: Efficient, continual learning and reasoning
Knowledge Representation of AlphaGo (policy and value)
Procedural Knowledge and Predictive/emprical knowledge
Knowledge must be: Expressive, Learnable, Composable
Procedural Knowledge: Options
 Option: (initiation set, policy, temination condition)
Options as behavioral programs
Where do options come from: Domain knowledge, OptionCritic Models
Back to value function
Knowledge Representation: Generalized Value Functions (cumulant function, continutation function) coming from Horde Architecture
Option Models
Life Learning Agent
11. Theorybased measures of object representations in deep artificial and biological networks, Naim Sompolinksy, Hebrew University of Jeruselam
!Not familiar to subject, so couldn't write much!
Untangling Object Manifolds
Object classification capacity
12. NNs in Speech Recognition  Tara Sainath , Google AI
Conventional ASR pipeline:
Input speech > Feature extraction > DNN/RNN Acoustic Models > Decoder > Second Pass > Rescoring Output
NNs helped combine the feature extraction and classification steps into one.
Deepness in speech: lower layers similar phones from different people are group together whereas in higer layers better discrimination is achieved.
Ref. B. Li et al Interspeech 2017
Multichannel neural networks for Google HOME
What does the network learn? Filters are doing spatial and spectral filtering.
Model: End2End Trained Seq2Seq Recognizer combining the whole pipeline for the sake of simplicity, model size shrinkes and joint optimization.
Ref. C Chiu ICASSP 2018  conventional baseline model was outperformed by E2E which is launched in Gboard.
Tail cases in speech recognition: numerics, context injecting, injecting domain knowledge.
13. Panel: What's missing in today's experimental analysis of DL? P. Jonathon Philips, Jitandra Malik, Peter Bartlett, Antonio Torralba, Isabelle Guyon
Question: If a breakthrough happens, do we have the capacity to realize/test it?
Reproducibility facilititated DL revolution.
Datasets are biased such that the creator's algorithm shines.
14. Right Ways Forward(I): Terrence Sejnowski, Salk Institue for Biological Studies
! Organizers request short talks starting with this talk. So there was not much to note !
High dimensional Geometry, subspaces will become important Adversarial examples Perturbation could help building robust NNs against adversarial attacks. We’re looking for general architectural principles.
15. Right Ways Forward(II): Jon Kleinberg, Cornell
Social policy and algorithmic decisions Screening as a prediction problem: Tabular structures into predictions e.g CV Interpretability problem for human decisions Two categories of discrimination:  disparate treatment: deliberately favoring individuals on race gender etc  Disparate impact: regardless of intent the output is disproportionate.
The challenge in correcting for human bias
Key argument: well regulated algorithms can make discrimination easier to detect.
Decomposing a Gap in Outcomes Disparity = Structural disparity + bias from choice of outcome + ...
16. From machine learning to Artifical Intelligence: Leon Bottou, Facebook AI Research
Caveat 1: Statistical problem is only a proxy to the real problem.
ML algorithms recklessly take advantage of spurious correlations.
Caveat 2: Causality
Viewpoints to causality  Manipulative causation Causal invariance Causal reasoning Dispositional causation: where do causations come from Causal intuituon: correlation is not causation but the data contains hints For instance, asymmetric relation. The scientific method is a good model of a learning process. Hypothesis generation precedes empirical validation.