We attended The Science of Deep Learning Colloquia which is a part of Arthur M. Sackler Colloquia of the National Academy of Sciences,US. We listen mostly high level but motivating and fun talks by leading researchers working on deep learning. We drafted some notes which are very terse and might be inaccurate, but you can get the intuition of the talks. Videos told to be released by two weeks in NAS’s youtube channel.
Day-1 Speakers: Amnon Shashua, Jitendra Malik, Chris Manning, Regina Barzilay, Tomaso Poggio, Orial Vinyals, Terrence Senjnowski, Olga Troyanskaya, Kyle Cranmer, Eero Simoncelli, Bruno Olshausen, Antonio Torralba, Rodney Brooks
- Favorites: Amnon Shashua, Kyle Cranmer, Eero Simoncelli, Olga Troyanskaya, Antonio Torralba, Rodney Brooks, Regina Barzilay
Day-2 Speakers: Tomaso Poggio, Nati Srebro, Peter Bartlett, Anders Hover, Jonathon Phillips, Doina Precup, Naim Sompolinksy, Ronald Coefman, Konrad Kording, Tara Sainath, P. Jonathan Philips, Jitendra Malik, Antonio Torralba, Jon Kleinberg, Terrence Senjnowski, Isabelle Guyon, Leon Bottou
- Favorites: Nati Srebro, Peter Bartlett
1. The State of Deep Learning : Overview Talk (I) Amnon Shashua
Deep Learning w. overparametrized networks enabled: Training Error ↓ Test Error ↓
Quantum Entanglement related with Deep Nets https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.122.065301
Self Driving Cars, Diriving Policy, Ethics in AI
- Failure allowed only due to perception error and it should be less then human's error
- There shoud be no failure due wrong decision
- The world is rich with details, challenge with pattern recognition is accuracy
- For humans, percetual accidents are very rare 10^7 hours of driving for an accident
- Combine two different (independent) subsystem for same task to reduce risk
Natural Language Understanding is a very good small test environment for General AI. Understanding a book has enough complexity.
2. The State of Deep Learning : Overview Talk (II) - Jitendra Malik, UC Berkeley
Phylogeny of intelligence.
Intelligence really about perception and action.
The evolutionary progression
- Vision and Locomotion
- Manipulation - science falling behind
- Language - same
Major Success of DL
CV, Speech understanding, machine translation, game playing
Behind the success: data + computing + annotation + simulation
1 neuron = 1000 instructions/sec
DL in the context of early history
Turing: rather than simulating an adult brain, try simulate a child's brain and then educate it.
1980: Neocognitron: A Self-organizing Neural Network Model
Lecun 1989 - Convolutional Neural Networks
Biological inspiration is overrated - deeper is better: from retina to the back of the there are only seven sinapsis.
- Lenet - AlexNet - VGG - GoogleNet - ResNet - ReneXt - Mask R-CNN
The future: Seeing 3D
Understand the geometry
Learned multiview scenario
Visual Navigation in Novel Environments
- few-shot learning
- learning with liitle supervision
- unifying learning with geometric reasoning
- perception and control
What about unsupervised learning? DL is a function approximation technique, there is input and output.
3. The State of Deep Learning : Overview Talk (III) - Chris Manning, Stanford
He provided an overview of DL in speech recognition and speech synthesis and how this technology evolved to innovations such as Alexa and Siri. A challenge in the past: similar words but non similar representation e.g. hotel and motel RNN to generate words bu repeated sampling The idea of by-pass: is very effective! LSTMs: Hochreiter 1998
Recurrent neural encoder decoder networks such as for translation tasks Ref for deep reps of this kind: Sutskever et al 2014 Luong 2015
Bottleneck: all information of the sentence has to pass thru one pipeline
Solution: seq to seq with attention.
One evaluation: 26% performance improvement from Eng to German 2014 to 2015 in one year
Open domain question answering: DrQA Ref. Chen et al acl 2017 The Stanford Attentive Reader Contextual word representations from LMs - think of hidden states as representations
ELMo (Peters 2018) in NER tagger - reps of this time grew by years e.g. BERT, GPT etc.
4. The State of Deep Reinforcement Learning, Orial Vinyals, Google, Deepmind
Reinforcement Learning Sceheme : Agent, Observations, Actions, Environment
Mentioned Deepmind's 2015 Atari Paper
Behind every great agent there's a great environment(importance of datasets)
Policy and Value Learning
Atari vs Go: action space is higher in Go, information type Near-Perfect vs Perfect
AlphaGo: Policy Network and Value Network(predict winner) Take a Search + Imitate the search: related summary blog post that I found
Policy Improvement Theorem
AlphaGo Zero: 4TPUs AlphaGo (Lee Sadol): 48TPUs
StartCraft: Action space is 10^6 Exploration is really hard Imitation learning Real time strategy game, in playing phase fast prediction needed Grandmaster level
Patterns for Success and Challenges Ahead
Environment Full of Rewards Atari
Available human demonstrations AlphaGo-AlphaStar
Algorithmic ways to improve the policy AlphaGo-AlphaZero
Real Word is not simulations
Transfer/ General AI Alpha Go should learn chess quickly Alphastart should learn Atari easily ImageNet classifier should trivally transfer to MNIST
Understanding theory non-convex optimization
5. Panel: Tomaso Poggio, Regina Barzilay, Terrence Senjnowski, Rodney Brooks
Regina: Interpretability needed to better undestand. Supervision and Quality of Supervision. Datasets are biased( ex Fake News Dataset, remove evidence we can predict bias :)).
Tomassa: We need more science and math but we have many gpu hours instead 🙂 Big data is not realistic. We, human, learn from very less data. Current architectures are not the way human learn.
Terrence: New scientific studies will be enabled by the hype of deep learning we experienced similar things when heat equations are solved. Deep learning revolution
Rodney: we can't trust systems trained on deep nets
6. Deep Learning in Science(I): Regina Barzilay, MIT
MIT Machine Learning for Pharmaceutical Discovery Consortium
Challenge in drug discovery: a huge combinatorial space.
How to deploy ML to address this challenge. Predicting chemical reactions.
- Property prediction: take a molecule, extract molecular fingerprint, graph convolution One reason the initial model failed was domain transfer.
- Better Molecule Generation
What are the open questions? Molecular representation beyond graphs Modeling underlying physics How to improve a molecule to have better properties.
String to string generation: Linearize the molecules (smiles) - did poorly.
Graph to graph generation: Invalidity of intermediate molecules is a challenge. Should produce many diverse outputs.
Tree decomposition: Molecule to tree.
7. Deep Learning in Science(II): Deep Learning and Particle Physics - Kyle Cranmer, NYU
higgs boson discovery
particle collision = complex probabilistic model
created particles create other particles
likelihood calculation is very hard
Bayesian inference under intractable likelihoods: lilelihood free inference
- Approximate Bayesian Computation (can also be intractable)
- You just need to do Forward Simulation
- sufficient statistics can not be determined sometimes
- Use Simulator
- Hijack the inside of simulator
- Learning The Simulator
- Generative Adverserial Network
- Learning the Likelihood Ratio (Supervised)
- Likelihood ratio trick (binary classifer ~= likelihood ratio)
ML have potentiol to effectively bridge the microscopic-macroscopic divide
Physics Aware Machine Learning
Intersection of Deep Learning(successfull) and Bayesian Methods(interpretable)
Physics aware Gaussian Processes
QCD-Aware Recursive Neural Networks
QCD-Aware Graph Convolutional Networks
JUNIPR: generative model for jets can train on real data! and interpretable
8. Deep Learning in Science(III): DL in Genomic Research - Olga Troyanskaya, Princeton
How does a single mutation in genome affect gene regulation?
Which SNPs are functional and lead to human disease?
Understanding disease causing mutations using DL.
Two types of mutations:
- Coding variant
- Noncoding regulatory variant
Model should be
Genomic sequence -> sequence model -> chromatin organization
Model is trained on a single genome.
Deep convolutional network-based sequence model
Why relevant to genome data?
- Many example of the same sequence along the whole DNA.
- Capture context information
- Interaction of seq features
- Multi task prediction
The proposed model is able to predict histone marks, DNase accessibility, transcription factors given a single code change/mutation.
DeepSEA idenrifies significant noncoding regulatory mutation burden in ASD. ASD is composed of families where autism is observed in only one of the children and not in the rest of the family.
One question: could autism be a result of stronger mutations not just a mutation i.e. sibling can also have the mutation but the disease? Yes. BioRxiv. Nature Genetics.
ExPecto - ab inito prediction of tissue specific gene expression from sequence.
A pipeline of methods deel learning, spatial feature transformation, regularized linear models to obtain the expression and associated impact of mutation.
- A DL based algorithmic framework for predicting the effect of any non-coding mutation in genome.
- A computational framework for accurate prediction of tissue-specific expression, including de novo prediction of expression variation
- Functional networks produced by semi-supervised data integration enable insight into mechanisms of human disease, including Alzheimer's, Parkinson's and cardiovascular diseases.
9. Deep Learning in Science(IV), Eero Simoncelli, NYU
- Deep Convolutional Neural Nets
- Largely inspired by neurobiology
- Astonishing (but often brittle) results
- Model for sensory neurobiology
- Basic neural selectivity
- MRI - stimulus similarity
-Recurrence/state/context(memory, reward, attention)
-Myriad bio-physical details
Example 1: Difference between two images(Berardino et al. NIPS 2017)
- MSE is not a good measure for human eyes
- L(X,Xhat) = ||f(X)-f(Xhat')||
- TID2018 Dataset
- All models (whether deep or not) performs same on the test data
- Which one generalizes: Least noticeable Eigen Distortions Most visable Eigen Distortions
- Local gain control
Example 2: Perceptual Straightening of Videos (Henaff, Goris, Simoncelli NN-9)
- Curvature in Pixel Domain
- Perceptual experiments on Humans
- Humans reduce curvature in their brain
- CNNs do not work like human brain
10. Panel: Scientific Funding for Deep Learning
Super Turing Computation
Can handle situations it hasn't encountered before utilizing previous learning.
Lifelong Learning Machines (L2M) - is concerned of learing while executing, improve over lifetime
11. Can Deep Learning provide Deep Insights for Neuroscience, Bruno Olshausen, Berkeley
Embrace complexity of biology
- Neuroscience moved a lot neuron is not neuron in ML
- Cortical Circuits:
- Highly Organized by layer
- Layers are interconnected in a canonical microcircuit
- Feed-back connections
What problems should we be solving?
showed pictures of animals with good capabilities
animal's vision system is very robust, low power
Nakayama et al.(1995)
O'regan & Noe(2001)
Mumford(2010) Pattern Theory -Sparse Discreteness
The Sparse Manifold Theorem (Yubei Chan, NeurIPS 2018)
12. Dissecting Neural Networks, Antonio Torralba, MIT
Very fun talk :))
10 billion dollar spend for CERN data We spend very very less data on learning datasets.
Cycle of Deep Learning: we realize datasets are biased, then google (?), then new datasets
Understanding Deep Representations: Network Dissection (~visualization)
Test Units for Semantic Segmentation Top Activated Images: IoU GANs How we can identify which neuron responsible for which object's detections in CNNs, we can also identify in GANs which neuron draws which part of the image
13. Super Intelligence, Rodney Brooks
History of AI
Turing papers mentioned
Approaches to AI
a) Symbolic Approach
Logic, statements about symbols, inference and reasoning Compositonality in symbolic systems Symbols are not grounded
b) Neural Networks
c) Traditional Robotics
Finding corners and future points in a picture
d) Behavior Robotics
What we are doing wrong currently There were some fun comics Betters turing test? Get Machines to do Real Tasks in the World
Hard Things to Do
Ex: Chess board with grays and whites are in same intensity
Ex: Blue filtered strawberry image where computer RBG colors are not red rather more blue
Our perception adjust according to the context
Audience learned a category with 3 images 🙂
Read a book
Common Sense Reasoning
What should work on
- Object recognition capabilities of 2 year old
- Language capabilities of 4 year old
- Manual dexerity of 6 year old
- Social understanding of 8 years old
1. Networks of neurons for learning and representing symbols in the brain, Tomaso Poggio, MIT
2. Inductive Bias and Optimization in DL - Nati Srebro, TTIC
- Capacity of the learning system - how many samples do we need to generalize?
- Expressivenes - can we capture reality?
- One opinion: NNs can approx any function. The objective could be expressiveness with small samples. In some cases, even small networks can capture everything.
- Computation/Optimization: NP-hard to find weights even with 2 hidden units. Even the simplest NN with O(logd) units, no noise, no poly time algorithm always works.
- Thus there might be some magic property of reality that makes local search work.
As the number of hidden units increase the training error decreases. In on trial it turns out that the test error keeps decreasing as the number of parameters increases. In repeated trials, in most of test errors are large. Maybe in the cases where test fails when training error is zero, the norm of the parameters is high not the number of parameters. This could be norm etc. so what is the relevant "complexity measure"? And, how to minimize by optimization algorithm.
Ref. Neyshabur Tomioka S ICLR '15
SGD vs ADAM
Different optimization algorithm -> Different bias in optimum reached -> Different inductive bias -> Different generalization properties
Need to understand optimization algorithm not just as reaching some (global) optimum, but as reaching a specific optimum. Choice of optimzation algorithm matters! The solution space is like an ocean.
Example 1: Unconstrained Matrix Completion
Ref. Gunasekar Woodworth Bhojanapalli Neyshabur 2017
Gradient Descent (small step size etc.) finds not any global minima but min nuclear norm solution which brings generalization.
Example 2: Single Overparatmerized Linear Unit
Example 3: Linear Conv Nets Over-paratmerization
Optimization Geometry and hence inductive bias affected by geometry of local search in parameter space and paratmerization characterization.
3. Peter Barlett
Generalization : Prediction Accuracy of Test Set
Typical Theorem: pred_err <= trn_err+complexity_penalty 
Emprical process theory for classification
Margins analysis: relating classification to regression
Interpolation: There is no apperant tradeoff between fit and complexity
Interpolation in Linear Regression
P(f(x) \noteq y) <= 1/n (trn classification error) + sqrt(c\n (VCDim(F)+log(1/sigma)))
Neural Networks VC-Dimension increases with (p=#of parameters, L=#of layers)
- p if nonlinearity continues
- pL o non linearity piece wise continous
A classification problem becomes a regression if we use a loss function that doesn't vary too quickly.
For regression, the complexity of a NN is controlled by the size of parameters.
Interprolation in DL - A new challenge for Statistical Learning Theory
Deep networks can be trained to zero training error for regression loss with near state-of-the-art performance and even for noisy problems. Thus there is no notion of a tradeoff between fit to training data and complexity where . Ref. Zhang, Bengio, Hardt et. al. 2017 and Belkin et al 2018.
Interpolation in Linear Regression
Classical linear regression setting, with n samples. f(x) = x'\theta, squared error as loss, risk = E[loss].
Choose \theta^ to minimize the training error average.
Excess expected loss: Empirical Risk - True Risk
^Q is corrupted because our view of covariance of x is distorted by x1, x2, ..., xn. Also, the noise.
Accurate interpolating prediction as dimension p_n grows.
Consider covariance of x in two pieces
- a fixed piece due to dimension k
- a tail which flattens with n
Interpolation: far from the regime of a trade off between fit to training data and complexity.
In high-dimensional linear regression if the covariance has a long and flat tail the minimum norm interpolant can hide the noise in these many unimportant directions.
- Relizes on overparametrization
- and lots of unimportant parameters
Can we extend these results ot interpolating deep networks?
Empirical process theory for classification: need n>>p
Margins analysis with Lipshcitz loss complexity can depend on size of parameters.
Interpolation: a new challenge. Where is the tradeoff between fit and complexity?
- Interpolation in linear regression can exploit overparametrization to hide the noise.
4. Why neuroscience needs science of DL , Konrad Kording, UPenn
Goals in computational systems neuroscience
- Should work.
5. Does AI come at a cost? Instabilities in DL - Anders Hover
Deep Fool was established in EPFL to test the instability of NNs.
Theorem: There are uncountably many classification problems.
Key point: there is always a NN that achieved zero training error but achieves generalizibility.
Question: Can stable neural networks be produced using recursion?
Example: Ref. On instabilities of deep learning in image reconstruction. Antun Renna Poon et al.
Image reconstruction with NNs is completely unstable.
If you overperform in two images, things go wrong (instability).
The instability problem is a nontrivial one. But we can test them against instabilities. Cure is DL theory.
6. Challenge and scope for Empirical Modeling for ML - Ronald Coefman, Yale
At this point ML provides encoders/tabulation and regression. The real quest should be to find instrinsic varibales which enables direct consistency and performance match between algorithmic learners.
7. Panel - Julia Kempe & Eero Simoncelli
Expressive theory vs General Theory
What do our students care about? Computation, data size (n=1), instabilities.
8. Dataset for Analyzing Face Recognition - Jonathon Phillips, NIST
- FERET, Dept of State, Mugshots (2010) - 1.6 million images
Two questions: Verification and Rank 1 recognition (who is this person?)
Ref. Lessons from collecting a million biometric samples - Philips, Flynn
Face recognition accuracy of forensic examiners, superrecognizers and algorithms
Experiment including human recognizers four groups with different expertise levels.
Best recognizer agent is created by combining one facial examiner and A2017b.
9. Neural Solvers for Power Transmission Problems, Isabelle Guyon, Paris-Sud University
AI & Electricity
- Deep learning methdos for predicting flows in power frid by Benjamin Donnon
- RL for controlling power grids
The load flow: Input: production, topology etc. --- numeric solver ---> output: power flows
One example of a numeric solver is Hades 2: the challenge is speed 100ms should be faster by 2 orders.
LEAPNet - Latent Encoding of Atypical Perturbations
- Generalizes to combinatorial topology changes
LEAPNet is able to predict around operating conditions. Ref. LEAPNets for power grid perturbations, Donnot et. al. 2019.
GNS (Graph Neural Solver) for Power Systems - Iteratively propogates messages through edges
Conclusion: Augmented intelligence = operators + hades2 + NNs
10. From Deep Reinforcament Learning to AI, Doina Precup, McGill-MILA
Standart RL Scheme inspired by animals, AlphGo environment very clear, reward function is well known
Golden Goal: Efficient, continual learning and reasoning
Knowledge Representation of AlphaGo (policy and value)
Procedural Knowledge and Predictive/emprical knowledge
Knowledge must be: Expressive, Learnable, Composable
Procedural Knowledge: Options
- Option: (initiation set, policy, temination condition)
Options as behavioral programs
Where do options come from: Domain knowledge, Option-Critic Models
Back to value function
Knowledge Representation: Generalized Value Functions (cumulant function, continutation function) coming from Horde Architecture
Life Learning Agent
11. Theory-based measures of object representations in deep artificial and biological networks, Naim Sompolinksy, Hebrew University of Jeruselam
!Not familiar to subject, so couldn't write much!
Untangling Object Manifolds
Object classification capacity
12. NNs in Speech Recognition - Tara Sainath , Google AI
Conventional ASR pipeline:
Input speech -> Feature extraction -> DNN/RNN Acoustic Models -> Decoder -> Second Pass -> Rescoring Output
NNs helped combine the feature extraction and classification steps into one.
Deepness in speech: lower layers similar phones from different people are group together whereas in higer layers better discrimination is achieved.
Ref. B. Li et al Interspeech 2017
Multi-channel neural networks for Google HOME
What does the network learn? Filters are doing spatial and spectral filtering.
Model: End2End Trained Seq2Seq Recognizer combining the whole pipeline for the sake of simplicity, model size shrinkes and joint optimization.
Ref. C Chiu ICASSP 2018 - conventional baseline model was outperformed by E2E which is launched in Gboard.
Tail cases in speech recognition: numerics, context injecting, injecting domain knowledge.
13. Panel: What's missing in today's experimental analysis of DL? P. Jonathon Philips, Jitandra Malik, Peter Bartlett, Antonio Torralba, Isabelle Guyon
Question: If a breakthrough happens, do we have the capacity to realize/test it?
Reproducibility facilititated DL revolution.
Datasets are biased such that the creator's algorithm shines.
14. Right Ways Forward(I): Terrence Sejnowski, Salk Institue for Biological Studies
! Organizers request short talks starting with this talk. So there was not much to note !
High dimensional Geometry, subspaces will become important Adversarial examples Perturbation could help building robust NNs against adversarial attacks. We’re looking for general architectural principles.
15. Right Ways Forward(II): Jon Kleinberg, Cornell
Social policy and algorithmic decisions Screening as a prediction problem: Tabular structures into predictions e.g CV Interpretability problem for human decisions Two categories of discrimination: - disparate treatment: deliberately favoring individuals on race gender etc - Disparate impact: regardless of intent the output is disproportionate.
The challenge in correcting for human bias
Key argument: well regulated algorithms can make discrimination easier to detect.
Decomposing a Gap in Outcomes Disparity = Structural disparity + bias from choice of outcome + ...
16. From machine learning to Artifical Intelligence: Leon Bottou, Facebook AI Research
Caveat 1: Statistical problem is only a proxy to the real problem.
ML algorithms recklessly take advantage of spurious correlations.
Caveat 2: Causality
Viewpoints to causality - Manipulative causation Causal invariance Causal reasoning Dispositional causation: where do causations come from Causal intuituon: correlation is not causation but the data contains hints For instance, asymmetric relation. The scientific method is a good model of a learning process. Hypothesis generation precedes empirical validation.