The Complete Deep Dive: Machine Learning in Astronomy¶
Let me take you through absolutely everythingβfrom the mathematical foundations to the cutting-edge applications for your distributed telescope array.
Part 1: The Foundations of Machine Learning¶
What's Actually Happening Under the Hood¶
When we say a computer "learns," we're being a bit poetic. What's really happening is mathematical optimization. Let me break this down completely.
The Core Concept: Functions and Parameters¶
Imagine you have a functionβa mathematical machine that takes inputs and produces outputs. For astronomy:
- Input: Raw pixel values from a telescope image (maybe 1 million numbers representing brightness at each point)
- Output: A classification like "spiral galaxy" or "elliptical galaxy"
The function has parametersβadjustable knobs that change how it behaves. A simple function might have 10 parameters. Modern deep learning models have billions.
Learning means finding the parameter values that make the function produce correct outputs for known examples. Once found, the function can (hopefully) produce correct outputs for new examples it's never seen.
The Three Types of Machine Learning¶
Supervised Learning: You have labeled examples. "Here's an image, and I'm telling you it's a spiral galaxy." The algorithm learns to predict labels from inputs.
- Training data: Images paired with correct classifications
- Goal: Predict correct labels for new, unseen images
- Astronomy uses: Galaxy classification, stellar property prediction, transient detection
Unsupervised Learning: No labels. You just have data and want to find structure.
- Training data: Just images, no labels
- Goal: Discover patterns, groupings, or anomalies
- Astronomy uses: Finding new types of objects, clustering similar stars, discovering outliers
Reinforcement Learning: The algorithm takes actions and gets rewards or penalties.
- Training: Trial and error with feedback
- Goal: Learn optimal behavior
- Astronomy uses: Telescope scheduling, adaptive optics control, observation prioritization
The Mathematics (As Gently As Possible)¶
Linear Regression: The Simplest ML¶
Suppose you want to predict a star's temperature from its color. The simplest model:
Temperature = wβ Γ (blue brightness) + wβ Γ (red brightness) + b
Here, wβ, wβ, and b are parameters. Learning means finding values that minimize prediction errors across your training data.
The "error" (called loss) might be the average squared difference between predicted and actual temperatures:
Loss = average of (predicted - actual)Β²
We find the best parameters using gradient descent: start with random values, calculate which direction to adjust them to reduce the loss, take a small step in that direction, repeat thousands of times.
Neural Networks: Stacking Complexity¶
A neural network is just this simple idea repeated and stacked:
Layer 1: Takes raw inputs, applies weights, produces intermediate values Layer 2: Takes Layer 1's outputs, applies different weights, produces new intermediate values ... more layers ... Final Layer: Produces the prediction
Each layer can learn different features:
- Layer 1 might learn to detect edges in an image
- Layer 2 might combine edges into shapes
- Layer 3 might recognize that certain shapes indicate spiral arms
- Final layer decides "spiral galaxy"
The "deep" in deep learning just means many layers (sometimes hundreds).
Why This Works for Astronomy¶
Astronomical data has hierarchical structure:
- Pixels combine into features (bright spots, dark regions)
- Features combine into structures (spiral arms, central bulges)
- Structures combine into object types (spiral galaxy, elliptical galaxy)
Neural networks naturally learn these hierarchies.
Types of Neural Networks in Astronomy¶
Convolutional Neural Networks (CNNs)¶
Perfect for images. Instead of treating each pixel independently, CNNs look at small patches and learn local patterns.
Imagine sliding a small window across a telescope image:
- The window might learn to recognize "this pattern of pixels looks like a point source"
- Another window learns "this gradient pattern suggests a galaxy edge"
- These combine into higher-level detections
Astronomy applications:
- Galaxy morphology classification
- Identifying gravitational lenses
- Detecting transients (supernovae, asteroids)
- Separating stars from galaxies
- Finding image artifacts
Recurrent Neural Networks (RNNs) and Transformers¶
Perfect for sequences. Astronomical data often comes as time seriesβbrightness measurements over time.
RNNs process data sequentially, maintaining "memory" of what came before. They can learn patterns like:
- "This star's brightness dips periodicallyβprobably an eclipsing binary"
- "This brightness curve shape indicates a Type Ia supernova"
- "This radio signal has a characteristic pulsar signature"
Transformers (the architecture behind ChatGPT) are newer and can find relationships across very long sequences. They're increasingly used for:
- Analyzing years of photometric data
- Finding periodic signals with irregular spacing
- Cross-matching observations across time
Autoencoders¶
These learn to compress and reconstruct data. Train them on normal telescope images; they learn what "normal" looks like. When they fail to reconstruct somethingβthat's interesting!
Astronomy applications:
- Anomaly detection (finding weird objects)
- Noise reduction (learn to reconstruct clean images)
- Data compression (critical for your distributed array!)
Generative Models (GANs, Diffusion Models)¶
These learn to create realistic data. Train on real galaxy images, and they can generate synthetic galaxies.
Astronomy applications:
- Generating training data for rare events
- Testing analysis pipelines
- Simulating what observations should look like
- Super-resolution (enhancing image detail)
Part 2: Current Astronomy Applications in Detail¶
Galaxy Classification at Scale¶
The Problem: Modern surveys like SDSS have imaged hundreds of millions of galaxies. Human classification is impossible at this scale.
The ML Solution: Train a CNN on galaxies that humans have classified (the Galaxy Zoo project provided millions of human classifications). The network learns to recognize:
- Spiral vs. elliptical morphology
- Presence of bars, rings, or tidal features
- Signs of mergers or interactions
- Active galactic nuclei (AGN) signatures
Current State-of-the-Art:
- Accuracy exceeds 95% for basic morphology
- Can classify a million galaxies in hours
- Now being extended to fine-grained features
- Some models identify structures humans miss
For Your Distributed Array: Even a small-scale version could classify objects in real-time, flagging interesting morphologies for follow-up across your network.
Transient Detection¶
The Problem: Some astronomical events last only hours or daysβsupernovae, gamma-ray burst afterglows, gravitational wave counterparts, asteroids. You need to find them fast.
The ML Pipeline:
- Image Subtraction: Compare new images to reference images
- Candidate Detection: Find things that changed
- ML Classification: Is this a real transient or an artifact?
The classification step is crucial. Most "changes" are:
- Cosmic ray hits on the detector
- Satellite trails
- Bad pixels
- Atmospheric artifacts
- Subtraction errors
ML learns to distinguish real astrophysical transients from garbage.
Real-Time Systems:
- ZTF (Zwicky Transient Facility) processes millions of candidates nightly
- ML cuts false positives by >99%
- Interesting candidates trigger automatic follow-up within minutes
For Your Array: A trained transient detector could alert when any telescope sees something unusual, triggering coordinated follow-up across your entire network within seconds.
Stellar Spectroscopy¶
The Problem: A star's spectrum (how its light splits into different colors) encodes everythingβtemperature, composition, velocity, age. But traditional analysis is slow.
The ML Approach: Train on stars with known properties (from detailed physics analysis), then predict properties for millions of other stars instantly.
What ML Learns:
- Which absorption lines indicate which elements
- How line shapes encode temperature and pressure
- Doppler shifts revealing motion
- Age-related abundance patterns
Current Capabilities:
- Predict 20+ stellar parameters from a single spectrum
- Process millions of spectra in minutes
- Precision approaching physics-based methods
- Can identify chemically peculiar stars automatically
Exoplanet Detection¶
The Problem: Finding planets around other stars means detecting tiny signalsβsmall brightness dips (transits) or subtle wobbles (radial velocity).
ML Techniques:
For Transits:
- Distinguish planet transits from stellar variability, eclipsing binaries, or instrumental effects
- Learn the characteristic shapes of planet transits
- Identify multi-planet systems from overlapping signals
For Radial Velocity:
- Separate planetary signals from stellar activity
- Handle multiple overlapping planetary signatures
- Distinguish planets from stellar pulsations
Kepler/TESS Results: ML has found thousands of planet candidates that traditional methods missed, including some in the habitable zone.
Gravitational Lens Finding¶
The Problem: Gravitational lensesβwhere massive objects bend light from background sourcesβare rare and scientifically valuable. Finding them in millions of images is hard.
Why ML Excels: Lenses have characteristic signatures (arcs, Einstein rings, multiple images) that CNNs learn to recognize even when faint or distorted.
Current Systems:
- Survey thousands of square degrees automatically
- Find lens candidates with >90% accuracy
- Have discovered hundreds of new lenses
- Some found lenses humans missed
Radio Astronomy¶
Unique Challenges:
- Data volumes are enormous (petabytes per day for SKA)
- Interference from human sources (satellites, phones, etc.)
- Complex imaging from antenna arrays
ML Applications:
- Real-time RFI (radio frequency interference) flagging
- Source detection and classification
- Fast radio burst detection
- Pulsar searching
- Image reconstruction
Part 3: Your Distributed Telescope ArrayβThe Complete ML Architecture¶
Now let's design a comprehensive ML system for your specific project.
The Data Challenge¶
With multiple geographically distributed telescopes, you're dealing with:
- Volume: Each telescope generates gigabytes nightly
- Velocity: Data arrives continuously from all sites
- Variety: Different weather, different equipment quirks, different calibrations
- Veracity: How do you know which data to trust?
ML addresses all of these.
Layer 1: Per-Telescope Intelligence¶
Each telescope site runs local ML systems:
Real-Time Quality Assessment¶
A trained model continuously evaluates incoming frames:
Input: Raw telescope frame
Output: Quality score (0-100) + issue flags
Issues detected:
- Cloud coverage percentage
- Atmospheric seeing estimate
- Tracking errors
- Focus problems
- Sensor issues (hot pixels, columns)
- Satellite/plane trails
Training Data: Historical frames labeled by quality, weather conditions, resulting science output.
Action: Bad frames immediately flagged; severe issues trigger alerts or automatic system adjustments.
Local Anomaly Detection¶
An autoencoder trained on "normal" observations:
Normal flow:
Frame β Encode β Compressed representation β Decode β Reconstructed frame
If reconstruction error > threshold:
Flag as anomaly for immediate review
This catches:
- Sudden transients
- Equipment malfunctions
- Unusual atmospheric events
- Potential discoveries
Edge Computing Benefits¶
Running ML locally means:
- Instant response (no network latency)
- Reduced data transfer (only interesting data goes to central)
- Resilience (works even if network fails)
- Bandwidth savings (critical for remote sites)
Layer 2: Cross-Site Coordination¶
A central ML system coordinates your entire network:
Intelligent Scheduling¶
This is a reinforcement learning problem. The system learns to maximize scientific output by:
State:
- Current conditions at each site
- Queue of observation requests
- Recent data quality from each telescope
- Astronomical event predictions
- Maintenance schedules
Actions:
- Assign targets to specific telescopes
- Coordinate multi-site observations
- Trigger follow-up observations
- Adjust exposure times
Reward:
- Scientific value of observations obtained
- Data quality metrics
- Response time to transients
- Network efficiency
Over time, the system learns patterns:
- "Site A produces better data on these targets"
- "Coordinated observations during this window work best"
- "When weather deteriorates at B, shift to C"
Dynamic Resource Allocation¶
When something interesting happens:
Event: Transient detected at Site A
ML System:
1. Classify transient type (supernova? asteroid? unknown?)
2. Predict evolution (how long will it be visible?)
3. Calculate optimal follow-up strategy
4. Identify which other sites can observe
5. Generate observation commands
6. Prioritize based on scientific value
Result: Within seconds, multiple sites coordinate
Data Fusion Engine¶
Combining data from multiple sites is non-trivial. Each telescope has:
- Different atmospheric conditions
- Different instrumental responses
- Different pointing accuracies
- Different time synchronization
An ML model learns the optimal combination:
Inputs:
- Frames from Sites A, B, C, D
- Metadata (conditions, calibrations)
- Cross-calibration history
Output:
- Combined image superior to any individual frame
- Uncertainty map
- Flags for inconsistent data
This is similar to how your brain combines two eye images into one 3D perceptionβbut with multiple telescopes.
Layer 3: Science-Ready Processing¶
Automated Calibration Pipeline¶
Traditional calibration requires:
- Bias frames (sensor offsets)
- Dark frames (thermal noise)
- Flat fields (sensitivity variations)
- Photometric calibration (brightness scale)
- Astrometric calibration (position mapping)
ML can:
- Learn sensor behavior and predict calibrations
- Identify when calibrations are outdated
- Flag calibration failures automatically
- Cross-calibrate between sites
Source Detection and Classification¶
For every processed image:
Pipeline:
1. Detect sources (stars, galaxies, artifacts)
2. Classify each source
3. Measure properties (brightness, shape, color)
4. Cross-match with catalogs
5. Flag unknowns or interesting objects
ML models at each step:
- Detection: U-Net or similar segmentation network
- Classification: ResNet or EfficientNet
- Property measurement: Regression networks
- Anomaly flagging: Isolation forests or autoencoders
Automated Science Products¶
The system can generate:
- Nightly summary reports
- Transient alerts
- Photometric databases
- Astrometric solutions
- Quality metrics
- Science-ready catalogs
All with ML-driven quality control.
Layer 4: Discovery Systems¶
Unknown Object Discovery¶
Here's where it gets really exciting. Train ML to know what "normal" objects look like, then find things that don't fit:
Approach 1: Clustering
- Represent each object as a feature vector
- Cluster similar objects together
- Objects far from all clusters are interesting
Approach 2: Anomaly scoring
- Train model on known object classes
- Low confidence predictions = potential new classes
Approach 3: Self-supervised learning
- Let model learn structure without labels
- Deviations from learned structure = anomalies
This is how ML might discover entirely new types of astronomical objects.
Pattern Detection in Time¶
Over months of operation, your array generates time-series data. ML can find:
- Periodic variables with unusual periods
- Long-term trends
- Correlated variations across different objects
- Quasi-periodic oscillations
- Chaotic behavior
Some of these patterns might reveal new physics.
Part 4: The Technical Implementation¶
Data Architecture¶
Site Level:
βββ Raw Data Buffer (hours)
βββ Quick-Look Processing
βββ ML Quality Assessment
βββ Local Database (days)
βββ Compression/Selection
βββ Upload Queue
Central Level:
βββ Ingestion Pipeline
βββ Data Lake (petabytes)
βββ Processing Clusters
βββ ML Training Infrastructure
βββ Science Databases
βββ User Interface/API
ML Infrastructure¶
Training: You need GPU clusters. Options:
- Cloud (AWS, Google Cloud, Azure)βflexible but ongoing cost
- On-premisesβhigh upfront cost but lower long-term
- Hybridβtrain in cloud, deploy on-premises
Inference (running trained models): Can run on:
- CPUs for some models
- Edge devices (NVIDIA Jetson) at telescope sites
- Central GPU servers for complex models
Model Management¶
Over time, you'll have many models:
- Different versions of the same model
- Models for different tasks
- Models trained on different data
You need MLOps:
- Version control for models
- Automated testing (does new model improve results?)
- Deployment pipelines
- Performance monitoring
- Retraining triggers
Practical Code Example: A Galaxy Classifier¶
Here's what a real implementation might look like:
import torch
import torch.nn as nn
from torchvision import models, transforms
from astropy.io import fits
# Load a pre-trained model and modify for galaxy classification
class GalaxyClassifier(nn.Module):
def __init__(self, num_classes=5):
super().__init__()
# Use EfficientNet as base (good accuracy/efficiency)
self.base = models.efficientnet_b0(pretrained=True)
# Modify input layer for astronomical data
# Astronomical images often have different properties
self.base.features[0][0] = nn.Conv2d(
1, 32, kernel_size=3, stride=2, padding=1, bias=False
)
# Modify output layer for our classes
# (spiral, elliptical, irregular, merger, artifact)
self.base.classifier[1] = nn.Linear(1280, num_classes)
def forward(self, x):
return self.base(x)
# Preprocessing for telescope images
def preprocess_fits(filepath):
"""Load FITS file and prepare for ML"""
with fits.open(filepath) as hdul:
data = hdul[0].data.astype('float32')
# Astronomical preprocessing
# 1. Handle negative values (common in processed images)
data = data - data.min()
# 2. Log stretch (astronomical images have huge dynamic range)
data = np.log1p(data)
# 3. Normalize to 0-1
data = (data - data.min()) / (data.max() - data.min())
# 4. Resize to model input size
data = cv2.resize(data, (224, 224))
# 5. Add channel dimension
data = data[np.newaxis, :, :]
return torch.tensor(data)
# Training loop sketch
def train_galaxy_classifier(model, train_loader, epochs=50):
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()
for epoch in range(epochs):
for images, labels in train_loader:
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Validate and save best model
val_accuracy = validate(model, val_loader)
if val_accuracy > best_accuracy:
torch.save(model.state_dict(), 'best_model.pt')
Real-Time Transient Detection System¶
class TransientDetector:
def __init__(self):
self.classifier = load_model('transient_classifier.pt')
self.alert_system = AlertBroker()
def process_frame(self, new_frame, reference_frame):
# Step 1: Image subtraction
difference = new_frame - reference_frame
# Step 2: Find candidates (simple thresholding + morphology)
candidates = self.find_candidates(difference)
# Step 3: Classify each candidate
for candidate in candidates:
cutout = self.extract_cutout(new_frame, candidate.position)
# Run through ML classifier
probs = self.classifier(cutout)
# Classes: real_transient, cosmic_ray, bad_pixel, satellite, noise
if probs['real_transient'] > 0.8:
# Real detection!
self.alert_system.send_alert(
position=candidate.position,
confidence=probs['real_transient'],
cutout=cutout,
timestamp=new_frame.timestamp
)
def find_candidates(self, difference_image):
"""Find potential transient locations"""
# Threshold at 5 sigma
threshold = 5 * np.std(difference_image)
mask = difference_image > threshold
# Find connected components
labels = measure.label(mask)
regions = measure.regionprops(labels)
return [r for r in regions if self.is_valid_candidate(r)]
Distributed Coordination System¶
class TelescopeNetwork:
def __init__(self, sites):
self.sites = sites # List of telescope connections
self.scheduler = MLScheduler()
self.data_fusion = DataFusionModel()
async def handle_transient_alert(self, alert):
"""Coordinate network response to transient detection"""
# 1. Classify the transient
classification = self.classify_transient(alert)
# 2. Determine follow-up priority
priority = self.calculate_priority(classification)
if priority > THRESHOLD:
# 3. Find available telescopes that can observe
available = []
for site in self.sites:
visibility = site.check_visibility(alert.position)
if visibility['observable']:
available.append({
'site': site,
'quality': site.current_conditions(),
'visibility': visibility
})
# 4. ML decides optimal observation strategy
strategy = self.scheduler.plan_followup(
transient=classification,
available_sites=available,
priority=priority
)
# 5. Execute coordinated observations
tasks = []
for assignment in strategy['assignments']:
task = assignment['site'].observe(
target=alert.position,
exposure=assignment['exposure'],
filters=assignment['filters']
)
tasks.append(task)
# 6. Gather and fuse results
results = await asyncio.gather(*tasks)
combined = self.data_fusion.combine(results)
return combined
Part 5: The FutureβWhat's Coming¶
Foundation Models for Astronomy¶
Just as GPT learned language and DALL-E learned images, astronomy foundation models are being developed. These are trained on vast astronomical datasets and can be fine-tuned for specific tasks.
Imagine a model that has "seen" every public telescope image ever taken. It understands:
- What different objects look like
- How instruments behave
- What noise looks like
- The structure of the astronomical universe
You could fine-tune this for your specific telescopes with minimal data.
Autonomous Discovery Systems¶
Current ML classifies objects into known categories. Future systems will:
- Propose hypotheses: "These objects share unusual featuresβmight be a new class"
- Design observations: "To test this hypothesis, we need spectra of these 5 objects"
- Request telescope time: Automatically submit proposals
- Analyze results: "Hypothesis confirmed/rejected, here's what we learned"
- Write papers: Generate preliminary reports of findings
This is AI-driven scienceβthe algorithm becomes a collaborator.
Multi-Messenger Astronomy¶
When gravitational waves, neutrinos, and light all come from the same event (like a neutron star merger), we need instant coordination. Future ML systems will:
- Ingest alerts from all types of observatories
- Triangulate source positions
- Coordinate hundreds of telescopes worldwide
- Prioritize based on predicted scientific value
- Adapt in real-time as new data arrives
Your distributed array could be part of this global network.
Simulation-Based Inference¶
Instead of training ML on observed data, train on simulated universes. Run physics simulations of different cosmological parameters, generate synthetic observations, train ML to infer parameters from observations.
This connects ML directly to physical theoryβthe algorithm learns not just patterns but physics.
Real-Time Adaptive Optics¶
Ground-based telescopes battle atmospheric turbulence. Adaptive optics (deformable mirrors) correct this, but current systems are limited. ML can:
- Predict atmospheric turbulence milliseconds ahead
- Control mirror surfaces faster than traditional systems
- Learn optimal corrections for each site
- Potentially achieve space-telescope quality from ground
Federated Learning for Privacy and Bandwidth¶
Not all data can be shared freely. Federated learning lets multiple telescope networks contribute to training a model without sharing raw data:
- Global model sent to each site
- Each site trains on local data
- Only model updates (not data) sent back
- Updates combined into improved global model
This enables collaboration while respecting data ownership.
Part 6: Getting StartedβPractical Roadmap¶
Phase 1: Foundation (Months 1-3)¶
Learn the basics:
- Python programming
- Basic ML concepts (scikit-learn tutorials)
- Astronomical data formats (FITS, catalogs)
- Astropy library
Set up infrastructure:
- Data storage solution
- Version control (Git)
- Basic processing pipeline
- Database for metadata
First ML project:
- Implement a simple image quality classifier
- Train on your own telescope data
- Deploy at one site
Phase 2: Single-Site Intelligence (Months 4-8)¶
Build local ML systems:
- Real-time quality assessment
- Basic transient detection
- Source detection and cataloging
- Automated calibration assistance
Infrastructure upgrades:
- GPU for ML inference
- Automated pipeline deployment
- Monitoring and logging
- Alert system
Phase 3: Network Coordination (Months 9-15)¶
Central intelligence:
- Data aggregation system
- Cross-site calibration
- Basic scheduling optimization
- Multi-site alert coordination
Advanced ML:
- Data fusion models
- Reinforcement learning for scheduling
- Anomaly detection across network
Phase 4: Discovery Systems (Months 16+)¶
Scientific ML:
- Unknown object discovery
- Time-series analysis
- Automated science products
- Publication-quality outputs
Network optimization:
- Full autonomous scheduling
- Predictive maintenance
- Adaptive observing strategies
Resources for Learning¶
Courses:
- Fast.ai (practical deep learning)
- Coursera ML specializations
- AstroML tutorials
Books:
- "Statistics, Data Mining, and Machine Learning in Astronomy"
- "Deep Learning" by Goodfellow et al.
Tools:
- PyTorch or TensorFlow (deep learning frameworks)
- Astropy (astronomical data handling)
- scikit-learn (traditional ML)
Communities:
- AstroML mailing list
- Astropy community
- ML in astronomy conferences (like ADASS)
Summary¶
Machine learning transforms astronomical data processing from a bottleneck into an opportunity. For your distributed telescope array specifically:
- Local intelligence at each site handles quality control and real-time detection
- Central coordination optimizes scheduling and combines data
- Science systems extract discoveries and generate products
- Future capabilities include autonomous discovery and global coordination
The key insight: ML doesn't replace astronomical expertiseβit amplifies it. The algorithms learn from human knowledge, then apply that knowledge at scales and speeds impossible for humans alone.
Your distributed array, equipped with ML, becomes more than the sum of its parts. Each telescope contributes to a collective intelligence that sees more, responds faster, and discovers things no single instrument could find.
Do you want me to dive even deeper into any specific area? I could elaborate on:
- The mathematical details of specific algorithms
- Detailed system architecture for your array
- Specific code implementations
- Training data strategies
- Hardware recommendations
- Integration with existing astronomical systems
The Complete AI/Machine Learning Education: From Zero to Practitioner¶
I'm going to teach you everything from the absolute ground up. By the end of this, you'll understand how AI actually works and be ready to start building.
Chapter 1: What Is AI, Really?¶
Forgetting the Science Fiction¶
First, forget everything movies have told you. AI isn't a sentient robot. It's not "thinking" the way you think. At its core, AI is:
Mathematics that finds patterns in data.
That's it. Everything elseβimage recognition, language understanding, game playingβemerges from this simple idea applied at massive scale.
The Spectrum of AI¶
Rule-Based Systems Machine Learning Deep Learning
β β β
"If X, then Y" "Learn from examples" "Learn complex patterns
with neural networks"
Example: Example: Example:
"If temperature > 100Β°, "Show me 10,000 spam "Show me millions of
send alert" emails, learn what images, learn to
spam looks like" recognize anything"
Rule-based: You write explicit rules. Limited but predictable.
Machine Learning: The computer discovers rules from data. Flexible but needs examples.
Deep Learning: Machine learning with neural networks. Can learn incredibly complex patterns but needs lots of data and computation.
Why This Matters for Astronomy¶
Traditional astronomy: "If brightness dips by X% for Y hours with this shape, it might be a planet transit."
ML astronomy: "Here are 10,000 confirmed planet transits. Learn what they look like. Now find more."
The second approach finds patterns humans might never think to look for.
Chapter 2: The Mathematics You Actually Need¶
Don't panic. You need less math than you think, and I'll explain each piece intuitively.
Concept 1: Variables and Functions¶
A variable is just a placeholder for a number:
x = 5
temperature = 72.4
brightness = 0.00847
A function takes inputs and produces outputs:
f(x) = 2x + 1
When x = 3: f(3) = 2(3) + 1 = 7
When x = 10: f(10) = 2(10) + 1 = 21
ML insight: A trained model IS a function. It takes your data as input and produces predictions as output.
Concept 2: Vectors and Matrices¶
A vector is a list of numbers:
pixel_values = [0.1, 0.4, 0.9, 0.2, 0.8]
star_properties = [temperature, brightness, distance, mass]
A matrix is a grid of numbers:
image = [
[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6],
[0.7, 0.8, 0.9]
]
ML insight: All data becomes vectors or matrices. An image? Matrix of pixel values. A spectrum? Vector of intensity values. Text? Converted to vectors of numbers.
Concept 3: The Dot Product¶
This is the key operation in ML. Multiply corresponding elements and add:
vector_a = [1, 2, 3]
vector_b = [4, 5, 6]
dot_product = (1Γ4) + (2Γ5) + (3Γ6)
= 4 + 10 + 18
= 32
ML insight: This is how neural networks combine inputs. Each input gets multiplied by a "weight," then everything is added up.
Concept 4: Probability Basics¶
Probability measures likelihood (0 = impossible, 1 = certain):
P(coin lands heads) = 0.5
P(sun rises tomorrow) β 1.0
P(finding a unicorn) = 0.0
ML insight: Models output probabilities. "This image is 94% likely to be a spiral galaxy, 5% elliptical, 1% artifact."
Concept 5: Derivatives (Just the Intuition)¶
A derivative measures "how fast something is changing."
Imagine driving a car:
- Position = where you are
- Velocity (derivative of position) = how fast position is changing
- Acceleration (derivative of velocity) = how fast velocity is changing
ML insight: Training uses derivatives to figure out "if I adjust this parameter slightly, how much does my error change?" This guides learning.
Chapter 3: How Machine Learning Actually Works¶
The Core Loop¶
Every ML system follows this pattern:
1. INITIALIZE: Start with random parameter values
2. PREDICT: Use current parameters to make predictions
3. MEASURE ERROR: Compare predictions to correct answers
4. UPDATE: Adjust parameters to reduce error
5. REPEAT: Go back to step 2, thousands of times
Let me make this concrete.
Example: Predicting Star Temperature from Color¶
The Data:
Star 1: Blue/Red ratio = 0.8, Temperature = 5000K
Star 2: Blue/Red ratio = 1.2, Temperature = 6500K
Star 3: Blue/Red ratio = 1.5, Temperature = 8000K
Star 4: Blue/Red ratio = 2.0, Temperature = 11000K
... (thousands more)
The Model (simplest possible):
Predicted_Temperature = w Γ (Blue/Red ratio) + b
Where w and b are parameters we need to learn
Training Process:
Step 1: Random initialization
w = 1000 (random guess)
b = 2000 (random guess)
Step 2: Make predictions
Star 1: 1000 Γ 0.8 + 2000 = 2800K (actual: 5000K) β way off!
Star 2: 1000 Γ 1.2 + 2000 = 3200K (actual: 6500K) β way off!
Step 3: Measure error
Error = average of (predicted - actual)Β²
= ((2800-5000)Β² + (3200-6500)Β²) / 2
= (4,840,000 + 10,890,000) / 2
= 7,865,000 β big number, bad!
Step 4: Update parameters
Mathematics tells us:
- Increasing w will reduce error
- Increasing b will reduce error
New w = 1000 + adjustment = 3000
New b = 2000 + adjustment = 2500
Step 5: Repeat
With new parameters, error becomes 2,100,000
Keep going...
After 1000 iterations:
w β 5000
b β 1000
Error is now tiny!
Final model:
Temperature β 5000 Γ (Blue/Red) + 1000
This simple model learned the relationship between color and temperature!
Gradient Descent: The Heart of Learning¶
"Gradient descent" is just a fancy name for the update process. Here's the intuition:
Imagine you're blindfolded on a hilly landscape. Your goal: find the lowest valley (minimum error).
Strategy:
- Feel the ground around you (compute gradient/derivative)
- Figure out which direction goes downhill (direction of steepest descent)
- Take a step that direction (update parameters)
- Repeat until you stop going downhill (reached minimum)
Error
^
|
* | * <- Starting point (random parameters)
* | *
* |*
* * <- Each step moves downhill
* /
* /
* /
* /
*/________________> Parameters
β
Minimum (best parameters)
The Learning Rate¶
How big should each step be?
- Too big: You overshoot the minimum, bounce around, never converge
- Too small: Takes forever to reach the minimum
- Just right: Steady progress toward the best solution
Learning rate too high: Learning rate too low: Learning rate good:
* * *
/ \ * *
/ \ * *
/ \ * *
/ * * *
* \ * *
* * *
... (takes forever) * <- converged!
The learning rate is a hyperparameterβsomething you choose, not something the model learns.
Chapter 4: Neural Networks Explained¶
The Biological Inspiration (Loosely)¶
Your brain has neurons connected by synapses. A neuron:
- Receives signals from other neurons
- If total signal exceeds a threshold, it "fires"
- Sends signals to other neurons
Artificial neural networks are inspired by this (but much simpler).
The Artificial Neuron¶
Inputs (xβ, xβ, xβ) Weights (wβ, wβ, wβ)
| |
v v
ββββββββββββββββββββββββββββββββββββββββββββββ
β β
β weighted_sum = wβΓxβ + wβΓxβ + wβΓxβ + b β
β β
β output = activation(weighted_sum) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββ
|
v
Output
Inputs: The data (pixel values, measurements, features)
Weights: Learnable parameters that determine importance of each input
Bias (b): An adjustable offset
Activation function: Introduces non-linearity (explained below)
Why Activation Functions Matter¶
Without activation functions, stacking layers would be pointless:
Layer 1: output = wβ Γ input + bβ
Layer 2: output = wβ Γ (wβ Γ input + bβ) + bβ
= (wβΓwβ) Γ input + (wβΓbβ + bβ)
= W Γ input + B β Still just a linear function!
Activation functions break this linearity, allowing complex patterns:
ReLU (Rectified Linear Unit) β most common:
ReLU(x) = max(0, x)
If x is negative, output 0
If x is positive, output x
Examples:
ReLU(-5) = 0
ReLU(0) = 0
ReLU(3) = 3
Sigmoid β squashes to 0-1 (good for probabilities):
Sigmoid(x) = 1 / (1 + e^(-x))
Very negative x β ~0
Zero β 0.5
Very positive x β ~1
Softmax β for classification (outputs sum to 1):
Used in final layer for classification
Converts raw scores to probabilities
Scores: [2.0, 1.0, 0.1]
Softmax: [0.66, 0.24, 0.10] β These sum to 1.0
Building a Neural Network¶
Stack neurons into layers:
INPUT LAYER HIDDEN LAYER 1 HIDDEN LAYER 2 OUTPUT LAYER
(your data) (learned features) (complex features) (predictions)
xβ ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
\ β β
\ /|\ /|\
\ / | \ / | \
xβ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ class 1 prob
/ \ | / \ | / \
/ \|/ \|/ \
xβ ββββββββββββββ β βββ class 2 prob
\ | | /
\ β β /
\ /|\ /|\ /
xβ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ class 3 prob
Each connection has a weight (learnable)
Each neuron has a bias (learnable)
Each neuron applies an activation function
What Each Layer Learns (Image Example)¶
For image classification:
Layer 1: Detects simple patterns
- Edge detectors (vertical, horizontal, diagonal)
- Color blobs
- Simple textures
Layer 2: Combines simple patterns into shapes
- Corners (vertical + horizontal edges)
- Curves (many edge detectors)
- Texture regions
Layer 3: Combines shapes into parts
- "This looks like a spiral arm"
- "This looks like a galactic core"
- "This looks like a star cluster"
Layer 4+: Combines parts into objects
- "Spiral arms + bright core + overall shape = spiral galaxy"
This hierarchical learning is why deep networks are so powerful!
Forward Pass vs Backward Pass¶
Forward Pass: Data flows through the network, producing predictions
Input β Layer 1 β Layer 2 β ... β Output β Prediction
Backward Pass (Backpropagation): Errors flow backward, updating weights
How wrong was the prediction?
β
How much did each Layer N weight contribute to error?
β
Adjust Layer N weights
β
How much did each Layer N-1 weight contribute to error?
β
Adjust Layer N-1 weights
β
... continue back to Layer 1 ...
This is where the calculus happensβcomputing how each weight affects the final error.
Chapter 5: Convolutional Neural Networks (CNNs) for Images¶
Since you're working with telescope images, CNNs are crucial.
The Problem with Regular Networks for Images¶
A small 256Γ256 grayscale image has 65,536 pixels.
If your first layer has 1000 neurons, you'd have 65,536,000 connections from input to first layer alone!
This is:
- Computationally expensive
- Prone to overfitting (too many parameters for limited data)
- Ignores the structure of images (nearby pixels are related)
The Key Insight: Local Patterns¶
In images, patterns are local:
- An edge is a few pixels wide
- A star is a small region
- Artifacts have local signatures
We don't need every neuron to look at every pixel!
Convolution: The Core Operation¶
A filter (or kernel) is a small pattern detector:
Example: 3Γ3 edge-detecting filter
Filter: Slide over image:
[-1 0 1]
[-1 0 1] Original After convolution
[-1 0 1] [image] --> [edge map]
How convolution works:
Image region: Filter: Calculation:
[1, 2, 3] [-1, 0, 1] Sum of element-wise products:
[4, 5, 6] Γ [-1, 0, 1] = (-1Γ1)+(0Γ2)+(1Γ3)+
[7, 8, 9] [-1, 0, 1] (-1Γ4)+(0Γ5)+(1Γ6)+
(-1Γ7)+(0Γ8)+(1Γ9)
= -1+0+3-4+0+6-7+0+9 = 6
Slide the filter across the entire image, computing this at each position. The result is a feature map.
Multiple Filters = Multiple Features¶
A CNN layer has many filters, each learning to detect different patterns:
Input Image (1 channel: grayscale)
β
Conv Layer 1 (32 filters)
β
32 Feature Maps (different patterns detected)
β
Conv Layer 2 (64 filters, each looks at all 32 previous maps)
β
64 Feature Maps (combinations of patterns)
β
... more layers ...
β
Final Classification
Pooling: Reducing Size¶
After convolution, we often pool to reduce the size:
Max Pooling (2Γ2):
[1, 3, 2, 4]
[5, 6, 1, 2] β [6, 4] Take max of each 2Γ2 region
[3, 2, 1, 0] [3, 3]
[1, 2, 3, 1]
This:
- Reduces computation for later layers
- Adds some translation invariance (small shifts don't matter)
- Keeps the strongest activations
Complete CNN Architecture¶
Input: 256Γ256Γ1 telescope image
Conv1: 32 filters (3Γ3), ReLU β 256Γ256Γ32
Pool1: Max pool (2Γ2) β 128Γ128Γ32
Conv2: 64 filters (3Γ3), ReLU β 128Γ128Γ64
Pool2: Max pool (2Γ2) β 64Γ64Γ64
Conv3: 128 filters (3Γ3), ReLU β 64Γ64Γ128
Pool3: Max pool (2Γ2) β 32Γ32Γ128
Flatten: 32Γ32Γ128 = 131,072 values
Dense1: 512 neurons, ReLU
Dense2: 128 neurons, ReLU
Output: 5 neurons, Softmax β [spiral, elliptical, irregular, merger, artifact]
Why CNNs Work So Well for Astronomical Images¶
- Translation invariance: A galaxy in the corner looks the same as one in the center
- Hierarchical features: Learn edges β shapes β structures β objects
- Parameter efficiency: Same filter applied everywhere, fewer total parameters
- Natural for 2D data: Respects spatial relationships
Chapter 6: Training in Practice¶
The Training/Validation/Test Split¶
Never evaluate on data you trained on! Split your data:
All Your Data (e.g., 10,000 galaxy images)
β
ββββββββββββββββββββββββββββββββββββββββββ
β Training Set (70%): 7,000 images β β Model learns from these
ββββββββββββββββββββββββββββββββββββββββββ€
β Validation Set (15%): 1,500 images β β Tune hyperparameters, early stopping
ββββββββββββββββββββββββββββββββββββββββββ€
β Test Set (15%): 1,500 images β β Final evaluation only (touch once!)
ββββββββββββββββββββββββββββββββββββββββββ
Training set: Model sees these, adjusts weights
Validation set: Model never trains on these; use to check performance during training
Test set: Model never sees until final evaluation; gives unbiased performance estimate
Overfitting vs Underfitting¶
Underfitting: Model too simple, can't capture patterns
Training accuracy: 60%
Validation accuracy: 58%
Both are bad β need more complex model
Good fit: Model captures patterns without memorizing
Training accuracy: 95%
Validation accuracy: 92%
Both good, close together β well-tuned model
Overfitting: Model memorized training data, fails on new data
Training accuracy: 99%
Validation accuracy: 70%
Big gap β model is memorizing, not learning
Visualization:
Model Complexity β
β
| Underfitting Sweet Overfitting
E | | Spot |
r | ____ | ______|______ |
r | / \ | / | \ |
o | / \ | / | \ |
r | / \ | / | \ |
|/ \| / | \ |
ββββββββββββββ΄ββββββββββββ΄ββββββββββ\βββ΄ββ
βββ Training Error
β β Validation Error
Regularization: Preventing Overfitting¶
Dropout: Randomly "turn off" neurons during training
During training:
[neuron1] [ ] [neuron3] [ ] [neuron5] β 40% dropped
β
Forces network to not rely on any single neuron
β
More robust, generalizes better
L2 Regularization: Penalize large weights
Loss = Prediction_Error + Ξ» Γ (sum of squared weights)
Large weights get penalized
Forces model to use smaller, more distributed weights
Data Augmentation: Create variations of training data
Original galaxy image
β
Augmented versions:
- Rotated 90Β°, 180Β°, 270Β°
- Flipped horizontally
- Flipped vertically
- Slightly shifted
- Slightly zoomed
- Noise added
- Brightness adjusted
1 image becomes 10+ training examples!
For astronomy, augmentation is powerful because physics doesn't change with rotation.
Batch Training¶
Processing all data at once is memory-intensive. Instead, use mini-batches:
10,000 training images
β
Split into batches of 32
β
312 batches per epoch
Each training step:
1. Load batch of 32 images
2. Forward pass: compute predictions
3. Compute loss
4. Backward pass: compute gradients
5. Update weights
6. Next batch
One complete pass through all batches = 1 epoch
Training typically runs for 10-100+ epochs
Learning Rate Schedules¶
Learning rate can change during training:
Constant: Step Decay: Exponential: Cosine Annealing:
lr lr lr lr
|____ |__ |\ /\ /\
| | |__ | \ / \ / \
| | |__ | \ / \/ \
|____________ |________|___ |___\____ /_____________\
epochs epochs epochs epochs
Common approach: Start high (learn fast), decrease over time (fine-tune).
Early Stopping¶
Stop training when validation performance stops improving:
Epoch 1: Val accuracy = 70%
Epoch 2: Val accuracy = 78%
Epoch 3: Val accuracy = 84%
Epoch 4: Val accuracy = 88%
Epoch 5: Val accuracy = 90%
Epoch 6: Val accuracy = 91%
Epoch 7: Val accuracy = 91% β Stopped improving
Epoch 8: Val accuracy = 90% β Getting worse (overfitting starting)
Epoch 9: Val accuracy = 89%
...
Early stopping: Stop at epoch 6 or 7, save that model
Chapter 7: Practical Python for ML¶
Setting Up Your Environment¶
Step 1: Install Python (version 3.9 or 3.10 recommended)
Step 2: Install essential packages
pip install numpy pandas matplotlib scikit-learn
pip install torch torchvision # PyTorch (or tensorflow if you prefer)
pip install astropy # For astronomy data
pip install jupyter # For interactive development
Step 3: Verify installation
import numpy as np
import torch
import astropy
print("All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}") # True if you have GPU
NumPy: The Foundation¶
NumPy is for numerical computing. Everything in ML uses it.
import numpy as np
# Creating arrays
a = np.array([1, 2, 3, 4, 5])
b = np.zeros((3, 3)) # 3x3 array of zeros
c = np.ones((2, 4)) # 2x4 array of ones
d = np.random.randn(100, 100) # 100x100 random values (normal distribution)
# Array operations (element-wise)
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
print(x + y) # [5, 7, 9]
print(x * y) # [4, 10, 18]
print(x ** 2) # [1, 4, 9]
print(np.sqrt(x)) # [1.0, 1.414, 1.732]
# Statistics
data = np.random.randn(1000)
print(np.mean(data)) # ~0
print(np.std(data)) # ~1
print(np.max(data)) # ~3
print(np.min(data)) # ~-3
# Reshaping
image = np.random.randn(256, 256) # 2D image
flat = image.reshape(-1) # Flatten to 1D: 65536 elements
back = flat.reshape(256, 256) # Back to 2D
# Slicing
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr[0, :]) # First row: [1, 2, 3]
print(arr[:, 1]) # Second column: [2, 5, 8]
print(arr[1:, 1:]) # Bottom-right: [[5, 6], [8, 9]]
Matplotlib: Visualization¶
import matplotlib.pyplot as plt
import numpy as np
# Basic line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.title('Sine Wave')
plt.show()
# Scatter plot
x = np.random.randn(100)
y = x + np.random.randn(100) * 0.5
plt.scatter(x, y, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
# Image display (crucial for astronomy!)
image = np.random.randn(256, 256)
plt.imshow(image, cmap='gray')
plt.colorbar()
plt.title('Random Image')
plt.show()
# Histogram
data = np.random.randn(10000)
plt.hist(data, bins=50, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Count')
plt.title('Distribution')
plt.show()
# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)
axes[1, 0].imshow(image, cmap='viridis')
axes[1, 1].hist(data, bins=30)
plt.tight_layout()
plt.show()
Astropy: Handling Astronomical Data¶
from astropy.io import fits
from astropy import units as u
from astropy.coordinates import SkyCoord
import numpy as np
import matplotlib.pyplot as plt
# Reading FITS files (telescope images)
def load_fits_image(filepath):
with fits.open(filepath) as hdul:
# Primary data is usually in index 0 or 1
print(hdul.info()) # See what's in the file
data = hdul[0].data # The image data
header = hdul[0].header # Metadata
return data, header
# Example usage
# data, header = load_fits_image('my_observation.fits')
# print(f"Image shape: {data.shape}")
# print(f"Object: {header.get('OBJECT', 'Unknown')}")
# print(f"Exposure time: {header.get('EXPTIME', 'Unknown')} seconds")
# Working with coordinates
coord = SkyCoord('10h30m00s', '+45d00m00s', frame='icrs')
print(f"RA: {coord.ra.degree} degrees")
print(f"Dec: {coord.dec.degree} degrees")
# Unit conversions
distance = 100 * u.pc # 100 parsecs
print(f"In light years: {distance.to(u.lyr)}")
print(f"In AU: {distance.to(u.AU)}")
# Displaying astronomical images properly
def display_astronomical_image(data, title='Astronomical Image'):
"""Display with log stretch (common for astronomy)"""
# Handle negative values
data_shifted = data - np.nanmin(data) + 1
# Log stretch
log_data = np.log10(data_shifted)
# Display
plt.figure(figsize=(10, 10))
plt.imshow(log_data, cmap='gray', origin='lower')
plt.colorbar(label='log(counts)')
plt.title(title)
plt.show()
PyTorch Basics¶
PyTorch is a deep learning framework. Here's the essentials:
import torch
import torch.nn as nn
import torch.optim as optim
# Tensors (like numpy arrays, but can run on GPU)
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.zeros(3, 3)
c = torch.randn(100, 100)
# Move to GPU (if available)
if torch.cuda.is_available():
a = a.cuda()
# or
device = torch.device('cuda')
a = a.to(device)
# Convert between numpy and torch
import numpy as np
numpy_array = np.array([1.0, 2.0, 3.0])
torch_tensor = torch.from_numpy(numpy_array)
back_to_numpy = torch_tensor.numpy()
# Automatic differentiation (the magic of PyTorch!)
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1 # y = xΒ² + 3x + 1
y.backward() # Compute derivative
print(x.grad) # dy/dx = 2x + 3 = 2(2) + 3 = 7 β
Building Your First Neural Network in PyTorch¶
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
# Define the network
class SimpleClassifier(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleClassifier, self).__init__()
self.network = nn.Sequential(
nn.Linear(input_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_size, hidden_size // 2),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(hidden_size // 2, num_classes)
)
def forward(self, x):
return self.network(x)
# Create synthetic data for demonstration
num_samples = 1000
input_size = 100
num_classes = 5
X = torch.randn(num_samples, input_size)
y = torch.randint(0, num_classes, (num_samples,))
# Split into train/val
train_X, val_X = X[:800], X[800:]
train_y, val_y = y[:800], y[800:]
# Create data loaders
train_dataset = TensorDataset(train_X, train_y)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataset = TensorDataset(val_X, val_y)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
# Initialize model, loss, optimizer
model = SimpleClassifier(input_size=100, hidden_size=256, num_classes=5)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
num_epochs = 20
for epoch in range(num_epochs):
model.train() # Set to training mode
train_loss = 0
for batch_X, batch_y in train_loader:
# Forward pass
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
# Backward pass
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
train_loss += loss.item()
# Validation
model.eval() # Set to evaluation mode
val_loss = 0
correct = 0
total = 0
with torch.no_grad(): # No gradient computation for validation
for batch_X, batch_y in val_loader:
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
val_loss += loss.item()
_, predicted = torch.max(outputs, 1)
total += batch_y.size(0)
correct += (predicted == batch_y).sum().item()
accuracy = 100 * correct / total
print(f'Epoch [{epoch+1}/{num_epochs}], '
f'Train Loss: {train_loss/len(train_loader):.4f}, '
f'Val Loss: {val_loss/len(val_loader):.4f}, '
f'Val Accuracy: {accuracy:.2f}%')
Building a CNN for Images¶
import torch
import torch.nn as nn
class AstronomyCNN(nn.Module):
def __init__(self, num_classes=5):
super(AstronomyCNN, self).__init__()
# Convolutional layers
self.conv_layers = nn.Sequential(
# Input: 1 channel (grayscale), Output: 32 channels
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 256 -> 128
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 128 -> 64
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 64 -> 32
nn.Conv2d(128, 256, kernel_size=3, padding=1),
nn.BatchNorm2d(256),
nn.ReLU(),
nn.MaxPool2d(2, 2), # 32 -> 16
)
# Fully connected layers
self.fc_layers = nn.Sequential(
nn.Flatten(),
nn.Linear(256 * 16 * 16, 512),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, num_classes)
)
def forward(self, x):
x = self.conv_layers(x)
x = self.fc_layers(x)
return x
# Create model
model = AstronomyCNN(num_classes=5)
# Print model summary
print(model)
# Check with dummy input
dummy_input = torch.randn(1, 1, 256, 256) # Batch of 1, 1 channel, 256x256
output = model(dummy_input)
print(f"Output shape: {output.shape}") # Should be [1, 5]
Complete Training Script for Astronomical Image Classification¶
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from astropy.io import fits
import os
from pathlib import Path
import matplotlib.pyplot as plt
class AstronomyDataset(Dataset):
"""Custom dataset for astronomical images"""
def __init__(self, image_dir, labels_file, transform=None):
"""
Args:
image_dir: Directory with FITS images
labels_file: Text file with "filename,label" per line
transform: Optional transform function
"""
self.image_dir = Path(image_dir)
self.transform = transform
# Load labels
self.samples = []
with open(labels_file, 'r') as f:
for line in f:
filename, label = line.strip().split(',')
self.samples.append((filename, int(label)))
self.classes = ['spiral', 'elliptical', 'irregular', 'merger', 'artifact']
def __len__(self):
return len(self.samples)
def __getitem__(self, idx):
filename, label = self.samples[idx]
# Load FITS image
filepath = self.image_dir / filename
with fits.open(filepath) as hdul:
image = hdul[0].data.astype(np.float32)
# Preprocessing
image = self.preprocess(image)
# Apply transforms if any
if self.transform:
image = self.transform(image)
# Convert to tensor
image = torch.from_numpy(image).unsqueeze(0) # Add channel dimension
return image, label
def preprocess(self, image):
"""Standard preprocessing for astronomical images"""
# Handle NaN values
image = np.nan_to_num(image, nan=0.0)
# Clip extreme values (cosmic rays, bad pixels)
p1, p99 = np.percentile(image, [1, 99])
image = np.clip(image, p1, p99)
# Log stretch (handles large dynamic range)
image = image - image.min() + 1
image = np.log(image)
# Normalize to 0-1
image = (image - image.min()) / (image.max() - image.min() + 1e-8)
return image
def train_model(model, train_loader, val_loader, num_epochs=50,
learning_rate=0.001, device='cuda'):
"""Complete training function with bells and whistles"""
model = model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='max', factor=0.5, patience=5
)
best_accuracy = 0
history = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}
for epoch in range(num_epochs):
# Training phase
model.train()
train_loss = 0
for images, labels in train_loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validation phase
model.eval()
val_loss = 0
correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
avg_train_loss = train_loss / len(train_loader)
avg_val_loss = val_loss / len(val_loader)
# Update scheduler
scheduler.step(accuracy)
# Save history
history['train_loss'].append(avg_train_loss)
history['val_loss'].append(avg_val_loss)
history['val_accuracy'].append(accuracy)
# Save best model
if accuracy > best_accuracy:
best_accuracy = accuracy
torch.save(model.state_dict(), 'best_model.pt')
print(f'Epoch [{epoch+1}/{num_epochs}] '
f'Train Loss: {avg_train_loss:.4f} '
f'Val Loss: {avg_val_loss:.4f} '
f'Val Acc: {accuracy:.2f}% '
f'(Best: {best_accuracy:.2f}%)')
return history
def plot_training_history(history):
"""Visualize training progress"""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Loss plot
ax1.plot(history['train_loss'], label='Train Loss')
ax1.plot(history['val_loss'], label='Val Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
# Accuracy plot
ax2.plot(history['val_accuracy'], label='Val Accuracy', color='green')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Validation Accuracy')
ax2.legend()
plt.tight_layout()
plt.savefig('training_history.png')
plt.show()
# Example usage (you'd replace with your actual data):
if __name__ == '__main__':
# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# Create model
model = AstronomyCNN(num_classes=5)
# For demonstration, create random data
# In practice, you'd use AstronomyDataset with real data
train_X = torch.randn(800, 1, 256, 256)
train_y = torch.randint(0, 5, (800,))
val_X = torch.randn(200, 1, 256, 256)
val_y = torch.randint(0, 5, (200,))
train_dataset = torch.utils.data.TensorDataset(train_X, train_y)
val_dataset = torch.utils.data.TensorDataset(val_X, val_y)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
# Train
history = train_model(model, train_loader, val_loader,
num_epochs=20, device=device)
# Plot results
plot_training_history(history)
Chapter 8: Your First Complete Project¶
Let's build something real: an image quality classifier for your telescope.
Project: Automatic Image Quality Assessment¶
Goal: Given a raw telescope frame, predict quality (good/medium/bad) automatically.
Step 1: Data Collection¶
First, manually classify some of your existing images:
import os
import shutil
from pathlib import Path
# Create directory structure
for quality in ['good', 'medium', 'bad']:
Path(f'training_data/{quality}').mkdir(parents=True, exist_ok=True)
print("""
Manual Classification Guide:
- GOOD: Clear stars, low background, good focus
- MEDIUM: Some clouds, slightly out of focus, minor issues
- BAD: Heavy clouds, tracking errors, severe artifacts
Move or copy your FITS files into the appropriate folders.
Aim for at least 100 images per category.
""")
Step 2: Data Preparation¶
import numpy as np
from astropy.io import fits
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
class QualityDataset(Dataset):
def __init__(self, filepaths, labels, image_size=128):
self.filepaths = filepaths
self.labels = labels
self.image_size = image_size
def __len__(self):
return len(self.filepaths)
def __getitem__(self, idx):
# Load image
with fits.open(self.filepaths[idx]) as hdul:
image = hdul[0].data.astype(np.float32)
# Resize to consistent size
from scipy.ndimage import zoom
zoom_factor = self.image_size / max(image.shape)
image = zoom(image, zoom_factor)
# Pad to exact size if needed
if image.shape[0] < self.image_size:
pad = self.image_size - image.shape[0]
image = np.pad(image, ((0, pad), (0, 0)))
if image.shape[1] < self.image_size:
pad = self.image_size - image.shape[1]
image = np.pad(image, ((0, 0), (0, pad)))
# Crop to exact size
image = image[:self.image_size, :self.image_size]
# Normalize
image = np.nan_to_num(image, nan=0)
p1, p99 = np.percentile(image, [1, 99])
image = np.clip(image, p1, p99)
image = (image - image.min()) / (image.max() - image.min() + 1e-8)
# To tensor
image = torch.from_numpy(image).unsqueeze(0)
return image, self.labels[idx]
def prepare_data(data_dir='training_data'):
"""Load data from organized folders"""
filepaths = []
labels = []
label_map = {'good': 0, 'medium': 1, 'bad': 2}
for quality, label in label_map.items():
folder = Path(data_dir) / quality
for filepath in folder.glob('*.fits'):
filepaths.append(str(filepath))
labels.append(label)
# Split into train/val/test
train_files, temp_files, train_labels, temp_labels = train_test_split(
filepaths, labels, test_size=0.3, stratify=labels, random_state=42
)
val_files, test_files, val_labels, test_labels = train_test_split(
temp_files, temp_labels, test_size=0.5, stratify=temp_labels, random_state=42
)
print(f"Training samples: {len(train_files)}")
print(f"Validation samples: {len(val_files)}")
print(f"Test samples: {len(test_files)}")
return (
(train_files, train_labels),
(val_files, val_labels),
(test_files, test_labels)
)
Step 3: Model Definition¶
import torch.nn as nn
class QualityClassifier(nn.Module):
"""Lightweight CNN for image quality assessment"""
def __init__(self, num_classes=3):
super().__init__()
self.features = nn.Sequential(
# Block 1: 128 -> 64
nn.Conv2d(1, 16, 3, padding=1),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 2: 64 -> 32
nn.Conv2d(16, 32, 3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 3: 32 -> 16
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 4: 16 -> 8
nn.Conv2d(64, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(128 * 8 * 8, 256),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(256, num_classes)
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
Step 4: Training Script¶
def train_quality_model():
# Configuration
BATCH_SIZE = 16
LEARNING_RATE = 0.001
NUM_EPOCHS = 30
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Prepare data
(train_files, train_labels), (val_files, val_labels), _ = prepare_data()
train_dataset = QualityDataset(train_files, train_labels)
val_dataset = QualityDataset(val_files, val_labels)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
shuffle=False, num_workers=2)
# Initialize model
model = QualityClassifier(num_classes=3).to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
# Training loop
best_accuracy = 0
for epoch in range(NUM_EPOCHS):
# Train
model.train()
train_loss = 0
for images, labels in train_loader:
images, labels = images.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
train_loss += loss.item()
# Validate
model.eval()
correct = 0
total = 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.to(DEVICE), labels.to(DEVICE)
outputs = model(images)
_, predicted = torch.max(outputs, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f'Epoch [{epoch+1}/{NUM_EPOCHS}] '
f'Loss: {train_loss/len(train_loader):.4f} '
f'Accuracy: {accuracy:.1f}%')
# Save best model
if accuracy > best_accuracy:
best_accuracy = accuracy
torch.save({
'model_state': model.state_dict(),
'accuracy': accuracy,
'epoch': epoch
}, 'quality_classifier_best.pt')
print(f"\nTraining complete! Best accuracy: {best_accuracy:.1f}%")
return model
Step 5: Deployment for Real-Time Use¶
class RealTimeQualityChecker:
"""Deploy the trained model for real-time quality assessment"""
def __init__(self, model_path='quality_classifier_best.pt'):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load model
self.model = QualityClassifier(num_classes=3)
checkpoint = torch.load(model_path, map_location=self.device)
self.model.load_state_dict(checkpoint['model_state'])
self.model.to(self.device)
self.model.eval()
self.classes = ['good', 'medium', 'bad']
def preprocess(self, image):
"""Preprocess a raw numpy image"""
from scipy.ndimage import zoom
# Resize
zoom_factor = 128 / max(image.shape)
image = zoom(image.astype(np.float32), zoom_factor)
image = image[:128, :128]
# Normalize
image = np.nan_to_num(image, nan=0)
p1, p99 = np.percentile(image, [1, 99])
image = np.clip(image, p1, p99)
image = (image - image.min()) / (image.max() - image.min() + 1e-8)
# To tensor
tensor = torch.from_numpy(image).unsqueeze(0).unsqueeze(0)
return tensor.to(self.device)
def assess(self, image):
"""
Assess image quality
Args:
image: numpy array (raw telescope image)
Returns:
dict with quality label and confidence
"""
tensor = self.preprocess(image)
with torch.no_grad():
outputs = self.model(tensor)
probabilities = torch.softmax(outputs, dim=1)[0]
predicted_class = torch.argmax(probabilities).item()
return {
'quality': self.classes[predicted_class],
'confidence': probabilities[predicted_class].item(),
'all_probabilities': {
cls: prob.item()
for cls, prob in zip(self.classes, probabilities)
}
}
def assess_file(self, filepath):
"""Assess quality of a FITS file"""
with fits.open(filepath) as hdul:
image = hdul[0].data
return self.assess(image)
# Usage example:
if __name__ == '__main__':
checker = RealTimeQualityChecker('quality_classifier_best.pt')
# Assess a single file
result = checker.assess_file('new_observation.fits')
print(f"Quality: {result['quality']} ({result['confidence']:.1%} confident)")
# In a real-time loop
def process_new_frame(filepath):
result = checker.assess_file(filepath)
if result['quality'] == 'bad':
print(f"β οΈ Bad frame detected: {filepath}")
# Could trigger alert or stop observation
elif result['quality'] == 'medium':
print(f"β‘ Medium quality: {filepath}")
# Continue but flag for review
else:
print(f"β Good frame: {filepath}")
# Proceed normally
return result
Chapter 9: Next Steps and Resources¶
Your Learning Path¶
Week 1-2: Python fundamentals
- Complete a Python tutorial (Codecademy, Python.org tutorial)
- Practice with NumPy and Matplotlib
- Load and visualize your telescope images
Week 3-4: Machine learning concepts
- Take Andrew Ng's ML course on Coursera (free to audit)
- Implement simple models with scikit-learn
- Understand training/validation/testing
Week 5-6: Deep learning basics
- Work through Fast.ai course (free, practical)
- Build your first CNN in PyTorch
- Train on your own data
Week 7-8: Your first real project
- Implement the quality classifier above
- Collect and label your data
- Train, validate, deploy
Month 2+: Advanced topics
- Time-series analysis for transient detection
- Multi-site coordination systems
- Real-time processing pipelines
Essential Resources¶
Books:
- "Python for Astronomers" (free online)
- "Deep Learning" by Goodfellow (the bible, free online)
- "Hands-On Machine Learning" by GΓ©ron
Courses:
- Fast.ai (practical deep learning)
- Coursera: Andrew Ng's courses
- DeepLearning.AI specializations
Astronomy-specific:
- AstroML documentation
- Astropy tutorials
- AAS astronomy + ML workshops
Communities:
- Stack Overflow (coding help)
- Cross Validated (ML theory)
- r/MachineLearning (Reddit)
- Astropy Slack/Discord
Hardware Recommendations¶
For learning (budget):
- Any modern laptop with 8GB+ RAM
- Use Google Colab for free GPU access
For development (intermediate):
- Desktop with NVIDIA GPU (RTX 3060 or better)
- 32GB+ RAM
- Fast SSD storage
For production (your telescope array):
- Edge devices: NVIDIA Jetson at each site
- Central server: Multiple GPUs for training
- Cloud backup for burst computing
Final Thoughts¶
You now have a complete foundation in AI/ML for astronomy. The key principles:
- ML finds patterns in data β nothing magical
- Neural networks learn hierarchically β simple to complex features
- Training requires iteration β thousands of updates to find good parameters
- Data quality matters more than model complexity β garbage in, garbage out
- Start simple, add complexity as needed β don't over-engineer initially
For your distributed telescope array, the path is:
- Build local quality assessment (the project above)
- Add transient detection at each site
- Implement cross-site coordination
- Develop data fusion capabilities
- Create autonomous discovery systems
Each step builds on the previous. You don't need to understand everything at onceβlearn what you need for each stage.
Would you like me to elaborate on any specific section? I can provide:
- More code examples for specific tasks
- Deeper mathematical explanations
- Hardware setup guides
- Specific astronomy ML techniques
- Step-by-step project walkthroughs
Complete Deep Dive: ML for Your Distributed Telescope Array¶
Part 1: How This Applies Specifically to Your Project¶
Your Unique Situation¶
You're not building a single telescopeβyou're building a network. This fundamentally changes everything about how ML applies to your project. Let me explain why this is both harder and more powerful than single-telescope ML.
The Distributed Data Problem¶
When you have telescopes in different locations, you face challenges that single observatories never encounter:
Heterogeneous Conditions: Your telescope in India sees through different atmosphere than your telescope in Chile. Humidity in one location, dust in another, light pollution patterns unique to each site. A galaxy image from Site A looks subtly different from the same galaxy imaged at Site B, even with identical equipment.
Temporal Asynchrony: It's daytime somewhere while it's nighttime elsewhere. Your network is always partially active, partially sleeping. Events happen when only some telescopes can see them. Coordinating observations across time zones means predicting conditions hours in advance.
Communication Latency: Data from a remote site might take seconds or minutes to reach your central system. In those seconds, a transient event could fade. ML must make local decisions fast while still benefiting from global coordination.
Calibration Drift: Each telescope drifts differently over time. Mirrors get dusty, sensors age, tracking develops quirks. What was perfectly calibrated last month might be slightly off now, and differently off at each site.
How ML Specifically Addresses Your Challenges¶
Learning Site-Specific Characteristics: Rather than manually characterizing each site, ML learns automatically. Feed it data from each telescope along with quality assessments, and it learns that Site A produces slightly bluer images, Site B has periodic vibration from nearby traffic, Site C gets dew formation around 3 AM local time. This knowledge is encoded in the model's parametersβno explicit rules needed.
Predictive Coordination: ML can learn patterns invisible to humans. Perhaps observations from Sites A and C together, taken within 30 minutes of each other, produce better combined data than A and B together. Maybe certain atmospheric conditions at one site predict what conditions will be at another site two hours later. These correlations exist in your dataβML finds them.
Adaptive Resource Allocation: Your network has finite resourcesβobservation time, storage, bandwidth, human attention. ML learns to allocate these optimally. When something interesting happens, which telescopes should respond? How should you balance survey observations against transient follow-up? ML can learn policies that maximize scientific output.
Unified Understanding from Diverse Data: The holy grail for your project is combining observations from multiple sites into something greater than any single observation. ML models can learn optimal combination strategies that account for each site's quirks, each observation's quality, and the physics of what you're observing.
The Mathematics Behind Your Specific Needs¶
Let me walk you through the actual math that makes this work for distributed telescope networks.
Multi-Site Calibration: Transfer Learning Mathematics¶
When you train a model on data from Site A, then want it to work at Site B, you're doing transfer learning. Here's how the math works:
Imagine each image can be described by two components: the underlying astronomical signal S, and site-specific effects E. For Site A:
Image_A = S + E_A + noise
For Site B:
Image_B = S + E_B + noise
The astronomical signal S is the same (it's the same object), but E_A and E_B differ. A naive model trained on Site A learns to recognize S + E_A as a unit. It fails at Site B because it's looking for E_A characteristics that aren't there.
Transfer learning separates these. The mathematics involves training the model's early layers (which learn generic features like edges and shapes) to be site-independent, while allowing later layers to adapt. Formally, you minimize a loss function that includes both prediction accuracy and a penalty for how different the learned representations are between sites:
Total Loss = Prediction Error + Ξ» Γ Domain Difference
The domain difference term forces the model to find representations that work across sites. The Ξ» parameter controls how much you care about cross-site consistency versus raw accuracy.
Data Fusion: Optimal Combination Theory¶
When combining observations from multiple telescopes, you want to weight each contribution appropriately. The mathematically optimal combination minimizes total uncertainty.
If Telescope 1 measures a value with uncertainty Οβ, and Telescope 2 measures with uncertainty Οβ, the optimal combined estimate is:
Combined = (valueβ/ΟβΒ² + valueβ/ΟβΒ²) / (1/ΟβΒ² + 1/ΟβΒ²)
This is inverse-variance weightingβbetter measurements (smaller Ο) contribute more.
But in reality, your uncertainties aren't simple numbers. They're complex functions of atmospheric conditions, telescope state, target properties, and inter-site correlations. ML learns this uncertainty structure from data. It implicitly estimates these complex Ο values and performs near-optimal combination.
The neural network is learning a function:
Combined_Image = f(Image_A, Image_B, Image_C, Metadata_A, Metadata_B, Metadata_C)
Where f is a highly nonlinear function with millions of parameters, trained to produce combined images that match what expert analysis would produce.
Scheduling: Reinforcement Learning Mathematics¶
Deciding which telescope observes what, and when, is a sequential decision problem. The mathematics come from reinforcement learning.
You have a state representing current conditions: weather at each site, queue of targets, recent observation quality, predicted satellite passages, current calibration status, and more.
You take actions: assign target X to telescope Y for Z minutes.
You receive rewards: scientific value of resulting observation, minus costs (slew time, missed opportunities elsewhere).
The goal is to learn a policyβa function mapping states to actionsβthat maximizes total reward over time.
The mathematics involve the Bellman equation, which describes optimal decision-making:
V(state) = max over all actions of [immediate_reward + Ξ³ Γ V(next_state)]
V(state) is the "value" of being in a particular stateβhow much total future reward you can expect. The parameter Ξ³ (gamma) discounts future rewards (a reward now is worth more than the same reward later).
This equation seems circularβV depends on Vβbut it can be solved iteratively. Start with a random guess for V, apply the equation repeatedly, and it converges to the true optimal values. Then your policy is just: from any state, take the action that leads to the highest-value next state.
For your telescope network, the state space is enormous. You can't enumerate all possible states. Neural networks approximate V(state), learning to estimate values for any state they encounter. This is deep reinforcement learning.
Anomaly Detection: Statistical Learning Theory¶
Finding unusual objects requires understanding what "usual" looks like. The mathematics here involve probability density estimation.
Given training data of normal observations, you're estimating the probability distribution P(x) over possible observations. An anomaly is something with very low probabilityβP(x_anomaly) << typical P(x).
Autoencoders approach this indirectly. They learn to compress and reconstruct normal data. The reconstruction error for any input tells you how "unusual" it is:
Anomaly_Score(x) = ||x - Reconstruct(x)||Β²
If the model can reconstruct x well, it's similar to training data (normal). If reconstruction is poor, it's unlike anything the model has seen (potentially anomalous).
The mathematical guarantee comes from information theory: autoencoders learn efficient codes for the training distribution. Data from outside this distribution can't be efficiently codedβreconstruction suffers.
For your telescope network, this is powerful. Train on normal observations from all sites. The model learns what normal looks like across your whole network. When something genuinely unusual appearsβa new type of transient, an equipment failure mode never seen before, an atmospheric phenomenon unique to one siteβthe anomaly score spikes.
Hardware Requirements for Your Specific Scale¶
Let me be concrete about what hardware your distributed telescope project actually needs.
At Each Telescope Site¶
Edge Computing Unit: You need local ML inference capability. This means:
For a small site (single telescope, basic automation):
- NVIDIA Jetson Nano or Orin Nano
- 4-8 GB unified memory
- Power consumption: 10-15 watts
- Cost: $200-500
- Capabilities: Real-time quality assessment, basic transient detection, image preprocessing
For a medium site (multiple instruments, more sophisticated local processing):
- NVIDIA Jetson AGX Xavier or Orin
- 32-64 GB unified memory
- Power consumption: 30-60 watts
- Cost: $700-2000
- Capabilities: Full local ML pipeline, preliminary data fusion, complex anomaly detection
For a major site (significant local autonomy required):
- Compact server with NVIDIA RTX 4080/4090 or A4000
- 64+ GB system RAM
- Dedicated storage array
- Power consumption: 300-500 watts
- Cost: $3000-8000
- Capabilities: Can operate fully autonomously, train local models, handle complete scientific analysis
Storage: Raw astronomical data accumulates fast. A single night might generate 50-200 GB depending on your instruments. You need:
- Fast SSD for working data (1-4 TB)
- Larger HDD or SSD array for local archive (10-50 TB)
- Fast network interface for uploads (1+ Gbps ideal)
Environmental Considerations: Edge devices at telescope sites face challenges. Temperature swings, humidity, power fluctuations. You need:
- Proper enclosure (temperature-controlled if extreme climate)
- Uninterruptible power supply
- Remote management capability (you can't physically visit every site easily)
Central Coordination System¶
This is where the heavy computation happensβtraining models, combining data from all sites, running complex analyses.
For a network of 3-5 small telescopes:
- Workstation with 1-2 NVIDIA RTX 4090 GPUs
- 128 GB RAM
- Fast storage: 10+ TB NVMe SSD
- Archive storage: 100+ TB
- Cost: $10,000-20,000
For a network of 5-15 telescopes with serious ambitions:
- Small server cluster or cloud resources
- 4-8 high-end GPUs (RTX 4090, A6000, or equivalent)
- 256-512 GB RAM per node
- Fast interconnect between GPUs
- Petabyte-scale storage
- Cost: $50,000-150,000 (or equivalent cloud spend)
For a large network approaching professional scale:
- HPC cluster or significant cloud allocation
- Dozens of GPUs for parallel training
- Multiple petabytes of storage
- Dedicated networking infrastructure
- Cost: $500,000+ (or major cloud commitment)
Network Infrastructure¶
Your system is only as good as its connectivity:
Bandwidth: Each site needs reliable upload capability. Assuming you want to transfer reduced data (not raw) in near-real-time:
- Minimum: 10 Mbps sustained upload per site
- Comfortable: 100 Mbps sustained upload per site
- Ideal: 1 Gbps (allows raw data transfer if needed)
Latency: For real-time coordination (transient response), latency matters:
- Acceptable: 200-500ms round-trip to central system
- Good: 50-200ms
- Excellent: <50ms
Reliability: Telescopes often sit in remote locations. Network failures happen. Your system needs:
- Local buffering for network outages
- Graceful degradation (sites continue operating independently)
- Automatic reconnection and synchronization
Compute Requirements by Task¶
Different ML tasks have different requirements:
Real-time quality assessment: Very lightweight. A Jetson Nano can run this at 10+ frames per second. Must run locally at each site.
Transient detection: Moderate requirements. Needs to process each frame in less time than the exposure time. For typical 30-60 second exposures, even modest edge hardware is sufficient.
Scheduling optimization: Can be computationally intensive but isn't time-critical. Run on central system, update schedules every few minutes.
Data fusion: Moderately intensive. Combining data from multiple sites requires having all that data in one place and processing it. Central system task.
Model training: By far the most intensive. Training new models or retraining existing ones requires serious GPU power. Plan for multi-hour to multi-day training runs. Can be batched during low-activity periods.
Anomaly detection for discovery: Variable intensity. Simple methods run in real-time. Sophisticated searches over historical data require substantial computation. Balance between always-running lightweight detection and periodic deep searches.
Part 2: ML System for Task Assignment and Observation Creation¶
The Complete Task Assignment System¶
Let me design a comprehensive ML system that handles both assigning existing tasks to telescopes and creating new observation tasks automatically.
Understanding the Problem Space¶
Your task assignment system must juggle competing demands:
Scientific Priorities: Different observations have different value. A follow-up of a confirmed gravitational wave counterpart might be worth 100 times more than a routine survey field. But value isn't fixedβit depends on what's already been observed, what other facilities are doing, and how the target is evolving.
Physical Constraints: Each telescope can only point at part of the sky at any moment. Targets rise and set. Weather changes. Instruments need calibration. Slewing takes time. These constraints are hardβviolating them produces zero useful data.
Resource Optimization: Observation time is precious. Every minute spent on a lower-value target is a minute not spent on something better. But you can't always know what "better" will appear. Balance exploitation (observe known-good targets) with exploration (survey for unknowns).
Coordination: Multiple telescopes can work together or independently. Some observations benefit from simultaneous multi-site coverage. Others are better done sequentially across sites. The system must know when coordination helps and when it's unnecessary overhead.
Architecture of the Task Assignment ML System¶
The system has several interconnected components:
Component 1: The State Representation Module¶
Before the ML can make decisions, it needs to understand the current state of your entire network. This module maintains a real-time representation including:
Environmental State: For each site, current and predicted conditionsβcloud cover, seeing, humidity, wind, moon position and phase, twilight status. This comes from local sensors, weather services, and historical patterns.
Equipment State: Telescope pointing, current filter/instrument configuration, time since last calibration, known issues or limitations, thermal status (some instruments need cooling time after changes).
Queue State: All pending observation requests with their priorities, time constraints, progress so far, and dependencies on other observations.
Historical Context: What has been observed recently? What patterns has the system learned about success rates for different target/site/condition combinations?
External Information: Are there active alerts from gravitational wave detectors, gamma-ray satellites, or other facilities? What are other telescopes doing (from public streams)?
This state representation is updated continuouslyβsome elements every second, others every few minutes.
Component 2: The Value Estimation Network¶
This neural network takes the state representation and, for any proposed observation, estimates its expected scientific value.
The network architecture combines several types of information:
Target Features: Position, brightness, type, variability history, time since last observation, relationship to other targets.
Observation Features: Proposed telescope, exposure time, filters, timing.
Context Features: Current conditions, competing demands, external alerts.
The output is a scalar value estimate plus uncertainty bounds. High uncertainty might mean the system needs more information before committing.
Training this network requires historical data with value labels. You can derive these from:
- Expert assessments of past observations
- Publication outcomes (did this observation lead to science?)
- Detection metrics (did we find what we were looking for?)
- Data quality achieved versus predicted
The network learns to integrate all these factors into a unified value estimate. It might learn that observing a certain type of target at Site B when humidity exceeds 70% has low expected value, even though individually those factors seem fine.
Component 3: The Constraint Satisfaction Engine¶
Not every observation is physically possible. This component evaluates hard constraints:
Visibility: Can the telescope actually see this target now? This involves coordinate transformations, horizon modeling, and obstruction maps.
Timing: Does the observation fit in available time? Account for slew time, setup, and required duration.
Instrument Compatibility: Is the right instrument available? Does the target require filters or modes that this telescope supports?
Exclusive Resources: Some operations can't happen simultaneouslyβyou can't observe two targets at once, can't calibrate while observing, can't change filters mid-exposure.
This component doesn't use MLβit's hard logic. But it interfaces with the ML components to filter impossible options before the system wastes computation evaluating them.
Component 4: The Policy Network¶
This is the core decision-making network. Given the current state and value estimates for all options, it selects actions.
The architecture is a combination of:
Attention Mechanisms: The network can "focus" on the most relevant parts of the state. When responding to a transient alert, it attends strongly to the alert information and capable sites, largely ignoring routine queue items.
Recurrent Components: The network maintains memory of recent decisions. This prevents thrashing (constantly switching between options) and enables multi-step planning.
Multi-Head Output: The network produces decisions for multiple aspects simultaneouslyβwhich target, which telescope, what configuration, how long.
The policy network is trained using reinforcement learning. It tries different decisions, observes outcomes, and adjusts to improve over time. The reward signal combines:
- Scientific value of observations obtained
- Efficiency metrics (minimal wasted time)
- Responsiveness (fast reaction to alerts)
- Fairness (different science programs get appropriate time)
Component 5: The Observation Generator¶
This component creates new observation tasks automatically. It's not just assigning existing requestsβit's inventing new ones.
Survey Field Selection: For survey operations, the generator proposes fields to observe based on:
- Coverage requirements (what hasn't been observed yet?)
- Scientific priorities for different regions
- Current conditions (which fields are optimally positioned?)
- Expected discovery yield per field
Follow-Up Proposals: When something interesting is detected, the generator creates appropriate follow-up observations:
- Same target, different filters (for color information)
- Same target, later time (for variability)
- Nearby targets (for context)
- Different site (for confirmation)
Calibration Scheduling: The generator monitors data quality and schedules calibrations when needed:
- Regular flats and darks
- Focus checks
- Pointing model updates
- Photometric standard observations
Opportunistic Observations: When primary programs can't observe (weather, equipment issues), the generator proposes useful alternatives:
- Shorter exposures of bright targets
- Engineering tests
- Calibration catch-up
- Low-priority but useful survey work
The Decision Flow¶
Here's how these components work together in real-time:
Continuous Monitoring Phase: State representation is constantly updated. Value estimation network runs in background on high-priority queue items. Constraint engine maintains pre-computed visibility windows.
Decision Point Trigger: When a decision is needed (current observation ending, alert received, conditions changed significantly), the policy network activates.
Option Generation: The observation generator proposes candidatesβboth from existing queue and newly created. The constraint engine filters to feasible options.
Value Assessment: The value estimation network scores all feasible options. Scores reflect expected scientific return given current conditions.
Policy Execution: The policy network selects from scored options, considering not just current value but strategic factors (don't neglect long-term programs for short-term gains).
Action Implementation: Commands go to the appropriate telescope. Monitoring continues.
Outcome Observation: When the observation completes, results feed back into training data. Did prediction match reality? What was actual scientific value?
Learning and Adaptation¶
The system improves over time through several mechanisms:
Online Learning: Every observation outcome provides training data. The value estimation network continuously refines its predictions. The policy network adjusts its strategies.
Periodic Retraining: Deep retraining happens offline, using accumulated data. This catches slow drifts and discovers new patterns.
Transfer Learning: Insights from one site transfer to others. If the system learns that a certain type of observation requires longer exposures than expected, this knowledge propagates across the network.
Human Feedback Integration: Expert assessments of observations (was this good science? was this a waste of time?) provide high-quality training signal. The system learns to match expert judgment while scaling beyond human attention capacity.
Handling Uncertainty¶
Real-world scheduling faces massive uncertainty. The ML system handles this through:
Probabilistic Predictions: Instead of single-point estimates, the system maintains probability distributions. "The value of this observation is probably around 7, but might be as low as 3 or as high as 15."
Robust Scheduling: When uncertainty is high, the system prefers decisions that are good across many scenarios over decisions that are optimal for one scenario but terrible for others.
Information-Seeking Actions: Sometimes the best decision is to gather more information before committing. The system can propose quick test observations to resolve uncertainty before dedicating major resources.
Graceful Replanning: Plans aren't rigid. When conditions change (weather shifts, new alert arrives, equipment fails), the system replans without requiring human intervention.
Multi-Site Coordination Specifics¶
Your distributed network enables coordination patterns impossible with single telescopes:
Simultaneous Observations: For some targets, observing from multiple sites simultaneously provides unique science (parallax measurements, multi-angle imaging, redundancy against clouds). The task system recognizes these opportunities and schedules accordingly.
Relay Coverage: For time-critical monitoring, sites can relay coverage as the Earth rotates. Site A observes until target sets, Site B picks up as it rises there. The task system plans these handoffs.
Confirmation Mode: An interesting detection at one site can trigger immediate confirmation attempts at other sites. This filters false positives before alerting humans.
Division of Labor: Different sites might specialize in different target types based on their equipment, conditions, or location advantages. The task system learns these specializations and routes accordingly.
Part 3: Limitations of ML and AI¶
Fundamental Limitations¶
Let me be completely honest about what ML cannot do and where it fails.
The Data Dependency¶
ML systems are only as good as their training data. This creates several fundamental limitations:
Garbage In, Garbage Out: If your training data contains errors, biases, or gaps, your model inherits them. A classifier trained on mislabeled images will confidently make the same mistakes. If your training set underrepresents certain types of objects, the model will struggle with them in deployment.
Distribution Shift: ML assumes the future resembles the past. When reality changesβnew instrument, different observing strategy, novel type of objectβmodels trained on old data may fail silently. They don't know what they don't know.
Data Volume Requirements: Deep learning requires substantial data. For rare phenomena (unusual transients, exotic object types), you might have only a handful of examples. Models trained on few examples overfit badly. This is the regime where ML struggles most.
Label Quality: Supervised learning needs labeled examples. In astronomy, labels often come from expert classification, which is expensive and sometimes inconsistent. Experts disagree, make mistakes, and have biases. Models learn from this imperfect supervision.
The Black Box Problem¶
Neural networks, especially deep ones, are largely opaque:
No Explanations: When a model classifies an image as a spiral galaxy, it doesn't explain why. You see the input and output, but the reasoning is encoded in millions of parameters that resist human interpretation. For scientific applications, this lack of explanation is problematic.
Debugging Difficulty: When models fail, diagnosing the cause is hard. Unlike traditional code where you can step through logic, neural networks fail in diffuse ways. The bug might be spread across thousands of parameters.
Unpredictable Failures: Models can fail in ways that seem random or inexplicable. An image almost identical to training examples might be misclassified while a completely different image is handled correctly. This unpredictability makes mission-critical deployment risky.
Adversarial Vulnerability: ML models can be fooled by carefully crafted inputs. Small, imperceptible changes to an image can cause confident misclassification. While intentional adversarial attacks are rare in astronomy, natural variations can accidentally hit these failure modes.
The Extrapolation Problem¶
ML excels at interpolationβhandling inputs similar to training data. It fails at extrapolationβhandling truly novel situations:
Novelty Blindness: A model trained on known object types cannot reliably identify genuinely new types. It might classify them as the nearest known type (missing the discovery) or flag everything unusual (overwhelming you with false positives).
Regime Changes: If physical conditions exceed anything in training dataβbrighter sources, fainter sources, different wavelengths, different instrumentsβmodel behavior is undefined. It might extrapolate reasonably or fail completely.
Black Swan Events: Extremely rare events (once-per-decade transients, unprecedented phenomena) cannot be in training data by definition. ML provides no advantage over traditional methods for true black swans.
Statistical Limitations¶
ML makes statistical predictions, not certainties:
Irreducible Error: Even a perfect model has error rates. If your best classifier achieves 95% accuracy, that means 5% errors are inherent to the problem given available information. No amount of training reduces this.
Calibration Problems: Models often give poorly calibrated confidence scores. A model might say it's 90% confident when it's actually right only 70% of the time. Or vice versa. Trusting reported confidences without calibration analysis is dangerous.
Long-Tail Problems: Real data has long tailsβrare examples far from typical. Standard training emphasizes common cases. Rare cases matter scientifically but get little training attention.
Simpson's Paradox and Confounding: ML can find correlations that don't reflect causation. A model might learn that observations at Site A have fewer artifacts, not because Site A is better, but because a skilled operator happens to work there. If that operator leaves, the model's expectations break.
Practical Limitations¶
Beyond theory, real-world ML deployment faces practical challenges:
Computational Costs¶
Training Expense: Training large models requires significant GPU time, often days or weeks. Iteration is slow. Exploring architectural variations is expensive.
Inference Costs: Running models in production requires ongoing computation. For real-time applications, this means dedicated hardware. The marginal cost per prediction might be small, but it's not zero.
Energy Consumption: ML training and inference consume substantial electricity. This matters for remote telescope sites on limited power and for environmental considerations broadly.
Scaling Challenges: As your network grows, ML demands grow too. More data means more storage and processing. More sites mean more edge devices. Costs don't grow linearlyβthey can explode.
Maintenance Burden¶
Model Decay: Deployed models degrade over time as the world changes. Regular retraining is necessary but often neglected.
Technical Debt: ML systems accumulate technical debt faster than traditional software. Data pipelines, feature engineering, model managementβall require ongoing attention.
Expertise Requirements: Operating ML systems requires specialized knowledge. Debugging, optimization, and adaptation need skills different from traditional software engineering.
Integration Complexity: ML models must interface with data systems, hardware, user interfaces, and other ML models. Integration is frequently underestimated.
Human Factors¶
Trust Calibration: People tend to either over-trust ML (automation bias) or under-trust it (algorithm aversion). Neither is appropriate. Developing correct calibration requires experience and training.
Deskilling Risk: Relying on ML can atrophy human expertise. If the ML always classifies images, operators lose classification skills. When the ML fails, humans may not be able to recover.
Accountability Gaps: When an ML system makes a decision, who is responsible? This question becomes sharp when decisions matterβprioritizing observations, triggering alerts, discarding data.
Transparency Demands: Science requires reproducibility and explanation. ML systems often can't explain their decisions in scientifically meaningful terms. This creates tension with scientific values.
Astronomy-Specific Limitations¶
Some limitations are particularly relevant to astronomical applications:
Rare Object Discovery¶
The most exciting discoveries are often things never seen before. ML is inherently weak here:
Training Paradox: You can't train on examples of objects that haven't been discovered yet. The first detection of a new phenomenon must come through some other means.
Confirmation Bias: ML systems favor known categories. A new type of transient might be classified as the most similar known type, its novelty invisible.
Anomaly Flooding: Systems tuned for novelty detection produce many false positives. The genuine discovery drowns in a sea of artifacts, glitches, and merely unusual known objects.
Small Sample Science¶
Much of astronomy involves small numbers of special objects:
Few-Shot Learning Limits: Despite progress, ML still struggles when training examples number in tens rather than thousands. Rare object types remain hard.
Statistical Power: ML confidence intervals on small-sample predictions are necessarily wide. Claims based on few examples require extra skepticism.
Selection Effects: Training data for rare objects often has selection effects. We observe the bright examples, miss the faint ones. Models learn these biases.
Systematic Effects¶
Telescope data has systematic effects that ML can mislearn:
Instrumental Signatures: ML might learn to recognize CCD artifacts, scattered light patterns, or optical ghosts rather than astronomical signal. It might even perform better by using these cluesβwhile learning nothing about astronomy.
Time-Dependent Effects: Sensors change over time. Training data from last year might not represent this year's behavior. Models need constant recalibration.
Site-Specific Quirks: In a distributed network, site-specific systematics are pernicious. A model might learn that a certain pattern indicates good data at Site A while the same pattern indicates bad data at Site B, without any astronomical reason.
Physical Understanding¶
ML is fundamentally empiricalβit learns patterns without understanding physics:
No Physical Constraints: A physics model knows that certain configurations are impossible. ML doesn't. It might predict physically impossible stellar properties or generate images that violate conservation laws.
No Generalization to New Regimes: Physical understanding allows extrapolation to new regimes. ML cannot. A stellar model based on physics works for stars never observed. An ML model might fail on any star outside the training distribution.
Explanation vs. Prediction: Science values explanation. ML provides prediction without explanation. A model that predicts stellar properties accurately but offers no insight into stellar physics is scientifically incomplete.
What ML Cannot Replace¶
Despite capabilities, some things remain firmly beyond ML:
Scientific Judgment: Deciding what questions to ask, what observations would be most informative, what results meanβthese require human insight ML cannot provide.
Novel Hypothesis Generation: ML finds patterns in data. Generating new theoretical frameworks to explain patterns requires creativity ML lacks.
Ethical Considerations: Decisions about resource allocation, data sharing, collaboration, and publication involve values ML cannot assess.
Error Checking: ML systems make mistakes. Humans must check results, especially unusual ones. Removing humans from the loop is dangerous.
Adaptation to Truly Novel Situations: When something genuinely unprecedented happens, human flexibility exceeds ML rigidity.
Part 4: Battle-Tested Libraries and Models¶
Core Deep Learning Frameworks¶
These are the foundations everything else builds on:
PyTorch¶
The dominant framework for research and increasingly for production. Developed by Meta AI.
Strengths: Intuitive design that matches how you think about neural networks. Excellent debugging (standard Python debugging works). Huge ecosystem. Active development. Strong community.
Weaknesses: Deployment to production requires additional tooling. Can be memory-inefficient compared to alternatives.
Maturity: Extremely mature. Used by most academic labs, many companies. If something works in deep learning, there's a PyTorch implementation.
Astronomy Usage: Default choice for new astronomical ML projects. Most astronomical ML papers use PyTorch.
TensorFlow¶
Google's framework. Older and more established in production settings.
Strengths: Excellent production deployment tools. TensorFlow Serving for scalable inference. TensorFlow Lite for edge devices. Strong enterprise support.
Weaknesses: Less intuitive programming model (though Keras helps). Slower to adopt research innovations.
Maturity: Very mature. Powers much of Google's ML. Extensive production track record.
Astronomy Usage: Still used in many production systems. Large astronomical surveys often use TensorFlow for deployment stability.
JAX¶
Google's newer framework focused on high performance and functional programming.
Strengths: Incredible performance through XLA compilation. Easy parallelization across devices. Automatic differentiation through arbitrary Python code.
Weaknesses: Steeper learning curve. Smaller ecosystem than PyTorch/TensorFlow. Functional paradigm unfamiliar to many.
Maturity: Mature but younger than alternatives. Growing adoption in research.
Astronomy Usage: Growing in computational astrophysics. Good for physics-informed neural networks.
Traditional Machine Learning¶
Not everything needs deep learning. These libraries handle classical ML:
scikit-learn¶
The standard library for classical machine learning in Python.
Capabilities: Classification (random forests, SVMs, logistic regression), regression, clustering (k-means, DBSCAN), dimensionality reduction (PCA, t-SNE), preprocessing, model selection, metrics.
Strengths: Consistent API across all algorithms. Excellent documentation. Very well tested. Fast for moderate data sizes.
Weaknesses: Not designed for deep learning. Doesn't scale to very large datasets (millions of examples, many features).
Maturity: Extremely mature. Used in production at countless companies. The default choice for non-deep-learning ML in Python.
Astronomy Usage: Widely used for classification tasks, clustering, and as baseline comparisons for deep learning approaches.
XGBoost / LightGBM / CatBoost¶
Gradient boosting libraries. Often the best choice for tabular data.
Capabilities: Classification and regression on tabular data. Handles missing values, categorical features. Often achieves state-of-the-art on structured data.
Strengths: Often beats neural networks on tabular data. Fast training and inference. Built-in handling of many practical issues.
Weaknesses: Not for images, sequences, or other unstructured data. Requires feature engineering.
Maturity: Very mature. Winners of many Kaggle competitions. Widely deployed in industry.
Astronomy Usage: Excellent for tasks with tabular features (stellar parameters from catalog data, transient classification from light curve features, photometric redshift estimation).
Computer Vision Libraries¶
For image-based astronomical data:
torchvision¶
PyTorch's computer vision library.
Capabilities: Pre-trained models (ResNet, EfficientNet, Vision Transformers). Image transformations and augmentation. Standard datasets. Detection and segmentation models.
Strengths: Tight integration with PyTorch. Well-maintained pre-trained weights. Standard transforms.
Weaknesses: Geared toward natural images (ImageNet). Astronomical images need adaptation.
Maturity: Very mature. Used everywhere PyTorch is used for vision.
Astronomy Usage: Starting point for most image classification work. Pre-trained models fine-tuned for astronomical tasks.
timm (PyTorch Image Models)¶
Huge collection of state-of-the-art image models.
Capabilities: Hundreds of model architectures with pre-trained weights. Includes latest research models. Consistent interface across all models.
Strengths: Most comprehensive collection available. Often has weights trained on larger datasets than torchvision. Regular updates with new models.
Weaknesses: So many options can be overwhelming. Documentation varies.
Maturity: Mature and widely used. Default source for SOTA image models.
Astronomy Usage: When you need the latest architectures for challenging classification or detection tasks.
Albumentations¶
Image augmentation library.
Capabilities: Fast augmentations (rotation, flipping, scaling, color adjustments, noise injection, and many more). Handles masks for segmentation. Handles keypoints and bounding boxes.
Strengths: Much faster than alternatives. Huge variety of transforms. Well-designed for ML pipelines.
Weaknesses: Learning curve for composition syntax.
Maturity: Very mature. Standard choice for augmentation in PyTorch pipelines.
Astronomy Usage: Essential for training robust astronomical image classifiers with limited data.
Astronomy-Specific Libraries¶
These are built specifically for astronomical ML:
AstroML¶
Machine learning for astronomy, built on scikit-learn.
Capabilities: Astronomical datasets, statistical tools, density estimation, time-series analysis, classification examples.
Strengths: Designed by astronomers for astronomers. Includes relevant datasets. Good tutorial material.
Weaknesses: Less actively developed than general ML libraries. Focuses on classical ML rather than deep learning.
Maturity: Mature but somewhat dated. Good for learning, less so for cutting-edge work.
Astronomy Usage: Learning astronomical ML. Baseline methods. Statistical analysis.
astropy¶
Not ML per se, but essential for astronomical data handling.
Capabilities: FITS file I/O, coordinate transformations, unit handling, cosmological calculations, time handling, table operations, astronomical constants.
Strengths: The standard astronomical Python library. Comprehensive. Well-documented. Actively developed.
Weaknesses: Not ML-specific. You need it alongside ML libraries, not instead of them.
Maturity: Extremely mature. Used by virtually all Python-based astronomical software.
Astronomy Usage: Loading data, coordinate handling, preprocessing. Essential foundation for any astronomical ML work.
photutils¶
Source detection and photometry.
Capabilities: Source detection, aperture and PSF photometry, background estimation, segmentation, centroiding.
Strengths: Standard astronomical photometry methods. Well-integrated with astropy.
Weaknesses: Classical methods, not ML-based.
Maturity: Mature. Standard tool for photometric analysis.
Astronomy Usage: Preprocessing before ML. Ground truth generation. Baseline comparisons.
SEP (Source Extractor in Python)¶
Python binding for Source Extractor functionality.
Capabilities: Background estimation, source detection, photometry. Fast C implementation with Python interface.
Strengths: Very fast. Matches behavior of classic Source Extractor.
Weaknesses: Less flexible than pure Python alternatives.
Maturity: Mature. Based on decades-old, proven algorithms.
Astronomy Usage: Fast preprocessing. Production pipelines where speed matters.
Time-Series Libraries¶
For light curves and temporal data:
tsfresh¶
Automatic feature extraction from time series.
Capabilities: Extracts hundreds of features from time series automatically. Features include statistical moments, spectral properties, entropy measures, and more.
Strengths: Comprehensive feature extraction. Little manual engineering needed. Works well with classical ML.
Weaknesses: Can be slow on large datasets. Feature explosion requires selection.
Maturity: Mature. Used in many time-series competition winners.
Astronomy Usage: Light curve classification. Variable star analysis. Transient characterization.
tslearn¶
Time series machine learning.
Capabilities: Time series classification, clustering, and metrics. DTW (dynamic time warping) implementations. Time series transformations.
Strengths: Dedicated to time series. Includes specialized algorithms not in general libraries.
Weaknesses: Less comprehensive than combining general libraries.
Maturity: Mature. Good for time-series-specific algorithms.
Astronomy Usage: Light curve similarity searches. Variable star clustering.
Reinforcement Learning¶
For scheduling and control:
Stable Baselines3¶
Standard implementations of RL algorithms.
Capabilities: PPO, A2C, SAC, TD3, DQN, and more. Consistent API. Built on PyTorch.
Strengths: Well-tested implementations. Active development. Good documentation.
Weaknesses: Customization can be awkward. RL still requires significant tuning.
Maturity: Mature. Standard starting point for applied RL.
Astronomy Usage: Telescope scheduling. Adaptive control systems. Resource allocation.
RLlib¶
Scalable RL library from Ray.
Capabilities: Distributed training, many algorithms, multi-agent RL, custom environments.
Strengths: Scales to large problems. Production-ready. Integrates with Ray ecosystem.
Weaknesses: Complex setup. Overkill for simple problems.
Maturity: Mature. Used at scale by many companies.
Astronomy Usage: Large-scale scheduling optimization. Multi-telescope coordination.
Pre-trained Models for Astronomy¶
Some models trained specifically on astronomical data:
Zoobot¶
Galaxy morphology classification models.
Training Data: Trained on Galaxy Zoo volunteer classifications of hundreds of thousands of galaxies.
Capabilities: Predicts detailed morphological features (spiral arms, bars, bulges, mergers, etc.). State-of-the-art galaxy classification.
Availability: Open source with pre-trained weights.
Astronomy Usage: Galaxy classification. Transfer learning starting point for morphology tasks.
AstroCLIP¶
Contrastive learning model for astronomical images.
Training Data: Trained on large astronomical image collections with self-supervised learning.
Capabilities: General-purpose astronomical image embeddings. Can be fine-tuned for various tasks.
Availability: Research code and weights available.
Astronomy Usage: Starting point for custom classification. Image similarity search.
ASTROMER¶
Transformer model for light curves.
Training Data: Pre-trained on large light curve collections.
Capabilities: Learns general representations of time-varying astronomical sources. Fine-tunable for classification.
Availability: Research code available.
Astronomy Usage: Variable star classification. Transient classification. Light curve analysis.
Deployment Tools¶
For putting models into production:
ONNX¶
Open Neural Network Exchange format.
Capabilities: Convert models between frameworks. Optimize for inference. Deploy to various runtimes.
Strengths: Framework-agnostic. Good optimization. Wide runtime support.
Weaknesses: Not all operations supported. Conversion can be tricky.
Maturity: Very mature. Industry standard for model exchange.
Astronomy Usage: Deploy PyTorch models to edge devices. Cross-framework compatibility.
TensorRT¶
NVIDIA's inference optimizer.
Capabilities: Optimize neural networks for NVIDIA GPUs. Quantization, layer fusion, kernel optimization.
Strengths: Massive speedups on NVIDIA hardware. Production-ready.
Weaknesses: NVIDIA-only. Requires supported operations.
Maturity: Very mature. Used in production at scale.
Astronomy Usage: Fast inference on GPU-equipped systems.
Docker¶
Containerization for ML systems.
Capabilities: Package entire ML environment (code, dependencies, models) into reproducible containers.
Strengths: Reproducibility. Easy deployment. Isolation.
Weaknesses: Learning curve. Container management overhead.
Maturity: Extremely mature. Industry standard for deployment.
Astronomy Usage: Deploy ML pipelines to telescope sites. Reproducible analysis environments.
Data Management¶
For handling large astronomical datasets:
Dask¶
Parallel computing library.
Capabilities: Parallel arrays, dataframes, and custom computations. Scales from laptop to cluster.
Strengths: Familiar pandas/numpy API. Lazy evaluation. Good for larger-than-memory data.
Weaknesses: Some operations less efficient than native numpy. Debugging can be tricky.
Maturity: Very mature. Widely used for large-scale data processing.
Astronomy Usage: Processing large image collections. Distributed feature extraction.
Vaex¶
Out-of-core dataframes.
Capabilities: Billion-row dataframes on a laptop. Memory-mapped files. Fast aggregations.
Strengths: Handles huge catalogs that don't fit in memory. Very fast for certain operations.
Weaknesses: Less flexible than pandas. Not all operations supported.
Maturity: Mature. Good for specific large-data use cases.
Astronomy Usage: Exploring large astronomical catalogs. Cross-matching. Feature extraction from massive datasets.
This covers the complete landscape of ML for your distributed telescope projectβthe specific applications, the underlying mathematics, the hardware requirements, system design for task assignment, honest limitations, and the practical tools you'll use. Let me know if you want to dive deeper into any particular area.
Understanding ML Model Architectures for Your Distributed Telescope Array¶
Let me take you through every major architecture, how it actually works, and exactly where each fits into your telescope network.
The Fundamental Question: Why Different Architectures?¶
Before diving into specifics, understand why we have different architectures at all.
Data comes in different shapes:
Tabular data: Rows and columns, like a spreadsheet. Star catalogs with measurements. Each row is independent, columns are features.
Images: 2D grids of pixels. Your telescope frames. Nearby pixels are related. Spatial structure matters.
Sequences: Ordered data points. Light curves over time. What came before affects interpretation of what comes after.
Graphs: Networks of connected entities. Stars in clusters. Galaxies in groups. Relationships between objects matter.
Sets: Collections without order. Multiple observations of the same field. The set matters, not the sequence.
Each architecture embodies assumptions about data structure. Using the wrong architecture means fighting against its assumptions. Using the right architecture means the model naturally captures relevant patterns.
Feedforward Neural Networks: The Foundation¶
What They Are¶
The simplest neural network. Data flows in one direction: input to output, no loops, no memory.
Input Layer β Hidden Layer 1 β Hidden Layer 2 β ... β Output Layer
Each layer is fully connected to the next. Every neuron in layer N connects to every neuron in layer N+1.
How They Process Information¶
Imagine your input is a vector of 100 numbers representing measurements of a star: brightness in different filters, position, proper motion, and so on.
Layer 1 (say, 256 neurons): Each neuron computes a weighted sum of all 100 inputs, adds a bias, applies an activation function. You get 256 new numbers, each representing some combination of the original features.
Layer 2 (say, 128 neurons): Each neuron takes all 256 outputs from Layer 1, computes weighted sums, applies activation. Now you have 128 numbers representing combinations of combinations.
Output Layer (say, 5 neurons for 5 star types): Each neuron combines the 128 Layer 2 outputs. Apply softmax to get probabilities.
The key insight: each successive layer learns more abstract representations. Layer 1 might learn "this combination of colors indicates high temperature." Layer 2 might learn "high temperature plus this proper motion pattern suggests a certain stellar population."
Mathematical Formulation¶
For a single layer:
output = activation(weights Γ input + bias)
Where:
inputis a vector of N valuesweightsis a matrix of size (M Γ N), where M is the number of neuronsbiasis a vector of M valuesactivationis a nonlinear function applied element-wiseoutputis a vector of M values
Stacking layers:
hβ = activation(Wβ Γ input + bβ)
hβ = activation(Wβ Γ hβ + bβ)
hβ = activation(Wβ Γ hβ + bβ)
output = softmax(Wβ Γ hβ + bβ)
Strengths¶
Universality: Can theoretically approximate any function given enough neurons. This is a mathematical guarantee.
Simplicity: Easy to implement, understand, debug. Training is straightforward.
Speed: Fast inference. No complex operations, just matrix multiplications.
Flexibility: Works on any fixed-size input. No structural assumptions beyond input dimension.
Weaknesses¶
No spatial awareness: Treats each input independently. For images, pixel 1 and pixel 1000 are equally "distant" from the network's perspective, even if they're adjacent in the image.
No temporal awareness: Each input is processed independently. Can't learn that a brightness measurement depends on previous measurements.
Parameter explosion: For large inputs, fully-connected layers have enormous numbers of parameters. A 256Γ256 image has 65,536 pixels. A single hidden layer of 1000 neurons would have 65 million parameters just for that layer.
No weight sharing: Patterns learned in one part of the input don't transfer to other parts. A galaxy in the corner of an image requires separate learning from a galaxy in the center.
For Your Telescope Array¶
Good for: Processing extracted features (not raw images). Tabular data from catalogs. Final classification layers after other architectures have extracted features.
Specific applications:
- Classifying stars from catalog measurements (colors, proper motions, parallax)
- Predicting observation quality from metadata (temperature, humidity, moon phase, elevation)
- Combining high-level features from multiple sources for final decision-making
- Quick assessment models where speed matters more than accuracy
Example scenario: You've extracted 50 features from a light curve (mean brightness, variance, periodicity measures, etc.). A feedforward network takes these 50 numbers and classifies the variable star type. The feature extraction handles temporal structure; the feedforward network handles the final classification.
Convolutional Neural Networks: Spatial Intelligence¶
What They Are¶
Networks designed for data with spatial structure, primarily images. Instead of connecting every input to every neuron, they use local connections and weight sharing.
The Core Insight¶
Images have two crucial properties feedforward networks ignore:
Locality: Relevant patterns are local. An edge is a few pixels. A star is a small region. You don't need to look at pixels 1000 apart simultaneously to detect these patterns.
Translation invariance: A spiral arm looks like a spiral arm regardless of where it appears in the image. Learning to recognize it in one location should transfer to all locations.
CNNs embody these assumptions through convolution operations.
How Convolution Works¶
A convolutional layer has small filters (also called kernels), typically 3Γ3, 5Γ5, or 7Γ7 pixels.
Each filter slides across the entire image, computing a dot product at each position:
Image patch: Filter: Computation:
[a b c] [wβ wβ wβ] output = aΓwβ + bΓwβ + cΓwβ +
[d e f] Γ [wβ wβ
wβ] dΓwβ + eΓwβ
+ fΓwβ +
[g h i] [wβ wβ wβ] gΓwβ + hΓwβ + iΓwβ
This single number represents "how much does this patch match this filter?"
Sliding the filter across all positions produces a feature map: a 2D grid showing where the pattern was detected.
Multiple Filters, Multiple Layers¶
A single convolutional layer has many filters (32, 64, 128 are common). Each learns to detect a different pattern.
Layer 1 filters learn simple patterns:
- Horizontal edges
- Vertical edges
- Diagonal edges
- Brightness gradients
- Spots of various sizes
Layer 2 filters operate on Layer 1's output, learning combinations:
- Corners (horizontal + vertical edges)
- Curves (sequences of edge orientations)
- Texture patterns
- Ring-like structures
Layer 3 and beyond learn increasingly complex combinations:
- Spiral arm signatures
- Galaxy core patterns
- Specific artifact shapes
- Complex morphological features
This hierarchy emerges automatically from training. You don't specify "learn edges then corners then spirals." The network discovers this hierarchy because it's efficient for reducing classification error.
Pooling Operations¶
Between convolutional layers, pooling reduces spatial dimensions:
Max pooling: Take the maximum value in each small region
[1 3 2 4]
[5 6 1 2] β Max pool 2Γ2 β [6 4]
[3 2 1 0] [3 3]
[1 2 3 1]
Average pooling: Take the mean value in each region
Pooling provides:
- Reduced computation for subsequent layers
- Some translation invariance (small shifts don't change max values much)
- Larger effective receptive field (later layers "see" more of the original image)
Receptive Fields¶
A crucial concept: how much of the original image influences a single neuron in a later layer?
Layer 1 neuron: Sees only its 3Γ3 filter region. Receptive field = 9 pixels.
Layer 2 neuron: Takes input from Layer 1 neurons, each of which saw 3Γ3. After pooling, each Layer 2 neuron effectively sees ~6Γ6 of the original image.
Deep layer neuron: Might effectively see the entire image, but through a hierarchical lens.
This is why deep CNNs can learn global patterns while still using local operations: information propagates through the hierarchy.
Mathematical Formulation¶
For a 2D convolution:
output[i,j] = Ξ£β Ξ£β input[i+m, j+n] Γ filter[m,n] + bias
Where the sums run over the filter dimensions.
With multiple input channels (like RGB, or previous layer features):
output[i,j] = Ξ£_c Ξ£β Ξ£β input[c, i+m, j+n] Γ filter[c,m,n] + bias
Where c indexes input channels.
Architecture Patterns¶
Standard CNN architectures follow patterns:
VGG pattern: Stack many 3Γ3 convolutions. Simple but effective.
Conv3Γ3 β Conv3Γ3 β Pool β Conv3Γ3 β Conv3Γ3 β Pool β ... β Dense β Output
ResNet pattern: Add skip connections that let gradients flow directly through many layers.
input β Conv β Conv β (+input) β Conv β Conv β (+previous) β ...
Skip connections solve the vanishing gradient problem, allowing very deep networks (50, 100, 150+ layers).
Inception/GoogLeNet pattern: Use multiple filter sizes in parallel, concatenate results.
input β [1Γ1 conv, 3Γ3 conv, 5Γ5 conv, pool] β concatenate β ...
This captures patterns at multiple scales simultaneously.
Strengths¶
Parameter efficiency: A 3Γ3 filter has 9 parameters regardless of image size. Compared to feedforward networks, CNNs have far fewer parameters.
Translation equivariance: A pattern detected at position (10, 10) uses the same weights as detection at (100, 100). Learning transfers across positions.
Hierarchical feature learning: Automatically learns appropriate feature hierarchy for the task.
Proven architecture: Decades of refinement. Well-understood behavior. Extensive pre-trained models available.
Weaknesses¶
Fixed input size: Standard CNNs expect fixed image dimensions. Variable sizes require padding, cropping, or architectural changes.
Limited global awareness: Despite stacking layers, CNNs can struggle with patterns requiring true global context. A pattern depending on opposite corners remains hard.
Translation invariance can hurt: Sometimes position matters. The center of a galaxy image is semantically different from the edge. Pure CNNs don't distinguish.
No temporal understanding: Each image is processed independently. Sequential relationships require additional architecture.
For Your Telescope Array¶
Good for: Any image-based task. Quality assessment. Object detection. Galaxy classification. Artifact identification.
Specific applications:
Real-time quality assessment: A lightweight CNN at each telescope evaluates incoming frames. Input: single frame. Output: quality score and issue flags (clouds, tracking error, focus problem, etc.).
Source detection: Semantic segmentation CNNs identify every source in an image. Each pixel gets classified: background, star, galaxy, artifact, satellite trail.
Galaxy morphology: CNNs trained on Galaxy Zoo data classify galaxy types, identify features like bars, rings, spiral arms, merger signatures.
Transient detection: CNNs compare new images to references, classifying differences as real transients, artifacts, or noise.
Cross-site calibration: CNNs learn to map images from different sites to a common representation, normalizing site-specific effects.
Example architecture for your quality classifier:
Input: 256Γ256 grayscale image
Block 1: Conv(32 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 128Γ128Γ32
Block 2: Conv(64 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 64Γ64Γ64
Block 3: Conv(128 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 32Γ32Γ128
Block 4: Conv(256 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 16Γ16Γ256
Global Average Pool: 256 values
Dense(128) β ReLU β Dropout(0.5)
Dense(3) β Softmax
Output: probabilities for [good, medium, bad]
Recurrent Neural Networks: Temporal Intelligence¶
What They Are¶
Networks designed for sequential data. They maintain internal state that persists across sequence elements, giving them a form of memory.
The Core Insight¶
Many phenomena unfold over time. A light curve isn't just a collection of brightness measurementsβit's an ordered sequence where each measurement relates to those before and after.
Standard feedforward networks process each input independently. RNNs process sequences element by element, maintaining hidden state that captures what they've seen so far.
Basic RNN Operation¶
At each time step t:
hidden[t] = activation(W_input Γ input[t] + W_hidden Γ hidden[t-1] + bias)
output[t] = W_output Γ hidden[t]
The key: hidden[t] depends on hidden[t-1]. Information flows forward through time.
input[0] β [RNN Cell] β hidden[0] β output[0]
β
input[1] β [RNN Cell] β hidden[1] β output[1]
β
input[2] β [RNN Cell] β hidden[2] β output[2]
β
...
The same weights are used at every time step. The only thing changing is the hidden state.
The Vanishing Gradient Problem¶
Basic RNNs have a critical flaw: information fades over time.
During training, gradients must flow backward through time. At each step, they get multiplied by weights. If weights are less than 1, gradients shrink exponentially. After 50 or 100 steps, gradients are effectively zero.
Result: basic RNNs can only learn short-range dependencies. They forget distant past, even when it's crucial.
LSTM: Long Short-Term Memory¶
LSTMs solve the vanishing gradient problem with a gated architecture:
βββββββββββββββββββββββββββββββββββββββββββ
β LSTM Cell β
β β
β ββββββββ ββββββββ ββββββββ β
β βForgetβ βInput β βOutputβ β
β β Gate β β Gate β β Gate β β
β ββββββββ ββββββββ ββββββββ β
β β β β β
β ββββββββββββββββββββββββββββββ β
β β Cell State β ββββββββΌββ (memory highway)
β ββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββ
Forget gate: Decides what to discard from cell state. "The transit event is over, forget those details."
Input gate: Decides what new information to store. "This brightness spike is important, remember it."
Output gate: Decides what to output based on cell state. "Based on everything seen, output this classification."
Cell state: The memory highway. Information can flow unchanged across many time steps. Gradients flow through without multiplication by weights.
The mathematics:
forget = sigmoid(W_f Γ [hidden[t-1], input[t]] + b_f)
input_gate = sigmoid(W_i Γ [hidden[t-1], input[t]] + b_i)
candidate = tanh(W_c Γ [hidden[t-1], input[t]] + b_c)
cell[t] = forget Γ cell[t-1] + input_gate Γ candidate
output_gate = sigmoid(W_o Γ [hidden[t-1], input[t]] + b_o)
hidden[t] = output_gate Γ tanh(cell[t])
The gates are sigmoid functions outputting values between 0 and 1, acting as soft switches.
GRU: Gated Recurrent Unit¶
A simplified gating mechanism, often performing comparably to LSTM with fewer parameters:
reset = sigmoid(W_r Γ [hidden[t-1], input[t]])
update = sigmoid(W_u Γ [hidden[t-1], input[t]])
candidate = tanh(W Γ [reset Γ hidden[t-1], input[t]])
hidden[t] = (1 - update) Γ hidden[t-1] + update Γ candidate
Two gates instead of three. Often faster to train with similar performance.
Bidirectional RNNs¶
Sometimes context from the future helps interpret the present. Bidirectional RNNs process sequences both forward and backward:
Forward: input[0] β input[1] β input[2] β ... β input[T]
β β β β
hidden_f[0] hidden_f[1] hidden_f[2] ... hidden_f[T]
Backward: input[0] β input[1] β input[2] β ... β input[T]
β β β β
hidden_b[0] hidden_b[1] hidden_b[2] ... hidden_b[T]
Combined: [hidden_f[t], hidden_b[t]] for each t
Each position gets context from both past and future. Useful when you have the complete sequence before processing.
Sequence-to-Sequence Architectures¶
For tasks where input and output are both sequences, use encoder-decoder architectures:
Encoder: Processes input sequence, produces summary hidden state Decoder: Takes summary, generates output sequence
Input sequence β [Encoder RNN] β Summary State β [Decoder RNN] β Output sequence
This architecture underlies machine translation, summarization, and can be adapted for time-series forecasting.
Strengths¶
Natural for sequences: Explicitly models temporal dependencies. Hidden state carries information across time.
Variable length: Unlike feedforward networks, RNNs handle sequences of any length.
Parameter efficiency: Same weights used at every time step. A 100-step sequence doesn't need 100Γ the parameters.
Interpretable dynamics: Hidden state evolution can be analyzed. What is the network remembering?
Weaknesses¶
Sequential computation: Can't parallelize across time steps. Each step waits for the previous. Training and inference are slower than parallelizable architectures.
Long-range dependencies: Even LSTMs struggle with very long sequences (hundreds to thousands of steps). Information still fades, just more slowly.
Training instability: RNNs can suffer from exploding gradients. Requires careful initialization and gradient clipping.
Superseded by transformers: For many tasks, transformers achieve better performance with easier training. RNNs are less dominant than they once were.
For Your Telescope Array¶
Good for: Light curves. Time-series data. Sequential observations. Any data where temporal order matters.
Specific applications:
Light curve classification: An LSTM processes a sequence of brightness measurements, classifying the variable star type, detecting transients, or identifying periodic behavior.
Light curve: [mag[0], mag[1], mag[2], ..., mag[T]]
β β β β
[LSTM] β [LSTM] β [LSTM] β ... β [LSTM]
β
Classification
Transient detection in time series: RNN monitors brightness sequence, outputs probability of transient at each time step. Alert when probability exceeds threshold.
Predictive modeling: Given recent conditions (weather, seeing, performance), predict near-future conditions for scheduling.
Anomaly detection in sequences: Train LSTM to predict next value in normal sequences. Large prediction errors indicate anomalies.
State tracking: RNN maintains hidden state representing current system status, updated with each new observation or event.
Example architecture for light curve classification:
Input: sequence of (time, magnitude, error) tuples, variable length
Embedding: Dense(64) applied to each time step
Output: sequence of 64-dimensional vectors
Bidirectional LSTM(128 units)
Output: sequence of 256-dimensional vectors (128 forward + 128 backward)
Attention layer (or just take final hidden state)
Output: 256-dimensional vector
Dense(128) β ReLU β Dropout(0.3)
Dense(64) β ReLU β Dropout(0.3)
Dense(num_classes) β Softmax
Output: class probabilities
Transformers: Attention is All You Need¶
What They Are¶
Transformers process sequences without recurrence. Instead of maintaining hidden state, they use attention mechanisms to directly relate any element to any other element.
The Core Insight¶
RNNs process sequences step by step. Information from early steps must pass through many intermediate steps to affect later processing. This creates bottlenecks.
Transformers skip the middleman. Every position can directly attend to every other position. Information flows directly between any pair of elements.
Self-Attention: The Key Mechanism¶
Self-attention computes relationships between all pairs of positions in a sequence.
For each position, create three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I have to offer?"
- Value (V): "What information do I carry?"
Attention score between position i and position j:
score[i,j] = Q[i] Β· K[j] / sqrt(d_k)
The dot product measures similarity. Division by sqrt(d_k) (dimension of keys) prevents scores from growing too large.
Apply softmax to get attention weights:
weights[i] = softmax(scores[i]) # weights[i] sums to 1
Output for position i is weighted sum of values:
output[i] = Ξ£β±Ό weights[i,j] Γ V[j]
Each position's output incorporates information from all other positions, weighted by relevance.
Multi-Head Attention¶
A single attention mechanism learns one type of relationship. Multi-head attention runs several attention mechanisms in parallel:
Head 1: Qβ, Kβ, Vβ β outputβ
Head 2: Qβ, Kβ, Vβ β outputβ
...
Head N: Qβ, Kβ, Vβ β outputβ
Concatenate: [outputβ, outputβ, ..., outputβ]
Project: W_o Γ concatenated
Different heads learn different relationships:
- Head 1 might attend to nearby positions
- Head 2 might attend to similar values
- Head 3 might attend to periodically related positions
The Transformer Block¶
A complete transformer block:
Input
β
Multi-Head Self-Attention
β
Add (residual connection) + Layer Normalization
β
Feed-Forward Network (two dense layers)
β
Add (residual connection) + Layer Normalization
β
Output
Stack many blocks (6, 12, 24, or more in large models).
Residual connections let gradients flow directly through the network, enabling very deep architectures.
Positional Encoding¶
Self-attention is permutation-invariant: it doesn't inherently know that position 1 comes before position 2. Order information must be added explicitly.
Sinusoidal encoding (original transformer):
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Different frequencies for different dimensions. Positions get unique signatures, and relative positions can be computed from these encodings.
Learned encodings: Just learn a vector for each position. Works well when maximum sequence length is known.
Encoder-Decoder Transformers¶
For sequence-to-sequence tasks:
Encoder: Self-attention sees entire input. Each position attends to all input positions.
Decoder: Self-attention is masked (positions can only attend to earlier positions, not future). Cross-attention lets decoder positions attend to encoder outputs.
Input Sequence β [Encoder Stack] β Encoded Representations
β
[Decoder Stack with Cross-Attention] β Output Sequence
Encoder-Only (BERT-style)¶
For tasks where you need to understand the input but not generate sequences:
Input β [Transformer Encoder] β Representations β Task-specific head
BERT, RoBERTa, and similar models use this pattern. Fine-tune for classification, extraction, or other tasks.
Decoder-Only (GPT-style)¶
For generation tasks:
Context β [Transformer Decoder] β Next token prediction
GPT models use this pattern. The model predicts the next element based on all previous elements.
Vision Transformers (ViT)¶
Transformers for images:
- Split image into patches (e.g., 16Γ16 pixels each)
- Flatten each patch into a vector
- Add position encodings
- Process with standard transformer
Image β [Split into patches] β [Linear embedding] β [Add position] β [Transformer] β [Classification head]
This treats an image as a sequence of patches, letting attention learn spatial relationships.
Strengths¶
Parallelization: Unlike RNNs, all positions can be computed simultaneously. Training is much faster on GPUs.
Long-range dependencies: Every position directly attends to every other. No information bottleneck.
Scalability: Transformers scale well. Larger models, more data, more compute generally means better performance.
State-of-the-art: Transformers dominate language, increasingly dominate vision, and excel in many domains.
Flexibility: Same architecture works for language, images, audio, and more with minimal modification.
Weaknesses¶
Quadratic complexity: Self-attention compares all pairs of positions. For sequence length N, complexity is O(NΒ²). Very long sequences become expensive.
Data hungry: Transformers typically need more training data than CNNs or RNNs to achieve good performance.
Compute hungry: Large transformers require substantial GPU resources for training and inference.
Position encoding limitations: Learned position encodings don't generalize beyond training length. Sinusoidal encodings help but aren't perfect.
Less inductive bias: Transformers make fewer assumptions about data structure. This flexibility means they need to learn structure from data rather than having it built in.
For Your Telescope Array¶
Good for: Complex sequences where long-range dependencies matter. Multi-modal data fusion. Tasks where CNNs or RNNs underperform.
Specific applications:
Advanced light curve analysis: Transformers can capture long-range periodicity, complex variability patterns, and subtle correlations that RNNs miss.
Multi-site data fusion: Treat observations from different sites as sequence elements. Attention learns which observations to weight more heavily, how to combine information across sites.
[Obs_Site_A, Obs_Site_B, Obs_Site_C, ...] β [Transformer] β Fused Representation
Catalog cross-matching: Given entries from multiple catalogs, transformer attention learns which entries correspond to the same object.
Vision Transformer for images: For challenging image classification tasks where CNNs plateau, ViT might push further (with sufficient data).
Multimodal understanding: Combine image features and light curve features in a single transformer. Attention learns relationships between visual appearance and temporal behavior.
Example architecture for multi-site data fusion:
Inputs: Observations from N sites, each represented as a vector
[obs_1, obs_2, ..., obs_N] where obs_i includes: image embedding, quality metrics, timestamp, site ID embedding
Positional encoding: Site embeddings rather than sequence positions
Transformer Encoder (4 layers, 8 attention heads, 256 dimensions)
Each observation attends to all others
Learns which sites to weight, how to combine
Global pooling or CLS token
Output: Fused representation
Task heads:
- Classification head: Dense β class probabilities
- Quality estimation head: Dense β expected quality of combined result
- Uncertainty head: Dense β confidence bounds
Autoencoders: Learning Compression¶
What They Are¶
Networks that learn to compress data to a smaller representation, then reconstruct the original. Not for prediction, but for representation learning.
The Core Insight¶
If a network can compress data to a small representation and reconstruct it accurately, that small representation must capture the essential information. What's lost is presumably noise or irrelevant detail.
Architecture¶
Input β [Encoder] β Bottleneck (small) β [Decoder] β Reconstruction
High-dimensional High-dimensional
input output
Low-dimensional
code/latent
Encoder: Compresses input to bottleneck. Typically uses convolutions (for images) or dense layers.
Bottleneck: The compressed representation. Much smaller than input (e.g., 256Γ256 image β 128 numbers).
Decoder: Reconstructs input from bottleneck. Mirror of encoder architecture.
Loss: Reconstruction error, typically mean squared error between input and output.
Variational Autoencoders (VAEs)¶
Standard autoencoders learn a deterministic mapping. VAEs learn a probabilistic one.
Instead of encoding to a single point, VAE encodes to a distribution (mean and variance):
Input β [Encoder] β (ΞΌ, Ο) β Sample z ~ N(ΞΌ, Ο) β [Decoder] β Reconstruction
Loss includes:
- Reconstruction error
- KL divergence between learned distribution and prior (regularizes latent space)
VAEs have smoother latent spaces. You can sample from the prior and generate realistic outputs.
Uses of Autoencoders¶
Dimensionality reduction: The bottleneck representation is a compressed version of input. Useful for visualization, clustering, or as input to other models.
Denoising: Train autoencoder on noisy inputs with clean targets. It learns to remove noise.
Anomaly detection: Train on normal data. Anomalies reconstruct poorly (high error).
Generation: VAEs (and related models) can generate new samples by decoding random latent vectors.
Strengths¶
Unsupervised: Don't need labels. Just need examples of normal data.
Representation learning: Learn useful features without explicit supervision.
Anomaly detection: Natural fit for finding unusual objects.
Compression: Learned compression can outperform hand-designed methods.
Weaknesses¶
Reconstruction focus: Optimizing reconstruction might not produce representations useful for downstream tasks.
Mode collapse: Can learn to ignore some input variation, reconstructing only "average" outputs.
Blurry outputs: Especially VAEs tend to produce blurry reconstructions, averaging over uncertainty.
Hyperparameter sensitivity: Bottleneck size, architecture choices significantly affect results.
For Your Telescope Array¶
Good for: Anomaly detection. Data compression. Finding unusual objects. Learning representations without labels.
Specific applications:
Anomaly detection: Train autoencoder on normal telescope images. High reconstruction error flags unusual images for human review.
Training: Normal images β Autoencoder β Minimize reconstruction error
Deployment: New image β Autoencoder β Measure reconstruction error
If error > threshold: Flag as anomalous
Compression for transmission: Train autoencoder to compress images. Send only bottleneck codes from remote sites, decode centrally. Lossy but much smaller.
Unknown object discovery: Cluster objects in latent space. Objects far from known clusters might be new types.
Quality-aware compression: Train autoencoder with quality-weighted loss. Preserve important regions (sources) more than background.
Example anomaly detection system:
Convolutional Autoencoder:
Encoder:
Conv(32, 3Γ3) β ReLU β Pool(2Γ2) # 256 β 128
Conv(64, 3Γ3) β ReLU β Pool(2Γ2) # 128 β 64
Conv(128, 3Γ3) β ReLU β Pool(2Γ2) # 64 β 32
Conv(256, 3Γ3) β ReLU β Pool(2Γ2) # 32 β 16
Flatten β Dense(512) β Dense(128) β Bottleneck
Decoder (mirror of encoder):
Dense(512) β Dense(16Γ16Γ256) β Reshape
Upsample(2Γ2) β Conv(128, 3Γ3) β ReLU # 16 β 32
Upsample(2Γ2) β Conv(64, 3Γ3) β ReLU # 32 β 64
Upsample(2Γ2) β Conv(32, 3Γ3) β ReLU # 64 β 128
Upsample(2Γ2) β Conv(1, 3Γ3) β Output # 128 β 256
Loss: Mean squared error
Anomaly score: Reconstruction error per image
Threshold: Set from validation data to achieve desired false positive rate
Graph Neural Networks: Relational Intelligence¶
What They Are¶
Networks designed for data naturally represented as graphs: nodes connected by edges. Where CNNs exploit spatial structure and RNNs exploit temporal structure, GNNs exploit relational structure.
The Core Insight¶
Many astronomical phenomena involve relationships:
- Stars in clusters are related
- Galaxies in groups interact
- Observations of the same object are connected
- Telescope sites share information
Graphs naturally represent these relationships. GNNs learn to use relational structure.
Graph Representation¶
A graph consists of:
- Nodes: Entities (stars, galaxies, observations, telescopes)
- Edges: Relationships between nodes (physical proximity, causal connection, same object)
- Node features: Attributes of each node (brightness, color, position)
- Edge features: Attributes of each relationship (distance, time difference, strength)
Message Passing: The Core Operation¶
GNNs work by passing messages between connected nodes:
For each node:
1. Gather messages from neighbors
2. Aggregate messages (sum, mean, max, or learned aggregation)
3. Update node representation based on current state + aggregated messages
After several rounds of message passing, each node's representation incorporates information from its neighborhood.
Round 1: Each node knows about immediate neighbors
Round 2: Each node knows about neighbors-of-neighbors
Round 3: Information from 3-hop neighborhood
...
Mathematical Formulation¶
Basic message passing:
m[i] = Aggregate({h[j] : j β Neighbors(i)})
h'[i] = Update(h[i], m[i])
Where:
- h[i] is node i's representation
- m[i] is aggregated message for node i
- Aggregate is a permutation-invariant function (sum, mean, max)
- Update combines current state with message (typically neural network)
Common architectures:
Graph Convolutional Network (GCN):
H' = Ο(D^(-1/2) A D^(-1/2) H W)
Where A is adjacency matrix, D is degree matrix, H is node features, W is learnable weights.
Graph Attention Network (GAT): Use attention to weight neighbor contributions differently.
GraphSAGE: Sample and aggregate neighbors, enabling mini-batch training on large graphs.
Strengths¶
Natural for relational data: Directly encodes relationships. No need to flatten graph structure into vectors.
Flexible structure: Works on graphs of any size and topology. Adapts to varying numbers of neighbors.
Inductive: Can generalize to unseen nodes/graphs if features are meaningful.
Combines information: Learns how to aggregate information from related entities.
Weaknesses¶
Scalability: Very large graphs (millions of nodes) require sophisticated sampling or approximation.
Oversmoothing: Many message-passing rounds make all node representations similar. Deep GNNs are harder to train.
Edge definition: Results depend on how you define graph structure. Wrong edges hurt performance.
Less mature: GNNs are newer than CNNs/RNNs. Fewer established best practices.
For Your Telescope Array¶
Good for: Modeling relationships between objects, sites, or observations. Catalog analysis. Network coordination.
Specific applications:
Star cluster analysis: Nodes are stars, edges connect probable cluster members. GNN learns cluster membership, identifies interlopers.
Galaxy group finding: Nodes are galaxies, edges from proximity or velocity similarity. GNN identifies group memberships, predicts properties.
Multi-observation fusion: Nodes are observations of the same target (different times, sites, instruments). Edges connect same-object observations. GNN learns optimal combination.
Graph structure:
Nodes: Individual observations
Edges: Same object, temporal proximity, or spatial proximity
Node features: Measurement values, quality metrics, metadata
Edge features: Time difference, site pair, conditions similarity
GNN:
Message passing learns how to weight and combine observations
Output: Fused estimate for each unique object
Telescope network optimization: Nodes are telescope sites, edges connect sites with complementary capabilities. GNN learns coordination patterns, recommends resource allocation.
Anomaly detection in context: When detecting anomalies, consider relationships. A star that's anomalous in isolation might be normal given its cluster context. GNN incorporates context.
Example architecture for multi-observation fusion:
Graph construction:
For each unique object, create nodes for all observations
Connect observations with edges (fully connected or based on relevance)
Node features (per observation):
- Measured values (magnitudes, colors, etc.)
- Uncertainty estimates
- Observation quality metrics
- Site identifier (embedded)
- Time of observation
Edge features:
- Time difference
- Site pair identifier
- Condition similarity score
GNN architecture:
GraphSAGE with 3 message-passing layers
Hidden dimension: 128
Aggregation: Attention-weighted mean
After message passing:
Global pooling across all nodes for this object
Dense layers for final estimate
Output:
Fused measurement estimate
Uncertainty bounds
Outlier flags for individual observations
Generative Models: Creating New Data¶
What They Are¶
Models that learn to generate new samples resembling training data. Instead of classifying or predicting, they create.
Generative Adversarial Networks (GANs)¶
Two networks in competition:
Generator: Takes random noise, produces fake samples Discriminator: Tries to distinguish real from fake samples
Training is adversarial:
- Discriminator improves at detecting fakes
- Generator improves at fooling discriminator
- At equilibrium, generator produces samples discriminator can't distinguish from real
Random noise z β [Generator] β Fake sample
β
[Discriminator] β Real or Fake?
β
Real sample
Loss functions:
Discriminator: maximize log(D(real)) + log(1 - D(G(z)))
Generator: maximize log(D(G(z))) (or minimize log(1 - D(G(z))))
Diffusion Models¶
Currently state-of-the-art for image generation.
Forward process: Gradually add noise to real data until it's pure noise. Reverse process: Learn to gradually remove noise, recovering data from noise.
Real image β [Add noise] β [Add noise] β ... β Pure noise
Pure noise β [Denoise] β [Denoise] β ... β Generated image
The denoising network learns to predict and remove noise at each step. Many small denoising steps produce high-quality samples.
Uses in Astronomy¶
Data augmentation: Generate synthetic training examples, especially for rare classes.
Simulation: Generate realistic synthetic observations to test pipelines.
Super-resolution: Generate high-resolution images from low-resolution inputs.
Inpainting: Fill in missing or corrupted regions of images.
Conditional generation: Generate images matching specific properties (galaxy with certain morphology, star with certain spectrum).
For Your Telescope Array¶
Specific applications:
Training data generation: Have few examples of rare transients? Train a generative model on what you have, generate more for classifier training.
Pipeline testing: Generate realistic synthetic observations to stress-test processing pipelines before real data arrives.
Data recovery: Inpaint satellite trails, cosmic rays, or bad pixels in otherwise good observations.
Prediction: Given current conditions and recent observations, generate predictions of what observations will look like in near future.
Architecture Selection Guide for Your Project¶
Let me be concrete about which architecture to use for each component of your distributed telescope system.
At Individual Telescope Sites¶
| Task | Architecture | Rationale |
|---|---|---|
| Frame quality assessment | Lightweight CNN | Fast inference, spatial patterns matter, proven performance |
| Real-time transient detection | CNN + threshold | Need speed, looking for spatial signatures |
| Basic source detection | U-Net (CNN variant) | Semantic segmentation task, well-established |
| Quick classification | Small CNN or feedforward from features | Speed critical, accuracy secondary |
| Equipment anomaly detection | Autoencoder | Unsupervised, learns normal behavior |
At Central Coordination¶
| Task | Architecture | Rationale |
|---|---|---|
| Deep image classification | ResNet/EfficientNet CNN or ViT | Accuracy matters, have compute resources |
| Light curve classification | Transformer or LSTM | Sequential data with long-range dependencies |
| Multi-site data fusion | Transformer or GNN | Relating multiple inputs, flexible attention |
| Scheduling optimization | Reinforcement learning (various) | Sequential decision-making |
| Catalog cross-matching | GNN or Transformer | Relational structure matters |
| Anomaly detection at scale | Autoencoder + clustering | Find unknowns in large datasets |
| Multi-modal analysis | Transformer | Naturally handles multiple input types |
Decision Flowchart¶
Is your data...?
βββ Images (2D spatial)
β βββ Classification/detection β CNN (ResNet, EfficientNet)
β βββ Segmentation β U-Net, DeepLab
β βββ Very complex patterns β Vision Transformer (if enough data)
β βββ Need speed β MobileNet, lightweight CNN
β
βββ Sequences (time series)
β βββ Short sequences (<100 steps) β LSTM or GRU
β βββ Long sequences (>100 steps) β Transformer
β βββ Real-time streaming β LSTM with online updates
β βββ Bidirectional context available β Bidirectional LSTM or Transformer
β
βββ Tabular (features/measurements)
β βββ Clear features β XGBoost/LightGBM (often beats neural networks)
β βββ Need neural network β Feedforward
β βββ Interactions complex β Feedforward with more layers
β
βββ Graph (relational)
β βββ Use GNN (GraphSAGE, GAT)
β
βββ Multiple modalities (images + sequences + tabular)
β βββ Transformer (or separate encoders feeding shared transformer)
β
βββ Unlabeled data
βββ Want compression/representation β Autoencoder
βββ Want anomaly detection β Autoencoder or isolation forest
βββ Want to generate samples β GAN or diffusion model
Hybrid Architectures for Your System¶
Real systems often combine architectures:
CNN + LSTM for video or image sequences:
Frame 1 β [CNN] β features[1] ββ
Frame 2 β [CNN] β features[2] ββΌβ [LSTM] β Sequence classification
Frame 3 β [CNN] β features[3] ββ
Use CNN to extract per-frame features, LSTM to model temporal evolution.
CNN + Transformer for multi-site fusion:
Site A image β [CNN] β embedding_A ββ
Site B image β [CNN] β embedding_B ββΌβ [Transformer] β Fused result
Site C image β [CNN] β embedding_C ββ
Use CNN to extract site-specific features, transformer to learn optimal combination.
Autoencoder + Classifier for semi-supervised learning:
Labeled + Unlabeled data β [Autoencoder] β Latent representations
Latent representations + Labels β [Classifier] β Predictions
Use autoencoder to learn representations from all data (including unlabeled), classifier on top using labels.
Summary Comparison Table¶
| Architecture | Best For | Input Type | Strengths | Weaknesses | Your Use Cases |
|---|---|---|---|---|---|
| Feedforward | Tabular data, simple tasks | Fixed-size vectors | Simple, fast, universal | No structure awareness | Feature-based classification, final layers |
| CNN | Images, spatial data | 2D/3D grids | Translation invariance, hierarchical features | Fixed input size, local focus | Image quality, source detection, morphology |
| RNN/LSTM | Sequences, time series | Variable-length sequences | Temporal modeling, memory | Sequential (slow), limited range | Light curves, streaming data |
| Transformer | Long sequences, multi-modal | Any (with encoding) | Parallelizable, long-range, flexible | Quadratic complexity, data hungry | Complex light curves, data fusion |
| Autoencoder | Compression, anomaly detection | Any | Unsupervised, learns representations | Reconstruction-focused | Anomaly detection, compression |
| GNN | Relational data, graphs | Graphs | Models relationships | Scaling, oversmoothing | Cluster analysis, observation fusion |
| GAN/Diffusion | Data generation | Any | Creates new samples | Training instability | Data augmentation, simulation |
This should give you a complete understanding of how each architecture works, what it's suited for, and exactly where each fits into your distributed telescope network. The key is matching architecture assumptions to your data's structure and your task's requirements.