The Complete Deep Dive: Machine Learning in Astronomy¶

Let me take you through absolutely everything—from the mathematical foundations to the cutting-edge applications for your distributed telescope array.

Part 1: The Foundations of Machine Learning¶

What's Actually Happening Under the Hood¶

When we say a computer "learns," we're being a bit poetic. What's really happening is mathematical optimization. Let me break this down completely.

The Core Concept: Functions and Parameters¶

Imagine you have a function—a mathematical machine that takes inputs and produces outputs. For astronomy:

Input: Raw pixel values from a telescope image (maybe 1 million numbers representing brightness at each point)
Output: A classification like "spiral galaxy" or "elliptical galaxy"

The function has parameters—adjustable knobs that change how it behaves. A simple function might have 10 parameters. Modern deep learning models have billions.

Learning means finding the parameter values that make the function produce correct outputs for known examples. Once found, the function can (hopefully) produce correct outputs for new examples it's never seen.

The Three Types of Machine Learning¶

Supervised Learning: You have labeled examples. "Here's an image, and I'm telling you it's a spiral galaxy." The algorithm learns to predict labels from inputs.

Training data: Images paired with correct classifications
Goal: Predict correct labels for new, unseen images
Astronomy uses: Galaxy classification, stellar property prediction, transient detection

Unsupervised Learning: No labels. You just have data and want to find structure.

Training data: Just images, no labels
Goal: Discover patterns, groupings, or anomalies
Astronomy uses: Finding new types of objects, clustering similar stars, discovering outliers

Reinforcement Learning: The algorithm takes actions and gets rewards or penalties.

Training: Trial and error with feedback
Goal: Learn optimal behavior
Astronomy uses: Telescope scheduling, adaptive optics control, observation prioritization

The Mathematics (As Gently As Possible)¶

Linear Regression: The Simplest ML¶

Suppose you want to predict a star's temperature from its color. The simplest model:

Temperature = w₁ × (blue brightness) + w₂ × (red brightness) + b

Here, w₁, w₂, and b are parameters. Learning means finding values that minimize prediction errors across your training data.

The "error" (called loss) might be the average squared difference between predicted and actual temperatures:

Loss = average of (predicted - actual)²

We find the best parameters using gradient descent: start with random values, calculate which direction to adjust them to reduce the loss, take a small step in that direction, repeat thousands of times.

Neural Networks: Stacking Complexity¶

A neural network is just this simple idea repeated and stacked:

Layer 1: Takes raw inputs, applies weights, produces intermediate values Layer 2: Takes Layer 1's outputs, applies different weights, produces new intermediate values ... more layers ... Final Layer: Produces the prediction

Each layer can learn different features:

Layer 1 might learn to detect edges in an image
Layer 2 might combine edges into shapes
Layer 3 might recognize that certain shapes indicate spiral arms
Final layer decides "spiral galaxy"

The "deep" in deep learning just means many layers (sometimes hundreds).

Why This Works for Astronomy¶

Astronomical data has hierarchical structure:

Pixels combine into features (bright spots, dark regions)
Features combine into structures (spiral arms, central bulges)
Structures combine into object types (spiral galaxy, elliptical galaxy)

Neural networks naturally learn these hierarchies.

Types of Neural Networks in Astronomy¶

Convolutional Neural Networks (CNNs)¶

Perfect for images. Instead of treating each pixel independently, CNNs look at small patches and learn local patterns.

Imagine sliding a small window across a telescope image:

The window might learn to recognize "this pattern of pixels looks like a point source"
Another window learns "this gradient pattern suggests a galaxy edge"
These combine into higher-level detections

Astronomy applications:

Galaxy morphology classification
Identifying gravitational lenses
Detecting transients (supernovae, asteroids)
Separating stars from galaxies
Finding image artifacts

Recurrent Neural Networks (RNNs) and Transformers¶

Perfect for sequences. Astronomical data often comes as time series—brightness measurements over time.

RNNs process data sequentially, maintaining "memory" of what came before. They can learn patterns like:

"This star's brightness dips periodically—probably an eclipsing binary"
"This brightness curve shape indicates a Type Ia supernova"
"This radio signal has a characteristic pulsar signature"

Transformers (the architecture behind ChatGPT) are newer and can find relationships across very long sequences. They're increasingly used for:

Analyzing years of photometric data
Finding periodic signals with irregular spacing
Cross-matching observations across time

Autoencoders¶

These learn to compress and reconstruct data. Train them on normal telescope images; they learn what "normal" looks like. When they fail to reconstruct something—that's interesting!

Astronomy applications:

Anomaly detection (finding weird objects)
Noise reduction (learn to reconstruct clean images)
Data compression (critical for your distributed array!)

Generative Models (GANs, Diffusion Models)¶

These learn to create realistic data. Train on real galaxy images, and they can generate synthetic galaxies.

Astronomy applications:

Generating training data for rare events
Testing analysis pipelines
Simulating what observations should look like
Super-resolution (enhancing image detail)

Part 2: Current Astronomy Applications in Detail¶

Galaxy Classification at Scale¶

The Problem: Modern surveys like SDSS have imaged hundreds of millions of galaxies. Human classification is impossible at this scale.

The ML Solution: Train a CNN on galaxies that humans have classified (the Galaxy Zoo project provided millions of human classifications). The network learns to recognize:

Spiral vs. elliptical morphology
Presence of bars, rings, or tidal features
Signs of mergers or interactions
Active galactic nuclei (AGN) signatures

Current State-of-the-Art:

Accuracy exceeds 95% for basic morphology
Can classify a million galaxies in hours
Now being extended to fine-grained features
Some models identify structures humans miss

For Your Distributed Array: Even a small-scale version could classify objects in real-time, flagging interesting morphologies for follow-up across your network.

Transient Detection¶

The Problem: Some astronomical events last only hours or days—supernovae, gamma-ray burst afterglows, gravitational wave counterparts, asteroids. You need to find them fast.

The ML Pipeline:

Image Subtraction: Compare new images to reference images
Candidate Detection: Find things that changed
ML Classification: Is this a real transient or an artifact?

The classification step is crucial. Most "changes" are:

Cosmic ray hits on the detector
Satellite trails
Bad pixels
Atmospheric artifacts
Subtraction errors

ML learns to distinguish real astrophysical transients from garbage.

Real-Time Systems:

ZTF (Zwicky Transient Facility) processes millions of candidates nightly
ML cuts false positives by >99%
Interesting candidates trigger automatic follow-up within minutes

For Your Array: A trained transient detector could alert when any telescope sees something unusual, triggering coordinated follow-up across your entire network within seconds.

Stellar Spectroscopy¶

The Problem: A star's spectrum (how its light splits into different colors) encodes everything—temperature, composition, velocity, age. But traditional analysis is slow.

The ML Approach: Train on stars with known properties (from detailed physics analysis), then predict properties for millions of other stars instantly.

What ML Learns:

Which absorption lines indicate which elements
How line shapes encode temperature and pressure
Doppler shifts revealing motion
Age-related abundance patterns

Current Capabilities:

Predict 20+ stellar parameters from a single spectrum
Process millions of spectra in minutes
Precision approaching physics-based methods
Can identify chemically peculiar stars automatically

Exoplanet Detection¶

The Problem: Finding planets around other stars means detecting tiny signals—small brightness dips (transits) or subtle wobbles (radial velocity).

ML Techniques:

For Transits:

Distinguish planet transits from stellar variability, eclipsing binaries, or instrumental effects
Learn the characteristic shapes of planet transits
Identify multi-planet systems from overlapping signals

For Radial Velocity:

Separate planetary signals from stellar activity
Handle multiple overlapping planetary signatures
Distinguish planets from stellar pulsations

Kepler/TESS Results: ML has found thousands of planet candidates that traditional methods missed, including some in the habitable zone.

Gravitational Lens Finding¶

The Problem: Gravitational lenses—where massive objects bend light from background sources—are rare and scientifically valuable. Finding them in millions of images is hard.

Why ML Excels: Lenses have characteristic signatures (arcs, Einstein rings, multiple images) that CNNs learn to recognize even when faint or distorted.

Current Systems:

Survey thousands of square degrees automatically
Find lens candidates with >90% accuracy
Have discovered hundreds of new lenses
Some found lenses humans missed

Radio Astronomy¶

Unique Challenges:

Data volumes are enormous (petabytes per day for SKA)
Interference from human sources (satellites, phones, etc.)
Complex imaging from antenna arrays

ML Applications:

Real-time RFI (radio frequency interference) flagging
Source detection and classification
Fast radio burst detection
Pulsar searching
Image reconstruction

Part 3: Your Distributed Telescope Array—The Complete ML Architecture¶

Now let's design a comprehensive ML system for your specific project.

The Data Challenge¶

With multiple geographically distributed telescopes, you're dealing with:

Volume: Each telescope generates gigabytes nightly
Velocity: Data arrives continuously from all sites
Variety: Different weather, different equipment quirks, different calibrations
Veracity: How do you know which data to trust?

ML addresses all of these.

Layer 1: Per-Telescope Intelligence¶

Each telescope site runs local ML systems:

Real-Time Quality Assessment¶

A trained model continuously evaluates incoming frames:

Input: Raw telescope frame
Output: Quality score (0-100) + issue flags

Issues detected:
- Cloud coverage percentage
- Atmospheric seeing estimate
- Tracking errors
- Focus problems
- Sensor issues (hot pixels, columns)
- Satellite/plane trails

Training Data: Historical frames labeled by quality, weather conditions, resulting science output.

Action: Bad frames immediately flagged; severe issues trigger alerts or automatic system adjustments.

Local Anomaly Detection¶

An autoencoder trained on "normal" observations:

Normal flow:
Frame → Encode → Compressed representation → Decode → Reconstructed frame

If reconstruction error > threshold:
    Flag as anomaly for immediate review

This catches:

Sudden transients
Equipment malfunctions
Unusual atmospheric events
Potential discoveries

Edge Computing Benefits¶

Running ML locally means:

Instant response (no network latency)
Reduced data transfer (only interesting data goes to central)
Resilience (works even if network fails)
Bandwidth savings (critical for remote sites)

Layer 2: Cross-Site Coordination¶

A central ML system coordinates your entire network:

Intelligent Scheduling¶

This is a reinforcement learning problem. The system learns to maximize scientific output by:

State: 
- Current conditions at each site
- Queue of observation requests
- Recent data quality from each telescope
- Astronomical event predictions
- Maintenance schedules

Actions:
- Assign targets to specific telescopes
- Coordinate multi-site observations
- Trigger follow-up observations
- Adjust exposure times

Reward:
- Scientific value of observations obtained
- Data quality metrics
- Response time to transients
- Network efficiency

Over time, the system learns patterns:

"Site A produces better data on these targets"
"Coordinated observations during this window work best"
"When weather deteriorates at B, shift to C"

Dynamic Resource Allocation¶

When something interesting happens:

Event: Transient detected at Site A

ML System:
1. Classify transient type (supernova? asteroid? unknown?)
2. Predict evolution (how long will it be visible?)
3. Calculate optimal follow-up strategy
4. Identify which other sites can observe
5. Generate observation commands
6. Prioritize based on scientific value

Result: Within seconds, multiple sites coordinate

Data Fusion Engine¶

Combining data from multiple sites is non-trivial. Each telescope has:

Different atmospheric conditions
Different instrumental responses
Different pointing accuracies
Different time synchronization

An ML model learns the optimal combination:

Inputs:
- Frames from Sites A, B, C, D
- Metadata (conditions, calibrations)
- Cross-calibration history

Output:
- Combined image superior to any individual frame
- Uncertainty map
- Flags for inconsistent data

This is similar to how your brain combines two eye images into one 3D perception—but with multiple telescopes.

Layer 3: Science-Ready Processing¶

Automated Calibration Pipeline¶

Traditional calibration requires:

Bias frames (sensor offsets)
Dark frames (thermal noise)
Flat fields (sensitivity variations)
Photometric calibration (brightness scale)
Astrometric calibration (position mapping)

ML can:

Learn sensor behavior and predict calibrations
Identify when calibrations are outdated
Flag calibration failures automatically
Cross-calibrate between sites

Source Detection and Classification¶

For every processed image:

Pipeline:
1. Detect sources (stars, galaxies, artifacts)
2. Classify each source
3. Measure properties (brightness, shape, color)
4. Cross-match with catalogs
5. Flag unknowns or interesting objects

ML models at each step:
- Detection: U-Net or similar segmentation network
- Classification: ResNet or EfficientNet
- Property measurement: Regression networks
- Anomaly flagging: Isolation forests or autoencoders

Automated Science Products¶

The system can generate:

Nightly summary reports
Transient alerts
Photometric databases
Astrometric solutions
Quality metrics
Science-ready catalogs

All with ML-driven quality control.

Layer 4: Discovery Systems¶

Unknown Object Discovery¶

Here's where it gets really exciting. Train ML to know what "normal" objects look like, then find things that don't fit:

Approach 1: Clustering
- Represent each object as a feature vector
- Cluster similar objects together
- Objects far from all clusters are interesting

Approach 2: Anomaly scoring
- Train model on known object classes
- Low confidence predictions = potential new classes

Approach 3: Self-supervised learning
- Let model learn structure without labels
- Deviations from learned structure = anomalies

This is how ML might discover entirely new types of astronomical objects.

Pattern Detection in Time¶

Over months of operation, your array generates time-series data. ML can find:

Periodic variables with unusual periods
Long-term trends
Correlated variations across different objects
Quasi-periodic oscillations
Chaotic behavior

Some of these patterns might reveal new physics.

Part 4: The Technical Implementation¶

Data Architecture¶

Site Level:
├── Raw Data Buffer (hours)
├── Quick-Look Processing
├── ML Quality Assessment
├── Local Database (days)
├── Compression/Selection
└── Upload Queue

Central Level:
├── Ingestion Pipeline
├── Data Lake (petabytes)
├── Processing Clusters
├── ML Training Infrastructure
├── Science Databases
└── User Interface/API

ML Infrastructure¶

Training: You need GPU clusters. Options:

Cloud (AWS, Google Cloud, Azure)—flexible but ongoing cost
On-premises—high upfront cost but lower long-term
Hybrid—train in cloud, deploy on-premises

Inference (running trained models): Can run on:

CPUs for some models
Edge devices (NVIDIA Jetson) at telescope sites
Central GPU servers for complex models

Model Management¶

Over time, you'll have many models:

Different versions of the same model
Models for different tasks
Models trained on different data

You need MLOps:

Version control for models
Automated testing (does new model improve results?)
Deployment pipelines
Performance monitoring
Retraining triggers

Practical Code Example: A Galaxy Classifier¶

Here's what a real implementation might look like:

import torch
import torch.nn as nn
from torchvision import models, transforms
from astropy.io import fits

# Load a pre-trained model and modify for galaxy classification
class GalaxyClassifier(nn.Module):
    def __init__(self, num_classes=5):
        super().__init__()
        # Use EfficientNet as base (good accuracy/efficiency)
        self.base = models.efficientnet_b0(pretrained=True)

        # Modify input layer for astronomical data
        # Astronomical images often have different properties
        self.base.features[0][0] = nn.Conv2d(
            1, 32, kernel_size=3, stride=2, padding=1, bias=False
        )

        # Modify output layer for our classes
        # (spiral, elliptical, irregular, merger, artifact)
        self.base.classifier[1] = nn.Linear(1280, num_classes)

    def forward(self, x):
        return self.base(x)

# Preprocessing for telescope images
def preprocess_fits(filepath):
    """Load FITS file and prepare for ML"""
    with fits.open(filepath) as hdul:
        data = hdul[0].data.astype('float32')

    # Astronomical preprocessing
    # 1. Handle negative values (common in processed images)
    data = data - data.min()

    # 2. Log stretch (astronomical images have huge dynamic range)
    data = np.log1p(data)

    # 3. Normalize to 0-1
    data = (data - data.min()) / (data.max() - data.min())

    # 4. Resize to model input size
    data = cv2.resize(data, (224, 224))

    # 5. Add channel dimension
    data = data[np.newaxis, :, :]

    return torch.tensor(data)

# Training loop sketch
def train_galaxy_classifier(model, train_loader, epochs=50):
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        # Validate and save best model
        val_accuracy = validate(model, val_loader)
        if val_accuracy > best_accuracy:
            torch.save(model.state_dict(), 'best_model.pt')

Real-Time Transient Detection System¶

class TransientDetector:
    def __init__(self):
        self.classifier = load_model('transient_classifier.pt')
        self.alert_system = AlertBroker()

    def process_frame(self, new_frame, reference_frame):
        # Step 1: Image subtraction
        difference = new_frame - reference_frame

        # Step 2: Find candidates (simple thresholding + morphology)
        candidates = self.find_candidates(difference)

        # Step 3: Classify each candidate
        for candidate in candidates:
            cutout = self.extract_cutout(new_frame, candidate.position)

            # Run through ML classifier
            probs = self.classifier(cutout)

            # Classes: real_transient, cosmic_ray, bad_pixel, satellite, noise
            if probs['real_transient'] > 0.8:
                # Real detection!
                self.alert_system.send_alert(
                    position=candidate.position,
                    confidence=probs['real_transient'],
                    cutout=cutout,
                    timestamp=new_frame.timestamp
                )

    def find_candidates(self, difference_image):
        """Find potential transient locations"""
        # Threshold at 5 sigma
        threshold = 5 * np.std(difference_image)
        mask = difference_image > threshold

        # Find connected components
        labels = measure.label(mask)
        regions = measure.regionprops(labels)

        return [r for r in regions if self.is_valid_candidate(r)]

Distributed Coordination System¶

class TelescopeNetwork:
    def __init__(self, sites):
        self.sites = sites  # List of telescope connections
        self.scheduler = MLScheduler()
        self.data_fusion = DataFusionModel()

    async def handle_transient_alert(self, alert):
        """Coordinate network response to transient detection"""

        # 1. Classify the transient
        classification = self.classify_transient(alert)

        # 2. Determine follow-up priority
        priority = self.calculate_priority(classification)

        if priority > THRESHOLD:
            # 3. Find available telescopes that can observe
            available = []
            for site in self.sites:
                visibility = site.check_visibility(alert.position)
                if visibility['observable']:
                    available.append({
                        'site': site,
                        'quality': site.current_conditions(),
                        'visibility': visibility
                    })

            # 4. ML decides optimal observation strategy
            strategy = self.scheduler.plan_followup(
                transient=classification,
                available_sites=available,
                priority=priority
            )

            # 5. Execute coordinated observations
            tasks = []
            for assignment in strategy['assignments']:
                task = assignment['site'].observe(
                    target=alert.position,
                    exposure=assignment['exposure'],
                    filters=assignment['filters']
                )
                tasks.append(task)

            # 6. Gather and fuse results
            results = await asyncio.gather(*tasks)
            combined = self.data_fusion.combine(results)

            return combined

Part 5: The Future—What's Coming¶

Foundation Models for Astronomy¶

Just as GPT learned language and DALL-E learned images, astronomy foundation models are being developed. These are trained on vast astronomical datasets and can be fine-tuned for specific tasks.

Imagine a model that has "seen" every public telescope image ever taken. It understands:

What different objects look like
How instruments behave
What noise looks like
The structure of the astronomical universe

You could fine-tune this for your specific telescopes with minimal data.

Autonomous Discovery Systems¶

Current ML classifies objects into known categories. Future systems will:

Propose hypotheses: "These objects share unusual features—might be a new class"
Design observations: "To test this hypothesis, we need spectra of these 5 objects"
Request telescope time: Automatically submit proposals
Analyze results: "Hypothesis confirmed/rejected, here's what we learned"
Write papers: Generate preliminary reports of findings

This is AI-driven science—the algorithm becomes a collaborator.

Multi-Messenger Astronomy¶

When gravitational waves, neutrinos, and light all come from the same event (like a neutron star merger), we need instant coordination. Future ML systems will:

Ingest alerts from all types of observatories
Triangulate source positions
Coordinate hundreds of telescopes worldwide
Prioritize based on predicted scientific value
Adapt in real-time as new data arrives

Your distributed array could be part of this global network.

Simulation-Based Inference¶

Instead of training ML on observed data, train on simulated universes. Run physics simulations of different cosmological parameters, generate synthetic observations, train ML to infer parameters from observations.

This connects ML directly to physical theory—the algorithm learns not just patterns but physics.

Real-Time Adaptive Optics¶

Ground-based telescopes battle atmospheric turbulence. Adaptive optics (deformable mirrors) correct this, but current systems are limited. ML can:

Predict atmospheric turbulence milliseconds ahead
Control mirror surfaces faster than traditional systems
Learn optimal corrections for each site
Potentially achieve space-telescope quality from ground

Federated Learning for Privacy and Bandwidth¶

Not all data can be shared freely. Federated learning lets multiple telescope networks contribute to training a model without sharing raw data:

Global model sent to each site
Each site trains on local data
Only model updates (not data) sent back
Updates combined into improved global model

This enables collaboration while respecting data ownership.

Part 6: Getting Started—Practical Roadmap¶

Phase 1: Foundation (Months 1-3)¶

Learn the basics:

Python programming
Basic ML concepts (scikit-learn tutorials)
Astronomical data formats (FITS, catalogs)
Astropy library

Set up infrastructure:

Data storage solution
Version control (Git)
Basic processing pipeline
Database for metadata

First ML project:

Implement a simple image quality classifier
Train on your own telescope data
Deploy at one site

Phase 2: Single-Site Intelligence (Months 4-8)¶

Build local ML systems:

Real-time quality assessment
Basic transient detection
Source detection and cataloging
Automated calibration assistance

Infrastructure upgrades:

GPU for ML inference
Automated pipeline deployment
Monitoring and logging
Alert system

Phase 3: Network Coordination (Months 9-15)¶

Central intelligence:

Data aggregation system
Cross-site calibration
Basic scheduling optimization
Multi-site alert coordination

Advanced ML:

Data fusion models
Reinforcement learning for scheduling
Anomaly detection across network

Phase 4: Discovery Systems (Months 16+)¶

Scientific ML:

Unknown object discovery
Time-series analysis
Automated science products
Publication-quality outputs

Network optimization:

Full autonomous scheduling
Predictive maintenance
Adaptive observing strategies

Resources for Learning¶

Courses:

Fast.ai (practical deep learning)
Coursera ML specializations
AstroML tutorials

Books:

"Statistics, Data Mining, and Machine Learning in Astronomy"
"Deep Learning" by Goodfellow et al.

Tools:

PyTorch or TensorFlow (deep learning frameworks)
Astropy (astronomical data handling)
scikit-learn (traditional ML)

Communities:

AstroML mailing list
Astropy community
ML in astronomy conferences (like ADASS)

Summary¶

Machine learning transforms astronomical data processing from a bottleneck into an opportunity. For your distributed telescope array specifically:

Local intelligence at each site handles quality control and real-time detection
Central coordination optimizes scheduling and combines data
Science systems extract discoveries and generate products
Future capabilities include autonomous discovery and global coordination

The key insight: ML doesn't replace astronomical expertise—it amplifies it. The algorithms learn from human knowledge, then apply that knowledge at scales and speeds impossible for humans alone.

Your distributed array, equipped with ML, becomes more than the sum of its parts. Each telescope contributes to a collective intelligence that sees more, responds faster, and discovers things no single instrument could find.

Do you want me to dive even deeper into any specific area? I could elaborate on:

The mathematical details of specific algorithms
Detailed system architecture for your array
Specific code implementations
Training data strategies
Hardware recommendations
Integration with existing astronomical systems

The Complete AI/Machine Learning Education: From Zero to Practitioner¶

I'm going to teach you everything from the absolute ground up. By the end of this, you'll understand how AI actually works and be ready to start building.

Chapter 1: What Is AI, Really?¶

Forgetting the Science Fiction¶

First, forget everything movies have told you. AI isn't a sentient robot. It's not "thinking" the way you think. At its core, AI is:

Mathematics that finds patterns in data.

That's it. Everything else—image recognition, language understanding, game playing—emerges from this simple idea applied at massive scale.

The Spectrum of AI¶

Rule-Based Systems          Machine Learning          Deep Learning
      ↓                           ↓                        ↓
"If X, then Y"            "Learn from examples"      "Learn complex patterns
                                                      with neural networks"

Example:                  Example:                   Example:
"If temperature > 100°,   "Show me 10,000 spam      "Show me millions of
  send alert"              emails, learn what         images, learn to
                           spam looks like"           recognize anything"

Rule-based: You write explicit rules. Limited but predictable.

Machine Learning: The computer discovers rules from data. Flexible but needs examples.

Deep Learning: Machine learning with neural networks. Can learn incredibly complex patterns but needs lots of data and computation.

Why This Matters for Astronomy¶

Traditional astronomy: "If brightness dips by X% for Y hours with this shape, it might be a planet transit."

ML astronomy: "Here are 10,000 confirmed planet transits. Learn what they look like. Now find more."

The second approach finds patterns humans might never think to look for.

Chapter 2: The Mathematics You Actually Need¶

Don't panic. You need less math than you think, and I'll explain each piece intuitively.

Concept 1: Variables and Functions¶

A variable is just a placeholder for a number:

x = 5
temperature = 72.4
brightness = 0.00847

A function takes inputs and produces outputs:

f(x) = 2x + 1

When x = 3:  f(3) = 2(3) + 1 = 7
When x = 10: f(10) = 2(10) + 1 = 21

ML insight: A trained model IS a function. It takes your data as input and produces predictions as output.

Concept 2: Vectors and Matrices¶

A vector is a list of numbers:

pixel_values = [0.1, 0.4, 0.9, 0.2, 0.8]
star_properties = [temperature, brightness, distance, mass]

A matrix is a grid of numbers:

image = [
    [0.1, 0.2, 0.3],
    [0.4, 0.5, 0.6],
    [0.7, 0.8, 0.9]
]

ML insight: All data becomes vectors or matrices. An image? Matrix of pixel values. A spectrum? Vector of intensity values. Text? Converted to vectors of numbers.

Concept 3: The Dot Product¶

This is the key operation in ML. Multiply corresponding elements and add:

vector_a = [1, 2, 3]
vector_b = [4, 5, 6]

dot_product = (1×4) + (2×5) + (3×6)
            = 4 + 10 + 18
            = 32

ML insight: This is how neural networks combine inputs. Each input gets multiplied by a "weight," then everything is added up.

Concept 4: Probability Basics¶

Probability measures likelihood (0 = impossible, 1 = certain):

P(coin lands heads) = 0.5
P(sun rises tomorrow) ≈ 1.0
P(finding a unicorn) = 0.0

ML insight: Models output probabilities. "This image is 94% likely to be a spiral galaxy, 5% elliptical, 1% artifact."

Concept 5: Derivatives (Just the Intuition)¶

A derivative measures "how fast something is changing."

Imagine driving a car:

Position = where you are
Velocity (derivative of position) = how fast position is changing
Acceleration (derivative of velocity) = how fast velocity is changing

ML insight: Training uses derivatives to figure out "if I adjust this parameter slightly, how much does my error change?" This guides learning.

Chapter 3: How Machine Learning Actually Works¶

The Core Loop¶

Every ML system follows this pattern:

1. INITIALIZE: Start with random parameter values

2. PREDICT: Use current parameters to make predictions

3. MEASURE ERROR: Compare predictions to correct answers

4. UPDATE: Adjust parameters to reduce error

5. REPEAT: Go back to step 2, thousands of times

Let me make this concrete.

Example: Predicting Star Temperature from Color¶

The Data:

Star 1: Blue/Red ratio = 0.8, Temperature = 5000K
Star 2: Blue/Red ratio = 1.2, Temperature = 6500K
Star 3: Blue/Red ratio = 1.5, Temperature = 8000K
Star 4: Blue/Red ratio = 2.0, Temperature = 11000K
... (thousands more)

The Model (simplest possible):

Predicted_Temperature = w × (Blue/Red ratio) + b

Where w and b are parameters we need to learn

Training Process:

Step 1: Random initialization
   w = 1000 (random guess)
   b = 2000 (random guess)

Step 2: Make predictions
   Star 1: 1000 × 0.8 + 2000 = 2800K (actual: 5000K) — way off!
   Star 2: 1000 × 1.2 + 2000 = 3200K (actual: 6500K) — way off!

Step 3: Measure error
   Error = average of (predicted - actual)²
         = ((2800-5000)² + (3200-6500)²) / 2
         = (4,840,000 + 10,890,000) / 2
         = 7,865,000  — big number, bad!

Step 4: Update parameters
   Mathematics tells us:
   - Increasing w will reduce error
   - Increasing b will reduce error

   New w = 1000 + adjustment = 3000
   New b = 2000 + adjustment = 2500

Step 5: Repeat
   With new parameters, error becomes 2,100,000
   Keep going...

After 1000 iterations:
   w ≈ 5000
   b ≈ 1000
   Error is now tiny!

Final model:
   Temperature ≈ 5000 × (Blue/Red) + 1000

This simple model learned the relationship between color and temperature!

Gradient Descent: The Heart of Learning¶

"Gradient descent" is just a fancy name for the update process. Here's the intuition:

Imagine you're blindfolded on a hilly landscape. Your goal: find the lowest valley (minimum error).

Strategy:

Feel the ground around you (compute gradient/derivative)
Figure out which direction goes downhill (direction of steepest descent)
Take a step that direction (update parameters)
Repeat until you stop going downhill (reached minimum)

          Error
            ^
            |    
         *  |  *     <- Starting point (random parameters)
        *   | *
       *    |*
      *     *         <- Each step moves downhill
     *    / 
    *   /
   *  /
  * /
 */________________> Parameters
        ↑
    Minimum (best parameters)

The Learning Rate¶

How big should each step be?

Too big: You overshoot the minimum, bounce around, never converge
Too small: Takes forever to reach the minimum
Just right: Steady progress toward the best solution

Learning rate too high:      Learning rate too low:     Learning rate good:
        *                            *                         *
       / \                          *                         *
      /   \                        *                         *
     /     \                      *                         *
    /       *                    *                         *
   *         \                  *                         *
              *                *                       *
                              ... (takes forever)    * <- converged!

The learning rate is a hyperparameter—something you choose, not something the model learns.

Chapter 4: Neural Networks Explained¶

The Biological Inspiration (Loosely)¶

Your brain has neurons connected by synapses. A neuron:

Receives signals from other neurons
If total signal exceeds a threshold, it "fires"
Sends signals to other neurons

Artificial neural networks are inspired by this (but much simpler).

The Artificial Neuron¶

Inputs (x₁, x₂, x₃)        Weights (w₁, w₂, w₃)       
      |                          |
      v                          v
   ┌────────────────────────────────────────────┐
   │                                            │
   │  weighted_sum = w₁×x₁ + w₂×x₂ + w₃×x₃ + b │
   │                                            │
   │  output = activation(weighted_sum)         │
   │                                            │
   └────────────────────────────────────────────┘
                      |
                      v
                   Output

Inputs: The data (pixel values, measurements, features)

Weights: Learnable parameters that determine importance of each input

Bias (b): An adjustable offset

Activation function: Introduces non-linearity (explained below)

Why Activation Functions Matter¶

Without activation functions, stacking layers would be pointless:

Layer 1: output = w₁ × input + b₁
Layer 2: output = w₂ × (w₁ × input + b₁) + b₂
       = (w₂×w₁) × input + (w₂×b₁ + b₂)
       = W × input + B  ← Still just a linear function!

Activation functions break this linearity, allowing complex patterns:

ReLU (Rectified Linear Unit) — most common:

ReLU(x) = max(0, x)

If x is negative, output 0
If x is positive, output x

Examples:
ReLU(-5) = 0
ReLU(0) = 0
ReLU(3) = 3

Sigmoid — squashes to 0-1 (good for probabilities):

Sigmoid(x) = 1 / (1 + e^(-x))

Very negative x → ~0
Zero → 0.5
Very positive x → ~1

Softmax — for classification (outputs sum to 1):

Used in final layer for classification
Converts raw scores to probabilities

Scores: [2.0, 1.0, 0.1]
Softmax: [0.66, 0.24, 0.10]  ← These sum to 1.0

Building a Neural Network¶

Stack neurons into layers:

INPUT LAYER          HIDDEN LAYER 1        HIDDEN LAYER 2        OUTPUT LAYER
(your data)          (learned features)    (complex features)    (predictions)

    x₁ ─────────────────────────────────────────────────────────────→
         \         ●                    ●                      
          \       /|\                  /|\                     
           \     / | \                / | \                   
    x₂ ─────●───●──●──●──────────────●──●──●───────────────●───→ class 1 prob
           /     \ | /                \ | /                \  
          /       \|/                  \|/                  \ 
    x₃ ─────────────●                    ●                   ●─→ class 2 prob
        \          |                    |                    /
         \         ●                    ●                   /
          \       /|\                  /|\                 /
    x₄ ─────────────────────────────────────────────────●───→ class 3 prob

Each connection has a weight (learnable)
Each neuron has a bias (learnable)
Each neuron applies an activation function

What Each Layer Learns (Image Example)¶

For image classification:

Layer 1: Detects simple patterns

Edge detectors (vertical, horizontal, diagonal)
Color blobs
Simple textures

Layer 2: Combines simple patterns into shapes

Corners (vertical + horizontal edges)
Curves (many edge detectors)
Texture regions

Layer 3: Combines shapes into parts

"This looks like a spiral arm"
"This looks like a galactic core"
"This looks like a star cluster"

Layer 4+: Combines parts into objects

"Spiral arms + bright core + overall shape = spiral galaxy"

This hierarchical learning is why deep networks are so powerful!

Forward Pass vs Backward Pass¶

Forward Pass: Data flows through the network, producing predictions

Input → Layer 1 → Layer 2 → ... → Output → Prediction

Backward Pass (Backpropagation): Errors flow backward, updating weights

How wrong was the prediction?
                ↓
How much did each Layer N weight contribute to error?
                ↓
Adjust Layer N weights
                ↓
How much did each Layer N-1 weight contribute to error?
                ↓
Adjust Layer N-1 weights
                ↓
... continue back to Layer 1 ...

This is where the calculus happens—computing how each weight affects the final error.

Chapter 5: Convolutional Neural Networks (CNNs) for Images¶

Since you're working with telescope images, CNNs are crucial.

The Problem with Regular Networks for Images¶

A small 256×256 grayscale image has 65,536 pixels.

If your first layer has 1000 neurons, you'd have 65,536,000 connections from input to first layer alone!

This is:

Computationally expensive
Prone to overfitting (too many parameters for limited data)
Ignores the structure of images (nearby pixels are related)

The Key Insight: Local Patterns¶

In images, patterns are local:

An edge is a few pixels wide
A star is a small region
Artifacts have local signatures

We don't need every neuron to look at every pixel!

Convolution: The Core Operation¶

A filter (or kernel) is a small pattern detector:

Example: 3×3 edge-detecting filter

Filter:            Slide over image:
[-1  0  1]         
[-1  0  1]         Original     After convolution
[-1  0  1]         [image] --> [edge map]

How convolution works:

Image region:      Filter:         Calculation:
[1, 2, 3]         [-1, 0, 1]      Sum of element-wise products:
[4, 5, 6]    ×    [-1, 0, 1]   =  (-1×1)+(0×2)+(1×3)+
[7, 8, 9]         [-1, 0, 1]      (-1×4)+(0×5)+(1×6)+
                                   (-1×7)+(0×8)+(1×9)
                                 = -1+0+3-4+0+6-7+0+9 = 6

Slide the filter across the entire image, computing this at each position. The result is a feature map.

Multiple Filters = Multiple Features¶

A CNN layer has many filters, each learning to detect different patterns:

Input Image (1 channel: grayscale)
        ↓
   Conv Layer 1 (32 filters)
        ↓
   32 Feature Maps (different patterns detected)
        ↓
   Conv Layer 2 (64 filters, each looks at all 32 previous maps)
        ↓
   64 Feature Maps (combinations of patterns)
        ↓
   ... more layers ...
        ↓
   Final Classification

Pooling: Reducing Size¶

After convolution, we often pool to reduce the size:

Max Pooling (2×2):

[1, 3, 2, 4]      
[5, 6, 1, 2]  →   [6, 4]    Take max of each 2×2 region
[3, 2, 1, 0]      [3, 3]
[1, 2, 3, 1]

This:

Reduces computation for later layers
Adds some translation invariance (small shifts don't matter)
Keeps the strongest activations

Complete CNN Architecture¶

Input: 256×256×1 telescope image

Conv1: 32 filters (3×3), ReLU → 256×256×32
Pool1: Max pool (2×2) → 128×128×32

Conv2: 64 filters (3×3), ReLU → 128×128×64
Pool2: Max pool (2×2) → 64×64×64

Conv3: 128 filters (3×3), ReLU → 64×64×128
Pool3: Max pool (2×2) → 32×32×128

Flatten: 32×32×128 = 131,072 values

Dense1: 512 neurons, ReLU
Dense2: 128 neurons, ReLU
Output: 5 neurons, Softmax → [spiral, elliptical, irregular, merger, artifact]

Why CNNs Work So Well for Astronomical Images¶

Translation invariance: A galaxy in the corner looks the same as one in the center
Hierarchical features: Learn edges → shapes → structures → objects
Parameter efficiency: Same filter applied everywhere, fewer total parameters
Natural for 2D data: Respects spatial relationships

Chapter 6: Training in Practice¶

The Training/Validation/Test Split¶

Never evaluate on data you trained on! Split your data:

All Your Data (e.g., 10,000 galaxy images)
         ↓
┌────────────────────────────────────────┐
│ Training Set (70%): 7,000 images       │ ← Model learns from these
├────────────────────────────────────────┤
│ Validation Set (15%): 1,500 images     │ ← Tune hyperparameters, early stopping
├────────────────────────────────────────┤
│ Test Set (15%): 1,500 images           │ ← Final evaluation only (touch once!)
└────────────────────────────────────────┘

Training set: Model sees these, adjusts weights

Validation set: Model never trains on these; use to check performance during training

Test set: Model never sees until final evaluation; gives unbiased performance estimate

Overfitting vs Underfitting¶

Underfitting: Model too simple, can't capture patterns

Training accuracy: 60%
Validation accuracy: 58%
Both are bad → need more complex model

Good fit: Model captures patterns without memorizing

Training accuracy: 95%
Validation accuracy: 92%
Both good, close together → well-tuned model

Overfitting: Model memorized training data, fails on new data

Training accuracy: 99%
Validation accuracy: 70%
Big gap → model is memorizing, not learning

Visualization:

                Model Complexity →

    ↑    
    |       Underfitting     Sweet      Overfitting
 E  |            |           Spot           |
 r  |    ____    |     ______|______       |
 r  |   /    \   |    /      |      \      |
 o  |  /      \  |   /       |       \     |
 r  | /        \ |  /        |        \    |
    |/          \| /         |         \   |
    └────────────┴───────────┴──────────\──┴──

    ─── Training Error
    ─ ─ Validation Error

Regularization: Preventing Overfitting¶

Dropout: Randomly "turn off" neurons during training

During training:
[neuron1] [     ] [neuron3] [     ] [neuron5]   ← 40% dropped
              ↓
Forces network to not rely on any single neuron
              ↓
More robust, generalizes better

L2 Regularization: Penalize large weights

Loss = Prediction_Error + λ × (sum of squared weights)

Large weights get penalized
Forces model to use smaller, more distributed weights

Data Augmentation: Create variations of training data

Original galaxy image
    ↓
Augmented versions:
- Rotated 90°, 180°, 270°
- Flipped horizontally
- Flipped vertically
- Slightly shifted
- Slightly zoomed
- Noise added
- Brightness adjusted

1 image becomes 10+ training examples!

For astronomy, augmentation is powerful because physics doesn't change with rotation.

Batch Training¶

Processing all data at once is memory-intensive. Instead, use mini-batches:

10,000 training images
    ↓
Split into batches of 32
    ↓
312 batches per epoch

Each training step:
1. Load batch of 32 images
2. Forward pass: compute predictions
3. Compute loss
4. Backward pass: compute gradients
5. Update weights
6. Next batch

One complete pass through all batches = 1 epoch
Training typically runs for 10-100+ epochs

Learning Rate Schedules¶

Learning rate can change during training:

Constant:        Step Decay:       Exponential:     Cosine Annealing:

 lr              lr                lr               lr
  |____          |__               |\               /\    /\
  |              |  |__            | \             /  \  /  \
  |              |     |__         |  \           /    \/    \
  |____________  |________|___     |___\____     /_____________\
       epochs        epochs          epochs           epochs

Common approach: Start high (learn fast), decrease over time (fine-tune).

Early Stopping¶

Stop training when validation performance stops improving:

Epoch 1:  Val accuracy = 70%
Epoch 2:  Val accuracy = 78%
Epoch 3:  Val accuracy = 84%
Epoch 4:  Val accuracy = 88%
Epoch 5:  Val accuracy = 90%
Epoch 6:  Val accuracy = 91%
Epoch 7:  Val accuracy = 91%  ← Stopped improving
Epoch 8:  Val accuracy = 90%  ← Getting worse (overfitting starting)
Epoch 9:  Val accuracy = 89%
...

Early stopping: Stop at epoch 6 or 7, save that model

Chapter 7: Practical Python for ML¶

Setting Up Your Environment¶

Step 1: Install Python (version 3.9 or 3.10 recommended)

Step 2: Install essential packages

pip install numpy pandas matplotlib scikit-learn
pip install torch torchvision  # PyTorch (or tensorflow if you prefer)
pip install astropy  # For astronomy data
pip install jupyter  # For interactive development

Step 3: Verify installation

import numpy as np
import torch
import astropy
print("All imports successful!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")  # True if you have GPU

NumPy: The Foundation¶

NumPy is for numerical computing. Everything in ML uses it.

import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])
b = np.zeros((3, 3))  # 3x3 array of zeros
c = np.ones((2, 4))   # 2x4 array of ones
d = np.random.randn(100, 100)  # 100x100 random values (normal distribution)

# Array operations (element-wise)
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

print(x + y)      # [5, 7, 9]
print(x * y)      # [4, 10, 18]
print(x ** 2)     # [1, 4, 9]
print(np.sqrt(x)) # [1.0, 1.414, 1.732]

# Statistics
data = np.random.randn(1000)
print(np.mean(data))   # ~0
print(np.std(data))    # ~1
print(np.max(data))    # ~3
print(np.min(data))    # ~-3

# Reshaping
image = np.random.randn(256, 256)  # 2D image
flat = image.reshape(-1)  # Flatten to 1D: 65536 elements
back = flat.reshape(256, 256)  # Back to 2D

# Slicing
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr[0, :])    # First row: [1, 2, 3]
print(arr[:, 1])    # Second column: [2, 5, 8]
print(arr[1:, 1:])  # Bottom-right: [[5, 6], [8, 9]]

Matplotlib: Visualization¶

import matplotlib.pyplot as plt
import numpy as np

# Basic line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.title('Sine Wave')
plt.show()

# Scatter plot
x = np.random.randn(100)
y = x + np.random.randn(100) * 0.5
plt.scatter(x, y, alpha=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Image display (crucial for astronomy!)
image = np.random.randn(256, 256)
plt.imshow(image, cmap='gray')
plt.colorbar()
plt.title('Random Image')
plt.show()

# Histogram
data = np.random.randn(10000)
plt.hist(data, bins=50, edgecolor='black')
plt.xlabel('Value')
plt.ylabel('Count')
plt.title('Distribution')
plt.show()

# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 10))
axes[0, 0].plot(x, y)
axes[0, 1].scatter(x, y)
axes[1, 0].imshow(image, cmap='viridis')
axes[1, 1].hist(data, bins=30)
plt.tight_layout()
plt.show()

Astropy: Handling Astronomical Data¶

from astropy.io import fits
from astropy import units as u
from astropy.coordinates import SkyCoord
import numpy as np
import matplotlib.pyplot as plt

# Reading FITS files (telescope images)
def load_fits_image(filepath):
    with fits.open(filepath) as hdul:
        # Primary data is usually in index 0 or 1
        print(hdul.info())  # See what's in the file

        data = hdul[0].data  # The image data
        header = hdul[0].header  # Metadata

        return data, header

# Example usage
# data, header = load_fits_image('my_observation.fits')
# print(f"Image shape: {data.shape}")
# print(f"Object: {header.get('OBJECT', 'Unknown')}")
# print(f"Exposure time: {header.get('EXPTIME', 'Unknown')} seconds")

# Working with coordinates
coord = SkyCoord('10h30m00s', '+45d00m00s', frame='icrs')
print(f"RA: {coord.ra.degree} degrees")
print(f"Dec: {coord.dec.degree} degrees")

# Unit conversions
distance = 100 * u.pc  # 100 parsecs
print(f"In light years: {distance.to(u.lyr)}")
print(f"In AU: {distance.to(u.AU)}")

# Displaying astronomical images properly
def display_astronomical_image(data, title='Astronomical Image'):
    """Display with log stretch (common for astronomy)"""
    # Handle negative values
    data_shifted = data - np.nanmin(data) + 1

    # Log stretch
    log_data = np.log10(data_shifted)

    # Display
    plt.figure(figsize=(10, 10))
    plt.imshow(log_data, cmap='gray', origin='lower')
    plt.colorbar(label='log(counts)')
    plt.title(title)
    plt.show()

PyTorch Basics¶

PyTorch is a deep learning framework. Here's the essentials:

import torch
import torch.nn as nn
import torch.optim as optim

# Tensors (like numpy arrays, but can run on GPU)
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.zeros(3, 3)
c = torch.randn(100, 100)

# Move to GPU (if available)
if torch.cuda.is_available():
    a = a.cuda()
    # or
    device = torch.device('cuda')
    a = a.to(device)

# Convert between numpy and torch
import numpy as np
numpy_array = np.array([1.0, 2.0, 3.0])
torch_tensor = torch.from_numpy(numpy_array)
back_to_numpy = torch_tensor.numpy()

# Automatic differentiation (the magic of PyTorch!)
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2 + 3 * x + 1  # y = x² + 3x + 1

y.backward()  # Compute derivative
print(x.grad)  # dy/dx = 2x + 3 = 2(2) + 3 = 7 ✓

Building Your First Neural Network in PyTorch¶

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Define the network
class SimpleClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleClassifier, self).__init__()

        self.network = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_size // 2, num_classes)
        )

    def forward(self, x):
        return self.network(x)

# Create synthetic data for demonstration
num_samples = 1000
input_size = 100
num_classes = 5

X = torch.randn(num_samples, input_size)
y = torch.randint(0, num_classes, (num_samples,))

# Split into train/val
train_X, val_X = X[:800], X[800:]
train_y, val_y = y[:800], y[800:]

# Create data loaders
train_dataset = TensorDataset(train_X, train_y)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

val_dataset = TensorDataset(val_X, val_y)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Initialize model, loss, optimizer
model = SimpleClassifier(input_size=100, hidden_size=256, num_classes=5)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 20

for epoch in range(num_epochs):
    model.train()  # Set to training mode
    train_loss = 0

    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)

        # Backward pass
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update weights

        train_loss += loss.item()

    # Validation
    model.eval()  # Set to evaluation mode
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():  # No gradient computation for validation
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            val_loss += loss.item()

            _, predicted = torch.max(outputs, 1)
            total += batch_y.size(0)
            correct += (predicted == batch_y).sum().item()

    accuracy = 100 * correct / total
    print(f'Epoch [{epoch+1}/{num_epochs}], '
          f'Train Loss: {train_loss/len(train_loader):.4f}, '
          f'Val Loss: {val_loss/len(val_loader):.4f}, '
          f'Val Accuracy: {accuracy:.2f}%')

Building a CNN for Images¶

import torch
import torch.nn as nn

class AstronomyCNN(nn.Module):
    def __init__(self, num_classes=5):
        super(AstronomyCNN, self).__init__()

        # Convolutional layers
        self.conv_layers = nn.Sequential(
            # Input: 1 channel (grayscale), Output: 32 channels
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 256 -> 128

            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 128 -> 64

            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 64 -> 32

            nn.Conv2d(128, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),  # 32 -> 16
        )

        # Fully connected layers
        self.fc_layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256 * 16 * 16, 512),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(512, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        x = self.conv_layers(x)
        x = self.fc_layers(x)
        return x

# Create model
model = AstronomyCNN(num_classes=5)

# Print model summary
print(model)

# Check with dummy input
dummy_input = torch.randn(1, 1, 256, 256)  # Batch of 1, 1 channel, 256x256
output = model(dummy_input)
print(f"Output shape: {output.shape}")  # Should be [1, 5]

Complete Training Script for Astronomical Image Classification¶

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
from astropy.io import fits
import os
from pathlib import Path
import matplotlib.pyplot as plt

class AstronomyDataset(Dataset):
    """Custom dataset for astronomical images"""

    def __init__(self, image_dir, labels_file, transform=None):
        """
        Args:
            image_dir: Directory with FITS images
            labels_file: Text file with "filename,label" per line
            transform: Optional transform function
        """
        self.image_dir = Path(image_dir)
        self.transform = transform

        # Load labels
        self.samples = []
        with open(labels_file, 'r') as f:
            for line in f:
                filename, label = line.strip().split(',')
                self.samples.append((filename, int(label)))

        self.classes = ['spiral', 'elliptical', 'irregular', 'merger', 'artifact']

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        filename, label = self.samples[idx]

        # Load FITS image
        filepath = self.image_dir / filename
        with fits.open(filepath) as hdul:
            image = hdul[0].data.astype(np.float32)

        # Preprocessing
        image = self.preprocess(image)

        # Apply transforms if any
        if self.transform:
            image = self.transform(image)

        # Convert to tensor
        image = torch.from_numpy(image).unsqueeze(0)  # Add channel dimension

        return image, label

    def preprocess(self, image):
        """Standard preprocessing for astronomical images"""
        # Handle NaN values
        image = np.nan_to_num(image, nan=0.0)

        # Clip extreme values (cosmic rays, bad pixels)
        p1, p99 = np.percentile(image, [1, 99])
        image = np.clip(image, p1, p99)

        # Log stretch (handles large dynamic range)
        image = image - image.min() + 1
        image = np.log(image)

        # Normalize to 0-1
        image = (image - image.min()) / (image.max() - image.min() + 1e-8)

        return image


def train_model(model, train_loader, val_loader, num_epochs=50, 
                learning_rate=0.001, device='cuda'):
    """Complete training function with bells and whistles"""

    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='max', factor=0.5, patience=5
    )

    best_accuracy = 0
    history = {'train_loss': [], 'val_loss': [], 'val_accuracy': []}

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0

        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        # Validation phase
        model.eval()
        val_loss = 0
        correct = 0
        total = 0

        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                loss = criterion(outputs, labels)
                val_loss += loss.item()

                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        avg_train_loss = train_loss / len(train_loader)
        avg_val_loss = val_loss / len(val_loader)

        # Update scheduler
        scheduler.step(accuracy)

        # Save history
        history['train_loss'].append(avg_train_loss)
        history['val_loss'].append(avg_val_loss)
        history['val_accuracy'].append(accuracy)

        # Save best model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            torch.save(model.state_dict(), 'best_model.pt')

        print(f'Epoch [{epoch+1}/{num_epochs}] '
              f'Train Loss: {avg_train_loss:.4f} '
              f'Val Loss: {avg_val_loss:.4f} '
              f'Val Acc: {accuracy:.2f}% '
              f'(Best: {best_accuracy:.2f}%)')

    return history


def plot_training_history(history):
    """Visualize training progress"""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Loss plot
    ax1.plot(history['train_loss'], label='Train Loss')
    ax1.plot(history['val_loss'], label='Val Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Training and Validation Loss')
    ax1.legend()

    # Accuracy plot
    ax2.plot(history['val_accuracy'], label='Val Accuracy', color='green')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy (%)')
    ax2.set_title('Validation Accuracy')
    ax2.legend()

    plt.tight_layout()
    plt.savefig('training_history.png')
    plt.show()


# Example usage (you'd replace with your actual data):
if __name__ == '__main__':
    # Check for GPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print(f"Using device: {device}")

    # Create model
    model = AstronomyCNN(num_classes=5)

    # For demonstration, create random data
    # In practice, you'd use AstronomyDataset with real data
    train_X = torch.randn(800, 1, 256, 256)
    train_y = torch.randint(0, 5, (800,))
    val_X = torch.randn(200, 1, 256, 256)
    val_y = torch.randint(0, 5, (200,))

    train_dataset = torch.utils.data.TensorDataset(train_X, train_y)
    val_dataset = torch.utils.data.TensorDataset(val_X, val_y)

    train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

    # Train
    history = train_model(model, train_loader, val_loader, 
                          num_epochs=20, device=device)

    # Plot results
    plot_training_history(history)

Chapter 8: Your First Complete Project¶

Let's build something real: an image quality classifier for your telescope.

Project: Automatic Image Quality Assessment¶

Goal: Given a raw telescope frame, predict quality (good/medium/bad) automatically.

Step 1: Data Collection¶

First, manually classify some of your existing images:

import os
import shutil
from pathlib import Path

# Create directory structure
for quality in ['good', 'medium', 'bad']:
    Path(f'training_data/{quality}').mkdir(parents=True, exist_ok=True)

print("""
Manual Classification Guide:
- GOOD: Clear stars, low background, good focus
- MEDIUM: Some clouds, slightly out of focus, minor issues
- BAD: Heavy clouds, tracking errors, severe artifacts

Move or copy your FITS files into the appropriate folders.
Aim for at least 100 images per category.
""")

Step 2: Data Preparation¶

import numpy as np
from astropy.io import fits
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

class QualityDataset(Dataset):
    def __init__(self, filepaths, labels, image_size=128):
        self.filepaths = filepaths
        self.labels = labels
        self.image_size = image_size

    def __len__(self):
        return len(self.filepaths)

    def __getitem__(self, idx):
        # Load image
        with fits.open(self.filepaths[idx]) as hdul:
            image = hdul[0].data.astype(np.float32)

        # Resize to consistent size
        from scipy.ndimage import zoom
        zoom_factor = self.image_size / max(image.shape)
        image = zoom(image, zoom_factor)

        # Pad to exact size if needed
        if image.shape[0] < self.image_size:
            pad = self.image_size - image.shape[0]
            image = np.pad(image, ((0, pad), (0, 0)))
        if image.shape[1] < self.image_size:
            pad = self.image_size - image.shape[1]
            image = np.pad(image, ((0, 0), (0, pad)))

        # Crop to exact size
        image = image[:self.image_size, :self.image_size]

        # Normalize
        image = np.nan_to_num(image, nan=0)
        p1, p99 = np.percentile(image, [1, 99])
        image = np.clip(image, p1, p99)
        image = (image - image.min()) / (image.max() - image.min() + 1e-8)

        # To tensor
        image = torch.from_numpy(image).unsqueeze(0)

        return image, self.labels[idx]

def prepare_data(data_dir='training_data'):
    """Load data from organized folders"""
    filepaths = []
    labels = []
    label_map = {'good': 0, 'medium': 1, 'bad': 2}

    for quality, label in label_map.items():
        folder = Path(data_dir) / quality
        for filepath in folder.glob('*.fits'):
            filepaths.append(str(filepath))
            labels.append(label)

    # Split into train/val/test
    train_files, temp_files, train_labels, temp_labels = train_test_split(
        filepaths, labels, test_size=0.3, stratify=labels, random_state=42
    )
    val_files, test_files, val_labels, test_labels = train_test_split(
        temp_files, temp_labels, test_size=0.5, stratify=temp_labels, random_state=42
    )

    print(f"Training samples: {len(train_files)}")
    print(f"Validation samples: {len(val_files)}")
    print(f"Test samples: {len(test_files)}")

    return (
        (train_files, train_labels),
        (val_files, val_labels),
        (test_files, test_labels)
    )

Step 3: Model Definition¶

import torch.nn as nn

class QualityClassifier(nn.Module):
    """Lightweight CNN for image quality assessment"""

    def __init__(self, num_classes=3):
        super().__init__()

        self.features = nn.Sequential(
            # Block 1: 128 -> 64
            nn.Conv2d(1, 16, 3, padding=1),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 2: 64 -> 32
            nn.Conv2d(16, 32, 3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 3: 32 -> 16
            nn.Conv2d(32, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Block 4: 16 -> 8
            nn.Conv2d(64, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

Step 4: Training Script¶

def train_quality_model():
    # Configuration
    BATCH_SIZE = 16
    LEARNING_RATE = 0.001
    NUM_EPOCHS = 30
    DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Prepare data
    (train_files, train_labels), (val_files, val_labels), _ = prepare_data()

    train_dataset = QualityDataset(train_files, train_labels)
    val_dataset = QualityDataset(val_files, val_labels)

    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, 
                              shuffle=True, num_workers=2)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, 
                            shuffle=False, num_workers=2)

    # Initialize model
    model = QualityClassifier(num_classes=3).to(DEVICE)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

    # Training loop
    best_accuracy = 0

    for epoch in range(NUM_EPOCHS):
        # Train
        model.train()
        train_loss = 0
        for images, labels in train_loader:
            images, labels = images.to(DEVICE), labels.to(DEVICE)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()

        # Validate
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for images, labels in val_loader:
                images, labels = images.to(DEVICE), labels.to(DEVICE)
                outputs = model(images)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total

        print(f'Epoch [{epoch+1}/{NUM_EPOCHS}] '
              f'Loss: {train_loss/len(train_loader):.4f} '
              f'Accuracy: {accuracy:.1f}%')

        # Save best model
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            torch.save({
                'model_state': model.state_dict(),
                'accuracy': accuracy,
                'epoch': epoch
            }, 'quality_classifier_best.pt')

    print(f"\nTraining complete! Best accuracy: {best_accuracy:.1f}%")
    return model

Step 5: Deployment for Real-Time Use¶

class RealTimeQualityChecker:
    """Deploy the trained model for real-time quality assessment"""

    def __init__(self, model_path='quality_classifier_best.pt'):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # Load model
        self.model = QualityClassifier(num_classes=3)
        checkpoint = torch.load(model_path, map_location=self.device)
        self.model.load_state_dict(checkpoint['model_state'])
        self.model.to(self.device)
        self.model.eval()

        self.classes = ['good', 'medium', 'bad']

    def preprocess(self, image):
        """Preprocess a raw numpy image"""
        from scipy.ndimage import zoom

        # Resize
        zoom_factor = 128 / max(image.shape)
        image = zoom(image.astype(np.float32), zoom_factor)
        image = image[:128, :128]

        # Normalize
        image = np.nan_to_num(image, nan=0)
        p1, p99 = np.percentile(image, [1, 99])
        image = np.clip(image, p1, p99)
        image = (image - image.min()) / (image.max() - image.min() + 1e-8)

        # To tensor
        tensor = torch.from_numpy(image).unsqueeze(0).unsqueeze(0)
        return tensor.to(self.device)

    def assess(self, image):
        """
        Assess image quality

        Args:
            image: numpy array (raw telescope image)

        Returns:
            dict with quality label and confidence
        """
        tensor = self.preprocess(image)

        with torch.no_grad():
            outputs = self.model(tensor)
            probabilities = torch.softmax(outputs, dim=1)[0]
            predicted_class = torch.argmax(probabilities).item()

        return {
            'quality': self.classes[predicted_class],
            'confidence': probabilities[predicted_class].item(),
            'all_probabilities': {
                cls: prob.item() 
                for cls, prob in zip(self.classes, probabilities)
            }
        }

    def assess_file(self, filepath):
        """Assess quality of a FITS file"""
        with fits.open(filepath) as hdul:
            image = hdul[0].data
        return self.assess(image)

# Usage example:
if __name__ == '__main__':
    checker = RealTimeQualityChecker('quality_classifier_best.pt')

    # Assess a single file
    result = checker.assess_file('new_observation.fits')
    print(f"Quality: {result['quality']} ({result['confidence']:.1%} confident)")

    # In a real-time loop
    def process_new_frame(filepath):
        result = checker.assess_file(filepath)

        if result['quality'] == 'bad':
            print(f"⚠️ Bad frame detected: {filepath}")
            # Could trigger alert or stop observation
        elif result['quality'] == 'medium':
            print(f"⚡ Medium quality: {filepath}")
            # Continue but flag for review
        else:
            print(f"✓ Good frame: {filepath}")
            # Proceed normally

        return result

Chapter 9: Next Steps and Resources¶

Your Learning Path¶

Week 1-2: Python fundamentals

Complete a Python tutorial (Codecademy, Python.org tutorial)
Practice with NumPy and Matplotlib
Load and visualize your telescope images

Week 3-4: Machine learning concepts

Take Andrew Ng's ML course on Coursera (free to audit)
Implement simple models with scikit-learn
Understand training/validation/testing

Week 5-6: Deep learning basics

Work through Fast.ai course (free, practical)
Build your first CNN in PyTorch
Train on your own data

Week 7-8: Your first real project

Implement the quality classifier above
Collect and label your data
Train, validate, deploy

Month 2+: Advanced topics

Time-series analysis for transient detection
Multi-site coordination systems
Real-time processing pipelines

Essential Resources¶

Books:

"Python for Astronomers" (free online)
"Deep Learning" by Goodfellow (the bible, free online)
"Hands-On Machine Learning" by Géron

Courses:

Fast.ai (practical deep learning)
Coursera: Andrew Ng's courses
DeepLearning.AI specializations

Astronomy-specific:

AstroML documentation
Astropy tutorials
AAS astronomy + ML workshops

Communities:

Stack Overflow (coding help)
Cross Validated (ML theory)
r/MachineLearning (Reddit)
Astropy Slack/Discord

Hardware Recommendations¶

For learning (budget):

Any modern laptop with 8GB+ RAM
Use Google Colab for free GPU access

For development (intermediate):

Desktop with NVIDIA GPU (RTX 3060 or better)
32GB+ RAM
Fast SSD storage

For production (your telescope array):

Edge devices: NVIDIA Jetson at each site
Central server: Multiple GPUs for training
Cloud backup for burst computing

Final Thoughts¶

You now have a complete foundation in AI/ML for astronomy. The key principles:

ML finds patterns in data — nothing magical
Neural networks learn hierarchically — simple to complex features
Training requires iteration — thousands of updates to find good parameters
Data quality matters more than model complexity — garbage in, garbage out
Start simple, add complexity as needed — don't over-engineer initially

For your distributed telescope array, the path is:

Build local quality assessment (the project above)
Add transient detection at each site
Implement cross-site coordination
Develop data fusion capabilities
Create autonomous discovery systems

Each step builds on the previous. You don't need to understand everything at once—learn what you need for each stage.

Would you like me to elaborate on any specific section? I can provide:

More code examples for specific tasks
Deeper mathematical explanations
Hardware setup guides
Specific astronomy ML techniques
Step-by-step project walkthroughs

Complete Deep Dive: ML for Your Distributed Telescope Array¶

Part 1: How This Applies Specifically to Your Project¶

Your Unique Situation¶

You're not building a single telescope—you're building a network. This fundamentally changes everything about how ML applies to your project. Let me explain why this is both harder and more powerful than single-telescope ML.

The Distributed Data Problem¶

When you have telescopes in different locations, you face challenges that single observatories never encounter:

Heterogeneous Conditions: Your telescope in India sees through different atmosphere than your telescope in Chile. Humidity in one location, dust in another, light pollution patterns unique to each site. A galaxy image from Site A looks subtly different from the same galaxy imaged at Site B, even with identical equipment.

Temporal Asynchrony: It's daytime somewhere while it's nighttime elsewhere. Your network is always partially active, partially sleeping. Events happen when only some telescopes can see them. Coordinating observations across time zones means predicting conditions hours in advance.

Communication Latency: Data from a remote site might take seconds or minutes to reach your central system. In those seconds, a transient event could fade. ML must make local decisions fast while still benefiting from global coordination.

Calibration Drift: Each telescope drifts differently over time. Mirrors get dusty, sensors age, tracking develops quirks. What was perfectly calibrated last month might be slightly off now, and differently off at each site.

How ML Specifically Addresses Your Challenges¶

Learning Site-Specific Characteristics: Rather than manually characterizing each site, ML learns automatically. Feed it data from each telescope along with quality assessments, and it learns that Site A produces slightly bluer images, Site B has periodic vibration from nearby traffic, Site C gets dew formation around 3 AM local time. This knowledge is encoded in the model's parameters—no explicit rules needed.

Predictive Coordination: ML can learn patterns invisible to humans. Perhaps observations from Sites A and C together, taken within 30 minutes of each other, produce better combined data than A and B together. Maybe certain atmospheric conditions at one site predict what conditions will be at another site two hours later. These correlations exist in your data—ML finds them.

Adaptive Resource Allocation: Your network has finite resources—observation time, storage, bandwidth, human attention. ML learns to allocate these optimally. When something interesting happens, which telescopes should respond? How should you balance survey observations against transient follow-up? ML can learn policies that maximize scientific output.

Unified Understanding from Diverse Data: The holy grail for your project is combining observations from multiple sites into something greater than any single observation. ML models can learn optimal combination strategies that account for each site's quirks, each observation's quality, and the physics of what you're observing.

The Mathematics Behind Your Specific Needs¶

Let me walk you through the actual math that makes this work for distributed telescope networks.

Multi-Site Calibration: Transfer Learning Mathematics¶

When you train a model on data from Site A, then want it to work at Site B, you're doing transfer learning. Here's how the math works:

Imagine each image can be described by two components: the underlying astronomical signal S, and site-specific effects E. For Site A:

Image_A = S + E_A + noise

For Site B:

Image_B = S + E_B + noise

The astronomical signal S is the same (it's the same object), but E_A and E_B differ. A naive model trained on Site A learns to recognize S + E_A as a unit. It fails at Site B because it's looking for E_A characteristics that aren't there.

Transfer learning separates these. The mathematics involves training the model's early layers (which learn generic features like edges and shapes) to be site-independent, while allowing later layers to adapt. Formally, you minimize a loss function that includes both prediction accuracy and a penalty for how different the learned representations are between sites:

Total Loss = Prediction Error + λ × Domain Difference

The domain difference term forces the model to find representations that work across sites. The λ parameter controls how much you care about cross-site consistency versus raw accuracy.

Data Fusion: Optimal Combination Theory¶

When combining observations from multiple telescopes, you want to weight each contribution appropriately. The mathematically optimal combination minimizes total uncertainty.

If Telescope 1 measures a value with uncertainty σ₁, and Telescope 2 measures with uncertainty σ₂, the optimal combined estimate is:

Combined = (value₁/σ₁² + value₂/σ₂²) / (1/σ₁² + 1/σ₂²)

This is inverse-variance weighting—better measurements (smaller σ) contribute more.

But in reality, your uncertainties aren't simple numbers. They're complex functions of atmospheric conditions, telescope state, target properties, and inter-site correlations. ML learns this uncertainty structure from data. It implicitly estimates these complex σ values and performs near-optimal combination.

The neural network is learning a function:

Combined_Image = f(Image_A, Image_B, Image_C, Metadata_A, Metadata_B, Metadata_C)

Where f is a highly nonlinear function with millions of parameters, trained to produce combined images that match what expert analysis would produce.

Scheduling: Reinforcement Learning Mathematics¶

Deciding which telescope observes what, and when, is a sequential decision problem. The mathematics come from reinforcement learning.

You have a state representing current conditions: weather at each site, queue of targets, recent observation quality, predicted satellite passages, current calibration status, and more.

You take actions: assign target X to telescope Y for Z minutes.

You receive rewards: scientific value of resulting observation, minus costs (slew time, missed opportunities elsewhere).

The goal is to learn a policy—a function mapping states to actions—that maximizes total reward over time.

The mathematics involve the Bellman equation, which describes optimal decision-making:

V(state) = max over all actions of [immediate_reward + γ × V(next_state)]

V(state) is the "value" of being in a particular state—how much total future reward you can expect. The parameter γ (gamma) discounts future rewards (a reward now is worth more than the same reward later).

This equation seems circular—V depends on V—but it can be solved iteratively. Start with a random guess for V, apply the equation repeatedly, and it converges to the true optimal values. Then your policy is just: from any state, take the action that leads to the highest-value next state.

For your telescope network, the state space is enormous. You can't enumerate all possible states. Neural networks approximate V(state), learning to estimate values for any state they encounter. This is deep reinforcement learning.

Anomaly Detection: Statistical Learning Theory¶

Finding unusual objects requires understanding what "usual" looks like. The mathematics here involve probability density estimation.

Given training data of normal observations, you're estimating the probability distribution P(x) over possible observations. An anomaly is something with very low probability—P(x_anomaly) << typical P(x).

Autoencoders approach this indirectly. They learn to compress and reconstruct normal data. The reconstruction error for any input tells you how "unusual" it is:

Anomaly_Score(x) = ||x - Reconstruct(x)||²

If the model can reconstruct x well, it's similar to training data (normal). If reconstruction is poor, it's unlike anything the model has seen (potentially anomalous).

The mathematical guarantee comes from information theory: autoencoders learn efficient codes for the training distribution. Data from outside this distribution can't be efficiently coded—reconstruction suffers.

For your telescope network, this is powerful. Train on normal observations from all sites. The model learns what normal looks like across your whole network. When something genuinely unusual appears—a new type of transient, an equipment failure mode never seen before, an atmospheric phenomenon unique to one site—the anomaly score spikes.

Hardware Requirements for Your Specific Scale¶

Let me be concrete about what hardware your distributed telescope project actually needs.

At Each Telescope Site¶

Edge Computing Unit: You need local ML inference capability. This means:

For a small site (single telescope, basic automation):

NVIDIA Jetson Nano or Orin Nano
4-8 GB unified memory
Power consumption: 10-15 watts
Cost: $200-500
Capabilities: Real-time quality assessment, basic transient detection, image preprocessing

For a medium site (multiple instruments, more sophisticated local processing):

NVIDIA Jetson AGX Xavier or Orin
32-64 GB unified memory
Power consumption: 30-60 watts
Cost: $700-2000
Capabilities: Full local ML pipeline, preliminary data fusion, complex anomaly detection

For a major site (significant local autonomy required):

Compact server with NVIDIA RTX 4080/4090 or A4000
64+ GB system RAM
Dedicated storage array
Power consumption: 300-500 watts
Cost: $3000-8000
Capabilities: Can operate fully autonomously, train local models, handle complete scientific analysis

Storage: Raw astronomical data accumulates fast. A single night might generate 50-200 GB depending on your instruments. You need:

Fast SSD for working data (1-4 TB)
Larger HDD or SSD array for local archive (10-50 TB)
Fast network interface for uploads (1+ Gbps ideal)

Environmental Considerations: Edge devices at telescope sites face challenges. Temperature swings, humidity, power fluctuations. You need:

Proper enclosure (temperature-controlled if extreme climate)
Uninterruptible power supply
Remote management capability (you can't physically visit every site easily)

Central Coordination System¶

This is where the heavy computation happens—training models, combining data from all sites, running complex analyses.

For a network of 3-5 small telescopes:

Workstation with 1-2 NVIDIA RTX 4090 GPUs
128 GB RAM
Fast storage: 10+ TB NVMe SSD
Archive storage: 100+ TB
Cost: $10,000-20,000

For a network of 5-15 telescopes with serious ambitions:

Small server cluster or cloud resources
4-8 high-end GPUs (RTX 4090, A6000, or equivalent)
256-512 GB RAM per node
Fast interconnect between GPUs
Petabyte-scale storage
Cost: $50,000-150,000 (or equivalent cloud spend)

For a large network approaching professional scale:

HPC cluster or significant cloud allocation
Dozens of GPUs for parallel training
Multiple petabytes of storage
Dedicated networking infrastructure
Cost: $500,000+ (or major cloud commitment)

Network Infrastructure¶

Your system is only as good as its connectivity:

Bandwidth: Each site needs reliable upload capability. Assuming you want to transfer reduced data (not raw) in near-real-time:

Minimum: 10 Mbps sustained upload per site
Comfortable: 100 Mbps sustained upload per site
Ideal: 1 Gbps (allows raw data transfer if needed)

Latency: For real-time coordination (transient response), latency matters:

Acceptable: 200-500ms round-trip to central system
Good: 50-200ms
Excellent: <50ms

Reliability: Telescopes often sit in remote locations. Network failures happen. Your system needs:

Local buffering for network outages
Graceful degradation (sites continue operating independently)
Automatic reconnection and synchronization

Compute Requirements by Task¶

Different ML tasks have different requirements:

Real-time quality assessment: Very lightweight. A Jetson Nano can run this at 10+ frames per second. Must run locally at each site.

Transient detection: Moderate requirements. Needs to process each frame in less time than the exposure time. For typical 30-60 second exposures, even modest edge hardware is sufficient.

Scheduling optimization: Can be computationally intensive but isn't time-critical. Run on central system, update schedules every few minutes.

Data fusion: Moderately intensive. Combining data from multiple sites requires having all that data in one place and processing it. Central system task.

Model training: By far the most intensive. Training new models or retraining existing ones requires serious GPU power. Plan for multi-hour to multi-day training runs. Can be batched during low-activity periods.

Anomaly detection for discovery: Variable intensity. Simple methods run in real-time. Sophisticated searches over historical data require substantial computation. Balance between always-running lightweight detection and periodic deep searches.

Part 2: ML System for Task Assignment and Observation Creation¶

The Complete Task Assignment System¶

Let me design a comprehensive ML system that handles both assigning existing tasks to telescopes and creating new observation tasks automatically.

Understanding the Problem Space¶

Your task assignment system must juggle competing demands:

Scientific Priorities: Different observations have different value. A follow-up of a confirmed gravitational wave counterpart might be worth 100 times more than a routine survey field. But value isn't fixed—it depends on what's already been observed, what other facilities are doing, and how the target is evolving.

Physical Constraints: Each telescope can only point at part of the sky at any moment. Targets rise and set. Weather changes. Instruments need calibration. Slewing takes time. These constraints are hard—violating them produces zero useful data.

Resource Optimization: Observation time is precious. Every minute spent on a lower-value target is a minute not spent on something better. But you can't always know what "better" will appear. Balance exploitation (observe known-good targets) with exploration (survey for unknowns).

Coordination: Multiple telescopes can work together or independently. Some observations benefit from simultaneous multi-site coverage. Others are better done sequentially across sites. The system must know when coordination helps and when it's unnecessary overhead.

Architecture of the Task Assignment ML System¶

The system has several interconnected components:

Component 1: The State Representation Module¶

Before the ML can make decisions, it needs to understand the current state of your entire network. This module maintains a real-time representation including:

Environmental State: For each site, current and predicted conditions—cloud cover, seeing, humidity, wind, moon position and phase, twilight status. This comes from local sensors, weather services, and historical patterns.

Equipment State: Telescope pointing, current filter/instrument configuration, time since last calibration, known issues or limitations, thermal status (some instruments need cooling time after changes).

Queue State: All pending observation requests with their priorities, time constraints, progress so far, and dependencies on other observations.

Historical Context: What has been observed recently? What patterns has the system learned about success rates for different target/site/condition combinations?

External Information: Are there active alerts from gravitational wave detectors, gamma-ray satellites, or other facilities? What are other telescopes doing (from public streams)?

This state representation is updated continuously—some elements every second, others every few minutes.

Component 2: The Value Estimation Network¶

This neural network takes the state representation and, for any proposed observation, estimates its expected scientific value.

The network architecture combines several types of information:

Target Features: Position, brightness, type, variability history, time since last observation, relationship to other targets.

Observation Features: Proposed telescope, exposure time, filters, timing.

Context Features: Current conditions, competing demands, external alerts.

The output is a scalar value estimate plus uncertainty bounds. High uncertainty might mean the system needs more information before committing.

Training this network requires historical data with value labels. You can derive these from:

Expert assessments of past observations
Publication outcomes (did this observation lead to science?)
Detection metrics (did we find what we were looking for?)
Data quality achieved versus predicted

The network learns to integrate all these factors into a unified value estimate. It might learn that observing a certain type of target at Site B when humidity exceeds 70% has low expected value, even though individually those factors seem fine.

Component 3: The Constraint Satisfaction Engine¶

Not every observation is physically possible. This component evaluates hard constraints:

Visibility: Can the telescope actually see this target now? This involves coordinate transformations, horizon modeling, and obstruction maps.

Timing: Does the observation fit in available time? Account for slew time, setup, and required duration.

Instrument Compatibility: Is the right instrument available? Does the target require filters or modes that this telescope supports?

Exclusive Resources: Some operations can't happen simultaneously—you can't observe two targets at once, can't calibrate while observing, can't change filters mid-exposure.

This component doesn't use ML—it's hard logic. But it interfaces with the ML components to filter impossible options before the system wastes computation evaluating them.

Component 4: The Policy Network¶

This is the core decision-making network. Given the current state and value estimates for all options, it selects actions.

The architecture is a combination of:

Attention Mechanisms: The network can "focus" on the most relevant parts of the state. When responding to a transient alert, it attends strongly to the alert information and capable sites, largely ignoring routine queue items.

Recurrent Components: The network maintains memory of recent decisions. This prevents thrashing (constantly switching between options) and enables multi-step planning.

Multi-Head Output: The network produces decisions for multiple aspects simultaneously—which target, which telescope, what configuration, how long.

The policy network is trained using reinforcement learning. It tries different decisions, observes outcomes, and adjusts to improve over time. The reward signal combines:

Scientific value of observations obtained
Efficiency metrics (minimal wasted time)
Responsiveness (fast reaction to alerts)
Fairness (different science programs get appropriate time)

Component 5: The Observation Generator¶

This component creates new observation tasks automatically. It's not just assigning existing requests—it's inventing new ones.

Survey Field Selection: For survey operations, the generator proposes fields to observe based on:

Coverage requirements (what hasn't been observed yet?)
Scientific priorities for different regions
Current conditions (which fields are optimally positioned?)
Expected discovery yield per field

Follow-Up Proposals: When something interesting is detected, the generator creates appropriate follow-up observations:

Same target, different filters (for color information)
Same target, later time (for variability)
Nearby targets (for context)
Different site (for confirmation)

Calibration Scheduling: The generator monitors data quality and schedules calibrations when needed:

Regular flats and darks
Focus checks
Pointing model updates
Photometric standard observations

Opportunistic Observations: When primary programs can't observe (weather, equipment issues), the generator proposes useful alternatives:

Shorter exposures of bright targets
Engineering tests
Calibration catch-up
Low-priority but useful survey work

The Decision Flow¶

Here's how these components work together in real-time:

Continuous Monitoring Phase: State representation is constantly updated. Value estimation network runs in background on high-priority queue items. Constraint engine maintains pre-computed visibility windows.

Decision Point Trigger: When a decision is needed (current observation ending, alert received, conditions changed significantly), the policy network activates.

Option Generation: The observation generator proposes candidates—both from existing queue and newly created. The constraint engine filters to feasible options.

Value Assessment: The value estimation network scores all feasible options. Scores reflect expected scientific return given current conditions.

Policy Execution: The policy network selects from scored options, considering not just current value but strategic factors (don't neglect long-term programs for short-term gains).

Action Implementation: Commands go to the appropriate telescope. Monitoring continues.

Outcome Observation: When the observation completes, results feed back into training data. Did prediction match reality? What was actual scientific value?

Learning and Adaptation¶

The system improves over time through several mechanisms:

Online Learning: Every observation outcome provides training data. The value estimation network continuously refines its predictions. The policy network adjusts its strategies.

Periodic Retraining: Deep retraining happens offline, using accumulated data. This catches slow drifts and discovers new patterns.

Transfer Learning: Insights from one site transfer to others. If the system learns that a certain type of observation requires longer exposures than expected, this knowledge propagates across the network.

Human Feedback Integration: Expert assessments of observations (was this good science? was this a waste of time?) provide high-quality training signal. The system learns to match expert judgment while scaling beyond human attention capacity.

Handling Uncertainty¶

Real-world scheduling faces massive uncertainty. The ML system handles this through:

Probabilistic Predictions: Instead of single-point estimates, the system maintains probability distributions. "The value of this observation is probably around 7, but might be as low as 3 or as high as 15."

Robust Scheduling: When uncertainty is high, the system prefers decisions that are good across many scenarios over decisions that are optimal for one scenario but terrible for others.

Information-Seeking Actions: Sometimes the best decision is to gather more information before committing. The system can propose quick test observations to resolve uncertainty before dedicating major resources.

Graceful Replanning: Plans aren't rigid. When conditions change (weather shifts, new alert arrives, equipment fails), the system replans without requiring human intervention.

Multi-Site Coordination Specifics¶

Your distributed network enables coordination patterns impossible with single telescopes:

Simultaneous Observations: For some targets, observing from multiple sites simultaneously provides unique science (parallax measurements, multi-angle imaging, redundancy against clouds). The task system recognizes these opportunities and schedules accordingly.

Relay Coverage: For time-critical monitoring, sites can relay coverage as the Earth rotates. Site A observes until target sets, Site B picks up as it rises there. The task system plans these handoffs.

Confirmation Mode: An interesting detection at one site can trigger immediate confirmation attempts at other sites. This filters false positives before alerting humans.

Division of Labor: Different sites might specialize in different target types based on their equipment, conditions, or location advantages. The task system learns these specializations and routes accordingly.

Part 3: Limitations of ML and AI¶

Fundamental Limitations¶

Let me be completely honest about what ML cannot do and where it fails.

The Data Dependency¶

ML systems are only as good as their training data. This creates several fundamental limitations:

Garbage In, Garbage Out: If your training data contains errors, biases, or gaps, your model inherits them. A classifier trained on mislabeled images will confidently make the same mistakes. If your training set underrepresents certain types of objects, the model will struggle with them in deployment.

Distribution Shift: ML assumes the future resembles the past. When reality changes—new instrument, different observing strategy, novel type of object—models trained on old data may fail silently. They don't know what they don't know.

Data Volume Requirements: Deep learning requires substantial data. For rare phenomena (unusual transients, exotic object types), you might have only a handful of examples. Models trained on few examples overfit badly. This is the regime where ML struggles most.

Label Quality: Supervised learning needs labeled examples. In astronomy, labels often come from expert classification, which is expensive and sometimes inconsistent. Experts disagree, make mistakes, and have biases. Models learn from this imperfect supervision.

The Black Box Problem¶

Neural networks, especially deep ones, are largely opaque:

No Explanations: When a model classifies an image as a spiral galaxy, it doesn't explain why. You see the input and output, but the reasoning is encoded in millions of parameters that resist human interpretation. For scientific applications, this lack of explanation is problematic.

Debugging Difficulty: When models fail, diagnosing the cause is hard. Unlike traditional code where you can step through logic, neural networks fail in diffuse ways. The bug might be spread across thousands of parameters.

Unpredictable Failures: Models can fail in ways that seem random or inexplicable. An image almost identical to training examples might be misclassified while a completely different image is handled correctly. This unpredictability makes mission-critical deployment risky.

Adversarial Vulnerability: ML models can be fooled by carefully crafted inputs. Small, imperceptible changes to an image can cause confident misclassification. While intentional adversarial attacks are rare in astronomy, natural variations can accidentally hit these failure modes.

The Extrapolation Problem¶

ML excels at interpolation—handling inputs similar to training data. It fails at extrapolation—handling truly novel situations:

Novelty Blindness: A model trained on known object types cannot reliably identify genuinely new types. It might classify them as the nearest known type (missing the discovery) or flag everything unusual (overwhelming you with false positives).

Regime Changes: If physical conditions exceed anything in training data—brighter sources, fainter sources, different wavelengths, different instruments—model behavior is undefined. It might extrapolate reasonably or fail completely.

Black Swan Events: Extremely rare events (once-per-decade transients, unprecedented phenomena) cannot be in training data by definition. ML provides no advantage over traditional methods for true black swans.

Statistical Limitations¶

ML makes statistical predictions, not certainties:

Irreducible Error: Even a perfect model has error rates. If your best classifier achieves 95% accuracy, that means 5% errors are inherent to the problem given available information. No amount of training reduces this.

Calibration Problems: Models often give poorly calibrated confidence scores. A model might say it's 90% confident when it's actually right only 70% of the time. Or vice versa. Trusting reported confidences without calibration analysis is dangerous.

Long-Tail Problems: Real data has long tails—rare examples far from typical. Standard training emphasizes common cases. Rare cases matter scientifically but get little training attention.

Simpson's Paradox and Confounding: ML can find correlations that don't reflect causation. A model might learn that observations at Site A have fewer artifacts, not because Site A is better, but because a skilled operator happens to work there. If that operator leaves, the model's expectations break.

Practical Limitations¶

Beyond theory, real-world ML deployment faces practical challenges:

Computational Costs¶

Training Expense: Training large models requires significant GPU time, often days or weeks. Iteration is slow. Exploring architectural variations is expensive.

Inference Costs: Running models in production requires ongoing computation. For real-time applications, this means dedicated hardware. The marginal cost per prediction might be small, but it's not zero.

Energy Consumption: ML training and inference consume substantial electricity. This matters for remote telescope sites on limited power and for environmental considerations broadly.

Scaling Challenges: As your network grows, ML demands grow too. More data means more storage and processing. More sites mean more edge devices. Costs don't grow linearly—they can explode.

Maintenance Burden¶

Model Decay: Deployed models degrade over time as the world changes. Regular retraining is necessary but often neglected.

Technical Debt: ML systems accumulate technical debt faster than traditional software. Data pipelines, feature engineering, model management—all require ongoing attention.

Expertise Requirements: Operating ML systems requires specialized knowledge. Debugging, optimization, and adaptation need skills different from traditional software engineering.

Integration Complexity: ML models must interface with data systems, hardware, user interfaces, and other ML models. Integration is frequently underestimated.

Human Factors¶

Trust Calibration: People tend to either over-trust ML (automation bias) or under-trust it (algorithm aversion). Neither is appropriate. Developing correct calibration requires experience and training.

Deskilling Risk: Relying on ML can atrophy human expertise. If the ML always classifies images, operators lose classification skills. When the ML fails, humans may not be able to recover.

Accountability Gaps: When an ML system makes a decision, who is responsible? This question becomes sharp when decisions matter—prioritizing observations, triggering alerts, discarding data.

Transparency Demands: Science requires reproducibility and explanation. ML systems often can't explain their decisions in scientifically meaningful terms. This creates tension with scientific values.

Astronomy-Specific Limitations¶

Some limitations are particularly relevant to astronomical applications:

Rare Object Discovery¶

The most exciting discoveries are often things never seen before. ML is inherently weak here:

Training Paradox: You can't train on examples of objects that haven't been discovered yet. The first detection of a new phenomenon must come through some other means.

Confirmation Bias: ML systems favor known categories. A new type of transient might be classified as the most similar known type, its novelty invisible.

Anomaly Flooding: Systems tuned for novelty detection produce many false positives. The genuine discovery drowns in a sea of artifacts, glitches, and merely unusual known objects.

Small Sample Science¶

Much of astronomy involves small numbers of special objects:

Few-Shot Learning Limits: Despite progress, ML still struggles when training examples number in tens rather than thousands. Rare object types remain hard.

Statistical Power: ML confidence intervals on small-sample predictions are necessarily wide. Claims based on few examples require extra skepticism.

Selection Effects: Training data for rare objects often has selection effects. We observe the bright examples, miss the faint ones. Models learn these biases.

Systematic Effects¶

Telescope data has systematic effects that ML can mislearn:

Instrumental Signatures: ML might learn to recognize CCD artifacts, scattered light patterns, or optical ghosts rather than astronomical signal. It might even perform better by using these clues—while learning nothing about astronomy.

Time-Dependent Effects: Sensors change over time. Training data from last year might not represent this year's behavior. Models need constant recalibration.

Site-Specific Quirks: In a distributed network, site-specific systematics are pernicious. A model might learn that a certain pattern indicates good data at Site A while the same pattern indicates bad data at Site B, without any astronomical reason.

Physical Understanding¶

ML is fundamentally empirical—it learns patterns without understanding physics:

No Physical Constraints: A physics model knows that certain configurations are impossible. ML doesn't. It might predict physically impossible stellar properties or generate images that violate conservation laws.

No Generalization to New Regimes: Physical understanding allows extrapolation to new regimes. ML cannot. A stellar model based on physics works for stars never observed. An ML model might fail on any star outside the training distribution.

Explanation vs. Prediction: Science values explanation. ML provides prediction without explanation. A model that predicts stellar properties accurately but offers no insight into stellar physics is scientifically incomplete.

What ML Cannot Replace¶

Despite capabilities, some things remain firmly beyond ML:

Scientific Judgment: Deciding what questions to ask, what observations would be most informative, what results mean—these require human insight ML cannot provide.

Novel Hypothesis Generation: ML finds patterns in data. Generating new theoretical frameworks to explain patterns requires creativity ML lacks.

Ethical Considerations: Decisions about resource allocation, data sharing, collaboration, and publication involve values ML cannot assess.

Error Checking: ML systems make mistakes. Humans must check results, especially unusual ones. Removing humans from the loop is dangerous.

Adaptation to Truly Novel Situations: When something genuinely unprecedented happens, human flexibility exceeds ML rigidity.

Part 4: Battle-Tested Libraries and Models¶

Core Deep Learning Frameworks¶

These are the foundations everything else builds on:

PyTorch¶

The dominant framework for research and increasingly for production. Developed by Meta AI.

Strengths: Intuitive design that matches how you think about neural networks. Excellent debugging (standard Python debugging works). Huge ecosystem. Active development. Strong community.

Weaknesses: Deployment to production requires additional tooling. Can be memory-inefficient compared to alternatives.

Maturity: Extremely mature. Used by most academic labs, many companies. If something works in deep learning, there's a PyTorch implementation.

Astronomy Usage: Default choice for new astronomical ML projects. Most astronomical ML papers use PyTorch.

TensorFlow¶

Google's framework. Older and more established in production settings.

Strengths: Excellent production deployment tools. TensorFlow Serving for scalable inference. TensorFlow Lite for edge devices. Strong enterprise support.

Weaknesses: Less intuitive programming model (though Keras helps). Slower to adopt research innovations.

Maturity: Very mature. Powers much of Google's ML. Extensive production track record.

Astronomy Usage: Still used in many production systems. Large astronomical surveys often use TensorFlow for deployment stability.

JAX¶

Google's newer framework focused on high performance and functional programming.

Strengths: Incredible performance through XLA compilation. Easy parallelization across devices. Automatic differentiation through arbitrary Python code.

Weaknesses: Steeper learning curve. Smaller ecosystem than PyTorch/TensorFlow. Functional paradigm unfamiliar to many.

Maturity: Mature but younger than alternatives. Growing adoption in research.

Astronomy Usage: Growing in computational astrophysics. Good for physics-informed neural networks.

Traditional Machine Learning¶

Not everything needs deep learning. These libraries handle classical ML:

scikit-learn¶

The standard library for classical machine learning in Python.

Capabilities: Classification (random forests, SVMs, logistic regression), regression, clustering (k-means, DBSCAN), dimensionality reduction (PCA, t-SNE), preprocessing, model selection, metrics.

Strengths: Consistent API across all algorithms. Excellent documentation. Very well tested. Fast for moderate data sizes.

Weaknesses: Not designed for deep learning. Doesn't scale to very large datasets (millions of examples, many features).

Maturity: Extremely mature. Used in production at countless companies. The default choice for non-deep-learning ML in Python.

Astronomy Usage: Widely used for classification tasks, clustering, and as baseline comparisons for deep learning approaches.

XGBoost / LightGBM / CatBoost¶

Gradient boosting libraries. Often the best choice for tabular data.

Capabilities: Classification and regression on tabular data. Handles missing values, categorical features. Often achieves state-of-the-art on structured data.

Strengths: Often beats neural networks on tabular data. Fast training and inference. Built-in handling of many practical issues.

Weaknesses: Not for images, sequences, or other unstructured data. Requires feature engineering.

Maturity: Very mature. Winners of many Kaggle competitions. Widely deployed in industry.

Astronomy Usage: Excellent for tasks with tabular features (stellar parameters from catalog data, transient classification from light curve features, photometric redshift estimation).

Computer Vision Libraries¶

For image-based astronomical data:

torchvision¶

PyTorch's computer vision library.

Capabilities: Pre-trained models (ResNet, EfficientNet, Vision Transformers). Image transformations and augmentation. Standard datasets. Detection and segmentation models.

Strengths: Tight integration with PyTorch. Well-maintained pre-trained weights. Standard transforms.

Weaknesses: Geared toward natural images (ImageNet). Astronomical images need adaptation.

Maturity: Very mature. Used everywhere PyTorch is used for vision.

Astronomy Usage: Starting point for most image classification work. Pre-trained models fine-tuned for astronomical tasks.

timm (PyTorch Image Models)¶

Huge collection of state-of-the-art image models.

Capabilities: Hundreds of model architectures with pre-trained weights. Includes latest research models. Consistent interface across all models.

Strengths: Most comprehensive collection available. Often has weights trained on larger datasets than torchvision. Regular updates with new models.

Weaknesses: So many options can be overwhelming. Documentation varies.

Maturity: Mature and widely used. Default source for SOTA image models.

Astronomy Usage: When you need the latest architectures for challenging classification or detection tasks.

Albumentations¶

Image augmentation library.

Capabilities: Fast augmentations (rotation, flipping, scaling, color adjustments, noise injection, and many more). Handles masks for segmentation. Handles keypoints and bounding boxes.

Strengths: Much faster than alternatives. Huge variety of transforms. Well-designed for ML pipelines.

Weaknesses: Learning curve for composition syntax.

Maturity: Very mature. Standard choice for augmentation in PyTorch pipelines.

Astronomy Usage: Essential for training robust astronomical image classifiers with limited data.

Astronomy-Specific Libraries¶

These are built specifically for astronomical ML:

AstroML¶

Machine learning for astronomy, built on scikit-learn.

Capabilities: Astronomical datasets, statistical tools, density estimation, time-series analysis, classification examples.

Strengths: Designed by astronomers for astronomers. Includes relevant datasets. Good tutorial material.

Weaknesses: Less actively developed than general ML libraries. Focuses on classical ML rather than deep learning.

Maturity: Mature but somewhat dated. Good for learning, less so for cutting-edge work.

Astronomy Usage: Learning astronomical ML. Baseline methods. Statistical analysis.

astropy¶

Not ML per se, but essential for astronomical data handling.

Capabilities: FITS file I/O, coordinate transformations, unit handling, cosmological calculations, time handling, table operations, astronomical constants.

Strengths: The standard astronomical Python library. Comprehensive. Well-documented. Actively developed.

Weaknesses: Not ML-specific. You need it alongside ML libraries, not instead of them.

Maturity: Extremely mature. Used by virtually all Python-based astronomical software.

Astronomy Usage: Loading data, coordinate handling, preprocessing. Essential foundation for any astronomical ML work.

photutils¶

Source detection and photometry.

Capabilities: Source detection, aperture and PSF photometry, background estimation, segmentation, centroiding.

Strengths: Standard astronomical photometry methods. Well-integrated with astropy.

Weaknesses: Classical methods, not ML-based.

Maturity: Mature. Standard tool for photometric analysis.

Astronomy Usage: Preprocessing before ML. Ground truth generation. Baseline comparisons.

SEP (Source Extractor in Python)¶

Python binding for Source Extractor functionality.

Capabilities: Background estimation, source detection, photometry. Fast C implementation with Python interface.

Strengths: Very fast. Matches behavior of classic Source Extractor.

Weaknesses: Less flexible than pure Python alternatives.

Maturity: Mature. Based on decades-old, proven algorithms.

Astronomy Usage: Fast preprocessing. Production pipelines where speed matters.

Time-Series Libraries¶

For light curves and temporal data:

tsfresh¶

Automatic feature extraction from time series.

Capabilities: Extracts hundreds of features from time series automatically. Features include statistical moments, spectral properties, entropy measures, and more.

Strengths: Comprehensive feature extraction. Little manual engineering needed. Works well with classical ML.

Weaknesses: Can be slow on large datasets. Feature explosion requires selection.

Maturity: Mature. Used in many time-series competition winners.

Astronomy Usage: Light curve classification. Variable star analysis. Transient characterization.

tslearn¶

Time series machine learning.

Capabilities: Time series classification, clustering, and metrics. DTW (dynamic time warping) implementations. Time series transformations.

Strengths: Dedicated to time series. Includes specialized algorithms not in general libraries.

Weaknesses: Less comprehensive than combining general libraries.

Maturity: Mature. Good for time-series-specific algorithms.

Astronomy Usage: Light curve similarity searches. Variable star clustering.

Reinforcement Learning¶

For scheduling and control:

Stable Baselines3¶

Standard implementations of RL algorithms.

Capabilities: PPO, A2C, SAC, TD3, DQN, and more. Consistent API. Built on PyTorch.

Strengths: Well-tested implementations. Active development. Good documentation.

Weaknesses: Customization can be awkward. RL still requires significant tuning.

Maturity: Mature. Standard starting point for applied RL.

Astronomy Usage: Telescope scheduling. Adaptive control systems. Resource allocation.

RLlib¶

Scalable RL library from Ray.

Capabilities: Distributed training, many algorithms, multi-agent RL, custom environments.

Strengths: Scales to large problems. Production-ready. Integrates with Ray ecosystem.

Weaknesses: Complex setup. Overkill for simple problems.

Maturity: Mature. Used at scale by many companies.

Astronomy Usage: Large-scale scheduling optimization. Multi-telescope coordination.

Pre-trained Models for Astronomy¶

Some models trained specifically on astronomical data:

Zoobot¶

Galaxy morphology classification models.

Training Data: Trained on Galaxy Zoo volunteer classifications of hundreds of thousands of galaxies.

Capabilities: Predicts detailed morphological features (spiral arms, bars, bulges, mergers, etc.). State-of-the-art galaxy classification.

Availability: Open source with pre-trained weights.

Astronomy Usage: Galaxy classification. Transfer learning starting point for morphology tasks.

AstroCLIP¶

Contrastive learning model for astronomical images.

Training Data: Trained on large astronomical image collections with self-supervised learning.

Capabilities: General-purpose astronomical image embeddings. Can be fine-tuned for various tasks.

Availability: Research code and weights available.

Astronomy Usage: Starting point for custom classification. Image similarity search.

ASTROMER¶

Transformer model for light curves.

Training Data: Pre-trained on large light curve collections.

Capabilities: Learns general representations of time-varying astronomical sources. Fine-tunable for classification.

Availability: Research code available.

Astronomy Usage: Variable star classification. Transient classification. Light curve analysis.

Deployment Tools¶

For putting models into production:

ONNX¶

Open Neural Network Exchange format.

Capabilities: Convert models between frameworks. Optimize for inference. Deploy to various runtimes.

Strengths: Framework-agnostic. Good optimization. Wide runtime support.

Weaknesses: Not all operations supported. Conversion can be tricky.

Maturity: Very mature. Industry standard for model exchange.

Astronomy Usage: Deploy PyTorch models to edge devices. Cross-framework compatibility.

TensorRT¶

NVIDIA's inference optimizer.

Capabilities: Optimize neural networks for NVIDIA GPUs. Quantization, layer fusion, kernel optimization.

Strengths: Massive speedups on NVIDIA hardware. Production-ready.

Weaknesses: NVIDIA-only. Requires supported operations.

Maturity: Very mature. Used in production at scale.

Astronomy Usage: Fast inference on GPU-equipped systems.

Docker¶

Containerization for ML systems.

Capabilities: Package entire ML environment (code, dependencies, models) into reproducible containers.

Strengths: Reproducibility. Easy deployment. Isolation.

Weaknesses: Learning curve. Container management overhead.

Maturity: Extremely mature. Industry standard for deployment.

Astronomy Usage: Deploy ML pipelines to telescope sites. Reproducible analysis environments.

Data Management¶

For handling large astronomical datasets:

Dask¶

Parallel computing library.

Capabilities: Parallel arrays, dataframes, and custom computations. Scales from laptop to cluster.

Strengths: Familiar pandas/numpy API. Lazy evaluation. Good for larger-than-memory data.

Weaknesses: Some operations less efficient than native numpy. Debugging can be tricky.

Maturity: Very mature. Widely used for large-scale data processing.

Astronomy Usage: Processing large image collections. Distributed feature extraction.

Vaex¶

Out-of-core dataframes.

Capabilities: Billion-row dataframes on a laptop. Memory-mapped files. Fast aggregations.

Strengths: Handles huge catalogs that don't fit in memory. Very fast for certain operations.

Weaknesses: Less flexible than pandas. Not all operations supported.

Maturity: Mature. Good for specific large-data use cases.

Astronomy Usage: Exploring large astronomical catalogs. Cross-matching. Feature extraction from massive datasets.

This covers the complete landscape of ML for your distributed telescope project—the specific applications, the underlying mathematics, the hardware requirements, system design for task assignment, honest limitations, and the practical tools you'll use. Let me know if you want to dive deeper into any particular area.

Understanding ML Model Architectures for Your Distributed Telescope Array¶

Let me take you through every major architecture, how it actually works, and exactly where each fits into your telescope network.

The Fundamental Question: Why Different Architectures?¶

Before diving into specifics, understand why we have different architectures at all.

Data comes in different shapes:

Tabular data: Rows and columns, like a spreadsheet. Star catalogs with measurements. Each row is independent, columns are features.

Images: 2D grids of pixels. Your telescope frames. Nearby pixels are related. Spatial structure matters.

Sequences: Ordered data points. Light curves over time. What came before affects interpretation of what comes after.

Graphs: Networks of connected entities. Stars in clusters. Galaxies in groups. Relationships between objects matter.

Sets: Collections without order. Multiple observations of the same field. The set matters, not the sequence.

Each architecture embodies assumptions about data structure. Using the wrong architecture means fighting against its assumptions. Using the right architecture means the model naturally captures relevant patterns.

Feedforward Neural Networks: The Foundation¶

What They Are¶

The simplest neural network. Data flows in one direction: input to output, no loops, no memory.

Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Output Layer

Each layer is fully connected to the next. Every neuron in layer N connects to every neuron in layer N+1.

How They Process Information¶

Imagine your input is a vector of 100 numbers representing measurements of a star: brightness in different filters, position, proper motion, and so on.

Layer 1 (say, 256 neurons): Each neuron computes a weighted sum of all 100 inputs, adds a bias, applies an activation function. You get 256 new numbers, each representing some combination of the original features.

Layer 2 (say, 128 neurons): Each neuron takes all 256 outputs from Layer 1, computes weighted sums, applies activation. Now you have 128 numbers representing combinations of combinations.

Output Layer (say, 5 neurons for 5 star types): Each neuron combines the 128 Layer 2 outputs. Apply softmax to get probabilities.

The key insight: each successive layer learns more abstract representations. Layer 1 might learn "this combination of colors indicates high temperature." Layer 2 might learn "high temperature plus this proper motion pattern suggests a certain stellar population."

Mathematical Formulation¶

For a single layer:

output = activation(weights × input + bias)

Where:

input is a vector of N values
weights is a matrix of size (M × N), where M is the number of neurons
bias is a vector of M values
activation is a nonlinear function applied element-wise
output is a vector of M values

Stacking layers:

h₁ = activation(W₁ × input + b₁)
h₂ = activation(W₂ × h₁ + b₂)
h₃ = activation(W₃ × h₂ + b₃)
output = softmax(W₄ × h₃ + b₄)

Strengths¶

Universality: Can theoretically approximate any function given enough neurons. This is a mathematical guarantee.

Simplicity: Easy to implement, understand, debug. Training is straightforward.

Speed: Fast inference. No complex operations, just matrix multiplications.

Flexibility: Works on any fixed-size input. No structural assumptions beyond input dimension.

Weaknesses¶

No spatial awareness: Treats each input independently. For images, pixel 1 and pixel 1000 are equally "distant" from the network's perspective, even if they're adjacent in the image.

No temporal awareness: Each input is processed independently. Can't learn that a brightness measurement depends on previous measurements.

Parameter explosion: For large inputs, fully-connected layers have enormous numbers of parameters. A 256×256 image has 65,536 pixels. A single hidden layer of 1000 neurons would have 65 million parameters just for that layer.

No weight sharing: Patterns learned in one part of the input don't transfer to other parts. A galaxy in the corner of an image requires separate learning from a galaxy in the center.

For Your Telescope Array¶

Good for: Processing extracted features (not raw images). Tabular data from catalogs. Final classification layers after other architectures have extracted features.

Specific applications:

Classifying stars from catalog measurements (colors, proper motions, parallax)
Predicting observation quality from metadata (temperature, humidity, moon phase, elevation)
Combining high-level features from multiple sources for final decision-making
Quick assessment models where speed matters more than accuracy

Example scenario: You've extracted 50 features from a light curve (mean brightness, variance, periodicity measures, etc.). A feedforward network takes these 50 numbers and classifies the variable star type. The feature extraction handles temporal structure; the feedforward network handles the final classification.

Convolutional Neural Networks: Spatial Intelligence¶

What They Are¶

Networks designed for data with spatial structure, primarily images. Instead of connecting every input to every neuron, they use local connections and weight sharing.

The Core Insight¶

Images have two crucial properties feedforward networks ignore:

Locality: Relevant patterns are local. An edge is a few pixels. A star is a small region. You don't need to look at pixels 1000 apart simultaneously to detect these patterns.

Translation invariance: A spiral arm looks like a spiral arm regardless of where it appears in the image. Learning to recognize it in one location should transfer to all locations.

CNNs embody these assumptions through convolution operations.

How Convolution Works¶

A convolutional layer has small filters (also called kernels), typically 3×3, 5×5, or 7×7 pixels.

Each filter slides across the entire image, computing a dot product at each position:

Image patch:        Filter:           Computation:
[a b c]            [w₁ w₂ w₃]        output = a×w₁ + b×w₂ + c×w₃ +
[d e f]     ×      [w₄ w₅ w₆]                 d×w₄ + e×w₅ + f×w₆ +
[g h i]            [w₇ w₈ w₉]                 g×w₇ + h×w₈ + i×w₉

This single number represents "how much does this patch match this filter?"

Sliding the filter across all positions produces a feature map: a 2D grid showing where the pattern was detected.

Multiple Filters, Multiple Layers¶

A single convolutional layer has many filters (32, 64, 128 are common). Each learns to detect a different pattern.

Layer 1 filters learn simple patterns:

Horizontal edges
Vertical edges
Diagonal edges
Brightness gradients
Spots of various sizes

Layer 2 filters operate on Layer 1's output, learning combinations:

Corners (horizontal + vertical edges)
Curves (sequences of edge orientations)
Texture patterns
Ring-like structures

Layer 3 and beyond learn increasingly complex combinations:

Spiral arm signatures
Galaxy core patterns
Specific artifact shapes
Complex morphological features

This hierarchy emerges automatically from training. You don't specify "learn edges then corners then spirals." The network discovers this hierarchy because it's efficient for reducing classification error.

Pooling Operations¶

Between convolutional layers, pooling reduces spatial dimensions:

Max pooling: Take the maximum value in each small region

[1 3 2 4]
[5 6 1 2]  → Max pool 2×2 →  [6 4]
[3 2 1 0]                     [3 3]
[1 2 3 1]

Average pooling: Take the mean value in each region

Pooling provides:

Reduced computation for subsequent layers
Some translation invariance (small shifts don't change max values much)
Larger effective receptive field (later layers "see" more of the original image)

Receptive Fields¶

A crucial concept: how much of the original image influences a single neuron in a later layer?

Layer 1 neuron: Sees only its 3×3 filter region. Receptive field = 9 pixels.

Layer 2 neuron: Takes input from Layer 1 neurons, each of which saw 3×3. After pooling, each Layer 2 neuron effectively sees ~6×6 of the original image.

Deep layer neuron: Might effectively see the entire image, but through a hierarchical lens.

This is why deep CNNs can learn global patterns while still using local operations: information propagates through the hierarchy.

Mathematical Formulation¶

For a 2D convolution:

output[i,j] = Σₘ Σₙ input[i+m, j+n] × filter[m,n] + bias

Where the sums run over the filter dimensions.

With multiple input channels (like RGB, or previous layer features):

output[i,j] = Σ_c Σₘ Σₙ input[c, i+m, j+n] × filter[c,m,n] + bias

Where c indexes input channels.

Architecture Patterns¶

Standard CNN architectures follow patterns:

VGG pattern: Stack many 3×3 convolutions. Simple but effective.

Conv3×3 → Conv3×3 → Pool → Conv3×3 → Conv3×3 → Pool → ... → Dense → Output

ResNet pattern: Add skip connections that let gradients flow directly through many layers.

input → Conv → Conv → (+input) → Conv → Conv → (+previous) → ...

Skip connections solve the vanishing gradient problem, allowing very deep networks (50, 100, 150+ layers).

Inception/GoogLeNet pattern: Use multiple filter sizes in parallel, concatenate results.

input → [1×1 conv, 3×3 conv, 5×5 conv, pool] → concatenate → ...

This captures patterns at multiple scales simultaneously.

Strengths¶

Parameter efficiency: A 3×3 filter has 9 parameters regardless of image size. Compared to feedforward networks, CNNs have far fewer parameters.

Translation equivariance: A pattern detected at position (10, 10) uses the same weights as detection at (100, 100). Learning transfers across positions.

Hierarchical feature learning: Automatically learns appropriate feature hierarchy for the task.

Proven architecture: Decades of refinement. Well-understood behavior. Extensive pre-trained models available.

Weaknesses¶

Fixed input size: Standard CNNs expect fixed image dimensions. Variable sizes require padding, cropping, or architectural changes.

Limited global awareness: Despite stacking layers, CNNs can struggle with patterns requiring true global context. A pattern depending on opposite corners remains hard.

Translation invariance can hurt: Sometimes position matters. The center of a galaxy image is semantically different from the edge. Pure CNNs don't distinguish.

No temporal understanding: Each image is processed independently. Sequential relationships require additional architecture.

For Your Telescope Array¶

Good for: Any image-based task. Quality assessment. Object detection. Galaxy classification. Artifact identification.

Specific applications:

Real-time quality assessment: A lightweight CNN at each telescope evaluates incoming frames. Input: single frame. Output: quality score and issue flags (clouds, tracking error, focus problem, etc.).

Source detection: Semantic segmentation CNNs identify every source in an image. Each pixel gets classified: background, star, galaxy, artifact, satellite trail.

Galaxy morphology: CNNs trained on Galaxy Zoo data classify galaxy types, identify features like bars, rings, spiral arms, merger signatures.

Transient detection: CNNs compare new images to references, classifying differences as real transients, artifacts, or noise.

Cross-site calibration: CNNs learn to map images from different sites to a common representation, normalizing site-specific effects.

Example architecture for your quality classifier:

Input: 256×256 grayscale image

Block 1: Conv(32 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 128×128×32

Block 2: Conv(64 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 64×64×64

Block 3: Conv(128 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 32×32×128

Block 4: Conv(256 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 16×16×256

Global Average Pool: 256 values

Dense(128) → ReLU → Dropout(0.5)
Dense(3) → Softmax

Output: probabilities for [good, medium, bad]

Recurrent Neural Networks: Temporal Intelligence¶

What They Are¶

Networks designed for sequential data. They maintain internal state that persists across sequence elements, giving them a form of memory.

The Core Insight¶

Many phenomena unfold over time. A light curve isn't just a collection of brightness measurements—it's an ordered sequence where each measurement relates to those before and after.

Standard feedforward networks process each input independently. RNNs process sequences element by element, maintaining hidden state that captures what they've seen so far.

Basic RNN Operation¶

At each time step t:

hidden[t] = activation(W_input × input[t] + W_hidden × hidden[t-1] + bias)
output[t] = W_output × hidden[t]

The key: hidden[t] depends on hidden[t-1]. Information flows forward through time.

input[0] → [RNN Cell] → hidden[0] → output[0]
              ↓
input[1] → [RNN Cell] → hidden[1] → output[1]
              ↓
input[2] → [RNN Cell] → hidden[2] → output[2]
              ↓
             ...

The same weights are used at every time step. The only thing changing is the hidden state.

The Vanishing Gradient Problem¶

Basic RNNs have a critical flaw: information fades over time.

During training, gradients must flow backward through time. At each step, they get multiplied by weights. If weights are less than 1, gradients shrink exponentially. After 50 or 100 steps, gradients are effectively zero.

Result: basic RNNs can only learn short-range dependencies. They forget distant past, even when it's crucial.

LSTM: Long Short-Term Memory¶

LSTMs solve the vanishing gradient problem with a gated architecture:

┌─────────────────────────────────────────┐
│                LSTM Cell                │
│                                         │
│  ┌──────┐   ┌──────┐   ┌──────┐        │
│  │Forget│   │Input │   │Output│        │
│  │ Gate │   │ Gate │   │ Gate │        │
│  └──────┘   └──────┘   └──────┘        │
│      ↓          ↓          ↓           │
│  ┌────────────────────────────┐        │
│  │        Cell State          │ ←──────┼── (memory highway)
│  └────────────────────────────┘        │
│                                         │
└─────────────────────────────────────────┘

Forget gate: Decides what to discard from cell state. "The transit event is over, forget those details."

Input gate: Decides what new information to store. "This brightness spike is important, remember it."

Output gate: Decides what to output based on cell state. "Based on everything seen, output this classification."

Cell state: The memory highway. Information can flow unchanged across many time steps. Gradients flow through without multiplication by weights.

The mathematics:

forget = sigmoid(W_f × [hidden[t-1], input[t]] + b_f)
input_gate = sigmoid(W_i × [hidden[t-1], input[t]] + b_i)
candidate = tanh(W_c × [hidden[t-1], input[t]] + b_c)
cell[t] = forget × cell[t-1] + input_gate × candidate
output_gate = sigmoid(W_o × [hidden[t-1], input[t]] + b_o)
hidden[t] = output_gate × tanh(cell[t])

The gates are sigmoid functions outputting values between 0 and 1, acting as soft switches.

GRU: Gated Recurrent Unit¶

A simplified gating mechanism, often performing comparably to LSTM with fewer parameters:

reset = sigmoid(W_r × [hidden[t-1], input[t]])
update = sigmoid(W_u × [hidden[t-1], input[t]])
candidate = tanh(W × [reset × hidden[t-1], input[t]])
hidden[t] = (1 - update) × hidden[t-1] + update × candidate

Two gates instead of three. Often faster to train with similar performance.

Bidirectional RNNs¶

Sometimes context from the future helps interpret the present. Bidirectional RNNs process sequences both forward and backward:

Forward:  input[0] → input[1] → input[2] → ... → input[T]
                ↓         ↓         ↓              ↓
           hidden_f[0]  hidden_f[1] hidden_f[2] ... hidden_f[T]

Backward: input[0] ← input[1] ← input[2] ← ... ← input[T]
                ↓         ↓         ↓              ↓
           hidden_b[0]  hidden_b[1] hidden_b[2] ... hidden_b[T]

Combined: [hidden_f[t], hidden_b[t]] for each t

Each position gets context from both past and future. Useful when you have the complete sequence before processing.

Sequence-to-Sequence Architectures¶

For tasks where input and output are both sequences, use encoder-decoder architectures:

Encoder: Processes input sequence, produces summary hidden state Decoder: Takes summary, generates output sequence

Input sequence → [Encoder RNN] → Summary State → [Decoder RNN] → Output sequence

This architecture underlies machine translation, summarization, and can be adapted for time-series forecasting.

Strengths¶

Natural for sequences: Explicitly models temporal dependencies. Hidden state carries information across time.

Variable length: Unlike feedforward networks, RNNs handle sequences of any length.

Parameter efficiency: Same weights used at every time step. A 100-step sequence doesn't need 100× the parameters.

Interpretable dynamics: Hidden state evolution can be analyzed. What is the network remembering?

Weaknesses¶

Sequential computation: Can't parallelize across time steps. Each step waits for the previous. Training and inference are slower than parallelizable architectures.

Long-range dependencies: Even LSTMs struggle with very long sequences (hundreds to thousands of steps). Information still fades, just more slowly.

Training instability: RNNs can suffer from exploding gradients. Requires careful initialization and gradient clipping.

Superseded by transformers: For many tasks, transformers achieve better performance with easier training. RNNs are less dominant than they once were.

For Your Telescope Array¶

Good for: Light curves. Time-series data. Sequential observations. Any data where temporal order matters.

Specific applications:

Light curve classification: An LSTM processes a sequence of brightness measurements, classifying the variable star type, detecting transients, or identifying periodic behavior.

Light curve: [mag[0], mag[1], mag[2], ..., mag[T]]
                ↓        ↓        ↓           ↓
             [LSTM] → [LSTM] → [LSTM] → ... → [LSTM]
                                                 ↓
                                          Classification

Transient detection in time series: RNN monitors brightness sequence, outputs probability of transient at each time step. Alert when probability exceeds threshold.

Predictive modeling: Given recent conditions (weather, seeing, performance), predict near-future conditions for scheduling.

Anomaly detection in sequences: Train LSTM to predict next value in normal sequences. Large prediction errors indicate anomalies.

State tracking: RNN maintains hidden state representing current system status, updated with each new observation or event.

Example architecture for light curve classification:

Input: sequence of (time, magnitude, error) tuples, variable length

Embedding: Dense(64) applied to each time step
Output: sequence of 64-dimensional vectors

Bidirectional LSTM(128 units)
Output: sequence of 256-dimensional vectors (128 forward + 128 backward)

Attention layer (or just take final hidden state)
Output: 256-dimensional vector

Dense(128) → ReLU → Dropout(0.3)
Dense(64) → ReLU → Dropout(0.3)
Dense(num_classes) → Softmax

Output: class probabilities

Transformers: Attention is All You Need¶

What They Are¶

Transformers process sequences without recurrence. Instead of maintaining hidden state, they use attention mechanisms to directly relate any element to any other element.

The Core Insight¶

RNNs process sequences step by step. Information from early steps must pass through many intermediate steps to affect later processing. This creates bottlenecks.

Transformers skip the middleman. Every position can directly attend to every other position. Information flows directly between any pair of elements.

Self-Attention: The Key Mechanism¶

Self-attention computes relationships between all pairs of positions in a sequence.

For each position, create three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I have to offer?"
Value (V): "What information do I carry?"

Attention score between position i and position j:

score[i,j] = Q[i] · K[j] / sqrt(d_k)

The dot product measures similarity. Division by sqrt(d_k) (dimension of keys) prevents scores from growing too large.

Apply softmax to get attention weights:

weights[i] = softmax(scores[i])  # weights[i] sums to 1

Output for position i is weighted sum of values:

output[i] = Σⱼ weights[i,j] × V[j]

Each position's output incorporates information from all other positions, weighted by relevance.

Multi-Head Attention¶

A single attention mechanism learns one type of relationship. Multi-head attention runs several attention mechanisms in parallel:

Head 1: Q₁, K₁, V₁ → output₁
Head 2: Q₂, K₂, V₂ → output₂
...
Head N: Qₙ, Kₙ, Vₙ → outputₙ

Concatenate: [output₁, output₂, ..., outputₙ]
Project: W_o × concatenated

Different heads learn different relationships:

Head 1 might attend to nearby positions
Head 2 might attend to similar values
Head 3 might attend to periodically related positions

The Transformer Block¶

A complete transformer block:

Input
  ↓
Multi-Head Self-Attention
  ↓
Add (residual connection) + Layer Normalization
  ↓
Feed-Forward Network (two dense layers)
  ↓
Add (residual connection) + Layer Normalization
  ↓
Output

Stack many blocks (6, 12, 24, or more in large models).

Residual connections let gradients flow directly through the network, enabling very deep architectures.

Positional Encoding¶

Self-attention is permutation-invariant: it doesn't inherently know that position 1 comes before position 2. Order information must be added explicitly.

Sinusoidal encoding (original transformer):

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Different frequencies for different dimensions. Positions get unique signatures, and relative positions can be computed from these encodings.

Learned encodings: Just learn a vector for each position. Works well when maximum sequence length is known.

Encoder-Decoder Transformers¶

For sequence-to-sequence tasks:

Encoder: Self-attention sees entire input. Each position attends to all input positions.

Decoder: Self-attention is masked (positions can only attend to earlier positions, not future). Cross-attention lets decoder positions attend to encoder outputs.

Input Sequence → [Encoder Stack] → Encoded Representations
                                            ↓
                        [Decoder Stack with Cross-Attention] → Output Sequence

Encoder-Only (BERT-style)¶

For tasks where you need to understand the input but not generate sequences:

Input → [Transformer Encoder] → Representations → Task-specific head

BERT, RoBERTa, and similar models use this pattern. Fine-tune for classification, extraction, or other tasks.

Decoder-Only (GPT-style)¶

For generation tasks:

Context → [Transformer Decoder] → Next token prediction

GPT models use this pattern. The model predicts the next element based on all previous elements.

Vision Transformers (ViT)¶

Transformers for images:

Split image into patches (e.g., 16×16 pixels each)
Flatten each patch into a vector
Add position encodings
Process with standard transformer

Image → [Split into patches] → [Linear embedding] → [Add position] → [Transformer] → [Classification head]

This treats an image as a sequence of patches, letting attention learn spatial relationships.

Strengths¶

Parallelization: Unlike RNNs, all positions can be computed simultaneously. Training is much faster on GPUs.

Long-range dependencies: Every position directly attends to every other. No information bottleneck.

Scalability: Transformers scale well. Larger models, more data, more compute generally means better performance.

State-of-the-art: Transformers dominate language, increasingly dominate vision, and excel in many domains.

Flexibility: Same architecture works for language, images, audio, and more with minimal modification.

Weaknesses¶

Quadratic complexity: Self-attention compares all pairs of positions. For sequence length N, complexity is O(N²). Very long sequences become expensive.

Data hungry: Transformers typically need more training data than CNNs or RNNs to achieve good performance.

Compute hungry: Large transformers require substantial GPU resources for training and inference.

Position encoding limitations: Learned position encodings don't generalize beyond training length. Sinusoidal encodings help but aren't perfect.

Less inductive bias: Transformers make fewer assumptions about data structure. This flexibility means they need to learn structure from data rather than having it built in.

For Your Telescope Array¶

Good for: Complex sequences where long-range dependencies matter. Multi-modal data fusion. Tasks where CNNs or RNNs underperform.

Specific applications:

Advanced light curve analysis: Transformers can capture long-range periodicity, complex variability patterns, and subtle correlations that RNNs miss.

Multi-site data fusion: Treat observations from different sites as sequence elements. Attention learns which observations to weight more heavily, how to combine information across sites.

[Obs_Site_A, Obs_Site_B, Obs_Site_C, ...] → [Transformer] → Fused Representation

Catalog cross-matching: Given entries from multiple catalogs, transformer attention learns which entries correspond to the same object.

Vision Transformer for images: For challenging image classification tasks where CNNs plateau, ViT might push further (with sufficient data).

Multimodal understanding: Combine image features and light curve features in a single transformer. Attention learns relationships between visual appearance and temporal behavior.

Example architecture for multi-site data fusion:

Inputs: Observations from N sites, each represented as a vector
[obs_1, obs_2, ..., obs_N] where obs_i includes: image embedding, quality metrics, timestamp, site ID embedding

Positional encoding: Site embeddings rather than sequence positions

Transformer Encoder (4 layers, 8 attention heads, 256 dimensions)
Each observation attends to all others
Learns which sites to weight, how to combine

Global pooling or CLS token
Output: Fused representation

Task heads:
- Classification head: Dense → class probabilities
- Quality estimation head: Dense → expected quality of combined result
- Uncertainty head: Dense → confidence bounds

Autoencoders: Learning Compression¶

What They Are¶

Networks that learn to compress data to a smaller representation, then reconstruct the original. Not for prediction, but for representation learning.

The Core Insight¶

If a network can compress data to a small representation and reconstruct it accurately, that small representation must capture the essential information. What's lost is presumably noise or irrelevant detail.

Architecture¶

Input → [Encoder] → Bottleneck (small) → [Decoder] → Reconstruction

        High-dimensional                              High-dimensional
           input                                         output
                        Low-dimensional
                          code/latent

Encoder: Compresses input to bottleneck. Typically uses convolutions (for images) or dense layers.

Bottleneck: The compressed representation. Much smaller than input (e.g., 256×256 image → 128 numbers).

Decoder: Reconstructs input from bottleneck. Mirror of encoder architecture.

Loss: Reconstruction error, typically mean squared error between input and output.

Variational Autoencoders (VAEs)¶

Standard autoencoders learn a deterministic mapping. VAEs learn a probabilistic one.

Instead of encoding to a single point, VAE encodes to a distribution (mean and variance):

Input → [Encoder] → (μ, σ) → Sample z ~ N(μ, σ) → [Decoder] → Reconstruction

Loss includes:

Reconstruction error
KL divergence between learned distribution and prior (regularizes latent space)

VAEs have smoother latent spaces. You can sample from the prior and generate realistic outputs.

Uses of Autoencoders¶

Dimensionality reduction: The bottleneck representation is a compressed version of input. Useful for visualization, clustering, or as input to other models.

Denoising: Train autoencoder on noisy inputs with clean targets. It learns to remove noise.

Anomaly detection: Train on normal data. Anomalies reconstruct poorly (high error).

Generation: VAEs (and related models) can generate new samples by decoding random latent vectors.

Strengths¶

Unsupervised: Don't need labels. Just need examples of normal data.

Representation learning: Learn useful features without explicit supervision.

Anomaly detection: Natural fit for finding unusual objects.

Compression: Learned compression can outperform hand-designed methods.

Weaknesses¶

Reconstruction focus: Optimizing reconstruction might not produce representations useful for downstream tasks.

Mode collapse: Can learn to ignore some input variation, reconstructing only "average" outputs.

Blurry outputs: Especially VAEs tend to produce blurry reconstructions, averaging over uncertainty.

Hyperparameter sensitivity: Bottleneck size, architecture choices significantly affect results.

For Your Telescope Array¶

Good for: Anomaly detection. Data compression. Finding unusual objects. Learning representations without labels.

Specific applications:

Anomaly detection: Train autoencoder on normal telescope images. High reconstruction error flags unusual images for human review.

Training: Normal images → Autoencoder → Minimize reconstruction error
Deployment: New image → Autoencoder → Measure reconstruction error
            If error > threshold: Flag as anomalous

Compression for transmission: Train autoencoder to compress images. Send only bottleneck codes from remote sites, decode centrally. Lossy but much smaller.

Unknown object discovery: Cluster objects in latent space. Objects far from known clusters might be new types.

Quality-aware compression: Train autoencoder with quality-weighted loss. Preserve important regions (sources) more than background.

Example anomaly detection system:

Convolutional Autoencoder:

Encoder:
Conv(32, 3×3) → ReLU → Pool(2×2)  # 256 → 128
Conv(64, 3×3) → ReLU → Pool(2×2)  # 128 → 64
Conv(128, 3×3) → ReLU → Pool(2×2) # 64 → 32
Conv(256, 3×3) → ReLU → Pool(2×2) # 32 → 16
Flatten → Dense(512) → Dense(128) → Bottleneck

Decoder (mirror of encoder):
Dense(512) → Dense(16×16×256) → Reshape
Upsample(2×2) → Conv(128, 3×3) → ReLU  # 16 → 32
Upsample(2×2) → Conv(64, 3×3) → ReLU   # 32 → 64
Upsample(2×2) → Conv(32, 3×3) → ReLU   # 64 → 128
Upsample(2×2) → Conv(1, 3×3) → Output  # 128 → 256

Loss: Mean squared error

Anomaly score: Reconstruction error per image
Threshold: Set from validation data to achieve desired false positive rate

Graph Neural Networks: Relational Intelligence¶

What They Are¶

Networks designed for data naturally represented as graphs: nodes connected by edges. Where CNNs exploit spatial structure and RNNs exploit temporal structure, GNNs exploit relational structure.

The Core Insight¶

Many astronomical phenomena involve relationships:

Stars in clusters are related
Galaxies in groups interact
Observations of the same object are connected
Telescope sites share information

Graphs naturally represent these relationships. GNNs learn to use relational structure.

Graph Representation¶

A graph consists of:

Nodes: Entities (stars, galaxies, observations, telescopes)
Edges: Relationships between nodes (physical proximity, causal connection, same object)
Node features: Attributes of each node (brightness, color, position)
Edge features: Attributes of each relationship (distance, time difference, strength)

Message Passing: The Core Operation¶

GNNs work by passing messages between connected nodes:

For each node:
    1. Gather messages from neighbors
    2. Aggregate messages (sum, mean, max, or learned aggregation)
    3. Update node representation based on current state + aggregated messages

After several rounds of message passing, each node's representation incorporates information from its neighborhood.

Round 1: Each node knows about immediate neighbors
Round 2: Each node knows about neighbors-of-neighbors
Round 3: Information from 3-hop neighborhood
...

Mathematical Formulation¶

Basic message passing:

m[i] = Aggregate({h[j] : j ∈ Neighbors(i)})
h'[i] = Update(h[i], m[i])

Where:

h[i] is node i's representation
m[i] is aggregated message for node i
Aggregate is a permutation-invariant function (sum, mean, max)
Update combines current state with message (typically neural network)

Common architectures:

Graph Convolutional Network (GCN):

H' = σ(D^(-1/2) A D^(-1/2) H W)

Where A is adjacency matrix, D is degree matrix, H is node features, W is learnable weights.

Graph Attention Network (GAT): Use attention to weight neighbor contributions differently.

GraphSAGE: Sample and aggregate neighbors, enabling mini-batch training on large graphs.

Strengths¶

Natural for relational data: Directly encodes relationships. No need to flatten graph structure into vectors.

Flexible structure: Works on graphs of any size and topology. Adapts to varying numbers of neighbors.

Inductive: Can generalize to unseen nodes/graphs if features are meaningful.

Combines information: Learns how to aggregate information from related entities.

Weaknesses¶

Scalability: Very large graphs (millions of nodes) require sophisticated sampling or approximation.

Oversmoothing: Many message-passing rounds make all node representations similar. Deep GNNs are harder to train.

Edge definition: Results depend on how you define graph structure. Wrong edges hurt performance.

Less mature: GNNs are newer than CNNs/RNNs. Fewer established best practices.

For Your Telescope Array¶

Good for: Modeling relationships between objects, sites, or observations. Catalog analysis. Network coordination.

Specific applications:

Star cluster analysis: Nodes are stars, edges connect probable cluster members. GNN learns cluster membership, identifies interlopers.

Galaxy group finding: Nodes are galaxies, edges from proximity or velocity similarity. GNN identifies group memberships, predicts properties.

Multi-observation fusion: Nodes are observations of the same target (different times, sites, instruments). Edges connect same-object observations. GNN learns optimal combination.

Graph structure:
  Nodes: Individual observations
  Edges: Same object, temporal proximity, or spatial proximity
  Node features: Measurement values, quality metrics, metadata
  Edge features: Time difference, site pair, conditions similarity

GNN:
  Message passing learns how to weight and combine observations
  Output: Fused estimate for each unique object

Telescope network optimization: Nodes are telescope sites, edges connect sites with complementary capabilities. GNN learns coordination patterns, recommends resource allocation.

Anomaly detection in context: When detecting anomalies, consider relationships. A star that's anomalous in isolation might be normal given its cluster context. GNN incorporates context.

Example architecture for multi-observation fusion:

Graph construction:
  For each unique object, create nodes for all observations
  Connect observations with edges (fully connected or based on relevance)

Node features (per observation):
  - Measured values (magnitudes, colors, etc.)
  - Uncertainty estimates
  - Observation quality metrics
  - Site identifier (embedded)
  - Time of observation

Edge features:
  - Time difference
  - Site pair identifier
  - Condition similarity score

GNN architecture:
  GraphSAGE with 3 message-passing layers
  Hidden dimension: 128
  Aggregation: Attention-weighted mean

After message passing:
  Global pooling across all nodes for this object
  Dense layers for final estimate

Output:
  Fused measurement estimate
  Uncertainty bounds
  Outlier flags for individual observations

Generative Models: Creating New Data¶

What They Are¶

Models that learn to generate new samples resembling training data. Instead of classifying or predicting, they create.

Generative Adversarial Networks (GANs)¶

Two networks in competition:

Generator: Takes random noise, produces fake samples Discriminator: Tries to distinguish real from fake samples

Training is adversarial:

Discriminator improves at detecting fakes
Generator improves at fooling discriminator
At equilibrium, generator produces samples discriminator can't distinguish from real

Random noise z → [Generator] → Fake sample
                                    ↓
                              [Discriminator] → Real or Fake?
                                    ↑
                              Real sample

Loss functions:

Discriminator: maximize log(D(real)) + log(1 - D(G(z)))
Generator: maximize log(D(G(z))) (or minimize log(1 - D(G(z))))

Diffusion Models¶

Currently state-of-the-art for image generation.

Forward process: Gradually add noise to real data until it's pure noise. Reverse process: Learn to gradually remove noise, recovering data from noise.

Real image → [Add noise] → [Add noise] → ... → Pure noise
Pure noise → [Denoise] → [Denoise] → ... → Generated image

The denoising network learns to predict and remove noise at each step. Many small denoising steps produce high-quality samples.

Uses in Astronomy¶

Data augmentation: Generate synthetic training examples, especially for rare classes.

Simulation: Generate realistic synthetic observations to test pipelines.

Super-resolution: Generate high-resolution images from low-resolution inputs.

Inpainting: Fill in missing or corrupted regions of images.

Conditional generation: Generate images matching specific properties (galaxy with certain morphology, star with certain spectrum).

For Your Telescope Array¶

Specific applications:

Training data generation: Have few examples of rare transients? Train a generative model on what you have, generate more for classifier training.

Pipeline testing: Generate realistic synthetic observations to stress-test processing pipelines before real data arrives.

Data recovery: Inpaint satellite trails, cosmic rays, or bad pixels in otherwise good observations.

Prediction: Given current conditions and recent observations, generate predictions of what observations will look like in near future.

Architecture Selection Guide for Your Project¶

Let me be concrete about which architecture to use for each component of your distributed telescope system.

At Individual Telescope Sites¶

Task	Architecture	Rationale
Frame quality assessment	Lightweight CNN	Fast inference, spatial patterns matter, proven performance
Real-time transient detection	CNN + threshold	Need speed, looking for spatial signatures
Basic source detection	U-Net (CNN variant)	Semantic segmentation task, well-established
Quick classification	Small CNN or feedforward from features	Speed critical, accuracy secondary
Equipment anomaly detection	Autoencoder	Unsupervised, learns normal behavior

At Central Coordination¶

Task	Architecture	Rationale
Deep image classification	ResNet/EfficientNet CNN or ViT	Accuracy matters, have compute resources
Light curve classification	Transformer or LSTM	Sequential data with long-range dependencies
Multi-site data fusion	Transformer or GNN	Relating multiple inputs, flexible attention
Scheduling optimization	Reinforcement learning (various)	Sequential decision-making
Catalog cross-matching	GNN or Transformer	Relational structure matters
Anomaly detection at scale	Autoencoder + clustering	Find unknowns in large datasets
Multi-modal analysis	Transformer	Naturally handles multiple input types

Decision Flowchart¶

Is your data...?

├── Images (2D spatial)
│   ├── Classification/detection → CNN (ResNet, EfficientNet)
│   ├── Segmentation → U-Net, DeepLab
│   ├── Very complex patterns → Vision Transformer (if enough data)
│   └── Need speed → MobileNet, lightweight CNN
│
├── Sequences (time series)
│   ├── Short sequences (<100 steps) → LSTM or GRU
│   ├── Long sequences (>100 steps) → Transformer
│   ├── Real-time streaming → LSTM with online updates
│   └── Bidirectional context available → Bidirectional LSTM or Transformer
│
├── Tabular (features/measurements)
│   ├── Clear features → XGBoost/LightGBM (often beats neural networks)
│   ├── Need neural network → Feedforward
│   └── Interactions complex → Feedforward with more layers
│
├── Graph (relational)
│   └── Use GNN (GraphSAGE, GAT)
│
├── Multiple modalities (images + sequences + tabular)
│   └── Transformer (or separate encoders feeding shared transformer)
│
└── Unlabeled data
    ├── Want compression/representation → Autoencoder
    ├── Want anomaly detection → Autoencoder or isolation forest
    └── Want to generate samples → GAN or diffusion model

Hybrid Architectures for Your System¶

Real systems often combine architectures:

CNN + LSTM for video or image sequences:

Frame 1 → [CNN] → features[1] ─┐
Frame 2 → [CNN] → features[2] ─┼→ [LSTM] → Sequence classification
Frame 3 → [CNN] → features[3] ─┘

Use CNN to extract per-frame features, LSTM to model temporal evolution.

CNN + Transformer for multi-site fusion:

Site A image → [CNN] → embedding_A ─┐
Site B image → [CNN] → embedding_B ─┼→ [Transformer] → Fused result
Site C image → [CNN] → embedding_C ─┘

Use CNN to extract site-specific features, transformer to learn optimal combination.

Autoencoder + Classifier for semi-supervised learning:

Labeled + Unlabeled data → [Autoencoder] → Latent representations
Latent representations + Labels → [Classifier] → Predictions

Use autoencoder to learn representations from all data (including unlabeled), classifier on top using labels.

Summary Comparison Table¶

Architecture	Best For	Input Type	Strengths	Weaknesses	Your Use Cases
Feedforward	Tabular data, simple tasks	Fixed-size vectors	Simple, fast, universal	No structure awareness	Feature-based classification, final layers
CNN	Images, spatial data	2D/3D grids	Translation invariance, hierarchical features	Fixed input size, local focus	Image quality, source detection, morphology
RNN/LSTM	Sequences, time series	Variable-length sequences	Temporal modeling, memory	Sequential (slow), limited range	Light curves, streaming data
Transformer	Long sequences, multi-modal	Any (with encoding)	Parallelizable, long-range, flexible	Quadratic complexity, data hungry	Complex light curves, data fusion
Autoencoder	Compression, anomaly detection	Any	Unsupervised, learns representations	Reconstruction-focused	Anomaly detection, compression
GNN	Relational data, graphs	Graphs	Models relationships	Scaling, oversmoothing	Cluster analysis, observation fusion
GAN/Diffusion	Data generation	Any	Creates new samples	Training instability	Data augmentation, simulation

This should give you a complete understanding of how each architecture works, what it's suited for, and exactly where each fits into your distributed telescope network. The key is matching architecture assumptions to your data's structure and your task's requirements.