[[Some otherstuff]]

Understanding ML Model Architectures for Your Distributed Telescope Array¶

Let me take you through every major architecture, how it actually works, and exactly where each fits into your telescope network.

The Fundamental Question: Why Different Architectures?¶

Before diving into specifics, understand why we have different architectures at all.

Data comes in different shapes:

Tabular data: Rows and columns, like a spreadsheet. Star catalogs with measurements. Each row is independent, columns are features.

Images: 2D grids of pixels. Your telescope frames. Nearby pixels are related. Spatial structure matters.

Sequences: Ordered data points. Light curves over time. What came before affects interpretation of what comes after.

Graphs: Networks of connected entities. Stars in clusters. Galaxies in groups. Relationships between objects matter.

Sets: Collections without order. Multiple observations of the same field. The set matters, not the sequence.

Each architecture embodies assumptions about data structure. Using the wrong architecture means fighting against its assumptions. Using the right architecture means the model naturally captures relevant patterns.

Feedforward Neural Networks: The Foundation¶

What They Are¶

The simplest neural network. Data flows in one direction: input to output, no loops, no memory.

Input Layer → Hidden Layer 1 → Hidden Layer 2 → ... → Output Layer

Each layer is fully connected to the next. Every neuron in layer N connects to every neuron in layer N+1.

How They Process Information¶

Imagine your input is a vector of 100 numbers representing measurements of a star: brightness in different filters, position, proper motion, and so on.

Layer 1 (say, 256 neurons): Each neuron computes a weighted sum of all 100 inputs, adds a bias, applies an activation function. You get 256 new numbers, each representing some combination of the original features.

Layer 2 (say, 128 neurons): Each neuron takes all 256 outputs from Layer 1, computes weighted sums, applies activation. Now you have 128 numbers representing combinations of combinations.

Output Layer (say, 5 neurons for 5 star types): Each neuron combines the 128 Layer 2 outputs. Apply softmax to get probabilities.

The key insight: each successive layer learns more abstract representations. Layer 1 might learn "this combination of colors indicates high temperature." Layer 2 might learn "high temperature plus this proper motion pattern suggests a certain stellar population."

Mathematical Formulation¶

For a single layer:

output = activation(weights × input + bias)

Where:

input is a vector of N values
weights is a matrix of size (M × N), where M is the number of neurons
bias is a vector of M values
activation is a nonlinear function applied element-wise
output is a vector of M values

Stacking layers:

h₁ = activation(W₁ × input + b₁)
h₂ = activation(W₂ × h₁ + b₂)
h₃ = activation(W₃ × h₂ + b₃)
output = softmax(W₄ × h₃ + b₄)

Strengths¶

Universality: Can theoretically approximate any function given enough neurons. This is a mathematical guarantee.

Simplicity: Easy to implement, understand, debug. Training is straightforward.

Speed: Fast inference. No complex operations, just matrix multiplications.

Flexibility: Works on any fixed-size input. No structural assumptions beyond input dimension.

Weaknesses¶

No spatial awareness: Treats each input independently. For images, pixel 1 and pixel 1000 are equally "distant" from the network's perspective, even if they're adjacent in the image.

No temporal awareness: Each input is processed independently. Can't learn that a brightness measurement depends on previous measurements.

Parameter explosion: For large inputs, fully-connected layers have enormous numbers of parameters. A 256×256 image has 65,536 pixels. A single hidden layer of 1000 neurons would have 65 million parameters just for that layer.

No weight sharing: Patterns learned in one part of the input don't transfer to other parts. A galaxy in the corner of an image requires separate learning from a galaxy in the center.

For Your Telescope Array¶

Good for: Processing extracted features (not raw images). Tabular data from catalogs. Final classification layers after other architectures have extracted features.

Specific applications:

Classifying stars from catalog measurements (colors, proper motions, parallax)
Predicting observation quality from metadata (temperature, humidity, moon phase, elevation)
Combining high-level features from multiple sources for final decision-making
Quick assessment models where speed matters more than accuracy

Example scenario: You've extracted 50 features from a light curve (mean brightness, variance, periodicity measures, etc.). A feedforward network takes these 50 numbers and classifies the variable star type. The feature extraction handles temporal structure; the feedforward network handles the final classification.

Convolutional Neural Networks: Spatial Intelligence¶

What They Are¶

Networks designed for data with spatial structure, primarily images. Instead of connecting every input to every neuron, they use local connections and weight sharing.

The Core Insight¶

Images have two crucial properties feedforward networks ignore:

Locality: Relevant patterns are local. An edge is a few pixels. A star is a small region. You don't need to look at pixels 1000 apart simultaneously to detect these patterns.

Translation invariance: A spiral arm looks like a spiral arm regardless of where it appears in the image. Learning to recognize it in one location should transfer to all locations.

CNNs embody these assumptions through convolution operations.

How Convolution Works¶

A convolutional layer has small filters (also called kernels), typically 3×3, 5×5, or 7×7 pixels.

Each filter slides across the entire image, computing a dot product at each position:

Image patch:        Filter:           Computation:
[a b c]            [w₁ w₂ w₃]        output = a×w₁ + b×w₂ + c×w₃ +
[d e f]     ×      [w₄ w₅ w₆]                 d×w₄ + e×w₅ + f×w₆ +
[g h i]            [w₇ w₈ w₉]                 g×w₇ + h×w₈ + i×w₉

This single number represents "how much does this patch match this filter?"

Sliding the filter across all positions produces a feature map: a 2D grid showing where the pattern was detected.

Multiple Filters, Multiple Layers¶

A single convolutional layer has many filters (32, 64, 128 are common). Each learns to detect a different pattern.

Layer 1 filters learn simple patterns:

Horizontal edges
Vertical edges
Diagonal edges
Brightness gradients
Spots of various sizes

Layer 2 filters operate on Layer 1's output, learning combinations:

Corners (horizontal + vertical edges)
Curves (sequences of edge orientations)
Texture patterns
Ring-like structures

Layer 3 and beyond learn increasingly complex combinations:

Spiral arm signatures
Galaxy core patterns
Specific artifact shapes
Complex morphological features

This hierarchy emerges automatically from training. You don't specify "learn edges then corners then spirals." The network discovers this hierarchy because it's efficient for reducing classification error.

Pooling Operations¶

Between convolutional layers, pooling reduces spatial dimensions:

Max pooling: Take the maximum value in each small region

[1 3 2 4]
[5 6 1 2]  → Max pool 2×2 →  [6 4]
[3 2 1 0]                     [3 3]
[1 2 3 1]

Average pooling: Take the mean value in each region

Pooling provides:

Reduced computation for subsequent layers
Some translation invariance (small shifts don't change max values much)
Larger effective receptive field (later layers "see" more of the original image)

Receptive Fields¶

A crucial concept: how much of the original image influences a single neuron in a later layer?

Layer 1 neuron: Sees only its 3×3 filter region. Receptive field = 9 pixels.

Layer 2 neuron: Takes input from Layer 1 neurons, each of which saw 3×3. After pooling, each Layer 2 neuron effectively sees ~6×6 of the original image.

Deep layer neuron: Might effectively see the entire image, but through a hierarchical lens.

This is why deep CNNs can learn global patterns while still using local operations: information propagates through the hierarchy.

Mathematical Formulation¶

For a 2D convolution:

output[i,j] = Σₘ Σₙ input[i+m, j+n] × filter[m,n] + bias

Where the sums run over the filter dimensions.

With multiple input channels (like RGB, or previous layer features):

output[i,j] = Σ_c Σₘ Σₙ input[c, i+m, j+n] × filter[c,m,n] + bias

Where c indexes input channels.

Architecture Patterns¶

Standard CNN architectures follow patterns:

VGG pattern: Stack many 3×3 convolutions. Simple but effective.

Conv3×3 → Conv3×3 → Pool → Conv3×3 → Conv3×3 → Pool → ... → Dense → Output

ResNet pattern: Add skip connections that let gradients flow directly through many layers.

input → Conv → Conv → (+input) → Conv → Conv → (+previous) → ...

Skip connections solve the vanishing gradient problem, allowing very deep networks (50, 100, 150+ layers).

Inception/GoogLeNet pattern: Use multiple filter sizes in parallel, concatenate results.

input → [1×1 conv, 3×3 conv, 5×5 conv, pool] → concatenate → ...

This captures patterns at multiple scales simultaneously.

Strengths¶

Parameter efficiency: A 3×3 filter has 9 parameters regardless of image size. Compared to feedforward networks, CNNs have far fewer parameters.

Translation equivariance: A pattern detected at position (10, 10) uses the same weights as detection at (100, 100). Learning transfers across positions.

Hierarchical feature learning: Automatically learns appropriate feature hierarchy for the task.

Proven architecture: Decades of refinement. Well-understood behavior. Extensive pre-trained models available.

Weaknesses¶

Fixed input size: Standard CNNs expect fixed image dimensions. Variable sizes require padding, cropping, or architectural changes.

Limited global awareness: Despite stacking layers, CNNs can struggle with patterns requiring true global context. A pattern depending on opposite corners remains hard.

Translation invariance can hurt: Sometimes position matters. The center of a galaxy image is semantically different from the edge. Pure CNNs don't distinguish.

No temporal understanding: Each image is processed independently. Sequential relationships require additional architecture.

For Your Telescope Array¶

Good for: Any image-based task. Quality assessment. Object detection. Galaxy classification. Artifact identification.

Specific applications:

Real-time quality assessment: A lightweight CNN at each telescope evaluates incoming frames. Input: single frame. Output: quality score and issue flags (clouds, tracking error, focus problem, etc.).

Source detection: Semantic segmentation CNNs identify every source in an image. Each pixel gets classified: background, star, galaxy, artifact, satellite trail.

Galaxy morphology: CNNs trained on Galaxy Zoo data classify galaxy types, identify features like bars, rings, spiral arms, merger signatures.

Transient detection: CNNs compare new images to references, classifying differences as real transients, artifacts, or noise.

Cross-site calibration: CNNs learn to map images from different sites to a common representation, normalizing site-specific effects.

Example architecture for your quality classifier:

Input: 256×256 grayscale image

Block 1: Conv(32 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 128×128×32

Block 2: Conv(64 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 64×64×64

Block 3: Conv(128 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 32×32×128

Block 4: Conv(256 filters, 3×3) → BatchNorm → ReLU → MaxPool(2×2)
Output: 16×16×256

Global Average Pool: 256 values

Dense(128) → ReLU → Dropout(0.5)
Dense(3) → Softmax

Output: probabilities for [good, medium, bad]

Recurrent Neural Networks: Temporal Intelligence¶

What They Are¶

Networks designed for sequential data. They maintain internal state that persists across sequence elements, giving them a form of memory.

The Core Insight¶

Many phenomena unfold over time. A light curve isn't just a collection of brightness measurements—it's an ordered sequence where each measurement relates to those before and after.

Standard feedforward networks process each input independently. RNNs process sequences element by element, maintaining hidden state that captures what they've seen so far.

Basic RNN Operation¶

At each time step t:

hidden[t] = activation(W_input × input[t] + W_hidden × hidden[t-1] + bias)
output[t] = W_output × hidden[t]

The key: hidden[t] depends on hidden[t-1]. Information flows forward through time.

input[0] → [RNN Cell] → hidden[0] → output[0]
              ↓
input[1] → [RNN Cell] → hidden[1] → output[1]
              ↓
input[2] → [RNN Cell] → hidden[2] → output[2]
              ↓
             ...

The same weights are used at every time step. The only thing changing is the hidden state.

The Vanishing Gradient Problem¶

Basic RNNs have a critical flaw: information fades over time.

During training, gradients must flow backward through time. At each step, they get multiplied by weights. If weights are less than 1, gradients shrink exponentially. After 50 or 100 steps, gradients are effectively zero.

Result: basic RNNs can only learn short-range dependencies. They forget distant past, even when it's crucial.

LSTM: Long Short-Term Memory¶

LSTMs solve the vanishing gradient problem with a gated architecture:

┌─────────────────────────────────────────┐
│                LSTM Cell                │
│                                         │
│  ┌──────┐   ┌──────┐   ┌──────┐        │
│  │Forget│   │Input │   │Output│        │
│  │ Gate │   │ Gate │   │ Gate │        │
│  └──────┘   └──────┘   └──────┘        │
│      ↓          ↓          ↓           │
│  ┌────────────────────────────┐        │
│  │        Cell State          │ ←──────┼── (memory highway)
│  └────────────────────────────┘        │
│                                         │
└─────────────────────────────────────────┘

Forget gate: Decides what to discard from cell state. "The transit event is over, forget those details."

Input gate: Decides what new information to store. "This brightness spike is important, remember it."

Output gate: Decides what to output based on cell state. "Based on everything seen, output this classification."

Cell state: The memory highway. Information can flow unchanged across many time steps. Gradients flow through without multiplication by weights.

The mathematics:

forget = sigmoid(W_f × [hidden[t-1], input[t]] + b_f)
input_gate = sigmoid(W_i × [hidden[t-1], input[t]] + b_i)
candidate = tanh(W_c × [hidden[t-1], input[t]] + b_c)
cell[t] = forget × cell[t-1] + input_gate × candidate
output_gate = sigmoid(W_o × [hidden[t-1], input[t]] + b_o)
hidden[t] = output_gate × tanh(cell[t])

The gates are sigmoid functions outputting values between 0 and 1, acting as soft switches.

GRU: Gated Recurrent Unit¶

A simplified gating mechanism, often performing comparably to LSTM with fewer parameters:

reset = sigmoid(W_r × [hidden[t-1], input[t]])
update = sigmoid(W_u × [hidden[t-1], input[t]])
candidate = tanh(W × [reset × hidden[t-1], input[t]])
hidden[t] = (1 - update) × hidden[t-1] + update × candidate

Two gates instead of three. Often faster to train with similar performance.

Bidirectional RNNs¶

Sometimes context from the future helps interpret the present. Bidirectional RNNs process sequences both forward and backward:

Forward:  input[0] → input[1] → input[2] → ... → input[T]
                ↓         ↓         ↓              ↓
           hidden_f[0]  hidden_f[1] hidden_f[2] ... hidden_f[T]

Backward: input[0] ← input[1] ← input[2] ← ... ← input[T]
                ↓         ↓         ↓              ↓
           hidden_b[0]  hidden_b[1] hidden_b[2] ... hidden_b[T]

Combined: [hidden_f[t], hidden_b[t]] for each t

Each position gets context from both past and future. Useful when you have the complete sequence before processing.

Sequence-to-Sequence Architectures¶

For tasks where input and output are both sequences, use encoder-decoder architectures:

Encoder: Processes input sequence, produces summary hidden state Decoder: Takes summary, generates output sequence

Input sequence → [Encoder RNN] → Summary State → [Decoder RNN] → Output sequence

This architecture underlies machine translation, summarization, and can be adapted for time-series forecasting.

Strengths¶

Natural for sequences: Explicitly models temporal dependencies. Hidden state carries information across time.

Variable length: Unlike feedforward networks, RNNs handle sequences of any length.

Parameter efficiency: Same weights used at every time step. A 100-step sequence doesn't need 100× the parameters.

Interpretable dynamics: Hidden state evolution can be analyzed. What is the network remembering?

Weaknesses¶

Sequential computation: Can't parallelize across time steps. Each step waits for the previous. Training and inference are slower than parallelizable architectures.

Long-range dependencies: Even LSTMs struggle with very long sequences (hundreds to thousands of steps). Information still fades, just more slowly.

Training instability: RNNs can suffer from exploding gradients. Requires careful initialization and gradient clipping.

Superseded by transformers: For many tasks, transformers achieve better performance with easier training. RNNs are less dominant than they once were.

For Your Telescope Array¶

Good for: Light curves. Time-series data. Sequential observations. Any data where temporal order matters.

Specific applications:

Light curve classification: An LSTM processes a sequence of brightness measurements, classifying the variable star type, detecting transients, or identifying periodic behavior.

Light curve: [mag[0], mag[1], mag[2], ..., mag[T]]
                ↓        ↓        ↓           ↓
             [LSTM] → [LSTM] → [LSTM] → ... → [LSTM]
                                                 ↓
                                          Classification

Transient detection in time series: RNN monitors brightness sequence, outputs probability of transient at each time step. Alert when probability exceeds threshold.

Predictive modeling: Given recent conditions (weather, seeing, performance), predict near-future conditions for scheduling.

Anomaly detection in sequences: Train LSTM to predict next value in normal sequences. Large prediction errors indicate anomalies.

State tracking: RNN maintains hidden state representing current system status, updated with each new observation or event.

Example architecture for light curve classification:

Input: sequence of (time, magnitude, error) tuples, variable length

Embedding: Dense(64) applied to each time step
Output: sequence of 64-dimensional vectors

Bidirectional LSTM(128 units)
Output: sequence of 256-dimensional vectors (128 forward + 128 backward)

Attention layer (or just take final hidden state)
Output: 256-dimensional vector

Dense(128) → ReLU → Dropout(0.3)
Dense(64) → ReLU → Dropout(0.3)
Dense(num_classes) → Softmax

Output: class probabilities

Transformers: Attention is All You Need¶

What They Are¶

Transformers process sequences without recurrence. Instead of maintaining hidden state, they use attention mechanisms to directly relate any element to any other element.

The Core Insight¶

RNNs process sequences step by step. Information from early steps must pass through many intermediate steps to affect later processing. This creates bottlenecks.

Transformers skip the middleman. Every position can directly attend to every other position. Information flows directly between any pair of elements.

Self-Attention: The Key Mechanism¶

Self-attention computes relationships between all pairs of positions in a sequence.

For each position, create three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I have to offer?"
Value (V): "What information do I carry?"

Attention score between position i and position j:

score[i,j] = Q[i] · K[j] / sqrt(d_k)

The dot product measures similarity. Division by sqrt(d_k) (dimension of keys) prevents scores from growing too large.

Apply softmax to get attention weights:

weights[i] = softmax(scores[i])  # weights[i] sums to 1

Output for position i is weighted sum of values:

output[i] = Σⱼ weights[i,j] × V[j]

Each position's output incorporates information from all other positions, weighted by relevance.

Multi-Head Attention¶

A single attention mechanism learns one type of relationship. Multi-head attention runs several attention mechanisms in parallel:

Head 1: Q₁, K₁, V₁ → output₁
Head 2: Q₂, K₂, V₂ → output₂
...
Head N: Qₙ, Kₙ, Vₙ → outputₙ

Concatenate: [output₁, output₂, ..., outputₙ]
Project: W_o × concatenated

Different heads learn different relationships:

Head 1 might attend to nearby positions
Head 2 might attend to similar values
Head 3 might attend to periodically related positions

The Transformer Block¶

A complete transformer block:

Input
  ↓
Multi-Head Self-Attention
  ↓
Add (residual connection) + Layer Normalization
  ↓
Feed-Forward Network (two dense layers)
  ↓
Add (residual connection) + Layer Normalization
  ↓
Output

Stack many blocks (6, 12, 24, or more in large models).

Residual connections let gradients flow directly through the network, enabling very deep architectures.

Positional Encoding¶

Self-attention is permutation-invariant: it doesn't inherently know that position 1 comes before position 2. Order information must be added explicitly.

Sinusoidal encoding (original transformer):

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Different frequencies for different dimensions. Positions get unique signatures, and relative positions can be computed from these encodings.

Learned encodings: Just learn a vector for each position. Works well when maximum sequence length is known.

Encoder-Decoder Transformers¶

For sequence-to-sequence tasks:

Encoder: Self-attention sees entire input. Each position attends to all input positions.

Decoder: Self-attention is masked (positions can only attend to earlier positions, not future). Cross-attention lets decoder positions attend to encoder outputs.

Input Sequence → [Encoder Stack] → Encoded Representations
                                            ↓
                        [Decoder Stack with Cross-Attention] → Output Sequence

Encoder-Only (BERT-style)¶

For tasks where you need to understand the input but not generate sequences:

Input → [Transformer Encoder] → Representations → Task-specific head

BERT, RoBERTa, and similar models use this pattern. Fine-tune for classification, extraction, or other tasks.

Decoder-Only (GPT-style)¶

For generation tasks:

Context → [Transformer Decoder] → Next token prediction

GPT models use this pattern. The model predicts the next element based on all previous elements.

Vision Transformers (ViT)¶

Transformers for images:

Split image into patches (e.g., 16×16 pixels each)
Flatten each patch into a vector
Add position encodings
Process with standard transformer

Image → [Split into patches] → [Linear embedding] → [Add position] → [Transformer] → [Classification head]

This treats an image as a sequence of patches, letting attention learn spatial relationships.

Strengths¶

Parallelization: Unlike RNNs, all positions can be computed simultaneously. Training is much faster on GPUs.

Long-range dependencies: Every position directly attends to every other. No information bottleneck.

Scalability: Transformers scale well. Larger models, more data, more compute generally means better performance.

State-of-the-art: Transformers dominate language, increasingly dominate vision, and excel in many domains.

Flexibility: Same architecture works for language, images, audio, and more with minimal modification.

Weaknesses¶

Quadratic complexity: Self-attention compares all pairs of positions. For sequence length N, complexity is O(N²). Very long sequences become expensive.

Data hungry: Transformers typically need more training data than CNNs or RNNs to achieve good performance.

Compute hungry: Large transformers require substantial GPU resources for training and inference.

Position encoding limitations: Learned position encodings don't generalize beyond training length. Sinusoidal encodings help but aren't perfect.

Less inductive bias: Transformers make fewer assumptions about data structure. This flexibility means they need to learn structure from data rather than having it built in.

For Your Telescope Array¶

Good for: Complex sequences where long-range dependencies matter. Multi-modal data fusion. Tasks where CNNs or RNNs underperform.

Specific applications:

Advanced light curve analysis: Transformers can capture long-range periodicity, complex variability patterns, and subtle correlations that RNNs miss.

Multi-site data fusion: Treat observations from different sites as sequence elements. Attention learns which observations to weight more heavily, how to combine information across sites.

[Obs_Site_A, Obs_Site_B, Obs_Site_C, ...] → [Transformer] → Fused Representation

Catalog cross-matching: Given entries from multiple catalogs, transformer attention learns which entries correspond to the same object.

Vision Transformer for images: For challenging image classification tasks where CNNs plateau, ViT might push further (with sufficient data).

Multimodal understanding: Combine image features and light curve features in a single transformer. Attention learns relationships between visual appearance and temporal behavior.

Example architecture for multi-site data fusion:

Inputs: Observations from N sites, each represented as a vector
[obs_1, obs_2, ..., obs_N] where obs_i includes: image embedding, quality metrics, timestamp, site ID embedding

Positional encoding: Site embeddings rather than sequence positions

Transformer Encoder (4 layers, 8 attention heads, 256 dimensions)
Each observation attends to all others
Learns which sites to weight, how to combine

Global pooling or CLS token
Output: Fused representation

Task heads:
- Classification head: Dense → class probabilities
- Quality estimation head: Dense → expected quality of combined result
- Uncertainty head: Dense → confidence bounds

Autoencoders: Learning Compression¶

What They Are¶

Networks that learn to compress data to a smaller representation, then reconstruct the original. Not for prediction, but for representation learning.

The Core Insight¶

If a network can compress data to a small representation and reconstruct it accurately, that small representation must capture the essential information. What's lost is presumably noise or irrelevant detail.

Architecture¶

Input → [Encoder] → Bottleneck (small) → [Decoder] → Reconstruction

        High-dimensional                              High-dimensional
           input                                         output
                        Low-dimensional
                          code/latent

Encoder: Compresses input to bottleneck. Typically uses convolutions (for images) or dense layers.

Bottleneck: The compressed representation. Much smaller than input (e.g., 256×256 image → 128 numbers).

Decoder: Reconstructs input from bottleneck. Mirror of encoder architecture.

Loss: Reconstruction error, typically mean squared error between input and output.

Variational Autoencoders (VAEs)¶

Standard autoencoders learn a deterministic mapping. VAEs learn a probabilistic one.

Instead of encoding to a single point, VAE encodes to a distribution (mean and variance):

Input → [Encoder] → (μ, σ) → Sample z ~ N(μ, σ) → [Decoder] → Reconstruction

Loss includes:

Reconstruction error
KL divergence between learned distribution and prior (regularizes latent space)

VAEs have smoother latent spaces. You can sample from the prior and generate realistic outputs.

Uses of Autoencoders¶

Dimensionality reduction: The bottleneck representation is a compressed version of input. Useful for visualization, clustering, or as input to other models.

Denoising: Train autoencoder on noisy inputs with clean targets. It learns to remove noise.

Anomaly detection: Train on normal data. Anomalies reconstruct poorly (high error).

Generation: VAEs (and related models) can generate new samples by decoding random latent vectors.

Strengths¶

Unsupervised: Don't need labels. Just need examples of normal data.

Representation learning: Learn useful features without explicit supervision.

Anomaly detection: Natural fit for finding unusual objects.

Compression: Learned compression can outperform hand-designed methods.

Weaknesses¶

Reconstruction focus: Optimizing reconstruction might not produce representations useful for downstream tasks.

Mode collapse: Can learn to ignore some input variation, reconstructing only "average" outputs.

Blurry outputs: Especially VAEs tend to produce blurry reconstructions, averaging over uncertainty.

Hyperparameter sensitivity: Bottleneck size, architecture choices significantly affect results.

For Your Telescope Array¶

Good for: Anomaly detection. Data compression. Finding unusual objects. Learning representations without labels.

Specific applications:

Anomaly detection: Train autoencoder on normal telescope images. High reconstruction error flags unusual images for human review.

Training: Normal images → Autoencoder → Minimize reconstruction error
Deployment: New image → Autoencoder → Measure reconstruction error
            If error > threshold: Flag as anomalous

Compression for transmission: Train autoencoder to compress images. Send only bottleneck codes from remote sites, decode centrally. Lossy but much smaller.

Unknown object discovery: Cluster objects in latent space. Objects far from known clusters might be new types.

Quality-aware compression: Train autoencoder with quality-weighted loss. Preserve important regions (sources) more than background.

Example anomaly detection system:

Convolutional Autoencoder:

Encoder:
Conv(32, 3×3) → ReLU → Pool(2×2)  # 256 → 128
Conv(64, 3×3) → ReLU → Pool(2×2)  # 128 → 64
Conv(128, 3×3) → ReLU → Pool(2×2) # 64 → 32
Conv(256, 3×3) → ReLU → Pool(2×2) # 32 → 16
Flatten → Dense(512) → Dense(128) → Bottleneck

Decoder (mirror of encoder):
Dense(512) → Dense(16×16×256) → Reshape
Upsample(2×2) → Conv(128, 3×3) → ReLU  # 16 → 32
Upsample(2×2) → Conv(64, 3×3) → ReLU   # 32 → 64
Upsample(2×2) → Conv(32, 3×3) → ReLU   # 64 → 128
Upsample(2×2) → Conv(1, 3×3) → Output  # 128 → 256

Loss: Mean squared error

Anomaly score: Reconstruction error per image
Threshold: Set from validation data to achieve desired false positive rate

Graph Neural Networks: Relational Intelligence¶

What They Are¶

Networks designed for data naturally represented as graphs: nodes connected by edges. Where CNNs exploit spatial structure and RNNs exploit temporal structure, GNNs exploit relational structure.

The Core Insight¶

Many astronomical phenomena involve relationships:

Stars in clusters are related
Galaxies in groups interact
Observations of the same object are connected
Telescope sites share information

Graphs naturally represent these relationships. GNNs learn to use relational structure.

Graph Representation¶

A graph consists of:

Nodes: Entities (stars, galaxies, observations, telescopes)
Edges: Relationships between nodes (physical proximity, causal connection, same object)
Node features: Attributes of each node (brightness, color, position)
Edge features: Attributes of each relationship (distance, time difference, strength)

Message Passing: The Core Operation¶

GNNs work by passing messages between connected nodes:

For each node:
    1. Gather messages from neighbors
    2. Aggregate messages (sum, mean, max, or learned aggregation)
    3. Update node representation based on current state + aggregated messages

After several rounds of message passing, each node's representation incorporates information from its neighborhood.

Round 1: Each node knows about immediate neighbors
Round 2: Each node knows about neighbors-of-neighbors
Round 3: Information from 3-hop neighborhood
...

Mathematical Formulation¶

Basic message passing:

m[i] = Aggregate({h[j] : j ∈ Neighbors(i)})
h'[i] = Update(h[i], m[i])

Where:

h[i] is node i's representation
m[i] is aggregated message for node i
Aggregate is a permutation-invariant function (sum, mean, max)
Update combines current state with message (typically neural network)

Common architectures:

Graph Convolutional Network (GCN):

H' = σ(D^(-1/2) A D^(-1/2) H W)

Where A is adjacency matrix, D is degree matrix, H is node features, W is learnable weights.

Graph Attention Network (GAT): Use attention to weight neighbor contributions differently.

GraphSAGE: Sample and aggregate neighbors, enabling mini-batch training on large graphs.

Strengths¶

Natural for relational data: Directly encodes relationships. No need to flatten graph structure into vectors.

Flexible structure: Works on graphs of any size and topology. Adapts to varying numbers of neighbors.

Inductive: Can generalize to unseen nodes/graphs if features are meaningful.

Combines information: Learns how to aggregate information from related entities.

Weaknesses¶

Scalability: Very large graphs (millions of nodes) require sophisticated sampling or approximation.

Oversmoothing: Many message-passing rounds make all node representations similar. Deep GNNs are harder to train.

Edge definition: Results depend on how you define graph structure. Wrong edges hurt performance.

Less mature: GNNs are newer than CNNs/RNNs. Fewer established best practices.

For Your Telescope Array¶

Good for: Modeling relationships between objects, sites, or observations. Catalog analysis. Network coordination.

Specific applications:

Star cluster analysis: Nodes are stars, edges connect probable cluster members. GNN learns cluster membership, identifies interlopers.

Galaxy group finding: Nodes are galaxies, edges from proximity or velocity similarity. GNN identifies group memberships, predicts properties.

Multi-observation fusion: Nodes are observations of the same target (different times, sites, instruments). Edges connect same-object observations. GNN learns optimal combination.

Graph structure:
  Nodes: Individual observations
  Edges: Same object, temporal proximity, or spatial proximity
  Node features: Measurement values, quality metrics, metadata
  Edge features: Time difference, site pair, conditions similarity

GNN:
  Message passing learns how to weight and combine observations
  Output: Fused estimate for each unique object

Telescope network optimization: Nodes are telescope sites, edges connect sites with complementary capabilities. GNN learns coordination patterns, recommends resource allocation.

Anomaly detection in context: When detecting anomalies, consider relationships. A star that's anomalous in isolation might be normal given its cluster context. GNN incorporates context.

Example architecture for multi-observation fusion:

Graph construction:
  For each unique object, create nodes for all observations
  Connect observations with edges (fully connected or based on relevance)

Node features (per observation):
  - Measured values (magnitudes, colors, etc.)
  - Uncertainty estimates
  - Observation quality metrics
  - Site identifier (embedded)
  - Time of observation

Edge features:
  - Time difference
  - Site pair identifier
  - Condition similarity score

GNN architecture:
  GraphSAGE with 3 message-passing layers
  Hidden dimension: 128
  Aggregation: Attention-weighted mean

After message passing:
  Global pooling across all nodes for this object
  Dense layers for final estimate

Output:
  Fused measurement estimate
  Uncertainty bounds
  Outlier flags for individual observations

Generative Models: Creating New Data¶

What They Are¶

Models that learn to generate new samples resembling training data. Instead of classifying or predicting, they create.

Generative Adversarial Networks (GANs)¶

Two networks in competition:

Generator: Takes random noise, produces fake samples Discriminator: Tries to distinguish real from fake samples

Training is adversarial:

Discriminator improves at detecting fakes
Generator improves at fooling discriminator
At equilibrium, generator produces samples discriminator can't distinguish from real

Random noise z → [Generator] → Fake sample
                                    ↓
                              [Discriminator] → Real or Fake?
                                    ↑
                              Real sample

Loss functions:

Discriminator: maximize log(D(real)) + log(1 - D(G(z)))
Generator: maximize log(D(G(z))) (or minimize log(1 - D(G(z))))

Diffusion Models¶

Currently state-of-the-art for image generation.

Forward process: Gradually add noise to real data until it's pure noise. Reverse process: Learn to gradually remove noise, recovering data from noise.

Real image → [Add noise] → [Add noise] → ... → Pure noise
Pure noise → [Denoise] → [Denoise] → ... → Generated image

The denoising network learns to predict and remove noise at each step. Many small denoising steps produce high-quality samples.

Uses in Astronomy¶

Data augmentation: Generate synthetic training examples, especially for rare classes.

Simulation: Generate realistic synthetic observations to test pipelines.

Super-resolution: Generate high-resolution images from low-resolution inputs.

Inpainting: Fill in missing or corrupted regions of images.

Conditional generation: Generate images matching specific properties (galaxy with certain morphology, star with certain spectrum).

For Your Telescope Array¶

Specific applications:

Training data generation: Have few examples of rare transients? Train a generative model on what you have, generate more for classifier training.

Pipeline testing: Generate realistic synthetic observations to stress-test processing pipelines before real data arrives.

Data recovery: Inpaint satellite trails, cosmic rays, or bad pixels in otherwise good observations.

Prediction: Given current conditions and recent observations, generate predictions of what observations will look like in near future.

Architecture Selection Guide for Your Project¶

Let me be concrete about which architecture to use for each component of your distributed telescope system.

At Individual Telescope Sites¶

Task	Architecture	Rationale
Frame quality assessment	Lightweight CNN	Fast inference, spatial patterns matter, proven performance
Real-time transient detection	CNN + threshold	Need speed, looking for spatial signatures
Basic source detection	U-Net (CNN variant)	Semantic segmentation task, well-established
Quick classification	Small CNN or feedforward from features	Speed critical, accuracy secondary
Equipment anomaly detection	Autoencoder	Unsupervised, learns normal behavior

At Central Coordination¶

Task	Architecture	Rationale
Deep image classification	ResNet/EfficientNet CNN or ViT	Accuracy matters, have compute resources
Light curve classification	Transformer or LSTM	Sequential data with long-range dependencies
Multi-site data fusion	Transformer or GNN	Relating multiple inputs, flexible attention
Scheduling optimization	Reinforcement learning (various)	Sequential decision-making
Catalog cross-matching	GNN or Transformer	Relational structure matters
Anomaly detection at scale	Autoencoder + clustering	Find unknowns in large datasets
Multi-modal analysis	Transformer	Naturally handles multiple input types

Decision Flowchart¶

Is your data...?

├── Images (2D spatial)
│   ├── Classification/detection → CNN (ResNet, EfficientNet)
│   ├── Segmentation → U-Net, DeepLab
│   ├── Very complex patterns → Vision Transformer (if enough data)
│   └── Need speed → MobileNet, lightweight CNN
│
├── Sequences (time series)
│   ├── Short sequences (<100 steps) → LSTM or GRU
│   ├── Long sequences (>100 steps) → Transformer
│   ├── Real-time streaming → LSTM with online updates
│   └── Bidirectional context available → Bidirectional LSTM or Transformer
│
├── Tabular (features/measurements)
│   ├── Clear features → XGBoost/LightGBM (often beats neural networks)
│   ├── Need neural network → Feedforward
│   └── Interactions complex → Feedforward with more layers
│
├── Graph (relational)
│   └── Use GNN (GraphSAGE, GAT)
│
├── Multiple modalities (images + sequences + tabular)
│   └── Transformer (or separate encoders feeding shared transformer)
│
└── Unlabeled data
    ├── Want compression/representation → Autoencoder
    ├── Want anomaly detection → Autoencoder or isolation forest
    └── Want to generate samples → GAN or diffusion model

Hybrid Architectures for Your System¶

Real systems often combine architectures:

CNN + LSTM for video or image sequences:

Frame 1 → [CNN] → features[1] ─┐
Frame 2 → [CNN] → features[2] ─┼→ [LSTM] → Sequence classification
Frame 3 → [CNN] → features[3] ─┘

Use CNN to extract per-frame features, LSTM to model temporal evolution.

CNN + Transformer for multi-site fusion:

Site A image → [CNN] → embedding_A ─┐
Site B image → [CNN] → embedding_B ─┼→ [Transformer] → Fused result
Site C image → [CNN] → embedding_C ─┘

Use CNN to extract site-specific features, transformer to learn optimal combination.

Autoencoder + Classifier for semi-supervised learning:

Labeled + Unlabeled data → [Autoencoder] → Latent representations
Latent representations + Labels → [Classifier] → Predictions

Use autoencoder to learn representations from all data (including unlabeled), classifier on top using labels.

Summary Comparison Table¶

Architecture	Best For	Input Type	Strengths	Weaknesses	Your Use Cases
Feedforward	Tabular data, simple tasks	Fixed-size vectors	Simple, fast, universal	No structure awareness	Feature-based classification, final layers
CNN	Images, spatial data	2D/3D grids	Translation invariance, hierarchical features	Fixed input size, local focus	Image quality, source detection, morphology
RNN/LSTM	Sequences, time series	Variable-length sequences	Temporal modeling, memory	Sequential (slow), limited range	Light curves, streaming data
Transformer	Long sequences, multi-modal	Any (with encoding)	Parallelizable, long-range, flexible	Quadratic complexity, data hungry	Complex light curves, data fusion
Autoencoder	Compression, anomaly detection	Any	Unsupervised, learns representations	Reconstruction-focused	Anomaly detection, compression
GNN	Relational data, graphs	Graphs	Models relationships	Scaling, oversmoothing	Cluster analysis, observation fusion
GAN/Diffusion	Data generation	Any	Creates new samples	Training instability	Data augmentation, simulation

This should give you a complete understanding of how each architecture works, what it's suited for, and exactly where each fits into your distributed telescope network. The key is matching architecture assumptions to your data's structure and your task's requirements.