[[Some otherstuff]]
Understanding ML Model Architectures for Your Distributed Telescope Array¶
Let me take you through every major architecture, how it actually works, and exactly where each fits into your telescope network.
The Fundamental Question: Why Different Architectures?¶
Before diving into specifics, understand why we have different architectures at all.
Data comes in different shapes:
Tabular data: Rows and columns, like a spreadsheet. Star catalogs with measurements. Each row is independent, columns are features.
Images: 2D grids of pixels. Your telescope frames. Nearby pixels are related. Spatial structure matters.
Sequences: Ordered data points. Light curves over time. What came before affects interpretation of what comes after.
Graphs: Networks of connected entities. Stars in clusters. Galaxies in groups. Relationships between objects matter.
Sets: Collections without order. Multiple observations of the same field. The set matters, not the sequence.
Each architecture embodies assumptions about data structure. Using the wrong architecture means fighting against its assumptions. Using the right architecture means the model naturally captures relevant patterns.
Feedforward Neural Networks: The Foundation¶
What They Are¶
The simplest neural network. Data flows in one direction: input to output, no loops, no memory.
Input Layer β Hidden Layer 1 β Hidden Layer 2 β ... β Output Layer
Each layer is fully connected to the next. Every neuron in layer N connects to every neuron in layer N+1.
How They Process Information¶
Imagine your input is a vector of 100 numbers representing measurements of a star: brightness in different filters, position, proper motion, and so on.
Layer 1 (say, 256 neurons): Each neuron computes a weighted sum of all 100 inputs, adds a bias, applies an activation function. You get 256 new numbers, each representing some combination of the original features.
Layer 2 (say, 128 neurons): Each neuron takes all 256 outputs from Layer 1, computes weighted sums, applies activation. Now you have 128 numbers representing combinations of combinations.
Output Layer (say, 5 neurons for 5 star types): Each neuron combines the 128 Layer 2 outputs. Apply softmax to get probabilities.
The key insight: each successive layer learns more abstract representations. Layer 1 might learn "this combination of colors indicates high temperature." Layer 2 might learn "high temperature plus this proper motion pattern suggests a certain stellar population."
Mathematical Formulation¶
For a single layer:
output = activation(weights Γ input + bias)
Where:
inputis a vector of N valuesweightsis a matrix of size (M Γ N), where M is the number of neuronsbiasis a vector of M valuesactivationis a nonlinear function applied element-wiseoutputis a vector of M values
Stacking layers:
hβ = activation(Wβ Γ input + bβ)
hβ = activation(Wβ Γ hβ + bβ)
hβ = activation(Wβ Γ hβ + bβ)
output = softmax(Wβ Γ hβ + bβ)
Strengths¶
Universality: Can theoretically approximate any function given enough neurons. This is a mathematical guarantee.
Simplicity: Easy to implement, understand, debug. Training is straightforward.
Speed: Fast inference. No complex operations, just matrix multiplications.
Flexibility: Works on any fixed-size input. No structural assumptions beyond input dimension.
Weaknesses¶
No spatial awareness: Treats each input independently. For images, pixel 1 and pixel 1000 are equally "distant" from the network's perspective, even if they're adjacent in the image.
No temporal awareness: Each input is processed independently. Can't learn that a brightness measurement depends on previous measurements.
Parameter explosion: For large inputs, fully-connected layers have enormous numbers of parameters. A 256Γ256 image has 65,536 pixels. A single hidden layer of 1000 neurons would have 65 million parameters just for that layer.
No weight sharing: Patterns learned in one part of the input don't transfer to other parts. A galaxy in the corner of an image requires separate learning from a galaxy in the center.
For Your Telescope Array¶
Good for: Processing extracted features (not raw images). Tabular data from catalogs. Final classification layers after other architectures have extracted features.
Specific applications:
- Classifying stars from catalog measurements (colors, proper motions, parallax)
- Predicting observation quality from metadata (temperature, humidity, moon phase, elevation)
- Combining high-level features from multiple sources for final decision-making
- Quick assessment models where speed matters more than accuracy
Example scenario: You've extracted 50 features from a light curve (mean brightness, variance, periodicity measures, etc.). A feedforward network takes these 50 numbers and classifies the variable star type. The feature extraction handles temporal structure; the feedforward network handles the final classification.
Convolutional Neural Networks: Spatial Intelligence¶
What They Are¶
Networks designed for data with spatial structure, primarily images. Instead of connecting every input to every neuron, they use local connections and weight sharing.
The Core Insight¶
Images have two crucial properties feedforward networks ignore:
Locality: Relevant patterns are local. An edge is a few pixels. A star is a small region. You don't need to look at pixels 1000 apart simultaneously to detect these patterns.
Translation invariance: A spiral arm looks like a spiral arm regardless of where it appears in the image. Learning to recognize it in one location should transfer to all locations.
CNNs embody these assumptions through convolution operations.
How Convolution Works¶
A convolutional layer has small filters (also called kernels), typically 3Γ3, 5Γ5, or 7Γ7 pixels.
Each filter slides across the entire image, computing a dot product at each position:
Image patch: Filter: Computation:
[a b c] [wβ wβ wβ] output = aΓwβ + bΓwβ + cΓwβ +
[d e f] Γ [wβ wβ
wβ] dΓwβ + eΓwβ
+ fΓwβ +
[g h i] [wβ wβ wβ] gΓwβ + hΓwβ + iΓwβ
This single number represents "how much does this patch match this filter?"
Sliding the filter across all positions produces a feature map: a 2D grid showing where the pattern was detected.
Multiple Filters, Multiple Layers¶
A single convolutional layer has many filters (32, 64, 128 are common). Each learns to detect a different pattern.
Layer 1 filters learn simple patterns:
- Horizontal edges
- Vertical edges
- Diagonal edges
- Brightness gradients
- Spots of various sizes
Layer 2 filters operate on Layer 1's output, learning combinations:
- Corners (horizontal + vertical edges)
- Curves (sequences of edge orientations)
- Texture patterns
- Ring-like structures
Layer 3 and beyond learn increasingly complex combinations:
- Spiral arm signatures
- Galaxy core patterns
- Specific artifact shapes
- Complex morphological features
This hierarchy emerges automatically from training. You don't specify "learn edges then corners then spirals." The network discovers this hierarchy because it's efficient for reducing classification error.
Pooling Operations¶
Between convolutional layers, pooling reduces spatial dimensions:
Max pooling: Take the maximum value in each small region
[1 3 2 4]
[5 6 1 2] β Max pool 2Γ2 β [6 4]
[3 2 1 0] [3 3]
[1 2 3 1]
Average pooling: Take the mean value in each region
Pooling provides:
- Reduced computation for subsequent layers
- Some translation invariance (small shifts don't change max values much)
- Larger effective receptive field (later layers "see" more of the original image)
Receptive Fields¶
A crucial concept: how much of the original image influences a single neuron in a later layer?
Layer 1 neuron: Sees only its 3Γ3 filter region. Receptive field = 9 pixels.
Layer 2 neuron: Takes input from Layer 1 neurons, each of which saw 3Γ3. After pooling, each Layer 2 neuron effectively sees ~6Γ6 of the original image.
Deep layer neuron: Might effectively see the entire image, but through a hierarchical lens.
This is why deep CNNs can learn global patterns while still using local operations: information propagates through the hierarchy.
Mathematical Formulation¶
For a 2D convolution:
output[i,j] = Ξ£β Ξ£β input[i+m, j+n] Γ filter[m,n] + bias
Where the sums run over the filter dimensions.
With multiple input channels (like RGB, or previous layer features):
output[i,j] = Ξ£_c Ξ£β Ξ£β input[c, i+m, j+n] Γ filter[c,m,n] + bias
Where c indexes input channels.
Architecture Patterns¶
Standard CNN architectures follow patterns:
VGG pattern: Stack many 3Γ3 convolutions. Simple but effective.
Conv3Γ3 β Conv3Γ3 β Pool β Conv3Γ3 β Conv3Γ3 β Pool β ... β Dense β Output
ResNet pattern: Add skip connections that let gradients flow directly through many layers.
input β Conv β Conv β (+input) β Conv β Conv β (+previous) β ...
Skip connections solve the vanishing gradient problem, allowing very deep networks (50, 100, 150+ layers).
Inception/GoogLeNet pattern: Use multiple filter sizes in parallel, concatenate results.
input β [1Γ1 conv, 3Γ3 conv, 5Γ5 conv, pool] β concatenate β ...
This captures patterns at multiple scales simultaneously.
Strengths¶
Parameter efficiency: A 3Γ3 filter has 9 parameters regardless of image size. Compared to feedforward networks, CNNs have far fewer parameters.
Translation equivariance: A pattern detected at position (10, 10) uses the same weights as detection at (100, 100). Learning transfers across positions.
Hierarchical feature learning: Automatically learns appropriate feature hierarchy for the task.
Proven architecture: Decades of refinement. Well-understood behavior. Extensive pre-trained models available.
Weaknesses¶
Fixed input size: Standard CNNs expect fixed image dimensions. Variable sizes require padding, cropping, or architectural changes.
Limited global awareness: Despite stacking layers, CNNs can struggle with patterns requiring true global context. A pattern depending on opposite corners remains hard.
Translation invariance can hurt: Sometimes position matters. The center of a galaxy image is semantically different from the edge. Pure CNNs don't distinguish.
No temporal understanding: Each image is processed independently. Sequential relationships require additional architecture.
For Your Telescope Array¶
Good for: Any image-based task. Quality assessment. Object detection. Galaxy classification. Artifact identification.
Specific applications:
Real-time quality assessment: A lightweight CNN at each telescope evaluates incoming frames. Input: single frame. Output: quality score and issue flags (clouds, tracking error, focus problem, etc.).
Source detection: Semantic segmentation CNNs identify every source in an image. Each pixel gets classified: background, star, galaxy, artifact, satellite trail.
Galaxy morphology: CNNs trained on Galaxy Zoo data classify galaxy types, identify features like bars, rings, spiral arms, merger signatures.
Transient detection: CNNs compare new images to references, classifying differences as real transients, artifacts, or noise.
Cross-site calibration: CNNs learn to map images from different sites to a common representation, normalizing site-specific effects.
Example architecture for your quality classifier:
Input: 256Γ256 grayscale image
Block 1: Conv(32 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 128Γ128Γ32
Block 2: Conv(64 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 64Γ64Γ64
Block 3: Conv(128 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 32Γ32Γ128
Block 4: Conv(256 filters, 3Γ3) β BatchNorm β ReLU β MaxPool(2Γ2)
Output: 16Γ16Γ256
Global Average Pool: 256 values
Dense(128) β ReLU β Dropout(0.5)
Dense(3) β Softmax
Output: probabilities for [good, medium, bad]
Recurrent Neural Networks: Temporal Intelligence¶
What They Are¶
Networks designed for sequential data. They maintain internal state that persists across sequence elements, giving them a form of memory.
The Core Insight¶
Many phenomena unfold over time. A light curve isn't just a collection of brightness measurementsβit's an ordered sequence where each measurement relates to those before and after.
Standard feedforward networks process each input independently. RNNs process sequences element by element, maintaining hidden state that captures what they've seen so far.
Basic RNN Operation¶
At each time step t:
hidden[t] = activation(W_input Γ input[t] + W_hidden Γ hidden[t-1] + bias)
output[t] = W_output Γ hidden[t]
The key: hidden[t] depends on hidden[t-1]. Information flows forward through time.
input[0] β [RNN Cell] β hidden[0] β output[0]
β
input[1] β [RNN Cell] β hidden[1] β output[1]
β
input[2] β [RNN Cell] β hidden[2] β output[2]
β
...
The same weights are used at every time step. The only thing changing is the hidden state.
The Vanishing Gradient Problem¶
Basic RNNs have a critical flaw: information fades over time.
During training, gradients must flow backward through time. At each step, they get multiplied by weights. If weights are less than 1, gradients shrink exponentially. After 50 or 100 steps, gradients are effectively zero.
Result: basic RNNs can only learn short-range dependencies. They forget distant past, even when it's crucial.
LSTM: Long Short-Term Memory¶
LSTMs solve the vanishing gradient problem with a gated architecture:
βββββββββββββββββββββββββββββββββββββββββββ
β LSTM Cell β
β β
β ββββββββ ββββββββ ββββββββ β
β βForgetβ βInput β βOutputβ β
β β Gate β β Gate β β Gate β β
β ββββββββ ββββββββ ββββββββ β
β β β β β
β ββββββββββββββββββββββββββββββ β
β β Cell State β ββββββββΌββ (memory highway)
β ββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββ
Forget gate: Decides what to discard from cell state. "The transit event is over, forget those details."
Input gate: Decides what new information to store. "This brightness spike is important, remember it."
Output gate: Decides what to output based on cell state. "Based on everything seen, output this classification."
Cell state: The memory highway. Information can flow unchanged across many time steps. Gradients flow through without multiplication by weights.
The mathematics:
forget = sigmoid(W_f Γ [hidden[t-1], input[t]] + b_f)
input_gate = sigmoid(W_i Γ [hidden[t-1], input[t]] + b_i)
candidate = tanh(W_c Γ [hidden[t-1], input[t]] + b_c)
cell[t] = forget Γ cell[t-1] + input_gate Γ candidate
output_gate = sigmoid(W_o Γ [hidden[t-1], input[t]] + b_o)
hidden[t] = output_gate Γ tanh(cell[t])
The gates are sigmoid functions outputting values between 0 and 1, acting as soft switches.
GRU: Gated Recurrent Unit¶
A simplified gating mechanism, often performing comparably to LSTM with fewer parameters:
reset = sigmoid(W_r Γ [hidden[t-1], input[t]])
update = sigmoid(W_u Γ [hidden[t-1], input[t]])
candidate = tanh(W Γ [reset Γ hidden[t-1], input[t]])
hidden[t] = (1 - update) Γ hidden[t-1] + update Γ candidate
Two gates instead of three. Often faster to train with similar performance.
Bidirectional RNNs¶
Sometimes context from the future helps interpret the present. Bidirectional RNNs process sequences both forward and backward:
Forward: input[0] β input[1] β input[2] β ... β input[T]
β β β β
hidden_f[0] hidden_f[1] hidden_f[2] ... hidden_f[T]
Backward: input[0] β input[1] β input[2] β ... β input[T]
β β β β
hidden_b[0] hidden_b[1] hidden_b[2] ... hidden_b[T]
Combined: [hidden_f[t], hidden_b[t]] for each t
Each position gets context from both past and future. Useful when you have the complete sequence before processing.
Sequence-to-Sequence Architectures¶
For tasks where input and output are both sequences, use encoder-decoder architectures:
Encoder: Processes input sequence, produces summary hidden state Decoder: Takes summary, generates output sequence
Input sequence β [Encoder RNN] β Summary State β [Decoder RNN] β Output sequence
This architecture underlies machine translation, summarization, and can be adapted for time-series forecasting.
Strengths¶
Natural for sequences: Explicitly models temporal dependencies. Hidden state carries information across time.
Variable length: Unlike feedforward networks, RNNs handle sequences of any length.
Parameter efficiency: Same weights used at every time step. A 100-step sequence doesn't need 100Γ the parameters.
Interpretable dynamics: Hidden state evolution can be analyzed. What is the network remembering?
Weaknesses¶
Sequential computation: Can't parallelize across time steps. Each step waits for the previous. Training and inference are slower than parallelizable architectures.
Long-range dependencies: Even LSTMs struggle with very long sequences (hundreds to thousands of steps). Information still fades, just more slowly.
Training instability: RNNs can suffer from exploding gradients. Requires careful initialization and gradient clipping.
Superseded by transformers: For many tasks, transformers achieve better performance with easier training. RNNs are less dominant than they once were.
For Your Telescope Array¶
Good for: Light curves. Time-series data. Sequential observations. Any data where temporal order matters.
Specific applications:
Light curve classification: An LSTM processes a sequence of brightness measurements, classifying the variable star type, detecting transients, or identifying periodic behavior.
Light curve: [mag[0], mag[1], mag[2], ..., mag[T]]
β β β β
[LSTM] β [LSTM] β [LSTM] β ... β [LSTM]
β
Classification
Transient detection in time series: RNN monitors brightness sequence, outputs probability of transient at each time step. Alert when probability exceeds threshold.
Predictive modeling: Given recent conditions (weather, seeing, performance), predict near-future conditions for scheduling.
Anomaly detection in sequences: Train LSTM to predict next value in normal sequences. Large prediction errors indicate anomalies.
State tracking: RNN maintains hidden state representing current system status, updated with each new observation or event.
Example architecture for light curve classification:
Input: sequence of (time, magnitude, error) tuples, variable length
Embedding: Dense(64) applied to each time step
Output: sequence of 64-dimensional vectors
Bidirectional LSTM(128 units)
Output: sequence of 256-dimensional vectors (128 forward + 128 backward)
Attention layer (or just take final hidden state)
Output: 256-dimensional vector
Dense(128) β ReLU β Dropout(0.3)
Dense(64) β ReLU β Dropout(0.3)
Dense(num_classes) β Softmax
Output: class probabilities
Transformers: Attention is All You Need¶
What They Are¶
Transformers process sequences without recurrence. Instead of maintaining hidden state, they use attention mechanisms to directly relate any element to any other element.
The Core Insight¶
RNNs process sequences step by step. Information from early steps must pass through many intermediate steps to affect later processing. This creates bottlenecks.
Transformers skip the middleman. Every position can directly attend to every other position. Information flows directly between any pair of elements.
Self-Attention: The Key Mechanism¶
Self-attention computes relationships between all pairs of positions in a sequence.
For each position, create three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I have to offer?"
- Value (V): "What information do I carry?"
Attention score between position i and position j:
score[i,j] = Q[i] Β· K[j] / sqrt(d_k)
The dot product measures similarity. Division by sqrt(d_k) (dimension of keys) prevents scores from growing too large.
Apply softmax to get attention weights:
weights[i] = softmax(scores[i]) # weights[i] sums to 1
Output for position i is weighted sum of values:
output[i] = Ξ£β±Ό weights[i,j] Γ V[j]
Each position's output incorporates information from all other positions, weighted by relevance.
Multi-Head Attention¶
A single attention mechanism learns one type of relationship. Multi-head attention runs several attention mechanisms in parallel:
Head 1: Qβ, Kβ, Vβ β outputβ
Head 2: Qβ, Kβ, Vβ β outputβ
...
Head N: Qβ, Kβ, Vβ β outputβ
Concatenate: [outputβ, outputβ, ..., outputβ]
Project: W_o Γ concatenated
Different heads learn different relationships:
- Head 1 might attend to nearby positions
- Head 2 might attend to similar values
- Head 3 might attend to periodically related positions
The Transformer Block¶
A complete transformer block:
Input
β
Multi-Head Self-Attention
β
Add (residual connection) + Layer Normalization
β
Feed-Forward Network (two dense layers)
β
Add (residual connection) + Layer Normalization
β
Output
Stack many blocks (6, 12, 24, or more in large models).
Residual connections let gradients flow directly through the network, enabling very deep architectures.
Positional Encoding¶
Self-attention is permutation-invariant: it doesn't inherently know that position 1 comes before position 2. Order information must be added explicitly.
Sinusoidal encoding (original transformer):
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Different frequencies for different dimensions. Positions get unique signatures, and relative positions can be computed from these encodings.
Learned encodings: Just learn a vector for each position. Works well when maximum sequence length is known.
Encoder-Decoder Transformers¶
For sequence-to-sequence tasks:
Encoder: Self-attention sees entire input. Each position attends to all input positions.
Decoder: Self-attention is masked (positions can only attend to earlier positions, not future). Cross-attention lets decoder positions attend to encoder outputs.
Input Sequence β [Encoder Stack] β Encoded Representations
β
[Decoder Stack with Cross-Attention] β Output Sequence
Encoder-Only (BERT-style)¶
For tasks where you need to understand the input but not generate sequences:
Input β [Transformer Encoder] β Representations β Task-specific head
BERT, RoBERTa, and similar models use this pattern. Fine-tune for classification, extraction, or other tasks.
Decoder-Only (GPT-style)¶
For generation tasks:
Context β [Transformer Decoder] β Next token prediction
GPT models use this pattern. The model predicts the next element based on all previous elements.
Vision Transformers (ViT)¶
Transformers for images:
- Split image into patches (e.g., 16Γ16 pixels each)
- Flatten each patch into a vector
- Add position encodings
- Process with standard transformer
Image β [Split into patches] β [Linear embedding] β [Add position] β [Transformer] β [Classification head]
This treats an image as a sequence of patches, letting attention learn spatial relationships.
Strengths¶
Parallelization: Unlike RNNs, all positions can be computed simultaneously. Training is much faster on GPUs.
Long-range dependencies: Every position directly attends to every other. No information bottleneck.
Scalability: Transformers scale well. Larger models, more data, more compute generally means better performance.
State-of-the-art: Transformers dominate language, increasingly dominate vision, and excel in many domains.
Flexibility: Same architecture works for language, images, audio, and more with minimal modification.
Weaknesses¶
Quadratic complexity: Self-attention compares all pairs of positions. For sequence length N, complexity is O(NΒ²). Very long sequences become expensive.
Data hungry: Transformers typically need more training data than CNNs or RNNs to achieve good performance.
Compute hungry: Large transformers require substantial GPU resources for training and inference.
Position encoding limitations: Learned position encodings don't generalize beyond training length. Sinusoidal encodings help but aren't perfect.
Less inductive bias: Transformers make fewer assumptions about data structure. This flexibility means they need to learn structure from data rather than having it built in.
For Your Telescope Array¶
Good for: Complex sequences where long-range dependencies matter. Multi-modal data fusion. Tasks where CNNs or RNNs underperform.
Specific applications:
Advanced light curve analysis: Transformers can capture long-range periodicity, complex variability patterns, and subtle correlations that RNNs miss.
Multi-site data fusion: Treat observations from different sites as sequence elements. Attention learns which observations to weight more heavily, how to combine information across sites.
[Obs_Site_A, Obs_Site_B, Obs_Site_C, ...] β [Transformer] β Fused Representation
Catalog cross-matching: Given entries from multiple catalogs, transformer attention learns which entries correspond to the same object.
Vision Transformer for images: For challenging image classification tasks where CNNs plateau, ViT might push further (with sufficient data).
Multimodal understanding: Combine image features and light curve features in a single transformer. Attention learns relationships between visual appearance and temporal behavior.
Example architecture for multi-site data fusion:
Inputs: Observations from N sites, each represented as a vector
[obs_1, obs_2, ..., obs_N] where obs_i includes: image embedding, quality metrics, timestamp, site ID embedding
Positional encoding: Site embeddings rather than sequence positions
Transformer Encoder (4 layers, 8 attention heads, 256 dimensions)
Each observation attends to all others
Learns which sites to weight, how to combine
Global pooling or CLS token
Output: Fused representation
Task heads:
- Classification head: Dense β class probabilities
- Quality estimation head: Dense β expected quality of combined result
- Uncertainty head: Dense β confidence bounds
Autoencoders: Learning Compression¶
What They Are¶
Networks that learn to compress data to a smaller representation, then reconstruct the original. Not for prediction, but for representation learning.
The Core Insight¶
If a network can compress data to a small representation and reconstruct it accurately, that small representation must capture the essential information. What's lost is presumably noise or irrelevant detail.
Architecture¶
Input β [Encoder] β Bottleneck (small) β [Decoder] β Reconstruction
High-dimensional High-dimensional
input output
Low-dimensional
code/latent
Encoder: Compresses input to bottleneck. Typically uses convolutions (for images) or dense layers.
Bottleneck: The compressed representation. Much smaller than input (e.g., 256Γ256 image β 128 numbers).
Decoder: Reconstructs input from bottleneck. Mirror of encoder architecture.
Loss: Reconstruction error, typically mean squared error between input and output.
Variational Autoencoders (VAEs)¶
Standard autoencoders learn a deterministic mapping. VAEs learn a probabilistic one.
Instead of encoding to a single point, VAE encodes to a distribution (mean and variance):
Input β [Encoder] β (ΞΌ, Ο) β Sample z ~ N(ΞΌ, Ο) β [Decoder] β Reconstruction
Loss includes:
- Reconstruction error
- KL divergence between learned distribution and prior (regularizes latent space)
VAEs have smoother latent spaces. You can sample from the prior and generate realistic outputs.
Uses of Autoencoders¶
Dimensionality reduction: The bottleneck representation is a compressed version of input. Useful for visualization, clustering, or as input to other models.
Denoising: Train autoencoder on noisy inputs with clean targets. It learns to remove noise.
Anomaly detection: Train on normal data. Anomalies reconstruct poorly (high error).
Generation: VAEs (and related models) can generate new samples by decoding random latent vectors.
Strengths¶
Unsupervised: Don't need labels. Just need examples of normal data.
Representation learning: Learn useful features without explicit supervision.
Anomaly detection: Natural fit for finding unusual objects.
Compression: Learned compression can outperform hand-designed methods.
Weaknesses¶
Reconstruction focus: Optimizing reconstruction might not produce representations useful for downstream tasks.
Mode collapse: Can learn to ignore some input variation, reconstructing only "average" outputs.
Blurry outputs: Especially VAEs tend to produce blurry reconstructions, averaging over uncertainty.
Hyperparameter sensitivity: Bottleneck size, architecture choices significantly affect results.
For Your Telescope Array¶
Good for: Anomaly detection. Data compression. Finding unusual objects. Learning representations without labels.
Specific applications:
Anomaly detection: Train autoencoder on normal telescope images. High reconstruction error flags unusual images for human review.
Training: Normal images β Autoencoder β Minimize reconstruction error
Deployment: New image β Autoencoder β Measure reconstruction error
If error > threshold: Flag as anomalous
Compression for transmission: Train autoencoder to compress images. Send only bottleneck codes from remote sites, decode centrally. Lossy but much smaller.
Unknown object discovery: Cluster objects in latent space. Objects far from known clusters might be new types.
Quality-aware compression: Train autoencoder with quality-weighted loss. Preserve important regions (sources) more than background.
Example anomaly detection system:
Convolutional Autoencoder:
Encoder:
Conv(32, 3Γ3) β ReLU β Pool(2Γ2) # 256 β 128
Conv(64, 3Γ3) β ReLU β Pool(2Γ2) # 128 β 64
Conv(128, 3Γ3) β ReLU β Pool(2Γ2) # 64 β 32
Conv(256, 3Γ3) β ReLU β Pool(2Γ2) # 32 β 16
Flatten β Dense(512) β Dense(128) β Bottleneck
Decoder (mirror of encoder):
Dense(512) β Dense(16Γ16Γ256) β Reshape
Upsample(2Γ2) β Conv(128, 3Γ3) β ReLU # 16 β 32
Upsample(2Γ2) β Conv(64, 3Γ3) β ReLU # 32 β 64
Upsample(2Γ2) β Conv(32, 3Γ3) β ReLU # 64 β 128
Upsample(2Γ2) β Conv(1, 3Γ3) β Output # 128 β 256
Loss: Mean squared error
Anomaly score: Reconstruction error per image
Threshold: Set from validation data to achieve desired false positive rate
Graph Neural Networks: Relational Intelligence¶
What They Are¶
Networks designed for data naturally represented as graphs: nodes connected by edges. Where CNNs exploit spatial structure and RNNs exploit temporal structure, GNNs exploit relational structure.
The Core Insight¶
Many astronomical phenomena involve relationships:
- Stars in clusters are related
- Galaxies in groups interact
- Observations of the same object are connected
- Telescope sites share information
Graphs naturally represent these relationships. GNNs learn to use relational structure.
Graph Representation¶
A graph consists of:
- Nodes: Entities (stars, galaxies, observations, telescopes)
- Edges: Relationships between nodes (physical proximity, causal connection, same object)
- Node features: Attributes of each node (brightness, color, position)
- Edge features: Attributes of each relationship (distance, time difference, strength)
Message Passing: The Core Operation¶
GNNs work by passing messages between connected nodes:
For each node:
1. Gather messages from neighbors
2. Aggregate messages (sum, mean, max, or learned aggregation)
3. Update node representation based on current state + aggregated messages
After several rounds of message passing, each node's representation incorporates information from its neighborhood.
Round 1: Each node knows about immediate neighbors
Round 2: Each node knows about neighbors-of-neighbors
Round 3: Information from 3-hop neighborhood
...
Mathematical Formulation¶
Basic message passing:
m[i] = Aggregate({h[j] : j β Neighbors(i)})
h'[i] = Update(h[i], m[i])
Where:
- h[i] is node i's representation
- m[i] is aggregated message for node i
- Aggregate is a permutation-invariant function (sum, mean, max)
- Update combines current state with message (typically neural network)
Common architectures:
Graph Convolutional Network (GCN):
H' = Ο(D^(-1/2) A D^(-1/2) H W)
Where A is adjacency matrix, D is degree matrix, H is node features, W is learnable weights.
Graph Attention Network (GAT): Use attention to weight neighbor contributions differently.
GraphSAGE: Sample and aggregate neighbors, enabling mini-batch training on large graphs.
Strengths¶
Natural for relational data: Directly encodes relationships. No need to flatten graph structure into vectors.
Flexible structure: Works on graphs of any size and topology. Adapts to varying numbers of neighbors.
Inductive: Can generalize to unseen nodes/graphs if features are meaningful.
Combines information: Learns how to aggregate information from related entities.
Weaknesses¶
Scalability: Very large graphs (millions of nodes) require sophisticated sampling or approximation.
Oversmoothing: Many message-passing rounds make all node representations similar. Deep GNNs are harder to train.
Edge definition: Results depend on how you define graph structure. Wrong edges hurt performance.
Less mature: GNNs are newer than CNNs/RNNs. Fewer established best practices.
For Your Telescope Array¶
Good for: Modeling relationships between objects, sites, or observations. Catalog analysis. Network coordination.
Specific applications:
Star cluster analysis: Nodes are stars, edges connect probable cluster members. GNN learns cluster membership, identifies interlopers.
Galaxy group finding: Nodes are galaxies, edges from proximity or velocity similarity. GNN identifies group memberships, predicts properties.
Multi-observation fusion: Nodes are observations of the same target (different times, sites, instruments). Edges connect same-object observations. GNN learns optimal combination.
Graph structure:
Nodes: Individual observations
Edges: Same object, temporal proximity, or spatial proximity
Node features: Measurement values, quality metrics, metadata
Edge features: Time difference, site pair, conditions similarity
GNN:
Message passing learns how to weight and combine observations
Output: Fused estimate for each unique object
Telescope network optimization: Nodes are telescope sites, edges connect sites with complementary capabilities. GNN learns coordination patterns, recommends resource allocation.
Anomaly detection in context: When detecting anomalies, consider relationships. A star that's anomalous in isolation might be normal given its cluster context. GNN incorporates context.
Example architecture for multi-observation fusion:
Graph construction:
For each unique object, create nodes for all observations
Connect observations with edges (fully connected or based on relevance)
Node features (per observation):
- Measured values (magnitudes, colors, etc.)
- Uncertainty estimates
- Observation quality metrics
- Site identifier (embedded)
- Time of observation
Edge features:
- Time difference
- Site pair identifier
- Condition similarity score
GNN architecture:
GraphSAGE with 3 message-passing layers
Hidden dimension: 128
Aggregation: Attention-weighted mean
After message passing:
Global pooling across all nodes for this object
Dense layers for final estimate
Output:
Fused measurement estimate
Uncertainty bounds
Outlier flags for individual observations
Generative Models: Creating New Data¶
What They Are¶
Models that learn to generate new samples resembling training data. Instead of classifying or predicting, they create.
Generative Adversarial Networks (GANs)¶
Two networks in competition:
Generator: Takes random noise, produces fake samples Discriminator: Tries to distinguish real from fake samples
Training is adversarial:
- Discriminator improves at detecting fakes
- Generator improves at fooling discriminator
- At equilibrium, generator produces samples discriminator can't distinguish from real
Random noise z β [Generator] β Fake sample
β
[Discriminator] β Real or Fake?
β
Real sample
Loss functions:
Discriminator: maximize log(D(real)) + log(1 - D(G(z)))
Generator: maximize log(D(G(z))) (or minimize log(1 - D(G(z))))
Diffusion Models¶
Currently state-of-the-art for image generation.
Forward process: Gradually add noise to real data until it's pure noise. Reverse process: Learn to gradually remove noise, recovering data from noise.
Real image β [Add noise] β [Add noise] β ... β Pure noise
Pure noise β [Denoise] β [Denoise] β ... β Generated image
The denoising network learns to predict and remove noise at each step. Many small denoising steps produce high-quality samples.
Uses in Astronomy¶
Data augmentation: Generate synthetic training examples, especially for rare classes.
Simulation: Generate realistic synthetic observations to test pipelines.
Super-resolution: Generate high-resolution images from low-resolution inputs.
Inpainting: Fill in missing or corrupted regions of images.
Conditional generation: Generate images matching specific properties (galaxy with certain morphology, star with certain spectrum).
For Your Telescope Array¶
Specific applications:
Training data generation: Have few examples of rare transients? Train a generative model on what you have, generate more for classifier training.
Pipeline testing: Generate realistic synthetic observations to stress-test processing pipelines before real data arrives.
Data recovery: Inpaint satellite trails, cosmic rays, or bad pixels in otherwise good observations.
Prediction: Given current conditions and recent observations, generate predictions of what observations will look like in near future.
Architecture Selection Guide for Your Project¶
Let me be concrete about which architecture to use for each component of your distributed telescope system.
At Individual Telescope Sites¶
| Task | Architecture | Rationale |
|---|---|---|
| Frame quality assessment | Lightweight CNN | Fast inference, spatial patterns matter, proven performance |
| Real-time transient detection | CNN + threshold | Need speed, looking for spatial signatures |
| Basic source detection | U-Net (CNN variant) | Semantic segmentation task, well-established |
| Quick classification | Small CNN or feedforward from features | Speed critical, accuracy secondary |
| Equipment anomaly detection | Autoencoder | Unsupervised, learns normal behavior |
At Central Coordination¶
| Task | Architecture | Rationale |
|---|---|---|
| Deep image classification | ResNet/EfficientNet CNN or ViT | Accuracy matters, have compute resources |
| Light curve classification | Transformer or LSTM | Sequential data with long-range dependencies |
| Multi-site data fusion | Transformer or GNN | Relating multiple inputs, flexible attention |
| Scheduling optimization | Reinforcement learning (various) | Sequential decision-making |
| Catalog cross-matching | GNN or Transformer | Relational structure matters |
| Anomaly detection at scale | Autoencoder + clustering | Find unknowns in large datasets |
| Multi-modal analysis | Transformer | Naturally handles multiple input types |
Decision Flowchart¶
Is your data...?
βββ Images (2D spatial)
β βββ Classification/detection β CNN (ResNet, EfficientNet)
β βββ Segmentation β U-Net, DeepLab
β βββ Very complex patterns β Vision Transformer (if enough data)
β βββ Need speed β MobileNet, lightweight CNN
β
βββ Sequences (time series)
β βββ Short sequences (<100 steps) β LSTM or GRU
β βββ Long sequences (>100 steps) β Transformer
β βββ Real-time streaming β LSTM with online updates
β βββ Bidirectional context available β Bidirectional LSTM or Transformer
β
βββ Tabular (features/measurements)
β βββ Clear features β XGBoost/LightGBM (often beats neural networks)
β βββ Need neural network β Feedforward
β βββ Interactions complex β Feedforward with more layers
β
βββ Graph (relational)
β βββ Use GNN (GraphSAGE, GAT)
β
βββ Multiple modalities (images + sequences + tabular)
β βββ Transformer (or separate encoders feeding shared transformer)
β
βββ Unlabeled data
βββ Want compression/representation β Autoencoder
βββ Want anomaly detection β Autoencoder or isolation forest
βββ Want to generate samples β GAN or diffusion model
Hybrid Architectures for Your System¶
Real systems often combine architectures:
CNN + LSTM for video or image sequences:
Frame 1 β [CNN] β features[1] ββ
Frame 2 β [CNN] β features[2] ββΌβ [LSTM] β Sequence classification
Frame 3 β [CNN] β features[3] ββ
Use CNN to extract per-frame features, LSTM to model temporal evolution.
CNN + Transformer for multi-site fusion:
Site A image β [CNN] β embedding_A ββ
Site B image β [CNN] β embedding_B ββΌβ [Transformer] β Fused result
Site C image β [CNN] β embedding_C ββ
Use CNN to extract site-specific features, transformer to learn optimal combination.
Autoencoder + Classifier for semi-supervised learning:
Labeled + Unlabeled data β [Autoencoder] β Latent representations
Latent representations + Labels β [Classifier] β Predictions
Use autoencoder to learn representations from all data (including unlabeled), classifier on top using labels.
Summary Comparison Table¶
| Architecture | Best For | Input Type | Strengths | Weaknesses | Your Use Cases |
|---|---|---|---|---|---|
| Feedforward | Tabular data, simple tasks | Fixed-size vectors | Simple, fast, universal | No structure awareness | Feature-based classification, final layers |
| CNN | Images, spatial data | 2D/3D grids | Translation invariance, hierarchical features | Fixed input size, local focus | Image quality, source detection, morphology |
| RNN/LSTM | Sequences, time series | Variable-length sequences | Temporal modeling, memory | Sequential (slow), limited range | Light curves, streaming data |
| Transformer | Long sequences, multi-modal | Any (with encoding) | Parallelizable, long-range, flexible | Quadratic complexity, data hungry | Complex light curves, data fusion |
| Autoencoder | Compression, anomaly detection | Any | Unsupervised, learns representations | Reconstruction-focused | Anomaly detection, compression |
| GNN | Relational data, graphs | Graphs | Models relationships | Scaling, oversmoothing | Cluster analysis, observation fusion |
| GAN/Diffusion | Data generation | Any | Creates new samples | Training instability | Data augmentation, simulation |
This should give you a complete understanding of how each architecture works, what it's suited for, and exactly where each fits into your distributed telescope network. The key is matching architecture assumptions to your data's structure and your task's requirements.