Skip to content

[[Some otherstuff]]

Understanding ML Model Architectures for Your Distributed Telescope Array

Let me take you through every major architecture, how it actually works, and exactly where each fits into your telescope network.


The Fundamental Question: Why Different Architectures?

Before diving into specifics, understand why we have different architectures at all.

Data comes in different shapes:

Tabular data: Rows and columns, like a spreadsheet. Star catalogs with measurements. Each row is independent, columns are features.

Images: 2D grids of pixels. Your telescope frames. Nearby pixels are related. Spatial structure matters.

Sequences: Ordered data points. Light curves over time. What came before affects interpretation of what comes after.

Graphs: Networks of connected entities. Stars in clusters. Galaxies in groups. Relationships between objects matter.

Sets: Collections without order. Multiple observations of the same field. The set matters, not the sequence.

Each architecture embodies assumptions about data structure. Using the wrong architecture means fighting against its assumptions. Using the right architecture means the model naturally captures relevant patterns.


Feedforward Neural Networks: The Foundation

What They Are

The simplest neural network. Data flows in one direction: input to output, no loops, no memory.

Input Layer β†’ Hidden Layer 1 β†’ Hidden Layer 2 β†’ ... β†’ Output Layer

Each layer is fully connected to the next. Every neuron in layer N connects to every neuron in layer N+1.

How They Process Information

Imagine your input is a vector of 100 numbers representing measurements of a star: brightness in different filters, position, proper motion, and so on.

Layer 1 (say, 256 neurons): Each neuron computes a weighted sum of all 100 inputs, adds a bias, applies an activation function. You get 256 new numbers, each representing some combination of the original features.

Layer 2 (say, 128 neurons): Each neuron takes all 256 outputs from Layer 1, computes weighted sums, applies activation. Now you have 128 numbers representing combinations of combinations.

Output Layer (say, 5 neurons for 5 star types): Each neuron combines the 128 Layer 2 outputs. Apply softmax to get probabilities.

The key insight: each successive layer learns more abstract representations. Layer 1 might learn "this combination of colors indicates high temperature." Layer 2 might learn "high temperature plus this proper motion pattern suggests a certain stellar population."

Mathematical Formulation

For a single layer:

output = activation(weights Γ— input + bias)

Where:

  • input is a vector of N values
  • weights is a matrix of size (M Γ— N), where M is the number of neurons
  • bias is a vector of M values
  • activation is a nonlinear function applied element-wise
  • output is a vector of M values

Stacking layers:

h₁ = activation(W₁ Γ— input + b₁)
hβ‚‚ = activation(Wβ‚‚ Γ— h₁ + bβ‚‚)
h₃ = activation(W₃ Γ— hβ‚‚ + b₃)
output = softmax(Wβ‚„ Γ— h₃ + bβ‚„)

Strengths

Universality: Can theoretically approximate any function given enough neurons. This is a mathematical guarantee.

Simplicity: Easy to implement, understand, debug. Training is straightforward.

Speed: Fast inference. No complex operations, just matrix multiplications.

Flexibility: Works on any fixed-size input. No structural assumptions beyond input dimension.

Weaknesses

No spatial awareness: Treats each input independently. For images, pixel 1 and pixel 1000 are equally "distant" from the network's perspective, even if they're adjacent in the image.

No temporal awareness: Each input is processed independently. Can't learn that a brightness measurement depends on previous measurements.

Parameter explosion: For large inputs, fully-connected layers have enormous numbers of parameters. A 256Γ—256 image has 65,536 pixels. A single hidden layer of 1000 neurons would have 65 million parameters just for that layer.

No weight sharing: Patterns learned in one part of the input don't transfer to other parts. A galaxy in the corner of an image requires separate learning from a galaxy in the center.

For Your Telescope Array

Good for: Processing extracted features (not raw images). Tabular data from catalogs. Final classification layers after other architectures have extracted features.

Specific applications:

  • Classifying stars from catalog measurements (colors, proper motions, parallax)
  • Predicting observation quality from metadata (temperature, humidity, moon phase, elevation)
  • Combining high-level features from multiple sources for final decision-making
  • Quick assessment models where speed matters more than accuracy

Example scenario: You've extracted 50 features from a light curve (mean brightness, variance, periodicity measures, etc.). A feedforward network takes these 50 numbers and classifies the variable star type. The feature extraction handles temporal structure; the feedforward network handles the final classification.


Convolutional Neural Networks: Spatial Intelligence

What They Are

Networks designed for data with spatial structure, primarily images. Instead of connecting every input to every neuron, they use local connections and weight sharing.

The Core Insight

Images have two crucial properties feedforward networks ignore:

Locality: Relevant patterns are local. An edge is a few pixels. A star is a small region. You don't need to look at pixels 1000 apart simultaneously to detect these patterns.

Translation invariance: A spiral arm looks like a spiral arm regardless of where it appears in the image. Learning to recognize it in one location should transfer to all locations.

CNNs embody these assumptions through convolution operations.

How Convolution Works

A convolutional layer has small filters (also called kernels), typically 3Γ—3, 5Γ—5, or 7Γ—7 pixels.

Each filter slides across the entire image, computing a dot product at each position:

Image patch:        Filter:           Computation:
[a b c]            [w₁ wβ‚‚ w₃]        output = aΓ—w₁ + bΓ—wβ‚‚ + cΓ—w₃ +
[d e f]     Γ—      [wβ‚„ wβ‚… w₆]                 dΓ—wβ‚„ + eΓ—wβ‚… + fΓ—w₆ +
[g h i]            [w₇ wβ‚ˆ w₉]                 gΓ—w₇ + hΓ—wβ‚ˆ + iΓ—w₉

This single number represents "how much does this patch match this filter?"

Sliding the filter across all positions produces a feature map: a 2D grid showing where the pattern was detected.

Multiple Filters, Multiple Layers

A single convolutional layer has many filters (32, 64, 128 are common). Each learns to detect a different pattern.

Layer 1 filters learn simple patterns:

  • Horizontal edges
  • Vertical edges
  • Diagonal edges
  • Brightness gradients
  • Spots of various sizes

Layer 2 filters operate on Layer 1's output, learning combinations:

  • Corners (horizontal + vertical edges)
  • Curves (sequences of edge orientations)
  • Texture patterns
  • Ring-like structures

Layer 3 and beyond learn increasingly complex combinations:

  • Spiral arm signatures
  • Galaxy core patterns
  • Specific artifact shapes
  • Complex morphological features

This hierarchy emerges automatically from training. You don't specify "learn edges then corners then spirals." The network discovers this hierarchy because it's efficient for reducing classification error.

Pooling Operations

Between convolutional layers, pooling reduces spatial dimensions:

Max pooling: Take the maximum value in each small region

[1 3 2 4]
[5 6 1 2]  β†’ Max pool 2Γ—2 β†’  [6 4]
[3 2 1 0]                     [3 3]
[1 2 3 1]

Average pooling: Take the mean value in each region

Pooling provides:

  • Reduced computation for subsequent layers
  • Some translation invariance (small shifts don't change max values much)
  • Larger effective receptive field (later layers "see" more of the original image)

Receptive Fields

A crucial concept: how much of the original image influences a single neuron in a later layer?

Layer 1 neuron: Sees only its 3Γ—3 filter region. Receptive field = 9 pixels.

Layer 2 neuron: Takes input from Layer 1 neurons, each of which saw 3Γ—3. After pooling, each Layer 2 neuron effectively sees ~6Γ—6 of the original image.

Deep layer neuron: Might effectively see the entire image, but through a hierarchical lens.

This is why deep CNNs can learn global patterns while still using local operations: information propagates through the hierarchy.

Mathematical Formulation

For a 2D convolution:

output[i,j] = Ξ£β‚˜ Ξ£β‚™ input[i+m, j+n] Γ— filter[m,n] + bias

Where the sums run over the filter dimensions.

With multiple input channels (like RGB, or previous layer features):

output[i,j] = Ξ£_c Ξ£β‚˜ Ξ£β‚™ input[c, i+m, j+n] Γ— filter[c,m,n] + bias

Where c indexes input channels.

Architecture Patterns

Standard CNN architectures follow patterns:

VGG pattern: Stack many 3Γ—3 convolutions. Simple but effective.

Conv3Γ—3 β†’ Conv3Γ—3 β†’ Pool β†’ Conv3Γ—3 β†’ Conv3Γ—3 β†’ Pool β†’ ... β†’ Dense β†’ Output

ResNet pattern: Add skip connections that let gradients flow directly through many layers.

input β†’ Conv β†’ Conv β†’ (+input) β†’ Conv β†’ Conv β†’ (+previous) β†’ ...

Skip connections solve the vanishing gradient problem, allowing very deep networks (50, 100, 150+ layers).

Inception/GoogLeNet pattern: Use multiple filter sizes in parallel, concatenate results.

input β†’ [1Γ—1 conv, 3Γ—3 conv, 5Γ—5 conv, pool] β†’ concatenate β†’ ...

This captures patterns at multiple scales simultaneously.

Strengths

Parameter efficiency: A 3Γ—3 filter has 9 parameters regardless of image size. Compared to feedforward networks, CNNs have far fewer parameters.

Translation equivariance: A pattern detected at position (10, 10) uses the same weights as detection at (100, 100). Learning transfers across positions.

Hierarchical feature learning: Automatically learns appropriate feature hierarchy for the task.

Proven architecture: Decades of refinement. Well-understood behavior. Extensive pre-trained models available.

Weaknesses

Fixed input size: Standard CNNs expect fixed image dimensions. Variable sizes require padding, cropping, or architectural changes.

Limited global awareness: Despite stacking layers, CNNs can struggle with patterns requiring true global context. A pattern depending on opposite corners remains hard.

Translation invariance can hurt: Sometimes position matters. The center of a galaxy image is semantically different from the edge. Pure CNNs don't distinguish.

No temporal understanding: Each image is processed independently. Sequential relationships require additional architecture.

For Your Telescope Array

Good for: Any image-based task. Quality assessment. Object detection. Galaxy classification. Artifact identification.

Specific applications:

Real-time quality assessment: A lightweight CNN at each telescope evaluates incoming frames. Input: single frame. Output: quality score and issue flags (clouds, tracking error, focus problem, etc.).

Source detection: Semantic segmentation CNNs identify every source in an image. Each pixel gets classified: background, star, galaxy, artifact, satellite trail.

Galaxy morphology: CNNs trained on Galaxy Zoo data classify galaxy types, identify features like bars, rings, spiral arms, merger signatures.

Transient detection: CNNs compare new images to references, classifying differences as real transients, artifacts, or noise.

Cross-site calibration: CNNs learn to map images from different sites to a common representation, normalizing site-specific effects.

Example architecture for your quality classifier:

Input: 256Γ—256 grayscale image

Block 1: Conv(32 filters, 3Γ—3) β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
Output: 128Γ—128Γ—32

Block 2: Conv(64 filters, 3Γ—3) β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
Output: 64Γ—64Γ—64

Block 3: Conv(128 filters, 3Γ—3) β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
Output: 32Γ—32Γ—128

Block 4: Conv(256 filters, 3Γ—3) β†’ BatchNorm β†’ ReLU β†’ MaxPool(2Γ—2)
Output: 16Γ—16Γ—256

Global Average Pool: 256 values

Dense(128) β†’ ReLU β†’ Dropout(0.5)
Dense(3) β†’ Softmax

Output: probabilities for [good, medium, bad]

Recurrent Neural Networks: Temporal Intelligence

What They Are

Networks designed for sequential data. They maintain internal state that persists across sequence elements, giving them a form of memory.

The Core Insight

Many phenomena unfold over time. A light curve isn't just a collection of brightness measurementsβ€”it's an ordered sequence where each measurement relates to those before and after.

Standard feedforward networks process each input independently. RNNs process sequences element by element, maintaining hidden state that captures what they've seen so far.

Basic RNN Operation

At each time step t:

hidden[t] = activation(W_input Γ— input[t] + W_hidden Γ— hidden[t-1] + bias)
output[t] = W_output Γ— hidden[t]

The key: hidden[t] depends on hidden[t-1]. Information flows forward through time.

input[0] β†’ [RNN Cell] β†’ hidden[0] β†’ output[0]
              ↓
input[1] β†’ [RNN Cell] β†’ hidden[1] β†’ output[1]
              ↓
input[2] β†’ [RNN Cell] β†’ hidden[2] β†’ output[2]
              ↓
             ...

The same weights are used at every time step. The only thing changing is the hidden state.

The Vanishing Gradient Problem

Basic RNNs have a critical flaw: information fades over time.

During training, gradients must flow backward through time. At each step, they get multiplied by weights. If weights are less than 1, gradients shrink exponentially. After 50 or 100 steps, gradients are effectively zero.

Result: basic RNNs can only learn short-range dependencies. They forget distant past, even when it's crucial.

LSTM: Long Short-Term Memory

LSTMs solve the vanishing gradient problem with a gated architecture:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                LSTM Cell                β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚Forgetβ”‚   β”‚Input β”‚   β”‚Outputβ”‚        β”‚
β”‚  β”‚ Gate β”‚   β”‚ Gate β”‚   β”‚ Gate β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚      ↓          ↓          ↓           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚  β”‚        Cell State          β”‚ ←──────┼── (memory highway)
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Forget gate: Decides what to discard from cell state. "The transit event is over, forget those details."

Input gate: Decides what new information to store. "This brightness spike is important, remember it."

Output gate: Decides what to output based on cell state. "Based on everything seen, output this classification."

Cell state: The memory highway. Information can flow unchanged across many time steps. Gradients flow through without multiplication by weights.

The mathematics:

forget = sigmoid(W_f Γ— [hidden[t-1], input[t]] + b_f)
input_gate = sigmoid(W_i Γ— [hidden[t-1], input[t]] + b_i)
candidate = tanh(W_c Γ— [hidden[t-1], input[t]] + b_c)
cell[t] = forget Γ— cell[t-1] + input_gate Γ— candidate
output_gate = sigmoid(W_o Γ— [hidden[t-1], input[t]] + b_o)
hidden[t] = output_gate Γ— tanh(cell[t])

The gates are sigmoid functions outputting values between 0 and 1, acting as soft switches.

GRU: Gated Recurrent Unit

A simplified gating mechanism, often performing comparably to LSTM with fewer parameters:

reset = sigmoid(W_r Γ— [hidden[t-1], input[t]])
update = sigmoid(W_u Γ— [hidden[t-1], input[t]])
candidate = tanh(W Γ— [reset Γ— hidden[t-1], input[t]])
hidden[t] = (1 - update) Γ— hidden[t-1] + update Γ— candidate

Two gates instead of three. Often faster to train with similar performance.

Bidirectional RNNs

Sometimes context from the future helps interpret the present. Bidirectional RNNs process sequences both forward and backward:

Forward:  input[0] β†’ input[1] β†’ input[2] β†’ ... β†’ input[T]
                ↓         ↓         ↓              ↓
           hidden_f[0]  hidden_f[1] hidden_f[2] ... hidden_f[T]

Backward: input[0] ← input[1] ← input[2] ← ... ← input[T]
                ↓         ↓         ↓              ↓
           hidden_b[0]  hidden_b[1] hidden_b[2] ... hidden_b[T]

Combined: [hidden_f[t], hidden_b[t]] for each t

Each position gets context from both past and future. Useful when you have the complete sequence before processing.

Sequence-to-Sequence Architectures

For tasks where input and output are both sequences, use encoder-decoder architectures:

Encoder: Processes input sequence, produces summary hidden state Decoder: Takes summary, generates output sequence

Input sequence β†’ [Encoder RNN] β†’ Summary State β†’ [Decoder RNN] β†’ Output sequence

This architecture underlies machine translation, summarization, and can be adapted for time-series forecasting.

Strengths

Natural for sequences: Explicitly models temporal dependencies. Hidden state carries information across time.

Variable length: Unlike feedforward networks, RNNs handle sequences of any length.

Parameter efficiency: Same weights used at every time step. A 100-step sequence doesn't need 100Γ— the parameters.

Interpretable dynamics: Hidden state evolution can be analyzed. What is the network remembering?

Weaknesses

Sequential computation: Can't parallelize across time steps. Each step waits for the previous. Training and inference are slower than parallelizable architectures.

Long-range dependencies: Even LSTMs struggle with very long sequences (hundreds to thousands of steps). Information still fades, just more slowly.

Training instability: RNNs can suffer from exploding gradients. Requires careful initialization and gradient clipping.

Superseded by transformers: For many tasks, transformers achieve better performance with easier training. RNNs are less dominant than they once were.

For Your Telescope Array

Good for: Light curves. Time-series data. Sequential observations. Any data where temporal order matters.

Specific applications:

Light curve classification: An LSTM processes a sequence of brightness measurements, classifying the variable star type, detecting transients, or identifying periodic behavior.

Light curve: [mag[0], mag[1], mag[2], ..., mag[T]]
                ↓        ↓        ↓           ↓
             [LSTM] β†’ [LSTM] β†’ [LSTM] β†’ ... β†’ [LSTM]
                                                 ↓
                                          Classification

Transient detection in time series: RNN monitors brightness sequence, outputs probability of transient at each time step. Alert when probability exceeds threshold.

Predictive modeling: Given recent conditions (weather, seeing, performance), predict near-future conditions for scheduling.

Anomaly detection in sequences: Train LSTM to predict next value in normal sequences. Large prediction errors indicate anomalies.

State tracking: RNN maintains hidden state representing current system status, updated with each new observation or event.

Example architecture for light curve classification:

Input: sequence of (time, magnitude, error) tuples, variable length

Embedding: Dense(64) applied to each time step
Output: sequence of 64-dimensional vectors

Bidirectional LSTM(128 units)
Output: sequence of 256-dimensional vectors (128 forward + 128 backward)

Attention layer (or just take final hidden state)
Output: 256-dimensional vector

Dense(128) β†’ ReLU β†’ Dropout(0.3)
Dense(64) β†’ ReLU β†’ Dropout(0.3)
Dense(num_classes) β†’ Softmax

Output: class probabilities

Transformers: Attention is All You Need

What They Are

Transformers process sequences without recurrence. Instead of maintaining hidden state, they use attention mechanisms to directly relate any element to any other element.

The Core Insight

RNNs process sequences step by step. Information from early steps must pass through many intermediate steps to affect later processing. This creates bottlenecks.

Transformers skip the middleman. Every position can directly attend to every other position. Information flows directly between any pair of elements.

Self-Attention: The Key Mechanism

Self-attention computes relationships between all pairs of positions in a sequence.

For each position, create three vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I have to offer?"
  • Value (V): "What information do I carry?"

Attention score between position i and position j:

score[i,j] = Q[i] Β· K[j] / sqrt(d_k)

The dot product measures similarity. Division by sqrt(d_k) (dimension of keys) prevents scores from growing too large.

Apply softmax to get attention weights:

weights[i] = softmax(scores[i])  # weights[i] sums to 1

Output for position i is weighted sum of values:

output[i] = Ξ£β±Ό weights[i,j] Γ— V[j]

Each position's output incorporates information from all other positions, weighted by relevance.

Multi-Head Attention

A single attention mechanism learns one type of relationship. Multi-head attention runs several attention mechanisms in parallel:

Head 1: Q₁, K₁, V₁ β†’ output₁
Head 2: Qβ‚‚, Kβ‚‚, Vβ‚‚ β†’ outputβ‚‚
...
Head N: Qβ‚™, Kβ‚™, Vβ‚™ β†’ outputβ‚™

Concatenate: [output₁, outputβ‚‚, ..., outputβ‚™]
Project: W_o Γ— concatenated

Different heads learn different relationships:

  • Head 1 might attend to nearby positions
  • Head 2 might attend to similar values
  • Head 3 might attend to periodically related positions

The Transformer Block

A complete transformer block:

Input
  ↓
Multi-Head Self-Attention
  ↓
Add (residual connection) + Layer Normalization
  ↓
Feed-Forward Network (two dense layers)
  ↓
Add (residual connection) + Layer Normalization
  ↓
Output

Stack many blocks (6, 12, 24, or more in large models).

Residual connections let gradients flow directly through the network, enabling very deep architectures.

Positional Encoding

Self-attention is permutation-invariant: it doesn't inherently know that position 1 comes before position 2. Order information must be added explicitly.

Sinusoidal encoding (original transformer):

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Different frequencies for different dimensions. Positions get unique signatures, and relative positions can be computed from these encodings.

Learned encodings: Just learn a vector for each position. Works well when maximum sequence length is known.

Encoder-Decoder Transformers

For sequence-to-sequence tasks:

Encoder: Self-attention sees entire input. Each position attends to all input positions.

Decoder: Self-attention is masked (positions can only attend to earlier positions, not future). Cross-attention lets decoder positions attend to encoder outputs.

Input Sequence β†’ [Encoder Stack] β†’ Encoded Representations
                                            ↓
                        [Decoder Stack with Cross-Attention] β†’ Output Sequence

Encoder-Only (BERT-style)

For tasks where you need to understand the input but not generate sequences:

Input β†’ [Transformer Encoder] β†’ Representations β†’ Task-specific head

BERT, RoBERTa, and similar models use this pattern. Fine-tune for classification, extraction, or other tasks.

Decoder-Only (GPT-style)

For generation tasks:

Context β†’ [Transformer Decoder] β†’ Next token prediction

GPT models use this pattern. The model predicts the next element based on all previous elements.

Vision Transformers (ViT)

Transformers for images:

  1. Split image into patches (e.g., 16Γ—16 pixels each)
  2. Flatten each patch into a vector
  3. Add position encodings
  4. Process with standard transformer
Image β†’ [Split into patches] β†’ [Linear embedding] β†’ [Add position] β†’ [Transformer] β†’ [Classification head]

This treats an image as a sequence of patches, letting attention learn spatial relationships.

Strengths

Parallelization: Unlike RNNs, all positions can be computed simultaneously. Training is much faster on GPUs.

Long-range dependencies: Every position directly attends to every other. No information bottleneck.

Scalability: Transformers scale well. Larger models, more data, more compute generally means better performance.

State-of-the-art: Transformers dominate language, increasingly dominate vision, and excel in many domains.

Flexibility: Same architecture works for language, images, audio, and more with minimal modification.

Weaknesses

Quadratic complexity: Self-attention compares all pairs of positions. For sequence length N, complexity is O(NΒ²). Very long sequences become expensive.

Data hungry: Transformers typically need more training data than CNNs or RNNs to achieve good performance.

Compute hungry: Large transformers require substantial GPU resources for training and inference.

Position encoding limitations: Learned position encodings don't generalize beyond training length. Sinusoidal encodings help but aren't perfect.

Less inductive bias: Transformers make fewer assumptions about data structure. This flexibility means they need to learn structure from data rather than having it built in.

For Your Telescope Array

Good for: Complex sequences where long-range dependencies matter. Multi-modal data fusion. Tasks where CNNs or RNNs underperform.

Specific applications:

Advanced light curve analysis: Transformers can capture long-range periodicity, complex variability patterns, and subtle correlations that RNNs miss.

Multi-site data fusion: Treat observations from different sites as sequence elements. Attention learns which observations to weight more heavily, how to combine information across sites.

[Obs_Site_A, Obs_Site_B, Obs_Site_C, ...] β†’ [Transformer] β†’ Fused Representation

Catalog cross-matching: Given entries from multiple catalogs, transformer attention learns which entries correspond to the same object.

Vision Transformer for images: For challenging image classification tasks where CNNs plateau, ViT might push further (with sufficient data).

Multimodal understanding: Combine image features and light curve features in a single transformer. Attention learns relationships between visual appearance and temporal behavior.

Example architecture for multi-site data fusion:

Inputs: Observations from N sites, each represented as a vector
[obs_1, obs_2, ..., obs_N] where obs_i includes: image embedding, quality metrics, timestamp, site ID embedding

Positional encoding: Site embeddings rather than sequence positions

Transformer Encoder (4 layers, 8 attention heads, 256 dimensions)
Each observation attends to all others
Learns which sites to weight, how to combine

Global pooling or CLS token
Output: Fused representation

Task heads:
- Classification head: Dense β†’ class probabilities
- Quality estimation head: Dense β†’ expected quality of combined result
- Uncertainty head: Dense β†’ confidence bounds

Autoencoders: Learning Compression

What They Are

Networks that learn to compress data to a smaller representation, then reconstruct the original. Not for prediction, but for representation learning.

The Core Insight

If a network can compress data to a small representation and reconstruct it accurately, that small representation must capture the essential information. What's lost is presumably noise or irrelevant detail.

Architecture

Input β†’ [Encoder] β†’ Bottleneck (small) β†’ [Decoder] β†’ Reconstruction

        High-dimensional                              High-dimensional
           input                                         output
                        Low-dimensional
                          code/latent

Encoder: Compresses input to bottleneck. Typically uses convolutions (for images) or dense layers.

Bottleneck: The compressed representation. Much smaller than input (e.g., 256Γ—256 image β†’ 128 numbers).

Decoder: Reconstructs input from bottleneck. Mirror of encoder architecture.

Loss: Reconstruction error, typically mean squared error between input and output.

Variational Autoencoders (VAEs)

Standard autoencoders learn a deterministic mapping. VAEs learn a probabilistic one.

Instead of encoding to a single point, VAE encodes to a distribution (mean and variance):

Input β†’ [Encoder] β†’ (ΞΌ, Οƒ) β†’ Sample z ~ N(ΞΌ, Οƒ) β†’ [Decoder] β†’ Reconstruction

Loss includes:

  • Reconstruction error
  • KL divergence between learned distribution and prior (regularizes latent space)

VAEs have smoother latent spaces. You can sample from the prior and generate realistic outputs.

Uses of Autoencoders

Dimensionality reduction: The bottleneck representation is a compressed version of input. Useful for visualization, clustering, or as input to other models.

Denoising: Train autoencoder on noisy inputs with clean targets. It learns to remove noise.

Anomaly detection: Train on normal data. Anomalies reconstruct poorly (high error).

Generation: VAEs (and related models) can generate new samples by decoding random latent vectors.

Strengths

Unsupervised: Don't need labels. Just need examples of normal data.

Representation learning: Learn useful features without explicit supervision.

Anomaly detection: Natural fit for finding unusual objects.

Compression: Learned compression can outperform hand-designed methods.

Weaknesses

Reconstruction focus: Optimizing reconstruction might not produce representations useful for downstream tasks.

Mode collapse: Can learn to ignore some input variation, reconstructing only "average" outputs.

Blurry outputs: Especially VAEs tend to produce blurry reconstructions, averaging over uncertainty.

Hyperparameter sensitivity: Bottleneck size, architecture choices significantly affect results.

For Your Telescope Array

Good for: Anomaly detection. Data compression. Finding unusual objects. Learning representations without labels.

Specific applications:

Anomaly detection: Train autoencoder on normal telescope images. High reconstruction error flags unusual images for human review.

Training: Normal images β†’ Autoencoder β†’ Minimize reconstruction error
Deployment: New image β†’ Autoencoder β†’ Measure reconstruction error
            If error > threshold: Flag as anomalous

Compression for transmission: Train autoencoder to compress images. Send only bottleneck codes from remote sites, decode centrally. Lossy but much smaller.

Unknown object discovery: Cluster objects in latent space. Objects far from known clusters might be new types.

Quality-aware compression: Train autoencoder with quality-weighted loss. Preserve important regions (sources) more than background.

Example anomaly detection system:

Convolutional Autoencoder:

Encoder:
Conv(32, 3Γ—3) β†’ ReLU β†’ Pool(2Γ—2)  # 256 β†’ 128
Conv(64, 3Γ—3) β†’ ReLU β†’ Pool(2Γ—2)  # 128 β†’ 64
Conv(128, 3Γ—3) β†’ ReLU β†’ Pool(2Γ—2) # 64 β†’ 32
Conv(256, 3Γ—3) β†’ ReLU β†’ Pool(2Γ—2) # 32 β†’ 16
Flatten β†’ Dense(512) β†’ Dense(128) β†’ Bottleneck

Decoder (mirror of encoder):
Dense(512) β†’ Dense(16Γ—16Γ—256) β†’ Reshape
Upsample(2Γ—2) β†’ Conv(128, 3Γ—3) β†’ ReLU  # 16 β†’ 32
Upsample(2Γ—2) β†’ Conv(64, 3Γ—3) β†’ ReLU   # 32 β†’ 64
Upsample(2Γ—2) β†’ Conv(32, 3Γ—3) β†’ ReLU   # 64 β†’ 128
Upsample(2Γ—2) β†’ Conv(1, 3Γ—3) β†’ Output  # 128 β†’ 256

Loss: Mean squared error

Anomaly score: Reconstruction error per image
Threshold: Set from validation data to achieve desired false positive rate

Graph Neural Networks: Relational Intelligence

What They Are

Networks designed for data naturally represented as graphs: nodes connected by edges. Where CNNs exploit spatial structure and RNNs exploit temporal structure, GNNs exploit relational structure.

The Core Insight

Many astronomical phenomena involve relationships:

  • Stars in clusters are related
  • Galaxies in groups interact
  • Observations of the same object are connected
  • Telescope sites share information

Graphs naturally represent these relationships. GNNs learn to use relational structure.

Graph Representation

A graph consists of:

  • Nodes: Entities (stars, galaxies, observations, telescopes)
  • Edges: Relationships between nodes (physical proximity, causal connection, same object)
  • Node features: Attributes of each node (brightness, color, position)
  • Edge features: Attributes of each relationship (distance, time difference, strength)

Message Passing: The Core Operation

GNNs work by passing messages between connected nodes:

For each node:
    1. Gather messages from neighbors
    2. Aggregate messages (sum, mean, max, or learned aggregation)
    3. Update node representation based on current state + aggregated messages

After several rounds of message passing, each node's representation incorporates information from its neighborhood.

Round 1: Each node knows about immediate neighbors
Round 2: Each node knows about neighbors-of-neighbors
Round 3: Information from 3-hop neighborhood
...

Mathematical Formulation

Basic message passing:

m[i] = Aggregate({h[j] : j ∈ Neighbors(i)})
h'[i] = Update(h[i], m[i])

Where:

  • h[i] is node i's representation
  • m[i] is aggregated message for node i
  • Aggregate is a permutation-invariant function (sum, mean, max)
  • Update combines current state with message (typically neural network)

Common architectures:

Graph Convolutional Network (GCN):

H' = Οƒ(D^(-1/2) A D^(-1/2) H W)

Where A is adjacency matrix, D is degree matrix, H is node features, W is learnable weights.

Graph Attention Network (GAT): Use attention to weight neighbor contributions differently.

GraphSAGE: Sample and aggregate neighbors, enabling mini-batch training on large graphs.

Strengths

Natural for relational data: Directly encodes relationships. No need to flatten graph structure into vectors.

Flexible structure: Works on graphs of any size and topology. Adapts to varying numbers of neighbors.

Inductive: Can generalize to unseen nodes/graphs if features are meaningful.

Combines information: Learns how to aggregate information from related entities.

Weaknesses

Scalability: Very large graphs (millions of nodes) require sophisticated sampling or approximation.

Oversmoothing: Many message-passing rounds make all node representations similar. Deep GNNs are harder to train.

Edge definition: Results depend on how you define graph structure. Wrong edges hurt performance.

Less mature: GNNs are newer than CNNs/RNNs. Fewer established best practices.

For Your Telescope Array

Good for: Modeling relationships between objects, sites, or observations. Catalog analysis. Network coordination.

Specific applications:

Star cluster analysis: Nodes are stars, edges connect probable cluster members. GNN learns cluster membership, identifies interlopers.

Galaxy group finding: Nodes are galaxies, edges from proximity or velocity similarity. GNN identifies group memberships, predicts properties.

Multi-observation fusion: Nodes are observations of the same target (different times, sites, instruments). Edges connect same-object observations. GNN learns optimal combination.

Graph structure:
  Nodes: Individual observations
  Edges: Same object, temporal proximity, or spatial proximity
  Node features: Measurement values, quality metrics, metadata
  Edge features: Time difference, site pair, conditions similarity

GNN:
  Message passing learns how to weight and combine observations
  Output: Fused estimate for each unique object

Telescope network optimization: Nodes are telescope sites, edges connect sites with complementary capabilities. GNN learns coordination patterns, recommends resource allocation.

Anomaly detection in context: When detecting anomalies, consider relationships. A star that's anomalous in isolation might be normal given its cluster context. GNN incorporates context.

Example architecture for multi-observation fusion:

Graph construction:
  For each unique object, create nodes for all observations
  Connect observations with edges (fully connected or based on relevance)

Node features (per observation):
  - Measured values (magnitudes, colors, etc.)
  - Uncertainty estimates
  - Observation quality metrics
  - Site identifier (embedded)
  - Time of observation

Edge features:
  - Time difference
  - Site pair identifier
  - Condition similarity score

GNN architecture:
  GraphSAGE with 3 message-passing layers
  Hidden dimension: 128
  Aggregation: Attention-weighted mean

After message passing:
  Global pooling across all nodes for this object
  Dense layers for final estimate

Output:
  Fused measurement estimate
  Uncertainty bounds
  Outlier flags for individual observations

Generative Models: Creating New Data

What They Are

Models that learn to generate new samples resembling training data. Instead of classifying or predicting, they create.

Generative Adversarial Networks (GANs)

Two networks in competition:

Generator: Takes random noise, produces fake samples Discriminator: Tries to distinguish real from fake samples

Training is adversarial:

  • Discriminator improves at detecting fakes
  • Generator improves at fooling discriminator
  • At equilibrium, generator produces samples discriminator can't distinguish from real
Random noise z β†’ [Generator] β†’ Fake sample
                                    ↓
                              [Discriminator] β†’ Real or Fake?
                                    ↑
                              Real sample

Loss functions:

Discriminator: maximize log(D(real)) + log(1 - D(G(z)))
Generator: maximize log(D(G(z))) (or minimize log(1 - D(G(z))))

Diffusion Models

Currently state-of-the-art for image generation.

Forward process: Gradually add noise to real data until it's pure noise. Reverse process: Learn to gradually remove noise, recovering data from noise.

Real image β†’ [Add noise] β†’ [Add noise] β†’ ... β†’ Pure noise
Pure noise β†’ [Denoise] β†’ [Denoise] β†’ ... β†’ Generated image

The denoising network learns to predict and remove noise at each step. Many small denoising steps produce high-quality samples.

Uses in Astronomy

Data augmentation: Generate synthetic training examples, especially for rare classes.

Simulation: Generate realistic synthetic observations to test pipelines.

Super-resolution: Generate high-resolution images from low-resolution inputs.

Inpainting: Fill in missing or corrupted regions of images.

Conditional generation: Generate images matching specific properties (galaxy with certain morphology, star with certain spectrum).

For Your Telescope Array

Specific applications:

Training data generation: Have few examples of rare transients? Train a generative model on what you have, generate more for classifier training.

Pipeline testing: Generate realistic synthetic observations to stress-test processing pipelines before real data arrives.

Data recovery: Inpaint satellite trails, cosmic rays, or bad pixels in otherwise good observations.

Prediction: Given current conditions and recent observations, generate predictions of what observations will look like in near future.


Architecture Selection Guide for Your Project

Let me be concrete about which architecture to use for each component of your distributed telescope system.

At Individual Telescope Sites

Task Architecture Rationale
Frame quality assessment Lightweight CNN Fast inference, spatial patterns matter, proven performance
Real-time transient detection CNN + threshold Need speed, looking for spatial signatures
Basic source detection U-Net (CNN variant) Semantic segmentation task, well-established
Quick classification Small CNN or feedforward from features Speed critical, accuracy secondary
Equipment anomaly detection Autoencoder Unsupervised, learns normal behavior

At Central Coordination

Task Architecture Rationale
Deep image classification ResNet/EfficientNet CNN or ViT Accuracy matters, have compute resources
Light curve classification Transformer or LSTM Sequential data with long-range dependencies
Multi-site data fusion Transformer or GNN Relating multiple inputs, flexible attention
Scheduling optimization Reinforcement learning (various) Sequential decision-making
Catalog cross-matching GNN or Transformer Relational structure matters
Anomaly detection at scale Autoencoder + clustering Find unknowns in large datasets
Multi-modal analysis Transformer Naturally handles multiple input types

Decision Flowchart

Is your data...?

β”œβ”€β”€ Images (2D spatial)
β”‚   β”œβ”€β”€ Classification/detection β†’ CNN (ResNet, EfficientNet)
β”‚   β”œβ”€β”€ Segmentation β†’ U-Net, DeepLab
β”‚   β”œβ”€β”€ Very complex patterns β†’ Vision Transformer (if enough data)
β”‚   └── Need speed β†’ MobileNet, lightweight CNN
β”‚
β”œβ”€β”€ Sequences (time series)
β”‚   β”œβ”€β”€ Short sequences (<100 steps) β†’ LSTM or GRU
β”‚   β”œβ”€β”€ Long sequences (>100 steps) β†’ Transformer
β”‚   β”œβ”€β”€ Real-time streaming β†’ LSTM with online updates
β”‚   └── Bidirectional context available β†’ Bidirectional LSTM or Transformer
β”‚
β”œβ”€β”€ Tabular (features/measurements)
β”‚   β”œβ”€β”€ Clear features β†’ XGBoost/LightGBM (often beats neural networks)
β”‚   β”œβ”€β”€ Need neural network β†’ Feedforward
β”‚   └── Interactions complex β†’ Feedforward with more layers
β”‚
β”œβ”€β”€ Graph (relational)
β”‚   └── Use GNN (GraphSAGE, GAT)
β”‚
β”œβ”€β”€ Multiple modalities (images + sequences + tabular)
β”‚   └── Transformer (or separate encoders feeding shared transformer)
β”‚
└── Unlabeled data
    β”œβ”€β”€ Want compression/representation β†’ Autoencoder
    β”œβ”€β”€ Want anomaly detection β†’ Autoencoder or isolation forest
    └── Want to generate samples β†’ GAN or diffusion model

Hybrid Architectures for Your System

Real systems often combine architectures:

CNN + LSTM for video or image sequences:

Frame 1 β†’ [CNN] β†’ features[1] ─┐
Frame 2 β†’ [CNN] β†’ features[2] ─┼→ [LSTM] β†’ Sequence classification
Frame 3 β†’ [CNN] β†’ features[3] β”€β”˜

Use CNN to extract per-frame features, LSTM to model temporal evolution.

CNN + Transformer for multi-site fusion:

Site A image β†’ [CNN] β†’ embedding_A ─┐
Site B image β†’ [CNN] β†’ embedding_B ─┼→ [Transformer] β†’ Fused result
Site C image β†’ [CNN] β†’ embedding_C β”€β”˜

Use CNN to extract site-specific features, transformer to learn optimal combination.

Autoencoder + Classifier for semi-supervised learning:

Labeled + Unlabeled data β†’ [Autoencoder] β†’ Latent representations
Latent representations + Labels β†’ [Classifier] β†’ Predictions

Use autoencoder to learn representations from all data (including unlabeled), classifier on top using labels.


Summary Comparison Table

Architecture Best For Input Type Strengths Weaknesses Your Use Cases
Feedforward Tabular data, simple tasks Fixed-size vectors Simple, fast, universal No structure awareness Feature-based classification, final layers
CNN Images, spatial data 2D/3D grids Translation invariance, hierarchical features Fixed input size, local focus Image quality, source detection, morphology
RNN/LSTM Sequences, time series Variable-length sequences Temporal modeling, memory Sequential (slow), limited range Light curves, streaming data
Transformer Long sequences, multi-modal Any (with encoding) Parallelizable, long-range, flexible Quadratic complexity, data hungry Complex light curves, data fusion
Autoencoder Compression, anomaly detection Any Unsupervised, learns representations Reconstruction-focused Anomaly detection, compression
GNN Relational data, graphs Graphs Models relationships Scaling, oversmoothing Cluster analysis, observation fusion
GAN/Diffusion Data generation Any Creates new samples Training instability Data augmentation, simulation

This should give you a complete understanding of how each architecture works, what it's suited for, and exactly where each fits into your distributed telescope network. The key is matching architecture assumptions to your data's structure and your task's requirements.