Skip to content

Merged from: AI and ideas.md + Edge ML.md (empty) + Figuring out ML.md + ML for project.md All originals preserved. This is the consolidated ML overview for the project.

Machine Learning for OpenAstro

ML is not the core of OpenAstro — the physics is. But ML solves specific problems in the pipeline that rule-based systems can't handle well. This note covers: what ML is for, which architecture types apply where, and how to get started.


Where ML Actually Helps in This Project

Task ML Approach Why Not Rules?
Transient classification CNN on image cutouts Satellite trails, cosmic rays, real transients look subtly different
Quality flagging Anomaly detection (autoencoder) "Bad" data has no fixed definition — weather, tracking errors, electronics
Scheduling optimization Reinforcement learning (future) Greedy heuristic is fine for MVP; RL improves at scale
Period detection Lomb-Scargle (not ML) Statistical signal processing beats ML here — this is math, not learning
FRB candidate detection CNN / RNN on time-series Millisecond-scale patterns in high-cadence photometry
Alert stream filtering Random forest / gradient boost Filter ZTF/Gaia alert streams for targets relevant to network capabilities

Priority order for implementation: Quality flagging first (every observation needs it), transient classification second, scheduling optimization last.


Architecture Decision: Which Model Type?

The open question from AI and ideas.md: "Which kind of model — RNN, CNN, reinforcement, or transformers?"

Architecture Use in OpenAstro Feasibility
CNN Image classification (transients, quality), 2D spectra High — well-understood, modest compute, good pretrained models
RNN / LSTM Time-series (light curves, TTV sequences) Medium — good for sequences, but Transformers now dominate
Transformer Light curve classification, FRB detection, multi-modal High compute cost, but BERT-style pretrained models exist for time series
Reinforcement Learning Scheduling (target assignment) Overkill at MVP scale; revisit at 200+ sites
Random Forest / XGBoost Alert filtering, quality flags Low compute, high interpretability, good first choice

Recommendation: Start with Random Forest/XGBoost for quality flagging (interpretable, fast). Use CNN for image-based transient classification. Transformers only when you have enough labeled data and GPU access.

On transformers specifically (from AI and ideas.md): They are expensive to train from scratch but there are astronomy-specific pretrained models (AstroCLIP, Astromer for light curves) that can be fine-tuned cheaply.


How Neural Networks Work (Foundation)

For anyone building intuition from scratch: a neural network is a series of weighted sums + nonlinearities. The star classification example:

  • Input: Temperature (normalized), Luminosity (normalized)
  • Output: Dwarf (0) or Giant (1)
  • Learning: 5000 iterations of forward pass → error calculation → backpropagation via chain rule → weight update

The key insight: you never write the rule "high luminosity = Giant". The weights self-assign during training. The chain rule distributes blame backwards through the network.

Physics intuition: Forward pass = signal propagation. Backprop = measuring how each weight contributed to the error. Gradient descent = nudging each weight in the direction that reduces error.

Matrix math vs. explicit neurons: Using np.dot to process all neurons simultaneously is equivalent to writing a for loop over individual Neuron objects — it's just GPU-efficient. weights1 = np.random.rand(2, 3) is literally 3 neurons with 2 weights each.


Practical Starting Point for OpenAstro

  1. Label ~1000 observations as good/bad quality manually — this is your training set
  2. Extract features: airmass, PSF FWHM, sky background, comparison star residuals, exposure time
  3. Train a Random Forest classifier — interpretable, works well with small labeled sets, no GPU needed
  4. Iterate: As more data accumulates, switch to a CNN on the actual image data for richer features

Don't start with transformers. Don't start with RL. Start with the simplest thing that can produce a useful quality flag. Upgrade when you have data and a clear bottleneck.


See Also

  • Types of nets.md — comprehensive architecture reference (CNNs, RNNs, Transformers, Autoencoders, GNNs, GANs with maths)
  • ML for project.md — high-level architecture (immutable data lake, distributed inference)
  • Everything on FRBs.md — FRB-specific ML pipelines (FETCH classifier, DANCE clustering)
  • NewOpenAstro/Tech/Distributed Computing Strategy.md — compute infrastructure for ML training/inference

Distributed Network ML — Unique Challenges (from NewOpenAstro/Science/Machine Learning/Dumpppp.md and ML and Distributed telescope.md)

A distributed telescope network introduces ML challenges that single-observatory systems don't face.

The Distributed Data Problem

Heterogeneous Conditions: Telescope A (India) sees through different atmosphere than Telescope B (Chile). The same galaxy imaged at both sites looks subtly different. A model trained on Site A data will have systematic errors at Site B. Solution: Transfer learning — train site-independent representations in early network layers, allow site-specific adaptation in later layers.

Temporal Asynchrony: It's always daytime somewhere. Events happen when only some telescopes can see them. ML must make fast local decisions while benefiting from global coordination.

Calibration Drift: Each telescope drifts differently over time. ML that learns site-specific characteristics automatically (rather than through manual characterisation) scales to hundreds of nodes.

Light Curve Folding (Exoplanet Period Finding)

To detect an exoplanet transit from distributed telescope data: 1. Collect flux measurements from many sites over days/weeks 2. "Fold" the data: transform absolute timestamps into phase values using a trial period P: phase = (t mod P) / P 3. With the wrong P: data looks like noise 4. With the correct P: transit dips at the same phase (e.g., 0.5) from all sites line up — random noise cancels, signal accumulates

This is how Kepler and TESS discover planets. The brute-force period search (Box Least Squares / BLS algorithm) is the standard first step.

K-Means for Pre-Processing

Before running expensive CNN classifiers, K-means clustering of light curve features (dip depth, dip width, symmetry, period) creates categories: - Cluster 1: Flat lines (constant stars — skip) - Cluster 2: Sine waves (pulsating variables) - Cluster 3: Sharp periodic dips (eclipsing binaries or exoplanet candidates)

Feed only Cluster 3 to the CNN. Massive compute saving.

K-means also works on telescopes themselves — cluster nodes by data quality metrics to automatically identify high-reliability vs suspect equipment without manual labelling.

ML Pipeline Chain for Exoplanet Detection

1. Ingestion: raw flux from 100 telescopes
2. BLS period folding: find strongest periodic signal
3. Feature extraction: dip depth, width, duration, symmetry
4. K-means: classify as "likely junk" / "variable star" / "potential transit"
5. CNN: score only the "potential transit" cluster → probability output

Key Battle-Tested Libraries

Deep Learning: PyTorch (research default), TensorFlow (production), JAX (physics-informed networks)

Classical ML: scikit-learn (Random Forest, K-means, SVM, PCA), XGBoost/LightGBM (tabular data — often beats neural nets)

Time-Series: tsfresh (automatic feature extraction), tslearn (DTW, time-series clustering)

Astronomy-Specific: AstroML (classical ML for astronomy), Astropy (data handling foundation), photutils / SEP (source detection), AstroCLIP (pretrained image embeddings), ASTROMER (pretrained light curve transformer)

Scheduling RL: Stable Baselines3 (standard RL algorithms, PyTorch-based), RLlib (large-scale distributed RL)

Deployment: ONNX (cross-framework model exchange), TensorRT (NVIDIA GPU inference optimisation), Docker (reproducible environments at telescope sites)

Edge ML Hardware at Telescope Sites

For sites that need local real-time inference (transient detection, quality assessment):

Scale Hardware Cost Capabilities
Single small scope NVIDIA Jetson Nano / Orin Nano $200–500 Quality assessment, basic transient detection
Medium site NVIDIA Jetson AGX Xavier/Orin $700–2000 Full local pipeline, preliminary data fusion
Major site Compact server + RTX 4080/4090 $3000–8000 Fully autonomous, local model training

Honest Limitations of ML

  • Garbage in, garbage out: Training data errors propagate into model behaviour
  • Black box problem: Neural networks cannot explain their decisions in scientifically meaningful terms — problematic for peer review
  • Distribution shift: When reality changes (new instrument, novel event type), models trained on old data fail silently
  • Rare objects: Deep learning needs many examples. For once-per-decade transients or genuinely new object types, ML provides no advantage over traditional methods
  • No physical understanding: ML cannot extrapolate beyond training distribution. A physics model works for stars never observed; an ML model may not
  • ML cannot replace: scientific judgment about what questions to ask, novel hypothesis generation, error checking, or adaptation to truly unprecedented situations