Merged from:
AI and ideas.md+Edge ML.md(empty) +Figuring out ML.md+ML for project.mdAll originals preserved. This is the consolidated ML overview for the project.
Machine Learning for OpenAstro¶
ML is not the core of OpenAstro — the physics is. But ML solves specific problems in the pipeline that rule-based systems can't handle well. This note covers: what ML is for, which architecture types apply where, and how to get started.
Where ML Actually Helps in This Project¶
| Task | ML Approach | Why Not Rules? |
|---|---|---|
| Transient classification | CNN on image cutouts | Satellite trails, cosmic rays, real transients look subtly different |
| Quality flagging | Anomaly detection (autoencoder) | "Bad" data has no fixed definition — weather, tracking errors, electronics |
| Scheduling optimization | Reinforcement learning (future) | Greedy heuristic is fine for MVP; RL improves at scale |
| Period detection | Lomb-Scargle (not ML) | Statistical signal processing beats ML here — this is math, not learning |
| FRB candidate detection | CNN / RNN on time-series | Millisecond-scale patterns in high-cadence photometry |
| Alert stream filtering | Random forest / gradient boost | Filter ZTF/Gaia alert streams for targets relevant to network capabilities |
Priority order for implementation: Quality flagging first (every observation needs it), transient classification second, scheduling optimization last.
Architecture Decision: Which Model Type?¶
The open question from AI and ideas.md: "Which kind of model — RNN, CNN, reinforcement, or transformers?"
| Architecture | Use in OpenAstro | Feasibility |
|---|---|---|
| CNN | Image classification (transients, quality), 2D spectra | High — well-understood, modest compute, good pretrained models |
| RNN / LSTM | Time-series (light curves, TTV sequences) | Medium — good for sequences, but Transformers now dominate |
| Transformer | Light curve classification, FRB detection, multi-modal | High compute cost, but BERT-style pretrained models exist for time series |
| Reinforcement Learning | Scheduling (target assignment) | Overkill at MVP scale; revisit at 200+ sites |
| Random Forest / XGBoost | Alert filtering, quality flags | Low compute, high interpretability, good first choice |
Recommendation: Start with Random Forest/XGBoost for quality flagging (interpretable, fast). Use CNN for image-based transient classification. Transformers only when you have enough labeled data and GPU access.
On transformers specifically (from AI and ideas.md): They are expensive to train from scratch but there are astronomy-specific pretrained models (AstroCLIP, Astromer for light curves) that can be fine-tuned cheaply.
How Neural Networks Work (Foundation)¶
For anyone building intuition from scratch: a neural network is a series of weighted sums + nonlinearities. The star classification example:
- Input: Temperature (normalized), Luminosity (normalized)
- Output: Dwarf (0) or Giant (1)
- Learning: 5000 iterations of forward pass → error calculation → backpropagation via chain rule → weight update
The key insight: you never write the rule "high luminosity = Giant". The weights self-assign during training. The chain rule distributes blame backwards through the network.
Physics intuition: Forward pass = signal propagation. Backprop = measuring how each weight contributed to the error. Gradient descent = nudging each weight in the direction that reduces error.
Matrix math vs. explicit neurons: Using np.dot to process all neurons simultaneously is equivalent to writing a for loop over individual Neuron objects — it's just GPU-efficient. weights1 = np.random.rand(2, 3) is literally 3 neurons with 2 weights each.
Practical Starting Point for OpenAstro¶
- Label ~1000 observations as good/bad quality manually — this is your training set
- Extract features: airmass, PSF FWHM, sky background, comparison star residuals, exposure time
- Train a Random Forest classifier — interpretable, works well with small labeled sets, no GPU needed
- Iterate: As more data accumulates, switch to a CNN on the actual image data for richer features
Don't start with transformers. Don't start with RL. Start with the simplest thing that can produce a useful quality flag. Upgrade when you have data and a clear bottleneck.
See Also¶
Types of nets.md— comprehensive architecture reference (CNNs, RNNs, Transformers, Autoencoders, GNNs, GANs with maths)ML for project.md— high-level architecture (immutable data lake, distributed inference)Everything on FRBs.md— FRB-specific ML pipelines (FETCH classifier, DANCE clustering)NewOpenAstro/Tech/Distributed Computing Strategy.md— compute infrastructure for ML training/inference
Distributed Network ML — Unique Challenges (from NewOpenAstro/Science/Machine Learning/Dumpppp.md and ML and Distributed telescope.md)¶
A distributed telescope network introduces ML challenges that single-observatory systems don't face.
The Distributed Data Problem¶
Heterogeneous Conditions: Telescope A (India) sees through different atmosphere than Telescope B (Chile). The same galaxy imaged at both sites looks subtly different. A model trained on Site A data will have systematic errors at Site B. Solution: Transfer learning — train site-independent representations in early network layers, allow site-specific adaptation in later layers.
Temporal Asynchrony: It's always daytime somewhere. Events happen when only some telescopes can see them. ML must make fast local decisions while benefiting from global coordination.
Calibration Drift: Each telescope drifts differently over time. ML that learns site-specific characteristics automatically (rather than through manual characterisation) scales to hundreds of nodes.
Light Curve Folding (Exoplanet Period Finding)¶
To detect an exoplanet transit from distributed telescope data:
1. Collect flux measurements from many sites over days/weeks
2. "Fold" the data: transform absolute timestamps into phase values using a trial period P: phase = (t mod P) / P
3. With the wrong P: data looks like noise
4. With the correct P: transit dips at the same phase (e.g., 0.5) from all sites line up — random noise cancels, signal accumulates
This is how Kepler and TESS discover planets. The brute-force period search (Box Least Squares / BLS algorithm) is the standard first step.
K-Means for Pre-Processing¶
Before running expensive CNN classifiers, K-means clustering of light curve features (dip depth, dip width, symmetry, period) creates categories: - Cluster 1: Flat lines (constant stars — skip) - Cluster 2: Sine waves (pulsating variables) - Cluster 3: Sharp periodic dips (eclipsing binaries or exoplanet candidates)
Feed only Cluster 3 to the CNN. Massive compute saving.
K-means also works on telescopes themselves — cluster nodes by data quality metrics to automatically identify high-reliability vs suspect equipment without manual labelling.
ML Pipeline Chain for Exoplanet Detection¶
1. Ingestion: raw flux from 100 telescopes
2. BLS period folding: find strongest periodic signal
3. Feature extraction: dip depth, width, duration, symmetry
4. K-means: classify as "likely junk" / "variable star" / "potential transit"
5. CNN: score only the "potential transit" cluster → probability output
Key Battle-Tested Libraries¶
Deep Learning: PyTorch (research default), TensorFlow (production), JAX (physics-informed networks)
Classical ML: scikit-learn (Random Forest, K-means, SVM, PCA), XGBoost/LightGBM (tabular data — often beats neural nets)
Time-Series: tsfresh (automatic feature extraction), tslearn (DTW, time-series clustering)
Astronomy-Specific: AstroML (classical ML for astronomy), Astropy (data handling foundation), photutils / SEP (source detection), AstroCLIP (pretrained image embeddings), ASTROMER (pretrained light curve transformer)
Scheduling RL: Stable Baselines3 (standard RL algorithms, PyTorch-based), RLlib (large-scale distributed RL)
Deployment: ONNX (cross-framework model exchange), TensorRT (NVIDIA GPU inference optimisation), Docker (reproducible environments at telescope sites)
Edge ML Hardware at Telescope Sites¶
For sites that need local real-time inference (transient detection, quality assessment):
| Scale | Hardware | Cost | Capabilities |
|---|---|---|---|
| Single small scope | NVIDIA Jetson Nano / Orin Nano | $200–500 | Quality assessment, basic transient detection |
| Medium site | NVIDIA Jetson AGX Xavier/Orin | $700–2000 | Full local pipeline, preliminary data fusion |
| Major site | Compact server + RTX 4080/4090 | $3000–8000 | Fully autonomous, local model training |
Honest Limitations of ML¶
- Garbage in, garbage out: Training data errors propagate into model behaviour
- Black box problem: Neural networks cannot explain their decisions in scientifically meaningful terms — problematic for peer review
- Distribution shift: When reality changes (new instrument, novel event type), models trained on old data fail silently
- Rare objects: Deep learning needs many examples. For once-per-decade transients or genuinely new object types, ML provides no advantage over traditional methods
- No physical understanding: ML cannot extrapolate beyond training distribution. A physics model works for stars never observed; an ML model may not
- ML cannot replace: scientific judgment about what questions to ask, novel hypothesis generation, error checking, or adaptation to truly unprecedented situations