Merged from: AI and ideas.md + Edge ML.md (empty) + Figuring out ML.md + ML for project.md All originals preserved. This is the consolidated ML overview for the project.

Machine Learning for OpenAstro¶

ML is not the core of OpenAstro — the physics is. But ML solves specific problems in the pipeline that rule-based systems can't handle well. This note covers: what ML is for, which architecture types apply where, and how to get started.

Where ML Actually Helps in This Project¶

Task	ML Approach	Why Not Rules?
Transient classification	CNN on image cutouts	Satellite trails, cosmic rays, real transients look subtly different
Quality flagging	Anomaly detection (autoencoder)	"Bad" data has no fixed definition — weather, tracking errors, electronics
Scheduling optimization	Reinforcement learning (future)	Greedy heuristic is fine for MVP; RL improves at scale
Period detection	Lomb-Scargle (not ML)	Statistical signal processing beats ML here — this is math, not learning
FRB candidate detection	CNN / RNN on time-series	Millisecond-scale patterns in high-cadence photometry
Alert stream filtering	Random forest / gradient boost	Filter ZTF/Gaia alert streams for targets relevant to network capabilities

Priority order for implementation: Quality flagging first (every observation needs it), transient classification second, scheduling optimization last.

Architecture Decision: Which Model Type?¶

The open question from AI and ideas.md: "Which kind of model — RNN, CNN, reinforcement, or transformers?"

Architecture	Use in OpenAstro	Feasibility
CNN	Image classification (transients, quality), 2D spectra	High — well-understood, modest compute, good pretrained models
RNN / LSTM	Time-series (light curves, TTV sequences)	Medium — good for sequences, but Transformers now dominate
Transformer	Light curve classification, FRB detection, multi-modal	High compute cost, but BERT-style pretrained models exist for time series
Reinforcement Learning	Scheduling (target assignment)	Overkill at MVP scale; revisit at 200+ sites
Random Forest / XGBoost	Alert filtering, quality flags	Low compute, high interpretability, good first choice

Recommendation: Start with Random Forest/XGBoost for quality flagging (interpretable, fast). Use CNN for image-based transient classification. Transformers only when you have enough labeled data and GPU access.

On transformers specifically (from AI and ideas.md): They are expensive to train from scratch but there are astronomy-specific pretrained models (AstroCLIP, Astromer for light curves) that can be fine-tuned cheaply.

How Neural Networks Work (Foundation)¶

For anyone building intuition from scratch: a neural network is a series of weighted sums + nonlinearities. The star classification example:

Input: Temperature (normalized), Luminosity (normalized)
Output: Dwarf (0) or Giant (1)
Learning: 5000 iterations of forward pass → error calculation → backpropagation via chain rule → weight update

The key insight: you never write the rule "high luminosity = Giant". The weights self-assign during training. The chain rule distributes blame backwards through the network.

Physics intuition: Forward pass = signal propagation. Backprop = measuring how each weight contributed to the error. Gradient descent = nudging each weight in the direction that reduces error.

Matrix math vs. explicit neurons: Using np.dot to process all neurons simultaneously is equivalent to writing a for loop over individual Neuron objects — it's just GPU-efficient. weights1 = np.random.rand(2, 3) is literally 3 neurons with 2 weights each.

Practical Starting Point for OpenAstro¶

Label ~1000 observations as good/bad quality manually — this is your training set
Extract features: airmass, PSF FWHM, sky background, comparison star residuals, exposure time
Train a Random Forest classifier — interpretable, works well with small labeled sets, no GPU needed
Iterate: As more data accumulates, switch to a CNN on the actual image data for richer features

Don't start with transformers. Don't start with RL. Start with the simplest thing that can produce a useful quality flag. Upgrade when you have data and a clear bottleneck.

Distributed Network ML — Unique Challenges (from `NewOpenAstro/Science/Machine Learning/Dumpppp.md` and `ML and Distributed telescope.md`)¶

A distributed telescope network introduces ML challenges that single-observatory systems don't face.

The Distributed Data Problem¶

Heterogeneous Conditions: Telescope A (India) sees through different atmosphere than Telescope B (Chile). The same galaxy imaged at both sites looks subtly different. A model trained on Site A data will have systematic errors at Site B. Solution: Transfer learning — train site-independent representations in early network layers, allow site-specific adaptation in later layers.

Temporal Asynchrony: It's always daytime somewhere. Events happen when only some telescopes can see them. ML must make fast local decisions while benefiting from global coordination.

Calibration Drift: Each telescope drifts differently over time. ML that learns site-specific characteristics automatically (rather than through manual characterisation) scales to hundreds of nodes.

Light Curve Folding (Exoplanet Period Finding)¶

To detect an exoplanet transit from distributed telescope data: 1. Collect flux measurements from many sites over days/weeks 2. "Fold" the data: transform absolute timestamps into phase values using a trial period P: phase = (t mod P) / P 3. With the wrong P: data looks like noise 4. With the correct P: transit dips at the same phase (e.g., 0.5) from all sites line up — random noise cancels, signal accumulates

This is how Kepler and TESS discover planets. The brute-force period search (Box Least Squares / BLS algorithm) is the standard first step.

K-Means for Pre-Processing¶

Before running expensive CNN classifiers, K-means clustering of light curve features (dip depth, dip width, symmetry, period) creates categories: - Cluster 1: Flat lines (constant stars — skip) - Cluster 2: Sine waves (pulsating variables) - Cluster 3: Sharp periodic dips (eclipsing binaries or exoplanet candidates)

Feed only Cluster 3 to the CNN. Massive compute saving.

K-means also works on telescopes themselves — cluster nodes by data quality metrics to automatically identify high-reliability vs suspect equipment without manual labelling.

ML Pipeline Chain for Exoplanet Detection¶

1. Ingestion: raw flux from 100 telescopes
2. BLS period folding: find strongest periodic signal
3. Feature extraction: dip depth, width, duration, symmetry
4. K-means: classify as "likely junk" / "variable star" / "potential transit"
5. CNN: score only the "potential transit" cluster → probability output

Key Battle-Tested Libraries¶

Deep Learning: PyTorch (research default), TensorFlow (production), JAX (physics-informed networks)

Classical ML: scikit-learn (Random Forest, K-means, SVM, PCA), XGBoost/LightGBM (tabular data — often beats neural nets)

Time-Series: tsfresh (automatic feature extraction), tslearn (DTW, time-series clustering)

Astronomy-Specific: AstroML (classical ML for astronomy), Astropy (data handling foundation), photutils / SEP (source detection), AstroCLIP (pretrained image embeddings), ASTROMER (pretrained light curve transformer)

Scheduling RL: Stable Baselines3 (standard RL algorithms, PyTorch-based), RLlib (large-scale distributed RL)

Deployment: ONNX (cross-framework model exchange), TensorRT (NVIDIA GPU inference optimisation), Docker (reproducible environments at telescope sites)

Edge ML Hardware at Telescope Sites¶

For sites that need local real-time inference (transient detection, quality assessment):

Scale	Hardware	Cost	Capabilities
Single small scope	NVIDIA Jetson Nano / Orin Nano	$200–500	Quality assessment, basic transient detection
Medium site	NVIDIA Jetson AGX Xavier/Orin	$700–2000	Full local pipeline, preliminary data fusion
Major site	Compact server + RTX 4080/4090	$3000–8000	Fully autonomous, local model training

Honest Limitations of ML¶

Garbage in, garbage out: Training data errors propagate into model behaviour
Black box problem: Neural networks cannot explain their decisions in scientifically meaningful terms — problematic for peer review
Distribution shift: When reality changes (new instrument, novel event type), models trained on old data fail silently
Rare objects: Deep learning needs many examples. For once-per-decade transients or genuinely new object types, ML provides no advantage over traditional methods
No physical understanding: ML cannot extrapolate beyond training distribution. A physics model works for stars never observed; an ML model may not
ML cannot replace: scientific judgment about what questions to ask, novel hypothesis generation, error checking, or adaptation to truly unprecedented situations