Sourced from: dump.md, massive info dump.md

Cloud-Scale Architecture for OpenAstro¶

This document covers the serverless / cloud-native architecture that the network must adopt when it grows beyond a handful of nodes and the cheap VPS can no longer handle volatility and load. The MVP stack (FastAPI + SQLite on Hetzner) is documented in mvp_server_guide.md. This file is the upgrade path.

Core Principle: Pay-Per-Photon¶

The goal is to eliminate idle compute time entirely. Every infrastructure component should cost nothing when telescopes are offline (daytime, clouds) and scale automatically when hundreds of nodes are active. This is achieved through serverless and event-driven patterns.

Component Architecture¶

Component	Cloud Solution	Cost/Scaling Strategy
Global Scheduler Logic	Function-as-a-Service (FaaS) — AWS Lambda, Google Cloud Functions	Zero idle cost: the function runs only for the milliseconds it takes to assign a mission or process a heartbeat.
Persistence Database	NoSQL Document Store — AWS DynamoDB, Google Firestore	Optimised for fast reads and horizontal scalability. Billed per read/write, not by provisioned server size. Handles constant heartbeat pings without needing a large instance.
Message Queue	AWS SQS or Google Pub/Sub	Decoupling: the scheduler drops commands into the queue and agents pull them. If one node fails, the queue buffers the command rather than crashing the scheduler. Billed per message.
Raw Data Archival	Object Storage — AWS S3, Google Cloud Storage	Extremely cheap for petabytes of raw FITS data. Primary cost is storage + egress on first download.
Stacking / Heavy Compute	Containerised short-lived tasks — AWS Fargate, Google Cloud Run	Spins up hundreds of container instances in parallel for a big stacking job, then shuts them down immediately. Massive power burst for minutes of cost.
Workflow Orchestration	AWS Step Functions, Google Cloud Workflows	Tracks which telescope delivered which exposure chunk; automatically calculates remaining exposure needed and reassigns.

Event-Driven Pipeline Flow¶

1. Telescope finishes an image → pushes FITS file to Object Storage (S3/GCS)
2. Successful upload triggers a Cloud Function (FaaS)
3. FaaS performs: metadata extraction, WCS header check, writes record to NoSQL DB
4. When mission Status Flag → COMPLETE (all exposure met), stacking pipeline is triggered
5. Containerised stacking tasks spin up, collect all FITS tagged with mission UUID, reproject and co-add
6. Final stacked FITS written back to Object Storage and torrent-seeded

Nothing in this flow requires a long-running server. Cost is proportional to actual work done.

Hot-Plug / Volatile Node Architecture¶

Amateur nodes go offline without warning. The architecture must handle this without human intervention.

Heartbeat Monitoring¶

Every node sends a heartbeat to the message queue every 15–30 seconds containing: node_id, state (ONLINE/BUSY/PARKED), current_target, weather_ok
If three consecutive heartbeats are missed, a FaaS function automatically marks that node as FAULTED
The scheduler removes it from the available pool and reassigns pending tasks

Decoupled Command Flow¶

The scheduler never waits synchronously for a telescope to respond
Commands are placed in the message queue; the telescope agent consumes them when ready
If the agent stops, its unacknowledged commands remain in the queue and can be re-consumed by a replacement node

Plugging In¶

A new node registers via the API, sends its first heartbeat, and begins polling the queue. No central downtime or reconfiguration required.

Distributed Compute via BOINC (Volunteer CPU)¶

For the most computationally intensive tasks (WCS reprojection, exoplanet transit MCMC fitting, image deconvolution), cloud compute is expensive. The alternative is volunteer CPU via BOINC (Berkeley Open Infrastructure for Network Computing).

Why BOINC fits:

The ideal BOINC job has a low-bandwidth input and a high compute time — the volunteer's machine does the heavy lifting while we only pay for cheap cloud storage and the tiny task/result messages.

Task Type	Input	Output	Compute Intensity
WCS Reprojection	1 FITS image (~48 MB) + master WCS (1 KB)	Reprojected FITS or transformation matrix	High — complex pixel grid interpolation
Exoplanet Transit Fitting	Light curve file (~500 KB)	MCMC posterior parameters	Very high — billions of iterations
Artifact Classification	Image cutout (~200 KB)	Classification label + confidence score	Moderate — CNN inference
Asteroid Orbit Fitting	Astrometric positions (~1 KB)	Orbital element set	High — numerical propagation

BOINC validation: The same work unit is sent to multiple volunteers. If two results match, the result is accepted. This handles untrusted, volunteer hardware.

Cost model: We pay only for: (1) cloud storage of raw files, (2) compute for the event-driven FaaS functions, and (3) initial egress for the first few torrent downloads that become seeds.

Open Data Distribution via BitTorrent¶

Distributing stacked FITS files via BitTorrent eliminates cloud egress costs after the first few seeds are established.

Data integrity: The .torrent file contains a cryptographic hash (SHA256) of every data chunk. Science data trustworthiness is guaranteed.
Seeding: The central cloud infrastructure permanently seeds all torrents. Community seed boxes are encouraged as redundancy.
Format: Distribute final stacked FITS only — not individual raw frames. Use FITS compression (fpack) to reduce file sizes by 30–50%.

Scaling the Scheduler: Asynchronous Workflow Orchestration¶

The hot-plug requirement demands a workflow manager that can track partial progress across hundreds of parallel missions.

Each long observation request is broken into tasks tracked in the NoSQL database
A workflow orchestration service (Step Functions / Cloud Workflows) maintains the state machine for each observation
Partial completion is a first-class state: if a telescope delivers 30 min of a required 60 min, that 30 min is archived immediately and the scheduler creates a new task for the remaining 30 min

This ensures every photon collected is saved even if the telescope shuts down mid-exposure.