Sourced from:
dump.md,massive info dump.md
Cloud-Scale Architecture for OpenAstro¶
This document covers the serverless / cloud-native architecture that the network must adopt when it grows beyond a handful of nodes and the cheap VPS can no longer handle volatility and load. The MVP stack (FastAPI + SQLite on Hetzner) is documented in mvp_server_guide.md. This file is the upgrade path.
Core Principle: Pay-Per-Photon¶
The goal is to eliminate idle compute time entirely. Every infrastructure component should cost nothing when telescopes are offline (daytime, clouds) and scale automatically when hundreds of nodes are active. This is achieved through serverless and event-driven patterns.
Component Architecture¶
| Component | Cloud Solution | Cost/Scaling Strategy |
|---|---|---|
| Global Scheduler Logic | Function-as-a-Service (FaaS) — AWS Lambda, Google Cloud Functions | Zero idle cost: the function runs only for the milliseconds it takes to assign a mission or process a heartbeat. |
| Persistence Database | NoSQL Document Store — AWS DynamoDB, Google Firestore | Optimised for fast reads and horizontal scalability. Billed per read/write, not by provisioned server size. Handles constant heartbeat pings without needing a large instance. |
| Message Queue | AWS SQS or Google Pub/Sub | Decoupling: the scheduler drops commands into the queue and agents pull them. If one node fails, the queue buffers the command rather than crashing the scheduler. Billed per message. |
| Raw Data Archival | Object Storage — AWS S3, Google Cloud Storage | Extremely cheap for petabytes of raw FITS data. Primary cost is storage + egress on first download. |
| Stacking / Heavy Compute | Containerised short-lived tasks — AWS Fargate, Google Cloud Run | Spins up hundreds of container instances in parallel for a big stacking job, then shuts them down immediately. Massive power burst for minutes of cost. |
| Workflow Orchestration | AWS Step Functions, Google Cloud Workflows | Tracks which telescope delivered which exposure chunk; automatically calculates remaining exposure needed and reassigns. |
Event-Driven Pipeline Flow¶
1. Telescope finishes an image → pushes FITS file to Object Storage (S3/GCS)
2. Successful upload triggers a Cloud Function (FaaS)
3. FaaS performs: metadata extraction, WCS header check, writes record to NoSQL DB
4. When mission Status Flag → COMPLETE (all exposure met), stacking pipeline is triggered
5. Containerised stacking tasks spin up, collect all FITS tagged with mission UUID, reproject and co-add
6. Final stacked FITS written back to Object Storage and torrent-seeded
Nothing in this flow requires a long-running server. Cost is proportional to actual work done.
Hot-Plug / Volatile Node Architecture¶
Amateur nodes go offline without warning. The architecture must handle this without human intervention.
Heartbeat Monitoring¶
- Every node sends a heartbeat to the message queue every 15–30 seconds containing:
node_id,state(ONLINE/BUSY/PARKED),current_target,weather_ok - If three consecutive heartbeats are missed, a FaaS function automatically marks that node as FAULTED
- The scheduler removes it from the available pool and reassigns pending tasks
Decoupled Command Flow¶
- The scheduler never waits synchronously for a telescope to respond
- Commands are placed in the message queue; the telescope agent consumes them when ready
- If the agent stops, its unacknowledged commands remain in the queue and can be re-consumed by a replacement node
Plugging In¶
A new node registers via the API, sends its first heartbeat, and begins polling the queue. No central downtime or reconfiguration required.
Distributed Compute via BOINC (Volunteer CPU)¶
For the most computationally intensive tasks (WCS reprojection, exoplanet transit MCMC fitting, image deconvolution), cloud compute is expensive. The alternative is volunteer CPU via BOINC (Berkeley Open Infrastructure for Network Computing).
Why BOINC fits:
The ideal BOINC job has a low-bandwidth input and a high compute time — the volunteer's machine does the heavy lifting while we only pay for cheap cloud storage and the tiny task/result messages.
| Task Type | Input | Output | Compute Intensity |
|---|---|---|---|
| WCS Reprojection | 1 FITS image (~48 MB) + master WCS (1 KB) | Reprojected FITS or transformation matrix | High — complex pixel grid interpolation |
| Exoplanet Transit Fitting | Light curve file (~500 KB) | MCMC posterior parameters | Very high — billions of iterations |
| Artifact Classification | Image cutout (~200 KB) | Classification label + confidence score | Moderate — CNN inference |
| Asteroid Orbit Fitting | Astrometric positions (~1 KB) | Orbital element set | High — numerical propagation |
BOINC validation: The same work unit is sent to multiple volunteers. If two results match, the result is accepted. This handles untrusted, volunteer hardware.
Cost model: We pay only for: (1) cloud storage of raw files, (2) compute for the event-driven FaaS functions, and (3) initial egress for the first few torrent downloads that become seeds.
Open Data Distribution via BitTorrent¶
Distributing stacked FITS files via BitTorrent eliminates cloud egress costs after the first few seeds are established.
- Data integrity: The
.torrentfile contains a cryptographic hash (SHA256) of every data chunk. Science data trustworthiness is guaranteed. - Seeding: The central cloud infrastructure permanently seeds all torrents. Community seed boxes are encouraged as redundancy.
- Format: Distribute final stacked FITS only — not individual raw frames. Use FITS compression (fpack) to reduce file sizes by 30–50%.
Scaling the Scheduler: Asynchronous Workflow Orchestration¶
The hot-plug requirement demands a workflow manager that can track partial progress across hundreds of parallel missions.
- Each long observation request is broken into tasks tracked in the NoSQL database
- A workflow orchestration service (Step Functions / Cloud Workflows) maintains the state machine for each observation
- Partial completion is a first-class state: if a telescope delivers 30 min of a required 60 min, that 30 min is archived immediately and the scheduler creates a new task for the remaining 30 min
This ensures every photon collected is saved even if the telescope shuts down mid-exposure.