Methodology — how Luxstay collects and validates data

The pipeline

01
Curate destinations
A human-curated list of destinations defines the Year-1 catalog (60 Vietnam destinations across three priority tiers). Curation considers search demand, traveler intent, and geographic coverage — not editorial favouritism.
02
Hydrate from open sources
For every destination we fetch records from GeoNames (geography, population, timezone), Wikidata (cross-locale identifiers), Wikipedia (narrative summaries), and OpenStreetMap (POIs). Raw payloads are attributed to a row in the entity_sources audit table so we can trace every fact back to a source.
03
AI structuring (extract, never invent)
Source data is fed to Anthropic's Claude with a tightly-scoped extraction prompt. The model is instructed to extract — not invent — factual values. Where source data is missing, the field is omitted rather than hallucinated. Each call returns a JSON document validated against a Zod schema before it touches the database.
04
Cost & audit logging
Every Claude call writes a row to the content_generations table recording the model, prompt version, token counts, latency, and cost in USD. This makes content economics transparent and lets us regenerate specific pages without re-running the whole catalog.
05
Comparison engine
Head-to-head comparison pages are generated from already- extracted destination facts — Claude only writes the comparison narrative, never the underlying numbers. Data points displayed in the comparison table are derived from structured facts, not free-form text.
06
Update cadence
Source data is re-pulled on a scheduled cadence (Year 1: monthly). Pages are regenerated when source data changes meaningfully or when the prompt version is bumped. Every regeneration increments the page's generation_version so historical content is auditable.
07
Affiliate independence
Editorial content (rankings, comparisons, recommendations) is produced before any affiliate or partner data is layered on. We never re-rank destinations or hide negatives based on commission. Affiliate links are clearly disclosed and tracked separately from page content.

Hard rules

Constraints we don't deviate from. Violations are bugs.

01No scraping of OTA pages (Airbnb, Booking, Vrbo, TripAdvisor) — affiliate APIs only.
02No copying of individual user reviews. Themes only, derived from our own affiliate-API data.
03No invented facts. Where a source lacks a value, the field is omitted.
04No editorial filler. Prose stays factual, hedged where source data is approximate.
05No consumer profiling. Subscriber emails are stored alone, not joined to behavioural data.

Spotted an inaccuracy? Email [email protected] with the page URL and we'll trace it back to the source row in the audit table.

Cited.Auditable.

Curate destinations

Hydrate from open sources

AI structuring (extract, never invent)

Cost & audit logging

Comparison engine

Update cadence

Affiliate independence

Cited.
Auditable.