Skip to content

Dataset Preparation

This page describes the public data.yaml contract and the standardized outputs produced by --steps prep.

Run Command

Prepare one or more datasets:

st-cnvbench --steps prep \
  --data-config data.yaml \
  --prep-ids sample_1 sample_2

For the packaged cSCC walkthrough, see Quickstart Demo And Expected Outputs.

Dataset Entries

Each dataset entry in data.yaml represents one sample or sample group.

Required top-level fields:

Field Meaning
dataset_id Unique dataset name used in output and result paths.
platform Spatial platform name, for example Visium, ST, or SlideDNAseq.
format Raw input layout currently supported by prep.
genome Genome label used by model wrappers, for example hg38.
species Species label, for example human or mouse.
ref_norm Whether model running should use reference normal spots when supported.
tumor_normal_mode How tumor-normal annotations are used in the run/eval flow.
raw Raw input paths and GT paths.
output.root Standardized output directory for this dataset.

Allowed values:

  • format: SpaceRanger or STpipeline
  • tumor_normal_mode: subset, de_novo, or off

raw.* Fields

Field Meaning
root Base raw-data directory. Used for path interpolation and optional auto-discovery.
counts Optional explicit count matrix path when the input is not discovered from raw.root.
barcodes Optional explicit barcode file path.
features Optional explicit feature file path.
scalefactors Spatial scale factors. Required for STpipeline; optional for SpaceRanger if present under raw.root/spatial/.
tissue_positions Spatial coordinate table when not discovered automatically.
tissue_image Optional tissue image used by prep and plotting.
tumor_normal Model-run annotation for reference-normal selection. In de_novo mode this can be null.
tumor_normal_gt Ground-truth tumor-normal annotation used only by evaluation.
subclone_gt Ground-truth spot-level subclone labels for subclone tasks.
cnv_gt Ground-truth CNV profile input for cnv_profile, and clone-level CNV GT for subclone tasks when applicable.
bam, bai Alignment files needed by allele-aware wrappers such as Numbat and Xclone.
wgs_wes_tumor_bedg, wgs_wes_normal_bedg WGS/WES count tracks for Clonalscope_WGS.
beads_mapping Slide-DNA-seq style pseudo-barcode to original-barcode mapping for subclone evaluation.

Tumor-Normal Modes

tumor_normal_mode controls how the pipeline treats tumor-normal annotations.

Mode Meaning
subset Use raw.tumor_normal to provide reference-normal spots during model run, and remove those reference spots from tumor-normal evaluation.
de_novo Run without reference-normal labels. Models that support de novo operation can still be benchmarked.
off Disable tumor-normal evaluation for this dataset.

Conceptually:

  • raw.tumor_normal is a model-run input
  • raw.tumor_normal_gt is an evaluation GT file

Minimal Example

datasets:
  sample_1:
    dataset_id: sample_1
    platform: Visium
    format: SpaceRanger
    genome: hg38
    species: human
    ref_norm: false
    tumor_normal_mode: de_novo
    raw:
      root: /path/to/raw/sample_1
      tissue_positions: null
      scalefactors: null
      tissue_image: null
      tumor_normal: null
      tumor_normal_gt: null
      subclone_gt: null
      cnv_gt: null
      bam: null
      bai: null
      wgs_wes_tumor_bedg: null
      wgs_wes_normal_bedg: null
      beads_mapping: null
    output:
      root: /path/to/processed/sample_1

Standardized Outputs Produced By prep

The public prep step creates a benchmark-ready dataset bundle under:

<output.root>/

Common outputs include:

filtered_feature_bc_matrix/
filtered_feature_bc_matrix.h5ad
spatial/
metadata_<dataset_id>_tumor_normal.tsv   # when a model-run tumor-normal annotation is available

Expected spatial/ contents:

spatial/tissue_positions.csv
spatial/scalefactors_json.json
spatial/tissue_hires_image.png   # optional when an image is available

The .h5ad file is assembled during preparation and is part of the standardized output bundle used by downstream benchmarking steps.

Notes

  • Missing required inputs should fail loudly rather than being silently substituted.
  • de_novo datasets can still run, but only with wrappers that support reference-free operation.
  • Keep GT paths null for evaluation tasks you do not plan to run.
  • Use one dataset entry per logical sample or sample group you want the benchmark controller to manage.