Data Structure
A RaggedAtlas is backed by two co-located stores:
- LanceDB database — all tabular data: cell metadata, feature registries, dataset inventory, and the per-dataset feature mapping
- Zarr object store — all array data: count matrices, embeddings, image tiles
No manifest files or external sidecars need to be maintained outside the atlas.
Overall layout
graph TD
subgraph LanceDB
cells["cell table\n(HoxBaseSchema subclass)"]
datasets["datasets table\n(DatasetRecord)"]
layouts["_feature_layouts table\n(FeatureLayout)"]
reg["registry tables\n(FeatureBaseSchema subclasses)"]
versions["atlas_versions table\n(AtlasVersionRecord)"]
end
subgraph Zarr["Zarr object store"]
sparse["sparse groups\n(csr/ + optional csc/)"]
dense["dense groups\n(data or layers/)"]
end
cells -- "pointer fields" --> Zarr
datasets -- "zarr_group" --> Zarr
layouts -. "feature_uid" .-> reg
layouts -. "layout_uid" .-> datasets
Cell table: HoxBaseSchema
Each row in the cell table is one observation — a cell, nucleus, spatial tile, etc. The exact schema is user-defined by subclassing HoxBaseSchema.
Every cell row carries:
| Field | Type | Description |
|---|---|---|
uid |
str |
Random 16-char hex. Unique per cell; safe for concurrent writes. |
dataset_uid |
str |
Links back to the originating DatasetRecord. |
| pointer fields | SparseZarrPointer \| DenseZarrPointer \| None |
One column per feature space the cell may have been profiled in. |
Pointer field names must match a registered feature space name (enforced at class definition time). At least one pointer field must be declared.
Pointer types
SparseZarrPointer — used for high-dimensional sparse assays (gene expression, chromatin peaks):
| Field | Description |
|---|---|
zarr_group |
Path to the zarr group within the object store |
start / end |
Element positions into csr/indices (derived from the CSR indptr). Slice [start:end] gives this cell's non-zero entries. |
zarr_row |
0-indexed row in the group; used for column-oriented (CSC) reads |
DenseZarrPointer — used for dense assays (protein abundance, image embeddings):
| Field | Description |
|---|---|
zarr_group |
Path to the zarr group |
position |
Row index into the group's dense array |
A typical multimodal schema looks like:
class MySchema(HoxBaseSchema):
gene_expression: SparseZarrPointer | None = None
protein_abundance: DenseZarrPointer | None = None
classDiagram
class HoxBaseSchema {
+str uid
+str dataset_uid
}
class MySchema {
+SparseZarrPointer | None gene_expression
+DenseZarrPointer | None protein_abundance
}
class SparseZarrPointer {
+str zarr_group
+int start
+int end
+int zarr_row
}
class DenseZarrPointer {
+str zarr_group
+int position
}
HoxBaseSchema <|-- MySchema
MySchema --> SparseZarrPointer : gene_expression
MySchema --> DenseZarrPointer : protein_abundance
Zarr group layouts
Each ingested dataset occupies one zarr group. The internal layout depends on the feature space's PointerKind.
Sparse groups
Used for gene expression, chromatin peaks, and other high-dimensional sparse assays. Data is stored in coordinate order — appending a new dataset is a pure array append.
<zarr_group>/
└── csr/
├── indices # (N_entries,) uint32 — local feature indices
└── layers/
└── counts # (N_entries,) dtype — values
indices holds local column indices (0-based within this dataset's feature ordering). The mapping from local to global feature index is stored in _feature_layouts.global_index and used at read time for scatter/gather operations.
A cell's data is at csr/indices[start:end] (and the matching slice from each layer), where start/end come from the cell's SparseZarrPointer.
After calling add_csc(), a column-sorted counterpart is written alongside CSR. This enables efficient feature-filtered queries (reading a feature across many cells) without scanning all cells:
<zarr_group>/
├── csr/
│ ├── indices # row-sorted: entry order matches cell order
│ └── layers/
│ └── counts
└── csc/
├── indices # col-sorted: entry order matches feature order (cell row ids)
└── layers/
└── counts
Dense groups
Used for protein abundance, image feature vectors, and other low-dimensional dense assays. Dense groups write a 2-D array with shape (N_cells, N_features). Depending on whether a layer name was specified at ingest time, the array lives either directly at data or under a layers/ subgroup:
or:
A cell is located at row index position from the cell's DenseZarrPointer.
Datasets table
Every ingested zarr group is registered as a DatasetRecord:
| Field | Description |
|---|---|
uid |
Stable dataset identifier |
zarr_group |
Path to the zarr group (matches pointer fields in cell table) |
feature_space |
Which feature space this group belongs to |
n_cells |
Number of cells in this dataset |
created_at |
UTC ISO timestamp |
The datasets table is the authoritative inventory of what zarr groups exist. validate() uses it to enumerate groups for consistency checks.
_feature_layouts table
_feature_layouts stores feature orderings ("layouts") shared across datasets. Each FeatureLayout row records one (layout, feature) pair. Datasets with identical feature orderings share the same layout_uid, dramatically reducing row count compared to a per-dataset approach. For the Python API, see Feature Layouts.
| Field | Description |
|---|---|
layout_uid |
Content-hash of the ordered feature list. Shared across datasets with the same feature ordering. Scalar-indexed for layout → features lookup. |
feature_uid |
global_feature_uid from the registry. FTS-indexed for feature → layouts lookup. |
local_index |
0-based position of this feature in the layout's zarr array (i.e. the column index stored in csr/indices). |
global_index |
Denormalized from the registry. Used as a scatter/gather key at training time — no database lookup needed in the hot path. |
An FTS index on feature_uid and a scalar index on layout_uid make two queries efficient:
- Feature → datasets: which layouts (and thus datasets) include feature X? (
find_datasets_with_features) - Layout → features: given a
layout_uid, reconstruct thelocal → globalindex remap for vectorized scatter/gather
erDiagram
DatasetRecord {
str uid PK
str zarr_group
str feature_space
int n_cells
str layout_uid FK
}
FeatureRegistry {
str uid PK
int global_index
}
FeatureLayout {
str layout_uid FK
str feature_uid FK
int local_index
int global_index
}
DatasetRecord }o--|| FeatureLayout : "references layout"
FeatureRegistry ||--o{ FeatureLayout : "referenced by"
Feature registries
Each feature space with a stable feature axis maintains a LanceDB registry table. All schemas subclass FeatureBaseSchema:
| Field | Description |
|---|---|
uid |
Stable canonical identifier. Never reassigned. Use for durable references across atlas rebuilds. |
global_index |
Dense integer (0 .. N-1) assigned by reindex_registry(). Used as an array index in compute paths. Never reassigned once set — new features get max + 1. |
Modality-specific subclasses add fields like gene symbol, Ensembl ID, UniProt ID, SMILES, guide sequence, etc.
The uid / global_index split separates two concerns: uid is safe for concurrent writers (no coordination needed); global_index is assigned in a single-writer reindexing pass so it always forms a contiguous range suitable for array indexing and scatter/gather operations.
Versioning
snapshot() records a consistent point-in-time view across all Lance table versions as an AtlasVersionRecord. checkout(version) (or checkout_latest()) reopens every table pinned to the exact Lance version captured in that record — giving a reproducible, read-only view of the atlas state. query() is only available on a checked-out atlas.
sequenceDiagram
participant User
participant Atlas
User->>Atlas: ingest datasets
User->>Atlas: optimize()
Note right of Atlas: assigns global_index,<br/>compacts tables,<br/>rebuilds FTS indexes
User->>Atlas: snapshot()
Atlas-->>User: version = N
User->>Atlas: checkout_latest(db_uri, schema, store)
Atlas-->>User: read-only atlas @ version N
User->>Atlas: query()