Zarr Group Specs
A ZarrGroupSpec is a declaration: it tells homeobox what zarr arrays to expect inside a group, which pointer kind to use, and how to reconstruct data at query time. Every feature space in the atlas must have a registered spec. Specs are validated at class-definition time — a HoxBaseSchema subclass that references a feature space without a registered spec will raise immediately.
from homeobox.group_specs import (
ZarrGroupSpec, PointerKind, LayersSpec, ArraySpec,
DTypeKind, register_spec, get_spec, registered_feature_spaces,
)
from homeobox.reconstruction import SparseCSRReconstructor, DenseReconstructor, FeatureCSCReconstructor
Core concepts
PointerKind
PointerKind is a string enum with two values:
| Value | Pointer type on cell rows | Typical use |
|---|---|---|
SPARSE |
SparseZarrPointer |
High-dimensional sparse assays: gene expression, chromatin accessibility |
DENSE |
DenseZarrPointer |
Low-dimensional dense assays: protein panels, image embeddings |
The PointerKind declared in the spec must match the pointer field types used in your HoxBaseSchema subclass. Constructing a SparseZarrPointer against a DENSE spec (or vice versa) raises a ValueError immediately.
ArraySpec
ArraySpec declares expected properties of a single zarr array:
| Field | Type | Description |
|---|---|---|
array_name |
str |
Path of the array relative to the group root (e.g. "csr/indices"). |
dtype_kind |
DTypeKind \| None |
Expected numeric category: BOOL, SIGNED_INTEGER, UNSIGNED_INTEGER, or FLOAT. None means any dtype is accepted. |
ndim |
int \| None |
Expected number of dimensions. None means any dimensionality is accepted. |
LayersSpec
LayersSpec declares the layers zarr subgroup for a feature space:
| Field | Type | Description |
|---|---|---|
prefix |
str |
Path prefix before layers/. Empty string means "layers/" at the group root; "csr" means "csr/layers/". |
uniform_shape |
bool |
When True, all arrays in the layers subgroup must share the same shape. Useful for asserting that all layers have identical cell and feature counts. |
match_shape_of |
str \| None |
When set, every array in the layers subgroup must have the same shape as the sibling array named here (resolved relative to the parent group). |
required |
list[str] |
Layer names that must exist in the layers subgroup. Also used as the default layers to read at query time. |
allowed |
list[str] |
Whitelist of valid layer names for ingestion validation. |
The path property returns the resolved layers path: f"{prefix}/layers" if prefix is non-empty, otherwise "layers".
ZarrGroupSpec fields
ZarrGroupSpec is the top-level registration object. Each field controls a different aspect of how homeobox interacts with a feature space.
feature_space
A string that must be unique across the spec registry. This value is used as the pointer field name in HoxBaseSchema subclasses and as the key for the feature registry table (when has_var_df=True). Choosing a stable, descriptive name matters: renaming a registered space after cells have been ingested requires migrating pointer field values in the cell table.
pointer_kind
PointerKind.SPARSE or PointerKind.DENSE. Determines whether cells in this space carry SparseZarrPointer or DenseZarrPointer fields, and what zarr layout the ingestion layer writes.
reconstructor
A Reconstructor protocol instance. Controls how data is assembled back into an AnnData at query time. The three built-in options are:
SparseCSRReconstructor()— for sparse assays stored in CSR layout. Reads byte ranges fromcsr/indicesand the corresponding layer array, then scatter-gathers into the global feature space.DenseReconstructor()— for dense assays. Reads row slices from a dense 2-D array and scatters into the global feature space.FeatureCSCReconstructor()— for feature-filtered queries on sparse data. Requires thatadd_csc()has been called on the group; reads fromcsc/indicesinstead ofcsr/indices, enabling efficient column extraction without touching all cell rows.
See the Reconstructors page for guidance on choosing between these.
has_var_df
bool. When True, this feature space has a feature registry table and entries in the _feature_layouts table (see Feature Layouts). The registry table schema must be supplied to RaggedAtlas.create() via registry_schemas. Set to False for spaces with no stable feature axis — for example, arbitrary-length embedding vectors where the column count varies by dataset and no cross-dataset feature identity is meaningful.
required_arrays
list[ArraySpec]. Arrays that must exist directly under the group root. The validate_group method checks that each named array is present and (if specified) has the right dtype_kind and ndim. Missing or mistyped arrays are reported as errors.
layers
LayersSpec. Declares the layers subgroup: its path prefix, shape constraints, required layers, and allowed layers. The layers.path property resolves the full subgroup path (e.g. "csr/layers" or "layers"). layers.required lists the layer names loaded by default at query time. layers.allowed is a whitelist for ingestion validation — attempting to ingest a layer whose name is not in this list raises an error.
Built-in specs
Two specs are pre-registered in homeobox.builtins. They are imported automatically when homeobox.builtins is imported, which happens at package init.
GENE_EXPRESSION_SPEC
ZarrGroupSpec(
feature_space="gene_expression",
pointer_kind=PointerKind.SPARSE,
has_var_df=True,
required_arrays=[ArraySpec(array_name="csr/indices", ndim=1, dtype_kind=DTypeKind.UNSIGNED_INTEGER)],
layers=LayersSpec(
prefix="csr",
uniform_shape=True,
match_shape_of="csr/indices",
required=["counts"],
allowed=["counts", "log_normalized", "tpm"],
),
reconstructor=SparseCSRReconstructor(),
)
The match_shape_of="csr/indices" constraint on the layers subgroup ensures that every layer array has the same number of entries as the indices array — a prerequisite for correct CSR reads. The uniform_shape=True constraint ensures that all layers within the subgroup have matching shapes with each other, preventing partial ingestion where one layer has more entries than another.
IMAGE_FEATURES_SPEC
ZarrGroupSpec(
feature_space="image_features",
pointer_kind=PointerKind.DENSE,
has_var_df=True,
layers=LayersSpec(
uniform_shape=True,
required=["raw"],
allowed=["raw", "log_normalized", "ctrl_standardized"],
),
reconstructor=DenseReconstructor(),
)
No required_arrays at the root because dense groups place everything under layers/. The uniform_shape=True on the layers subgroup enforces that all layer arrays share the same (N_cells, N_features) shape.
Defining a custom spec
Custom specs are the primary extension point. The pattern is: construct a ZarrGroupSpec, call register_spec(), then define any HoxBaseSchema subclass that uses it.
register_spec() must be called before any HoxBaseSchema subclass that references the feature space is defined. The pointer field validation happens at class-definition time, not at instantiation time.
Dense custom spec
This example registers a lognorm_rna space for dense log-normalized RNA-seq data — the same spec used in the atlas walkthrough.
from homeobox.group_specs import (
ZarrGroupSpec, PointerKind, LayersSpec, register_spec,
)
from homeobox.reconstruction import DenseReconstructor
LOGNORM_RNA_SPEC = ZarrGroupSpec(
feature_space="lognorm_rna",
pointer_kind=PointerKind.DENSE,
has_var_df=True,
layers=LayersSpec(
uniform_shape=True,
required=["log_normalized"],
allowed=["log_normalized"],
),
reconstructor=DenseReconstructor(),
)
register_spec(LOGNORM_RNA_SPEC)
After this call, "lognorm_rna" is a valid pointer field name for any HoxBaseSchema subclass defined in the same process.
Sparse custom spec
For sparse data such as chromatin accessibility, mirror the structure of GENE_EXPRESSION_SPEC:
from homeobox.group_specs import ArraySpec, DTypeKind, LayersSpec
from homeobox.reconstruction import SparseCSRReconstructor
ATAC_SPEC = ZarrGroupSpec(
feature_space="chromatin_accessibility",
pointer_kind=PointerKind.SPARSE,
has_var_df=True,
required_arrays=[ArraySpec(array_name="csr/indices", ndim=1, dtype_kind=DTypeKind.UNSIGNED_INTEGER)],
layers=LayersSpec(
prefix="csr",
uniform_shape=True,
match_shape_of="csr/indices",
required=["counts"],
allowed=["counts"],
),
reconstructor=SparseCSRReconstructor(),
)
register_spec(ATAC_SPEC)
Ordering requirements
sequenceDiagram
participant App
participant Registry
participant Schema
App->>Registry: register_spec(ATAC_SPEC)
Note right of Registry: "chromatin_accessibility" is now valid
App->>Schema: class MyCell(HoxBaseSchema)
Note right of Schema: __init_subclass__ checks pointer fields<br/>against registry — passes
App->>Schema: class BadCell(HoxBaseSchema)
Note right of Schema: pointer field "chromatin_accessibility"<br/>not yet registered — raises TypeError
If register_spec() is called after the schema class is defined, the TypeError from the schema definition will have already propagated. The fix is always to move register_spec() earlier in the module load order, typically in a dedicated specs.py module that is imported before any schema module.
Querying the registry
Two utility functions inspect the live registry:
from homeobox.group_specs import get_spec, registered_feature_spaces
registered_feature_spaces()
# {'gene_expression', 'image_features', 'lognorm_rna', 'chromatin_accessibility', ...}
spec = get_spec("gene_expression")
# ZarrGroupSpec(feature_space='gene_expression', pointer_kind=<PointerKind.SPARSE: 'sparse'>, ...)
get_spec() raises KeyError if the named space is not registered. Use registered_feature_spaces() to check membership before calling get_spec() in code that handles unknown spaces.
Group validation
ZarrGroupSpec.validate_group(group) returns a list of error strings — empty if the group satisfies all constraints declared by the spec. This is called internally by atlas.validate() but can also be used during development to verify a group before ingestion.
import zarr
group = zarr.open_group("/path/to/group")
errors = spec.validate_group(group)
if errors:
for e in errors:
print(e)
Typical errors include missing arrays, wrong ndim, wrong dtype_kind, a missing subgroup, arrays within a subgroup with mismatched shapes, and a layer array whose shape does not match the required sibling array. Validation does not load any array data — it only inspects zarr metadata, so it is fast even for large remote groups.