homeobox
Homeobox is a multimodal single-cell database built for interactive analysis and ML training at scale. It stores cell metadata in LanceDB and raw array data (count matrices, images and features) in Zarr, and provides a PyTorch-native data loading layer that reads directly from those stores without intermediate copies or format conversions.
A central design goal is to make it practical to train foundation models on collections of heterogeneous datasets — datasets with different gene panels, different assayed modalities, and different obs schemas — without forcing them into a common rectangular matrix upfront.
Why homeobox
Ragged feature spaces, unified cell table
Real-world atlas building involves stitching together datasets that were not designed to be compatible. A 10x 3' dataset measures ~33,000 genes. A targeted panel measures 500. A CITE-seq experiment adds protein on top of RNA. Conventional tools handle this by padding to a union matrix (wasteful) or intersecting to shared features (lossy).
Homeobox takes a different approach: each dataset occupies its own zarr group with its own feature ordering, and each cell carries a pointer into its group. The reconstruction layer handles union/intersection/feature-filter logic at query time — no padding is stored, no information is discarded at ingest.
Querying across cells, modalities, and feature spaces
The cell table lives in LanceDB, so the full query surface is available without writing custom loaders: SQL predicates, vector similarity search (ANN), full-text search, and multi-column aggregations all work out of the box. You can count cell types, find nearest neighbors by embedding, filter by metadata, and load the result as AnnData or MuData in a single fluent chain.
# Filter by metadata, load gene expression as AnnData
adata = atlas_r.query().where("tissue = 'lung' AND cell_type IS NOT NULL").to_anndata()
# Retrieve the 50 cells most similar to a query embedding
hits = atlas_r.query().search(query_vec, vector_column_name="embedding").limit(50).to_anndata()
# Load only a specific gene panel — uses the CSC index when available for minimal I/O
adata = atlas_r.query().features(["CD3D", "CD19", "MS4A1"], "gene_expression").to_anndata()
Feature-filtered queries are a first-class operation. When you request a gene panel, homeobox uses the optional CSC (column-sorted) index to read only the byte ranges belonging to those features — O(nnz for wanted features) instead of O(nnz across all cells). Groups without a CSC index fall back to CSR reads automatically, so the index is purely additive and can be built incrementally.
For multimodal atlases, .to_mudata() returns one AnnData per modality wrapped in MuData, with per-cell presence masks tracking which cells were measured by each assay. .to_batches() provides a streaming iterator for queries that would exceed memory if materialised all at once.
Fast reads from cloud storage
Zarr's sharded storage format packs many chunks into a single object-store file. The shard index records each chunk's byte offset, enabling targeted range reads — but the Python zarr stack issues one HTTP request per chunk even when chunks can be coalesced.
Homeobox includes a Rust extension (RustShardReader) that handles shard reads manually: it batches all requested ranges, issues one get_ranges call per shard file (coalescing multiple chunks into as few network requests as possible), and decodes chunks in parallel using rayon. On remote object stores (S3, GCS, Azure), this typically cuts latency-dominated read time by an order of magnitude compared to sequential per-chunk fetches. The same reader backs both interactive queries and ML training — fast cloud reads are not a special mode, they are the default.
BP-128 bitpacking
When ingesting integer count data (int32, int64, uint32, uint64), homeobox automatically applies BP-128 bitpacking with delta encoding to the sparse indices array, and BP-128 without delta to the values array. BP-128 is a SIMD-accelerated codec that packs integers using the minimum number of bits required per 128-element block.
In practice this delivers compression ratios comparable to the zarr default zstd on typical single-cell count matrices, while decoding at memory bandwidth speeds — making it strictly better than general-purpose codecs for this data type.
Map-style PyTorch datasets
Homeobox's CellDataset and MultimodalCellDataset are map-style PyTorch datasets that expose a __getitem__ interface. This means PyTorch's DataLoader can dispatch any index to any worker process independently.
Versioned snapshots
Homeobox separates the writable ingest path from the read/query path with an explicit snapshot model. You ingest freely, call snapshot() to record a consistent point-in-time view across all LanceDB table versions, then checkout(version) to open a read-only handle pinned to that snapshot. Queries and training runs execute against a frozen, reproducible view of the atlas — concurrent ingestion into the live atlas does not affect them.
Architecture overview
graph TD
subgraph LanceDB
cells["cell table\n(HoxBaseSchema)"]
registry["feature registry\n(FeatureBaseSchema)"]
layouts["_feature_layouts\n(local→global index map)"]
datasets["datasets table\n(DatasetRecord)"]
versions["atlas_versions\n(AtlasVersionRecord)"]
end
subgraph Zarr["Zarr object store"]
csr["sparse groups\ncsr/indices + csr/layers/"]
csc["optional CSC index\ncsc/indices + csc/layers/"]
dense["dense groups\nlayers/"]
end
subgraph PyTorch
ds["CellDataset / MultimodalCellDataset"]
loader["DataLoader (spawn, parallel workers)"]
end
cells -- "SparseZarrPointer\nDenseZarrPointer" --> Zarr
layouts -. "local_index → global_index" .-> registry
Zarr --> ds
ds --> loader
Installation
Prebuilt wheels are available on PyPI. Requires Python 3.13.
pip install homeobox # core: atlas, querying, ingestion
pip install homeobox[ml] # + PyTorch dataloader
pip install homeobox[bio] # + scanpy, GEOparse
pip install homeobox[io] # + S3/GCS/Azure, image codecs
pip install homeobox[viz] # + marimo, matplotlib
pip install homeobox[all] # everything
To build from source (requires a Rust toolchain):
curl -LsSf https://astral.sh/uv/install.sh | sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
uv sync
maturin develop --release
Documentation
Concepts
- Data Structure — the LanceDB + zarr layout, pointer types,
_feature_layoutsfeature mapping, and versioning model. - Building an Atlas — end-to-end walkthrough: register a spec, define schemas, ingest two datasets with different feature panels, snapshot, and run union/intersection queries.
Reference
- Schemas — all LanceDB schema classes:
HoxBaseSchema,SparseZarrPointer,DenseZarrPointer,FeatureBaseSchema,DatasetRecord,FeatureLayout,AtlasVersionRecord. Covers theuid/global_indexsplit and how pointer fields are validated. - Feature Layouts — Python API for the
_feature_layoutstable: computing layout UIDs, building layout DataFrames, reindexing the registry, syncing global indices, and resolving feature UIDs to global positions. - Group Specs —
ZarrGroupSpec,PointerKind,ArraySpec,LayersSpec, built-in specs, and how to define custom specs for new assay types. - Querying — the
AtlasQueryfluent builder: filtering cells, controlling feature reconstruction, union/intersection joins, feature-filtered queries, and all terminal methods (.to_anndata(),.to_mudata(),.to_batches(),.count()). - Reconstructors —
SparseCSRReconstructor,DenseReconstructor,FeatureCSCReconstructor; choosing between them; theReconstructorprotocol for custom implementations. - Array Storage —
add_from_anndatainternals: streaming from backed.h5adfiles, chunk/shard sizing, BP-128 bitpacking, the_feature_layoutsfeature mapping. Building the optional CSC column index withadd_csc()for fast feature-filtered reads. - PyTorch Data Loading —
CellDatasetandMultimodalCellDataset;CellSampler(locality-aware bin-packing); collate functions;make_loaderwith spawn parallelism.- Differential Expression — scanpy-style
dex()function: Mann-Whitney U and Welch's t-test across groups, with FDR correction. Works on both sparse and dense feature spaces.
- Differential Expression — scanpy-style
Quickstart
import obstore.store
from homeobox.atlas import RaggedAtlas
from homeobox.schema import HoxBaseSchema, FeatureBaseSchema, SparseZarrPointer
from homeobox.ingestion import add_from_anndata
# homeobox registers built-in specs (gene_expression, image_features) at import time —
# no register_spec() call needed for these feature spaces.
# 1. Define schemas
class GeneFeature(FeatureBaseSchema):
gene_symbol: str
class CellSchema(HoxBaseSchema):
cell_type: str | None = None
gene_expression: SparseZarrPointer | None = None # matches built-in feature space name
# 2. Create atlas
store = obstore.store.LocalStore("/data/atlas/arrays")
atlas = RaggedAtlas.create(
db_uri="/data/atlas/db",
cell_table_name="cells",
cell_schema=CellSchema,
store=store,
registry_schemas={"gene_expression": GeneFeature},
)
# 3. Register features and ingest
atlas.register_features("gene_expression", features)
add_from_anndata(atlas, adata, feature_space="gene_expression",
zarr_layer="counts", dataset_record=record)
# 4. Snapshot and query
# optimize() assigns global_index to any newly registered features internally
atlas.optimize()
atlas.snapshot()
atlas_r = RaggedAtlas.checkout_latest("/data/atlas/db", CellSchema, store)
adata = atlas_r.query().where("cell_type = 'T cells'").to_anndata()
For a full walkthrough with two heterogeneous datasets, see Building an Atlas.