How to Build a Virtual Cell in Python from Scratch

This is a gentle introduction to building a virtual cell in Python. We build a simple model that predicts how a cell’s gene expression changes in response to a perturbation. I try to explain everything step by step to show the complete thought process behind every decision. We start by answering the fundamental questions: what is a virtual cell and why it matters for disease understanding and drug discovery. Then we go through the entire process and cover: downloading and exploring single-cell RNA-seq data, preprocessing the data, designing a training pipeline, splitting data by unseen perturbations, representing perturbations with gene embeddings, building PyTorch datasets and data loaders, training a simple neural net, testing the model, comparing against a baseline, dealing with model collapse, improving the model with highly variable genes, pseudobulk expression, better evaluation, and delta prediction, discussing possible next steps. The final model takes a perturbation embedding, predicts a change in expression, and adds that change to the control cell state. To be clear - this tutorial is for educational purposes and aims to illustrate the main steps involved in building a virtual cell. It does not produce a competitive model for perturbation response prediction, but it is a starting point for you to play around with and improve. Code: https://github.com/MaciejPiernik/virt... Resources Arc Institute Virtual Cell Atlas: https://arcinstitute.org/tools/virtua... Virtual Cell Challenge dataset: https://github.com/ArcInstitute/arc-v... Gene embeddings (benchmark paper + downloads): https://www.biorxiv.org/content/10.11... CELLxGENE: https://cellxgene.cziscience.com/ Chapters 0:00 Intro 6:30 Representing a cell 11:24 Project setup 12:58 Intuition 20:25 What data we need? 25:02 Downloading data 28:58 Exploring data 35:09 Preprocessing data 43:12 The training pipeline 45:52 Splitting data 57:31 Encoding perturbations 1:02:55 Gene embeddings 1:12:40 The full training loop 1:27:52 The model 1:30:34 Data loaders 1:39:31 Mapping genes to embeddings 1:57:52 First training run 2:01:00 Refactor 2:07:32 Testing the model 2:15:21 Technical improvements 2:21:17 Model collapse 2:24:55 Fix #1: Highly variable genes 2:26:52 Fix #2: Pseudobulk 2:34:06 Fix #3: Loss & eval 2:40:20 Baseline 2:43:55 Fix #4: Predicting delta 2:48:13 Improving over baseline 2:54:46 Next steps and Conclusion #VirtualCell #MachineLearning #Bioinformatics #Python #SingleCell #RNASeq #DeepLearning #DrugDiscovery #ComputationalBiology